~maas-committers/maas/+git/temporal:cdf/oss-626

Last commit made on 2023-12-19
Get this branch:
git clone -b cdf/oss-626 https://git.launchpad.net/~maas-committers/maas/+git/temporal

Branch merges

Branch information

Name:
cdf/oss-626
Repository:
lp:~maas-committers/maas/+git/temporal

Recent commits

05a7e42... by Carly de Frondeville <email address hidden>

first pass at new tdbg cmd

fdf2676... by David Reiss <email address hidden>

Add util.Ptr and replace convert.Ptr functions (#5228)

## What changed?
Add generic function to copy a value and return a pointer.
Also restore `build-tests` Makefile target.

## Why?
Simplify code.

## How did you test it?
existing tests

dcf3bd1... by David Reiss <email address hidden>

Don't unload task queue while child is polling user data (#5227)

## What changed?
- Mark a tqm live when we're doing a long poll for user data.
- Adjust timeouts so that user data long poll is slightly less than idle
unload timeout.

## Why?
This prevents the annoying `task queue closed` errors from
`GetTaskQueueUserData` by keeping a parent partition alive while a child
is doing a user data long poll (i.e. as long as any children are
loaded).

## How did you test it?
Mostly manually, with shortened timeouts. (There aren't currently good
tests around the idle unload.)

865e2e1... by Stephan Behnke <email address hidden>

Make target for generating service clients (#5232)

## What changed?
<!-- Describe what has changed in this PR -->

Added a step to re-generate gRPC API clients under `/client`.

## Why?
<!-- Tell your future self why have you made these changes -->

Whenever proto files are re-compiled, the gRPC clients should also be
re-generated.

Right now the developer has to manually invoke `go generate` (and know
where to find them first).

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

Running `make proto` and `make update-proto` - and observing the clients
being updated.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

There could be some subtle ordering assumption in `update-proto` that is
violated now?

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->

No

f387d1b... by Prathyush PV <email address hidden>

Emitting workflow metrics with a different operation tag when workflow execution completes (#5224)

## What changed?
Emit accumulated history size, history count, state transitions when a
workflow closes
using a different operation tag "CompletionStats".

## Why?
To understand what is the history size, history count, and state
transitions when workflow closed.

## How did you test it?
Tested locally.

## Potential risks
None

## Is hotfix candidate?
No

993f441... by Will Duan <email address hidden>

Refactor history events replication apply logic (#5166)

## What changed?
Refactor history events replication apply logic

## Why?
<!-- Tell your future self why have you made these changes -->
1. To support events import
2. As a building block to handle migration back and forth scenario

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
unit test
## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
no risk, not wired into current logic
## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
no

0cf9b99... by Alex Shtin <email address hidden>

Reject update if worker didn't process update request message (#5169)

## What changed?
<!-- Describe what has changed in this PR -->
Reject update if worker didn't process update request message.

## Why?
<!-- Tell your future self why have you made these changes -->
If worker receives update request messages in WT it must respond to
those messages (accept or reject) when completing this WT. There are two
exceptions:
1. WT is heartbeating WT,
2. WF was completed on that WT.

If worker didn't do it, it means it uses old SDK which is not aware of
messages/updates or there is a bug in SDK. Server will now reject all
updates that were sent to worker but not processed, instead of keep
creating new WT and keep sending same update messages to the worker.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
Modified existing tests and added new functional test.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
No risks.

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
No.

8796381... by Tim Deeb-Swihart <email address hidden>

Improve ackManager.completeTask performance by two orders of magnitude (#5216)

## What changed?
I replaced the outstandingTasks map with an ordered treemap and
optimized
completeTask to only scan what was necessary to update the ack level.

## Why?
The old implementation of completeTask required a full scan of the task
map in order to move the ack level which had terrible performance.
By storing tasks in an ordered set we can limit the scan's size by
stopping at the first unacked task.

This trades addTask performance for completeTask performance but since
all
added tasks are presumably completed we should be fine with 1/3 the
performance on addTask for 227x the completeTask performance. With this
change both operations run in about the same amount of time.

Before:
```
$ go test -bench=AckManager ./service/matching/... -run=FooBarBaz
goos: darwin
goarch: arm64
pkg: go.temporal.io/server/service/matching
BenchmarkAckManager_AddTask-12 22768 52206 ns/op
BenchmarkAckManager_CompleteTask-12 38 29293019 ns/op
```

After:
```
$ go test -bench=AckManager ./service/matching -run=FooBarBaz
goos: darwin
goarch: arm64
pkg: go.temporal.io/server/service/matching
BenchmarkAckManager_AddTask-12 8127 147226 ns/op
BenchmarkAckManager_CompleteTask-12 8626 136614 ns/op
```

## How did you test it?
I added both tests and benchmarks to ensure the ackManager worked as
before

## Potential risks
None.

## Is hotfix candidate?
No

c8aba5e... by David Reiss <email address hidden>

Return Unavailable to frontend rpcs until healthy (#5069)

**What changed?**
Add an interceptor to return Unavailable to WorkflowService methods
until the frontend considers itself "healthy", which currently means
"membership is initialized".

**Why?**
Fixes #5015

**How did you test it?**
mostly manually

**Potential risks**
This adds a window of time where frontend can now return Unavailable
where previously it might have succeeded or returned a different error
code. Specifically note that client.Dial in go sdk (at least) will fail
fast on this error and the caller will need to retry.

---------

Co-authored-by: Tim Deeb-Swihart <email address hidden>

51ea367... by Tim Deeb-Swihart <email address hidden>

Remove the timestamp.Timestamp type (#5220)

## What changed?
I removed the now-unused timestamp.Timestamp type

## Why?
It's no longer necessary

## How did you test it?
Existing tests

## Potential risks
None

## Is hotfix candidate?
No