~maas-committers/maas/+git/temporal:ppv/wfcache

Last commit made on 2023-12-05
Get this branch:
git clone -b ppv/wfcache https://git.launchpad.net/~maas-committers/maas/+git/temporal

Branch merges

Branch information

Name:
ppv/wfcache
Repository:
lp:~maas-committers/maas/+git/temporal

Recent commits

440832d... by Prathyush PV

Fixing test

ba7e75f... by Prathyush PV

Fixing race

384e24e... by Prathyush PV

minor fixes

a61e7fb... by Prathyush PV

Min-max cache for MutableStates

881b4b4... by pdoerner <email address hidden>

Always set RPS on dynamic rate limiter refresh (#5180)

## What changed?
<!-- Describe what has changed in this PR -->
Always set the RPS of the dynamic rate limiter when it is refreshed

## Why?
<!-- Tell your future self why have you made these changes -->
Previously it would only apply changes made to `rateFn` when the dynamic
multiplier should be adjusted. The rate function should always be
checked to pick up any changes to the maximum RPS dynamic config.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
Existing tests

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
Extra rate limiting

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->

aa34d01... by Yichao Yang <email address hidden>

Fix default persistence request priority (#5177)

f602a59... by Shahab Tajik <email address hidden>

Prevent forwarding and sync match when there is a backlog (#4982)

<!-- Describe what has changed in this PR -->
**What changed?**
- Added a config to define a threshold for *negligible* backlog based on
the age of the oldest task in the backlog. The value is configurable per
TQ, but planning to set it to 5 secs by default in prod. A backlog older
than this threshold is considered non-negligible.
- Stop forwarding tasks when there is a non-negligible backlog
- This in turn stops pollers to be forwarded in presence of a
non-negligible backlog.
- This ensures that all the partition accumulate and process backlog at
the same rate. Effectively, the backlog age diff between partitions
should not grow significantly larger than the negligibility threshold
(e.g. 5 secs).
- There is an exception to this: forwarding tasks is allowed when the
partition has not received a poll recently (configurable, default
200ms). This is to ensure that in a TQ with few pollers, tasks and
pollers are do not stuck in different partitions for a long time.
- Prevent sync match when there is a non-negligible backlog. This is
already mostly happening (i.e. history tasks are all spooled and only
synced matched in race conditions, when there is a backlog). However,
forwarded tasks are also “sync match”ed in the root partition. In effect
this change makes sure that the root partition does not dispatch
forwarded tasks when itself has a backlog. This helps further with
keeping the partitions balanced.
- Also added the following metrics to be able to monitor poll and task
dispatch behavior at TQ partition level:
    - `poll_latency` with `forwarded=false/true`
- `task_dispatch_latency` with `forwarded=false/true` and
`source=History/DbBacklog`

<!-- Tell your future self why have you made these changes -->
**Why?**
With previous forwarding logic, when a backlog started to form, the root
partition was effectively receiving more pollers and was able to process
its backlog faster. This mixed up the order of dispatching tasks: a new
task could be dispatched faster by going to the root partitions while
old tasks are backlogged in other partitions. This change prevents that
situation.

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**
- Unit tests
- Manually verified against 9 workload+worker configurations as defined
here:
https://github.com/ShahabT/temporal-tests/blob/main/balanced-partitions/experiments.go

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**
The change is config protected and the default value is disabled (by
setting a very high threshold). Will be enabled by default in a future
release after successful verification in prod workloads.

<!-- Is this PR a hotfix candidate or require that a notification be
sent to the broader community? (Yes/No) -->
**Is hotfix candidate?**
No.

bbb5b2d... by ast2023 <email address hidden>

reimplement nextBackoffInterval tests to cover functional requirements (#5161)

## What changed?
Reimplemented unit-tests for nextBackoffInterval

## Why?
original tests description and functionality did not match, it what hard
to say what it was testing. This is the best readability approach I
could think of today. I tried table-based approach, but I think it was
less readable because the call to the function under test was hidden.

## How did you test it?
run tests

## Potential risks
some edge cases were not covered by tests

## Is hotfix candidate?
No

c6aef53... by Alex Shtin <email address hidden>

ScheduleToStart timeout for speculative WT on normal task queue (#5146)

<!-- Describe what has changed in this PR -->
**What changed?**
Add `ScheduleToStart` timeout for speculative WT on normal task queue.

<!-- Tell your future self why have you made these changes -->
**Why?**
`UpdateWorkflowExecution` API handler tries to add WT to matching
directly. This call must be made outside of workflow lock and might fail
(if matching is temporarily unavailable, for example). There is no good
way to handle this error after lock was released. This is why I
introduced this `ScheduleToStart` timeout for speculative WT on the
normal task queue. This timeout will ensure that WT will be eventually
added to matching and workflow won't stuck.

Also removed special flag which was used to skip transfer task creation
when speculative WT is converted to normal before MS with speculative WT
is persisted.

Both these speculative WT logic changes (timeout on normal and skip
transfer task) are extracted to special
`GenerateScheduleSpeculativeWorkflowTaskTasks` method.

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**
Added functional test.

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**
Speculative WT times out once if there is no worker available.

<!-- Is this PR a hotfix candidate or require that a notification be
sent to the broader community? (Yes/No) -->
**Is hotfix candidate?**
No.

691bd20... by ast2023 <email address hidden>

Mutable state approximate size should not change if we do not retry the activity (#5175)

## What changed?
Keep approximate size of mutable state when activity cannot be restarted
because of deadline exceeded or other error by nextBackoffInterval

## Why?
It is wrong behavior, the mutable state size was reduced before activity
replaced

## How did you test it?
unit tests

## Potential risks
not sure what persistence was doing with the wrong size

## Is hotfix candidate?
No