~maas-committers/maas/+git/temporal:ppv/mutableStateMetrics

Last commit made on 2024-02-17
Get this branch:
git clone -b ppv/mutableStateMetrics https://git.launchpad.net/~maas-committers/maas/+git/temporal

Branch merges

Branch information

Name:
ppv/mutableStateMetrics
Repository:
lp:~maas-committers/maas/+git/temporal

Recent commits

d7e75ce... by Prathyush PV <email address hidden>

Refactor

0b79843... by Prathyush PV <email address hidden>

Adding buffered events to payload size

811f7dd... by Prathyush PV <email address hidden>

Emitting payload size in mutable state

9c0e746... by Tim Deeb-Swihart <email address hidden>

Only update shard info when enough tasks are acked or time has passed (#5399)

## What changed?

Our shard info update logic now monitors how many tasks have been
completed across all queues. If enough changes have occurred it will
persist changes to our database even if `ShardUpdateMinInterval` time
hasn't elapsed since the last change. The time between updates and the
number of tasks completed per update can be monitored using
`tasks_per_shardinfo_update` and `time_between_shardinfo_updates`
metrics.

While here I extracted the metric calculation from the updateShardInfo function.
In the previous code we'd stop updating metrics if we stopped updating shard info
which isn't the behavior we want. Now it's handled by a separate goroutine that
grabs the shard's read lock as needed.

Follow-up work will handle updating shard info based on replication task
completion if it makes sense to.

## Why?

When we have a large number of shards updating shard info every five
minutes (the default) can be costly. If we persist after enough changes
have occurred we can increase this interval and reduce the load on our
DB.

## How did you test it?

I added a handful of new unit tests to verify the behavior

## Potential risks

None

## Is hotfix candidate?

No

b4eb43b... by Prathyush PV <email address hidden>

Adding size and usage metrics to LRU cache (#5427)

## What changed?
Adding size and usage metrics to LRU cache.

## Why?
It will give us more clarity about how mutable state cache and history
event
cache is being used. It will help us optimize the the size of these
caches.

## How did you test it?
Unit tests. Ran bench workload on a local cluster.

## Potential risks
None

## Is hotfix candidate?
No

6b21033... by David Reiss <email address hidden>

Unload matching task queues on membership change (#5345)

## What changed?
Matching pays attention to membership changes and unload task queue
partitions that it no longer owns. It waits a few seconds before
unloading to avoid situations where a task queue would get immediately
reloaded because the membership change hadn't propagated to some
frontend or history. The delay of 3s was chosen because membership
changes seem to propagate in <1s generally.

## Why?
Fixes #5198

## How did you test it?
unit test, lots of simulated restarts and monitoring metrics

## Potential risks
If membership changes are very slow to propagate, this could lead to
more loading/unloading, although that could happen anyway (if two
frontend/history disagree on the owner of a partition and they fight for
it). We could address that by checking membership at load time also, but
that has other risks. It could come in a separate PR.

729f753... by Alex Shtin <email address hidden>

Extract updateutils package (#5424)

## What changed?
<!-- Describe what has changed in this PR -->
Extract `common/testing/updateutils` package.

## Why?
<!-- Tell your future self why have you made these changes -->
Will be used outside of functional tests.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
Run tests locally.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
No risks.

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
No.

1bbe557... by Yichao Yang <email address hidden>

Persistence store GetHistoryTreeContainingBranch (#5411)

0c2cb7a... by Stephan Behnke <email address hidden>

Rename WorkflowContext to WorkflowLease #2 (#5418)

## What changed?

Renamed remaining api.WorkflowContext to api.WorkflowLease. (including
the file name)

## Why?

Missed a couple earlier in
https://github.com/temporalio/temporal/pull/5386

## How did you test it?

Compiler

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->

4b47ecd... by Alex Shtin <email address hidden>

Extract protoutils package (#5422)

## What changed?
<!-- Describe what has changed in this PR -->
Extract `common/testing/protoutils` package.

## Why?
<!-- Tell your future self why have you made these changes -->
I plan to use this package outside of functional tests.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
Run tests.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
No risks.

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
No.