9c0e746...
by
Tim Deeb-Swihart <email address hidden>
Only update shard info when enough tasks are acked or time has passed (#5399)
## What changed?
Our shard info update logic now monitors how many tasks have been
completed across all queues. If enough changes have occurred it will
persist changes to our database even if `ShardUpdateMinInterval` time
hasn't elapsed since the last change. The time between updates and the
number of tasks completed per update can be monitored using
`tasks_per_shardinfo_update` and `time_between_shardinfo_updates`
metrics.
While here I extracted the metric calculation from the updateShardInfo function.
In the previous code we'd stop updating metrics if we stopped updating shard info
which isn't the behavior we want. Now it's handled by a separate goroutine that
grabs the shard's read lock as needed.
Follow-up work will handle updating shard info based on replication task
completion if it makes sense to.
## Why?
When we have a large number of shards updating shard info every five
minutes (the default) can be costly. If we persist after enough changes
have occurred we can increase this interval and reduce the load on our
DB.
## How did you test it?
I added a handful of new unit tests to verify the behavior
Adding size and usage metrics to LRU cache (#5427)
## What changed?
Adding size and usage metrics to LRU cache.
## Why?
It will give us more clarity about how mutable state cache and history
event
cache is being used. It will help us optimize the the size of these
caches.
## How did you test it?
Unit tests. Ran bench workload on a local cluster.
Unload matching task queues on membership change (#5345)
## What changed?
Matching pays attention to membership changes and unload task queue
partitions that it no longer owns. It waits a few seconds before
unloading to avoid situations where a task queue would get immediately
reloaded because the membership change hadn't propagated to some
frontend or history. The delay of 3s was chosen because membership
changes seem to propagate in <1s generally.
## Why?
Fixes #5198
## How did you test it?
unit test, lots of simulated restarts and monitoring metrics
## Potential risks
If membership changes are very slow to propagate, this could lead to
more loading/unloading, although that could happen anyway (if two
frontend/history disagree on the owner of a partition and they fight for
it). We could address that by checking membership at load time also, but
that has other risks. It could come in a separate PR.