* Self document that the information is refreshed every 30 seconds.
* Reorder the queue length section and the running packages one so that
we very quickly have the info of the queue length, without needing to
scroll down when too much packages are running.
* Some semantic improvements (h2 vs h3)
fix: worker: only send the current state every 30 seconds
This fixes a producer/consumer issue, leading to RabbitMQ consuming too
much memory and restarting every 2 hours when the infra is under high
load.
This boils down to this:
* Every worker sends its state every 10s to a RabbitMQ fanout
* This fanout distributes messages to one subscriber per web unit, in
the `amqp-status-collector` service.
* This `amqp-status-collector` must then process every message to
produce `running.json`, that is later used to generate the `/running`
page.
* The performance of this `amqp-status-collector` script boils down
to its `process_message` function, and then to the slow JSON parsing
of Python.
* This function was benchmarked on the infra to run in about 25ms,
meaning an average throughput of about 40 messages per seconds to
consume the queues.
* With the currently deployed workers (110+22+22+22+22+50)*2+16*3=544,
the producing part of that system can reach more than 50 messages per
seconds.
* That means RabbitMQ will at some point start to accumulate messages
in the queues, and finally get killed by the watchdog looking at its
RAM consumption.
A quick analysis of the logs shows that basically nobody relies on
refreshing the `/running` page every 10 seconds, so a refresh rate of 30
seconds should definitely be enough for most use-cases.
This special syntax shows a clear circle at each data point. This is
useful when some data is showing only once, thus doesn't have a line
linking two points.