Merge lp:~vila/udd/795321-make-tea into lp:udd
Status: | Merged |
---|---|
Merged at revision: | 513 |
Proposed branch: | lp:~vila/udd/795321-make-tea |
Merge into: | lp:udd |
Diff against target: |
587 lines (+397/-17) 7 files modified
mass_import.py (+107/-13) selftest.py (+1/-0) udd/circuit_breaker.dot (+23/-0) udd/circuit_breaker.py (+109/-0) udd/icommon.py (+25/-4) udd/tests/test_circuit_breaker.py (+129/-0) udd/threads.py (+3/-0) |
To merge this branch: | bzr merge lp:~vila/udd/795321-make-tea |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
James Westby | Approve | ||
Review via email: mp+76058@code.launchpad.net |
Description of the change
This implements a circuit breaker to track launchpad down times and avoid
the associated spurious failures.
Strictly speaking a circuit breaker protect the usage of a resource by
tracking the attempts, successes and failures of these usages. When a
failure occurs, it blocks the attempts for some grace period and then wait
for a success to closed the breaker again.
See the udd/circuit_
The pattern roughly apply to launchpad with some tweaks: launchpad is used
in many different ways and not all failures scenarios are currently known
(in the importer code base). Additionnally, several imports are running in
parallel querying launchpad in different ways and producing success and
failures than can be seen by the circuit breaker in unexpected orders.
I chose to implement a variant that relies on the existing known transient
failures: the ones recorded in the status database in the 'should_retry'
table and just pause for a while without starting new imports as soon as a
launchpad failure is detected. After the grace period, the *same* imports
are retried and should provide a good test that launchpad is back.
The net effect will be to *limit* the number of spurious failures.
The nice property of this table is that we can add more scenarios as we see
them appear with:
requeue_
This means we can complete our knowledge as we discover new failures.
I made only rough tests locally... but couldn't find an easy way to simulate
a local launchpad, even less a transiently down one :-/
But given that I've witnessed a storm of transient failures last Friday and
one Today, I expect more to come.
Note that I added the failure for today's storm (~600 failures) with:
requeue --auto --all-of-type libgnomeui
(requeue is provided after sourcing fixit.sh and is really
requeue_package.py) but the package importer still have 165 outstanding jobs
as I write this (i.e. the time needed to recover from a lp downtime if far
longer than the down time itself).
Note that since all these failures were for the same traceback, it's highly
likely that this is the first (or one of the first) query made against
launchpad during an import, so it *is* a good candidate to retry imports.
I.e. even if all in-fly imports fail for an unknown reason (or one we
haven't correctly identified), the next started one will likely fail with a
*known* transient one and stop the storm.
This seems to be a good first step to cope with the actual code base without
requiring a full analysis of the all the failure modes :)
If this requires too much manual tweak and lead to too much false positives,
we can focus on finding a better way to identify launchpad transient
failures, but the circuit breaker will still be there, should be easy to
change and above all, should avoid the huge number of failures.
Hi Vincent,
Thanks for working on this, it looks great, and I'm looking forward
to seeing it in production!
77 + """Launcpad circuit breaker.
Missing "h" in Launchpad.
131 + # We want to retry asap ('priority') db.retry( package_ name, priority=True)
132 + self.status_
Maybe try with the same priority as before?
I don't think it's that important though, given that there
will only be a few jobs in that state though.
303 + self.state = 'open'
I prefer class "constants" for that sort of thing, so that
typos lead to obvious runtime errors, rather than errant
behaviour.
Thanks,
James