Code review comment for lp:~vila/udd/795321-make-tea

Revision history for this message
Vincent Ladeuil (vila) wrote :

>>>>> James Westby <email address hidden> writes:

    > Review: Approve
    > Hi Vincent,

    > Thanks for working on this,

Thanks for the speedy review !

    > it looks great, and I'm looking forward to seeing it in
    > production!

Me too :)

    > 77 + """Launcpad circuit breaker.

    > Missing "h" in Launchpad.

    > 131 + # We want to retry asap ('priority')
    > 132 + self.status_db.retry(package_name, priority=True)

    > Maybe try with the same priority as before?

    > I don't think it's that important though, given that there
    > will only be a few jobs in that state though.

It's a delicate matter and I may change my mind, but the underlying idea
is that if an import is qualified as failing due to lp, it makes it the
ideal candidate to check that lp is back online.

As such, I want this import to be tried *before* any other import as the
others have yet to prove that they qualify as better candidates.

Also, one should keep in mind that *several* imports are always in fly
and all of them can open/close the breaker so the
LaunchpadCircuitBreaker try to cope with false positives by essentially
retrying the imports that enhance the S/N ratio: imports may succeed
when lp is down and fail when it's up so whatever is in fly, we should
retry imports that helps deciding whether lp is up or not.

success when down:
==================

It's a bit unlikely but conceivable that an import manage to *not*
query lp during the down time. Fair enough (and one reason why I didn't
even try to stop the imports still running as soon as lp is seen down).

Such an unexpected success will wrongly close the breaker, but the
breaker will open at the next failure, it's a glitch but we cope with it
because we will open it again at the next failure (and we know we'll
have failures if lp is down).

failure when up:
================

As far as the circuit breaker is concerned, what matters is whether the
failure is related to lp, if it is not, the LaunchpadCircuitBreaker will
ignore it.

This is really where the failure qualification matters, if a failure is
related to lp but not marked as such, the circuit breaker is just
ineffective and we'll enter into a failure "storm".

summary:
========

Finding the Right failures is the real key. Potentially, any lp query
(in the largest interpretation) is Right. marking them all as such is
HARD.

Fortunately, there is an easy way to find Right failures: every time we
see a failure storm, we know it's caused by the *first* lp query done by
an import (which is also the reason we get storms: being the first query
it happens early so the import fails quickly and *another* import is
queued, also failing quickly, etc). Therefore, it's sufficient to mark
this query as transient to make it a Right failure.

We will need discover other Right failures (all over the place in the
import code path) but they are far less relevant because they won't
trigger during storms (i.e. when we retry an import that has failed this
way, the retry will fail on the *first* lp query, not because of this
other Right failure).

    > 303 + self.state = 'open'

    > I prefer class "constants" for that sort of thing, so that
    > typos lead to obvious runtime errors, rather than errant
    > behaviour.

Doh. Thanks, I was concentrating on other parts while writing that, I'll
use constants instead in both circuit breakers.

(I finally used package constants to simplify tests).

« Back to merge proposal