Merge ~johnsca/juju-wait:bug/1680963-settle-retry into juju-wait:master
| Status: | Merged | ||||
|---|---|---|---|---|---|
| Merged at revision: | e07eec16ee4b8d4fca84505beb2152b98658245c | ||||
| Proposed branch: | ~johnsca/juju-wait:bug/1680963-settle-retry | ||||
| Merge into: | juju-wait:master | ||||
| Diff against target: |
91 lines (+44/-6) 1 file modified
juju_wait/__init__.py (+44/-6) |
||||
| Related bugs: |
|
| Reviewer | Review Type | Date Requested | Status |
|---|---|---|---|
| Stuart Bishop | 2017-07-10 | Needs Fixing on 2017-07-11 | |
|
Review via email:
|
|||
Description of the Change
Juju will automatically retry failed hooks with a cool-off period to account for transient errors. Some charms unfortunately (and perhaps unknowingly) rely on this behavior. This option allows juju-wait to be a bit more forgiving before reporting failure.
| Cory Johns (johnsca) wrote : | # |
| Stuart Bishop (stub) wrote : | # |
This certainly shouldn't be the default (which it isn't), as the case described is exactly the sort of failure that juju-wait should be reporting as an error so it can be fixed. While it could be useful for deploy scripts, it would be a mistake to enable this on CI or test environments - if a charm recovers from an error state it does not mean it will always recover, and ignoring these bugs is ignoring tech debt that will likely need to be repaid at deployment time.
I'm flagging this as NEEDS FIX, as we need to emit a warning if retry behaviour is triggered, explaining that the charm failed but we are attempting to continue anyway. At no point should relying on retry behaviour be considered normal, and it should be clearly pointed out at every stage that a bug has been detected when a charm hits an error state. Retry behaviour was added to Juju purely so users could work with crap charms; it was not added to make crap charms acceptable or the status quo. If we want to change this policy and best practice, we should do it with our eyes open rather than let things slide until the decision has been made for us.
| Stuart Bishop (stub) wrote : | # |
(juju-deployer may need a similar update if it is in use, as I think it will still fail if a unit happens to be in an error state when it polls the status)
| Cory Johns (johnsca) wrote : | # |
I agree that this generally encourages not fixing bugs in charms, but we need this functionality for conjure-up as there are an unfortunate number of cases where some of the older charms used by some bundles will go into an error state very briefly which causes conjure-up to fail even though the deployment eventually settles out to fine. +100 to never allowing it in CI.

I went with number of retries rather than a time-frame because it seemed easier than trying to account for hook run times (failure could be due to a download that times out on the order of minutes) or controller load.
It's possible to turn off automatic retries via the model-config, and if this option is used on a model where the automatic retries aren't happening, it could end up blocking indefinitely. But I'm not sure that case is worth investing effort in handling rather than just leaving it up to the user to not ask for contradictory things.