Merge ~johnsca/juju-wait:bug/1680963-settle-retry into juju-wait:master

Proposed by Cory Johns on 2017-07-10
Status: Merged
Merged at revision: e07eec16ee4b8d4fca84505beb2152b98658245c
Proposed branch: ~johnsca/juju-wait:bug/1680963-settle-retry
Merge into: juju-wait:master
Diff against target: 91 lines (+44/-6)
1 file modified
juju_wait/__init__.py (+44/-6)
Reviewer Review Type Date Requested Status
Stuart Bishop 2017-07-10 Needs Fixing on 2017-07-11
Review via email: mp+327122@code.launchpad.net

Description of the Change

Juju will automatically retry failed hooks with a cool-off period to account for transient errors. Some charms unfortunately (and perhaps unknowingly) rely on this behavior. This option allows juju-wait to be a bit more forgiving before reporting failure.

To post a comment you must log in.
Cory Johns (johnsca) wrote :

I went with number of retries rather than a time-frame because it seemed easier than trying to account for hook run times (failure could be due to a download that times out on the order of minutes) or controller load.

It's possible to turn off automatic retries via the model-config, and if this option is used on a model where the automatic retries aren't happening, it could end up blocking indefinitely. But I'm not sure that case is worth investing effort in handling rather than just leaving it up to the user to not ask for contradictory things.

Stuart Bishop (stub) wrote :

This certainly shouldn't be the default (which it isn't), as the case described is exactly the sort of failure that juju-wait should be reporting as an error so it can be fixed. While it could be useful for deploy scripts, it would be a mistake to enable this on CI or test environments - if a charm recovers from an error state it does not mean it will always recover, and ignoring these bugs is ignoring tech debt that will likely need to be repaid at deployment time.

I'm flagging this as NEEDS FIX, as we need to emit a warning if retry behaviour is triggered, explaining that the charm failed but we are attempting to continue anyway. At no point should relying on retry behaviour be considered normal, and it should be clearly pointed out at every stage that a bug has been detected when a charm hits an error state. Retry behaviour was added to Juju purely so users could work with crap charms; it was not added to make crap charms acceptable or the status quo. If we want to change this policy and best practice, we should do it with our eyes open rather than let things slide until the decision has been made for us.

review: Needs Fixing
Stuart Bishop (stub) wrote :

(juju-deployer may need a similar update if it is in use, as I think it will still fail if a unit happens to be in an error state when it polls the status)

Cory Johns (johnsca) wrote :

I agree that this generally encourages not fixing bugs in charms, but we need this functionality for conjure-up as there are an unfortunate number of cases where some of the older charms used by some bundles will go into an error state very briefly which causes conjure-up to fail even though the deployment eventually settles out to fine. +100 to never allowing it in CI.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/juju_wait/__init__.py b/juju_wait/__init__.py
2index e6d8945..876ef9c 100755
3--- a/juju_wait/__init__.py
4+++ b/juju_wait/__init__.py
5@@ -105,6 +105,14 @@ def juju_exe():
6 return juju
7
8
9+def juju(*args):
10+ # Older juju versions don't support --utc, so force UTC timestamps
11+ # using the environment variable.
12+ env = os.environ.copy()
13+ env['TZ'] = 'UTC'
14+ return run_or_die([juju_exe()] + args, env=env)
15+
16+
17 def juju_run(unit, cmd, timeout=None):
18 if timeout is None:
19 timeout = 6 * 60 * 60
20@@ -207,6 +215,10 @@ def wait_cmd(args=sys.argv[1:]):
21 parser.add_argument('-t', '--max_wait', dest='max_wait',
22 help='Maximum time to wait for readiness (seconds)',
23 action='store', default=None)
24+ parser.add_argument('-r', '--retry_errors', dest='retry_errors',
25+ action='store', default=0, type=int, metavar='N',
26+ help='Allow Juju to retry errors N times '
27+ 'before failing')
28 parser.add_argument('--version', default=False, action='store_true')
29 args = parser.parse_args(args)
30
31@@ -231,9 +243,17 @@ def wait_cmd(args=sys.argv[1:]):
32 else:
33 max_wait = None
34
35+ if args.retry_errors:
36+ if juju('version').startswith('1.'):
37+ logging.warning('Ignoring --retry_errors on Juju 1.x')
38+ args.retry_errors = 0
39+ elif juju('model-config', 'automatically-retry-hooks') != 'True':
40+ logging.warning('Ignoring --retry_errors when retries disabled')
41+ args.retry_errors = 0
42+
43 # Begin watching and waiting
44 try:
45- wait(log, args.wait_for_workload, max_wait)
46+ wait(log, args.wait_for_workload, max_wait, args.retry_errors)
47 return 0
48 except JujuWaitException as x:
49 return x.args[0]
50@@ -248,7 +268,21 @@ def reset_logging():
51 'logging-config=juju=WARNING;unit=INFO'])
52
53
54-def wait(log=None, wait_for_workload=False, max_wait=None):
55+def get_error_repeat_count(uname):
56+ """Determine the number of times the most recent workload status was
57+ repeated.
58+ """
59+ status_log = juju('show-status-log', uname, '--days=1')
60+ status_log = [l.split('\t')
61+ for l in status_log.splitlines()
62+ if 'juju-unit' in l and 'executing' not in l]
63+ for i, status in enumerate(reversed(status_log)):
64+ if status[2].strip() not in WORKLOAD_ERROR_STATES:
65+ return i
66+ return len(status_log)
67+
68+
69+def wait(log=None, wait_for_workload=False, max_wait=None, retry_errors=0):
70 # Note that wait_for_workload is only useful as a workaround for
71 # broken setups. It is impossible for a charm to report that it has
72 # completed setup, because a charm has no information about units
73@@ -359,10 +393,14 @@ def wait(log=None, wait_for_workload=False, max_wait=None):
74
75 # Fail and raise if workload state is ever in error
76 if current in WORKLOAD_ERROR_STATES and wait_for_workload:
77- logging.error('{} failed: workload status is '
78- '{}'.format(uname, current))
79- ready = False
80- raise JujuWaitException(1)
81+ if get_error_repeat_count(uname) < retry_errors:
82+ logging.warning('{} failure ignored, '
83+ 'will be retried'.format(uname))
84+ else:
85+ logging.error('{} failed: workload status is '
86+ '{}'.format(uname, current))
87+ ready = False
88+ raise JujuWaitException(1)
89
90 for uname, astatus in sorted(agent_status.items()):
91 if not ('current' in astatus and 'since' in astatus):

Subscribers

People subscribed via source and target branches

to all changes: