Merge lp:~andrewjbeach/juju-ci-tools/fix-wait-for-started into lp:juju-ci-tools

Proposed by Andrew James Beach
Status: Superseded
Proposed branch: lp:~andrewjbeach/juju-ci-tools/fix-wait-for-started
Merge into: lp:juju-ci-tools
Diff against target: 420 lines (+346/-2)
2 files modified
jujupy.py (+194/-1)
tests/test_jujupy.py (+152/-1)
To merge this branch: bzr merge lp:~andrewjbeach/juju-ci-tools/fix-wait-for-started
Reviewer Review Type Date Requested Status
Juju Release Engineering Pending
Review via email: mp+310553@code.launchpad.net

This proposal has been superseded by a proposal from 2016-11-14.

Description of the change

Adds Status.check_for_errors. Which checks the entire status object for
errors and translates it into an Exception.

Its intended use is in _wait_for_status and other wait_for_* functions so that they can produce meaningful error messages when something goes wrong.

To post a comment you must log in.
1644. By Andrew James Beach

Cleaned up iter_status by using a helper instead of a bunch or repeating internal variables.

Revision history for this message
Curtis Hovey (sinzui) wrote :

Good questions.

1. Lets treat StatusError as the generic error seen in status, we prefer specific errors about what has errored. I want to test for errors in status all the time to exit the test early, but we need extra intelligence to know when to deffer raising :(

2. I know from experience that Juju 2 is great at retries. It sees the errors in status as we do, and attempts to resolve them. So getting status with errors should not cause us to raise an error. We need to wait for a period we believe Juju should have addresses the error.

But I know that machine errors are not recoverable at this time. Once we see on in status, we can give up.

3. The priority is:
MachineError: not recoverable, raise when first seen.
   We might want subtypes in the future. Image not found for example can be a human or canonical error.
   Machine failed to start is a substrate error
UnitError: juju will retry. There isn't a right answer for retries to recover, but 10 minutes often works
   HookError is the most common UnitError and juju reties. (I really love this feature)
   InstallError is almost always fatal
AppError: This is often a summary of UnitError. It might also show config errors which are not unit specific
StatusError: something else

I wonder if the call needs to check status needs a an arg to distinguish between fatal and recoverable errors

1645. By Andrew James Beach

Added a new sub-hierarchy of exceptions for status.

Revision history for this message
Andrew James Beach (andrewjbeach) wrote :

The Original Questions replied to: ==========================================
How should this mix with the existing error checks in Status? And the
existing Exception type ErroredUnit? Should some overhead StatusError class
be used?

Should the exception be raised within the check?

Are there any types of errors we want to give priority over others? If so how
do we keep them from getting shadowed by less important exceptions that just
happen to be found earlier?
=============================================================================

I added the exceptions mentioned as a first step. In addition to just being
new more particular exceptions StatusError and all of its children store a
recoverable value which marks if we think juju will recover from them.

We could store a time value if we want control over how often to re-try,
but I don't think we do.

All the classes are orderable within each other. After a sweep through the
Status any generated errors can be sorted to find the most important ones.
I tried encoding the sorting rules in code, but it was rather complex and
still required some manual ordering.

1646. By Andrew James Beach

Worked on ordering, recoverable and InstallError.

Revision history for this message
Andrew James Beach (andrewjbeach) wrote :

The main change made in the last commit is that recoverable is now a value of the class, not the instance. If two errors are different enough that one is recoverable and another is not, that is different enough for them to be different exception types as well.

1647. By Andrew James Beach

Fixed a few odds and ends for lint, I also changed StatusError sorting to a faster key based method.

1648. By Andrew James Beach

New status->exception translation. A few more types of Errors and some 'surface' functions with ignore_recoverable. Tests have been updated, but should probably be redone.

1649. By Andrew James Beach

Didn't mean to cut that test out.

Revision history for this message
Andrew James Beach (andrewjbeach) wrote :

So most of the logic of translating status to errors is in a helper used by iter_errors. However it and iter_status are not the front-line functions, check_for_errors and raise_highest_error. They may not be actually be used (as they are) but they show the intended use.

The ignore_recoverable flag is used in ongoing situations, where recoverable errors can be ignored as they might still be recovered from.

Also added a few new types of Exceptions to represent errors in agents.

Revision history for this message
Curtis Hovey (sinzui) wrote :

This is a fine start. I have some suggestions inline.

1650. By Andrew James Beach

Clean-up in responce to latest round of feedback. Such as clearing out the debugging code and adding the StatusItem class.

1651. By Andrew James Beach

Small fix in check_for_errors.

Revision history for this message
Andrew James Beach (andrewjbeach) wrote :

That is all the suggestions implemented. I'm going to go through and give it another round of polish. The main thing is a few helpers in StatusItem now that exists.

One other point is renaming AgentLostTimeout. The current check does not search for a type of error as the name of the Exception would suggest. So either another check should be added or I think it should be AgentLongError or AgentNotRecovered, which don't suggest what the problem is, just that it has been around for longer.

Revision history for this message
Aaron Bentley (abentley) wrote :

Suggestion inline.

1652. By Andrew James Beach

to_exception now returns None if the StatusItem is not an error. Changed StatusItem's __init__.

Revision history for this message
Andrew James Beach (andrewjbeach) wrote :

Was debating using a predicate instead of Aaron's idea. But returning None is harder to misuse as compared with checking a predicate, so I went with that.

I also changed the order of the arguments in StatusItem. The way the exception tree is set up the status_name has the greatest effect on the type of the exception (when I set it up it was a tie-breaker). So it is more significant and I made it come first.

1653. By Andrew James Beach

It is datetime.strptime, not timedelta.strptime.

1654. By Andrew James Beach

Clean up, tests and trying to get compatability with Juju 1.X.

Revision history for this message
Andrew James Beach (andrewjbeach) wrote :

OK, we have officially crossed the 400 lines mark, so this will get broken up.

But before that I have a few remaining 'big issues' to sort out. (Polish can happen in the small pieces.) The main one is ensuring compatibility with Juju 1.X versions.

To do that I have added an override of iter_status to Status1X. It goes through a slightly different set of fields than the one in Status. It also, rather crudely, maps all the status information to the format expected by StatusItem.

Perhaps the creating a StatusItem1X might be a better solution. We could also give Status and Status1X Item class members to pick which one to use. However that probably would not be worth it unless there are other parts of StatusItem (besides __init__) we want to change.

Also I realized we are not covering machine containers or unit subordinates which may also have status information we should check.

1655. By Andrew James Beach

Cleaned up Status1X.iter_status.

Unmerged revisions

1655. By Andrew James Beach

Cleaned up Status1X.iter_status.

1654. By Andrew James Beach

Clean up, tests and trying to get compatability with Juju 1.X.

1653. By Andrew James Beach

It is datetime.strptime, not timedelta.strptime.

1652. By Andrew James Beach

to_exception now returns None if the StatusItem is not an error. Changed StatusItem's __init__.

1651. By Andrew James Beach

Small fix in check_for_errors.

1650. By Andrew James Beach

Clean-up in responce to latest round of feedback. Such as clearing out the debugging code and adding the StatusItem class.

1649. By Andrew James Beach

Didn't mean to cut that test out.

1648. By Andrew James Beach

New status->exception translation. A few more types of Errors and some 'surface' functions with ignore_recoverable. Tests have been updated, but should probably be redone.

1647. By Andrew James Beach

Fixed a few odds and ends for lint, I also changed StatusError sorting to a faster key based method.

1646. By Andrew James Beach

Worked on ordering, recoverable and InstallError.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'jujupy.py'
2--- jujupy.py 2016-11-14 20:10:26 +0000
3+++ jujupy.py 2016-11-14 20:42:47 +0000
4@@ -8,7 +8,10 @@
5 contextmanager,
6 )
7 from copy import deepcopy
8-from datetime import datetime
9+from datetime import (
10+ datetime,
11+ timedelta,
12+ )
13 import errno
14 from itertools import chain
15 import json
16@@ -533,6 +536,133 @@
17 return self.get_cloud_credentials_item()[1]
18
19
20+class StatusError(Exception):
21+ """Generic error for Status."""
22+
23+ recoverable = True
24+
25+ # This has to be filled in after the classes are declared.
26+ ordering = []
27+
28+ @classmethod
29+ def priority(cls):
30+ """Get the priority of the StatusError as an number.
31+
32+ Lower number means higher priority. This can be used as a key
33+ function in sorting."""
34+ return cls.ordering.index(cls)
35+
36+
37+class MachineError(StatusError):
38+ """Error in machine-status."""
39+
40+ recoverable = False
41+
42+
43+class UnitError(StatusError):
44+ """Error in a unit's status."""
45+
46+
47+class HookFailedError(UnitError):
48+ """A unit hook has failed."""
49+
50+ def __init__(self, item_name, msg):
51+ match = re.search('^hook failed: "([^"]+)"$', msg)
52+ if match:
53+ msg = match.group(1)
54+ super(HookFailedError, self).__init__(item_name, msg)
55+
56+
57+class InstallError(HookFailedError):
58+ """The unit's install hook has failed."""
59+
60+ recoverable = False
61+
62+
63+class AppError(StatusError):
64+ """Error in an application's status."""
65+
66+
67+class AgentError(StatusError):
68+ """Error in a juju agent."""
69+
70+
71+class AgentLongError(AgentError):
72+ """Agent error has not recovered in a reasonable time."""
73+
74+ recoverable = False
75+
76+
77+StatusError.ordering = [MachineError, InstallError, AgentLongError,
78+ HookFailedError, UnitError, AppError, AgentError,
79+ StatusError]
80+
81+
82+class StatusItem:
83+
84+ APPLICATION = 'application-status'
85+ WORKLOAD = 'workload-status'
86+ MACHINE = 'machine-status'
87+ JUJU = 'juju-status'
88+
89+ def __init__(self, status_name, item_name, item_value):
90+ self.status_name = status_name
91+ self.item_name = item_name
92+ # (self.item_name, item_value) = item
93+ self.status = item_value[status_name]
94+
95+ @property
96+ def message(self):
97+ return self.status.get('message')
98+
99+ @property
100+ def since(self):
101+ return self.status.get('since')
102+
103+ @property
104+ def current(self):
105+ return self.status.get('current')
106+
107+ @property
108+ def version(self):
109+ return self.status.get('version')
110+
111+ def datetime_since(self):
112+ return datetime.strptime(self.since, '%d %b %Y %H:%M:%SZ')
113+
114+ def to_exception(self):
115+ """Create an exception representing the error if one exists.
116+
117+ :return: StatusError (or subtype) to represent an error or None
118+ to show that there is no error."""
119+ if self.current not in ['error', 'failed']:
120+ return None
121+
122+ if self.APPLICATION == self.status_name:
123+ return AppError(self.item_name, self.message)
124+ elif self.WORKLOAD == self.status_name:
125+ if self.message is None:
126+ return UnitError(self.item_name, self.message)
127+ elif re.match('hook failed ".*install.*"', self.message):
128+ return InstallError(self.item_name, self.message)
129+ elif re.match('hook failed', self.message):
130+ return HookFailedError(self.item_name, self.message)
131+ else:
132+ return UnitError(self.item_name, self.message)
133+ elif self.MACHINE == self.status_name:
134+ return MachineError(self.item_name, self.message)
135+ elif self.JUJU == self.status_name:
136+ time_since = datetime.utcnow() - self.datetime_since()
137+ if time_since > timedelta(minutes=5):
138+ return AgentLongError(self.item_name, self.message,
139+ time_since.total_seconds())
140+ else:
141+ return AgentError(self.item_name, self.message)
142+ else:
143+ raise ValueError('Unknown status:{}'.format(self.status_name),
144+ (self.item_name, self.status_value))
145+
146+
147 class Status:
148
149 def __init__(self, status, status_text):
150@@ -552,6 +682,10 @@
151 def get_applications(self):
152 return self.status.get('applications', {})
153
154+ @property
155+ def machines(self):
156+ return self.status['machines']
157+
158 def iter_machines(self, containers=False, machines=True):
159 for machine_name, machine in sorted(self.status['machines'].items()):
160 if machines:
161@@ -685,12 +819,71 @@
162 """
163 return self.get_unit(unit_name).get('open-ports', [])
164
165+ def iter_status(self):
166+ """Iterate through every status field in the larger status data."""
167+ for machine_name, machine_value in self.machines.items():
168+ yield StatusItem(StatusItem.MACHINE, machine_name, machine_value)
169+ yield StatusItem(StatusItem.JUJU, machine_name, machine_value)
170+ for app_name, app_value in self.get_applications().items():
171+ yield StatusItem(StatusItem.APPLICATION, app_name, app_value)
172+ for unit_name, unit_value in app_value['units'].items():
173+ yield StatusItem(StatusItem.WORKLOAD, unit_name, unit_value)
174+ yield StatusItem(StatusItem.JUJU, unit_name, unit_value)
175+
176+ def iter_errors(self, ignore_recoverable=False):
177+ """Iterate through every error, repersented by exceptions."""
178+ for sub_status in self.iter_status():
179+ error = sub_status.to_exception()
180+ if error is not None:
181+ if not (ignore_recoverable and error.recoverable):
182+ yield error
183+
184+ def check_for_errors(self, ignore_recoverable=False):
185+ """Return a list of errors, in order of their priority."""
186+ return sorted(self.iter_errors(ignore_recoverable),
187+ key=lambda item: item.priority())
188+
189+ def raise_highest_error(self, ignore_recoverable=False):
190+ """Raise an exception reperenting the highest priority error."""
191+ errors = self.check_for_errors(ignore_recoverable)
192+ if errors:
193+ raise errors[0]
194+
195
196 class Status1X(Status):
197
198 def get_applications(self):
199 return self.status.get('services', {})
200
201+ def status_condence(self, item_value):
202+ """Condence the scattered agent-* fields into a status dict."""
203+ return {'current': item_value['agent-state'],
204+ 'version': item_value['agent-version'],
205+ 'message': item_value.get('agent-state-info'),
206+ }
207+
208+ def iter_status(self):
209+ for machine_name, machine_value in self.machines.items():
210+ yield StatusItem(
211+ StatusItem.JUJU, machine_name,
212+ {StatusItem.JUJU: self.status_condence(machine_value)})
213+ for app_name, app_value in self.get_applications().items():
214+ yield StatusItem(
215+ StatusItem.APPLICATION, app_name,
216+ {StatusItem.APPLICATION: app_value['service-status']})
217+ for unit_name, unit_value in app_value['units'].items()
218+ if StatusItem.WORKLOAD is in unit_value:
219+ yield StatusItem(StatusItem.WORKLOAD,
220+ unit_name, unit_value)
221+ if 'agent-status' is in unit_value:
222+ yield StatusItem(
223+ StatusItem.JUJU, unit_name,
224+ {StatusItem.JUJU: unit_value['agent-status']})
225+ else:
226+ yield StatusItem(
227+ StatusItem.JUJU, unit_name,
228+ {StatusItem.JUJU: self.status_condence(unit_value)})
229+
230
231 def describe_substrate(env):
232 if env.provider == 'local':
233
234=== modified file 'tests/test_jujupy.py'
235--- tests/test_jujupy.py 2016-11-14 20:10:26 +0000
236+++ tests/test_jujupy.py 2016-11-14 20:42:47 +0000
237@@ -58,17 +58,21 @@
238 JUJU_DEV_FEATURE_FLAGS,
239 KILL_CONTROLLER,
240 Machine,
241+ MachineError,
242 make_safe_config,
243 NoProvider,
244 parse_new_state_server_from_error,
245 SimpleEnvironment,
246- Status1X,
247 SoftDeadlineExceeded,
248 Status,
249+ Status1X,
250+ StatusError,
251+ StatusItem,
252 SYSTEM,
253 temp_bootstrap_env,
254 _temp_env as temp_env,
255 temp_yaml_file,
256+ UnitError,
257 uniquify_local,
258 UpgradeMongoNotSupported,
259 VersionNotTestedError,
260@@ -6017,6 +6021,138 @@
261 self.assertIsInstance(gen, types.GeneratorType)
262 self.assertEqual(expected, list(gen))
263
264+ def run_iter_status(self):
265+ status = Status({
266+ 'machines': {
267+ '0': {
268+ 'juju-status': {
269+ 'current': 'idle',
270+ 'since': 'DD MM YYYY hh:mm:ss',
271+ 'version': '2.0.0',
272+ },
273+ 'machine-status': {
274+ 'current': 'running',
275+ 'message': 'Running',
276+ 'since': 'DD MM YYYY hh:mm:ss',
277+ },
278+ },
279+ '1': {
280+ 'juju-status': {
281+ 'current': 'idle',
282+ 'since': 'DD MM YYYY hh:mm:ss',
283+ 'version': '2.0.0',
284+ },
285+ 'machine-status': {
286+ 'current': 'running',
287+ 'message': 'Running',
288+ 'since': 'DD MM YYYY hh:mm:ss',
289+ },
290+ },
291+ },
292+ 'applications': {
293+ 'fakejob': {
294+ 'application-status': {
295+ 'current': 'idle',
296+ 'since': 'DD MM YYYY hh:mm:ss',
297+ },
298+ 'units': {
299+ 'fakejob/0': {
300+ 'workload-status': {
301+ 'current': 'maintenance',
302+ 'message': 'Started',
303+ 'since': 'DD MM YYYY hh:mm:ss',
304+ },
305+ 'juju-status': {
306+ 'current': 'idle',
307+ 'since': 'DD MM YYYY hh:mm:ss',
308+ 'version': '2.0.0',
309+ },
310+ },
311+ 'fakejob/1': {
312+ 'workload-status': {
313+ 'current': 'maintenance',
314+ 'message': 'Started',
315+ 'since': 'DD MM YYYY hh:mm:ss',
316+ },
317+ 'juju-status': {
318+ 'current': 'idle',
319+ 'since': 'DD MM YYYY hh:mm:ss',
320+ 'version': '2.0.0',
321+ },
322+ },
323+ },
324+ }
325+ },
326+ }, '')
327+ for sub_status in status.iter_status():
328+ yield sub_status
329+
330+ def test_iter_status_range(self):
331+ status_set = set([(status_item.item_name, status_item.status_name)
332+ for status_item in self.run_iter_status()])
333+ self.assertEqual({
334+ ('0', 'juju-status'), ('0', 'machine-status'),
335+ ('1', 'juju-status'), ('1', 'machine-status'),
336+ ('fakejob', 'application-status'),
337+ ('fakejob/0', 'workload-status'), ('fakejob/0', 'juju-status'),
338+ ('fakejob/1', 'workload-status'), ('fakejob/1', 'juju-status'),
339+ }, status_set)
340+
341+ def test_iter_status_data(self):
342+ min_set = set(['current', 'since'])
343+ max_set = set(['current', 'message', 'since', 'version'])
344+ for status_item in self.run_iter_status():
345+ if 'fakejob' == status_item.item_name:
346+ self.assertEqual(StatusItem.APPLICATION,
347+ status_item.status_name)
348+ self.assertEqual({'current': 'idle',
349+ 'since': 'DD MM YYYY hh:mm:ss',
350+ }, status_item.status)
351+ else:
352+ cur_set = set(status_item.status.keys())
353+ self.assertTrue(min_set < cur_set)
354+ self.assertTrue(cur_set < max_set)
355+
356+ def test_iter_errors(self):
357+ status = Status({}, '')
358+ retval = [StatusItem(StatusItem.WORKLOAD, 'job/0',
359+ {StatusItem.WORKLOAD: {'current': 'error'}}),
360+ StatusItem(StatusItem.APPLICATION, 'job',
361+ {StatusItem.APPLICATION: {'current': 'running'}})
362+ ]
363+ with patch.object(status, 'iter_status', autospec=True,
364+ return_value=retval):
365+ errors = list(status.iter_errors())
366+ self.assertEqual(len(errors), 1)
367+ self.assertIsInstance(errors[0], UnitError)
368+ self.assertEqual(('job/0', None), errors[0].args)
369+
370+ def test_check_for_errors_good(self):
371+ status = Status({}, '')
372+ with patch.object(status, 'iter_errors', autospec=True,
373+ return_value=[]) as error_mock:
374+ self.assertEqual([], status.check_for_errors())
375+ error_mock.assert_called_once_with(False)
376+
377+ @contextmanager
378+ def patch_iter_errors_one(self, status,
379+ item_name, status_name, **kwargs):
380+ retval = [(item_name, status_name, dict(current='error', **kwargs))]
381+ with patch.object(status, 'iter_errors', autospec=True,
382+ return_value=retval) as error_mock:
383+ yield error_mock
384+
385+ def test_check_for_errors(self):
386+ status = Status({}, '')
387+ errors = [MachineError('0'), StatusError('2'), UnitError('1')]
388+ with patch.object(status, 'iter_errors', autospec=True,
389+ return_value=errors) as errors_mock:
390+ sorted_errors = status.check_for_errors()
391+ errors_mock.assert_called_once_with(False)
392+ self.assertEqual(sorted_errors[0].args, ('0',))
393+ self.assertEqual(sorted_errors[1].args, ('1',))
394+ self.assertEqual(sorted_errors[2].args, ('2',))
395+
396 def test_get_applications_gets_applications(self):
397 status = Status({
398 'services': {'service': {}},
399@@ -6035,6 +6171,21 @@
400 self.assertEqual({'service': {}}, status.get_applications())
401
402
403+class TestStatusItem(TestCase):
404+
405+ @staticmethod
406+ def make_status_item(status_name, item_name, **kwargs):
407+ return StatusItem(status_name, item_name, {status_name: kwargs})
408+
409+ def test_datetime_since(self):
410+ item = self.make_status_item(StatusItem.JUJU, '0',
411+ since='19 Aug 2016 05:36:42Z')
412+ target = datetime(2016, 8, 19, 5, 36, 42)
413+ self.assertEqual(item.datetime_since(), target)
414+
415+ # to_exception is going to need a lot of tests.
416+
417+
418 def fast_timeout(count):
419 if False:
420 yield

Subscribers

People subscribed via source and target branches