juju-ci-tools

Merge lp:~andrewjbeach/juju-ci-tools/fix-wait-for-started into lp:juju-ci-tools

fix-wait-for-started
Merge into trunk

Proposed by Andrew James Beach on 2016-11-10

Status:	Superseded
Proposed branch:	lp:~andrewjbeach/juju-ci-tools/fix-wait-for-started
Merge into:	lp:juju-ci-tools
Diff against target:	420 lines (+346/-2) 2 files modified jujupy.py (+194/-1) tests/test_jujupy.py (+152/-1)
To merge this branch:	bzr merge lp:~andrewjbeach/juju-ci-tools/fix-wait-for-started
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Juju Release Engineering		2016-11-10	Pending
Review via email: mp+310553@code.launchpad.net

This proposal has been superseded by a proposal from 2016-11-14.

Description of the change

Adds Status.check_for_errors. Which checks the entire status object for
errors and translates it into an Exception.

Its intended use is in _wait_for_status and other wait_for_* functions so that they can produce meaningful error messages when something goes wrong.

lp:~andrewjbeach/juju-ci-tools/fix-wait-for-started updated on 2016-11-10

1644. By Andrew James Beach on 2016-11-10: Cleaned up iter_status by using a helper instead of a bunch or repeating internal variables.

Revision history for this message

Curtis Hovey (sinzui) wrote on 2016-11-10:

Good questions.

1. Lets treat StatusError as the generic error seen in status, we prefer specific errors about what has errored. I want to test for errors in status all the time to exit the test early, but we need extra intelligence to know when to deffer raising :(

2. I know from experience that Juju 2 is great at retries. It sees the errors in status as we do, and attempts to resolve them. So getting status with errors should not cause us to raise an error. We need to wait for a period we believe Juju should have addresses the error.

But I know that machine errors are not recoverable at this time. Once we see on in status, we can give up.

3. The priority is:
MachineError: not recoverable, raise when first seen.
   We might want subtypes in the future. Image not found for example can be a human or canonical error.
   Machine failed to start is a substrate error
UnitError: juju will retry. There isn't a right answer for retries to recover, but 10 minutes often works
   HookError is the most common UnitError and juju reties. (I really love this feature)
   InstallError is almost always fatal
AppError: This is often a summary of UnitError. It might also show config errors which are not unit specific
StatusError: something else

I wonder if the call needs to check status needs a an arg to distinguish between fatal and recoverable errors

lp:~andrewjbeach/juju-ci-tools/fix-wait-for-started updated on 2016-11-10

1645. By Andrew James Beach on 2016-11-10: Added a new sub-hierarchy of exceptions for status.

Revision history for this message

Andrew James Beach (andrewjbeach) wrote on 2016-11-10:

The Original Questions replied to: ==========================================
How should this mix with the existing error checks in Status? And the
existing Exception type ErroredUnit? Should some overhead StatusError class
be used?

Should the exception be raised within the check?

Are there any types of errors we want to give priority over others? If so how
do we keep them from getting shadowed by less important exceptions that just
happen to be found earlier?
=============================================================================

I added the exceptions mentioned as a first step. In addition to just being
new more particular exceptions StatusError and all of its children store a
recoverable value which marks if we think juju will recover from them.

We could store a time value if we want control over how often to re-try,
but I don't think we do.

All the classes are orderable within each other. After a sweep through the
Status any generated errors can be sorted to find the most important ones.
I tried encoding the sorting rules in code, but it was rather complex and
still required some manual ordering.

lp:~andrewjbeach/juju-ci-tools/fix-wait-for-started updated on 2016-11-10

1646. By Andrew James Beach on 2016-11-10: Worked on ordering, recoverable and InstallError.

Revision history for this message

Andrew James Beach (andrewjbeach) wrote on 2016-11-10:

The main change made in the last commit is that recoverable is now a value of the class, not the instance. If two errors are different enough that one is recoverable and another is not, that is different enough for them to be different exception types as well.

lp:~andrewjbeach/juju-ci-tools/fix-wait-for-started updated on 2016-11-11

1647. By Andrew James Beach on 2016-11-11: Fixed a few odds and ends for lint, I also changed StatusError sorting to a faster key based method.
1648. By Andrew James Beach on 2016-11-11: New status->exception translation. A few more types of Errors and some 'surface' functions with ignore_recoverable. Tests have been updated, but should probably be redone.
1649. By Andrew James Beach on 2016-11-11: Didn't mean to cut that test out.

Revision history for this message

Andrew James Beach (andrewjbeach) wrote on 2016-11-11:

So most of the logic of translating status to errors is in a helper used by iter_errors. However it and iter_status are not the front-line functions, check_for_errors and raise_highest_error. They may not be actually be used (as they are) but they show the intended use.

The ignore_recoverable flag is used in ongoing situations, where recoverable errors can be ignored as they might still be recovered from.

Also added a few new types of Exceptions to represent errors in agents.

Revision history for this message

Curtis Hovey (sinzui) wrote on 2016-11-11:

This is a fine start. I have some suggestions inline.

lp:~andrewjbeach/juju-ci-tools/fix-wait-for-started updated on 2016-11-11

1650. By Andrew James Beach on 2016-11-11: Clean-up in responce to latest round of feedback. Such as clearing out the debugging code and adding the StatusItem class.
1651. By Andrew James Beach on 2016-11-11: Small fix in check_for_errors.

Revision history for this message

Andrew James Beach (andrewjbeach) wrote on 2016-11-11:

That is all the suggestions implemented. I'm going to go through and give it another round of polish. The main thing is a few helpers in StatusItem now that exists.

One other point is renaming AgentLostTimeout. The current check does not search for a type of error as the name of the Exception would suggest. So either another check should be added or I think it should be AgentLongError or AgentNotRecovered, which don't suggest what the problem is, just that it has been around for longer.

Revision history for this message

Aaron Bentley (abentley) wrote on 2016-11-11:

Suggestion inline.

lp:~andrewjbeach/juju-ci-tools/fix-wait-for-started updated on 2016-11-11

1652. By Andrew James Beach on 2016-11-11: to_exception now returns None if the StatusItem is not an error. Changed StatusItem's __init__.

Revision history for this message

Andrew James Beach (andrewjbeach) wrote on 2016-11-11:

Was debating using a predicate instead of Aaron's idea. But returning None is harder to misuse as compared with checking a predicate, so I went with that.

I also changed the order of the arguments in StatusItem. The way the exception tree is set up the status_name has the greatest effect on the type of the exception (when I set it up it was a tie-breaker). So it is more significant and I made it come first.

lp:~andrewjbeach/juju-ci-tools/fix-wait-for-started updated on 2016-11-14

1653. By Andrew James Beach on 2016-11-11: It is datetime.strptime, not timedelta.strptime.
1654. By Andrew James Beach on 2016-11-14: Clean up, tests and trying to get compatability with Juju 1.X.

Revision history for this message

Andrew James Beach (andrewjbeach) wrote on 2016-11-14:

OK, we have officially crossed the 400 lines mark, so this will get broken up.

But before that I have a few remaining 'big issues' to sort out. (Polish can happen in the small pieces.) The main one is ensuring compatibility with Juju 1.X versions.

To do that I have added an override of iter_status to Status1X. It goes through a slightly different set of fields than the one in Status. It also, rather crudely, maps all the status information to the format expected by StatusItem.

Perhaps the creating a StatusItem1X might be a better solution. We could also give Status and Status1X Item class members to pick which one to use. However that probably would not be worth it unless there are other parts of StatusItem (besides __init__) we want to change.

Also I realized we are not covering machine containers or unit subordinates which may also have status information we should check.

lp:~andrewjbeach/juju-ci-tools/fix-wait-for-started updated on 2016-11-14

1655. By Andrew James Beach on 2016-11-14: Cleaned up Status1X.iter_status.

Unmerged revisions

1655. By Andrew James Beach on 2016-11-14: Cleaned up Status1X.iter_status.
1654. By Andrew James Beach on 2016-11-14: Clean up, tests and trying to get compatability with Juju 1.X.
1653. By Andrew James Beach on 2016-11-11: It is datetime.strptime, not timedelta.strptime.
1652. By Andrew James Beach on 2016-11-11: to_exception now returns None if the StatusItem is not an error. Changed StatusItem's __init__.
1651. By Andrew James Beach on 2016-11-11: Small fix in check_for_errors.
1650. By Andrew James Beach on 2016-11-11: Clean-up in responce to latest round of feedback. Such as clearing out the debugging code and adding the StatusItem class.
1649. By Andrew James Beach on 2016-11-11: Didn't mean to cut that test out.
1648. By Andrew James Beach on 2016-11-11: New status->exception translation. A few more types of Errors and some 'surface' functions with ignore_recoverable. Tests have been updated, but should probably be redone.
1647. By Andrew James Beach on 2016-11-11: Fixed a few odds and ends for lint, I also changed StatusError sorting to a faster key based method.
1646. By Andrew James Beach on 2016-11-10: Worked on ordering, recoverable and InstallError.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Andrew James Beach

Juju Release Engineering

 === modified file 'jujupy.py'
 --- jujupy.py	2016-11-14 20:10:26 +0000
 +++ jujupy.py	2016-11-14 20:42:47 +0000
@@ -8,7 +8,10 @@
      contextmanager,
+     )
  from copy import deepcopy
--from datetime import datetime
++from datetime import (
++    datetime,
++    timedelta,
++    )
  import errno
  from itertools import chain
  import json
@@ -533,6 +536,133 @@
          return self.get_cloud_credentials_item()[1]
++class StatusError(Exception):
++    """Generic error for Status."""
++
++    recoverable = True
++
++    # This has to be filled in after the classes are declared.
++    ordering = []
++
++    @classmethod
++    def priority(cls):
++        """Get the priority of the StatusError as an number.
++
++        Lower number means higher priority. This can be used as a key
++        function in sorting."""
++        return cls.ordering.index(cls)
++
++
++class MachineError(StatusError):
++    """Error in machine-status."""
++
++    recoverable = False
++
++
++class UnitError(StatusError):
++    """Error in a unit's status."""
++
++
++class HookFailedError(UnitError):
++    """A unit hook has failed."""
++
++    def __init__(self, item_name, msg):
++        match = re.search('^hook failed: "([^"]+)"$', msg)
++        if match:
++            msg = match.group(1)
++        super(HookFailedError, self).__init__(item_name, msg)
++
++
++class InstallError(HookFailedError):
++    """The unit's install hook has failed."""
++
++    recoverable = False
++
++
++class AppError(StatusError):
++    """Error in an application's status."""
++
++
++class AgentError(StatusError):
++    """Error in a juju agent."""
++
++
++class AgentLongError(AgentError):
++    """Agent error has not recovered in a reasonable time."""
++
++    recoverable = False
++
++
++StatusError.ordering = [MachineError, InstallError, AgentLongError,
++                        HookFailedError, UnitError, AppError, AgentError,
++                        StatusError]
++
++
++class StatusItem:
++
++    APPLICATION = 'application-status'
++    WORKLOAD = 'workload-status'
++    MACHINE = 'machine-status'
++    JUJU = 'juju-status'
++
++    def __init__(self, status_name, item_name, item_value):
++        self.status_name = status_name
++        self.item_name = item_name
++        # (self.item_name, item_value) = item
++        self.status = item_value[status_name]
++
++    @property
++    def message(self):
++        return self.status.get('message')
++
++    @property
++    def since(self):
++        return self.status.get('since')
++
++    @property
++    def current(self):
++        return self.status.get('current')
++
++    @property
++    def version(self):
++        return self.status.get('version')
++
++    def datetime_since(self):
++        return datetime.strptime(self.since, '%d %b %Y %H:%M:%SZ')
++
++    def to_exception(self):
++        """Create an exception representing the error if one exists.
++
++        :return: StatusError (or subtype) to represent an error or None
++        to show that there is no error."""
++        if self.current not in ['error', 'failed']:
++            return None
++
++        if self.APPLICATION == self.status_name:
++            return AppError(self.item_name, self.message)
++        elif self.WORKLOAD == self.status_name:
++            if self.message is None:
++                return UnitError(self.item_name, self.message)
++            elif re.match('hook failed ".*install.*"', self.message):
++                return InstallError(self.item_name, self.message)
++            elif re.match('hook failed', self.message):
++                return HookFailedError(self.item_name, self.message)
++            else:
++                return UnitError(self.item_name, self.message)
++        elif self.MACHINE == self.status_name:
++            return MachineError(self.item_name, self.message)
++        elif self.JUJU == self.status_name:
++            time_since = datetime.utcnow() - self.datetime_since()
++            if time_since > timedelta(minutes=5):
++                return AgentLongError(self.item_name, self.message,
++                                      time_since.total_seconds())
++            else:
++                return AgentError(self.item_name, self.message)
++        else:
++            raise ValueError('Unknown status:{}'.format(self.status_name),
++                             (self.item_name, self.status_value))
++
++
  class Status:
      def __init__(self, status, status_text):
@@ -552,6 +682,10 @@
      def get_applications(self):
          return self.status.get('applications', {})
++    @property
++    def machines(self):
++        return self.status['machines']
++
      def iter_machines(self, containers=False, machines=True):
          for machine_name, machine in sorted(self.status['machines'].items()):
              if machines:
@@ -685,12 +819,71 @@
          """
          return self.get_unit(unit_name).get('open-ports', [])
++    def iter_status(self):
++        """Iterate through every status field in the larger status data."""
++        for machine_name, machine_value in self.machines.items():
++            yield StatusItem(StatusItem.MACHINE, machine_name, machine_value)
++            yield StatusItem(StatusItem.JUJU, machine_name, machine_value)
++        for app_name, app_value in self.get_applications().items():
++            yield StatusItem(StatusItem.APPLICATION, app_name, app_value)
++            for unit_name, unit_value in app_value['units'].items():
++                yield StatusItem(StatusItem.WORKLOAD, unit_name, unit_value)
++                yield StatusItem(StatusItem.JUJU, unit_name, unit_value)
++
++    def iter_errors(self, ignore_recoverable=False):
++        """Iterate through every error, repersented by exceptions."""
++        for sub_status in self.iter_status():
++            error = sub_status.to_exception()
++            if error is not None:
++                if not (ignore_recoverable and error.recoverable):
++                    yield error
++
++    def check_for_errors(self, ignore_recoverable=False):
++        """Return a list of errors, in order of their priority."""
++        return sorted(self.iter_errors(ignore_recoverable),
++                      key=lambda item: item.priority())
++
++    def raise_highest_error(self, ignore_recoverable=False):
++        """Raise an exception reperenting the highest priority error."""
++        errors = self.check_for_errors(ignore_recoverable)
++        if errors:
++            raise errors[0]
++
  class Status1X(Status):
      def get_applications(self):
          return self.status.get('services', {})
++    def status_condence(self, item_value):
++        """Condence the scattered agent-* fields into a status dict."""
++        return {'current': item_value['agent-state'],
++                'version': item_value['agent-version'],
++                'message': item_value.get('agent-state-info'),
++                }
++
++    def iter_status(self):
++        for machine_name, machine_value in self.machines.items():
++            yield StatusItem(
++                StatusItem.JUJU, machine_name,
++                {StatusItem.JUJU: self.status_condence(machine_value)})
++        for app_name, app_value in self.get_applications().items():
++            yield StatusItem(
++                StatusItem.APPLICATION, app_name,
++                {StatusItem.APPLICATION: app_value['service-status']})
++            for unit_name, unit_value in app_value['units'].items()
++                if StatusItem.WORKLOAD is in unit_value:
++                    yield StatusItem(StatusItem.WORKLOAD,
++                                     unit_name, unit_value)
++                if 'agent-status' is in unit_value:
++                    yield StatusItem(
++                        StatusItem.JUJU, unit_name,
++                        {StatusItem.JUJU: unit_value['agent-status']})
++                else:
++                    yield StatusItem(
++                        StatusItem.JUJU, unit_name,
++                        {StatusItem.JUJU: self.status_condence(unit_value)})
++
  def describe_substrate(env):
      if env.provider == 'local':
 === modified file 'tests/test_jujupy.py'
 --- tests/test_jujupy.py	2016-11-14 20:10:26 +0000
 +++ tests/test_jujupy.py	2016-11-14 20:42:47 +0000
@@ -58,17 +58,21 @@
      JUJU_DEV_FEATURE_FLAGS,
      KILL_CONTROLLER,
      Machine,
++    MachineError,
      make_safe_config,
      NoProvider,
      parse_new_state_server_from_error,
      SimpleEnvironment,
--    Status1X,
      SoftDeadlineExceeded,
      Status,
++    Status1X,
++    StatusError,
++    StatusItem,
      SYSTEM,
      temp_bootstrap_env,
      _temp_env as temp_env,
      temp_yaml_file,
++    UnitError,
      uniquify_local,
      UpgradeMongoNotSupported,
      VersionNotTestedError,
@@ -6017,6 +6021,138 @@
          self.assertIsInstance(gen, types.GeneratorType)
          self.assertEqual(expected, list(gen))
++    def run_iter_status(self):
++        status = Status({
++            'machines': {
++                '0': {
++                    'juju-status': {
++                        'current': 'idle',
++                        'since': 'DD MM YYYY hh:mm:ss',
++                        'version': '2.0.0',
++                        },
++                    'machine-status': {
++                        'current': 'running',
++                        'message': 'Running',
++                        'since': 'DD MM YYYY hh:mm:ss',
++                        },
++                    },
++                '1': {
++                    'juju-status': {
++                        'current': 'idle',
++                        'since': 'DD MM YYYY hh:mm:ss',
++                        'version': '2.0.0',
++                        },
++                    'machine-status': {
++                        'current': 'running',
++                        'message': 'Running',
++                        'since': 'DD MM YYYY hh:mm:ss',
++                        },
++                    },
++                },
++            'applications': {
++                'fakejob': {
++                    'application-status': {
++                        'current': 'idle',
++                        'since': 'DD MM YYYY hh:mm:ss',
++                        },
++                    'units': {
++                        'fakejob/0': {
++                            'workload-status': {
++                                'current': 'maintenance',
++                                'message': 'Started',
++                                'since': 'DD MM YYYY hh:mm:ss',
++                                },
++                            'juju-status': {
++                                'current': 'idle',
++                                'since': 'DD MM YYYY hh:mm:ss',
++                                'version': '2.0.0',
++                                },
++                            },
++                        'fakejob/1': {
++                            'workload-status': {
++                                'current': 'maintenance',
++                                'message': 'Started',
++                                'since': 'DD MM YYYY hh:mm:ss',
++                                },
++                            'juju-status': {
++                                'current': 'idle',
++                                'since': 'DD MM YYYY hh:mm:ss',
++                                'version': '2.0.0',
++                                },
++                            },
++                        },
++                    }
++                },
++            }, '')
++        for sub_status in status.iter_status():
++            yield sub_status
++
++    def test_iter_status_range(self):
++        status_set = set([(status_item.item_name, status_item.status_name)
++                          for status_item in self.run_iter_status()])
++        self.assertEqual({
++            ('0', 'juju-status'), ('0', 'machine-status'),
++            ('1', 'juju-status'), ('1', 'machine-status'),
++            ('fakejob', 'application-status'),
++            ('fakejob/0', 'workload-status'), ('fakejob/0', 'juju-status'),
++            ('fakejob/1', 'workload-status'), ('fakejob/1', 'juju-status'),
++            }, status_set)
++
++    def test_iter_status_data(self):
++        min_set = set(['current', 'since'])
++        max_set = set(['current', 'message', 'since', 'version'])
++        for status_item in self.run_iter_status():
++            if 'fakejob' == status_item.item_name:
++                self.assertEqual(StatusItem.APPLICATION,
++                                 status_item.status_name)
++                self.assertEqual({'current': 'idle',
++                                  'since': 'DD MM YYYY hh:mm:ss',
++                                  }, status_item.status)
++            else:
++                cur_set = set(status_item.status.keys())
++                self.assertTrue(min_set < cur_set)
++                self.assertTrue(cur_set < max_set)
++
++    def test_iter_errors(self):
++        status = Status({}, '')
++        retval = [StatusItem(StatusItem.WORKLOAD, 'job/0',
++                             {StatusItem.WORKLOAD: {'current': 'error'}}),
++                  StatusItem(StatusItem.APPLICATION, 'job',
++                             {StatusItem.APPLICATION: {'current': 'running'}})
++                  ]
++        with patch.object(status, 'iter_status', autospec=True,
++                          return_value=retval):
++            errors = list(status.iter_errors())
++        self.assertEqual(len(errors), 1)
++        self.assertIsInstance(errors[0], UnitError)
++        self.assertEqual(('job/0', None), errors[0].args)
++
++    def test_check_for_errors_good(self):
++        status = Status({}, '')
++        with patch.object(status, 'iter_errors', autospec=True,
++                          return_value=[]) as error_mock:
++            self.assertEqual([], status.check_for_errors())
++        error_mock.assert_called_once_with(False)
++
++    @contextmanager
++    def patch_iter_errors_one(self, status,
++                              item_name, status_name, **kwargs):
++        retval = [(item_name, status_name, dict(current='error', **kwargs))]
++        with patch.object(status, 'iter_errors', autospec=True,
++                          return_value=retval) as error_mock:
++            yield error_mock
++
++    def test_check_for_errors(self):
++        status = Status({}, '')
++        errors = [MachineError('0'), StatusError('2'), UnitError('1')]
++        with patch.object(status, 'iter_errors', autospec=True,
++                          return_value=errors) as errors_mock:
++            sorted_errors = status.check_for_errors()
++        errors_mock.assert_called_once_with(False)
++        self.assertEqual(sorted_errors[0].args, ('0',))
++        self.assertEqual(sorted_errors[1].args, ('1',))
++        self.assertEqual(sorted_errors[2].args, ('2',))
++
      def test_get_applications_gets_applications(self):
          status = Status({
              'services': {'service': {}},
@@ -6035,6 +6171,21 @@
          self.assertEqual({'service': {}}, status.get_applications())
++class TestStatusItem(TestCase):
++
++    @staticmethod
++    def make_status_item(status_name, item_name, **kwargs):
++        return StatusItem(status_name, item_name, {status_name: kwargs})
++
++    def test_datetime_since(self):
++        item = self.make_status_item(StatusItem.JUJU, '0',
++                                     since='19 Aug 2016 05:36:42Z')
++        target = datetime(2016, 8, 19, 5, 36, 42)
++        self.assertEqual(item.datetime_since(), target)
++
++    # to_exception is going to need a lot of tests.
++
++
  def fast_timeout(count):
      if False:
          yield

juju-ci-tools

Merge lp:~andrewjbeach/juju-ci-tools/fix-wait-for-started into lp:juju-ci-tools

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers