pyjuju

Merge lp:~hazmat/pyjuju/unit-agent-resolved into lp:pyjuju

unit-agent-resolved
Merge into trunk

Proposed by Kapil Thangavelu on 2011-05-02

Status:	Merged
Approved by:	Gustavo Niemeyer on 2011-05-05
Approved revision:	275
Merged at revision:	224
Proposed branch:	lp:~hazmat/pyjuju/unit-agent-resolved
Merge into:	lp:pyjuju
Prerequisite:	lp:~hazmat/pyjuju/ensemble-resolved
Diff against target:	1276 lines (+718/-94) 7 files modified ensemble/control/tests/test_resolved.py (+0/-1) ensemble/state/service.py (+18/-5) ensemble/state/tests/test_service.py (+71/-1) ensemble/unit/lifecycle.py (+85/-21) ensemble/unit/tests/test_lifecycle.py (+242/-3) ensemble/unit/tests/test_workflow.py (+154/-12) ensemble/unit/workflow.py (+148/-51)
To merge this branch:	bzr merge lp:~hazmat/pyjuju/unit-agent-resolved
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Gustavo Niemeyer		2011-05-02	Approve on 2011-05-05
Review via email: mp+59713@code.launchpad.net

This proposal supersedes a proposal from 2011-05-02.

Description of the change

This implements the unit lifecycle resolving unit relations, and additional changes to enable resolution (transition actions receiving hook execution flags).

The branch is large enough that i'm going to split the remaining unit agent resolving unit relations into another branch.

Revision history for this message

Gustavo Niemeyer (niemeyer) wrote on 2011-05-03:

Kapil mentioned he's still working on this one.

Revision history for this message

Gustavo Niemeyer (niemeyer) wrote on 2011-05-04:

[1]

+ def do_retry_start(self, fire_hooks=True):
+ return self._invoke_lifecycle(
+ self._lifecycle.start, fire_hooks=fire_hooks)

We've already debated this live over a voice call, and had already
covered the topic in a previous conversation: I think it is a mistake
to introduce variables which define how the transition should work.

This is increasing the complexity of actions in an unpredictable way
(a single action with parameters could handle *all* the possible
transitions in the state machine).

review: Needs Fixing

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2011-05-04:

Changed over to using additional transitions, its a bit nicer now, thanks.

Revision history for this message

Gustavo Niemeyer (niemeyer) wrote on 2011-05-05:

Nice, thanks for the changes. Looking good!

+1, taking these in consideration:

[2]

+ def do_retry_upgrade_formula_hook(self, fire_hooks=True):

The fire_hooks seems to be a left over.

[3]

+ return self._invoke_lifecycle(
+ self._lifecycle.start, fire_hooks=False).addCallback(
+ lambda x: self._invoke_lifecycle(
+ self._lifecycle.configure, fire_hooks=False))
(...)
+ return self._invoke_lifecycle(
+ self._lifecycle.start, fire_hooks=False).addCallback(
+ lambda x: self._invoke_lifecycle(
+ self._lifecycle.configure))

Breaking these down would make them more readable than the huge one-liners.

[4]

+# Another interesting issue, process recovery using the on disk state,
+# is complicated by consistency to the the in memory state, which
+# won't be directly recoverable anymore without some state specific

Sweet. Thanks for capturing those ideas.

[5]

+ # If the unit lifecycle isn't running we shouldn't process
+ # any relation resolutions.
+ if not self._running:
+ self._log.debug("stop watch relation resolved changes")
+ self._watching_relation_resolved = False
+ raise StopWatcher()

It's not clear what's the intention here, and test coverage also
looks a bit poor in this area (e.g. replacing everything under the "if" by
"return" still passes tests fine).

What do we want to happen in this case, and what's the rationale
behind it?

[6]

+ if self._client.connected:
+ yield self._process_relation_resolved_changes()

What's the "if connected" test about? It feels like pretty much every
interaction with zk could have such a test. Why is this location
special?

review: Approve

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2011-05-05:

Excerpts from Gustavo Niemeyer's message of Thu May 05 02:01:15 UTC 2011:
> Review: Approve
> Nice, thanks for the changes. Looking good!
>
> +1, taking these in consideration:
>
> [5]
>
> + # If the unit lifecycle isn't running we shouldn't process
> + # any relation resolutions.
> + if not self._running:
> + self._log.debug("stop watch relation resolved changes")
> + self._watching_relation_resolved = False
> + raise StopWatcher()
>
> It's not clear what's the intention here, and test coverage also
> looks a bit poor in this area (e.g. replacing everything under the "if" by
> "return" still passes tests fine).
>
> What do we want to happen in this case, and what's the rationale
> behind it?
>

The notion is if we're not running, the watcher terminates, when we start watching
again the watcher is restarted, if it doesn't already exist. Its designed to
avoid concurrent watchers, and to allow for watch termination.

The design rationale being if a unit is not running, it needs to be resolved before
its unit relations can be fixed.

Being able to correctly tests runs into another behavior on the unit lifecycle,
where we automatically repair unit relations when the unit is started. I've
introduced a separate error state for relations in this branch, so we can choose
not to have unit relation errors automatically recovered, in which case the logic
in question can be tested easily, by bringing the unit back online, and the verifying
the resolve action was performed.

>
> [6]
>
> + if self._client.connected:
> + yield self._process_relation_resolved_changes()
>
> What's the "if connected" test about? It feels like pretty much every
> interaction with zk could have such a test. Why is this location
> special?
>

Its typically done in watch callbacks, such that an async background execution while the test
is closing, does not trigger a zookeeper closing exception, and minimizes errors.

Excerpts from Gustavo Niemeyer's message of Thu May 05 02:01:15 UTC 2011:
> Review: Approve
> Nice, thanks for the changes.  Looking good!
> 
> +1, taking these in consideration:
> 
> [5]
> 
> +            # If the unit lifecycle isn't running we shouldn't process
> +            # any relation resolutions.
> +            if not self._running:
> +                self._log.debug("stop watch relation resolved changes")
> +                self._watching_relation_resolved = False
> +                raise StopWatcher()
> 
> It's not clear what's the intention here, and test coverage also
> looks a bit poor in this area (e.g. replacing everything under the "if" by
> "return" still passes tests fine).
> 
> What do we want to happen in this case, and what's the rationale
> behind it?
>

The design rationale being if a unit is not running, it needs to be resolved before
its unit relations can be fixed.

Being able to correctly tests runs into another behavior on the unit lifecycle, 
where we automatically repair unit relations when the unit is started. I've 
introduced a separate error state for relations in this branch, so we can choose
not to have unit relation errors automatically recovered, in which case the logic
in question can be tested easily, by bringing the unit back online, and the verifying
the resolve action was performed.

> 
> [6]
> 
> +            if self._client.connected:
> +                yield self._process_relation_resolved_changes()
> 
> What's the "if connected" test about?  It feels like pretty much every
> interaction with zk could have such a test.  Why is this location
> special?
>

Its typically done in watch callbacks, such that an async background execution while the test
is closing, does not trigger a zookeeper closing exception, and  minimizes errors.

lp:~hazmat/pyjuju/unit-agent-resolved updated on 2011-05-05

276. By Kapil Thangavelu on 2011-05-05: Merged ensemble-resolved into unit-agent-resolved.
277. By Kapil Thangavelu on 2011-05-05: Merged ensemble-resolved into unit-agent-resolved.
278. By Kapil Thangavelu on 2011-05-05: Merged ensemble-resolved into unit-agent-resolved.
279. By Kapil Thangavelu on 2011-05-05: address review comments re formatting

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2011-05-05:

Excerpts from Kapil Thangavelu's message of Thu May 05 13:26:02 -0400 2011:
> Excerpts from Gustavo Niemeyer's message of Thu May 05 02:01:15 UTC 2011:
> > Review: Approve
> > Nice, thanks for the changes. Looking good!
> >
> > +1, taking these in consideration:
> >
> > [5]
> >
> > + # If the unit lifecycle isn't running we shouldn't process
> > + # any relation resolutions.
> > + if not self._running:
> > + self._log.debug("stop watch relation resolved changes")
> > + self._watching_relation_resolved = False
> > + raise StopWatcher()
> >
> > It's not clear what's the intention here, and test coverage also
> > looks a bit poor in this area (e.g. replacing everything under the "if" by
> > "return" still passes tests fine).
> >
> > What do we want to happen in this case, and what's the rationale
> > behind it?
> >
>
> The notion is if we're not running, the watcher terminates, when we start watching
> again the watcher is restarted, if it doesn't already exist. Its designed to
> avoid concurrent watchers, and to allow for watch termination.
>
> The design rationale being if a unit is not running, it needs to be resolved before
> its unit relations can be fixed.
>
> Being able to correctly tests runs into another behavior on the unit lifecycle,
> where we automatically repair unit relations when the unit is started. I've
> introduced a separate error state for relations in this branch, so we can choose
> not to have unit relation errors automatically recovered, in which case the logic
> in question can be tested easily, by bringing the unit back online, and the verifying
> the resolve action was performed.

Just to be clear this is a test for the overall functionality in

test_resolved_relation_watch_unit_lifecycle_not_running

But it because of the above constraint regarding automatic recover on start, it can
only verify that the watcher has stopped, not that the recovery proceeds normally
after restart, so it instead verifies that the resolved setting has persisted
past the end of the watcher.

Excerpts from Kapil Thangavelu's message of Thu May 05 13:26:02 -0400 2011:
> Excerpts from Gustavo Niemeyer's message of Thu May 05 02:01:15 UTC 2011:
> > Review: Approve
> > Nice, thanks for the changes.  Looking good!
> > 
> > +1, taking these in consideration:
> > 
> > [5]
> > 
> > +            # If the unit lifecycle isn't running we shouldn't process
> > +            # any relation resolutions.
> > +            if not self._running:
> > +                self._log.debug("stop watch relation resolved changes")
> > +                self._watching_relation_resolved = False
> > +                raise StopWatcher()
> > 
> > It's not clear what's the intention here, and test coverage also
> > looks a bit poor in this area (e.g. replacing everything under the "if" by
> > "return" still passes tests fine).
> > 
> > What do we want to happen in this case, and what's the rationale
> > behind it?
> > 
> 
> The notion is if we're not running, the watcher terminates, when we start watching
> again the watcher is restarted, if it doesn't already exist. Its designed to
> avoid concurrent watchers, and to allow for watch termination.
> 
> The design rationale being if a unit is not running, it needs to be resolved before
> its unit relations can be fixed.
> 
> Being able to correctly tests runs into another behavior on the unit lifecycle, 
> where we automatically repair unit relations when the unit is started. I've 
> introduced a separate error state for relations in this branch, so we can choose
> not to have unit relation errors automatically recovered, in which case the logic
> in question can be tested easily, by bringing the unit back online, and the verifying
> the resolve action was performed.

Just to be clear this is a test for the overall functionality in

test_resolved_relation_watch_unit_lifecycle_not_running

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Beber

Benjamin Saller

Chuck Short

Gustavo Niemeyer

James Page

Jim Baker

John A Meinel

Jorge Castro

Kapil Thangavelu

Marius B. Kotsbak

liuxing

to status/vote changes:

Akapo

 === modified file 'ensemble/control/tests/test_resolved.py'
 --- ensemble/control/tests/test_resolved.py	2011-04-28 17:09:36 +0000
 +++ ensemble/control/tests/test_resolved.py	2011-05-05 17:43:23 +0000
@@ -2,7 +2,6 @@
  from yaml import dump
  from ensemble.control import main
--from ensemble.control.resolved import resolved
  from ensemble.control.tests.common import ControlToolTest
  from ensemble.formula.tests.test_repository import RepositoryTestBase
 === modified file 'ensemble/state/service.py'
 --- ensemble/state/service.py	2011-05-05 16:18:34 +0000
 +++ ensemble/state/service.py	2011-05-05 17:43:23 +0000
@@ -16,11 +16,10 @@
      ServiceUnitStateMachineAlreadyAssigned, ServiceStateNameInUse,
      BadDescriptor, BadServiceStateName, NoUnusedMachines,
      ServiceUnitDebugAlreadyEnabled, ServiceUnitResolvedAlreadyEnabled,
--    ServiceUnitRelationResolvedAlreadyEnabled)
++    ServiceUnitRelationResolvedAlreadyEnabled, StopWatcher)
  from ensemble.state.formula import FormulaStateManager
  from ensemble.state.relation import ServiceRelationState, RelationStateManager
  from ensemble.state.machine import _public_machine_id, MachineState
--
  from ensemble.state.utils import remove_tree, dict_merge, YAMLState
  RETRY_HOOKS = 1000
@@ -516,7 +515,7 @@
          its `write` method invoked to publish the state to Zookeeper.
          """
          config_node = YAMLState(self._client,
--                                "/services/%s/config" % self._internal_id )
++                                "/services/%s/config" % self._internal_id)
          yield config_node.read()
          returnValue(config_node)
@@ -944,9 +943,13 @@
          def watcher(change_event):
              if not self._client.connected:
                  returnValue(None)
++
              exists_d, watch_d = self._client.exists_and_watch(
                  self._unit_resolve_path)
--            yield callback(change_event)
++            try:
++                yield callback(change_event)
++            except StopWatcher:
++                returnValue(None)
              watch_d.addCallback(watcher)
          exists_d, watch_d = self._client.exists_and_watch(
@@ -958,6 +961,8 @@
          callback_d = maybeDeferred(callback, bool(exists))
          callback_d.addCallback(
              lambda x: watch_d.addCallback(watcher) and x)
++        callback_d.addErrback(
++            lambda failure: failure.trap(StopWatcher))
      @property
      def _relation_resolved_path(self):
@@ -1033,6 +1038,8 @@
      @inlineCallbacks
      def clear_relation_resolved(self):
++        """ Clear the relation resolved setting.
++        """
          try:
              yield self._client.delete(self._relation_resolved_path)
          except zookeeper.NoNodeException:
@@ -1055,7 +1062,11 @@
                  returnValue(None)
              exists_d, watch_d = self._client.exists_and_watch(
                  self._relation_resolved_path)
--            yield callback(change_event)
++            try:
++                yield callback(change_event)
++            except StopWatcher:
++                returnValue(None)
++
              watch_d.addCallback(watcher)
          exists_d, watch_d = self._client.exists_and_watch(
@@ -1068,6 +1079,8 @@
          callback_d = maybeDeferred(callback, bool(exists))
          callback_d.addCallback(
              lambda x: watch_d.addCallback(watcher) and x)
++        callback_d.addErrback(
++            lambda failure: failure.trap(StopWatcher))
  def _parse_unit_name(unit_name):
 === modified file 'ensemble/state/tests/test_service.py'
 --- ensemble/state/tests/test_service.py	2011-05-05 14:31:41 +0000
 +++ ensemble/state/tests/test_service.py	2011-05-05 17:43:23 +0000
@@ -17,7 +17,7 @@
      ServiceUnitStateMachineAlreadyAssigned, ServiceStateNameInUse,
      BadDescriptor, BadServiceStateName, ServiceUnitDebugAlreadyEnabled,
      MachineStateNotFound, NoUnusedMachines, ServiceUnitResolvedAlreadyEnabled,
--    ServiceUnitRelationResolvedAlreadyEnabled)
++    ServiceUnitRelationResolvedAlreadyEnabled, StopWatcher)
  from ensemble.state.tests.common import StateTestBase
@@ -790,6 +790,42 @@
              {"retry": NO_HOOKS})
      @inlineCallbacks
++    def test_stop_watch_resolved(self):
++        """A unit resolved watch can be instituted on a permanent basis.
++
++        However the callback can raise StopWatcher at anytime to stop the watch
++        """
++        unit_state = yield self.get_unit_state()
++
++        results = []
++
++        def callback(value):
++            results.append(value)
++            if len(results) == 1:
++                raise StopWatcher()
++            if len(results) == 3:
++                raise StopWatcher()
++
++        unit_state.watch_resolved(callback)
++        yield unit_state.set_resolved(RETRY_HOOKS)
++        yield unit_state.clear_resolved()
++        yield self.poke_zk()
++
++        unit_state.watch_resolved(callback)
++        yield unit_state.set_resolved(NO_HOOKS)
++        yield unit_state.clear_resolved()
++
++        yield self.poke_zk()
++
++        self.assertEqual(len(results), 3)
++        self.assertIdentical(results.pop(0), False)
++        self.assertIdentical(results.pop(0), False)
++        self.assertEqual(results.pop(0).type_name, "created")
++
++        self.assertEqual(
++            (yield unit_state.get_resolved()), None)
++
++    @inlineCallbacks
      def test_get_set_clear_relation_resolved(self):
          """The a unit's realtions can be set to resolved to mark a
          future transition, with an optional retry flag."""
@@ -850,6 +886,40 @@
              {"0": NO_HOOKS})
      @inlineCallbacks
++    def test_stop_watch_relation_resolved(self):
++        """A unit resolved watch can be instituted on a permanent basis."""
++        unit_state = yield self.get_unit_state()
++
++        results = []
++
++        def callback(value):
++            results.append(value)
++
++            if len(results) == 1:
++                raise StopWatcher()
++
++            if len(results) == 3:
++                raise StopWatcher()
++
++        unit_state.watch_relation_resolved(callback)
++        yield unit_state.set_relation_resolved({"0": RETRY_HOOKS})
++        yield unit_state.clear_relation_resolved()
++        yield self.poke_zk()
++        self.assertEqual(len(results), 1)
++
++        unit_state.watch_relation_resolved(callback)
++        yield unit_state.set_relation_resolved({"0": RETRY_HOOKS})
++        yield unit_state.clear_relation_resolved()
++        yield self.poke_zk()
++        self.assertEqual(len(results), 3)
++        self.assertIdentical(results.pop(0), False)
++        self.assertIdentical(results.pop(0), False)
++        self.assertEqual(results.pop(0).type_name, "created")
++
++        self.assertEqual(
++            (yield unit_state.get_relation_resolved()), None)
++
++    @inlineCallbacks
      def test_watch_resolved_slow_callback(self):
          """A slow watch callback is still invoked serially."""
          unit_state = yield self.get_unit_state()
 === modified file 'ensemble/unit/lifecycle.py'
 --- ensemble/unit/lifecycle.py	2011-05-05 14:31:43 +0000
 +++ ensemble/unit/lifecycle.py	2011-05-05 17:43:23 +0000
@@ -2,7 +2,7 @@
  import logging
  from twisted.internet.defer import (
--    inlineCallbacks, DeferredLock)
++    inlineCallbacks, DeferredLock, returnValue)
  from ensemble.hooks.invoker import Invoker
  from ensemble.hooks.scheduler import HookScheduler
@@ -32,7 +32,8 @@
          self._unit_path = unit_path
          self._relations = {}
          self._running = False
--        self._watching = False
++        self._watching_relation_memberships = False
++        self._watching_relation_resolved = False
          self._run_lock = DeferredLock()
          self._log = logging.getLogger("unit.lifecycle")
@@ -45,21 +46,23 @@
          return self._relations[relation_id]
      @inlineCallbacks
--    def install(self):
++    def install(self, fire_hooks=True):
          """Invoke the unit's install hook.
          """
--        yield self._execute_hook("install")
++        if fire_hooks:
++            yield self._execute_hook("install")
      @inlineCallbacks
--    def upgrade_formula(self):
++    def upgrade_formula(self, fire_hooks=True):
          """Invoke the unit's upgrade-formula hook.
          """
--        yield self._execute_hook("upgrade-formula", now=True)
++        if fire_hooks:
++            yield self._execute_hook("upgrade-formula", now=True)
          # Restart hook queued hook execution.
          self._executor.start()
      @inlineCallbacks
--    def start(self):
++    def start(self, fire_hooks=True):
          """Invoke the start hook, and setup relation watching.
          """
          self._log.debug("pre-start acquire, running:%s", self._running)
@@ -70,22 +73,29 @@
              assert not self._running, "Already started"
              # Execute the start hook
--            yield self._execute_hook("start")
++            if fire_hooks:
++                yield self._execute_hook("start")
              # If we have any existing relations in memory, start them.
              if self._relations:
                  self._log.debug("starting relation lifecycles")
              for workflow in self._relations.values():
--                # should not transition an
                  yield workflow.transition_state("up")
              # Establish a watch on the existing relations.
--            if not self._watching:
++            if not self._watching_relation_memberships:
                  self._log.debug("starting service relation watch")
                  yield self._service.watch_relation_states(
                      self._on_service_relation_changes)
--                self._watching = True
++                self._watching_relation_memberships = True
++
++            # Establish a watch for resolved relations
++            if not self._watching_relation_resolved:
++                self._log.debug("starting unit relation resolved watch")
++                yield self._unit.watch_relation_resolved(
++                    self._on_relation_resolved_changes)
++                self._watching_relation_resolved = True
              # Set current status
              self._running = True
@@ -94,7 +104,7 @@
          self._log.debug("started unit lifecycle")
      @inlineCallbacks
--    def stop(self):
++    def stop(self, fire_hooks=True):
          """Stop the unit, executes the stop hook, and stops relation watching.
          """
          self._log.debug("pre-stop acquire, running:%s", self._running)
@@ -110,7 +120,8 @@
              for workflow in self._relations.values():
                  yield workflow.transition_state("down")
--            yield self._execute_hook("stop")
++            if fire_hooks:
++                yield self._execute_hook("stop")
              # Set current status
              self._running = False
@@ -119,9 +130,11 @@
          self._log.debug("stopped unit lifecycle")
      @inlineCallbacks
--    def configure(self):
++    def configure(self, fire_hooks=True):
          """Inform the unit that its service config has changed.
          """
++        if not fire_hooks:
++            returnValue(None)
          yield self._run_lock.acquire()
          try:
              # Verify State
@@ -134,6 +147,49 @@
          self._log.debug("configured unit")
      @inlineCallbacks
++    def _on_relation_resolved_changes(self, event):
++        """Callback for unit relation resolved watching.
++
++        The callback is invoked whenever the relation resolved
++        settings change.
++        """
++        self._log.debug("relation resolved changed")
++        # Acquire the run lock, and process the changes.
++        yield self._run_lock.acquire()
++
++        try:
++            # If the unit lifecycle isn't running we shouldn't process
++            # any relation resolutions.
++            if not self._running:
++                self._log.debug("stop watch relation resolved changes")
++                self._watching_relation_resolved = False
++                raise StopWatcher()
++
++            self._log.info("processing relation resolved changed")
++            if self._client.connected:
++                yield self._process_relation_resolved_changes()
++        finally:
++            yield self._run_lock.release()
++
++    @inlineCallbacks
++    def _process_relation_resolved_changes(self):
++        """Invoke retry transitions on relations if their not running.
++        """
++        relation_resolved = yield self._unit.get_relation_resolved()
++        if relation_resolved is None:
++            returnValue(None)
++        else:
++            yield self._unit.clear_relation_resolved()
++
++        keys = set(relation_resolved).intersection(self._relations)
++        for rel_id in keys:
++            relation_workflow = self._relations[rel_id]
++            relation_state = yield relation_workflow.get_state()
++            if relation_state == "up":
++                continue
++            yield relation_workflow.transition_state("up")
++
++    @inlineCallbacks
      def _on_service_relation_changes(self, old_relations, new_relations):
          """Callback for service relation watching.
@@ -153,7 +209,7 @@
              # If the lifecycle is not running, then stop the watcher
              if not self._running:
                  self._log.debug("stop service-rel watcher, discarding changes")
--                self._watching = False
++                self._watching_relation_memberships = False
                  raise StopWatcher()
              self._log.debug("processing relations changed")
@@ -163,9 +219,9 @@
      @inlineCallbacks
      def _process_service_changes(self, old_relations, new_relations):
--        """Add and remove unit lifecycles per the service relations changes.
++        """Add and remove unit lifecycles per the service relations Determine.
          """
--        # Determine relation delta of global zk state with our memory state.
++        # changes relation delta of global zk state with our memory state.
          new_relations = dict([(service_relation.internal_relation_id,
                                 service_relation) for
                                service_relation in new_relations])
@@ -352,8 +408,11 @@
          self._error_handler = handler
      @inlineCallbacks
--    def start(self):
++    def start(self, watches=True):
          """Start watching related units and executing change hooks.
++
++        @param watches: boolean parameter denoting if relation watches
++               should be started.
          """
          yield self._run_lock.acquire()
          try:
@@ -364,19 +423,24 @@
                  self._watcher = yield self._unit_relation.watch_related_units(
                      self._scheduler.notify_change)
              # And start the watcher.
--            yield self._watcher.start()
++            if watches:
++                yield self._watcher.start()
          finally:
              self._run_lock.release()
          self._log.debug(
              "started relation:%s lifecycle", self._relation_name)
      @inlineCallbacks
--    def stop(self):
++    def stop(self, watches=True):
          """Stop watching changes and stop executing relation change hooks.
++
++        @param watches: boolean parameter denoting if relation watches
++               should be stopped.
          """
          yield self._run_lock.acquire()
          try:
--            self._watcher.stop()
++            if watches and self._watcher:
++                self._watcher.stop()
              self._scheduler.stop()
          finally:
              yield self._run_lock.release()
 === modified file 'ensemble/unit/tests/test_lifecycle.py'
 --- ensemble/unit/tests/test_lifecycle.py	2011-05-05 09:41:11 +0000
 +++ ensemble/unit/tests/test_lifecycle.py	2011-05-05 17:43:23 +0000
@@ -12,6 +12,8 @@
  from ensemble.unit.lifecycle import (
      UnitLifecycle, UnitRelationLifecycle, RelationInvoker)
++from ensemble.unit.workflow import RelationWorkflowState
++
  from ensemble.hooks.invoker import Invoker
  from ensemble.hooks.executor import HookExecutor
@@ -19,9 +21,11 @@
  from ensemble.state.endpoint import RelationEndpoint
  from ensemble.state.relation import ClientServerUnitWatcher
++from ensemble.state.service import NO_HOOKS
  from ensemble.state.tests.test_relation import RelationTestBase
  from ensemble.state.hook import RelationChange
++
  from ensemble.lib.testing import TestCase
  from ensemble.lib.mocker import MATCH
@@ -97,6 +101,8 @@
              results.append(hook_name)
              if debug:
                  print "-> exec hook", hook_name
++            if d.called:
++                return
              if results == sequence:
                  d.callback(True)
              if hook_name == name and count is None:
@@ -145,6 +151,182 @@
          return output
++class LifecycleResolvedTest(LifecycleTestBase):
++
++    @inlineCallbacks
++    def setUp(self):
++        yield super(LifecycleResolvedTest, self).setUp()
++        yield self.setup_default_test_relation()
++        self.lifecycle = UnitLifecycle(
++            self.client, self.states["unit"], self.states["service"],
++            self.unit_directory, self.executor)
++
++    def get_unit_relation_workflow(self, states):
++        state_dir = os.path.join(self.ensemble_directory, "state")
++        lifecycle = UnitRelationLifecycle(
++            self.client,
++            states["unit_relation"],
++            states["service_relation"].relation_name,
++            self.unit_directory,
++            self.executor)
++
++        workflow = RelationWorkflowState(
++            self.client,
++            states["unit_relation"],
++            lifecycle,
++            state_dir)
++
++        return (workflow, lifecycle)
++
++    @inlineCallbacks
++    def test_resolved_relation_watch_unit_lifecycle_not_running(self):
++        """If the unit is not running then no relation resolving is performed.
++        However the resolution value remains the same.
++        """
++        # Start the unit.
++        yield self.lifecycle.start()
++
++        # Wait for the relation to be started.... TODO: async background work
++        yield self.sleep(0.1)
++
++        # Simulate relation down on an individual unit relation
++        workflow = self.lifecycle.get_relation_workflow(
++            self.states["unit_relation"].internal_relation_id)
++        self.assertEqual("up", (yield workflow.get_state()))
++
++        yield workflow.transition_state("down")
++        resolved = self.wait_on_state(workflow, "up")
++
++        # Stop the unit lifecycle
++        yield self.lifecycle.stop()
++
++        # Set the relation to resolved
++        yield self.states["unit"].set_relation_resolved(
++            {self.states["unit_relation"].internal_relation_id: NO_HOOKS})
++
++        # Give a moment for the watch to fire erroneously
++        yield self.sleep(0.2)
++
++        # Ensure we didn't attempt a transition.
++        self.assertFalse(resolved.called)
++        self.assertEqual(
++            {self.states["unit_relation"].internal_relation_id: NO_HOOKS},
++            (yield self.states["unit"].get_relation_resolved()))
++
++        # If the unit is restarted start, we currently have the
++        # behavior that the unit relation workflow will automatically
++        # be transitioned back to running, as part of the normal state
++        # transition. Sigh.. we should have a separate error
++        # state for relation hooks then down with state variable usage.
++
++    @inlineCallbacks
++    def test_resolved_relation_watch_relation_up(self):
++        """If a relation marked as to be resolved is already running,
++        then no work is performed.
++        """
++        # Start the unit.
++        yield self.lifecycle.start()
++
++        # Wait for the relation to be started.... TODO: async background work
++        yield self.sleep(0.1)
++
++        # get a hold of the unit relation and verify state
++        workflow = self.lifecycle.get_relation_workflow(
++            self.states["unit_relation"].internal_relation_id)
++        self.assertEqual("up", (yield workflow.get_state()))
++
++        # Set the relation to resolved
++        yield self.states["unit"].set_relation_resolved(
++            {self.states["unit_relation"].internal_relation_id: NO_HOOKS})
++
++        # Give a moment for the async background work.
++        yield self.sleep(0.1)
++
++        # Ensure we're still up and the relation resolved setting has been
++        # cleared.
++        self.assertEqual(
++            None, (yield self.states["unit"].get_relation_resolved()))
++        self.assertEqual("up", (yield workflow.get_state()))
++
++    @inlineCallbacks
++    def test_resolved_relation_watch_from_error(self):
++        """Unit lifecycle's will process a unit relation resolved
++        setting, and transition a down relation back to a running
++        state.
++        """
++        log_output = self.capture_logging(
++            "unit.lifecycle", level=logging.DEBUG)
++
++        # Start the unit.
++        yield self.lifecycle.start()
++
++        # Wait for the relation to be started... TODO: async background work
++        yield self.sleep(0.1)
++
++        # Simulate an error condition
++        workflow = self.lifecycle.get_relation_workflow(
++            self.states["unit_relation"].internal_relation_id)
++        self.assertEqual("up", (yield workflow.get_state()))
++        yield workflow.fire_transition("error")
++
++        resolved = self.wait_on_state(workflow, "up")
++
++        # Set the relation to resolved
++        yield self.states["unit"].set_relation_resolved(
++            {self.states["unit_relation"].internal_relation_id: NO_HOOKS})
++
++        # Wait for the relation to come back up
++        value = yield self.states["unit"].get_relation_resolved()
++
++        yield resolved
++
++        # Verify state
++        value = yield workflow.get_state()
++        self.assertEqual(value, "up")
++
++        self.assertIn(
++            "processing relation resolved changed", log_output.getvalue())
++
++    @inlineCallbacks
++    def test_resolved_relation_watch(self):
++        """Unit lifecycle's will process a unit relation resolved
++        setting, and transition a down relation back to a running
++        state.
++        """
++        log_output = self.capture_logging(
++            "unit.lifecycle", level=logging.DEBUG)
++
++        # Start the unit.
++        yield self.lifecycle.start()
++
++        # Wait for the relation to be started... TODO: async background work
++        yield self.sleep(0.1)
++
++        # Simulate an error condition
++        workflow = self.lifecycle.get_relation_workflow(
++            self.states["unit_relation"].internal_relation_id)
++        self.assertEqual("up", (yield workflow.get_state()))
++        yield workflow.transition_state("down")
++
++        resolved = self.wait_on_state(workflow, "up")
++
++        # Set the relation to resolved
++        yield self.states["unit"].set_relation_resolved(
++            {self.states["unit_relation"].internal_relation_id: NO_HOOKS})
++
++        # Wait for the relation to come back up
++        value = yield self.states["unit"].get_relation_resolved()
++
++        yield resolved
++
++        # Verify state
++        value = yield workflow.get_state()
++        self.assertEqual(value, "up")
++
++        self.assertIn(
++            "processing relation resolved changed", log_output.getvalue())
++
++
  class UnitLifecycleTest(LifecycleTestBase):
      @inlineCallbacks
@@ -187,6 +369,45 @@
          # verify the sockets are cleaned up.
          self.assertEqual(os.listdir(self.unit_directory), ["formula"])
++    @inlineCallbacks
++    def test_start_sans_hook(self):
++        """The lifecycle start can be invoked without firing hooks."""
++        self.write_hook("start", "#!/bin/sh\n exit 1")
++        start_executed = self.wait_on_hook("start")
++        yield self.lifecycle.start(fire_hooks=False)
++        # Wait for unit relation background processing....
++        yield self.sleep(0.1)
++        self.assertFalse(start_executed.called)
++
++    @inlineCallbacks
++    def test_stop_sans_hook(self):
++        """The lifecycle stop can be invoked without firing hooks."""
++        self.write_hook("stop", "#!/bin/sh\n exit 1")
++        stop_executed = self.wait_on_hook("stop")
++        yield self.lifecycle.start()
++        yield self.lifecycle.stop(fire_hooks=False)
++        # Wait for unit relation background processing....
++        yield self.sleep(0.1)
++        self.assertFalse(stop_executed.called)
++
++    @inlineCallbacks
++    def test_install_sans_hook(self):
++        """The lifecycle install can be invoked without firing hooks."""
++        self.write_hook("install", "#!/bin/sh\n exit 1")
++        install_executed = self.wait_on_hook("install")
++        yield self.lifecycle.install(fire_hooks=False)
++        self.assertFalse(install_executed.called)
++
++    @inlineCallbacks
++    def test_upgrade_sans_hook(self):
++        """The lifecycle upgrade can be invoked without firing hooks."""
++        self.executor.stop()
++        self.write_hook("upgrade-formula", "#!/bin/sh\n exit 1")
++        upgrade_executed = self.wait_on_hook("upgrade-formula")
++        yield self.lifecycle.upgrade_formula(fire_hooks=False)
++        self.assertFalse(upgrade_executed.called)
++        self.assertTrue(self.executor.running)
++
      def test_hook_error(self):
          """Verify hook execution error, raises an exception."""
          self.write_hook("install", '#!/bin/sh\n exit 1')
@@ -196,14 +417,12 @@
      def test_hook_not_executable(self):
          """A hook not executable, raises an exception."""
          self.write_hook("install", '#!/bin/sh\n exit 0', no_exec=True)
--        # It would be preferrable if this was also a formulainvocation error.
          return self.failUnlessFailure(
              self.lifecycle.install(), FormulaError)
      def test_hook_not_formatted_correctly(self):
          """Hook execution error, raises an exception."""
          self.write_hook("install", '!/bin/sh\n exit 0')
--        # It would be preferrable if this was also a formulainvocation error.
          return self.failUnlessFailure(
              self.lifecycle.install(), FormulaInvocationError)
@@ -532,7 +751,7 @@
      @inlineCallbacks
      def test_initial_start_lifecycle_no_related_no_exec(self):
          """
--        If there are no related units on startup, the relation changed hook
++        If there are no related units on startup, the relation joined hook
          is not invoked.
          """
          file_path = self.makeFile()
@@ -545,6 +764,26 @@
          self.assertFalse(os.path.exists(file_path))
      @inlineCallbacks
++    def test_stop_can_continue_watching(self):
++        """
++        """
++        file_path = self.makeFile()
++        self.write_hook(
++            "%s-relation-changed" % self.relation_name,
++            ("#!/bin/bash\n" "echo executed >> %s\n" % file_path))
++        rel_states = yield self.add_opposite_service_unit(self.states)
++        yield self.lifecycle.start()
++        yield self.wait_on_hook(
++            sequence=["app-relation-joined", "app-relation-changed"])
++        changed_executed = self.wait_on_hook("app-relation-changed")
++        yield self.lifecycle.stop(watches=False)
++        rel_states["unit_relation"].set_data(yaml.dump(dict(hello="world")))
++        yield self.sleep(0.1)
++        self.assertFalse(changed_executed.called)
++        yield self.lifecycle.start(watches=False)
++        yield changed_executed
++
++    @inlineCallbacks
      def test_initial_start_lifecycle_with_related(self):
          """
          If there are related units on startup, the relation changed hook
 === modified file 'ensemble/unit/tests/test_workflow.py'
 --- ensemble/unit/tests/test_workflow.py	2011-05-05 14:31:43 +0000
 +++ ensemble/unit/tests/test_workflow.py	2011-05-05 17:43:23 +0000
@@ -83,11 +83,30 @@
          self.assertFalse(result)
          current_state = yield self.workflow.get_state()
          yield self.assertEqual(current_state, "install_error")
--        self.write_hook("install", "#!/bin/bash\necho hello\n")
          result = yield self.workflow.fire_transition("retry_install")
          yield self.assertState(self.workflow, "installed")
      @inlineCallbacks
++    def test_install_error_with_retry_hook(self):
++        """If the install hook fails, the workflow is transition to the
++        install_error state.
++        """
++        self.write_hook("install", "#!/bin/bash\nexit 1")
++        result = yield self.workflow.fire_transition("install")
++        self.assertFalse(result)
++        current_state = yield self.workflow.get_state()
++        yield self.assertEqual(current_state, "install_error")
++
++        result = yield self.workflow.fire_transition("retry_install_hook")
++        yield self.assertState(self.workflow, "install_error")
++
++        self.write_hook("install", "#!/bin/bash\necho hello\n")
++        hook_deferred = self.wait_on_hook("install")
++        result = yield self.workflow.fire_transition_alias("retry_hook")
++        yield hook_deferred
++        yield self.assertState(self.workflow, "installed")
++
++    @inlineCallbacks
      def test_start(self):
          file_path = self.makeFile()
          self.write_hook(
@@ -131,12 +150,41 @@
          current_state = yield self.workflow.get_state()
          self.assertEqual(current_state, "start_error")
--        self.write_hook("start", "#!/bin/bash\necho hello\n")
          result = yield self.workflow.fire_transition("retry_start")
          yield self.assertState(self.workflow, "started")
--        # If we don't stop, we'll end up with the relation lifecycle
--        # watches firing in the background when the test stops.
--        self.write_hook("stop", "#!/bin/bash\necho hello\n")
++
++        # If we don't stop, we'll end up with the relation lifecycle
++        # watches firing in the background when the test stops.
++        result = yield self.workflow.fire_transition("stop")
++        yield self.assertState(self.workflow, "stopped")
++
++    @inlineCallbacks
++    def test_start_error_with_retry_hook(self):
++        """Executing the start transition with a hook error, results in the
++        workflow going to the start_error state. The start can be retried.
++        """
++        self.write_hook("install", "#!/bin/bash\necho hello\n")
++        result = yield self.workflow.fire_transition("install")
++        self.assertTrue(result)
++        self.write_hook("start", "#!/bin/bash\nexit 1")
++        result = yield self.workflow.fire_transition("start")
++        self.assertFalse(result)
++        current_state = yield self.workflow.get_state()
++        self.assertEqual(current_state, "start_error")
++
++        hook_deferred = self.wait_on_hook("start")
++        result = yield self.workflow.fire_transition("retry_start_hook")
++        yield hook_deferred
++        yield self.assertState(self.workflow, "start_error")
++
++        self.write_hook("start", "#!/bin/bash\nexit 0")
++        hook_deferred = self.wait_on_hook("start")
++        result = yield self.workflow.fire_transition_alias("retry_hook")
++        yield hook_deferred
++        yield self.assertState(self.workflow, "started")
++
++        # If we don't stop, we'll end up with the relation lifecycle
++        # watches firing in the background when the test stops.
          result = yield self.workflow.fire_transition("stop")
          yield self.assertState(self.workflow, "stopped")
@@ -193,13 +241,49 @@
          yield self.assertState(self.workflow, "configure_error")
          # Verify recovery from error state
++        result = yield self.workflow.fire_transition_alias("retry")
++        self.assertTrue(result)
++        yield self.assertState(self.workflow, "started")
++
++        # Stop any background processing
++        yield self.workflow.fire_transition("stop")
++
++    @inlineCallbacks
++    def test_configure_error_and_retry_hook(self):
++        """An error while configuring, transitions the unit and
++        stops the lifecycle."""
++        #self.capture_output()
++        yield self.workflow.fire_transition("install")
++        result = yield self.workflow.fire_transition("start")
++        self.assertTrue(result)
++        self.assertState(self.workflow, "started")
++
++        # Verify transition to error state
++        hook_deferred = self.wait_on_hook("config-changed")
++        self.write_hook("config-changed", "#!/bin/bash\nexit 1")
++        result = yield self.workflow.fire_transition("reconfigure")
++        yield hook_deferred
++        self.assertFalse(result)
++        yield self.assertState(self.workflow, "configure_error")
++
++        # Verify retry hook with hook error stays in error state
++        hook_deferred = self.wait_on_hook("config-changed")
++        result = yield self.workflow.fire_transition("retry_configure_hook")
++
++        self.assertFalse(result)
++        yield hook_deferred
++        yield self.assertState(self.workflow, "configure_error")
++
          hook_deferred = self.wait_on_hook("config-changed")
          self.write_hook("config-changed", "#!/bin/bash\nexit 0")
--        result = yield self.workflow.fire_transition("retry_configure")
++        result = yield self.workflow.fire_transition_alias("retry_hook")
          yield hook_deferred
--        self.assertTrue(result)
          yield self.assertState(self.workflow, "started")
++        # Stop any background processing
++        yield self.workflow.fire_transition("stop")
++        yield self.sleep(0.1)
++
      @inlineCallbacks
      def test_upgrade(self):
          """Upgrading a workflow results in the upgrade hook being
@@ -235,7 +319,7 @@
          yield self.workflow.fire_transition("stop")
      @inlineCallbacks
--    def test_upgrade_error_state(self):
++    def test_upgrade_error_retry(self):
          """A hook error during an upgrade transitions to
          upgrade_error.
          """
@@ -259,6 +343,41 @@
          yield self.workflow.fire_transition("retry_upgrade_formula")
          current_state = yield self.workflow.get_state()
          self.assertEqual(current_state, "started")
++
++        # Stop any background activity
++        yield self.workflow.fire_transition("stop")
++
++    @inlineCallbacks
++    def test_upgrade_error_retry_hook(self):
++        """A hook error during an upgrade transitions to
++        upgrade_error, and can be re-tried with hook execution.
++        """
++        yield self.workflow.fire_transition("install")
++        yield self.workflow.fire_transition("start")
++        current_state = yield self.workflow.get_state()
++        self.assertEqual(current_state, "started")
++
++        # Agent prepares this.
++        self.executor.stop()
++
++        self.write_hook("upgrade-formula", "#!/bin/bash\nexit 1")
++        hook_deferred = self.wait_on_hook("upgrade-formula")
++        yield self.workflow.fire_transition("upgrade_formula")
++        yield hook_deferred
++        current_state = yield self.workflow.get_state()
++        self.assertEqual(current_state, "formula_upgrade_error")
++
++        hook_deferred = self.wait_on_hook("upgrade-formula")
++        self.write_hook("upgrade-formula", "#!/bin/bash\nexit 0")
++        # The upgrade error hook should ensure that the executor is stoppped.
++        self.assertFalse(self.executor.running)
++        yield self.workflow.fire_transition_alias("retry_hook")
++        yield hook_deferred
++        current_state = yield self.workflow.get_state()
++        self.assertEqual(current_state, "started")
++        self.assertTrue(self.executor.running)
++
++        # Stop any background activity
          yield self.workflow.fire_transition("stop")
      @inlineCallbacks
@@ -301,13 +420,36 @@
          self.assertTrue(result)
          result = yield self.workflow.fire_transition("start")
          self.assertTrue(result)
++
          self.write_hook("stop", "#!/bin/bash\nexit 1")
          result = yield self.workflow.fire_transition("stop")
          self.assertFalse(result)
--        current_state = yield self.workflow.get_state()
--        self.assertEqual(current_state, "stop_error")
++
++        yield self.assertState(self.workflow, "stop_error")
          self.write_hook("stop", "#!/bin/bash\necho hello\n")
          result = yield self.workflow.fire_transition("retry_stop")
++
++        yield self.assertState(self.workflow, "stopped")
++
++    @inlineCallbacks
++    def test_stop_error_with_retry_hook(self):
++        self.write_hook("install", "#!/bin/bash\necho hello\n")
++        self.write_hook("start", "#!/bin/bash\necho hello\n")
++        result = yield self.workflow.fire_transition("install")
++        self.assertTrue(result)
++        result = yield self.workflow.fire_transition("start")
++        self.assertTrue(result)
++
++        self.write_hook("stop", "#!/bin/bash\nexit 1")
++        result = yield self.workflow.fire_transition("stop")
++        self.assertFalse(result)
++        yield self.assertState(self.workflow, "stop_error")
++
++        result = yield self.workflow.fire_transition_alias("retry_hook")
++        yield self.assertState(self.workflow, "stop_error")
++
++        self.write_hook("stop", "#!/bin/bash\nexit 0")
++        result = yield self.workflow.fire_transition_alias("retry_hook")
          yield self.assertState(self.workflow, "stopped")
      @inlineCallbacks
@@ -447,7 +589,7 @@
          # Add a new unit, and wait for the broken hook to result in
          # the transition to the down state.
          yield self.add_opposite_service_unit(self.states)
--        yield self.wait_on_state(self.workflow, "down")
++        yield self.wait_on_state(self.workflow, "error")
          f_state, history, zk_state = yield self.read_persistent_state(
              history_id=self.workflow.zk_state_id)
@@ -458,7 +600,7 @@
                           "formula", "hooks", "app-relation-changed"))
          self.assertEqual(f_state,
--                         {"state": "down",
++                         {"state": "error",
                            "state_variables": {
                                "change_type": "joined",
                                "error_message": error}})
 === modified file 'ensemble/unit/workflow.py'
 --- ensemble/unit/workflow.py	2011-05-05 14:31:43 +0000
 +++ ensemble/unit/workflow.py	2011-05-05 17:43:23 +0000
@@ -14,31 +14,56 @@
  UnitWorkflow = Workflow(
++    # Install transitions
      Transition("install", "Install", None, "installed",
                 error_transition_id="error_install"),
--    Transition("error_install", "Install Error", None, "install_error"),
--    Transition("retry_install", "Retry Install", "install_error", "installed"),
++    Transition("error_install", "Install error", None, "install_error"),
++
++    Transition("retry_install", "Retry install", "install_error", "installed",
++               alias="retry"),
++    Transition("retry_install_hook", "Retry install with hook",
++               "install_error", "installed", alias="retry_hook"),
++
++    # Start transitions
      Transition("start", "Start", "installed", "started",
                 error_transition_id="error_start"),
--    Transition("error_start", "Start Error", "installed", "start_error"),
--    Transition("retry_start", "Retry Start", "start_error", "started"),
++    Transition("error_start", "Start error", "installed", "start_error"),
++    Transition("retry_start", "Retry start", "start_error", "started",
++               alias="retry"),
++    Transition("retry_start_hook", "Retry start with hook",
++              "start_error", "started",  alias="retry_hook"),
++
++    # Stop transitions
      Transition("stop", "Stop", "started", "stopped",
                 error_transition_id="error_stop"),
--    Transition("error_stop", "Stop Error", "started", "stop_error"),
--    Transition("retry_stop", "Retry Stop", "stop_error", "stopped"),
--
--    # Upgrade Transitions (stay in state, with success transitition)
++    Transition("error_stop", "Stop error", "started", "stop_error"),
++    Transition("retry_stop", "Retry stop", "stop_error", "stopped",
++               alias="retry"),
++    Transition("retry_stop_hook", "Retry stop with hook",
++               "stop_error", "stopped", alias="retry_hook"),
++
++    # Restart transitions
++    Transition("restart", "Restart", "stop", "start",
++               error_transition_id="error_start", alias="retry"),
++    Transition("restart_with_hook", "Restart with hook",
++               "stop", "start", alias="retry_hook",
++               error_transition_id="error_start"),
++
++    # Upgrade transitions
      Transition(
          "upgrade_formula", "Upgrade", "started", "started",
          error_transition_id="upgrade_formula_error"),
      Transition(
--        "upgrade_formula_error", "On upgrade error",
++        "upgrade_formula_error", "Upgrade from stop error",
          "started", "formula_upgrade_error"),
      Transition(
--        "retry_upgrade_formula", "Retry failed upgrade",
--        "formula_upgrade_error", "started"),
++        "retry_upgrade_formula", "Upgrade from stop error",
++        "formula_upgrade_error", "started", alias="retry"),
++    Transition(
++        "retry_upgrade_formula_hook", "Upgrade from stop error with hook",
++        "formula_upgrade_error", "started", alias="retry_hook"),
--    # Configuration Transitions (stay in state, with success transition)
++    # Configuration Transitions
      Transition(
          "reconfigure", "Reconfigure", "started", "started",
          error_transition_id="error_configure"),
@@ -46,28 +71,56 @@
          "error_configure", "On configure error",
          "started", "configure_error"),
      Transition(
++        "retry_error", "On retry configure error",
++        "configure_error", "configure_error"),
++    Transition(
          "retry_configure", "Retry configure",
--        "configure_error", "started")
++        "configure_error", "started", alias="retry",
++        error_transition_id="retry_error"),
++    Transition(
++        "retry_configure_hook", "Retry configure with hooks",
++        "configure_error", "started", alias="retry_hook",
++        error_transition_id="retry_error")
+     )
--# There's been some discussion, if we should have per change type error states
--# here, corresponding to the different changes that the relation-changed hook
--# is invoked for. The important aspects to capture are both observability of
--# error type locally and globally (zk), and per error type and instance
--# recovery of the same. To provide for this functionality without additional
--# states, the error information (change type, and error message) are captured
--# in state variables which are locally and globally observable. Future
--# extension of the restart transition action, will allow for customized
--# recovery based on the change type state variable. Effectively this
--# differs from the unit definition, in that it collapses three possible error
--# states, into a behavior off switch. A separate state will be needed to
--# denote departing.
++# Unit relation error states
++#
++# There's been some discussion, if we should have per change type
++# error states here, corresponding to the different changes that the
++# relation-changed hook is invoked for. The important aspects to
++# capture are both observability of error type locally and globally
++# (zk), and per error type and instance recovery of the same. To
++# provide for this functionality without additional states, the error
++# information (change type, and error message) are captured in state
++# variables which are locally and globally observable. Future
++# extension of the restart transition action, will allow for
++# customized recovery based on the change type state
++# variable. Effectively this differs from the unit definition, in that
++# it collapses three possible error states, into a behavior off
++# switch. A separate state will be needed to denote departing.
++
++
++# Process recovery using on disk workflow state
++#
++# Another interesting issue, process recovery using the on disk state,
++# is complicated by consistency to the the in memory state, which
++# won't be directly recoverable anymore without some state specific
++# semantics to recovering from on disk state, ie a restarted unit
++# agent, with a relation in an error state would require special
++# semantics around loading from disk to ensure that the in-memory
++# process state (watching and scheduling but not executing) matches
++# the recovery transition actions (which just restart hook execution,
++# but assume the watch continues).. this functionality added to better
++# allow for the behavior that while down due to a hook error, the
++# relation would continues to schedule pending hooks
  RelationWorkflow = Workflow(
      Transition("start", "Start", None, "up"),
      Transition("stop", "Stop", "up", "down"),
--    Transition("restart", "Restart", "down", "up"),
++    Transition("restart", "Restart", "down", "up", alias="retry"),
++    Transition("error", "Relation hook error", "up", "error"),
++    Transition("reset", "Recover from hook error", "error", "up"),
      Transition("depart", "Relation broken", "up", "departed"),
      Transition("down_depart", "Relation broken", "down", "departed"),
+     )
@@ -223,7 +276,6 @@
          row per entry with CSV escaping.
          """
          state_serialized = yaml.safe_dump(state_dict)
--
          # State File
          with open(self.state_file_path, "w") as handle:
              handle.write(state_serialized)
@@ -243,6 +295,7 @@
              return {"state": None}
          with open(self.state_file_path, "r") as handle:
              content = handle.read()
++
          return yaml.load(content)
@@ -282,51 +335,77 @@
          self._lifecycle = lifecycle
      @inlineCallbacks
--    def _invoke_lifecycle(self, method):
++    def _invoke_lifecycle(self, method, *args, **kw):
          try:
--            result = yield method()
++            result = yield method(*args, **kw)
          except (FileNotFound, FormulaError, FormulaInvocationError), e:
              raise TransitionError(e)
          returnValue(result)
--    # Transition Actions
++    # Install transitions
      def do_install(self):
          return self._invoke_lifecycle(self._lifecycle.install)
++    def do_retry_install(self):
++        return self._invoke_lifecycle(self._lifecycle.install,
++                                      fire_hooks=False)
++
++    def do_retry_install_hook(self):
++        return self._invoke_lifecycle(self._lifecycle.install)
++
++    # Start transitions
      def do_start(self):
          return self._invoke_lifecycle(self._lifecycle.start)
--    def do_stop(self):
--        return self._invoke_lifecycle(self._lifecycle.stop)
--
      def do_retry_start(self):
++        return self._invoke_lifecycle(self._lifecycle.start,
++                                      fire_hooks=False)
++
++    def do_retry_start_hook(self):
          return self._invoke_lifecycle(self._lifecycle.start)
++    # Stop transitions
++    def do_stop(self):
++        return self._invoke_lifecycle(self._lifecycle.stop)
++
      def do_retry_stop(self):
--        self._invoke_lifecycle(self._lifecycle.stop)
--
--    def do_retry_install(self):
--        return self._invoke_lifecycle(self._lifecycle.install)
++        return self._invoke_lifecycle(self._lifecycle.stop,
++                                      fire_hooks=False)
++
++    def do_retry_stop_hook(self):
++        return self._invoke_lifecycle(self._lifecycle.stop)
++
++    # Upgrade transititions
++    def do_upgrade_formula(self):
++        return self._invoke_lifecycle(self._lifecycle.upgrade_formula)
      def do_retry_upgrade_formula(self):
--        return self._invoke_lifecycle(self._lifecycle.upgrade_formula)
--
--    def do_upgrade_formula(self):
--        return self._invoke_lifecycle(self._lifecycle.upgrade_formula)
--
--    # Some of this needs support from the resolved branches, as we
--    # want to fire some of these lifecycle methods sans hooks.
++        return self._invoke_lifecycle(self._lifecycle.upgrade_formula,
++                                      fire_hooks=False)
++
++    def do_retry_upgrade_formula_hook(self):
++        return self._invoke_lifecycle(self._lifecycle.upgrade_formula)
++
++    # Config transitions
      def do_error_configure(self):
--        # self._invoke_lifecycle(self._lifecycle.stop, fire_hooks=False)
--        pass
++        return self._invoke_lifecycle(self._lifecycle.stop, fire_hooks=False)
      def do_reconfigure(self):
          return self._invoke_lifecycle(self._lifecycle.configure)
++    def do_retry_error(self):
++        return self._invoke_lifecycle(self._lifecycle.stop, fire_hooks=False)
++
++    @inlineCallbacks
      def do_retry_configure(self):
--        # self._invoke_lifecycle(self._lifecycle.start, fire_hooks=False)
--        self._invoke_lifecycle(
--            self._lifecycle.configure)  # fire_hooks=False)
++        yield self._invoke_lifecycle(self._lifecycle.start, fire_hooks=False)
++        yield self._invoke_lifecycle(self._lifecycle.configure,
++                                     fire_hooks=False)
++
++    @inlineCallbacks
++    def do_retry_configure_hook(self):
++        yield self._invoke_lifecycle(self._lifecycle.start, fire_hooks=False)
++        yield self._invoke_lifecycle(self._lifecycle.configure)
  class RelationWorkflowState(DiskWorkflowState):
@@ -360,7 +439,7 @@
          @param: error: The error from hook invocation.
          """
--        yield self.fire_transition("stop",
++        yield self.fire_transition("error",
                                     change_type=relation_change.change_type,
                                     error_message=str(error))
@@ -369,12 +448,30 @@
          """Transition the workflow to the 'down' state.
          Turns off the unit-relation lifecycle monitoring and hook execution.
++
++        :param error_info: If called on relation hook error, contains
++        error variables.
          """
          yield self._lifecycle.stop()
      @inlineCallbacks
++    def do_reset(self):
++        """Transition the workflow to the 'up' state from an error state.
++
++        Turns on the unit-relation lifecycle monitoring and hook execution.
++        """
++        yield self._lifecycle.start(watches=False)
++
++    @inlineCallbacks
++    def do_error(self, **error_info):
++        """A relation hook error, stops further execution hooks but
++        continues to watch for changes.
++        """
++        yield self._lifecycle.stop(watches=False)
++
++    @inlineCallbacks
      def do_restart(self):
--        """Transition the workflow to the 'up' state.
++        """Transition the workflow to the 'up' state from the down state.
          Turns on the unit-relation lifecycle monitoring and hook execution.
          """

pyjuju

Merge lp:~hazmat/pyjuju/unit-agent-resolved into lp:pyjuju

Commit message

Description of the change

Preview Diff

Subscribers