pyjuju

docs/source/internals/unit-agent-persistence.rst (+138/-0)
juju/agents/tests/test_unit.py (+79/-141)
juju/agents/unit.py (+59/-136)
juju/control/tests/test_resolved.py (+16/-8)
juju/control/tests/test_status.py (+11/-37)
juju/control/tests/test_upgrade_charm.py (+4/-2)
juju/errors.py (+11/-1)
juju/lib/statemachine.py (+105/-28)
juju/lib/tests/test_statemachine.py (+214/-76)
juju/state/service.py (+10/-10)
juju/tests/test_errors.py (+31/-9)
juju/unit/lifecycle.py (+122/-24)
juju/unit/tests/test_lifecycle.py (+121/-50)
juju/unit/tests/test_workflow.py (+435/-227)
juju/unit/workflow.py (+68/-59)

To merge this branch:

bzr merge lp:~fwereade/pyjuju/fix-charm-upgrade

Related bugs:

Bug #903018: charm upgrade is dangerous

Undecided

Fix Released

Link a bug report

Reviewer	Date Requested	Status
Kapil Thangavelu (community)		Approve on 2012-02-21
Jim Baker (community)	2011-12-12	Approve on 2012-01-09
Review via email: mp+85271@code.launchpad.net

Description of the change

I'm pretty sure that charm upgrades will now:

* fail silently, as before, on early errors (when no changes have been made to any state apart from the upgrade flag)
* error out on early errors if recovering from an earlier failed operation (the *whole* *thing* needs to complete successfully...)
* induce charm_upgrade_error workflow states, when the workflow comes up in a "started" state but midway through a charm upgrade
* do the Right Thing re fire_hooks, which is to *ignore* fire_hooks in any invocation in which the charm is actually written to disk; in this case, it is vital to *always* fire the upgraded-charm hook, because we guarantee it's the *first* one fired after the operation completes.
* actually overwrite old charms, instead of just unpacking into the same directory (and thus leaving droppings).

Revision history for this message

Jim Baker (jimbaker) wrote on 2012-01-09:

+1, looks good to me. The only thing:

$ pep8 juju/unit
juju/unit/tests/test_lifecycle.py:830:1: E303 too many blank lines (3)
juju/unit/tests/test_workflow.py:619:80: E501 line too long (80 characters)

review: Approve

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2012-01-11:

[0]

+ def catch_up(self):
+ # There's nothing we can wait on to determine when all the consequences
+ # of the cb_watch_upgrade_flag have come to pass; this seems to be a
+ # "reliable workaround"...
+ for _ in range(10):
+ yield self.poke_zk()

eek, sleep and pray is not kosher ;-). tests need a definitive observation point, either using the existing wait_on_state helper or create a wait on log message helper, to track the done ness of cb watch upgrade.

[1]

charm directories can contain charm state, ie. their not read only. So they can't be overwritten wholesale. they'd need a manifest on install to delta compute the upgrade for files to delete, i'd go ahead and punt on this one for now, it can be handled separately.

review: Needs Fixing

Revision history for this message

William Reade (fwereade) wrote on 2012-01-11:

[0]

Replaced with a wait_for_log method.

[1]

Fixed.

lp:~fwereade/pyjuju/fix-charm-upgrade updated on 2012-01-12

446. By William Reade on 2012-01-11: merge parent
447. By William Reade on 2012-01-12: merge parent

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2012-01-13:

[2]

+ returnValue(state_dict.get("state"))

this seems suspect, we have some state data, but its invalid, and we transparently assume its a new/starting workflow .. what if this was a previously initialized system, we'd be transparently resetting it without warning.

[3]

    @property
    def _upgrade_flag(self):
        return "/units/%s/upgrade" % self._internal_id

for clarity, should be _path suffixed.

[4]

There don't appear to be any tests for charm upgrade prepare failing in the upgrade charm lifecycle method, but irrelevant given the following.

[5]

as discussed in person, let's remove the first_attempt logic, any error in the upgrade transition should result in an error state.

lp:~fwereade/pyjuju/fix-charm-upgrade updated on 2012-01-16

448. By William Reade on 2012-01-16: added upgrade_charm_ready and charm_replaced states, along with suitable transitions and error states

Revision history for this message

William Reade (fwereade) wrote on 2012-01-16:

[2]

Hmm, very sensible, not sure why I did that :-/.

[3]

Sounds good, done.

[4/5]

OK, following discussion, I've broken the upgrade process into 3 basic transitions (plus associated error/retry) ones: "begin_charm_upgrade", "replace_charm", "finish_charm_upgrade", with 2 new (non-error) states: "charm_upgrade_ready" and "charm_replaced".

There's no more first_attempt malarkey, BUT the "begin_charm_upgrade" transition will raise an error if there is no new charm available, hence aborting the upgrade operation (and keeping the workflow in state "started"); and thereby ensuring that any upgrade operation that makes it to the "charm_upgrade_ready" state represents a real upgrade, and must therefore pass through "replace_charm" and "finish_charm_upgrade" before it returns to the "started" state.

lp:~fwereade/pyjuju/fix-charm-upgrade updated on 2012-01-30

449. By William Reade on 2012-01-16: address review point
450. By William Reade on 2012-01-16: address final review point
451. By William Reade on 2012-01-23: merge parent
452. By William Reade on 2012-01-30: merge parent

Revision history for this message

William Reade (fwereade) wrote on 2012-01-31:

Forces in play/further justifications for chosen approach:

1) The use of a state variable for the critical section was criticised as being a bit surprising/magical, so it seemed better to store states as, well, states.
2) first_attempt was also criticised as too complex, so I dropped the silent-early-failure mode (ie once we know an upgrade is a sensible thing to do, we're locked into it and can't escape).
3) I was reluctant to move too much logic into the top-level callback for the upgrade flag watch. It is true that putting the check of service-charm-id/unit-charm-id in there would indeed eliminate the first state transition, but it seemed cleanest to just try the transition and allow the state machinery to handle backing out in response to the lifecycle's complaints.
4) Splitting upgrade-charm and run-hook into two distinct transitions is definitely a win, because (i) the possible errors in the two states are very different, and it's good to distinguish them; (ii) breaking the process into a pipeline like this eliminated further complexity in the original UL.upgrade_charm method (we handle all state tracking with explicit state machine states
5) The repeated check for service/unit charm ids in UL.replace_charm is not *necessary*, but it seemed helpful to avoid redownloading and reinstalling exactly the same bits. We could drop it without doing anything any actual *harm*.

Summary: we agreed on having 2 transitions to handle the process of upgrading; farming the initial can-we-actually-upgrade check out to a third new transition, which fails back out to "started", keeps responsibility for state-tracking in the workflow where, IMO, it should be anyway. I don't consider the additional workflow states to represent serious complexity: the only other state it interacts with is "started".

Aside: rereading IRC, I may have missed something re: reentrancy concerns. Each of the transition methods on UL is, AFAICT, safe for restart midway through; and the workflow is responsible for ensuring we only run them at sensible times.

Does any of this help at all?

Forces in play/further justifications for chosen approach:

Does any of this help at all?

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2012-01-31:

Download full text (4.2 KiB)

Excerpts from William Reade's message of Tue Jan 31 09:11:25 UTC 2012:
> Forces in play/further justifications for chosen approach:
>
> 1) The use of a state variable for the critical section was criticised as being a bit surprising/magical, so it seemed better to store states as, well, states.

Indeed, it does, but as gustavo pointed out in irc, its really for lack of a WAL
transition (either disk or zk) that its needed at all, and a better fit to model
it as such than as additional state/transitions. Its main purpose is to
signal the transition that's currently in play to allow for recovery and
re-execution of that transition instead of merely entering the last recorded
state.

> 2) first_attempt was also criticised as too complex, so I dropped the silent-early-failure mode (ie once we know an upgrade is a sensible thing to do, we're locked into it and can't escape).

yeah.. the distinction between first_attempt was ever checked by the tests.

> 3) I was reluctant to move too much logic into the top-level callback for the upgrade flag watch. It is true that putting the check of service-charm-id/unit-charm-id in there would indeed eliminate the first state transition, but it seemed cleanest to just try the transition and allow the state machinery to handle backing out in response to the lifecycle's complaints.

i think the refactoring thats been done to subsume some of the callback
into the transition is good. Capturing the entire operation into the
workflow machinery has some benefits wrt to status reporting/recording failures.

> 4) Splitting upgrade-charm and run-hook into two distinct transitions is definitely a win, because (i) the possible errors in the two states are very different, and it's good to distinguish them; (ii) breaking the process into a pipeline like this eliminated further complexity in the original UL.upgrade_charm method (we handle all state tracking with explicit state machine states

The question is the recovery harmed by rexecuting the whole transition? Ie.
what's the value from an error recovery point of view of the separate states.

> 5) The repeated check for service/unit charm ids in UL.replace_charm is not *necessary*, but it seemed helpful to avoid redownloading and reinstalling exactly the same bits. We could drop it without doing anything any actual *harm*.
>

I don't think feel this is really a big deal, given its simplicity. There's an
open bug out for creating a formula cache that could alleviate the redownload.

> Summary: we agreed on having 2 transitions to handle the process of upgrading; farming the initial can-we-actually-upgrade check out to a third new transition, which fails back out to "started", keeps responsibility for state-tracking in the workflow where, IMO, it should be anyway. I don't consider the additional workflow states to represent serious complexity: the only other state it interacts with is "started".
>

Excerpts from William Reade's message of Tue Jan 31 09:11:25 UTC 2012:
> Forces in play/further justifications for chosen approach:
> 
> 1) The use of a state variable for the critical section was criticised as being a bit surprising/magical, so it seemed better to store states as, well, states.

Indeed, it does, but as gustavo pointed out in irc, its really for lack of a WAL 
transition (either disk or zk) that its needed at all, and a better fit to model 
it as such than as additional state/transitions. Its main purpose is to 
signal the transition that's currently in play to allow for recovery and 
re-execution of that transition instead of merely entering the last recorded 
state.

> 2) first_attempt was also criticised as too complex, so I dropped the silent-early-failure mode (ie once we know an upgrade is a sensible thing to do, we're locked into it and can't escape).

yeah.. the distinction between first_attempt was ever checked by the tests.

i think the refactoring thats been done to subsume some of the callback 
into the transition is good. Capturing the entire operation into the 
workflow machinery has some benefits wrt to status reporting/recording failures.

The question is the recovery harmed by rexecuting the whole transition? Ie. 
what's the value from an error recovery point of view of the separate states.

I don't think feel this is really a big deal, given its simplicity. There's an 
open bug out for creating a formula cache that could alleviate the redownload.

Yeah.. i've back-tracked on this looking over the implementation and given
given gustavo's WAL insight. It feels like the additional states and 
transitions are just adding complexity esp while juggling the executor. Its also 
doesn't seems like a  benefit to a user having to understand the semantic 
differences between the various upgrade states. The previous version of the 
branch would work well afaics given a removal of the first_attempt logic.

The removal of the started state variable could follow along with a generic 
transition WAL.

> Aside: rereading IRC, I may have missed something re: reentrancy concerns. Each of the transition methods on UL is, AFAICT, safe for restart midway through; and the workflow is responsible for ensuring we only run them at sensible times.
> 
> Does any of this help at all?

there are three different lifecycle methods all coordinating around the 
executor global variable, and those lifecycle methods may be called by 
any of seven different upgrade transitions (across the four states). Those 
states would all need to appropriately reset the executor for recovery/startup.

Given that combining those lifecycle methods into a single method/entry point 
doesn't deter from their functionality at all, It seems like a clear win to me 
for simplicity too just have the one lifecycle method, two transitions, and one 
state.

lp:~fwereade/pyjuju/fix-charm-upgrade updated on 2012-02-02

453. By William Reade on 2012-02-02: manually back out r448 without screwing up future merges, I hope
454. By William Reade on 2012-02-02: remove UL.upgrade_charm's first_attempt kwarg

Revision history for this message

William Reade (fwereade) wrote on 2012-02-02:

Still prefer the old way, but done now :).

lp:~fwereade/pyjuju/fix-charm-upgrade updated on 2012-02-08

455. By William Reade on 2012-02-08: merge parent

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2012-02-08:

[6]

As we discussed over g+, this is looking good The main thing is that the callback/deferred passing should be removed in favor of having lib/statemachine.py do saving/clearing of transitions in progress against state variables. This will make generic recovery of any inflight transition and perhaps special casing in synchronize.

It should also obviate the need for _notify_upgrading, and upgrade_charm's cb_upgrading.

@inlineCallbacks
def _notify_upgrading(self, is_upgrading):

[7] just a comment, looking at upgrade_charm's implementation.

The distinction between upgrade.prepare, upgrade.run, and upgrade.ready looks a little superficial. The upgrade.prepare should work or raise an error. On a retry of an upgrade-charm error state we should always ..
This also has the interesting side effect of firing hooks on juju resolved if we're just now downloading the charm successfully, but that seems correct.

lp:~fwereade/pyjuju/fix-charm-upgrade updated on 2012-02-21

456. By William Reade on 2012-02-21

WorkflowState now (1) has a synchronize method that re-runs inflight

transitions after restart, and (2) requires external locking on state-
changing methods.

(1) is a required change due to the verdict on fix-charm-upgrade (that is:
we don't want additional states for the upgrade process, and we should use
write-ahead logging to maintain state across process death); (2) is
consequently required because of the interaction between the upgrade flag
watch and the resolved watch, both of which take `state == "started"` to
mean "it's safe to fire a transition". This was a pre-existing bug, but was
IMO exacerbated severely by the fact that we now never leave the "started"
state while upgrading a charm -- enough so that I'm not comfortable
proposing a patch that fails to address this issue.

This diff is *big*, but the vast majority of the changes are as a result of
the WorkflowState locking; in particular, many many tests set workflow state
in one way or another, and the necessary changes add a *lot* of noise. In
hindsight, the locking changes could have been made independently, but I
don't think the resulting pair of branches would actually be significantly
easier to deal with.

Various explanatory notes follow:

* External WorkflowState locking (rather than automagic internal locking) was
  chosen for simplicity's sake on the part of the client as well: when one
  to fire a transition conditional on some state being active, it makes
  sense to grab the lock, check, and fire the transition, in the certain
  knowledge that nobody can have changed state underneath you.

* The obvious problem with DeferredLock (that you can tell it's locked, but
not who owns the lock) has minimal impact, thanks to the unit tests.
Consider the following scenarios:

  * Code tries to lock, test not holding lock: we're fine.
  * Code tries to lock, test is holding lock: bad test; deadlocks.
  * Code fails to lock, test not holding lock: bad code: asserts.
  * Code fails to lock, test is holding lock: bad code AND bad test.

  The final scenario is the only dangerous one, but it's just a special
  case of the fact that you can *always* write a bad test that passes bad
  code; I think attempting to solve this is out of scope for this universe.

* WorkflowState.fire_transition sets the inflight transition once it knows
  the requested transition is valid, and only clears it explicitly if the
  transition fails without an error transition to follow up (when a
  transition succeeds, set_state implicitly clears inflight).

* WorkflowState.synchronize (1) detects and re-runs inflight transitions and
  (2) has taken over responsibility for the automatic transitions (eg
  None->installed->started) which had previously been handled in UWS/RWS.
  The overlapping concept of success_transition_id has been dropped;
  transitions can now be marked as "automatic".

* WorkflowState.set_inflight really wants to be private, but is exposed for
  testing's sake; get_inflight needs to be exposed so that UWS.synchronize
  doesn't inappropriately start the executor when recovering from mid-
  upgrade process death.

* Er...

* That's it.

R=
CC=
https://codereview.appspot.com/5647064

457. By William Reade on 2012-02-21

break up synchronize test to avoid occasional timeout during full test runs

Revision history for this message

William Reade (fwereade) wrote on 2012-02-21:

OK, state-machine-sync has been merged in; are we good to go?

lp:~fwereade/pyjuju/fix-charm-upgrade updated on 2012-02-21

458. By William Reade on 2012-02-21: merge parent

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2012-02-21:

woohoo! make the magic happen :-)

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Beber

Benjamin Saller

Chuck Short

Gustavo Niemeyer

James Page

Jim Baker

John A Meinel

Jorge Castro

Kapil Thangavelu

Marius B. Kotsbak

William Reade

liuxing

to status/vote changes:

Akapo

 === added file 'docs/source/internals/unit-agent-persistence.rst'
 --- docs/source/internals/unit-agent-persistence.rst	1970-01-01 00:00:00 +0000
 +++ docs/source/internals/unit-agent-persistence.rst	2012-02-21 10:23:29 +0000
@@ -0,0 +1,138 @@
++Notes on unit agent persistence
++===============================
++
++Introduction
++------------
++
++This was first written to explain the extensive changes made in the branch
++lp:~fwereade/juju/restart-transitions; that branch has been split out into
++four separate branches, but this discussion should remain a useful guide to
++the changes made to the unit and relation workflows and lifecycles in the
++course of making the unit agent resumable.
++
++
++Glossary
++--------
++
++UA = UnitAgent
++UL = UnitLifecycle
++UWS = UnitWorkflowState
++URL = UnitRelationLifecycle
++RWS = RelationWorkflowState
++URS = UnitRelationState
++SRS = ServiceRelationState
++
++
++Technical discussion
++--------------------
++
++Probably the most fundamental change is the addition of a "synchronize" method
++to both UWS and RWS. Calling "synchronize" should generally be *all* you need
++to do to put the workflow and associated components into "the right state"; ie
++ZK state will be restored, the appropriate lifecycle will be started (or not),
++and any initial transitons will automatically be fired ("start" for RWS;
++"install", "start" for UWS).
++
++The synchronize method keeps responsibility for the lifecycle's state purely in
++the hands of the workflow; once a workflow is synced, the *only* necessary
++interactions with it should be in response to changes in ZK.
++
++The disadvantage is that lifecycle "start" and "stop" methods have become a
++touch overloaded:
++
++* UL.stop(): now takes "stop_relations" in addition to "fire_hooks", in which
++    "stop_relations" being True causes the orginal behaviour (transition "up"
++    RWSs to "down", as when transitioning the UWS to a "stopped" or "error"
++    state), but False simply causes them to stop watching for changes (in
++    preparation for an orderly shutdown, for example).
++
++* UL.start(): now takes "start_relations" in addition to "fire_hooks", in which
++    the "start_relations" flag being True causes the original behaviour
++    (automatically transition "down" RWSs to "up", as when restarting/resolving
++    the UWS), while False causes the RWSs only to be synced.
++
++* URL.start(): now takes "scheduler" in addition to "watches", allowing the
++    watching and the contained HookScheduler to be controlled separately
++    (allowing us to actually perform the RWS synchronise correctly).
++
++* URL.stop(): still just takes "watches", because there wasn't a scenario in
++    which I wanted to stop the watches but not the HookScheduler.
++
++I still think it's a win, though: and I don't think that turning them into
++separate methods is the right way to go; "start" and "stop" remain perfectly
++decent and appropriate names for what they do.
++
++Now this has been done, we can always launch directly into whatever state we
++shut down in, and that's great, because sudden process death doesn't hurt us
++any more [0] [1]. Except... when we're upgrading a charm. It emerges that the
++charm upgrade state transition only covers the process of firing the hook, and
++not the process of actually upgrading the charm.
++
++In short, we had a mechanism, completely outside the workflow's purview, for
++potentially *brutal* modifications of state (both in terms of the charm itself,
++on disk, and also in that the hook executor should remain stopped forever while
++in "charm_upgrade_error" state); and this rather scuppered the "restart in the
++same state" goal. The obvious thing to do was to move the charm upgrade
++operation into the "charm_upgrade" transition, so we had a *chance* of being
++able to start in the correct state.
++
++UL.upgrade_charm, called by UWS, does itself have subtleties, but it should be
++reasonably clear when examined in context; the most important point is that it
++will call back at the start and end of the risky period, and that the UWS's
++handler for this callback sets a flag in "started"'s state_vars for the
++duration of the upgrade. If that flag is set when we subsequently start up
++again and synchronize the UWS, then we know to immediately force the
++charm_upgrade_error state and work from there.
++
++[0] Well, it does, because we need to persist more than just the (already-
++persisted) workflow state. This branch includes RWS persistence in the UL, as
++requested in this branch's first pre-review (back in the day...), but does not
++include HookScheduler persistence in the URLs, so it remains possible for
++relation hooks which have been queued, but not yet executed, to be lost if the
++process executes before the queue empties. That will be coming in another
++branch (resolve-unit-relation-diffs).
++
++[1] This seems like a good time to mention the UL's relation-broken handling
++for relations that went away while the process was stopped: every time
++._relations is changed, it writes out enough state to recreate a Frankenstein's
++URS object, which it can then use on load to reconstruct the necessary URL and
++hence RWS.
++
++We don't strictly need to *reconstruct* it in every case -- we can just use
++SRS.get_unit_state if the relation still exists -- but given that sometimes we
++do, it seemed senseless to have two code paths for the same operations. Of the
++RWSs we reconstruct, those with existing SRSs will be synchronized (because we
++know it's safe to do so), and the remainder will be stored untouched (because
++we know that _process_service_changes will fire the "depart" transition for us
++before doing anything else... and the "relation-broken" hook will be executed
++in a DepartedRelationHookContext, which is rather restricted, and so shouldn't
++cause the Frankenstein's URS to hit state we can't be sure exists).
++
++
++Appendix: a rough history of changes to restart-transitions
++-----------------------------------------------------------
++
++* Add UWS transitions from "stopped" to "started", so that process restarts can
++    be made to restart UWSs.
++* Upon review, add RWS persistence to UL, to ensure we can't miss
++    relation-broken hooks; as part of this, as discussed, add
++    DepartedRelationHookContext in which to execute them.
++* Upon discussion, discover that original UWS "started" -> "stopped" behaviour
++    on process shutdown is not actually the desired behaviour (and that the
++    associated RWS "up" -> "down" shouldn't happen either.
++* Make changes to UL.start/stop, and add UWS/RWS.synchronize, to allow us to
++    shut down workflows cleanly without transitions and bring them up again in
++    the same state.
++* Discover that we don't have any other reason to transition UWS to "stopped";
++    to actually fire stop hooks at the right time, we need a more sophisticated
++    system (possibly involving the machine agent telling the unit agent to shut
++    itself down). Remove the newly-added "restart" transitions, because they're
++    meaningless now; ponder what good it does us to have a "stopped" state that
++    we never actually enter; chicken out of actually removing it.
++* Realise that charm upgrades do an end-run around the whole UWS mechanism, and
++    resolve to intergate them so I can actually detect upgrades left incomplete
++    due to process death.
++* Move charm upgrade operation from agent into UL; come to appreciate the
++    subtleties of the charm upgrade process; make necessary tweaks to
++    UL.upgrade_charm, and UWS, to allow for synchronization of incomplete
++    upgrades.
 === modified file 'juju/agents/tests/test_unit.py'
 --- juju/agents/tests/test_unit.py	2011-12-16 09:23:31 +0000
 +++ juju/agents/tests/test_unit.py	2012-02-21 10:23:29 +0000
@@ -3,10 +3,9 @@
  import os
  import yaml
--from twisted.internet.defer import (
--    inlineCallbacks, returnValue, fail, Deferred)
++from twisted.internet.defer import inlineCallbacks, returnValue
--from juju.agents.unit import UnitAgent, CharmUpgradeOperation
++from juju.agents.unit import UnitAgent
  from juju.agents.base import TwistedOptionNamespace
  from juju.charm import get_charm_from_path
  from juju.charm.url import CharmURL
@@ -74,8 +73,10 @@
                  "stop", "#!/bin/bash\necho stop >> %s" % output_file)
          for k in kw.keys():
--            self.write_hook(k.replace("_", "-"),
--                            "#!/bin/bash\necho $0 >> %s" % output_file)
++            hook_name = k.replace("_", "-")
++            self.write_hook(
++                hook_name,
++                "#!/bin/bash\necho %s >> %s" % (hook_name, output_file))
          return output_file
@@ -136,7 +137,8 @@
              self.client, self.states["unit"], lifecycle,
              os.path.join(self.juju_directory, "state"))
--        yield workflow.fire_transition("install")
++        with (yield workflow.lock()):
++            yield workflow.fire_transition("install")
          yield lifecycle.stop(fire_hooks=False, stop_relations=False)
          yield self.agent.startService()
@@ -154,9 +156,10 @@
              self.client, self.states["unit"], lifecycle,
              os.path.join(self.juju_directory, "state"))
--        yield workflow.fire_transition("install")
--        self.write_exit_hook("stop", 1)
--        yield workflow.fire_transition("stop")
++        with (yield workflow.lock()):
++            yield workflow.fire_transition("install")
++            self.write_exit_hook("stop", 1)
++            yield workflow.fire_transition("stop")
          yield self.agent.startService()
          current_state = yield self.agent.workflow.get_state()
@@ -516,7 +519,8 @@
          yield hook_deferred
          hook_deferred = self.wait_on_hook("stop", executor=self.agent.executor)
--        yield self.agent.workflow.fire_transition("stop")
++        with (yield self.agent.workflow.lock()):
++            yield self.agent.workflow.fire_transition("stop")
          yield hook_deferred
          self.assertEqual("stop_error", (yield self.agent.workflow.get_state()))
@@ -544,7 +548,8 @@
          yield hook_deferred
          hook_deferred = self.wait_on_hook("stop", executor=self.agent.executor)
--        yield self.agent.workflow.fire_transition("stop")
++        with (yield self.agent.workflow.lock()):
++            yield self.agent.workflow.fire_transition("stop")
          yield hook_deferred
          self.assertEqual("stop_error", (yield self.agent.workflow.get_state()))
@@ -568,6 +573,12 @@
          self.makeDir(path=os.path.join(self.juju_directory, "charms"))
      @inlineCallbacks
++    def wait_for_log(self, logger_name, message, level=logging.DEBUG):
++        output = self.capture_logging(logger_name, level=level)
++        while message not in output.getvalue():
++            yield self.sleep(0.1)
++
++    @inlineCallbacks
      def mark_charm_upgrade(self):
          # Create a new version of the charm
          repository = self.increment_charm(self.charm)
@@ -592,158 +603,85 @@
          yield self.assertState(self.agent.workflow, "started")
      @inlineCallbacks
--    def test_agent_upgrade_watch_continues_on_unexpected_error(self):
--        """The agent watches for unit upgrades and continues if there is an
--        unexpected error."""
--        yield self.mark_charm_upgrade()
--        self.agent.set_watch_enabled(True)
--
--        output = self.capture_logging(
--            "juju.agents.unit", level=logging.DEBUG)
--
--        upgrade_done = Deferred()
--
--        def operation_has_run():
--            upgrade_done.callback(True)
--
--        operation = self.mocker.patch(CharmUpgradeOperation)
--        operation.run()
--
--        self.mocker.call(operation_has_run)
--        self.mocker.result(fail(ValueError("magic mouse")))
--        self.mocker.replay()
--
--        yield self.agent.startService()
--
--        yield upgrade_done
--        self.assertIn("Error while upgrading", output.getvalue())
--        self.assertIn("magic mouse", output.getvalue())
--
--        yield self.agent.workflow.fire_transition("stop")
--
--    @inlineCallbacks
      def test_agent_upgrade(self):
          """The agent can succesfully upgrade its charm."""
--        self.agent.set_watch_enabled(False)
--        yield self.agent.startService()
--
--        yield self.mark_charm_upgrade()
--
++        log_written = self.wait_for_log("juju.agents.unit", "Upgrade complete")
          hook_done = self.wait_on_hook(
              "upgrade-charm", executor=self.agent.executor)
--        self.write_hook("upgrade-charm", "#!/bin/bash\nexit 0")
--        output = self.capture_logging("unit.upgrade", level=logging.DEBUG)
--
--        # Do the upgrade
--        upgrade = CharmUpgradeOperation(self.agent)
--        value = yield upgrade.run()
--
--        # Verify the upgrade.
--        self.assertIdentical(value, True)
--        self.assertIn("Unit upgraded", output.getvalue())
++
++        self.agent.set_watch_enabled(True)
++        yield self.agent.startService()
++        yield self.mark_charm_upgrade()
          yield hook_done
++        yield log_written
++        self.assertIdentical(
++            (yield self.states["unit"].get_upgrade_flag()),
++            False)
          new_charm = get_charm_from_path(
              os.path.join(self.agent.unit_directory, "charm"))
--
          self.assertEqual(
              self.charm.get_revision() + 1, new_charm.get_revision())
      @inlineCallbacks
++    def test_agent_upgrade_version_current(self):
++        """If the unit is running the latest charm, do nothing."""
++        log_written = self.wait_for_log(
++            "juju.agents.unit",
++            "Upgrade ignored: already running latest charm")
++
++        old_charm_id = yield self.states["unit"].get_charm_id()
++        self.agent.set_watch_enabled(True)
++        yield self.agent.startService()
++        yield self.states["unit"].set_upgrade_flag()
++        yield log_written
++
++        self.assertIdentical(
++            (yield self.states["unit"].get_upgrade_flag()), False)
++        self.assertEquals(
++            (yield self.states["unit"].get_charm_id()), old_charm_id)
++
++
++    @inlineCallbacks
      def test_agent_upgrade_bad_unit_state(self):
--        """The an upgrade fails if the unit is in a bad state."""
--        self.agent.set_watch_enabled(False)
--        yield self.agent.startService()
--
++        """The upgrade fails if the unit is in a bad state."""
          # Upload a new version of the unit's charm
          repository = self.increment_charm(self.charm)
          charm = yield repository.find(CharmURL.parse("local:series/mysql"))
          charm, charm_state = yield self.publish_charm(charm.path)
++        old_charm_id = yield self.states["unit"].get_charm_id()
++
++        log_written = self.wait_for_log(
++            "juju.agents.unit",
++            "Cannot upgrade: unit is in non-started state configure_error. "
++            "Reissue upgrade command to try again.")
++        self.agent.set_watch_enabled(True)
++        yield self.agent.startService()
          # Mark the unit for upgrade, with an invalid state.
++        with (yield self.agent.workflow.lock()):
++            yield self.agent.workflow.fire_transition("error_configure")
          yield self.states["service"].set_charm_id(charm_state.id)
          yield self.states["unit"].set_upgrade_flag()
--        yield self.agent.workflow.set_state("start_error")
--
--        output = self.capture_logging("unit.upgrade", level=logging.DEBUG)
--
--        # Do the upgrade
--        upgrade = CharmUpgradeOperation(self.agent)
--        value = yield upgrade.run()
--
--        # Verify the upgrade.
--        self.assertIdentical(value, False)
--        self.assertIn("Unit not in an upgradeable state: start_error",
--                      output.getvalue())
++        yield log_written
++
          self.assertIdentical(
--            (yield self.states["unit"].get_upgrade_flag()),
--            False)
++            (yield self.states["unit"].get_upgrade_flag()), False)
++        self.assertEquals(
++            (yield self.states["unit"].get_charm_id()), old_charm_id)
      @inlineCallbacks
      def test_agent_upgrade_no_flag(self):
--        """An upgrade fails if there is no upgrade flag set."""
--        self.agent.set_watch_enabled(False)
--        yield self.agent.startService()
--        output = self.capture_logging("unit.upgrade", level=logging.DEBUG)
--        upgrade = CharmUpgradeOperation(self.agent)
--        value = yield upgrade.run()
--        self.assertIdentical(value, False)
--        self.assertIn("No upgrade flag set", output.getvalue())
--        yield self.agent.workflow.fire_transition("stop")
--
--    @inlineCallbacks
--    def test_agent_upgrade_version_current(self):
--        """An upgrade fails if the unit is running the latest charm."""
--        self.agent.set_watch_enabled(False)
--        yield self.agent.startService()
--        yield self.states["unit"].set_upgrade_flag()
--        output = self.capture_logging("unit.upgrade", level=logging.DEBUG)
--        upgrade = CharmUpgradeOperation(self.agent)
--        value = yield upgrade.run()
--        self.assertIdentical(value, True)
--        self.assertIn("Unit already running latest charm", output.getvalue())
--        self.assertFalse((yield self.states["unit"].get_upgrade_flag()))
--
--    @inlineCallbacks
--    def test_agent_upgrade_hook_failure(self):
--        """An upgrade fails if the upgrade hook errors."""
--        self.agent.set_watch_enabled(False)
--        yield self.agent.startService()
--
--        # Upload a new version of the unit's charm
--        repository = self.increment_charm(self.charm)
--        charm = yield repository.find(CharmURL.parse("local:series/mysql"))
--        charm, charm_state = yield self.publish_charm(charm.path)
--
--        # Mark the unit for upgrade
--        yield self.states["service"].set_charm_id(charm_state.id)
--        yield self.states["unit"].set_upgrade_flag()
--
--        hook_done = self.wait_on_hook(
--            "upgrade-charm", executor=self.agent.executor)
--        self.write_hook("upgrade-charm", "#!/bin/bash\nexit 1")
--        output = self.capture_logging("unit.upgrade", level=logging.DEBUG)
--
--        # Do the upgrade
--        upgrade = CharmUpgradeOperation(self.agent)
--        value = yield upgrade.run()
--
--        # Verify the failed upgrade.
--        self.assertIdentical(value, False)
--        self.assertIn("Invoking upgrade transition", output.getvalue())
--        self.assertIn("Upgrade failed.", output.getvalue())
--        yield hook_done
--
--        # Verify state
--        workflow_state = yield self.agent.workflow.get_state()
--        self.assertEqual("charm_upgrade_error", workflow_state)
--
--        # Verify new charm is in place
--        new_charm = get_charm_from_path(
--            os.path.join(self.agent.unit_directory, "charm"))
--
--        self.assertEqual(
--            self.charm.get_revision() + 1, new_charm.get_revision())
--
--        # Verify upgrade flag is cleared.
--        self.assertFalse((yield self.states["unit"].get_upgrade_flag()))
++        """An upgrade stops if there is no upgrade flag set."""
++        log_written = self.wait_for_log(
++            "juju.agents.unit", "No upgrade flag set")
++        old_charm_id = yield self.states["unit"].get_charm_id()
++        self.agent.set_watch_enabled(True)
++        yield self.agent.startService()
++        yield log_written
++
++        self.assertIdentical(
++            (yield self.states["unit"].get_upgrade_flag()),
++            False)
++        new_charm_id = yield self.states["unit"].get_charm_id()
++        self.assertEquals(new_charm_id, old_charm_id)
 === modified file 'juju/agents/unit.py'
 --- juju/agents/unit.py	2012-01-10 14:14:28 +0000
 +++ juju/agents/unit.py	2012-02-21 10:23:29 +0000
@@ -1,7 +1,5 @@
  import os
  import logging
--import shutil
--import tempfile
  from twisted.internet.defer import inlineCallbacks, returnValue
@@ -14,8 +12,6 @@
  from juju.unit.lifecycle import UnitLifecycle, HOOK_SOCKET_FILE
  from juju.unit.workflow import UnitWorkflowState
--from juju.unit.charm import download_charm
--
  from juju.agents.base import BaseAgent
@@ -66,14 +62,14 @@
      @inlineCallbacks
      def start(self):
          """Start the unit agent process."""
--        self.service_state_manager = ServiceStateManager(self.client)
++        service_state_manager = ServiceStateManager(self.client)
          # Retrieve our unit and configure working directories.
          service_name = self.unit_name.split("/")[0]
--        service_state = yield self.service_state_manager.get_service_state(
++        self.service_state = yield service_state_manager.get_service_state(
              service_name)
--        self.unit_state = yield service_state.get_unit_state(
++        self.unit_state = yield self.service_state.get_unit_state(
              self.unit_name)
          self.unit_directory = os.path.join(
              self.config["juju_directory"], "units",
@@ -101,19 +97,20 @@
          yield self.unit_state.connect_agent()
          self.lifecycle = UnitLifecycle(
--            self.client, self.unit_state, service_state, self.unit_directory,
--            self.state_directory, self.executor)
++            self.client, self.unit_state, self.service_state,
++            self.unit_directory, self.state_directory, self.executor)
          self.workflow = UnitWorkflowState(
              self.client, self.unit_state, self.lifecycle, self.state_directory)
          # Set up correct lifecycle and executor state given the persistent
          # unit workflow state, and fire any starting transitions if necessary.
--        yield self.workflow.synchronize(self.executor)
++        with (yield self.workflow.lock()):
++            yield self.workflow.synchronize(self.executor)
          if self.get_watch_enabled():
              yield self.unit_state.watch_resolved(self.cb_watch_resolved)
--            yield service_state.watch_config_state(
++            yield self.service_state.watch_config_state(
                  self.cb_watch_config_changed)
              yield self.unit_state.watch_upgrade_flag(
                  self.cb_watch_upgrade_flag)
@@ -147,20 +144,17 @@
          # Clear out the setting
          yield self.unit_state.clear_resolved()
--        # Verify its not already running
--        if (yield self.workflow.get_state()) == "started":
--            returnValue(None)
--
--        log.info("Resolved detected, firing retry transition")
--
--        # Fire a resolved transition
--        try:
--            if resolved["retry"] == RETRY_HOOKS:
--                yield self.workflow.fire_transition_alias("retry_hook")
--            else:
--                yield self.workflow.fire_transition_alias("retry")
--        except Exception:
--            log.exception("Unknown error while transitioning for resolved")
++        with (yield self.workflow.lock()):
++            if (yield self.workflow.get_state()) == "started":
++                returnValue(None)
++            try:
++                log.info("Resolved detected, firing retry transition")
++                if resolved["retry"] == RETRY_HOOKS:
++                    yield self.workflow.fire_transition_alias("retry_hook")
++                else:
++                    yield self.workflow.fire_transition_alias("retry")
++            except Exception:
++                log.exception("Unknown error while transitioning for resolved")
      @inlineCallbacks
      def cb_watch_hook_debug(self, change):
@@ -175,122 +169,51 @@
          """Update the unit's charm when requested.
          """
          upgrade_flag = yield self.unit_state.get_upgrade_flag()
--        if upgrade_flag:
--            log.info("Upgrade detected, starting upgrade")
--            upgrade = CharmUpgradeOperation(self)
--            try:
--                yield upgrade.run()
--            except Exception:
--                log.exception("Error while upgrading")
++        if not upgrade_flag:
++            log.info("No upgrade flag set.")
++            return
++
++        log.info("Upgrade detected")
++        # Clear the flag immediately; this means that upgrade requests will
++        # be *ignored* by units which are not "started", and will need to be
++        # reissued when the units are in acceptable states.
++        yield self.unit_state.clear_upgrade_flag()
++
++        new_id = yield self.service_state.get_charm_id()
++        old_id = yield self.unit_state.get_charm_id()
++        if new_id == old_id:
++            log.info("Upgrade ignored: already running latest charm")
++            return
++
++        with (yield self.workflow.lock()):
++            state = yield self.workflow.get_state()
++            if state != "started":
++                log.warning(
++                    "Cannot upgrade: unit is in non-started state %s. Reissue "
++                    "upgrade command to try again.", state)
++                return
++
++            log.info("Starting upgrade")
++            if (yield self.workflow.fire_transition("upgrade_charm")):
++                log.info("Upgrade complete")
++            else:
++                log.info("Upgrade failed")
      @inlineCallbacks
      def cb_watch_config_changed(self, change):
          """Trigger hook on configuration change"""
          # Verify it is running
--        current_state = yield self.workflow.get_state()
--        log.debug("Configuration Changed")
--
--        if  current_state != "started":
--            log.debug(
--                "Configuration updated on service in a non-started state")
--            returnValue(None)
--
--        yield self.workflow.fire_transition("reconfigure")
--
--
--class CharmUpgradeOperation(object):
--    """A unit agent charm upgrade operation."""
--
--    def __init__(self, agent):
--        self._agent = agent
--        self._log = logging.getLogger("unit.upgrade")
--        self._charm_directory = tempfile.mkdtemp(
--            suffix="charm-upgrade", prefix="tmp")
--
--    def retrieve_charm(self, charm_id):
--        return download_charm(
--            self._agent.client, charm_id, self._charm_directory)
--
--    def _remove_tree(self, result):
--        if os.path.exists(self._charm_directory):
--            shutil.rmtree(self._charm_directory)
--        return result
--
--    def run(self):
--        d = self._run()
--        d.addBoth(self._remove_tree)
--        return d
--
--    @inlineCallbacks
--    def _run(self):
--        self._log.info("Starting charm upgrade...")
--
--        # Verify the workflow state
--        workflow_state = yield self._agent.workflow.get_state()
--        if workflow_state != "started":
--            self._log.warning(
--                "Unit not in an upgradeable state: %s", workflow_state)
--            # Upgrades can only be supported while the unit is
--            # running, we clear the flag because we don't support
--            # persistent upgrade requests across unit starts. The
--            # upgrade request will need to be reissued, after
--            # resolving or restarting the unit.
--            yield self._agent.unit_state.clear_upgrade_flag()
--            returnValue(False)
--
--        # Get, check, and clear the flag. Do it first so a second upgrade
--        # will restablish the upgrade request.
--        upgrade_flag = yield self._agent.unit_state.get_upgrade_flag()
--        if not upgrade_flag:
--            self._log.warning("No upgrade flag set.")
--            returnValue(False)
--
--        self._log.debug("Clearing upgrade flag.")
--        yield self._agent.unit_state.clear_upgrade_flag()
--
--        # Retrieve the service state
--        service_state_manager = ServiceStateManager(self._agent.client)
--        service_state = yield service_state_manager.get_service_state(
--            self._agent.unit_name.split("/")[0])
--
--        # Verify unit state, upgrade flag, and newer version requested.
--        service_charm_id = yield service_state.get_charm_id()
--        unit_charm_id = yield self._agent.unit_state.get_charm_id()
--
--        if service_charm_id == unit_charm_id:
--            self._log.debug("Unit already running latest charm")
--            yield self._agent.unit_state.clear_upgrade_flag()
--            returnValue(True)
--
--        # Retrieve charm
--        self._log.debug("Retrieving charm %s", service_charm_id)
--        charm = yield self.retrieve_charm(service_charm_id)
--
--        # Stop hook executions
--        self._log.debug("Stopping hook execution.")
--        yield self._agent.executor.stop()
--
--        # Note the current charm version
--        self._log.debug("Setting unit charm id to %s", service_charm_id)
--        yield self._agent.unit_state.set_charm_id(service_charm_id)
--
--        # Extract charm
--        self._log.debug("Extracting new charm.")
--        charm.extract_to(
--            os.path.join(self._agent.unit_directory, "charm"))
--
--        # Upgrade
--        self._log.debug("Invoking upgrade transition.")
--
--        success = yield self._agent.workflow.fire_transition(
--            "upgrade_charm")
--
--        if success:
--            self._log.debug("Unit upgraded.")
--        else:
--            self._log.warning("Upgrade failed.")
--
--        returnValue(success)
++        with (yield self.workflow.lock()):
++            current_state = yield self.workflow.get_state()
++            log.debug("Configuration Changed")
++
++            if  current_state != "started":
++                log.debug(
++                    "Configuration updated on service in a non-started state")
++                returnValue(None)
++
++            yield self.workflow.fire_transition("configure")
++
  if __name__ == '__main__':
      UnitAgent.run()
 === modified file 'juju/control/tests/test_resolved.py'
 --- juju/control/tests/test_resolved.py	2012-01-12 10:18:07 +0000
 +++ juju/control/tests/test_resolved.py	2012-02-21 10:23:29 +0000
@@ -36,7 +36,8 @@
          self.unit1_workflow = UnitWorkflowState(
              self.client, self.service_unit1, None, self.makeDir())
--        yield self.unit1_workflow.set_state("started")
++        with (yield self.unit1_workflow.lock()):
++            yield self.unit1_workflow.set_state("started")
          self.environment = self.config.get_default()
          self.provider = self.environment.get_machine_provider()
@@ -95,7 +96,8 @@
              workflow_state = RelationWorkflowState(
                  self.client, unit_relation, service_relation.relation_name,
                  lifecycle, self.makeDir())
--            yield workflow_state.set_state(state)
++            with (yield workflow_state.lock()):
++                yield workflow_state.set_state(state)
      @inlineCallbacks
      def test_resolved(self):
@@ -104,7 +106,8 @@
          retrying from an error state.
          """
          # Push the unit into an error state
--        yield self.unit1_workflow.set_state("start_error")
++        with (yield self.unit1_workflow.lock()):
++            yield self.unit1_workflow.set_state("start_error")
          self.setup_exit(0)
          finished = self.setup_cli_reactor()
          self.mocker.replay()
@@ -128,7 +131,8 @@
          for retrying from an error state with a retry of hooks
          executions.
          """
--        yield self.unit1_workflow.set_state("start_error")
++        with (yield self.unit1_workflow.lock()):
++            yield self.unit1_workflow.set_state("start_error")
          self.setup_exit(0)
          finished = self.setup_cli_reactor()
          self.mocker.replay()
@@ -159,7 +163,8 @@
              (self.service_unit1, "down"),
              (self.service_unit2, "up"))
--        yield self.unit1_workflow.set_state("start_error")
++        with (yield self.unit1_workflow.lock()):
++            yield self.unit1_workflow.set_state("start_error")
          self.setup_exit(0)
          finished = self.setup_cli_reactor()
          self.mocker.replay()
@@ -309,7 +314,8 @@
          # Just verify we don't accidentally mark up another unit of the service
          unit2_workflow = UnitWorkflowState(
              self.client, self.service_unit2, None, self.makeDir())
--        unit2_workflow.set_state("start_error")
++        with (yield unit2_workflow.lock()):
++            unit2_workflow.set_state("start_error")
          self.setup_exit(0)
          finished = self.setup_cli_reactor()
@@ -335,11 +341,13 @@
          """
          # Mark the unit as resolved and as in an error state.
          yield self.service_unit1.set_resolved(RETRY_HOOKS)
--        yield self.unit1_workflow.set_state("start_error")
++        with (yield self.unit1_workflow.lock()):
++            yield self.unit1_workflow.set_state("start_error")
          unit2_workflow = UnitWorkflowState(
              self.client, self.service_unit1, None, self.makeDir())
--        unit2_workflow.set_state("start_error")
++        with (yield unit2_workflow.lock()):
++            unit2_workflow.set_state("start_error")
          self.assertEqual(
              (yield self.service_unit2.get_resolved()), None)
 === modified file 'juju/control/tests/test_status.py'
 --- juju/control/tests/test_status.py	2011-12-07 18:29:12 +0000
 +++ juju/control/tests/test_status.py	2012-02-21 10:23:29 +0000
@@ -10,7 +10,6 @@
  from juju.agents.base import TwistedOptionNamespace
  from juju.agents.machine import MachineAgent
--from juju.agents.unit import UnitAgent
  from juju.environment.environment import Environment
  from juju.control import status
  from juju.control import tests
@@ -39,7 +38,7 @@
      # Status tests setup a large tree every time, make allowances for it.
      # TODO: create minimal trees needed per test.
      timeout = 10
--
++
      @inlineCallbacks
      def setUp(self):
          yield super(StatusTestBase, self).setUp()
@@ -59,22 +58,14 @@
          self.provider = self.environment.get_machine_provider()
          self.output = StringIO()
--        self.agents = []
--
--    @inlineCallbacks
--    def tearDown(self):
--        for agent in self.agents:
--            if getattr(agent, "api_socket", None):
--                yield agent.api_socket.stopListening()
--                agent.api_socket = None
--        yield super(StatusTestBase, self).tearDown()
      @inlineCallbacks
      def set_unit_state(self, unit_state, state, port_protos=()):
          unit_state.set_public_address(
              "%s.example.com" % unit_state.unit_name.replace("/", "-"))
          workflow_client = ZookeeperWorkflowState(self.client, unit_state)
--        yield workflow_client.set_state(state)
++        with (yield workflow_client.lock()):
++            yield workflow_client.set_state(state)
          for port_proto in port_protos:
              yield unit_state.open_port(*port_proto)
@@ -85,7 +76,8 @@
                  unit_state)
              workflow_client = ZookeeperWorkflowState(
                  self.client, relation_unit_state)
--            yield workflow_client.set_state(state)
++            with (yield workflow_client.lock()):
++                yield workflow_client.set_state(state)
      @inlineCallbacks
      def add_relation_with_relation_units(
@@ -102,30 +94,11 @@
              dest_relation_state, dest_units, dest_states)
      @inlineCallbacks
--    def create_agent(self, agent_cls, path, **extra_options):
--        agent = agent_cls()
--        options = TwistedOptionNamespace()
--        options["juju_directory"] = path
--        options["zookeeper_servers"] = get_test_zookeeper_address()
--        for k, v in extra_options.items():
--            options[k] = v
--        agent.configure(options)
--        agent.set_watch_enabled(False)
--        agent.client = self.client
--        yield agent.start()
--        self.agents.append(agent)
--
--    @inlineCallbacks
      def add_unit(self, service, machine, with_agent=lambda _: True):
          unit = yield service.add_unit_state()
          yield unit.assign_to_machine(machine)
--        name = unit.unit_name
--        if with_agent(name):
--            juju_dir = self.makeDir()
--            os.makedirs(os.path.join(juju_dir, "state"))
--            os.makedirs(os.path.join(juju_dir, "units",
--                                     name.replace("/", "-")))
--            yield self.create_agent(UnitAgent, juju_dir, unit_name=name)
++        if with_agent(unit.unit_name):
++            yield unit.connect_agent()
          returnValue(unit)
      @inlineCallbacks
@@ -274,9 +247,10 @@
          _, (peer_rel,) = yield self.relation_state_manager.add_relation_state(
              RelationEndpoint("riak", "peer", "ring", "peer"))
--        yield ZookeeperWorkflowState(
--            self.client,
--            (yield peer_rel.add_unit_state(riak_u1))).set_state("up")
++        riak_u1_relation = yield peer_rel.add_unit_state(riak_u1)
++        riak_u1_workflow = ZookeeperWorkflowState(self.client, riak_u1_relation)
++        with (yield riak_u1_workflow.lock()):
++            yield riak_u1_workflow.set_state("up")
          yield peer_rel.add_unit_state(riak_u2)
          state = yield status.collect(
 === modified file 'juju/control/tests/test_upgrade_charm.py'
 --- juju/control/tests/test_upgrade_charm.py	2011-12-07 02:26:56 +0000
 +++ juju/control/tests/test_upgrade_charm.py	2012-02-21 10:23:29 +0000
@@ -66,7 +66,8 @@
          self.unit1_workflow = UnitWorkflowState(
              self.client, self.service_unit1, None, self.makeDir())
--        yield self.unit1_workflow.set_state("started")
++        with (yield self.unit1_workflow.lock()):
++            yield self.unit1_workflow.set_state("started")
          self.environment = self.config.get_default()
          self.provider = self.environment.get_machine_provider()
@@ -444,7 +445,8 @@
          self.unit1_workflow = UnitWorkflowState(
              self.client, self.service_unit1, None, self.makeDir())
--        yield self.unit1_workflow.set_state("started")
++        with (yield self.unit1_workflow.lock()):
++            yield self.unit1_workflow.set_state("started")
          self.environment = self.config.get_default()
          self.provider = self.environment.get_machine_provider()
 === modified file 'juju/errors.py'
 --- juju/errors.py	2011-09-24 22:21:23 +0000
 +++ juju/errors.py	2012-02-21 10:23:29 +0000
@@ -62,7 +62,7 @@
          return "Error processing %r: %s" % (self.path, self.message)
--class CharmInvocationError(JujuError):
++class CharmInvocationError(CharmError):
      """A charm's hook invocation exited with an error"""
      def __init__(self, path, exit_code):
@@ -74,6 +74,16 @@
              self.path, self.exit_code)
++class CharmUpgradeError(CharmError):
++    """Something went wrong trying to upgrade a charm"""
++
++    def __init__(self, message):
++        self.message = message
++
++    def __str__(self):
++        return "Cannot upgrade charm: %s" % self.message
++
++
  class FileAlreadyExists(JujuError):
      """Raised when something refuses to overwrite an existing file.
 === modified file 'juju/lib/statemachine.py'
 --- juju/lib/statemachine.py	2011-05-04 19:40:59 +0000
 +++ juju/lib/statemachine.py	2012-02-21 10:23:29 +0000
@@ -17,7 +17,7 @@
  import logging
--from twisted.internet.defer import inlineCallbacks, returnValue
++from twisted.internet.defer import DeferredLock, inlineCallbacks, returnValue
  class WorkflowError(Exception):
@@ -43,15 +43,44 @@
      return instance.__class__.__name__.lower()
++class _ExitCaller(object):
++
++    def __init__(self, func):
++        self._func = func
++
++    def __enter__(self):
++        pass
++
++    def __exit__(self, *exc_info):
++        self._func()
++
++
  class WorkflowState(object):
      _workflow = None
      def __init__(self, workflow=None):
--
          if workflow:
              self._workflow = workflow
          self._observer = None
++        self._lock = DeferredLock()
++
++    @inlineCallbacks
++    def lock(self):
++        yield self._lock.acquire()
++        returnValue(_ExitCaller(self._lock.release))
++
++    def _assert_locked(self):
++        """Should be called at the start of any method which changes state.
++
++        This is a frankly pitiful hack that should (handwave handwave) help
++        people to use this correctly; it doesn't stop anyone from calling
++        write methods on this object while someone *else* holds a lock, but
++        hopefully it will help us catch these situations when unit testing.
++
++        This method only exists as a place to put this documentation.
++        """
++        assert self._lock.locked
      @inlineCallbacks
      def get_available_transitions(self):
@@ -82,6 +111,7 @@
          Ambigious (multiple) or no matching transitions cause an exception
          InvalidTransition to be raised.
          """
++        self._assert_locked()
          found = []
          for t in (yield self.get_available_transitions()):
@@ -113,26 +143,25 @@
          Returns a boolean value based on whether the state
          was achieved.
          """
--        # verify its a valid state id
++        self._assert_locked()
++
++        # verify it's a valid state id
          if not self._workflow.has_state(state_id):
              raise InvalidStateError(state_id)
          transitions = yield self.get_available_transitions()
--        found_transition = False
          for transition in transitions:
              if transition.destination == state_id:
--                found_transition = True
                  break
--
--        if found_transition:
--            log.debug("%s: transition state (%s -> %s)",
--                      class_name(self),
--                      transition.source,
--                      transition.destination)
--            result = yield self.fire_transition(transition.transition_id)
--            returnValue(result)
--
--        returnValue(False)
++        else:
++            returnValue(False)
++
++        log.debug("%s: transition state (%s -> %s)",
++                  class_name(self),
++                  transition.source,
++                  transition.destination)
++        result = yield self.fire_transition(transition.transition_id)
++        returnValue(result)
      @inlineCallbacks
      def fire_transition(self, transition_id, **state_variables):
@@ -141,6 +170,8 @@
          Invokes any transition actions, saves state and state variables, and
          error transitions as needed.
          """
++        self._assert_locked()
++
          # Verify and retrieve the transition.
          available = yield self.get_available_transitions()
          available_ids = [t.transition_id for t in available]
@@ -149,6 +180,7 @@
              raise InvalidTransitionError(
                  "%r not a valid transition for state %s" % (
                      transition_id, current_state))
++        yield self.set_inflight(transition_id)
          transition = self._workflow.get_transition(transition_id)
          log.debug("%s: transition %s (%s -> %s) %r",
@@ -181,23 +213,30 @@
                      yield self.fire_transition(
                          transition.error_transition_id)
                  else:
++                    yield self.set_inflight(None)
                      log.debug("%s:  transition %s failed %s",
                                class_name(self), transition_id, e)
                  # Bail, and note the error as a return value.
                  returnValue(False)
--        # Set the state with state variables
++        # Set the state with state variables (and implicitly clear inflight)
          yield self.set_state(transition.destination, **state_variables)
          log.debug("%s: transition complete %s (state %s) %r",
                    class_name(self), transition_id,
                    transition.destination, state_variables)
--        if transition.success_transition_id:
--            log.debug("%s: initiating success transition: %s",
--                      class_name(self), transition.success_transition_id)
--            yield self.fire_transition(transition.success_transition_id)
++        yield self._fire_automatic_transitions()
          returnValue(True)
      @inlineCallbacks
++    def _fire_automatic_transitions(self):
++        self._assert_locked()
++        available = yield self.get_available_transitions()
++        for t in available:
++            if t.automatic:
++                yield self.fire_transition(t.transition_id)
++                return
++
++    @inlineCallbacks
      def get_state(self):
          """Get the current workflow state.
          """
@@ -230,10 +269,48 @@
      def set_state(self, state, **variables):
          """Set the current workflow state, optionally setting state variables.
          """
++        self._assert_locked()
          yield self._store(dict(state=state, state_variables=variables))
          if self._observer:
              self._observer(state, variables)
++    @inlineCallbacks
++    def set_inflight(self, transition_id):
++        """Record intent to perform a transition, or completion of same.
++
++        Ideally, this would not be exposed to the public, but it's necessary
++        for writing sane tests.
++        """
++        self._assert_locked()
++        state = yield self._load() or {}
++        state.setdefault("state", None)
++        state.setdefault("state_variables", {})
++        if transition_id is not None:
++            state["transition_id"] = transition_id
++        else:
++            state.pop("transition_id", None)
++        yield self._store(state)
++
++    @inlineCallbacks
++    def get_inflight(self):
++        """Get the id of the transition that is currently executing.
++
++        (Or which was abandoned due to unexpected process death.)
++        """
++        state = yield self._load() or {}
++        returnValue(state.get("transition_id"))
++
++    @inlineCallbacks
++    def synchronize(self):
++        """Rerun inflight transition, if any, and any default transitions."""
++        self._assert_locked()
++        # First of all, complete any abandoned transition.
++        transition_id = yield self.get_inflight()
++        if transition_id is not None:
++            yield self.fire_transition(transition_id)
++        else:
++            yield self._fire_automatic_transitions()
++
      def _load(self):
          """ Load the state and variables from persistent storage.
          """
@@ -280,25 +357,25 @@
  class Transition(object):
      """A transition encapsulates an edge in the statemachine graph.
--    :attr:`transition_id` The identity fo the transition.
++    :attr:`transition_id` The identity of the transition.
      :attr:`label` A human readable label of the transition's purpose.
      :attr:`source` The origin/source state of the transition.
      :attr:`destination` The target/destination state of the transition.
      :attr:`action_id` The name of the action method to use for this transition.
      :attr:`error_transition_id`: A transition to fire if the action fails.
--    :attr:`success_transition_id`: A transition to fire if the action succeeds.
++    :attr:`automatic`: If true, always try to fire this transition whenever in
++        `source` state.
      :attr:`alias` See :meth:`WorkflowState.fire_transition_alias`
      """
      def __init__(self, transition_id, label, source, destination,
--                 error_transition_id=None, success_transition_id=None,
--                 alias=None):
++                 error_transition_id=None, automatic=False, alias=None):
          self._transition_id = transition_id
          self._label = label
          self._source = source
          self._destination = destination
          self._error_transition_id = error_transition_id
--        self._success_transition_id = success_transition_id
++        self._automatic = automatic
          self._alias = alias
      @property
@@ -334,7 +411,7 @@
          return self._error_transition_id
      @property
--    def success_transition_id(self):
--        """The id of a transition to fire upon the success of this transition.
++    def automatic(self):
++        """Should this transition always fire whenever possible?
          """
--        return self._success_transition_id
++        return self._automatic
 === modified file 'juju/lib/tests/test_statemachine.py'
 --- juju/lib/tests/test_statemachine.py	2011-09-15 18:50:23 +0000
 +++ juju/lib/tests/test_statemachine.py	2012-02-21 10:23:29 +0000
@@ -1,6 +1,6 @@
  import logging
--from twisted.internet.defer import succeed, fail, inlineCallbacks
++from twisted.internet.defer import succeed, fail, inlineCallbacks, Deferred
  from juju.lib.testing import TestCase
  from juju.lib.statemachine import (
@@ -8,7 +8,7 @@
      InvalidTransitionError, TransitionError)
--class AttributeWorkflowState(WorkflowState):
++class TestWorkflowState(WorkflowState):
      _workflow_state = None
@@ -20,6 +20,9 @@
      def _load(self):
          return self._workflow_state
++
++class AttributeWorkflowState(TestWorkflowState):
++
      # transition handlers.
      def do_jump_puddle(self):
          self._jumped = True
@@ -86,30 +89,35 @@
          yield self.assertEqual(
              transitions, [workflow.get_transition("init_workflow")])
++    @inlineCallbacks
      def test_fire_transition_alias_multiple(self):
          workflow = Workflow(
              Transition("init", "", None, "initialized", alias="init"),
              Transition("init_start", "", None, "started", alias="init"))
          workflow_state = AttributeWorkflowState(workflow)
--        return self.assertFailure(
--            workflow_state.fire_transition_alias("init"),
--            InvalidTransitionError)
++        with (yield workflow_state.lock()):
++            yield self.assertFailure(
++                workflow_state.fire_transition_alias("init"),
++                InvalidTransitionError)
++    @inlineCallbacks
      def test_fire_transition_alias_none(self):
          workflow = Workflow(
              Transition("init_workflow", "", None, "initialized"),
              Transition("start", "", "initialized", "started"))
          workflow_state = AttributeWorkflowState(workflow)
--        return self.assertFailure(
--            workflow_state.fire_transition_alias("dog"),
--            InvalidTransitionError)
++        with (yield workflow_state.lock()):
++            yield self.assertFailure(
++                workflow_state.fire_transition_alias("dog"),
++                InvalidTransitionError)
      @inlineCallbacks
      def test_fire_transition_alias(self):
          workflow = Workflow(
              Transition("init_magic", "", None, "initialized", alias="init"))
          workflow_state = AttributeWorkflowState(workflow)
--        value = yield workflow_state.fire_transition_alias("init")
++        with (yield workflow_state.lock()):
++            value = yield workflow_state.fire_transition_alias("init")
          self.assertEqual(value, True)
      @inlineCallbacks
@@ -121,10 +129,16 @@
          workflow_state = AttributeWorkflowState(workflow)
          current_state = yield workflow_state.get_state()
          self.assertEqual(current_state, None)
--
--        yield workflow_state.set_state("started")
++        current_vars = yield workflow_state.get_state_variables()
++        self.assertEqual(current_vars, {})
++
++        with (yield workflow_state.lock()):
++            yield workflow_state.set_state("started")
++
          current_state = yield workflow_state.get_state()
          self.assertEqual(current_state, "started")
++        current_vars = yield workflow_state.get_state_variables()
++        self.assertEqual(current_vars, {})
      @inlineCallbacks
      def test_state_fire_transition(self):
@@ -132,14 +146,15 @@
              Transition("init_workflow", "", None, "initialized"),
              Transition("start", "", "initialized", "started"))
          workflow_state = AttributeWorkflowState(workflow)
--        yield workflow_state.fire_transition("init_workflow")
--        current_state = yield workflow_state.get_state()
--        self.assertEqual(current_state, "initialized")
--        yield workflow_state.fire_transition("start")
--        current_state = yield workflow_state.get_state()
--        self.assertEqual(current_state, "started")
--        yield self.assertFailure(workflow_state.fire_transition("stop"),
--                                 InvalidTransitionError)
++        with (yield workflow_state.lock()):
++            yield workflow_state.fire_transition("init_workflow")
++            current_state = yield workflow_state.get_state()
++            self.assertEqual(current_state, "initialized")
++            yield workflow_state.fire_transition("start")
++            current_state = yield workflow_state.get_state()
++            self.assertEqual(current_state, "started")
++            yield self.assertFailure(workflow_state.fire_transition("stop"),
++                                     InvalidTransitionError)
          name = "attributeworkflowstate"
          output = (
@@ -159,7 +174,8 @@
              Transition("jump_puddle", "", None, "dry"))
          workflow_state = AttributeWorkflowState(workflow)
--        yield workflow_state.fire_transition("jump_puddle")
++        with (yield workflow_state.lock()):
++            yield workflow_state.fire_transition("jump_puddle")
          current_state = yield workflow_state.get_state()
          self.assertEqual(current_state, "dry")
          self.assertEqual(
@@ -170,13 +186,14 @@
      def test_transition_action_workflow_error(self):
          """If a transition action callback raises a transitionerror, the
          transition does not complete, and the state remains the same.
--        The fire_transition method in this cae returns False.
++        The fire_transition method in this case returns False.
          """
          workflow = Workflow(
              Transition("raises_transition_error", "", None, "next-state"))
          workflow_state = AttributeWorkflowState(workflow)
--        result = yield workflow_state.fire_transition(
--            "raises_transition_error")
++        with (yield workflow_state.lock()):
++            result = yield workflow_state.fire_transition(
++                "raises_transition_error")
          self.assertEqual(result, False)
          current_state = yield workflow_state.get_state()
          self.assertEqual(current_state, None)
@@ -189,6 +206,7 @@
          self.assertEqual(self.log_stream.getvalue(),
                           "\n".join([line % name for line in output]))
++    @inlineCallbacks
      def test_transition_action_unknown_error(self):
          """If an unknown error is raised by a transition action, it
          is raised from the fire transition method.
@@ -197,9 +215,10 @@
              Transition("error_unknown", "", None, "next-state"))
          workflow_state = AttributeWorkflowState(workflow)
--        return self.assertFailure(
--            workflow_state.fire_transition("error_unknown"),
--            AttributeError)
++        with (yield workflow_state.lock()):
++            yield self.assertFailure(
++                workflow_state.fire_transition("error_unknown"),
++                AttributeError)
      @inlineCallbacks
      def test_transition_resets_state_variables(self):
@@ -214,17 +233,18 @@
          state_variables = yield workflow_state.get_state_variables()
          self.assertEqual(state_variables, {})
--        yield workflow_state.fire_transition("transition_variables")
--        current_state = yield workflow_state.get_state()
--        self.assertEqual(current_state, "next-state")
--        state_variables = yield workflow_state.get_state_variables()
--        self.assertEqual(state_variables, {"hello": "world"})
++        with (yield workflow_state.lock()):
++            yield workflow_state.fire_transition("transition_variables")
++            current_state = yield workflow_state.get_state()
++            self.assertEqual(current_state, "next-state")
++            state_variables = yield workflow_state.get_state_variables()
++            self.assertEqual(state_variables, {"hello": "world"})
--        yield workflow_state.fire_transition("some_transition")
--        current_state = yield workflow_state.get_state()
--        self.assertEqual(current_state, "final-state")
--        state_variables = yield workflow_state.get_state_variables()
--        self.assertEqual(state_variables, {})
++            yield workflow_state.fire_transition("some_transition")
++            current_state = yield workflow_state.get_state()
++            self.assertEqual(current_state, "final-state")
++            state_variables = yield workflow_state.get_state_variables()
++            self.assertEqual(state_variables, {})
      @inlineCallbacks
      def test_transition_success_transition(self):
@@ -233,16 +253,13 @@
          action handler are executed.
          """
          workflow = Workflow(
--            Transition("initialized", "", None, "next-state",
--                       success_transition_id="markup"),
--            Transition("markup", "", "next-state", "final-state"),
++            Transition("initialized", "", None, "next"),
++            Transition("markup", "", "next", "final", automatic=True),
+             )
          workflow_state = AttributeWorkflowState(workflow)
--        yield workflow_state.fire_transition("initialized")
--        self.assertEqual((yield workflow_state.get_state()), "final-state")
--        self.assertIn(
--            "initiating success transition: markup",
--            self.log_stream.getvalue())
++        with (yield workflow_state.lock()):
++            yield workflow_state.fire_transition("initialized")
++        self.assertEqual((yield workflow_state.get_state()), "final")
      @inlineCallbacks
      def test_transition_error_transition(self):
@@ -256,7 +273,8 @@
              Transition("error_transition", "", None, "error-state"))
          workflow_state = AttributeWorkflowState(workflow)
--        yield workflow_state.fire_transition("raises_transition_error")
++        with (yield workflow_state.lock()):
++            yield workflow_state.fire_transition("raises_transition_error")
          current_state = yield workflow_state.get_state()
          self.assertEqual(current_state, "error-state")
@@ -278,10 +296,10 @@
              Transition("continue", "", "next-state", "final-state"))
          workflow_state = AttributeWorkflowState(workflow)
--        workflow_state.set_observer(observer)
--
--        yield workflow_state.fire_transition("begin")
--        yield workflow_state.fire_transition("continue")
++        with (yield workflow_state.lock()):
++            workflow_state.set_observer(observer)
++            yield workflow_state.fire_transition("begin")
++            yield workflow_state.fire_transition("continue")
          self.assertEqual(results,
                           [("next-state", {}), ("final-state", {})])
@@ -295,19 +313,19 @@
              Transition("continue", "", "next-state", "final-state"))
          workflow_state = AttributeWorkflowState(workflow)
--
--        yield workflow_state.fire_transition(
--            "begin", rabbit="moon", hello=True)
--        current_state = yield workflow_state.get_state()
--        self.assertEqual(current_state, "next-state")
--        variables = yield workflow_state.get_state_variables()
--        self.assertEqual({"rabbit": "moon", "hello": True}, variables)
--
--        yield workflow_state.fire_transition("continue")
--        current_state = yield workflow_state.get_state()
--        self.assertEqual(current_state, "final-state")
--        variables = yield workflow_state.get_state_variables()
--        self.assertEqual({}, variables)
++        with (yield workflow_state.lock()):
++            yield workflow_state.fire_transition(
++                "begin", rabbit="moon", hello=True)
++            current_state = yield workflow_state.get_state()
++            self.assertEqual(current_state, "next-state")
++            variables = yield workflow_state.get_state_variables()
++            self.assertEqual({"rabbit": "moon", "hello": True}, variables)
++
++            yield workflow_state.fire_transition("continue")
++            current_state = yield workflow_state.get_state()
++            self.assertEqual(current_state, "final-state")
++            variables = yield workflow_state.get_state_variables()
++            self.assertEqual({}, variables)
      @inlineCallbacks
      def test_transition_state(self):
@@ -319,17 +337,137 @@
              Transition("to_house", "", "trail", "house"))
          workflow_state = AttributeWorkflowState(workflow)
--
--        result = yield workflow_state.transition_state("trail")
--        self.assertEqual(result, True)
--        current_state = yield workflow_state.get_state()
--        self.assertEqual(current_state, "trail")
--
--        result = yield workflow_state.transition_state("cabin")
--        self.assertEqual(result, True)
--
--        result = yield workflow_state.transition_state("house")
--        self.assertEqual(result, False)
--
--        self.assertFailure(workflow_state.transition_state("unknown"),
--                           InvalidStateError)
++        with (yield workflow_state.lock()):
++            result = yield workflow_state.transition_state("trail")
++            self.assertEqual(result, True)
++            current_state = yield workflow_state.get_state()
++            self.assertEqual(current_state, "trail")
++
++            result = yield workflow_state.transition_state("cabin")
++            self.assertEqual(result, True)
++
++            result = yield workflow_state.transition_state("house")
++            self.assertEqual(result, False)
++
++            self.assertFailure(workflow_state.transition_state("unknown"),
++                               InvalidStateError)
++
++    @inlineCallbacks
++    def test_load_bad_state(self):
++        class BadLoadWorkflowState(WorkflowState):
++            def _load(self):
++                return succeed({"some": "other-data"})
++
++        workflow = BadLoadWorkflowState(Workflow())
++        yield self.assertFailure(workflow.get_state(), KeyError)
++        yield self.assertFailure(workflow.get_state_variables(), KeyError)
++
++
++SyncWorkflow = Workflow(
++    Transition("init", "", None, "inited", error_transition_id="error_init"),
++    Transition("error_init", "", None, "borken"),
++    Transition("start", "", "inited", "started", automatic=True),
++
++    # Disjoint states for testing default transition synchronize.
++    Transition("predefault", "", "default_init", "default_start"),
++    Transition("default", "", "default_start", "default_end", automatic=True),
++)
++
++class SyncWorkflowState(TestWorkflowState):
++
++    _workflow = SyncWorkflow
++
++    def __init__(self):
++        super(SyncWorkflowState, self).__init__()
++        self.started = {
++            "init": Deferred(), "error_init": Deferred(), "start": Deferred()}
++        self.blockers = {
++            "init": Deferred(), "error_init": Deferred(), "start": Deferred()}
++
++    def do(self, transition):
++        self.started[transition].callback(None)
++        return self.blockers[transition]
++
++    def do_init(self):
++        return self.do("init")
++
++    def do_error_init(self):
++        return self.do("error_init")
++
++    def do_start(self):
++        return self.do("start")
++
++
++class StateMachineSynchronizeTest(TestCase):
++
++    @inlineCallbacks
++    def setUp(self):
++        yield super(StateMachineSynchronizeTest, self).setUp()
++        self.workflow = SyncWorkflowState()
++
++    @inlineCallbacks
++    def assert_state(self, state, inflight):
++        self.assertEquals((yield self.workflow.get_state()), state)
++        self.assertEquals((yield self.workflow.get_inflight()), inflight)
++
++    @inlineCallbacks
++    def test_plain_synchronize(self):
++        """synchronize does nothing when no inflight transitions or applicable
++        default transitions"""
++        yield self.assert_state(None, None)
++        with (yield self.workflow.lock()):
++            yield self.workflow.synchronize()
++        yield self.assert_state(None, None)
++
++    @inlineCallbacks
++    def test_synchronize_default_transition(self):
++        """synchronize runs default transitions after inflight recovery"""
++        with (yield self.workflow.lock()):
++            yield self.workflow.set_state("default_init")
++            yield self.workflow.set_inflight("predefault")
++            yield self.workflow.synchronize()
++            yield self.assert_state("default_end", None)
++
++    @inlineCallbacks
++    def test_synchronize_inflight_success(self):
++        """synchronize will complete an unfinished transition and run the
++        success transition where warranted"""
++        with (yield self.workflow.lock()):
++            yield self.workflow.set_inflight("init")
++            d = self.workflow.synchronize()
++            yield self.workflow.started["init"]
++            yield self.assert_state(None, "init")
++            self.workflow.blockers["init"].callback(None)
++            yield self.workflow.started["start"]
++            yield self.assert_state("inited", "start")
++            self.workflow.blockers["start"].callback(None)
++            yield d
++            yield self.assert_state("started", None)
++
++    @inlineCallbacks
++    def test_synchronize_inflight_error(self):
++        """synchronize will complete an unfinished transition and run the
++        error transition where warranted"""
++        with (yield self.workflow.lock()):
++            yield self.workflow.set_inflight("init")
++            d = self.workflow.synchronize()
++            yield self.workflow.started["init"]
++            yield self.assert_state(None, "init")
++            self.workflow.blockers["init"].errback(TransitionError())
++            yield self.workflow.started["error_init"]
++            yield self.assert_state(None, "error_init")
++            self.workflow.blockers["error_init"].callback(None)
++            yield d
++            yield self.assert_state("borken", None)
++
++    @inlineCallbacks
++    def test_error_without_transition_clears_inflight(self):
++        """when a transition fails, it should no longer be marked inflight"""
++        with (yield self.workflow.lock()):
++            yield self.workflow.set_state("inited")
++            d = self.workflow.fire_transition("start")
++            yield self.workflow.started["start"]
++            yield self.assert_state("inited", "start")
++            self.workflow.blockers["start"].errback(TransitionError())
++            yield d
++            yield self.assert_state("inited", None)
 === modified file 'juju/state/service.py'
 --- juju/state/service.py	2012-02-04 00:01:06 +0000
 +++ juju/state/service.py	2012-02-21 10:23:29 +0000
@@ -1020,13 +1020,16 @@
          # Wait on the first callback, reflecting present state, not a zk watch
          yield callback_d
++    @property
++    def _upgrade_flag_path(self):
++        return "/units/%s/upgrade" % self._internal_id
++
      @inlineCallbacks
      def set_upgrade_flag(self):
          """Inform the unit it should perform an upgrade.
          """
--        upgrade_path = "/units/%s/upgrade" % self._internal_id
          try:
--            yield self._client.create(upgrade_path)
++            yield self._client.create(self._upgrade_flag_path)
          except zookeeper.NodeExistsException:
              # We get to the same end state
              pass
@@ -1035,8 +1038,7 @@
      def get_upgrade_flag(self):
          """Returns a boolean denoting if the upgrade flag is set.
          """
--        upgrade_path = "/units/%s/upgrade" % self._internal_id
--        stat = yield self._client.exists(upgrade_path)
++        stat = yield self._client.exists(self._upgrade_flag_path)
          returnValue(bool(stat))
      @inlineCallbacks
@@ -1046,9 +1048,8 @@
          Typically done by the unit agent before beginning the
          upgrade.
          """
--        upgrade_path = "/units/%s/upgrade" % self._internal_id
          try:
--            yield self._client.delete(upgrade_path)
++            yield self._client.delete(self._upgrade_flag_path)
          except zookeeper.NoNodeException:
              # We get to the same end state.
              pass
@@ -1070,20 +1071,19 @@
          happening, the callback should fetch the current value via
          the API, if needed.
          """
--        upgrade_path = "/units/%s/upgrade" % self._internal_id
--
          @inlineCallbacks
          def watcher(change_event):
              if permanent and self._client.connected:
--                exists_d, watch_d = self._client.exists_and_watch(upgrade_path)
++                exists_d, watch_d = self._client.exists_and_watch(
++                    self._upgrade_flag_path)
              yield callback(change_event)
              if permanent:
                  watch_d.addCallback(watcher)
--        exists_d, watch_d = self._client.exists_and_watch(upgrade_path)
++        exists_d, watch_d = self._client.exists_and_watch(self._upgrade_flag_path)
          exists = yield exists_d
 === modified file 'juju/tests/test_errors.py'
 --- juju/tests/test_errors.py	2011-09-24 22:21:23 +0000
 +++ juju/tests/test_errors.py	2012-02-21 10:23:29 +0000
@@ -1,9 +1,9 @@
  from juju.errors import (
--    JujuError, FileNotFound, FileAlreadyExists,
--    NoConnection, InvalidHost, InvalidUser, ProviderError, CloudInitError,
--    ProviderInteractionError, CannotTerminateMachine, MachinesNotFound,
--    EnvironmentPending, EnvironmentNotFound, IncompatibleVersion,
--    InvalidPlacementPolicy)
++    JujuError, FileNotFound, FileAlreadyExists, CharmError,
++    CharmInvocationError, CharmUpgradeError, NoConnection, InvalidHost,
++    InvalidUser, ProviderError, CloudInitError, ProviderInteractionError,
++    CannotTerminateMachine, MachinesNotFound, EnvironmentPending,
++    EnvironmentNotFound, IncompatibleVersion, InvalidPlacementPolicy)
  from juju.lib.testing import TestCase
@@ -73,34 +73,56 @@
      def test_MachinesNotFoundSingular(self):
          error = MachinesNotFound(("i-sublimed",))
++        self.assertIsJujuError(error)
          self.assertEquals(error.instance_ids, ["i-sublimed"])
          self.assertEquals(str(error),
                            "Cannot find machine: i-sublimed")
      def test_MachinesNotFoundPlural(self):
          error = MachinesNotFound(("i-disappeared", "i-exploded"))
++        self.assertIsJujuError(error)
          self.assertEquals(error.instance_ids, ["i-disappeared", "i-exploded"])
          self.assertEquals(str(error),
                            "Cannot find machines: i-disappeared, i-exploded")
--    def testEnvironmentNotFoundWithInfo(self):
++    def test_EnvironmentNotFoundWithInfo(self):
          error = EnvironmentNotFound("problem")
++        self.assertIsJujuError(error)
          self.assertEquals(str(error),
                            "juju environment not found: problem")
--    def testEnvironmentNotFoundNoInfo(self):
++    def test_EnvironmentNotFoundNoInfo(self):
          error = EnvironmentNotFound()
++        self.assertIsJujuError(error)
          self.assertEquals(str(error),
                            "juju environment not found: no details "
                            "available")
--    def testEnvironmentPendingWithInfo(self):
++    def test_EnvironmentPendingWithInfo(self):
          error = EnvironmentPending("problem")
++        self.assertIsJujuError(error)
          self.assertEquals(str(error), "problem")
--    def testInvalidPlacementPolicy(self):
++    def test_InvalidPlacementPolicy(self):
          error = InvalidPlacementPolicy("x", "foobar",  ["a", "b", "c"])
++        self.assertIsJujuError(error)
          self.assertEquals(
              str(error),
              ("Unsupported placement policy: 'x' for provider: 'foobar', "
              "supported policies a, b, c"))
++
++    def test_CharmError(self):
++        error = CharmError("/foo/bar", "blah blah")
++        self.assertIsJujuError(error)
++        self.assertEquals(str(error), "Error processing '/foo/bar': blah blah")
++
++    def test_CharmInvocationError(self):
++        error = CharmInvocationError("/foo/bar", 1)
++        self.assertIsJujuError(error)
++        self.assertEquals(
++            str(error), "Error processing '/foo/bar': exit code 1.")
++
++    def test_CharmUpgradeError(self):
++        error = CharmUpgradeError("blah blah")
++        self.assertIsJujuError(error)
++        self.assertEquals(str(error), "Cannot upgrade charm: blah blah")
 === modified file 'juju/unit/lifecycle.py'
 --- juju/unit/lifecycle.py	2012-01-28 08:45:10 +0000
 +++ juju/unit/lifecycle.py	2012-02-21 10:23:29 +0000
@@ -1,17 +1,21 @@
  import os
  import logging
++import shutil
++import tempfile
  import yaml
  from twisted.internet.defer import (
      inlineCallbacks, DeferredLock, DeferredList, returnValue)
++from juju.errors import CharmUpgradeError
  from juju.hooks.invoker import Invoker
  from juju.hooks.scheduler import HookScheduler
  from juju.state.hook import (
--    DepartedRelationHookContext, RelationChange, HookContext)
++    DepartedRelationHookContext, HookContext, RelationChange)
  from juju.state.errors import StopWatcher, UnitRelationStateNotFound
  from juju.state.relation import RelationStateManager, UnitRelationState
++from juju.unit.charm import download_charm
  from juju.unit.workflow import RelationWorkflowState
@@ -19,6 +23,67 @@
  hook_log = logging.getLogger("hook.output")
++# This is used as `client_id` when constructing Invokers
++_EVIL_CONSTANT = "constant"
++
++
++class _CharmUpgradeOperation(object):
++    """Helper class dealing only with the bare mechanics of upgrading"""
++
++    def __init__(self, client, service, unit, unit_dir):
++        self._client = client
++        self._service = service
++        self._unit = unit
++        self._old_id = None
++        self._new_id = None
++        self._download_dir = tempfile.mkdtemp()
++        self._bundle = None
++        self._charm_dir = os.path.join(unit_dir, "charm")
++        self._log = logging.getLogger("charm.upgrade")
++
++    @inlineCallbacks
++    def prepare(self):
++        self._log.debug("Checking for newer charm...")
++        try:
++            self._new_id = yield self._service.get_charm_id()
++            self._old_id = yield self._unit.get_charm_id()
++            if self._new_id != self._old_id:
++                self._log.debug("Downloading %s...", self._new_id)
++                self._bundle = yield download_charm(
++                    self._client, self._new_id, self._download_dir)
++            else:
++                self._log.debug("Latest charm is already present.")
++        except Exception as e:
++            self._log.exception("Charm upgrade preparation failed.")
++            raise CharmUpgradeError(str(e))
++
++    @property
++    def ready(self):
++        return self._bundle is not None
++
++    @inlineCallbacks
++    def run(self):
++        assert self.ready
++        self._log.debug(
++            "Replacing charm %s with %s.", self._old_id, self._new_id)
++        try:
++            # TODO this will leave droppings from the old charm; but we can't
++            # delete the whole charm dir and replace it, because some charms
++            # store state within their directories. See lp:791035
++            self._bundle.extract_to(self._charm_dir)
++            self._log.debug(
++                "Charm has been upgraded to %s.", self._new_id)
++
++            yield self._unit.set_charm_id(self._new_id)
++            self._log.debug("Upgrade recorded.")
++        except Exception as e:
++            self._log.exception("Charm upgrade failed.")
++            raise CharmUpgradeError(str(e))
++
++    def cleanup(self):
++        if os.path.exists(self._download_dir):
++            shutil.rmtree(self._download_dir)
++
  class UnitLifecycle(object):
      """Manager for a unit lifecycle.
@@ -64,15 +129,6 @@
              yield self._execute_hook("install")
      @inlineCallbacks
--    def upgrade_charm(self, fire_hooks=True):
--        """Invoke the unit's upgrade-charm hook.
--        """
--        if fire_hooks:
--            yield self._execute_hook("upgrade-charm", now=True)
--        # Restart hook queued hook execution.
--        self._executor.start()
--
--    @inlineCallbacks
      def start(self, fire_hooks=True, start_relations=True):
          """Invoke the start hook, and setup relation watching.
@@ -104,9 +160,10 @@
                  # We actually want to transition from "down" to "up" where
                  # applicable (ie a stopped unit is starting up again)
                  for workflow in self._relations.values():
--                    state = yield workflow.get_state()
--                    if state == "down":
--                        yield workflow.transition_state("up")
++                    with (yield workflow.lock()):
++                        state = yield workflow.get_state()
++                        if state == "down":
++                            yield workflow.transition_state("up")
              # Establish a watch on the existing relations.
              if not self._watching_relation_memberships:
@@ -157,7 +214,8 @@
                  # We actually want to transition relation states
                  # (probably because the unit workflow state is stopped/error)
                  for workflow in self._relations.values():
--                    yield workflow.transition_state("down")
++                    with (yield workflow.lock()):
++                        yield workflow.transition_state("down")
              else:
                  # We just want to stop the relations from acting
                  # (probably because the process is going down)
@@ -192,6 +250,42 @@
          self._log.debug("configured unit")
      @inlineCallbacks
++    def upgrade_charm(self, fire_hooks=True):
++        """Upgrade the charm and invoke the upgrade-charm hook if requested.
++
++        :param fire_hooks: if False, *and* the actual upgrade operation is not
++            necessary, skip the upgrade-charm hook. When the actual charm has
++            changed during this invocation, this flag is ignored: hooks will
++            always be fired.
++        """
++        self._log.debug("Upgrading charm")
++        upgrade = _CharmUpgradeOperation(
++            self._client, self._service, self._unit, self._unit_dir)
++        yield self._run_lock.acquire()
++        try:
++            yield upgrade.prepare()
++
++            # Executor may already be stopped if we're retrying.
++            if self._executor.running:
++                self._log.debug("Pausing normal hook execution")
++                yield self._executor.stop()
++
++            if upgrade.ready:
++                yield upgrade.run()
++                # We changed the charm just now: we *must* fire hooks.
++                fire_hooks = True
++            if fire_hooks:
++                yield self._execute_hook("upgrade-charm", now=True)
++
++            # Always restart executor on success; charm upgrade operations and
++            # errors are the only reasons for the executor to be stopped.
++            self._log.debug("Resuming normal hook execution.")
++            self._executor.start()
++        finally:
++            self._run_lock.release()
++            upgrade.cleanup()
++
++    @inlineCallbacks
      def _on_relation_resolved_changes(self, event):
          """Callback for unit relation resolved watching.
@@ -228,11 +322,11 @@
          keys = set(relation_resolved).intersection(self._relations)
          for rel_id in keys:
--            relation_workflow = self._relations[rel_id]
--            relation_state = yield relation_workflow.get_state()
--            if relation_state == "up":
--                continue
--            yield relation_workflow.transition_state("up")
++            workflow = self._relations[rel_id]
++            with (yield workflow.lock()):
++                state = yield workflow.get_state()
++                if state != "up":
++                    yield workflow.transition_state("up")
      @inlineCallbacks
      def _on_service_relation_changes(self, old_relations, new_relations):
@@ -294,7 +388,8 @@
          # Actually depart old relations.
          for relation_id in removed:
              workflow = self._relations.pop(relation_id)
--            yield workflow.transition_state("departed")
++            with (yield workflow.lock()):
++                yield workflow.transition_state("departed")
              self._store_relations()
          # Process new relations.
@@ -336,7 +431,8 @@
              # (according to latest stored state) they will try to start, and
              # it won't go well (no way to watch related units).
              if relation_id in current_ids:
--                yield workflow.synchronize()
++                with (yield workflow.lock()):
++                    yield workflow.synchronize()
              # Put everything into self._relations; adds/departs will be handled
              # as usual in the first call to _process_service_changes.
@@ -399,7 +495,8 @@
              lifecycle, self._state_dir)
          self._relations[service_relation.internal_relation_id] = workflow
--        yield workflow.synchronize()
++        with (yield workflow.lock()):
++            yield workflow.synchronize()
      @inlineCallbacks
      def _execute_hook(self, hook_name, now=False):
@@ -412,7 +509,7 @@
          socket_path = os.path.join(self._unit_dir, HOOK_SOCKET_FILE)
          invoker = Invoker(
              HookContext(self._client, self._unit.unit_name), None,
--            "constant", socket_path, self._unit_dir, hook_log)
++            _EVIL_CONSTANT, socket_path, self._unit_dir, hook_log)
          if now:
              yield self._executor.run_priority_hook(invoker, hook_path)
@@ -482,7 +579,8 @@
      def set_hook_error_handler(self, handler):
          """Set an error handler to be invoked if a hook errors.
--        The handler should accept one parameter, the exception instance.
++        The handler should accept two parameters, the RelationChange that
++        triggered the hook, and the exception instance.
          """
          self._error_handler = handler
 === modified file 'juju/unit/tests/test_lifecycle.py'
 --- juju/unit/tests/test_lifecycle.py	2012-01-11 09:37:48 +0000
 +++ juju/unit/tests/test_lifecycle.py	2012-02-21 10:23:29 +0000
@@ -9,25 +9,37 @@
  from twisted.internet.defer import inlineCallbacks, Deferred, fail, returnValue
--from juju.unit.lifecycle import (
--    UnitLifecycle, UnitRelationLifecycle, RelationInvoker)
--
++from juju.charm.url import CharmURL
++from juju.control.tests.test_upgrade_charm import CharmUpgradeTestBase
++from juju.errors import CharmInvocationError, CharmError, CharmUpgradeError
  from juju.hooks.invoker import Invoker
  from juju.hooks.executor import HookExecutor
--
--from juju.errors import CharmInvocationError, CharmError
--
  from juju.state.endpoint import RelationEndpoint
  from juju.state.relation import ClientServerUnitWatcher
  from juju.state.service import NO_HOOKS
  from juju.state.tests.test_relation import RelationTestBase
  from juju.state.hook import RelationChange
--
++from juju.unit.lifecycle import (
++    UnitLifecycle, UnitRelationLifecycle, RelationInvoker)
++from juju.unit.tests.test_charm import CharmPublisherTestBase
  from juju.lib.testing import TestCase
  from juju.lib.mocker import MATCH
++class UnwriteablePath(object):
++
++    def __init__(self, path):
++        self.path = path
++
++    def __enter__(self):
++        self.mode = os.stat(self.path).st_mode
++        os.chmod(self.path, 0000)
++
++    def __exit__(self, *exc_info):
++        os.chmod(self.path, self.mode)
++
++
  class LifecycleTestBase(RelationTestBase):
      juju_directory = None
@@ -64,6 +76,9 @@
          self.state_directory = os.path.join(self.juju_directory, "state")
          os.makedirs(self.state_directory)
++    def frozen_charm(self):
++        return UnwriteablePath(os.path.join(self.unit_directory, "charm"))
++
      def write_hook(self, name, text, no_exec=False, hooks_dir=None):
          if hooks_dir is None:
              hooks_dir = os.path.join(self.unit_directory, "charm", "hooks")
@@ -218,7 +233,8 @@
              self.states["unit_relation"].internal_relation_id)
          self.assertEqual("up", (yield workflow.get_state()))
--        yield workflow.transition_state("down")
++        with (yield workflow.lock()):
++            yield workflow.transition_state("down")
          resolved = self.wait_on_state(workflow, "up")
          # Stop the unit lifecycle
@@ -235,14 +251,6 @@
              {self.states["unit_relation"].internal_relation_id: NO_HOOKS},
              (yield self.states["unit"].get_relation_resolved()))
--        # If the unit is restarted start, we currently have the
--        # behavior that the unit relation workflow will automatically
--        # be transitioned back to running, as part of the normal state
--        # transition. Sigh.. we should have a separate error
--        # state for relation hooks then down with state variable usage.
--        # The current end behavior though seems like the best outcome, ie.
--        # automatically restart relations.
--
      @inlineCallbacks
      def test_resolved_relation_watch_relation_up(self):
          """If a relation marked as to be resolved is already running,
@@ -285,7 +293,8 @@
          workflow = self.lifecycle.get_relation_workflow(
              self.states["unit_relation"].internal_relation_id)
          self.assertEqual("up", (yield workflow.get_state()))
--        yield workflow.fire_transition("error")
++        with (yield workflow.lock()):
++            yield workflow.fire_transition("error")
          resolved = self.wait_on_state(workflow, "up")
@@ -321,7 +330,8 @@
          workflow = self.lifecycle.get_relation_workflow(
              self.states["unit_relation"].internal_relation_id)
          self.assertEqual("up", (yield workflow.get_state()))
--        yield workflow.transition_state("down")
++        with (yield workflow.lock()):
++            yield workflow.transition_state("down")
          resolved = self.wait_on_state(workflow, "up")
@@ -410,16 +420,6 @@
          self.assertFalse(install_executed.called)
      @inlineCallbacks
--    def test_upgrade_sans_hook(self):
--        """The lifecycle upgrade can be invoked without firing hooks."""
--        self.executor.stop()
--        self.write_hook("upgrade-charm", "#!/bin/sh\n exit 1")
--        upgrade_executed = self.wait_on_hook("upgrade-charm")
--        yield self.lifecycle.upgrade_charm(fire_hooks=False)
--        self.assertFalse(upgrade_executed.called)
--        self.assertTrue(self.executor.running)
--
--    @inlineCallbacks
      def test_running(self):
          self.assertFalse(self.lifecycle.running)
          yield self.lifecycle.install()
@@ -472,23 +472,6 @@
          return file_path
      @inlineCallbacks
--    def test_upgrade_hook_invoked_on_upgrade_charm(self):
--        """Invoking the upgrade_charm lifecycle method executes the
--        upgrade-charm hook.
--        """
--        file_path = self.makeFile("")
--        self.write_hook(
--            "upgrade-charm",
--            ("#!/bin/bash\n" "echo upgraded >> %s\n" % file_path))
--
--        # upgrade requires the external actor that extracts the charm
--        # to stop the hook executor, prior to extraction so the
--        # upgrade is the first hook run.
--        yield self.executor.stop()
--        yield self.lifecycle.upgrade_charm()
--        self.assertEqual(open(file_path).read().strip(), "upgraded")
--
--    @inlineCallbacks
      def test_config_hook_invoked_on_configure(self):
          """Invoke the configure lifecycle method will execute the
          config-changed hook.
@@ -765,7 +748,8 @@
          workflow = self.lifecycle.get_relation_workflow(
              self.states["unit_relation"].internal_relation_id)
          self.assertEqual("up", (yield workflow.get_state()))
--        yield workflow.transition_state("down")
++        with (yield workflow.lock()):
++            yield workflow.transition_state("down")
          resolved = self.wait_on_state(workflow, "up")
          # Stop the unit lifecycle
@@ -854,6 +838,93 @@
          yield new_lifecycle.stop()
++class UnitLifecycleUpgradeTest(
++        LifecycleTestBase, CharmPublisherTestBase, CharmUpgradeTestBase):
++
++    @inlineCallbacks
++    def setUp(self):
++        yield super(UnitLifecycleUpgradeTest, self).setUp()
++        yield self.setup_default_test_relation()
++        self.lifecycle = UnitLifecycle(
++            self.client, self.states["unit"], self.states["service"],
++            self.unit_directory, self.state_directory, self.executor)
++
++    @inlineCallbacks
++    def test_no_actual_upgrade_bad_hook(self):
++        self.write_hook("upgrade-charm", "#!/bin/bash\nexit 1\n")
++        done = self.wait_on_hook("upgrade-charm")
++        yield self.assertFailure(
++            self.lifecycle.upgrade_charm(), CharmInvocationError)
++        yield done
++        self.assertFalse(self.executor.running)
++
++    @inlineCallbacks
++    def test_no_actual_upgrade_good_hook(self):
++        self.write_hook("upgrade-charm", "#!/bin/bash\nexit 0\n")
++        # Ensure we don't actually upgrade
++        with self.frozen_charm():
++            done = self.wait_on_hook("upgrade-charm")
++            yield self.lifecycle.upgrade_charm()
++            yield done
++            self.assertTrue(self.executor.running)
++
++    @inlineCallbacks
++    def test_no_actual_upgrade_dont_fire_hooks(self):
++        self.write_hook("upgrade-charm", "#!/bin/bash\nexit 1\n")
++        with self.frozen_charm():
++            done = self.wait_on_hook("upgrade-charm")
++            yield self.lifecycle.upgrade_charm(fire_hooks=False)
++            yield self.sleep(0.1)
++            self.assertFalse(done.called)
++
++    @inlineCallbacks
++    def prepare_real_upgrade(self, hook_exit):
++        repo = self.increment_charm(self.charm)
++        hooks_dir = os.path.join(repo.path, "series", "mysql", "hooks")
++        self.write_hook(
++            "upgrade-charm",
++            "#!/bin/bash\nexit %s\n" % hook_exit,
++            hooks_dir=hooks_dir)
++        charm = yield repo.find(CharmURL.parse("local:series/mysql"))
++        charm, charm_state = yield self.publish_charm(charm.path)
++        yield self.states["service"].set_charm_id(charm_state.id)
++
++    @inlineCallbacks
++    def test_full_run_bad_write(self):
++        yield self.prepare_real_upgrade(0)
++        with self.frozen_charm():
++            yield self.assertFailure(
++                self.lifecycle.upgrade_charm(), CharmUpgradeError)
++            self.assertFalse(self.executor.running)
++
++    @inlineCallbacks
++    def test_full_run_bad_hook(self):
++        yield self.prepare_real_upgrade(1)
++        done = self.wait_on_hook("upgrade-charm")
++        yield self.assertFailure(
++            self.lifecycle.upgrade_charm(), CharmInvocationError)
++        yield done
++        self.assertFalse(self.executor.running)
++
++    @inlineCallbacks
++    def test_full_run_good_hook(self):
++        yield self.prepare_real_upgrade(0)
++        done = self.wait_on_hook("upgrade-charm")
++        yield self.lifecycle.upgrade_charm()
++        yield done
++        self.assertTrue(self.executor.running)
++
++    @inlineCallbacks
++    def test_full_run_dont_fire_hooks_ignored(self):
++        """Hooks must always be fired if the charm version actually changed
++        in the course of the upgrade"""
++        yield self.prepare_real_upgrade(0)
++        done = self.wait_on_hook("upgrade-charm")
++        yield self.lifecycle.upgrade_charm(fire_hooks=False)
++        yield done
++        self.assertTrue(self.executor.running)
++
++
  class RelationInvokerTest(TestCase):
      def test_relation_invoker_environment(self):
@@ -875,10 +946,10 @@
  class UnitRelationLifecycleTest(LifecycleTestBase):
      hook_template = (
--            "#!/bin/bash\n"
--            "echo %(change_type)s >> %(file_path)s\n"
--            "echo JUJU_RELATION=$JUJU_RELATION >> %(file_path)s\n"
--            "echo JUJU_REMOTE_UNIT=$JUJU_REMOTE_UNIT >> %(file_path)s")
++        "#!/bin/bash\n"
++        "echo %(change_type)s >> %(file_path)s\n"
++        "echo JUJU_RELATION=$JUJU_RELATION >> %(file_path)s\n"
++        "echo JUJU_REMOTE_UNIT=$JUJU_REMOTE_UNIT >> %(file_path)s")
      @inlineCallbacks
      def setUp(self):
 === modified file 'juju/unit/tests/test_workflow.py'
 --- juju/unit/tests/test_workflow.py	2012-01-11 09:37:48 +0000
 +++ juju/unit/tests/test_workflow.py	2012-02-21 10:23:29 +0000
@@ -1,14 +1,19 @@
++import csv
  import itertools
  import logging
++import os
  import yaml
--import csv
--import os
  from twisted.internet.defer import inlineCallbacks, returnValue
++from juju.control.tests.test_upgrade_charm import CharmUpgradeTestBase
++from juju.unit.tests.test_charm import CharmPublisherTestBase
  from juju.unit.tests.test_lifecycle import LifecycleTestBase
++
++from juju.charm.directory import CharmDirectory
++from juju.charm.url import CharmURL
++from juju.lib.statemachine import WorkflowState
  from juju.unit.lifecycle import UnitLifecycle, UnitRelationLifecycle
--
  from juju.unit.workflow import (
      UnitWorkflowState, RelationWorkflowState, WorkflowStateClient,
      is_unit_running, is_relation_running)
@@ -40,6 +45,26 @@
                      [yaml.load(r[0]) for r in csv.reader(history)],
                       yaml.load(zk_state[history_id])))
++    @inlineCallbacks
++    def assert_history(self, expected, **kwargs):
++        f_state, history, zk_state = yield self.read_persistent_state(**kwargs)
++        self.assertEquals(f_state, zk_state)
++        self.assertEquals(f_state, history[-1])
++        self.assertEquals(history, expected)
++
++    def assert_history_concise(self, *chunks, **kwargs):
++        state = None
++        history = []
++        for chunk in chunks:
++            for transition in chunk[:-1]:
++                history.append({
++                    "state": state,
++                    "state_variables": {},
++                    "transition_id": transition})
++            state = chunk[-1]
++            history.append({"state": state, "state_variables": {}})
++        return self.assert_history(history, **kwargs)
++
      def write_exit_hook(self, name, code=0, hooks_dir=None):
          self.write_hook(
              name,
@@ -68,12 +93,14 @@
      @inlineCallbacks
      def assert_transition(self, transition, success=True):
--        result = yield self.workflow.fire_transition(transition)
++        with (yield self.workflow.lock()):
++            result = yield self.workflow.fire_transition(transition)
          self.assertEquals(result, success)
      @inlineCallbacks
      def assert_transition_alias(self, transition, success=True):
--        result = yield self.workflow.fire_transition_alias(transition)
++        with (yield self.workflow.lock()):
++            result = yield self.workflow.fire_transition_alias(transition)
          self.assertEquals(result, success)
      @inlineCallbacks
@@ -86,13 +113,6 @@
              lines = tuple(l.strip() for l in f)
          self.assertEquals(lines, hooks)
--    @inlineCallbacks
--    def assert_history(self, expected):
--        f_state, history, zk_state = yield self.read_persistent_state()
--        self.assertEquals(f_state, zk_state)
--        self.assertEquals(f_state, history[-1])
--        self.assertEquals(history, expected)
--
  class UnitWorkflowTest(UnitWorkflowTestBase):
@@ -101,9 +121,8 @@
          yield self.assert_transition("install")
          yield self.assert_state("started")
          self.assert_hooks("install", "config-changed", "start")
--        yield self.assert_history([
--            {"state": "installed", "state_variables": {}},
--            {"state": "started", "state_variables": {}}])
++        yield self.assert_history_concise(
++            ("install", "installed"), ("start", "started"))
      @inlineCallbacks
      def test_install_with_error_and_retry(self):
@@ -118,10 +137,10 @@
          yield self.assert_transition("retry_install")
          yield self.assert_state("started")
          self.assert_hooks("install", "config-changed", "start")
--        yield self.assert_history([
--            {"state": "install_error", "state_variables": {}},
--            {"state": "installed", "state_variables": {}},
--            {"state": "started", "state_variables": {}}])
++        yield self.assert_history_concise(
++            ("install", "error_install", "install_error"),
++            ("retry_install", "installed"),
++            ("start", "started"))
      @inlineCallbacks
      def test_install_error_with_retry_hook(self):
@@ -139,21 +158,22 @@
          yield self.assert_state("started")
          self.assert_hooks(
              "install", "install", "install", "config-changed", "start")
--        yield self.assert_history([
--            {"state": "install_error", "state_variables": {}},
--            {"state": "installed", "state_variables": {}},
--            {"state": "started", "state_variables": {}}])
++        yield self.assert_history_concise(
++            ("install", "error_install", "install_error"),
++            ("retry_install_hook", "install_error"),
++            ("retry_install_hook", "installed"),
++            ("start", "started"))
      @inlineCallbacks
      def test_start(self):
--        yield self.workflow.set_state("installed")
++        with (yield self.workflow.lock()):
++            yield self.workflow.set_state("installed")
          yield self.assert_transition("start")
          yield self.assert_state("started")
          self.assert_hooks("config-changed", "start")
--        yield self.assert_history([
--            {"state": "installed", "state_variables": {}},
--            {"state": "started", "state_variables": {}}])
++        yield self.assert_history_concise(
++            ("installed",), ("start", "started"))
      @inlineCallbacks
      def test_start_with_error(self):
@@ -169,10 +189,10 @@
          yield self.assert_transition("retry_start")
          yield self.assert_state("started")
          self.assert_hooks("install", "config-changed", "start")
--        yield self.assert_history([
--            {"state": "installed", "state_variables": {}},
--            {"state": "start_error", "state_variables": {}},
--            {"state": "started", "state_variables": {}}])
++        yield self.assert_history_concise(
++            ("install", "installed"),
++            ("start", "error_start", "start_error"),
++            ("retry_start", "started"))
      @inlineCallbacks
      def test_start_error_with_retry_hook(self):
@@ -191,10 +211,11 @@
          self.assert_hooks(
              "install", "config-changed", "start", "config-changed", "start",
              "config-changed", "start")
--        yield self.assert_history([
--            {"state": "installed", "state_variables": {}},
--            {"state": "start_error", "state_variables": {}},
--            {"state": "started", "state_variables": {}}])
++        yield self.assert_history_concise(
++            ("install", "installed"),
++            ("start", "error_start", "start_error"),
++            ("retry_start_hook", "start_error"),
++            ("retry_start_hook", "started"))
      @inlineCallbacks
      def test_stop(self):
@@ -206,10 +227,10 @@
          yield self.assert_transition("stop")
          yield self.assert_state("stopped")
          self.assert_hooks("install", "config-changed", "start", "stop")
--        yield self.assert_history([
--            {"state": "installed", "state_variables": {}},
--            {"state": "started", "state_variables": {}},
--            {"state": "stopped", "state_variables": {}}])
++        yield self.assert_history_concise(
++            ("install", "installed"),
++            ("start", "started"),
++            ("stop", "stopped"))
      @inlineCallbacks
      def test_stop_with_error(self):
@@ -221,11 +242,11 @@
          yield self.assert_transition("retry_stop")
          yield self.assert_state("stopped")
          self.assert_hooks("install", "config-changed", "start", "stop")
--        yield self.assert_history([
--            {"state": "installed", "state_variables": {}},
--            {"state": "started", "state_variables": {}},
--            {"state": "stop_error", "state_variables": {}},
--            {"state": "stopped", "state_variables": {}}])
++        yield self.assert_history_concise(
++            ("install", "installed"),
++            ("start", "started"),
++            ("stop", "error_stop", "stop_error"),
++            ("retry_stop", "stopped"))
      @inlineCallbacks
      def test_stop_error_with_retry_hook(self):
@@ -241,11 +262,12 @@
          yield self.assert_state("stopped")
          self.assert_hooks(
              "install", "config-changed", "start", "stop", "stop", "stop")
--        yield self.assert_history([
--            {"state": "installed", "state_variables": {}},
--            {"state": "started", "state_variables": {}},
--            {"state": "stop_error", "state_variables": {}},
--            {"state": "stopped", "state_variables": {}}])
++        yield self.assert_history_concise(
++            ("install", "installed"),
++            ("start", "started"),
++            ("stop", "error_stop", "stop_error"),
++            ("retry_stop_hook", "stop_error"),
++            ("retry_stop_hook", "stopped"))
      @inlineCallbacks
      def test_configure(self):
@@ -254,14 +276,14 @@
          """
          yield self.assert_transition("install")
--        yield self.assert_transition("reconfigure")
++        yield self.assert_transition("configure")
          yield self.assert_state("started")
          self.assert_hooks(
              "install", "config-changed", "start", "config-changed")
--        yield self.assert_history([
--            {"state": "installed", "state_variables": {}},
--            {"state": "started", "state_variables": {}},
--            {"state": "started", "state_variables": {}}])
++        yield self.assert_history_concise(
++            ("install", "installed"),
++            ("start", "started"),
++            ("configure", "started"))
      @inlineCallbacks
      def test_configure_error_and_retry(self):
@@ -270,17 +292,17 @@
          yield self.assert_transition("install")
          self.write_exit_hook("config-changed", 1)
--        yield self.assert_transition("reconfigure", False)
++        yield self.assert_transition("configure", False)
          yield self.assert_state("configure_error")
          yield self.assert_transition("retry_configure")
          yield self.assert_state("started")
          self.assert_hooks(
              "install", "config-changed", "start", "config-changed")
--        yield self.assert_history([
--            {"state": "installed", "state_variables": {}},
--            {"state": "started", "state_variables": {}},
--            {"state": "configure_error", "state_variables": {}},
--            {"state": "started", "state_variables": {}}])
++        yield self.assert_history_concise(
++            ("install", "installed"),
++            ("start", "started"),
++            ("configure", "error_configure", "configure_error"),
++            ("retry_configure", "started"))
      @inlineCallbacks
      def test_configure_error_and_retry_hook(self):
@@ -289,7 +311,7 @@
          yield self.assert_transition("install")
          self.write_exit_hook("config-changed", 1)
--        yield self.assert_transition("reconfigure", False)
++        yield self.assert_transition("configure", False)
          yield self.assert_state("configure_error")
          yield self.assert_transition("retry_configure_hook", False)
          yield self.assert_state("configure_error")
@@ -299,12 +321,13 @@
          self.assert_hooks(
              "install", "config-changed", "start",
              "config-changed", "config-changed", "config-changed")
--        yield self.assert_history([
--            {"state": "installed", "state_variables": {}},
--            {"state": "started", "state_variables": {}},
--            {"state": "configure_error", "state_variables": {}},
--            {"state": "configure_error", "state_variables": {}},
--            {"state": "started", "state_variables": {}}])
++        yield self.assert_history_concise(
++            ("install", "installed"),
++            ("start", "started"),
++            ("configure", "error_configure", "configure_error"),
++            ("retry_configure_hook", "error_retry_configure",
++                "configure_error"),
++            ("retry_configure_hook", "started"))
      @inlineCallbacks
      def test_is_unit_running(self):
@@ -312,12 +335,14 @@
              self.client, self.states["unit"])
          self.assertIdentical(running, False)
          self.assertIdentical(state, None)
--        yield self.workflow.fire_transition("install")
++        with (yield self.workflow.lock()):
++            yield self.workflow.fire_transition("install")
          running, state = yield is_unit_running(
              self.client, self.states["unit"])
          self.assertIdentical(running, True)
          self.assertEqual(state, "started")
--        yield self.workflow.fire_transition("stop")
++        with (yield self.workflow.lock()):
++            yield self.workflow.fire_transition("stop")
          running, state = yield is_unit_running(
              self.client, self.states["unit"])
          self.assertIdentical(running, False)
@@ -331,57 +356,98 @@
      @inlineCallbacks
      def test_client_with_state(self):
--        yield self.workflow.fire_transition("install")
++        with (yield self.workflow.lock()):
++            yield self.workflow.fire_transition("install")
          workflow_client = WorkflowStateClient(self.client, self.states["unit"])
          self.assertEqual(
              (yield workflow_client.get_state()), "started")
      @inlineCallbacks
      def test_client_readonly(self):
--        yield self.workflow.fire_transition("install")
++        with (yield self.workflow.lock()):
++            yield self.workflow.fire_transition("install")
          workflow_client = WorkflowStateClient(
              self.client, self.states["unit"])
          self.assertEqual(
              (yield workflow_client.get_state()), "started")
--        yield self.assertFailure(
--            workflow_client.set_state("stopped"), NotImplementedError)
++        with (yield workflow_client.lock()):
++            yield self.assertFailure(
++                workflow_client.set_state("stopped"), NotImplementedError)
          self.assertEqual(
              (yield workflow_client.get_state()), "started")
      @inlineCallbacks
--    def assert_synchronize(
--            self, start_state, start_vars,
--            expect_state, expect_lifecycle, expect_executor):
++    def assert_synchronize(self, start_state, state, lifecycle, executor,
++                           sync_lifecycle=None, sync_executor=None,
++                           start_inflight=None):
++        # Handle cases where we expect to be in a different state pre-sync
++        # to the final state post-sync.
++        if sync_lifecycle is None:
++            sync_lifecycle = lifecycle
++        if sync_executor is None:
++            sync_executor = executor
++        super_sync = WorkflowState.synchronize
++
++        @inlineCallbacks
++        def check_sync(obj):
++            # We don't care about RelationWorkflowState syncing here
++            if type(obj) == UnitWorkflowState:
++                self.assertEquals(
++                    self.lifecycle.running, sync_lifecycle)
++                self.assertEquals(
++                    self.executor.running, sync_executor)
++            yield super_sync(obj)
++
          all_start_states = itertools.product((True, False), (True, False))
--        for lifecycle, executor in all_start_states:
--            if executor and not self.executor.running:
++        for initial_lifecycle, initial_executor in all_start_states:
++            if initial_executor and not self.executor.running:
                  self.executor.start()
--            if lifecycle and not self.lifecycle.running:
++            elif not initial_executor and self.executor.running:
++                yield self.executor.stop()
++            if initial_lifecycle and not self.lifecycle.running:
                  yield self.lifecycle.start(fire_hooks=False)
--            yield self.workflow.set_state(start_state, **start_vars)
--            yield self.workflow.synchronize(self.executor)
--
--            state = yield self.workflow.get_state()
--            self.assertEquals(state, expect_state)
++            elif not initial_lifecycle and self.lifecycle.running:
++                yield self.lifecycle.stop(fire_hooks=False)
++            with (yield self.workflow.lock()):
++                yield self.workflow.set_state(start_state)
++                yield self.workflow.set_inflight(start_inflight)
++
++                # self.patch is not suitable because we can't unpatch until
++                # the end of the test, and we don't really want [many] distinct
++                # one-line test_synchronize_foo methods.
++                WorkflowState.synchronize = check_sync
++                try:
++                    yield self.workflow.synchronize(self.executor)
++                finally:
++                    WorkflowState.synchronize = super_sync
++
++            new_inflight = yield self.workflow.get_inflight()
++            self.assertEquals(new_inflight, None)
++            new_state = yield self.workflow.get_state()
++            self.assertEquals(new_state, state)
              vars = yield self.workflow.get_state_variables()
              self.assertEquals(vars, {})
--            self.assertEquals(self.lifecycle.running, expect_lifecycle)
--            self.assertEquals(self.executor.running, expect_executor)
++            self.assertEquals(self.lifecycle.running, lifecycle)
++            self.assertEquals(self.executor.running, executor)
      def assert_default_synchronize(self, state):
--        return self.assert_synchronize(state, {}, state, False, True)
--
--    @inlineCallbacks
--    def test_synchronize(self):
--        yield self.assert_synchronize(
--            None, {}, "started", True, True)
--        yield self.assert_synchronize(
--            "installed", {}, "started", True, True)
--        yield self.assert_synchronize(
--            "started", {}, "started", True, True)
--        yield self.assert_synchronize(
--            "charm_upgrade_error", {}, "charm_upgrade_error", True, False)
++        return self.assert_synchronize(state, state, False, True)
++
++    @inlineCallbacks
++    def test_synchronize_automatic(self):
++        # No transition in flight
++        yield self.assert_synchronize(
++            None, "started", True, True, False, True)
++        yield self.assert_synchronize(
++            "installed", "started", True, True, False, True)
++        yield self.assert_synchronize(
++            "started", "started", True, True)
++        yield self.assert_synchronize(
++            "charm_upgrade_error", "charm_upgrade_error", True, False)
++
++    @inlineCallbacks
++    def test_synchronize_trivial(self):
          yield self.assert_default_synchronize("install_error")
          yield self.assert_default_synchronize("start_error")
          yield self.assert_default_synchronize("configure_error")
@@ -389,88 +455,182 @@
          yield self.assert_default_synchronize("stopped")
      @inlineCallbacks
++    def test_synchronize_inflight(self):
++        # With transition inflight (we check the important one (upgrade_charm)
++        # and a couple of others at random, but testing every single one is
++        # entirely redundant).
++        yield self.assert_synchronize(
++            "started", "started", True, True, True, False, "upgrade_charm")
++        yield self.assert_synchronize(
++            None, "started", True, True, False, True, "install")
++        yield self.assert_synchronize(
++            "configure_error", "started", True, True, False, True,
++            "retry_configure_hook")
++
++
++class UnitWorkflowUpgradeTest(
++        UnitWorkflowTestBase, CharmPublisherTestBase, CharmUpgradeTestBase):
++
++    expected_upgrade = None
++
++    @inlineCallbacks
++    def ready_upgrade(self, bad_hook):
++        repository = self.increment_charm(self.charm)
++        hooks_dir = os.path.join(repository.path, "series", "mysql", "hooks")
++        self.write_exit_hook(
++            "upgrade-charm", int(bad_hook), hooks_dir=hooks_dir)
++
++        charm = yield repository.find(CharmURL.parse("local:series/mysql"))
++        charm, charm_state = yield self.publish_charm(charm.path)
++        yield self.states["service"].set_charm_id(charm_state.id)
++        self.expected_upgrade = charm_state.id
++
++    @inlineCallbacks
++    def assert_charm_upgraded(self, expect_upgraded):
++        charm_id = yield self.states["unit"].get_charm_id()
++        self.assertEquals(charm_id == self.expected_upgrade, expect_upgraded)
++        if expect_upgraded:
++            expect_revision = CharmURL.parse(self.expected_upgrade).revision
++            charm = CharmDirectory(os.path.join(self.unit_directory, "charm"))
++            self.assertEquals(charm.get_revision(), expect_revision)
++
++    @inlineCallbacks
++    def test_upgrade_not_available(self):
++        """Upgrading when there's no new version runs the hook anyway"""
++        yield self.assert_transition("install")
++        yield self.assert_state("started")
++
++        yield self.assert_transition("upgrade_charm")
++        yield self.assert_state("started")
++        self.assert_hooks(
++            "install", "config-changed", "start", "upgrade-charm")
++        yield self.assert_history_concise(
++            ("install", "installed"),
++            ("start", "started"),
++            ("upgrade_charm", "started"))
++
++    @inlineCallbacks
      def test_upgrade(self):
          """Upgrading a workflow results in the upgrade hook being
          executed.
          """
--        self.makeFile()
--        yield self.workflow.fire_transition("install")
--        current_state = yield self.workflow.get_state()
--        self.assertEqual(current_state, "started")
--        file_path = self.makeFile()
--        self.write_hook("upgrade-charm",
--                        ("#!/bin/bash\n"
--                         "echo upgraded >> %s") % file_path)
--        self.executor.stop()
--        yield self.workflow.fire_transition("upgrade_charm")
--        current_state = yield self.workflow.get_state()
--        self.assertEqual(current_state, "started")
--
--    @inlineCallbacks
--    def test_upgrade_without_stopping_hooks_errors(self):
--        """Attempting to execute an upgrade without stopping the
--        executor is an error.
--        """
--        yield self.workflow.fire_transition("install")
--        current_state = yield self.workflow.get_state()
--        self.assertEqual(current_state, "started")
--        yield self.assertFailure(
--            self.workflow.fire_transition("upgrade_charm"),
--            AssertionError)
++        yield self.assert_transition("install")
++        yield self.assert_state("started")
++        yield self.ready_upgrade(False)
++
++        yield self.assert_charm_upgraded(False)
++        yield self.assert_transition("upgrade_charm")
++        yield self.assert_state("started")
++        yield self.assert_charm_upgraded(True)
++
++        self.assert_hooks(
++            "install", "config-changed", "start", "upgrade-charm")
++        yield self.assert_history_concise(
++            ("install", "installed"),
++            ("start", "started"),
++            ("upgrade_charm", "started"))
      @inlineCallbacks
      def test_upgrade_error_retry(self):
          """A hook error during an upgrade transitions to
          upgrade_error.
          """
--        self.write_hook("upgrade-charm", "#!/bin/bash\nexit 1")
--        yield self.workflow.fire_transition("install")
--        current_state = yield self.workflow.get_state()
--        self.assertEqual(current_state, "started")
--        self.executor.stop()
--        yield self.workflow.fire_transition("upgrade_charm")
--
--        current_state = yield self.workflow.get_state()
--        self.assertEqual(current_state, "charm_upgrade_error")
--        file_path = self.makeFile()
--        self.write_hook("upgrade-charm",
--                        ("#!/bin/bash\n"
--                         "echo upgraded >> %s") % file_path)
--
--        # The upgrade error hook should ensure that the executor is stoppped.
++        yield self.assert_transition("install")
++        yield self.assert_state("started")
++        yield self.ready_upgrade(True)
++
++        yield self.assert_charm_upgraded(False)
++        yield self.assert_transition("upgrade_charm", False)
++        yield self.assert_state("charm_upgrade_error")
          self.assertFalse(self.executor.running)
--        yield self.workflow.fire_transition("retry_upgrade_charm")
--        current_state = yield self.workflow.get_state()
--        self.assertEqual(current_state, "started")
++        # The upgrade should complete before the hook blows up.
++        yield self.assert_charm_upgraded(True)
++
++        # The bad hook is still in place, but we don't run it again
++        yield self.assert_transition("retry_upgrade_charm")
++        yield self.assert_state("started")
++        yield self.assert_charm_upgraded(True)
++        self.assertTrue(self.executor.running)
++
++        self.assert_hooks(
++            "install", "config-changed", "start", "upgrade-charm")
++        yield self.assert_history_concise(
++            ("install", "installed"),
++            ("start", "started"),
++            ("upgrade_charm", "upgrade_charm_error", "charm_upgrade_error"),
++            ("retry_upgrade_charm", "started"))
      @inlineCallbacks
      def test_upgrade_error_retry_hook(self):
          """A hook error during an upgrade transitions to
          upgrade_error, and can be re-tried with hook execution.
          """
--        yield self.workflow.fire_transition("install")
--        current_state = yield self.workflow.get_state()
--        self.assertEqual(current_state, "started")
--
--        # Agent prepares this.
--        self.executor.stop()
--
--        self.write_hook("upgrade-charm", "#!/bin/bash\nexit 1")
--        hook_deferred = self.wait_on_hook("upgrade-charm")
--        yield self.workflow.fire_transition("upgrade_charm")
--        yield hook_deferred
--        current_state = yield self.workflow.get_state()
--        self.assertEqual(current_state, "charm_upgrade_error")
--
--        hook_deferred = self.wait_on_hook("upgrade-charm")
--        self.write_hook("upgrade-charm", "#!/bin/bash\nexit 0")
--        # The upgrade error hook should ensure that the executor is stoppped.
--        self.assertFalse(self.executor.running)
--        yield self.workflow.fire_transition_alias("retry_hook")
--        yield hook_deferred
--        current_state = yield self.workflow.get_state()
--        self.assertEqual(current_state, "started")
--        self.assertTrue(self.executor.running)
++        yield self.assert_transition("install")
++        yield self.assert_state("started")
++        yield self.ready_upgrade(True)
++
++        yield self.assert_charm_upgraded(False)
++        yield self.assert_transition("upgrade_charm", False)
++        yield self.assert_state("charm_upgrade_error")
++        self.assertFalse(self.executor.running)
++        # The upgrade should complete before the hook blows up.
++        yield self.assert_charm_upgraded(True)
++
++        yield self.assert_transition("retry_upgrade_charm_hook", False)
++        yield self.assert_state("charm_upgrade_error")
++        self.assertFalse(self.executor.running)
++        yield self.assert_charm_upgraded(True)
++
++        self.write_exit_hook("upgrade-charm")
++        yield self.assert_transition_alias("retry_hook")
++        yield self.assert_state("started")
++        self.assertTrue(self.executor.running)
++        yield self.assert_charm_upgraded(True)
++
++        self.assert_hooks(
++            "install", "config-changed", "start",
++            "upgrade-charm", "upgrade-charm", "upgrade-charm")
++        yield self.assert_history_concise(
++            ("install", "installed"),
++            ("start", "started"),
++            ("upgrade_charm", "upgrade_charm_error", "charm_upgrade_error"),
++            ("retry_upgrade_charm_hook", "retry_upgrade_charm_error",
++                "charm_upgrade_error"),
++            ("retry_upgrade_charm_hook", "started"))
++
++    @inlineCallbacks
++    def test_upgrade_error_before_hook(self):
++        """If we blow up during the critical pre-hook bits, we should still
++        end up in the same error state"""
++        yield self.assert_transition("install")
++        yield self.assert_state("started")
++        yield self.ready_upgrade(False)
++
++        # Induce a surprising error
++        with self.frozen_charm():
++            yield self.assert_charm_upgraded(False)
++            yield self.assert_transition("upgrade_charm", False)
++            yield self.assert_state("charm_upgrade_error")
++            self.assertFalse(self.executor.running)
++            # The upgrade did not complete
++            yield self.assert_charm_upgraded(False)
++
++        yield self.assert_transition("retry_upgrade_charm")
++        yield self.assert_state("started")
++        self.assertTrue(self.executor.running)
++        yield self.assert_charm_upgraded(True)
++
++        # The hook must run here, even though it's a retry, because the actual
++        # charm only just got overwritten: and so we know that we've never even
++        # tried to execute a hook for this upgrade, and we must do so to fulfil
++        # the guarantee that that hook runs first after upgrade.
++        self.assert_hooks(
++            "install", "config-changed", "start", "upgrade-charm")
++        yield self.assert_history_concise(
++            ("install", "installed"),
++            ("start", "started"),
++            ("upgrade_charm", "upgrade_charm_error", "charm_upgrade_error"),
++            ("retry_upgrade_charm", "started"))
  class UnitRelationWorkflowTest(WorkflowTestBase):
@@ -497,27 +657,29 @@
          self.workflow = RelationWorkflowState(
              self.client, self.states["unit_relation"],
--            self.states["unit"].unit_name, self.lifecycle, self.state_directory)
++            self.states["unit"].unit_name, self.lifecycle,
++            self.state_directory)
      @inlineCallbacks
      def test_is_relation_running(self):
          """The unit relation's workflow state can be categorized as a
          boolean.
          """
--        running, state = yield is_relation_running(
--            self.client, self.states["unit_relation"])
--        self.assertIdentical(running, False)
--        self.assertIdentical(state, None)
--        yield self.workflow.fire_transition("start")
--        running, state = yield is_relation_running(
--            self.client, self.states["unit_relation"])
--        self.assertIdentical(running, True)
--        self.assertEqual(state, "up")
--        yield self.workflow.fire_transition("stop")
--        running, state = yield is_relation_running(
--            self.client, self.states["unit_relation"])
--        self.assertIdentical(running, False)
--        self.assertEqual(state, "down")
++        with (yield self.workflow.lock()):
++            running, state = yield is_relation_running(
++                self.client, self.states["unit_relation"])
++            self.assertIdentical(running, False)
++            self.assertIdentical(state, None)
++            yield self.workflow.fire_transition("start")
++            running, state = yield is_relation_running(
++                self.client, self.states["unit_relation"])
++            self.assertIdentical(running, True)
++            self.assertEqual(state, "up")
++            yield self.workflow.fire_transition("stop")
++            running, state = yield is_relation_running(
++                self.client, self.states["unit_relation"])
++            self.assertIdentical(running, False)
++            self.assertEqual(state, "down")
      @inlineCallbacks
      def test_up_down_cycle(self):
@@ -526,40 +688,29 @@
          self.write_hook("%s-relation-changed" % self.relation_name,
                          "#!/bin/bash\nexit 0\n")
--        yield self.workflow.fire_transition("start")
++        with (yield self.workflow.lock()):
++            yield self.workflow.fire_transition("start")
          yield self.assertState(self.workflow, "up")
          hook_executed = self.wait_on_hook("app-relation-changed")
--        # Add a new unit, and this will be scheduled by the time
--        # we finish stopping.
++        # Add a new unit, while we're stopped.
++        with (yield self.workflow.lock()):
++            yield self.workflow.fire_transition("stop")
          yield self.add_opposite_service_unit(self.states)
--        yield self.workflow.fire_transition("stop")
          yield self.assertState(self.workflow, "down")
          self.assertFalse(hook_executed.called)
--        # Currently if we restart, we will only see the previously
--        # queued event, as the last watch active when a lifecycle is
--        # stopped, may already be in flight and will be scheduled, and
--        # will be executed when the lifecycle is started. However any
--        # events that may have occured after the lifecycle is stopped
--        # are currently ignored and un-notified.
--        yield self.workflow.fire_transition("restart")
++        # Come back up; check unit add detected.
++        with (yield self.workflow.lock()):
++            yield self.workflow.fire_transition("restart")
          yield self.assertState(self.workflow, "up")
          yield hook_executed
--        f_state, history, zk_state = yield self.read_persistent_state(
++        self.assert_history_concise(
++            ("start", "up"), ("stop", "down"), ("restart", "up"),
              history_id=self.workflow.zk_state_id)
--        self.assertEqual(f_state, zk_state)
--        self.assertEqual(f_state,
--                         {"state": "up", "state_variables": {}})
--
--        self.assertEqual(history,
--                         [{"state": "up", "state_variables": {}},
--                          {"state": "down", "state_variables": {}},
--                          {"state": "up", "state_variables": {}}])
--
      @inlineCallbacks
      def test_change_hook_with_error(self):
          """An error while processing a change hook, results
@@ -571,11 +722,9 @@
          self.write_hook("%s-relation-changed" % self.relation_name,
                          "#!/bin/bash\nexit 1\n")
--        current_state = yield self.workflow.get_state()
--        self.assertEqual(current_state, None)
--        yield self.workflow.fire_transition("start")
++        with (yield self.workflow.lock()):
++            yield self.workflow.fire_transition("start")
          yield self.assertState(self.workflow, "up")
--        current_state = yield self.workflow.get_state()
          # Add a new unit, and wait for the broken hook to result in
          # the transition to the down state.
@@ -602,7 +751,8 @@
          broken hook is executed, and the unit stops responding to relation
          changes.
          """
--        yield self.workflow.fire_transition("start")
++        with (yield self.workflow.lock()):
++            yield self.workflow.fire_transition("start")
          yield self.assertState(self.workflow, "up")
          wait_on_hook = self.wait_on_hook("app-relation-changed")
@@ -611,7 +761,8 @@
          wait_on_hook = self.wait_on_hook("app-relation-broken")
          wait_on_state = self.wait_on_state(self.workflow, "departed")
--        yield self.workflow.fire_transition("depart")
++        with (yield self.workflow.lock()):
++            yield self.workflow.fire_transition("depart")
          yield wait_on_hook
          yield wait_on_state
@@ -643,7 +794,8 @@
      def test_client_read_state(self):
          """The relation workflow client can read the state of a unit
          relation."""
--        yield self.workflow.fire_transition("start")
++        with (yield self.workflow.lock()):
++            yield self.workflow.fire_transition("start")
          yield self.assertState(self.workflow, "up")
          self.write_hook("%s-relation-changed" % self.relation_name,
@@ -660,12 +812,31 @@
      def test_client_read_only(self):
          workflow_client = WorkflowStateClient(
              self.client, self.states["unit_relation"])
--        yield self.assertFailure(
--            workflow_client.set_state("up"),
--            NotImplementedError)
++        with (yield workflow_client.lock()):
++            yield self.assertFailure(
++                workflow_client.set_state("up"),
++                NotImplementedError)
      @inlineCallbacks
--    def assert_synchronize(self, state, expect_state, watches, scheduler):
++    def assert_synchronize(self, start_state, state, watches, scheduler,
++                           sync_watches=None, sync_scheduler=None,
++                           start_inflight=None):
++        # Handle cases where we expect to be in a different state pre-sync
++        # to the final state post-sync.
++        if sync_watches is None:
++            sync_watches = watches
++        if sync_scheduler is None:
++            sync_scheduler = scheduler
++        super_sync = WorkflowState.synchronize
++
++        @inlineCallbacks
++        def check_sync(obj):
++            self.assertEquals(
++                self.workflow.lifecycle.watching, sync_watches)
++            self.assertEquals(
++                self.workflow.lifecycle.executing, sync_scheduler)
++            yield super_sync(obj)
++
          start_states = itertools.product((True, False), (True, False))
          for (initial_watches, initial_scheduler) in start_states:
              yield self.workflow.lifecycle.stop()
@@ -676,22 +847,53 @@
                  self.workflow.lifecycle.watching, initial_watches)
              self.assertEquals(
                  self.workflow.lifecycle.executing, initial_scheduler)
--            yield self.workflow.set_state(state)
--
--            yield self.workflow.synchronize()
++            with (yield self.workflow.lock()):
++                yield self.workflow.set_state(start_state)
++                yield self.workflow.set_inflight(start_inflight)
++
++                # self.patch is not suitable because we can't unpatch until
++                # the end of the test, and we don't really want 13 distinct
++                # one-line test_synchronize_foo methods.
++                WorkflowState.synchronize = check_sync
++                try:
++                    yield self.workflow.synchronize()
++                finally:
++                    WorkflowState.synchronize = super_sync
++
++            new_inflight = yield self.workflow.get_inflight()
++            self.assertEquals(new_inflight, None)
              new_state = yield self.workflow.get_state()
--            self.assertEquals(new_state, expect_state)
++            self.assertEquals(new_state, state)
              self.assertEquals(self.workflow.lifecycle.watching, watches)
              self.assertEquals(self.workflow.lifecycle.executing, scheduler)
      @inlineCallbacks
      def test_synchronize(self):
--        yield self.assert_synchronize(None,  "up", True, True)
++        # No transition in flight
++        yield self.assert_synchronize(None,  "up", True, True, False, False)
          yield self.assert_synchronize("down", "down", False, False)
          yield self.assert_synchronize("departed", "departed", False, False)
          yield self.assert_synchronize("error", "error", True, False)
          yield self.assert_synchronize("up", "up", True, True)
++        # With transition inflight
++        yield self.assert_synchronize(
++            None, "up", True, True, False, False, "start")
++        yield self.assert_synchronize(
++            "up", "down", False, False, True, True, "stop")
++        yield self.assert_synchronize(
++            "down", "up", True, True, False, False, "restart")
++        yield self.assert_synchronize(
++            "up", "error", True, False, True, True, "error")
++        yield self.assert_synchronize(
++            "error", "up", True, True, True, False, "reset")
++        yield self.assert_synchronize(
++            "up", "departed", False, False, True, True, "depart")
++        yield self.assert_synchronize(
++            "down", "departed", False, False, False, False, "down_depart")
++        yield self.assert_synchronize(
++            "error", "departed", False, False, True, False, "error_depart")
++
      @inlineCallbacks
      def test_depart_hook_error(self):
          """A depart hook error, still results in a transition to the
@@ -701,7 +903,8 @@
                          "#!/bin/bash\nexit 1\n")
          error_output = self.capture_logging("unit.relation.workflow")
--        yield self.workflow.fire_transition("start")
++        with (yield self.workflow.lock()):
++            yield self.workflow.fire_transition("start")
          yield self.assertState(self.workflow, "up")
          wait_on_hook = self.wait_on_hook("app-relation-changed")
@@ -710,7 +913,8 @@
          wait_on_hook = self.wait_on_hook("app-relation-broken")
          wait_on_state = self.wait_on_state(self.workflow, "departed")
--        yield self.workflow.fire_transition("depart")
++        with (yield self.workflow.lock()):
++            yield self.workflow.fire_transition("depart")
          yield wait_on_hook
          yield wait_on_state
@@ -756,15 +960,17 @@
          broken hook is executed, and the unit stops responding to relation
          changes.
          """
--        yield self.workflow.fire_transition("start")
--        yield self.assertState(self.workflow, "up")
--        yield self.workflow.fire_transition("stop")
--        yield self.assertState(self.workflow, "down")
++        with (yield self.workflow.lock()):
++            yield self.workflow.fire_transition("start")
++            yield self.assertState(self.workflow, "up")
++            yield self.workflow.fire_transition("stop")
++            yield self.assertState(self.workflow, "down")
          states = yield self.add_opposite_service_unit(self.states)
          wait_on_hook = self.wait_on_hook("app-relation-broken")
          wait_on_state = self.wait_on_state(self.workflow, "departed")
--        yield self.workflow.fire_transition("depart")
++        with (yield self.workflow.lock()):
++            yield self.workflow.fire_transition("depart")
          yield wait_on_hook
          yield wait_on_state
@@ -783,15 +989,17 @@
          self.assertFalse(results)
      def test_depart_error(self):
--        yield self.workflow.fire_transition("start")
--        yield self.assertState(self.workflow, "up")
--        yield self.workflow.fire_transition("error")
--        yield self.assertState(self.workflow, "error")
++        with (yield self.workflow.lock()):
++            yield self.workflow.fire_transition("start")
++            yield self.assertState(self.workflow, "up")
++            yield self.workflow.fire_transition("error")
++            yield self.assertState(self.workflow, "error")
          states = yield self.add_opposite_service_unit(self.states)
          wait_on_hook = self.wait_on_hook("app-relation-broken")
          wait_on_state = self.wait_on_state(self.workflow, "departed")
--        yield self.workflow.fire_transition("depart")
++        with (yield self.workflow.lock()):
++            yield self.workflow.fire_transition("depart")
          yield wait_on_hook
          yield wait_on_state
 === modified file 'juju/unit/workflow.py'
 --- juju/unit/workflow.py	2012-01-11 09:37:48 +0000
 +++ juju/unit/workflow.py	2012-02-21 10:23:29 +0000
@@ -8,7 +8,7 @@
  from txzookeeper.utils import retry_change
--from juju.errors import CharmInvocationError, CharmError, FileNotFound
++from juju.errors import CharmError, FileNotFound
  from juju.lib.statemachine import (
      WorkflowState, Workflow, Transition, TransitionError)
@@ -16,18 +16,16 @@
  UnitWorkflow = Workflow(
      # Install transitions
      Transition("install", "Install", None, "installed",
--               error_transition_id="error_install",
--               success_transition_id="start"),
++               error_transition_id="error_install", automatic=True),
      Transition("error_install", "Install error", None, "install_error"),
      Transition("retry_install", "Retry install", "install_error", "installed",
--               alias="retry", success_transition_id="start"),
++               alias="retry"),
      Transition("retry_install_hook", "Retry install with hook",
--               "install_error", "installed", alias="retry_hook",
--               success_transition_id="start"),
++               "install_error", "installed", alias="retry_hook"),
      # Start transitions
      Transition("start", "Start", "installed", "started",
--               error_transition_id="error_start"),
++               error_transition_id="error_start", automatic=True),
      Transition("error_start", "Start error", "installed", "start_error"),
      Transition("retry_start", "Retry start", "start_error", "started",
                 alias="retry"),
@@ -48,33 +46,38 @@
          "upgrade_charm", "Upgrade", "started", "started",
          error_transition_id="upgrade_charm_error"),
      Transition(
--        "upgrade_charm_error", "Upgrade from stop error",
++        "upgrade_charm_error", "Upgrade error",
          "started", "charm_upgrade_error"),
      Transition(
--        "retry_upgrade_charm", "Upgrade from stop error",
--        "charm_upgrade_error", "started", alias="retry"),
--    Transition(
--        "retry_upgrade_charm_hook", "Upgrade from stop error with hook",
--        "charm_upgrade_error", "started", alias="retry_hook"),
++        "retry_upgrade_charm_error", "Upgrade error",
++        "charm_upgrade_error", "charm_upgrade_error"),
++    Transition(
++        "retry_upgrade_charm", "Retry upgrade",
++        "charm_upgrade_error", "started", alias="retry",
++        error_transition_id="retry_upgrade_charm_error"),
++    Transition(
++        "retry_upgrade_charm_hook", "Retry upgrade with hook",
++        "charm_upgrade_error", "started", alias="retry_hook",
++        error_transition_id="retry_upgrade_charm_error"),
      # Configuration Transitions
      Transition(
--        "reconfigure", "Reconfigure", "started", "started",
++        "configure", "Configure", "started", "started",
          error_transition_id="error_configure"),
      Transition(
          "error_configure", "On configure error",
          "started", "configure_error"),
      Transition(
--        "retry_error", "On retry configure error",
++        "error_retry_configure", "On retry configure error",
          "configure_error", "configure_error"),
      Transition(
          "retry_configure", "Retry configure",
          "configure_error", "started", alias="retry",
--        error_transition_id="retry_error"),
++        error_transition_id="error_retry_configure"),
      Transition(
          "retry_configure_hook", "Retry configure with hooks",
          "configure_error", "started", alias="retry_hook",
--        error_transition_id="retry_error")
++        error_transition_id="error_retry_configure")
+     )
@@ -110,7 +113,7 @@
  # relation would continues to schedule pending hooks
  RelationWorkflow = Workflow(
--    Transition("start", "Start", None, "up"),
++    Transition("start", "Start", None, "up", automatic=True),
      Transition("stop", "Stop", "up", "down"),
      Transition("restart", "Restart", "down", "up", alias="retry"),
      Transition("error", "Relation hook error", "up", "error"),
@@ -291,6 +294,7 @@
          with open(self.state_file_path, "r") as handle:
              content = handle.read()
++        # TODO load ZK state and overwrite with disk state if different?
          return yaml.load(content)
@@ -333,28 +337,47 @@
      def _invoke_lifecycle(self, method, *args, **kw):
          try:
              result = yield method(*args, **kw)
--        except (FileNotFound, CharmError, CharmInvocationError), e:
++        except (FileNotFound, CharmError) as e:
              raise TransitionError(e)
          returnValue(result)
      @inlineCallbacks
++    def _get_preconditions(self):
++        """Given StateMachine state, return expected executor/lifecycle state.
++
++        :return: (run_executor, run_lifecycle)
++
++        Once the executor and lifecycle are in the expected state, it should
++        be safe to call StateMachine.synchronize(), and to run other
++        transitions as appropriate.
++        """
++        mid_upgrade = (False, True)
++        started = (True, True)
++        other = (True, False)
++        state = yield self.get_state()
++        if state == "charm_upgrade_error":
++            returnValue(mid_upgrade)
++        if state == "started":
++            if (yield self.get_inflight()) == "upgrade_charm":
++                # We don't want any risk of queued hooks firing while we're in
++                # a potentially-broken mid-upgrade state.
++                returnValue(mid_upgrade)
++            returnValue(started)
++        returnValue(other)
++
++    @inlineCallbacks
      def synchronize(self, executor):
          """Ensure the workflow's lifecycle is in the correct state, given
          current zookeeper state.
          :param executor: the unit agent's shared HookExecutor, which should not
--            run if we come up in (or detect and switch to) the
--            "charm_upgrade_error" state.
++            run if we come up during an incomplete charm upgrade.
          In addition, if the lifecycle has never been started before, the
          necessary state transitions are run.
          """
--        state = yield self.get_state()
--        run_executor, run_lifecycle = True, False
--        if state == "started":
--            run_lifecycle = True
--        elif state == "charm_upgrade_error":
--            run_executor, run_lifecycle = False, True
++        self._assert_locked()
++        run_executor, run_lifecycle = yield self._get_preconditions()
          if run_executor:
              if not executor.running:
@@ -369,13 +392,7 @@
          elif self._lifecycle.running:
              yield self._lifecycle.stop(fire_hooks=False)
--        # At this point, prior state (if any) has been fully restored, and
--        # we can run state transitions as usual; fire the standard startup ones
--        # if they haven't completed yet.
--        if state is None:
--            yield self.fire_transition("install")
--        if state == "installed":
--            yield self.fire_transition("start")
++        yield super(UnitWorkflowState, self).synchronize()
      # Install transitions
      def do_install(self):
@@ -411,6 +428,7 @@
          return self._invoke_lifecycle(self._lifecycle.stop)
      # Upgrade transititions
++
      def do_upgrade_charm(self):
          return self._invoke_lifecycle(self._lifecycle.upgrade_charm)
@@ -425,10 +443,10 @@
      def do_error_configure(self):
          return self._invoke_lifecycle(self._lifecycle.stop, fire_hooks=False)
--    def do_reconfigure(self):
++    def do_configure(self):
          return self._invoke_lifecycle(self._lifecycle.configure)
--    def do_retry_error(self):
++    def do_error_retry_configure(self):
          return self._invoke_lifecycle(self._lifecycle.stop, fire_hooks=False)
      @inlineCallbacks
@@ -465,6 +483,7 @@
          In addition, if the lifecycle has never been started before, the
          necessary state transitions are run.
          """
++        self._assert_locked()
          state = yield self.get_state()
          if state == "up":
              watches, scheduler = True, True
@@ -478,8 +497,7 @@
              yield self._lifecycle.start(
                  start_watches=watches, start_scheduler=scheduler)
--        if state is None:
--            yield self.fire_transition("start")
++        yield super(RelationWorkflowState, self).synchronize()
      @property
      def lifecycle(self):
@@ -500,9 +518,10 @@
          @param: error: The error from hook invocation.
          """
--        yield self.fire_transition("error",
--                                   change_type=relation_change.change_type,
--                                   error_message=str(error))
++        with (yield self.lock()):
++            yield self.fire_transition("error",
++                                       change_type=relation_change.change_type,
++                                       error_message=str(error))
      @inlineCallbacks
      def do_stop(self):
@@ -548,33 +567,23 @@
      @inlineCallbacks
      def do_depart(self):
--        """Transition a relation to the departed state, from the up state.
--        """
--        # Stop related unit watches and change hook execution.
--        yield self._lifecycle.stop()
--        result = yield self._do_depart()
--        returnValue(result)
--
--    def do_down_depart(self):
--        """Transition a relation to the departed state, from the down state.
--        """
--        return self._do_depart()
--
--    @inlineCallbacks
--    def _do_depart(self):
--        """Execute the depart hook.
++        """Transition a relation to the departed state, from any state.
          We ignore hook errors, as we won't logically process any additional
          events for the relation once it doesn't exist. However we do
          note the error in the log.
          """
--        # To avoid the relation-changed hook error handler being used,
--        # set the handler to None, so the exception is raised.
++        # Ensure that no further relation hook executions can occur.
++        yield self._lifecycle.stop()
++
++        # Handle errors ourselves, don't try to transition again
          self._lifecycle.set_hook_error_handler(None)
--
          try:
              yield self._lifecycle.depart()
          except Exception, e:
              self._log.error("Depart hook error, ignoring: %s", str(e))
              returnValue({"change_type": "depart",
                           "error_message": str(e)})
++
++    do_down_depart = do_depart
++    do_error_depart = do_depart

pyjuju

Merge lp:~fwereade/pyjuju/fix-charm-upgrade into lp:pyjuju

Commit message

Description of the change

Preview Diff

Subscribers