Code review comment for lp:~fwereade/pyjuju/fix-charm-upgrade

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

Excerpts from William Reade's message of Tue Jan 31 09:11:25 UTC 2012:
> Forces in play/further justifications for chosen approach:
>
> 1) The use of a state variable for the critical section was criticised as being a bit surprising/magical, so it seemed better to store states as, well, states.

Indeed, it does, but as gustavo pointed out in irc, its really for lack of a WAL
transition (either disk or zk) that its needed at all, and a better fit to model
it as such than as additional state/transitions. Its main purpose is to
signal the transition that's currently in play to allow for recovery and
re-execution of that transition instead of merely entering the last recorded
state.

> 2) first_attempt was also criticised as too complex, so I dropped the silent-early-failure mode (ie once we know an upgrade is a sensible thing to do, we're locked into it and can't escape).

yeah.. the distinction between first_attempt was ever checked by the tests.

> 3) I was reluctant to move too much logic into the top-level callback for the upgrade flag watch. It is true that putting the check of service-charm-id/unit-charm-id in there would indeed eliminate the first state transition, but it seemed cleanest to just try the transition and allow the state machinery to handle backing out in response to the lifecycle's complaints.

i think the refactoring thats been done to subsume some of the callback
into the transition is good. Capturing the entire operation into the
workflow machinery has some benefits wrt to status reporting/recording failures.

> 4) Splitting upgrade-charm and run-hook into two distinct transitions is definitely a win, because (i) the possible errors in the two states are very different, and it's good to distinguish them; (ii) breaking the process into a pipeline like this eliminated further complexity in the original UL.upgrade_charm method (we handle all state tracking with explicit state machine states

The question is the recovery harmed by rexecuting the whole transition? Ie.
what's the value from an error recovery point of view of the separate states.

> 5) The repeated check for service/unit charm ids in UL.replace_charm is not *necessary*, but it seemed helpful to avoid redownloading and reinstalling exactly the same bits. We could drop it without doing anything any actual *harm*.
>

I don't think feel this is really a big deal, given its simplicity. There's an
open bug out for creating a formula cache that could alleviate the redownload.

> Summary: we agreed on having 2 transitions to handle the process of upgrading; farming the initial can-we-actually-upgrade check out to a third new transition, which fails back out to "started", keeps responsibility for state-tracking in the workflow where, IMO, it should be anyway. I don't consider the additional workflow states to represent serious complexity: the only other state it interacts with is "started".
>

Yeah.. i've back-tracked on this looking over the implementation and given
given gustavo's WAL insight. It feels like the additional states and
transitions are just adding complexity esp while juggling the executor. Its also
doesn't seems like a benefit to a user having to understand the semantic
differences between the various upgrade states. The previous version of the
branch would work well afaics given a removal of the first_attempt logic.

The removal of the started state variable could follow along with a generic
transition WAL.

> Aside: rereading IRC, I may have missed something re: reentrancy concerns. Each of the transition methods on UL is, AFAICT, safe for restart midway through; and the workflow is responsible for ensuring we only run them at sensible times.
>
> Does any of this help at all?

there are three different lifecycle methods all coordinating around the
executor global variable, and those lifecycle methods may be called by
any of seven different upgrade transitions (across the four states). Those
states would all need to appropriately reset the executor for recovery/startup.

Given that combining those lifecycle methods into a single method/entry point
doesn't deter from their functionality at all, It seems like a clear win to me
for simplicity too just have the one lifecycle method, two transitions, and one
state.

« Back to merge proposal