Some questions about the testing infrastructure, but overall LGTM https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent.go File cmd/jujud/agent.go (left): https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent.go#oldcode130 cmd/jujud/agent.go:130: RunOnce(st *state.State, entity AgentState) error I do really like the rename of RunOnce to StartTasks. As RunOnce can be interpreted as "only ever run this once", which confuses it a bit with why you are calling it in a loop. https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent_test.go File cmd/jujud/agent_test.go (right): https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent_test.go#newcode275 cmd/jujud/agent_test.go:275: conf.MachineNonce = state.BootstrapNonce I thought all machines that aren't machine/0 were meant to have their own Nonce's. Is this just to have *something*? (As in, it could just as easily be a random string, as long as you set it and preserve it?) https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent_test.go#newcode430 cmd/jujud/agent_test.go:430: s.retryDelay, retryDelay = retryDelay, time.Millisecond I appreciate not having delays in the test suite. I wonder about using a more obvious naming convention for package-global variables that will be mutated in separate files. Certainly when I read this, it looks like a function-local 'retryDelay' variable. I realize there is no ':', and I can do the 'grep' to figure out where 'retryDelay' is defined (somewhere in the cmd/jujud package). Nothing to do with your proposed changes, just something about "where is this variable coming from" that I find hard to read with our current idioms. (In python you at least have to say 'global retryDelay' so you know to go look for it somewhere else. And even then it would still be in the same file.) https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent_test.go#newcode448 cmd/jujud/agent_test.go:448: } Not specifically related to your patch, but "type Agent interface" has no doc strings for me to understand what it is actually trying to do. And AgentState is defined in 'upgrade.go' while 'Agent' is defined in agent.go. Anyway, the point I wanted to make was to think about Mock and what we are trying to do. The concern is always that a Mock gets out of sync with the real thing, and your tests just always keep passing because Mock doesn't migrate. I'm guessing go's "interface" helps us enough here, in that if Agent changes its interface mockAgent will be updated to match (at least the function signatures). Like in this case, you changed from RunOnce to StartTasks, you do have to update the mock or the compiler will complain. In general, I love to see decoupling of the various bits of Juju and make it easier to test just one layer. Though I generally prefer tested Doubles (they act like the thing they are doubling) rather than ultralightweight Mocks (they record what they were called with). It also seems odd that you are able to mock out running 'heavyweight tasks', but you still have to have a State object that connects to a real mongodb running. It feels like if you are going to mock anything, you would want to remove the State connection first. https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent_test.go#newcode504 cmd/jujud/agent_test.go:504: stderrors.New("abritrary"), arbitrary https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent_test.go#newcode508 cmd/jujud/agent_test.go:508: worker.ErrTerminateAgent, If you are testing 'retry until something happens', shouldn't you have more errors after the one you expect to terminate the loop, and then you can check that it stopped at the right step? For example, what happens if you just remove 'worker.ErrTerminateAgent', from what I can see it would trigger a panic() in Wait because of a slice-out-of-bounds error. But I think that only triggers in the goroutine, so you would still end up with the loop stopping. (I guess maybe it dies with a timeout instead of a done signal because RunAgentLoop doesn't recover from panics?) https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent_test.go#newcode518 cmd/jujud/agent_test.go:518: } Maybe adding a 'c.Check(errTask.waitCount, Equals, 6)' to be sure how many times it waited, and what finally terminated the looping. https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent_test.go#newcode521 cmd/jujud/agent_test.go:521: func (s *RunAgentLoopSuite) TestRetryUntilUpgradedError(c *C) { Similar structure here. https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent_test.go#newcode578 cmd/jujud/agent_test.go:578: tomb.Kill(stderrors.New("external")) I'm not sure this test does what you want it to. I dug into the tomb details, and it says "only the first non-nil error is recorded as reason", and you are using tomb.Kill(non-nil) to initiate the stop code. So even though your waitTask.Stop() code *also* signals an error, I'm pretty sure this test is functionally identical to the TestRunUntilStopped. I would consider dropping it unless you can actually observe a difference. https://codereview.appspot.com/10439046/diff/1/cmd/jujud/machine_test.go File cmd/jujud/machine_test.go (right): https://codereview.appspot.com/10439046/diff/1/cmd/jujud/machine_test.go#newcode253 cmd/jujud/machine_test.go:253: // This is horrible. But it makes this functionality somewhat testable... I can't say I quite understand why this is horrible. Care to elaborate? https://codereview.appspot.com/10439046/