Some questions about the testing infrastructure, but overall
LGTM


https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent.go
File cmd/jujud/agent.go (left):

https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent.go#oldcode130
cmd/jujud/agent.go:130: RunOnce(st *state.State, entity AgentState)
error
I do really like the rename of RunOnce to StartTasks.
As RunOnce can be interpreted as "only ever run this once", which
confuses it a bit with why you are calling it in a loop.

https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent_test.go
File cmd/jujud/agent_test.go (right):

https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent_test.go#newcode275
cmd/jujud/agent_test.go:275: conf.MachineNonce = state.BootstrapNonce
I thought all machines that aren't machine/0 were meant to have their
own Nonce's. Is this just to have *something*? (As in, it could just as
easily be a random string, as long as you set it and preserve it?)

https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent_test.go#newcode430
cmd/jujud/agent_test.go:430: s.retryDelay, retryDelay = retryDelay,
time.Millisecond
I appreciate not having delays in the test suite.
I wonder about using a more obvious naming convention for package-global
variables that will be mutated in separate files.

Certainly when I read this, it looks like a function-local 'retryDelay'
variable. I realize there is no ':', and I can do the 'grep' to figure
out where 'retryDelay' is defined (somewhere in the cmd/jujud package).

Nothing to do with your proposed changes, just something about "where is
this variable coming from" that I find hard to read with our current
idioms. (In python you at least have to say 'global retryDelay' so you
know to go look for it somewhere else. And even then it would still be
in the same file.)

https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent_test.go#newcode448
cmd/jujud/agent_test.go:448: }
Not specifically related to your patch, but "type Agent interface" has
no doc strings for me to understand what it is actually trying to do.
And AgentState is defined in 'upgrade.go' while 'Agent' is defined in
agent.go.

Anyway, the point I wanted to make was to think about Mock and what we
are trying to do. The concern is always that a Mock gets out of sync
with the real thing, and your tests just always keep passing because
Mock doesn't migrate.

I'm guessing go's "interface" helps us enough here, in that if Agent
changes its interface mockAgent will be updated to match (at least the
function signatures). Like in this case, you changed from RunOnce to
StartTasks, you do have to update the mock or the compiler will
complain.


In general, I love to see decoupling of the various bits of Juju and
make it easier to test just one layer. Though I generally prefer tested
Doubles (they act like the thing they are doubling) rather than
ultralightweight Mocks (they record what they were called with).

It also seems odd that you are able to mock out running 'heavyweight
tasks', but you still have to have a State object that connects to a
real mongodb running. It feels like if you are going to mock anything,
you would want to remove the State connection first.

https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent_test.go#newcode504
cmd/jujud/agent_test.go:504: stderrors.New("abritrary"),
arbitrary

https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent_test.go#newcode508
cmd/jujud/agent_test.go:508: worker.ErrTerminateAgent,
If you are testing 'retry until something happens', shouldn't you have
more errors after the one you expect to terminate the loop, and then you
can check that it stopped at the right step?

For example, what happens if you just remove 'worker.ErrTerminateAgent',
from what I can see it would trigger a panic() in Wait because of a
slice-out-of-bounds error. But I think that only triggers in the
goroutine, so you would still end up with the loop stopping. (I guess
maybe it dies with a timeout instead of a done signal because
RunAgentLoop doesn't recover from panics?)

https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent_test.go#newcode518
cmd/jujud/agent_test.go:518: }
Maybe adding a 'c.Check(errTask.waitCount, Equals, 6)' to be sure how
many times it waited, and what finally terminated the looping.

https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent_test.go#newcode521
cmd/jujud/agent_test.go:521: func (s *RunAgentLoopSuite)
TestRetryUntilUpgradedError(c *C) {
Similar structure here.

https://codereview.appspot.com/10439046/diff/1/cmd/jujud/agent_test.go#newcode578
cmd/jujud/agent_test.go:578: tomb.Kill(stderrors.New("external"))
I'm not sure this test does what you want it to.

I dug into the tomb details, and it says "only the first non-nil error
is recorded as reason", and you are using tomb.Kill(non-nil) to initiate
the stop code.
So even though your waitTask.Stop() code *also* signals an error, I'm
pretty sure this test is functionally identical to the
TestRunUntilStopped.

I would consider dropping it unless you can actually observe a
difference.

https://codereview.appspot.com/10439046/diff/1/cmd/jujud/machine_test.go
File cmd/jujud/machine_test.go (right):

https://codereview.appspot.com/10439046/diff/1/cmd/jujud/machine_test.go#newcode253
cmd/jujud/machine_test.go:253: // This is horrible. But it makes this
functionality somewhat testable...
I can't say I quite understand why this is horrible. Care to elaborate?

https://codereview.appspot.com/10439046/