Merge lp:~wallyworld/juju-core/provisioner-retry into lp:~go-bot/juju-core/trunk

Proposed by Ian Booth
Status: Merged
Approved by: Ian Booth
Approved revision: no longer in the source branch.
Merged at revision: 2486
Proposed branch: lp:~wallyworld/juju-core/provisioner-retry
Merge into: lp:~go-bot/juju-core/trunk
Prerequisite: lp:~wallyworld/juju-core/machineswithtransienterrors-api
Diff against target: 648 lines (+311/-43)
12 files modified
state/api/provisioner/provisioner.go (+13/-0)
state/api/provisioner/provisioner_test.go (+1/-11)
state/apiserver/provisioner/machineerror.go (+80/-0)
state/apiserver/provisioner/provisioner.go (+32/-4)
state/apiserver/provisioner/provisioner_test.go (+43/-2)
state/machine.go (+5/-1)
state/machine_test.go (+10/-2)
state/status.go (+6/-2)
state/unit.go (+1/-1)
worker/provisioner/provisioner.go (+10/-4)
worker/provisioner/provisioner_task.go (+35/-11)
worker/provisioner/provisioner_test.go (+75/-5)
To merge this branch: bzr merge lp:~wallyworld/juju-core/provisioner-retry
Reviewer Review Type Date Requested Status
Juju Engineering Pending
Review via email: mp+212768@code.launchpad.net

Commit message

Provisioner retries failed machines

Introduce a new watcher - WatchMachineErrorRetry. This
watcher fires when the provisioner task should attempt
to re-provision any machines with transient errors. It
is done as a separate watcher to avoid changing the
semantics of the exisitng machine watcher close to
cutting a release, and to more easily allow the
flexibility of doing things like queuing retries
separately with different priorities etc. It's still
early days and the implemenation is still evolving
because the watcher currently just triggers every minute.
It's an internal implementation detail so can be changed
if required. Future work needs to make the watcher smarter
and able to interpret errors returned by the providers
in order to know when to retry.

https://codereview.appspot.com/80340043/

Description of the change

Provisioner retries failed machines

Introduce a new watcher - WatchMachineErrorRetry. This
watcher fires when the provisioner task should attempt
to re-provision any machines with transient errors. It
is done as a separate watcher to avoid changing the
semantics of the exisitng machine watcher close to
cutting a release, and to more easily allow the
flexibility of doing things like queuing retries
separately with different priorities etc. It's still
early days and the implemenation is still evolving
because the watcher currently just triggers every minute.
It's an internal implementation detail so can be changed
if required. Future work needs to make the watcher smarter
and able to interpret errors returned by the providers
in order to know when to retry.

https://codereview.appspot.com/80340043/

To post a comment you must log in.
Revision history for this message
Ian Booth (wallyworld) wrote :

Reviewers: mp+212768_code.launchpad.net,

Message:
Please take a look.

Description:
Provisioner retries failed machines

Introduce a new watcher - WatchMachineErrorRetry. This
watcher fires when the provisioner task should attempt
to re-provision any machines with transient errors. It
is done as a separate watcher to avoid changing the
semantics of the exisitng machine watcher close to
cutting a release, and to more easily allow the
flexibility of doing things like queuing retries
separately with different priorities etc. It's still
early days and the implemenation is still evolving
because the watcher currently just triggers every minute.
It's an internal implementation detail so can be changed
if required. Future work needs to make the watcher smarter
and able to interpret errors returned by the providers
in order to know when to retry.

https://code.launchpad.net/~wallyworld/juju-core/provisioner-retry/+merge/212768

Requires:
https://code.launchpad.net/~wallyworld/juju-core/machineswithtransienterrors-api/+merge/212749

(do not edit description out of merge proposal)

Please review this at https://codereview.appspot.com/80340043/

Affected files (+286, -26 lines):
   A [revision details]
   M state/api/provisioner/provisioner.go
   A state/apiserver/provisioner/machineerror.go
   M state/apiserver/provisioner/provisioner.go
   M state/apiserver/provisioner/provisioner_test.go
   M worker/provisioner/provisioner.go
   M worker/provisioner/provisioner_task.go
   M worker/provisioner/provisioner_test.go

Revision history for this message
Ian Booth (wallyworld) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :

I like where this is going, I wish our code base would let it be
smaller.

I'm a little concerned that the new test is going to be timing
dependent, but otherwise LGTM.

https://codereview.appspot.com/80340043/diff/20001/state/apiserver/provisioner/machineerror.go
File state/apiserver/provisioner/machineerror.go (right):

https://codereview.appspot.com/80340043/diff/20001/state/apiserver/provisioner/machineerror.go#newcode20
state/apiserver/provisioner/machineerror.go:20:
I really wish we could inherit from a common NotifyWatcher
implementation and then just implement the "loop()" in our custom code.

I think we could actually do that with a NotifyWatchLooper or some such
that did all of this work and just took an

type interface Looper {
   Loop(out chan struct{}) error
}

I've done some bits like that in the Workers, where I implemented
NotifyWorker that takes a Handler that triggers when the changes are
actually interesting.

Anyway, not something you have to do here, because our pattern in code
has definitely been to just copy & paste the boilerplate.

https://codereview.appspot.com/80340043/diff/20001/state/apiserver/provisioner/machineerror.go#newcode66
state/apiserver/provisioner/machineerror.go:66: // triggering every 2
minutes.
Your comment here doesn't match the actual ErrorRetryWaitDelay. Maybe
just reference the variable instead of an explicit time ?

https://codereview.appspot.com/80340043/diff/20001/state/apiserver/provisioner/machineerror.go#newcode73
state/apiserver/provisioner/machineerror.go:73: case
<-time.After(ErrorRetryWaitDelay):
But all of this boiler plate exists *only* for this line. (when this
channel triggers, trigger an output).

https://codereview.appspot.com/80340043/diff/20001/worker/provisioner/provisioner_test.go
File worker/provisioner/provisioner_test.go (right):

https://codereview.appspot.com/80340043/diff/20001/worker/provisioner/provisioner_test.go#newcode773
worker/provisioner/provisioner_test.go:773:
s.PatchValue(&apiserverprovisioner.ErrorRetryWaitDelay,
50*time.Millisecond)
Can we use testing.ShortWait here instead of another hard-coded time?

https://codereview.appspot.com/80340043/diff/20001/worker/provisioner/provisioner_test.go#newcode804
worker/provisioner/provisioner_test.go:804: // Machine 4 is never
provisioned.
I don't understand why Machine 4 is listed as never provisioned, when
the StartInstance calls claim that it will be provisioned after 2
retries.
Is this sensitive to the timeout times? (we assume we poll at exactly
50ms, and we get 2 polls in the 100ms wait above?)

Anything that requires the timing to get right tends to fail on the bot.

It may be that checkStartInstance properly retries until success, or
something else.

Is there even a reason why we need to have ErrorRetryWaitDelay not
something like even 5ms?

https://codereview.appspot.com/80340043/

Revision history for this message
Ian Booth (wallyworld) wrote :
Download full text (3.8 KiB)

Please take a look.

https://codereview.appspot.com/80340043/diff/20001/state/apiserver/provisioner/machineerror.go
File state/apiserver/provisioner/machineerror.go (right):

https://codereview.appspot.com/80340043/diff/20001/state/apiserver/provisioner/machineerror.go#newcode20
state/apiserver/provisioner/machineerror.go:20:
On 2014/03/26 07:22:42, jameinel wrote:
> I really wish we could inherit from a common NotifyWatcher
implementation and
> then just implement the "loop()" in our custom code.

> I think we could actually do that with a NotifyWatchLooper or some
such that did
> all of this work and just took an

> type interface Looper {
> Loop(out chan struct{}) error
> }

> I've done some bits like that in the Workers, where I implemented
NotifyWorker
> that takes a Handler that triggers when the changes are actually
interesting.

> Anyway, not something you have to do here, because our pattern in code
has
> definitely been to just copy & paste the boilerplate.

Yeah, I agree. Tim and I have looked into this before and because Go
doesn't have things like virtual methods or generics it is really hard
to avoid all the boiler plate. I was in a rush for this branch to try
and make the 1.18 deadline so I didn't look into it oo much for this
branch.

https://codereview.appspot.com/80340043/diff/20001/state/apiserver/provisioner/machineerror.go#newcode66
state/apiserver/provisioner/machineerror.go:66: // triggering every 2
minutes.
On 2014/03/26 07:22:42, jameinel wrote:
> Your comment here doesn't match the actual ErrorRetryWaitDelay. Maybe
just
> reference the variable instead of an explicit time ?

Doh, will fix.

https://codereview.appspot.com/80340043/diff/20001/state/apiserver/provisioner/machineerror.go#newcode73
state/apiserver/provisioner/machineerror.go:73: case
<-time.After(ErrorRetryWaitDelay):
On 2014/03/26 07:22:42, jameinel wrote:
> But all of this boiler plate exists *only* for this line. (when this
channel
> triggers, trigger an output).

For now yes. But this is just the start. There's a lot more business
logic to be added. The time based trigger is just to get something
working.

https://codereview.appspot.com/80340043/diff/20001/worker/provisioner/provisioner_test.go
File worker/provisioner/provisioner_test.go (right):

https://codereview.appspot.com/80340043/diff/20001/worker/provisioner/provisioner_test.go#newcode773
worker/provisioner/provisioner_test.go:773:
s.PatchValue(&apiserverprovisioner.ErrorRetryWaitDelay,
50*time.Millisecond)
On 2014/03/26 07:22:42, jameinel wrote:
> Can we use testing.ShortWait here instead of another hard-coded time?

Sure. Originally it wasn't ShortWait but it is now so I'll change it.

https://codereview.appspot.com/80340043/diff/20001/worker/provisioner/provisioner_test.go#newcode804
worker/provisioner/provisioner_test.go:804: // Machine 4 is never
provisioned.
On 2014/03/26 07:22:42, jameinel wrote:
> I don't understand why Machine 4 is listed as never provisioned, when
the
> StartInstance calls claim that it will be provisioned after 2 retries.
> Is this sensitive to the timeout times? (we assume we poll at exactly
50ms, and
> we get 2 polls in the 100ms wait above?)

Only machine 3 ...

Read more...

Revision history for this message
Ian Booth (wallyworld) wrote :
Revision history for this message
William Reade (fwereade) wrote :

I would be so much happier if this was implemented as a StringsWatcher,
that sent down the ids-to-be-retried list, and we were able to call
exactly the same processMachines method -- and thus be able to convert
everything in one go to use the bulk apis that already exist.

https://codereview.appspot.com/80340043/diff/60001/state/machine_test.go
File state/machine_test.go (right):

https://codereview.appspot.com/80340043/diff/60001/state/machine_test.go#newcode1058
state/machine_test.go:1058: c.Assert(err, gc.ErrorMatches, `cannot set
status "pending"`)
We should also be able to set "pending" when it's currently in "error"
state, right?

https://codereview.appspot.com/80340043/

Revision history for this message
Go Bot (go-bot) wrote :
Download full text (54.8 KiB)

The attempt to merge lp:~wallyworld/juju-core/provisioner-retry into lp:juju-core failed. Below is the output from the failed tests.

ok launchpad.net/juju-core 0.012s
ok launchpad.net/juju-core/agent 1.054s
ok launchpad.net/juju-core/agent/mongo 0.574s
ok launchpad.net/juju-core/agent/tools 0.178s
ok launchpad.net/juju-core/bzr 4.875s
ok launchpad.net/juju-core/cert 2.044s
ok launchpad.net/juju-core/charm 0.383s
? launchpad.net/juju-core/charm/hooks [no test files]
? launchpad.net/juju-core/charm/testing [no test files]
ok launchpad.net/juju-core/cloudinit 0.030s
ok launchpad.net/juju-core/cloudinit/sshinit 0.798s
ok launchpad.net/juju-core/cmd 0.179s
ok launchpad.net/juju-core/cmd/charm-admin 0.276s
? launchpad.net/juju-core/cmd/charmd [no test files]
? launchpad.net/juju-core/cmd/charmload [no test files]
ok launchpad.net/juju-core/cmd/juju 209.526s
ok launchpad.net/juju-core/cmd/jujud 64.785s
ok launchpad.net/juju-core/cmd/plugins/juju-metadata 11.406s
? launchpad.net/juju-core/cmd/plugins/juju-restore [no test files]
ok launchpad.net/juju-core/cmd/plugins/local 0.151s
? launchpad.net/juju-core/cmd/plugins/local/juju-local [no test files]
ok launchpad.net/juju-core/constraints 0.024s
ok launchpad.net/juju-core/container 0.043s
ok launchpad.net/juju-core/container/factory 0.045s
ok launchpad.net/juju-core/container/kvm 0.175s
ok launchpad.net/juju-core/container/kvm/mock 0.052s
? launchpad.net/juju-core/container/kvm/testing [no test files]
ok launchpad.net/juju-core/container/lxc 4.331s
? launchpad.net/juju-core/container/lxc/mock [no test files]
? launchpad.net/juju-core/container/lxc/testing [no test files]
? launchpad.net/juju-core/container/testing [no test files]
ok launchpad.net/juju-core/downloader 5.250s
ok launchpad.net/juju-core/environs 2.449s
ok launchpad.net/juju-core/environs/bootstrap 11.134s
ok launchpad.net/juju-core/environs/cloudinit 0.443s
ok launchpad.net/juju-core/environs/config 1.533s
ok launchpad.net/juju-core/environs/configstore 0.034s
ok launchpad.net/juju-core/environs/filestorage 0.030s
ok launchpad.net/juju-core/environs/httpstorage 0.607s
ok launchpad.net/juju-core/environs/imagemetadata 0.422s
? launchpad.net/juju-core/environs/imagemetadata/testing [no test files]
ok launchpad.net/juju-core/environs/instances 0.045s
ok launchpad.net/juju-core/environs/jujutest 0.201s
ok launchpad.net/juju-core/environs/manual 13.449s
ok launchpad.net/juju-core/environs/simplestreams 0.282s
? launchpad.net/juju-core/environs/simplestreams/testing [no test files]
ok launchpad.net/juju-core/environs/sshstorage 0.888s
ok launchpad.net/juju-core/environs/storage 0.960s
ok launchpad.net/juju-core/environs/sync 44.468s
ok launchpad.net/juju-core/environs/testing 0.157s
ok launchpad.net/juju-core/environs/tools 5.056s
? launchpad.net/juju-core/environs/tools/testing [no test files]
ok launchpad.net/juju-core/errors 0.011s
ok launchpad.net/juju-core/instance 0.025s
? launchpad.net/juju-core/instance/testing [no test files]
ok launchpad.net/juju-core/juju 18.263s
ok launchpad.net/juju-core/juju/arch 0.014s...

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'state/api/provisioner/provisioner.go'
2--- state/api/provisioner/provisioner.go 2014-03-25 03:10:55 +0000
3+++ state/api/provisioner/provisioner.go 2014-03-26 11:18:52 +0000
4@@ -70,6 +70,19 @@
5 return w, nil
6 }
7
8+func (st *State) WatchMachineErrorRetry() (watcher.NotifyWatcher, error) {
9+ var result params.NotifyWatchResult
10+ err := st.call("WatchMachineErrorRetry", nil, &result)
11+ if err != nil {
12+ return nil, err
13+ }
14+ if err := result.Error; err != nil {
15+ return nil, result.Error
16+ }
17+ w := watcher.NewNotifyWatcher(st.caller, result)
18+ return w, nil
19+}
20+
21 // StateAddresses returns the list of addresses used to connect to the state.
22 func (st *State) StateAddresses() ([]string, error) {
23 var result params.StringsResult
24
25=== modified file 'state/api/provisioner/provisioner_test.go'
26--- state/api/provisioner/provisioner_test.go 2014-03-26 05:16:31 +0000
27+++ state/api/provisioner/provisioner_test.go 2014-03-26 11:18:52 +0000
28@@ -121,17 +121,7 @@
29 c.Assert(err, gc.IsNil)
30 err = machine.SetStatus(params.StatusError, "blah", params.StatusData{"transient": true})
31 c.Assert(err, gc.IsNil)
32- password, err := utils.RandomPassword()
33- c.Assert(err, gc.IsNil)
34- err = machine.SetPassword(password)
35- c.Assert(err, gc.IsNil)
36- err = machine.SetProvisioned("i-manager", "fake_nonce", nil)
37- c.Assert(err, gc.IsNil)
38- st := s.OpenAPIAsMachine(c, machine.Tag(), password, "fake_nonce")
39- c.Assert(s.st, gc.NotNil)
40- p := st.Provisioner()
41-
42- machines, info, err := p.MachinesWithTransientErrors()
43+ machines, info, err := s.provisioner.MachinesWithTransientErrors()
44 c.Assert(err, gc.IsNil)
45 c.Assert(machines, gc.HasLen, 1)
46 c.Assert(machines[0].Id(), gc.Equals, "1")
47
48=== added file 'state/apiserver/provisioner/machineerror.go'
49--- state/apiserver/provisioner/machineerror.go 1970-01-01 00:00:00 +0000
50+++ state/apiserver/provisioner/machineerror.go 2014-03-26 11:18:52 +0000
51@@ -0,0 +1,80 @@
52+// Copyright 2014 Canonical Ltd.
53+// Licensed under the AGPLv3, see LICENCE file for details.
54+
55+package provisioner
56+
57+import (
58+ "time"
59+
60+ "launchpad.net/tomb"
61+
62+ "launchpad.net/juju-core/state"
63+)
64+
65+// machineErrorRetry is a notify watcher that fires when it is
66+// appropriate to retry provisioning machines with transient errors.
67+type machineErrorRetry struct {
68+ tomb tomb.Tomb
69+ out chan struct{}
70+}
71+
72+func newWatchMachineErrorRetry() state.NotifyWatcher {
73+ w := &machineErrorRetry{
74+ out: make(chan struct{}),
75+ }
76+ go func() {
77+ defer w.tomb.Done()
78+ defer close(w.out)
79+ w.tomb.Kill(w.loop())
80+ }()
81+ return w
82+}
83+
84+// Stop stops the watcher, and returns any error encountered while running
85+// or shutting down.
86+func (w *machineErrorRetry) Stop() error {
87+ w.Kill()
88+ return w.Wait()
89+}
90+
91+// Kill kills the watcher without waiting for it to shut down.
92+func (w *machineErrorRetry) Kill() {
93+ w.tomb.Kill(nil)
94+}
95+
96+// Wait waits for the watcher to die and returns any
97+// error encountered when it was running.
98+func (w *machineErrorRetry) Wait() error {
99+ return w.tomb.Wait()
100+}
101+
102+// Err returns any error encountered while running or shutting down, or
103+// tomb.ErrStillAlive if the watcher is still running.
104+func (w *machineErrorRetry) Err() error {
105+ return w.tomb.Err()
106+}
107+
108+// Changes returns the event channel for the machineErrorRetry watcher.
109+func (w *machineErrorRetry) Changes() <-chan struct{} {
110+ return w.out
111+}
112+
113+// ErrorRetryWaitDelay is the poll time currently used to trigger the watcher.
114+var ErrorRetryWaitDelay = 1 * time.Minute
115+
116+// The initial implementation of this watcher simply acts as a poller,
117+// triggering every ErrorRetryWaitDelay minutes.
118+func (w *machineErrorRetry) loop() error {
119+ out := w.out
120+ for {
121+ select {
122+ case <-w.tomb.Dying():
123+ return tomb.ErrDying
124+ case <-time.After(ErrorRetryWaitDelay):
125+ out = w.out
126+ case out <- struct{}{}:
127+ out = nil
128+ }
129+ }
130+ return nil
131+}
132
133=== modified file 'state/apiserver/provisioner/provisioner.go'
134--- state/apiserver/provisioner/provisioner.go 2014-03-26 05:16:31 +0000
135+++ state/apiserver/provisioner/provisioner.go 2014-03-26 11:18:52 +0000
136@@ -27,10 +27,11 @@
137 *common.EnvironMachinesWatcher
138 *common.InstanceIdGetter
139
140- st *state.State
141- resources *common.Resources
142- authorizer common.Authorizer
143- getAuthFunc common.GetAuthFunc
144+ st *state.State
145+ resources *common.Resources
146+ authorizer common.Authorizer
147+ getAuthFunc common.GetAuthFunc
148+ getCanWatchMachines common.GetAuthFunc
149 }
150
151 // NewProvisionerAPI creates a new server-side ProvisionerAPI facade.
152@@ -87,6 +88,7 @@
153 resources: resources,
154 authorizer: authorizer,
155 getAuthFunc: getAuthFunc,
156+ getCanWatchMachines: getCanReadSecrets,
157 }, nil
158 }
159
160@@ -238,6 +240,11 @@
161 if !canAccessFunc(machine.Tag()) {
162 continue
163 }
164+ if _, provisionedErr := machine.InstanceId(); provisionedErr == nil {
165+ // Machine may have been provisioned but machiner hasn't set the
166+ // status to Started yet.
167+ continue
168+ }
169 result := params.StatusResult{}
170 if result.Status, result.Info, result.Data, err = machine.Status(); err != nil {
171 continue
172@@ -318,3 +325,24 @@
173 }
174 return result, nil
175 }
176+
177+// WatchMachineErrorRetry returns a NotifyWatcher that notifies when
178+// the provisioner should retry provisioning machines with transient errors.
179+func (p *ProvisionerAPI) WatchMachineErrorRetry() (params.NotifyWatchResult, error) {
180+ result := params.NotifyWatchResult{}
181+ canWatch, err := p.getCanWatchMachines()
182+ if err != nil {
183+ return params.NotifyWatchResult{}, err
184+ }
185+ if !canWatch("") {
186+ return result, common.ErrPerm
187+ }
188+ watch := newWatchMachineErrorRetry()
189+ // Consume any initial event and forward it to the result.
190+ if _, ok := <-watch.Changes(); ok {
191+ result.NotifyWatcherId = p.resources.Register(watch)
192+ } else {
193+ return result, watcher.MustErr(watch)
194+ }
195+ return result, nil
196+}
197
198=== modified file 'state/apiserver/provisioner/provisioner_test.go'
199--- state/apiserver/provisioner/provisioner_test.go 2014-03-26 05:16:31 +0000
200+++ state/apiserver/provisioner/provisioner_test.go 2014-03-26 11:18:52 +0000
201@@ -57,7 +57,7 @@
202 if withStateServer {
203 s.machines = append(s.machines, testing.AddStateServerMachine(c, s.State))
204 }
205- for i := 0; i < 4; i++ {
206+ for i := 0; i < 5; i++ {
207 machine, err := s.State.AddMachine("quantal", state.JobHostUnits)
208 c.Check(err, gc.IsNil)
209 s.machines = append(s.machines, machine)
210@@ -120,6 +120,7 @@
211 {Tag: s.machines[1].Tag(), Password: "xxx1-1234567890123457890"},
212 {Tag: s.machines[2].Tag(), Password: "xxx2-1234567890123457890"},
213 {Tag: s.machines[3].Tag(), Password: "xxx3-1234567890123457890"},
214+ {Tag: s.machines[4].Tag(), Password: "xxx4-1234567890123457890"},
215 {Tag: "machine-42", Password: "foo"},
216 {Tag: "unit-foo-0", Password: "zzz"},
217 {Tag: "service-bar", Password: "abc"},
218@@ -133,6 +134,7 @@
219 {nil},
220 {nil},
221 {nil},
222+ {nil},
223 {apiservertesting.NotFoundError("machine 42")},
224 {apiservertesting.ErrUnauthorized},
225 {apiservertesting.ErrUnauthorized},
226@@ -357,6 +359,13 @@
227 c.Assert(err, gc.IsNil)
228 err = s.machines[3].SetStatus(params.StatusError, "error", nil)
229 c.Assert(err, gc.IsNil)
230+ // Machine 4 is provisioned but error not reset yet.
231+ err = s.machines[4].SetStatus(params.StatusError, "transient error",
232+ params.StatusData{"transient": true, "foo": "bar"})
233+ c.Assert(err, gc.IsNil)
234+ hwChars := instance.MustParseHardware("arch=i386", "mem=4G")
235+ err = s.machines[4].SetProvisioned("i-am", "fake_nonce", &hwChars)
236+ c.Assert(err, gc.IsNil)
237
238 result, err := s.provisioner.MachinesWithTransientErrors()
239 c.Assert(err, gc.IsNil)
240@@ -705,7 +714,7 @@
241 c.Assert(err, gc.IsNil)
242 c.Assert(result, gc.DeepEquals, params.StringsWatchResult{
243 StringsWatcherId: "1",
244- Changes: []string{"0", "1", "2", "3"},
245+ Changes: []string{"0", "1", "2", "3", "4"},
246 })
247
248 // Verify the resources were registered and stop them when done.
249@@ -916,3 +925,35 @@
250 Result: s.State.CACert(),
251 })
252 }
253+
254+func (s *withoutStateServerSuite) TestWatchMachineErrorRetry(c *gc.C) {
255+ s.PatchValue(&provisioner.ErrorRetryWaitDelay, 2*coretesting.ShortWait)
256+ c.Assert(s.resources.Count(), gc.Equals, 0)
257+
258+ _, err := s.provisioner.WatchMachineErrorRetry()
259+ c.Assert(err, gc.IsNil)
260+
261+ // Verify the resources were registered and stop them when done.
262+ c.Assert(s.resources.Count(), gc.Equals, 1)
263+ resource := s.resources.Get("1")
264+ defer statetesting.AssertStop(c, resource)
265+
266+ // Check that the Watch has consumed the initial event ("returned"
267+ // in the Watch call)
268+ wc := statetesting.NewNotifyWatcherC(c, s.State, resource.(state.NotifyWatcher))
269+ wc.AssertNoChange()
270+
271+ // We should now get a time triggered change.
272+ wc.AssertOneChange()
273+
274+ // Make sure WatchMachineErrorRetry fails with a machine agent login.
275+ anAuthorizer := s.authorizer
276+ anAuthorizer.MachineAgent = true
277+ anAuthorizer.EnvironManager = false
278+ aProvisioner, err := provisioner.NewProvisionerAPI(s.State, s.resources, anAuthorizer)
279+ c.Assert(err, gc.IsNil)
280+
281+ result, err := aProvisioner.WatchMachineErrorRetry()
282+ c.Assert(err, gc.ErrorMatches, "permission denied")
283+ c.Assert(result, gc.DeepEquals, params.NotifyWatchResult{})
284+}
285
286=== modified file 'state/machine.go'
287--- state/machine.go 2014-03-25 09:15:00 +0000
288+++ state/machine.go 2014-03-26 11:18:52 +0000
289@@ -960,7 +960,11 @@
290 StatusInfo: info,
291 StatusData: data,
292 }
293- if err := doc.validateSet(); err != nil {
294+ // If a machine is not yet provisioned, we allow its status
295+ // to be set back to pending (when a retry is to occur).
296+ _, err := m.InstanceId()
297+ allowPending := IsNotProvisionedError(err)
298+ if err := doc.validateSet(allowPending); err != nil {
299 return err
300 }
301 ops := []txn.Op{{
302
303=== modified file 'state/machine_test.go'
304--- state/machine_test.go 2014-03-26 06:28:38 +0000
305+++ state/machine_test.go 2014-03-26 11:18:52 +0000
306@@ -1052,8 +1052,6 @@
307 func (s *MachineSuite) TestGetSetStatusWhileAlive(c *gc.C) {
308 err := s.machine.SetStatus(params.StatusError, "", nil)
309 c.Assert(err, gc.ErrorMatches, `cannot set status "error" without info`)
310- err = s.machine.SetStatus(params.StatusPending, "", nil)
311- c.Assert(err, gc.ErrorMatches, `cannot set status "pending"`)
312 err = s.machine.SetStatus(params.StatusDown, "", nil)
313 c.Assert(err, gc.ErrorMatches, `cannot set status "down"`)
314 err = s.machine.SetStatus(params.Status("vliegkat"), "orville", nil)
315@@ -1086,6 +1084,16 @@
316 })
317 }
318
319+func (s *MachineSuite) TestSetStatusPending(c *gc.C) {
320+ err := s.machine.SetStatus(params.StatusPending, "", nil)
321+ c.Assert(err, gc.IsNil)
322+ // Cannot set status to pending once a machine is provisioned.
323+ err = s.machine.SetProvisioned("umbrella/0", "fake_nonce", nil)
324+ c.Assert(err, gc.IsNil)
325+ err = s.machine.SetStatus(params.StatusPending, "", nil)
326+ c.Assert(err, gc.ErrorMatches, `cannot set status "pending"`)
327+}
328+
329 func (s *MachineSuite) TestGetSetStatusWhileNotAlive(c *gc.C) {
330 // When Dying set/get should work.
331 err := s.machine.Destroy()
332
333=== modified file 'state/status.go'
334--- state/status.go 2014-03-19 23:07:33 +0000
335+++ state/status.go 2014-03-26 11:18:52 +0000
336@@ -26,12 +26,16 @@
337
338 // validateSet returns an error if the statusDoc does not represent a sane
339 // SetStatus operation.
340-func (doc statusDoc) validateSet() error {
341+func (doc statusDoc) validateSet(allowPending bool) error {
342 if !doc.Status.Valid() {
343 return fmt.Errorf("cannot set invalid status %q", doc.Status)
344 }
345 switch doc.Status {
346- case params.StatusPending, params.StatusDown:
347+ case params.StatusPending:
348+ if !allowPending {
349+ return fmt.Errorf("cannot set status %q", doc.Status)
350+ }
351+ case params.StatusDown:
352 return fmt.Errorf("cannot set status %q", doc.Status)
353 case params.StatusError:
354 if doc.StatusInfo == "" {
355
356=== modified file 'state/unit.go'
357--- state/unit.go 2014-03-19 23:07:33 +0000
358+++ state/unit.go 2014-03-26 11:18:52 +0000
359@@ -562,7 +562,7 @@
360 StatusInfo: info,
361 StatusData: data,
362 }
363- if err := doc.validateSet(); err != nil {
364+ if err := doc.validateSet(false); err != nil {
365 return err
366 }
367 ops := []txn.Op{{
368
369=== modified file 'worker/provisioner/provisioner.go'
370--- worker/provisioner/provisioner.go 2014-03-05 19:41:34 +0000
371+++ worker/provisioner/provisioner.go 2014-03-26 11:18:52 +0000
372@@ -14,6 +14,7 @@
373 "launchpad.net/juju-core/environs/config"
374 "launchpad.net/juju-core/instance"
375 apiprovisioner "launchpad.net/juju-core/state/api/provisioner"
376+ apiwatcher "launchpad.net/juju-core/state/api/watcher"
377 "launchpad.net/juju-core/state/watcher"
378 "launchpad.net/juju-core/worker"
379 )
380@@ -28,7 +29,7 @@
381 type Provisioner interface {
382 worker.Worker
383 Stop() error
384- getWatcher() (Watcher, error)
385+ getMachineWatcher() (apiwatcher.StringsWatcher, error)
386 }
387
388 // environProvisioner represents a running provisioning worker for machine nodes
389@@ -103,7 +104,11 @@
390 }
391 // Start responding to changes in machines, and to any further updates
392 // to the environment config.
393- machineWatcher, err := p.getWatcher()
394+ machineWatcher, err := p.getMachineWatcher()
395+ if err != nil {
396+ return nil, err
397+ }
398+ retryWatcher, err := p.st.WatchMachineErrorRetry()
399 if err != nil {
400 return nil, err
401 }
402@@ -112,6 +117,7 @@
403 safeMode,
404 p.st,
405 machineWatcher,
406+ retryWatcher,
407 p.broker,
408 auth)
409 return task, nil
410@@ -183,7 +189,7 @@
411 }
412 }
413
414-func (p *environProvisioner) getWatcher() (Watcher, error) {
415+func (p *environProvisioner) getMachineWatcher() (apiwatcher.StringsWatcher, error) {
416 return p.st.WatchEnvironMachines()
417 }
418
419@@ -250,7 +256,7 @@
420 return p.machine, nil
421 }
422
423-func (p *containerProvisioner) getWatcher() (Watcher, error) {
424+func (p *containerProvisioner) getMachineWatcher() (apiwatcher.StringsWatcher, error) {
425 machine, err := p.getMachine()
426 if err != nil {
427 return nil, err
428
429=== modified file 'worker/provisioner/provisioner_task.go'
430--- worker/provisioner/provisioner_task.go 2014-03-26 00:24:30 +0000
431+++ worker/provisioner/provisioner_task.go 2014-03-26 11:18:52 +0000
432@@ -16,6 +16,7 @@
433 "launchpad.net/juju-core/names"
434 "launchpad.net/juju-core/state/api/params"
435 apiprovisioner "launchpad.net/juju-core/state/api/provisioner"
436+ apiwatcher "launchpad.net/juju-core/state/api/watcher"
437 "launchpad.net/juju-core/state/watcher"
438 coretools "launchpad.net/juju-core/tools"
439 "launchpad.net/juju-core/utils"
440@@ -35,28 +36,25 @@
441 SetSafeMode(safeMode bool)
442 }
443
444-type Watcher interface {
445- watcher.Errer
446- watcher.Stopper
447- Changes() <-chan []string
448-}
449-
450 type MachineGetter interface {
451 Machine(tag string) (*apiprovisioner.Machine, error)
452+ MachinesWithTransientErrors() ([]*apiprovisioner.Machine, []params.StatusResult, error)
453 }
454
455 func NewProvisionerTask(
456 machineTag string,
457 safeMode bool,
458 machineGetter MachineGetter,
459- watcher Watcher,
460+ machineWatcher apiwatcher.StringsWatcher,
461+ retryWatcher apiwatcher.NotifyWatcher,
462 broker environs.InstanceBroker,
463 auth environs.AuthenticationProvider,
464 ) ProvisionerTask {
465 task := &provisionerTask{
466 machineTag: machineTag,
467 machineGetter: machineGetter,
468- machineWatcher: watcher,
469+ machineWatcher: machineWatcher,
470+ retryWatcher: retryWatcher,
471 broker: broker,
472 auth: auth,
473 safeMode: safeMode,
474@@ -73,7 +71,8 @@
475 type provisionerTask struct {
476 machineTag string
477 machineGetter MachineGetter
478- machineWatcher Watcher
479+ machineWatcher apiwatcher.StringsWatcher
480+ retryWatcher apiwatcher.NotifyWatcher
481 broker environs.InstanceBroker
482 tomb tomb.Tomb
483 auth environs.AuthenticationProvider
484@@ -132,8 +131,6 @@
485 if !ok {
486 return watcher.MustErr(task.machineWatcher)
487 }
488- // TODO(dfc; lp:1042717) fire process machines periodically to shut down unknown
489- // instances.
490 if err := task.processMachines(ids); err != nil {
491 return fmt.Errorf("failed to process updated machines: %v", err)
492 }
493@@ -152,6 +149,10 @@
494 return fmt.Errorf("failed to process machines after safe mode disabled: %v", err)
495 }
496 }
497+ case <-task.retryWatcher.Changes():
498+ if err := task.processMachinesWithTransientErrors(); err != nil {
499+ return fmt.Errorf("failed to process machines with transient errors: %v", err)
500+ }
501 }
502 }
503 }
504@@ -164,6 +165,29 @@
505 }
506 }
507
508+func (task *provisionerTask) processMachinesWithTransientErrors() error {
509+ machines, statusResults, err := task.machineGetter.MachinesWithTransientErrors()
510+ if err != nil {
511+ return nil
512+ }
513+ logger.Tracef("processMachinesWithTransientErrors(%v)", statusResults)
514+ var pending []*apiprovisioner.Machine
515+ for i, status := range statusResults {
516+ if status.Error != nil {
517+ logger.Errorf("cannot retry provisioning of machine %q: %v", status.Id, status.Error)
518+ continue
519+ }
520+ machine := machines[i]
521+ if err := machine.SetStatus(params.StatusPending, "", nil); err != nil {
522+ logger.Errorf("cannot reset status of machine %q: %v", status.Id, err)
523+ continue
524+ }
525+ task.machines[machine.Tag()] = machine
526+ pending = append(pending, machine)
527+ }
528+ return task.startMachines(pending)
529+}
530+
531 func (task *provisionerTask) processMachines(ids []string) error {
532 logger.Tracef("processMachines(%v)", ids)
533 // Populate the tasks maps of current instances and machines.
534
535=== modified file 'worker/provisioner/provisioner_test.go'
536--- worker/provisioner/provisioner_test.go 2014-03-17 22:42:51 +0000
537+++ worker/provisioner/provisioner_test.go 2014-03-26 11:18:52 +0000
538@@ -14,6 +14,8 @@
539 "launchpad.net/juju-core/constraints"
540 "launchpad.net/juju-core/environs"
541 "launchpad.net/juju-core/environs/config"
542+ "launchpad.net/juju-core/environs/simplestreams"
543+ "launchpad.net/juju-core/environs/tools"
544 "launchpad.net/juju-core/errors"
545 "launchpad.net/juju-core/instance"
546 "launchpad.net/juju-core/juju/testing"
547@@ -23,6 +25,7 @@
548 "launchpad.net/juju-core/state/api"
549 "launchpad.net/juju-core/state/api/params"
550 apiprovisioner "launchpad.net/juju-core/state/api/provisioner"
551+ apiserverprovisioner "launchpad.net/juju-core/state/apiserver/provisioner"
552 coretesting "launchpad.net/juju-core/testing"
553 "launchpad.net/juju-core/utils"
554 "launchpad.net/juju-core/utils/set"
555@@ -731,17 +734,21 @@
556 s.waitRemoved(c, m3)
557 }
558
559-func (s *ProvisionerSuite) newProvisionerTask(c *gc.C, safeMode bool) provisioner.ProvisionerTask {
560- env := s.APIConn.Environ
561- watcher, err := s.provisioner.WatchEnvironMachines()
562+func (s *ProvisionerSuite) newProvisionerTask(c *gc.C, safeMode bool,
563+ broker environs.InstanceBroker) provisioner.ProvisionerTask {
564+
565+ machineWatcher, err := s.provisioner.WatchEnvironMachines()
566+ c.Assert(err, gc.IsNil)
567+ retryWatcher, err := s.provisioner.WatchMachineErrorRetry()
568 c.Assert(err, gc.IsNil)
569 auth, err := environs.NewAPIAuthenticator(s.provisioner)
570 c.Assert(err, gc.IsNil)
571- return provisioner.NewProvisionerTask("machine-0", safeMode, s.provisioner, watcher, env, auth)
572+ return provisioner.NewProvisionerTask(
573+ "machine-0", safeMode, s.provisioner, machineWatcher, retryWatcher, broker, auth)
574 }
575
576 func (s *ProvisionerSuite) TestTurningOffSafeModeReapsUnknownInstances(c *gc.C) {
577- task := s.newProvisionerTask(c, true)
578+ task := s.newProvisionerTask(c, true, s.APIConn.Environ)
579 defer stop(c, task)
580
581 // Initially create a machine, and an unknown instance, with safe mode on.
582@@ -761,3 +768,66 @@
583 task.SetSafeMode(false)
584 s.checkStopInstances(c, i1)
585 }
586+
587+func (s *ProvisionerSuite) TestProvisionerRetriesTransientErrors(c *gc.C) {
588+ s.PatchValue(&apiserverprovisioner.ErrorRetryWaitDelay, 5*time.Millisecond)
589+ var e environs.Environ = &mockBroker{Environ: s.APIConn.Environ, retryCount: make(map[string]int)}
590+ task := s.newProvisionerTask(c, false, e)
591+ defer stop(c, task)
592+
593+ // Provision some machines, some will be started first time,
594+ // another will require retries.
595+ m1, err := s.addMachine()
596+ c.Assert(err, gc.IsNil)
597+ m2, err := s.addMachine()
598+ c.Assert(err, gc.IsNil)
599+ m3, err := s.addMachine()
600+ c.Assert(err, gc.IsNil)
601+ m4, err := s.addMachine()
602+ c.Assert(err, gc.IsNil)
603+ s.checkStartInstance(c, m1)
604+ s.checkStartInstance(c, m2)
605+ thatsAllFolks := make(chan struct{})
606+ go func() {
607+ for {
608+ select {
609+ case <-thatsAllFolks:
610+ return
611+ case <-time.After(coretesting.ShortWait):
612+ err := m3.SetStatus(params.StatusError, "info", params.StatusData{"transient": true})
613+ c.Assert(err, gc.IsNil)
614+ }
615+ }
616+ }()
617+ s.checkStartInstance(c, m3)
618+ close(thatsAllFolks)
619+ // Machine 4 is never provisioned.
620+ status, _, _, err := m4.Status()
621+ c.Assert(err, gc.IsNil)
622+ c.Assert(status, gc.Equals, params.StatusError)
623+ _, err = m4.InstanceId()
624+ c.Assert(err, jc.Satisfies, state.IsNotProvisionedError)
625+}
626+
627+type mockBroker struct {
628+ environs.Environ
629+ retryCount map[string]int
630+}
631+
632+func (b *mockBroker) StartInstance(args environs.StartInstanceParams) (instance.Instance, *instance.HardwareCharacteristics, error) {
633+ // All machines except machines 3, 4 are provisioned successfully the first time.
634+ // Machines 3 is provisioned after some attempts have been made.
635+ // Machine 4 is never provisioned.
636+ id := args.MachineConfig.MachineId
637+ retries := b.retryCount[id]
638+ if (id != "3" && id != "4") || retries > 2 {
639+ return b.Environ.StartInstance(args)
640+ } else {
641+ b.retryCount[id] = retries + 1
642+ }
643+ return nil, nil, fmt.Errorf("error: some error")
644+}
645+
646+func (b *mockBroker) GetToolsSources() ([]simplestreams.DataSource, error) {
647+ return b.Environ.(tools.SupportsCustomSources).GetToolsSources()
648+}

Subscribers

People subscribed via source and target branches

to status/vote changes: