Merge lp:~wallyworld/juju-core/provisioner-retry into lp:~go-bot/juju-core/trunk
- provisioner-retry
- Merge into trunk
Status: | Merged |
---|---|
Approved by: | Ian Booth |
Approved revision: | no longer in the source branch. |
Merged at revision: | 2486 |
Proposed branch: | lp:~wallyworld/juju-core/provisioner-retry |
Merge into: | lp:~go-bot/juju-core/trunk |
Prerequisite: | lp:~wallyworld/juju-core/machineswithtransienterrors-api |
Diff against target: |
648 lines (+311/-43) 12 files modified
state/api/provisioner/provisioner.go (+13/-0) state/api/provisioner/provisioner_test.go (+1/-11) state/apiserver/provisioner/machineerror.go (+80/-0) state/apiserver/provisioner/provisioner.go (+32/-4) state/apiserver/provisioner/provisioner_test.go (+43/-2) state/machine.go (+5/-1) state/machine_test.go (+10/-2) state/status.go (+6/-2) state/unit.go (+1/-1) worker/provisioner/provisioner.go (+10/-4) worker/provisioner/provisioner_task.go (+35/-11) worker/provisioner/provisioner_test.go (+75/-5) |
To merge this branch: | bzr merge lp:~wallyworld/juju-core/provisioner-retry |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Juju Engineering | Pending | ||
Review via email: mp+212768@code.launchpad.net |
Commit message
Provisioner retries failed machines
Introduce a new watcher - WatchMachineErr
watcher fires when the provisioner task should attempt
to re-provision any machines with transient errors. It
is done as a separate watcher to avoid changing the
semantics of the exisitng machine watcher close to
cutting a release, and to more easily allow the
flexibility of doing things like queuing retries
separately with different priorities etc. It's still
early days and the implemenation is still evolving
because the watcher currently just triggers every minute.
It's an internal implementation detail so can be changed
if required. Future work needs to make the watcher smarter
and able to interpret errors returned by the providers
in order to know when to retry.
Description of the change
Provisioner retries failed machines
Introduce a new watcher - WatchMachineErr
watcher fires when the provisioner task should attempt
to re-provision any machines with transient errors. It
is done as a separate watcher to avoid changing the
semantics of the exisitng machine watcher close to
cutting a release, and to more easily allow the
flexibility of doing things like queuing retries
separately with different priorities etc. It's still
early days and the implemenation is still evolving
because the watcher currently just triggers every minute.
It's an internal implementation detail so can be changed
if required. Future work needs to make the watcher smarter
and able to interpret errors returned by the providers
in order to know when to retry.
Ian Booth (wallyworld) wrote : | # |
Ian Booth (wallyworld) wrote : | # |
Please take a look.
John A Meinel (jameinel) wrote : | # |
I like where this is going, I wish our code base would let it be
smaller.
I'm a little concerned that the new test is going to be timing
dependent, but otherwise LGTM.
https:/
File state/apiserver
https:/
state/apiserver
I really wish we could inherit from a common NotifyWatcher
implementation and then just implement the "loop()" in our custom code.
I think we could actually do that with a NotifyWatchLooper or some such
that did all of this work and just took an
type interface Looper {
Loop(out chan struct{}) error
}
I've done some bits like that in the Workers, where I implemented
NotifyWorker that takes a Handler that triggers when the changes are
actually interesting.
Anyway, not something you have to do here, because our pattern in code
has definitely been to just copy & paste the boilerplate.
https:/
state/apiserver
minutes.
Your comment here doesn't match the actual ErrorRetryWaitD
just reference the variable instead of an explicit time ?
https:/
state/apiserver
<-time.
But all of this boiler plate exists *only* for this line. (when this
channel triggers, trigger an output).
https:/
File worker/
https:/
worker/
s.PatchValue(
50*time.
Can we use testing.ShortWait here instead of another hard-coded time?
https:/
worker/
provisioned.
I don't understand why Machine 4 is listed as never provisioned, when
the StartInstance calls claim that it will be provisioned after 2
retries.
Is this sensitive to the timeout times? (we assume we poll at exactly
50ms, and we get 2 polls in the 100ms wait above?)
Anything that requires the timing to get right tends to fail on the bot.
It may be that checkStartInstance properly retries until success, or
something else.
Is there even a reason why we need to have ErrorRetryWaitDelay not
something like even 5ms?
Ian Booth (wallyworld) wrote : | # |
Please take a look.
https:/
File state/apiserver
https:/
state/apiserver
On 2014/03/26 07:22:42, jameinel wrote:
> I really wish we could inherit from a common NotifyWatcher
implementation and
> then just implement the "loop()" in our custom code.
> I think we could actually do that with a NotifyWatchLooper or some
such that did
> all of this work and just took an
> type interface Looper {
> Loop(out chan struct{}) error
> }
> I've done some bits like that in the Workers, where I implemented
NotifyWorker
> that takes a Handler that triggers when the changes are actually
interesting.
> Anyway, not something you have to do here, because our pattern in code
has
> definitely been to just copy & paste the boilerplate.
Yeah, I agree. Tim and I have looked into this before and because Go
doesn't have things like virtual methods or generics it is really hard
to avoid all the boiler plate. I was in a rush for this branch to try
and make the 1.18 deadline so I didn't look into it oo much for this
branch.
https:/
state/apiserver
minutes.
On 2014/03/26 07:22:42, jameinel wrote:
> Your comment here doesn't match the actual ErrorRetryWaitD
just
> reference the variable instead of an explicit time ?
Doh, will fix.
https:/
state/apiserver
<-time.
On 2014/03/26 07:22:42, jameinel wrote:
> But all of this boiler plate exists *only* for this line. (when this
channel
> triggers, trigger an output).
For now yes. But this is just the start. There's a lot more business
logic to be added. The time based trigger is just to get something
working.
https:/
File worker/
https:/
worker/
s.PatchValue(
50*time.
On 2014/03/26 07:22:42, jameinel wrote:
> Can we use testing.ShortWait here instead of another hard-coded time?
Sure. Originally it wasn't ShortWait but it is now so I'll change it.
https:/
worker/
provisioned.
On 2014/03/26 07:22:42, jameinel wrote:
> I don't understand why Machine 4 is listed as never provisioned, when
the
> StartInstance calls claim that it will be provisioned after 2 retries.
> Is this sensitive to the timeout times? (we assume we poll at exactly
50ms, and
> we get 2 polls in the 100ms wait above?)
Only machine 3 ...
Ian Booth (wallyworld) wrote : | # |
Please take a look.
William Reade (fwereade) wrote : | # |
I would be so much happier if this was implemented as a StringsWatcher,
that sent down the ids-to-be-retried list, and we were able to call
exactly the same processMachines method -- and thus be able to convert
everything in one go to use the bulk apis that already exist.
https:/
File state/machine_
https:/
state/machine_
status "pending"`)
We should also be able to set "pending" when it's currently in "error"
state, right?
Go Bot (go-bot) wrote : | # |
The attempt to merge lp:~wallyworld/juju-core/provisioner-retry into lp:juju-core failed. Below is the output from the failed tests.
ok launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
? launchpad.
? launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
? launchpad.
? launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
? launchpad.
ok launchpad.
? launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
? launchpad.
ok launchpad.
? launchpad.
? launchpad.
? launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
? launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
? launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
ok launchpad.
? launchpad.
ok launchpad.
ok launchpad.
? launchpad.
ok launchpad.
ok launchpad.
Preview Diff
1 | === modified file 'state/api/provisioner/provisioner.go' |
2 | --- state/api/provisioner/provisioner.go 2014-03-25 03:10:55 +0000 |
3 | +++ state/api/provisioner/provisioner.go 2014-03-26 11:18:52 +0000 |
4 | @@ -70,6 +70,19 @@ |
5 | return w, nil |
6 | } |
7 | |
8 | +func (st *State) WatchMachineErrorRetry() (watcher.NotifyWatcher, error) { |
9 | + var result params.NotifyWatchResult |
10 | + err := st.call("WatchMachineErrorRetry", nil, &result) |
11 | + if err != nil { |
12 | + return nil, err |
13 | + } |
14 | + if err := result.Error; err != nil { |
15 | + return nil, result.Error |
16 | + } |
17 | + w := watcher.NewNotifyWatcher(st.caller, result) |
18 | + return w, nil |
19 | +} |
20 | + |
21 | // StateAddresses returns the list of addresses used to connect to the state. |
22 | func (st *State) StateAddresses() ([]string, error) { |
23 | var result params.StringsResult |
24 | |
25 | === modified file 'state/api/provisioner/provisioner_test.go' |
26 | --- state/api/provisioner/provisioner_test.go 2014-03-26 05:16:31 +0000 |
27 | +++ state/api/provisioner/provisioner_test.go 2014-03-26 11:18:52 +0000 |
28 | @@ -121,17 +121,7 @@ |
29 | c.Assert(err, gc.IsNil) |
30 | err = machine.SetStatus(params.StatusError, "blah", params.StatusData{"transient": true}) |
31 | c.Assert(err, gc.IsNil) |
32 | - password, err := utils.RandomPassword() |
33 | - c.Assert(err, gc.IsNil) |
34 | - err = machine.SetPassword(password) |
35 | - c.Assert(err, gc.IsNil) |
36 | - err = machine.SetProvisioned("i-manager", "fake_nonce", nil) |
37 | - c.Assert(err, gc.IsNil) |
38 | - st := s.OpenAPIAsMachine(c, machine.Tag(), password, "fake_nonce") |
39 | - c.Assert(s.st, gc.NotNil) |
40 | - p := st.Provisioner() |
41 | - |
42 | - machines, info, err := p.MachinesWithTransientErrors() |
43 | + machines, info, err := s.provisioner.MachinesWithTransientErrors() |
44 | c.Assert(err, gc.IsNil) |
45 | c.Assert(machines, gc.HasLen, 1) |
46 | c.Assert(machines[0].Id(), gc.Equals, "1") |
47 | |
48 | === added file 'state/apiserver/provisioner/machineerror.go' |
49 | --- state/apiserver/provisioner/machineerror.go 1970-01-01 00:00:00 +0000 |
50 | +++ state/apiserver/provisioner/machineerror.go 2014-03-26 11:18:52 +0000 |
51 | @@ -0,0 +1,80 @@ |
52 | +// Copyright 2014 Canonical Ltd. |
53 | +// Licensed under the AGPLv3, see LICENCE file for details. |
54 | + |
55 | +package provisioner |
56 | + |
57 | +import ( |
58 | + "time" |
59 | + |
60 | + "launchpad.net/tomb" |
61 | + |
62 | + "launchpad.net/juju-core/state" |
63 | +) |
64 | + |
65 | +// machineErrorRetry is a notify watcher that fires when it is |
66 | +// appropriate to retry provisioning machines with transient errors. |
67 | +type machineErrorRetry struct { |
68 | + tomb tomb.Tomb |
69 | + out chan struct{} |
70 | +} |
71 | + |
72 | +func newWatchMachineErrorRetry() state.NotifyWatcher { |
73 | + w := &machineErrorRetry{ |
74 | + out: make(chan struct{}), |
75 | + } |
76 | + go func() { |
77 | + defer w.tomb.Done() |
78 | + defer close(w.out) |
79 | + w.tomb.Kill(w.loop()) |
80 | + }() |
81 | + return w |
82 | +} |
83 | + |
84 | +// Stop stops the watcher, and returns any error encountered while running |
85 | +// or shutting down. |
86 | +func (w *machineErrorRetry) Stop() error { |
87 | + w.Kill() |
88 | + return w.Wait() |
89 | +} |
90 | + |
91 | +// Kill kills the watcher without waiting for it to shut down. |
92 | +func (w *machineErrorRetry) Kill() { |
93 | + w.tomb.Kill(nil) |
94 | +} |
95 | + |
96 | +// Wait waits for the watcher to die and returns any |
97 | +// error encountered when it was running. |
98 | +func (w *machineErrorRetry) Wait() error { |
99 | + return w.tomb.Wait() |
100 | +} |
101 | + |
102 | +// Err returns any error encountered while running or shutting down, or |
103 | +// tomb.ErrStillAlive if the watcher is still running. |
104 | +func (w *machineErrorRetry) Err() error { |
105 | + return w.tomb.Err() |
106 | +} |
107 | + |
108 | +// Changes returns the event channel for the machineErrorRetry watcher. |
109 | +func (w *machineErrorRetry) Changes() <-chan struct{} { |
110 | + return w.out |
111 | +} |
112 | + |
113 | +// ErrorRetryWaitDelay is the poll time currently used to trigger the watcher. |
114 | +var ErrorRetryWaitDelay = 1 * time.Minute |
115 | + |
116 | +// The initial implementation of this watcher simply acts as a poller, |
117 | +// triggering every ErrorRetryWaitDelay minutes. |
118 | +func (w *machineErrorRetry) loop() error { |
119 | + out := w.out |
120 | + for { |
121 | + select { |
122 | + case <-w.tomb.Dying(): |
123 | + return tomb.ErrDying |
124 | + case <-time.After(ErrorRetryWaitDelay): |
125 | + out = w.out |
126 | + case out <- struct{}{}: |
127 | + out = nil |
128 | + } |
129 | + } |
130 | + return nil |
131 | +} |
132 | |
133 | === modified file 'state/apiserver/provisioner/provisioner.go' |
134 | --- state/apiserver/provisioner/provisioner.go 2014-03-26 05:16:31 +0000 |
135 | +++ state/apiserver/provisioner/provisioner.go 2014-03-26 11:18:52 +0000 |
136 | @@ -27,10 +27,11 @@ |
137 | *common.EnvironMachinesWatcher |
138 | *common.InstanceIdGetter |
139 | |
140 | - st *state.State |
141 | - resources *common.Resources |
142 | - authorizer common.Authorizer |
143 | - getAuthFunc common.GetAuthFunc |
144 | + st *state.State |
145 | + resources *common.Resources |
146 | + authorizer common.Authorizer |
147 | + getAuthFunc common.GetAuthFunc |
148 | + getCanWatchMachines common.GetAuthFunc |
149 | } |
150 | |
151 | // NewProvisionerAPI creates a new server-side ProvisionerAPI facade. |
152 | @@ -87,6 +88,7 @@ |
153 | resources: resources, |
154 | authorizer: authorizer, |
155 | getAuthFunc: getAuthFunc, |
156 | + getCanWatchMachines: getCanReadSecrets, |
157 | }, nil |
158 | } |
159 | |
160 | @@ -238,6 +240,11 @@ |
161 | if !canAccessFunc(machine.Tag()) { |
162 | continue |
163 | } |
164 | + if _, provisionedErr := machine.InstanceId(); provisionedErr == nil { |
165 | + // Machine may have been provisioned but machiner hasn't set the |
166 | + // status to Started yet. |
167 | + continue |
168 | + } |
169 | result := params.StatusResult{} |
170 | if result.Status, result.Info, result.Data, err = machine.Status(); err != nil { |
171 | continue |
172 | @@ -318,3 +325,24 @@ |
173 | } |
174 | return result, nil |
175 | } |
176 | + |
177 | +// WatchMachineErrorRetry returns a NotifyWatcher that notifies when |
178 | +// the provisioner should retry provisioning machines with transient errors. |
179 | +func (p *ProvisionerAPI) WatchMachineErrorRetry() (params.NotifyWatchResult, error) { |
180 | + result := params.NotifyWatchResult{} |
181 | + canWatch, err := p.getCanWatchMachines() |
182 | + if err != nil { |
183 | + return params.NotifyWatchResult{}, err |
184 | + } |
185 | + if !canWatch("") { |
186 | + return result, common.ErrPerm |
187 | + } |
188 | + watch := newWatchMachineErrorRetry() |
189 | + // Consume any initial event and forward it to the result. |
190 | + if _, ok := <-watch.Changes(); ok { |
191 | + result.NotifyWatcherId = p.resources.Register(watch) |
192 | + } else { |
193 | + return result, watcher.MustErr(watch) |
194 | + } |
195 | + return result, nil |
196 | +} |
197 | |
198 | === modified file 'state/apiserver/provisioner/provisioner_test.go' |
199 | --- state/apiserver/provisioner/provisioner_test.go 2014-03-26 05:16:31 +0000 |
200 | +++ state/apiserver/provisioner/provisioner_test.go 2014-03-26 11:18:52 +0000 |
201 | @@ -57,7 +57,7 @@ |
202 | if withStateServer { |
203 | s.machines = append(s.machines, testing.AddStateServerMachine(c, s.State)) |
204 | } |
205 | - for i := 0; i < 4; i++ { |
206 | + for i := 0; i < 5; i++ { |
207 | machine, err := s.State.AddMachine("quantal", state.JobHostUnits) |
208 | c.Check(err, gc.IsNil) |
209 | s.machines = append(s.machines, machine) |
210 | @@ -120,6 +120,7 @@ |
211 | {Tag: s.machines[1].Tag(), Password: "xxx1-1234567890123457890"}, |
212 | {Tag: s.machines[2].Tag(), Password: "xxx2-1234567890123457890"}, |
213 | {Tag: s.machines[3].Tag(), Password: "xxx3-1234567890123457890"}, |
214 | + {Tag: s.machines[4].Tag(), Password: "xxx4-1234567890123457890"}, |
215 | {Tag: "machine-42", Password: "foo"}, |
216 | {Tag: "unit-foo-0", Password: "zzz"}, |
217 | {Tag: "service-bar", Password: "abc"}, |
218 | @@ -133,6 +134,7 @@ |
219 | {nil}, |
220 | {nil}, |
221 | {nil}, |
222 | + {nil}, |
223 | {apiservertesting.NotFoundError("machine 42")}, |
224 | {apiservertesting.ErrUnauthorized}, |
225 | {apiservertesting.ErrUnauthorized}, |
226 | @@ -357,6 +359,13 @@ |
227 | c.Assert(err, gc.IsNil) |
228 | err = s.machines[3].SetStatus(params.StatusError, "error", nil) |
229 | c.Assert(err, gc.IsNil) |
230 | + // Machine 4 is provisioned but error not reset yet. |
231 | + err = s.machines[4].SetStatus(params.StatusError, "transient error", |
232 | + params.StatusData{"transient": true, "foo": "bar"}) |
233 | + c.Assert(err, gc.IsNil) |
234 | + hwChars := instance.MustParseHardware("arch=i386", "mem=4G") |
235 | + err = s.machines[4].SetProvisioned("i-am", "fake_nonce", &hwChars) |
236 | + c.Assert(err, gc.IsNil) |
237 | |
238 | result, err := s.provisioner.MachinesWithTransientErrors() |
239 | c.Assert(err, gc.IsNil) |
240 | @@ -705,7 +714,7 @@ |
241 | c.Assert(err, gc.IsNil) |
242 | c.Assert(result, gc.DeepEquals, params.StringsWatchResult{ |
243 | StringsWatcherId: "1", |
244 | - Changes: []string{"0", "1", "2", "3"}, |
245 | + Changes: []string{"0", "1", "2", "3", "4"}, |
246 | }) |
247 | |
248 | // Verify the resources were registered and stop them when done. |
249 | @@ -916,3 +925,35 @@ |
250 | Result: s.State.CACert(), |
251 | }) |
252 | } |
253 | + |
254 | +func (s *withoutStateServerSuite) TestWatchMachineErrorRetry(c *gc.C) { |
255 | + s.PatchValue(&provisioner.ErrorRetryWaitDelay, 2*coretesting.ShortWait) |
256 | + c.Assert(s.resources.Count(), gc.Equals, 0) |
257 | + |
258 | + _, err := s.provisioner.WatchMachineErrorRetry() |
259 | + c.Assert(err, gc.IsNil) |
260 | + |
261 | + // Verify the resources were registered and stop them when done. |
262 | + c.Assert(s.resources.Count(), gc.Equals, 1) |
263 | + resource := s.resources.Get("1") |
264 | + defer statetesting.AssertStop(c, resource) |
265 | + |
266 | + // Check that the Watch has consumed the initial event ("returned" |
267 | + // in the Watch call) |
268 | + wc := statetesting.NewNotifyWatcherC(c, s.State, resource.(state.NotifyWatcher)) |
269 | + wc.AssertNoChange() |
270 | + |
271 | + // We should now get a time triggered change. |
272 | + wc.AssertOneChange() |
273 | + |
274 | + // Make sure WatchMachineErrorRetry fails with a machine agent login. |
275 | + anAuthorizer := s.authorizer |
276 | + anAuthorizer.MachineAgent = true |
277 | + anAuthorizer.EnvironManager = false |
278 | + aProvisioner, err := provisioner.NewProvisionerAPI(s.State, s.resources, anAuthorizer) |
279 | + c.Assert(err, gc.IsNil) |
280 | + |
281 | + result, err := aProvisioner.WatchMachineErrorRetry() |
282 | + c.Assert(err, gc.ErrorMatches, "permission denied") |
283 | + c.Assert(result, gc.DeepEquals, params.NotifyWatchResult{}) |
284 | +} |
285 | |
286 | === modified file 'state/machine.go' |
287 | --- state/machine.go 2014-03-25 09:15:00 +0000 |
288 | +++ state/machine.go 2014-03-26 11:18:52 +0000 |
289 | @@ -960,7 +960,11 @@ |
290 | StatusInfo: info, |
291 | StatusData: data, |
292 | } |
293 | - if err := doc.validateSet(); err != nil { |
294 | + // If a machine is not yet provisioned, we allow its status |
295 | + // to be set back to pending (when a retry is to occur). |
296 | + _, err := m.InstanceId() |
297 | + allowPending := IsNotProvisionedError(err) |
298 | + if err := doc.validateSet(allowPending); err != nil { |
299 | return err |
300 | } |
301 | ops := []txn.Op{{ |
302 | |
303 | === modified file 'state/machine_test.go' |
304 | --- state/machine_test.go 2014-03-26 06:28:38 +0000 |
305 | +++ state/machine_test.go 2014-03-26 11:18:52 +0000 |
306 | @@ -1052,8 +1052,6 @@ |
307 | func (s *MachineSuite) TestGetSetStatusWhileAlive(c *gc.C) { |
308 | err := s.machine.SetStatus(params.StatusError, "", nil) |
309 | c.Assert(err, gc.ErrorMatches, `cannot set status "error" without info`) |
310 | - err = s.machine.SetStatus(params.StatusPending, "", nil) |
311 | - c.Assert(err, gc.ErrorMatches, `cannot set status "pending"`) |
312 | err = s.machine.SetStatus(params.StatusDown, "", nil) |
313 | c.Assert(err, gc.ErrorMatches, `cannot set status "down"`) |
314 | err = s.machine.SetStatus(params.Status("vliegkat"), "orville", nil) |
315 | @@ -1086,6 +1084,16 @@ |
316 | }) |
317 | } |
318 | |
319 | +func (s *MachineSuite) TestSetStatusPending(c *gc.C) { |
320 | + err := s.machine.SetStatus(params.StatusPending, "", nil) |
321 | + c.Assert(err, gc.IsNil) |
322 | + // Cannot set status to pending once a machine is provisioned. |
323 | + err = s.machine.SetProvisioned("umbrella/0", "fake_nonce", nil) |
324 | + c.Assert(err, gc.IsNil) |
325 | + err = s.machine.SetStatus(params.StatusPending, "", nil) |
326 | + c.Assert(err, gc.ErrorMatches, `cannot set status "pending"`) |
327 | +} |
328 | + |
329 | func (s *MachineSuite) TestGetSetStatusWhileNotAlive(c *gc.C) { |
330 | // When Dying set/get should work. |
331 | err := s.machine.Destroy() |
332 | |
333 | === modified file 'state/status.go' |
334 | --- state/status.go 2014-03-19 23:07:33 +0000 |
335 | +++ state/status.go 2014-03-26 11:18:52 +0000 |
336 | @@ -26,12 +26,16 @@ |
337 | |
338 | // validateSet returns an error if the statusDoc does not represent a sane |
339 | // SetStatus operation. |
340 | -func (doc statusDoc) validateSet() error { |
341 | +func (doc statusDoc) validateSet(allowPending bool) error { |
342 | if !doc.Status.Valid() { |
343 | return fmt.Errorf("cannot set invalid status %q", doc.Status) |
344 | } |
345 | switch doc.Status { |
346 | - case params.StatusPending, params.StatusDown: |
347 | + case params.StatusPending: |
348 | + if !allowPending { |
349 | + return fmt.Errorf("cannot set status %q", doc.Status) |
350 | + } |
351 | + case params.StatusDown: |
352 | return fmt.Errorf("cannot set status %q", doc.Status) |
353 | case params.StatusError: |
354 | if doc.StatusInfo == "" { |
355 | |
356 | === modified file 'state/unit.go' |
357 | --- state/unit.go 2014-03-19 23:07:33 +0000 |
358 | +++ state/unit.go 2014-03-26 11:18:52 +0000 |
359 | @@ -562,7 +562,7 @@ |
360 | StatusInfo: info, |
361 | StatusData: data, |
362 | } |
363 | - if err := doc.validateSet(); err != nil { |
364 | + if err := doc.validateSet(false); err != nil { |
365 | return err |
366 | } |
367 | ops := []txn.Op{{ |
368 | |
369 | === modified file 'worker/provisioner/provisioner.go' |
370 | --- worker/provisioner/provisioner.go 2014-03-05 19:41:34 +0000 |
371 | +++ worker/provisioner/provisioner.go 2014-03-26 11:18:52 +0000 |
372 | @@ -14,6 +14,7 @@ |
373 | "launchpad.net/juju-core/environs/config" |
374 | "launchpad.net/juju-core/instance" |
375 | apiprovisioner "launchpad.net/juju-core/state/api/provisioner" |
376 | + apiwatcher "launchpad.net/juju-core/state/api/watcher" |
377 | "launchpad.net/juju-core/state/watcher" |
378 | "launchpad.net/juju-core/worker" |
379 | ) |
380 | @@ -28,7 +29,7 @@ |
381 | type Provisioner interface { |
382 | worker.Worker |
383 | Stop() error |
384 | - getWatcher() (Watcher, error) |
385 | + getMachineWatcher() (apiwatcher.StringsWatcher, error) |
386 | } |
387 | |
388 | // environProvisioner represents a running provisioning worker for machine nodes |
389 | @@ -103,7 +104,11 @@ |
390 | } |
391 | // Start responding to changes in machines, and to any further updates |
392 | // to the environment config. |
393 | - machineWatcher, err := p.getWatcher() |
394 | + machineWatcher, err := p.getMachineWatcher() |
395 | + if err != nil { |
396 | + return nil, err |
397 | + } |
398 | + retryWatcher, err := p.st.WatchMachineErrorRetry() |
399 | if err != nil { |
400 | return nil, err |
401 | } |
402 | @@ -112,6 +117,7 @@ |
403 | safeMode, |
404 | p.st, |
405 | machineWatcher, |
406 | + retryWatcher, |
407 | p.broker, |
408 | auth) |
409 | return task, nil |
410 | @@ -183,7 +189,7 @@ |
411 | } |
412 | } |
413 | |
414 | -func (p *environProvisioner) getWatcher() (Watcher, error) { |
415 | +func (p *environProvisioner) getMachineWatcher() (apiwatcher.StringsWatcher, error) { |
416 | return p.st.WatchEnvironMachines() |
417 | } |
418 | |
419 | @@ -250,7 +256,7 @@ |
420 | return p.machine, nil |
421 | } |
422 | |
423 | -func (p *containerProvisioner) getWatcher() (Watcher, error) { |
424 | +func (p *containerProvisioner) getMachineWatcher() (apiwatcher.StringsWatcher, error) { |
425 | machine, err := p.getMachine() |
426 | if err != nil { |
427 | return nil, err |
428 | |
429 | === modified file 'worker/provisioner/provisioner_task.go' |
430 | --- worker/provisioner/provisioner_task.go 2014-03-26 00:24:30 +0000 |
431 | +++ worker/provisioner/provisioner_task.go 2014-03-26 11:18:52 +0000 |
432 | @@ -16,6 +16,7 @@ |
433 | "launchpad.net/juju-core/names" |
434 | "launchpad.net/juju-core/state/api/params" |
435 | apiprovisioner "launchpad.net/juju-core/state/api/provisioner" |
436 | + apiwatcher "launchpad.net/juju-core/state/api/watcher" |
437 | "launchpad.net/juju-core/state/watcher" |
438 | coretools "launchpad.net/juju-core/tools" |
439 | "launchpad.net/juju-core/utils" |
440 | @@ -35,28 +36,25 @@ |
441 | SetSafeMode(safeMode bool) |
442 | } |
443 | |
444 | -type Watcher interface { |
445 | - watcher.Errer |
446 | - watcher.Stopper |
447 | - Changes() <-chan []string |
448 | -} |
449 | - |
450 | type MachineGetter interface { |
451 | Machine(tag string) (*apiprovisioner.Machine, error) |
452 | + MachinesWithTransientErrors() ([]*apiprovisioner.Machine, []params.StatusResult, error) |
453 | } |
454 | |
455 | func NewProvisionerTask( |
456 | machineTag string, |
457 | safeMode bool, |
458 | machineGetter MachineGetter, |
459 | - watcher Watcher, |
460 | + machineWatcher apiwatcher.StringsWatcher, |
461 | + retryWatcher apiwatcher.NotifyWatcher, |
462 | broker environs.InstanceBroker, |
463 | auth environs.AuthenticationProvider, |
464 | ) ProvisionerTask { |
465 | task := &provisionerTask{ |
466 | machineTag: machineTag, |
467 | machineGetter: machineGetter, |
468 | - machineWatcher: watcher, |
469 | + machineWatcher: machineWatcher, |
470 | + retryWatcher: retryWatcher, |
471 | broker: broker, |
472 | auth: auth, |
473 | safeMode: safeMode, |
474 | @@ -73,7 +71,8 @@ |
475 | type provisionerTask struct { |
476 | machineTag string |
477 | machineGetter MachineGetter |
478 | - machineWatcher Watcher |
479 | + machineWatcher apiwatcher.StringsWatcher |
480 | + retryWatcher apiwatcher.NotifyWatcher |
481 | broker environs.InstanceBroker |
482 | tomb tomb.Tomb |
483 | auth environs.AuthenticationProvider |
484 | @@ -132,8 +131,6 @@ |
485 | if !ok { |
486 | return watcher.MustErr(task.machineWatcher) |
487 | } |
488 | - // TODO(dfc; lp:1042717) fire process machines periodically to shut down unknown |
489 | - // instances. |
490 | if err := task.processMachines(ids); err != nil { |
491 | return fmt.Errorf("failed to process updated machines: %v", err) |
492 | } |
493 | @@ -152,6 +149,10 @@ |
494 | return fmt.Errorf("failed to process machines after safe mode disabled: %v", err) |
495 | } |
496 | } |
497 | + case <-task.retryWatcher.Changes(): |
498 | + if err := task.processMachinesWithTransientErrors(); err != nil { |
499 | + return fmt.Errorf("failed to process machines with transient errors: %v", err) |
500 | + } |
501 | } |
502 | } |
503 | } |
504 | @@ -164,6 +165,29 @@ |
505 | } |
506 | } |
507 | |
508 | +func (task *provisionerTask) processMachinesWithTransientErrors() error { |
509 | + machines, statusResults, err := task.machineGetter.MachinesWithTransientErrors() |
510 | + if err != nil { |
511 | + return nil |
512 | + } |
513 | + logger.Tracef("processMachinesWithTransientErrors(%v)", statusResults) |
514 | + var pending []*apiprovisioner.Machine |
515 | + for i, status := range statusResults { |
516 | + if status.Error != nil { |
517 | + logger.Errorf("cannot retry provisioning of machine %q: %v", status.Id, status.Error) |
518 | + continue |
519 | + } |
520 | + machine := machines[i] |
521 | + if err := machine.SetStatus(params.StatusPending, "", nil); err != nil { |
522 | + logger.Errorf("cannot reset status of machine %q: %v", status.Id, err) |
523 | + continue |
524 | + } |
525 | + task.machines[machine.Tag()] = machine |
526 | + pending = append(pending, machine) |
527 | + } |
528 | + return task.startMachines(pending) |
529 | +} |
530 | + |
531 | func (task *provisionerTask) processMachines(ids []string) error { |
532 | logger.Tracef("processMachines(%v)", ids) |
533 | // Populate the tasks maps of current instances and machines. |
534 | |
535 | === modified file 'worker/provisioner/provisioner_test.go' |
536 | --- worker/provisioner/provisioner_test.go 2014-03-17 22:42:51 +0000 |
537 | +++ worker/provisioner/provisioner_test.go 2014-03-26 11:18:52 +0000 |
538 | @@ -14,6 +14,8 @@ |
539 | "launchpad.net/juju-core/constraints" |
540 | "launchpad.net/juju-core/environs" |
541 | "launchpad.net/juju-core/environs/config" |
542 | + "launchpad.net/juju-core/environs/simplestreams" |
543 | + "launchpad.net/juju-core/environs/tools" |
544 | "launchpad.net/juju-core/errors" |
545 | "launchpad.net/juju-core/instance" |
546 | "launchpad.net/juju-core/juju/testing" |
547 | @@ -23,6 +25,7 @@ |
548 | "launchpad.net/juju-core/state/api" |
549 | "launchpad.net/juju-core/state/api/params" |
550 | apiprovisioner "launchpad.net/juju-core/state/api/provisioner" |
551 | + apiserverprovisioner "launchpad.net/juju-core/state/apiserver/provisioner" |
552 | coretesting "launchpad.net/juju-core/testing" |
553 | "launchpad.net/juju-core/utils" |
554 | "launchpad.net/juju-core/utils/set" |
555 | @@ -731,17 +734,21 @@ |
556 | s.waitRemoved(c, m3) |
557 | } |
558 | |
559 | -func (s *ProvisionerSuite) newProvisionerTask(c *gc.C, safeMode bool) provisioner.ProvisionerTask { |
560 | - env := s.APIConn.Environ |
561 | - watcher, err := s.provisioner.WatchEnvironMachines() |
562 | +func (s *ProvisionerSuite) newProvisionerTask(c *gc.C, safeMode bool, |
563 | + broker environs.InstanceBroker) provisioner.ProvisionerTask { |
564 | + |
565 | + machineWatcher, err := s.provisioner.WatchEnvironMachines() |
566 | + c.Assert(err, gc.IsNil) |
567 | + retryWatcher, err := s.provisioner.WatchMachineErrorRetry() |
568 | c.Assert(err, gc.IsNil) |
569 | auth, err := environs.NewAPIAuthenticator(s.provisioner) |
570 | c.Assert(err, gc.IsNil) |
571 | - return provisioner.NewProvisionerTask("machine-0", safeMode, s.provisioner, watcher, env, auth) |
572 | + return provisioner.NewProvisionerTask( |
573 | + "machine-0", safeMode, s.provisioner, machineWatcher, retryWatcher, broker, auth) |
574 | } |
575 | |
576 | func (s *ProvisionerSuite) TestTurningOffSafeModeReapsUnknownInstances(c *gc.C) { |
577 | - task := s.newProvisionerTask(c, true) |
578 | + task := s.newProvisionerTask(c, true, s.APIConn.Environ) |
579 | defer stop(c, task) |
580 | |
581 | // Initially create a machine, and an unknown instance, with safe mode on. |
582 | @@ -761,3 +768,66 @@ |
583 | task.SetSafeMode(false) |
584 | s.checkStopInstances(c, i1) |
585 | } |
586 | + |
587 | +func (s *ProvisionerSuite) TestProvisionerRetriesTransientErrors(c *gc.C) { |
588 | + s.PatchValue(&apiserverprovisioner.ErrorRetryWaitDelay, 5*time.Millisecond) |
589 | + var e environs.Environ = &mockBroker{Environ: s.APIConn.Environ, retryCount: make(map[string]int)} |
590 | + task := s.newProvisionerTask(c, false, e) |
591 | + defer stop(c, task) |
592 | + |
593 | + // Provision some machines, some will be started first time, |
594 | + // another will require retries. |
595 | + m1, err := s.addMachine() |
596 | + c.Assert(err, gc.IsNil) |
597 | + m2, err := s.addMachine() |
598 | + c.Assert(err, gc.IsNil) |
599 | + m3, err := s.addMachine() |
600 | + c.Assert(err, gc.IsNil) |
601 | + m4, err := s.addMachine() |
602 | + c.Assert(err, gc.IsNil) |
603 | + s.checkStartInstance(c, m1) |
604 | + s.checkStartInstance(c, m2) |
605 | + thatsAllFolks := make(chan struct{}) |
606 | + go func() { |
607 | + for { |
608 | + select { |
609 | + case <-thatsAllFolks: |
610 | + return |
611 | + case <-time.After(coretesting.ShortWait): |
612 | + err := m3.SetStatus(params.StatusError, "info", params.StatusData{"transient": true}) |
613 | + c.Assert(err, gc.IsNil) |
614 | + } |
615 | + } |
616 | + }() |
617 | + s.checkStartInstance(c, m3) |
618 | + close(thatsAllFolks) |
619 | + // Machine 4 is never provisioned. |
620 | + status, _, _, err := m4.Status() |
621 | + c.Assert(err, gc.IsNil) |
622 | + c.Assert(status, gc.Equals, params.StatusError) |
623 | + _, err = m4.InstanceId() |
624 | + c.Assert(err, jc.Satisfies, state.IsNotProvisionedError) |
625 | +} |
626 | + |
627 | +type mockBroker struct { |
628 | + environs.Environ |
629 | + retryCount map[string]int |
630 | +} |
631 | + |
632 | +func (b *mockBroker) StartInstance(args environs.StartInstanceParams) (instance.Instance, *instance.HardwareCharacteristics, error) { |
633 | + // All machines except machines 3, 4 are provisioned successfully the first time. |
634 | + // Machines 3 is provisioned after some attempts have been made. |
635 | + // Machine 4 is never provisioned. |
636 | + id := args.MachineConfig.MachineId |
637 | + retries := b.retryCount[id] |
638 | + if (id != "3" && id != "4") || retries > 2 { |
639 | + return b.Environ.StartInstance(args) |
640 | + } else { |
641 | + b.retryCount[id] = retries + 1 |
642 | + } |
643 | + return nil, nil, fmt.Errorf("error: some error") |
644 | +} |
645 | + |
646 | +func (b *mockBroker) GetToolsSources() ([]simplestreams.DataSource, error) { |
647 | + return b.Environ.(tools.SupportsCustomSources).GetToolsSources() |
648 | +} |
Reviewers: mp+212768_ code.launchpad. net,
Message:
Please take a look.
Description:
Provisioner retries failed machines
Introduce a new watcher - WatchMachineErr orRetry. This
watcher fires when the provisioner task should attempt
to re-provision any machines with transient errors. It
is done as a separate watcher to avoid changing the
semantics of the exisitng machine watcher close to
cutting a release, and to more easily allow the
flexibility of doing things like queuing retries
separately with different priorities etc. It's still
early days and the implemenation is still evolving
because the watcher currently just triggers every minute.
It's an internal implementation detail so can be changed
if required. Future work needs to make the watcher smarter
and able to interpret errors returned by the providers
in order to know when to retry.
https:/ /code.launchpad .net/~wallyworl d/juju- core/provisione r-retry/ +merge/ 212768
Requires: /code.launchpad .net/~wallyworl d/juju- core/machineswi thtransienterro rs-api/ +merge/ 212749
https:/
(do not edit description out of merge proposal)
Please review this at https:/ /codereview. appspot. com/80340043/
Affected files (+286, -26 lines): provisioner/ provisioner. go /provisioner/ machineerror. go /provisioner/ provisioner. go /provisioner/ provisioner_ test.go provisioner/ provisioner. go provisioner/ provisioner_ task.go provisioner/ provisioner_ test.go
A [revision details]
M state/api/
A state/apiserver
M state/apiserver
M state/apiserver
M worker/
M worker/
M worker/