Merge lp:~thumper/juju-core/provisioner-reget-state-info into lp:~go-bot/juju-core/trunk

Proposed by Tim Penhey
Status: Merged
Approved by: Tim Penhey
Approved revision: no longer in the source branch.
Merged at revision: 1292
Proposed branch: lp:~thumper/juju-core/provisioner-reget-state-info
Merge into: lp:~go-bot/juju-core/trunk
Diff against target: 292 lines (+46/-112)
4 files modified
worker/provisioner/export_test.go (+2/-2)
worker/provisioner/provisioner.go (+11/-15)
worker/provisioner/provisioner_task.go (+33/-22)
worker/provisioner/provisioner_test.go (+0/-73)
To merge this branch: bzr merge lp:~thumper/juju-core/provisioner-reget-state-info
Reviewer Review Type Date Requested Status
Juju Engineering Pending
Review via email: mp+169951@code.launchpad.net

This proposal supersedes a proposal from 2013-06-10.

Commit message

Provisioner gets addresses for each new machine.

The addresses are retrieved each time we attempt to start a machine. This
allows the provisioner to take advantage of any new api or state servers that
are started as part of a block of machines being provisioned at one time.

https://codereview.appspot.com/9824047/

Description of the change

Provisioner gets addresses for each new machine.

The addresses are retrieved each time we attempt to start a machine. This
allows the provisioner to take advantage of any new api or state servers that
are started as part of a block of machines being provisioned at one time.

There is some saving of addresses at this stage to make the existing tests
pass, even though this goes a little against what William has mentioned
before. If we have invalid config, the tests assume that we continue with old
info, whereas William has suggested that we pause provisioning while the
config is broken.

One behavioural tweak in this branch. If we do fail to setup authentication,
we no longer kill the task, but instead log and continue. We will attempt to
start it again next time through the loop.

https://codereview.appspot.com/9824047/

To post a comment you must log in.
Revision history for this message
Tim Penhey (thumper) wrote : Posted in a previous version of this proposal

Reviewers: mp+168562_code.launchpad.net,

Message:
Please take a look.

Description:
Provisioner gets addresses for each new machine.

The addresses are retrieved each time we attempt to start a machine.
This
allows the provisioner to take advantage of any new api or state servers
that
are started as part of a block of machines being provisioned at one
time.

There is some saving of addresses at this stage to make the existing
tests
pass, even though this goes a little against what William has mentioned
before. If we have invalid config, the tests assume that we continue
with old
info, whereas William has suggested that we pause provisioning while the
config is broken.

One behavioural tweak in this branch. If we do fail to setup
authentication,
we no longer kill the task, but instead log and continue. We will
attempt to
start it again next time through the loop.

https://code.launchpad.net/~thumper/juju-core/provisioner-reget-state-info/+merge/168562

(do not edit description out of merge proposal)

Please review this at https://codereview.appspot.com/9824047/

Affected files:
   A [revision details]
   M worker/provisioner/export_test.go
   M worker/provisioner/provisioner.go
   M worker/provisioner/provisioner_task.go

Revision history for this message
William Reade (fwereade) wrote : Posted in a previous version of this proposal

Strongly -1 on the clever bits here. Just barf on failure and I'll be
happy :).

https://codereview.appspot.com/9824047/diff/1/worker/provisioner/provisioner_task.go
File worker/provisioner/provisioner_task.go (right):

https://codereview.appspot.com/9824047/diff/1/worker/provisioner/provisioner_task.go#newcode304
worker/provisioner/provisioner_task.go:304: // Don't return a real
error, just try again next time.
This is problematic, because the watcher won't report that machine again
until it changes -- which it won't. For this to work, we need the
provisioner to be restarted regularly... but I'm hoping for a
provisioner that can run for weeks without breaking sweat.

https://codereview.appspot.com/9824047/diff/1/worker/provisioner/provisioner_task.go#newcode360
worker/provisioner/provisioner_task.go:360: // info if config becomes
invalid.
Honestly, I think this is sufficient justification to nuke the whole
provisioner; it'll be restarted in a few seconds, and for now I'd rather
not even try to provision a machine (which can be expected to cost our
users actual money) unless I'm as sure as I can be that I'll be able to
start a functioning machine agent...

https://codereview.appspot.com/9824047/

Revision history for this message
William Reade (fwereade) wrote : Posted in a previous version of this proposal

Re what we do without a valid environment... hmm. Thinking aloud:

* we certainly start off with an invalid environ config, and only get a
valid one once whoever started the environment has connected for the
first time.

* today, on first startup, there will be one machine present; but it'll
be marked as provisioned already, so it won't cause env access directly;
and if we're tolerant of list-instances failures, we'll be able to
complete the first processMachines without difficulty. But I don't think
we can necessarily depend on that forever...

* in theory, it is impossible to set an invalid environ config in state,
except via the special mechanism at bootstrap time.

* in practice, an environ config can most certainly become invalid --
you can (1) replace the config with *any* valid config, which might ofc
be invalid in this context (changed environ type, for example, will
screw us most comprehensively) and (2) config update is racy anyway,
such that two users making valid changes concurrently can end up with an
environment config that reflects neither of their desires... and I bet
there's a way for that to cause invalidity. (oh, and, on inspection,
(3): it looks like values can't be dependably unset anyway. we probably
need that too. grar.)

* BUT the consequences of an invalid environ affect many more things
than just the provisioner.

* AND SO it would be folly to spend time and effort hardening the
provisioner against catastrophic environ failure when we could be fixing
the root cause.

* AND THUS you should (1) wait for a valid config at startup before
processing machines and (2) proceed otherwise as though the environment
configuration were always guaranteed consistent and valid from a juju
perspective and (3) use the time thereby saved to fix SetEnvironConfig.

* BUT ALSO remember that a valid *environ config* does not necessarily
contain valid *provider credentials*. So issues can always arise at
runtime anyway, and should I think be dealt with as follows:

   * StartInstance: record on machine via SetInstanceError.
   * AllInstances: log and ignore -- I don't think there are negative
consequences (out-of-date
     instance list is always possible anyway).
   * StopInstances: log and ignore, they'll be tried again next time we
processMachines. (We should
     add a periodic processMachines call so that we do at least handle
them in something resembling
     a timely fashion. Not too hard, I think?)

https://codereview.appspot.com/9824047/

Revision history for this message
Tim Penhey (thumper) wrote : Posted in a previous version of this proposal

Please take a look.

https://codereview.appspot.com/9824047/diff/1/worker/provisioner/provisioner_task.go
File worker/provisioner/provisioner_task.go (right):

https://codereview.appspot.com/9824047/diff/1/worker/provisioner/provisioner_task.go#newcode304
worker/provisioner/provisioner_task.go:304: // Don't return a real
error, just try again next time.
On 2013/06/11 09:08:21, fwereade wrote:
> This is problematic, because the watcher won't report that machine
again until
> it changes -- which it won't. For this to work, we need the
provisioner to be
> restarted regularly... but I'm hoping for a provisioner that can run
for weeks
> without breaking sweat.

OK, reverted.

https://codereview.appspot.com/9824047/diff/1/worker/provisioner/provisioner_task.go#newcode360
worker/provisioner/provisioner_task.go:360: // info if config becomes
invalid.
On 2013/06/11 09:08:21, fwereade wrote:
> Honestly, I think this is sufficient justification to nuke the whole
> provisioner; it'll be restarted in a few seconds, and for now I'd
rather not
> even try to provision a machine (which can be expected to cost our
users actual
> money) unless I'm as sure as I can be that I'll be able to start a
functioning
> machine agent...

OK, nuking in progress.

https://codereview.appspot.com/9824047/

Revision history for this message
William Reade (fwereade) wrote : Posted in a previous version of this proposal
Revision history for this message
Ian Booth (wallyworld) wrote : Posted in a previous version of this proposal

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'worker/provisioner/export_test.go'
2--- worker/provisioner/export_test.go 2013-05-02 15:55:42 +0000
3+++ worker/provisioner/export_test.go 2013-06-17 21:50:34 +0000
4@@ -11,13 +11,13 @@
5 // exported so we can manually close the Provisioners underlying
6 // state connection.
7 func (p *Provisioner) CloseState() error {
8- return p.st.Close()
9+ return p.state.Close()
10 }
11
12 // exported so we can discover all machines visible to the
13 // Provisioners state connection.
14 func (p *Provisioner) AllMachines() ([]*state.Machine, error) {
15- return p.st.AllMachines()
16+ return p.state.AllMachines()
17 }
18
19 func (o *configObserver) SetObserver(observer chan<- *config.Config) {
20
21=== modified file 'worker/provisioner/provisioner.go'
22--- worker/provisioner/provisioner.go 2013-06-11 02:15:31 +0000
23+++ worker/provisioner/provisioner.go 2013-06-17 21:50:34 +0000
24@@ -19,7 +19,7 @@
25
26 // Provisioner represents a running provisioning worker.
27 type Provisioner struct {
28- st *state.State
29+ state *state.State
30 machineId string // Which machine runs the provisioner.
31 environ environs.Environ
32 tomb tomb.Tomb
33@@ -46,7 +46,7 @@
34 // and allocates them to the new machines.
35 func NewProvisioner(st *state.State, machineId string) *Provisioner {
36 p := &Provisioner{
37- st: st,
38+ state: st,
39 machineId: machineId,
40 }
41 go func() {
42@@ -57,7 +57,7 @@
43 }
44
45 func (p *Provisioner) loop() error {
46- environWatcher := p.st.WatchEnvironConfig()
47+ environWatcher := p.state.WatchEnvironConfig()
48 defer watcher.Stop(environWatcher, &p.tomb)
49
50 var err error
51@@ -66,27 +66,18 @@
52 return err
53 }
54
55- // Get a new StateInfo from the environment: the one used to
56- // launch the agent may refer to localhost, which will be
57- // unhelpful when attempting to run an agent on a new machine.
58- stateInfo, apiInfo, err := p.environ.StateInfo()
59- if err != nil {
60- return err
61- }
62-
63 // Start a new worker for the environment provider.
64
65 // Start responding to changes in machines, and to any further updates
66 // to the environment config.
67- machinesWatcher := p.st.WatchEnvironMachines()
68+ machinesWatcher := p.state.WatchEnvironMachines()
69 environmentBroker := newEnvironBroker(p.environ)
70 environmentProvisioner := newProvisionerTask(
71 p.machineId,
72- p.st,
73+ p.state,
74 machinesWatcher,
75 environmentBroker,
76- stateInfo,
77- apiInfo)
78+ )
79 defer watcher.Stop(environmentProvisioner, &p.tomb)
80
81 for {
82@@ -103,6 +94,11 @@
83 }
84 if err := p.setConfig(cfg); err != nil {
85 logger.Errorf("loaded invalid environment configuration: %v", err)
86+ // We *shouldn't* ever get into a state here where we have an
87+ // invalid config. If we do, stop the task and let jujud
88+ // restart the provisioner, which will then wait for valid
89+ // config.
90+ return err
91 }
92 }
93 }
94
95=== modified file 'worker/provisioner/provisioner_task.go'
96--- worker/provisioner/provisioner_task.go 2013-06-14 01:02:53 +0000
97+++ worker/provisioner/provisioner_task.go 2013-06-17 21:50:34 +0000
98@@ -30,25 +30,17 @@
99 Changes() <-chan []string
100 }
101
102-type MachineGetter interface {
103- Machine(id string) (*state.Machine, error)
104-}
105-
106 func newProvisionerTask(
107 machineId string,
108- machineGetter MachineGetter,
109+ st *state.State,
110 watcher Watcher,
111 broker Broker,
112- stateInfo *state.Info,
113- apiInfo *api.Info,
114 ) ProvisionerTask {
115 task := &provisionerTask{
116 machineId: machineId,
117- machineGetter: machineGetter,
118+ state: st,
119 machineWatcher: watcher,
120 broker: broker,
121- stateInfo: stateInfo,
122- apiInfo: apiInfo,
123 machines: make(map[string]*state.Machine),
124 }
125 go func() {
126@@ -60,12 +52,10 @@
127
128 type provisionerTask struct {
129 machineId string
130- machineGetter MachineGetter
131+ state *state.State
132 machineWatcher Watcher
133 broker Broker
134 tomb tomb.Tomb
135- stateInfo *state.Info
136- apiInfo *api.Info
137
138 // instance id -> instance
139 instances map[state.InstanceId]instance.Instance
140@@ -174,7 +164,7 @@
141 // change list.
142 // TODO(thumper): update for API server later to get all machines in one go.
143 for _, id := range ids {
144- machine, err := task.machineGetter.Machine(id)
145+ machine, err := task.state.Machine(id)
146 switch {
147 case errors.IsNotFoundError(err):
148 logger.Debugf("machine %q not found in state", id)
149@@ -306,7 +296,7 @@
150 // state.Info as the PA.
151 stateInfo, apiInfo, err := task.setupAuthentication(machine)
152 if err != nil {
153- logger.Errorf("failed to setup authentication: %v", err)
154+ logger.Warningf("failed to setup authentication for machine %q: %v", machine.Id(), err)
155 return err
156 }
157 cons, err := machine.Constraints()
158@@ -355,6 +345,22 @@
159 }
160
161 func (task *provisionerTask) setupAuthentication(machine *state.Machine) (*state.Info, *api.Info, error) {
162+ // Grab a new list of state and api addresses each time along with the
163+ // cert, as these can change during a processing loop. If for example the
164+ // provisioner was starting 100 machines in one go, some of those may be
165+ // new api servers. We should take advantage of those ASAP.
166+ stateAddresses, err := task.state.Addresses()
167+ if err != nil {
168+ // This will only return an error if the config becomes invalid. In
169+ // those situations, the provisioner bombs out and is restarted.
170+ return nil, nil, fmt.Errorf("cannot get addresses from state: %v", err)
171+ }
172+ apiAddresses, err := task.state.APIAddresses()
173+ if err != nil {
174+ // Same for the api addresses. We technically shouldn't get here if
175+ // we didn't fail before, but best to check.
176+ return nil, nil, fmt.Errorf("cannot get api addresses from state: %v", err)
177+ }
178 password, err := utils.RandomPassword()
179 if err != nil {
180 return nil, nil, fmt.Errorf("cannot make password for machine %v: %v", machine, err)
181@@ -362,11 +368,16 @@
182 if err := machine.SetMongoPassword(password); err != nil {
183 return nil, nil, fmt.Errorf("cannot set password for machine %v: %v", machine, err)
184 }
185- stateInfo := *task.stateInfo
186- stateInfo.Tag = machine.Tag()
187- stateInfo.Password = password
188- apiInfo := *task.apiInfo
189- apiInfo.Tag = machine.Tag()
190- apiInfo.Password = password
191- return &stateInfo, &apiInfo, nil
192+ cert := task.state.CACert()
193+ return &state.Info{
194+ Addrs: stateAddresses,
195+ CACert: cert,
196+ Tag: machine.Tag(),
197+ Password: password,
198+ }, &api.Info{
199+ Addrs: apiAddresses,
200+ CACert: cert,
201+ Tag: machine.Tag(),
202+ Password: password,
203+ }, nil
204 }
205
206=== modified file 'worker/provisioner/provisioner_test.go'
207--- worker/provisioner/provisioner_test.go 2013-06-14 01:23:48 +0000
208+++ worker/provisioner/provisioner_test.go 2013-06-17 21:50:34 +0000
209@@ -361,27 +361,6 @@
210 s.checkStartInstance(c, m)
211 }
212
213-func (s *ProvisionerSuite) TestProvisioningDoesOccurAfterInvalidEnvironmentPublished(c *C) {
214- p := provisioner.NewProvisioner(s.State, "0")
215- defer stop(c, p)
216-
217- // place a new machine into the state
218- m, err := s.State.AddMachine(config.DefaultSeries, state.JobHostUnits)
219- c.Assert(err, IsNil)
220-
221- s.checkStartInstance(c, m)
222-
223- err = s.invalidateEnvironment(c)
224- c.Assert(err, IsNil)
225-
226- // create a second machine
227- m, err = s.State.AddMachine(config.DefaultSeries, state.JobHostUnits)
228- c.Assert(err, IsNil)
229-
230- // the PA should create it using the old environment
231- s.checkStartInstance(c, m)
232-}
233-
234 func (s *ProvisionerSuite) TestProvisioningDoesNotProvisionTheSameMachineAfterRestart(c *C) {
235 p := provisioner.NewProvisioner(s.State, "0")
236 defer stop(c, p)
237@@ -466,55 +445,3 @@
238 c.Assert(err, IsNil)
239 c.Assert(m0.Life(), Equals, state.Dying)
240 }
241-
242-func (s *ProvisionerSuite) TestProvisioningRecoversAfterInvalidEnvironmentPublished(c *C) {
243- p := provisioner.NewProvisioner(s.State, "0")
244- defer stop(c, p)
245-
246- // place a new machine into the state
247- m, err := s.State.AddMachine(config.DefaultSeries, state.JobHostUnits)
248- c.Assert(err, IsNil)
249- s.checkStartInstance(c, m)
250-
251- err = s.invalidateEnvironment(c)
252- c.Assert(err, IsNil)
253- s.State.StartSync()
254-
255- // create a second machine
256- m, err = s.State.AddMachine(config.DefaultSeries, state.JobHostUnits)
257- c.Assert(err, IsNil)
258-
259- // the PA should create it using the old environment
260- s.checkStartInstance(c, m)
261-
262- err = s.fixEnvironment()
263- c.Assert(err, IsNil)
264-
265- // insert our observer
266- cfgObserver := make(chan *config.Config, 1)
267- p.SetObserver(cfgObserver)
268-
269- cfg, err := s.State.EnvironConfig()
270- c.Assert(err, IsNil)
271- attrs := cfg.AllAttrs()
272- attrs["secret"] = "beef"
273- cfg, err = config.New(attrs)
274- c.Assert(err, IsNil)
275- err = s.State.SetEnvironConfig(cfg)
276-
277- s.State.StartSync()
278-
279- // wait for the PA to load the new configuration
280- select {
281- case <-cfgObserver:
282- case <-time.After(200 * time.Millisecond):
283- c.Fatalf("PA did not action config change")
284- }
285-
286- // create a third machine
287- m, err = s.State.AddMachine(config.DefaultSeries, state.JobHostUnits)
288- c.Assert(err, IsNil)
289-
290- // the PA should create it using the new environment
291- s.checkStartInstanceCustom(c, m, "beef", constraints.Value{})
292-}

Subscribers

People subscribed via source and target branches

to status/vote changes: