Merge lp:~dave-cheney/juju-core/go-cmd-jujud-provisioner-unknown-machines into lp:~juju/juju-core/trunk

Proposed by Dave Cheney
Status: Merged
Merge reported by: Gustavo Niemeyer
Merged at revision: not available
Proposed branch: lp:~dave-cheney/juju-core/go-cmd-jujud-provisioner-unknown-machines
Merge into: lp:~juju/juju-core/trunk
Diff against target: 222 lines (+132/-15)
2 files modified
cmd/jujud/provisioning.go (+56/-15)
cmd/jujud/provisioning_test.go (+76/-0)
To merge this branch: bzr merge lp:~dave-cheney/juju-core/go-cmd-jujud-provisioner-unknown-machines
Reviewer Review Type Date Requested Status
The Go Language Gophers Pending
Review via email: mp+110777@code.launchpad.net

Description of the change

cmd/jujud: provisioner shuts down unknown instance

This proposal adds the facility for the PA to detect lost or
unknown instances running in the environment that are not
associated with a machine in the state and shut them down.

https://codereview.appspot.com/6315043/

To post a comment you must log in.
Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

LGTM, with a couple of minors.

https://codereview.appspot.com/6315043/diff/6001/cmd/jujud/provisioning.go
File cmd/jujud/provisioning.go (right):

https://codereview.appspot.com/6315043/diff/6001/cmd/jujud/provisioning.go#newcode110
cmd/jujud/provisioning.go:110: p.tomb.Kill(err)
return?

https://codereview.appspot.com/6315043/diff/6001/cmd/jujud/provisioning.go#newcode197
cmd/jujud/provisioning.go:197: machines, err := p.st.AllMachines()
These All* methods aren't a great API. We can do that for now, but we'll
need something better soon. Think about 100k machines running. This is
fetching 200k items from the DB regularly, without much benefit in most
cases.

Can you please add a note here and file a bug in Launchpad against
juju-core?

https://codereview.appspot.com/6315043/diff/6001/cmd/jujud/provisioning.go#newcode206
cmd/jujud/provisioning.go:206: if id == "" {
Please add the same TODO we have elsewhere. InstanceId should really not
be returning an empty id.

https://codereview.appspot.com/6315043/diff/6001/cmd/jujud/provisioning_test.go
File cmd/jujud/provisioning_test.go (right):

https://codereview.appspot.com/6315043/diff/6001/cmd/jujud/provisioning_test.go#newcode392
cmd/jujud/provisioning_test.go:392: // where the final machine has been
removed from the state which the PA was
"removed from the state which the PA was down"?

https://codereview.appspot.com/6315043/

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

LGTM, with a couple of trivials.

https://codereview.appspot.com/6315043/diff/6001/cmd/jujud/provisioning.go
File cmd/jujud/provisioning.go (right):

https://codereview.appspot.com/6315043/diff/6001/cmd/jujud/provisioning.go#newcode206
cmd/jujud/provisioning.go:206: if id == "" {
On 2012/06/19 05:19:17, dfc wrote:
> On 2012/06/19 01:53:28, niemeyer wrote:
> > Please add the same TODO we have elsewhere. InstanceId should really
not be
> > returning an empty id.

> Done.

Not done?

https://codereview.appspot.com/6315043/diff/10001/cmd/jujud/provisioning.go
File cmd/jujud/provisioning.go (right):

https://codereview.appspot.com/6315043/diff/10001/cmd/jujud/provisioning.go#newcode104
cmd/jujud/provisioning.go:104: p.tomb.Kill(err)
return

https://codereview.appspot.com/6315043/

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

On 2012/06/20 12:25:53, dfc wrote:
> Please take a look.

The fix looks good, Dave, but this will need a different branch and a
different CL. This one was already submitted.

Feel free to propose+submit it directly with the specific fix.

https://codereview.appspot.com/6315043/

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'cmd/jujud/provisioning.go'
2--- cmd/jujud/provisioning.go 2012-06-18 22:55:56 +0000
3+++ cmd/jujud/provisioning.go 2012-06-19 05:20:25 +0000
4@@ -98,6 +98,12 @@
5 continue
6 }
7 log.Printf("provisioner loaded new environment configuration")
8+
9+ // call processMachines to stop any unknown instances before watching machines.
10+ if err := p.processMachines(nil, nil); err != nil {
11+ p.tomb.Kill(err)
12+ }
13+
14 p.innerLoop()
15 }
16 }
17@@ -132,7 +138,9 @@
18 }
19 return
20 }
21- if err := p.processMachines(machines); err != nil {
22+ // TODO(dfc) fire process machines periodically to shut down unknown
23+ // instances.
24+ if err := p.processMachines(machines.Added, machines.Deleted); err != nil {
25 p.tomb.Kill(err)
26 }
27 }
28@@ -151,29 +159,62 @@
29 return p.tomb.Wait()
30 }
31
32-func (p *Provisioner) processMachines(changes *state.MachinesChange) error {
33+func (p *Provisioner) processMachines(added, removed []*state.Machine) error {
34 // step 1. find which of the added machines have not
35 // yet been allocated a started instance.
36- notstarted, err := p.findNotStarted(changes.Added)
37+ notstarted, err := p.findNotStarted(added)
38 if err != nil {
39 return err
40 }
41
42 // step 2. start an instance for any machines we found.
43- if _, err := p.startMachines(notstarted); err != nil {
44+ if err := p.startMachines(notstarted); err != nil {
45 return err
46 }
47
48 // step 3. stop all machines that were removed from the state.
49- stopping, err := p.instancesForMachines(changes.Deleted)
50+ stopping, err := p.instancesForMachines(removed)
51 if err != nil {
52 return err
53 }
54
55- // TODO(dfc) obtain a list of started instances from the Environ and compare that
56- // with the known instances stored in the machine.InstanceId() config.
57-
58- return p.stopInstances(stopping)
59+ // step 4. find instances which are running but have no machine
60+ // associated with them.
61+ unknown, err := p.findUnknownInstances()
62+
63+ return p.stopInstances(append(stopping, unknown...))
64+}
65+
66+// findUnknownInstances finds instances which are not associated with a machine.
67+func (p *Provisioner) findUnknownInstances() ([]environs.Instance, error) {
68+ all, err := p.environ.AllInstances()
69+ if err != nil {
70+ return nil, err
71+ }
72+ instances := make(map[string]environs.Instance)
73+ for _, i := range all {
74+ instances[i.Id()] = i
75+ }
76+ // TODO(dfc) this is very inefficient, p.machines cache may help.
77+ machines, err := p.st.AllMachines()
78+ if err != nil {
79+ return nil, err
80+ }
81+ for _, m := range machines {
82+ id, err := m.InstanceId()
83+ if err != nil {
84+ return nil, err
85+ }
86+ if id == "" {
87+ continue
88+ }
89+ delete(instances, id)
90+ }
91+ var unknown []environs.Instance
92+ for _, i := range instances {
93+ unknown = append(unknown, i)
94+ }
95+ return unknown, nil
96 }
97
98 // findNotStarted finds machines without an InstanceId set, these are defined as not started.
99@@ -193,15 +234,13 @@
100 return notstarted, nil
101 }
102
103-func (p *Provisioner) startMachines(machines []*state.Machine) ([]*state.Machine, error) {
104- var started []*state.Machine
105+func (p *Provisioner) startMachines(machines []*state.Machine) error {
106 for _, m := range machines {
107 if err := p.startMachine(m); err != nil {
108- return nil, err
109+ return err
110 }
111- started = append(started, m)
112 }
113- return started, nil
114+ return nil
115 }
116
117 func (p *Provisioner) startMachine(m *state.Machine) error {
118@@ -228,7 +267,7 @@
119 }
120
121 func (p *Provisioner) stopInstances(instances []environs.Instance) error {
122- // although calling StopInstance with an empty slice should produce no change in the
123+ // Although calling StopInstance with an empty slice should produce no change in the
124 // provider, environs like dummy do not consider this a noop.
125 if len(instances) == 0 {
126 return nil
127@@ -272,6 +311,8 @@
128 return inst, nil
129 }
130
131+// instancesForMachines returns a list of environs.Instance that represent the list of machines running
132+// in the provider.
133 func (p *Provisioner) instancesForMachines(machines []*state.Machine) ([]environs.Instance, error) {
134 var insts []environs.Instance
135 for _, m := range machines {
136
137=== modified file 'cmd/jujud/provisioning_test.go'
138--- cmd/jujud/provisioning_test.go 2012-06-19 01:31:04 +0000
139+++ cmd/jujud/provisioning_test.go 2012-06-19 05:20:25 +0000
140@@ -350,6 +350,82 @@
141 c.Assert(p.Stop(), IsNil)
142 }
143
144+func (s *ProvisioningSuite) TestProvisioningStopsUnknownInstances(c *C) {
145+ p, err := NewProvisioner(s.zkInfo)
146+ c.Check(err, IsNil)
147+ // we are not using defer s.stopProvisioner(c, p) because we need to control when
148+ // the PA is restarted in this test. Methods like Fatalf and Assert should not be used.
149+ op := make(chan dummy.Operation, 1)
150+ dummy.Listen(op)
151+
152+ // create a machine
153+ m, err := s.st.AddMachine()
154+ c.Check(err, IsNil)
155+
156+ s.checkStartInstance(c, op)
157+ s.checkMachineIdSet(c, m)
158+
159+ // create a second machine
160+ m, err = s.st.AddMachine()
161+ c.Check(err, IsNil)
162+
163+ s.checkStartInstance(c, op)
164+ s.checkMachineIdSet(c, m)
165+
166+ // stop the PA
167+ c.Check(p.Stop(), IsNil)
168+
169+ // remove the machine
170+ err = s.st.RemoveMachine(m.Id())
171+ c.Check(err, IsNil)
172+
173+ // start a new provisioner
174+ p, err = NewProvisioner(s.zkInfo)
175+ c.Check(err, IsNil)
176+
177+ s.checkStopInstance(c, op)
178+
179+ c.Assert(p.Stop(), IsNil)
180+}
181+
182+// This check is different from the one above as it catches the edge case
183+// where the final machine has been removed from the state while the PA was
184+// not running.
185+func (s *ProvisioningSuite) TestProvisioningStopsOnlyUnknownInstances(c *C) {
186+ p, err := NewProvisioner(s.zkInfo)
187+ c.Check(err, IsNil)
188+ // we are not using defer s.stopProvisioner(c, p) because we need to control when
189+ // the PA is restarted in this test. Methods like Fatalf and Assert should not be used.
190+ op := make(chan dummy.Operation, 1)
191+ dummy.Listen(op)
192+
193+ // create a machine
194+ m, err := s.st.AddMachine()
195+ c.Check(err, IsNil)
196+
197+ s.checkStartInstance(c, op)
198+ s.checkMachineIdSet(c, m)
199+
200+ // stop the PA
201+ c.Check(p.Stop(), IsNil)
202+
203+ // remove the machine
204+ err = s.st.RemoveMachine(m.Id())
205+ c.Check(err, IsNil)
206+
207+ machines, err := s.st.AllMachines()
208+ c.Check(err, IsNil)
209+ c.Check(len(machines), Equals, 0) // it's really gone
210+
211+ // start a new provisioner
212+ p, err = NewProvisioner(s.zkInfo)
213+ c.Check(err, IsNil)
214+
215+ s.checkStopInstance(c, op)
216+
217+ c.Assert(p.Stop(), IsNil)
218+}
219+
220 func (s *ProvisioningSuite) TestProvisioningRecoversAfterInvalidEnvironmentPublished(c *C) {
221 p, err := NewProvisioner(s.zkInfo)
222 c.Assert(err, IsNil)

Subscribers

People subscribed via source and target branches