Merge lp:~jk0/nova/xs-rescue-periodic-tasks into lp:~hudson-openstack/nova/trunk

Proposed by Josh Kearney
Status: Merged
Approved by: Rick Harris
Approved revision: 851
Merged at revision: 858
Proposed branch: lp:~jk0/nova/xs-rescue-periodic-tasks
Merge into: lp:~hudson-openstack/nova/trunk
Diff against target: 235 lines (+101/-26)
6 files modified
nova/compute/manager.py (+11/-2)
nova/utils.py (+8/-0)
nova/virt/hyperv.py (+3/-0)
nova/virt/libvirt_conn.py (+4/-0)
nova/virt/xenapi/vmops.py (+71/-24)
nova/virt/xenapi_conn.py (+4/-0)
To merge this branch: bzr merge lp:~jk0/nova/xs-rescue-periodic-tasks
Reviewer Review Type Date Requested Status
Sandy Walsh (community) Needs Fixing
Rick Harris (community) Approve
Matt Dietz (community) Approve
Review via email: mp+54597@code.launchpad.net

Description of the change

Offers the ability to run a periodic_task that sweeps through rescued instances older than 24 hours and forcibly unrescues them.

Flag added: rescue_timeout (default is 0 - disabled)

To post a comment you must log in.
848. By Josh Kearney

Added docstring

Revision history for this message
Matt Dietz (cerberus) wrote :

39 +def is_then_greater(then, seconds):

That's rather awkwardly named. Perhaps "is_older_than(before, seconds):" ?

Otherwise I think this looks good.

review: Needs Fixing
849. By Josh Kearney

Better method name

Revision history for this message
Matt Dietz (cerberus) wrote :

derp

review: Approve
Revision history for this message
Rick Harris (rconradharris) wrote :
Download full text (3.9 KiB)

Hey Josh. Overall looks good, just a few suggestions:

> 12 +flags.DEFINE_integer("rescue_timeout", 0,
> 13 + "Automatically unrescue an instance after N hours."
> 14 + " Set to 0 to disable.")

I'd consider using minutes or seconds here for maximum flexibility.

Suppose some other provider wants to auto-unrescue after 15 minutes, since
we're using an integer, that would be impossible with hours as the metric.

> 25 + if FLAGS.rescue_timeout > 0:
> 26 + self.driver.poll_rescued_instances(FLAGS.rescue_timeout)

It looks like you only defined an implementation in xenapi_conn.py. We may
want to add `poll_rescued_instances` to the other drivers with a
NotImplementedError so someone can come along add at that in for us.

> 105 + _vbd_ref = self._session.get_xenapi().VBD.get_record(vbd_ref)
> 106 + if _vbd_ref["userdevice"] == "1":

Per new-convention, this should probably be:

    vbd_rec = self._session.get_xenapi().VBD.get_record(vbd_ref)
    if vbd_rec["userdevice"] == "1":

> 106 + if _vbd_ref["userdevice"] == "1":

Could you clarify the significance of userdevice == "1". A comment would be
useful, but even better would be setting it nicely named variable:

    something_meaningful_is_true = (vbd_rec["userdevice"] == "1")
    if something_meaningful_is_true:
        # blah

> 186 + vms = []
> 187 + for instance in self.list_instances():
> 188 + if instance.endswith("-rescue"):

Since these are only rescue VMs, var might be clearer as `rescue_vms`.

> 196 + original_name = vm["name"].split("-rescue", 1)[0]
> 197 + original_vm_ref = VMHelper.lookup(self._session, original_name)

Might sense to move these lines down so they're next to where they are used:

    original_name = vm["name"].split("-rescue", 1)[0]
    original_vm_ref = VMHelper.lookup(self._session, original_name)
    self._release_bootlock(original_vm_ref)
    self._session.call_xenapi("VM.start", original_vm_ref, False,
                              False)

> self._destroy_rescue_vbds(rescue_vm_ref)
> self._shutdown_rescue(rescue_vm_ref)
> self._destroy_rescue_vdis(rescue_vm_ref)
> self._destroy_rescue_instance(rescue_vm_ref)
> self._release_bootlock(original_vm_ref)
> self._start(instance, original_vm_ref)

This block of code is effectively repeated twice in vmops. Might be work
refactoring it out to a separate method to DRY it up.

> 81 + def _shutdown_rescue(self, rescue_vm_ref):
> 82 + """Shutdown a rescue instance"""
> 83 + self._session.call_xenapi("Async.VM.hard_shutdown", rescue_vm_ref)

I'm not sure we need an additional method here (the less the better!). I
*think* we could replace with a call to:

    self._shutdown(instance, rescue_vm_ref, hard=True)

This gives us the benefit of logging (and the wait_for_task code that is
missing).

> 92 + def _destroy_rescue_vdis(self, rescue_vm_ref):
> 93 + """Destroys all VDIs associated with a rescued VM"""
> 94 + vdi_refs = VMHelper.lookup_vm_vdis(self._session, rescue_vm_ref)
> 95 + for vdi_ref in vdi_refs:
> 96 ...

Read more...

review: Needs Fixing
850. By Josh Kearney

Review feedback

Revision history for this message
Josh Kearney (jk0) wrote :

Thanks Rick, this is some great feedback. I was able to fix everything with the exception of a few things below (comments inline):

> It looks like you only defined an implementation in xenapi_conn.py. We may
> want to add `poll_rescued_instances` to the other drivers with a
> NotImplementedError so someone can come along add at that in for us.

I did implement a `pass` method for libvirt, however, I didn't bother with HyperV since that layer is extremely bare at the moment. I suspect that will improve and change quite rapidly over time.

> As above, I think it would be better to use the regular destroy_*
> implementations and just pass the rescue_vm_ref instead of the usual vm_ref.

Unfortunately since the `instance` object isn't available in these circumstances, I had to break them down into their own rescue methods respectively. Looking at the bright side, this will allow me to provide more thorough test coverage for Rescue/Unrescue once this lands. Another possibility is the big XenAPI refactor that we've talked about, but that will likely warrant its own BP for a future release.

Revision history for this message
Rick Harris (rconradharris) wrote :

> I didn't bother with HyperV since that layer is extremely bare at the moment.

Not a deal-breaker, but might be a good idea to add to HyperV anyway so it doesn't fall any further behind. Whomever ends up owning HyperV will benefit greatly from the stubs, and they cost next to nothing to add.

> Another possibility is the big XenAPI refactor that we've talked about

Yeah, I'm pretty much in agreement with you here. As much as I hate to see methods which are near-duplicates, I don't see an easy way around without some refactoring.

`_poll_tasks` in xenapi_conn.py is particularly nefarious, considering it takes an instance_id, optionally.

So, with both those items cleared up, I think this patch is good-to-go. Great job, Josh!

review: Approve
851. By Josh Kearney

Added hyperv stub

Revision history for this message
Sandy Walsh (sandy-walsh) wrote :

9 +def is_then_greater(then, seconds):
40 + if utcnow() - then > datetime.timedelta(seconds=seconds):
41 + return True
42 + else:
43 + return False

Beyond the naming as matt pointed out, how about

return utcnow() - then > datetime.timedelta(seconds=seconds)

98 + continue
log message?

105 + if _vbd_ref["userdevice"] == "1":

Is there a risk of userdevice not being defined?
Perhaps, if _vbd_ref.get("userdevice", None) == "1":

173 + if last_ran:
174 + if not utils.is_then_greater(last_ran, timeout * 60 * 60):
175 + # Do not run. Let's bail.
176 + return
177 + else:
178 + # Update the time tracker and proceed.
179 + self.poll_rescue_last_ran = utils.utcnow()
180 + else:
181 + # We need a base time to start tracking.
182 + self.poll_rescue_last_ran = utils.utcnow()
183 + return

I try to keep my returns together to keep the if blocks smaller

173 + if not last_ran:
181 + # We need a base time to start tracking.
182 + self.poll_rescue_last_ran = utils.utcnow()
183 + return

174 + if not utils.is_then_greater(last_ran, timeout * 60 * 60):
175 + # Do not run. Let's bail.
176 + return

177 + # Update the time tracker and proceed.
179 + self.poll_rescue_last_ran = utils.utcnow()

No tests?

review: Needs Fixing
Revision history for this message
Josh Kearney (jk0) wrote :

Thanks Sandy, I've worked all your suggestions into my other branch (where the unit tests live :)).

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'nova/compute/manager.py'
2--- nova/compute/manager.py 2011-03-23 18:56:23 +0000
3+++ nova/compute/manager.py 2011-03-24 03:56:32 +0000
4@@ -65,8 +65,11 @@
5 'Console proxy host to use to connect to instances on'
6 'this host.')
7 flags.DEFINE_integer('live_migration_retry_count', 30,
8- ("Retry count needed in live_migration."
9- " sleep 1 sec for each count"))
10+ "Retry count needed in live_migration."
11+ " sleep 1 sec for each count")
12+flags.DEFINE_integer("rescue_timeout", 0,
13+ "Automatically unrescue an instance after N seconds."
14+ " Set to 0 to disable.")
15
16 LOG = logging.getLogger('nova.compute.manager')
17
18@@ -132,6 +135,12 @@
19 """
20 self.driver.init_host(host=self.host)
21
22+ def periodic_tasks(self, context=None):
23+ """Tasks to be run at a periodic interval."""
24+ super(ComputeManager, self).periodic_tasks(context)
25+ if FLAGS.rescue_timeout > 0:
26+ self.driver.poll_rescued_instances(FLAGS.rescue_timeout)
27+
28 def _update_state(self, context, instance_id):
29 """Update the state of an instance from the driver info."""
30 # FIXME(ja): include other fields from state?
31
32=== modified file 'nova/utils.py'
33--- nova/utils.py 2011-03-22 16:13:48 +0000
34+++ nova/utils.py 2011-03-24 03:56:32 +0000
35@@ -335,6 +335,14 @@
36 utcnow.override_time = None
37
38
39+def is_older_than(before, seconds):
40+ """Return True if before is older than 'seconds'"""
41+ if utcnow() - before > datetime.timedelta(seconds=seconds):
42+ return True
43+ else:
44+ return False
45+
46+
47 def utcnow_ts():
48 """Timestamp version of our utcnow function."""
49 return time.mktime(utcnow().timetuple())
50
51=== modified file 'nova/virt/hyperv.py'
52--- nova/virt/hyperv.py 2011-01-27 17:56:54 +0000
53+++ nova/virt/hyperv.py 2011-03-24 03:56:32 +0000
54@@ -467,3 +467,6 @@
55 if vm is None:
56 raise exception.NotFound('Cannot detach volume from missing %s '
57 % instance_name)
58+
59+ def poll_rescued_instances(self, timeout):
60+ pass
61
62=== modified file 'nova/virt/libvirt_conn.py'
63--- nova/virt/libvirt_conn.py 2011-03-23 23:51:08 +0000
64+++ nova/virt/libvirt_conn.py 2011-03-24 03:56:32 +0000
65@@ -417,6 +417,10 @@
66 self.reboot(instance)
67
68 @exception.wrap_exception
69+ def poll_rescued_instances(self, timeout):
70+ pass
71+
72+ @exception.wrap_exception
73 def spawn(self, instance):
74 xml = self.to_xml(instance)
75 db.instance_set_state(context.get_admin_context(),
76
77=== modified file 'nova/virt/xenapi/vmops.py'
78--- nova/virt/xenapi/vmops.py 2011-03-23 18:58:08 +0000
79+++ nova/virt/xenapi/vmops.py 2011-03-24 03:56:32 +0000
80@@ -51,6 +51,7 @@
81 def __init__(self, session):
82 self.XenAPI = session.get_imported_xenapi()
83 self._session = session
84+ self.poll_rescue_last_ran = None
85
86 VMHelper.XenAPI = self.XenAPI
87
88@@ -488,6 +489,10 @@
89 except self.XenAPI.Failure, exc:
90 LOG.exception(exc)
91
92+ def _shutdown_rescue(self, rescue_vm_ref):
93+ """Shutdown a rescue instance"""
94+ self._session.call_xenapi("Async.VM.hard_shutdown", rescue_vm_ref)
95+
96 def _destroy_vdis(self, instance, vm_ref):
97 """Destroys all VDIs associated with a VM"""
98 instance_id = instance.id
99@@ -505,6 +510,24 @@
100 except self.XenAPI.Failure, exc:
101 LOG.exception(exc)
102
103+ def _destroy_rescue_vdis(self, rescue_vm_ref):
104+ """Destroys all VDIs associated with a rescued VM"""
105+ vdi_refs = VMHelper.lookup_vm_vdis(self._session, rescue_vm_ref)
106+ for vdi_ref in vdi_refs:
107+ try:
108+ self._session.call_xenapi("Async.VDI.destroy", vdi_ref)
109+ except self.XenAPI.Failure:
110+ continue
111+
112+ def _destroy_rescue_vbds(self, rescue_vm_ref):
113+ """Destroys all VBDs tied to a rescue VM"""
114+ vbd_refs = self._session.get_xenapi().VM.get_VBDs(rescue_vm_ref)
115+ for vbd_ref in vbd_refs:
116+ vbd_rec = self._session.get_xenapi().VBD.get_record(vbd_ref)
117+ if vbd_rec["userdevice"] == "1": # primary VBD is always 1
118+ VMHelper.unplug_vbd(self._session, vbd_ref)
119+ VMHelper.destroy_vbd(self._session, vbd_ref)
120+
121 def _destroy_kernel_ramdisk(self, instance, vm_ref):
122 """
123 Three situations can occur:
124@@ -555,6 +578,14 @@
125
126 LOG.debug(_("Instance %(instance_id)s VM destroyed") % locals())
127
128+ def _destroy_rescue_instance(self, rescue_vm_ref):
129+ """Destroy a rescue instance"""
130+ self._destroy_rescue_vbds(rescue_vm_ref)
131+ self._shutdown_rescue(rescue_vm_ref)
132+ self._destroy_rescue_vdis(rescue_vm_ref)
133+
134+ self._session.call_xenapi("Async.VM.destroy", rescue_vm_ref)
135+
136 def destroy(self, instance):
137 """
138 Destroy VM instance
139@@ -658,41 +689,57 @@
140
141 """
142 rescue_vm_ref = VMHelper.lookup(self._session,
143- instance.name + "-rescue")
144+ instance.name + "-rescue")
145
146 if not rescue_vm_ref:
147 raise exception.NotFound(_(
148 "Instance is not in Rescue Mode: %s" % instance.name))
149
150 original_vm_ref = self._get_vm_opaque_ref(instance)
151- vbd_refs = self._session.get_xenapi().VM.get_VBDs(rescue_vm_ref)
152-
153 instance._rescue = False
154
155- for vbd_ref in vbd_refs:
156- _vbd_ref = self._session.get_xenapi().VBD.get_record(vbd_ref)
157- if _vbd_ref["userdevice"] == "1":
158- VMHelper.unplug_vbd(self._session, vbd_ref)
159- VMHelper.destroy_vbd(self._session, vbd_ref)
160-
161- task1 = self._session.call_xenapi("Async.VM.hard_shutdown",
162- rescue_vm_ref)
163- self._session.wait_for_task(task1, instance.id)
164-
165- vdi_refs = VMHelper.lookup_vm_vdis(self._session, rescue_vm_ref)
166- for vdi_ref in vdi_refs:
167- try:
168- task = self._session.call_xenapi('Async.VDI.destroy', vdi_ref)
169- self._session.wait_for_task(task, instance.id)
170- except self.XenAPI.Failure:
171- continue
172-
173- task2 = self._session.call_xenapi('Async.VM.destroy', rescue_vm_ref)
174- self._session.wait_for_task(task2, instance.id)
175-
176+ self._destroy_rescue_instance(rescue_vm_ref)
177 self._release_bootlock(original_vm_ref)
178 self._start(instance, original_vm_ref)
179
180+ def poll_rescued_instances(self, timeout):
181+ """Look for expirable rescued instances
182+ - forcibly exit rescue mode for any instances that have been
183+ in rescue mode for >= the provided timeout
184+ """
185+ last_ran = self.poll_rescue_last_ran
186+ if last_ran:
187+ if not utils.is_older_than(last_ran, timeout):
188+ # Do not run. Let's bail.
189+ return
190+ else:
191+ # Update the time tracker and proceed.
192+ self.poll_rescue_last_ran = utils.utcnow()
193+ else:
194+ # We need a base time to start tracking.
195+ self.poll_rescue_last_ran = utils.utcnow()
196+ return
197+
198+ rescue_vms = []
199+ for instance in self.list_instances():
200+ if instance.endswith("-rescue"):
201+ rescue_vms.append(dict(name=instance,
202+ vm_ref=VMHelper.lookup(self._session,
203+ instance)))
204+
205+ for vm in rescue_vms:
206+ rescue_name = vm["name"]
207+ rescue_vm_ref = vm["vm_ref"]
208+
209+ self._destroy_rescue_instance(rescue_vm_ref)
210+
211+ original_name = vm["name"].split("-rescue", 1)[0]
212+ original_vm_ref = VMHelper.lookup(self._session, original_name)
213+
214+ self._release_bootlock(original_vm_ref)
215+ self._session.call_xenapi("VM.start", original_vm_ref, False,
216+ False)
217+
218 def get_info(self, instance):
219 """Return data about VM instance"""
220 vm_ref = self._get_vm_opaque_ref(instance)
221
222=== modified file 'nova/virt/xenapi_conn.py'
223--- nova/virt/xenapi_conn.py 2011-03-17 16:03:07 +0000
224+++ nova/virt/xenapi_conn.py 2011-03-24 03:56:32 +0000
225@@ -223,6 +223,10 @@
226 """Unrescue the specified instance"""
227 self._vmops.unrescue(instance, callback)
228
229+ def poll_rescued_instances(self, timeout):
230+ """Poll for rescued instances"""
231+ self._vmops.poll_rescued_instances(timeout)
232+
233 def reset_network(self, instance):
234 """reset networking for specified instance"""
235 self._vmops.reset_network(instance)