Merge lp:~jk0/nova/xs-rescue-periodic-tasks into lp:~hudson-openstack/nova/trunk

Proposed by Josh Kearney
Status: Merged
Approved by: Rick Harris
Approved revision: 851
Merged at revision: 858
Proposed branch: lp:~jk0/nova/xs-rescue-periodic-tasks
Merge into: lp:~hudson-openstack/nova/trunk
Diff against target: 235 lines (+101/-26)
6 files modified
nova/compute/manager.py (+11/-2)
nova/utils.py (+8/-0)
nova/virt/hyperv.py (+3/-0)
nova/virt/libvirt_conn.py (+4/-0)
nova/virt/xenapi/vmops.py (+71/-24)
nova/virt/xenapi_conn.py (+4/-0)
To merge this branch: bzr merge lp:~jk0/nova/xs-rescue-periodic-tasks
Reviewer Review Type Date Requested Status
Sandy Walsh (community) Needs Fixing
Rick Harris (community) Approve
Matt Dietz (community) Approve
Review via email: mp+54597@code.launchpad.net

Description of the change

Offers the ability to run a periodic_task that sweeps through rescued instances older than 24 hours and forcibly unrescues them.

Flag added: rescue_timeout (default is 0 - disabled)

To post a comment you must log in.
848. By Josh Kearney

Added docstring

Revision history for this message
Matt Dietz (cerberus) wrote :

39 +def is_then_greater(then, seconds):

That's rather awkwardly named. Perhaps "is_older_than(before, seconds):" ?

Otherwise I think this looks good.

review: Needs Fixing
849. By Josh Kearney

Better method name

Revision history for this message
Matt Dietz (cerberus) wrote :

derp

review: Approve
Revision history for this message
Rick Harris (rconradharris) wrote :
Download full text (3.9 KiB)

Hey Josh. Overall looks good, just a few suggestions:

> 12 +flags.DEFINE_integer("rescue_timeout", 0,
> 13 + "Automatically unrescue an instance after N hours."
> 14 + " Set to 0 to disable.")

I'd consider using minutes or seconds here for maximum flexibility.

Suppose some other provider wants to auto-unrescue after 15 minutes, since
we're using an integer, that would be impossible with hours as the metric.

> 25 + if FLAGS.rescue_timeout > 0:
> 26 + self.driver.poll_rescued_instances(FLAGS.rescue_timeout)

It looks like you only defined an implementation in xenapi_conn.py. We may
want to add `poll_rescued_instances` to the other drivers with a
NotImplementedError so someone can come along add at that in for us.

> 105 + _vbd_ref = self._session.get_xenapi().VBD.get_record(vbd_ref)
> 106 + if _vbd_ref["userdevice"] == "1":

Per new-convention, this should probably be:

    vbd_rec = self._session.get_xenapi().VBD.get_record(vbd_ref)
    if vbd_rec["userdevice"] == "1":

> 106 + if _vbd_ref["userdevice"] == "1":

Could you clarify the significance of userdevice == "1". A comment would be
useful, but even better would be setting it nicely named variable:

    something_meaningful_is_true = (vbd_rec["userdevice"] == "1")
    if something_meaningful_is_true:
        # blah

> 186 + vms = []
> 187 + for instance in self.list_instances():
> 188 + if instance.endswith("-rescue"):

Since these are only rescue VMs, var might be clearer as `rescue_vms`.

> 196 + original_name = vm["name"].split("-rescue", 1)[0]
> 197 + original_vm_ref = VMHelper.lookup(self._session, original_name)

Might sense to move these lines down so they're next to where they are used:

    original_name = vm["name"].split("-rescue", 1)[0]
    original_vm_ref = VMHelper.lookup(self._session, original_name)
    self._release_bootlock(original_vm_ref)
    self._session.call_xenapi("VM.start", original_vm_ref, False,
                              False)

> self._destroy_rescue_vbds(rescue_vm_ref)
> self._shutdown_rescue(rescue_vm_ref)
> self._destroy_rescue_vdis(rescue_vm_ref)
> self._destroy_rescue_instance(rescue_vm_ref)
> self._release_bootlock(original_vm_ref)
> self._start(instance, original_vm_ref)

This block of code is effectively repeated twice in vmops. Might be work
refactoring it out to a separate method to DRY it up.

> 81 + def _shutdown_rescue(self, rescue_vm_ref):
> 82 + """Shutdown a rescue instance"""
> 83 + self._session.call_xenapi("Async.VM.hard_shutdown", rescue_vm_ref)

I'm not sure we need an additional method here (the less the better!). I
*think* we could replace with a call to:

    self._shutdown(instance, rescue_vm_ref, hard=True)

This gives us the benefit of logging (and the wait_for_task code that is
missing).

> 92 + def _destroy_rescue_vdis(self, rescue_vm_ref):
> 93 + """Destroys all VDIs associated with a rescued VM"""
> 94 + vdi_refs = VMHelper.lookup_vm_vdis(self._session, rescue_vm_ref)
> 95 + for vdi_ref in vdi_refs:
> 96 ...

Read more...

review: Needs Fixing
850. By Josh Kearney

Review feedback

Revision history for this message
Josh Kearney (jk0) wrote :

Thanks Rick, this is some great feedback. I was able to fix everything with the exception of a few things below (comments inline):

> It looks like you only defined an implementation in xenapi_conn.py. We may
> want to add `poll_rescued_instances` to the other drivers with a
> NotImplementedError so someone can come along add at that in for us.

I did implement a `pass` method for libvirt, however, I didn't bother with HyperV since that layer is extremely bare at the moment. I suspect that will improve and change quite rapidly over time.

> As above, I think it would be better to use the regular destroy_*
> implementations and just pass the rescue_vm_ref instead of the usual vm_ref.

Unfortunately since the `instance` object isn't available in these circumstances, I had to break them down into their own rescue methods respectively. Looking at the bright side, this will allow me to provide more thorough test coverage for Rescue/Unrescue once this lands. Another possibility is the big XenAPI refactor that we've talked about, but that will likely warrant its own BP for a future release.

Revision history for this message
Rick Harris (rconradharris) wrote :

> I didn't bother with HyperV since that layer is extremely bare at the moment.

Not a deal-breaker, but might be a good idea to add to HyperV anyway so it doesn't fall any further behind. Whomever ends up owning HyperV will benefit greatly from the stubs, and they cost next to nothing to add.

> Another possibility is the big XenAPI refactor that we've talked about

Yeah, I'm pretty much in agreement with you here. As much as I hate to see methods which are near-duplicates, I don't see an easy way around without some refactoring.

`_poll_tasks` in xenapi_conn.py is particularly nefarious, considering it takes an instance_id, optionally.

So, with both those items cleared up, I think this patch is good-to-go. Great job, Josh!

review: Approve
851. By Josh Kearney

Added hyperv stub

Revision history for this message
Sandy Walsh (sandy-walsh) wrote :

9 +def is_then_greater(then, seconds):
40 + if utcnow() - then > datetime.timedelta(seconds=seconds):
41 + return True
42 + else:
43 + return False

Beyond the naming as matt pointed out, how about

return utcnow() - then > datetime.timedelta(seconds=seconds)

98 + continue
log message?

105 + if _vbd_ref["userdevice"] == "1":

Is there a risk of userdevice not being defined?
Perhaps, if _vbd_ref.get("userdevice", None) == "1":

173 + if last_ran:
174 + if not utils.is_then_greater(last_ran, timeout * 60 * 60):
175 + # Do not run. Let's bail.
176 + return
177 + else:
178 + # Update the time tracker and proceed.
179 + self.poll_rescue_last_ran = utils.utcnow()
180 + else:
181 + # We need a base time to start tracking.
182 + self.poll_rescue_last_ran = utils.utcnow()
183 + return

I try to keep my returns together to keep the if blocks smaller

173 + if not last_ran:
181 + # We need a base time to start tracking.
182 + self.poll_rescue_last_ran = utils.utcnow()
183 + return

174 + if not utils.is_then_greater(last_ran, timeout * 60 * 60):
175 + # Do not run. Let's bail.
176 + return

177 + # Update the time tracker and proceed.
179 + self.poll_rescue_last_ran = utils.utcnow()

No tests?

review: Needs Fixing
Revision history for this message
Josh Kearney (jk0) wrote :

Thanks Sandy, I've worked all your suggestions into my other branch (where the unit tests live :)).

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'nova/compute/manager.py'
--- nova/compute/manager.py 2011-03-23 18:56:23 +0000
+++ nova/compute/manager.py 2011-03-24 03:56:32 +0000
@@ -65,8 +65,11 @@
65 'Console proxy host to use to connect to instances on'65 'Console proxy host to use to connect to instances on'
66 'this host.')66 'this host.')
67flags.DEFINE_integer('live_migration_retry_count', 30,67flags.DEFINE_integer('live_migration_retry_count', 30,
68 ("Retry count needed in live_migration."68 "Retry count needed in live_migration."
69 " sleep 1 sec for each count"))69 " sleep 1 sec for each count")
70flags.DEFINE_integer("rescue_timeout", 0,
71 "Automatically unrescue an instance after N seconds."
72 " Set to 0 to disable.")
7073
71LOG = logging.getLogger('nova.compute.manager')74LOG = logging.getLogger('nova.compute.manager')
7275
@@ -132,6 +135,12 @@
132 """135 """
133 self.driver.init_host(host=self.host)136 self.driver.init_host(host=self.host)
134137
138 def periodic_tasks(self, context=None):
139 """Tasks to be run at a periodic interval."""
140 super(ComputeManager, self).periodic_tasks(context)
141 if FLAGS.rescue_timeout > 0:
142 self.driver.poll_rescued_instances(FLAGS.rescue_timeout)
143
135 def _update_state(self, context, instance_id):144 def _update_state(self, context, instance_id):
136 """Update the state of an instance from the driver info."""145 """Update the state of an instance from the driver info."""
137 # FIXME(ja): include other fields from state?146 # FIXME(ja): include other fields from state?
138147
=== modified file 'nova/utils.py'
--- nova/utils.py 2011-03-22 16:13:48 +0000
+++ nova/utils.py 2011-03-24 03:56:32 +0000
@@ -335,6 +335,14 @@
335utcnow.override_time = None335utcnow.override_time = None
336336
337337
338def is_older_than(before, seconds):
339 """Return True if before is older than 'seconds'"""
340 if utcnow() - before > datetime.timedelta(seconds=seconds):
341 return True
342 else:
343 return False
344
345
338def utcnow_ts():346def utcnow_ts():
339 """Timestamp version of our utcnow function."""347 """Timestamp version of our utcnow function."""
340 return time.mktime(utcnow().timetuple())348 return time.mktime(utcnow().timetuple())
341349
=== modified file 'nova/virt/hyperv.py'
--- nova/virt/hyperv.py 2011-01-27 17:56:54 +0000
+++ nova/virt/hyperv.py 2011-03-24 03:56:32 +0000
@@ -467,3 +467,6 @@
467 if vm is None:467 if vm is None:
468 raise exception.NotFound('Cannot detach volume from missing %s '468 raise exception.NotFound('Cannot detach volume from missing %s '
469 % instance_name)469 % instance_name)
470
471 def poll_rescued_instances(self, timeout):
472 pass
470473
=== modified file 'nova/virt/libvirt_conn.py'
--- nova/virt/libvirt_conn.py 2011-03-23 23:51:08 +0000
+++ nova/virt/libvirt_conn.py 2011-03-24 03:56:32 +0000
@@ -417,6 +417,10 @@
417 self.reboot(instance)417 self.reboot(instance)
418418
419 @exception.wrap_exception419 @exception.wrap_exception
420 def poll_rescued_instances(self, timeout):
421 pass
422
423 @exception.wrap_exception
420 def spawn(self, instance):424 def spawn(self, instance):
421 xml = self.to_xml(instance)425 xml = self.to_xml(instance)
422 db.instance_set_state(context.get_admin_context(),426 db.instance_set_state(context.get_admin_context(),
423427
=== modified file 'nova/virt/xenapi/vmops.py'
--- nova/virt/xenapi/vmops.py 2011-03-23 18:58:08 +0000
+++ nova/virt/xenapi/vmops.py 2011-03-24 03:56:32 +0000
@@ -51,6 +51,7 @@
51 def __init__(self, session):51 def __init__(self, session):
52 self.XenAPI = session.get_imported_xenapi()52 self.XenAPI = session.get_imported_xenapi()
53 self._session = session53 self._session = session
54 self.poll_rescue_last_ran = None
5455
55 VMHelper.XenAPI = self.XenAPI56 VMHelper.XenAPI = self.XenAPI
5657
@@ -488,6 +489,10 @@
488 except self.XenAPI.Failure, exc:489 except self.XenAPI.Failure, exc:
489 LOG.exception(exc)490 LOG.exception(exc)
490491
492 def _shutdown_rescue(self, rescue_vm_ref):
493 """Shutdown a rescue instance"""
494 self._session.call_xenapi("Async.VM.hard_shutdown", rescue_vm_ref)
495
491 def _destroy_vdis(self, instance, vm_ref):496 def _destroy_vdis(self, instance, vm_ref):
492 """Destroys all VDIs associated with a VM"""497 """Destroys all VDIs associated with a VM"""
493 instance_id = instance.id498 instance_id = instance.id
@@ -505,6 +510,24 @@
505 except self.XenAPI.Failure, exc:510 except self.XenAPI.Failure, exc:
506 LOG.exception(exc)511 LOG.exception(exc)
507512
513 def _destroy_rescue_vdis(self, rescue_vm_ref):
514 """Destroys all VDIs associated with a rescued VM"""
515 vdi_refs = VMHelper.lookup_vm_vdis(self._session, rescue_vm_ref)
516 for vdi_ref in vdi_refs:
517 try:
518 self._session.call_xenapi("Async.VDI.destroy", vdi_ref)
519 except self.XenAPI.Failure:
520 continue
521
522 def _destroy_rescue_vbds(self, rescue_vm_ref):
523 """Destroys all VBDs tied to a rescue VM"""
524 vbd_refs = self._session.get_xenapi().VM.get_VBDs(rescue_vm_ref)
525 for vbd_ref in vbd_refs:
526 vbd_rec = self._session.get_xenapi().VBD.get_record(vbd_ref)
527 if vbd_rec["userdevice"] == "1": # primary VBD is always 1
528 VMHelper.unplug_vbd(self._session, vbd_ref)
529 VMHelper.destroy_vbd(self._session, vbd_ref)
530
508 def _destroy_kernel_ramdisk(self, instance, vm_ref):531 def _destroy_kernel_ramdisk(self, instance, vm_ref):
509 """532 """
510 Three situations can occur:533 Three situations can occur:
@@ -555,6 +578,14 @@
555578
556 LOG.debug(_("Instance %(instance_id)s VM destroyed") % locals())579 LOG.debug(_("Instance %(instance_id)s VM destroyed") % locals())
557580
581 def _destroy_rescue_instance(self, rescue_vm_ref):
582 """Destroy a rescue instance"""
583 self._destroy_rescue_vbds(rescue_vm_ref)
584 self._shutdown_rescue(rescue_vm_ref)
585 self._destroy_rescue_vdis(rescue_vm_ref)
586
587 self._session.call_xenapi("Async.VM.destroy", rescue_vm_ref)
588
558 def destroy(self, instance):589 def destroy(self, instance):
559 """590 """
560 Destroy VM instance591 Destroy VM instance
@@ -658,41 +689,57 @@
658689
659 """690 """
660 rescue_vm_ref = VMHelper.lookup(self._session,691 rescue_vm_ref = VMHelper.lookup(self._session,
661 instance.name + "-rescue")692 instance.name + "-rescue")
662693
663 if not rescue_vm_ref:694 if not rescue_vm_ref:
664 raise exception.NotFound(_(695 raise exception.NotFound(_(
665 "Instance is not in Rescue Mode: %s" % instance.name))696 "Instance is not in Rescue Mode: %s" % instance.name))
666697
667 original_vm_ref = self._get_vm_opaque_ref(instance)698 original_vm_ref = self._get_vm_opaque_ref(instance)
668 vbd_refs = self._session.get_xenapi().VM.get_VBDs(rescue_vm_ref)
669
670 instance._rescue = False699 instance._rescue = False
671700
672 for vbd_ref in vbd_refs:701 self._destroy_rescue_instance(rescue_vm_ref)
673 _vbd_ref = self._session.get_xenapi().VBD.get_record(vbd_ref)
674 if _vbd_ref["userdevice"] == "1":
675 VMHelper.unplug_vbd(self._session, vbd_ref)
676 VMHelper.destroy_vbd(self._session, vbd_ref)
677
678 task1 = self._session.call_xenapi("Async.VM.hard_shutdown",
679 rescue_vm_ref)
680 self._session.wait_for_task(task1, instance.id)
681
682 vdi_refs = VMHelper.lookup_vm_vdis(self._session, rescue_vm_ref)
683 for vdi_ref in vdi_refs:
684 try:
685 task = self._session.call_xenapi('Async.VDI.destroy', vdi_ref)
686 self._session.wait_for_task(task, instance.id)
687 except self.XenAPI.Failure:
688 continue
689
690 task2 = self._session.call_xenapi('Async.VM.destroy', rescue_vm_ref)
691 self._session.wait_for_task(task2, instance.id)
692
693 self._release_bootlock(original_vm_ref)702 self._release_bootlock(original_vm_ref)
694 self._start(instance, original_vm_ref)703 self._start(instance, original_vm_ref)
695704
705 def poll_rescued_instances(self, timeout):
706 """Look for expirable rescued instances
707 - forcibly exit rescue mode for any instances that have been
708 in rescue mode for >= the provided timeout
709 """
710 last_ran = self.poll_rescue_last_ran
711 if last_ran:
712 if not utils.is_older_than(last_ran, timeout):
713 # Do not run. Let's bail.
714 return
715 else:
716 # Update the time tracker and proceed.
717 self.poll_rescue_last_ran = utils.utcnow()
718 else:
719 # We need a base time to start tracking.
720 self.poll_rescue_last_ran = utils.utcnow()
721 return
722
723 rescue_vms = []
724 for instance in self.list_instances():
725 if instance.endswith("-rescue"):
726 rescue_vms.append(dict(name=instance,
727 vm_ref=VMHelper.lookup(self._session,
728 instance)))
729
730 for vm in rescue_vms:
731 rescue_name = vm["name"]
732 rescue_vm_ref = vm["vm_ref"]
733
734 self._destroy_rescue_instance(rescue_vm_ref)
735
736 original_name = vm["name"].split("-rescue", 1)[0]
737 original_vm_ref = VMHelper.lookup(self._session, original_name)
738
739 self._release_bootlock(original_vm_ref)
740 self._session.call_xenapi("VM.start", original_vm_ref, False,
741 False)
742
696 def get_info(self, instance):743 def get_info(self, instance):
697 """Return data about VM instance"""744 """Return data about VM instance"""
698 vm_ref = self._get_vm_opaque_ref(instance)745 vm_ref = self._get_vm_opaque_ref(instance)
699746
=== modified file 'nova/virt/xenapi_conn.py'
--- nova/virt/xenapi_conn.py 2011-03-17 16:03:07 +0000
+++ nova/virt/xenapi_conn.py 2011-03-24 03:56:32 +0000
@@ -223,6 +223,10 @@
223 """Unrescue the specified instance"""223 """Unrescue the specified instance"""
224 self._vmops.unrescue(instance, callback)224 self._vmops.unrescue(instance, callback)
225225
226 def poll_rescued_instances(self, timeout):
227 """Poll for rescued instances"""
228 self._vmops.poll_rescued_instances(timeout)
229
226 def reset_network(self, instance):230 def reset_network(self, instance):
227 """reset networking for specified instance"""231 """reset networking for specified instance"""
228 self._vmops.reset_network(instance)232 self._vmops.reset_network(instance)