OpenStack Compute (nova)

Merge lp:~jk0/nova/xs-rescue-periodic-tasks into lp:~hudson-openstack/nova/trunk

xs-rescue-periodic-tasks
Merge into trunk

Proposed by Josh Kearney on 2011-03-23

Status:

Merged

Approved by:

Rick Harris on 2011-03-24

Approved revision:

851

Merged at revision:

858

Proposed branch:

lp:~jk0/nova/xs-rescue-periodic-tasks

Merge into:

lp:~hudson-openstack/nova/trunk

Diff against target:

235 lines (+101/-26)

6 files modified

nova/compute/manager.py (+11/-2)
nova/utils.py (+8/-0)
nova/virt/hyperv.py (+3/-0)
nova/virt/libvirt_conn.py (+4/-0)
nova/virt/xenapi/vmops.py (+71/-24)
nova/virt/xenapi_conn.py (+4/-0)

To merge this branch:

bzr merge lp:~jk0/nova/xs-rescue-periodic-tasks

Undecided

Fix Released

Link a bug report

Reviewer	Date Requested	Status
Sandy Walsh (community)	2011-03-23	Needs Fixing on 2011-03-24
Rick Harris (community)	2011-03-23	Approve on 2011-03-24
Matt Dietz (community)		Approve on 2011-03-23
Review via email: mp+54597@code.launchpad.net

Description of the change

Offers the ability to run a periodic_task that sweeps through rescued instances older than 24 hours and forcibly unrescues them.

Flag added: rescue_timeout (default is 0 - disabled)

lp:~jk0/nova/xs-rescue-periodic-tasks updated on 2011-03-23

848. By Josh Kearney on 2011-03-23: Added docstring

Revision history for this message

Matt Dietz (cerberus) wrote on 2011-03-23:

39 +def is_then_greater(then, seconds):

That's rather awkwardly named. Perhaps "is_older_than(before, seconds):" ?

Otherwise I think this looks good.

review: Needs Fixing

lp:~jk0/nova/xs-rescue-periodic-tasks updated on 2011-03-23

849. By Josh Kearney on 2011-03-23: Better method name

Revision history for this message

Matt Dietz (cerberus) wrote on 2011-03-23:

derp

review: Approve

Revision history for this message

Rick Harris (rconradharris) wrote on 2011-03-23:

Download full text (3.9 KiB)

Hey Josh. Overall looks good, just a few suggestions:

> 12 +flags.DEFINE_integer("rescue_timeout", 0,
> 13 + "Automatically unrescue an instance after N hours."
> 14 + " Set to 0 to disable.")

I'd consider using minutes or seconds here for maximum flexibility.

Suppose some other provider wants to auto-unrescue after 15 minutes, since
we're using an integer, that would be impossible with hours as the metric.

> 25 + if FLAGS.rescue_timeout > 0:
> 26 + self.driver.poll_rescued_instances(FLAGS.rescue_timeout)

It looks like you only defined an implementation in xenapi_conn.py. We may
want to add `poll_rescued_instances` to the other drivers with a
NotImplementedError so someone can come along add at that in for us.

> 105 + _vbd_ref = self._session.get_xenapi().VBD.get_record(vbd_ref)
> 106 + if _vbd_ref["userdevice"] == "1":

Per new-convention, this should probably be:

vbd_rec = self._session.get_xenapi().VBD.get_record(vbd_ref)
if vbd_rec["userdevice"] == "1":

> 106 + if _vbd_ref["userdevice"] == "1":

Could you clarify the significance of userdevice == "1". A comment would be
useful, but even better would be setting it nicely named variable:

    something_meaningful_is_true = (vbd_rec["userdevice"] == "1")
    if something_meaningful_is_true:
        # blah

> 186 + vms = []
> 187 + for instance in self.list_instances():
> 188 + if instance.endswith("-rescue"):

Since these are only rescue VMs, var might be clearer as `rescue_vms`.

> 196 + original_name = vm["name"].split("-rescue", 1)[0]
> 197 + original_vm_ref = VMHelper.lookup(self._session, original_name)

Might sense to move these lines down so they're next to where they are used:

    original_name = vm["name"].split("-rescue", 1)[0]
    original_vm_ref = VMHelper.lookup(self._session, original_name)
    self._release_bootlock(original_vm_ref)
    self._session.call_xenapi("VM.start", original_vm_ref, False,
                              False)

> self._destroy_rescue_vbds(rescue_vm_ref)
> self._shutdown_rescue(rescue_vm_ref)
> self._destroy_rescue_vdis(rescue_vm_ref)
> self._destroy_rescue_instance(rescue_vm_ref)
> self._release_bootlock(original_vm_ref)
> self._start(instance, original_vm_ref)

This block of code is effectively repeated twice in vmops. Might be work
refactoring it out to a separate method to DRY it up.

> 81 + def _shutdown_rescue(self, rescue_vm_ref):
> 82 + """Shutdown a rescue instance"""
> 83 + self._session.call_xenapi("Async.VM.hard_shutdown", rescue_vm_ref)

I'm not sure we need an additional method here (the less the better!). I
*think* we could replace with a call to:

self._shutdown(instance, rescue_vm_ref, hard=True)

This gives us the benefit of logging (and the wait_for_task code that is
missing).

Hey Josh. Overall looks good, just a few suggestions:

> 12	+flags.DEFINE_integer("rescue_timeout", 0,
> 13	+                     "Automatically unrescue an instance after N hours."
> 14	+                     " Set to 0 to disable.")

I'd consider using minutes or seconds here for maximum flexibility.

Suppose some other provider wants to auto-unrescue after 15 minutes, since
we're using an integer, that would be impossible with hours as the metric.

> 25	+        if FLAGS.rescue_timeout > 0:
> 26	+            self.driver.poll_rescued_instances(FLAGS.rescue_timeout)

> 105	+            _vbd_ref = self._session.get_xenapi().VBD.get_record(vbd_ref)
> 106	+            if _vbd_ref["userdevice"] == "1":

Per new-convention, this should probably be:

vbd_rec = self._session.get_xenapi().VBD.get_record(vbd_ref)
    if vbd_rec["userdevice"] == "1":

> 106	+            if _vbd_ref["userdevice"] == "1":

Could you clarify the significance of userdevice == "1". A comment would be
useful, but even better would be setting it nicely named variable:

something_meaningful_is_true = (vbd_rec["userdevice"] == "1")
    if something_meaningful_is_true:
        # blah

> 186	+        vms = []
> 187	+        for instance in self.list_instances():
> 188	+            if instance.endswith("-rescue"):

Since these are only rescue VMs, var might be clearer as `rescue_vms`.

> 196	+            original_name = vm["name"].split("-rescue", 1)[0]
> 197	+            original_vm_ref = VMHelper.lookup(self._session, original_name)

Might sense to move these lines down so they're next to where they are used:

>        self._destroy_rescue_vbds(rescue_vm_ref)
>        self._shutdown_rescue(rescue_vm_ref)
>        self._destroy_rescue_vdis(rescue_vm_ref)
>        self._destroy_rescue_instance(rescue_vm_ref)
>        self._release_bootlock(original_vm_ref)
>        self._start(instance, original_vm_ref)

This block of code is effectively repeated twice in vmops. Might be work
refactoring it out to a separate method to DRY it up.

> 81	+    def _shutdown_rescue(self, rescue_vm_ref):
> 82	+        """Shutdown a rescue instance"""
> 83	+        self._session.call_xenapi("Async.VM.hard_shutdown", rescue_vm_ref)

I'm not sure we need an additional method here (the less the better!). I
*think* we could replace with a call to:

self._shutdown(instance, rescue_vm_ref, hard=True)

This gives us the benefit of logging (and the wait_for_task code that is
missing).

> 92	+    def _destroy_rescue_vdis(self, rescue_vm_ref):
> 93	+        """Destroys all VDIs associated with a rescued VM"""
> 94	+        vdi_refs = VMHelper.lookup_vm_vdis(self._session, rescue_vm_ref)
> 95	+        for vdi_ref in vdi_refs:
> 96	+            try:
> 97	+                self._session.call_xenapi("Async.VDI.destroy", vdi_ref)
> 98	+            except self.XenAPI.Failure:
> 99	+                continue
> 100	+
> 101	+    def _destroy_rescue_vbds(self, rescue_vm_ref):
> 102	+        """Destroys all VBDs tied to a rescue VM"""
> 103	+        vbd_refs = self._session.get_xenapi().VM.get_VBDs(rescue_vm_ref)
> 104	+        for vbd_ref in vbd_refs:
> 105	+            _vbd_ref = self._session.get_xenapi().VBD.get_record(vbd_ref)
> 106	+            if _vbd_ref["userdevice"] == "1":
> 107	+                VMHelper.unplug_vbd(self._session, vbd_ref)
> 108	+                VMHelper.destroy_vbd(self._session, vbd_ref)
> 109	+

As above, I think it would be better to use the regular destroy_*
implementations and just pass the rescue_vm_ref instead of the usual vm_ref.

review: Needs Fixing

lp:~jk0/nova/xs-rescue-periodic-tasks updated on 2011-03-23

850. By Josh Kearney on 2011-03-23: Review feedback

Revision history for this message

Josh Kearney (jk0) wrote on 2011-03-23:

Thanks Rick, this is some great feedback. I was able to fix everything with the exception of a few things below (comments inline):

> It looks like you only defined an implementation in xenapi_conn.py. We may
> want to add `poll_rescued_instances` to the other drivers with a
> NotImplementedError so someone can come along add at that in for us.

I did implement a `pass` method for libvirt, however, I didn't bother with HyperV since that layer is extremely bare at the moment. I suspect that will improve and change quite rapidly over time.

> As above, I think it would be better to use the regular destroy_*
> implementations and just pass the rescue_vm_ref instead of the usual vm_ref.

Unfortunately since the `instance` object isn't available in these circumstances, I had to break them down into their own rescue methods respectively. Looking at the bright side, this will allow me to provide more thorough test coverage for Rescue/Unrescue once this lands. Another possibility is the big XenAPI refactor that we've talked about, but that will likely warrant its own BP for a future release.

Revision history for this message

Rick Harris (rconradharris) wrote on 2011-03-24:

> I didn't bother with HyperV since that layer is extremely bare at the moment.

Not a deal-breaker, but might be a good idea to add to HyperV anyway so it doesn't fall any further behind. Whomever ends up owning HyperV will benefit greatly from the stubs, and they cost next to nothing to add.

> Another possibility is the big XenAPI refactor that we've talked about

Yeah, I'm pretty much in agreement with you here. As much as I hate to see methods which are near-duplicates, I don't see an easy way around without some refactoring.

`_poll_tasks` in xenapi_conn.py is particularly nefarious, considering it takes an instance_id, optionally.

So, with both those items cleared up, I think this patch is good-to-go. Great job, Josh!

review: Approve

lp:~jk0/nova/xs-rescue-periodic-tasks updated on 2011-03-24

851. By Josh Kearney on 2011-03-24: Added hyperv stub

Revision history for this message

Sandy Walsh (sandy-walsh) wrote on 2011-03-24:

9 +def is_then_greater(then, seconds):
40 + if utcnow() - then > datetime.timedelta(seconds=seconds):
41 + return True
42 + else:
43 + return False

Beyond the naming as matt pointed out, how about

return utcnow() - then > datetime.timedelta(seconds=seconds)

98 + continue
log message?

105 + if _vbd_ref["userdevice"] == "1":

Is there a risk of userdevice not being defined?
Perhaps, if _vbd_ref.get("userdevice", None) == "1":

173 + if last_ran:
174 + if not utils.is_then_greater(last_ran, timeout * 60 * 60):
175 + # Do not run. Let's bail.
176 + return
177 + else:
178 + # Update the time tracker and proceed.
179 + self.poll_rescue_last_ran = utils.utcnow()
180 + else:
181 + # We need a base time to start tracking.
182 + self.poll_rescue_last_ran = utils.utcnow()
183 + return

I try to keep my returns together to keep the if blocks smaller

173 + if not last_ran:
181 + # We need a base time to start tracking.
182 + self.poll_rescue_last_ran = utils.utcnow()
183 + return

174 + if not utils.is_then_greater(last_ran, timeout * 60 * 60):
175 + # Do not run. Let's bail.
176 + return

177 + # Update the time tracker and proceed.
179 + self.poll_rescue_last_ran = utils.utcnow()

No tests?

review: Needs Fixing

Revision history for this message

Josh Kearney (jk0) wrote on 2011-03-24:

Thanks Sandy, I've worked all your suggestions into my other branch (where the unit tests live :)).

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Adam Johnson

Anne Gentle

Anthony Young

Brian Waldon

Chuck Short

Dan Mihai Dumitriu

Dave Walker

David Pravec

Diego Parrilla

Edgar Magana

Endre Karlson

Ilya Alekseyev

Isaku Yamahata

JJ Asghar

Jay Pipes

Jonathan Bryce

Josh Kearney

Kapil Thangavelu

Keisuke Tagami

Koji Iida

Krisztian Eyssen

Lorin Hochstein

Mark McLoughlin

Masanori Itoh

Milind Barve

Nachi Ueno

Paul Guth

Pedro Perez

Rajesh Battala

Ram Durairaj

Robert Middleswarth

Salvatore Orlando

Sateesh

Soren Hansen

Tomoya Masuko

Vish Ishaya

Vladimir Popovski

Youcef Laribi

adil mukarram

jawaid ekram

justinsb

jxta

makki

med makki maalej

sreekanth

termie

to status/vote changes:

Chris Behrens

 === modified file 'nova/compute/manager.py'
 --- nova/compute/manager.py	2011-03-23 18:56:23 +0000
 +++ nova/compute/manager.py	2011-03-24 03:56:32 +0000
@@ -65,8 +65,11 @@
                      'Console proxy host to use to connect to instances on'
                      'this host.')
  flags.DEFINE_integer('live_migration_retry_count', 30,
--                    ("Retry count needed in live_migration."
--                     " sleep 1 sec for each count"))
++                     "Retry count needed in live_migration."
++                     " sleep 1 sec for each count")
++flags.DEFINE_integer("rescue_timeout", 0,
++                     "Automatically unrescue an instance after N seconds."
++                     " Set to 0 to disable.")
  LOG = logging.getLogger('nova.compute.manager')
@@ -132,6 +135,12 @@
          """
          self.driver.init_host(host=self.host)
++    def periodic_tasks(self, context=None):
++        """Tasks to be run at a periodic interval."""
++        super(ComputeManager, self).periodic_tasks(context)
++        if FLAGS.rescue_timeout > 0:
++            self.driver.poll_rescued_instances(FLAGS.rescue_timeout)
++
      def _update_state(self, context, instance_id):
          """Update the state of an instance from the driver info."""
          # FIXME(ja): include other fields from state?
 === modified file 'nova/utils.py'
 --- nova/utils.py	2011-03-22 16:13:48 +0000
 +++ nova/utils.py	2011-03-24 03:56:32 +0000
@@ -335,6 +335,14 @@
  utcnow.override_time = None
++def is_older_than(before, seconds):
++    """Return True if before is older than 'seconds'"""
++    if utcnow() - before > datetime.timedelta(seconds=seconds):
++        return True
++    else:
++        return False
++
++
  def utcnow_ts():
      """Timestamp version of our utcnow function."""
      return time.mktime(utcnow().timetuple())
 === modified file 'nova/virt/hyperv.py'
 --- nova/virt/hyperv.py	2011-01-27 17:56:54 +0000
 +++ nova/virt/hyperv.py	2011-03-24 03:56:32 +0000
@@ -467,3 +467,6 @@
          if vm is None:
              raise exception.NotFound('Cannot detach volume from missing %s '
                      % instance_name)
++
++    def poll_rescued_instances(self, timeout):
++        pass
 === modified file 'nova/virt/libvirt_conn.py'
 --- nova/virt/libvirt_conn.py	2011-03-23 23:51:08 +0000
 +++ nova/virt/libvirt_conn.py	2011-03-24 03:56:32 +0000
@@ -417,6 +417,10 @@
          self.reboot(instance)
      @exception.wrap_exception
++    def poll_rescued_instances(self, timeout):
++        pass
++
++    @exception.wrap_exception
      def spawn(self, instance):
          xml = self.to_xml(instance)
          db.instance_set_state(context.get_admin_context(),
 === modified file 'nova/virt/xenapi/vmops.py'
 --- nova/virt/xenapi/vmops.py	2011-03-23 18:58:08 +0000
 +++ nova/virt/xenapi/vmops.py	2011-03-24 03:56:32 +0000
@@ -51,6 +51,7 @@
      def __init__(self, session):
          self.XenAPI = session.get_imported_xenapi()
          self._session = session
++        self.poll_rescue_last_ran = None
          VMHelper.XenAPI = self.XenAPI
@@ -488,6 +489,10 @@
          except self.XenAPI.Failure, exc:
              LOG.exception(exc)
++    def _shutdown_rescue(self, rescue_vm_ref):
++        """Shutdown a rescue instance"""
++        self._session.call_xenapi("Async.VM.hard_shutdown", rescue_vm_ref)
++
      def _destroy_vdis(self, instance, vm_ref):
          """Destroys all VDIs associated with a VM"""
          instance_id = instance.id
@@ -505,6 +510,24 @@
              except self.XenAPI.Failure, exc:
                  LOG.exception(exc)
++    def _destroy_rescue_vdis(self, rescue_vm_ref):
++        """Destroys all VDIs associated with a rescued VM"""
++        vdi_refs = VMHelper.lookup_vm_vdis(self._session, rescue_vm_ref)
++        for vdi_ref in vdi_refs:
++            try:
++                self._session.call_xenapi("Async.VDI.destroy", vdi_ref)
++            except self.XenAPI.Failure:
++                continue
++
++    def _destroy_rescue_vbds(self, rescue_vm_ref):
++        """Destroys all VBDs tied to a rescue VM"""
++        vbd_refs = self._session.get_xenapi().VM.get_VBDs(rescue_vm_ref)
++        for vbd_ref in vbd_refs:
++            vbd_rec = self._session.get_xenapi().VBD.get_record(vbd_ref)
++            if vbd_rec["userdevice"] == "1":  # primary VBD is always 1
++                VMHelper.unplug_vbd(self._session, vbd_ref)
++                VMHelper.destroy_vbd(self._session, vbd_ref)
++
      def _destroy_kernel_ramdisk(self, instance, vm_ref):
          """
          Three situations can occur:
@@ -555,6 +578,14 @@
          LOG.debug(_("Instance %(instance_id)s VM destroyed") % locals())
++    def _destroy_rescue_instance(self, rescue_vm_ref):
++        """Destroy a rescue instance"""
++        self._destroy_rescue_vbds(rescue_vm_ref)
++        self._shutdown_rescue(rescue_vm_ref)
++        self._destroy_rescue_vdis(rescue_vm_ref)
++
++        self._session.call_xenapi("Async.VM.destroy", rescue_vm_ref)
++
      def destroy(self, instance):
          """
          Destroy VM instance
@@ -658,41 +689,57 @@
          """
          rescue_vm_ref = VMHelper.lookup(self._session,
--                                    instance.name + "-rescue")
++                                        instance.name + "-rescue")
          if not rescue_vm_ref:
              raise exception.NotFound(_(
                  "Instance is not in Rescue Mode: %s" % instance.name))
          original_vm_ref = self._get_vm_opaque_ref(instance)
--        vbd_refs = self._session.get_xenapi().VM.get_VBDs(rescue_vm_ref)
--
          instance._rescue = False
--        for vbd_ref in vbd_refs:
--            _vbd_ref = self._session.get_xenapi().VBD.get_record(vbd_ref)
--            if _vbd_ref["userdevice"] == "1":
--                VMHelper.unplug_vbd(self._session, vbd_ref)
--                VMHelper.destroy_vbd(self._session, vbd_ref)
--
--        task1 = self._session.call_xenapi("Async.VM.hard_shutdown",
--                                          rescue_vm_ref)
--        self._session.wait_for_task(task1, instance.id)
--
--        vdi_refs = VMHelper.lookup_vm_vdis(self._session, rescue_vm_ref)
--        for vdi_ref in vdi_refs:
--            try:
--                task = self._session.call_xenapi('Async.VDI.destroy', vdi_ref)
--                self._session.wait_for_task(task, instance.id)
--            except self.XenAPI.Failure:
--                continue
--
--        task2 = self._session.call_xenapi('Async.VM.destroy', rescue_vm_ref)
--        self._session.wait_for_task(task2, instance.id)
--
++        self._destroy_rescue_instance(rescue_vm_ref)
          self._release_bootlock(original_vm_ref)
          self._start(instance, original_vm_ref)
++    def poll_rescued_instances(self, timeout):
++        """Look for expirable rescued instances
++            - forcibly exit rescue mode for any instances that have been
++              in rescue mode for >= the provided timeout
++        """
++        last_ran = self.poll_rescue_last_ran
++        if last_ran:
++            if not utils.is_older_than(last_ran, timeout):
++                # Do not run. Let's bail.
++                return
++            else:
++                # Update the time tracker and proceed.
++                self.poll_rescue_last_ran = utils.utcnow()
++        else:
++            # We need a base time to start tracking.
++            self.poll_rescue_last_ran = utils.utcnow()
++            return
++
++        rescue_vms = []
++        for instance in self.list_instances():
++            if instance.endswith("-rescue"):
++                rescue_vms.append(dict(name=instance,
++                                  vm_ref=VMHelper.lookup(self._session,
++                                                         instance)))
++
++        for vm in rescue_vms:
++            rescue_name = vm["name"]
++            rescue_vm_ref = vm["vm_ref"]
++
++            self._destroy_rescue_instance(rescue_vm_ref)
++
++            original_name = vm["name"].split("-rescue", 1)[0]
++            original_vm_ref = VMHelper.lookup(self._session, original_name)
++
++            self._release_bootlock(original_vm_ref)
++            self._session.call_xenapi("VM.start", original_vm_ref, False,
++                                      False)
++
      def get_info(self, instance):
          """Return data about VM instance"""
          vm_ref = self._get_vm_opaque_ref(instance)
 === modified file 'nova/virt/xenapi_conn.py'
 --- nova/virt/xenapi_conn.py	2011-03-17 16:03:07 +0000
 +++ nova/virt/xenapi_conn.py	2011-03-24 03:56:32 +0000
@@ -223,6 +223,10 @@
          """Unrescue the specified instance"""
          self._vmops.unrescue(instance, callback)
++    def poll_rescued_instances(self, timeout):
++        """Poll for rescued instances"""
++        self._vmops.poll_rescued_instances(timeout)
++
      def reset_network(self, instance):
          """reset networking for specified instance"""
          self._vmops.reset_network(instance)

OpenStack Compute (nova)

Merge lp:~jk0/nova/xs-rescue-periodic-tasks into lp:~hudson-openstack/nova/trunk

Commit message

Description of the change

Preview Diff

Subscribers