Merge lp:~rackspace-titan/nova/terminate-libvirt-hang-lp754509 into lp:~hudson-openstack/nova/trunk

Proposed by Brian Lamar
Status: Merged
Approved by: Soren Hansen
Approved revision: 971
Merged at revision: 973
Proposed branch: lp:~rackspace-titan/nova/terminate-libvirt-hang-lp754509
Merge into: lp:~hudson-openstack/nova/trunk
Diff against target: 22 lines (+6/-3)
1 file modified
nova/virt/libvirt_conn.py (+6/-3)
To merge this branch: bzr merge lp:~rackspace-titan/nova/terminate-libvirt-hang-lp754509
Reviewer Review Type Date Requested Status
Soren Hansen (community) Approve
Vish Ishaya (community) Approve
Ed Leafe (community) Needs Fixing
justinsb (community) Approve
Titan Pending
Review via email: mp+57003@code.launchpad.net

Description of the change

Currently terminating an instance will hang in a loop, this allows for deletion of instances when using a libvirt backend. Also I couldn't help add a debug log where an exception is caught and ignored.

To post a comment you must log in.
Revision history for this message
Brian Lamar (blamar) wrote :

While I am unsure how this was working before, currently it looks like there is a busy-wait check on the instance state which never exits because it's checking for the instance to be in power_state.SHUTDOWN which isn't the same as power_state.SHUTOFF. These two states are a bit confusing and I think a long-term fix should involve some renaming of states and a good hard look at any code that is busy-wait checking like this while True loop.

Revision history for this message
justinsb (justin-fathomdb) wrote :

I was hitting the bug repeatedly; merged in the patch and it fixed it.

Brian - any chance you could put a little comment about the difference between SHUTOFF & SHUTDOWN in there somewhere? I've seen a lot of confusion about this, and was myself mystified as to the difference up until now!

I think this is probably worthy of whatever exemption it needs, because otherwise the compute node is effectively crashing on machine shutdown.

review: Approve
Revision history for this message
Koji Iida (iida-koji) wrote :
Download full text (3.3 KiB)

Hi,

Brian, thank you for your making a bug fix branch.
With this branch, I could terminate instance.

justinsb, I found two more problems.

(1) euca-reboot-instance fails.
 you need to apply Brian's patch before reproducing this issue.

 reboot() simply calls following codes,

          self.destroy(instance, False)
          self._create_new_domain(xml)

 _create_new_domain causes followig exception because domain is already defined.

libvir: Domain Config error : operation failed: domain 'instance-00000002' already exists with uuid a3a56e76-0ac8-ecbb-7b91-b7d76259ac81
2011-04-09 10:29:49,276 ERROR nova.exception [-] Uncaught exception
(nova.exception): TR self.destroy(instance, False)
ACE: Traceback (most recent call last):
(nova.exception): TRACE: File "/home/iida/nova/nova/exception.py", line 120, in _wrap
(nova.exception): TRACE: return f(*args, **kw)
(nova.exception): TRACE: File "/home/iida/nova/nova/virt/libvirt_conn.py", line 478, in reboot
(nova.exception): TRACE: self._create_new_domain(xml)
(nova.exception): TRACE: File "/home/iida/nova/nova/virt/libvirt_conn.py", line 1029, in _create_new_domain
(nova.exception): TRACE: domain = self._conn.defineXML(xml)
(nova.exception): TRACE: File "/usr/lib/python2.6/dist-packages/libvirt.py", line 1368, in defineXML
(nova.exception): TRACE: if ret is None:raise libvirtError('virDomainDefineXML() failed', conn=self)
(nova.exception): TRACE: libvirtError: operation failed: domain 'instance-00000002' already exists with uuid a3a56e76-0ac8-ecbb-7b91-b7d76259ac81
(nova.exception): TRACE:
2011-04-09 10:29:49,286 ERROR nova [-] Exception during message handling
(nova): TRACE: Traceback (most recent call last):
(nova): TRACE: File "/home/iida/nova/nova/rpc.py", line 188, in _receive
(nova): TRACE: rval = node_func(context=ctxt, **node_args)
(nova): TRACE: File "/home/iida/nova/nova/exception.py", line 120, in _wrap
(nova): TRACE: return f(*args, **kw)
(nova): TRACE: File "/home/iida/nova/nova/compute/manager.py", line 105, in decorated_function
(nova): TRACE: function(self, context, instance_id, *args, **kwargs)
(nova): TRACE: File "/home/iida/nova/nova/compute/manager.py", line 319, in reboot_instance
(nova): TRACE: self.driver.reboot(instance_ref)
(nova): TRACE: File "/home/iida/nova/nova/exception.py", line 126, in _wrap
(nova): TRACE: raise Error(str(e))
(nova): TRACE: Error: operation failed: domain 'instance-00000002' already exists with uuid a3a56e76-0ac8-ecbb-7b91-b7d76259ac81
(nova): TRACE:

(2) It seems that there are no code calling 'undefine' domain xml. So domain xml is not removed.

for example,

root@ubuntu:/home/iida# virsh list --all
 Id Name State
----------------------------------
  5 instance-00000001 running

root@ubuntu:/home/iida# euca-terminate-instances i-00000001
root@ubuntu:/home/iida# virsh list --all
 Id Name State
----------------------------------
  - instance-00000001 shut off

root@ubuntu:/home/iida#

I think we could undefine xml definition when we terminate instance-00000001.

FYI:
https://help.ubuntu.com/community/KVM/Managing#Define,%20undefine,%20start,%...

Read more...

Revision history for this message
Ed Leafe (ed-leafe) wrote :

There is a syntax error in your log message (lines 13-14 of the diff). The result of dict.update() is None, as update() operates in place. So

locals().update({"id": instance["id"]})

...which equals None, will not do what I think you want to do. It would be better to simply create the dict you want directly:

msg = _("Error encountered when destroying instance '%(id)s': "
        "%(ex)s") % {"ex": ex, "id": instance["id"]}

review: Needs Fixing
971. By Brian Lamar

Fixed log message gaffe.

Revision history for this message
Brian Lamar (blamar) wrote :

@ed

Yeah, good catch there :) Not sure what I was thinking! Should be good now.

@justin

I'm not a huge fan of creating NOTE's in the code, I'd rather bring it up at the conference that I think the names should get a do-over. There are so many places that this is used that noting one place seems futile.

Revision history for this message
justinsb (justin-fathomdb) wrote :

Koji: I'm going to open a bug report for the issues you described. I think they are due to my recent patch which changed the way we set up libvirt instances, so that they are persistent and are not lost on reboot. As you say, it looks like I need to add code to remove instance definitions.

Revision history for this message
justinsb (justin-fathomdb) wrote :

Incidentally, Koji's discovery is why this bug came up. When we were using transient instances, libvirt would immediately destroy the machine. Now that our instances survive host reboots, they enter a different stage when shut down, and so this bug gets hit. So I introduced this one - sorry guys. And thanks Koji for explaining it to me :-)

I'm fixing the issues that Koji described: Bug #755666 and related branch lp:~justin-fathomdb/nova/fix-reset-libvirt

Revision history for this message
Vish Ishaya (vishvananda) wrote :

lgtm

review: Approve
Revision history for this message
Soren Hansen (soren) wrote :

lgtm

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'nova/virt/libvirt_conn.py'
2--- nova/virt/libvirt_conn.py 2011-04-08 00:32:34 +0000
3+++ nova/virt/libvirt_conn.py 2011-04-09 15:11:25 +0000
4@@ -325,12 +325,15 @@
5 state = self.get_info(instance['name'])['state']
6 db.instance_set_state(context.get_admin_context(),
7 instance['id'], state)
8- if state == power_state.SHUTDOWN:
9+ if state == power_state.SHUTOFF:
10 break
11- except Exception:
12+ except Exception as ex:
13+ msg = _("Error encountered when destroying instance '%(id)s': "
14+ "%(ex)s") % {"id": instance["id"], "ex": ex}
15+ LOG.debug(msg)
16 db.instance_set_state(context.get_admin_context(),
17 instance['id'],
18- power_state.SHUTDOWN)
19+ power_state.SHUTOFF)
20 break
21
22 self.firewall_driver.unfilter_instance(instance)