OpenStack Compute (nova)

Merge lp:~rackspace-titan/nova/terminate-libvirt-hang-lp754509 into lp:~hudson-openstack/nova/trunk

terminate-libvirt-hang-lp754509
Merge into trunk

Proposed by Brian Lamar on 2011-04-08

Status:

Merged

Approved by:

Soren Hansen on 2011-04-11

Approved revision:

971

Merged at revision:

973

Proposed branch:

lp:~rackspace-titan/nova/terminate-libvirt-hang-lp754509

Merge into:

lp:~hudson-openstack/nova/trunk

Diff against target:

22 lines (+6/-3)

1 file modified

nova/virt/libvirt_conn.py (+6/-3)

To merge this branch:

bzr merge lp:~rackspace-titan/nova/terminate-libvirt-hang-lp754509

Critical

Fix Released

Link a bug report

Reviewer	Date Requested	Status
Soren Hansen (community)		Approve on 2011-04-11
Vish Ishaya (community)	2011-04-08	Approve on 2011-04-10
Ed Leafe (community)		Needs Fixing on 2011-04-09
justinsb (community)	2011-04-08	Approve on 2011-04-08
Titan	2011-04-08	Pending
Review via email: mp+57003@code.launchpad.net

Description of the change

Currently terminating an instance will hang in a loop, this allows for deletion of instances when using a libvirt backend. Also I couldn't help add a debug log where an exception is caught and ignored.

Revision history for this message

Brian Lamar (blamar) wrote on 2011-04-08:

While I am unsure how this was working before, currently it looks like there is a busy-wait check on the instance state which never exits because it's checking for the instance to be in power_state.SHUTDOWN which isn't the same as power_state.SHUTOFF. These two states are a bit confusing and I think a long-term fix should involve some renaming of states and a good hard look at any code that is busy-wait checking like this while True loop.

Revision history for this message

justinsb (justin-fathomdb) wrote on 2011-04-08:

I was hitting the bug repeatedly; merged in the patch and it fixed it.

Brian - any chance you could put a little comment about the difference between SHUTOFF & SHUTDOWN in there somewhere? I've seen a lot of confusion about this, and was myself mystified as to the difference up until now!

I think this is probably worthy of whatever exemption it needs, because otherwise the compute node is effectively crashing on machine shutdown.

review: Approve

Revision history for this message

Koji Iida (iida-koji) wrote on 2011-04-09:

Download full text (3.3 KiB)

Hi,

Brian, thank you for your making a bug fix branch.
With this branch, I could terminate instance.

justinsb, I found two more problems.

(1) euca-reboot-instance fails.
you need to apply Brian's patch before reproducing this issue.

reboot() simply calls following codes,

self.destroy(instance, False)
self._create_new_domain(xml)

_create_new_domain causes followig exception because domain is already defined.

libvir: Domain Config error : operation failed: domain 'instance-00000002' already exists with uuid a3a56e76-0ac8-ecbb-7b91-b7d76259ac81
2011-04-09 10:29:49,276 ERROR nova.exception [-] Uncaught exception
(nova.exception): TR self.destroy(instance, False)
ACE: Traceback (most recent call last):
(nova.exception): TRACE: File "/home/iida/nova/nova/exception.py", line 120, in _wrap
(nova.exception): TRACE: return f(*args, **kw)
(nova.exception): TRACE: File "/home/iida/nova/nova/virt/libvirt_conn.py", line 478, in reboot
(nova.exception): TRACE: self._create_new_domain(xml)
(nova.exception): TRACE: File "/home/iida/nova/nova/virt/libvirt_conn.py", line 1029, in _create_new_domain
(nova.exception): TRACE: domain = self._conn.defineXML(xml)
(nova.exception): TRACE: File "/usr/lib/python2.6/dist-packages/libvirt.py", line 1368, in defineXML
(nova.exception): TRACE: if ret is None:raise libvirtError('virDomainDefineXML() failed', conn=self)
(nova.exception): TRACE: libvirtError: operation failed: domain 'instance-00000002' already exists with uuid a3a56e76-0ac8-ecbb-7b91-b7d76259ac81
(nova.exception): TRACE:
2011-04-09 10:29:49,286 ERROR nova [-] Exception during message handling
(nova): TRACE: Traceback (most recent call last):
(nova): TRACE: File "/home/iida/nova/nova/rpc.py", line 188, in _receive
(nova): TRACE: rval = node_func(context=ctxt, **node_args)
(nova): TRACE: File "/home/iida/nova/nova/exception.py", line 120, in _wrap
(nova): TRACE: return f(*args, **kw)
(nova): TRACE: File "/home/iida/nova/nova/compute/manager.py", line 105, in decorated_function
(nova): TRACE: function(self, context, instance_id, *args, **kwargs)
(nova): TRACE: File "/home/iida/nova/nova/compute/manager.py", line 319, in reboot_instance
(nova): TRACE: self.driver.reboot(instance_ref)
(nova): TRACE: File "/home/iida/nova/nova/exception.py", line 126, in _wrap
(nova): TRACE: raise Error(str(e))
(nova): TRACE: Error: operation failed: domain 'instance-00000002' already exists with uuid a3a56e76-0ac8-ecbb-7b91-b7d76259ac81
(nova): TRACE:

(2) It seems that there are no code calling 'undefine' domain xml. So domain xml is not removed.

for example,

root@ubuntu:/home/iida# virsh list --all
Id Name State
----------------------------------
5 instance-00000001 running

root@ubuntu:/home/iida# euca-terminate-instances i-00000001
root@ubuntu:/home/iida# virsh list --all
Id Name State
----------------------------------
- instance-00000001 shut off

root@ubuntu:/home/iida#

I think we could undefine xml definition when we terminate instance-00000001.

FYI:
https://help.ubuntu.com/community/KVM/Managing#Define,%20undefine,%20start,%...

Hi,

Brian, thank you for your making a bug fix branch.
With this branch, I could terminate instance.

justinsb, I found two more problems.

(1) euca-reboot-instance fails. 
 you need to apply Brian's patch before reproducing this issue.

reboot() simply calls following codes,

self.destroy(instance, False)
          self._create_new_domain(xml)

_create_new_domain causes followig exception because domain is already defined.

libvir: Domain Config error : operation failed: domain 'instance-00000002' already exists with uuid a3a56e76-0ac8-ecbb-7b91-b7d76259ac81
2011-04-09 10:29:49,276 ERROR nova.exception [-] Uncaught exception
(nova.exception): TR        self.destroy(instance, False)
ACE: Traceback (most recent call last):
(nova.exception): TRACE:   File "/home/iida/nova/nova/exception.py", line 120, in _wrap
(nova.exception): TRACE:     return f(*args, **kw)
(nova.exception): TRACE:   File "/home/iida/nova/nova/virt/libvirt_conn.py", line 478, in reboot
(nova.exception): TRACE:     self._create_new_domain(xml)
(nova.exception): TRACE:   File "/home/iida/nova/nova/virt/libvirt_conn.py", line 1029, in _create_new_domain
(nova.exception): TRACE:     domain = self._conn.defineXML(xml)
(nova.exception): TRACE:   File "/usr/lib/python2.6/dist-packages/libvirt.py", line 1368, in defineXML
(nova.exception): TRACE:     if ret is None:raise libvirtError('virDomainDefineXML() failed', conn=self)
(nova.exception): TRACE: libvirtError: operation failed: domain 'instance-00000002' already exists with uuid a3a56e76-0ac8-ecbb-7b91-b7d76259ac81
(nova.exception): TRACE: 
2011-04-09 10:29:49,286 ERROR nova [-] Exception during message handling
(nova): TRACE: Traceback (most recent call last):
(nova): TRACE:   File "/home/iida/nova/nova/rpc.py", line 188, in _receive
(nova): TRACE:     rval = node_func(context=ctxt, **node_args)
(nova): TRACE:   File "/home/iida/nova/nova/exception.py", line 120, in _wrap
(nova): TRACE:     return f(*args, **kw)
(nova): TRACE:   File "/home/iida/nova/nova/compute/manager.py", line 105, in decorated_function
(nova): TRACE:     function(self, context, instance_id, *args, **kwargs)
(nova): TRACE:   File "/home/iida/nova/nova/compute/manager.py", line 319, in reboot_instance
(nova): TRACE:     self.driver.reboot(instance_ref)
(nova): TRACE:   File "/home/iida/nova/nova/exception.py", line 126, in _wrap
(nova): TRACE:     raise Error(str(e))
(nova): TRACE: Error: operation failed: domain 'instance-00000002' already exists with uuid a3a56e76-0ac8-ecbb-7b91-b7d76259ac81
(nova): TRACE:

(2) It seems that there are no code calling 'undefine' domain xml. So domain xml is not removed.

for example,

root@ubuntu:/home/iida# virsh list --all
 Id Name                 State
----------------------------------
  5 instance-00000001    running

root@ubuntu:/home/iida# euca-terminate-instances i-00000001
root@ubuntu:/home/iida# virsh list --all
 Id Name                 State
----------------------------------
  - instance-00000001    shut off

root@ubuntu:/home/iida#

I think we could undefine xml definition when we terminate instance-00000001.

FYI: 
https://help.ubuntu.com/community/KVM/Managing#Define,%20undefine,%20start,%20shutdown,%20destroy%20VMs

And lastly, I have not checked rescue mode is working or not. Does someone know that rescue mode is working properly now?

Revision history for this message

Ed Leafe (ed-leafe) wrote on 2011-04-09:

There is a syntax error in your log message (lines 13-14 of the diff). The result of dict.update() is None, as update() operates in place. So

locals().update({"id": instance["id"]})

...which equals None, will not do what I think you want to do. It would be better to simply create the dict you want directly:

msg = _("Error encountered when destroying instance '%(id)s': "
"%(ex)s") % {"ex": ex, "id": instance["id"]}

review: Needs Fixing

lp:~rackspace-titan/nova/terminate-libvirt-hang-lp754509 updated on 2011-04-09

971. By Brian Lamar on 2011-04-09: Fixed log message gaffe.

Revision history for this message

Brian Lamar (blamar) wrote on 2011-04-09:

@ed

Yeah, good catch there :) Not sure what I was thinking! Should be good now.

@justin

I'm not a huge fan of creating NOTE's in the code, I'd rather bring it up at the conference that I think the names should get a do-over. There are so many places that this is used that noting one place seems futile.

Revision history for this message

justinsb (justin-fathomdb) wrote on 2011-04-09:

Koji: I'm going to open a bug report for the issues you described. I think they are due to my recent patch which changed the way we set up libvirt instances, so that they are persistent and are not lost on reboot. As you say, it looks like I need to add code to remove instance definitions.

Revision history for this message

justinsb (justin-fathomdb) wrote on 2011-04-09:

Incidentally, Koji's discovery is why this bug came up. When we were using transient instances, libvirt would immediately destroy the machine. Now that our instances survive host reboots, they enter a different stage when shut down, and so this bug gets hit. So I introduced this one - sorry guys. And thanks Koji for explaining it to me :-)

I'm fixing the issues that Koji described: Bug #755666 and related branch lp:~justin-fathomdb/nova/fix-reset-libvirt

Revision history for this message

Vish Ishaya (vishvananda) wrote on 2011-04-10:

lgtm

review: Approve

Revision history for this message

Soren Hansen (soren) wrote on 2011-04-11:

lgtm

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Adam Johnson

Anne Gentle

Anthony Young

Brian Waldon

Chuck Short

Dan Mihai Dumitriu

Dave Walker

David Pravec

Diego Parrilla

Edgar Magana

Endre Karlson

Ilya Alekseyev

Isaku Yamahata

JJ Asghar

Jay Pipes

Jonathan Bryce

Kapil Thangavelu

Keisuke Tagami

Koji Iida

Krisztian Eyssen

Lorin Hochstein

Mark McLoughlin

Masanori Itoh

Milind Barve

Nachi Ueno

Paul Guth

Pedro Perez

Rajesh Battala

Ram Durairaj

Robert Middleswarth

Salvatore Orlando

Sateesh

Soren Hansen

Titan

Tomoya Masuko

Vish Ishaya

Vladimir Popovski

Youcef Laribi

adil mukarram

jawaid ekram

justinsb

jxta

makki

med makki maalej

sreekanth

termie

to status/vote changes:

Chris Behrens

 === modified file 'nova/virt/libvirt_conn.py'
 --- nova/virt/libvirt_conn.py	2011-04-08 00:32:34 +0000
 +++ nova/virt/libvirt_conn.py	2011-04-09 15:11:25 +0000
@@ -325,12 +325,15 @@
                  state = self.get_info(instance['name'])['state']
                  db.instance_set_state(context.get_admin_context(),
                                        instance['id'], state)
--                if state == power_state.SHUTDOWN:
++                if state == power_state.SHUTOFF:
                      break
--            except Exception:
++            except Exception as ex:
++                msg = _("Error encountered when destroying instance '%(id)s': "
++                        "%(ex)s") % {"id": instance["id"], "ex": ex}
++                LOG.debug(msg)
                  db.instance_set_state(context.get_admin_context(),
                                        instance['id'],
--                                      power_state.SHUTDOWN)
++                                      power_state.SHUTOFF)
                  break
          self.firewall_driver.unfilter_instance(instance)

OpenStack Compute (nova)

Merge lp:~rackspace-titan/nova/terminate-libvirt-hang-lp754509 into lp:~hudson-openstack/nova/trunk

Commit message

Description of the change

Preview Diff

Subscribers