[RC] libvirt instance definitions not removed

Bug #755666 reported by justinsb
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
justinsb

Bug Description

In my recent patch to make sure that libvirt instances didn't disappear on reboot, I changed it so that definitions were persistent. However, I didn't consider the consequences of leaving definitions around.

Koji reported the following issues on a MP, I'm pasting them here so that we can track them as a bug and I can work on them:

(1) euca-reboot-instance fails.
 you need to apply Brian's patch before reproducing this issue.

 reboot() simply calls following codes,

          self.destroy(instance, False)
          self._create_new_domain(xml)

 _create_new_domain causes followig exception because domain is already defined.

libvir: Domain Config error : operation failed: domain 'instance-00000002' already exists with uuid a3a56e76-0ac8-ecbb-7b91-b7d76259ac81
2011-04-09 10:29:49,276 ERROR nova.exception [-] Uncaught exception
(nova.exception): TR self.destroy(instance, False)
ACE: Traceback (most recent call last):
(nova.exception): TRACE: File "/home/iida/nova/nova/exception.py", line 120, in _wrap
(nova.exception): TRACE: return f(*args, **kw)
(nova.exception): TRACE: File "/home/iida/nova/nova/virt/libvirt_conn.py", line 478, in reboot
(nova.exception): TRACE: self._create_new_domain(xml)
(nova.exception): TRACE: File "/home/iida/nova/nova/virt/libvirt_conn.py", line 1029, in _create_new_domain
(nova.exception): TRACE: domain = self._conn.defineXML(xml)
(nova.exception): TRACE: File "/usr/lib/python2.6/dist-packages/libvirt.py", line 1368, in defineXML
(nova.exception): TRACE: if ret is None:raise libvirtError('virDomainDefineXML() failed', conn=self)
(nova.exception): TRACE: libvirtError: operation failed: domain 'instance-00000002' already exists with uuid a3a56e76-0ac8-ecbb-7b91-b7d76259ac81
(nova.exception): TRACE:
2011-04-09 10:29:49,286 ERROR nova [-] Exception during message handling
(nova): TRACE: Traceback (most recent call last):
(nova): TRACE: File "/home/iida/nova/nova/rpc.py", line 188, in _receive
(nova): TRACE: rval = node_func(context=ctxt, **node_args)
(nova): TRACE: File "/home/iida/nova/nova/exception.py", line 120, in _wrap
(nova): TRACE: return f(*args, **kw)
(nova): TRACE: File "/home/iida/nova/nova/compute/manager.py", line 105, in decorated_function
(nova): TRACE: function(self, context, instance_id, *args, **kwargs)
(nova): TRACE: File "/home/iida/nova/nova/compute/manager.py", line 319, in reboot_instance
(nova): TRACE: self.driver.reboot(instance_ref)
(nova): TRACE: File "/home/iida/nova/nova/exception.py", line 126, in _wrap
(nova): TRACE: raise Error(str(e))
(nova): TRACE: Error: operation failed: domain 'instance-00000002' already exists with uuid a3a56e76-0ac8-ecbb-7b91-b7d76259ac81
(nova): TRACE:

(2) It seems that there are no code calling 'undefine' domain xml. So domain xml is not removed.

for example,

root@ubuntu:/home/iida# virsh list --all
 Id Name State
----------------------------------
  5 instance-00000001 running

root@ubuntu:/home/iida# euca-terminate-instances i-00000001
root@ubuntu:/home/iida# virsh list --all
 Id Name State
----------------------------------
  - instance-00000001 shut off

root@ubuntu:/home/iida#

I think we could undefine xml definition when we terminate instance-00000001.

FYI:
https://help.ubuntu.com/community/KVM/Managing#Define,%20undefine,%20start,%20shutdown,%20destroy%20VMs

And lastly, I have not checked rescue mode is working or not. Does someone know that rescue mode is working properly now?

Related branches

Changed in nova:
assignee: nobody → justinsb (justin-fathomdb)
summary: - libvirt instance definitions not removed
+ [RC] libvirt instance definitions not removed
Revision history for this message
justinsb (justin-fathomdb) wrote :

Requesting Gamma Freeze exemption...

Benefit: Without this, instance reboot on libvirt backed instances will not work (because it deletes the domain and recreates it - it probably shouldn't do that anyway, but we can't fix that in Cactus). Any function that involves deleting a domain is likely to be broken without it (e.g. recovery), and in addition delete domains accumulate in libvirt (visible in virsh list --all).

Risk of regression: Moderate. This is not a trivial fix, but it's not super complicated either - it is just adding one extra call to "undefine". That one call expands into lots of lines of code because it has to cope if the domain is shutoff but not deleted, so we can't just keep the naive error handling. Mitigating factors:
1) Testing against my own install using KVM, including with instances in the 'stuck' state (shut down but still defined)
2) Very careful error handling code (which we probably should have throughout the libvirt code anyway)
3) Making the new behaviour as close as possible to the old behaviour (e.g. I would like to see restart reuse the domain definition, because then I think e.g. volume attachments would persist; however that would put a much higher workload on QA)

Thierry Carrez (ttx)
Changed in nova:
milestone: none → cactus-rc
importance: Undecided → High
status: New → In Progress
Revision history for this message
Andrey Brindeyev (abrindeyev) wrote :
Download full text (5.1 KiB)

That patch does not helped me with same error. I patched nova (downloaded diff from linked branch and applied it to bzr972 rev) and error is still here:

2011-04-11 12:42:27,730 nova.rpc: MSG_ID is dbe301a61bdd4331b9731e87987beaa9
2011-04-11 12:42:28,258 nova.utils: Running cmd (subprocess): ip link show dev vlan100
2011-04-11 12:42:28,272 nova.utils: Attempting to grab semaphore "ensure_bridge" for method "ensure_bridge"...
2011-04-11 12:42:28,273 nova.utils: Attempting to grab file lock "ensure_bridge" for method "ensure_bridge"...
2011-04-11 12:42:28,274 nova.utils: Running cmd (subprocess): ip link show dev br100
2011-04-11 12:42:28,287 nova.utils: Running cmd (subprocess): sudo route -n
2011-04-11 12:42:28,311 nova.utils: Running cmd (subprocess): sudo ip addr show dev vlan100 scope global
2011-04-11 12:42:28,337 nova.utils: Running cmd (subprocess): sudo brctl addif br100 vlan100
2011-04-11 12:42:28,363 nova.utils: Result was 1
2011-04-11 12:42:28,413 nova.virt.libvirt_conn: instance instance-00000001: starting toXML method
2011-04-11 12:42:28,545 nova.virt.libvirt_conn: instance instance-00000001: finished toXML method
2011-04-11 12:42:28,607 nova: called setup_basic_filtering in nwfilter
2011-04-11 12:42:28,607 nova: ensuring static filters
2011-04-11 12:42:28,727 nova.utils: Attempting to grab semaphore "iptables" for method "apply"...
2011-04-11 12:42:28,728 nova.utils: Attempting to grab file lock "iptables" for method "apply"...
2011-04-11 12:42:28,736 nova.utils: Running cmd (subprocess): sudo iptables-save -t filter
2011-04-11 12:42:28,760 nova.utils: Running cmd (subprocess): sudo iptables-restore
2011-04-11 12:42:28,785 nova.utils: Running cmd (subprocess): sudo iptables-save -t nat
2011-04-11 12:42:28,811 nova.utils: Running cmd (subprocess): sudo iptables-restore
2011-04-11 12:42:28,869 nova.utils: Running cmd (subprocess): mkdir -p /var/lib/nova/instances/instance-00000001/
2011-04-11 12:42:28,888 nova.virt.libvirt_conn: instance instance-00000001: Creating image
2011-04-11 12:42:28,986 nova.utils: Attempting to grab semaphore "73f3cf93" for method "call_if_not_exists"...
2011-04-11 12:42:29,001 nova.utils: Running cmd (subprocess): cp /var/lib/nova/instances/_base/73f3cf93 /var/lib/nova/instances/instance-00000001/kernel
2011-04-11 12:42:29,040 nova.utils: Attempting to grab semaphore "57de2572" for method "call_if_not_exists"...
2011-04-11 12:42:29,055 nova.utils: Running cmd (subprocess): cp /var/lib/nova/instances/_base/57de2572 /var/lib/nova/instances/instance-00000001/ramdisk
2011-04-11 12:42:29,113 nova.utils: Attempting to grab semaphore "58677c0c_sm" for method "call_if_not_exists"...
2011-04-11 12:42:29,380 nova.utils: Running cmd (subprocess): qemu-img create -f qcow2 -o cluster_size=2M,backing_file=/var/lib/nova/instances/_base/58677c0c_sm /var/lib/nova/instances/instance-00000001/disk
2011-04-11 12:42:29,422 nova.virt.libvirt_conn: instance instance-00000001: injecting key into image 1483176972
2011-04-11 12:42:29,438 nova.compute.disk: Mounting disk...
2011-04-11 12:42:34,053 nova.compute.disk: Injecting SSH key...
2011-04-11 12:42:34,800 nova.compute.disk: Deleting guestfs object...
2011-04-11 12:42:36,332 n...

Read more...

Revision history for this message
justinsb (justin-fathomdb) wrote :

Hi Andrey - sorry about the problem. You're correct, the patch does not address the case where the definition exists _and_ you reset the database so an instance ID is reused when creating a new machine. That shouldn't happen to production users (because we don't reuse instance IDs unless the DB is reset), I believe, and it only happens to people that ran the version between the break and the fix anyway. As a workaround, you can run "virsh undefine i-00000001" to remove the old definition (and do a virsh list --all to see if you have i-000002 etc). It wouldn't be safe to do that from code for new / unknown domains. However, the problem then shouldn't occur going forwards, and shouldn't occur anyway if you don't reset your DB.

Thierry Carrez (ttx)
Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: cactus-rc → 2011.2
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.