VMs paused unbeknownst to nova compute are destroyed

Bug #1097806 reported by Andres Lagar-Cavilla
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Yun Mao
Folsom
Fix Released
Medium
Yun Mao

Bug Description

Libvirt-managed qemu/KVM VMs can be paused outside of nova compute's workflow through a variety of means.

* By issuing virsh suspend
* By issuing virsh qemu-monitor-command '{"execute" : "stop"}'
* By causing qemu to emit a STOP event, for example when attaching a GDB debugger and single-stepping
* By connecting through an additional qemu monitor and issuing any commands that may cause qemu to emit a STOP event.

Starting in Folsom (specifically https://github.com/openstack/nova/commit/129b87e17d3333aeaa9e855a70dea51e6581ea63#L6R2502 i.e. commit 129b87e diff line 2502) nova compute will destroy a VM if libvirt reports it as paused and this doesn't fit nova compute's recorded state for the VM.

I surmise the original rationale is to destroy VMs that are paused by IO errors or KVM emulation errors, which would also cause qemu to emit STOP events.

The problem is that this will also destroy VMs that are paused through a variety of valid reasons as outlined above.

The problem is exacerbated by a Libvirt bug (https://bugzilla.redhat.com/show_bug.cgi?id=892791) which latches the state of a VM to paused even though the VM is running. The fix is already committed upstream (http://libvirt.org/git/?p=libvirt.git;a=commit;h=aedfcce33e4c2f266668a39fd655574fe34f1265) and we are intending for it to make its way through backports into distros.

Even with libvirt's bug fixed, there are still points in time at which nova-compute will check a VMs state, find it paused for a valid reason, and decide to erroneously destroy it.

The fix is to either remove this behavior, or to further query libvirt for the paused reason, which will show conclusively whether the VM is effectively crashed, or just paused.

Yun Mao (yunmao)
Changed in nova:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/19467

Changed in nova:
assignee: nobody → Yun Mao (yunmao)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/19467
Committed: http://github.com/openstack/nova/commit/f7fbdeb5672bae7d3bffd6fa76de1ce81fc132bf
Submitter: Jenkins
Branch: master

commit f7fbdeb5672bae7d3bffd6fa76de1ce81fc132bf
Author: Yun Mao <email address hidden>
Date: Fri Jan 11 11:59:23 2013 -0500

    Fix state sync logic related to the PAUSED VM state

    A VM may get into the paused state not only because the user request
    via API calls, but also due to (temporary) external instrumentations.
    Before the virt layer can reliably report the reason, we simply ignore
    the state discrepancy. In many cases, the VM state will go back to
    running after the external instrumentation is done.

    Fix bug 1097806.

    Change-Id: I8edef45d60fa79d6ddebf7d0438042a7b3986b55

Changed in nova:
status: In Progress → Fix Committed
tags: added: folsom-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/folsom)

Fix proposed to branch: stable/folsom
Review: https://review.openstack.org/20337

tags: removed: folsom-backport-potential
Thierry Carrez (ttx)
Changed in nova:
milestone: none → grizzly-3
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/folsom)

Reviewed: https://review.openstack.org/20337
Committed: http://github.com/openstack/nova/commit/7ace55fcf9e1b7fea074f6c0331b6feafbbc4178
Submitter: Jenkins
Branch: stable/folsom

commit 7ace55fcf9e1b7fea074f6c0331b6feafbbc4178
Author: Yun Mao <email address hidden>
Date: Fri Jan 11 11:59:23 2013 -0500

    Fix state sync logic related to the PAUSED VM state

    A VM may get into the paused state not only because the user request
    via API calls, but also due to (temporary) external instrumentations.
    Before the virt layer can reliably report the reason, we simply ignore
    the state discrepancy. In many cases, the VM state will go back to
    running after the external instrumentation is done.

    Fix bug 1097806.

    Change-Id: I8edef45d60fa79d6ddebf7d0438042a7b3986b55
    (cherry picked from commit f7fbdeb5672bae7d3bffd6fa76de1ce81fc132bf)

Thierry Carrez (ttx)
Changed in nova:
milestone: grizzly-3 → 2013.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.