Locate and clean up orphaned VDIs in XenServer

Bug #809614 reported by Antony Messerli
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Josh Kearney

Bug Description

On instance provision, if an exception is generated which stops the build, and then the failed build is deleted, the files pulled from Glance are not cleaned up at all.

Over time, this can stack up and become a very large problem since there is a lot of junk disks. From what I observed, the disk image is brought down to the host machine, and has been scanned in to the SR as there is a VDI record. It's just not removed. We should also add the instance id into the VDI name-description so that we can track which VDIs are associated with what instances. At this point, there's not a good way to track and clean this cruft from the failed builds.

For example:

uuid ( RO) : 205b5447-e87a-46a5-8f4a-bbc7e8434677
          name-label ( RW): 0
    name-description ( RW):
             sr-uuid ( RO): 65ccc1a6-335d-92fe-72df-1b09a8f483a6
        virtual-size ( RO): 20401094656
            sharable ( RO): false
           read-only ( RO): false

For reference I'm running rev 1265. The exception I ran into was (nova): TRACE: RemoteError: FixedIpNotFoundForInstance Instance 1 has zero fixed ips. This was due to not having added IPs yet to the DB.

Revision history for this message
Brian Lamar (blamar) wrote :

The way I see this we have a couple of options:

1) Clean up all orphaned disks periodically.
2) Provide Admin API calls to list orphaned disks and delete said disks.

I'm a fan of providing this via the Admin API because then any operational team can decide what to do with that information. We don't have an admin API client (something we need?)

I don't love the periodic task strategy (feels like bailing out water when you could be finding the leak) but it might be prudent to make a task and then a blueprint for conversion to admin API and having an admin API client which could be run in a cron job?

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

At a first glance this looks a duplicate of bug #723301, which addresses the issue of removing VDIs and kernel/ramdisk files when instances fails to spawn.

Revision history for this message
Brian Lamar (blamar) wrote :

I'm not certain why I didn't think about this before, but the cleanup method you used in your fix won't work for situation where the cause of VM spawn failure is loss of connectivity with the hypervisor. How I was testing this was by killing XenAPI on the hypervisor, and while this is a more specific/rare case...it might be worth looking at.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

You're right.
I'm not a great fan of periodic tasks either; ideally, an operation admin API would allow clients such as the dashboard to perform mainteinance operations on hypervisors. However, I'm not sure whether there is a place in the OS API for this kind of operations.

At least for the xenapi backend it would also worth tagging VDIs created by nova, maybe with a parameter in other-config. This way the cleanup operation, whichever way it is implemented, will remove only orphaned disk created by nova (there might be some orphaned VDIs on the SR used for other purposes).

Revision history for this message
Thierry Carrez (ttx) wrote :

Also related to bug 814561

Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
Brian Waldon (bcwaldon)
Changed in nova:
assignee: nobody → Brian Waldon (bcwaldon)
Revision history for this message
Brian Waldon (bcwaldon) wrote :

There is already error handling in place for cleaning up VDIs on spawn failures. This feels more like a side-effect of another bug that should be treated separately. I'd vote for invalidating this.

Brian Waldon (bcwaldon)
Changed in nova:
assignee: Brian Waldon (bcwaldon) → nobody
Josh Kearney (jk0)
Changed in nova:
status: Confirmed → In Progress
assignee: nobody → Josh Kearney (jk0)
Josh Kearney (jk0)
Changed in nova:
importance: Medium → Critical
importance: Critical → High
summary: - Disk Clean up on Build Failure in XenServer
+ Locate and clean up orphaned VDIs in XenServer
Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : A change has been merged to openstack/nova

Reviewed: https://review.openstack.org/693
Committed: http://github.com/openstack/nova/commit/04548b067c7c79602332fe2bc2dc89ed77cee7ac
Submitter: Jenkins
Branch: master

 status fixcommitted
 done

commit 04548b067c7c79602332fe2bc2dc89ed77cee7ac
Author: Josh Kearney <email address hidden>
Date: Tue Sep 27 15:21:42 2011 -0500

    Adds a script that can automatically delete orphaned VDIs. Also had to move some flags around to avoid circular imports.

    Fixes bug 809614.

    Change-Id: I635f7eef9ede45bee1ee4a62a3882b55d4222ee3

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → essex-1
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: essex-1 → 2012.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.