OpenStack Compute (nova)

Deleting a server will cause temporary 404 from GET /servers

Bug #885267 reported by Brian Waldon on 2011-11-02

This bug affects 7 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	Medium	Dean Troyer	OpenStack Compute (nova) 2012.1 "essex"

Bug Description

I did a 'nova list' (GET /servers) while server '443c3600-7724-4450-9b21-47ffa8544ad3' was being deleted. For some reason, I got a 404 back as it was removed from the list. This should not happen

Tags:

Thierry Carrez (ttx) on 2011-11-08

Changed in nova:
importance:	Undecided → Medium
status:	New → Confirmed

Revision history for this message

Jesse Andrews (anotherjesse) wrote on 2011-12-30:

I have run into this as well. Here is my traceback:

    File "/opt/python-novaclient/novaclient/v1_1/servers.py", line 247, in list
      return self._list("/servers%s%s" % (detail, query_string), "servers")
    File "/opt/python-novaclient/novaclient/base.py", line 69, in _list
      resp, body = self.api.client.get(url)
    File "/opt/python-novaclient/novaclient/client.py", line 131, in get
      return self._cs_request(url, 'GET', **kwargs)
    File "/opt/python-novaclient/novaclient/client.py", line 119, in _cs_request
      **kwargs)
    File "/opt/python-novaclient/novaclient/client.py", line 102, in request
      raise exceptions.from_response(resp, body)

I have a script that creates a VM, waits for it to launch and then deletes it and verifies deletion occurs. It fails a significant amount of time due to this exception

Revision history for this message

Jesse Andrews (anotherjesse) wrote on 2011-12-30:

min.py Edit (2.1 KiB, text/x-python)

my script that triggers this.

It fails on:

if not any([s.id == server_id for s in nc.servers.list()]):

Dean Troyer (dtroyer) on 2012-01-05

Changed in nova:
assignee:	nobody → Dean Troyer (dtroyer)

Revision history for this message

Dean Troyer (dtroyer) wrote on 2012-01-05:

I see this in the logs a bit before the HTTP 404:

2012-01-04 22:40:00,376 DEBUG nova.api.openstack.common [1ca75227-460c-4cca-93bc-1e9d70a571c7 demo 2] Generated ACTIVE from vm_state=active task_state=deleting. from (pid=22694) status_from_state /opt/stack/nova/nova/api/openstack/common.py:93

Is the ACTIVE status here causing this indirectly?

Revision history for this message

Anthony Young (sleepsonthefloor) wrote on 2012-01-05:

Dean - I was thinking the bug was somewhere in the detail server code (perhaps where you cite), which iterates over each of the servers and grabs a whole bunch of extra info about the instance, but couldn't isolate any specific problem area (i was looking for an errant join). In the meantime, some extra logging could help here: https://github.com/openstack/nova/blob/master/nova/api/openstack/v2/servers.py#L71

Revision history for this message

Dean Troyer (dtroyer) wrote on 2012-01-06:

I finally found the source of the exception: https://github.com/openstack/nova/blob/master/nova/api/openstack/v2/contrib/extended_status.py#L69. Extensions are still new to me so the flow here isn't obvious, but it appears to explain why every /servers/detail api call appears to be duplicated in the logs.

I'm still not certain about the status=ACTIVE when task_state=deleting, but I don't think that is in play here.

Revision history for this message

Dean Troyer (dtroyer) wrote on 2012-01-06:

The race condition we see here is between the original call to compute.api.get_all() and when extended_status _get_and_extend_all() gets around to looping through the server list to call compute.api.routing_get() for each one. I think the Right Thing here would be to remove the server from body['servers'], log a warning and continue. I'm letting the dev cluster test this overnight.

Dean Troyer (dtroyer) on 2012-01-06

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2012-01-06: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/2874

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2012-01-09: Fix merged to nova (master)

Reviewed: https://review.openstack.org/2874
Committed: http://github.com/openstack/nova/commit/51c0d545253b9f5618d1923aea3f7061da6cd60b
Submitter: Jenkins
Branch: master

commit 51c0d545253b9f5618d1923aea3f7061da6cd60b
Author: Dean Troyer <email address hidden>
Date: Fri Jan 6 00:22:52 2012 -0600

Bug 885267: Fix GET /servers during instance delete

    There is a period during an instance delete when GET /servers
    will fail occasionally. The race condition is during GET /servers
    between the initial get_all() and when the extended_status extension
    re-retrieves individual servers via compute.api.routing_get().
    We log a warning and remove the offending server from the list
    as it no longer exists.

Change-Id: Id75723a21c0d6dc20f446560847e5b8522ec3262