qpid timeout causing compute service to crash

Bug #999698 reported by Derek Higgins
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Russell Bryant
Essex
Fix Released
Undecided
Pádraig Brady
nova (Ubuntu)
Fix Released
Undecided
Unassigned
Precise
Fix Released
Undecided
Unassigned

Bug Description

Following a service restart the exception below was outputted to the compute log file and the service crashes

Once this occurs, the same exception is raised every time I start the compute service

2012-05-15 11:48:57 AUDIT nova.service [-] Starting compute node (version 2012.1-LOCALBRANCH:LOCALREVISION)
2012-05-15 11:48:57 WARNING nova.utils [req-1cdae63a-8885-4711-a39d-2416a1f6477a None None] /usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.3-py2.6-linux-x86_64.egg/sqlalchemy/pool.py:639: SADeprecationWarning: The 'listeners' argument to Pool (and create_engine()) is deprecated. Use event.listen().
  Pool.__init__(self, creator, **kw)

2012-05-15 11:48:57 WARNING nova.utils [req-1cdae63a-8885-4711-a39d-2416a1f6477a None None] /usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.3-py2.6-linux-x86_64.egg/sqlalchemy/pool.py:145: SADeprecationWarning: Pool.add_listener is deprecated. Use event.listen()
  self.add_listener(l)

2012-05-15 11:48:58 INFO nova.rpc.impl_qpid [req-1cdae63a-8885-4711-a39d-2416a1f6477a None None] Connected to AMQP server on localhost:5672
2012-05-15 11:49:58 ERROR nova.rpc.impl_qpid [req-1cdae63a-8885-4711-a39d-2416a1f6477a None None] Timed out waiting for RPC response: None
2012-05-15 11:49:58 TRACE nova.rpc.impl_qpid Traceback (most recent call last):
2012-05-15 11:49:58 TRACE nova.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/nova/rpc/impl_qpid.py", line 359, in ensure
2012-05-15 11:49:58 TRACE nova.rpc.impl_qpid return method(*args, **kwargs)
2012-05-15 11:49:58 TRACE nova.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/nova/rpc/impl_qpid.py", line 408, in _consume
2012-05-15 11:49:58 TRACE nova.rpc.impl_qpid nxt_receiver = self.session.next_receiver(timeout=timeout)
2012-05-15 11:49:58 TRACE nova.rpc.impl_qpid File "<string>", line 6, in next_receiver
2012-05-15 11:49:58 TRACE nova.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 651, in next_receiver
2012-05-15 11:49:58 TRACE nova.rpc.impl_qpid raise Empty
2012-05-15 11:49:58 TRACE nova.rpc.impl_qpid Empty: None
2012-05-15 11:49:58 TRACE nova.rpc.impl_qpid
2012-05-15 11:49:58 CRITICAL nova [-] Timeout while waiting on RPC response.

Related branches

Revision history for this message
Derek Higgins (derekh) wrote :

some version information

qpid-cpp-server-0.14-16.el6.x86_64
qpid-cpp-client-0.14-16.el6.x86_64
qpid-qmf-0.14-7.el6_2.x86_64
qpid-tools-0.14-2.el6_2.noarch
python-qpid-0.14-8.el6.noarch
python-qpid-qmf-0.14-7.el6_2.x86_64

Running on RHEL 6.3 Beta

Revision history for this message
Derek Higgins (derekh) wrote :

Restarting qpidd doesn't help the situation

Revision history for this message
Derek Higgins (derekh) wrote :

Some more information that may be relevant

I noticed nova-compute was subscribing to a queue with a random hex uuid name e.g. 93c92020b8bc4dab88c166a615002f0b

I was restarting all services together
for srv in api cert compute network objectstore scheduler volume ; do sudo service openstack-nova-$srv restart ; done

nova-compute starts to successfully work when I restarted it by itself, and it is now subscriped to
compute and compute_fanout_53de51e809cf4c99a390903bbda3d6a0

affects: keystone → nova
Changed in nova:
assignee: nobody → Russell Bryant (russellb)
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Derek Higgins (derekh) wrote :

I can now reproduce this by starting starting a instance,
and restarting all nova services together

starting nova-compute by itself then fixes the problem

With verbose enabled

CRITICAL nova [-] Timeout while waiting on RPC response.
TRACE nova Traceback (most recent call last):
TRACE nova File "/usr/bin/nova-compute", line 49, in <module>
TRACE nova service.wait()
TRACE nova File "/usr/lib/python2.6/site-packages/nova/service.py", line 413, in wait
TRACE nova _launcher.wait()
TRACE nova File "/usr/lib/python2.6/site-packages/nova/service.py", line 131, in wait
TRACE nova service.wait()
TRACE nova File "/usr/lib/python2.6/site-packages/eventlet/greenthread.py", line 166, in wait
TRACE nova return self._exit_event.wait()
TRACE nova File "/usr/lib/python2.6/site-packages/eventlet/event.py", line 116, in wait
TRACE nova return hubs.get_hub().switch()
TRACE nova File "/usr/lib/python2.6/site-packages/eventlet/hubs/hub.py", line 177, in switch
TRACE nova return self.greenlet.switch()
TRACE nova File "/usr/lib/python2.6/site-packages/eventlet/greenthread.py", line 192, in main
TRACE nova result = function(*args, **kwargs)
TRACE nova File "/usr/lib/python2.6/site-packages/nova/service.py", line 101, in run_server
TRACE nova server.start()
TRACE nova File "/usr/lib/python2.6/site-packages/nova/service.py", line 162, in start
TRACE nova self.manager.init_host()
TRACE nova File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 251, in init_host
TRACE nova net_info = self._get_instance_nw_info(context, instance)
TRACE nova File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 313, in _get_instance_nw_info
TRACE nova instance)
TRACE nova File "/usr/lib/python2.6/site-packages/nova/network/api.py", line 219, in get_instance_nw_info
TRACE nova 'args': args})
TRACE nova File "/usr/lib/python2.6/site-packages/nova/rpc/__init__.py", line 68, in call
TRACE nova return _get_impl().call(context, topic, msg, timeout)
TRACE nova File "/usr/lib/python2.6/site-packages/nova/rpc/impl_qpid.py", line 521, in call
TRACE nova return rpc_amqp.call(context, topic, msg, timeout, Connection.pool)
TRACE nova File "/usr/lib/python2.6/site-packages/nova/rpc/amqp.py", line 338, in call
TRACE nova rv = list(rv)
TRACE nova File "/usr/lib/python2.6/site-packages/nova/rpc/amqp.py", line 299, in __iter__
TRACE nova self._iterator.next()
TRACE nova File "/usr/lib/python2.6/site-packages/nova/rpc/impl_qpid.py", line 417, in iterconsume
TRACE nova yield self.ensure(_error_callback, _consume)
TRACE nova File "/usr/lib/python2.6/site-packages/nova/rpc/impl_qpid.py", line 363, in ensure
TRACE nova error_callback(e)
TRACE nova File "/usr/lib/python2.6/site-packages/nova/rpc/impl_qpid.py", line 402, in _error_callback
TRACE nova raise rpc_common.Timeout()
TRACE nova Timeout: Timeout while waiting on RPC response.
TRACE nova

Revision history for this message
Russell Bryant (russellb) wrote :

Can you show me the exact command(s) you are using to restart all services at the same time? It's starting to sound like:

1) When nova-compute gets restarted, it's making a call out to nova-network.

2) nova-network is getting restarted before this call is completed, resulting in the timeout error that occurred.

3) This code path in nova-compute is not properly handling the timeout exception, resulting in the service stopping.

So if it's just a bash for loop restarting all of the services, try restarting network before compute.

Revision history for this message
Derek Higgins (derekh) wrote :

I was using a bash loop to start the services
for srv in api cert compute network objectstore scheduler volume ; do sudo service openstack-nova-$srv restart ; done

changing the order didn't help
for srv in api cert network objectstore scheduler volume compute ; do sudo service openstack-nova-$srv restart ; done

it does work if I add a sleep in between each service restart
for srv in api cert network objectstore scheduler volume compute ; do sudo service openstack-nova-$srv restart ; sleep 5 ; done

Revision history for this message
Russell Bryant (russellb) wrote :

Ok, thanks. I'm still thinking that what's happening is that nova-network is reading the message off the queue before it restarts, even when the order is changed in your loop. The service restarts must be happening asyncrhonously.

Regarding the service stopping part ... this error is happening in service startup code, so I think it's pretty much just an expected failure if nova-compute can't talk to nova-network during startup.

There is a patch on gerrit right now that may fix this scenario.

https://review.openstack.org/#/c/6694/

I haven't reviewed the patch yet, but based on the description, it sounds like it should make it so nova-network can't restart wihle it's in the middle of processing the request from nova-compute. Mind giving that patch a try?

Revision history for this message
Steven Dake (sdake) wrote :

When running in a loop like that, systemd does not guarantee that the process has actually finished starting up. systemd checks for the pidfile to make a determination when to start dependent processes. The proper solution would be for the systemd devs to fix the problem, but it is unclear if they intend to, as they think the pid lockfile is consistent with the daemon being in a runnable state. The proper semantics of the lock file should be "the daemon has started, no new ones can start now". This problem pervades all of Fedora init scripts now for dependent services.

Revision history for this message
Russell Bryant (russellb) wrote :

Related: https://lists.launchpad.net/openstack/msg12443.html

My previous suggestions on this bug were around the case that nova-network is stopped in the middle of processing a message. There is a much more likely and common case, as well as a couple of possible solutions, that I describe in the linked post above.

Revision history for this message
Russell Bryant (russellb) wrote :

Can you give this patch a try and see if it makes the problem go away? It's a bit of a hack, but it will help confirm that the issue is just nova-network not being up yet, meaning that the queue the request is destined for doesn't exist yet.

Revision history for this message
Derek Higgins (derekh) wrote :

I've applied the above patch to Essex and it looks like it is working

with the patch applied when I restart all service together in the nova-compute log I get a timeout traceback but the service stays running and continues with its startup sequence.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/7961

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Russell Bryant (russellb) wrote :

Derek, I have written a new version of this patch based on discussion on the bug. Can you verify that this still resolves the problem for you? I ran it through the unit tests and the functional tests on smokestack, but I'm not sure that any of those are hitting this code (restarting nova-compute with instances running). I'll do some more targeted testing on it tomorrow.

Revision history for this message
Russell Bryant (russellb) wrote :

I was able to test restarting nova-compute with an instance running and it seems to work. I'll submit the patch.

Jay Pipes (jaypipes)
Changed in nova:
milestone: none → folsom-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/7961
Committed: http://github.com/openstack/nova/commit/8db54f3bd590e71c6c6e383c928aa82fc28b3379
Submitter: Jenkins
Branch: master

commit 8db54f3bd590e71c6c6e383c928aa82fc28b3379
Author: Russell Bryant <email address hidden>
Date: Tue May 29 16:04:16 2012 -0400

    Don't query nova-network on startup.

    Fix bug 999698.

    nova-compute requested network info for each instance on startup via rpc.
    If all services get (re)started at the same time, nova-network may not
    be available to take this request, resulting in a lost request. To
    combat this issue, get the network info from the cache in the database
    on startup. If by some chance this information is not correct, it will
    get fixed up by a periodic task.

    Change-Id: I0bbd475e078ac2a67c99c2be4711e86d617c609a

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
Pádraig Brady (p-draigbrady) wrote :

Is this appropriate for essex?
Derek reported it works there.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/essex)

Fix proposed to branch: stable/essex
Review: https://review.openstack.org/8393

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/essex)

Reviewed: https://review.openstack.org/8393
Committed: http://github.com/openstack/nova/commit/d89c2f309221afb8b25cb0afe5291d432f033076
Submitter: Jenkins
Branch: stable/essex

commit d89c2f309221afb8b25cb0afe5291d432f033076
Author: Russell Bryant <email address hidden>
Date: Mon Jun 11 14:07:19 2012 +0100

    Don't query nova-network on startup.

    Backport from commit 8db54f3.
    Fix bug 999698.

    nova-compute requested network info for each instance on startup via rpc.
    If all services get (re)started at the same time, nova-network may not
    be available to take this request, resulting in a lost request. To
    combat this issue, get the network info from the cache in the database
    on startup. If by some chance this information is not correct, it will
    get fixed up by a periodic task.

    Change-Id: Ifb0634e87770f565e4ab36a54f6e9e19e5f31632

Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Dave Walker (davewalker)
Changed in nova (Ubuntu):
status: New → Fix Released
Changed in nova (Ubuntu Precise):
status: New → Confirmed
Revision history for this message
Adam Gandelman (gandelman-a) wrote : Verification report.

Please find the attached test log from the Ubuntu Server Team's CI infrastructure. As part of the verification process for this bug, Nova has been deployed and configured across multiple nodes using precise-proposed as an installation source. After successful bring-up and configuration of the cluster, a number of exercises and smoke tests have be invoked to ensure the updated package did not introduce any regressions. A number of test iterations were carried out to catch any possible transient errors.

Please Note the list of installed packages at the top and bottom of the report.

For records of upstream test coverage of this update, please see the Jenkins links in the comments of the relevant upstream code-review(s):

Trunk review: https://review.openstack.org/7961
Stable review: https://review.openstack.org/8393

As per the provisional Micro Release Exception granted to this package by the Technical Board, we hope this contributes toward verification of this update.

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Test coverage log.

tags: added: verification-done
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (5.4 KiB)

This bug was fixed in the package nova - 2012.1.3+stable-20120827-4d2a4afe-0ubuntu1

---------------
nova (2012.1.3+stable-20120827-4d2a4afe-0ubuntu1) precise-proposed; urgency=low

  * New upstream snapshot, fixes FTBFS in -proposed. (LP: #1041120)
  * Resynchronize with stable/essex (4d2a4afe):
    - [5d63601] Inappropriate exception handling on kvm live/block migration
      (LP: #917615)
    - [ae280ca] Deleted floating ips can cause instance delete to fail
      (LP: #1038266)

nova (2012.1.3+stable-20120824-86fb7362-0ubuntu1) precise-proposed; urgency=low

  * New upstream snapshot. (LP: #1041120)
  * Dropped, superseded by new snapshot:
    - debian/patches/CVE-2012-3447.patch: [d9577ce]
    - debian/patches/CVE-2012-3371.patch: [25f5bd3]
    - debian/patches/CVE-2012-3360+3361.patch: [b0feaff]
  * Resynchronize with stable/essex (86fb7362):
    - [86fb736] Libvirt driver reports incorrect error when volume-detach fails
      (LP: #1029463)
    - [272b98d] nova delete lxc-instance umounts the wrong rootfs (LP: #971621)
    - [09217ab] Block storage connections are NOT restored on system reboot
      (LP: #1036902)
    - [d9577ce] CVE-2012-3361 not fully addressed (LP: #1031311)
    - [e8ef050] pycrypto is unused and the existing code is potentially insecure
      to use (LP: #1033178)
    - [3b4ac31] cannot umount guestfs (LP: #1013689)
    - [f8255f3] qpid_heartbeat setting in ineffective (LP: #1030430)
    - [413c641] Deallocation of fixed IP occurs before security group refresh
      leading to potential security issue in error / race conditions
      (LP: #1021352)
    - [219c5ca] Race condition in network/deallocate_for_instance() leads to
      security issue (LP: #1021340)
    - [f2bc403] cleanup_file_locks does not remove stale sentinel files
      (LP: #1018586)
    - [4c7d671] Deleting Flavor currently in use by instance creates error
      (LP: #994935)
    - [7e88e39] nova testsuite errors on newer versions of python-boto (e.g.
      2.5.2) (LP: #1027984)
    - [80d3026] NoMoreFloatingIps: Zero floating ips available after repeatedly
      creating and destroying instances over time (LP: #1017418)
    - [4d74631] Launching with source groups under load produces lazy load error
      (LP: #1018721)
    - [08e5128] API 'v1.1/{tenant_id}/os-hosts' does not return a list of hosts
      (LP: #1014925)
    - [801b94a] Restarting nova-compute removes ip packet filters (LP: #1027105)
    - [f6d1f55] instance live migration should create virtual_size disk image
      (LP: #977007)
    - [4b89b4f] [nova][volumes] Exceeding volumes, gigabytes and floating_ips
      quotas returns general uninformative HTTP 500 error (LP: #1021373)
    - [6e873bc] [nova][volumes] Exceeding volumes, gigabytes and floating_ips
      quotas returns general uninformative HTTP 500 error (LP: #1021373)
    - [7b215ed] Use default qemu-img cluster size in libvirt connection driver
    - [d3a87a2] Listing flavors with marker set returns 400 (LP: #956096)
    - [cf6a85a] nova-rootwrap hardcodes paths instead of using
      /sbin:/usr/sbin:/usr/bin:/bin (LP: #1013147)
    - [2efc87c] affinity filters don't work if scheduler_hints is None
      (LP: #1007573)
  ...

Read more...

Changed in nova (Ubuntu Precise):
status: Confirmed → Fix Released
Revision history for this message
Clint Byrum (clint-fewbar) wrote : Update Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Thierry Carrez (ttx)
Changed in nova:
milestone: folsom-2 → 2012.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.