[SRU] Using SSL with rabbitmq prevents communication between nova-compute and conductor after latest nova updates

Bug #1472712 reported by Dina
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
oslo.messaging
Invalid
Undecided
Unassigned
python-amqp (Ubuntu)
Fix Released
High
Edward Hope-Morley
Trusty
Fix Released
High
Edward Hope-Morley

Bug Description

[Impact]

Current oslo.messaging and python-amqp results in repeated connection timeouts in the amqp transport layer (SSLError) and thus excessive reconnect attempts. This is a known issues that was fixed in python-amqp 1.4.4.

[Test Case]

Deploy openstack using current Trusty versions + this version of python-amqp + rabbitmq configured to allow ssl connections only. Once up and running, check the following:

 - number of rabbitmq connections - with single nova-compute, conductor etc I see approx 20 connections whereas previously i saw well over 100 and rising.

    sudo rabbitmqctl list_connections

- check that messages are being consumed from openstack queues

    sudo rabbitmqctl list_queues -p openstack consumers messages name

- also check e.g. nova-compute and nova-conductor logs and verify that the erros menioned below no longer appear

[Regression Potential]

None.

[Other Info]

None.

---- ---- ----- ----

On the latest update of the Ubuntu OpenStack packages, it was discovered that the nova-compute/nova-conductor (1:2014.1.4-0ubuntu2.1) packages encountered a bug with using SSL to connect to rabbitmq.

When this problem occurs, the compute node cannot connect to the controller, and this message is constantly displayed:

WARNING nova.conductor.api [req-4022395c-9501-47cf-bf8e-476e1cc58772 None None] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor?

Investigation revealed that having rabbitmq configured with SSL was the root cause of this problem. This seems to have been introduced with the current version of the nova packages. Rabbitmq was not updated as part of this distribution update, but the messaging library (python-oslo.messaging 1.3.0-0ubuntu1.1) was updated. So the problem could exist in any of these components.

Versions installed:
Openstack version: Icehouse
Ubuntu 14.04.2 LTS
nova-conductor 1:2014.1.4-0ubuntu2.1
nova-compute 1:2014.1.4-0ubuntu2.1
rabbitmq-server 3.2.4-1
openssl:amd64/trusty-security 1.0.1f-1ubuntu2.15

Related branches

Revision history for this message
Dina (dina-salem) wrote :

Upgraded nova-compute to 1:2014.1.5-0ubuntu1

The logs show more details:

2015-07-08 11:32:29.066 13437 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on amqp.wedgecnd.internal:5672
2015-07-08 11:32:38.062 13437 WARNING nova.conductor.api [req-9f555cd9-0e37-4114-919f-f3e5e164d724 None None] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor?
2015-07-08 11:32:38.069 13437 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: <AMQPError: unknown error>
2015-07-08 11:32:38.069 13437 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2015-07-08 11:32:38.069 13437 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 624, in ensure
2015-07-08 11:32:38.069 13437 TRACE oslo.messaging._drivers.impl_rabbit return method(*args, **kwargs)
2015-07-08 11:32:38.069 13437 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 704, in _consume
2015-07-08 11:32:38.069 13437 TRACE oslo.messaging._drivers.impl_rabbit raise self.connection.recoverable_connection_errors[0]
2015-07-08 11:32:38.069 13437 TRACE oslo.messaging._drivers.impl_rabbit RecoverableConnectionError: <AMQPError: unknown error>
2015-07-08 11:32:38.069 13437 TRACE oslo.messaging._drivers.impl_rabbit
2015-07-08 11:32:38.069 13437 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on amqp.wedgecnd.internal:5672
2015-07-08 11:32:38.069 13437 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

I added "oslo.messaging" as affected project.

tags: added: oslo
Revision history for this message
Liam Young (gnuoy) wrote :

If the patches that are carried in the Ubuntu packaging for oslo.messaging are removed and the nova services restarted then ssl + nova + rabbit seems to work.

eg

apt-get install --yes quilt
apt-get source oslo.messaging
cd oslo.messaging-1.3.0/
export QUILT_PATCHES=debian/patches
export QUILT_REFRESH_ARGS="-p ab --no-timestamps --no-index"
quilt pop -a

cp oslo/messaging/_drivers/impl_rabbit.py /usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py
cp oslo/messaging/_drivers/common.py /usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/common.py
cp oslo/messaging/_drivers/amqpdriver.py /usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py

cd /etc/init/; for i in nova-*conf; do service ${i/.conf/} restart; done

Changed in oslo.messaging:
status: New → Confirmed
Liam Young (gnuoy)
Changed in oslo.messaging:
status: Confirmed → Invalid
Changed in nova:
status: New → Invalid
Changed in python-oslo.messaging (Ubuntu):
status: New → Confirmed
tags: added: sts
Revision history for this message
Edward Hope-Morley (hopem) wrote :

A bit more info from my end. I've been trying out different scenarios and it seems that this is constrained to Trusty Icehouse using python-oslo.messaging version 1.3.0-0ubuntu1.2 configured to connect to rabbitmq-server using ssl e.g. my nova.conf has:

rabbit_userid = nova
rabbit_virtual_host = openstack
rabbit_password = gr6Mx2FJhC8NH3P4dBRGH8tYT39s6LLcMfJChKM6dtb3rpN5wfkRWVBcMLdhqp58
rabbit_host = 10.5.6.86
rabbit_use_ssl = True
rabbit_port = 5671
kombu_ssl_ca_certs = /etc/nova/rabbit-client-ca.pem

I've played around with reverting back to 1.3.0-0ubuntu1 (which does not appear to exhibit the issue) and re-adding patches one-by-one and have found that simply adding the patch for bug 1400268 causes the issue to occur. So, question is what is it about that patch that causes these issues?

Revision history for this message
Edward Hope-Morley (hopem) wrote :

OK upon further investigation i have found some trace of a root cause. Oslo.messaging always uses a timeout of 1 second when polling queues and connections. This appears to be too small when using ssl and frequently results in SSLError/timeout which cause all threads to fail and reconnect and fail again repeatedly thus resulting in the number of connections rising fast and rpc not working, hence why compute and conductor are not able to communicate. I've played around with alternative timeout values and I get much better results even with a value of 2s instead of 1s. I'll propose an initial workaround patch shortly so we can get out of this bind for now but I think we'll ultimately need a more intelligent solution than what oslo.messaging support in this version.

Changed in python-oslo.messaging (Ubuntu):
status: Confirmed → In Progress
assignee: nobody → Edward Hope-Morley (hopem)
importance: Undecided → High
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Finally got to the bottom of this. The issue lies in python-amqp rather than python-oslo.messaging. The current trusty version of python-amqp (1.3.3) has a bug that is fixed in 1.4.4 (see http://amqp.readthedocs.org/en/latest/changelog.html#version-1-4-4). I tried backporting the Juno/Utopic version (1.4.5) for Trusty and everything works just fine now. I will shortly propose an SRU to get python-amqp fixed in Trusty.

affects: python-oslo.messaging (Ubuntu) → python-amqp (Ubuntu)
description: updated
summary: - Using SSL with rabbitmq prevents communication between nova-compute and
- conductor after latest nova updates
+ [SRU] Using SSL with rabbitmq prevents communication between nova-
+ compute and conductor after latest nova updates
Changed in python-amqp (Ubuntu Trusty):
status: New → In Progress
importance: Undecided → High
assignee: nobody → Edward Hope-Morley (hopem)
description: updated
James Page (james-page)
Changed in python-amqp (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
Chris J Arges (arges) wrote : Please test proposed package

Hello Dina, or anyone else affected,

Accepted python-amqp into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/python-amqp/1.3.3-1ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in python-amqp (Ubuntu Trusty):
status: In Progress → Fix Committed
tags: added: verification-needed
Revision history for this message
Dina (dina-salem) wrote :

Tested python-amqp version 1.3.3-1ubuntu1.1 from the trusty-proposed and it fixes the bug.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Edward Hope-Morley (hopem) wrote :

I have run an ssl only icehouse deployment with this and can also verify that it does fix the bug.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package python-amqp - 1.3.3-1ubuntu1.1

---------------
python-amqp (1.3.3-1ubuntu1.1) trusty; urgency=medium

  * Ensure SSL read timeouts raised properly (LP: #1472712):
    - d/p/dont-disconnect-transport-on-ssl-read-timeout.patch:
      Backport patch from 1.4.4.

 -- Edward Hope-Morley <email address hidden> Mon, 10 Aug 2015 15:57:44 +0100

Changed in python-amqp (Ubuntu Trusty):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for python-amqp has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Liam Young (gnuoy) wrote :

This is effecting precise/icehouse deployments as they have the old version of python-amqp

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.