ConnectionError: Too many heartbeats missed

Bug #1436769 reported by Can Zhang
62
This bug affects 11 people
Affects Status Importance Assigned to Milestone
oslo.messaging
Fix Released
Critical
Mehdi Abaakouk
Kilo
Fix Released
Critical
Mehdi Abaakouk
oslo.messaging (Ubuntu)
Fix Released
High
Unassigned
Vivid
Fix Released
High
Unassigned

Bug Description

2015-03-26 17:26:20.163 INFO oslo_messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 192.168.10.102:5672
2015-03-26 17:26:22.123 DEBUG oslo_messaging._drivers.amqpdriver [-] MSG_ID is bd43de9f41eb46138f3785d0eea100cc from (pid=16959) _send /usr/lib/python2.6/site-packages/oslo_messaging/_drivers/amqpdriver.py:311
2015-03-26 17:26:22.146 DEBUG oslo_messaging._drivers.impl_rabbit [-] Received recoverable error from kombu: from (pid=16959) ensure /usr/lib/python2.6/site-packages/oslo_messaging/_drivers/impl_rabbit.py:857
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit Traceback (most recent call last):
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 852, in ensure
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit ret, channel = autoretry_method()
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/kombu/connection.py", line 453, in _ensured
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit return fun(*args, **kwargs)
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/kombu/connection.py", line 522, in __call__
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit return fun(*args, **kwargs), channels[0]
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 838, in execute_method
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit method()
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1031, in _consume
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit rate=self.driver_conf.heartbeat_rate)
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/kombu/connection.py", line 270, in heartbeat_check
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit return self.transport.heartbeat_check(self.connection, rate=rate)
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/kombu/transport/pyamqp.py", line 134, in heartbeat_check
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit return connection.heartbeat_tick(rate=rate)
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/amqp/connection.py", line 844, in heartbeat_tick
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit raise ConnectionError('Too many heartbeats missed')
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit ConnectionError: Too many heartbeats missed
2015-03-26 17:26:22.146 TRACE oslo_messaging._drivers.impl_rabbit

Revision history for this message
Mehdi Abaakouk (sileht) wrote :

This is a debug message, than can be completely normal, where and how have you get this message ?

Changed in oslo.messaging:
status: New → Incomplete
Revision history for this message
Can Zhang (acme-ican) wrote :

Ops sorry. I got this message when I use devstack(icehouse) to install openstack. This error happens forever and the log files eat up my disk space.

Revision history for this message
Jason (jason-ob) wrote :

I am seeing this also. Fresh install of devstack (from master branch 4/3/2015) on Ubuntu 14.04. I noticed this on pretty much all of the services. My log files are also filling up as does the seemingly infinite number of ephemeral port connections to 5672. I can still authenticate and use Horizon, but nova services-list shows nova-compute is down, which is probably related I suppose. Is there some branch I can "back-up" to that doesn't have this probelem?

Revision history for this message
Travis Tripp (travis-tripp) wrote :

I'm getting the same message on devstack. Ubuntu 14.04. Horizon reports nova-compute as down.

Same output in:

n-cpu, n-sched, q-svc, etc

Revision history for this message
Travis Tripp (travis-tripp) wrote :

BTW, I can't launch an instance. Says there are no available compute nodes.

Changed in oslo.messaging:
status: Incomplete → Confirmed
Revision history for this message
Shahbaz Nazir (shahbazn) wrote :

Seeing similar messages in the following services.
g-api, n-api, q-svc, n-cond, n-sch, n-cpu, c-sch, c-vol

Setup: devstack(Master)+Ubunut14.04

Outpu for c-vol:

2015-04-14 13:11:33.153 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 192.168.122.95:5672 is unreachable: Too many heartbeats missed. Trying again in 1 seconds.
2015-04-14 13:11:34.167 INFO oslo_messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 192.168.122.95:5672
2015-04-14 13:11:36.169 DEBUG oslo_messaging._drivers.impl_rabbit [-] Received recoverable error from kombu: from (pid=30954) on_error /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/impl_rabbit.py:782
2015-04-14 13:11:36.169 TRACE oslo_messaging._drivers.impl_rabbit Traceback (most recent call last):
2015-04-14 13:11:36.169 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 440, in _ensured
2015-04-14 13:11:36.169 TRACE oslo_messaging._drivers.impl_rabbit return fun(*args, **kwargs)
2015-04-14 13:11:36.169 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 512, in __call__
2015-04-14 13:11:36.169 TRACE oslo_messaging._drivers.impl_rabbit return fun(*args, channel=channels[0], **kwargs), channels[0]
2015-04-14 13:11:36.169 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/impl_rabbit.py", line 831, in execute_method
2015-04-14 13:11:36.169 TRACE oslo_messaging._drivers.impl_rabbit method()
2015-04-14 13:11:36.169 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1024, in _consume
2015-04-14 13:11:36.169 TRACE oslo_messaging._drivers.impl_rabbit rate=self.driver_conf.heartbeat_rate)
2015-04-14 13:11:36.169 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 268, in heartbeat_check
2015-04-14 13:11:36.169 TRACE oslo_messaging._drivers.impl_rabbit return self.transport.heartbeat_check(self.connection, rate=rate)
2015-04-14 13:11:36.169 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/transport/pyamqp.py", line 127, in heartbeat_check
2015-04-14 13:11:36.169 TRACE oslo_messaging._drivers.impl_rabbit return connection.heartbeat_tick(rate=rate)
2015-04-14 13:11:36.169 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/connection.py", line 876, in heartbeat_tick
2015-04-14 13:11:36.169 TRACE oslo_messaging._drivers.impl_rabbit raise ConnectionForced('Too many heartbeats missed')
2015-04-14 13:11:36.169 TRACE oslo_messaging._drivers.impl_rabbit ConnectionForced: Too many heartbeats missed
2015-04-14 13:11:36.169 TRACE oslo_messaging._drivers.impl_rabbit
2015-04-14 13:11:36.169 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 192.168.122.95:5672 is unreachable: Too many heartbeats missed. Trying again in 1 seconds.
2015-04-14 13:11:37.187 INFO oslo_messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 192.168.122.95:5672

Revision history for this message
Sean Dague (sdague) wrote :

I just saw this this morning with a new system, it's happening on everything that's talking to the rpc bus every second, over and over and over again.

Revision history for this message
Mehdi Abaakouk (sileht) wrote :

Perhaps: https://github.com/celery/py-amqp/issues/6

What version of py-amqp do you use ? all version of amqp < 1.4.0 seems inpacted

Revision history for this message
Sean Dague (sdague) wrote :

I was on 1.3.3 when I hit this issue, it does appear that forcing the upgrade to 1.4.0 makes it recover. This was a long running devstack box, and we never raised the minimum requirement here, so I can see how I'd still have an older amqp.

Revision history for this message
Sean Dague (sdague) wrote :

From irc logs this morning:

<sdague> I don't know, what's your opinion of the risk of finding other issues here?
<-- sreshetnyak has quit (Ping timeout: 250 seconds)
<sileht> sdague, heartbeat or not, the really complains that if something bad occurs on the communication between oslo.msg and rabbit, we never want to wait the the system tcp timeout to discover it
<sileht> sdague, many people think that implementing heartbeat was fixing the issue
<sileht> sdague, but that not true, to really fix that we have to be able the set timeout on socket
<sileht> sdague, and THEN use heartbeat to use very long timeout
<sdague> ok, that's fair
<sdague> so what's your opinion on the safest option for the kilo release to keep regressions to a minimum?
<sileht> sdague, but kombu doesn't allow us to do that on the socket used for write
<sdague> should we bump the requirement?
<sdague> or turn off hearbeats by default?
<sileht> perhaps turn off hearbeats because the read timeout is still forced to a maximun of 1sec to ensure we detect connection lost on read
<sdague> sileht: ok, so that would be a 1.8.2 with the default behavior changed?
<sileht> on write, heartbeat or not, we can still have issue, if the packet that fail to transfert is the heartbeat one
--> inc0 (~inc0@134.134.137.73) has joined #openstack-oslo
<sileht> sdague, but on the other side ops really really want this feature
<sileht> sdague, even it's not yet perfect yet
<sdague> well, they can turn it back on
<sileht> true
<sdague> call it experimental because there are known issues, which is why it defaults off
<sdague> but explain how it can be enabled if people want to try it
<sileht> sdague, that soon reasonable
<sileht> sdague, I will proposed a patch and see other core dev opinion

So the proposed fix is to turn off heartbeats by default, which will require a 1.8.2 and 1.9.1 release.

In addition it seems prudent to have oslo.messaging versions that support heartbeats to require amqp >= 1.4.0, which addresses a critical bug here.

All of that really needs to be in place for the kilo release.

Changed in oslo.messaging:
importance: Undecided → Critical
Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
assignee: nobody → Mehdi Abaakouk (sileht)
Mehdi Abaakouk (sileht)
tags: added: 1.8.1
tags: added: kilo-backport-potential
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

+1 to mandate a minimum of amqp > 1.4.x by bumping up kombu version minimum to 3.0.9

https://github.com/celery/kombu/blob/master/requirements/default.txt

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/174929
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=287a4f56f45ed9cd40116a9e7b6e529f3382a925
Submitter: Jenkins
Branch: master

commit 287a4f56f45ed9cd40116a9e7b6e529f3382a925
Author: Mehdi Abaakouk <email address hidden>
Date: Fri Apr 17 17:37:34 2015 +0200

    Disable and mark heartbeat as experimental

    Due to some discovered issues since heartbeat is enabled by default.
    Specially #1436788, that needs to fix the underlying library, too.
    So, according to the discution here:
    https://bugs.launchpad.net/oslo.messaging/+bug/1436769/comments/10

    We decide to mark the implementation as experimental and disable it by default.

    Related-bug: #1436788
    Related-bug: #1436769
    Change-Id: Ib7c55977f976bdbbc8df4ad5915e0433cbf84a17

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/kilo)

Related fix proposed to branch: stable/kilo
Review: https://review.openstack.org/177076

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (stable/kilo)

Reviewed: https://review.openstack.org/177076
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=a8c06abdd6ea78869b8836f7d20bab63b4fe798b
Submitter: Jenkins
Branch: stable/kilo

commit a8c06abdd6ea78869b8836f7d20bab63b4fe798b
Author: Mehdi Abaakouk <email address hidden>
Date: Fri Apr 17 17:37:34 2015 +0200

    Disable and mark heartbeat as experimental

    Due to some discovered issues since heartbeat is enabled by default.
    Specially #1436788, that needs to fix the underlying library, too.
    So, according to the discution here:
    https://bugs.launchpad.net/oslo.messaging/+bug/1436769/comments/10

    We decide to mark the implementation as experimental and disable it by default.

    Related-bug: #1436788
    Related-bug: #1436769
    Change-Id: Ib7c55977f976bdbbc8df4ad5915e0433cbf84a17
    (cherry picked from commit 287a4f56f45ed9cd40116a9e7b6e529f3382a925)

tags: added: in-stable-kilo
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/177772
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=562c41bb78f578ff9c27633c04ac6dd914c55f5a
Submitter: Jenkins
Branch: stable/kilo

commit 562c41bb78f578ff9c27633c04ac6dd914c55f5a
Author: Mehdi Abaakouk <email address hidden>
Date: Mon Apr 27 15:51:42 2015 +0200

    rabbit: add some notes about heartbeat

    Heartbeat is supported only with recent version
    of kombu (>=3.0.7) and amqp (>=1.4.0)

    Related-bug: #1436769
    Change-Id: I938741d953b4db9e6f56858677220c5d6624af2d

Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
milestone: none → next-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.openstack.org/179356

Changed in oslo.messaging:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/179356
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=0c954cffa2f3710acafa79f01b958a8955823640
Submitter: Jenkins
Branch: master

commit 0c954cffa2f3710acafa79f01b958a8955823640
Author: Mehdi Abaakouk <email address hidden>
Date: Fri May 1 13:27:15 2015 +0200

    Bump kombu and amqp requirements

    We at least need these versions of amqp and kombu to have
    a working heartbeat support.

    Related-bug: #1436788
    Closes-bug: #1436769
    Closes-bug: #1408830

    Change-Id: I61440c5ccf2b540fe9a1e868bdcae9f5d2cf8422

Changed in oslo.messaging:
status: In Progress → Fix Committed
Changed in oslo.messaging:
status: Fix Committed → Fix Released
James Page (james-page)
Changed in oslo.messaging (Ubuntu Wily):
status: New → Fix Released
Changed in oslo.messaging (Ubuntu Vivid):
status: New → Triaged
importance: Undecided → High
Changed in oslo.messaging (Ubuntu Wily):
importance: Undecided → High
Revision history for this message
Chris J Arges (arges) wrote : Please test proposed package

Hello Can, or anyone else affected,

Accepted oslo.messaging into vivid-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/oslo.messaging/1.8.3-0ubuntu0.15.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in oslo.messaging (Ubuntu Vivid):
status: Triaged → Fix Committed
tags: added: verification-needed
James Page (james-page)
tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package oslo.messaging - 1.8.3-0ubuntu0.15.04.1

---------------
oslo.messaging (1.8.3-0ubuntu0.15.04.1) vivid; urgency=medium

  * New upstream point release (LP: #1467959):
    - RabbitMQ driver:
      + Adding publisher acknowledgements/confirms for better handling
        of messages during broker shutdown/network failure.
      + Ensure consumer connections closed properly (LP: #1458917).
      + Set timeout on the underlying socket (LP: #1436788).
      + Disable and mark heartbeat as experimental (LP: #1436769).
      + Fix ipv6 support.
    - ZeroMQ driver:
      + Don't raise Timeout on no-matchmaker results (LP: #1186310).
      + Fix issue with Redis not deleting expired keys (LP: #1417464).
      + d/p/Fix-changing-keys-during-iteration-in-matchmaker-hea.patch,
        d/p/Add-pluggability-for-matchmakers.patch: Dropped, included
        upstream.

 -- James Page <email address hidden> Tue, 23 Jun 2015 15:28:01 +0100

Changed in oslo.messaging (Ubuntu Vivid):
status: Fix Committed → Fix Released
Revision history for this message
Chris J Arges (arges) wrote : Update Released

The verification of the Stable Release Update for oslo.messaging has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Chris J Arges (arges) wrote : Please test proposed package

Hello Can, or anyone else affected,

Accepted oslo.messaging into vivid-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/oslo.messaging/1.8.3-0ubuntu0.15.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: removed: verification-done
tags: added: verification-needed
James Page (james-page)
tags: added: verification-done
removed: verification-needed
Mathew Hodson (mhodson)
no longer affects: oslo.messaging (Ubuntu Wily)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.