gate-tempest-dsvm-large-ops fails with RPC MessagingTimeout in _try_deallocate_network

Bug #1415106 reported by Matt Riedemann
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Critical
Matt Riedemann
PBR
Fix Released
Undecided
Monty Taylor

Bug Description

http://logs.openstack.org/60/147460/7/gate/gate-tempest-dsvm-large-ops/12ba360/logs/screen-n-cpu-1.txt.gz?level=TRACE

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiU2V0dGluZyBpbnN0YW5jZSB2bV9zdGF0ZSB0byBFUlJPUlwiIEFORCBtZXNzYWdlOlwiX3RyeV9kZWFsbG9jYXRlX25ldHdvcmtcIiBBTkQgbWVzc2FnZTpcIk1lc3NhZ2luZ1RpbWVvdXRcIiBBTkQgYnVpbGRfbmFtZTpcImdhdGUtdGVtcGVzdC1kc3ZtLWxhcmdlLW9wc1wiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDIyMzc3NTI3MzgwLCJtb2RlIjoiIiwiYW5hbHl6ZV9maWVsZCI6IiJ9

61 hits in 7 days, check and gate, all failures.

2015-01-27 10:55:34.025 ERROR nova.compute.manager [req-34f0ef2f-34f0-4b06-9e30-fd72565a3388 TestLargeOpsScenario-405264936 TestLargeOpsScenario-221766999] [instance: 63145652-811e-4124-8ef7-cc64b594336a] Setting instance vm_state to ERROR
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] Traceback (most recent call last):
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/opt/stack/new/nova/nova/compute/manager.py", line 2569, in do_terminate_instance
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] self._delete_instance(context, instance, bdms, quotas)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/opt/stack/new/nova/nova/hooks.py", line 149, in inner
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] rv = f(*args, **kwargs)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/opt/stack/new/nova/nova/compute/manager.py", line 2538, in _delete_instance
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] quotas.rollback()
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 82, in __exit__
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] six.reraise(self.type_, self.value, self.tb)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/opt/stack/new/nova/nova/compute/manager.py", line 2515, in _delete_instance
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] self._shutdown_instance(context, instance, bdms)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/opt/stack/new/nova/nova/compute/manager.py", line 2449, in _shutdown_instance
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] self._try_deallocate_network(context, instance, requested_networks)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/opt/stack/new/nova/nova/compute/manager.py", line 2379, in _try_deallocate_network
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] self._set_instance_error_state(context, instance)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 82, in __exit__
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] six.reraise(self.type_, self.value, self.tb)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/opt/stack/new/nova/nova/compute/manager.py", line 2374, in _try_deallocate_network
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] self._deallocate_network(context, instance, requested_networks)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/opt/stack/new/nova/nova/compute/manager.py", line 1954, in _deallocate_network
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] context, instance, requested_networks=requested_networks)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/opt/stack/new/nova/nova/network/api.py", line 47, in wrapped
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] return func(self, context, *args, **kwargs)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/opt/stack/new/nova/nova/network/api.py", line 292, in deallocate_for_instance
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] requested_networks=requested_networks)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/opt/stack/new/nova/nova/network/rpcapi.py", line 209, in deallocate_for_instance
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] return cctxt.call(ctxt, 'deallocate_for_instance', **kwargs)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 389, in call
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] return self.prepare().call(ctxt, method, **kwargs)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] retry=self.retry)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] timeout=timeout, retry=retry)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 436, in send
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] retry=retry)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 425, in _send
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] result = self._waiter.wait(msg_id, timeout)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 328, in wait
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] reply, ending, trylock = self._poll_queue(msg_id, timer)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 267, in _poll_queue
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] message = self.waiters.get(msg_id, timeout=timer.check_return(msg_id))
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 173, in get
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] 'to message ID %s' % msg_id)
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a] MessagingTimeout: Timed out waiting for a reply to message ID 2717e1139f6d414dbebbc85164e05f9c
2015-01-27 10:55:34.025 32626 TRACE nova.compute.manager [instance: 63145652-811e-4124-8ef7-cc64b594336a]

This job uses nova-network, it's doing an rpc call on the network deallocate so if something is holding a lock in the same area would could be timing out that way, need to check the code to see what's all using locks at the same time in these flows.

Tags: network
Matt Riedemann (mriedem)
tags: added: network
Matt Riedemann (mriedem)
Changed in nova:
status: New → Confirmed
Revision history for this message
Matt Riedemann (mriedem) wrote :

From logstash it looks like this showed up on 1/19.

Revision history for this message
Joe Gordon (jogo) wrote :

2015-01-27 22:24:46.812 DEBUG oslo_concurrency.lockutils [req-3392f0ad-b103-4648-8a72-ea1dfd8678a5 TestLargeOpsScenario-1669034559 TestLargeOpsScenario-1787167515] Lock "dnsmasq_start" acquired by "restart_dhcp" :: waited 38.814s inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:430

http://logs.openstack.org/65/140565/6/gate/gate-tempest-dsvm-large-ops/d40a5c9/logs/screen-n-net.txt.gz?#_2015-01-27_22_24_46_812

Changed in nova:
importance: Undecided → Critical
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Joe Gordon (jogo) wrote :

This bug started on the 19th

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/150607

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Confirmed → In Progress
Revision history for this message
Joe Gordon (jogo) wrote :

Based on logs from before this patch was introduced http://logs.openstack.org/45/142345/1/check//gate-tempest-dsvm-large-ops/4ca3f72/logs/screen-n-net.txt.gz

We are holding the lock for the same amount of time, but waiting longer to acquire it.

Revision history for this message
Joe Gordon (jogo) wrote :

Contrary to my previous comment:

average time the lock was held with this issue:

Average: 0.442208

And before:

Average: 0.261565

Revision history for this message
Joe Gordon (jogo) wrote :

grep "restart_dhcp\" \:\: held" screen-n-net.txt.good | cut -d' ' -f15 | awk '{s+=$1} END {print "Average: " s/NR}'

Revision history for this message
Joe Gordon (jogo) wrote :

After further investigation, it looks like https://review.openstack.org/#/c/38000/ appears to not be working anymore, rootwrap command are taking longer again, and they don't have the "PBR Generated from" line in `which nova-rootwrap`

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Here's an old one from Nov

stack@bigblue:~$ cat /usr/local/bin/nova-rootwrap
#!/usr/bin/python
# PBR Generated from u'console_scripts'

import sys

from oslo.rootwrap.cmd import main

if __name__ == "__main__":
    sys.exit(main())
stack@bigblue:~$
stack@bigblue:~$ ls -altr /usr/local/bin/nova-rootwrap
-rwxr-xr-x 1 root root 156 Nov 12 13:21 /usr/local/bin/nova-rootwrap

Revision history for this message
Joe Gordon (jogo) wrote :
Revision history for this message
Joe Gordon (jogo) wrote :

current rootwrap:

#!/usr/bin/python
# EASY-INSTALL-ENTRY-SCRIPT: 'nova==2015.1.dev567','console_scripts','nova-rootwrap'
__requires__ = 'nova==2015.1.dev567'
import sys
from pkg_resources import load_entry_point

if __name__ == '__main__':
    sys.exit(
        load_entry_point('nova==2015.1.dev567', 'console_scripts', 'nova-rootwrap')()
    )

Revision history for this message
Joe Gordon (jogo) wrote :

looks like this is the issue:

https://bitbucket.org/pypa/setuptools/commits/b7ff4497056b8b940b6b5c0b313988f54c4e9cd3?at=12.0.5

pip install -e hg+https://bitbucket.org/pypa/setuptools/@8b136e4#egg=setuptools

from python-novaclient:

 python setup.py develop && cat /usr/local/bin/nova

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :
Revision history for this message
Joe Gordon (jogo) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/150904

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/150607
Reason: This is supposed to fix it:

https://review.openstack.org/#/c/150890/

Matt Riedemann (mriedem)
Changed in pbr:
status: New → In Progress
Changed in nova:
status: In Progress → Invalid
Changed in pbr:
assignee: nobody → Monty Taylor (mordred)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/150904
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=10c510ec922792c6c59c9d7fe59951997b3a0aae
Submitter: Jenkins
Branch: master

commit 10c510ec922792c6c59c9d7fe59951997b3a0aae
Author: Chet Burgess <email address hidden>
Date: Wed Jan 28 10:20:24 2015 -0800

    Use a workarounds option to disable rootwrap

    Create a workarounds option to disable rootwrap to work around
    recent setuptools changes that impacted PBRs ability to make
    rootwrap performance not terrible.

    DocImpact
    Related-Bug: #1415106

    Change-Id: Iad1e57b18e490a10b1638f00f21e05e7e2a68b22

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to pbr (feature/0.10)

Reviewed: https://review.openstack.org/152691
Committed: https://git.openstack.org/cgit/openstack-dev/pbr/commit/?id=1457aafa6095da4e33a6eb0a52218f077d991a67
Submitter: Jenkins
Branch: feature/0.10

commit 1457aafa6095da4e33a6eb0a52218f077d991a67
Author: Monty Taylor <email address hidden>
Date: Wed Jan 28 06:57:06 2015 -0800

    Support script text override for newer develop

    setuptools > 12 has a new flow for writing out generated script text.
    It's nicer, actually, because it means we can just subclass and extend
    one method instead of monekypatching.

    Closes-Bug: 1415106
    Change-Id: I56e7bea60df8a59d859575d426ce93c45ffee314

Matt Riedemann (mriedem)
Changed in pbr:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.