test_cleanup_stale_devices functional test sporadic failures
Bug #1604115 reported by
Assaf Muller
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
High
|
Ihar Hrachyshka |
Bug Description
19 hits in the last 7 days
build_status:
Example TRACE failure:
http://
Example log from testrunner:
http://
tags: | added: neutron-proactive-backport-potential |
tags: | removed: neutron-proactive-backport-potential |
To post a comment you must log in.
Both successful and failed runs show the same, snipped / simplified:
Found stale device tapfoo_id2, deleting _cleanup_ stale_devices neutron/ agent/linux/ dhcp.py: 1215 bridge= test-br75746765 , port=tapfoo_id2, if_exists=True) do_commit neutron/ agent/ovsdb/ impl_idl. py:83 stale_devices neutron/ agent/linux/ dhcp.py: 1215 bridge= test-br75746765 , port=tapfoo_id3, if_exists=True) do_commit neutron/ agent/ovsdb/ impl_idl. py:83 rootwrap_ daemon neutron/ agent/linux/ utils.py: 99
Running txn command(idx=0): DelPortCommand(
Found stale device tapfoo_id3, deleting _cleanup_
Running txn command(idx=0): DelPortCommand(
Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-foo_id', 'find', '/sys/class/net', '-maxdepth', '1', '-type', 'l', '-printf', '%f '] execute_
We then assert that the find command returns 1 non-loopback device in the namespace, which is the DHCP interface itself. However, for failed runs we get 2 or 3 (I've seen both) even though the OVS delete command did not return any errors. The only explanation I can come up with is that there is asynchronicity involved and that the command returns before the device is entirely deleted from the Linux network stack. In which case we have to ask ourselves if we expect Neutron's ovs_lib delete_port to return when it's done, or if the test needs to be fixed to loop the get_devices until it returns the expected number of devices.