VMs don't get ip from dhcp after compute restart

Bug #1853613 reported by Darragh O'Reilly
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Undecided
Darragh O'Reilly
neutron (Ubuntu)
Incomplete
Undecided
Unassigned
Bionic
Fix Released
Undecided
Unassigned

Bug Description

(For SRU template, please see bug 1869808, as the SRU info there applies to this bug also)

Env: pike + ovs + vxlan + l2pop + iptables_hybrid.
Dhcp agent on differnt node than compute.

Steps:
1. Boot 4 or more vms to same compute and same vxlan net.
2. Wait until they are fully running and reboot compute node.
3. After boot the vms are in status SHUTOFF. Start the vms.

Vms don't get an ip address from neutron dhcp. The flood to tunnels flow (br-tun table 22) for the network is missing, so broadcasts like dhcp requests don't get on a tunnel to the node with dhcp agent. Neutron server did not send the flooding entry to the agent. It only does that for the first or second active port, or if the agent is restarted.

After the compute boots, neutron-ovs-cleanup runs first and deletes the qvo ports from br-int [4]. Then the ovs-agent starts and nova-compute after it. Nova-compute destroys the domains and moves the vms to SHUTOFF status. It also (for some reason) recreates the qbr linux bridges and qvb/qvo veths connected to br-int. So neutron continues to see the ports as ACTIVE even though the vms are SHUTOFF, and agent_active_ports [1] never drops below 3. Also nova-compute might start a short time after the ovs-agent and the new ports are not detected in first iteration of the ovs agent loop, so agent_restarted will be false here [2].

Before [3] agent_restarted was true if the agent was running for less than agent_boot_time (default 180 sec) and the problem did not show.

It does not happen if neutron-ovs-cleanup is disabled. Then the ovs agent first treats them as skipped_devices and they get status DOWN.

[1] https://github.com/openstack/neutron/blob/21a52f7ae597f7992f32ff41cedff0c31e35c762/neutron/plugins/ml2/drivers/l2pop/mech_driver.py#L306
[2] https://github.com/openstack/neutron/blob/21a52f7ae597f7992f32ff41cedff0c31e35c762/neutron/plugins/ml2/drivers/l2pop/mech_driver.py#L310
[3] https://opendev.org/openstack/neutron/commit/62fe7852bbd70a24174853997096c52ee015e269
[4] https://bugs.launchpad.net/neutron/+bug/1853582

tags: added: l2-pop ovs
Revision history for this message
Miguel Lavalle (minsel) wrote :

@Darragh,

Given https://bugs.launchpad.net/neutron/+bug/1853582, it seems you assumption was that ovs_all_ports set to False prevented the deletion of ports created by Nova. But as I indicated in that bug, you assumption is not valid. Do you still believe we have to pursue this bug? If yes, how should the report be updated?

Revision history for this message
Darragh O'Reilly (darragh-oreilly) wrote :

@Miguel,

Yes I think we should pursue this bug. I mentioned the ovs-cleanup here because it's needed to reproduce the problem in devstack.

The problem was first seen on a production system installed by a distro, and I was able to reproduce with the distro. But I was having difficulty reproducing on devstack. Unlike the distro, devstack was not running ovs-cleanup at boot, but I didn't think this was relevant because I assumed it was not deleting nova ports (ovs_all_ports was false). After I found that assumption was incorrect, I added a systemd script to run ovs-cleanup, changed the timings so q-agt started 4 sec after ovs-cleanup, and n-cpu started 4 sec after q-agt, and then the problem happened.

I don't think ovs-cleanup is the problem here. The problem was introduced with https://opendev.org/openstack/neutron/commit/62fe7852bbd70a24174853997096c52ee015e269 , but I'm not sure how it should be fixed.

Changed in neutron:
assignee: nobody → Darragh O'Reilly (darragh-oreilly)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/697655

Revision history for this message
Bence Romsics (bence-romsics) wrote :

I just started digesting this bug, but I may have found a previous occurence which seems to have a quite deep analysis. May this be a duplicate of the following bug?

https://bugs.launchpad.net/neutron/+bug/1681979

Revision history for this message
Darragh O'Reilly (darragh-oreilly) wrote :

@Bence

No, that bug is from 2017. This problem is a consequence of agent_boot_time no longer being used, which was added this year: https://opendev.org/openstack/neutron/commit/a5244d6d44d2b66de27dc77efa7830fa657260be

Revision history for this message
Darragh O'Reilly (darragh-oreilly) wrote :

Steps to reproduce using multinode devstack:

- Boot 4 or more vms to same compute and same vxlan net.
- Wait until they are fully working.
- check the flood to tun flow is present:
ovs-ofctl dump-flows br-tun table=22 | grep priority=1
- On compute disable services so they won't start automatically on boot:
- systemctl disable <email address hidden> <email address hidden> <email address hidden>
- Reboot compute
- Run neutron-ovs-cleanup
- systemctl start <email address hidden>
- journalctl -f -u <email address hidden>
- wait until loop iteration > 0
- systemctl start <email address hidden>
- vms are SHUTOFF - start them - nova start vm1 vm2 vm3 vm4
- wait until vms show ACTIVE in nova list
- check the flood to tun flow is missing:
- ovs-ofctl dump-flows br-tun table=22 | grep priority=1

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/697655
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=93e9dc5426764b791ac69e62c6d60be7591c16ab
Submitter: Zuul
Branch: master

commit 93e9dc5426764b791ac69e62c6d60be7591c16ab
Author: Darragh O'Reilly <email address hidden>
Date: Fri Dec 6 10:06:21 2019 +0000

    ovs agent: signal to plugin if tunnel refresh needed

    Currently the ovs agent calls update_device_list with the
    agent_restarted flag set only on the first loop iteration. Then the
    server knows to send the l2pop flooding entries for the network to
    the agent. But when a compute node with many instances on many
    networks reboots, it takes time to readd all the active devices and
    some may be readded after the first loop iteration. Then the server
    can fail to send the flooding entries which means there will be no
    flood_to_tuns flow and broadcasts like dhcp will fail.

    This patch fixes that by renaming the agent_restarted flag to
    refresh_tunnels and setting it if the agent has not received the
    flooding entries for the network.

    Change-Id: I607aa8fa399e72b037fd068ad4f02b6210e57e91
    Closes-Bug: #1853613

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/704173

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/train)

Reviewed: https://review.opendev.org/704173
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bc0ab0fcd721d3fb01fd83291269a586f50efa0e
Submitter: Zuul
Branch: stable/train

commit bc0ab0fcd721d3fb01fd83291269a586f50efa0e
Author: Darragh O'Reilly <email address hidden>
Date: Fri Jan 24 16:19:30 2020 +0000

    ovs agent: signal to plugin if tunnel refresh needed

    Patch https://review.opendev.org/#/c/697655/ cannot be backported
    because it includes an RPC version change. This patch is for the
    stable branches.

    Currently the ovs agent calls update_device_list with the
    agent_restarted flag set only on the first loop iteration. Then the
    server knows to send the l2pop flooding entries for the network to
    the agent. But when a compute node with many instances on many
    networks reboots, it takes time to readd all the active devices and
    some may be readded after the first loop iteration. Then the server
    can fail to send the flooding entries which means there will be no
    flood_to_tuns flow and broadcasts like dhcp will fail.

    This patch fixes that by also setting the agent_restarted flag if
    the agent has not received the flooding entries for a network.

    Change-Id: Iccc4fe4a785ee042fd76a663d0e76a27facd1809
    Closes-Bug: #1853613

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/708748

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 16.0.0.0b1

This issue was fixed in the openstack/neutron 16.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/709445

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/709446

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/708748
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=aee87e72b1d456da66d719ccf19054ac1f285a7b
Submitter: Zuul
Branch: stable/stein

commit aee87e72b1d456da66d719ccf19054ac1f285a7b
Author: Darragh O'Reilly <email address hidden>
Date: Fri Jan 24 16:19:30 2020 +0000

    ovs agent: signal to plugin if tunnel refresh needed

    Patch https://review.opendev.org/#/c/697655/ cannot be backported
    because it includes an RPC version change. This patch is for the
    stable branches.

    Currently the ovs agent calls update_device_list with the
    agent_restarted flag set only on the first loop iteration. Then the
    server knows to send the l2pop flooding entries for the network to
    the agent. But when a compute node with many instances on many
    networks reboots, it takes time to readd all the active devices and
    some may be readded after the first loop iteration. Then the server
    can fail to send the flooding entries which means there will be no
    flood_to_tuns flow and broadcasts like dhcp will fail.

    This patch fixes that by also setting the agent_restarted flag if
    the agent has not received the flooding entries for a network.

    Change-Id: Iccc4fe4a785ee042fd76a663d0e76a27facd1809
    Closes-Bug: #1853613
    (cherry picked from commit bc0ab0fcd721d3fb01fd83291269a586f50efa0e)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.opendev.org/709683

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.opendev.org/709445
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d1e2b840b57ba9425d5ea710a0509b677eefcc08
Submitter: Zuul
Branch: stable/rocky

commit d1e2b840b57ba9425d5ea710a0509b677eefcc08
Author: Darragh O'Reilly <email address hidden>
Date: Fri Jan 24 16:19:30 2020 +0000

    ovs agent: signal to plugin if tunnel refresh needed

    Patch https://review.opendev.org/#/c/697655/ cannot be backported
    because it includes an RPC version change. This patch is for the
    stable branches.

    Currently the ovs agent calls update_device_list with the
    agent_restarted flag set only on the first loop iteration. Then the
    server knows to send the l2pop flooding entries for the network to
    the agent. But when a compute node with many instances on many
    networks reboots, it takes time to readd all the active devices and
    some may be readded after the first loop iteration. Then the server
    can fail to send the flooding entries which means there will be no
    flood_to_tuns flow and broadcasts like dhcp will fail.

    This patch fixes that by also setting the agent_restarted flag if
    the agent has not received the flooding entries for a network.

    Change-Id: Iccc4fe4a785ee042fd76a663d0e76a27facd1809
    Closes-Bug: #1853613
    (cherry picked from commit bc0ab0fcd721d3fb01fd83291269a586f50efa0e)
    (cherry picked from commit aee87e72b1d456da66d719ccf19054ac1f285a7b)

tags: added: in-stable-rocky
tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.opendev.org/709446
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=576e76f209ce69d99606ce126cfd4a96d99f8e88
Submitter: Zuul
Branch: stable/queens

commit 576e76f209ce69d99606ce126cfd4a96d99f8e88
Author: Darragh O'Reilly <email address hidden>
Date: Fri Jan 24 16:19:30 2020 +0000

    ovs agent: signal to plugin if tunnel refresh needed

    Patch https://review.opendev.org/#/c/697655/ cannot be backported
    because it includes an RPC version change. This patch is for the
    stable branches.

    Currently the ovs agent calls update_device_list with the
    agent_restarted flag set only on the first loop iteration. Then the
    server knows to send the l2pop flooding entries for the network to
    the agent. But when a compute node with many instances on many
    networks reboots, it takes time to readd all the active devices and
    some may be readded after the first loop iteration. Then the server
    can fail to send the flooding entries which means there will be no
    flood_to_tuns flow and broadcasts like dhcp will fail.

    This patch fixes that by also setting the agent_restarted flag if
    the agent has not received the flooding entries for a network.

    Change-Id: Iccc4fe4a785ee042fd76a663d0e76a27facd1809
    Closes-Bug: #1853613
    (cherry picked from commit bc0ab0fcd721d3fb01fd83291269a586f50efa0e)
    (cherry picked from commit aee87e72b1d456da66d719ccf19054ac1f285a7b)
    (cherry picked from commit d1e2b840b57ba9425d5ea710a0509b677eefcc08)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.7

This issue was fixed in the openstack/neutron 13.0.7 release.

tags: added: in-stable-pike
tags: added: neutron-proactive-backport-potential
tags: removed: neutron-proactive-backport-potential
Revision history for this message
Brian Murray (brian-murray) wrote : Missing SRU information

Thanks for uploading the fix for this bug report to -proposed. However, when reviewing the package in -proposed and the details of this bug report I noticed that the bug description is missing information required for the SRU process. You can find full details at http://wiki.ubuntu.com/StableReleaseUpdates#Procedure but essentially this bug is missing some of the following: a statement of impact, a test case and details regarding the regression potential. Thanks in advance!

Changed in neutron (Ubuntu):
status: New → Incomplete
Dan Streetman (ddstreet)
description: updated
Revision history for this message
Dan Streetman (ddstreet) wrote :

> bug description is missing information required for the SRU process

sorry, I updated the description to refer this bug to the sru template in bug 1869808 as that sru info applies to all other bugs in the upload.

Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Darragh, or anyone else affected,

Accepted neutron into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:12.1.1-0ubuntu4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in neutron (Ubuntu Bionic):
status: New → Fix Committed
tags: added: verification-needed verification-needed-bionic
Revision history for this message
Edward Hope-Morley (hopem) wrote :

All SRU verification completed and performed in https://bugs.launchpad.net/neutron/+bug/1869808 so please refer to that LP for the results.

tags: added: verification-done verification-done-bionic
removed: verification-needed verification-needed-bionic
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for neutron has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:12.1.1-0ubuntu4

---------------
neutron (2:12.1.1-0ubuntu4) bionic; urgency=medium

  * Fix interrupt of VLAN traffic on reboot of neutron-ovs-agent:
  - d/p/0001-ovs-agent-signal-to-plugin-if-tunnel-refresh-needed.patch (LP: #1853613)
  - d/p/0002-Do-not-block-connection-between-br-int-and-br-phys-o.patch (LP: #1869808)
  - d/p/0003-Ensure-that-stale-flows-are-cleaned-from-phys_bridge.patch (LP: #1864822)
  - d/p/0004-DVR-Reconfigure-re-created-physical-bridges-for-dvr-.patch (LP: #1864822)
  - d/p/0005-Ensure-drop-flows-on-br-int-at-agent-startup-for-DVR.patch (LP: #1887148)
  - d/p/0006-Don-t-check-if-any-bridges-were-recrected-when-OVS-w.patch (LP: #1864822)
  - d/p/0007-Not-remove-the-running-router-when-MQ-is-unreachable.patch (LP: #1871850)

 -- Edward Hope-Morley <email address hidden> Mon, 22 Feb 2021 16:55:40 +0000

Changed in neutron (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron pike-eol

This issue was fixed in the openstack/neutron pike-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron queens-eol

This issue was fixed in the openstack/neutron queens-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.