L3 HA: multiple agents are active at the same time

Bug #1731595 reported by Xav Paice
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
High
Corey Bryant
Mitaka
Fix Released
High
Unassigned
Newton
Fix Released
High
Unassigned
Ocata
Fix Released
High
Corey Bryant
Pike
Fix Released
High
Corey Bryant
Queens
Fix Released
High
Corey Bryant
neutron
Fix Released
High
venkata anil
neutron (Ubuntu)
Fix Released
High
Corey Bryant
Xenial
Fix Released
High
Unassigned
Zesty
Won't Fix
High
Corey Bryant
Artful
Fix Released
High
Corey Bryant
Bionic
Fix Released
High
Corey Bryant

Bug Description

OS: Xenial, Ocata from Ubuntu Cloud Archive
We have three neutron-gateway hosts, with L3 HA enabled and a min of 2, max of 3. There are approx. 400 routers defined.

At some point (we weren't monitoring exactly) a number of the routers changed from being one active, and 1+ others standby, to >1 active. This included each of the 'active' namespaces having the same IP addresses allocated, and therefore traffic problems reaching instances.

Removing the routers from all but one agent, and re-adding, resolved the issue. Restarting one l3 agent also appeared to resolve the issue, but very slowly, to the point where we needed the system alive again faster and reverted to removing/re-adding.

At the same time, a number of routers were listed without any agents active at all. This situation appears to have been resolved by adding routers to agents, after several minutes downtime.

I'm finding it very difficult to find relevant keepalived messages to indicate what's going on, but what I do notice is that all the agents have equal priority and are configured as 'backup'.

I am trying to figure out a way to get a reproducer of this, it might be that we need to have a large number of routers configured on a small number of gateways.

Revision history for this message
Xav Paice (xavpaice) wrote :

See https://bugs.launchpad.net/neutron/+bug/1597461 which could be related, but we're running 10.0.3-0ubuntu1~cloud0.

Keepalived is 1.2.19-1ubuntu0.2

tags: added: l3-ha
Revision history for this message
Brian Haley (brian-haley) wrote :

Do you see any failures in the keepalived logs? Something like "Netlink: Received message overrun (No buffer space available)" ?

I've seen another report of this, and looking through the keepalived bugs/changes it seems there was a fix for that, then a bigger change in 1.3.6 titled "Add notify FIFO":

https://github.com/acassen/keepalived/commit/04905cdcb7d2b2fe4aaee9eabdf7f6945726f3c4

https://github.com/acassen/keepalived/issues/584

Revision history for this message
venkata anil (anil-venkata) wrote :

https://bugs.launchpad.net/neutron/+bug/1597461 is not related, it fixes l3 agent restart scenario. Looks like they are seeing multiple masters without restarting the agent.

Revision history for this message
Brian Haley (brian-haley) wrote :

So I have heard of someone trying with keepalived version 1.3.9 and still seeing this failure, so that "Add notify FIFO" change wasn't the silver bullet it seemed like.

Revision history for this message
Xav Paice (xavpaice) wrote :

In answer to "Do you see any failures in the keepalived logs?", no, unfortunately no indication of the reason for switching to master, just that it did. Same for syslog.

Revision history for this message
Brian Haley (brian-haley) wrote :

Thanks for the info.

I realize it's hard to reproduce, but if you had a time you know it happened and could attach logs from the neutron server, l3-agent, and keepalived from around that timeframe it might help to narrow-down what the possible problem is.

Changed in neutron:
status: New → Confirmed
importance: Undecided → High
Changed in neutron:
assignee: nobody → venkata anil (anil-venkata)
Revision history for this message
venkata anil (anil-venkata) wrote :

In https://review.openstack.org/#/c/470905/4/neutron/api/rpc/handlers/l3_rpc.py we want to set all HA network ports(of a l3 agent) status to DOWN when that l3 agent is restarted. But we thought fetch_and_sync_all_routers(which invokes get_router_ids [1]) called only once during l3 agent restart.

In our customer setup, sometimes we have seen l3 agent unable to report state(as l3 agent is busy setting HA network ports status to DOWN and handling corresponding router update notifications), resulting in l3 agent state to AGENT_REVIVED.
When agent state is AGENT_REVIVED, it is again sets HA network ports status to DOWN, resulting in
1) ovs agent rebind the ports
2) l3 agent receiving multiple router updates
As server, ovs agent and l3 agent are busy with these unncessary processing and RPC calls, resulting l3 agent failing to report state(again AGENT_REVIVED state) and periodic syncs.

To fix this, we need to make sure _update_ha_network_port_status() called only when l3 agent is restarted.

[1] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L593
[2] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L743

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/522641

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/522784

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/522792

Ryan Beisner (1chb1n)
Changed in cloud-archive:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Corey Bryant (corey.bryant)
Changed in neutron (Ubuntu):
status: New → Triaged
importance: Undecided → High
assignee: nobody → Corey Bryant (corey.bryant)
Revision history for this message
John George (jog) wrote :

This bug falls under the Canonical Cloud Engineering service-level agreement (SLA) process, as a field critical bug.

Changed in neutron (Ubuntu Artful):
status: New → Triaged
Changed in neutron (Ubuntu Zesty):
status: New → Triaged
importance: Undecided → High
assignee: nobody → Corey Bryant (corey.bryant)
Changed in neutron (Ubuntu Artful):
assignee: nobody → Corey Bryant (corey.bryant)
importance: Undecided → High
Revision history for this message
Ryan Beisner (1chb1n) wrote :

@jog ack, confirmed. We're tracking it as such.

Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Xav, or anyone else affected,

Accepted neutron into artful-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:11.0.2-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-artful to verification-done-artful. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-artful. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Artful):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-artful
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/522641
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9ed693228f90251c0f03fb842ef19628b439f9bc
Submitter: Zuul
Branch: master

commit 9ed693228f90251c0f03fb842ef19628b439f9bc
Author: venkata anil <email address hidden>
Date: Thu Nov 23 18:40:30 2017 +0000

    Call update_all_ha_network_port_statuses on agent start

    As explained in bug [1] when l3 agent fails to report state to the
    server, its state is set to AGENT_REVIVED, triggering
    fetch_and_sync_all_routers, which will set all its HA network ports
    to DOWN, resulting in
    1) ovs agent rewiring these ports and setting status to ACTIVE
    2) when these ports are active, server sends router update to l3 agent
    As server, ovs and l3 agents are busy with this processing, l3 agent
    may fail again reporting state, repeating this process.

    As l3 agent is repeatedly processing same routers, SIGHUPs are
    frequently sent to keepalived, resulting in multiple masters.

    To fix this, we call update_all_ha_network_port_statuses in l3 agent
    start instead of calling from fetch_and_sync_all_routers.

    [1] https://bugs.launchpad.net/neutron/+bug/1731595/comments/7

    Change-Id: Ia9d5549f7d53b538c9c9f93fe6aa71ffff15524a
    Related-bug: #1597461
    Closes-Bug: #1731595

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote : Please test proposed package

Hello Xav, or anyone else affected,

Accepted neutron into pike-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:pike-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-pike-needed to verification-pike-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-pike-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Bionic):
status: Triaged → Fix Released
tags: added: verification-pike-needed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/522784
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f6560d14b6125906048b74c65f1f974b31206df3
Submitter: Zuul
Branch: stable/pike

commit f6560d14b6125906048b74c65f1f974b31206df3
Author: venkata anil <email address hidden>
Date: Thu Nov 23 18:40:30 2017 +0000

    Call update_all_ha_network_port_statuses on agent start

    As explained in bug [1] when l3 agent fails to report state to the
    server, its state is set to AGENT_REVIVED, triggering
    fetch_and_sync_all_routers, which will set all its HA network ports
    to DOWN, resulting in
    1) ovs agent rewiring these ports and setting status to ACTIVE
    2) when these ports are active, server sends router update to l3 agent
    As server, ovs and l3 agents are busy with this processing, l3 agent
    may fail again reporting state, repeating this process.

    As l3 agent is repeatedly processing same routers, SIGHUPs are
    frequently sent to keepalived, resulting in multiple masters.

    To fix this, we call update_all_ha_network_port_statuses in l3 agent
    start instead of calling from fetch_and_sync_all_routers.

    [1] https://bugs.launchpad.net/neutron/+bug/1731595/comments/7

    Change-Id: Ia9d5549f7d53b538c9c9f93fe6aa71ffff15524a
    Related-bug: #1597461
    Closes-Bug: #1731595
    (cherry picked from commit 9ed693228f90251c0f03fb842ef19628b439f9bc)

Revision history for this message
James Page (james-page) wrote : Please test proposed package

Hello Xav, or anyone else affected,

Accepted neutron into queens-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:queens-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-queens-needed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/522792
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=385ac553e33f12c34e8a23459337b2f0af0b75eb
Submitter: Zuul
Branch: stable/ocata

commit 385ac553e33f12c34e8a23459337b2f0af0b75eb
Author: venkata anil <email address hidden>
Date: Thu Nov 23 18:40:30 2017 +0000

    Call update_all_ha_network_port_statuses on agent start

    As explained in bug [1] when l3 agent fails to report state to the
    server, its state is set to AGENT_REVIVED, triggering
    fetch_and_sync_all_routers, which will set all its HA network ports
    to DOWN, resulting in
    1) ovs agent rewiring these ports and setting status to ACTIVE
    2) when these ports are active, server sends router update to l3 agent
    As server, ovs and l3 agents are busy with this processing, l3 agent
    may fail again reporting state, repeating this process.

    As l3 agent is repeatedly processing same routers, SIGHUPs are
    frequently sent to keepalived, resulting in multiple masters.

    To fix this, we call update_all_ha_network_port_statuses in l3 agent
    start instead of calling from fetch_and_sync_all_routers.

    [1] https://bugs.launchpad.net/neutron/+bug/1731595/comments/7
    Conflicts:
     neutron/agent/l3/agent.py
            neutron/api/rpc/handlers/l3_rpc.py

    Note: This RPC update_all_ha_network_port_statuses is added in only pike
    and later branches. In older branches, we were using get_router_ids RPC
    to invoke _update_ha_network_port_status. As we need to invoke this
    functionality during l3 agent start and get_service_plugin_list() is the
    only available RPC which is called during l3 agent start, we call
    _update_ha_network_port_status from get_service_plugin_list.

    Change-Id: Ia9d5549f7d53b538c9c9f93fe6aa71ffff15524a
    Related-bug: #1597461
    Closes-Bug: #1731595
    (cherry picked from commit 9ab1ad1433d54fec3e5b04f1edf8ca436e1f7af1)
    (cherry picked from commit a6d985bbca57b5027eecaa43071964b14d9075d9)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.0.0b2

This issue was fixed in the openstack/neutron 12.0.0.0b2 development milestone.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

SRU details for Ubuntu:

[Impact]
Details of the issue are described thoroughly in this bug report. The fix prevents multiple L3HA masters from existing at the same time, and is already upstream for all affected branches.

[Test Case]
The following SRU process will be followed:
https://wiki.ubuntu.com/OpenStackUpdates

In order to avoid regression of existing consumers, the OpenStack team will run their continuous integration test against the packages that are in -proposed. A successful run of all available tests will be required before the proposed packages can be let into -updates.

The OpenStack team will be in charge of attaching the output summary of the executed tests. The OpenStack team members will not mark ‘verification-done’ until this has happened.

[Regression Potential]
The regression potential is lowered as the fix is cherry-picked without change from corresponding upstream stable branches. In order to mitigate the regression potential, the results of the aforementioned tests are attached to this bug.

[Discussion]

Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Xav, or anyone else affected,

Accepted neutron into zesty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:10.0.4-0ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-zesty to verification-done-zesty. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-zesty. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Zesty):
status: Triaged → Fix Committed
tags: added: verification-needed-zesty
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Xav, or anyone else affected,

Accepted neutron into ocata-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:ocata-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-ocata-needed to verification-ocata-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-ocata-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-ocata-needed
Revision history for this message
Xav Paice (xavpaice) wrote :

Please note, we now have a client affected by this running Mitaka as well.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hi Xav,

I took a look at the code and confirmed that this does appear to affect Newton and Mitaka so I've targeted those releases as well. We'll need to backport the Ocata patches to Newton and then Mitaka.

Corey

Changed in neutron (Ubuntu Xenial):
importance: Undecided → High
status: New → Triaged
Revision history for this message
Corey Bryant (corey.bryant) wrote :

SRU addendum for mitaka (xenial) and newton.

[Regression Potential]

For mitaka (xenial) and newton, the regression potential is a little higher than that of ocata+ as the patch(es) aren't available on the corresponding upstream branches (they're EOL). These patches were cherry picked from the upstream stable/ocata branch, and required slight modifications in order to apply to mitaka and newton. For mitaka, an additional patch was required as a pre-req (set-ha-network-port-to-down-when-l3-agent-starts.patch). This patch is already in the upstream branches and packages for newton+.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

I've uploaded new versions of neutron with backported patches to fix this issue to xenial (awaiting SRU team review) and newton-staging.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Regression testing has completed successfully for artful, zesty, xenial-pike, and xenial-ocata.

artful-pike-proposed with stable charms:

======
Totals
======
Ran: 102 tests in 1551.5835 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 732.9467 sec.

artful-pike-proposed with dev charms:

======
Totals
======
Ran: 102 tests in 1419.3399 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 695.6942 sec.

zesty-ocata-proposed with stable charms:

======
Totals
======
Ran: 102 tests in 1665.0215 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 939.2629 sec.

zesty-ocata-proposed with dev charms:

======
Totals
======
Ran: 102 tests in 1744.4931 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 916.1659 sec.

xenial-pike-proposed with stable charms:

======
Totals
======
Ran: 102 tests in 1591.8960 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 695.8174 sec.

xenial-pike-proposed with dev charms:

======
Totals
======
Ran: 102 tests in 1609.1086 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 708.7850 sec.

xenial-ocata-proposed with stable charms:

======
Totals
======
Ran: 102 tests in 1650.0841 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 858.3489 sec.

xenial-ocata-proposed with dev charms:

======
Totals
======
Ran: 102 tests in 2173.3217 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 1031.2947 sec.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Xav, or anyone else affected,

Accepted neutron into newton-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:newton-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-newton-needed to verification-newton-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-newton-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-newton-needed
Revision history for this message
Xav Paice (xavpaice) wrote :

We have installed the Ocata -proposed package, however the situation is this:

- there's 464 routers configured, on 3 Neutron gateway hosts, using l3-ha, and each router is scheduled to all 3 hosts.
- we installed the package because were in a situation with a current incident with multiple l3 agents active, hoping the package update would solve the problem. One of the gateway hosts was being rebooted at the time to also try to do a King Canute and halt the tidal wave of arp.
- We later found that openvswitch had run out of filehandles, see LP: #1737866
- Resolving that allowed ovs to create a ton more filehandles.
- Removing/ re-adding the routers to agents seemed to clean things up, we saw some routers with multiple agents active, and some with none active (all 3 agents 'standby').
- After a few iterations of that, things cleaned up.
- 15-20 mins later, we saw more routers with multiple agents active (ones which weren't before), and ran through the same cleanup steps. At this time, there were a large number of keepalived messages in syslog, particularly routers becoming MASTER then BACKUP again. (https://pastebin.canonical.com/205361/)
- after another hour or two, we're still clean.

I can't at this stage whether the fix actually fixed the problem or not - I need to dig further to find out if there could have been some process running cleanups.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

I'm marking this as fix released for artful/pike as it has passed regression testing successfully and it is paired up with a stable point release that has CVE fixes.

Xav, I know you are testing ocata still and it is up in the air still as to whether this has fixed your problem. Please keep us posted on further results.

tags: added: verification-done-artful verification-pike-done
removed: verification-needed-artful verification-pike-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

s/fix released/verified

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:11.0.2-0ubuntu1.1

---------------
neutron (2:11.0.2-0ubuntu1.1) artful; urgency=medium

  * d/gbp.conf: Set debian-branch to stable/pike.
  * New upstream version.
  * New stable point release for OpenStack Pike (LP: #1734990).
  * d/p/call-update_all_ha_network_port_statuses-on-agent-start.patch:
    Cherry-pick from upstream to prevent multiple masters for L3HA
    (LP: #1731595).

 -- Corey Bryant <email address hidden> Tue, 28 Nov 2017 14:55:02 -0500

Changed in neutron (Ubuntu Artful):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package neutron - 2:11.0.2-0ubuntu1.1~cloud0
---------------

 neutron (2:11.0.2-0ubuntu1.1~cloud0) xenial-pike; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:11.0.2-0ubuntu1.1) artful; urgency=medium
 .
   * d/gbp.conf: Set debian-branch to stable/pike.
   * New upstream version.
   * New stable point release for OpenStack Pike (LP: #1734990).
   * d/p/call-update_all_ha_network_port_statuses-on-agent-start.patch:
     Cherry-pick from upstream to prevent multiple masters for L3HA
     (LP: #1731595).

Revision history for this message
Chris Halse Rogers (raof) wrote : Please test proposed package

Hello Xav, or anyone else affected,

Accepted neutron into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:8.4.0-0ubuntu6 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Xenial):
status: Triaged → Fix Committed
tags: added: verification-needed-xenial
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Xav, or anyone else affected,

Accepted neutron into mitaka-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:mitaka-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-mitaka-needed to verification-mitaka-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-mitaka-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-mitaka-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hi Xav, do you have any more feedback on ocata-proposed testing?

tags: added: neutron-proactive-backport-potential
Revision history for this message
James Hebden (ec0) wrote :

Hi Corey,

Unfortunately, in the case of the cloud where we are seeing this behaviour, the updated package which Xav installed per his previous comment does not seem to have addressed the issue. This was neutron 10.0.4-0ubuntu1~cloud0 from Cloud Archive xenial-updates/ocata.

I did notice that the packages being released for other Ubuntu releases appear to be a newer version, 2:11.0.2-0ubuntu1.1 - is this intended?

As an update, the workaround in place for this particular issue has been to disable L3HA on individual routers as we detect this issue. We have this particular cloud down to a number of routers where things seem relatively stable, now that we are closer to the 400 L3HA router mark.

Let me know if you need further information or testing performed.

Revision history for this message
Akash (taloleakash) wrote :

Hi,

Same issue is coming for openstack pike with openstack-neutron-11.0.3 on Centos 7.
Any solution/patch?

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hi jhebden, thanks for the feedback. Yes, Pike has a newer pacakage versions. Once a release is GA, we stay at the major version (ie. 10 in the case of Ocata, 10.0.4) so as not to introduce any new features to a stable release.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

It seems as if this bug surfaces due to load issues. While the fix provided by Venkata (https://review.openstack.org/#/c/522641/) should help clean things up at the time of l3 agent restart, issues seem to come back later down the line in some circumstances. xavpaice mentioned he saw multiple routers active at the same time when they had 464 routers configured on 3 neutron gateway hosts using L3HA, and each router was scheduled to all 3 hosts. However, jhebden mentions that things seem stable at the 400 L3HA router mark, and it's worth noting this is the same deployment that xavpaice was referring to.

It seems to me that something is being pushed to it's limit, and possibly once that limit is hit, master router advertisements aren't being received, causing a new master to be elected. If this is the case it would be great to get to the bottom of what resource is getting constrained.

Revision history for this message
venkata anil (anil-venkata) wrote :

As I am unable to reproduce it, I will be happy if someone takes over this issue.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

I think we need to get a new bug opened for this. As it's been marked fix released upstream it's probably not on anyone's radar.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

I wasn't able to change the upstream status back to "New" so I've opened a new bug to track this at https://bugs.launchpad.net/ubuntu/artful/+source/neutron/+bug/1744062.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Since we have another bug opened for this issue and we know the fix provided for this bug hasn't regressed anything, I'm going to mark this bug as verified so that we can get the packages promoted to updates and unblock any new neutron uploads.

Changed in neutron (Ubuntu Zesty):
status: Fix Committed → Won't Fix
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Zesty is EOL now, marking as won't fix.

tags: added: verification-ocata-done
removed: verification-ocata-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote : Update Released

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package neutron - 2:10.0.4-0ubuntu2~cloud0
---------------

 neutron (2:10.0.4-0ubuntu2~cloud0) xenial-ocata; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:10.0.4-0ubuntu2) zesty; urgency=medium
 .
   * d/p/call-update_all_ha_network_port_statuses-on-agent-start.patch:
     Cherry-pick from upstream to prevent multiple masters for L3HA
     (LP: #1731595).

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Regression testing was successful for xenial-newton-proposed.

xenial-newton-proposed with development charms:

======
Totals
======
Ran: 102 tests in 1480.0247 sec.
 - Passed: 94
 - Skipped: 8
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 896.7256 sec.

xenial-newton-proposed with stable charms:

======
Totals
======
Ran: 102 tests in 1448.5591 sec.
 - Passed: 94
 - Skipped: 8
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 919.5493 sec.

tags: added: verification-newton-done
removed: verification-newton-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Regression testing was successful for xenial-mitaka-proposed and trusty-mitaka-proposed.

xenial-mitaka-proposed with development charms:

======
Totals
======
Ran: 102 tests in 984.1188 sec.
 - Passed: 94
 - Skipped: 8
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 639.2812 sec.

xenial-mitaka-proposed with stable charms:

======
Totals
======
Ran: 102 tests in 1031.4567 sec.
 - Passed: 94
 - Skipped: 8
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 621.0117 sec.

trusty-mitaka-proposed with development charms:

======
Totals
======
Ran: 102 tests in 909.0828 sec.
 - Passed: 94
 - Skipped: 8
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 549.2448 sec.

trusty-mitaka-proposed with stable charms:

======
Totals
======
Ran: 102 tests in 1036.5808 sec.
 - Passed: 94
 - Skipped: 8
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 603.3084 sec.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package neutron - 2:9.4.1-0ubuntu1~cloud1
---------------

 neutron (2:9.4.1-0ubuntu1~cloud1) xenial-newton; urgency=medium
 .
   * d/p/call-update_all_ha_network_port_statuses-on-agent-start.patch:
     Cherry-pick from upstream stable/ocata branch to prevent multiple
     masters for L3HA (LP: #1731595).

tags: added: verification-done-xenial verification-mitaka-done
removed: verification-mitaka-needed verification-needed-xenial
tags: removed: neutron-proactive-backport-potential
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:8.4.0-0ubuntu6

---------------
neutron (2:8.4.0-0ubuntu6) xenial; urgency=medium

  * d/p/set-ha-network-port-to-down-when-l3-agent-starts.patch,
    d/p/call-update_all_ha_network_port_statuses-on-agent-start.patch:
    Cherry-pick from upstream stable/ocata branch to prevent multiple
    masters for L3HA (LP: #1731595).

 -- Corey Bryant <email address hidden> Tue, 12 Dec 2017 13:36:08 -0500

Changed in neutron (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package neutron - 2:8.4.0-0ubuntu6~cloud0
---------------

 neutron (2:8.4.0-0ubuntu6~cloud0) trusty-mitaka; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:8.4.0-0ubuntu6) xenial; urgency=medium
 .
   * d/p/set-ha-network-port-to-down-when-l3-agent-starts.patch,
     d/p/call-update_all_ha_network_port_statuses-on-agent-start.patch:
     Cherry-pick from upstream stable/ocata branch to prevent multiple
     masters for L3HA (LP: #1731595).

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

Hello

On Xenial.

a customer reported that when they upgraded neutron-openvswitch-agent,

In some hosts's(not all of them), all ports are deleted after it tried restarting agent service.

and all instances on the host lost network connectivity

Could be race condition, some had issue but some aren't.

Restarting daemons were not a solution.
After rebooting host machine, solved this symptom.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.5

This issue was fixed in the openstack/neutron 10.0.5 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.3

This issue was fixed in the openstack/neutron 11.0.3 release.

Revision history for this message
Xav Paice (xavpaice) wrote :

Comment for the folks that are noticing this as 'fix released' but still affected - see https://github.com/acassen/keepalived/commit/e90a633c34fbe6ebbb891aa98bf29ce579b8b45c for the rest of this fix, we need keepalived to be at least 1.4.0 in order to have this commit.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Xav, do we have a bug open against the keepalived package for this?

Revision history for this message
Xav Paice (xavpaice) wrote :

Corey, as far as I'm aware there isn't a bug open for the keepalived package (for Xenial at least). Are you suggesting that we open a bug for a backport to the current cloudarchive package?

no longer affects: keepalived (Ubuntu Artful)
no longer affects: keepalived (Ubuntu Zesty)
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Xav, I just checked and the patch you referenced can be backported fairly cleanly to at least keepalived 1:1.2.19-1 (xenial/mitaka) and above.

Changed in keepalived (Ubuntu Xenial):
importance: Undecided → Critical
status: New → Triaged
Changed in keepalived (Ubuntu Bionic):
importance: Undecided → High
status: New → Triaged
Changed in keepalived (Ubuntu):
status: New → Triaged
importance: Undecided → High
Changed in keepalived (Ubuntu Xenial):
importance: Critical → High
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Since this bug is already marked as fix released for the cloud-archive, let's please track this issue in the following bug: https://bugs.launchpad.net/ubuntu/+bug/1744062

no longer affects: keepalived (Ubuntu)
no longer affects: keepalived (Ubuntu Xenial)
no longer affects: keepalived (Ubuntu Bionic)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.