vRouter not working after update to 16.3.1

Bug #1927868 reported by Alexandros Soumplis
70
This bug affects 10 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Critical
Unassigned
Train
Fix Released
Critical
Unassigned
Ussuri
Fix Released
Critical
Unassigned
Victoria
Fix Released
Critical
Unassigned
Wallaby
Fix Released
Critical
Unassigned
Xena
Fix Released
Critical
Unassigned
neutron
Fix Released
Critical
Edward Hope-Morley
oslo.privsep
New
Undecided
Unassigned
neutron (Ubuntu)
Fix Released
Critical
Unassigned
Focal
Fix Released
Critical
Unassigned
Hirsute
Fix Released
Critical
Unassigned
Impish
Fix Released
Critical
Unassigned

Bug Description

We run a juju managed Openstack Ussuri on Bionic. After updating neutron packages from 16.3.0 to 16.3.1 all virtual routers stopped working. It seems that most (not all) namespaces are created but have only the lo interface and sometime the ha-XYZ interface in DOWN state. The underlying tap interfaces are also in down.

neutron-l3-agent has many logs similar to the following:
2021-05-08 15:01:45.286 39411 ERROR neutron.agent.l3.ha_router [-] Gateway interface for router 02945b59-639b-41be-8237-3b7933b4e32d was not set up; router will not work properly

and journal logs report at around the same time
May 08 15:01:40 lar1615.srv-louros.grnet.gr neutron-keepalived-state-change[18596]: 2021-05-08 15:01:40.765 18596 INFO neutron.agent.linux.ip_lib [-] Failed sending gratuitous ARP to 62.62.62.62 on qg-5a6efe8c-6b in namespace qrouter-02945b59-639b-41be-8237-3b7933b4e32d: Exit code: 2; Stdin: ; Stdout: Interface "qg-5a6efe8c-6b" is down
May 08 15:01:40 lar1615.srv-louros.grnet.gr neutron-keepalived-state-change[18596]: 2021-05-08 15:01:40.767 18596 INFO neutron.agent.linux.ip_lib [-] Interface qg-5a6efe8c-6b or address 62.62.62.62 in namespace qrouter-02945b59-639b-41be-8237-3b7933b4e32d was deleted concurrently

The neutron packages installed are:

ii neutron-common 2:16.3.1-0ubuntu1~cloud0 all Neutron is a virtual network service for Openstack - common
ii neutron-dhcp-agent 2:16.3.1-0ubuntu1~cloud0 all Neutron is a virtual network service for Openstack - DHCP agent
ii neutron-l3-agent 2:16.3.1-0ubuntu1~cloud0 all Neutron is a virtual network service for Openstack - l3 agent
ii neutron-metadata-agent 2:16.3.1-0ubuntu1~cloud0 all Neutron is a virtual network service for Openstack - metadata agent
ii neutron-metering-agent 2:16.3.1-0ubuntu1~cloud0 all Neutron is a virtual network service for Openstack - metering agent
ii neutron-openvswitch-agent 2:16.3.1-0ubuntu1~cloud0 all Neutron is a virtual network service for Openstack - Open vSwitch plugin agent
ii python3-neutron 2:16.3.1-0ubuntu1~cloud0 all Neutron is a virtual network service for Openstack - Python library
ii python3-neutron-lib 2.3.0-0ubuntu1~cloud0 all Neutron shared routines and utilities - Python 3.x
ii python3-neutronclient 1:7.1.1-0ubuntu1~cloud0 all client API library for Neutron - Python 3.x

Downgrading to 16.3.0 resolves the issues.

=================================

Ubuntu SRU details:

[Impact]
See above.

[Test Case]
Deploy openstack with l3ha and create several HA routers, the number required varies per environment. It is probably best to deploy a known bad version of the package, ensure it is failing, upgrade to the version in proposed, and re-test several times to confirm it is fixed.

Restarting neutron-l3-agent should expect all HA Routers are restored.

[Regression Potential]
This change is fixing a regression by reverting a patch that was introduced in a stable point release of neutron.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in neutron (Ubuntu):
status: New → Confirmed
Revision history for this message
Jared Baker (shubjero) wrote :

This is a terrible bug. 16.3.1 needs to be pulled from the repositories immediately. This will cause all Ussuri OpenStack deployments on Bionic to be completely down.

Revision history for this message
Jared Baker (shubjero) wrote :

This bug appears to be present in 16.3.2 in the proposed repository as well.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hi, I've added upstream neutron to this bug to see if anyone is familiar with it.

Revision history for this message
Corey Bryant (corey.bryant) wrote :
Revision history for this message
Jared Baker (shubjero) wrote :

I don't think so. We don't even use DVR. Our observations were basically identical to OP's which was a telltale log entry 'Gateway interface for router 02945b59-639b-41be-8237-3b7933b4e32d was not set up; router will not work properly' and also that any Layer-3 agents for a router all remained in a standby state. None of them would ever go active.

Revision history for this message
Jared Baker (shubjero) wrote :

I was not able to roll back to 16.3.0 because they are not on any repositories that I could find so I had to manually download neutron debian files for focal and thankfully they seem to kind of work on Bionic Ussuri. We are now running 16.0.0~b3~git2020041516.5f42488a9a-0ubuntu2 which is not ideal.

Revision history for this message
LIU Yulong (dragon889) wrote :

Seems to be a terrible bug... I have a question about juju managed Openstack, it will directly use the tag from upstream neutron without any private pathes or backports? If so, the following is the patches from 13.6.0 ahead to 13.6.1, maybe we can found the problem from it.

7771f16 [L3] Delete DvrFipGatewayPortAgentBindings after no gw ports
7f3aadd Lock sg updates while in _apply_port_filter()
9736efd [OVN] Set mcast_flood_reports on LSPs
ebc9921 Revert "DVR: Remove control plane arp updates for DVR"
9539db1 Add minimum bw qos rule validation for network
f8f1eaf Add some wait time between stopping and starting again ovsdb monitor
b16c29f Ignore python warnings in the fullstack job
590551d [L3][HA] Retry when setting HA router GW status.
664ee1d Fix wrong packet_type set for IPv6 GRE tunnels in OVS
630fc3b Stop metadata proxy gracefully
28cf678 Delete HA metadata proxy PID and config with elevated privileges
2f4ef31 Improve "get_devices_with_ip" performance
580e57b [OVS FW] Allow egress ICMPv6 only for know addresses
0c75fd0 Remove update_initial_state() method from the HA router
87fce78 Migrate "netstat" to oslo.privsep
817c5f2 [OVS FW] Clean conntrack entries with mark == CT_MARK_INVALID
73e1672 Fix deletion of rfp interfaces when router is re-enabled
7ce3c8e Avoid race condition when processing RowEvents
b2dc70e [OVN] ovn-metadata-agent: Retry registering Chassis at startup
f39230d Fix update of trunk subports during live migration
75b8fa7 Add extension unit tests for conntrack_helper plugin
9139f40 Fix incorrect exception catch when update floating ip port forwarding
f849a4c Don't try to create default SG when security groups are disabled
dc01f5b Process DHCP events in order if related
55a82da [OVN] Update metadata port ony for requested subnet
b84dbd6 Fix losses of ovs flows when ovs is restarted
5f78ff5 Make test_agent_show only look for its own agents
c825921 Do not update agents "alive" state in TestAgentApi
76a57e9 Optimize get_ports with QoS extension
a76a3f2 [OVN] Ensure metadata checksum
a2f312d Auto-remove floating agent gw ports on net/subnet delete
1f71ae2 [QoS] Get only min bw rules when extending port dict
e24a66d Optimize get_ports with trunk extension
8cb6130 Improve DHCP agent's debug messages
f7e028b ovn: Support live migration to DPDK nodes
daba68d Add WaitForPortCreateEvent in BaseOVSTestCase

Revision history for this message
Jared Baker (shubjero) wrote :

590551d [L3][HA] Retry when setting HA router GW status.

I'm not a developer but this change does reflect the log error messages myself and the OP are seeing.

https://github.com/openstack/neutron/commit/8f5a801270f81bd9fe3559fee9c1714c97849b3e

Revision history for this message
Corey Bryant (corey.bryant) wrote :

If we can narrow in on the the offending commit we can look at getting an SRU out asap with the change reverted.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

@LIU, they're using the Ubuntu package which imports upstream releases from the published tarballs at https://tarballs.opendev.org/openstack/neutron/. So I think looking through the delta of commits that you've listed is the right way to go to narrow down on the offending patch.

tags: added: l3-d
tags: added: l3-dvr-backlog
removed: l3-d
Changed in neutron:
importance: Undecided → Critical
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

This is a bit strange as we do run Ussuri scenario jobs with L3-HA, see e.g. https://zuul.opendev.org/t/openstack/build/508a4cc49b0844978d107b74ed18982d/logs and it works fine. It also runs on Ubuntu Bionic.
Can You check what keepalived version do You have, maybe compare other Your packages with what was used in that job and also share with us full L3 agent's log with errors which You have?

Revision history for this message
Christian Rohmann (christian-rohmann) wrote :

We run OpenStack Train on Ubuntu Bionic and observe similar issues with L3-HA routers after having updated from 15.3.2 -> 15.3.3.

Currently we are still collecting evidence, but, in case there is an issue with a certain router, we already observed that the gateway interfaces on all nodes running a router are DOWN. After manually (nsenter) setting them to up, and i.e. restarting the l3 agent or killing a keepalived things get back into a working state.

Revision history for this message
Jared Baker (shubjero) wrote :

keepalived: 1:1.3.9-1ubuntu0.18.04.2
from: http://archive.ubuntu.com/ubuntu bionic-updates/main

neutron-l3-agent.log filled with the following log lines for every single router in our cluster:
2021-06-10 19:59:53.602 328191 ERROR neutron.agent.l3.ha_router [-] Gateway interface for router 9f276b0e-8581-4312-8780-ff4503c3df07 was not set up; router will not work properly

We have 3 controllers and use l3 HA and usually have 1 l3 agent per controller, 2 standby, 1 active.

We rolled back to 16.0.0 (only because we could not find 16.3.0 anywhere) and the issue stopped. OP was able to roll back to 16.3.0 and said the issue went away so I think we should focus on 16.3.1+

Before upgrading to Ussuri we did briefly upgrade to the latest Train packages, give me some time to see if I can see if those error messages started when we were briefly on latest Train packages. I need to do some cross-referencing for that.

Revision history for this message
Jared Baker (shubjero) wrote :

OK, I did find evidence that this issue did begin to happen as we upgraded our Train packages just before we upgraded to Ussuri (I believe its best practice to update all packages, THEN upgrade to a new version of OpenStack)

Upgraded Neutron 15.3.0 to 15.3.3 at 16:40 EDT
2021-06-10 16:40:40 upgrade neutron-l3-agent:all 2:15.3.0-0ubuntu1~cloud2 2:15.3.3-0ubuntu1~cloud0

Evidence of gateway problems at 17:06 EDT (Might have seen this in the logs several minutes earlier but I had to purge a lot of logs from that day because of the sheer volume of them)
2021-06-10 17:06:23.440 33440 ERROR neutron.agent.l3.ha_router [-] Gateway interface for router 04a5a2c3-739f-4cf6-94bb-dd314c2a8e66 was not set up; router will not work properly

Upgraded from latest Train to latest Ussuri happened at 17:12
2021-06-10 17:12:31 upgrade neutron-l3-agent:all 2:15.3.3-0ubuntu1~cloud0 2:16.3.1-0ubuntu1~cloud0

Revision history for this message
Jared Baker (shubjero) wrote :

Hey Ubuntu team,

Is there any way for you to provide the Neutron 16.3.0 packages for Bionic? We are in emergency mode here and the current 16.0.0 we are running, although working better than 16.3.1 is causing major control plane instability.

We are in desperate need of 16.3.0 packages.

Revision history for this message
Ante Karamatić (ivoks) wrote :

For those looking for 16.3.0 packages, they are available at https://launchpad.net/~ubuntu-cloud-archive/+archive/ubuntu/ussuri-staging/+build/21146273

Revision history for this message
Billy Olsen (billy-olsen) wrote :

FWIW, I don't believe that the commit referenced in comment #9 is related to the problem, unless there's a race condition introduced by the added delay. Looking at that commit, the code will try harder to make sure that the device in the namespace exists before proceeding on as it did before. The log shows the error messages because the referenced change introduced a log message for the condition that the device doesn't exist in the namespace. The code in that code path continues as it did previously even when the error is logged. As a result, that change primarily waits up to 3 seconds longer for the device to exist and will then log the error for awareness, but still continues in the same code path.

Revision history for this message
Corey Bryant (corey.bryant) wrote (last edit ):

UPDATE: This was a bad test as I didn't account for db migrations.

I was able to recreate this with an active/passive scenario. The upgrade from stein->train went well and one qg-a88fe206-a2@if19 was UP and the other was DOWN, but after the upgrade from train->ussuri both qg-a88fe206-a2@if19 interfaces were in DOWN state. See details attached.

Changed in neutron (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
Jared Baker (shubjero) wrote :

Ante: Thank you for the link to the 16.3.0 packages. We managed to find 16.2.0 before your post and are currently using them and it seems to be working fine for us. Should the opportunity arise we will install 16.3.0 since we've downloaded the deb files just in case.

Billy: I guess the hunt continues for which commit introduced the bug, but I do see that this change '590551d' is a result of the following bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1929829 where this operator was complaining that his L3 agents would not enter active state. Coincidentally it may have fixed the issue for them, but introduced a bad bug on the Ubuntu side of things.

Corey: It's reassuring that you were able to reproduce this bug. Thank you for your efforts!

Revision history for this message
Christian Rohmann (christian-rohmann) wrote :

Yes, Billy Olsen, we are a little in doubt about this (https://bugs.launchpad.net/neutron/+bug/1927868/comments/18) as well.

We have been observing such non functioning gateways on our Train installation occasionally also before this patch / update.

Usually a "clear gateway" and a recreation via the Horizon GUI by the user fixed this - one more reason to pursue the idea of a race condition as it then works on another attempt.

Also killing the master keepalived did work as it then switched over to another node usually just worked.

With the recent change though, we appear to have had a larger number broken gateways after migrations and node reboots.

Revision history for this message
Corey Bryant (corey.bryant) wrote (last edit ):

UPDATE: This was a bad test as I didn't account for db migrations.

Update on testing results from my end, based on https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1927868/comments/19.

Attempts to upgrade from train to any of the ussuri package versions in the ussuri-staging PPA [1] hit the bug or a similar bad state of routers (both routers are DOWN, or both are UP with no IP address, or one is DOWN and one is UP with no IP address).

[1] https://launchpad.net/~ubuntu-cloud-archive/+archive/ubuntu/ussuri-staging/+packages?field.name_filter=neutron&field.status_filter=&field.series_filter=

So I think this might be an issue between 15.3.4 and commit 3e8abb9a8f (the latter being the upstream commit that the earliest package snapshot in ussuri-staging is based on).

Revision history for this message
Corey Bryant (corey.bryant) wrote (last edit ):

It seems this may only present itself if neutron-server, etc are not upgraded and database migrations (ie. neutron-db-manage) are not run prior to network nodes being upgraded.

In a juju deployment this means that neutron-api units need to be upgraded prior to neutron-gateway units.

I'll test this theory again tomorrow from scratch to confirm as my current deployment has seen a lot of manual manipulation today.

I'm curious if anyone who's hit this remembers the order of their upgrades.

I came across this as I was bisecting upgrades of neutron-gateway to various upstream commits and HA routers started misbehaving when I upgraded to this commit https://opendev.org/openstack/neutron/commit/843b5ffd9a8ee3f4d9d8830f43aa3d517cc11e07 which has alembic migrations, and it dawned on me that the database hadn't been upgraded to include them.

Revision history for this message
Jared Baker (shubjero) wrote :

Our order of operations went as such:

- All packages get updated to latest available Ussuri (16.3.1) for Bionic via apt-get dist-upgrade
- Stop services (systemctl stop neutron-dhcp-agent; systemctl stop neutron-metadata-agent; systemctl stop neutron-ovs-cleanup; systemctl stop neutron-l3-agent; systemctl stop neutron-openvswitch-agent; systemctl stop neutron-server)
- Upgrade database (neutron-db-manage current; neutron-db-manage upgrade heads)
- Start services back up (systemctl start neutron-dhcp-agent; systemctl start neutron-metadata-agent; systemctl start neutron-ovs-cleanup; systemctl start neutron-l3-agent; systemctl start neutron-openvswitch-agent; systemctl start neutron-server)
- All L3 agents for all routers elect to be down
- Rebooted control plane several times while troubleshooting, L3 agents stay down
- Found that 16.3.2 was available on staging repository, installed it, all L3 agents go to standby
- Install 16.0.0 for Focal (all I could find during my scramble to fix the outage), L3 agents start to elect a master for each router
- Later that week, re-attempt 16.3.1, all L3 agents go to standby
- Find 16.2.0 for Bionic and downgrade to it, all L3 agents begin to elect masters

Revision history for this message
Alexandros Soumplis (soumplis) wrote :

@Corey In our case there was no release upgrade. We have been to Ussuri for many weeks and running happily. The problem came up only after we update the relevant ubuntu packages from 16.3.0 to 16.3.1. Just reverting back to 16.3.0 resolved the issue.

Revision history for this message
Christian Rohmann (christian-rohmann) wrote :

Corey, Jared I believe your analysis is running a little in the wrong direction here:

1) We run OpenStack TRAIN (15) and also experienced the described issues. So there cannot be any relation to the database schema upgrades.

2) We did experience the issue even before the recent upgrade and we believe it's a coincidence that it was triggered just now. There also is a bug report about exactly the observations I made (all gateway interfaces where down): https://bugs.launchpad.net/neutron/+bug/1916024
Did you observe anything else while running the potentially flawed 16.3.1 and when looking the interfaces of a virtual router?

3) The commit https://opendev.org/openstack/neutron/commit/12c07ba3ea9c6501dd7494561e2920496407c48b was just about trying to fight a race condition with a static "sleep 3", but it also gave us an error message now and we saw this multiple times during a larger batch of machine migrations. So the issue is real in TRAIN and likely also in USSURI and later.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Christian:

Please check the bug and the commit description [1]. As commented in c#4:

What the patch is solving is that situation when the interface disappears and reappears again, while keepalived is configuring it.

If you still see this error message this is because:
- keepalived didn't finish configuring this interface. If the interface doesn't appear in this time (3 seconds), maybe you have another problem.
- the interface was deleted

What you need to identify is what is happening in your system and why, when the interface is going to be set UP, is not found in the kernel.

Regards.

[1]https://review.opendev.org/c/openstack/neutron/+/776427

Revision history for this message
Jared Baker (shubjero) wrote :

@Christian, sorry if it wasn't clear but I don't believe this is database migration related.

@Rodolfo, At the very least the code that was written with timeout=3 is likely problematic and should have been at least made user configurable. It's quite possible this change is the culprit and breaks Ubuntu deployments.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Jared:

How this change is breaking any deployment? If the interface is there, there will be no wait at all. If the interface is not present, then you have a problem and this is not related to the timeout.

First you need to find why, when "set_link_status" is called, the GW interface is not found in the kernel. It could be useful to use the steps provided in [1] to reproduce the issue and find why the interface is not there. If you don't trust [2], remove it. But you'll hit the same error.

Regards.

[1]https://bugs.launchpad.net/neutron/+bug/1916024
[2]https://review.opendev.org/c/openstack/neutron/+/776427

Revision history for this message
Jared Baker (shubjero) wrote :

@Rodolfo, We still aren't 100% certain that this particular change is the culprit but it's highly suspect because of the nature of the change compared to other commits.

In this ticket you have at least 4 instances of this bug being reproducible by different people including Ubuntu maintainers and several Cloud Operators that had entire clouds go completely offline and had to roll back to previous version of Ussuri AND Train. We're now all in holding patterns and unable to upgrade to this version or beyond as long as this code exists.

I do believe that there could be something specific to Ubuntu deployments and this code as I see that this change was triggered by a RHEL bug ticket, so maybe the change fixed a problem for RHEL deployments but broke things for Ubuntu deployments.

Revision history for this message
Corey Bryant (corey.bryant) wrote (last edit ):

I haven't been able to recreate this when upgrading packages in the correct order and running db migrations as mentioned in https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1927868/comments/23.

Today I attempted to recreate this a few times, and my HA routers look correct after upgrade.

Alexandros, if you have any more details on steps to take to recreate this please let me know. I understand your deployment is juju managed, and that's what I use for testing, so it may be simple to recreate if I get the right steps.

My general testing steps were: deploy train, create an HA router, upgrade neutron-api to ussuri 16.3.0, upgrade neutron-gateways to 16.3.0, upgrade neutron-api to 16.3.2, and finally upgrade neutron-gateway units to 16.3.2. All the while of course restarting services along the way, and running ip addr list in both router namespaces to ensure one qg-* is UP with an IP and the other is DOWN as well as monitoring L3 agents hosting the router are active/standby.

Changed in neutron (Ubuntu):
status: Triaged → New
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Just a note that I've been running a loop for the last few hours similar to Christian's create/delete scripts in the description of https://bugs.launchpad.net/neutron/+bug/1916024 with no success in recreating. This was on ussuri with neutron packages at 2:16.3.2-0ubuntu3~cloud0.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

I've had a go at deploying Train and upgrading Neutron to latest Ussuri and I see the same issue. Looking closer what I see is that post-upgrade Neutron l3-agent has not spawned any keepalived processes hence why no router goes active. When the agent is restarted it would normally receive two router updates; first one to spawn_state_change_monitor and a second to spawn keepalived. In my non-working nodes the second router update is never received by the l3-agent. Here is an example of a working agent https://pastebin.ubuntu.com/p/PFb594wkhB vs. a not working https://pastebin.ubuntu.com/p/MtDNrXmvZB/.

I tested restarted all agents and this did not fix things. I then rebooted one of my upgraded nodes and it resolved the issue for that node i.e. two updates received and both spawned then router goes active. I also noticed that on a non-rebooted node, following ovs agent restart I see https://pastebin.ubuntu.com/p/2n4KxBv8S2/ which again is not resolved by an agent restart and is fixed by the node reboot. This latter issue is described on old bugs e.g. https://bugs.launchpad.net/neutron/+bug/1625305

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :
Download full text (8.7 KiB)

Hello,

I reviewed the code path and upgrade in my reproducer, following the approach
of upgrading neutron-gateway and subsequently neutron-api doesn't works because of a mismatch
in the migrations/rpc versions that causes the ha port to fail to be created/updated,
then the keepalived process cannot be spawned and finally the state-change-monitor
fails to find the PID for that keepalived process.

If I upgrade neutron-api, run the migrations to head and then upgrade the gateways, all seems correct.

I upgraded from the following versions

root@juju-da864d-1927868-5:/home/ubuntu# dpkg -l |grep keepalived
ii keepalived 1:1.3.9-1ubuntu0.18.04.2 amd64 Failover and monitoring daemon for LVS clusters

root@juju-da864d-1927868-5:/home/ubuntu# dpkg -l |grep neutron-common
ii neutron-common 2:15.3.3-0ubuntu1~cloud0 all Neutron is a virtual network service for Openstack - common

--> To

root@juju-da864d-1927868-5:/home/ubuntu# dpkg -l |grep neutron-common
ii neutron-common 2:16.3.2-0ubuntu3~cloud0 all Neutron is a virtual network service for Openstack - common

I created a router with HA enabled as follows

$ openstack router list
+--------------------------------------+-----------------+--------+-------+----------------------------------+-------------+------+
| ID | Name | Status | State | Project | Distributed | HA |
+--------------------------------------+-----------------+--------+-------+----------------------------------+-------------+------+
| 09fa811f-410c-4360-8cae-687e7e73ff21 | provider-router | ACTIVE | UP | 6f5aaf5130764305a5d37862e3ff18ce | False | True |
+--------------------------------------+-----------------+--------+-------+----------------------------------+-------------+------+

===> Prior to upgrade I can list the keepalived processed linked to the ha-router

root 22999 0.0 0.0 91816 3052 ? Ss 19:17 0:00 keepalived -P -f /var/lib/neutron/ha_confs/09fa811f-410c-4360-8cae-687e7e73ff21/keepalived.conf -p /var/lib/neutron/ha_confs/09fa811f-410c-4360-8cae-687e7e73ff21.pid.keepalived -r /var/lib/neutron/ha_confs/09fa811f-410c-4360-8cae-687e7e73ff21.pid.keepalived-vrrp -D

root 23001 0.0 0.1 92084 4088 ? S 19:17 0:00 keepalived -P -f /var/lib/neutron/ha_confs/09fa811f-410c-4360-8cae-687e7e73ff21/keepalived.conf -p /var/lib/neutron/ha_confs/09fa811f-410c-4360-8cae-687e7e73ff21.pid.keepalived -r /var/lib/neutron/ha_confs/09fa811f-410c-4360-8cae-687e7e73ff21.pid.keepalived-vrrp -D

===> After upgrading -- None is returned, and in fact the keepalived processes aren't spawned
after neutron-* is upgraded.

Pre-upgrade:
Jun 24 19:17:07 juju-da864d-1927868-5 Keepalived[22997]: Starting Keepalived v1.3.9 (10/21,2017)
Jun 24 19:17:07 juju-da864d-1927868-5 Keepalived[22999]: Starting VRRP child process, pid=23001

Post - upgrade -- Not started

Jun 24 19:30:41 juju-da864d-1927868-5 Keepalived[22999]: Stopping
Jun 24 19:30:42...

Read more...

Revision history for this message
Edward Hope-Morley (hopem) wrote :
Download full text (7.6 KiB)

I have just re-tested all of this as follows:

 * deployed Openstack Train (on Bionic i.e. 2:15.3.3-0ubuntu1~cloud0) with 3 gateway nodes
 * created one HA router, one vm with one fip
 * can ping fip and confirm single active router
 * upgraded neutron-server (api) to 16.3.0-0ubuntu3~cloud0 (ussuri), stopped server, neutron-db-manage upgrade head, start server
 * ping still works
 * upgraded all compute hosts to 16.3.0-0ubuntu3~cloud0, observed vrrp failover and short interruption
 * ping still works
 * upgraded one compute to 2:16.3.2-0ubuntu3~cloud0
 * ping still works
 * upgraded neutron-server (api) to 2:16.3.2-0ubuntu3~cloud0, stopped server, neutron-db-manage upgrade head (observed no migrations), start server
 * ping still works
 * upgraded remaining compute to 2:16.3.2-0ubuntu3~cloud0
 * ping still works

I noticed that after upgrading to 2:16.3.2-0ubuntu3~cloud0 my interfaces when from:

root@juju-f0dfb3-lp1927868-6:~# ip netns exec qrouter-8b5e4130-6688-45c5-bc8e-ee3781d8719c ip a s; pgrep -alf keepalived| grep -v state
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ha-bd1bd9ab-f8@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fa:16:3e:6a:ae:8c brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 169.254.195.91/18 brd 169.254.255.255 scope global ha-bd1bd9ab-f8
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe6a:ae8c/64 scope link
       valid_lft forever preferred_lft forever
3: qg-9e134c20-1f@if13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fa:16:3e:c4:cc:84 brd ff:ff:ff:ff:ff:ff link-netnsid 0
4: qr-a125b622-2d@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fa:16:3e:0b:d3:74 brd ff:ff:ff:ff:ff:ff link-netnsid 0

to:

1: lo: <LOOPBACK,UP,LOWER_UP> ...

Read more...

Changed in neutron (Ubuntu):
status: New → Incomplete
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I've marked this as incomplete for now since we haven't been able to recreate. We'll need more information on steps that need to be taken to recreate and triage. If anyone has more details on how to recreate please add them to the bug and feel free to move back to New state.

Revision history for this message
zibort (zibort) wrote (last edit ):

I've same problem on setup. We use kolla-ansible for upgrade tasks, and after upgrade from train to ussuri problem was reproduced.

Revision history for this message
zibort (zibort) wrote (last edit ):

Latest build ussuri kolla neutron-l3-agent container is based on neutron==16.3.3.dev45.

L3 agent is stuck in "Staring router update" for part of routers.

2021-06-30 14:07:50.863 575 INFO neutron.agent.l3.agent [-] Starting router update for a9cb59da-3058-44ac-a2f7-21add6160e6d, action 3, priority 2, update_id 78aad0a3-f5d2-4e81-9e17-df79484bc07e. Wait time elapsed: 32.982

After restart another part of routers is stuck.

Re-install to neutron==16.3.0 resolved that problem.

Revision history for this message
zibort (zibort) wrote :

I tried L3 agent with neutron==16.3.1, problem was reproduced.

Revision history for this message
James Page (james-page) wrote :

On the assumption this is somewhere in the L3/HA codetree

$ git log --pretty=oneline --no-merges 15.3.2..15.3.3 neutron/agent/l3
12c07ba3ea9c6501dd7494561e2920496407c48b [L3][HA] Retry when setting HA router GW status.
3b2b7f4fe7bacb99028b5cba7ac7a8e6c412d965 Remove update_initial_state() method from the HA router
4360603d8bca4aec7793e1bd415ccb2774afd860 Fix deletion of rfp interfaces when router is re-enabled

$ git log --pretty=oneline --no-merges 16.3.0..16.3.1 neutron/agent/l3
590551dbbff5ba0b2a43772c3ef117377f783db8 [L3][HA] Retry when setting HA router GW status.
0c75fd0d330f67939ad1eec6b0773f5353799439 Remove update_initial_state() method from the HA router
73e1672d6fc8ccabd87c97773d51b9efede9ba55 Fix deletion of rfp interfaces when router is re-enabled

James Troup (elmo)
Changed in neutron (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Edward Hope-Morley has recreated this and put a lot of effort into debugging. We can triage this officially since we have a recreate that we're debugging. We'll report back with more details soon.

Changed in neutron (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → Critical
Revision history for this message
Corey Bryant (corey.bryant) wrote :

We believe we've narrowed this down to a regression in the commit "[L3][HA] Retry when setting HA router GW status". Reverting that patch appears to have fixed this issue in our test environment so we are going to move forward with an SRU for further testing. Full debug details can be found in the attached document.

Thanks to Edward Hope-Morley who put a lot of time into recreating/debugging this and also to Dmitrii Shcherbakov for helping debug.

Changed in neutron (Ubuntu Focal):
status: New → Triaged
Changed in neutron (Ubuntu Hirsute):
importance: Undecided → Critical
Changed in neutron (Ubuntu Focal):
importance: Undecided → Critical
Changed in neutron (Ubuntu Hirsute):
status: New → Triaged
description: updated
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Alexandros, or anyone else affected,

Accepted neutron into hirsute-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:18.1.0-0ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-hirsute to verification-done-hirsute. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-hirsute. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in neutron (Ubuntu Hirsute):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-hirsute
Changed in neutron (Ubuntu Focal):
status: Triaged → Fix Committed
tags: added: verification-needed-focal
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello Alexandros, or anyone else affected,

Accepted neutron into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:16.4.0-0ubuntu3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Alexandros, or anyone else affected,

Accepted neutron into victoria-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:victoria-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-victoria-needed to verification-victoria-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-victoria-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-victoria-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Alexandros, or anyone else affected,

Accepted neutron into train-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:train-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-train-needed to verification-train-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-train-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-train-needed
Changed in cloud-archive:
status: Triaged → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:18.1.0+git2021072117.147830620f-0ubuntu2

---------------
neutron (2:18.1.0+git2021072117.147830620f-0ubuntu2) impish; urgency=medium

  * d/p/revert-l3-ha-retry-when-setting-ha-router-gw-status.patch: Revert
    upstream patch that introduced regression that prevented full restore
    of HA routers on restart of L3 agent (LP: #1927868).

 -- Corey Bryant <email address hidden> Wed, 28 Jul 2021 16:40:07 -0400

Changed in neutron (Ubuntu Impish):
status: Triaged → Fix Released
Revision history for this message
James Page (james-page) wrote :

Hello Alexandros, or anyone else affected,

Accepted neutron into ussuri-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:ussuri-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-ussuri-needed to verification-ussuri-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-ussuri-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-ussuri-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Alexandros, or anyone else affected,

Accepted neutron into wallaby-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:wallaby-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-wallaby-needed to verification-wallaby-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-wallaby-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-wallaby-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I've added upstream oslo.privsep to this bug. It seems that minimally an except block with a log message would be useful in the send_recv() method from oslo_privsep/comm.py.

Revision history for this message
Christian Rohmann (christian-rohmann) wrote :

Thanks all for really digging into the issue!

Quite honestly reverting that one commit might have fixed the observed issue.
But having an potential ~3 second delay in the code path should not have this impact at all.
What I am trying to say is that there might be a whole other issue with timing things.

We recently found an issue with oslo.privsep or rather python-msgpack being slow and observed strange effects and also log messages not indicating the actual issue at all (https://bugs.launchpad.net/ubuntu/+source/python-msgpack/+bug/1937261).

I am quite keen to hear your analysis on WHY the commit caused the effect you observed.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.privsep (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/oslo.privsep/+/803225

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package neutron - 2:18.1.0+git2021072117.147830620f-0ubuntu2~cloud0
---------------

 neutron (2:18.1.0+git2021072117.147830620f-0ubuntu2~cloud0) focal-xena; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:18.1.0+git2021072117.147830620f-0ubuntu2) impish; urgency=medium
 .
   * d/p/revert-l3-ha-retry-when-setting-ha-router-gw-status.patch: Revert
     upstream patch that introduced regression that prevented full restore
     of HA routers on restart of L3 agent (LP: #1927868).

Changed in cloud-archive:
status: Fix Committed → Fix Released
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Verified bionic-ussuri/proposed using [Test Case]

tags: added: verification-ussuri-done
removed: verification-ussuri-needed
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote : Update Released

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

This bug was fixed in the package neutron - 2:16.4.0-0ubuntu3~cloud0
---------------

 neutron (2:16.4.0-0ubuntu3~cloud0) bionic-ussuri; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:16.4.0-0ubuntu3) focal; urgency=medium
 .
   * d/p/revert-l3-ha-retry-when-setting-ha-router-gw-status.patch: Revert
     upstream patch that introduced regression that prevented full restore
     of HA routers on restart of L3 agent (LP: #1927868).

tags: added: verification-hirsute-done
removed: verification-needed-hirsute
Revision history for this message
Alin-Gabriel Serdean (alin-serdean) wrote (last edit ):

Verified hirtsue/proposed, focal-ussuri/proposed, focal-victoria/proposed, focal-wallaby/proposed using [Test Case]

tags: added: verification-focal-done verification-victoria-done verification-wallaby-done
removed: verification-needed-focal verification-victoria-needed verification-wallaby-needed
Revision history for this message
Alin-Gabriel Serdean (alin-serdean) wrote :

Verified bionic-ussuri/proposed and bionic-ussuri/train proposed using [Test Case]

tags: added: verification-bionic-done verification-train-done
removed: verification-train-needed
tags: added: verification-done
removed: verification-needed
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

This bug was fixed in the package neutron - 2:18.1.0-0ubuntu2~cloud0
---------------

 neutron (2:18.1.0-0ubuntu2~cloud0) focal-wallaby; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:18.1.0-0ubuntu2) hirsute; urgency=medium
 .
   * d/p/revert-l3-ha-retry-when-setting-ha-router-gw-status.patch: Revert
     upstream patch that introduced regression that prevented full restore
     of HA routers on restart of L3 agent (LP: #1927868).

tags: added: verification-done-focal verification-done-hirsute
removed: verification-focal-done verification-hirsute-done
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:18.1.0-0ubuntu2

---------------
neutron (2:18.1.0-0ubuntu2) hirsute; urgency=medium

  * d/p/revert-l3-ha-retry-when-setting-ha-router-gw-status.patch: Revert
    upstream patch that introduced regression that prevented full restore
    of HA routers on restart of L3 agent (LP: #1927868).

neutron (2:18.1.0-0ubuntu1) hirsute; urgency=medium

  * New stable point release for OpenStack Wallaby (LP: #1935027).
  * Remove patches that have landed upstream:
    - d/p/remove-leading-zeroes-from-an-ip-address.patch.
    - d/p/initialize-privsep-library-for-neutron-ovs-cleanup.patch.
    - d/p/initialize-privsep-library-in-neutron-commands.patch.

 -- Corey Bryant <email address hidden> Wed, 28 Jul 2021 16:52:11 -0400

Changed in neutron (Ubuntu Hirsute):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:16.4.0-0ubuntu3

---------------
neutron (2:16.4.0-0ubuntu3) focal; urgency=medium

  * d/p/revert-l3-ha-retry-when-setting-ha-router-gw-status.patch: Revert
    upstream patch that introduced regression that prevented full restore
    of HA routers on restart of L3 agent (LP: #1927868).

 -- Corey Bryant <email address hidden> Wed, 28 Jul 2021 17:20:43 -0400

Changed in neutron (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

This bug was fixed in the package neutron - 2:17.2.0-0ubuntu1~cloud1
---------------

 neutron (2:17.2.0-0ubuntu1~cloud1) focal-victoria; urgency=medium
 .
   * d/p/revert-l3-ha-retry-when-setting-ha-router-gw-status.patch: Revert
     upstream patch that introduced regression that prevented full restore
     of HA routers on restart of L3 agent (LP: #1927868).
 .
 neutron (2:17.2.0-0ubuntu1~cloud0) focal-victoria; urgency=medium
 .
   * New upstream release for the Ubuntu Cloud Archive.
 .
 neutron (2:17.2.0-0ubuntu1) groovy; urgency=medium
 .
   * New stable point release for OpenStack Victoria (LP: #1935029).

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

This bug was fixed in the package neutron - 2:15.3.4-0ubuntu1~cloud1
---------------

 neutron (2:15.3.4-0ubuntu1~cloud1) bionic-train; urgency=medium
 .
   * d/p/revert-l3-ha-retry-when-setting-ha-router-gw-status.patch: Revert
     upstream patch that introduced regression that prevented full restore
     of HA routers on restart of L3 agent (LP: #1927868).

Revision history for this message
Edward Hope-Morley (hopem) wrote :

@christian-rohmann The problem essentially boils down to the exception at [1] being raised because prior to that [2] gets called as a result of a timeout exception but the code is not actually catching the exception. This was traced to be the result of a privileged call being used as argument to [3] from [4] (which is in the patch we reverted).

So the *real* problem with privsep code is that if an unexpected exception is raised, it does not get caught thus either killing the reader thread and/or never releasing the lock. There is a separate bug [5] which was raised about the same issue that led to the fix [6] being added to privsep which, crucially, replaces the raised AttributeError with a continue thus stopping it from killing the reader thread. I have not yet tested whether this actually fixes all the agent issues we have seen though and while we should do this, there is still room for improvement in the privsep code namely [7] which should have an except clause that, if nothing else, prints a log message to say that the message timed out.

[1] https://github.com/openstack/oslo.privsep/blob/6d41ef9f91b297091aa37721ba10456142fc5107/oslo_privsep/comm.py#L141
[2] https://github.com/openstack/oslo.privsep/blob/6d41ef9f91b297091aa37721ba10456142fc5107/oslo_privsep/comm.py#L174
[3] https://github.com/openstack/neutron/blob/d4b1b4a0729c187551e1fa2b2855db136456d496/neutron/common/utils.py#L689
[4] https://github.com/openstack/neutron/blob/d8f1f1118d3cde0b5264220836a250f14687893e/neutron/agent/linux/interface.py#L328
[5] https://bugs.launchpad.net/neutron/+bug/1930401
[6] https://github.com/openstack/oslo.privsep/commit/f7f3349d6a4def52f810ab1728879521c12fe2d0
[7] https://github.com/openstack/oslo.privsep/blob/f7f3349d6a4def52f810ab1728879521c12fe2d0/oslo_privsep/comm.py#L189

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/805366

Changed in neutron:
status: New → In Progress
Changed in neutron:
assignee: nobody → Edward Hope-Morley (hopem)
Revision history for this message
Christian Rohmann (christian-rohmann) wrote :

@hopem thanks for your nice reply and the complete overview of the situation.

I do understand the issue with exception handling and propagation between privsep and the reader.
As one cannot catch all exceptions or erroneous conditions that systems might reach, a major improvement would be to consider possible ways to reconcile in this and also other situations:

1) If the setup of any of the various components (veth interfaces, routes, iptables, ...) fails, switch away from being the keepalived master giving any other node the chance to actually take over

2) If a node is the master but things failed retry to set things up once more

To avoid excessive retries certainly an exponential back-off needs to be applied to retries, but
the state of a node being the HA router master, but then not being ready to service traffic must not remain.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.privsep (master)

Reviewed: https://review.opendev.org/c/openstack/oslo.privsep/+/803225
Committed: https://opendev.org/openstack/oslo.privsep/commit/4f1450677ff7c6b22496c391ab87dc79c86c3ef4
Submitter: "Zuul (22348)"
Branch: master

commit 4f1450677ff7c6b22496c391ab87dc79c86c3ef4
Author: Corey Bryant <email address hidden>
Date: Mon Aug 2 11:47:47 2021 -0400

    Add except path with exception debug to send_recv

    The related bug resulted when an exception occurred within the
    future.result() call. This caused the finally block to be executed,
    and therefore myid to be deleted from self.outstanding_msgs prior
    to _reader_main() checking if the msgid not in self.outstanding_msgs.
    This caused _reader_main() to raise an AssertionError because the
    msgid was no longer in outstanding_msgs. This is a small step forward
    to log a warning when this siutation occurs.

    Related-Bug: #1927868
    Change-Id: I2eed242e0c796b8a2aa3d1b21bd1da4c497f624d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/805366
Committed: https://opendev.org/openstack/neutron/commit/344fc0c8d2ce7d942606c834a54cb81f0b47aa37
Submitter: "Zuul (22348)"
Branch: master

commit 344fc0c8d2ce7d942606c834a54cb81f0b47aa37
Author: Edward Hope-Morley <email address hidden>
Date: Fri Aug 20 12:25:04 2021 +0100

    Revert "[L3][HA] Retry when setting HA router GW status."

    In short this patch can cause the privsep reader thread to
    die resulting in the l3 agent getting stuck and e.g. not
    processing any router updates. See related LP bug for full
    explanation.

    Closes-Bug: #1927868

    This reverts commit 662f483120972a373e19bde52f16392e2ccb9c82.

    Change-Id: Ide7e9771d08eb623dd75941e425813d9b857b4c6

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 19.0.0.0rc1

This issue was fixed in the openstack/neutron 19.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/809219

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/neutron/+/809382

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/neutron/+/809383

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/neutron/+/809384

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/809219
Committed: https://opendev.org/openstack/neutron/commit/b3a70fe75315fba061ecb7d6ac1d50a04768ec13
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit b3a70fe75315fba061ecb7d6ac1d50a04768ec13
Author: Edward Hope-Morley <email address hidden>
Date: Fri Aug 20 12:25:04 2021 +0100

    Revert "[L3][HA] Retry when setting HA router GW status."

    In short this patch can cause the privsep reader thread to
    die resulting in the l3 agent getting stuck and e.g. not
    processing any router updates. See related LP bug for full
    explanation.

    Closes-Bug: #1927868

    This reverts commit 662f483120972a373e19bde52f16392e2ccb9c82.

    Change-Id: Ide7e9771d08eb623dd75941e425813d9b857b4c6
    (cherry picked from commit 344fc0c8d2ce7d942606c834a54cb81f0b47aa37)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/train)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/809384
Committed: https://opendev.org/openstack/neutron/commit/f54658c203e66355e94fdd48fbd57682577fc928
Submitter: "Zuul (22348)"
Branch: stable/train

commit f54658c203e66355e94fdd48fbd57682577fc928
Author: Edward Hope-Morley <email address hidden>
Date: Fri Aug 20 12:25:04 2021 +0100

    Revert "[L3][HA] Retry when setting HA router GW status."

    In short this patch can cause the privsep reader thread to
    die resulting in the l3 agent getting stuck and e.g. not
    processing any router updates. See related LP bug for full
    explanation.

    Closes-Bug: #1927868

    This reverts commit 662f483120972a373e19bde52f16392e2ccb9c82.

    Change-Id: Ide7e9771d08eb623dd75941e425813d9b857b4c6
    (cherry picked from commit 344fc0c8d2ce7d942606c834a54cb81f0b47aa37)

tags: added: in-stable-train
tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/809383
Committed: https://opendev.org/openstack/neutron/commit/c45f0fd4bce12d9ee3ef07c1ce5d574d3308959f
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit c45f0fd4bce12d9ee3ef07c1ce5d574d3308959f
Author: Edward Hope-Morley <email address hidden>
Date: Fri Aug 20 12:25:04 2021 +0100

    Revert "[L3][HA] Retry when setting HA router GW status."

    In short this patch can cause the privsep reader thread to
    die resulting in the l3 agent getting stuck and e.g. not
    processing any router updates. See related LP bug for full
    explanation.

    Closes-Bug: #1927868

    This reverts commit 662f483120972a373e19bde52f16392e2ccb9c82.

    Change-Id: Ide7e9771d08eb623dd75941e425813d9b857b4c6
    (cherry picked from commit 344fc0c8d2ce7d942606c834a54cb81f0b47aa37)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/809382
Committed: https://opendev.org/openstack/neutron/commit/5049a8faf103324add24a09c74b9f41aba37ec75
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 5049a8faf103324add24a09c74b9f41aba37ec75
Author: Edward Hope-Morley <email address hidden>
Date: Fri Aug 20 12:25:04 2021 +0100

    Revert "[L3][HA] Retry when setting HA router GW status."

    In short this patch can cause the privsep reader thread to
    die resulting in the l3 agent getting stuck and e.g. not
    processing any router updates. See related LP bug for full
    explanation.

    Closes-Bug: #1927868

    This reverts commit 662f483120972a373e19bde52f16392e2ccb9c82.

    Change-Id: Ide7e9771d08eb623dd75941e425813d9b857b4c6
    (cherry picked from commit 344fc0c8d2ce7d942606c834a54cb81f0b47aa37)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 16.4.2

This issue was fixed in the openstack/neutron 16.4.2 release.

tags: added: neutron-proactive-backport-potential
tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 17.3.0

This issue was fixed in the openstack/neutron 17.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 18.2.0

This issue was fixed in the openstack/neutron 18.2.0 release.

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Hi, can we should change bug status from Fix Released to Confirmed since the fix was reverted, the problem behind it (lp bug 1916024) still exist.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron train-eol

This issue was fixed in the openstack/neutron train-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.