[SRU] Doesn't regain quorum when tracked process restarts with PID > 32767

Bug #1960036 reported by Jason Grammenos
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
keepalived (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Released
High
Lucas Kanashiro

Bug Description

[Impact]

If a user is tracking a process with PID > 32767 keepalived will not be able to work as expected.

This bug was fixed upstream here: https://github.com/acassen/keepalived/commit/23a5b8113bf0b8ec4718443df0406882e8e4d831

[Test Plan]

Launch a Focal VM and run the following commands:

# install keepalived and nginx
$ apt install -y nginx keepalived

# configure keepalived to track the nginx process
$ cat << EOF > /etc/keepalived/keepalived.conf
global_defs {
    enable_script_security
}
vrrp_track_process track_nginx {
    process nginx
    weight 10
    delay 1
}
vrrp_instance lb {
    interface enp5s0
    state MASTER
    priority 100
    virtual_router_id 50
    authentication {
        auth_type PASS
        auth_pass password
    }
    track_process {
        track_nginx
    }
    virtual_ipaddress {
        10.191.226.19
    }
}
EOF
$ systemctl restart keepalived

# stop nginx process to loose quorum
$ systemctl stop nginx
$ journalctl -u keepalived | grep Quorum
Feb 07 20:13:04 keepalived-debug Keepalived_vrrp[3346]: Quorum lost for tracked process track_nginx

# start nginx process to gain quorum
$ systemctl start nginx
$ pidof nginx
3282 3281
$ journalctl -u keepalived | grep Quorum
Feb 07 20:13:04 keepalived-debug Keepalived_vrrp[3346]: Quorum lost for tracked process track_nginx
Feb 07 20:19:58 keepalived-debug Keepalived_vrrp[3346]: Quorum gained for tracked process track_nginx

# stop nginx process to loose quorum again
$ systemctl stop nginx
$ journalctl -u keepalived | grep Quorum
Feb 07 20:13:04 keepalived-debug Keepalived_vrrp[3346]: Quorum lost for tracked process track_nginx
Feb 07 20:19:58 keepalived-debug Keepalived_vrrp[3346]: Quorum gained for tracked process track_nginx
Feb 07 20:21:39 keepalived-debug Keepalived_vrrp[3346]: Quorum lost for tracked process track_nginx

# start nginx process forcing its PID to be > 32767
$ echo 32767 > /proc/sys/kernel/ns_last_pid; systemctl start nginx
$ pidof nginx
32773 32772

# quorum is not gained again
$ journalctl -u keepalived | grep Quorum
Feb 07 20:13:04 keepalived-debug Keepalived_vrrp[3346]: Quorum lost for tracked process track_nginx
Feb 07 20:19:58 keepalived-debug Keepalived_vrrp[3346]: Quorum gained for tracked process track_nginx
Feb 07 20:21:39 keepalived-debug Keepalived_vrrp[3346]: Quorum lost for tracked process track_nginx

To make sure the bug is fixed we need to install the fixed keepalived package, then stop and start the nginx process (with PID > 32767). After that, the quorum will be lost again and then regained:

$ systemctl stop nginx
$ systemctl start nginx
$ pidof nginx
33505 33504
$ journalctl -u keepalived | grep Quorum
Feb 08 14:46:47 keepalived-debug2 Keepalived_vrrp[8079]: Quorum lost for tracked process track_nginx
Feb 08 14:47:00 keepalived-debug2 Keepalived_vrrp[8079]: Quorum gained for tracked process track_nginx
Feb 08 14:47:19 keepalived-debug2 Keepalived_vrrp[8079]: Quorum lost for tracked process track_nginx
Feb 08 14:49:01 keepalived-debug2 Keepalived_vrrp[33346]: Quorum lost for tracked process track_nginx
Feb 08 14:49:14 keepalived-debug2 Keepalived_vrrp[33346]: Quorum gained for tracked process track_nginx

[Where problems could occur]

The upstream fix is quite straightforward but if a problem would occur it would be manifested in any tracking process feature in keepalived. Since keepalived is widely used in HA, this might be reflected in some specific setups using keepalived.

[Original description]

Keepalived doesn't regain quorum when using tracked process due to a bug in high numbered pids
The upstream has already fixed in a patch release 2.0.20

https://serverfault.com/questions/993432/keepalived-doesnt-gain-quorum-when-tracked-process-comes-up

Could we please get the 2.0.20 released to 20.04

Related branches

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks for the report Jason!

This is fixed in:
https://github.com/acassen/keepalived/commit/23a5b8113bf0b8ec4718443df0406882e8e4d831

Which is in 2.1.0 and later as well as in the 2.0.20 backport.

Thereby >=Impish are fixed already:
 keepalived | 1:2.0.19-2ubuntu0.1 | focal-updates | source, amd64, arm64, armhf, ppc64el, riscv64, s390x
 keepalived | 1:2.1.5-0.2ubuntu1.1 | impish-updates | source, amd64, arm64, armhf, ppc64el, riscv64, s390x
 keepalived | 1:2.2.4-0.2 | jammy | source, amd64, arm64, armhf, ppc64el, riscv64, s390x

I'd leave it to Lucas who usually looks after HA bits to decide if we'd want to go for just fixing this or if we should consider 2.0.20 as a whole for a MRE.
Assigning him for further triaging ...

Changed in keepalived (Ubuntu Focal):
status: New → Confirmed
Changed in keepalived (Ubuntu):
status: New → Fix Released
Changed in keepalived (Ubuntu Focal):
assignee: nobody → Lucas Kanashiro (lucaskanashiro)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

From just a glimpse to me 2.0.20 looks too much of a not-just-fixes release for an MRE at:
https://www.keepalived.org/changelog.html

But not reaching qorum on some PIDs seems severe.

Therefore I'd suggest to:
- add this to server-next to backport just the fix
- look for someone to work on it this Wednesday.
- Set importance high

@Lucas - please have a look if you agree and if you do set these status entries as suggested.

@Jason - for an SRU [1] we will also eventually need a testcase. Do you happen to know if there is some trivial setup plus (I assume) forcing high pid numbers is enough to test this. If so it would be very useful to add these "how to test this on a fresh system" steps here.

[1]: https://wiki.ubuntu.com/StableReleaseUpdates

Revision history for this message
Jason Grammenos (jason.grammenos.agility) wrote :

Here are both of the keepalived configs, Both after about as basic as they get. I am not sure how to force the high pid's. If you need more information from me, just let me know.

primary keepalived config (update ips to match your environment)
```
# Global Settings for notifications
global_defs {
    enable_script_security
    script_user keepalived_script
    enable_dbus
}
vrrp_track_process track_haproxy {
      process haproxy
      weight 10
      delay 1
}
# Configuration for Virtual Interface
vrrp_instance LB_VIP {
    interface ens5
    state MASTER
    priority 101
    virtual_router_id 51

      authentication {
        auth_type PASS
        auth_pass somepassword # Password for accessing vrrpd. Same on all devices, only up to 8 chars
    }
    unicast_src_ip 10.1.1.3 # Private IP address of primary
    unicast_peer {
        10.1.1.4 # Private IP address of the secondary haproxy
   }

    # The virtual ip address shared between the two loadbalancers
    virtual_ipaddress {
        10.1.1.2
    }

    # Use the Defined Script to Check whether to initiate a fail over
    #track_script {
    # chk_haproxy
    #}
    track_process {
        track_haproxy
    }
}
```
backup keepalived config
```
# Global Settings for notifications
global_defs {
        enable_script_security
        script_user keepalived_script
        enable_dbus
    }
    vrrp_track_process track_haproxy {
          process haproxy
          weight 10
          delay 1
    }
    # Configuration for Virtual Interface
    vrrp_instance LB_VIP {
        interface ens5
        state BACKUP
        priority 100
        virtual_router_id 51

        authentication {
            auth_type PASS
            auth_pass somepass # Password for accessing vrrpd. Same on all devices, only up to 8 chars
        }

        unicast_src_ip 10.1.1.4 # Private IP address of primary
        unicast_peer {
            10.1.1.3 # Private IP address of the secondary haproxy
       }

        # The virtual ip address shared between the two loadbalancers
        virtual_ipaddress {
            10.1.1.2
        }

        track_process {
            track_haproxy
        }
}
```

Revision history for this message
Lucas Kanashiro (lucaskanashiro) wrote :

Thanks for the investigation. I agree with your assessment Christian, backport only the fix seems to be the way to go.

I'll be trying to define a test case for this SRU based on the info we already have.

Adding server-next tag and setting the importance to high.

Changed in keepalived (Ubuntu Focal):
importance: Undecided → High
tags: added: server-next
summary: - Doesnt regain quorum when tracked process restarts
+ [SRU] Doesn't regain quorum when tracked process restarts with PID >
+ 32767
description: updated
Changed in keepalived (Ubuntu Focal):
status: Confirmed → In Progress
description: updated
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Jason, or anyone else affected,

Accepted keepalived into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/keepalived/1:2.0.19-2ubuntu0.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in keepalived (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
Jason Grammenos (jason.grammenos.agility) wrote :
Download full text (5.7 KiB)

version tested: 2.0.19-2ubuntu0.2

tested keepalived ability to regain quorum after the tracked process (with high numbers pid) restarts.

Based on my testing the version tested fixes the reported issue. I do not know if it introduces new bugs. I only tested to see if it fixes the reported issue.
Note: after downgrading back to 2.0.19-2ubuntu0.1 the issue re appears.

testing output

```
pre patch

----

pp01
stop haproxy service
Feb 16 11:30:26 pp01 systemd[1]: Stopping HAProxy Load Balancer...
Feb 16 11:30:26 pp01 systemd[1]: haproxy.service: Succeeded.
Feb 16 11:30:26 pp01 systemd[1]: Stopped HAProxy Load Balancer.
Feb 16 11:30:27 pp01 Keepalived_vrrp[18854]: Quorum lost for tracked process track_haproxy
Feb 16 11:30:27 pp01 Keepalived_vrrp[18854]: (LB_VIP) Changing effective priority from 111 to 101
Feb 16 11:30:31 pp01 Keepalived_vrrp[18854]: (LB_VIP) Master received advert from 10.4.150.182 with higher priority 110, ours 101
Feb 16 11:30:31 pp01 Keepalived_vrrp[18854]: (LB_VIP) Entering BACKUP STATE
Feb 16 11:30:54 pp01 systemd[1]: Starting HAProxy Load Balancer...
Feb 16 11:30:54 pp01 systemd[1]: Started HAProxy Load Balancer.
start haproxy service
-- nothing
restart haproxy service
eb 16 11:32:15 pp01 systemd[1]: Stopping Keepalive Daemon (LVS and VRRP)...
Feb 16 11:32:16 pp01 Keepalived_vrrp[18854]: Released DBus
Feb 16 11:32:16 pp01 Keepalived_vrrp[18854]: Stopped
Feb 16 11:32:16 pp01 Keepalived[18853]: Stopped Keepalived v2.0.19 (10/19,2019)
Feb 16 11:32:16 pp01 systemd[1]: keepalived.service: Succeeded.
Feb 16 11:32:16 pp01 systemd[1]: Stopped Keepalive Daemon (LVS and VRRP).
Feb 16 11:32:16 pp01 systemd[1]: Started Keepalive Daemon (LVS and VRRP).
Feb 16 11:32:16 pp01 Keepalived[511543]: Starting Keepalived v2.0.19 (10/19,2019)
Feb 16 11:32:16 pp01 Keepalived[511543]: Running on Linux 5.11.0-1028-aws #31~20.04.1-Ubuntu SMP Fri Jan 14 14:37:50 UTC 2022 (built for Linux 5.4.151)
Feb 16 11:32:16 pp01 Keepalived[511543]: Command line: '/usr/sbin/keepalived' '--dont-fork'
Feb 16 11:32:16 pp01 Keepalived[511543]: Opening file '/etc/keepalived/keepalived.conf'.
Feb 16 11:32:16 pp01 Keepalived[511543]: Starting VRRP child process, pid=511544
Feb 16 11:32:16 pp01 Keepalived_vrrp[511544]: Registering Kernel netlink reflector
Feb 16 11:32:16 pp01 Keepalived_vrrp[511544]: Registering Kernel netlink command channel
Feb 16 11:32:16 pp01 Keepalived_vrrp[511544]: Opening file '/etc/keepalived/keepalived.conf'.
Feb 16 11:32:16 pp01 Keepalived_vrrp[511544]: (LB_VIP) Changing effective priority from 101 to 111
Feb 16 11:32:16 pp01 Keepalived_vrrp[511544]: Registering gratuitous ARP shared channel
Feb 16 11:32:16 pp01 Keepalived_vrrp[511544]: (LB_VIP) Entering BACKUP STATE (init)
Feb 16 11:32:16 pp01 Keepalived_vrrp[511544]: Acquired DBus bus org.keepalived.Vrrp1
Feb 16 11:32:16 pp01 Keepalived_vrrp[511544]: Acquired the name org.keepalived.Vrrp1 on the session bus
Feb 16 11:32:17 pp01 Keepalived_vrrp[511544]: (LB_VIP) received lower priority (110) advert from 10.4.150.182 - discarding
Feb 16 11:32:20 pp01 Keepalived_vrrp[511544]: message repeated 3 times: [ (LB_VIP) received lower priority (110) advert from 10.4.150.182 - discarding]
Feb 16 11:32:...

Read more...

tags: added: verification-done-focal
tags: added: verification-done
removed: verification-needed verification-needed-focal
Revision history for this message
Robie Basak (racb) wrote : Update Released

The verification of the Stable Release Update for keepalived has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package keepalived - 1:2.0.19-2ubuntu0.2

---------------
keepalived (1:2.0.19-2ubuntu0.2) focal; urgency=medium

  * Add patch fixing track_process with PID greater than 32767 (LP: #1960036)

 -- Lucas Kanashiro <email address hidden> Mon, 07 Feb 2022 17:52:31 -0300

Changed in keepalived (Ubuntu Focal):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.