[SRU] Add support for disabling mlockall() calls in ovs-vswitchd

Bug #1906280 reported by Michael Skalka
60
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Neutron Gateway Charm
New
Undecided
Unassigned
OpenStack Neutron Open vSwitch Charm
Fix Released
Critical
Frode Nordahl
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Queens
Fix Released
Critical
Corey Bryant
Stein
Fix Released
Critical
Corey Bryant
Train
Fix Released
Critical
Corey Bryant
Ussuri
Fix Released
Critical
Corey Bryant
charm-ovn-chassis
Fix Released
Critical
Corey Bryant
charm-ovn-dedicated-chassis
Fix Released
Critical
Unassigned
openvswitch (Ubuntu)
Fix Released
Critical
Corey Bryant
Bionic
Fix Released
Critical
Corey Bryant
Focal
Fix Released
Critical
Corey Bryant
Groovy
Fix Released
Critical
Corey Bryant
Hirsute
Fix Released
Critical
Corey Bryant

Bug Description

[Impact]

Recent changes to systemd rlimit are resulting in memory exhaustion with ovs-vswitchd's use of mlockall(). mlockall() can be disabled via /etc/defaults/openvswitch-vswitch, however there is currently a bug in the shipped ovs-vswitchd systemd unit file that prevents it. The package will be fixed in this SRU. Additionally the neutron-openvswitch charm will be updated to enable disabling of mlockall() use in ovs-vswitchd via a config option.

More details on the above summary can be found in the following comments:
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1906280/comments/16
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1906280/comments/19

==== Original bug details ===
Original bug title:

Charm stuck waiting for ovsdb 'no key "ovn-remote" in Open_vSwitch record'

Original bug details:

As seen during this Focal Ussuri test run: https://solutions.qa.canonical.com/testruns/testRun/5f7ad510-f57e-40ce-beb7-5f39800fa5f0
Crashdump here: https://oil-jenkins.canonical.com/artifacts/5f7ad510-f57e-40ce-beb7-5f39800fa5f0/generated/generated/openstack/juju-crashdump-openstack-2020-11-28-03.40.36.tar.gz

Full history of occurrences can be found here: https://solutions.qa.canonical.com/bugs/bugs/bug/1906280

Octavia's ovn-chassis units are stuck waiting:

octavia/0 blocked idle 1/lxd/8 10.244.8.170 9876/tcp Awaiting leader to create required resources
  hacluster-octavia/1 active idle 10.244.8.170 Unit is ready and clustered
  logrotated/63 active idle 10.244.8.170 Unit is ready.
  octavia-ovn-chassis/1 waiting executing 10.244.8.170 'ovsdb' incomplete
  public-policy-routing/45 active idle 10.244.8.170 Unit is ready

When the db is reporting healthy:

ovn-central/0* active idle 1/lxd/9 10.246.64.225 6641/tcp,6642/tcp Unit is ready (leader: ovnnb_db, ovnsb_db)
  logrotated/19 active idle 10.246.64.225 Unit is ready.
ovn-central/1 active idle 3/lxd/9 10.246.64.250 6641/tcp,6642/tcp Unit is ready (northd: active)
  logrotated/27 active idle 10.246.64.250 Unit is ready.
ovn-central/2 active idle 5/lxd/9 10.246.65.21 6641/tcp,6642/tcp Unit is ready
  logrotated/52 active idle 10.246.65.21 Unit is ready.

Warning in the juju unit logs indicates that the charm is blocking on a missing key in the ovsdb:

2020-11-27 23:36:57 INFO juju-log ovsdb:195: Invoking reactive handler: hooks/relations/ovsdb-subordinate/provides.py:97:joined:ovsdb-subordinate
2020-11-27 23:36:57 DEBUG jujuc server.go:211 running hook tool "relation-get"
2020-11-27 23:36:57 WARNING ovsdb-relation-changed ovs-vsctl: no key "ovn-remote" in Open_vSwitch record "." column external_ids
2020-11-27 23:36:57 DEBUG jujuc server.go:211 running hook tool "juju-log"
2020-11-27 23:36:57 INFO juju-log ovsdb:195: Invoking reactive handler: hooks/relations/ovsdb/requires.py:34:joined:ovsdb
==============================

[Test Case]
Note: Bionic requires additional testing due to pairing with other SRUS.

The easiest way to test this is to deploy openstack with the neutron-openvswitch charm, using the new charm updates. Once deployed, edit /usr/share/openvswitch/scripts/ovs-ctl with an echo to show what MLOCKALL is set to. Then toggle the charm config option [1] and look at journalctl -xe to find the echo output, which should correspond to the mlockall setting.

[1]
juju config neutron-openvswitch disable-mlockall=true
juju config neutron-openvswitch disable-mlockall=false

[Regression Potential]
There's potential that this will break users who have come to depend on the incorrect EnvironmentFile setting and environment variable in the systemd unit file for ovs-vswitchd. If that is the case they must be running with modified systemd unit files anyway so it is probably a moot point.

[Discussion]
== Groovy ==
Update (16-12-2020): I chatted briefly with Christian and it sounds like the ltmain-whole-archive.diff may be optional, so I've dropped it from this upload. There are now 2 openvswitch's in the groovy unapproved queue. Please reject the upload from 15-12-2020 and consider accepting the upload from 16-12-2020.
I have a query out to James and Christian about an undocumented commit that is getting picked up in the groovy upload. It is committed to the ubuntu/groovy branch of the package Vcs. See debian/ltmain-whole-archive.diff and debian/rules in the upload debdiff at http://launchpadlibrarian.net/511453613/openvswitch_2.13.1-0ubuntu1_2.13.1-0ubuntu1.1.diff.gz

== Bionic ==
The bionic upload is paired with the following SRUs which will also require verification:
https://bugs.launchpad.net/bugs/1823295
https://bugs.launchpad.net/bugs/1881077

== Package details ==
New package versions are in progress and can be found at:
hirsute: https://launchpad.net/ubuntu/+source/openvswitch/2.14.0-0ubuntu2
groovy: https://launchpad.net/ubuntu/groovy/+queue?queue_state=1&queue_text=openvswitch
focal: https://launchpad.net/ubuntu/focal/+queue?queue_state=1&queue_text=openvswitch
train: https://launchpad.net/~ubuntu-cloud-archive/+archive/ubuntu/train-staging/+packages?field.name_filter=openvswitch&field.status_filter=published&field.series_filter=
stein: https://launchpad.net/~ubuntu-cloud-archive/+archive/ubuntu/stein-staging/+packages?field.name_filter=openvswitch&field.status_filter=published&field.series_filter=
bionic: https://launchpad.net/ubuntu/bionic/+queue?queue_state=1&queue_text=openvswitch

== Charm update ==
https://review.opendev.org/c/openstack/charm-neutron-openvswitch/+/767212

CVE References

Revision history for this message
Michael Skalka (mskalka) wrote :

Subscribing field-high as we are seeing this during our Focal Ussuri HOV release runs.

Revision history for this message
Marian Gasparovic (marosg) wrote :
Michael Skalka (mskalka)
description: updated
Revision history for this message
Michael Skalka (mskalka) wrote :

Given the frequency and disruption this is causing to our stable B-U sku release I am subbing this critical.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

The charm looks like its been executing the hook for quite awhile:

          octavia-ovn-chassis/1:
            workload-status:
              current: waiting
              message: '''ovsdb'' incomplete'
              since: 27 Nov 2020 23:36:17Z

while the latest entry in the unit log is about 4 hours later:

2020-11-28 03:43:30 DEBUG juju.worker.uniter.remotestate watcher.go:636 update status timer triggered

And the last thing of relevance that the charm was doing was creating the integration bridge 'br-int':

2020-11-27 23:36:55 INFO juju-log ovsdb:195: Creating bridge br-int

Looking at the syslog, it appears that creating the bridge is causing apport to detect a crash:

Nov 27 23:36:54 juju-7f6c1c-1-lxd-8 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set open . external-ids:ovn-remote=ssl:10.246.64.225:6642,ssl:10.246.64.250:6642,ssl:10.246.65.21:6642
Nov 27 23:36:55 juju-7f6c1c-1-lxd-8 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -- --may-exist add-br br-int -- set bridge br-int external-ids:charm-ovn-chassis=managed -- set bridge br-int protocols=Ope
nFlow13,OpenFlow15 -- set bridge br-int datapath-type=system -- set bridge br-int fail-mode=secure -- set bridge br-int other-config:disable-in-band=true
Nov 27 23:36:55 juju-7f6c1c-1-lxd-8 networkd-dispatcher[222]: WARNING:Unknown index 2 seen, reloading interface list
Nov 27 23:36:55 juju-7f6c1c-1-lxd-8 systemd[1]: system-apport\x2dforward.slice: Failed to reset devices.list: Operation not permitted
Nov 27 23:36:55 juju-7f6c1c-1-lxd-8 systemd[1]: Created slice system-apport\x2dforward.slice.
Nov 27 23:36:55 juju-7f6c1c-1-lxd-8 systemd[1]: Starting Apport crash forwarding receiver...
Nov 27 23:36:56 juju-7f6c1c-1-lxd-8 systemd[1]: Started Apport crash forwarding receiver.

which is confirmed in the apport log file:

ERROR: apport (pid 71613) Fri Nov 27 23:36:55 2020: called for pid 17618, signal 6, core limit 34359738368, dump mode 1
ERROR: apport (pid 71613) Fri Nov 27 23:36:55 2020: executable: /usr/lib/openvswitch-switch/ovs-vswitchd (command line "ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach")
ERROR: apport (pid 71613) Fri Nov 27 23:36:55 2020: is_closing_session(): no DBUS_SESSION_BUS_ADDRESS in environment
ERROR: apport (pid 71613) Fri Nov 27 23:36:56 2020: wrote report /var/crash/_usr_lib_openvswitch-switch_ovs-vswitchd.0.crash
ERROR: apport (pid 71613) Fri Nov 27 23:36:56 2020: writing core dump to core (limit: 34359738368)
ERROR: apport (pid 71613) Fri Nov 27 23:36:56 2020: writing core dump core of size 27217920

So this appears to be ovs crashing when creating the bridge.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Can you update the juju crash dump to collect the contents of /var/crash as well? This will give us a chance to see see what's causing the crash. I think it could be argued that any crash in /var/crash should be collected via juju crashdump as a good practice.

Changed in charm-ovn-chassis:
assignee: nobody → Billy Olsen (billy-olsen)
Changed in charm-ovn-chassis:
assignee: Billy Olsen (billy-olsen) → Dmitrii Shcherbakov (dmitriis)
Revision history for this message
Michael Skalka (mskalka) wrote :

We have added the collection of /var/crash to juju-crashdump, so the next runs which fail on this should have the requested information. I'll update the bug once those are captured.

Changed in charm-ovn-chassis:
status: New → Incomplete
Revision history for this message
Billy Olsen (billy-olsen) wrote :

From the crash made available in a recreate, we get the following:

#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
        set = {__val = {0, 140725298456336, 140664669834388, 0, 140725298456672, 140664668256100, 14583431671641669699, 14583431671641719254, 0, 140664668255802, 4707197592648237900, 7161402270843880775,
            140724610755886, 11, 140725298456672, 140664668255726}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1 0x00007fef0b965921 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x5, sa_sigaction = 0x5}, sa_mask = {__val = {140664672163456, 10, 11, 11, 94494056588192, 140725298456896, 140664668628179, 140664672163456, 140664695139080,
              11, 140664668607515, 0, 94494055958702, 1, 140725298457136, 94494060193760}}, sa_flags = 695003688, sa_restorer = 0x55f11ce422a8}
        sigs = {__val = {32, 0 <repeats 15 times>}}
        __cnt = <optimized out>
        __set = <optimized out>
        __cnt = <optimized out>
        __set = <optimized out>
#2 0x000055f11ca3892e in ovs_abort_valist (err_no=<optimized out>, format=<optimized out>, args=args@entry=0x7ffd296ce940) at ../lib/util.c:419
No locals.
#3 0x000055f11ca389c4 in ovs_abort (err_no=<optimized out>, format=format@entry=0x55f11cad23a0 "pthread_create failed") at ../lib/util.c:411
        args = {{gp_offset = 16, fp_offset = 48, overflow_arg_area = 0x7ffd296cea20, reg_save_area = 0x7ffd296ce960}}
#4 0x000055f11ca048fe in ovs_thread_create (name=name@entry=0x55f11cab2908 "handler", start=start@entry=0x55f11c9413f0 <udpif_upcall_handler>, arg=arg@entry=0x55f11ce422a8) at ../lib/ovs-thread.c:449
        once = {done = true, mutex = {lock = pthread_mutex_t = {Type = Error check, Status = Not acquired, Robust = No, Shared = No, Protocol = None}, where = 0x55f11caaacfe "<unlocked>"}}
        aux = 0x55f11ce427e0
        thread = 8388608
        error = <optimized out>
        attr = {__size = '\000' <repeats 17 times>, "\020", '\000' <repeats 37 times>, __align = 0}
#5 0x000055f11c93e7d6 in udpif_start_threads (udpif=0x55f11ce41b50, n_handlers_=<optimized out>, n_revalidators_=<optimized out>) at ../ofproto/ofproto-dpif-upcall.c:569
        handler = 0x55f11ce422a8
        i = 3

the pthread_create is failing with EAGAIN, which:

EAGAIN A system-imposed limit on the number of threads was
              encountered. There are a number of limits that may trigger
              this error: the RLIMIT_NPROC soft resource limit (set via
              setrlimit(2)), which limits the number of processes and
              threads for a real user ID, was reached; the kernel's system-
              wide limit on the number of processes and threads,
              /proc/sys/kernel/threads-max, was reached (see proc(5)); or
              the maximum number of PIDs, /proc/sys/kernel/pid_max, was
              reached (see proc(5)).

Which means one of the system limits was reached.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Seems we're running into the memlock limit. A test of disabling the MEMLOCKALL option for ovs allows the ovs-vswitchd service to no longer crash.

As Dmitrii pointed out to me, this might be related to the issue and commentary in https://patchwork<email address hidden>/.

However, the LimitSTACK=2M change in the systemd service file patch proposed was not low enough for this environment. I had to set it to 1M to actually get the service to start.

Another option might be to increase the memlock available to the containers, but I'm not sure if that's something that can easily be set in the juju lxd profiles.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

The LXD documentation has a page about production setup of a server hosting containers [0], it suggests that `memlock` should be set to unlimited on the host.

Perhaps this is something juju should do for all hosts it provisions LXD containers on?

0: https://linuxcontainers.org/lxd/docs/master/production-setup

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (5.5 KiB)

Expanding on #8, in my testing on a different environment (Bionic host, Focal container) I found that vswitchd fails when a pthread gets created and tries to mmap some memory for its stack:

13077 20:22:25.392054 mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = -1 EAGAIN (Resource temporarily unavailable)
13077 20:22:25.392096 write(2, "ovs-vswitchd: ", 14) = 14
13077 20:22:25.392140 write(2, "pthread_create failed", 21) = 21
13077 20:22:25.392184 write(2, " (Resource temporarily unavailable)", 35) = 35
13077 20:22:25.392223 write(2, "\n", 1) = 1

The reason for that deserves a detailed description:

--------------------------
1. Stack size:

https://git.launchpad.net/ubuntu/+source/openvswitch/tree/lib/ovs-thread.c?h=applied/ubuntu/focal-updates#n438 (pthread_create in the OVS code)
https://git.launchpad.net/ubuntu/+source/glibc/tree/nptl/allocatestack.c?h=ubuntu/focal-updates&id=e639063e5d4ba5c296c990924eb4f290bc1d06ae#n562 (the glibc code doing the mmap, see the comment about PROT_NONE starting with line 559)

The prlimit for STACK memory is 8388608 and the mmap region includes a guard page (8388608 + 4096 = 8392704) so the size passed to mmap is correct (plus PROT_NONE is used). So this is not because of the stack memory.

STACK max stack size 8388608 unlimited bytes

--------------------------
2. ovs-vswitchd applies memory locking to all memory allocations by default when started via ovs-ctl:

https://git.launchpad.net/ubuntu/+source/openvswitch/tree/vswitchd/ovs-vswitchd.c?h=applied/ubuntu/focal-updates#n93
    if (want_mlockall) {
#ifdef HAVE_MLOCKALL
        if (mlockall(MCL_CURRENT | MCL_FUTURE)) {
            VLOG_ERR("mlockall failed: %s", ovs_strerror(errno));

https://git.launchpad.net/ubuntu/+source/openvswitch/tree/utilities/ovs-ctl.in?h=applied/ubuntu/focal-updates#n321
    MLOCKALL=yes

https://git.launchpad.net/ubuntu/+source/openvswitch/tree/utilities/ovs-ctl.in?h=applied/ubuntu/focal-updates#n210
        if test X"$MLOCKALL" != Xno; then
            set "$@" --mlockall
        fi

--------------------------
3. EAGAIN returned by mmap and memory locking

mmap returns EAGAIN when it cannot lock memory and memory cannot be locked if the process goes beyond the RLIMIT_MEMLOCK (unless it has CAP_IPC_LOCK in the initial user namespace or has uid 0 in it)

https://elixir.bootlin.com/linux/v4.15.18/source/mm/mmap.c#L1385 (do_mmap)
 if (mlock_future_check(mm, vm_flags, len))
  return -EAGAIN;
https://elixir.bootlin.com/linux/v4.15.18/source/mm/mmap.c#L1300 (mlock_future_check)
  if (locked > lock_limit && !capable(CAP_IPC_LOCK))
   return -EAGAIN;

The mlock manpage documents that the use of mlockall(MCL_FUTURE) may lead to future mmap failures if the RLIMIT_MEMLOCK is hit, however, the root user (uid 0) in the initial user namespace will not be affected since it has CAP_IPC_LOCK and the limit will be ignored for it:

https://man7.org/linux/man-pages/man2/mlock.2.html
"In Linux 2.6.8 and earlier, a process must be privileged (CAP_IPC_LOCK) in order to lock memory and the RLIMIT_MEMLOCK soft resource limit defines a limit on how much memory the process may lock.
"Since kernel...

Read more...

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

4. The number of OVS threads and memory usage

On my test host there are 48 HT cores seen by the system and the LXD container:

lscpu | grep On-line
On-line CPU(s) list: 0-47

I have changed ovs-vswitchd start command to include `--no-mlockall` and it started successfully

ovs-ctl --no-mlockall --no-ovsdb-server --no-monitor start

Which allowed me to look at the actual number of threads it creates (50):

https://paste.ubuntu.com/p/GbF4DPxz9q/
root@right-imp:~# pstree -p 738 | wc -l
50

root@right-imp:~# pstree -p `pgrep -f ovs-vswitchd` | wc -l
ovs-vswitchd(738)─┬─{ovs-vswitchd}(740)
                  ├─{ovs-vswitchd}(741)
                  ├─{ovs-vswitchd}(742)
# ...
                  ├─{ovs-vswitchd}(785)
                  ├─{ovs-vswitchd}(786)
                  ├─{ovs-vswitchd}(787)
                  ├─{ovs-vswitchd}(788)
                  └─{ovs-vswitchd}(958)

`top` shows that the virtual address space for ovs-vswitchd is ~ 3.5Gib while the resident memory is around 20 MiB:

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
  738 root 10 -10 3615.0m 20.3m 4.8m S 2.3 0.0 0:22.52 ovs-vswitchd

When I reduce the amount of cores allocated to the container

# lxc config set right-imp limits.cpu 8
root@right-imp:~# lscpu | grep On-line
On-line CPU(s) list: 1,6,10,12,20,24,38,44

pstree -p `pgrep -f ovs-vswitchd` | wc -l
10

I can see that the resident memory drops for ovs-vswitchd:

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
  696 root 10 -10 734.8m 8.3m 4.9m S 0.3 0.0 0:00.37 ovs-vswitchd

rss_diff = 20.3 - 8.3 = 12 MiB

Given that I reduced the number of cores exposed to the container (and vswitchd) by 40 the following can be used to estimate the RSS memory added per core:

rough_rss_per_thread = rss_diff / core_diff = 12 MiB / (48 - 40) = 1.5 MiB

So a rough calculation shows that each core adds an additional 1.5 MiB to RSS when OVS is idling.

--------------------------------

Based on that, I can summarize that the issue may appear:

1) Depending on the amount of cores available to the system;
2) Depending on which systemd version is used: the one with the 64 MiB RLIMIT_MEMLOCK used by default will be less prone to that until more memory is put into the RSS by vswitchd (new allocations or more cores).

Changed in charm-ovn-chassis:
status: Incomplete → Confirmed
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

5. in #10 I referred to an ability of a process with CAP_IPC_LOCK to bypass the RLIMIT_MEMLOCK:

https://elixir.bootlin.com/linux/v4.15.18/source/mm/mmap.c#L1300 (mlock_future_check)
  if (locked > lock_limit && !capable(CAP_IPC_LOCK))
   return -EAGAIN;

Which raises the question whether this capability needs to be effective (see man 7 capabilities) in the user namespace of the unprivileged container or in the initial user namespace.

Based on what I see, CAP_IPC_LOCK is not dropped for unprivileged containers (also based on a comment from Stephane here https://discuss.linuxcontainers.org/t/how-to-add-cap-ipc-lock-capabilities-to-container/484/2):

$ ps 17228
  PID TTY STAT TIME COMMAND
17228 ? Ss 0:00 /sbin/init

$ grep Cap /proc/17228/status
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000

$ capsh --decode=0000003fffffffff
0x0000003fffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read

The "capable" function in the kernel checks the presence of a capability for the **initial** user namespace (the function comments seem to refer to that as having a "superior capability")

https://elixir.bootlin.com/linux/v4.15.18/source/kernel/capability.c#L429 (capable)
https://elixir.bootlin.com/linux/v4.15.18/source/kernel/user.c#L26
struct user_namespace init_user_ns = {

As opposed to the ns_capable function, for example:
https://elixir.bootlin.com/linux/v4.15.18/source/kernel/capability.c#L395 (ns_capable)

Therefore, we will not be able to use CAP_IPC_LOCK for users in unprivileged LXD containers to bypass RLIMIT_MEMLOCK.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Reading the Open vSwitch documentation for the --mlockall configuration option, used by default in ovs-ctl, suggests it should try to lock the memory and gracefully log a message if unsuccessful. If that indeed is the intent I think we should treat this as an Open vSwitch bug and seek a fix for it there.

That way we can postpone working out a generic way of giving unprivileged containers the memlock headroom they need until we have a clear set of use cases for it.

       --mlockall
              Causes ovs-vswitchd to call the mlockall() function, to attempt
              to lock all of its process memory into physical RAM, preventing
              the kernel from paging any of its memory to disk. This helps to
              avoid networking interruptions due to system memory pressure.

              Some systems do not support mlockall() at all, and other systems
              only allow privileged users, such as the superuser, to use it.
              ovs-vswitchd emits a log message if mlockall() is unavailable or
              unsuccessful.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Frode,

OVS has the following checks:

1) a compile-time check for HAVE_MLOCKALL;
2) An attempt to make all future memory allocations locked via `mlockall(MCL_CURRENT | MCL_FUTURE)` which fails gracefully and logs a messsage;
https://github.com/openvswitch/ovs/blob/v2.13.1/vswitchd/ovs-vswitchd.c#L93-L103

As such, it gracefully enables locking *of all future memory allocations* (MCL_FUTURE) but the allocations themselves are not handled gracefully: mmap(2), sbrk(2), malloc(3) may fail when RLIMIT_MEMLOCK gets hit. In other works, memory locking is enabled once and then none of the memory allocations have to explicitly request locking which makes it hard to fail gracefully.

As a tactical fix, we can modify the LXD profiles shipped with our charms to set RLIMIT_MEMLOCK to "unlimited".

The rationale is described in the commit message:
https://review.opendev.org/c/x/charm-ovn-chassis/+/765492/1//COMMIT_MSG

Reviews:
https://review.opendev.org/c/x/charm-ovn-chassis/+/765492
https://review.opendev.org/c/openstack/charm-neutron-openvswitch/+/765493

Changed in charm-ovn-chassis:
status: Confirmed → Triaged
Changed in charm-neutron-openvswitch:
status: New → Triaged
status: Triaged → In Progress
Changed in charm-ovn-chassis:
status: Triaged → In Progress
Changed in charm-neutron-openvswitch:
importance: Undecided → Critical
Changed in charm-ovn-chassis:
importance: Undecided → Critical
Changed in charm-neutron-openvswitch:
assignee: nobody → Dmitrii Shcherbakov (dmitriis)
Changed in charm-ovn-chassis:
milestone: none → 21.01
Changed in charm-neutron-openvswitch:
milestone: none → 21.01
Revision history for this message
Billy Olsen (billy-olsen) wrote :

I'm beginning to suspect this is fallout of https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1830746. That bug has several other reports to which are very similar, namely that allocations using memlockall are crashing.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

I had a conversation with Dan Streetman regarding this. The working theory currently is that this was working before b/c the rlimits for memlock were so low inside a container that the effort to perform the mlockall( MCL_CURRENT | MCL_FUTURE ) fails, which then proceeds to have fallback behavior and work just fine.

With the systemd patch from above, the rlimit was increased to 64MB. This increase actually allows the initial mlockall attempt to succeed, but then because enough memory is not available the future spawning of threads fails.

Unfortunately, with the current openvswitch-switch package there's no option to pass the --no-mlockall flag as the --no-mlockall flag must be specified before the start/stop/restart command for the ovs-ctl script.

I suspect this will hit users that have already deployed openvswitch inside a container and have now upgraded the bionic version of systemd to 237-3ubuntu10.43. For those users, you can set the --no-mlockall flag in the ovs-vswitchd.service file in order to allow it to start.

Revision history for this message
Syed Mohammad Adnan Karim (karimsye) wrote : Re: Charm stuck waiting for ovsdb 'no key "ovn-remote" in Open_vSwitch record'

Thank you so much Billy, I really appreciate it!

I implemented your suggestion by updating the ovs-vswitchd.service file
with the *--no-mlockall *flag for the start/stop/restart of the service.
After a systemctl daemon-reload and starting the ovs-vswitchd service
again, I was able to get things working as expected.

I can confirm that the containers that were having the problem in the 9
node cloud did indeed have *systemd/bionic-updates,now 237-3ubuntu10.43
amd64 [installed,automatic] *without the *--no-mlockall* flag.
However, the strange thing is that one of their larger clouds (39 node
cloud) that was recently rebuilt (on December 9th or 10th) also has
*systemd/bionic-updates,now
237-3ubuntu10.43 amd64 [installed,automatic] *without the *--no-mlockall*
flag but it worked from the beginning without it so I am not sure why it
worked in one cloud and not the other (the hardware, specs, and charm
config are the same in both clouds). The other 2 large clouds (39 nodes)
have *systemd/now 237-3ubuntu10.39 amd64 [installed,upgradable to:
237-3ubuntu10.43]* so it makes sense why they never hit this as they were
built a 6-12 months ago.

On Monday, I will try restarting the ovs-vswitchd service and also the
containers to see if that reveals anything on the 39 node cloud that worked
from the start.

Thanks again!

Changed in charm-neutron-openvswitch:
assignee: Dmitrii Shcherbakov (dmitriis) → Corey Bryant (corey.bryant)
Changed in charm-ovn-chassis:
assignee: Dmitrii Shcherbakov (dmitriis) → Corey Bryant (corey.bryant)
Revision history for this message
Corey Bryant (corey.bryant) wrote :

On the package level, it appears the ovs-vswitchd and ovsdb-server systemd unit files are sourcing the wrong environment file and using the wrong environment variable.

I think it should look like this:

[Service]
LimitNOFILE=1048576
Type=forking
Restart=on-failure
Environment=HOME=/var/run/openvswitch
EnvironmentFile=-/etc/default/openvswitch-switch
ExecStart=/usr/share/openvswitch/scripts/ovs-ctl \
          --no-ovsdb-server --no-monitor --system-id=random \
          start $OVS_CTL_OPTS
ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop
ExecReload=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server \
          --no-monitor --system-id=random \
          restart $OVS_CTL_OPTS
TimeoutSec=300

That'll allow setting OVS_CTL_OPTS=--no-mlockall in the existing /etc/default/openvswitch-switch file.

no longer affects: Ubuntu Hirsute
no longer affects: ubuntu
Changed in openvswitch (Ubuntu Hirsute):
importance: Undecided → Critical
status: New → Triaged
Changed in openvswitch (Ubuntu Focal):
importance: Undecided → Critical
status: New → Triaged
Changed in openvswitch (Ubuntu Bionic):
importance: Undecided → Critical
status: New → Triaged
Changed in openvswitch (Ubuntu Xenial):
importance: Undecided → Critical
status: New → Triaged
Changed in cloud-archive:
importance: Undecided → Critical
status: New → Triaged
status: Triaged → Invalid
importance: Critical → Undecided
Changed in openvswitch (Ubuntu Xenial):
assignee: nobody → Corey Bryant (corey.bryant)
Changed in openvswitch (Ubuntu Bionic):
assignee: nobody → Corey Bryant (corey.bryant)
Changed in openvswitch (Ubuntu Focal):
assignee: nobody → Corey Bryant (corey.bryant)
Changed in openvswitch (Ubuntu Hirsute):
assignee: nobody → Corey Bryant (corey.bryant)
Changed in openvswitch (Ubuntu Groovy):
assignee: nobody → Corey Bryant (corey.bryant)
importance: Undecided → Critical
status: New → Triaged
no longer affects: cloud-archive/mitaka
no longer affects: openvswitch (Ubuntu Xenial)
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I have package updates prepped for this locally for all affected releases. I just want to get a +1 from James in the morning tomorrow before uploading.

summary: - Charm stuck waiting for ovsdb 'no key "ovn-remote" in Open_vSwitch
- record'
+ [SRU] Add support for disabling memlockall() calls in ovs-vswitchd
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
summary: - [SRU] Add support for disabling memlockall() calls in ovs-vswitchd
+ [SRU] Add support for disabling mlockall() calls in ovs-vswitchd
description: updated
description: updated
description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package openvswitch - 2.14.0-0ubuntu2

---------------
openvswitch (2.14.0-0ubuntu2) hirsute; urgency=medium

  * d/openvswitch-switch.ovs*.service: Update ovs-vswitchd and ovsdb-server
    systemd unit files to use the correct environment file and environment
    variable for ovs-ctl options, /etc/default/openvswitch-switch and
    OVS_CTL_OPTS, respectively (LP: #1906280).

 -- Corey Bryant <email address hidden> Mon, 14 Dec 2020 12:34:26 -0500

Changed in openvswitch (Ubuntu Hirsute):
status: Triaged → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

The charm change for this bug is causing a functional test failure. I'm not sure if it's exposes or introduces a bug, so I've opened a bug to document the issue: https://bugs.launchpad.net/charm-ovn-chassis/+bug/1908615

Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Michael, or anyone else affected,

Accepted openvswitch into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/openvswitch/2.13.1-0ubuntu0.20.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in openvswitch (Ubuntu Focal):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-focal
Changed in openvswitch (Ubuntu Groovy):
status: Triaged → Fix Committed
tags: added: verification-needed-groovy
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello Michael, or anyone else affected,

Accepted openvswitch into groovy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/openvswitch/2.13.1-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-groovy to verification-done-groovy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-groovy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Michael, or anyone else affected,

Accepted openvswitch into ussuri-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:ussuri-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-ussuri-needed to verification-ussuri-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-ussuri-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-ussuri-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Michael, or anyone else affected,

Accepted openvswitch into train-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:train-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-train-needed to verification-train-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-train-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-train-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Michael, or anyone else affected,

Accepted openvswitch into stein-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:stein-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-stein-needed to verification-stein-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-stein-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-stein-needed
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hello Michael, or anyone else affected,

Accepted openvswitch into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/openvswitch/2.9.7-0ubuntu0.18.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in openvswitch (Ubuntu Bionic):
status: Triaged → Fix Committed
tags: added: verification-needed-bionic
Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (openvswitch/2.9.7-0ubuntu0.18.04.1)

All autopkgtests for the newly accepted openvswitch (2.9.7-0ubuntu0.18.04.1) for bionic have finished running.
The following regressions have been reported in tests triggered by the package:

mininet/2.2.2-2ubuntu1 (i386)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/bionic/update_excuses.html#openvswitch

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
Corey Bryant (corey.bryant) wrote : Please test proposed package

Hello Michael, or anyone else affected,

Accepted openvswitch into queens-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:queens-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-queens-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

This SRU has been verified on all release combinations. Please see attached document.

tags: added: verification-done verification-done-bionic verification-done-focal verification-done-groovy verification-queens-done verification-stein-done verification-train-done verification-ussuri-done
removed: verification-needed verification-needed-bionic verification-needed-focal verification-needed-groovy verification-queens-needed verification-stein-needed verification-train-needed verification-ussuri-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Charm changes has been merged into master.

Changed in charm-neutron-openvswitch:
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package openvswitch - 2.9.7-0ubuntu0.18.04.2

---------------
openvswitch (2.9.7-0ubuntu0.18.04.2) bionic-security; urgency=medium

  * SECURITY UPDATE: buffer overflow decoding malformed packets in lldp
    - debian/patches/CVE-2015-8011.patch: check lengths in lib/lldp/lldp.c.
    - CVE-2015-8011
  * SECURITY UPDATE: Externally triggered memory leak in lldp
    - debian/patches/CVE-2020-27827.patch: properly free memory in
      lib/lldp/lldp.c.
    - CVE-2020-27827

 -- Marc Deslauriers <email address hidden> Fri, 08 Jan 2021 07:30:25 -0500

Changed in openvswitch (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package openvswitch - 2.13.1-0ubuntu0.20.04.3

---------------
openvswitch (2.13.1-0ubuntu0.20.04.3) focal-security; urgency=medium

  * SECURITY UPDATE: buffer overflow decoding malformed packets in lldp
    - debian/patches/CVE-2015-8011.patch: check lengths in lib/lldp/lldp.c.
    - CVE-2015-8011
  * SECURITY UPDATE: Externally triggered memory leak in lldp
    - debian/patches/CVE-2020-27827.patch: properly free memory in
      lib/lldp/lldp.c.
    - CVE-2020-27827

 -- Marc Deslauriers <email address hidden> Fri, 08 Jan 2021 07:29:51 -0500

Changed in openvswitch (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Michael Skalka (mskalka) wrote :

Dropping the crit flag as this is nearing fix-released across all releases.

Revision history for this message
Corey Bryant (corey.bryant) wrote : Update Released

The verification of the Stable Release Update for openvswitch has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package openvswitch - 2.13.1-0ubuntu0.20.04.2~cloud0
---------------

 openvswitch (2.13.1-0ubuntu0.20.04.2~cloud0) bionic-ussuri; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 openvswitch (2.13.1-0ubuntu0.20.04.2) focal; urgency=medium
 .
   * d/openvswitch-switch.ovs*.service: Update ovs-vswitchd and ovsdb-server
     systemd unit files to use the correct environment file and environment
     variable for ovs-ctl options, /etc/default/openvswitch-switch and
     OVS_CTL_OPTS, respectively (LP: #1906280).

Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for openvswitch has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package openvswitch - 2.12.1-0ubuntu0.19.10.1~cloud1
---------------

 openvswitch (2.12.1-0ubuntu0.19.10.1~cloud1) bionic; urgency=medium
 .
   * d/openvswitch-switch.ovs*.service: Update ovs-vswitchd and ovsdb-server
     systemd unit files to use the correct environment file and environment
     variable for ovs-ctl options, /etc/default/openvswitch-switch and
     OVS_CTL_OPTS, respectively (LP: #1906280).

Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for openvswitch has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package openvswitch - 2.11.4-0ubuntu0.19.04.1~cloud1
---------------

 openvswitch (2.11.4-0ubuntu0.19.04.1~cloud1) bionic; urgency=medium
 .
   * d/openvswitch-switch.ovs*.service: Update ovs-vswitchd and ovsdb-server
     systemd unit files to use the correct environment file and environment
     variable for ovs-ctl options, /etc/default/openvswitch-switch and
     OVS_CTL_OPTS, respectively (LP: #1906280).

Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for openvswitch has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package openvswitch - 2.9.7-0ubuntu0.18.04.1~cloud0
---------------

 openvswitch (2.9.7-0ubuntu0.18.04.1~cloud0) xenial-queens; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 openvswitch (2.9.7-0ubuntu0.18.04.1) bionic; urgency=medium
 .
   [ James Page ]
   * d/rules: Skip execution of dh_systemd_start for openvswitch-switch
     subunits avoiding multiple restarts during package configuration
     (LP: #1823295).
   * d/watch: Update for changes in upstream website.
 .
   [ Corey Bryant ]
   * d/openvswitch-switch.ovs*.service: Update ovs-vswitchd and ovsdb-server
     systemd unit files to use the correct environment file and environment
     variable for ovs-ctl options, /etc/default/openvswitch-switch and
     OVS_CTL_OPTS, respectively (LP: #1906280).
   * New upstream point release (LP: #1881077).

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Groovy is released

Changed in openvswitch (Ubuntu Groovy):
status: Fix Committed → Fix Released
Revision history for this message
Frode Nordahl (fnordahl) wrote :

Re-opening this for charm-neutron-openvswitch as I still see the issue, I believe it is due to an ordering issue in the charm, i.e. it attempts to do run-time configuration of Open vSwitch prior to writing `/etc/default/openvswitch-switch` to disk and restarting `ovs-vswitchd`.

Proposal up here: https://review.opendev.org/c/openstack/charm-neutron-openvswitch/+/771511

Changed in charm-neutron-openvswitch:
assignee: Corey Bryant (corey.bryant) → Frode Nordahl (fnordahl)
status: Fix Committed → In Progress
Changed in charm-neutron-openvswitch:
status: In Progress → Fix Committed
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

Landed, this fix will make its way into charm-ovn-chassis when this review lands:
https://review.opendev.org/c/x/charm-ovn-chassis/+/771874

Changed in charm-ovn-chassis:
status: In Progress → Fix Committed
Revision history for this message
Michael Skalka (mskalka) wrote :
Download full text (5.9 KiB)

We are still seeing this issue using the -next version of the ovn-chassis charm, as seen during this test run for the charm release: https://solutions.qa.canonical.com/testruns/testRun/23d8528d-2931-4be6-a0d1-bad21e3d75a5

Artifacts can be found here: https://oil-jenkins.canonical.com/artifacts/23d8528d-2931-4be6-a0d1-bad21e3d75a5/index.html

And specifically the openstack crashdump here: https://oil-jenkins.canonical.com/artifacts/23d8528d-2931-4be6-a0d1-bad21e3d75a5/generated/generated/openstack/juju-crashdump-openstack-2021-01-27-18.32.08.tar.gz

Symptoms are the same, ovn-chassis units stay blocked:

ubuntu@production-cpe-23d8528d-2931-4be6-a0d1-bad21e3d75a5:~$ juju status octavia-ovn-chassis
Model Controller Cloud/Region Version SLA Timestamp
openstack foundations-maas maas_cloud/default 2.8.7 unsupported 18:29:58Z

App Version Status Scale Charm Store Rev OS Notes
hacluster-octavia active 0 hacluster jujucharms 161 ubuntu
logrotated active 0 logrotated jujucharms 2 ubuntu
octavia 6.1.0 blocked 3 octavia jujucharms 90 ubuntu
octavia-ovn-chassis 20.03.1 waiting 3 ovn-chassis jujucharms 49 ubuntu
public-policy-routing active 0 advanced-routing jujucharms 3 ubuntu

Unit Workload Agent Machine Public address Ports Message
octavia/0* blocked idle 1/lxd/8 10.244.40.229 9876/tcp Awaiting end-user execution of `configure-resources` action to create required resources
  hacluster-octavia/0* active idle 10.244.40.229 Unit is ready and clustered
  logrotated/62 active idle 10.244.40.229 Unit is ready.
  octavia-ovn-chassis/0* waiting executing 10.244.40.229 'ovsdb' incomplete
  public-policy-routing/44 active idle 10.244.40.229 Unit is ready
octavia/1 blocked idle 3/lxd/8 10.244.40.244 9876/tcp Awaiting leader to create required resources
  hacluster-octavia/1 active idle 10.244.40.244 Unit is ready and clustered
  logrotated/63 active idle 10.244.40.244 Unit is ready.
  octavia-ovn-chassis/1 waiting executing 10.244.40.244 'ovsdb' incomplete
  public-policy-routing/45 active idle 10.244.40.244 Unit is ready
octavia/2 blocked idle 5/lxd/8 10.244.40.250 9876/tcp Awaiting leader to create required resources
  hacluster-octavia/2 active idle 10.244.40.250 Unit is ready and clustered
  logrotated/64 active idle 10.244.40.250 Unit is ready.
  octavia-ovn-chassis/2 waiting executing 10.244.40.250 'ovsdb' incomplete
  public-policy-routing/46 active idle 10.244.40.250 Unit is ready

Machine Stat...

Read more...

Michael Skalka (mskalka)
tags: added: cdo-release-blocker
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Michael, thanks for reporting this. This shouldn't have been missed in the initial patch. I've confirmed that if the config option isn't set at deploy time, openvswitch-switch doesn't get restarted by the charm after rendering /etc/default/openvswitch-switch. The charm needs an update to restart the service in this scenario without doing it every time we run through the hook.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

I have a fix proposed for the issue Michael reported: https://github.com/openstack-charmers/charm-layer-ovn/pull/35.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Rebuild is up to pick up the charm-layer-ovn PR: https://review.opendev.org/c/x/charm-ovn-chassis/+/772897

Revision history for this message
Drew Freiberger (afreiberger) wrote :

I'm also seeing this affecting neutron-gateway on focal with config-changed hook hanging at:

ovs-vsctl -- --may-exist add-br br-int -- set bridge br-int external-ids:charm-neutron-gateway=managed

This is during LMA charm testing which is performed on LXD provider at the moment.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This change has been merged into the ovn-chassis charm: https://review.opendev.org/c/x/charm-ovn-chassis/+/772897.

David Ames (thedac)
Changed in charm-ovn-chassis:
status: Fix Committed → Fix Released
Changed in charm-neutron-openvswitch:
status: Fix Committed → Fix Released
Revision history for this message
Ryan Mickler (ryanmickler) wrote :
Download full text (4.6 KiB)

I just got this ERROR on the install hook of ovn-dedicated-chassis

2021-02-12 04:44:00 INFO juju-log Invoking reactive handler: hooks/relations/tls-certificates/requires.py:109:broken:certificates
2021-02-12 04:44:00 INFO juju-log Invoking reactive handler: reactive/ovn_chassis_charm_handlers.py:28:enable_chassis_reactive_code
2021-02-12 04:44:00 DEBUG juju-log tracer: set flag charms.openstack.do-default-charm.installed
2021-02-12 04:44:00 DEBUG juju-log tracer: set flag charms.openstack.do-default-config.changed
2021-02-12 04:44:00 DEBUG juju-log tracer: set flag charms.openstack.do-default-config-rendered
2021-02-12 04:44:00 DEBUG juju-log tracer: set flag charms.openstack.do-default-update-status
2021-02-12 04:44:00 DEBUG juju-log tracer: set flag charms.openstack.do-default-upgrade-charm
2021-02-12 04:44:00 DEBUG juju-log tracer: set flag charms.openstack.do-default-certificates.available
2021-02-12 04:44:00 INFO juju-log Invoking reactive handler: reactive/ovn_chassis_charm_handlers.py:58:disable_openstack
2021-02-12 04:44:00 INFO juju-log Invoking reactive handler: reactive/layer_openstack.py:14:default_install
2021-02-12 04:44:01 DEBUG install Hit:1 http://archive.ubuntu.com/ubuntu focal InRelease
2021-02-12 04:44:02 DEBUG install Hit:2 http://archive.ubuntu.com/ubuntu focal-updates InRelease
2021-02-12 04:44:02 DEBUG install Hit:3 http://archive.ubuntu.com/ubuntu focal-security InRelease
2021-02-12 04:44:02 DEBUG install Hit:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease
2021-02-12 04:44:03 DEBUG install Reading package lists...
2021-02-12 04:44:04 DEBUG juju-log tracer: set flag ovn-dedicated-chassis-installed
2021-02-12 04:44:04 ERROR juju-log Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-ovn-dedicated-chassis-0/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-ovn-dedicated-chassis-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-ovn-dedicated-chassis-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-ovn-dedicated-chassis-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-ovn-dedicated-chassis-0/charm/reactive/layer_openstack.py", line 27, in default_install
    instance.install()
  File "lib/charm/openstack/ovn_dedicated_chassis.py", line 82, in install
    super().install()
  File "lib/charms/ovn_charm.py", line 287, in install
    self.options.mlockall_disabled):
AttributeError: 'OVNDedicatedChassisConfigurationAdapter' object has no attribute 'mlockall_disabled'

2021-02-12 04:44:04 DEBUG install Traceback (most recent call last):
2021-02-12 04:44:04 DEBUG install File "/var/lib/juju/agents/unit-ovn-dedicated-chassis-0/charm/hooks/install", line 22, in <module>
2021-02-12 04:44:04 DEBUG install main()
2021-02-12 04:44:04 DEBUG install File "/var/lib/juju/agents/unit-ovn-dedicated-chassis-0/.venv/...

Read more...

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Ryan, extremely sorry about this. For a work-around you can manually revert the code introduced at https://github.com/openstack-charmers/charm-layer-ovn/pull/37

For a quick fix, we could revert that change.

For a fix that prevents unnecessary restarts of ovs, I'm trying to figure out if this code needs mlockall_disabled too: https://opendev.org/x/charm-ovn-dedicated-chassis/src/branch/master/src/lib/charm/openstack/ovn_dedicated_chassis.py#L29

I'm a bit confused by the code as it seems like these should inherit from the same parent class maybe: https://github.com/openstack-charmers/charm-layer-ovn/blob/master/lib/charms/ovn_charm.py#L34

Frode Nordahl (fnordahl)
Changed in charm-ovn-dedicated-chassis:
importance: Undecided → Critical
status: New → In Progress
Revision history for this message
Corey Bryant (corey.bryant) wrote :

A partial fix for the issue reported in comment #54 has been merged:
https://github.com/openstack-charmers/charm-layer-ovn/pull/39

The other part of this change is waiting on tests at:
https://review.opendev.org/c/x/charm-ovn-dedicated-chassis/+/775405

There's also an ovn-chassis change here to ensure testing passes. This doesn't necessarily need to get merged at this time:
https://review.opendev.org/c/x/charm-ovn-chassis/+/775439

Note changes will need backporting to the latest stable branch once testing on master is successful.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Ryan, cs:ovn-dedicated-chassis should now be fixed. The following has been merged into the stable branch via https://review.opendev.org/c/x/charm-ovn-dedicated-chassis/+/775769.

Revision history for this message
Ryan Mickler (ryanmickler) wrote :

Thanks Corey for addressing this so quickly, I'll test it out and report back.

Changed in cloud-archive:
status: Invalid → Fix Committed
Changed in cloud-archive:
status: Fix Committed → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package openvswitch - 2.15.0-0ubuntu2~cloud0
---------------

 openvswitch (2.15.0-0ubuntu2~cloud0) focal-wallaby; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 openvswitch (2.15.0-0ubuntu2) hirsute; urgency=medium
 .
   * Fix recording of FQDN/hostname on startup (LP: #1915829):
     - d/p/ovs-dev-ovs-ctl-Allow-recording-hostname-separately.patch: Cherry
       pick of committed upstream fix to support skip of hostname
       configuration on ovs-vswitchd/ovsdb-server startup.
     - d/openvswitch-switch.ovs-record-hostname.service: Record hostname in
       Open vSwitch after network-online.target using new systemd unit.
     - d/openvswitch-switch.ovs-vswitchd.service: Pass `--no-record-hostname`
       option to `ovs-ctl` to delegate recording of hostname to the separate
       service.
     - d/openvswitch-switch.ovsdb-server.service: Pass `--no-record-hostname`
       option to `ovs-ctl` to delegate recording of hostname to the separate
       service.
     - d/openvswitch-switch.service: Add `Also` reference to
       ovs-record-hostname.service so that the service is enabled on install.
     - d/rules: Add `ovs-record-hostname.service` to package build.
 .
 openvswitch (2.15.0-0ubuntu1) hirsute; urgency=medium
 .
   * New upstream release 2.15
 .
 openvswitch (2.15.0~git20210104.def6eb1ea-0ubuntu3) hirsute; urgency=medium
 .
   * d/openvswitch-switch.ovsdb-server.service: avoid removing the state
     dir on restart (LP: #1910209)
 .
 openvswitch (2.15.0~git20210104.def6eb1ea-0ubuntu2) hirsute; urgency=medium
 .
   * d/rules: Re-align expected test failure with test numbering on
     armhf.
 .
 openvswitch (2.15.0~git20210104.def6eb1ea-0ubuntu1) hirsute; urgency=medium
 .
   * New upstream snapshot in preparation for 2.15.0 release.
   * d/p/*: Refresh
   * d/control: Bump minimum libdpdk-dev version to 20.11.
   * d/control: Add BD on libdbus-1-dev for pcap architectures.
   * d/rules: Set DPDK build to use shared libraries.
 .
 openvswitch (2.14.0-0ubuntu2) hirsute; urgency=medium
 .
   * d/openvswitch-switch.ovs*.service: Update ovs-vswitchd and ovsdb-server
     systemd unit files to use the correct environment file and environment
     variable for ovs-ctl options, /etc/default/openvswitch-switch and
     OVS_CTL_OPTS, respectively (LP: #1906280).
 .
 openvswitch (2.14.0-0ubuntu1) hirsute; urgency=medium
 .
   * New upstream release.
   * d/p/*: Refresh.
 .
 openvswitch (2.13.1-0ubuntu1) groovy; urgency=medium
 .
   [ Chris MacNaughton ]
   * d/openvswitch-switch.ovsdb-server.service: Add local-fs.target to
     systemd service file to ensure that local filesystems are ready
     before the ovsdb service tries to start (LP: #1887177).
   * d/control: Remove Breaks/Replaces that are older than Focal (LP: #1878419).
 .
   [ James Page ]
   * New upstream point release.
   * d/p/py3-compat.patch: Refresh.

Revision history for this message
Felipe Reyes (freyes) wrote :

Marking ovn-dedicated-chassis as fix released based on comment #57

Changed in charm-ovn-dedicated-chassis:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.