cgmanager and cgproxy startup is racy

Bug #1302174 reported by Pat McGowan
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cgmanager (Ubuntu)
Fix Released
Medium
Unassigned
lxc-android-config (Ubuntu)
Fix Released
High
Oliver Grawert

Bug Description

Every 10-100 boots (in a really random manner) the boot of mako devices gets completely stuck. This started with image 274.

This was tracked down to two issues:
 1) lxc-android-config starts on started cgmanager while it actually needs cgproxy due to the rather old kernel. This should be fixed by adding a cgmanager-ready event which will kick in whenever things are ready, whether that's just cgmanager or just cgproxy or both.
 2) cgmanager and cgproxy should only mark themselves started once their main loop is ready to accept queries from clients. This should be done by raising SIGSTOP once they are ready (daemonize would prevent debugging).

Revision history for this message
Oliver Grawert (ogra) wrote :

after the hard hang, directly rebooting into recovery and retrieving /proc/last_kmsg results in a log like http://paste.ubuntu.com/7197970/

obviously the boot hangs right after run-init (swapon is the last step in the /init script in initrd)

Revision history for this message
Oliver Grawert (ogra) wrote :

checking a boot with loglevel=7 like in http://paste.ubuntu.com/7200123/ it seems that systemd-udevd starts very early, it should actually wait until the container is up (which should happen closer to the 10 second mark)

http://people.canonical.com/~ogra/touch-image-stats/274.changes shows that systemd was updated in 274

also checking the udev.override file we ship in lxc-android-config there is the following start condition

start on (android
          and stopped lxc-android-boot)

...

this should better be lxc-android-config, not lxc-android-boot.

Revision history for this message
Oliver Grawert (ogra) wrote :

from a boot with"loglevel=7 --verbose" i got the following output ... which is sadly totally mangled in last_kmsg http://paste.ubuntu.com/7202307/

Revision history for this message
Oliver Grawert (ogra) wrote :

oops, sorry, it was late last night, the change to udev.override indeed needs to read: "and started lxc-android-config"

Revision history for this message
Oliver Grawert (ogra) wrote :

hmm, or probably just drop the and started condition completely ... seems the issue is only tangentially related to udev and rather point to the switch from cgroup-lite to cgmanager ... running tests with that change reverted now.

Revision history for this message
Oliver Grawert (ogra) wrote :

/var/log/lxc/android.log is full of:

      lxc-start 1396572576.808 ERROR lxc_cgfs - Could not find writable mount point for cgroup hierarchy 5 while trying to create cgroup.
      lxc-start 1396572576.812 ERROR lxc_start - failed creating cgroups
      lxc-start 1396572576.812 ERROR lxc_start - failed to spawn 'android'

Revision history for this message
Didier Roche-Tolomelli (didrocks) wrote :

Tried upgraded to latest udev with bug fix for https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1302264. unfortunately, still stuck after 27 reboots.

Revision history for this message
Oliver Grawert (ogra) wrote :

the tests i started are still running fine with teh following changes:

apt-get install cgroups-lite

edit /etc/init/lxc-android-config.conf to start on started cgroups-lite

echo manual >/etc/init/cgmanager.conf
echo manual >/etc/cgproxy.conf

this is in the 70th loop here atm, if it survives until around 250-300 i will upload this change and we should test in the lab with a new image build.

Revision history for this message
Oliver Grawert (ogra) wrote :

err the second echo line should be:

echo manual >/etc/init/cgproxy.conf

indeed :)

Revision history for this message
Oliver Grawert (ogra) wrote :

for lxc-android-config i will apply http://paste.ubuntu.com/7203232/ now, and additionally open a task for cgmanager

Changed in cgmanager (Ubuntu):
importance: Undecided → High
assignee: nobody → Stéphane Graber (stgraber)
summary: - Booting mako hangs randomly right after run-init
+ Booting mako hangs randomly with cgmanager installed
Revision history for this message
Oliver Grawert (ogra) wrote : Re: Booting mako hangs randomly with cgmanager installed

oh, and before i forget, the reboot script used for reproducing:

#! /bin/sh

for i in `seq 1 1000`; do
 stamp="$(date +%H:%M:%S)"
    echo "$stamp LOOP $i"
    adb reboot
    adb wait-for-device
    sleep 20
done

in general my device died between 60 and 120 reboots with the bug in place. after the above revert i stopped the loop at 273 reboots.

there is definitely something going on with cgmanager on the phones

Revision history for this message
Anca Emanuel (anca-emanuel) wrote :

^ Something to add to automatic testing... on some ebay-ed cheap broken screen device.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lxc-android-config - 0.158

---------------
lxc-android-config (0.158) trusty; urgency=medium

  * revert to using cgroup-lite, with cgmanager running we get random hard
    hangs of the system during image testing (LP: #1302174)
  * ship .override files for cgmanager and cgproxy
  * switch the package dependency back from cgmanager to cgroup-lite
 -- Oliver Grawert <email address hidden> Fri, 04 Apr 2014 15:13:45 +0200

Changed in lxc-android-config (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Stéphane Graber (stgraber) wrote :

reducing status and priority to incomplete/undecided as it's not clear cgmanager is actually at fault.

None of the logs provided indicate any kind of crash from cgmanager, /var/log/upstart/cgmanager.log wasn't provided and the LXC error is very likely because you are attempting to start the container before cgmanager is actually started which is a lxc-android-config bug not a cgmanager one.

Changed in cgmanager (Ubuntu):
status: New → Incomplete
assignee: Stéphane Graber (stgraber) → nobody
importance: High → Undecided
Revision history for this message
Oliver Grawert (ogra) wrote :

The cgmanager.log from my testing device is at http://paste.ubuntu.com/7203516/

Changed in cgmanager (Ubuntu):
status: Incomplete → Triaged
importance: Undecided → Medium
summary: - Booting mako hangs randomly with cgmanager installed
+ cgmanager and cgproxy startup is racy
description: updated
description: updated
Revision history for this message
Stéphane Graber (stgraber) wrote :

At last managed to extract the crash:

Program terminated with signal SIGSEGV, Segmentation fault.
#0 _int_malloc (av=av@entry=0xb6eb34e8 <main_arena>, bytes=bytes@entry=100)
    at malloc.c:3489
3489 malloc.c: No such file or directory.
(gdb) bt
#0 _int_malloc (av=av@entry=0xb6eb34e8 <main_arena>, bytes=bytes@entry=100)
    at malloc.c:3489
#1 0xb6e26808 in __GI___libc_malloc (bytes=100) at malloc.c:2891
#2 0xb6efb9a6 in nih_alloc () from /lib/arm-linux-gnueabihf/libnih.so.1
#3 0xb6f4c44a in alloc_scm_sock_data (fd=fd@entry=9,
    t=t@entry=REQ_TYPE_REMOVE, message=0xb8117840) at frontend.c:57
#4 0xb6f4d900 in cgmanager_remove_scm (data=<optimized out>,
    message=message@entry=0xb8117840,
    controller=controller@entry=0xb8110a70 "freezer",
    cgroup=0xb81175e8 "user/0.user/c1.session", recursive=recursive@entry=1,
    sockfd=9) at frontend.c:1014
#5 0xb6f47aaa in cgmanager_org_linuxcontainers_cgmanager0_0_RemoveScm_method
    (object=0xb8110918, message=0xb8117840)
    at org.linuxcontainers.cgmanager.c:3430
#6 0xb6eeb07c in ?? () from /lib/arm-linux-gnueabihf/libnih-dbus.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Revision history for this message
Stéphane Graber (stgraber) wrote :

I'm now testing an upload candidate for cgmanager which adds sigstop support, updates the upstart jobs to use that, emit the cgmanager-ready event and fix a bug that was preventing cgmanager respawning on crashes.

All of those combined while not fixing the actual crash which still happens occasionally should make the effect go away entirely and give us stability again. My phone is in a reboot loop now currently at 45 reboots. When I beat Oliver's 273, I'll get that in the archive.

I'll also provide the debdiff to turn things back on properly in lxc-android-config, hopefully we can have that land next week.

Revision history for this message
Stéphane Graber (stgraber) wrote :

Got up to 300 successful boots, calling this upload ready.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cgmanager - 0.24-0ubuntu3

---------------
cgmanager (0.24-0ubuntu3) trusty; urgency=medium

  * Use SIGSTOP as a way to tell upstart that we are ready to accept
    queries, this fixes a race condition. (LP: #1302174)
  * Emit cgmanager-ready once all the required daemons have been spawned
    and are ready for requests. (LP: #1302174)
  * Always kill cgproxy when spawning cgmanager, this is required to
    allow proper restart/respawning on systems using both daemons.
    (LP: #1302174)
 -- Stephane Graber <email address hidden> Fri, 04 Apr 2014 21:04:43 -0400

Changed in cgmanager (Ubuntu):
status: Triaged → Fix Released
Revision history for this message
Stéphane Graber (stgraber) wrote :

Re-opening the lxc-android-config task as cgmanager support should be re-enabled there.

Attaching a debdiff.

Changed in lxc-android-config (Ubuntu):
status: Fix Released → Triaged
importance: Critical → High
tags: added: bot-stop-nagging
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lxc-android-config - 0.161

---------------
lxc-android-config (0.161) trusty; urgency=medium

  [ Stéphane Graber ]
  * e-enable cgmanager support. (LP: #1302174)
 -- Oliver Grawert <email address hidden> Mon, 14 Apr 2014 13:23:45 +0200

Changed in lxc-android-config (Ubuntu):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.