lxc on precise is not working with lucid containers (container does not reach runlevel 2)

Bug #924337 reported by Gary Poster
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxc (Ubuntu)
Fix Released
High
Serge Hallyn
Declined for Maverick by Martin Pitt
Natty
Fix Released
Undecided
Unassigned
Oneiric
Fix Released
Undecided
Unassigned

Bug Description

============================================
SRU Justification
1. Impact: any upstart jobs which start on lo being raised will not start
2. Development fix: lxcguest emits the signal pretending that lo was created
3. Stable fix: same as development fix
4. Test case (on precise host):
   a. lxc-create -t ubuntu -n test1 -- -r oneiric
   b. cat > /var/lib/lxc/test1/rootfs/etc/init/lxclo.conf << EOF
description "detect net-device-up IFACE=lo"
start on net-device-up IFACE=lo
EOF
   c. lxc-start -n test1
   Log in as root:root, and type 'status lxclo'. With the fix, it will say
   lxclo is 'start/running'.
5. Regression potential: if this were done wrong, containers could fail to start up right.
============================================

On precise with newest packages as of 2012-01-31 1500Z, trying to start an lxc lucid container hangs. Oneric and Precise containers work fine.

$ sudo lxc-start -n lp
init: ureadahead-other main process (31) terminated with status 4
init: console-setup main process (32) terminated with status 1
init: procps main process (33) terminated with status 255
init: ureadahead-other main process (37) terminated with status 4
[HANGS HERE, for at least two minutes]

How I created the container:
$ sudo lxc-create -t ubuntu -n lp -f /etc/lxc/local.conf -- -r lucid -a i686 -b gary

local.conf:
$ cat /etc/lxc/local.conf
lxc.network.type=veth
lxc.network.link=lxcbr0
lxc.network.flags=up

/var/lib/lxc/lp/rootfs/etc/resolv.conf was a symlink to /etc/resolvconf/run/resolv.conf. On the basis of previous conversations, I tried changing it to nameserver 10.0.3.1 to no luck

Changed in lxc (Ubuntu):
importance: Undecided → High
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Hi Gary,

when we've seen this before it has been because the container was waiting for a dhcp response. It has generally been on virbr0, and when virbr0 had stp on.

Can you show what 'brctl show' on the host gives? Do your host logs show any problems with dnsmasq on lxcbr0?

Changed in lxc (Ubuntu):
status: New → Incomplete
Revision history for this message
Gary Poster (gary) wrote :

I reported on IRC to Serge that virbr0 did indeed have STP enabled, but that I did not see any obvious dnsmasq errors. He already had diagnosed the problem & said as a workaround I could just change the 'start on' in /etc/init/console.conf in my container to 'start on mounted MOUNTPOINT=/run'. stgraber's new upstart will fix it properly when it hits the archive.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks, Gary.

Note that the upstart 'fix' will basically be the same as the suggested workaround.

I think the stp on for virbr0 is actually a red herring here, for two reasons. First, you show in the Description that you are using lxcbr0, which should have stp off. Second, in my own experiments, when the console has been slow to come up, the network was actually up and i could ssh in before the console came up. So i think the current 'start on', which is 'stopped rc runlevel=2345]' is jsut bad - that is, there is something running long (maybe udev?) which keeps us getting to *stopped* rc runlevel 2.

Revision history for this message
Gary Poster (gary) wrote :

I tried the fix and I got the same symptom.

For clarity:

- I wiped out /var/lxc/cache before I started because lxc-create wasn't working. That got lxc-create working. I used the same commands I listed in the bug description.

- I changed the container's console.conf as described:

$ cat /var/lib/lxc/lp/rootfs/etc/init/console.conf
# /dev/console - getty
#
# This service maintains a getty on /dev/console from the point the
# system is started until it is shut down again.
# It only runs in lxc containers.
# Libvirt links /dev/tty1 to /dev/pts/0.

## start on stopped rc RUNLEVEL=[2345]

start on mounted MOUNTPOINT=/run
stop on runlevel [!2345]

env container
env LIBVIRT_LXC_UUID
pre-start script
 [ "x$container" = "xlxc" ] && exit 0;
 stop;
 exit 0;
end script

respawn

exec /sbin/getty -8 38400 /dev/console

- sudo lxc-start -n lp then hangs with the exact same symptoms described initially.

Thanks

Changed in lxc (Ubuntu):
status: Incomplete → New
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Reproduced. Note that during this time I can ssh in just fine. The container does have an ip address! (you can find it with tail /var/log/daemon.log on the host)

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Gary

terribly sorry, 'mounted=/run' won't work in your containers as you are using lxcguest.

You should be safe to just make it 'start on started lxcguest' (really even 'start on startup', which I just tried and worked fine)

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

It seems to be a bug in lxcguest. runlevel shows 'unknown', whereas on a real host or precise container it shows 'N 2'

Changed in lxc (Ubuntu):
status: New → Confirmed
Revision history for this message
Gary Poster (gary) wrote :

Serge, cool. The workaround is fine. Thanks.

summary: - lxc on precise is not working with lucid containers
+ lxc on precise is not working with lucid containers (container does not
+ reach runlevel 2)
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

rc-sysinit.conf never starts because it doesn't see net-device-up for lo. When I do:

sudo initctl emit --no-wait net-device-up IFACE=lo

then the container transitions to runlevel 'N 2'.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

When I first created lxcguest, one of the main things it did was emit that event. Then we stopped having to, but the reasons were always a bit too mysterious for me.

Assigning to stgraber as I *believe* he will know offhand what's going on. If not, please feel free to reassign to me.

Changed in lxc (Ubuntu):
assignee: nobody → Stéphane Graber (stgraber)
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

It turns out the original problem never went away. The container never gets a net-device-up event for lo, even in precise. You can verify this trivially with two upstart jobs, /etc/init/lxc{lo,eth}.conf:

==== lxclo.conf ====
description "detect net-device-up IFACE=lo"
start on net-device-up IFACE=lo
==== lxceth.conf ===
description "detect net-device-up IFACE=eth0"
start on net-device-up IFACE=eth0
====================

Start the container and check 'status lxclo' and 'status lxceth'.

The reason this happens is clear in /etc/network/if-up.d/upstart. The reason
this stopped mattering is that rc-sysinit.conf switched from being

start on filesystem and net-device-up IFACE=lo

to being

start on (filesystem and static-network-up) or failsafe-boot

Where static-network-up does not require lo to come up.

Revision history for this message
Stéphane Graber (stgraber) wrote :

Right, so we need for anything before Oneiric to have that emit in lxcguest, since Oneiric it should be properly handled by networking.conf bringing up lo and by the static-network-up event.

Starting with Precise where we won't have lxcguest, the same trick as Oneiric will work so we won't need to emit net-device-up either.

Changed in lxc (Ubuntu):
assignee: Stéphane Graber (stgraber) → Serge Hallyn (serge-hallyn)
status: Confirmed → Fix Released
Changed in lxc (Ubuntu):
status: Fix Released → Confirmed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lxc - 0.7.5-3ubuntu18

---------------
lxc (0.7.5-3ubuntu18) precise; urgency=low

  * lxcguest.lxcguest.upstart: emit the net-device-up IFACE=lo event, so
    that any upstart jobs waiting on it (esp rc-sysinit before oneiric) will
    proceed. (LP: #924337)
  * 0034-fix-lxc-execute-reboot.patch: fix bad handling of 'exit 0' for
    lxc-execute introduced with the container reboot handling. (LP: #927863)
  * debian/lxcguest.lxcmount.upstart: add '--no-wait' to emit to make sure we
    don't wait for the event to be processed.
  * 0035-lxc-init-ignore-shm.patch: if lxc-init can't mount /dev/shm, don't
    fail on account of that. (LP: #927883)
  * debian/lxc.init: if the network is already up, exit before setting the
    trap EXIT.
 -- Serge Hallyn <email address hidden> Mon, 06 Feb 2012 17:37:37 -0600

Changed in lxc (Ubuntu):
status: Confirmed → Fix Released
Changed in lxc (Ubuntu Natty):
status: New → Invalid
description: updated
Revision history for this message
Clint Byrum (clint-fewbar) wrote : Please test proposed package

Hello Gary, or anyone else affected,

Accepted lxc into oneiric-proposed. The package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in lxc (Ubuntu Oneiric):
status: New → Fix Committed
tags: added: verification-needed
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Confirmed, lxcguest from oneiric-proposed fixes this in an oneiric guest.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lxc - 0.7.5-0ubuntu8.5

---------------
lxc (0.7.5-0ubuntu8.5) oneiric-proposed; urgency=low

  * lxcguest.lxcguest.upstart: emit the net-device-up IFACE=lo event, so
    that any upstart jobs waiting on it (esp rc-sysinit before oneiric) will
    proceed. (LP: #924337)
  * 0035-lxc-init-ignore-shm.patch: if lxc-init can't mount /dev/shm, don't
    fail on account of that. (LP: #927883)
 -- Serge Hallyn <email address hidden> Mon, 06 Feb 2012 19:06:46 -0600

Changed in lxc (Ubuntu Oneiric):
status: Fix Committed → Fix Released
Revision history for this message
Guilherme Salgado (salgado) wrote :

This happens to me on a Natty container:

salgado@delgadito:~$ cat /var/lib/lxc/patchmetrics/rootfs/etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=11.04
DISTRIB_CODENAME=natty
DISTRIB_DESCRIPTION="Ubuntu 11.04"
salgado@delgadito:~$ sudo lxc-start -n patchmetrics
init: procps main process (26) terminated with status 255
init: udev-fallback-graphics main process (46) terminated with status 1
init: plymouth main process (6) killed by ABRT signal
init: plymouth-splash main process (52) terminated with status 2

Changed in lxc (Ubuntu Natty):
status: Invalid → Confirmed
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Confirmed - in fact /etc/init/lxcguest.conf is missing altogether, causing other problems as well.

I'll push a fix to natty-proposed. Thanks for pointing this out.

Revision history for this message
Chris Halse Rogers (raof) wrote :

Hello Gary, or anyone else affected,

Accepted lxc into natty-proposed. The package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in lxc (Ubuntu Natty):
status: Confirmed → Fix Committed
tags: removed: verification-done
tags: added: verification-needed
Revision history for this message
Gary Poster (gary) wrote :

Salgado, would you be able to do this verification? If not, let me know, and I'll do it.

Revision history for this message
Guilherme Salgado (salgado) wrote :

Just confirmed; this did fix a newly created Natty container for me. Thanks!

Revision history for this message
Stéphane Graber (stgraber) wrote :

Based on comment above, marking verification-done

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lxc - 0.7.4-0ubuntu7.3

---------------
lxc (0.7.4-0ubuntu7.3) natty-proposed; urgency=low

  * lxcguest.lxcguest.upstart: emit the net-device-up IFACE=lo event, so
    that any upstart jobs waiting on it (esp rc-sysinit before oneiric) will
    proceed. (LP: #924337)
  * debian/rules: install lxcguest.lxcguest.upstart (as it was not in the
    natty package before)
 -- Serge Hallyn <email address hidden> Wed, 28 Mar 2012 13:58:23 -0500

Changed in lxc (Ubuntu Natty):
status: Fix Committed → Fix Released
Revision history for this message
Oliver Mueller (oliver-vpr) wrote :

Is it possible that this bug returned?

I updated the host to ubuntu saucy and lxc was working fine with the virtual machine "jenkins". Once I updated the guest system to saucy as well, it seemed not to boot, but I was wrong. It never reached the runlevel (5), which it did before.

Here is the output of lxc-ls

# lxc-ls --fancy
NAME STATE IPV4 IPV6 AUTOSTART
-------------------------------------------------
jenkins RUNNING 192.168.1.30 - NO
php-54-32 STOPPED - - NO
test STOPPED - - NO

as you can see it is running, but not reachable by ssh. so I entered:

# lxc-attach -n jenkins -- runlevel
unknown

it doesn't know in which runlevel it is.

even though the network is up within the guest system:

# lxc-attach -n jenkins -- ifconfig
eth0 Link encap:Ethernet HWaddr 10:16:3e:53:5b:7a
          inet addr:192.168.1.30 Bcast:192.168.1.255 Mask:255.255.255.0
          inet6 addr: fe80::1216:3eff:fe53:5b7a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:277 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:44733 (44.7 KB) TX bytes:648 (648.0 B)

as soon as I force the runlevel with:

lxc-attach -n jenkins -- init 5

everything is working, until I reboot the guest again.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Hi,

I suspect you're suffering from the veth offloading feature in saucy kernel. If possible please flush the lxc cache (sudo rm -rf /var/cache/lxc/*) and re-create the containers, or else chroot into the containers to update them. Newer userspace will work around the "feature".

If that doesn't help, please open a new bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.