lxc-start-ephemeral's use of dhcp lease table is fragile

Bug #994752 reported by Gary Poster
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
lxc (Ubuntu)
Fix Released
High
Unassigned
Precise
Fix Released
High
Stéphane Graber
Quantal
Fix Released
High
Unassigned

Bug Description

[Impact]
This affects anyone using lxc-start-ephemeral as part of an automated process for which intermittent failures are a problem. This includes the people who developed the initial version of the script, the Launchpad developers. Our automated test suite will fail, stopping our landing tools, whenever this failure is triggered.

[Development Fix]
1. no longer look in the container's file system for a dhcp table to get the ip of the container; instead, look in the host's network information. This is more reliable and ready sooner. r101 of quantal lxc package has this change.
2. increase the timeout waiting for the containers' network and sshd to be ready.

Note that the current increase, from 30 retries @ 1/sec to 60 retries @ 1/sec, is insufficient for the people who filed the bug, unfortunately. Making the retry count configurable would be ideal. Increasing it to 300 would be sufficient, based on our experience so far. stgraber suggested using the `parallel -l maxload` construct to keep the starts from being overwhelmed by load. Unfortunately, we believe that this is insufficient for at least two reasons. First, the point of our effort is to do a lot of work in parallel, with an lxc per core. The work we have to do takes more than half an hour. Waiting for load to decrease would miss the point of the effort. Second, it doesn't seem that cpu contention is always the problem, from watching top.

[Stable Fix]
[stgraber will need to specify]

[Text Case]

1. Create an lxc container (which has sshd running and your home directory mounter, as is the default). For the sake of these instructions, we will call it "lxctest".
2. Run something like this. Replace "username" with your user name. You might need to do this more or fewer times; we've seen it most easily on a 32 core (16 core hyperthreaded) machine trying to run 32 concurrent callsTo make this less annoying, you could create a temporary passphraseless ssh key.

parallel -j 16 bash -c "lxc-start-ephemeral -u gary -o lpdev -- 'cat /etc/hostname'" -- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Do this a few times.

Broken Behavior: At least one of the times, at least one of these fail (emitting an error message rather than the hostname) either because the code could not get the ip address in time, or because the container's sshd wasn't ready in time.
Fixed Behavior: You get all 16 hostnames.

[Regression Potential]
The increased timeout might cause some code to wait longer than before to discover that something is wrong. The improved ip code should have no negative effect.

[Original Report]When lxc-start-ephemeral is given a command to run (-- do_something) it wants to use lxc-attach to run the command, but lxc-attach is not ready yet. Instead, it parses the dhcp leases to figure out the IP for the container, and then tries to use ssh to run the command.

Twice today in tests involving lxc-start-ephemeral, the dhcp leases were unavailable and lxc-start-ephemeral failed. The machine was under fairly heavy load and was virtualized (EC2).

I'd like to try and make this less fragile. As discussed on IRC, using lxcip (http://bazaar.launchpad.net/~launchpad/lpsetup/trunk/files/head:/lplxcip/) should make this more reliable. Perhaps increasing the timeout in that code might be useful as well.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks, Gary. Do you have a debdiff for fixing this using lxcip in either precise or q?

Changed in lxc (Ubuntu):
status: New → Triaged
importance: Undecided → High
Revision history for this message
Francesco Banconi (frankban) wrote :

Please find attached a diff for fixing the bug.
This patch introduces the *lxc-ip* Python script, initially developed in lp:lpsetup, and used to retrieve the ip address of a running container.
It also updates *lxc-start-ephemeral* to use lxc-ip and an increased number of connection retries.
The diff is against precise-proposed: I hope the diff format is correct.

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "bug-994752-lxc-ip.debdiff" of this bug report has been identified as being a patch in the form of a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. In the event that this is in fact not a patch you can resolve this situation by removing the tag 'patch' from the bug report and editing the attachment so that it is not flagged as a patch. Additionally, if you are member of the ubuntu-sponsors team please also unsubscribe the team from this bug report.

[This is an automated message performed by a Launchpad user owned by Brian Murray. Please contact him regarding any issues with the action taken in this bug report.]

tags: added: patch
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks, the patch looks good. A new package was pushed to precise-proposed today, so it should sit for 6 days. I'll test this one over the next few days and upload it soon.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Sorry - I forgot this affects quantal as well (though lxc-attach may arrive in tie for that). I'll push to there by tomorrow (assuming no testing hiccoughs). Thanks again.

Changed in lxc (Ubuntu Precise):
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lxc - 0.8.0~rc1-4ubuntu7

---------------
lxc (0.8.0~rc1-4ubuntu7) quantal; urgency=low

  [ Francesco Banconi ]
  * Introduced lxc-ip: retrieve the ip addresses of a container.
  * lxc-start-ephemeral: use lxc-ip to ssh to the container (LP: #994752).
 -- Serge Hallyn <email address hidden> Wed, 16 May 2012 10:46:21 -0500

Changed in lxc (Ubuntu Quantal):
status: Triaged → Fix Released
Bryce Harrington (bryce)
description: updated
Revision history for this message
Bryce Harrington (bryce) wrote :

It appears the package was already uploaded to quantal (#6) and precise (#4) so I take it there's no sponsorship work needed here.

Changed in lxc (Ubuntu Precise):
assignee: nobody → Stéphane Graber (stgraber)
status: Confirmed → In Progress
Changed in lxc (Ubuntu Precise):
status: In Progress → Fix Committed
Gary Poster (gary)
description: updated
Revision history for this message
Clint Byrum (clint-fewbar) wrote : Please test proposed package

Hello Gary, or anyone else affected,

Accepted lxc into precise-proposed. The package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

tags: added: verification-needed
Revision history for this message
Gary Poster (gary) wrote :

The improvement in the new version is significant, and there is no regression.

The improvement is still insufficient for our use case, as I noted would be likely in the updated description. For our use case of 32 LXC containers running on a 32 core (16 hyperthreaded) EC2 box, 21 ephemeral containers were able to correctly complete, with 11 timing out. Previously, a maximum of 11 were able to correctly complete experimentally, and 20+ timed out.

In sum, we will need something more--a configurable timeout, or a much larger timeout, like 300. However, the proposed package is good--a very nice improvement in many ways--and I don't see any reason not to release it.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lxc - 0.7.5-3ubuntu58

---------------
lxc (0.7.5-3ubuntu58) precise-proposed; urgency=low

  * Fix broken logic in lxc-ubuntu template where lxc.devttydir would be
    set to 'lxc' only for releases that don't support it. (LP: #1007493)

lxc (0.7.5-3ubuntu57) precise-proposed; urgency=low

  [ Serge Hallyn ]
  * 0083-always-close-all-fds.patch: Have lxc-start always run with
    --close-all-fds. There is no advantage to having lxc-start fail with
    inherited fds. (LP: #1003583)
  * debian/lxc-net.upstart: don't put '()' after call to cleanup.
    (LP: #1000174)

  [ Stéphane Graber ]
  * Sync lxc-ubuntu with the one in Quantal:
    - Bugfixes:
      + Update list of extra packages for debootstrap to only include vim
        and ssh. The others were only relevant when we were still using the
        minbase variant. (LP: #996839)
      + Update default /etc/hosts to match that of a regular Ubuntu system.
        (adds missing ipv6 aliases) (LP: #1004108)
      + Make sure /etc/resolv.conf is valid before running any apt command.
        Fixes a potential race condition (no report of it at this time).

    - Improvements we get by pulling the whole patch from Quantal.
      These don't contain any user behaviour change but will make
      cherry-picking any further change much easier.
      + Drop any hardcoded Ubuntu version check and replace by feature
        checks instead. This removes the need for SRUs whenever we release
        a new Ubuntu.
      + Format lxc-ubuntu to consistently use 4-spaces indent instead
        of mixed spaces/tabs.
      + Update default /etc/network/interfaces to include the header.
      + Drop support for never supported releases (gutsy on sparc).
      + Update template help message for release and arch parameters.
        Old string was only listing i386 and amd64, which is no longer
        accurate (as of 12.04).
        (This string isn't translated)
      + Switch default Ubuntu version from lucid to precise for systems
        that don't have lsb_release (won't affect Ubuntu)

  * Sync lxc-start-ephemeral with the one in Quantal:
    - Switch lxc-start-ephemeral from unreliable parsing of DHCP lease files
      to using "ip netns" to retrieve the IP from the container's network
      namespace. (LP: #994752)
    - Fix a race in lxc-start-ephemeral where the container isn't yet
      running when trying to get its IPs.
    - Update a few calls so that lxc-start-ephemeral can be called as a
      user (ensure consistent usage of sudo across the script). (LP: #1004069)
 -- Stephane Graber <email address hidden> Fri, 01 Jun 2012 11:46:50 -0400

Changed in lxc (Ubuntu Precise):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.