problems starting multiple lxc instances concurrently

Bug #842845 reported by James Page
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Fix Released
High
Serge Hallyn
Oneiric
Fix Released
High
Serge Hallyn
nova (Ubuntu)
Invalid
High
Serge Hallyn
Oneiric
Invalid
High
Serge Hallyn

Bug Description

I've seen this issue fairly consistently when trying to launch multiple LXC instances:

euca-run-instances -n 4 -k whatever ami-XXXXXX

Some instances start OK - some fail to start and sit 'pending' with this error message in the nova-compute.log file.

>>>>>>>>>>>>>>>>>>>>>

2011-09-06 15:21:07,106 DEBUG nova.utils [-] Running cmd (subprocess): sudo mount /dev/nbd12 /var/lib/nova/instances/instance-00000006//rootfs from (pid=1229) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
2011-09-06 15:21:07,219 DEBUG nova.virt.libvirt_conn [-] instance instance-00000006: is running from (pid=1229) spawn /usr/lib/pymodules/python2.7/nova/virt/libvirt/connection.py:599
2011-09-06 15:21:07,220 DEBUG nova.compute.manager [-] Checking state of instance-00000006 from (pid=1229) _get_power_state /usr/lib/pymodules/python2.7/nova/compute/manager.py:188
2011-09-06 15:21:07,242 ERROR nova.exception [-] Uncaught exception
(nova.exception): TRACE: Traceback (most recent call last):
(nova.exception): TRACE: File "/usr/lib/pymodules/python2.7/nova/exception.py", line 98, in wrapped
(nova.exception): TRACE: return f(*args, **kw)
(nova.exception): TRACE: File "/usr/lib/pymodules/python2.7/nova/compute/manager.py", line 453, in run_instance
(nova.exception): TRACE: self._run_instance(context, instance_id, **kwargs)
(nova.exception): TRACE: File "/usr/lib/pymodules/python2.7/nova/compute/manager.py", line 431, in _run_instance
(nova.exception): TRACE: current_power_state = self._get_power_state(context, instance)
(nova.exception): TRACE: File "/usr/lib/pymodules/python2.7/nova/compute/manager.py", line 190, in _get_power_state
(nova.exception): TRACE: return self.driver.get_info(instance['name'])["state"]
(nova.exception): TRACE: File "/usr/lib/pymodules/python2.7/nova/virt/libvirt/connection.py", line 1168, in get_info
(nova.exception): TRACE: (state, max_mem, mem, num_cpu, cpu_time) = virt_dom.info()
(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1059, in info
(nova.exception): TRACE: if ret is None: raise libvirtError ('virDomainGetInfo() failed', dom=self)
(nova.exception): TRACE: libvirtError: internal error Unable to get cgroup for instance-00000006
(nova.exception): TRACE:
2011-09-06 15:21:07,262 ERROR nova.rpc [-] Exception during message handling
(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 620, in _process_data
(nova.rpc): TRACE: rval = node_func(context=ctxt, **node_args)
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/exception.py", line 129, in wrapped
(nova.rpc): TRACE: raise Error(str(e))
(nova.rpc): TRACE: Error: internal error Unable to get cgroup for instance-00000006
(nova.rpc): TRACE:
2011-09-06 15:21:07,281 INFO nova.virt.libvirt_conn [-] Instance instance-00000006 spawned successfully.

ProblemType: Bug
DistroRelease: Ubuntu 11.10
Package: nova-compute-lxc 2011.3~rc~20110901.1523-0ubuntu1
ProcVersionSignature: Ubuntu 3.0.0-10.16-server 3.0.4
Uname: Linux 3.0.0-10-server x86_64
Architecture: amd64
Date: Tue Sep 6 16:30:36 2011
NovaConf: Error: [Errno 13] Permission denied: '/etc/nova/nova.conf'
PackageArchitecture: all
ProcEnviron:
 LANGUAGE=en_GB:
 LANG=en_GB.UTF-8
 SHELL=/bin/bash
SourcePackage: nova
UpgradeStatus: No upgrade log present (probably fresh install)

Related branches

Revision history for this message
James Page (james-page) wrote :
Revision history for this message
Clint Byrum (clint-fewbar) wrote :

This looks more like a problem with libvirt, as its an internal libvirt error..

(nova.exception): TRACE: File "/usr/lib/pymodules/python2.7/nova/virt/libvirt/connection.py", line 1168, in get_info
(nova.exception): TRACE: (state, max_mem, mem, num_cpu, cpu_time) = virt_dom.info()
(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1059, in info
(nova.exception): TRACE: if ret is None: raise libvirtError ('virDomainGetInfo() failed', dom=self)
(nova.exception): TRACE: libvirtError: internal error Unable to get cgroup for instance-00000006

Though it could be a logic problem in some assumptions nova makes about libvirt "domains" that is only proved false with LXC.

Marking as also affects libvirt, setting both to importance of High. While this only affects users of LXC, its intermittent which may signal other problems that are harder to detect.

Changed in libvirt (Ubuntu):
importance: Undecided → High
Changed in nova (Ubuntu):
importance: Undecided → High
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

You don't get this problem at all doing the same thing with kvm containers, right?

Revision history for this message
James Page (james-page) wrote : Re: [Bug 842845] Re: problems starting multiple lxc instances concurrently

On Mon, 2011-09-12 at 13:36 +0000, Serge Hallyn wrote:
> You don't get this problem at all doing the same thing with kvm
> containers, right?

Right

--
James Page
Ubuntu Server Developer

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Can't seem to reproduce this with a simple

  for c in o1 o2 o3 o4; do
     virsh -c lxc:// start $c
  done

so I'll set up an openstack box to test.

Revision history for this message
Chuck Short (zulcss) wrote :

I had a look at this and this caused by libvirt in src/lxc/lxc_driver.c:

 545 if (virCgroupForDomain(driver->cgroup, vm->def->name, &cgroup, 0) != 0) {
 546 lxcError(VIR_ERR_INTERNAL_ERROR,
 547 _("Unable to get cgroup for %s"), vm->def->name);
 548 goto cleanup;
 549 }

I am not exactly sure what happened here, when the domain got created, it didnt register with cgroups maybe James switch from kvm to lxc when this happened (in this case), but I have seen it other cases when libvirt is a bit racey when creating the lxc domains.

Changed in libvirt (Ubuntu):
status: New → Confirmed
Dave Walker (davewalker)
tags: added: server-o-rs
Revision history for this message
Dave Walker (davewalker) wrote :

zul, is this still and issue and are we able to reproduce it?

Is there a nova issue?

Changed in libvirt (Ubuntu Oneiric):
milestone: none → ubuntu-11.10
Changed in nova (Ubuntu Oneiric):
status: New → Incomplete
Dave Walker (davewalker)
Changed in nova (Ubuntu):
milestone: none → ubuntu-11.10
Changed in libvirt (Ubuntu Oneiric):
assignee: nobody → Serge Hallyn (serge-hallyn)
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

I'm able to reproduce this with 'euca-run-instances -n 4'. But the libvirt instances do in fact start.

virsh -c lxc:/// list shows the, but euca-describe-instances shows them pending.

nova-compute.log shows:

2011-09-27 16:06:23,110 ERROR nova.exception [-] Uncaught exception
(nova.exception): TRACE: Traceback (most recent call last):
(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 98, in wrapped
(nova.exception): TRACE: return f(*args, **kw)
(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 454, in run_instance
(nova.exception): TRACE: self._run_instance(context, instance_id, **kwargs)
(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 432, in _run_instance
(nova.exception): TRACE: current_power_state = self._get_power_state(context, instance)
(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 190, in _get_power_state
(nova.exception): TRACE: return self.driver.get_info(instance['name'])["state"]
(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/connection.py", line 1196, in get_info
(nova.exception): TRACE: (state, max_mem, mem, num_cpu, cpu_time) = virt_dom.info()
(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1059, in info
(nova.exception): TRACE: if ret is None: raise libvirtError ('virDomainGetInfo() failed', dom=self)
(nova.exception): TRACE: libvirtError: internal error Unable to get cgroup for instance-00000008
(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 620, in _process_data
(nova.rpc): TRACE: rval = node_func(context=ctxt, **node_args)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 129, in wrapped
(nova.rpc): TRACE: raise Error(str(e))
(nova.rpc): TRACE: Error: internal error Unable to get cgroup for instance-00000008
(nova.rpc): TRACE:
2011-09-27 16:06:23,220 INFO nova.virt.libvirt_conn [-] Instance instance-00000008 spawned successfully.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

I see - nova-compute checks too quickly whether the vm has powered on. It asks libvirt to look up the vm, but libvirt complains that it has not (yet) created the cgroups.

Not sure yet whether this should be changed at libvirt or at nova's end.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

With this debdiff applied, the bug no longer appears on my systems.

Changed in nova (Ubuntu Oneiric):
assignee: nobody → Chuck Short (zulcss)
Changed in libvirt (Ubuntu Oneiric):
assignee: Serge Hallyn (serge-hallyn) → nobody
Changed in nova (Ubuntu Oneiric):
status: Incomplete → In Progress
Changed in libvirt (Ubuntu Oneiric):
status: Confirmed → Invalid
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "nova.debdiff" of this bug report has been identified as being a patch in the form of a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. In the event that this is in fact not a patch you can resolve this situation by removing the tag 'patch' from the bug report and editing the attachment so that it is not flagged as a patch. Additionally, if you are member of the ubuntu-sponsors please also unsubscribe the team from this bug report.

[This is an automated message performed by a Launchpad user owned by Brian Murray. Please contact him regarding any issues with the action taken in this bug report.]

tags: added: patch
Revision history for this message
Dave Walker (davewalker) wrote :

@hallyn, Can you propose your patch to upstream trunk please - then we'll look to cherry pick it back.

Thanks :)

Changed in nova (Ubuntu Oneiric):
assignee: Chuck Short (zulcss) → Serge Hallyn (serge-hallyn)
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

With the debdiff above, I can still very rarely reproduce this - it's the same cause - nova is calling get_info while libvirt is still setting the cgroups. The nova-compute.log shows the unhandled exception now at _get_power_state in nova/compute/manager.py. I'm not sure why it is getting to that point, though, because _wait_for_boot at nova/virt/libvirt/connection.py should make it wait until the container is ready.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :
Download full text (3.7 KiB)

Here is the relevant excerpt from nova-compute.log. note that the first line (at /usr/lib/python2.7/dist-packages/nova/virt/libvirt/connection.py:604)is two lines before the timer call which should wait until the container is booted.

2011-09-27 21:20:43,830 DEBUG nova.virt.libvirt_conn [-] instance instance-0000003d: is running from (pid=2100) spawn /usr/lib/python2.7/dist-packages/nova/virt/libvirt/connection.py:604
2011-09-27 21:20:43,831 DEBUG nova.compute.manager [-] Checking state of instance-0000003d from (pid=2100) _get_power_state /usr/lib/python2.7/dist-packages/nova/compute/manager.py:188
2011-09-27 21:20:43,868 ERROR nova.exception [-] Uncaught exception
(nova.exception): TRACE: Traceback (most recent call last):
(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 98, in wrapped
(nova.exception): TRACE: return f(*args, **kw)
(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 454, in run_instance
(nova.exception): TRACE: self._run_instance(context, instance_id, **kwargs)
(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 432, in _run_instance
(nova.exception): TRACE: current_power_state = self._get_power_state(context, instance)
(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 190, in _get_power_state
(nova.exception): TRACE: return self.driver.get_info(instance['name'])["state"]
(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/connection.py", line 1204, in get_info
(nova.exception): TRACE: (state, max_mem, mem, num_cpu, cpu_time) = virt_dom.info()
(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1059, in info
(nova.exception): TRACE: if ret is None: raise libvirtError ('virDomainGetInfo() failed', dom=self)
(nova.exception): TRACE: libvirtError: internal error Unable to get cgroup for instance-0000003d
(nova.exception): TRACE:
2011-09-27 21:20:43,869 ERROR nova.rpc [-] Exception during message handling
(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 620, in _process_data
(nova.rpc): TRACE: rval = node_func(context=ctxt, **node_args)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 129, in wrapped
(nova.rpc): TRACE: raise Error(str(e))
(nova.rpc): TRACE: Error: internal error Unable to get cgroup for instance-0000003d
(nova.rpc): TRACE:
2011-09-27 21:20:43,964 INFO nova.virt.libvirt_conn [-] Instance instance-0000003d spawned successfully.

Note also that it says the container spawned successfully. And while the libvirt container is in fact running, euca-describe-instances shows:

serge@stack:~$ euca-describe-instances
RESERVATION r-lj3xi0v8 test-project default
INSTANCE i-0000003d ami-00000003 11.0.0.5 11.0.0.5 pending mykey 3 m1.small 2011-09-28T02:20:25Z nova ami-00000000 ami-00000000
INSTANCE i-0000003c ami-00000003 11.0.0.4 ...

Read more...

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

In the libvirt logs, I see

21:20:43.846: 1205: debug : remoteSerializeError:132 : prog=536903814 ver=1 proc=16 type=1 serial=1004, msg=internal error Unable to get cgroup for instance-0000003d

followed a bit later by

21:20:43.856: 8627: debug : virCgroupMakeGroup:511 : Make group /libvirt/lxc/instance-0000003d
21:20:43.856: 8627: debug : virCgroupMakeGroup:526 : Make controller /sys/fs/cgroup/cpu/libvirt/lxc/instance-0000003d/
...
21:20:43.878: 8627: debug : virCgroupSetValueStr:304 : Set value '/sys/fs/cgroup/cpuacct/libvirt/lxc/instance-0000003d/tasks' to '8627'

etc.

The nova unhandled exception happened at 21:20:43,868 which seems too early.

Changed in libvirt (Ubuntu Oneiric):
status: Invalid → In Progress
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Actually this libvirt debdiff may be a better solution. I'm going to send it to upstream libvirt (cc: openstack) for comment.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Sent the patch upstream. The tread can be followed here: https://www.redhat.com/archives/libvir-list/2011-September/msg01181.html

Dave Walker (davewalker)
Changed in libvirt (Ubuntu Oneiric):
assignee: nobody → Serge Hallyn (serge-hallyn)
tags: added: rls-mgr-o-tracking
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

This is actually fixed upstream. I have a new independent patch (which can be found in the thread listed in comment #17) which fixes this, but it will be better if I can cherrypick the lxc patch from git.

Changed in nova (Ubuntu Oneiric):
status: In Progress → Invalid
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

libvirt with this debdiff attached passes qa-regression-testing

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Chuck,

assigning this to you in the hopes that you can either push the debdiff or merge it with any changes you have remaining for nova.

Changed in libvirt (Ubuntu Oneiric):
assignee: Serge Hallyn (serge-hallyn) → Chuck Short (zulcss)
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Chuck,

sorry! wrong bug. I'm aboug to assign 861656 to you however.

Changed in libvirt (Ubuntu Oneiric):
assignee: Chuck Short (zulcss) → Serge Hallyn (serge-hallyn)
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 0.9.2-4ubuntu15

---------------
libvirt (0.9.2-4ubuntu15) oneiric; urgency=low

  * Pull patches from upstream which prevent a race between lxc driver and
    controller while a container is started, easily exposed by nova.
    (LP: #842845)
 -- Serge Hallyn <email address hidden> Mon, 03 Oct 2011 14:39:05 -0500

Changed in libvirt (Ubuntu Oneiric):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.