autostart fails on boot time host when network devices not ready

Bug #495394 reported by Heiko Harders
98
This bug affects 16 people
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Fix Released
High
Serge Hallyn
Lucid
Fix Released
Undecided
Unassigned
Maverick
Fix Released
Undecided
Unassigned
Natty
Fix Released
Undecided
Unassigned

Bug Description

=====================================================
SRU justification:
1. Impact: servers which auto-start libvirt VMs can result in failed
   VM boots if network devices involved have not been brought up in
   time.
2. How bug is addressed: the libvirt-bin.conf upstart job now waits
   until networking.conf (ifup -a) is done, so that all network devices
   have been brought up.
3. Minimal patch:
--- libvirt-0.9.2/debian/libvirt-bin.upstart 2011-06-23 10:12:47.000000000 -0500
+++ libvirt-0.9.2/debian/libvirt-bin.upstart 2011-07-07 10:23:20.000000000 -0500
@@ -1,7 +1,7 @@
 description "libvirt daemon"
 author "Dustin Kirkland <email address hidden>"

-start on runlevel [2345]
+start on (runlevel [2345] and stopped networking RESULT=ok)
 stop on runlevel [!2345]

 expect daemon

4. TEST CASE:
   Add an entry to /etc/network/interfaces like:
auto lxcbr0
iface lxcbr0 inet static
  pre-up /opt/sleep
  bridge_ports none
  address 192.168.30.1
  netmask 255.255.255.0

where /opt/sleep is an executable script sleeping 1 minute. Then define
a libvirt VM with the following network:

    <interface type='bridge'>
      <source bridge='lxcbr0'/>
      <target dev='veth0'/>
    </interface>

If reproducing inside a kvm VM, you can define a minimal non-kvm qemu
vm booting from a debian businesscard iso cdrom, or use an lxc instance.
Make the VM autostart using

   virsh autostart vmname # or virsh -c lxc:// autostart vmname

On reboot, the VM will fail to start without this patch. With the patch,
for the first minute after reboot (while lxcbr0 is not yet configured) the
command

   status libvirt-bin

will show libvirt is not yet running. When it starts, your vm will be running
as seen by

   virsh list # or virsh -c lxc:// list

5. Regression potential: If a site has auto network interfaces which are
defined but always fail to start, then even though those interfaces may
not be needed at all, libvirt will fail to start until the definitions
are removed or fixed.
=====================================================

host OS:
lsb_release -rd:
1. Release of Ubuntu:
Description: Ubuntu 9.10
Release: 9.10
Linux 2.6.31-16-generic #53-Ubuntu SMP Tue Dec 8 04:02:15 UTC 2009 x86_64 GNU/Linux

2. Version of package:
apt-cache policy libvirt-bin
libvirt-bin:
  Installed: 0.7.0-1ubuntu13.1
  Candidate: 0.7.0-1ubuntu13.1
  Version table:
 *** 0.7.0-1ubuntu13.1 0
        500 http://nl.archive.ubuntu.com karmic-updates/main Packages
        100 /var/lib/dpkg/status
     0.7.0-1ubuntu13 0
        500 http://nl.archive.ubuntu.com karmic/main Packages

3. What I expected to happen:
Domains that are marked `autostart' should be running after the host was booted.

4. What happened instead:
- auto starting domains mostly fails when booting the host OS (Ubuntu 9.10)
- auto starting the same domains does work when using Ubuntu 8.04.3 LTS or Ubuntu 9.04 as host OS
- auto starting the same domains does work when invoking `/etc/init.d/libvirt-bin restart'

Libvirtd is running.
There are symlinks in /etc/libvirt/qemu/autostart.

Mostly none of my domains are running, however sometimes a domain succeeds and is booted (say 1 in 10 attempts a domain succeeds to boot during the host boot process). When, after booting the host, I run `/etc/init.d/libvirt-bin restart' all of my domains are coming up as expected. Autostart works for all domains while using Ubuntu 8.04.3 LTS or Ubuntu 9.04 as host OS.

I'm using Ubuntu 9.04 and 9.10 guest OS'es. Some of them were created under Ubuntu 9.04 and some of them were created on Ubuntu 9.10. Most of the domains are installed on a LVM, but I also tried creating a file based virtual machine that is located on the boot device of the host OS. There is no difference between these domains, all of them are booted only very sporadicly while booting the host OS.

All domains are using a bridge device that I specified myself, and using static IP addresses. I removed the default network created by libvirt, because I don't use it (however: before I deleted that, autostart didn't work either). The bridge device works properly, I can log in my virtual machines via ssh and I use the bridge as well to talk to the internal network.

I tried setting the bug logging level in `/etc/libvirt/libvirt.conf' to 1, but I don't see anything in the files in `/var/log' that explains why my domains are not auto-starting during boot time of the host OS (or at least, nothing that I recognize).

If there is anything else I could try, or any other information I should provide, please let me know.

Revision history for this message
Heiko Harders (heiko-harders) wrote :

Just rebooted the host, which started checking the file system. Thereafter all domains seemed to be up. Unsure whether this was coincidence (can't remember seeing all domains up after a reboot before), or whether the extra boot time somehow caused the domains to come up as expected.

Revision history for this message
Heiko Harders (heiko-harders) wrote :

I've been able to start up all domains consistently on each boot of the host OS, by changing the parameters of the partition the host OS is installed on. I've forced a check of this filesystem on each system boot, and all domains are running consistently after the host is booted. The filesystem check only takes a couple of seconds, I still don't know whether it is just the extra delay during boot time gives libvirt the necessary time to get the domains up, or whether something else is going on.

Revision history for this message
Quetschke (tobias-quetschke) wrote :

I am experiencing the same issue: Domains created and managed via libvirt/virsh do not autostart although they are marked as 'autostarting', the domain runs regulary on manual starts and the symlink in /etc/libvirt has been created successfully.

Revision history for this message
Holger Mauermann (mauermann) wrote :

This may be related to upstart and bridge_utils (bug #498245). Try setting bridge_maxwait=0 in /etc/network/interfaces and see if this fixes the problem.

Revision history for this message
jeffbl (jeff-mulb) wrote :

I have bridge_maxwait=0 set for both my bridges, and neither VM I have set to autostart does so. Also on 9.10 64 bit host. This used to work, then stopped, even for new virtual machines I create.

Revision history for this message
Heiko Harders (heiko-harders) wrote :

I tried setting bridge_maxwait=0, I only booted two times thereafter to see what happened. In both occasions some of the VM's with autostart booted, but not all of them (first time 2/5, second time 4/5). So, at best this might have helped a bit, but it is not a solution for the problem.

The filesystem check, that only takes a couple of seconds, is still a good solution. When I setup my system so that it checks the filesystem (on which my host OS is installed, not the filesystem on which the VM's are installed), all VM's start consistently.

Revision history for this message
Heiko Harders (heiko-harders) wrote :

Is there anything we can do to help somebody looking into this? I'm happy to provide more information if necessary. Should we look into other related packages that might cause the problem and file bug reports for those? For me this bug is pretty much a show stopper, autostarting domains is something I really need working.

Chuck Short (zulcss)
Changed in libvirt (Ubuntu):
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Stuart Young (cef) wrote :

I too have come across this issue.

In my case, it appears to be related to br0 (my bridge) not becoming live in time.

As a temp solution, I found that simply adding a 'sleep 4' in the init script (just before it runs libvirtd) alleviates the issue. This gives the bridge enough time to become active. This is a total hack until some way to having it wait till the bridge is up appears and gets implemented.

Attached are two snippets of /var/log/messages:

1. msgs1 - Standard behaviour. First VM tries to spawn, hanging for 30 secs and then gets destroyed. Subsequent VM's start successfully as the bridge is now active.

2. msgs2 - Behaviour if I add 'sleep 4' just before running libvirtd in /etc/init.d/libvirt-bin

Revision history for this message
Stuart Young (cef) wrote :

Second file attached.

Revision history for this message
Stuart Young (cef) wrote :

Upgraded the libvirt host to lucid (libvirt-bin 0.7.5-5ubuntu7). Since doing so, I have not been able to replicate the issue with guests not starting at boot due to the bridge coming up late (after some/all of the guests have started). Will continue to do more testing.

Notes:
1. eth0 appears to be coming up after the guests are running, but the guests seem to be fine with this and not crash out like they did under karmic.
2. Guests are still running karmic atm (if that's at all relevant?)

Revision history for this message
Stuart Young (cef) wrote :

Upgraded one of the guests to lucid with no issues. All guests still start fine.

As this is a test machine, I can always create more VM's, if that seems like a good idea?

Suggestions for more tests to perform welcome.

Revision history for this message
Dustin Kirkland  (kirkland) wrote :

I converted libvirt to an upstart init script in Lucid, and I expect that this should be fixed.

I believe that comment #11 confirms this, so I'm marking this fix-released.

Please reopen if you can reproduce this in Lucid.

Changed in libvirt (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Jan van Oorschot (janvanoorschot) wrote :

I Have reproduced this bug in a fresh Lucid install. Libvirtd (running on Lucid), with one v-domain (also Lucid) on hardware with three virtual bridges (br0, br1 and br2, each connected to a physical ehternet interface eth0 eth1 eth2).

The temp. fix mentioned by Stuart in #8 fixes the problem for my, only given that i am running Lucid i had to edit /etc/init/libvirt-bin.conf:

pre-start script
        mkdir -p /var/run/libvirt
        # Clean up a pidfile that might be left around
        rm -f /var/run/libvirtd.pid
        echo "libvirtd sleep start" >> /tmp/libvirtd.txt
        sleep 40
        echo "libvirtd sleep end" >> /tmp/libvirtd.txt
end script

4 seconds was to fast, and 40 seconds seems to work for me (since this machines is going to boot once every year this is fine by me).

So this race-condition is still present in lucid (IMHO)

Regards, Jan

Revision history for this message
ossjunkie (ossjunkie) wrote :

yes it is still present on lucid server.

after unsucessfully trying something like:

start on (runlevel [2345] and networking)

in the upstart script i also head for a dirty sleep in the pre-start script. Upstart experts needed!

Changed in libvirt (Ubuntu):
status: Fix Released → Confirmed
Revision history for this message
Mika Båtsman (mika-batsman) wrote :

I'm not an expert in upstart but solved the problem by creating an upstart task that checks whether all automatically started bridge interfaces are up and made libvirt-bin depend on it.

Patch attached.

tags: added: patch
Revision history for this message
Andreas Ntaflos (daff) wrote :

Mika, thank you for the patch and new upstart job. We are trying it out here and find that the bridged-network job seems to be waiting forever for a "net-device-up" signal to be emitted, thus keeping libvirtd-bin from starting.

I only now have begun reading up on upstart but so far I can't find any obvious flaws in your job definition. Waiting on the "net-device-up" signal which is emitted every time a network interface comes up (/etc/network/if-up.d/upstart) and then checking the status of the bridges seems the correct way but it doesn't work for us.

Did you take any additional steps in order to get this upstart job to work correctly?

Revision history for this message
Mika Båtsman (mika-batsman) wrote :

I made the original patch on karmic. For some reason it stopped working after upgrading to lucid.

I modified it a bit but forgot to post the changes. Here's an updated patch which I've been using successfully on couple of Lucid machines for over a month.

Hope it helps to solve the problem.

Revision history for this message
Andreas Ntaflos (daff) wrote :

Thanks for the new patch! I tested it just now and it seems to work, but it's always hard to tell when dealing with race conditions. I'll keep testing.

Out of interest, the only change (apart from the more verbose way of testing the $interface variable) is the added "break" statement, right? I am not familiar enough with the Upstart boot process but why does this change make the job work correctly? I also have not found any info on what exactly ifquery does. Does it just read /etc/network/interfaces and extract the interface names?

Anyway, thanks again!

Revision history for this message
Seb James (sebjames) wrote :

I made use of Mika's patch on a 10.04 system and found it solved my problem.

I had been confused and thought that there was something wrong with my/the apparmor config for libvirt, because there were a few apparmor audit messages in the log, but in fact, it was this issue with the bridge interfaces not coming up in time.

Revision history for this message
Roland Moriz (rmoriz) wrote :

problem still exists in 10.10

Revision history for this message
Roland Moriz (rmoriz) wrote :

Mika's patch seems not to work with alias interfaces like "eth0:1"

status bridge-network-interface INTERFACE='eth0:1' 2>/dev/null | awk '{print $3}'
=> ""

ifconfig eth0:1
eth0:1 Link encap:Ethernet HWaddr xxxxxxxxxx
          inet addr:xxxxxxxxx Bcast:xxxxxxx Mask:255.255.255.224
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          Interrupt:45 Base address:0xa000

Revision history for this message
Heiko Harders (heiko-harders) wrote :

I can reproduce this on a fresh Ubuntu 11.04 using two 11.04 virtual machines (I had better luck with my previous 10.10 install that did work properly for me).

The patch provided by Mika does not seem to work for me, libvirt does not seem to be started properly with it (my domains are not shown in virsh with a `list --all' for example). My boot.log shows two lines with `Stopping Check if bridged network is up. OK' though.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

I like the idea of Mika's patch. Besides extending it to not break on
eth0:1, should it (and can it) be extended to also support slow network
links which are not bridges?

I'm assigning this temporarily to SpamapS to get his opinion, and give
him a chance to nack the patch if he thinks an imminent new upstart
feature will fix this. If such a feature is not imminent, then I'll
push the fix for o-series and SRU it back.

Changed in libvirt (Ubuntu):
importance: Medium → High
assignee: nobody → Clint Byrum (clint-fewbar)
Revision history for this message
Mika Båtsman (mika-batsman) wrote :

Part of this problem is that there are 2 competing mechanisms for bringing the bridges up:

1) bridge-network-interface.conf from bridge-utils
2) network-interface.conf from ifupdown

These two seem to be racing which one gets to handle the interface. It seems that most of the time network-interface.conf is faster and emits net-device-up. bridged-network.conf (from the patch) relies on net-device-up events. Because of the bridge-network-interface.conf the bridged-network.conf didn't seem to get all the net-device-up events which caused occasional failures. At least for me that happened very seldom and that made the problem more difficult to solve.

I've disabled bridged-network-interface.conf and haven't had any problems with networking and libvirt for a few months now.

The attachment has my current solution for 10.04. I haven't tested it with 10.10 or 11.04. It now checks status for status network-interface instead of bridge-network-interface so it's a bit more generic and could help with wider range of network configurations, not just with bridges. I have also rename bridged-network.conf to network-up.conf. If you have tested the previous patch and are going to test this one make sure you don't have both bridged-network.conf and network-up.conf.

I don't have any setups using alias interfaces so I don't know if this helps with with those but at least it's not that bridge specific any more.

Revision history for this message
Heiko Harders (heiko-harders) wrote :

It seems there is another problem with my configuration that causes libvirt to have problems with autostarting virtual machines. I am using software RAID 1 with mdadm and my virtual machines are running on an LVM2 partition on md0. It seems the LVM2 volume is not yet available at the point where libvirt tries to start my virtual machines:

libvirtd: 20:42:16.345: 1389: error : qemuAutostartDomain:275 : Failed to autostart VM 'ns': unable to set user and group to '114:
125' on '/dev/mapper/storage-st0': No such file or directory
libvirtd: 20:42:16.346: 1389: error : virSecurityDACSetOwnership:125 : unable to set user and group to '114:125' on '/dev/mapper/s
torage-st1': No such file or directory

So it seems at this point it is not the bridge that is causing problems, but it is mdadm in combination with LVM2. According to my /var/log/boot.log the mdadm monitoring daemon is started after libvirt. But I'm not sure if the monitoring application has anything to do with it.

Revision history for this message
Heiko Harders (heiko-harders) wrote :

I changed my upstart script to ensure both the bridge and the md0 device (on which the LVM volume is located) are started before libvirt is started. In my situation this makes sure all my virtual machines can be started. However, different virtual machines can have different dependencies on (possibly slow) hardware being available or not. Perhaps it is a good idea to create separate upstart scripts for each virtual machine? This way it could be ensured that the hardware a specific virtual machine is relying on is brought up.

I fixed my problems with the following `start on' line in /etc/init/libvirt-bin.conf:

start on runlevel [2345] and net-device-added INTERFACE="br0" and block-device-added DEVNAME="/dev/md0"

br0 is the bridge I am using
md0 is the raid volume on which the LVM2 volumes are located, it seems (although I'm not 100% sure) that the block-device-added event is always fired after all LVM volumes on the block device are up

Revision history for this message
Clint Byrum (clint-fewbar) wrote : Re: [Bug 495394] Re: autostart almost always fails on boot time host

Excerpts from Heiko Harders's message of Tue May 24 18:01:58 UTC 2011:
> I changed my upstart script to ensure both the bridge and the md0 device
> (on which the LVM volume is located) are started before libvirt is
> started. In my situation this makes sure all my virtual machines can be
> started. However, different virtual machines can have different
> dependencies on (possibly slow) hardware being available or not. Perhaps
> it is a good idea to create separate upstart scripts for each virtual
> machine? This way it could be ensured that the hardware a specific
> virtual machine is relying on is brought up.

Another option is to make the start on more broad (start on
net-device-added or block-device-added) and then in the pre-start or
daemon's own code, check for the hardware and gracefully refuse to start
if its not available yet.

This can actually get racey though w/o instance specifiers though
because if block-device-added happens between the pre-start deciding
the block device it needed was not there, and the pre-start exitting,
upstart will just consider its job done (its already in a goal of start
so upstart won't change it).

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Excerpts from Heiko Harders's message of Tue May 24 18:01:58 UTC 2011:
> I changed my upstart script to ensure both the bridge and the md0 device
> (on which the LVM volume is located) are started before libvirt is
> started. In my situation this makes sure all my virtual machines can be
> started. However, different virtual machines can have different
> dependencies on (possibly slow) hardware being available or not. Perhaps
> it is a good idea to create separate upstart scripts for each virtual
> machine? This way it could be ensured that the hardware a specific
> virtual machine is relying on is brought up.
>
> I fixed my problems with the following `start on' line in /etc/init
> /libvirt-bin.conf:
>
> start on runlevel [2345] and net-device-added INTERFACE="br0" and block-
> device-added DEVNAME="/dev/md0"
>
> br0 is the bridge I am using
> md0 is the raid volume on which the LVM2 volumes are located, it seems (although I'm not 100% sure) that the block-device-added event is always fired after all LVM volumes on the block device are up
>

Interesting finding, though I think its a fairly dangerous assumption. It
may happen that way simply because the block device added event isn't
emitted by udev until the kernel has scanned partitions, hence finding
the LVM and enabling it. Probably something that needs build-time and
maybe even an automated test created if it turns into a generic solution
of any kind.

Revision history for this message
m m (bk-praca) wrote : Re: autostart almost always fails on boot time host

Could you please look at this thread:
http://<email address hidden>/msg01444.html ?

Do you have all IP tables related modules loaded when libvirt-bin is started? Do you know what can be done to ensure they are loaded before libvirt-bin is started?

Changed in libvirt (Ubuntu):
assignee: Clint Byrum (clint-fewbar) → Serge Hallyn (serge-hallyn)
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Mika,

thanks for posting your patches along the way. Regarding your third version (in comment #24), is it just the net-device-up for the bridge which you're not seeing? Have you tried leaving bridge-network-interface.conf enabled, and, at the bottom of the loop in its pre-start script, doing

   ifup $i

(to force /etc/network/if-up.d/upstart to do the initctl emit for us). So the script would look like:

pre-start script
        . /lib/bridge-utils/bridge-utils.sh

        mkdir -p /var/run/network
        for i in $(ifquery --list --allow auto); do
                ports=$(ifquery $i | sed -n -e's/^bridge_ports: //p')
                for port in $(bridge_parse_ports $ports); do
                        case $port in
                                $INTERFACE|$INTERFACE.*)
                                        ifup --allow auto $i
                                        brctl addif $i $port && ifconfig $port 0.0.0.0 up
                                        break
                                        ;;
                        esac
                done
                ifup $i
        done
end script

Clint, does that look reasonable to you?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

In oneiric the bridge-network-interface.conf is in fact gone. I've got a setup that reproduce this (using an lxc container connected to a bridge, br3, which is brought up by an upstart job which first sleeps two minutes).

I'll get a version of the libvirt-networking-up.conf that works for me and post a debdiff.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Following jhunt's terrific suggestion, I changed the start on for
libvirt-bin to:

start on (runlevel [2345] and stopped networking RESULT=ok)

which is working perfectly.

Changed in libvirt (Ubuntu):
status: Confirmed → In Progress
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 0.9.2-4ubuntu3

---------------
libvirt (0.9.2-4ubuntu3) oneiric; urgency=low

  * Fix /etc/init/libvirt-bin.conf start on to wait until networking.conf
    has stopped with success, meaning ifup -a completed successfully and
    all auto-started network devices are up. (LP: #495394)
 -- Serge Hallyn <email address hidden> Thu, 07 Jul 2011 10:23:25 -0500

Changed in libvirt (Ubuntu):
status: In Progress → Fix Released
description: updated
Revision history for this message
Chris Halse Rogers (raof) wrote :

Before accepting this upload I'd like to check that you don't want to fold the missing fixes for bug #697046 into this upload.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks, Chris. At this point i prefer to finish up with this one and push 697046 separately next. If you prefer that I combine them, please let me know and I'll go happily do it.

Revision history for this message
Martin Pitt (pitti) wrote : Please test proposed package

Hello Heiko, or anyone else affected,

Accepted libvirt into natty-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in libvirt (Ubuntu Natty):
status: New → Fix Committed
tags: added: verification-needed
Changed in libvirt (Ubuntu Maverick):
status: New → Fix Committed
Revision history for this message
Martin Pitt (pitti) wrote :

Hello Heiko, or anyone else affected,

Accepted libvirt into maverick-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in libvirt (Ubuntu Lucid):
status: New → Fix Committed
Revision history for this message
Martin Pitt (pitti) wrote :

Hello Heiko, or anyone else affected,

Accepted libvirt into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Revision history for this message
Franck78 (fbourdonnec) wrote : Re: autostart almost always fails on boot time host

Hello,

Using ubuntu server 11.04

The fix does nothing "start on (runlevel [2345] and stopped networking RESULT=ok)"
Have patche applied
ii libvirt-bin 0.8.8-1ubuntu6.3
ii libvirt0 0.8.8-1ubuntu6.3

BUT I found on my installation that when

1/ the default virbr0 in /etc/libvirt/qemu/networks
is activated, VMs startup is done.

2/ the default virbr0 is disabled (by removing autostart symlink),
VMs never start.

The VMs don't use virbr0 but my defined br0 and br1.

Franck

my /etc/network/interfaces FYI
auto lo br0 br1
# The loopback network interface
iface lo inet loopback

# The real primary network interfaces
iface eth0 inet manual
iface eth1 inet manual

iface br0 inet static
     bridge_ports eth0
     bridge_stp off
    address 10.0.0.200
    netmask 255.255.0.0
    gateway 10.0.0.100

iface br1 inet static
     bridge_ports eth1
     bridge_stp off
     address 10.2.0.200
     netmask 255.255.0.0
     broadcast 10.2.255.255

Rolf Leggewie (r0lf)
tags: added: verification-failed
removed: verification-needed
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Franck78,

Are you able to tell why the containers failed to start? Can you post the xml contents for the containers which fail to start (result of 'virsh dumpxml <containername>'), and do a

 apport-collect 495394

?

Revision history for this message
Dave Walker (davewalker) wrote :

bouncing back to verification-needed as it's not clear it has -failed. Thanks.

tags: added: verification-needed
removed: verification-failed
Revision history for this message
Franck78 (fbourdonnec) wrote :
Download full text (8.3 KiB)

@Serge,

here is one domain; other is a duplicate. It a test machine.
I think the apport-collect is unhappy...
see log after this dumpxml

Maybe the disk system is also mandatory to complete ?
I have two 2TB drives mirrored+lvm2 on a not so slow board
(phenom2 6hearts on gigabyte ga890fxa-ud7)

As it is a test server, you can ssh in it if you want. Tell.me.

<domain type='kvm'>
  <name>ipcop</name>
  <uuid>fe2d60ab-4dc8-677e-9876-6e848380dbf3</uuid>
  <description>Un IPcop de test</description>
  <memory>524288</memory>
  <currentMemory>524288</currentMemory>
  <vcpu>1</vcpu>
  <os>
    <type arch='x86_64' machine='pc-0.14'>hvm</type>
    <boot dev='hd'/>
    <bootmenu enable='no'/>
  </os>
  <features>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/home/fbourdonnec/vm/ipcop/ipcop.raw'/>
      <target dev='hda' bus='ide'/>
      <address type='drive' controller='0' bus='0' unit='0'/>
    </disk>
    <controller type='ide' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:d3:d8:1a'/>
      <source bridge='br0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <interface type='bridge'>
      <mac address='52:54:00:a3:c1:dd'/>
      <source bridge='br1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target port='0'/>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <input type='mouse' bus='ps2'/>
    <graphics type='vnc' port='-1' autoport='yes' keymap='fr'/>
    <video>
      <model type='cirrus' vram='9216' heads='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </memballoon>
  </devices>
</domain>

<domain type='kvm'>
  <name>ipcop</name>
  <uuid>fe2d60ab-4dc8-677e-9876-6e848380dbf3</uuid>
  <description>Un IPcop de test
login root:test
green 10.0.0.50 admin:test
</description>
  <memory>524288</memory>
  <currentMemory>524288</currentMemory>
  <vcpu>1</vcpu>
  <os>
    <type arch='x86_64' machine='pc-0.14'>hvm</type>
    <boot dev='hd'/>
    <bootmenu enable='no'/>
  </os>
  <features>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/home/fbourdonnec/vm/ipcop/ipcop.raw'/>
      <target dev='hda' bus='ide'/>
      <address type='drive' controller='0' bus='0' unit='0'/>
    </disk>
    <controller type='ide' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' func...

Read more...

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Franck78,

Thanks for the info.

Yes, slow storage is not addressed by this and also needs to be fixed. For that one, I want to discuss with upstream, as we'll probably want to use a generic tool to enumerate all the storage needed by autostart domains. Considering the many complicated possibilities, it seems a daunting task.

I suppose a simple, non-intelligent way to go about it would be to grab all 'source file=' and 'source dev=' lines inside a <disk > .. </disk> stanza, and, in a /etc/init/libvirt-storage-waiter.conf, which is

  start on mounted and not started libvirt-bin

do something like

pre-start script
   if [ status libvirt-bin | grep start > /dev/null ]; then
      stop
      exit 0
   fi
   for f in `/sbin/enumerate-libvirt-autostart-files`; do
      if [ ! -r $f ]; then
         stop
         exit 0
      fi
   done
   initctl emit -n libvirt-storage-ready
end script

and have /etc/init/libvirt-bin.conf
   start on (runlevel [2345] and stopped networking STATUS=ok and libvirt-storage-ready)

Revision history for this message
Franck78 (fbourdonnec) wrote :

Hello,
The disk subsystem may or may not be involved on my system. Difficult to say and have no relation with the 'virbr0' (when defined virbr0=>everything OK at least with two VMs booting).
I have compiled 0.9.2 and 0.9.3 for my 11.04 system. I will try again with those updated libvirt-bin (incredible number of bug fixed at every release !).
Btw, 'upstart' is strange : "and stopped networking STATUS=ok" , for me this means when networking is properly shutdown... Need to find the upstart howto ;-)

Franck

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

For posterity, here is a debdiff doing sort of what I was thinking for slow storage

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Franck78

no doubt, counterintuitive :) it means that the upstart job has finished, though.

Thanks for the infi. If you find a new bug responsible, please feel free to open a new bug.

Revision history for this message
Clint Byrum (clint-fewbar) wrote : Re: [Bug 495394] Re: autostart almost always fails on boot time host

Excerpts from Serge Hallyn's message of Tue Jul 19 19:47:40 UTC 2011:
> @Franck78
>
> no doubt, counterintuitive :) it means that the upstart job has
> finished, though.
>
> Thanks for the infi. If you find a new bug responsible, please feel
> free to open a new bug.
>

That is indeed quite counter-intuitive.. a result of naming a
task something that sounds more like a state. It should be named
'configure-static-network'

The 'stopped networking' bit should go away in oneiric. I'm polishing
off the last bits of a new event, 'static-network-up', which means
all interfaces marked 'auto' in /etc/network/interfaces are "up". This
should allow things that can't handle transient network interfaces to
at least try to start at the right time, which is, for the most part,
what 'start on stopped networking' is an attempt to do.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: autostart almost always fails on boot time host

The lucid-proposed package just passed my test case. As I wrote the fix I don't know if I can verify.

description: updated
Revision history for this message
Franck78 (fbourdonnec) wrote :

Hello *,
@Serge,

I'm pretty sure now where is the problem.
Disk related.
The patch waiting for 'disk' is not sufficient.

I have updated the libvirt.0.8.8 with 0.9.3 compiled locally.
Nothing changes, one VM (over two in autostart) is starting.

Short libvirt log activated :

20:38:48.728: 1226: info : libvirt version: 0.9.3
20:38:48.728: 1226: error : qemuMonitorIORead:487 : Unable to read from monitor: Connection reset by peer
20:38:48.736: 1234: error : qemuMonitorTextGetPtyPaths:1960 : operation failed: failed to retrieve chardev info in qemu with 'info chardev'
20:38:49.751: 1234: error : qemuAutostartDomain:156 : Failed to autostart VM 'ipcop-clone': operation failed: failed to retrieve chardev info in qemu with 'info chardev'

Every VM is tried. Some fail ;-)

The complete debug shows what is wrong.

Libvirt try to open some socket/file on "/var/lib/libvirt/......

AND /var is not ready (not mounted, residing on lvm system)

See the full log in attachement

#L1 starts ipcop-clone
#L104 failure declared for this VM
#L876 starts ipcop
#1005 successfully open the monitor chanel, go on

Can you fix your patch for this ?
Need to wait also for some utility directories like /var/run /var/lib !

Franck

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Franck78

libvirt won't start until mounted-varrun has happened, so what you are
describing *should* not be happening. Can you open a new bug, preferably
using 'apport-bug libvirt-bin'? Please include your /etc/fstab and a
description of your storage setup (lvm config etc).

Revision history for this message
Dave Walker (davewalker) wrote :

Verified that this bug is resolved on Lucid with the proposed package, with no obvious regressions (basic functionality works as expected)

Revision history for this message
Dave Walker (davewalker) wrote :

Verified the proposed package for Maverick, as above. Thanks.

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Still need verification on natty. Thanks!

tags: added: verification-done
removed: verification-needed
tags: added: verification-done-lucid verification-done-maverick verification-needed
removed: verification-done
Revision history for this message
Dave Walker (davewalker) wrote :

Verification for Natty now complete. Thanks.

tags: added: verification-done verification-done-natty
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 0.7.5-5ubuntu27.14

---------------
libvirt (0.7.5-5ubuntu27.14) lucid-proposed; urgency=low

  * Fix /etc/init/libvirt-bin.conf start on to wait until networking.conf
    has stopped with success, meaning ifup -a completed successfully and
    all auto-started network devices are up. (LP: #495394)
 -- Serge Hallyn <email address hidden> Thu, 07 Jul 2011 16:41:04 -0500

Changed in libvirt (Ubuntu Lucid):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 0.8.3-1ubuntu19

---------------
libvirt (0.8.3-1ubuntu19) maverick-proposed; urgency=low

  * Fix /etc/init/libvirt-bin.conf start on to wait until networking.conf
    has stopped with success, meaning ifup -a completed successfully and
    all auto-started network devices are up. (LP: #495394)
 -- Serge Hallyn <email address hidden> Thu, 07 Jul 2011 16:48:36 -0500

Changed in libvirt (Ubuntu Maverick):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 0.8.8-1ubuntu6.3

---------------
libvirt (0.8.8-1ubuntu6.3) natty-proposed; urgency=low

  * Fix /etc/init/libvirt-bin.conf start on to wait until networking.conf
    has stopped with success, meaning ifup -a completed successfully and
    all auto-started network devices are up. (LP: #495394)
 -- Serge Hallyn <email address hidden> Thu, 07 Jul 2011 16:54:35 -0500

Changed in libvirt (Ubuntu Natty):
status: Fix Committed → Fix Released
Revision history for this message
Franck78 (fbourdonnec) wrote :

well,
if it is not "/var/something"
there is also
"/dev/pts"

Something is wrong around the "qemuMonitor" routines.

find why qemuMonitorIORead:487 is triggered and the bug is closed !

I really don't understand the logic of 'upstart' to trace into it, sorry.
Openning a new bug on the same subject ? Why ?

Franck

****** when it is OK ********
20:57:10.438: 1234: debug : qemuMonitorTextCommandWithHandler:237 : Send command 'info chardev' for write with FD -1
....
20:57:10.439: 1234: debug : qemuMonitorTextCommandWithHandler:242 : Receive command reply ret=0 rxLength=109 rxBuffer='charmonitor: filename=unix:/var/lib/libvirt/qemu/ipcop.monitor,server
charserial0: filename=pty:/dev/pts/1'

******** when it is NOT ok **********
20:57:08.555: 1234: debug : qemuMonitorTextCommandWithHandler:237 : Send command 'info chardev' for write with FD -1
....
20:57:08.648: 1228: error : qemuMonitorIORead:487 : Unable to read from monitor: Connection reset by peer
20:57:08.648: 1228: debug : qemuMonitorIO:610 : Error on monitor Unable to read from monitor: Connection reset by peer
20:57:08.648: 1228: debug : qemuMonitorIO:644 : Triggering error callback
20:57:08.648: 1228: debug : qemuProcessHandleMonitorError:170 : Received error on 0x142f5a0 'ipcop-clone'
20:57:08.648: 1234: debug : qemuMonitorSend:811 : Send command resulted in error Unable to read from monitor: Connection reset by peer
...
20:57:08.648: 1234: debug : qemuMonitorTextCommandWithHandler:242 : Receive command reply ret=-1 rxLength=0 rxBuffer='(null)'
20:57:08.648: 1234: error : qemuMonitorTextGetPtyPaths:1960 : operation failed: failed to retrieve chardev info in qemu with 'info chardev'20:57:08.648: 1234: debug : qemuProcessWaitForMonitor:1170 : qemuMonitorGetPtyPaths returned -1
20:57:08.648: 1234: debug : qemuProcessStop:2801 : Shutting down VM 'ipcop-clone' pid=1347 migrated=0

I HAVE check /var is mounted with a "ls /var" in init/libvirt-bin.conf & libvirt-bin-storage.conf.
It is mounted ok.

Franck

Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: [Bug 495394] Re: autostart almost always fails on boot time host

Quoting Franck78 (<email address hidden>):
> well,
> if it is not "/var/something"
> there is also
> "/dev/pts"
>
> Something is wrong around the "qemuMonitor" routines.
>
> find why qemuMonitorIORead:487 is triggered and the bug is closed !

Libvirt's monitor is trying to read from already-opened monitor fd.
Qemu has crashed, perhaps unable to find some backing store, perhaps for
some other reason. /var itself is ok - even it it had gotten
overmounted, libvirt is reading from an fd and the overmount wouldn't
matter.

We need to figure out why qemu crashed. There may be useful info in
/var/log/libvirt/qemu/ipcop-clone.log

> I really don't understand the logic of 'upstart' to trace into it, sorry.
> Openning a new bug on the same subject ? Why ?

Because yours has a different cause, and is therefore a different bug
with similar symptoms. Globbing the info with that from other bugs
makes it harder to cleanly reason about it and minimizes our chances of
finding the root cause.

summary: - autostart almost always fails on boot time host
+ autostart fails on boot time host when network devices not ready
Revision history for this message
Gary Pope (gaz-6) wrote :

comment #8 fixed me too, but I had to use a longer sleep value. I used sleep 8.
This remedied /etc/init.d/libvirt.bin under Debian v7.1.0 HOST starting a VM for Ubuntu 12.04-3

Gaz

Revision history for this message
Gary Pope (gaz-6) wrote :

Sorry comment #60 was meant to be sleep 60 (60 seconds not 4, like comment #8)

Revision history for this message
Ruben Portier (rubenportier) wrote :

I'm having a similar issue on Ubuntu 16.04 as host. When the guests are on autostart and I boot the host, I can see an error message on the guests' while booting: "Failed to start Raise network interfaces".

I've tried the methods above by adding a sleep to the init file. however, the init file /etc/init/libvirt-bin.conf is not used anymore, as echoing into a file does not work (seems like this init file is deprecated, then why is it still there?). I've found another file, /etc/init.d/libvirt-bin, which looks a lot newer, still it states the year 2007 at the top of the file. This file is completely different and I have no idea where to put the sleep command to test.

I'm not sure if this is the exact same issue, as I couldn't find any similar issues on the internet. I've tried changing my bridge configuration, without success. I have bridge_maxwait 0 on my bridge, but it does not help. The host is able to use the internet right after boot, so it seems the interface (bridge) is actually up and working. The guests are working for a short period of time after boot. I've noticed that the default route (IPv6) has a expire option on it and it's counting downwards. When the route expires, the guest is no longer reachable over the internet. When restarting the guests' interface, it's working again without any problems.

So, this seems like an issue with the bridge not completely ready when libvirt autostarts my guests. I have no idea why this happens and why it takes longer for the bridge to fully initialize. I hope someone can help me find out if this reported bug is related to my issue, or if I'm having a different one.

Thanks in advance!

Revision history for this message
Ruben Portier (rubenportier) wrote :

Apparently, this issue was caused by the host having high utilisation on the CPU on host boot. This caused the guests to not have their interfaces configured in time. The link on an interface can be up while the interface itself is not yet fully initialised. This causes the networking-services to accept neighbor advertisements and router advertisements which can add IPv6 routes to the routing table.

When the interface is almost complete, it will set the default route as I had a gateway rule in my interfaces file. This fails (file exists) because there already is a default IPv6 route for this particular gateway, assigned via RA (router advertisement). To solve this issue, I simply removed the gateway from the interfaces file, as the gateway is auto assigned via RA. Another fix would be to disable accepting RA on this particular interface, the default value or all interfaces by using:

pre-up net.ipv6.conf.device.accept_ra=0

where "device" is "all", "default" or the actual device name (eth0, em0 etc.).

I hope this can help some people suffering from the same problem as I had. It took me way too long to find the cause of this problem. The actual fix was found on this link: http://unix.stackexchange.com/questions/306139/rtnetlink-answers-file-exists-after-adding-ipv6-address.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.