Sockfile check retries too short for a busy system boot

Bug #1571209 reported by guessi
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Fix Released
High
Unassigned
Precise
Won't Fix
High
Unassigned
Trusty
Fix Released
High
Unassigned
Wily
Won't Fix
High
Unassigned
Xenial
Fix Released
High
Unassigned
Zesty
Fix Released
Undecided
Unassigned
Artful
Fix Released
High
Unassigned

Bug Description

[Impact]

 * Libvirt service reports to be ready, but it has not spawned the libvirt
   socket yet. Depending services fail. There was an SRU (#1455608) meant
   to fix that but it has many deficiencies (not considering config,
   giving up after 10 seconds, being an unconditional sleep 2, taking up
   to 2 seconds to a service stop while in pist-start).

 * This is the backport and improvement of a change that was brought to
   Yakkety already, but there due to systemd it doesn't matter too much.

[Test Case]

 * There are two very different ways to "test" this due to the overload
   based scenario where this really becomes important.

 * Version #1 - being lame
   One can just modify the upstart script and exchange the check for the
   socket with /bin/true.
   That way it waits forever which allows you to check the log entries,
   the abort responsiveness and similar.

 * Version #2 - recreating the case
   - This mostly means the system has to be very slow and overloaded.
     You can either just slow down the system (e.g. run a qemu with nice
     MAX). Stress your host with other things burning CPU/memory/disk.
   - we worked with adding autostart guests (see comment #35) but that
     actually takes place after the socket is created. The reported acse
     had a raid rebuilding.
   - TL;DR get your system slow enough so that libvirt exceeds 10 seconds
     to start properly (the old limit is 5*2 seconds)

[Regression Potential]

 * I'd think that there might exist (super rare) cases were the post-start
   now does spin forever. But by the definition
   http://upstart.ubuntu.com/cookbook/#post-start this is correct. It is
   started (yes) but not yet ready. Yet this might appear as a regression
   to some.
 * Other than that clearly this should fix more issues than it (hopefully
   not) causes.

[Other Info]

 * n/a

--- END SRU Template ---

[ problem description ]

sockfile_check_retries is first introduced by #1455608, for preventing the failure case of sockfile not ready, but it was default to a hard-coded value "5", it might be too short for a busy system boot.

#1455608 - https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1455608

[ step to reproduce ]

setup a clean install system (Ubuntu Server 14.04.4 LTS), and assemble os disk as RAID-1, boot up some guest instances (count > 10, start-at-boot), force shutdown host by pressing power-button for 3s ~ 5s, or via IPMI command, then power-on afterward. it may sometimes failed to get sockfile ready after in "post-start" script, with an line of error in /var/log/syslog,

==> kernel: [ 313.059830] init: libvirt-bin post-start process (2430) terminated with status 1 <==

since there's multiple VMs Read/Write before a non-graceful shutdown, RAID devices need to re-sync after boot, and lead to a slow response, but start-up script for libvirt-bin can only wait 5 cycles, 2 seconds wait for each cycle, so it will timed-out after 10s, and exit with "1".

[ possible solution ]

extend the retry times for sockfile waiting, and make it possible to change via editing `/etc/default/libvirt-bin` file.

<please see the patch file as attachment>

[ sysinfo ]

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.4 LTS
Release: 14.04
Codename: trusty

$ uname -a
Linux host2 4.2.0-35-generic #40~14.04.1-Ubuntu SMP Fri Mar 18 16:37:35 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

[ related issue ]

#1386465 - https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1386465

Revision history for this message
guessi (guessi) wrote :
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "allow-change-sockfile_check_retries.diff" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
guessi (guessi)
tags: added: trusty xenial
guessi (guessi)
description: updated
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

This is the very definition of racy.

It appears to me that the point of the post-start is to make sure libvirt is ready, no matter how long it takes. If libvirt actually fails to start, upstart should catch that and mark it failed and abort post-start, is that right? If not, then we should complicate the loop in the post-start to make that logic happen.

But my point is that having a timeout there at all is wrong - we are waiting for the socket to be ready, not trying to catch failure to start. So we should effectively wait forever for either libvirt to be ready, or to have failed to start. No guessing at timeouts here.

Changed in libvirt (Ubuntu):
assignee: nobody → Edward Hope-Morley (hopem)
importance: Undecided → High
status: New → Confirmed
Revision history for this message
guessi (guessi) wrote :

@serge, @hopem,

I'd like to propose a new approach, wait infinitely until the sockfile is ready,
and allow it to exit if there's an interrupt of stop/restart/force-stop event.
* excluding systemd solution, sorry I'm not familiar with systemd.

it is inspired by the following PR from Docker project,

- https://github.com/docker/docker/pull/9287/files
- https://github.com/docker/docker/pull/11945/files

should be smoothly apply on libvirt-bin as well,
please give help to review it again, thanks !

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks, @guessi, that patch looks great. The only thing i wonder is whether sleep 0.5s is a bit too short. I assume you had a rationale for shortening the sleep time?

Revision history for this message
guessi (guessi) wrote :

@serge,

no, I don't,
it's simply shortened the waiting time for service back or down.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: [Bug 1571209] Re: Sockfile check retries too short for a busy system boot

Quoting guessi (<email address hidden>):
> @serge,
>
> no, I don't,
> it's simply shortened the waiting time for service back or down.

Ok - the shorter time itself is probably fine, but printing a
message every 0.2 seconds seemed a bit much. On the other
hand I don't want to complicate the logic to only print every
5 iterations either...

Revision history for this message
guessi (guessi) wrote :

this is the patch file from #4, but keep sleep wait time "2"

https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1571209/comments/4

Revision history for this message
guessi (guessi) wrote :
Revision history for this message
guessi (guessi) wrote :
Revision history for this message
guessi (guessi) wrote :
Revision history for this message
guessi (guessi) wrote :
Revision history for this message
guessi (guessi) wrote :

@serge,
to avoid emit too much log, I've change the delay time back to "2".

@serge, @hopem,
patch files above for precise/trusty/vivid/wily/xenial have been uploaded,
should work as expected, but still, please have time to review them.

thanks !!!

tags: added: precise vivid wily
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Heh.

So the importance of this patch is underscored by the fact that this has
basically turned every 'start libvirt-bin' into a sleep 2s.

So I guess we should drop the number. Or perhaps we should increment the
sleep time on every iteration? start at 0.5 second and go up to a 2s sleep
at the end?

or just go back to 0.5 or 1 second sleep on every loop.

Your patch does work (at least with upstart). Let me know what you think
we should do for the sleep time, and i'll either change the number, or if
you want to do incremental please post a new set of patches.

Revision history for this message
guessi (guessi) wrote :

@serge,

sorry for the late reply,
we live in different timezone :p

I would prefer decrease the delay time down to 1s, but keep it wait infinitely.

according my original post (boot up with degraded RAID, 5s -> 10s), it takes up to 10s in my case, so I don't think increasing delay from 0.5s to 2s is enough for even worse cases, that's why I leave an option in default/libvirt-bin.

do I need to post the patch set again if we just change to number? or maybe I should create a merge-request?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Hi,

I'll just drop it to 1s, thanks.

Revision history for this message
guessi (guessi) wrote :

thanks, I'll keep track on this issue.

Revision history for this message
guessi (guessi) wrote :

@serge, is there any milestone to fix this rare case problem?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Sorry, working on it for yakkety now.

Changed in libvirt (Ubuntu):
assignee: Edward Hope-Morley (hopem) → Serge Hallyn (serge-hallyn)
status: Confirmed → In Progress
Changed in libvirt (Ubuntu Precise):
importance: Undecided → High
Changed in libvirt (Ubuntu Trusty):
importance: Undecided → High
Changed in libvirt (Ubuntu Wily):
importance: Undecided → High
Changed in libvirt (Ubuntu Xenial):
importance: Undecided → High
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 1.3.4-1ubuntu4

---------------
libvirt (1.3.4-1ubuntu4) yakkety; urgency=medium

  * Re-enable the upstart job by renaming the file.
  * Include patchby @guessi to continally wait for libvirtd to start when
    using sysvinit or upstart. (LP: #1571209)

 -- Serge Hallyn <email address hidden> Mon, 23 May 2016 13:50:22 -0500

Changed in libvirt (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
guessi (guessi) wrote :

@serge, great thanks! looking for LTS (trusty/xenial).

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Yes, I suppose xenial we could do right now, but I'm waiting on
another package to clear trusty-proposed.

Revision history for this message
Jon Grimm (jgrimm) wrote :

IRC chat with hallyn is that he'd like to hand this off to someone else to SRU, but is a willing sponsor if needed.

Changed in libvirt (Ubuntu Wily):
status: New → Won't Fix
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

This bug was dead for quite some time, but the issue still exists for Trusty (were we could take the supposed change) and we have to check the systemd cases in >=Xenial.
Adapting tasks and will come back once I tested on >=Xenial.

Changed in libvirt (Ubuntu Artful):
status: Fix Released → Incomplete
Changed in libvirt (Ubuntu Zesty):
status: New → Incomplete
Changed in libvirt (Ubuntu Xenial):
status: New → Incomplete
Changed in libvirt (Ubuntu Trusty):
status: New → Triaged
Changed in libvirt (Ubuntu Precise):
status: New → Won't Fix
Changed in libvirt (Ubuntu Artful):
assignee: Serge Hallyn (serge-hallyn) → nobody
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Found while testing:
BTW - the base path of the socket is configurable, so the fixes would need to consider "unix_sock_dir" from /etc/libvirt/libvirtd.conf to be correct.
If anyone changes that in the config the service will become "unstartable" as it always would consider it not up and break it.
That should be wrapped up in the fix for Trusty where sysv/ustart is used.
Continuing on systemd tests (while there is no explicit libvirtd.target to wait on I want to know how it behaves).

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (8.9 KiB)

To check on the systemd case I picked a rather slow system and spawned 30 guests.
I set all those to autostart and then shut them down.

Starting them sequentially as well as concurrently takes about ~20 seconds.

/var/run/libvirt/libvirt-sock is gone after the service is stopped.

With the following script this can be tested:
$ cat test-restart.sh
#!/bin/bash
set -uxe
date
sudo systemctl start libvirtd
date
# right after start (1-2 seconds usually) check status of sockets and guests
while /bin/true; do
    ls -laF /var/run/libvirt/libvirt-sock
    virsh list
    sudo systemctl status libvirtd --no-pager
    date
    sleep 2s
done

You will see that the "virsh list" is kind of blocking until other internal tasks are done.
That pretty much ends up after the ~20 seconds that I counted until it goes on.
All guests are started then and things are good.
But the socket is available right after start, yet if you do a request you are enqueued and have to wait - which is fine.

Output:
$ ./test-restart.sh
+ date
Mi 6. Sep 04:40:34 EDT 2017
+ sudo systemctl start libvirtd
+ date
Mi 6. Sep 04:40:34 EDT 2017
+ /bin/true
+ ls -laF /var/run/libvirt/libvirt-sock
srwxrwx--- 1 root libvirtd 0 Sep 6 04:40 /var/run/libvirt/libvirt-sock=
+ virsh list
 Id Name State
----------------------------------------------------
 1 artful-test30 running
 2 artful-test9 running
 3 artful-test8 running
 4 artful-test23 running
 5 artful-test13 running
 6 artful-test19 running
 7 artful-test11 running
 8 artful-test15 running
 9 artful-test10 running
 10 artful-test17 running
 11 artful-test5 running
 12 artful-test24 running
 13 artful-test26 running
 14 artful-test20 running
 15 artful-test2 running
 16 artful-test29 running
 17 artful-test22 running
 18 artful-test27 running
 19 artful-test14 running
 20 artful-test6 running
 21 artful-test25 running
 22 artful-test12 running
 23 artful-test3 running
 24 artful-test21 running
 25 artful-test7 running
 26 artful-test16 running
 27 artful-test28 running
 28 artful-test18 running
 29 artful-test1 running
 30 artful-test4 running

+ sudo systemctl status libvirtd --no-pager
● libvirtd.service - Virtualization daemon
   Loaded: loaded (/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2017-09-06 04:40:34 EDT; 23s ago
     Docs: man:libvirtd(8)
           http://libvirt.org
 Main PID: 171137 (libvirtd)
    Tasks: 303 (limit: 32768)
   Memory: 93.7M
      CPU: 15.383s
   CGroup: /system.slice/libvirtd.service
           ├─ 3848 /usr/...

Read more...

Changed in libvirt (Ubuntu Artful):
status: Incomplete → Fix Released
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Changed the order of status and list to get it even better while testing on Xenial.
Also the system was even slower which is good I think (for the test).

So things are good on Xenial onward (with systemd).

$ sudo ./test-restart.sh
+ date
Mi 6. Sep 09:03:35 UTC 2017
+ sudo systemctl start libvirtd
+ date
Mi 6. Sep 09:03:36 UTC 2017
+ /bin/true
+ ls -laF /var/run/libvirt/libvirt-sock
srwxrwx--- 1 root libvirtd 0 Sep 6 09:03 /var/run/libvirt/libvirt-sock=
+ sudo systemctl status libvirtd --no-pager
* libvirt-bin.service - Virtualization daemon
   Loaded: loaded (/lib/systemd/system/libvirt-bin.service; enabled; vendor preset: enabled)
   Active: active (running) since Mi 2017-09-06 09:03:36 UTC; 48ms ago
     Docs: man:libvirtd(8)
           http://libvirt.org
 Main PID: 261895 (libvirtd)
    Tasks: 20
   Memory: 30.5M
      CPU: 75ms
   CGroup: /system.slice/libvirt-bin.service
           |- 3932 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/lib/libvirt/libvirt_...
           |- 3933 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/lib/libvirt/libvirt_...
           |-261895 /usr/sbin/libvirtd
           `-261913 /usr/sbin/libvirtd

Sep 06 09:03:35 s1lp04 systemd[1]: Starting Virtualization daemon...
Sep 06 09:03:36 s1lp04 systemd[1]: Started Virtualization daemon.
+ virsh list
 Id Name State
----------------------------------------------------
 1 artful-test2 running
 2 artful-test13 running
 3 artful-test27 running
 4 artful-test28 running
 5 artful-test17 running
 6 artful-test26 running
 7 artful-test25 running
 8 artful-test20 running
 9 artful-test21 running
 10 artful-test11 running
 11 artful-test23 running
 12 artful-test16 running
 13 artful-test10 running
 14 artful-test30 running
 15 artful-test8 running
 16 artful-test18 running
 17 artful-test5 running
 18 artful-test3 running
 19 artful-test4 running
 20 artful-test12 running
 21 artful-test24 running
 22 artful-test9 running
 23 artful-test14 running
 24 artful-test19 running
 25 artful-test1 running
 26 artful-test15 running
 27 artful-test29 running
 28 artful-test6 running
 29 artful-test7 running
 30 artful-test22 running

+ date
Mi 6. Sep 09:04:55 UTC 2017
+ sleep 2s

Changed in libvirt (Ubuntu Xenial):
status: Incomplete → Fix Released
Changed in libvirt (Ubuntu Zesty):
status: Incomplete → Fix Released
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

That leaves to fix Trusty:
- do verifications on Trusty before working on SRU (we need working tests steps for SRU anyway)
If confirmed:
- backport supposed fix from guessi (thanks for all your work btw and sorry this fell out of sight)
- adapt to not print messages too often (adaptive sleep)
- has to consider the configuration chance to the base path (that needs to be done no matter what the startup tests show btw).

Changed in libvirt (Ubuntu Trusty):
status: Triaged → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

My Trusty system wan't as slow, so I used 100 guests on autostart.

But that seemed to work as well (test adapted for upstart):

./test-restart.sh
+ date
Wed Sep 6 09:34:58 UTC 2017
+ sudo service libvirt-bin start
libvirt-bin start/running, process 19805
+ date
Wed Sep 6 09:35:00 UTC 2017
+ /bin/true
+ ls -laF /var/run/libvirt/libvirt-sock
srwxrwx--- 1 root libvirtd 0 Sep 6 09:34 /var/run/libvirt/libvirt-sock=
+ sudo service libvirt-bin status
libvirt-bin start/running, process 19805
+ ls -laF /var/run/libvirt/libvirt-sock
srwxrwx--- 1 root libvirtd 0 Sep 6 09:34 /var/run/libvirt/libvirt-sock=
+ virsh list

So the socket is immediately available even with all sorts of slow guests starting.
Unfortunately the initial bug 1455608 wasn't too detailed on an actual case.
Actually I'd have refused that as a SRU team member :-/

Well live with what we have.
For now I'll try to slow more things down and recheck.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Slowed down further (lmited to 2 cpus), startup of the guests takes minutes now.
Still no issue to be seen.
The socket is around and working, while in background the guests start slowly but surely.

The test is now:
# cat test-restart.sh
#!/bin/bash
set -uxe
#starts off
date
ls -laF /var/run/libvirt/libvirt-sock
sudo service libvirt-bin status
date
# start service
sudo service libvirt-bin start
date
# check if socket is available immediately
ls -laF /var/run/libvirt/libvirt-sock
sudo service libvirt-bin status
ls -laF /var/run/libvirt/libvirt-sock
# this might take a while as the dameon queues requests
virsh list
date

With 100 Guests shut down, but set to autostart on a 2 cpu system.
That takes about 5 minutes to start the guests, but the service is up and ready on the socket before doing all that.
Did you face something else?

@Ed / @Guessi
I know it is a long time, but I'd need to get something reproducible to go on fixing this for artful. If you can help me with examples that do please let me know.

@Guessi:
I tried with the degraded RAID but it worked for me as well, maybe it is just not the most reliable repro.

@Ed:
the original case lacks some details what actually failed e.g. what dependent services and how it was triggered in your case.

Keeping incomplete status on Trusty for now :-/

Log of the last test:
# ./test-restart.sh
+ date
Wed Sep 6 09:56:01 UTC 2017
+ ls -laF /var/run/libvirt/libvirt-sock
ls: cannot access /var/run/libvirt/libvirt-sock: No such file or directory
+ /bin/true
+ sudo service libvirt-bin status
libvirt-bin stop/waiting
+ date
Wed Sep 6 09:56:01 UTC 2017
+ sudo service libvirt-bin start
libvirt-bin start/running, process 9730
+ date
Wed Sep 6 09:56:03 UTC 2017
+ ls -laF /var/run/libvirt/libvirt-sock
srwxrwx--- 1 root libvirtd 0 Sep 6 09:56 /var/run/libvirt/libvirt-sock=
+ sudo service libvirt-bin status
libvirt-bin start/running, process 9730
+ ls -laF /var/run/libvirt/libvirt-sock
srwxrwx--- 1 root libvirtd 0 Sep 6 09:56 /var/run/libvirt/libvirt-sock=
+ virsh list
 Id Name State
----------------------------------------------------
 2 artful-test32 running
[...]
 101 artful-test45 running
+ date
Wed Sep 6 10:01:41 UTC 2017

Revision history for this message
guessi (guessi) wrote :

@paelzer,

glad to see people start re-work on this issue, but it's been 16+ months old,

I can only tell the scenario I remember,
I think it would not happened if it is already booted,
it is only happened *at boot time*, as I describe in originally message,

key points are,
* it is running with >10 virtual instances
* it is running under RAID-1 system (traditional disk, *NOT* SSD)
* it is ungracefully shut down (unplug power cord with rude)

under RAID-1 mode, it would try to recovery data between disks, due to ungracefully shut down,
it would finally lead to a slow start up state *during boot*

the latest merge was made by @searge, see comment #19 above,
maybe he/she could help to add more detail on his test/patch?

Revision history for this message
guessi (guessi) wrote :

@paelzer,

by the way, the final patch that merged into Yakkety is "sleep 0.5"

see "diff from 1.3.4-1ubuntu3 to 1.3.4-1ubuntu4 (1.6 KiB)"
- https://launchpad.net/ubuntu/+source/libvirt/1.3.4-1ubuntu4

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Yeah, nobody minds yakkety anymore (yes - it is that much time that passed :-/) and on the others there is no change yet (and systemd by default).

I'll go into another repro session closer to the case you had adding:
- reboots
- unclean shutdown
- maybe a raid or something similar to stress this even more

I'll start with Trusty where it was surely seen by you and once I have a case that works move forward.

Changed in libvirt (Ubuntu Trusty):
status: Incomplete → Triaged
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Acitivty-Ball moved to my side of the field, setting triaged to make it clear that we do not wait for the reporter - even though we havn't reached "confirmed" yet in a strict sense.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thank you Guessi for your summary once more.
I tried to get closer to your case.

# Install a Trusty system with Raid-1, that has space for many guests (a lot of memory) but is slow (1 cpu)
$ wget http://releases.ubuntu.com/14.04/ubuntu-14.04.5-server-amd64.iso
$ qemu-img create -f qcow2 trusty-disk1.qcow2 10G
$ qemu-img create -f qcow2 trusty-disk2.qcow2 10G
$ sudo qemu-system-x86_64 -enable-kvm -hda trusty-disk1.qcow2 -hdb trusty-disk2.qcow2 -cdrom ubuntu-14.04.5-server-amd64.iso -m 16384 -smp 1
# On this install SW Raid1 according to [1]
# I set up 20 guests with autostart as before
# system seems slow enough, but in addition break the secondary raid device

# I also took a copy and blasted a long 256M whole of zeroes after its partition table
$ dd if=/dev/zero bs=1024 of=trusty-disk2.qcow2 seek=4096 count=$((4096*64))
# Start Trusty system again, but that blow was too much for it to recover at all
# given that the startup took so long (I made the qemu guest rather slow in general) I don't think I need that part of it to slow it down more.
# So further on I killed it hard to cause minor FS issues, but that matches your power loss case better anyway

On reboot then the service still was good and not timing out.
Guests were still starting a minute later, but nothing stopped libvirt from not having it's socket ready in time.

Message in /var/log/upstart/libvirt-bin.log is just ".. ready", no failure to be seen.

Instead of making the system even slower (unusable) I decided to shorten the retries and instead of 5*2sec I did only 2*1 sec, with that it finally triggered *yay*.

Checked another reboot and yes, still showing up.

Error is:
...
Giving up waiting for /var/tun/libvirt/libvirt-sock
libvirt-bin stop/post-start, process 1217
    post-start process 1218
/usr/sbin/libvirtd: error: unable to determine if daemon is running: No such file or directory

So in >=Xenial as I mentioned all of this doesn't matter unless you explicitly opt into sysv/upstart. But in Trusty we now finally have a case to work with.

Trying a fix combining your suggestion with:
- incremental sleep times
- considering the potential path being changes in libvirtd.conf

[1]: https://help.ubuntu.com/community/Installation/SoftwareRAID

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Backport of what was applied to zesty plus the fixes discussed here is building in ppa [1]

  * d/libvirt-bin.init, d/libvirt-bin.upstart: fix waiting for the libvirt
    socket (LP: #1571209)
    - avoid timing out on slow systems (only stop when service is stopped)
    - fix whitespace damage formerly added to d/libvirt-bin.init
    - no more sleep without announcing why
    - use incremental sleep to be fast in most fast path cases
    - use incremental sleep to avoid too much repetive messages
    - fix check if unix_sock_dir path is set in /etc/libvirt/libvirtd.conf

I'll test that (so far untested) now and if successful open a merge proposal for team review.
More tests / verifications are welcome.

[1]: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/2950

Revision history for this message
guessi (guessi) wrote :

@paelzer,

glad to see you reproduce the problem successfully,
and simulate by creating nested KVM is better the my test, thanks

here I would like to add more history for this bug report,
* it is running under KVM with multiple instance
* it is ungracefully shut-down (could be simulated by virsh if test under nested KVM, or kill)
* disk for physical host is running under RAID-1
* disk for physical host is formatted with 2-partition, / and /boot
* one of the KVM instance running with 2 vHDD, partition: / and /data
* * the /data is mounted as data volume for DB (yeah, it is reading/writing all the time)

as for the patch in the PPA above,
let's focus on libvirt-related section only,
I would prefer to keep sleep interval short (1s),
see discussion #14 ~ #16 above,

consider the following scenario,
if under a slow environment, waiting for socket ready required 65s,
but due to the loop with incremental condition, it will required 91s, there's 26s wasted
(1s + 2s + 4s + 8s + 16s + 30s + 30s = 91s)

and the worst case,
it will lead to 29s wasted. it is slowing down boot up speed,
I think it's not a good user experience, although it is no harm for system.

BTW,
I saw you marked Xenial/Zesty/Artful as Fixed Released in #26, #27
does it mean test#35 all passed for Xenial/Zesty/Artful?

Revision history for this message
guessi (guessi) wrote :

forget to mention,
the RAID system I mentioned above, it is talking about software-raid system,
there's no raid card on server

hope these info would help :)

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Fix seems to work, I see it waiting for 1+2+4+8 second overall before declaring the job ready by detecting the socket.

Also the extended test adapting unix_sock_path in /etc/libvirt/libvirtd.conf e.g. to "/var/run" worked (which would have failed before). Libvirt comes up and is considered up correctly even using a non default path.

As long as it is in the post phase it is reporting eventually every 30 seconds to the log.
Status is like ""respawn/post-start, (post-start) process 1177

Prepping an MP to review and a SRU Template.
But heads up - there is another SRU ongoing that has to complete before that.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

MP open and linked here

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

The MP review seems to get close to approval, added an SRU Template here.

description: updated
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Uploaded to trusty-unapproved for SRU Team review.

Please note: there is another upload [1] currently passing through SRU processing.
We want [1] to completely pass before the new upload [2] is accepted.

The former one is all complete and just needs a few days more in proposed.
So if all goes right on ~next Monday [1] is released and [2] can enter trusty-proposed.

[1]: https://launchpad.net/ubuntu/+source/libvirt/1.2.2-0ubuntu13.1.22
[2]: https://launchpadlibrarian.net/336851479/libvirt_1.2.2-0ubuntu13.1.23_source.changes

Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello guessi, or anyone else affected,

Accepted libvirt into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libvirt/1.2.2-0ubuntu13.1.23 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-trusty to verification-done-trusty. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-trusty. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in libvirt (Ubuntu Trusty):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-trusty
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I tested again the same way I did with the ppa, and it works for me with the fix that is in proposed.
So the behavior should be more reliable now for those "real" cases affected by slow systems that open up the window for the former issue to happen.

tags: added: verification-done verification-done-trusty
removed: verification-needed verification-needed-trusty
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 1.2.2-0ubuntu13.1.23

---------------
libvirt (1.2.2-0ubuntu13.1.23) trusty; urgency=medium

  * d/libvirt-bin.init, d/libvirt-bin.upstart: fix waiting for the libvirt
    socket (LP: #1571209)
    - avoid timing out on slow systems (only stop when service is stopped)
    - fix whitespace damage formerly added to d/libvirt-bin.init
    - no more long sleep without announcing to log
    - check socket and service status more often for lower latency on changes
    - fix check if unix_sock_dir path is set in /etc/libvirt/libvirtd.conf
    - fix the upstart service name that is checked

 -- Christian Ehrhardt <email address hidden> Thu, 07 Sep 2017 14:22:45 +0200

Changed in libvirt (Ubuntu Trusty):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for libvirt has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.