Bug #75681 “boot-time race condition initializing md” : Bugs : mdadm package : Ubuntu

Revision history for this message

Kees Cook (kees) wrote on 2006-12-14: Re: race condition between sata and md

#1

Tracked this down: there is a race between the /scripts/local-top/mdadm running and the SATA drives being brought online. Adding a delay seems to work, and I now look forward to
https://blueprints.launchpad.net/distros/ubuntu/+spec/udev-mdadm
working with initramfs. :)

Revision history for this message

Kees Cook (kees) wrote on 2006-12-14: Re: initramfs script: race condition between sata and md

#2

wait for devices Edit (1.2 KiB, text/plain)

Here is a patch that stalls the mdadm initramfs script until the desired devices are available. I set the timeout to 30 seconds.

Revision history for this message

Reinhard Tartler (siretart) wrote on 2006-12-22:

#3

I can confirm this bug, ajmitch is suffering as well from this.

Changed in mdadm:
importance:	Undecided → High
status:	Unconfirmed → Confirmed

Revision history for this message

Aurelien Naldi (aurelien.naldi) wrote on 2007-02-07:

#4

It looks like it was fixed by recent updates, but at the expense of breaking LVM (my initramfs does not contain (/scripts/local-top/lvm anymore).

Revision history for this message

Aurelien Naldi (aurelien.naldi) wrote on 2007-02-08:

#5

It looks like a race condition is still here: the initramfs tryes to assemble my RAID upon detection of /dev/sda. Obviously with 3 drives missing out of 4, it fails... Then, /dev/md0 is not here but inactive and not correctly created when the other drives are detected.

Is it possible to get and udev event when udev knows that all drives have been detected ?
I think that the best behaviour looks like:
- upon disk detection, scan for RAID volumes. If a volume is available, assemble it ONLY if all of its drives are present.
- try to assemble uncomplete (degraded) RAID volumes ONLY when ALL drives have been detected (i.e. if a drive is still missing, it will _not_ appear later).

This would make it faster without breaking things when a drive is just slow to answer.
Is it what the "udev-mdadm" and "udev-lvm" specs are for ? They seem to only deal with sending udev events for newly created md/lvm volumes, not about detecting/assembling them.

Revision history for this message

Reinhard Tartler (siretart) wrote on 2007-02-08: Re: [Bug 75681] Re: initramfs script: race condition between sata and md

#6

Aurelien Naldi <email address hidden> writes:
> It looks like it was fixed by recent updates, but at the expense of
> breaking LVM (my initramfs does not contain (/scripts/local-top/lvm
> anymore).

I can confirm. Latest updates changed the behavior, now the raid is
activated but not lvm. booting with 'break=mount' lets me run the mdadm
script by hand, but still does not activate lvm. I have to run 'lvm
vgchange -ay' by hand and press 'CTRL-D' to continue booting.

--
Gruesse/greetings,
Reinhard Tartler, KeyID 945348A4

Revision history for this message

Reinhard Tartler (siretart) wrote on 2007-02-10: Re: initramfs script: race condition between sata and md

#7

subscribing Ian and Scott, since they have done some uploads regarding udev and lvm2 in the past which improved the situation a bit.

Revision history for this message

Reinhard Tartler (siretart) wrote on 2007-02-10:

#8

status update: udev does start one raid device on bootup, but not all. After starting the raid devices manually using /scripts/local-top/mdadm, the vg still don't get up, I need to do manually 'lvm vgscan ; lvm vgchange -a y'

Revision history for this message

Aurelien Naldi (aurelien.naldi) wrote on 2007-02-12:

#9

There is still a race condition here. I does not attempt to start the array before detecting devices but it often starts it too early. My 4 devices RAID5 is, most of the time, either not assembled (i.e. 1 or 2 devices out of 4 is not enough) or assembled in degraded mode. An array should be assembled in degraded mode ONLY when no more devices are expected (or at least after a timeout)!

Revision history for this message

snore (sten-bit) wrote on 2007-02-12:

#10

I've been hit with the same problem with a fresh feisty install (10 feb install cd),
booting often doesn't work and always results in incomplete raid arrays.

This is with 2x 80gb sata, and 7 raid1 devices. The current mdadm
initramfs script is unsuitable for release.

Revision history for this message

Ian Jackson (ijackson) wrote on 2007-02-12:

#11

I think I have fixed this in mdadm_2.5.6-7ubuntu4. Please could you install this, which I have just uploaded, and check if it works.

Changed in mdadm:
assignee:	nobody → ijackson
status:	Confirmed → Fix Committed

Revision history for this message

Ian Jackson (ijackson) wrote on 2007-02-12: Re: [Bug 83832] Re: [feisty] mounting LVM root broken

#12

I think I have fixed the bug that causes the assembly of the RAID
arrays in degraded mode, in mdadm_2.5.6-7ubuntu4 which I have just
uploaded.

Please let me know if it works ...

Thanks,
Ian.

Revision history for this message

Sten Spans (sten-blinkenlights) wrote on 2007-02-12: Re: initramfs script: race condition between sata and md

#13

this fixes the issue for me

Revision history for this message

Aurelien Naldi (aurelien.naldi) wrote on 2007-02-13:

#14

The first boot went fine. The next one hanged. And the next one went also fine.
For all of them, I get the message "no devices listed in cinf file were found" but it looks harmless.
I do not know were the second one hanged so it may be unrelated, but it was before mounting /

Revision history for this message

Aurelien Naldi (aurelien.naldi) wrote on 2007-02-13:

#15

From the changelog of mdadm 2.5.6-7ubuntu4:
     Specify --no-degraded argument to mdadm in initramfs; this
     can be overridden by setting MD_DEGRADED_ARGS to some nonempty value
     (eg, a single space). This ought to fix race problems where RAIDs are
     assembled in degraded mode far too much. (LP #75681 and many dupes.)

This looks much more like a work arround than a real fix! I may misunderstand it, please ignore me if this is the case :)
Does it mean that I have to enter commands in the initramfs or to do some kind of black magic to boot a system with a broken hard drive ?
And it only avoids to desynchronise a working RAID, it does not fix the race at boot time as mdadm does still try to assemble the array with too few drives and thus refuses to assemble it properly later.

The array should _not_ be started _at_all_ unless all needed devices have been detected (including spare ones) OR when all device detection is done.
So I ask it again, is it any way to make udev send a special signal when devices detection is finished ?
Is it any way to make mdadm scan for arrays and assemble them ONLY if all of their devices are present, doing _nothing_ otherwise ?

It may be tricky but I think that these problems can only be fixed this way.
Anyway, it did work before.

Revision history for this message

Ian Jackson (ijackson) wrote on 2007-02-13: Re: [Bug 75681] Re: initramfs script: race condition between sata and md

#16

Aurelien Naldi writes ("[Bug 75681] Re: initramfs script: race condition between sata and md"):
> The first boot went fine. The next one hanged. And the next one went
> also fine. For all of them, I get the message "no devices listed in
> cinf file were found" but it looks harmless. I do not know were the
> second one hanged so it may be unrelated, but it was before mounting
> /

When one of the boots hangs, can you please wait for it to time out (3
minutes by default, though IIRC you can adjust this by saying
ROOTWAIT=<some-number-of-seconds>) and then when you get the initramfs
prompt try

1. Check that
     cat /proc/partitions
   lists all of the components of your array. If not, then we need to
   understand why not. What are those components ? You will probably
   find that it does not list the array itself. If it does then we
   need to understand why it doesn't seem to be able to mount it.

2. Run
     /scripts/local-top/mdadm from-udev
   (NB `from-udev' is an argument you must pass to the script)
   and see if it fixes it. (If so it will show up in /proc/partitions
   and be mountable.)

3. If that doesn't fix it, see if
mdadm -As --no-degraded
fixes it.

Thanks,
Ian.

Revision history for this message

Ian Jackson (ijackson) wrote on 2007-02-13:

#17

Aurelien Naldi writes ("[Bug 75681] Re: initramfs script: race condition between sata and md"):
> The array should _not_ be started _at_all_ unless all needed devices
> have been detected (including spare ones) OR when all device
> detection is done.

Unfortunately, in Linux 2.6, there is no way to detect when `all
device detection is done' (and indeed depending on which kinds of bus
are available, that may not even be a meaningful concept).

Ian.

Revision history for this message

Reinhard Tartler (siretart) wrote on 2007-02-13:

#18

Ian Jackson <email address hidden> writes:
> 1. Check that
> cat /proc/partitions
> lists all of the components of your array. If not, then we need to
> understand why not. What are those components ? You will probably
> find that it does not list the array itself. If it does then we
> need to understand why it doesn't seem to be able to mount it.

I notice thay my 'real' partitions (/dev/sd[a,b][1..7]) do appear, but
only md0, not my other md[1..3] devices. Moreover, after checking
/proc/mdstat, /dev/md0 is started in degraded mode.

My setup: /dev/md0 is a mirror of /boot. md1 is a mirror of swap, md2 is
a mirrored volume group with my home (and backup of root), md3 is the
rest with a striped volume group.

> 2. Run
> /scripts/local-top/mdadm from-udev
> (NB `from-udev' is an argument you must pass to the script)
> and see if it fixes it. (If so it will show up in /proc/partitions
> and be mountable.)

Running that script makes /dev/md[1..3] come up in non-degraded mode,
md0 stays in degraded mode. LVM is not started automatically. After
typing 'lvm vgscan ; lvm vgchange -ay', my LVMs come up and I can
continue booting with CTRL-D.

Interestingly, this seem to happen most of the time. Just before I
booted with mdadm_2.5.6-7ubuntu3, and the system came up just fine. Note
that this happened exactly once to me so far.

--
Gruesse/greetings,
Reinhard Tartler, KeyID 945348A4

Revision history for this message

Ian Jackson (ijackson) wrote on 2007-02-13: Re: [Bug 75681] Re: initramfs script: race condition between sata and md

#19

Reinhard Tartler writes ("Re: [Bug 75681] Re: initramfs script: race condition between sata and md"):
> Ian Jackson <email address hidden> writes:
> > 2. Run
> > /scripts/local-top/mdadm from-udev
> > (NB `from-udev' is an argument you must pass to the script)
> > and see if it fixes it. (If so it will show up in /proc/partitions
> > and be mountable.)
>
> Running that script makes /dev/md[1..3] come up in non-degraded mode,
> md0 stays in degraded mode. LVM is not started automatically. After
> typing 'lvm vgscan ; lvm vgchange -ay', my LVMs come up and I can
> continue booting with CTRL-D.

udev is supposed to do all of these things. It's supposed to
automatically activate the LVM when the md devices appear, too. I
don't know why it isn't, yet.

Could you please boot with break=premount, and do this:
udevd --verbose >/tmp/udev-output 2>&1 &
udevtrigger
At this point I think you will find that your mds and lvms
are not activated properly; check with
cat /proc/partitions

If as I assume they aren't:
pkill udevd
udevd --verbose >/tmp/udev-output2 2>&1 &
/scripts/local-top/mdadm from-udev

And now your mds should all be activated but hopefully the LVM bug
will recur. So:
pkill udevd
lvm vgscan
lvm vgchange -a y
mount /dev/my-volume-group/my-volume-name /root
cp /tmp/udev-output* /root/root/.
exit

And then when the system boots attach /root/udev-output* and your
initramfs to this bug report.

mdadm_2.5.6-7ubuntu4 ought to fix the fact that md0 comes up in
degraded mode but please don't upgrade to it yet as it may perturb the
other symptoms out of existence.

Thanks,
Ian.

Revision history for this message

Aurelien Naldi (aurelien.naldi) wrote on 2007-02-13: Re: initramfs script: race condition between sata and md

#20

OK, here are the result of some more tests with mdadm_2.5.6-7ubuntu4:

* my first boot went fine, allowing me to get your advice.
* the second boot hanged, so I waited for a while. All my devices were here, both in /dev and in /proc/partitions
The RAID array was assemble with 3 devices (out of 4) but not started (i.e. not degraded)
running "/scripts/local-top/mdadm from udev" did not work, the screen was filled with error message:
"mdadm: SET_ARRAY_INFO failed for /dev/md0: Device or ressource busy"
I could stop it only using the sysreq, which took me back to the initramfs shell, with the raid array completely stopped. There running the same command again assembled the array correctly.
[off topic] I could not leave the shell and continue booting, It took me back to this shell every time, saying:
"can't access tty; job control turned off"
[/off topic]
* The third boot hanged also. The array was assembled with one drive only (thus it could not even have started in degraded mode). I tried running "mdadm -S /dev/md0" before doing anything else.
After this, running "/scripts/local-top/mdadm from-udev" assembled the array properly

[off topic] But I still could not continue the boot. The next 4 boot did also hang (but I did not wait). Then it accepted to boot again [/off topic]

Revision history for this message

Reinhard Tartler (siretart) wrote on 2007-02-16:

#21

Ian Jackson <email address hidden> writes:

> Could you please boot with break=premount, and do this:
> udevd --verbose >/tmp/udev-output 2>&1 &
> udevtrigger
> At this point I think you will find that your mds and lvms
> are not activated properly; check with
> cat /proc/partitions

The thing is that I cannot reproduce the problem this way, because my
raid and lvm devices come up as expected. The visible difference is,
that I get tons of output on the screen after having typed
udevtrigger. I suspect that they slow down udev's operation which avoids
the race condition. Is there some way to avoid this massive output on
the console while preserving the debug info on the tempfile?

Changed in mdadm:
status:	Fix Committed → Confirmed

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2007-02-16:

#22

The problem here is clearly that mdadm is being run too early, and is trying to assemble RAIDs that are not yet complete.

Either mdadm needs to be fixed so that doing this is possible, and harmless; like it is for lvm, etc. or the script that calls mdadm needs to check whether it is safe to call mdadm yet.

Revision history for this message

Aurelien Naldi (aurelien.naldi) wrote on 2007-02-16:

#23

With the --no-degraded option that has been added, it IS harmless for the array itself, but it can still block the boot.

If an array has been assembled too early, it should be stopped (running "mdadm -S /dev/md0" by hand worked for me) BEFORE trying to assemble it again, an other race may be present here ?

mdadm can tell if a device is part of an array, this could be used to check that all devices are present before assembling it, but would slow things down or require some memory.

If a device is really missing (shit happens...), the array should then be assembled in degraded mode. This should probably happen automatically (or with a confirmation) after the timeout, or by adding a boot option.

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2007-02-16:

#24

How can it be assembled too early if --no-degraded is given?

Surely with that option, mdadm doesn't assemble the array if some devices are missing, instead of part-assembling it in degraded mode?

Revision history for this message

Aurelien Naldi (aurelien.naldi) wrote on 2007-02-16:

#25

Perhaps is it another bug on my particular system...
I have written about it in previous comments, when trying to assemble /dev/md0 I have three different results:
* all devices of the array are available: /dev/md0 is created and working
* a device is missing (and --no-degraded is _not_ specified): /dev/md0 is created and working in degraded mode
* several devices are missing (or only one with --no-degraded): /dev/md0 is still created but _not_ working. I get a message like "too many missing devices, not starting the array", but it still appears in /proc/mdstat (not running but present, it may not be completely assembled but it is here!). I can not assemble it until it has been fully stopped using "mdadm -S /dev/md0"

This looks weird, and I had not seen this before, but I had not tryed to launch the array with 2 missing drives...

Revision history for this message

Jyrki Muukkonen (jvtm) wrote on 2007-03-02:

#26

Can confirm this with a fresh installation from daily/20070301/feisty-server-amd64.iso.

Setup:
- only one disk
- md0 raid1 mounted as / (/dev/sda1 + other mirror missing, the installation ui actually permits this)
- md1 raid1 unused (/dev/sda3 + other mirror missing)

On the first boot I got to the initramfs prompt, with only md1 active:
mdadm: No devices listed in conf file were found.
stdin: error 0
Usage: modprobe ....
mount: Cannot read /etc/fstab: No such file or directory
mount: Mounting /root/dev ... failed
mount: Mounting /sys ... failed
mount: Mounting /proc ...failed
Target filesystem doesn't have /sbin/init

/bin/sh: can't access tty; job control turned off
(initramfs)

Second try gave the same prompt, but now md0 was active. However, it didn't boot:

(initramfs) cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[0]
15623104 blocks [2/1] [U_]

Finally, on the third try, it booted (well, got some warnings about missing /dev/input/mice or something like that, but that's not the point here).

Now I'm just sticking with / as a normal partition (+ others like /home as raid1). I'm hoping that the migration to raid1 goes fine after this problem has been fixed.

Revision history for this message

octothorp (shawn-leas) wrote on 2007-03-08:

#27

I have been dealing with this for a while now, and it's a udev problem.

Check out the following:
https://launchpad.net/bugs/90657
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=403136

I have four identical SATAs, and sometimes the sd[a,b] device nodes fail to show up, and sometimes they don't but are lagged, sometimes it's sd[dc]...

I have to boot with break=premount and cobble things together manually, then exit the shell and let it boot.

Revision history for this message

octothorp (shawn-leas) wrote on 2007-03-08:

#28

This is very much udev's fault. A fix is reportedly in debian's 105-2 version.

Revision history for this message

octothorp (shawn-leas) wrote on 2007-03-10:

#29

Is anyone listening???

Revision history for this message

Aurelien Naldi (aurelien.naldi) wrote on 2007-03-13:

#30

I would surely test a newer udev packages if it fixes this problem. Using apt-get source to rebuild the debian package should be fairly easy, but I guess ubuntu maintains a set of additionnal patches, and merging them might be non-trivial. Is any of the ubuntu dev willing to upload a test package somewhere ?

Reading the debian bug, I am not convinced it is the same problem, on my system I have SOME devices created and other lagging behind, the debian bug seems to be about devices not created at all.

If the newer udev does not fix this problem, is it doable to detect "stalled" RAID array and to stop them ?

Revision history for this message

octothorp (shawn-leas) wrote on 2007-03-13: Re: [Bug 75681] Re: boot-time race condition initializing md

#31

It seems it's all related to SATA initialization.

udev does not build like other packages, and I'd hate to miss something
about building it and then wreck my system totally.

On 3/13/07, Reinhard Tartler <email address hidden> wrote:
>
> ** Changed in: mdadm (Ubuntu Feisty)
> Target: None => 7.04-beta
>
> --
> boot-time race condition initializing md
> https://launchpad.net/bugs/75681
>

Revision history for this message

pjwigan (pjwigan) wrote on 2007-03-13:

#32

I've just tried Herd 5 x86 and hit a similar issue; only this box has no RAID capability.

Setup is:
- 1 disk (/dev/sda1)
- 1 DVD+RW (/dev/scd0)

Attempting to boot from the live CD consistently gives me:

udevd-event[2029]: run_program: '/sbin/modprobe' abnormal exit

BusyBox v1.1.3 (Debian ...

/bin/sh: can't access tty; job control turned off
(initramfs)

I'll try the 64 bit version and report back

Revision history for this message

octothorp (shawn-leas) wrote on 2007-03-13:

#33

I don't think by what I've seen that this is the same. In your case it's
modprobe's fault by extension, and really probably caused by a module not
loading for whatever reason.

Caveat: I have not done bug hunting for your bug. I just don't think it's
this one.

On 3/13/07, pjwigan <email address hidden> wrote:
>
> I've just tried Herd 5 x86 and hit a similar issue; only this box has no
> RAID capability.
>
> Setup is:
> - 1 disk (/dev/sda1)
> - 1 DVD+RW (/dev/scd0)
>
> Attempting to boot from the live CD consistently gives me:
>
> udevd-event[2029]: run_program: '/sbin/modprobe' abnormal exit
>
> BusyBox v1.1.3 (Debian ...
>
> /bin/sh: can't access tty; job control turned off
> (initramfs)
>
>
> I'll try the 64 bit version and report back
>
> --
> boot-time race condition initializing md
> https://launchpad.net/bugs/75681
>

Revision history for this message

pjwigan (pjwigan) wrote on 2007-03-13:

#34

Thanks for the tip. Having dug deeper, 84964 is a perfect match.

Revision history for this message

Eamonn Sullivan (eamonn-sullivan) wrote on 2007-03-14:

#35

I got hit with something that sounds very similar to this when upgrading to the 2.6.20-10-server kernel. My system works fine on -9. I ended up stuck in busybox with no mounted drives. I'm using a DG965 Intel motherboard with three SATA hard disks. The following are details on my RAID5 setup and LVM. My /boot partition is non-raid on the first SATA disk. What else do you need?

        Version : 00.90.03
  Creation Time : Sun Mar 4 14:26:40 2007
     Raid Level : raid5
     Array Size : 617184896 (588.59 GiB 632.00 GB)
    Device Size : 308592448 (294.30 GiB 316.00 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Wed Mar 14 07:24:03 2007
          State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 0
  Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

UUID : 9a1cfa02:4eddd96e:18354ce8:e82aff38
Events : 0.13

    Number Major Minor RaidDevice State
       0 8 1 0 active sync /dev/sda1
       1 8 17 1 active sync /dev/sdb1
       2 8 33 2 active sync /dev/sdc1

And here's my lvm setup:

  --- Logical volume ---
  LV Name /dev/snape/root
  VG Name snape
  LV UUID 5j5Rh6-YeSD-vr3p-T1ci-RPoi-fpuI-t1k40O
  LV Write Access read/write
  LV Status available
  # open 1
  LV Size 20.00 GB
  Current LE 5120
  Segments 1
  Allocation inherit
  Read ahead sectors 0
  Block device 254:0

  --- Logical volume ---
  LV Name /dev/snape/tmp
  VG Name snape
  LV UUID NP3pnf-uPAz-3jdd-fK6n-A62K-KGqU-UXQviZ
  LV Write Access read/write
  LV Status available
  # open 1
  LV Size 20.00 GB
  Current LE 5 Segments 1
  Allocation inherit
  Read ahead sectors 0
  Block device 254:1

  --- Logical volume ---
  LV Name /dev/snape/var
  VG Name snape
  LV UUID 1sxfyk-b22f-ajmE-rtdg-Sg0h-hR2q-MjMzHi
  LV Write Access read/write
  LV Status available
  # open 1
  LV Size 250.00 GB
  Current LE 64000
  Segments 1
  Allocation inherit
  Read ahead sectors 0
  Block device 254:2

  --- Logical volume ---
  LV Name /dev/snape/home
  VG Name snape
  LV UUID 52zbNh-vfpr-UKTZ-moXR-fVEN-Zqzk-061JTK
120
  LV Write Access read/write
  LV Status available
  # open 1
  LV Size 298.59 GB
  Current LE 76439
  Segments 1
  Allocation inherit
  Read ahead sectors 0
  Block device 254:3

I got hit with something that sounds very similar to this when upgrading to the 2.6.20-10-server kernel. My system works fine on -9. I ended up stuck in busybox with no mounted drives. I'm using a DG965 Intel motherboard with three SATA hard disks. The following are details on my RAID5 setup and LVM. My /boot partition is non-raid on the first SATA disk. What else do you need?

Version : 00.90.03
  Creation Time : Sun Mar  4 14:26:40 2007
     Raid Level : raid5
     Array Size : 617184896 (588.59 GiB 632.00 GB)
    Device Size : 308592448 (294.30 GiB 316.00 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

Update Time : Wed Mar 14 07:24:03 2007
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

Layout : left-symmetric
     Chunk Size : 64K

UUID : 9a1cfa02:4eddd96e:18354ce8:e82aff38
         Events : 0.13

Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1

And here's my lvm setup:

--- Logical volume ---
  LV Name                /dev/snape/root
  VG Name                snape
  LV UUID                5j5Rh6-YeSD-vr3p-T1ci-RPoi-fpuI-t1k40O
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                20.00 GB
  Current LE             5120
  Segments               1
  Allocation             inherit
  Read ahead sectors     0
  Block device           254:0
   
  --- Logical volume ---
  LV Name                /dev/snape/tmp
  VG Name                snape
  LV UUID                NP3pnf-uPAz-3jdd-fK6n-A62K-KGqU-UXQviZ
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                20.00 GB
  Current LE             5  Segments               1
  Allocation             inherit
  Read ahead sectors     0
  Block device           254:1
   
  --- Logical volume ---
  LV Name                /dev/snape/var
  VG Name                snape
  LV UUID                1sxfyk-b22f-ajmE-rtdg-Sg0h-hR2q-MjMzHi
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                250.00 GB
  Current LE             64000
  Segments               1
  Allocation             inherit
  Read ahead sectors     0
  Block device           254:2
   
  --- Logical volume ---
  LV Name                /dev/snape/home
  VG Name                snape
  LV UUID                52zbNh-vfpr-UKTZ-moXR-fVEN-Zqzk-061JTK
120
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                298.59 GB
  Current LE             76439
  Segments               1
  Allocation             inherit
  Read ahead sectors     0
  Block device           254:3

Revision history for this message

Ian Jackson (ijackson) wrote on 2007-03-14:

#36

pjwigan writes ("[Bug 75681] Re: boot-time race condition initializing md"):
> udevd-event[2029]: run_program: '/sbin/modprobe' abnormal exit

I think this is probably a separate problem. Are you using LILO ?
Can you please email me your lilo.conf ? (Don't attach it to the bug
report since I want to avoid confusing this report.)

Ian.

Revision history for this message

pjwigan (pjwigan) wrote on 2007-03-14:

#37

The issue (which appears to be bug #84964 BTW) occurs
when trying to boot from the Herd 5 live CD, whether
32 or 64 bit. The PC has an up to date standard
install of 32 bit Edgy, so LILO is not involved.

One oddity tho': the udevd-event line only appears on
my secondary monitor. They usually display exactly
the same text until X starts.

--- Ian Jackson <email address hidden> wrote:

> pjwigan writes ("[Bug 75681] Re: boot-time race
> condition initializing md"):
> > udevd-event[2029]: run_program: '/sbin/modprobe'
> abnormal exit
>
> I think this is probably a separate problem. Are
> you using LILO ?
> Can you please email me your lilo.conf ? (Don't
> attach it to the bug
> report since I want to avoid confusing this report.)
>
> Ian.
>
> --
> boot-time race condition initializing md
> https://launchpad.net/bugs/75681
>

___________________________________________________________
New Yahoo! Mail is the ultimate force in competitive emailing. Find out more at the Yahoo! Mail Championships. Plus: play games and win prizes.
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk

Revision history for this message

octothorp (shawn-leas) wrote on 2007-03-14:

#38

He has already discovered that his problem likely has a different bug
already in the system.

On 3/14/07, Ian Jackson <email address hidden> wrote:
>
> pjwigan writes ("[Bug 75681] Re: boot-time race condition initializing
> md"):
> > udevd-event[2029]: run_program: '/sbin/modprobe' abnormal exit
>
> I think this is probably a separate problem. Are you using LILO ?
> Can you please email me your lilo.conf ? (Don't attach it to the bug
> report since I want to avoid confusing this report.)
>
> Ian.
>
> --
> boot-time race condition initializing md
> https://launchpad.net/bugs/75681
>

Revision history for this message

octothorp (shawn-leas) wrote on 2007-03-14:

#39

I suggest breaking in an "sh -x"ing the script that loads the modules,
tacking "-v" onto any modprobe lines, and getting kernel messages captures
using a serial console or something.

You'll need to identify, since there's a hard failure at a consistent
location, where it's failing.

Probably should move further discussion over to that bug before this one
looses its specificity.

On 3/14/07, pjwigan <email address hidden> wrote:
>
> The issue (which appears to be bug #84964 BTW) occurs
> when trying to boot from the Herd 5 live CD, whether
> 32 or 64 bit. The PC has an up to date standard
> install of 32 bit Edgy, so LILO is not involved.
>
> One oddity tho': the udevd-event line only appears on
> my secondary monitor. They usually display exactly
> the same text until X starts.
>

Revision history for this message

Eamonn Sullivan (eamonn-sullivan) wrote on 2007-03-16:

#40

Just a note that 2.6.20-11-server appears to have solved this issue for me. The server is booting normally after today's update.

Revision history for this message

octothorp (shawn-leas) wrote on 2007-03-16:

#41

Huh. I'll try.

On 3/16/07, Eamonn Sullivan <email address hidden> wrote:
>
> Just a note that 2.6.20-11-server appears to have solved this issue for
> me. The server is booting normally after today's update.
>
> --
> boot-time race condition initializing md
> https://launchpad.net/bugs/75681
>

Revision history for this message

John Mark (johnmark) wrote on 2007-03-17:

#42

Just wanted to confirm that the -11 kernel update did not fix the problem for my black macbook. The boot hang first started with -10 and -9 still works fine.

As described previously, the boot stops with ATA errors and dumps me into an initramfs shell.

I hope this is fixed soon :(

Revision history for this message

octothorp (shawn-leas) wrote on 2007-03-17:

#43

I also still have the issue. Intel QX6700 on a gigabyte GA-965P-DS3.

On 3/17/07, John Mark <email address hidden> wrote:
>
> Just wanted to confirm that the -11 kernel update did not fix the
> problem for my black macbook. The boot hang first started with -10 and
> -9 still works fine.
>
> As described previously, the boot stops with ATA errors and dumps me
> into an initramfs shell.
>
> I hope this is fixed soon :(
>
> --
> boot-time race condition initializing md
> https://launchpad.net/bugs/75681
>

Revision history for this message

Jeff Balderson (jbalders) wrote on 2007-03-19:

#44

FWIW, I have the problem, exactly as described, with a IDE-only system.

I attempted this with both an Edgy->Feisty upgrade (mid Februrary), and a fresh install of Herd5.

I'm running straight RAID1 for all my volumes:

md0 -> hda1/hdc1 -> /boot
md1 -> hda5/hdc5 -> /
md2 -> hda6/hdc6 -> /usr
md3 -> hda7/hdc7 -> /tmp
md4 -> hda8/hdc8 -> swap
md5 -> hda9/hdc9 -> /var
md6 -> hda10/hdc10 -> /export

EVERY time I boot, mdadm complains that it can't build my arrays. It dumps me in the busybox shell where I discover that all /dev/hda* and /dev/hdc* devices are missing. I ran across a suggestion here: http://ubuntuforums.org/showpost.php?p=2236181&postcount=5. I followed the instructions and so far, it's worked perfectly at every boot (about 5 times now).

Other than just to post a "me too", I thought my comments might help give a possible temporary workaround, as well as document the fact this isn't just a SATA problem.

Revision history for this message

Ronny Becker (ronny-becker) wrote on 2007-03-21:

#45

I also confirm this problem on a non SATA System.

On booting it dumps me into a busybox. Like other users writing I do not see any harddrive devices in /dev - the raid array could not be created. I also tried to use the solution above (adding div. modules to /etc/initramfs-tools/modules) but it did not work.

I installed with the alternate CD and I was wondering why my IDE drives (normally hda/hdb) are shown as scsi drives (sda/sdb). So I tried to change the /etc/mdadm/mdadm.conf) according to this information (with the new drive names) but it did not work, too.

For me it looks like a problem how kernel modules load respectively if they are build in or not. If I get the time, I will try another Distribution with kernel 2.6.20 to see if the systems generally works with this kernel.

Further information: Dell Optiplex 240, 2 IDE Harddrives, raid1 & lvm, fresh install of herd-5 (alternate CD), running edgy before without any problems.

Revision history for this message

Eamonn Sullivan (eamonn-sullivan) wrote on 2007-03-21:

#46

-11 and -12 are working for me now (3 SATA drives configured as RAID5 on a DG965 Intel motherboard), but one thing I did differently that *may* be relevant is that I enabled the menu and added a 15-second delay in /boot/grub/menu.lst. I did this so that I have time to change to an earlier kernel during the boot process, but I wonder if that delay is giving the BIOS time to discover everything, even before the init process begins?

Revision history for this message

Ronny Becker (ronny-becker) wrote on 2007-03-21:

#47

Additional Information that´s maybe interesting to get the problem solved:
Because I use raid1 with LVM Ubuntu uses lilo as Bootloader. So I don´t think that it makes a difference if grub or lilo is used.

Furthermore I used my system already with edgy - without additional delay or something else. So in my case it´s improbably that the system needs any delay before booting the kernel. Of course this is possible for some systems with additional hard drive controllers (or something like that).

Revision history for this message

octothorp (shawn-leas) wrote on 2007-03-21:

#48

I guess I'll give it a shot, but the fact that linux does not rely on the
BIOS for SATA or IDE detection and initialization, and the fact that
different versions of software have given me differing results would suggest
your problem might be different, maybe power related (spinup)?

I cannot draw on any pas experience with my system, as edgy did not support
my SATA. I have a QX6700 on a Gigabyte GA-965P-DS3, so things are pretty
new.

On 3/21/07, Ronny Becker <email address hidden> wrote:
>
> Additional Information that´s maybe interesting to get the problem solved:
> Because I use raid1 with LVM Ubuntu uses lilo as Bootloader. So I don´t
> think that it makes a difference if grub or lilo is used.
>
> Furthermore I used my system already with edgy - without additional
> delay or something else. So in my case it´s improbably that the system
> needs any delay before booting the kernel. Of course this is possible
> for some systems with additional hard drive controllers (or something
> like that).
>
> --
> boot-time race condition initializing md
> https://launchpad.net/bugs/75681
>

Revision history for this message

Reinhard Tartler (siretart) wrote on 2007-03-23:

#49

after adding a 'sleep 5' after udevtrigger, I noticed that mdadm segfaults on my machine sometimes, resulting that not all volumes start up.Starting /scripts/local-top/mdadm from-udev manually resulted in having all 4 raid volumes up however.

even in the cases where all 4 volumes do come up as expected, the lvm volume groups aren't started most of the time automatically, I have to issue the 'lvm vgscan && lvm vgchange -ay' by hand.

Revision history for this message

Jeffrey Knockel (jeff250) wrote on 2007-03-25:

#50

I suspect that this is the issue I'm experiencing with my desktop. My root partition is /dev/md0, but after upgrading to feisty, my system is no longer boots and throws me to busybox. I noticed in busybox that my sata drives are missing from /dev, which explains why the array isn't being mounted.

One workaround that seems to work so far (although I have not thoroughly tested it) is booting the kernel with the option 'break=mount'. This seems to somehow fix everything, as by the time I get thrown to busybox, my drives are already in /dev, as well as my raid array, so I just press ctrl+d to continue booting.

Revision history for this message

Jeffrey Knockel (jeff250) wrote on 2007-03-25:

#51

I tried upgrading to 0.105-3 from debian unstable, and it fixed this bug for me (I used "prevu" to accomplish this). Devices for my drives are now created correctly, and /dev/md0 is now created and mounted correctly at boot-up. Unfortunately, as a side-effect, network-manager is no longer seeing my wireless card now (gah!). I'll have to investigate this, but I wasn't anticipating upgrading to debian unstable's version to be painless.

Revision history for this message

Jeffrey Knockel (jeff250) wrote on 2007-03-25:

#52

I doubt that this is the ideal place to say this, but in case anyone else goes down this path, getting my wireless working again involved copying /lib/udev/firmware-helper from the old feisty udev package to the same location once the new udev from debian unstable is installed. Apparently the debian udev package does not include this file. This goes to show that ubuntu's udev package is indeed very tricked out. ;-)

Revision history for this message

Aurelien Naldi (aurelien.naldi) wrote on 2007-03-25:

#53

I have just tested upgrading to the debian udev package and (beside being painfull) it completly broked my system.
I think that people here are experiencing two different bugs, as _for_me_ /dev/sd* files are created normally, just with a few seconds of delay, making mdadm blocked after trying to assemble an array for which some devices are still missing.

* I could not use prevu to upgrade to the debian udev package as its version number (0.105) is lower than this of the ubuntu package (103), thus lvm2 complains that it will break this old version of udev.
* after upgrading the version number, building the packages by hand and rebooting I got stucked and _NO_ devices was created at all. Maybe some ubuntu patches should be applied to make it work ??
* after reinstalling the older version (thanks to an other working feisty install), I could boot again. But booting does still require several attempts or adding "break=mount" option and running the mdadm script once my devices are created.

One thing changed since my last comment: if I wait for the busybox shell, I do not see a half-created /dev/md0 anymore, so something improved at least ;)

Revision history for this message

Reinhard Tartler (siretart) wrote on 2007-03-25:

#54

in order to verify the assumption that a new udev upstream version would/could fix this issue, I packaged the newer upstream version 105 for edgy. You can find the sourcepackage here: http://siretart.tauware.de/udev-edgy-test/udev_105-0ubuntu1.dsc.

To install it: `dget -x http://siretart.tauware.de/udev-edgy-test/udev_105-0ubuntu1.dsc && cd udev-105 && debuild -b`

My first boot showed exactly the same behavior as described in this bug, so I don't think that it is fixed there. I'm on amd64, btw.

Revision history for this message

Jeffrey Knockel (jeff250) wrote on 2007-03-25:

#55

Reinhard Tartler, am I correct in assuming that you forgot to upload udev_105.orig.tar.gz? Otherwise, where are we expected to get it? Thanks.

To mirror others' sentiments, I do suspect that there are a plurality of bugs with similar symptoms being voiced here.

Revision history for this message

octothorp (shawn-leas) wrote on 2007-03-26:

#56

I'm very interested in finally testing 105, and the missing udev_105.orig.tar.gz is a bit of a challenge, but at least there's a diff.

As I recall, from my initial attempt at building a deb for 105, there was superficially only one reject. I was just unsure as to how to properly resolve it.

Revision history for this message

Ian Jackson (ijackson) wrote on 2007-03-26:

#57

Reinhard, you may remember that on irc I asked you to try moving
/usr/share/initramfs-tools/scripts/local-top/mdrun
aside and rebuilding your initramfs. Did you try this in the end ?
(Please move it somewhere not under /usr/share/initramfs-tools/.)

My supposition is that it is this script which is causing the problem
(for you at least). As far as I can tell, if it runs at a point when
some but not all of the disks in the root raid are available, it will
activate the raid in degraded mode.

Then, later, when the remaining devices turn up, it is too late and
the I-hope-race-proofed script (scripts/local-top/mdadm) runs but
cannot fix it.

So, if you have the file
/usr/share/initramfs-tools/scripts/local-top/mdadm
then I think it should be safe and correct to move aside the mdrun
script.

Ian.

Revision history for this message

Amon_Re (ochal) wrote on 2007-03-26:

#58

My bugreport might be related to this one, Bug #95280.
(https://launchpad.net/ubuntu/+bug/95280)

Revision history for this message

octothorp (shawn-leas) wrote on 2007-03-27:

#59

The fact that I have been able to boot twice since yesterday's updates is tacit confirmation that something got either mitigated or fixed, and I believe it to have been a bit of a reorganization of the scripting ubuntu uses to help udev initialize things.

I noticed an initramfs-tools update just as I was going to test a udev-105 ubuntu .deb, so I applied that update, and have been able to boot since then.

I'll continue to test things out and report on if my "me too" in this bug should be considered resolved.

Revision history for this message

Ian Jackson (ijackson) wrote on 2007-03-27: Re: [Bug 75681] Re: initramfs script: race condition between sata and md

#60

Earlier, I suggested a data-collection exercise involving udevd
--verbgose, but this didn't do much good because the race went away
when testing in this way.

I think this was probably due to udev writing to the console. I have
now prepared a version of udev which can be made not to write to the
console and I would appreciate it if people (particularly Reinhard)
who are still haviong this problem would carry out the following test:

1. Install
  http://www.chiark.greenend.org.uk/~ian/d/udev-nosyslog/udev_103-0ubuntu14~iwj1_i386.deb
  (Sources can be found alongside, at
    http://www.chiark.greenend.org.uk/~ian/d/udev-nosyslog/)

2. Boot with break=premount

3. At the initramfs prompt:
udevd --verbose --suppress-syslog >/tmp/udev-output 2>&1 &
udevtrigger

At this point I hope you will find that your root raid is degraded
(ie, that the bug has happened); check with
cat /proc/partitions
mdstat -D /dev/md0

4. Mount your root filesystem and copy the debug data to it:
        pkill udevd
        mount /dev/my-volume-group/my-volume-name /root
        cp /tmp/udev-output* /root/root/.
        exit

When your system boots up, please attach /root/udev-output to this
bug.

Thanks,
Ian.

Revision history for this message

Aurelien Naldi (aurelien.naldi) wrote on 2007-03-28:

#61

output of udevd --verbose Edit (1.9 MiB, text/plain)

Here is my output for udevd. I needed several boots to get it, so I hope it helps ;)

Revision history for this message

Ian Jackson (ijackson) wrote on 2007-03-28: Re: [Bug 75681] Re: boot-time race condition initializing md

#62

Aurelien Naldi writes ("[Bug 75681] Re: boot-time race condition initializing md"):
> Here is my output for udevd. I needed several boots to get it, so I hope
> it helps ;)

Thanks, that's going to help a lot I think. Can you please put up a
copy of your initramfs (/initrd.img) too ? It'll probably be too
large to mail and I'm not sure LP will take it, so you probably want
to put it on a webserver somewhere.

Ideally it would be the initramfs from the same test run as the log
but another will do.

Also, to help me understand this log, can you describe your raid and
LVM setup ?

Ian.

Revision history for this message

Aurelien Naldi (aurelien.naldi) wrote on 2007-03-28:

#63

I am not on this system right now so I can't upload my initramfs yet,
I will probably do it later tonight.
I have four SATA drives and two installations (i386 and AMD64, both
feisty, this is in the i386 one, the AMD64 one is not up to date and I
use it as a rescue system).

Every hard drive has a very similar partitionning scheme:
* boot partition (two installs, each has a backup of its main
"/boot", not exactly up to date as I have been too lazy to write a
small script)
* swap partition or small temporary data partition
* the main RAID partition

I have a single RAID5 system, md0 using sda3, sdb3, sdc3 and sdd3
md0 is a LVM group, (vg_raid if I recall correctly) and contains three volumes:
the "root" partition for my two install and a big shared "/home"

is it any other think that may help you, beside my initramfs ?
I guess you do not need the mdadm config file as it is included...

On 3/28/07, Ian Jackson <email address hidden> wrote:
> Aurelien Naldi writes ("[Bug 75681] Re: boot-time race condition initializing md"):
> > Here is my output for udevd. I needed several boots to get it, so I hope
> > it helps ;)
>
> Thanks, that's going to help a lot I think. Can you please put up a
> copy of your initramfs (/initrd.img) too ? It'll probably be too
> large to mail and I'm not sure LP will take it, so you probably want
> to put it on a webserver somewhere.
>
> Ideally it would be the initramfs from the same test run as the log
> but another will do.
>
> Also, to help me understand this log, can you describe your raid and
> LVM setup ?
>
> Ian.
>
> --
> boot-time race condition initializing md
> https://launchpad.net/bugs/75681
>

--
aurelien naldi

Revision history for this message

Ian Jackson (ijackson) wrote on 2007-03-28:

#64

Aurelien Naldi writes ("Re: [Bug 75681] Re: boot-time race condition initializing md"):
> I am not on this system right now so I can't upload my initramfs yet,
> I will probably do it later tonight.

Thanks.

> Every hard drive has a very similar partitionning scheme:
> * boot partition (two installs, each has a backup of its main
> "/boot", not exactly up to date as I have been too lazy to write a
> small script)
> * swap partition or small temporary data partition

sd?1 and sd?2 respectively, I take it ?

> I have a single RAID5 system, md0 using sda3, sdb3, sdc3 and sdd3
> md0 is a LVM group, (vg_raid if I recall correctly) and contains
> three volumes: the "root" partition for my two install and a big
> shared "/home"
>
> is it any other think that may help you, beside my initramfs ?

Do you recall in what state the md0 raid came up ? The symptom as I
understand it is that it came up degraded ? Looking at the log, the
disks were found in the order sda3 sdd3 sdb3 sdc3 so if it came up
degraded was sdc3 missing ?

And the LVM works properly now ?

> I guess you do not need the mdadm config file as it is included...

Quite so.

Ian.

Revision history for this message

Aurelien Naldi (aurelien.naldi) wrote on 2007-03-28:

#65

On 3/28/07, Ian Jackson <email address hidden> wrote:
> > Every hard drive has a very similar partitionning scheme:
> > * boot partition (two installs, each has a backup of its main
> > "/boot", not exactly up to date as I have been too lazy to write a
> > small script)
> > * swap partition or small temporary data partition
>
> sd?1 and sd?2 respectively, I take it ?

yes sd?1 are boot partitions and sd?2 swap/other

> Do you recall in what state the md0 raid came up ? The symptom as I
> understand it is that it came up degraded ? Looking at the log, the
> disks were found in the order sda3 sdd3 sdb3 sdc3 so if it came up
> degraded was sdc3 missing ?

With older versions (of mdadm I think) the RAID was assembled, but
degraded or kind of assembled but not startable (when only 1 or 2 disk
were present at first). a "--no-degraded" option was added in the
mdadm script to avoid this. Now the RAID does not try to start when
drive(s) are missing (i.e. is never assembled in degraded mode) but
weird things are still hapening.
I do not recall which devices were in the "assembled" array when I
launched this one, but the RAID was _not_ degraded. But I guess you
have it right: I had sometime enough time to see in /proc/partition
only a few of my drives, and the array was assembled with these. Once
all the devices are there, runing /scripts/local-top/mdadm again fixes
it all.

>
> And the LVM works properly now ?

after running udevd and udevtrigger, I checked that things were well
broken (it works too fine in my two first attempts) and then all I had
to do is running "/scripts/local-top/mdadm" once more and the array
was correctly assembled.

I have not started LVM by hand, I copied the udevd.out to one of my
normal partition, then exited the shell and it finished to boot fine.

One more comment: I have tested some other suggested workarround:
- removing the mdrun script did not help (the first thing done by this
script is to check the presence of the mdadm one and then to exit)
- adding a timeout in "/usr/share/initramfs-tools/init" works fine for
me: 5 seconds are not enough but 8 are OK ;)

--
aurelien naldi

On 3/28/07, Ian Jackson <iwj@ubuntu.com> wrote:
> > Every hard drive has a very similar partitionning scheme:
> >  * boot partition  (two installs, each has a backup of its main
> > "/boot", not exactly up to date as I have been too lazy to write a
> > small script)
> > * swap partition or small temporary data partition
>
>  sd?1 and sd?2 respectively, I take it ?

yes sd?1 are boot partitions and sd?2 swap/other

> Do you recall in what state the md0 raid came up ?  The symptom as I
> understand it is that it came up degraded ?  Looking at the log, the
> disks were found in the order sda3 sdd3 sdb3 sdc3 so if it came up
> degraded was sdc3 missing ?

With older versions (of mdadm I think) the RAID was assembled, but
degraded or kind of assembled but not startable (when only 1 or 2 disk
were present at first). a "--no-degraded" option was added in the
mdadm script to avoid this. Now the RAID does not try to start when
drive(s) are missing (i.e. is never assembled in degraded mode) but
weird things are still hapening.
I do not recall which devices were in the "assembled" array when I
launched this one, but the RAID was _not_ degraded. But I guess you
have it right: I had sometime enough time to see in /proc/partition
only a few of my drives, and the array was assembled with these. Once
all the devices are there, runing /scripts/local-top/mdadm again fixes
it all.

>
> And the LVM works properly now ?

after running udevd and udevtrigger, I checked that things were well
broken (it works too fine in my two first attempts) and then all I had
to do is running "/scripts/local-top/mdadm" once more and the array
was correctly assembled.

I have not started LVM by hand, I copied the udevd.out to one of my
normal partition, then exited the shell and it finished to boot fine.

One more comment: I have tested some other suggested workarround:
- removing the mdrun script did not help (the first thing done by this
script is to check the presence of the mdadm one and then to exit)
- adding a timeout in "/usr/share/initramfs-tools/init" works fine for
me: 5 seconds are not enough but 8 are OK ;)

-- 
aurelien naldi

Revision history for this message

Ian Jackson (ijackson) wrote on 2007-03-28:

#66

Aurelien Naldi writes ("Re: [Bug 75681] Re: boot-time race condition initializing md"):
> With older versions (of mdadm I think) the RAID was assembled, but
> degraded or kind of assembled but not startable (when only 1 or 2 disk
> were present at first). a "--no-degraded" option was added in the
> mdadm script to avoid this. Now the RAID does not try to start when
> drive(s) are missing (i.e. is never assembled in degraded mode) but
> weird things are still hapening.

OK, so the main symptom in the log that I'm looking at was that you
got an initramfs prompt ?

> I do not recall which devices were in the "assembled" array when I
> launched this one, but the RAID was _not_ degraded. But I guess you
> have it right: I had sometime enough time to see in /proc/partition
> only a few of my drives, and the array was assembled with these. Once
> all the devices are there, runing /scripts/local-top/mdadm again fixes
> it all.

Right. Did you have to rerun vgscan ?

> after running udevd and udevtrigger, I checked that things were well
> broken (it works too fine in my two first attempts) and then all I had
> to do is running "/scripts/local-top/mdadm" once more and the array
> was correctly assembled.

Right.

> I have not started LVM by hand, I copied the udevd.out to one of my
> normal partition, then exited the shell and it finished to boot fine.

You mounted your root fs by hand, though ?

> One more comment: I have tested some other suggested workarround:
> - removing the mdrun script did not help (the first thing done by this
> script is to check the presence of the mdadm one and then to exit)

I think you must have a newer version than the one I was looking at.

> - adding a timeout in "/usr/share/initramfs-tools/init" works fine for
> me: 5 seconds are not enough but 8 are OK ;)

Hmm.

Ian.

Revision history for this message

Oliver Brakmann (obrakmann) wrote on 2007-03-28:

#67

> > One more comment: I have tested some other suggested workarround:
> > - removing the mdrun script did not help (the first thing done by this
> > script is to check the presence of the mdadm one and then to exit)
>
> I think you must have a newer version than the one I was looking at.

He's right though. This bug broke my system after I upgraded to Feisty
and I looked at the scripts, too. The check is there.
The initramfs-tools version installed here is 0.85eubuntu6

> > - adding a timeout in "/usr/share/initramfs-tools/init" works fine for
> > me: 5 seconds are not enough but 8 are OK ;)
>
> Hmm.

Same here. I added a "sleep 3" to the mdadm script and my system now
consistently boots. Previously, I got an error message from mdadm that
it couldn't find the devices that made up my RAID.

Might adding udevsettle somewhere help in this regard? I thought it was
meant for just such a purpose?

Revision history for this message

Aurelien Naldi (aurelien.naldi) wrote on 2007-03-28:

#68

On 3/28/07, Ian Jackson <email address hidden> wrote:
> Aurelien Naldi writes ("Re: [Bug 75681] Re: boot-time race condition initializing md"):
> > With older versions (of mdadm I think) the RAID was assembled, but
> > degraded or kind of assembled but not startable (when only 1 or 2 disk
> > were present at first). a "--no-degraded" option was added in the
> > mdadm script to avoid this. Now the RAID does not try to start when
> > drive(s) are missing (i.e. is never assembled in degraded mode) but
> > weird things are still hapening.
>
> OK, so the main symptom in the log that I'm looking at was that you
> got an initramfs prompt ?

a normal boot would have given me a busybox yes, but I added
"break=premount here", so the shell is not exactly a bug ;)
The bug is that my array was not _correctly_ assembled after running udevtrigger

> > I do not recall which devices were in the "assembled" array when I
> > launched this one, but the RAID was _not_ degraded. But I guess you
> > have it right: I had sometime enough time to see in /proc/partition
> > only a few of my drives, and the array was assembled with these. Once
> > all the devices are there, runing /scripts/local-top/mdadm again fixes
> > it all.
>
> Right. Did you have to rerun vgscan ?

Not this time, but the array was present and exiting the shell
resulted in a working boot, LVM has no problem anymore for me.

> > after running udevd and udevtrigger, I checked that things were well
> > broken (it works too fine in my two first attempts) and then all I had
> > to do is running "/scripts/local-top/mdadm" once more and the array
> > was correctly assembled.
>
> Right.
>
> > I have not started LVM by hand, I copied the udevd.out to one of my
> > normal partition, then exited the shell and it finished to boot fine.
>
> You mounted your root fs by hand, though ?

No, I copied the output to one of my boot partition, one that is _not_
in the RAID/LVM. I did it to avoir running vgscan and friends by hand
to see if the next part of the boot goes fine or not.

On 3/28/07, Ian Jackson <iwj@ubuntu.com> wrote:
> Aurelien Naldi writes ("Re: [Bug 75681] Re: boot-time race condition initializing md"):
> > With older versions (of mdadm I think) the RAID was assembled, but
> > degraded or kind of assembled but not startable (when only 1 or 2 disk
> > were present at first). a "--no-degraded" option was added in the
> > mdadm script to avoid this. Now the RAID does not try to start when
> > drive(s) are missing (i.e. is never assembled in degraded mode) but
> > weird things are still hapening.
>
> OK, so the main symptom in the log that I'm looking at was that you
> got an initramfs prompt ?

a normal boot would have given me a busybox yes, but I added
"break=premount here", so the shell is not exactly a bug ;)
The bug is that my array was not _correctly_ assembled after running udevtrigger

> > I do not recall which devices were in the "assembled" array when I
> > launched this one, but the RAID was _not_ degraded. But I guess you
> > have it right: I had sometime enough time to see in /proc/partition
> > only a few of my drives, and the array was assembled with these. Once
> > all the devices are there, runing /scripts/local-top/mdadm again fixes
> > it all.
>
> Right.  Did you have to rerun vgscan ?

Not this time, but the array was present and exiting the shell
resulted in a working boot, LVM has no problem anymore for me.

> > after running udevd and udevtrigger, I checked that things were well
> > broken (it works too fine in my two first attempts) and then all I had
> > to do is running "/scripts/local-top/mdadm" once more and the array
> > was correctly assembled.
>
> Right.
>
> > I have not started LVM by hand, I copied the udevd.out to one of my
> > normal partition, then exited the shell and it finished to boot fine.
>
> You mounted your root fs by hand, though ?

No, I copied the output to one of my boot partition, one that is _not_
in the RAID/LVM. I did it to avoir running vgscan and friends by hand
to see if the next part of the boot goes fine or not.

Revision history for this message

Drew Hess (dhess) wrote on 2007-03-28:

#69

Oliver Brakmann <email address hidden> writes:

> Same here. I added a "sleep 3" to the mdadm script and my system now
> consistently boots. Previously, I got an error message from mdadm that
> it couldn't find the devices that made up my RAID.

Ditto here, running up-to-date Feisty on amd64. I use "break=mount"
and hit ctrl-d almost immediately after the output stops, and
everything works fine. If I don't do that, 7 times out of 10 my
machine will hang for a few minutes and then drop to the (initramfs)
prompt, complaining that my /dev/mapper/main-root LVM root isn't
found. It's on a software RAID1, /dev/md1, and it comes up degraded
in these cases, even if I shut the machine down cleanly before the
reboot.

Here's my /proc/mdstat:

Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
312472192 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
96256 blocks [2/2] [UU]

unused devices: <none>

/dev/md0 is my /boot, a software RAID1 running ext3fs.

d

Revision history for this message

Aurelien Naldi (aurelien.naldi) wrote on 2007-03-28:

#70

my initramfs can be found here: http://gin.univ-mrs.fr/GINsim/download/initrd.img-2.6.20-13-generic
NOTE: it is note the one on which I booted this morning but I think I reverted all changes and built it again, so it should be pretty similar ;)

Revision history for this message

Ian Jackson (ijackson) wrote on 2007-03-28:

#71

Aurelien Naldi writes ("Re: [Bug 75681] Re: boot-time race condition initializing md"):
> On 3/28/07, Ian Jackson <email address hidden> wrote:
> > OK, so the main symptom in the log that I'm looking at was that you
> > got an initramfs prompt ?
>
> a normal boot would have given me a busybox yes, but I added
> "break=premount here", so the shell is not exactly a bug ;) The bug
> is that my array was not _correctly_ assembled after running
> udevtrigger

I see. Err, actually, I don't see. In what way was the assembly of
the raid incorrect ? You say it wasn't degraded. Was it assembled at
all ? Was it half-assembled ?

> No, I copied the output to one of my boot partition, one that is _not_
> in the RAID/LVM. I did it to avoir running vgscan and friends by hand
> to see if the next part of the boot goes fine or not.

Right.

Ian.

Revision history for this message

Aurelien Naldi (aurelien.naldi) wrote on 2007-03-29:

#72

Le mercredi 28 mars 2007 à 22:55 +0000, Ian Jackson a écrit :
> I see. Err, actually, I don't see. In what way was the assembly of
> the raid incorrect ? You say it wasn't degraded. Was it assembled at
> all ? Was it half-assembled ?

My memory does not deserve me right, sorry!
In previous versions, the array was listed in /proc/mdstat with a set of
drives, but not really assembled. before running the mdadm script I had
to stop it... This bug is now gone, the array was _not_ assembled and I
could run the mdadm script directly ;)

Revision history for this message

Reinhard Tartler (siretart) wrote on 2007-03-29:

#73

Ian Jackson <email address hidden> writes:

> Reinhard, you may remember that on irc I asked you to try moving
> /usr/share/initramfs-tools/scripts/local-top/mdrun
> aside and rebuilding your initramfs. Did you try this in the end ?

Yes, I tried that with the effect that reproducibly none if the raid
devices come up at all :(

Revision history for this message

Ian Jackson (ijackson) wrote on 2007-03-29:

#74

Aurelien, sorry to ask you to do this tedious test again, but I've
been looking at this logfile and I think I should have told you to use
`>>' rather than `>' when writing the log. Also, I really need to
know more clearly what the fault was and what you did to fix it (if
anything). And while I'm at it I'd like to rule out a possible
failure mode.

So here's some more data collection instructions:

0. General: please write down everything you do. It's very difficult
   to do this debugging at a distance and accurate, detailed and
   reliable information about the sequence of events will make life
   much easier. If you have two computers available, use one for
   making notes. Otherwise make notes on paper.

1. Edit
     /usr/share/initramfs-tools/scripts/local-top/mdadm
   and insert
     echo "running local-top/mdadm $*"
   near the top, just after `set -eu'.

2. Install
     http://www.chiark.greenend.org.uk/~ian/d/udev-nosyslog/udev_103-0ubuntu14~iwj1_i386.deb
   (Sources can be found alongside, at
     http://www.chiark.greenend.org.uk/~ian/d/udev-nosyslog/)
   If you had that installed already then because of step 1 you must
   say
     update-initramfs -u

3. Boot with break=premount

4. At the initramfs prompt:
  udevd --verbose --suppress-syslog >>/tmp/udev-output 2>&1 &
  udevtrigger
   and wait for everything to settle down.

At this point we need to know whether your root filesystem is there.
If it is (/dev/VG/LV exists) then the attempt to reproduce has failed.

If the attempt to reproduce the problem has succeeded:

5. Collect information about the problem symptoms
(cat /proc/partitions; mdadm -Q /dev/md0; mdadm -D /dev/md0) >>/tmp/extra-info 2>&1

6. Write down what you do to fix it. Preserve
   /tmp/udev-output and /tmp/extra-info eg by copying them to
   your root filesystem. Eg:
        pkill udevd
        mount /dev/my-volume-group/my-volume-name /root
        cp /tmp/udev-output* /root/root/.
        exit

7. Please comment on this bug giving the following information:
     * What you did, exactly
     * What the outcomes were including attaching these files
        udev-output
        extra-info
     * A location where I can download the /initrd.img you were
       using.
     * A description of your raid and lvm setup, if you haven't given
       that already.

Once again, I'm sorry to put you to this trouble. I think it's
essential to fix this bug for the feisty release and I have been
poring over your logs and trying various strategies to reproduce what
I suspect might be relevant failure modes, but without significant
success so far.

Thanks,
Ian.

Aurelien, sorry to ask you to do this tedious test again, but I've
been looking at this logfile and I think I should have told you to use
`>>' rather than `>' when writing the log.  Also, I really need to
know more clearly what the fault was and what you did to fix it (if
anything).  And while I'm at it I'd like to rule out a possible
failure mode.

So here's some more data collection instructions:

0. General: please write down everything you do.  It's very difficult
   to do this debugging at a distance and accurate, detailed and
   reliable information about the sequence of events will make life
   much easier.  If you have two computers available, use one for
   making notes.  Otherwise make notes on paper.

1. Edit
     /usr/share/initramfs-tools/scripts/local-top/mdadm
   and insert
     echo "running local-top/mdadm $*"
   near the top, just after `set -eu'.

2. Install
     http://www.chiark.greenend.org.uk/~ian/d/udev-nosyslog/udev_103-0ubuntu14~iwj1_i386.deb
   (Sources can be found alongside, at
     http://www.chiark.greenend.org.uk/~ian/d/udev-nosyslog/)
   If you had that installed already then because of step 1 you must
   say
     update-initramfs -u

3. Boot with break=premount

4. At the initramfs prompt:
 	udevd --verbose --suppress-syslog >>/tmp/udev-output 2>&1 &
 	udevtrigger
   and wait for everything to settle down.

At this point we need to know whether your root filesystem is there.
If it is (/dev/VG/LV exists) then the attempt to reproduce has failed.

If the attempt to reproduce the problem has succeeded:

5. Collect information about the problem symptoms
 	(cat /proc/partitions; mdadm -Q /dev/md0; mdadm -D /dev/md0) >>/tmp/extra-info 2>&1

6. Write down what you do to fix it.  Preserve
   /tmp/udev-output and /tmp/extra-info eg by copying them to
   your root filesystem.  Eg:
        pkill udevd
        mount /dev/my-volume-group/my-volume-name /root
        cp /tmp/udev-output* /root/root/.
        exit

7. Please comment on this bug giving the following information:
     * What you did, exactly
     * What the outcomes were including attaching these files
        udev-output
        extra-info
     * A location where I can download the /initrd.img you were
       using.
     * A description of your raid and lvm setup, if you haven't given
       that already.

Once again, I'm sorry to put you to this trouble.  I think it's
essential to fix this bug for the feisty release and I have been
poring over your logs and trying various strategies to reproduce what
I suspect might be relevant failure modes, but without significant
success so far.

Thanks,
Ian.

Revision history for this message

Ian Jackson (ijackson) wrote on 2007-03-29:

#75

Reinhard Tartler writes ("Re: [Bug 75681] Re: boot-time race condition initializing md"):
> Yes, I tried that with the effect that reproducibly none if the raid
> devices come up at all :(

I find this puzzling. I've double-checked your initramfs again
following other people's comments and the existence of the mdrun
script shouldn't matter nowadays. Anyway, if you can reliably
reproduce the problem this is good because it means I might be able to
fix it.

Can you please follow the instructions I've just given to Aurelien in
my last comment to this bug ? As I say I'm sorry to put you to all
this trouble - I know that repeatedly rebooting and messing about in
the initramfs are a PITA. But I really want to fix this bug.

Ian.

Revision history for this message

Reinhard Tartler (siretart) wrote on 2007-03-29:

#76

octothorp <email address hidden> writes:

> I'm very interested in finally testing 105, and the missing
> udev_105.orig.tar.gz is a bit of a challenge, but at least there's a
> diff.

I'm terribly sorry, I've just uploaded the forgotten orig.tar.gz

Revision history for this message

Aurelien Naldi (aurelien.naldi) wrote on 2007-03-29:

#77

some log files Edit (44.2 KiB, application/x-tar)

OK, I really appreciate that you want to fix this and I really want to help, but this has been driving me nuts.
I can reproduce it *way too* reliably with a "normal" boot (i.e. it happens 8 times out of 10) but when trying to get a log it suddenly refused to happen :/
I started to think that the echo line added to the mdadm script caused a sufficient lag to avoid the trap and rebooted without this line, I could reproduce it and get a log at the first boot! Unfortunatly I messed somethings up and thus tryed to make a new one without much success.

Then I put the "echo running...." line back and I have tried again and again for half an hour to finally get one, but I forgot to put a ">>" instead of a ">" when redirecting udevd's output.
I give up for now as I have been unable to reproduce it since (except twice with "normal" boot, but not willing to wait 3minutes...)

so, here it is, I attach a tar.gz of the log I made, and I put my initramfs here: http://crfb.univ-mrs.fr/~naldi/initrd.img-2.6.20-13-generic

What I did:
boot with break=premount
launch udevd & udevtrigger, check that the bug happened, collect some info

to fix it:
/scripts/local-top/mdadm
lvm vgscan
lvm vgchange -a y

collect some more info

mount /dev/sda1 on /mnt # an ext2 partition for /boot
copy my log files here
umount /mnt
pkill udevd (sorry, I might have get nicer logs if I stopped it before, right ?)
exit

watch it finish booting

Revision history for this message

Akmal Xushvaqov (uzadmin) wrote on 2007-03-29:

#78

it doesn't upgrade the system. It shows me following:

Failed to fetch http://uz.archive.ubuntu.com/ubuntu/dists/edgy-updates/Release.gpg Соединение разорвано
Failed to fetch http://uz.archive.ubuntu.com/ubuntu/dists/edgy/Release.gpg Соединение разорвано
Failed to fetch http://uz.archive.ubuntu.com/ubuntu/dists/edgy-backports/Release.gpg Соединение разорвано
Failed to fetch http://kubuntu.org/packages/kde4-3.80.3/dists/edgy/Release.gpg Ошибка чтения, удалённый сервер прервал соединение
Failed to fetch http://thomas.enix.org/pub/debian/packages/dists/edgy/main/binary-i386/Packages.gz 302 Found

Revision history for this message

Reinhard Tartler (siretart) wrote on 2007-03-29:

#79

Udev debug log Edit (3.9 MiB, text/plain)

Download full text (4.1 KiB)

my setup (taken from a booted system):

siretart-@hades:~
>> cat /proc/mdstat
Personalities : [raid0] [raid1]
md1 : active raid1 sda5[0] sdb5[1]
1951744 blocks [2/2] [UU]

md0 : active raid1 sda2[0] sdb2[1]
489856 blocks [2/2] [UU]

md3 : active raid0 sda7[0] sdb7[1]
395632512 blocks 64k chunks

md2 : active raid1 sda6[0] sdb6[1]
97659008 blocks [2/2] [UU]

unused devices: <none>

siretart-@hades:~
>> sudo lvs
  LV VG Attr LSize Origin Snap% Move Log Copy%
  backup hades_mirror -wi-ao 42,00G
  home hades_mirror -wi-ao 25,00G
  ubunturoot hades_mirror -wi-ao 25,00G
  chroot_dapper hades_stripe -wi-a- 5,00G
  chroot_dapper32 hades_stripe -wi-a- 5,00G
  chroot_edgy hades_stripe -wi-a- 5,00G
  chroot_edgy32 hades_stripe -wi-a- 5,00G
  chroot_feisty hades_stripe -wi-a- 5,00G
  chroot_feisty32 hades_stripe -wi-a- 5,00G
  chroot_sarge32 hades_stripe -wi-a- 3,00G
  chroot_sid hades_stripe -wi-a- 5,00G
  chroot_sid32 hades_stripe owi-a- 5,00G
  dapper32-snap hades_stripe -wi-a- 2,00G
  mirror hades_stripe -wi-ao 89,00G
  scratch hades_stripe -wi-ao 105,00G
  sid-xine-snap hades_stripe swi-a- 3,00G chroot_sid32 26,43
  ubunturoot hades_stripe -wi-ao 25,00G
siretart-@hades:~
>> sudo pvs
  PV VG Fmt Attr PSize PFree
  /dev/md2 hades_mirror lvm2 a- 93,13G 1,13G
  /dev/md3 hades_stripe lvm2 a- 377,30G 10,30G
siretart-@hades:~
>> sudo vgs
  VG #PV #LV #SN Attr VSize VFree
  hades_mirror 1 3 0 wz--n- 93,13G 1,13G
  hades_stripe 1 14 1 wz--n- 377,30G 10,30G

The (primary) root volume is /dev/hades_stripe/ubunturoot.

I wasn't able to reproduce the problem with the instructions you
gave. However, I modified
/usr/share/initramfs-tools/scripts/init-premount/udev to look like this:

--- /usr/share/initramfs-tools/scripts/init-premount/udev 2007-03-29 20:44:30.000000000 +0200
+++ /usr/share/initramfs-tools/scripts/init-premount/udev~ 2007-03-29 20:30:21.000000000 +0200
@@ -20,9 +20,10 @@
# It's all over netlink now
echo "" > /proc/sys/kernel/hotplug

+sleep 3
+
# Start the udev daemon to process events
-#/sbin/udevd --daemon
-/sbin/udevd --verbose --suppress-syslog >> /tmp/udev-output 2>&1 &
+/sbin/udevd --daemon

# Iterate sysfs and fire off everything; if we include a rule for it then
# it'll get handled; otherwise it'll get handled later when we do this again

This way (okay, after two boots), I was able to reproduce...

my setup (taken from a booted system):

siretart-@hades:~
>> cat /proc/mdstat 
Personalities : [raid0] [raid1] 
md1 : active raid1 sda5[0] sdb5[1]
      1951744 blocks [2/2] [UU]
      
md0 : active raid1 sda2[0] sdb2[1]
      489856 blocks [2/2] [UU]
      
md3 : active raid0 sda7[0] sdb7[1]
      395632512 blocks 64k chunks
      
md2 : active raid1 sda6[0] sdb6[1]
      97659008 blocks [2/2] [UU]
      
unused devices: <none>

siretart-@hades:~
>> sudo lvs      
  LV              VG           Attr   LSize   Origin       Snap%  Move Log Copy% 
  backup          hades_mirror -wi-ao  42,00G                                    
  home            hades_mirror -wi-ao  25,00G                                    
  ubunturoot      hades_mirror -wi-ao  25,00G                                    
  chroot_dapper   hades_stripe -wi-a-   5,00G                                    
  chroot_dapper32 hades_stripe -wi-a-   5,00G                                    
  chroot_edgy     hades_stripe -wi-a-   5,00G                                    
  chroot_edgy32   hades_stripe -wi-a-   5,00G                                    
  chroot_feisty   hades_stripe -wi-a-   5,00G                                    
  chroot_feisty32 hades_stripe -wi-a-   5,00G                                    
  chroot_sarge32  hades_stripe -wi-a-   3,00G                                    
  chroot_sid      hades_stripe -wi-a-   5,00G                                    
  chroot_sid32    hades_stripe owi-a-   5,00G                                    
  dapper32-snap   hades_stripe -wi-a-   2,00G                                    
  mirror          hades_stripe -wi-ao  89,00G                                    
  scratch         hades_stripe -wi-ao 105,00G                                    
  sid-xine-snap   hades_stripe swi-a-   3,00G chroot_sid32  26,43                
  ubunturoot      hades_stripe -wi-ao  25,00G                                    
siretart-@hades:~
>> sudo pvs
  PV         VG           Fmt  Attr PSize   PFree 
  /dev/md2   hades_mirror lvm2 a-    93,13G  1,13G
  /dev/md3   hades_stripe lvm2 a-   377,30G 10,30G
siretart-@hades:~
>> sudo vgs
  VG           #PV #LV #SN Attr   VSize   VFree 
  hades_mirror   1   3   0 wz--n-  93,13G  1,13G
  hades_stripe   1  14   1 wz--n- 377,30G 10,30G

The (primary) root volume is /dev/hades_stripe/ubunturoot.

I wasn't able to reproduce the problem with the instructions you
gave. However, I modified
/usr/share/initramfs-tools/scripts/init-premount/udev to look like this:

--- /usr/share/initramfs-tools/scripts/init-premount/udev       2007-03-29 20:44:30.000000000 +0200
+++ /usr/share/initramfs-tools/scripts/init-premount/udev~      2007-03-29 20:30:21.000000000 +0200
@@ -20,9 +20,10 @@
 # It's all over netlink now
 echo "" > /proc/sys/kernel/hotplug
 
+sleep 3
+
 # Start the udev daemon to process events
-#/sbin/udevd --daemon
-/sbin/udevd --verbose --suppress-syslog >> /tmp/udev-output 2>&1 &
+/sbin/udevd --daemon
 
 # Iterate sysfs and fire off everything; if we include a rule for it then
 # it'll get handled; otherwise it'll get handled later when we do this again

This way (okay, after two boots), I was able to reproduce the
problem. Here is the output of /proc/partitions:

major minor
#blocks name

8     0  312571224 sda
   8     1   14651248 sda1
   8     2     489982 sda2
   8     3          1 sda3
   8     5    1951866 sda5
   8     6   97659103 sda6
   8     7  197816346 sda7
   8    16  312571224 sdb
   8    17   14651248 sdb1
   8    18     489982 sdb2
   8    19          1 sdb3
   8    21    1951866 sdb5
   8    22   97659103 sdb6
   8    23  197816346 sdb7

Here the contents of /proc/mdstat:

Personalities : [raid0] [raid1] 
md0 : inactive sdb2[1](S)
      489856 blocks
       
unused devices: <none>

mdstat -Q / mdstat -D don't work:
/dev/md0: is an md device which is not active
mdadm: md device /dev/md0 does not appear to be active.

In order to get the root device, I just need to (re)issue
'udevtrigger'. Then I can mount the root filesystem, copy the
debug output files.

Please note that the file udev-output contains the debug output
for the 2nd run of udevtrigger as well.

Revision history for this message

Reinhard Tartler (siretart) wrote on 2007-03-29:

#80

perhaps this is related to http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=416654 ?

Revision history for this message

Jeffrey Knockel (jeff250) wrote on 2007-03-30:

#81

Reinhard, I tried your Ubuntu deb (105-4), and it brought the bug back for me consistently. Then I reverted to debian unstable's version (0.105-4), and now I'm booting consistently correctly again.

This is interesting. Clearly what is causing my bug must be in the Ubuntu packaging? I've already tried gouging out the Ubuntu patches in debian/patches of the Ubuntu package that you gave me (and hacking debian/rules to accommodate) and then repackaging and reinstalling it then, but still no go. This as far as I have time to play around with this tonight, but before I try again in the near future, does anyone have any ideas?

Revision history for this message

Reinhard Tartler (siretart) wrote on 2007-03-30:

#82

Jeff250 <email address hidden> writes:

> Reinhard, I tried your Ubuntu deb (105-4), and it brought the bug back
> for me consistently. Then I reverted to debian unstable's version
> (0.105-4), and now I'm booting consistently correctly again.
>
> This is interesting. Clearly what is causing my bug must be in the
> Ubuntu packaging?

As you might have already noticed, udev is a pretty critical and central
piece of software in the boot procedure of a debian/ubuntu system. The
packaging and integration of udev is pretty different to debian. The
change which caused this is the implementation of the UdevMdadm Spec,
which was approved last UDS. I'm pretty sure that with reverting to the
packaging of the edgy udev package, the problem won't
appear. Unfortunately, this seems to be no option for feisty. :(

Revision history for this message

Codink (herco) wrote on 2007-03-30:

#83

I think the solution is in:
Bug report: 83231
https://launchpad.net/ubuntu/+source/initramfs-tools/+bug/83231

the "udevsettle --timeout 10" at the end of /usr/share/initramfs-tools/scripts/init-premount/udev works too.

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2007-04-02:

#84

We believe that this problem has been corrected by a series of uploads today. Please update to ensure you have the following package versions:

    dmsetup, libdevmapper1.02 - 1.02.08-1ubuntu6
    lvm-common - 1.5.20ubuntu12
    lvm2 - 2.02.06-2ubuntu9
    mdadm - 2.5.6-7ubuntu5 (not applicable unless you're also using mdadm)
    udev, volumeid, libvolume-id0 - 108-0ubuntu1

The problem was caused by a number of ordering issues and race conditions relating to when lvm and mdadm were called, and how those interacted to ensure the devices were created and their contents examined.

This should work as follows:
* an underlying block device, sda1, is detected
* udev (through vol_id) detects that this is a RAID member
* udev invokes mdadm, which fails to assemble because the RAID-1 is not complete
* the creation of a new raid, md0, is detected
* udev fails to detect this device, because it is not yet complete

meanwhile:
* a second underlying block device, sdb1, is detected
* udev (through vol_id) detects that this is a RAID member
* udev invokes mdadm, which can now complete since the set is ready
* the change of the raid array, md0, is detected
* udev (through vol_id) detects that this is an LVM physical volume
* lvm is called to handle the creation of the devmapper devices

then
* various devmapper devices are detected
* the devices are created by udev, named correctly under /dev/mapper
* meanwhile the requesting application spins until the device exists, at which point it carries on
* udev (through vol_id) detects that these devices contain typical filesystems
* using vol_id it obtains the LABEL and UUID, which is used to populate /dev/disk

Note that this event-based sequence is substantially different from Debian, so any bugs filed there will not be relevant to helping solve problems in Ubuntu.

This should now work correctly. If it does not, I would ask that you do not re-open this bug, and instead file a new bug on lvm2 for your exact problem, even if someone else has already filed one, with verbose details about your setup and how you cause the error.

We believe that this problem has been corrected by a series of uploads today.  Please update to ensure you have the following package versions:

dmsetup, libdevmapper1.02 - 1.02.08-1ubuntu6
    lvm-common - 1.5.20ubuntu12
    lvm2 - 2.02.06-2ubuntu9
    mdadm - 2.5.6-7ubuntu5 (not applicable unless you're also using mdadm)
    udev, volumeid, libvolume-id0 - 108-0ubuntu1

The problem was caused by a number of ordering issues and race conditions relating to when lvm and mdadm were called, and how those interacted to ensure the devices were created and their contents examined.

This should work as follows:
 * an underlying block device, sda1, is detected
 * udev (through vol_id) detects that this is a RAID member
 * udev invokes mdadm, which fails to assemble because the RAID-1 is not complete
 * the creation of a new raid, md0, is detected
 * udev fails to detect this device, because it is not yet complete

meanwhile:
 * a second underlying block device, sdb1, is detected
 * udev (through vol_id) detects that this is a RAID member
 * udev invokes mdadm, which can now complete since the set is ready
 * the change of the raid array, md0, is detected
 * udev (through vol_id) detects that this is an LVM physical volume
 * lvm is called to handle the creation of the devmapper devices

then
 * various devmapper devices are detected
 * the devices are created by udev, named correctly under /dev/mapper
 * meanwhile the requesting application spins until the device exists, at which point it carries on
 * udev (through vol_id) detects that these devices contain typical filesystems
 * using vol_id it obtains the LABEL and UUID, which is used to populate /dev/disk

Note that this event-based sequence is substantially different from Debian, so any bugs filed there will not be relevant to helping solve problems in Ubuntu.

This should now work correctly.  If it does not, I would ask that you do not re-open this bug, and instead file a new bug on lvm2 for your exact problem, even if someone else has already filed one, with verbose details about your setup and how you cause the error.

Changed in lvm2:
status:	Unconfirmed → Fix Released
Changed in mdadm:
status:	Confirmed → Fix Released
Changed in udev:
status:	Unconfirmed → Fix Released

Revision history for this message

Manoj Kasichainula (manoj+launchpad-net) wrote on 2007-04-03:

#85

I don't use LVM yet am seeing the same problem with software RAID. I just dist-upgraded, reran the update-initramfs just to be sure, and saw the failure at boot. Package list confirmed below matching the most recent versions mentioned above:

> dpkg -l dmsetup libdevmapper1.02 lvm-common lvm2 mdadm udev volumeid libvolume-id0
No packages found matching lvm-common.
No packages found matching lvm2.
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Installed/Config-files/Unpacked/Failed-config/Half-installed
|/ Err?=(none)/Hold/Reinst-required/X=both-problems (Status,Err: uppercase=bad)
||/ Name Version Description
+++-=======================-=======================-==============================================================
ii dmsetup 1.02.08-1ubuntu6 The Linux Kernel Device Mapper userspace library
ii libdevmapper1.02 1.02.08-1ubuntu6 The Linux Kernel Device Mapper userspace library
ii libvolume-id0 108-0ubuntu1 volume identification library
ii mdadm 2.5.6-7ubuntu5 tool to administer Linux MD arrays (software RAID)
ii udev 108-0ubuntu1 rule-based device node and kernel event manager
ii volumeid 108-0ubuntu1 volume identification tool

The udevsettle script from https://launchpad.net/ubuntu/+source/initramfs-tools/+bug/83231 fixed my problem.

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2007-04-03:

#86

Manoj: as noted above, please file a new bug

Revision history for this message

Sami J. Laine (sjlain) wrote on 2007-04-03:

#87

Scott James Remnant wrote:
> We believe that this problem has been corrected by a series of uploads
> today. Please update to ensure you have the following package versions:
>
> dmsetup, libdevmapper1.02 - 1.02.08-1ubuntu6
> lvm-common - 1.5.20ubuntu12
> lvm2 - 2.02.06-2ubuntu9
> mdadm - 2.5.6-7ubuntu5 (not applicable unless you're also using mdadm)
> udev, volumeid, libvolume-id0 - 108-0ubuntu1
>
> The problem was caused by a number of ordering issues and race
> conditions relating to when lvm and mdadm were called, and how those
> interacted to ensure the devices were created and their contents
> examined.
>
> This should work as follows:
> * an underlying block device, sda1, is detected
> * udev (through vol_id) detects that this is a RAID member
> * udev invokes mdadm, which fails to assemble because the RAID-1 is not complete
> * the creation of a new raid, md0, is detected
> * udev fails to detect this device, because it is not yet complete
>
> meanwhile:
> * a second underlying block device, sdb1, is detected
> * udev (through vol_id) detects that this is a RAID member
> * udev invokes mdadm, which can now complete since the set is ready
> * the change of the raid array, md0, is detected
> * udev (through vol_id) detects that this is an LVM physical volume
> * lvm is called to handle the creation of the devmapper devices
>
> then
> * various devmapper devices are detected
> * the devices are created by udev, named correctly under /dev/mapper
> * meanwhile the requesting application spins until the device exists, at which point it carries on
> * udev (through vol_id) detects that these devices contain typical filesystems
> * using vol_id it obtains the LABEL and UUID, which is used to populate /dev/disk
>
> Note that this event-based sequence is substantially different from
> Debian, so any bugs filed there will not be relevant to helping solve
> problems in Ubuntu.
>
> This should now work correctly. If it does not, I would ask that you do
> not re-open this bug, and instead file a new bug on lvm2 for your exact
> problem, even if someone else has already filed one, with verbose
> details about your setup and how you cause the error.

The problem persists. Only solution is still to use break=mount option
to boot.

However, I don't use LVM at all, so I don't think I should file a bug on
lvm.

--
Sami Laine @ GMail

Scott James Remnant wrote:
> We believe that this problem has been corrected by a series of uploads
> today.  Please update to ensure you have the following package versions:
> 
>     dmsetup, libdevmapper1.02 - 1.02.08-1ubuntu6
>     lvm-common - 1.5.20ubuntu12
>     lvm2 - 2.02.06-2ubuntu9
>     mdadm - 2.5.6-7ubuntu5 (not applicable unless you're also using mdadm)
>     udev, volumeid, libvolume-id0 - 108-0ubuntu1
> 
> The problem was caused by a number of ordering issues and race
> conditions relating to when lvm and mdadm were called, and how those
> interacted to ensure the devices were created and their contents
> examined.
> 
> This should work as follows:
>  * an underlying block device, sda1, is detected
>  * udev (through vol_id) detects that this is a RAID member
>  * udev invokes mdadm, which fails to assemble because the RAID-1 is not complete
>  * the creation of a new raid, md0, is detected
>  * udev fails to detect this device, because it is not yet complete
> 
> meanwhile:
>  * a second underlying block device, sdb1, is detected
>  * udev (through vol_id) detects that this is a RAID member
>  * udev invokes mdadm, which can now complete since the set is ready
>  * the change of the raid array, md0, is detected
>  * udev (through vol_id) detects that this is an LVM physical volume
>  * lvm is called to handle the creation of the devmapper devices
> 
> then
>  * various devmapper devices are detected
>  * the devices are created by udev, named correctly under /dev/mapper
>  * meanwhile the requesting application spins until the device exists, at which point it carries on
>  * udev (through vol_id) detects that these devices contain typical filesystems
>  * using vol_id it obtains the LABEL and UUID, which is used to populate /dev/disk
> 
> Note that this event-based sequence is substantially different from
> Debian, so any bugs filed there will not be relevant to helping solve
> problems in Ubuntu.
> 
> This should now work correctly.  If it does not, I would ask that you do
> not re-open this bug, and instead file a new bug on lvm2 for your exact
> problem, even if someone else has already filed one, with verbose
> details about your setup and how you cause the error.

The problem persists. Only solution is still to use break=mount option 
to boot.

However, I don't use LVM at all, so I don't think I should file a bug on 
lvm.

-- 
Sami Laine @ GMail

Revision history for this message

Wilb (ubuntu-wilb) wrote on 2007-04-03:

#88

Exactly same problem for me here too - no LVM in sight, just a md0 and md1 as /boot and / respectively, fixed by using break=mount and mounting manually.

Revision history for this message

Oliver Brakmann (obrakmann) wrote on 2007-04-04:

#89

Did somebody already report a new bug on this?
If not, there are still two other open bugs with the same issue, one of them new: bug #83231 and bug #102410

Revision history for this message

Mathias Lustig (delaylama) wrote on 2007-04-04:

#90

Hi everyone,

yesterday evening (here in Germany), I also got into trouble using the latest lvm2 upgrade available for feisty. I upgraded the feisty install on my Desktop PC and during the upgrade, everything just ran fine. The pc is equipped with two 160 GB Samsung Sata-Drives on an Nforce4 Sata Controller.

After a reboot, Usplash just hangs forever at the state, where the lvm should mount all his volumes. It hangs and hangs and hangs and nothing happens. This procedure happens withe every kernel that's installed ( the 386, generic and lowlatency kernels from 2.6.20-11, 2.6.20-12 and 2.6.20-13). Booting in single user / recovery mode and interrupting the LVM with strg +c brings me to a root shell where mount -a enables me to mount all my logical volumes - obivously something the lvm init-script just won't do on its own.

Is there some workaround to get my lvm working again, until a new, fixed package is available?

ah, something I forgot: at my working place, there's a feisty Workstation, too, which also uses LVM on a single ata disk. Here, I experienced no problems after an upgrade. The upgrade here just worked fine. Now, I'm a little afraid that upgrading my notebook will also brake the lvm ...

Is there any file or any other thing that I should provide to help you finding and fixing the bug?

Greetings,

Mathias

Revision history for this message

Mathias Lustig (delaylama) wrote on 2007-04-04:

#91

i forgot to mention, that I do NOT use any software RAID prior to the LVM. The LVM just spans my two diskt that are joined together in one volume group.

Revision history for this message

Reinhard Tartler (siretart) wrote on 2007-04-04:

#92

could you try removing the evms package and regenerating your initramfs? This fixed the problem for me.

Scott has a fix for evms in the pipe, bug feel free to file another bug to document this issue.

Revision history for this message

Mathias Lustig (delaylama) wrote on 2007-04-04:

#93

okay, I can try to remove the evms-package and rerun update-initramfs this evening. I'm not shure if evms is installed but maybe it is ;)
The biggest problem is, that the ubuntu installation on my desktop pc was dapper at first, where I installed a lot of extra software. Three weeks after the edgy release I upgrades from Dapper to Edgy and about 3 weeks ago I just wanted to do something new and upgraded to feisty. I can never be sure if a specific problem is a error in feisty or if it results because of some strange packages that remained from an earlier and dist-upgraded ubuntu version.
maybe there are pontetial sources of error under this condition...

I'll try to follow your advice to track down my screwed up lvm2 / evms problem. I don't know if it's usefull to file a bugreport for an allready know problem...

Btw - (I know, that this is probebly not the right place for that question) - but what are the differences between lvm2 and evms ? *confused*
Just wondering why evms might be installed on my desktop even if I don't need it because lvm2 is the logical volume manager of choice ...

Revision history for this message

cionci (cionci-tin) wrote on 2007-04-04:

#94

My Feisty is up to date. I've the same BusyBox problem at boot up. Can you tell me how to track the issue ? I don't have raid but I have a Adaptec 19160 SCSI controller on which reside the root partition.

Revision history for this message

Jeff Balderson (jbalders) wrote on 2007-04-05:

#95

This isn't fixed for me. I opened a new report (Bug # 103177) as requested by Scott James Remnant.

My problem still has the exact same symptoms that I described above.

Revision history for this message

den (den-elak) wrote on 2007-07-28:

#96

So there is still no cure!?? The workaround with MD_DEGRADED_ARGS helps to assemble array only when all of raid devices found by udev. But there is still very BAD behavior, if we unplug one of raid disks the system hangs during a boot process.

PS. Just installed feisty server, and updated.
I use MD, and LVM2 on top of it.
The setup has two SATA drives acting in RAID1 mode.
mdadm 2.5.6-7ubuntu5
udev 108-0ubuntu4
initramfs-tool 0.85eubuntu10
lvm2 2.02.06-2ubuntu9

I fixed the problem for me by purpose of udevsettle and setting MD_DEGRADED_ARGS=" ", but that's an ugly method...

Can you tell is there any full-fledged idea to fix the problem?

Revision history for this message

dan_linder (dan-linder) wrote on 2007-07-28:

#97

I was having a similar issue, but a solution in this bug report fixed it for me:

https://bugs.launchpad.net/ubuntu/+source/initramfs-tools/+bug/99439/comments/1

[Quote]
Try adding "/sbin/udevsettle --timeout=10" in /usr/share/initramfs-tools/init before line:
log_begin_msg "Mounting root file system..."
and then rebuild initrd with:
sudo update-initramfs -u -k all
[/Quote]

Does this help someone/anyone on this bug thread?

Dan

Revision history for this message

den (den-elak) wrote on 2007-07-30:

#99

Thanks Dan!
That's true, I did managed to solve the problem just the same way.
But why /sbin/udevsettle --timeout=10 is not in the distribution itself, perhaps something is wrong with that?

Revision history for this message

Reinhard Tartler (siretart) wrote on 2007-07-30:

#100

den <email address hidden> writes:

> But why /sbin/udevsettle --timeout=10 is not in the distribution
> itself, perhaps something is wrong with that?

That's not the right fix. It is a workaround that seems to work on some
machines, though...

Scott, do you think that workaround is worth an upload to
feisty-updates? I could perhaps upload a test package to my ppa...

--
Gruesse/greetings,
Reinhard Tartler, KeyID 945348A4

Revision history for this message

Reinhard Tartler (siretart) wrote on 2007-07-30:

#101

Okay, I uploaded a package with the udevsettle line to my ppa. Add this to your /etc/apt/sources.list:

deb http://ppa.dogfood.launchpad.net/siretart/ubuntu/ feisty main

do a 'apt-get update && apt-get dist-upgrade'. Please give feedback if that package lets your system boot.

Changed in initramfs-tools:
importance:	Undecided → High
status:	New → Incomplete

Revision history for this message

den (den-elak) wrote on 2007-07-30:

#102

Hello Reinhard!
My system boots fine with your updated initramfs-tools package!

Revision history for this message

den (den-elak) wrote on 2007-08-01:

#103

Hello!

Not everything is OK as I expected. So I use /sbin/udevsettle --timeout=10 at the end of /etc/initramfs-tools/scripts/init-premount/udev script! Sometimes system boots fine but once I have rebooted, I get cat /proc/mdstat:
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
116712128 blocks [2/2] [UU]

md0 : active raid1 sda1[0]
505920 blocks [2/1] [U_]

unused devices: <none>
The first raid array can't assemble well!

PS!
I also set MD_DEGRADED_ARGS=" " in local-top/mdadm.

Revision history for this message

Reinhard Tartler (siretart) wrote on 2007-08-01:

#104

den <email address hidden> writes:

> Not everything is OK as I expected. So I use /sbin/udevsettle
> --timeout=10 at the end of
> /etc/initramfs-tools/scripts/init-premount/udev script! Sometimes
> system boots fine but once I have rebooted, I get cat /proc/mdstat:
> Personalities : [raid1]

As said, this is a nasty race condition in the bootup
procedure. Honestly, I don't believe we can fix it in feisty. In gutsy,
my system now seems to boot reliably. It would be great if you could
verify it, so that we can fix it there.

--
Gruesse/greetings,
Reinhard Tartler, KeyID 945348A4

Revision history for this message

den (den-elak) wrote on 2007-08-01:

#105

I think I can try. But I put the system in production. If you could tell me what packages are exactly involved (kernel, initramfs-tools ...) and how to get them from gutsy repository without full dist-upgrade. I can check that on that system and report.

Revision history for this message

Peter Haight (peterh-sapros) wrote on 2007-08-18:

#106

A patch to handle the case where there is a failed RAID drive. Edit (1.1 KiB, text/plain)

There is a problem with the solution outlined by Scott James Remnant above. (https://bugs.launchpad.net/ubuntu/+source/initramfs-tools/+bug/75681/comments/84)

What happens when one of the raid devices has failed and it isn't getting detected? Take Scott's example a raid with (sda1 and sdb1):

> This should work as follows:
> * an underlying block device, sda1, is detected
> * udev (through vol_id) detects that this is a RAID member
> * udev invokes mdadm, which fails to assemble because the RAID-1 is not complete
> * the creation of a new raid, md0, is detected
> * udev fails to detect this device, because it is not yet complete

At this point, mdadm should assemble the RAID with just sda1 because sdb1 is down, but in the current scheme mdadm only assembles the RAID if all drives are available. This sort of defeats the point of using any of the mirrored RAID schemes.

So because the only case that I know of where this is an issue is a case with drive failure, how about trying to run mdadm again after the root mount timeout, but this time without the --no-degraded arg so that if we can assemble some of the RAID arrays without the missing drives, we do it.

I'll attach some patches to some /usr/share/initramfs-tools scripts which fix this problem for me.

So then the question is how do we know that sdb1 is down and should go ahead with assembling the RAID array? I'm not sure exactly what kind of information we have this early in the bootup process, but how about something like this

Revision history for this message

Peter Haight (peterh-sapros) wrote on 2007-08-18:

#107

So, there is something wrong with that patch. Actually it seems to be working great, but when I disconnect a drive to fail it, it boots up immediately instead of trying mdadm after the timeout. So I'm guessing that the mdadm script is getting called without the from-udev parameter somewhere else. But it is working in some sense because the machine boots nicely with one of the RAID drives disconnected, or with both of them properly setup. So there might be some race problem with this patch.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2007-10-13:

#108

[Expired for initramfs-tools (Ubuntu Feisty) because there has been no activity for 60 days.]

Revision history for this message

xteejx (xteejx) wrote on 2009-05-24:

#109

LP Janitor did not change the status of the initramfs-tools. Changing it to Invalid. If this is wrong, please change it to Fix Released, etc. Thanks.

Changed in initramfs-tools (Ubuntu):
status:	New → Invalid

Ubuntu
mdadm package

boot-time race condition initializing md

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
initramfs-tools (Ubuntu)	Invalid	Undecided	Unassigned
Feisty	Invalid	High	Unassigned
lvm2 (Ubuntu)	Fix Released	Undecided	Unassigned
Feisty	Fix Released	Undecided	Unassigned
mdadm (Ubuntu)	Fix Released	High	Ian Jackson	Ubuntu ubuntu-7.04
Feisty	Fix Released	High	Ian Jackson	Ubuntu ubuntu-7.04
udev (Ubuntu)	Fix Released	Undecided	Unassigned
Feisty	Fix Released	Undecided	Unassigned

Ubuntumdadm package

boot-time race condition initializing md

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

Ubuntu
mdadm package