[gutsy] partitions no longer detected as RAID components after repairing degraded RAID 1 mirror

Bug #133773 reported by Daniel Pittman
2
Affects Status Importance Assigned to Milestone
udev
Fix Released
Undecided
Unassigned
udev (Ubuntu)
Fix Released
Medium
Scott James Remnant (Canonical)
Nominated for Gutsy by Daniel Pittman

Bug Description

Binary package hint: udev

My system was correctly booting from a mirrored PATA (via libata) RAID 1 until one of the disks was removed. I then hit the "degraded mode doesn't boot" problem for a while. This all happened under Gutsy (kept up to date daily.)

After that I replaced the missing disk, while the system was powered off, and then:
 * booted with 'break=mount' on the kernel command line.
 * waited until detection of hardware had completed.
 * ran 'exec mdadm -As' to detect the degraded RAID and continue to boot
 * ran 'sfdisk -d /dev/sda | sfdisk /dev/sdb' to partition the new disk
 * ran 'mdadm -a /dev/md0 /dev/sdb1'
 * ran 'mdadm -a /dev/md1 /dev/sdb2'
 * waited for the system to resync the data
 * checked that /proc/mdstat showed all healthy
 * rebooted

At this point I expected, naturally, to have the system boot cleanly without any problems or delay. Instead the system simply ground to a halt after the three minute boot timeout without the RAID detected.

After some investigation this looks, to me, like a problem with identification of the use of the device components.

The udev rules for mdadm depend on ENV{ID_FS_TYPE}=="linux_raid*" to run mdadm at all; for my RAID components I get the following details:

UDEV [1187654052.620855] add /block/sda/sda1 (block)
UDEV_LOG=3
ACTION=add
DEVPATH=/block/sda/sda1
SUBSYSTEM=block
SEQNUM=1750
MINOR=1
MAJOR=8
PHYSDEVPATH=/devices/pci0000:00/0000:00:1f.1/host0/target0:0:0/0:0:0:0
PHYSDEVBUS=scsi
PHYSDEVDRIVER=sd
UDEVD_EVENT=1
DEVTYPE=partition
ID_VENDOR=ATA
ID_MODEL=SAMSUNG_MP0804H
ID_REVISION=UE10
ID_SERIAL=1ATA_SAMSUNG_MP0804H_S042J10Y257961
ID_SERIAL_SHORT=ATA_SAMSUNG_MP0804H_S042J10Y257961
ID_TYPE=disk
ID_BUS=scsi
ID_ATA_COMPAT=SAMSUNG_MP0804H_S042J10Y257961
ID_PATH=pci-0000:00:1f.1-scsi-0:0:0:0
ID_FS_USAGE=filesystem
ID_FS_TYPE=ext3
ID_FS_VERSION=1.0
ID_FS_UUID=f93b5509-6e68-4f2f-9d2f-fcff7a2dfb19
ID_FS_UUID_ENC=f93b5509-6e68-4f2f-9d2f-fcff7a2dfb19
ID_FS_LABEL=enki-root
ID_FS_LABEL_ENC=enki-root
ID_FS_LABEL_SAFE=enki-root
DEVNAME=/dev/sda1
DEVLINKS=/dev/disk/by-id/scsi-1ATA_SAMSUNG_MP0804H_S042J10Y257961-part1 /dev/disk/by-id/ata-SAMSUNG_MP0804H_S042J10Y257961-part1 /dev/disk/by-path/pci-0000:00:1f.1-scsi-0:0:0:0-part1 /dev/disk/by-uuid/f93b5509-6e68-4f2f-9d2f-fcff7a2dfb19 /dev/disk/by-label/enki-root

Note that the 'ID_FS_TYPE' value is ext3, the file system in the RAID array, rather than identifying this disk as part of a RAID array.

The same misidentification is present for the swap RAID1 and the other component; I can supply logs showing that if it matters.

The RAID array itself is a healthy RAID1 with version 1.0 metadata:

daniel@enki:~$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid1 sda2[0] sdb2[1]
      1975984 blocks super 1.0 [2/2] [UU]

md0 : active raid1 sdb1[0] sda1[2]
      76204224 blocks super 1.0 [2/2] [UU]

unused devices: <none>

daniel@enki:~$ sudo mdadm -D /dev/md0
[sudo] password for daniel:
/dev/md0:
        Version : 01.00.03
  Creation Time : Wed May 16 01:08:07 2007
     Raid Level : raid1
     Array Size : 76204224 (72.67 GiB 78.03 GB)
  Used Dev Size : 152408448 (72.67 GiB 78.03 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Aug 21 10:21:04 2007
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : enki-root
           UUID : 3a6b05ca:c218089c:f9df5f76:731157e8
         Events : 802619

    Number Major Minor RaidDevice State
       0 8 17 0 active sync /dev/sdb1
       2 8 1 1 active sync /dev/sda1

However, one very odd factor in this:

daniel@enki:~$ sudo mdadm -E /dev/sda1
/dev/sda1:
          Magic : a92b4efc
        Version : 01
    Feature Map : 0x0
     Array UUID : 3a6b05ca:c218089c:f9df5f76:731157e8
           Name : enki-root
  Creation Time : Wed May 16 01:08:07 2007
     Raid Level : raid1
   Raid Devices : 2

  Used Dev Size : 152408448 (72.67 GiB 78.03 GB)
     Array Size : 152408448 (72.67 GiB 78.03 GB)
   Super Offset : 152408576 sectors
          State : active
    Device UUID : fc1de0a0:88071249:b753809d:3cd1beee

    Update Time : Tue Aug 21 10:21:24 2007
       Checksum : 655ae10b - correct
         Events : 802619

    Array Slot : 2 (0, failed, 1)
   Array State : uU 1 failed
daniel@enki:~$ sudo mdadm -E /dev/sdb1
/dev/sdb1:
          Magic : a92b4efc
        Version : 01
    Feature Map : 0x0
     Array UUID : 3a6b05ca:c218089c:f9df5f76:731157e8
           Name : enki-root
  Creation Time : Wed May 16 01:08:07 2007
     Raid Level : raid1
   Raid Devices : 2

  Used Dev Size : 152408448 (72.67 GiB 78.03 GB)
     Array Size : 152408448 (72.67 GiB 78.03 GB)
   Super Offset : 152408576 sectors
          State : active
    Device UUID : 71575927:ef34548e:5693969d:1b8f6631

    Update Time : Tue Aug 21 10:21:26 2007
       Checksum : 73d44564 - correct
         Events : 802619

    Array Slot : 0 (0, failed, 1)
   Array State : Uu 1 failed

It looks like the MD device has a happy RAID1 header but the individual components have metadata that indicates that /both/ of them are part of a failed RAID array.

In any case the misidentification of the component use means that mdadm is never called on my system.

Please let me know if I can assist in debugging this further or in providing any specific testing. I have no concerns building whatever tools with debugging, etc, as needed.

Revision history for this message
Daniel Pittman (daniel-rimspace) wrote :

Please find attached a patch the resolves this issue.

I have tracked it down to the vol_id code using the wrong superblock offset to locate the metadata within the partition, at least for version 1.0 (new, at end of device) superblocks.

The attached patch implements the correct location calculation for the 1.0 superblock based on the code present in the current gutsy version of mdadm, suitably modified to fit the coding style of the udev helper.

I have tested this and verified that it does, correctly, determine the use of my devices as RAID members rather than as simple ext3 file system content.

I think my patch is technically in error, in that it uses both the old and new calculations to try and locate the superblock on the device for both version 0.9 and 1.0 metadata. I suspect, but due to illness don't have the time to verify, that we should use the older method only for 0.9 superblocks and the new method only for 1.0 superblocks.

That said this isn't actually a big problem. The system notes that there isn't a valid RAID superblock there and simply continues to the next test, so this is harmless.

This is an upstream bug, so far as I can tell, since the vol_id code is not modified in the Ubuntu/Debian patch applied to the package.

I also think this should be pushed into the gutsy release -- at the moment Gutsy will fail to boot on a software RAID device with 1.0 metadata despite the system being full and correct.

Regards, Daniel

Revision history for this message
Daniel Pittman (daniel-rimspace) wrote :

Oh. In case it is needed the output of the vol_id tool on my partition after building with the patch is:

ID_FS_USAGE=raid
ID_FS_TYPE=linux_raid_member
ID_FS_VERSION=1.0.0
ID_FS_UUID=9c0818c2:00000000:00000000:00000000
ID_FS_UUID_ENC=9c0818c2:00000000:00000000:00000000
ID_FS_LABEL=
ID_FS_LABEL_ENC=

Regards, Daniel

Changed in udev:
assignee: nobody → keybuk
status: New → Confirmed
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Note that your output with the patch is bogus, the UUID is wrong. See following from Kay Sievers (udev upstream):

> We've had a bug report filed for Ubuntu caused by vol_id not correctly
> detecting a RAID1 with 1.0 metadata:
>
> https://bugs.launchpad.net/ubuntu/+source/udev/+bug/133773

Seems like a bug for metadata 1.0, yes.

> A patch is attached to the bug (attached again here), does this seem a
> reasonable fix?

But the UUID looks strange:
  ID_FS_UUID=9c0818c2:00000000:00000000:00000000

I see the same thing here with this patch:
  $ mdadm -E /dev/sda7
  Array UUID : 847bf627:97a28f2a:37e21246:25b446b1

  $ extras/volume_id/vol_id /dev/sda7
  ID_FS_UUID=2a8fa297:00000000:00000000:00000000

With a different one line fix to the lib it seems fine:
  $ extras/volume_id/vol_id /dev/sda7
  ID_FS_UUID_ENC=847bf627:97a28f2a:37e21246:25b446b1

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :
Changed in udev:
status: New → Fix Committed
Changed in udev:
status: Confirmed → Fix Committed
Changed in udev:
importance: Undecided → Medium
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

udev (113-0ubuntu11) gutsy; urgency=low

  * debian/patches/10-git-linux_raid-1.0-metadata.patch:
    - Upstream patch to fix detection of linux_raid metadata v1.0 and
      produce correct UUIDs for such raid devices. LP: #133773

  * 20-names.rules:
    - Place ucm[0-9]* and rdma_ucm into the infiniband dir. LP: #124990.
  * 80-programs.rules:
    - Fix calling of create_floppy_devices to just $tempnode not
      $root/$tempnode, which is just plain wrong. LP: #132546.

 -- Scott James Remnant <email address hidden> Mon, 24 Sep 2007 13:18:41 +0100

Changed in udev:
status: Fix Committed → Fix Released
Revision history for this message
Daniel Pittman (daniel-rimspace) wrote : Thank you for your effort on this (was Re: [Bug 133773] Re: [gutsy] partitions no longer detected as RAID components after repairing degraded RAID 1 mirror)

G'day Scott.

> udev (113-0ubuntu11) gutsy; urgency=low
>
> * debian/patches/10-git-linux_raid-1.0-metadata.patch:
> - Upstream patch to fix detection of linux_raid metadata v1.0 and
> produce correct UUIDs for such raid devices. LP: #133773

Thank you very much for your work fixing this and pushing the changes
out into Gutsy. I appreciate it, and I am sorry that I left you feeling
I was complaining about your efforts early in the picture.

Regards,
        Daniel
--
Daniel Pittman <email address hidden> Phone: 03 9621 2377
Level 4, 10 Queen St, Melbourne Web: http://www.cyber.com.au
Cybersource: Australia's Leading Linux and Open Source Solutions Company

Martin Pitt (pitti)
Changed in udev:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.