curtin race on vivid when /dev/sda1 doesn't exist

Bug #1443542 reported by Robie Basak
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
curtin
Fix Released
Medium
Unassigned
curtin (Ubuntu)
Fix Released
Medium
Unassigned
Trusty
Fix Released
Medium
Unassigned
Vivid
Fix Released
Medium
Unassigned

Bug Description

=== Begin SRU Template ===
[Description]
Installation fails, with log information showing:

  Unexpected error while running command.
  Command: ['mkfs.ext4', '-q', '-L', 'cloudimg-rootfs', '/dev/sda1']
  Exit code: 1
  Reason: -
  Stdout: ''
  Stderr: ''
  Installation failed with exception: Unexpected error while running command.
  Command: ['curtin', 'block-meta', 'simple']

This is a result of curtin having done:
 a.) partition the disk
 b.) invoke wipefs on the block device with an offset to where the partition started

The intent of that was to ensure that there were no fileystem signatures or other interesting metadata (lvm, raid) on the partition.

The issue was two fold:
 a.) wipefs always invokes rereadpt on the disk it wipes
 b.) wipefs opened the block device, not the partition.

Both of these things cause a flurry of udev events that may result in the device having an open filehandle, and thus mkfs refusing to create a filesystem on it.

The solution is to replace 'wipefs' with our own 'wipe_partition' that does not open the block device, but rather only the partition and does not invoke rereadpt.

[Impact]
Installation fails occasionally.

[Test Case]
In the original bug-opener's environment it fails fairly reliably under heavy host load using vmware. He would do a deploy to several guests on the same host at the same time and this would reproduce. Unfortunately I was unable to come up with a test case in a less complex environment.

[Regression Potential]
Not specifically a regression, but it is possible that we need additional 'udevadm settle' after the internal 'wipe_partition'

=== End SRU Template ===

The file /dev/sda1 does not exist and no size was specified.
Unexpected error while running command.
Command: ['mkfs.ext4', '-q', '-L', 'cloudimg-rootfs', '/dev/sda1']
Exit code: 1
Reason: -
Stdout: ''
Stderr: ''
Installation failed with exception: Unexpected error while running command.
Command: ['curtin', 'block-meta', 'simple']
Exit code: 3
Reason: -
Stdout: b"The file /dev/sda1 does not exist and no size was specified.\nUnexpected error while running command.\nCommand: ['mkfs.ext4', '-q', '-L', 'cloudimg-rootfs', '/dev/sda1']\nExit code: 1\nReason: -\nStdout: ''\nStderr: ''\n"
Stderr: ''

This happened 4 out of 10 times when Ryan tested it.

Could this be a race condition because /dev/sda1 hasn't had time to come into existence between when curtin partitions the disk and when curtin tries to create a filesystem on the partition?

Ryan Beisner (1chb1n)
tags: added: openstack uosci
Revision history for this message
Ryan Harper (raharper) wrote :

cloud-init[1226]: /var/lib/cloud/instance/scripts/user_data.sh: 18: /var/lib/cloud/instance/scripts/user_data.sh: initctl: not found

ubuntu@phony-trick:/var/log$ cat /proc/partitions
major minor #blocks name

  11 0 1048575 sr0
   8 0 488386584 sda
   8 1 488385560 sda1
   8 16 488386584 sdb
   8 17 487336967 sdb1
   8 18 1047552 sdb2
   8 32 1433600 sdc
ubuntu@phony-trick:/var/log$ ls -al /dev/sda1
brw-rw---- 1 root disk 8, 1 May 11 21:13 /dev/sda1

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in curtin (Ubuntu):
status: New → Confirmed
Revision history for this message
Scott Moser (smoser) wrote :

i'm looking at http://bazaar.launchpad.net/~curtin-dev/curtin/trunk/view/head:/helpers/common#L330 (which admittedly i just committed to, but did not change this general behavior).

and i'm not really sure how it can get there...
partition_main() line 374 calls wipedev. wipedev wipes start and end, and then calls blockdev --rereadpt and udevadm settle.

then we partition via one of the pt_* things. and each of those do a 'blockdev --rereadpt "$target"' and 'udevadm settle.

that 'partition' is called from
 http://bazaar.launchpad.net/~curtin-dev/curtin/trunk/view/head:/curtin/commands/block_meta.py#L204
and then after that block_meta does the mkfs that apparently finds no /dev/sda1.

Revision history for this message
Scott Moser (smoser) wrote :

so 1 of 2 things seems like it must have happend:
a.) 'isblk' is not getting set (which would mean we dont run blockdev and udevadm settle)
b.) my assumption that changing the partition table on a device, then running 'blockdev --rereadpt' and subsequently 'udevadm settle' would result in the new block devices being present is simply wrong.

we can easily enough add a wait loop in 'assert_partitions' now, but.. that sucks. i'd like a guarantee somehow.

Revision history for this message
Scott Moser (smoser) wrote :

possible / likely explanation is that there is no guarantee that blockdev will exit after kernel has put events into the udev queue.
so udevadm could have run and found an empty queue.

Revision history for this message
Oleg Strikov (strikov-deactivatedaccount) wrote :

Execution of blockdev --rereadpt simply issues a call to ioctl(BLKRRPART) which forces kernel to re-initialize its metadata. Kernel adds information about newly created partition here: http://lxr.linux.no/#linux+v3.19.1/block/partition-generic.c#L269 You can see a call to kobject_uevent() which puts event to udev's queue. So to me it looks like udev settle should have all the information available. Only if blockdev succeeds though. We don't check its return code if I got it correctly.

Another interesting point is that there are two popular ways to re-read the partition table: BLKRRPAR ioctl (used by blockdev) and BLKPG ioctl (used by partprobe). They are very different. While BLKRRPAR lets kernel know that partition table needs to be re-read, BLKPG tells kernel the exact partitioning layout (parted does this because it wants to have ultimate control over things). BLKPG may be more udev-friendly because parted-based tools are more popular. We may want to try partprobe instead of blockdev and see if it solves the issue.

Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Scott Moser (smoser) wrote :

This is believed to be fix-committed. under commit 205.
I removed a 'wipefs' which always calls blockdev --rereadpt. That would trigger deletion and recreation of partition devices.
We replaced it with 'wipe_partitions' which does not issue a rereadpt, but instead just zeros the front of the device.
Note, though, that in vivid and later, even a write to /dev/vdb1 does trigger kernel/udev events.

Revision history for this message
Scott Moser (smoser) wrote :

Marking fix-committed. if you see this issue again, please re-open and ping me.

Changed in curtin:
status: New → Fix Committed
importance: Undecided → Medium
Changed in curtin (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Fix Released
Changed in curtin (Ubuntu Trusty):
status: New → Confirmed
Changed in curtin (Ubuntu Vivid):
status: New → Confirmed
Changed in curtin (Ubuntu Trusty):
importance: Undecided → Medium
Changed in curtin (Ubuntu Vivid):
importance: Undecided → Medium
Scott Moser (smoser)
description: updated
Revision history for this message
Chris J Arges (arges) wrote : Please test proposed package

Hello Robie, or anyone else affected,

Accepted curtin into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/curtin/0.1.0~bzr221-0ubuntu1~14.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in curtin (Ubuntu Trusty):
status: Confirmed → Fix Committed
tags: added: verification-needed
Revision history for this message
Chris J Arges (arges) wrote :

Hello Robie, or anyone else affected,

Accepted curtin into vivid-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/curtin/0.1.0~bzr221-0ubuntu1~14.10.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in curtin (Ubuntu Vivid):
status: Confirmed → Fix Committed
Revision history for this message
Scott Moser (smoser) wrote :

The system where we found these issues is happily deploying using newer curtin.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package curtin - 0.1.0~bzr221-0ubuntu1~14.04.1

---------------
curtin (0.1.0~bzr221-0ubuntu1~14.04.1) trusty-proposed; urgency=medium

  * New upstream snapshot.
    - support installation to multipath devices. (LP: #1371634)
    - know that kernel version 4.2.0 maps to linux-generic-lts-wily
    - support install to arm64 systems that use UEFI for boot (LP: #1447834)
    - fix remaining usage of 'lsblk --out' rather than 'lsblk --output'
      (LP: #1386275)
    - retry 'apt-get update' on failure to avoid transient failures
      (LP: #1403133)
    - run udevadm settle before unmounting /dev in a target to avoid transient
      failures (LP: #1462139)
    - fixes and additions to tools used in development.
    - Add --no-nvram to the grub-install command for UEFI. (LP: #1311827)
    - avoid race condition and transient failure due busy device in mkfs
      (LP: #1443542)
    - improvements to device and partition naming code which allow installation
      devices with HP cciss smart array drives(LP: #1401190, #1263181)
    - do not consider devices < 1G as installable targets
  * debian/README.source fix doc on how to create new upstream snapshots

 -- Scott Moser <email address hidden> Wed, 24 Jun 2015 14:31:14 -0400

Changed in curtin (Ubuntu Trusty):
status: Fix Committed → Fix Released
Revision history for this message
Chris J Arges (arges) wrote : Update Released

The verification of the Stable Release Update for curtin has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package curtin - 0.1.0~bzr221-0ubuntu1~14.10.1

---------------
curtin (0.1.0~bzr221-0ubuntu1~14.10.1) vivid-proposed; urgency=medium

  * New upstream snapshot.
    - support installation to multipath devices. (LP: #1371634)
    - know that kernel version 4.2.0 maps to linux-generic-lts-wily
    - support install to arm64 systems that use UEFI for boot (LP: #1447834)
    - fix remaining usage of 'lsblk --out' rather than 'lsblk --output'
      (LP: #1386275)
    - retry 'apt-get update' on failure to avoid transient failures
      (LP: #1403133)
    - run udevadm settle before unmounting /dev in a target to avoid transient
      failures (LP: #1462139)
    - fixes and additions to tools used in development.
    - Add --no-nvram to the grub-install command for UEFI. (LP: #1311827)
    - avoid race condition and transient failure due busy device in mkfs
      (LP: #1443542)
    - improvements to device and partition naming code which allow installation
      devices with HP cciss smart array drives(LP: #1401190, #1263181)
    - do not consider devices < 1G as installable targets
  * debian/README.source fix doc on how to create new upstream snapshots

 -- Scott Moser <email address hidden> Wed, 24 Jun 2015 16:12:59 -0400

Changed in curtin (Ubuntu Vivid):
status: Fix Committed → Fix Released
Revision history for this message
Scott Moser (smoser) wrote : Fixed in Curtin 17.1

This bug is believed to be fixed in curtin in 17.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in curtin:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.