Merge ~paul-meyer/cloud-init:paulmey/fix-fs-setup into cloud-init:master

Proposed by Paul Meyer
Status: Rejected
Rejected by: Scott Moser
Proposed branch: ~paul-meyer/cloud-init:paulmey/fix-fs-setup
Merge into: cloud-init:master
Diff against target: 19 lines (+2/-0)
1 file modified
cloudinit/config/cc_disk_setup.py (+2/-0)
Reviewer Review Type Date Requested Status
Scott Moser Needs Fixing
Server Team CI bot continuous-integration Needs Fixing
Review via email: mp+324359@code.launchpad.net

Commit message

Add 'udevadm settle' before enumerating partitions

Fixes lp #1692093

To post a comment you must log in.
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

FAILED: Continuous integration, rev:8098f58f5def9acd35116de792e16b412f6898c3
No commit message was specified in the merge proposal. Click on the following link and set the commit message (if you want a jenkins rebuild you need to trigger it yourself):
https://code.launchpad.net/~paul-meyer/cloud-init/+git/cloud-init/+merge/324359/+edit-commit-message

https://jenkins.ubuntu.com/server/job/cloud-init-ci/363/
Executed test runs:
    SUCCESS: https://jenkins.ubuntu.com/server/job/cloud-init-ci/nodes=metal-amd64/363
    SUCCESS: https://jenkins.ubuntu.com/server/job/cloud-init-ci/nodes=metal-arm64/363
    SUCCESS: https://jenkins.ubuntu.com/server/job/cloud-init-ci/nodes=metal-ppc64el/363
    SUCCESS: https://jenkins.ubuntu.com/server/job/cloud-init-ci/nodes=metal-s390x/363
    SUCCESS: https://jenkins.ubuntu.com/server/job/cloud-init-ci/nodes=vm-i386/363

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/363/rebuild

review: Needs Fixing (continuous-integration)
Revision history for this message
Scott Moser (smoser) wrote :

I'm pretty sure the 'udevm settle' in this particular location is really just the same as time.sleep(1). udevadm settle has an annoying way of "fixing" things like that.

The problem here really is the way that cloud-init works on azure.
The datasource identifies that the ephemeral disk can be wiped.
In order to do that, it removes the marker that cc_disk_setup already ran. Normally that would only run per-instance so this is not an issue.

Ideally we'd have a more fine grained control on what part of cc_disk_setup were to re-run. We want the recreation of the ephemeral disk to occur, but *only* that.

Adding a 'udevadm settle' to a read-only operation (enumerate_disk) shouldnt be necessary.

review: Needs Fixing
Revision history for this message
Paul Meyer (paul-meyer) wrote :

I think you're right that it's a timing issue that could be fixed with a sleep. I was perusing the lsblk sources to see where it gets the label and file system type, to see if there's a way to make sure they are read properly. blkid has a parameter to force reading from disk rather than cache and I wonder it this is something similar.

I wholeheartedly agree that it is a bad thing to have cc_disk_setup run every boot, just because of case of service-healing or instance resize, where the ephemeral disk needs another wipe. What do you think is a good approach? Another property on disk_setup and fs_setup elements to indicate that a configuration should be applied every boot and then switching cc_disk_setup to per-boot?
Alternatively, one could consider a cc_azure_ephemeral_disk module to run on every boot that calls into cc_disk_setup with the azure-specific hardcoded disk/fs config?

This issue occurs not only on reboots, but also on the initial boot of instances with an attached disk that has previously been prepared by cloud-init. So even if cc_disk_setup would only run per-instance, we would need a fix.

Revision history for this message
Paul Meyer (paul-meyer) wrote :

From the lsblk code:
https://git.kernel.org/pub/scm/utils/util-linux/util-linux.git/tree/misc-utils/lsblk.c#n944
https://git.kernel.org/pub/scm/utils/util-linux/util-linux.git/tree/misc-utils/lsblk.c#n590

I gather that lsblk reads the fstype from udev properties, debug log for the lsblk execution in cloud-init shows that it hits the udev code via the log line at #572 in the lsblk.c file. (I added some code to set LSBLK_DEBUG=all in the update_env for the lsblk call in enumerate_disk.)

1137: lsblk: CXT: [0x7ffca0fd36f0]: setting context for sdc [parent=(nil), wholedisk=(nil)]
1137: lsblk: CXT: [0x7ffca0fd36f0]: sdc: filename=/dev/sdc
1137: lsblk: CXT: [0x7ffca0fd36f0]: sdc: npartitions=1, nholders=0, nslaves=0
1137: lsblk: CXT: [0x7ffca0fd36f0]: sdc: context successfully initialized
1137: lsblk: CXT: [0x7ffca0fd3850]: setting context for sdc1 [parent=0x7ffca0fd36f0, wholedisk=0x7ffca0fd36f0]
1137: lsblk: CXT: [0x7ffca0fd3850]: sdc1: filename=/dev/sdc1
1137: lsblk: CXT: [0x7ffca0fd3850]: sdc1: npartitions=0, nholders=0, nslaves=0
1137: lsblk: CXT: [0x7ffca0fd3850]: sdc1: context successfully initialized
1137: lsblk: DEV: [0x7ffca0fd3850]: sdc1: found udev properties
1137: lsblk: CXT: [0x7ffca0fd3850]: sdc1: list dependencies
1137: lsblk: CXT: [0x7ffca0fd3850]: reset
1137: lsblk: CXT: [0x7ffca0fd36f0]: reset

Revision history for this message
Scott Moser (smoser) wrote :

> I think you're right that it's a timing issue that could be fixed with a
> sleep. I was perusing the lsblk sources to see where it gets the label and
> file system type, to see if there's a way to make sure they are read properly.
> blkid has a parameter to force reading from disk rather than cache and I
> wonder it this is something similar.
>
> I wholeheartedly agree that it is a bad thing to have cc_disk_setup run every
> boot, just because of case of service-healing or instance resize, where the
> ephemeral disk needs another wipe. What do you think is a good approach?
> Another property on disk_setup and fs_setup elements to indicate that a
> configuration should be applied every boot and then switching cc_disk_setup to
> per-boot?

Disk setup in general needs significant improvement. I think that work will
likely end up with a different disk setup format. I think we can fix
this specific issue without substantial overhaul, so for now I think we
should focus on the current issue, because it *should* be achievable.
.

> This issue occurs not only on reboots, but also on the initial boot of
> instances with an attached disk that has previously been prepared by cloud-
> init. So even if cc_disk_setup would only run per-instance, we would need a
> fix.

ACK. Yeah thats definitely possible.

Revision history for this message
Scott Moser (smoser) wrote :

It seems liek this approach should work.

http://paste.ubuntu.com/24636970/

Paul do you have thoughts on that?

Revision history for this message
Paul Meyer (paul-meyer) wrote :

> It seems like this approach should work.
>
> http://paste.ubuntu.com/24636970/
>
> Paul do you have thoughts on that?
It seems to work well for the reboot case, haven't gotten a chance to test the re-provision case, but I expect it to work well too. Pretty robust code, it looks like!
Note that your patch has a typo in the third chunk, there's a trailing : that shouldn't be there.

Revision history for this message
Paul Meyer (paul-meyer) wrote :

It also works well for the re-provision scenario.
I did find that the assert_... function logs a warning when the device is a partition for mkfs():

2017-05-25 04:48:41,565 - cc_disk_setup.py[DEBUG]: setting up filesystems: [{'label': 'etcd_disk', 'extra_opts': ['-F', '-E', 'lazy_itable_init=1,lazy_journal_init=1'], 'device': '/dev/sdc1', 'filesystem': 'ext4'}]
2017-05-25 04:48:41,565 - cc_disk_setup.py[DEBUG]: Creating new filesystem.
2017-05-25 04:48:41,565 - util.py[DEBUG]: Running command ['/sbin/blockdev', '--rereadpt', '/dev/sdc1'] with allowed return codes [0] (shell=False, capture=True)
2017-05-25 04:48:41,568 - util.py[DEBUG]: out=b''
2017-05-25 04:48:41,568 - util.py[DEBUG]: err=b'blockdev: ioctl error on BLKRRPART: Invalid argument\n'
2017-05-25 04:48:41,568 - util.py[DEBUG]: rc=1
2017-05-25 04:48:41,569 - cc_disk_setup.py[WARNING]: failed command: Unexpected error while running command.
Command: ['/sbin/blockdev', '--rereadpt', '/dev/sdc1']
Exit code: 1
Reason: -
Stdout: -
Stderr: blockdev: ioctl error on BLKRRPART: Invalid argument
2017-05-25 04:48:41,576 - util.py[DEBUG]: Running command ['udevadm', 'settle'] with allowed return codes [0] (shell=False, capture=True)
2017-05-25 04:48:41,630 - util.py[DEBUG]: out=b''
2017-05-25 04:48:41,630 - util.py[DEBUG]: err=b''
2017-05-25 04:48:41,630 - util.py[DEBUG]: rc=0

I do need to figure out why the ephemeral drive was not setup here... it's not even in the fs_setup list.

Revision history for this message
Scott Moser (smoser) wrote :

Paul,
if the user provides config that defines the disk setup, then they override the datasource provided config.

merging of items in cloud-config is kind of wierd.

You can look in /var/lib/cloud/instance/cloud-config.txt
to see the fully merged value.

Revision history for this message
Scott Moser (smoser) wrote :

I'm not sure, though why the blkid would fail if the disk was in the mkfs.

Revision history for this message
Scott Moser (smoser) wrote :

I've got a MP at https://code.launchpad.net/~paul-meyer/cloud-init/+git/cloud-init/+merge/324359
I'm going to mark this one as 'Rejected'. Sorry for the harsh tone of that word.

Please move further discussion to bug or *that* mp.

Thanks for your help, Paul.
Scott

Revision history for this message
Paul Meyer (paul-meyer) wrote :

No problem, thanks for the patch!

-----Original Message-----
From: <email address hidden> [mailto:<email address hidden>] On Behalf Of Scott Moser
Sent: Thursday, May 25, 2017 10:11 AM
To: Paul Meyer <email address hidden>
Subject: Re: [Merge] ~paul-meyer/cloud-init:paulmey/fix-fs-setup into cloud-init:master

I've got a MP at https://na01.safelinks.protection.outlook.com/?url=https:%2F%2Fcode.launchpad.net%2F~paul-meyer%2Fcloud-init%2F%2Bgit%2Fcloud-init%2F%2Bmerge%2F324359&data=02%7C01%7Cpaul.meyer%40microsoft.com%7C86e47c4aff9d40ced0ed08d4a39108f6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636313290691970251&sdata=Iafl3p3Oep3ZV0WeXWvur8DVKgnJKhZqZNHxouXcY%2FA%3D&reserved=0
I'm going to mark this one as 'Rejected'. Sorry for the harsh tone of that word.

Please move further discussion to bug or *that* mp.

Thanks for your help, Paul.
Scott
--
https://na01.safelinks.protection.outlook.com/?url=https:%2F%2Fcode.launchpad.net%2F~paul-meyer%2Fcloud-init%2F%2Bgit%2Fcloud-init%2F%2Bmerge%2F324359&data=02%7C01%7Cpaul.meyer%40microsoft.com%7C86e47c4aff9d40ced0ed08d4a39108f6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636313290691980265&sdata=qECGTxQDthcUbkCDGCmeEaHamAGWBs5L3cVF6p99j9o%3D&reserved=0
You are the owner of ~paul-meyer/cloud-init:paulmey/fix-fs-setup.

Unmerged commits

8098f58... by Paul Meyer

Add 'udevadm settle' before enumerating partitions

Fixes lp #1692093

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/cloudinit/config/cc_disk_setup.py b/cloudinit/config/cc_disk_setup.py
2index 29eb5dd..5cefcd1 100644
3--- a/cloudinit/config/cc_disk_setup.py
4+++ b/cloudinit/config/cc_disk_setup.py
5@@ -242,12 +242,14 @@ def enumerate_disk(device, nodeps=False):
6
7 lsblk_cmd = [LSBLK_CMD, '--pairs', '--output', 'NAME,TYPE,FSTYPE,LABEL',
8 device]
9+ udev_cmd = [UDEVADM_CMD, 'settle']
10
11 if nodeps:
12 lsblk_cmd.append('--nodeps')
13
14 info = None
15 try:
16+ util.subp(udev_cmd)
17 info, _err = util.subp(lsblk_cmd)
18 except Exception as e:
19 raise Exception("Failed during disk check for %s\n%s" % (device, e))

Subscribers

People subscribed via source and target branches