cc_grub_dpkg: determine idevs in a more robust manner with grub-probe

Bug #1877491 reported by Matthew Ruffell
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
cloud-init
Fix Released
Undecided
Matthew Ruffell

Bug Description

Currently, we populate the debconf database variable grub-pc/install_devices by checking to see if a device is present in a hardcoded list [1] of directories:

- /dev/sda
- /dev/vda
- /dev/xvda
- /dev/sda1
- /dev/vda1
- /dev/xvda1

[1] https://github.com/canonical/cloud-init/blob/master/cloudinit/config/cc_grub_dpkg.py

While this is a simple elegant solution, the hardcoded list does not match real world conditions, where grub is installed to a disk which is not on this list.

The primary example is any cloud which uses NVMe storage, such as AWS c5 instances.

/dev/nvme0n1 is not on the above list, and in this case, falls back to a hardcoded /dev/sda value for grub-pc/install_devices.

The thing is, the grub postinstall script [2] checks to see if the value from grub-pc/install_devices exists, and if it doesn't, shows the user an interactive dpkg prompt where they must select the disk to install grub to. See the screenshot [3].

[2] https://paste.ubuntu.com/p/5FChJxbk5K/
[3] https://launchpadlibrarian.net/478771797/Screenshot%20from%202020-04-14%2014-39-11.png

This breaks scripts that don't set DEBIAN_FRONTEND=noninteractive as they get hung waiting for the user to input a choice.

I propose that we modify the cc_grub_dpkg module to be more robust at selecting the correct disk grub is installed to.

Why not simply add an extra directory to the hardcoded list?

Lets take NVMe storage as an example again. On a c5d.large instance I spun up just now, lsblk returns:

$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1 259:0 0 46.6G 0 disk
nvme1n1 259:1 0 8G 0 disk
└─nvme1n1p1 259:2 0 8G 0 part /

We cannot hardcode /dev/nvme0n1, as the NVMe naming conventions are not stable in the kernel, and some boots the 8G disk will be /dev/nvme0n1, and others will be /dev/nvme1n1.

Instead, I propose that we determine which grub has been installed to by following the grub2 debian/postinst.in script, and implementing the algorithm behind usable_partitions(), device_to_id() and available_ids() functions [3].

[3] https://paste.ubuntu.com/p/vKFNSwNyhP/

This uses grub-probe to find the root disk where the /boot directory is located, and then turns the disk name into a /dev/disk/by-id/ value.

This is robust to unstable kernel device naming conventions.

On Nitro, this returns:
/dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol0179fff411dd211f0

On Xen, this returns:
/dev/xvda

On a typical QEMU/KVM machine, this returns:
/dev/vda

On my personal desktop computer, this returns:
/dev/disk/by-id/ata-WDC_WD5000AAKX-00PWEA0_WD-WMAYP3497618

I have tested this on AWS, on Xen, Nitro, on KVM, with BIOS and EFI based instances, in LXC, and on bare metal with a BIOS based MAAS machine.

All give the correct results in my testing.

TESTING:

You can fetch grub-pc/install_devices with:

$ echo get grub-pc/install_devices | sudo debconf-communicate grub-pc

Reset with:

$ echo reset grub-pc/install_devices | sudo debconf-communicate grub-pc

Tags: sts
Revision history for this message
Matthew Ruffell (mruffell) wrote :
Changed in cloud-init:
status: New → In Progress
assignee: nobody → Matthew Ruffell (mruffell)
tags: added: sts
description: updated
Revision history for this message
Matthew Ruffell (mruffell) wrote :
Revision history for this message
Nivedita Singhvi (niveditasinghvi) wrote :

Reviewed Matthew's changes - LGTM, although I haven't tested.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

I thought that grub-mkdevicemap is deprecated, and for majority of usecases is neither needed, nor does anything. (i.e. by default there is no /boot/grub/device.map on my system)

/by-id/ is the preferred way to identify i.e. ESP partition, or like the MBR partition. You can always refer to Ubuntu FSTAB policy for the best symlinks to use https://wiki.ubuntu.com/FSTAB

I don't like the fork of codepath of either BIOS or UEFI. Our images support dual bootloader support, and in theory can boot under either method. That is true for the first boot, but currently we do not correctly apply updates to continue upgrading grub2 for both protocols.

In focal grub-[pc|efi]/install_devices is a multiselect field now, that supports multiple values. Thus it would be nice to support in cloud-init as multiselect too. Ie. we even have support for multiple ESP to correctly handle resilient boot with full disk raid1.
I.e. on my system
grub-efi-amd64 grub-efi/install_devices multiselect /dev/disk/by-id/nvme-eui.00000000000000006479a71d90513837-part1

This is just a light review of the proposal / improvements that could be made to cc_grub_dpkg module.

description: updated
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hi Dimitri!

Thanks for your feedback! I have taken it on-board and pushed a new revision to the merge request.

grub-mkdevicemap has been removed in favour of grub-probe.

The fork for BIOS or EFI systems has been completely removed, and the code simplified.

The new code closely follows the algorithm from the usable_partitions(), device_to_id() and available_ids() functions in the grub2 debian/postinst.in script: https://paste.ubuntu.com/p/vKFNSwNyhP/

In short:

1) Fetch the disk the /boot directory is located on with grub-probe
2) If /dev/disk/by-id/ exists, create a mapping of /dev/disk/by-id values to devices.
3) Resolve the symlink from each /dev/disk/by-id value to find the one that matches the disk from 1).
4) If there is no /dev/disk/by-id value, fallback to the plain device name.

Please review and let me know what you think.

Changed in cloud-init:
status: In Progress → Fix Committed
summary: - cc_grub_dpkg: determine idevs in a more robust manner with grub-
- mkdevicemap
+ cc_grub_dpkg: determine idevs in a more robust manner with grub-probe
Revision history for this message
Matthew Ruffell (mruffell) wrote :

The fix is currently being SRU'd in cloud-init version 20.2-45-g5f7825e2-0ubuntu1, which is currently sitting in -proposed.

I have validated the packages on Xenial, Bionic, Eoan, Focal and Groovy on a t2.micro and cd5.large instance each on AWS, and everything looks good.

To aid in your testing, I have written some instructions to make testing straightforward.

Instructions to install:

1) sudo -s
2) cat <<EOF >/etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list
# Enable Ubuntu proposed archive
deb http://archive.ubuntu.com/ubuntu/ $(lsb_release -cs)-proposed restricted main multiverse universe
EOF
3) sudo apt update
4) sudo apt install cloud-init
5) sudo apt-cache policy cloud-init | grep Installed
Installed: 20.2-45-g5f7825e2-0ubuntu1~18.04.1

The cc_grub_dpkg module only runs once, at instance creation. To get it to run again, we will clear some configuration and instruct cloud-init to reconfigure the system again from scratch.

We need to clear out pre-exisiting debconf variables in the database

6) echo reset grub-pc/install_devices | sudo debconf-communicate grub-pc
7) echo reset grub-pc/install_devices_empty | sudo debconf-communicate grub-pc

cloud-init needs the hostname changed to fully reset to fresh state
8) sudo hostname test1
9) sudo cloud-init clean --logs --reboot

The machine will now reboot. Once the machine comes up again, you can attempt to trigger the interactive dialog box by removing and installing grub packages:

10) sudo apt remove grub-common grub-gfxpayload-lists grub-legacy-ec2 grub-pc grub-pc-bin grub2-common
11) sudo apt install grub-common grub-gfxpayload-lists grub-legacy-ec2 grub-pc grub-pc-bin grub2-common

This will now complete without any problems with the cloud-init package from -proposed.

You can also see if the debconf database entries have been set to the correct values:

12) echo get grub-pc/install_devices | sudo debconf-communicate grub-pc

On Xen instances, you will see something like "0 /dev/xvda", and on Nitro instances, you will see something like "0 /dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol07ea2e1d719941167".

The new cc_grub_dpkg module in cloud-init 20.2-45-g5f7825e2-0ubuntu1 fixes the grub interactive dialog box problems, and I am happy to mark this as verified.

Revision history for this message
James Falcon (falcojr) wrote : Fixed in cloud-init version 20.3.

This bug is believed to be fixed in cloud-init in version 20.3. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: Fix Committed → Fix Released
Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.