MaaS/Curtin fail to unmount /target/run way too often

Bug #1928839 reported by Boris Lukashev
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Medium
Unassigned
curtin
Fix Released
Undecided
Unassigned

Bug Description

I'm in the process of iteratively rebuilding OpenStack on Juju+MaaS, and every few deploys i get 1-all physical hosts failing with:
```

TIMED subp(['udevadm', 'settle']): 0.010
        Running command ['umount', '/tmp/tmpj0gpggki/target/sys/firmware/efi/efivars'] with allowed return codes [0] (capture=False)
        Running command ['umount', '/tmp/tmpj0gpggki/target/sys'] with allowed return codes [0] (capture=False)
        Running command ['umount', '/tmp/tmpj0gpggki/target/run'] with allowed return codes [0] (capture=False)
        umount: /tmp/tmpj0gpggki/target/run: target is busy.
        finish: cmd-install/stage-curthooks/builtin/cmd-curthooks/installing-kernel: FAIL: installing kernel
        finish: cmd-install/stage-curthooks/builtin/cmd-curthooks: FAIL: curtin command curthooks
        Traceback (most recent call last):

```
or variants thereof wherein unmounting `.../target/run` fails, which fails to provision the node, and breaks juju's attempts at openstack deployment. Juju's inability to `retry-provisioning` on maas (and LXD units for some reason) is rather confounding when this keeps happening.

The problem is intermittent - some runs, all nodes are good, run after run. Then suddenly one or several, or ALL, will fail with varying versions of log files leading up to the failure to unmount `target/run`. After that, several subsequent deployments fail with these errors, and then it seems to go away.
We see this with our Arch Linux process wherein gpg-agents can get stuck running in the chroot context (well, nspawn) for package operations, requiring manually killing them to unmount the target from the parent. We addressed this in our workflows by having a check for running processes before existing the chroot which kills off anything remaining after waiting for it to complete efforts for a minute.

Related branches

Revision history for this message
Boris Lukashev (rageltman) wrote :

It appears that the last command to be run is `apt-get clean`, forked, in a namespace inside the chroot:
```
        Running command ['unshare', '--fork', '--pid', '--', 'chroot', '/tmp/tmphi85v827/target', 'apt-get', 'clean'] with allowed return codes [0] (capture=False)
        Running command ['udevadm', 'settle'] with allowed return codes [0] (capture=False)
        TIMED subp(['udevadm', 'settle']): 0.013
        Running command ['umount', '/tmp/tmphi85v827/target/sys/firmware/efi/efivars'] with allowed return codes [0] (capture=False)
        Running command ['umount', '/tmp/tmphi85v827/target/sys'] with allowed return codes [0] (capture=False)
        Running command ['umount', '/tmp/tmphi85v827/target/run'] with allowed return codes [0] (capture=False)
        umount: /tmp/tmphi85v827/target/run: target is busy.
```
My guess is that this is still executing when the unmounts occur.

Happens on XFS, EXT4, or my preferred ZFS (which for some reason is run with no features enabled, resulting in significantly higher numbers of IOs issued to the underlying media).

The OS storage media for the compute/OSD nodes is 128G of SD card on the iLO (leaving storage bus free for Ceph), and these all passed even the abuse FIO threw at them via MaaS testing. They showed ~20MB/s on sync-io, and depending on what apt cached, are probably still unlinking inodes when Curtin tries to unmount /run from the target.

Its not uncommon for commercial hypervisors to deploy this way, and i think comes down to a missing safety latch in Curtin, waiting for slower install media at the end of the deployment.

Revision history for this message
Boris Lukashev (rageltman) wrote :

Digging into the Curtin sources, it looks like the apt-clean bit is done due to a default `clean=True` in
```
266 def run_apt_command(mode, args=None, opts=None, env=None, target=None,
267 execute=True, allow_daemons=False, clean=True):

```
which is checked in
```
291 with ChrootableTarget(target, allow_daemons=allow_daemons) as inchroot:
292 cmd_rv = inchroot.subp(cmd, env=env)
293 if clean and mode in ['dist-upgrade', 'install', 'upgrade']:
294 inchroot.subp(['apt-get', 'clean'])

```
Is there some way via MaaS to change this behavior without having to figure out how to hot-load Curtin patches into a running instance?

Revision history for this message
Boris Lukashev (rageltman) wrote :

In the ZFS case, you really want async_destroy enabled when running this - by default MaaS deploys ZFS with
```
pool autotrim off default
rpool feature@async_destroy disabled local
rpool feature@empty_bpobj disabled local
rpool feature@lz4_compress disabled local
rpool feature@multi_vdev_crash_dump disabled local
rpool feature@spacemap_histogram disabled local
rpool feature@enabled_txg disabled local
rpool feature@hole_birth disabled local
rpool feature@extensible_dataset disabled local
rpool feature@embedded_data disabled local
rpool feature@bookmarks disabled local
rpool feature@filesystem_limits disabled local
rpool feature@large_blocks disabled local
rpool feature@large_dnode disabled local
rpool feature@sha512 disabled local
rpool feature@skein disabled local
rpool feature@edonr disabled local
rpool feature@userobj_accounting disabled local
rpool feature@encryption disabled local
rpool feature@project_quota disabled local
rpool feature@device_removal disabled local
rpool feature@obsolete_counts disabled local
rpool feature@zpool_checkpoint disabled local
rpool feature@spacemap_v2 disabled local
rpool feature@allocation_classes disabled local
rpool feature@resilver_defer disabled local
rpool feature@bookmark_v2 disabled local

```
Trim might be nice too, though you guys are on 0.8.3 (OpenZFS is fast approaching 2.1.0), so not sure if there are any rough edges remaining there. OpenZFS devs have started trying to improve performance again, so there may be a bunch of good reasons for Canonical to resync with upstream.

Revision history for this message
Alberto Donato (ack) wrote :

Curtin is a maas dependency.
If this is confirmed to be a curtin bug, MAAS will be able to get the fix once it's fixed in curtin and backported to focal.

Changed in maas:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Boris Lukashev (rageltman) wrote :

In the meantime, is there some way for me to inject a remedial bit of python process cleanup or even a sleep between where curtin installs then cleanups kernel packages and tries to unmount the target's /run?
I'm not a user, i'm the solutions architect for a partner trying to iron out and standardize implementation of the stack on hardware we acquired largely for this purpose; so this is burning real cycles every time it fails. I need a development-level workaround of some sort to reduce that loss in the meantime. I can try to hack on it myself, but no fan of Python or Go (Ruby and Rust/C are my poisons of choice for those use cases) - figure someone in engineering has their own injection workarounds to keep down operational stalls while relevant teams resolve this sort of stuff... so if there's anything hacky and completely unsupportable under the couch cushions, i'll be happy to adapt it while the bug pipeline works itself out.

Revision history for this message
Boris Lukashev (rageltman) wrote :

To sprinkle in a bit of cross-domain-responsibility-hatred on top of this suck-salad, it might not be a purely Curtin problem - restarting the MaaS host makes the behavior go away for a couple of iterations. I honestly see no rhyme or reason to that: its not like maas tells curtin to wait longer after maas is freshly booted.
After letting maas stay up for the last week, i started seeing this manifest on properly backed control VMs during provisioning (they live on top of 10k disks in a ZFS mirrored stripe with SLOG and ARC) - so while slow storage seems to trip this issue much more often, it apparently can occur with actually fast disks underneath.

Revision history for this message
Boris Lukashev (rageltman) wrote :

Interestingly, on the control VMs, this isn't necessarily always happening during the kernel install pass, but also can be tripped while installing the bootloader or setting efivars:
``` Running command ['mount', '--bind', '/sys/firmware/efi/efivars', '/tmp/tmpks300m4p/target/sys/firmware/efi/efivars'] with allowed return codes [0] (capture=False)
        Running command ['unshare', '--fork', '--pid', '--', 'chroot', '/tmp/tmpks300m4p/target', 'efibootmgr', '-v'] with allowed return codes [0] (capture=True)
        Running command ['udevadm', 'settle'] with allowed return codes [0] (capture=False)
        TIMED subp(['udevadm', 'settle']): 0.023
        Running command ['umount', '/tmp/tmpks300m4p/target/sys/firmware/efi/efivars'] with allowed return codes [0] (capture=False)
        Running command ['umount', '/tmp/tmpks300m4p/target/sys'] with allowed return codes [0] (capture=False)
        Running command ['umount', '/tmp/tmpks300m4p/target/run'] with allowed return codes [0] (capture=False)
        umount: /tmp/tmpks300m4p/target/run: target is busy.
        finish: cmd-install/stage-curthooks/builtin/cmd-curthooks/install-grub: FAIL: installing grub to target devices
        finish: cmd-install/stage-curthooks/builtin/cmd-curthooks/configuring-bootloader: FAIL: configuring target system bootloader
        finish: cmd-install/stage-curthooks/builtin/cmd-curthooks: FAIL: curtin command curthooks
        Traceback (most recent call last):
```

Revision history for this message
Ryan Harper (raharper) wrote :

@Boris

are you ssh'ing (or have something that logs into the instance) during deployment?

I've seen this happen many times when the ubuntu user logs into the deployment node; I believe when this happens that systemd does some mounting of directories for the user in /run which then prevent the unmounting of the /target with /run bind mounted over it as the additional mounts on user-login are still mounted.

If you can, please attach the curtin debug logs.
https://discourse.maas.io/t/getting-curtin-debug-logs/169

Possibly a duplicate here: https://bugs.launchpad.net/maas/+bug/1918490

Changed in curtin:
status: New → Incomplete
Revision history for this message
Boris Lukashev (rageltman) wrote :

@Ryan: negative, not during deployment. I log in on occasion to debug the juju side of this after maas is done, but not during maas deployment (actually didn't know you could log into during deployment, didn't occur to try).

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Is there any way to get syslog or journal output when this happens?

Revision history for this message
Boris Lukashev (rageltman) wrote :

@Michael - was hoping you could tell me, these are Canonical stacks.
My previous question stands - how do i work around this while its being fixed upstream? The storage media doesn't seem to be the cause, though it happens very consistently on the compute nodes which have OS' on slower storage. I've re-boostrapped my entire stack 4X since this morning and every time 6-10 compute nodes fail, various nodes on different passes.

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

This bug is fixed with commit 1f663e51 to curtin on branch master.
To view that commit see the following URL:
https://git.launchpad.net/curtin/commit/?id=1f663e51

Changed in curtin:
status: Incomplete → Fix Committed
Revision history for this message
Boris Lukashev (rageltman) wrote :

Thanks, i went over the commit diff - looks rational from here. I'll build a copy off master and test over the weekend unless upstream expects the snap version to have the fix by then, in which case i'll pull down from there.

Revision history for this message
Dan Bungert (dbungert) wrote : Fixed in curtin version 21.3.

This bug is believed to be fixed in curtin in version 21.3. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in curtin:
status: Fix Committed → Fix Released
Revision history for this message
Boris Lukashev (rageltman) wrote :

Thank you!
Will be running some more provisioning passes around the start of the year at which point i'll recreate the conditions seen during the report and validate.

Alberto Donato (ack)
Changed in maas:
milestone: none → next
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Curtin fix is part of MAAS 3.2 release

Changed in maas:
milestone: next → 3.2.0
status: Triaged → Fix Committed
Changed in maas:
milestone: 3.2.0 → 3.2.0-beta3
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.