focal-arm64 install fails. No space left on device on /target

Bug #1865027 reported by Paride Legovini
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
curtin
Invalid
Undecided
Unassigned
subiquity
Fix Released
Critical
Unassigned
probert (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

When installing Focal on arm64, once the actual installation stage is reached, subiquity fails with this terminal output:

https://paste.ubuntu.com/p/6DWcxNdzPF/

The curtin and subiquity-debug logs for the very same installation are attached. The error seems to be an ENOSPC in /target while curtin is running:

Running command ['sh', '-c', 'mkdir -p "$2" && cd "$2" && rsync -aXHAS --one-file-system "$1/" .', '--', '/media/filesystem', '/target']

but I think this is bogus, as /target is an 8GB partition which should be more than enough, and according to the curtin log the partition is correctly mounted:

{
 "vdb2": {
  "ALIGNMENT": "0",
  "DISC-ALN": "0",
  "DISC-GRAN": "512",
  "DISC-MAX": "2147483136",
  "DISC-ZERO": "0",
  "FSTYPE": "",
  "GROUP": "disk",
  "KNAME": "vdb2",
  "LABEL": "",
  "LOG-SEC": "512",
  "MAJ:MIN": "252:18",
  "MIN-IO": "512",
  "MODE": "brw-rw----",
  "MODEL": "",
  "MOUNTPOINT": "",
  "NAME": "vdb2",
  "OPT-IO": "0",
  "OWNER": "root",
  "PHY-SEC": "512",
  "RM": "0",
  "RO": "0",
  "ROTA": "1",
  "RQ-SIZE": "256",
  "SIZE": "8050966528",
  "STATE": "",
  "TYPE": "part",
  "UUID": "",
  "device_path": "/dev/vdb2"
 }
}

Running command ['mount', '-t', 'ext4', '-o', 'defaults', '/dev/vdb2', '/target/'] with allowed return codes [0] (capture=True)

Unfortunately it is not easy to monitor what's going on by asking subiquity to give us a shell, as when a shell is running *this problem doesn't happen*! At least this is what I observe systematically, and took me a while to figure out. If I drop to a shell then the installation still fails, but this time as described here:

https://bugs.launchpad.net/curtin/+bug/1864992

Likely a different issue.

Given that when the problem happens it is not possible to regain control of subiquity and get a shell, and that monitoring the installation from a shell is not possible, in order to get the log files I had to setup two netcats in background before starting the installation and piped the logs out of the system.

Revision history for this message
Paride Legovini (paride) wrote :
tags: added: rls-ff-incoming
Paride Legovini (paride)
Changed in curtin:
status: New → Fix Committed
Paride Legovini (paride)
Changed in curtin:
status: Fix Committed → New
Revision history for this message
Paride Legovini (paride) wrote :

This seems to be happening again :/ Logs attached. In curtin's install.log I see the following:

dpkg: error processing archive /tmp/apt-dpkg-install-643qVf/19-linux-headers-5.4.0-18_5.4.0-18.22_all.deb (--unpack):
 cannot copy extracted data for './usr/src/linux-headers-5.4.0-18/tools/testing/selftests/sparc64/drivers/drivers_test.sh' to '/usr/src/linux-headers-5.4.0-18/tools/testing/selftests/sparc64/drivers/drivers_test.sh.dpkg-new': failed to write (No space left on device)

but if I mount /dev/vda2 it's an almost empty filesystem:

Filesystem Size Used Avail Use% Mounted on
/dev/vda2 7.4G 34M 6.9G 1% /target

According to the kernel log the mount did happen at install time:

Mar 31 14:42:48 ubuntu-server kernel: [ 295.394200] EXT4-fs (vda2): mounted filesystem with ordered data mode. Opts: (null)

Subiquity's snapcraft.yaml currently points to one of the latest curtin commits, and the following commits do not seem related to the failure, at least at first glance. However I'm not sure on the exact version of curtin being used on the arm64 subiquity snap (LP: 865075).

Revision history for this message
Ryan Harper (raharper) wrote :

 "curthooks_commands": {
  "000-configure-run": [
   "/snap/bin/subiquity.subiquity-configure-run"
  ],
  "001-configure-apt": [
   "/snap/bin/subiquity.subiquity-configure-apt",
   "/snap/subiquity/1572/usr/bin/python3",
   "false"
  ],

subiquity is injecting two custom curthook commands. Can we inspect those to see if they're messing with mounts?

Changed in curtin:
status: New → Incomplete
Revision history for this message
Paride Legovini (paride) wrote :

Here they are:

root@ubuntu-server:/# cat /snap/subiquity/1572/usr/bin/subiquity-configure-run
#!/bin/bash -eux

# curtin should probably change to do this itself. But for now:

mount --bind /run "$TARGET_MOUNT_POINT/run"

root@ubuntu-server:/# cat /snap/subiquity/1572/usr/bin/subiquity-configure-apt
#!/bin/bash -eux

# Configure apt so that installs from the pool on the cdrom are preferred
# during installation but not in the installed system.
#
# This has a few steps.
#
# 1. As the remaining steps mean that any changes to apt configuration are do
# not persist into the installed system, we get curtin to configure apt a
# bit earlier than it would by default.
#
# 2. Bind-mount the cdrom into the installed system as /run/cdrom
#
# 3. Set up an overlay over /target/etc/apt. This means that any changes we
# make will not persist into the installed system and we do not have to
# worry about cleaning up after ourselves.
#
# 4. Configure apt in /target to look at the pool on the cdrom. This has two
# subcases:
#
# a. if we expect the network to be working, this basically means
# prepending "deb file:///run/cdrom $(lsb_release -sc) main restricted"
# to the sources.list file.
#
# b. if the network is not expected to be working, we replace the
# sources.list with a file just referencing the cdrom.
#
# 5. If the network is not expected to be working, we also set up an overlay
# over /target/var/lib/apt/lists (if the network is working, we'll run "apt
# update" after the /target/etc/apt overlay has been cleared).

if [ -z "$TARGET_MOUNT_POINT" ]; then
    # Nothing good can happen from running this script without
    # TARGET_MOUNT_POINT set.
    exit 1;
fi

PY=$1
HAS_NETWORK=$2

$PY -m curtin apt-config

setup_overlay () {
    local dir=$1
    local t=$(mktemp -d)
    mkdir $t/work $t/upper
    mount -t overlay overlay -o lowerdir=$dir,workdir=$t/work,upperdir=$t/upper $dir
}

setup_overlay $TARGET_MOUNT_POINT/etc/apt

mkdir -p "$TARGET_MOUNT_POINT/cdrom"
mount --bind /cdrom "$TARGET_MOUNT_POINT/cdrom"

if [ $HAS_NETWORK = 'true' ]; then
    mv "$TARGET_MOUNT_POINT/etc/apt/sources.list" "$TARGET_MOUNT_POINT/etc/apt/sources.list.d/original.list"
else
    # Do not use the configured proxy during the install if one was configured.
    rm -f $TARGET_MOUNT_POINT/etc/apt/apt.conf.d/90curtin-aptproxy
    setup_overlay $TARGET_MOUNT_POINT/var/lib/apt/lists
fi

cat > "$TARGET_MOUNT_POINT/etc/apt/sources.list" <<EOF
deb file:///cdrom $(lsb_release -sc) main restricted
EOF

$PY -m curtin in-target -- apt-get update

Changed in curtin:
status: Incomplete → New
Revision history for this message
Ubuntu QA Website (ubuntuqa) wrote :

This bug has been reported on the Ubuntu ISO testing tracker.

A list of all reports related to this bug can be found here:
http://iso.qa.ubuntu.com/qatracker/reports/bugs/1865027

tags: added: iso-testing
Revision history for this message
Paride Legovini (paride) wrote :

This can be reproduced with virt-install:

virt-install --name focal-test --memory 4096 --disk focal-test.img,bus=virtio,size=8 --cdrom /path/to/focal-live-server-arm64.iso

Revision history for this message
Ryan Harper (raharper) wrote :

If you bump the ram to 8192 does it still fail with nospace ? Also, could you increase the size of the root disk to 16G?

Revision history for this message
Paride Legovini (paride) wrote :

Still fails with 8G of ram, but without the ENOSPC, apparently. The problem is with the mounts: the target partition is mounted at some point during the install process:

EXT4-fs (vda2): mounted filesystem with ordered data mode. Opts: (null)

but when I inspect the installer system after the failure the installed system is in /target of the *installer*. That is, all the files are there but /target is not mounted:

root@ubuntu-server:/# mount | grep target
tmpfs on /target/run type tmpfs (rw,nosuid,nodev,noexec,relatime,size=813312k,mode=755)
overlay on /target/etc/apt type overlay (rw,relatime,lowerdir=/target/etc/apt,upperdir=/tmp/tmp.0VbbIaFJdR/upper,workdir=/tmp/tmp.0VbbIaFJdR/work)
/dev/sr0 on /target/cdrom type iso9660 (ro,noatime,nojoliet,check=s,map=n,blocksize=2048)

root@ubuntu-server:/# ls /target
bin cdrom etc lib mnt proc run snap sys usr
boot dev home media opt root sbin srv tmp var

root@ubuntu-server:/# df -h
Filesystem Size Used Avail Use% Mounted on
udev 3.9G 0 3.9G 0% /dev
tmpfs 795M 1.2M 794M 1% /run
/dev/sr0 877M 877M 0 100% /cdrom
/dev/loop0 320M 320M 0 100% /rofs
/cow 3.9G 2.2G 1.7G 57% /
/dev/loop2 63M 63M 0 100% /usr/lib/modules
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
tmpfs 3.9G 32K 3.9G 1% /tmp
/dev/loop3 81M 81M 0 100% /snap/core/8691
/dev/loop4 49M 49M 0 100% /snap/subiquity/1572
overlay 3.9G 32K 3.9G 1% /target/etc/apt

With 4GB of ram / is big enough and it's not filled up. With 4GB of ram we have:

Filesystem Size Used Avail Use% Mounted on
/cow 2.0G 142M 1.8G 8% /

and if fills up.

Revision history for this message
Ryan Harper (raharper) wrote :

Where is /tmp mounted? cat /proc/mounts would be useful.

If possible, can you get into the instance before you start the install so that you can watch /proc/mounts ?

I'm positive theres nothing in curtin that's unmounting /target ... journal shows odd messages of target.mount unit ... those somehow show up in systemd's run dir:

find /run/systemd -name '*.mount

I worry there's some auto mount/unmount of .mount units going on and something is shooting us in the foot. I'm thinking snapd, or other subiquity related services (snap restart/refresh) ...

Also, let's watch a few things with inotify

inotifywait -mr --timefmt "%s" --format "%T %w%f %e" /target
inotifywait -mr --timefmt "%s" --format "%T %w%f %e" /dev/vda2

Revision history for this message
Paride Legovini (paride) wrote :

OK, I think I have an idea that may make sense.

You know that the installer launches a subiquity process *per terminal*. For example on arm64 two subiquity processes are launched right after the boot, one on tty1 and one on ttyAMA0:

root 1558 26.2 1.4 432964 57580 tty1 Ssl+ 10:23 0:07 /snap/subiquity/1572/usr/bin/python3 /snap/subiquity/1572/usr/bin/subiquity
root 1636 31.1 1.4 433220 57780 ttyAMA0 Ssl 10:23 0:06 /snap/subiquity/1572/usr/bin/python3 /snap/subiquity/1572/usr/bin/subiquity

and if I ssh to the instance as the 'installer' use, one new subiquity process is spawned:

root 1826 39.0 1.4 433200 57688 pts/0 Sl+ 10:24 0:05 /snap/subiquity/1572/usr/bin/python3 /snap/subiquity/1572/usr/bin/subiquity

Now what I noticed is that if I install on one terminal while leaving the ssh client open with its subiquity waiting there, it crashes after some time. This seems to be happening even to the other "idling" subiquity instances, and indeed we have several subiquity log files, one per PID. For example, from my "march31" log files tarball, we have:

subiquity-debug.log.1590
subiquity-debug.log.1668
subiquity-debug.log.1839
subiquity-debug.log.2709

When subiquity crashes it it respawned. Now, IIUC, subiquity does an early unmount of /target when it's launched, and this is what interferes with the ongoing install process. This also somehow explain the weird behavior of this failure mode, as it really depends on the moment when the unmount happens.

Now why the idling subiquity process die remains to be seen.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Mar 31 15:44:11 ubuntu-server audit[2786]: AVC apparmor="DENIED" operation="file_inherit" profile="/usr/lib/snapd/snap-confine" name="/dev/ttyAMA0" pid=2786 comm="snap-confine" requested_mask="wr" denied_mask=">
Mar 31 15:44:11 ubuntu-server kernel: audit: type=1400 audit(1585665851.227:17): apparmor="DENIED" operation="file_inherit" profile="/usr/lib/snapd/snap-confine" name="/dev/ttyAMA0" pid=2786 comm="snap-confine">

Is concerning.

Also:

Mar 31 15:42:53 ubuntu-server systemd[1]: snap.subiquity.subiquity-service.service: Main process exited, code=killed, status=11/SEGV

Somehow subiquity crashes

Is snapd running uncontrolled refreshing?
Mar 31 15:43:05 ubuntu-server snapd[1401]: storehelpers.go:438: cannot refresh: snap has no updates available: "core", "subiquity"

Revision history for this message
Paride Legovini (paride) wrote :

Still fails with the 20200331 image. Attached is a tarball of /var/crash generated with the 20200331 iso.

Revision history for this message
Paride Legovini (paride) wrote :

As xnox noticed we have this in the journal:

systemd[1]: snap.subiquity.subiquity-service.service: Main process exited, code=killed, status=11/SEGV
systemd[1]: snap.subiquity.subiquity-service.service: Failed with result 'signal'.
systemd[1]: snap.subiquity.subiquity-service.service: Scheduled restart job, restart counter is at 1.
systemd[1]: Stopped Service for snap application subiquity.subiquity-service.
systemd[1]: Started Service for snap application subiquity.subiquity-service.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

From reading the subiquity log (the very first one) it seems like we are continuously crashing with "block device disk probing failed" and generating crashes. Over and over and over again?

grep crash$ subiquity-debug.log.1590
2020-03-31 14:42:38,808 INFO subiquity.core:278 saving crash report 'block probing crashed with CancelledError' to /var/crash/1585665758.736616135.block_probe_fail.crash
2020-03-31 14:42:39,246 INFO subiquity.core:278 saving crash report 'block probing crashed with CancelledError' to /var/crash/1585665759.208643913.block_probe_fail.crash
2020-03-31 14:42:39,839 INFO subiquity.core:278 saving crash report 'block probing crashed with CancelledError' to /var/crash/1585665759.807823420.block_probe_fail.crash
2020-03-31 14:42:40,026 INFO subiquity.core:278 saving crash report 'block probing crashed with CancelledError' to /var/crash/1585665759.983493328.disk_probe_fail.crash
2020-03-31 14:42:42,420 INFO subiquity.core:278 saving crash report 'block probing crashed with CancelledError' to /var/crash/1585665762.351994991.block_probe_fail.crash
2020-03-31 14:42:42,996 INFO subiquity.core:278 saving crash report 'block probing crashed with CancelledError' to /var/crash/1585665762.995283604.disk_probe_fail.crash
2020-03-31 14:42:43,131 INFO subiquity.core:278 saving crash report 'block probing crashed with CancelledError' to /var/crash/1585665763.130393744.disk_probe_fail.crash
2020-03-31 14:42:43,950 INFO subiquity.core:278 saving crash report 'block probing crashed with CancelledError' to /var/crash/1585665763.904731274.block_probe_fail.crash
2020-03-31 14:42:43,992 INFO subiquity.core:278 saving crash report 'block probing crashed with CancelledError' to /var/crash/1585665763.991884708.disk_probe_fail.crash
2020-03-31 14:42:47,952 INFO subiquity.core:278 saving crash report 'block probing crashed with CancelledError' to /var/crash/1585665767.950375080.block_probe_fail.crash
2020-03-31 14:42:48,513 INFO subiquity.core:278 saving crash report 'block probing crashed with CancelledError' to /var/crash/1585665768.468663692.disk_probe_fail.crash
2020-03-31 14:42:49,695 INFO subiquity.core:278 saving crash report 'block probing crashed with CancelledError' to /var/crash/1585665769.592257977.block_probe_fail.crash
2020-03-31 14:42:50,224 INFO subiquity.core:278 saving crash report 'block probing crashed with CancelledError' to /var/crash/1585665770.223218679.disk_probe_fail.crash
2020-03-31 14:42:50,448 INFO subiquity.core:278 saving crash report 'block probing crashed with CancelledError' to /var/crash/1585665770.425232410.block_probe_fail.crash

Thus yes eventually we will run out of disk space / ram, and die.

Changed in subiquity:
importance: Undecided → Critical
status: New → Confirmed
Revision history for this message
Brian Murray (brian-murray) wrote :

For what it is worth apport has some code to prevent this from happening.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

So I think Paride is more or less right in #10: One of the subiquities starts the install. One of the subiquities then segfaults (doesn't matter which one). Systemd restarts this subiquity, which then unmounts /target, so the subiquity running the install just copies filesystem.squashfs to RAM and eventually runs out of space. One obvious thing to do would be to only unmount /target just before starting the install, so lets do that.

The 'block probing crashed with CancelledError' things are also bugs (would you believe there are two CancelledErrors??) but I don't think we are logging enough of them to fill up RAM.

I've fixed these in https://github.com/CanonicalLtd/subiquity/pull/675

Now, as to why subiquity is segfaulting: I have no idea :(

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Ah wait, subiquity is segfaulting when curtin is in net-meta auto. That makes me very much suspect the crash is inside probert. Did libnl3 change recently?

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

It would be amazingly strange, but it's possible that https://github.com/CanonicalLtd/subiquity/pull/670 will make the crash go away.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

In any case, can someone who can reproduce this try with edge?

Revision history for this message
Paride Legovini (paride) wrote :

The 11/SEGV crash doesn't happen with edge, but the install still fails. Crash file attached.

Revision history for this message
Paride Legovini (paride) wrote :
Changed in subiquity:
status: Confirmed → Fix Committed
Changed in curtin:
status: New → Incomplete
Changed in probert (Ubuntu):
status: New → Incomplete
Revision history for this message
Paride Legovini (paride) wrote :

I had a longer discussion with Ryan today on the "efibootmgr not found" issue that we hit after getting rid of the subiquity crash which caused /target to be unmounted. The conclusion is that curtin should *not* need efibootmgr to be present in the ephemeral system after all, so we didn't actually get to the bottom of the problem.

For the moment having subiquity stage efibootmgr is fine as a workaround, but before releasing Focal I want to fully understand what's going on and get this done right.

I'm not opening a separate bug report for the "efibootmgr not found" problem because a stray umount of /target could be the reason why efibootmgr is not found. IIUC curtin can fallback to run commands on the host system if they are not found in the /target chroot, and this could be why we observe it failing to call efibootmgr in the host. Did we only delay the subiquity crash/respawn/umount?

Further investigation is needed.

Revision history for this message
Joshua Powers (powersj) wrote :

Fixes are committed to edge, will be published to stable soon. Not sure of actual root cause, but not reproducing anymore.

Changed in probert (Ubuntu):
status: Incomplete → Invalid
Changed in curtin:
status: Incomplete → Invalid
Revision history for this message
Paride Legovini (paride) wrote :

For the record: curtin did leak efibootmgr calls in some cases, the bug has been by this curtin commit:

https://git.launchpad.net/curtin/commit/?id=0fed7d68b8f5863cd4a872828608db9f568cc271

This subiquity commit pulls in the fix and drops the staging of efibootmgr:

https://github.com/CanonicalLtd/subiquity/commit/ebd8101b5934b1497599fe00995a56d191a4f79f

I now consider this bug fully understood and properly fixed. Thanks Michael and Ryan!

Changed in subiquity:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.