Zesty deployments failing sporadically

Bug #1680197 reported by Rod Smith
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Unassigned
curtin
Fix Released
Undecided
Unassigned

Bug Description

I'm encountering the following deployment failure (as reported in the web UI's "installation output") on about 1 in 4 deployments on two systems, one running MAAS 2.1.4 and the other MAAS 2.1.5:

curtin: Installation started. (0.1.0~bzr470-0ubuntu1~16.04.1)
Failed to exclusively open path: /dev/sda1
Traceback (most recent call last):
  File "/curtin/curtin/block/__init__.py", line 773, in exclusive_open
    fd = os.open(path, os.O_RDWR | os.O_EXCL)
OSError: [Errno 16] Device or resource busy: '/dev/sda1'
Device holders with exclusive access: []
Device mounts: []
[Errno 16] Device or resource busy: '/dev/sda1'
curtin: Installation failed with exception: Unexpected error while running command.
Command: ['curtin', 'block-meta', 'custom']
Exit code: 3
Reason: -
Stdout: Failed to exclusively open path: /dev/sda1
        Traceback (most recent call last):
          File "/curtin/curtin/block/__init__.py", line 773, in exclusive_open
            fd = os.open(path, os.O_RDWR | os.O_EXCL)
        OSError: [Errno 16] Device or resource busy: '/dev/sda1'
        Device holders with exclusive access: []
        Device mounts: []
        [Errno 16] Device or resource busy: '/dev/sda1'

Stderr: ''

I'm attaching the /var/log/maas directory tree from the 2.1.5 system. Here's the version information from the 2.1.5 installation:

dpkg -s maas | grep Version
Version: 2.1.5+bzr5596-0ubuntu1~16.04.1
root@gil:/# dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-==============================-============-==================================================
ii maas 2.1.5+bzr5596-0ubuntu1~16.04.1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cert-server 0.2.30-0~74~ubuntu16.04.1 all Ubuntu certification support files for MAAS server
ii maas-cli 2.1.5+bzr5596-0ubuntu1~16.04.1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.1.5+bzr5596-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.1.5+bzr5596-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.1.5+bzr5596-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.1.5+bzr5596-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.1.5+bzr5596-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.1.5+bzr5596-0ubuntu1~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.1.5+bzr5596-0ubuntu1~16.04.1 all Region Controller for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.1.5+bzr5596-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.1.5+bzr5596-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.1.5+bzr5596-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)

Related bugs:
 * bug 1666573: transient systemd ordering cycle in boot with overlayroot ver read-only open-iscsi root
 * bug 1723183: transient systemd ordering issue when using overlayroot

Revision history for this message
Rod Smith (rodsmith) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :

This seems like a curtin bug (and may even be an issue with zesty itself).

Changed in maas:
status: New → Incomplete
status: Incomplete → Invalid
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Rod,

 - Can you please provide the curtin version that you are using? On the maas server: dpkg -l | grep curtin
 - That said, provide the full installation log from the UI. or via (maas <user> node-results read system_id=<your_system> result_type=1 | grep "\"data\"" | cut -d"\"" -f4 | base64 --decode)
 - And please provide the config: maas <user> machine get-curtin-config <system_id_of_machine>

Revision history for this message
Ryan Harper (raharper) wrote :

The lack of exclusive access on open suggests that the device has been claimed by some other block service (Like md or lvm, or bcache).

We'll need the curtin storage configuration as Andres suggests to help understand the underlying problem.

here's the original bug that lead to curtin testing for exclusive device
access when asked to wipe a device.

https://bugs.launchpad.net/curtin/+bug/1659509

Changed in curtin:
status: New → Incomplete
Revision history for this message
Rod Smith (rodsmith) wrote :
Download full text (5.7 KiB)

Curtin version information:

$ dpkg -l | grep curtin
ii curtin-common 0.1.0~bzr470-0ubuntu1~16.04.1 all Library and tools for curtin installer
ii python3-curtin 0.1.0~bzr470-0ubuntu1~16.04.1 all Library and tools for curtin installer

Installation log:

$ maas admin node-results read system_id=xr36pd result_type=1 | grep "\"data\"" | cut -d"\"" -f4 | base64 --decode
curtin: Installation started. (0.1.0~bzr470-0ubuntu1~16.04.1)
Failed to exclusively open path: /dev/sda1
Traceback (most recent call last):
  File "/curtin/curtin/block/__init__.py", line 773, in exclusive_open
    fd = os.open(path, os.O_RDWR | os.O_EXCL)
OSError: [Errno 16] Device or resource busy: '/dev/sda1'
Device holders with exclusive access: []
Device mounts: ['/dev/sda1 /media/root-ro ext4 ro,relatime,data=ordered 0 0']
[Errno 16] Device or resource busy: '/dev/sda1'
curtin: Installation failed with exception: Unexpected error while running command.
Command: ['curtin', 'block-meta', 'custom']
Exit code: 3
Reason: -
Stdout: Failed to exclusively open path: /dev/sda1
        Traceback (most recent call last):
          File "/curtin/curtin/block/__init__.py", line 773, in exclusive_open
            fd = os.open(path, os.O_RDWR | os.O_EXCL)
        OSError: [Errno 16] Device or resource busy: '/dev/sda1'
        Device holders with exclusive access: []
        Device mounts: ['/dev/sda1 /media/root-ro ext4 ro,relatime,data=ordered 0 0']
        [Errno 16] Device or resource busy: '/dev/sda1'

Stderr: ''

Curtin machine config:

$ maas admin machine get-curtin-config xr36pd
Success.
Machine-readable output follows:
apt:
  preserve_sources_list: false
  primary:
  - arches:
    - default
    uri: http://nessus.rodsbooks.com/ubuntu/
  proxy: http://172.24.124.1:8000/
  security:
  - arches:
    - default
    uri: http://nessus.rodsbooks.com/ubuntu/
debconf_selections:
  maas: 'cloud-init cloud-init/datasources multiselect MAAS

    cloud-init cloud-init/maas-metadata-url string http://172.24.124.1:5240/MAAS/metadata/

    cloud-init cloud-init/maas-metadata-credentials string oauth_consumer_key=EUDbzueWEvEUz3XsUS&oauth_token_key=Y3MUmumkezRCgtTM93&oauth_token_secret=zuseqQu9bvTHSBJN3wwDphFskTpff2FU

    cloud-init cloud-init/local-cloud-config string apt:\n preserve_sources_list:
    false\n primary:\n - arches: [default]\n uri: http://nessus.rodsbooks.com/ubuntu/\n proxy:
    http://172.24.124.1:8000/\n security:\n - arches: [default]\n uri: http://nessus.rodsbooks.com/ubuntu/\napt_preserve_sources_list:
    true\napt_proxy: http://172.24.124.1:8000/\nmanage_etc_hosts: false\nmanual_cache_clean:
    true\nreporting:\n maas: {consumer_key: EUDbzueWEvEUz3XsUS, endpoint: ''http://172.24.124.1:5240/MAAS/metadata/status/xr36pd'',\n token_key:
    Y3MUmumkezRCgtTM93, token_secret: zuseqQu9bvTHSBJN3wwDphFskTpff2FU,\n type:
    webhook}\nsystem_info:\n package_mirrors:\n - arches: [i386, amd64]\n failsafe:
    {primary: ''http://archive.ubuntu.com/ubuntu'', security: ''http://security.ubuntu.com/ubuntu''}\n search:\n primary:
...

Read more...

Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1680197] Re: Zesty deployments failing sporadically
Download full text (6.9 KiB)

Device mounts: ['/dev/sda1 /media/root-ro ext4 ro,relatime,data=ordered 0
0']

That shouldn't be mounted during deployment; the ephemeral image that the
target machine netboots shouldn't mount anything at all.

Can you also include the full boot log (and if possible getting
/var/log/cloud-init* from
the ephemeral env would be helpful to see if somehow cloud-init mounted
this device).

Note, media/root-ro looks a *lot* like a custom kernel command line with
overlay root settings.

On Wed, Apr 5, 2017 at 2:07 PM, Rod Smith <email address hidden> wrote:

> Curtin version information:
>
> $ dpkg -l | grep curtin
> ii curtin-common
> 0.1.0~bzr470-0ubuntu1~16.04.1 all Library and
> tools for curtin installer
> ii python3-curtin
> 0.1.0~bzr470-0ubuntu1~16.04.1 all Library and
> tools for curtin installer
>
> Installation log:
>
> $ maas admin node-results read system_id=xr36pd result_type=1 | grep
> "\"data\"" | cut -d"\"" -f4 | base64 --decode
> curtin: Installation started. (0.1.0~bzr470-0ubuntu1~16.04.1)
> Failed to exclusively open path: /dev/sda1
> Traceback (most recent call last):
> File "/curtin/curtin/block/__init__.py", line 773, in exclusive_open
> fd = os.open(path, os.O_RDWR | os.O_EXCL)
> OSError: [Errno 16] Device or resource busy: '/dev/sda1'
> Device holders with exclusive access: []
> Device mounts: ['/dev/sda1 /media/root-ro ext4 ro,relatime,data=ordered 0
> 0']
> [Errno 16] Device or resource busy: '/dev/sda1'
> curtin: Installation failed with exception: Unexpected error while running
> command.
> Command: ['curtin', 'block-meta', 'custom']
> Exit code: 3
> Reason: -
> Stdout: Failed to exclusively open path: /dev/sda1
> Traceback (most recent call last):
> File "/curtin/curtin/block/__init__.py", line 773, in
> exclusive_open
> fd = os.open(path, os.O_RDWR | os.O_EXCL)
> OSError: [Errno 16] Device or resource busy: '/dev/sda1'
> Device holders with exclusive access: []
> Device mounts: ['/dev/sda1 /media/root-ro ext4
> ro,relatime,data=ordered 0 0']
> [Errno 16] Device or resource busy: '/dev/sda1'
>
> Stderr: ''
>
> Curtin machine config:
>
> $ maas admin machine get-curtin-config xr36pd
> Success.
> Machine-readable output follows:
> apt:
> preserve_sources_list: false
> primary:
> - arches:
> - default
> uri: http://nessus.rodsbooks.com/ubuntu/
> proxy: http://172.24.124.1:8000/
> security:
> - arches:
> - default
> uri: http://nessus.rodsbooks.com/ubuntu/
> debconf_selections:
> maas: 'cloud-init cloud-init/datasources multiselect MAAS
>
> cloud-init cloud-init/maas-metadata-url string
> http://172.24.124.1:5240/MAAS/metadata/
>
> cloud-init cloud-init/maas-metadata-credentials string
> oauth_consumer_key=EUDbzueWEvEUz3XsUS&oauth_token_key=Y3MUmumkezRCgtTM93&
> oauth_token_secret=zuseqQu9bvTHSBJN3wwDphFskTpff2FU
>
> cloud-init cloud-init/local-cloud-config string apt:\n
> preserve_sources_list:
> false\n primary:\n - arches: [default]\n uri:
> http://nessus.rodsbooks.com/ubuntu/\n proxy:
> http://172.24.124.1:8000/\n security:\n - a...

Read more...

Revision history for this message
Rod Smith (rodsmith) wrote :

Can you provide me with pointers on how to get the information you want? I know there used to be some hackish way to gain access to MAAS ephemerals, but that was several versions ago and I don't know what the current method is.

Revision history for this message
Ryan Harper (raharper) wrote :

On Wed, Apr 5, 2017 at 3:06 PM, Rod Smith <email address hidden> wrote:

> Can you provide me with pointers on how to get the information you want?
> I know there used to be some hackish way to gain access to MAAS
> ephemerals, but that was several versions ago and I don't know what the
> current method is.
>

https://gist.github.com/smoser/2610e9b78b8d7b54319675d9e3986a1b

If possible, Accessing during deployment, or stopping shutdown should
let you ssh in to examine the system.

In particular, the cloud-init.log (which actually shoud be available via
maas cli, you pasted one earlier where you base64 dumped and cut.

 maas admin node-results read system_id=xr36pd result_type=1

decoding the whole blob should have the entire cloud-init log.

--
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1680197
>
> Title:
> Zesty deployments failing sporadically
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1680197/+subscriptions
>

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Rod,

While the machine is deploying, ssh into it and run:

sudo touch /tmp/block-reboot

That will prevent the machine from being rebooted.

On Wed, Apr 5, 2017 at 4:06 PM, Rod Smith <email address hidden> wrote:

> Can you provide me with pointers on how to get the information you want?
> I know there used to be some hackish way to gain access to MAAS
> ephemerals, but that was several versions ago and I don't know what the
> current method is.
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1680197
>
> Title:
> Zesty deployments failing sporadically
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1680197/+subscriptions
>

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Revision history for this message
Rod Smith (rodsmith) wrote :

Here's the entire /var/log directory tree from the ephemeral on a failed deployment.

Revision history for this message
Rod Smith (rodsmith) wrote :

FWIW, I've discovered that some nodes seem to be more susceptible than others to this bug. I've encountered it on at least four nodes, but some fail only once every several deployments, whereas others succeed in deploying only once every several tries (or maybe never). The worst seems to be an HP ProBook 6470b, which is an AMT-enabled laptop on my home network. (That's the system whose logs I've provided.) The next worse seems to be either a new Dell PowerEdge C6320p or an older Dell PowerEdge T610, both IPMI-enabled servers in 1SS. I've also encountered it, but with much less frequency, on an AMT-enabled Intel NUC on my home network. Of possible relevance: The NUC uses an SSD, whereas the other systems all have conventional spinning hard disks.

Revision history for this message
Ryan Harper (raharper) wrote :

Something I'm looking further into is:

1) kernel command line using overlayroot=tmpfs

Kernel command line: BOOT_IMAGE=ubuntu/amd64/ga-17.04/zesty/daily/boot-kernel nomodeset iscsi_target_name=iqn.2004-05.com.ubuntu:maas:ephemeral-ubuntu-amd64-ga-17.04-zesty-daily iscsi_target_ip=172.24.124.1 iscsi_target_port=3260 iscsi_initiator=kzanol ip=::::kzanol:BOOTIF ip6=off ro root=/dev/disk/by-path/ip-172.24.124.1:3260-iscsi-iqn.2004-05.com.ubuntu:maas:ephemeral-ubuntu-amd64-ga-17.04-zesty-daily-lun-1 overlayroot=tmpfs cc:{datasource_list: [MAAS]}end_cc cloud-config-url=http://172.24.124.1:5240/MAAS/metadata/latest/by-id/xr36pd/?op=get_preseed log_host=172.24.124.1 log_port=514 BOOTIF=01-f0:92:1c:5e:1e:70

2) issues with systemd mount, likely triggered after partitioning

Apr 6 13:26:53 kzanol cloud-init[2457]: curtin: Installation started. (0.1.0~bzr470-0ubuntu1~16.04.1)
Apr 6 13:26:57 kzanol kernel: [ 81.464816] sda: sda1 sda15
Apr 6 13:26:57 kzanol systemd[1]: Stopped target Local File Systems.
Apr 6 13:26:57 kzanol systemd[1]: Unmounting /media/root-ro...
Apr 6 13:26:57 kzanol systemd[1]: Unmounted /media/root-ro.
Apr 6 13:26:58 kzanol systemd[1]: Mounting /media/root-ro...
Apr 6 13:26:58 kzanol kernel: [ 82.109869] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
Apr 6 13:26:58 kzanol systemd[1]: Mounted /media/root-ro.
Apr 6 13:26:58 kzanol cloud-init[2457]: Failed to exclusively open path: /dev/sda1
Apr 6 13:26:58 kzanol cloud-init[2457]: Traceback (most recent call last):
Apr 6 13:26:58 kzanol cloud-init[2457]: File "/curtin/curtin/block/__init__.py", line 773, in exclusive_open
Apr 6 13:26:58 kzanol cloud-init[2457]: fd = os.open(path, os.O_RDWR | os.O_EXCL)
Apr 6 13:26:58 kzanol cloud-init[2457]: OSError: [Errno 16] Device or resource busy: '/dev/sda1'
Apr 6 13:26:58 kzanol cloud-init[2457]: Device holders with exclusive access: []
Apr 6 13:26:58 kzanol cloud-init[2457]: Device mounts: ['/dev/sda1 /media/root-ro ext4 ro,relatime,data=ordered 0 0']

It's likely that curtin is racying with systemd.mount files/entries related to initramfs configuring overlayroot.

Some details for that here:
https://bugs.launchpad.net/ubuntu/+source/open-iscsi/+bug/1666573

Revision history for this message
Ryan Harper (raharper) wrote :
Download full text (3.3 KiB)

OK,

We believe this is related to deploying in simple mode where curtin formats
the root filesystem and label's it
'cloudimg-rootfs' which matches the /etc/fstab entry in the ephemeral
environment; subsequently, systemd's fstab generator creates a mount unit
for "cloudimg-rootfs" to be mounted at /media/root-ro; This races with
curtin's attempt to wipe the device.

To confirm, this should reproduce 100% if you deploy with no custom storage
config, followed by a custom config.
Also, it would be interesting to see if that also reproduces with Xenial in
the above scenario; that may help narrow down
changes to systemd that make this more likely in zesty.

On Thu, Apr 6, 2017 at 10:18 AM, Ryan Harper <email address hidden>
wrote:

> Something I'm looking further into is:
>
> 1) kernel command line using overlayroot=tmpfs
>
> Kernel command line: BOOT_IMAGE=ubuntu/amd64/ga-17.04/zesty/daily/boot-
> kernel nomodeset iscsi_target_name=iqn.2004-05.com.ubuntu:maas
> :ephemeral-ubuntu-amd64-ga-17.04-zesty-daily
> iscsi_target_ip=172.24.124.1 iscsi_target_port=3260
> iscsi_initiator=kzanol ip=::::kzanol:BOOTIF ip6=off ro root=/dev/disk
> /by-path/ip-172.24.124.1:3260-iscsi-iqn.2004-05.com.ubuntu:maas
> :ephemeral-ubuntu-amd64-ga-17.04-zesty-daily-lun-1 overlayroot=tmpfs
> cc:{datasource_list: [MAAS]}end_cc cloud-config-
> url=http://172.24.124.1:5240/MAAS/metadata/latest/by-
> id/xr36pd/?op=get_preseed log_host=172.24.124.1 log_port=514
> BOOTIF=01-f0:92:1c:5e:1e:70
>
> 2) issues with systemd mount, likely triggered after partitioning
>
> Apr 6 13:26:53 kzanol cloud-init[2457]: curtin: Installation started.
> (0.1.0~bzr470-0ubuntu1~16.04.1)
> Apr 6 13:26:57 kzanol kernel: [ 81.464816] sda: sda1 sda15
> Apr 6 13:26:57 kzanol systemd[1]: Stopped target Local File Systems.
> Apr 6 13:26:57 kzanol systemd[1]: Unmounting /media/root-ro...
> Apr 6 13:26:57 kzanol systemd[1]: Unmounted /media/root-ro.
> Apr 6 13:26:58 kzanol systemd[1]: Mounting /media/root-ro...
> Apr 6 13:26:58 kzanol kernel: [ 82.109869] EXT4-fs (sda1): mounted
> filesystem with ordered data mode. Opts: (null)
> Apr 6 13:26:58 kzanol systemd[1]: Mounted /media/root-ro.
> Apr 6 13:26:58 kzanol cloud-init[2457]: Failed to exclusively open path:
> /dev/sda1
> Apr 6 13:26:58 kzanol cloud-init[2457]: Traceback (most recent call last):
> Apr 6 13:26:58 kzanol cloud-init[2457]: File
> "/curtin/curtin/block/__init__.py", line 773, in exclusive_open
> Apr 6 13:26:58 kzanol cloud-init[2457]: fd = os.open(path, os.O_RDWR
> | os.O_EXCL)
> Apr 6 13:26:58 kzanol cloud-init[2457]: OSError: [Errno 16] Device or
> resource busy: '/dev/sda1'
> Apr 6 13:26:58 kzanol cloud-init[2457]: Device holders with exclusive
> access: []
> Apr 6 13:26:58 kzanol cloud-init[2457]: Device mounts: ['/dev/sda1
> /media/root-ro ext4 ro,relatime,data=ordered 0 0']
>
> It's likely that curtin is racying with systemd.mount files/entries
> related to initramfs configuring overlayroot.
>
> Some details for that here:
> https://bugs.launchpad.net/ubuntu/+source/open-iscsi/+bug/1666573
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad....

Read more...

Revision history for this message
Andres Rodriguez (andreserl) wrote :

I can confirm this happens to me. The storage config looks like:

http://paste.ubuntu.com/24554915/

Note that this machine has:

1. disk.
2. I can deploy 16.04 just fine.
3. Deploying 17.04 fails.

Changed in curtin:
status: Incomplete → Confirmed
Revision history for this message
Scott Moser (smoser) wrote :

I think this is related to bug 1689557, which i think root cause is systemd and overlayroot.

Revision history for this message
Scott Moser (smoser) wrote :

Strike that comment.
The bug number i think is 1666573.

description: updated
Jeff Lane  (bladernr)
tags: added: hwcert-server
Revision history for this message
Scott Moser (smoser) wrote :

Long story short: I believe that SRU of bug 1723183 will fix this.

I believe this is fixed by the changes made for bug 1723183 to overlayroot.
That fix is currently in bionic, but not in xenial, artful, zesty.
The change can be seen at:
 https://git.launchpad.net/~cloud-initramfs-tools/cloud-initramfs-tools/commit/?id=ce092cb57

It is essentially the same change we have made in curtin
when moving to testing with squashfs and overlayroot and maas v3 streams.
The curtin change to avoid this is added at
 https://code.launchpad.net/~raharper/curtin/trunk.vmtests-v3-streams/+merge/332753

description: updated
Revision history for this message
Jeff Lane  (bladernr) wrote :

Going to close this as fix released for now on the assumption that A: the SRU smoser mentioned did fix this, and the fact that B: we really haven't seen this occur on releases after Zesty (All the way up to Eoan and Focal).

Changed in curtin:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.