curtin fails to setup bcache when unclean bcache from previous install exists

Bug #1718699 reported by Nobuto Murata
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
curtin
Fix Released
High
Unassigned
curtin (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

curtin: 0.1.0~bzr505-0ubuntu1~16.04.1

How to reproduce:

Let's say you have 2 disks, one as faster, another one as slower.

1. create a two partition on the faster one
2. set the first partition as /
3. set the remaining partition as bcache cache device and use the second and slow disk as backing device
https://bugs.launchpad.net/curtin/+bug/1718699/+attachment/4970286/+files/bcache_test.png

4. install OS (with MAAS in my case)

$ maas admin machine deploy retc7f
$ maas admin machine get-curtin-config retc7f > curtin-config.yaml

-> first deployment succeeds

5. quick erase all disks (erasing a few megabytes at the beginning and the end of the drives with MAAS in my case)

$ maas admin machine release retc7f erase=true quick_erase=true
-> quick erase succeeds

6. use the same machine again with the same bcache layout, then install OS

$ maas admin machine deploy retc7f
$ maas admin machine get-curtin-config retc7f > curtin-config_after_erase.yaml

7. Get an installation failure with "Device or resource busy"

An error occured handling 'vda-part2': OSError - [Errno 16] Device or resource busy: '/dev/vda2'

Curtin config of the first install:
https://bugs.launchpad.net/curtin/+bug/1718699/+attachment/4970287/+files/curtin-config.yaml

Curtin config after the quick erase:
https://bugs.launchpad.net/curtin/+bug/1718699/+attachment/4970288/+files/curtin-config_after_erase.yaml

My gut feeling is that backing devices' bcache signature was deleted by quick erase, but the bcache signature of the second partition of the cache device wasn't cleanup because it was in the middle of the disk (not covered by quick erase). Then, bcache was in unclean state which might cause "Device or resource busy". Possibly more bcache cleanup just before installing OS may be required.

curtin: Installation started. (0.1.0~bzr505-0ubuntu1~16.04.1)
third party drivers not installed or necessary.
Failed to exclusively open path: /dev/vda2
Device holders with exclusive access: []
Device mounts: []
Possible users of /dev/vda2:
None
An error occured handling 'vda-part2': OSError - [Errno 16] Device or resource busy: '/dev/vda2'
[Errno 16] Device or resource busy: '/dev/vda2'
curtin: Installation failed with exception: Unexpected error while running command.
Command: ['curtin', 'block-meta', 'custom']
Exit code: 3
Reason: -
Stdout: Failed to exclusively open path: /dev/vda2
        Device holders with exclusive access: []
        Device mounts: []
        Possible users of /dev/vda2:
        None
        An error occured handling 'vda-part2': OSError - [Errno 16] Device or resource busy: '/dev/vda2'
        [Errno 16] Device or resource busy: '/dev/vda2'

Stderr: ''

Tags: cpe-onsite

Related branches

Revision history for this message
Nobuto Murata (nobuto) wrote :

I was in hurry to get rolling, so didn't have enough time to collect more info. More robust reproducer could be created when I have some time.

Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1718699] Re: curtin fails to setup bcache when unclean bcache from previous install exists

> 4. quick erase all disks (erasing a few megabytes at the beginning and
the end of the drives with MAAS in my case)

Can your provide the curtin configuration of the first deployment and the
second?

In general, if maas is sending wipe: superblock

Curtin handles recursively wiping and clearing holders at each step of the
way.

The full curtin log would be helpful to see what the discovered block tree
looks like
curtin as curtin starts wiping devices.

On Thu, Sep 21, 2017 at 9:52 AM, Nobuto Murata <email address hidden>
wrote:

> I was in hurry to get rolling, so didn't have enough time to collect
> more info. More robust reproducer could be created when I have some
> time.
>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1718699
>
> Title:
> curtin fails to setup bcache when unclean bcache from previous install
> exists
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1718699/+subscriptions
>

Ryan Harper (raharper)
Changed in curtin:
importance: Undecided → High
status: New → Incomplete
Nobuto Murata (nobuto)
tags: added: cpe-onsite
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

I can confirm this issue. Encountered it myself on one node. After booting to the rescue mode I found out that a bcache signature was still present.

ubuntu@jkt01z03nova003:~$ sudo blkid
/dev/sdd: UUID="bf3a9242-cb68-40b9-b2a3-d4c72894fb79" TYPE="bcache"

Interestingly, the commissioning process always succeeds while the deployment process consistently fails. During deployment there are multiple devices: /dev/sdd and /dev/sdf present.

Error messsage (verbose curtin log):
http://paste.ubuntu.com/25730909/

      finish: cmd-install/stage-partitioning/builtin/cmd-block-meta: FAIL: removing previous storage devices
        finish: cmd-install/stage-partitioning/builtin/cmd-block-meta: FAIL: curtin command block-meta
        Traceback (most recent call last):
          File "/curtin/curtin/commands/main.py", line 215, in main
            ret = args.func(args)
          File "/curtin/curtin/commands/block_meta.py", line 67, in block_meta
            meta_custom(args)
          File "/curtin/curtin/commands/block_meta.py", line 1186, in meta_custom
            clear_holders.clear_holders(disk_paths)
...
            return next(self.gen)
          File "/curtin/curtin/block/__init__.py", line 807, in exclusive_open
            fd = os.open(path, os.O_RDWR | os.O_EXCL)
        OSError: [Errno 16] Device or resource busy: '/dev/sdd'
        [Errno 16] Device or resource busy: '/dev/sdd'

curtin config + lsscsi in the rescue mode:
http://paste.ubuntu.com/25730923/

ubuntu@jkt01z03nova003:~$ sudo blkid
/dev/sdd: UUID="bf3a9242-cb68-40b9-b2a3-d4c72894fb79" TYPE="bcache"

After a certain point you may start to blame hardware which isn't correct.

`wipefs -a /dev/sdd` successfully removes the bcache signature and solves the issue.

Changed in curtin:
status: Incomplete → Confirmed
Revision history for this message
Ryan Harper (raharper) wrote :

Can you provide the curtin configuration sent to the end point?

For maas 2.0+

maas <session> machine get-curtin-config <system-id>

Changed in curtin:
status: Confirmed → Incomplete
Revision history for this message
Nobuto Murata (nobuto) wrote :

I have reproduced the issue on my test bed. Will attach some info here.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Disk layout settings in MAAS.

Revision history for this message
Nobuto Murata (nobuto) wrote :

How to reproduce:

$ maas admin machine deploy retc7f
$ maas admin machine get-curtin-config retc7f > curtin-config.yaml

-> first deployment succeeds

$ maas admin machine release retc7f erase=true quick_erase=true
-> quick erase succeeds

$ maas admin machine deploy retc7f
$ maas admin machine get-curtin-config retc7f > curtin-config_after_erase.yaml

-> second deployment with the same configuration fails with:
An error occured handling 'vda-part2': OSError - [Errno 16] Device or resource busy: '/dev/vda2'

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :
Nobuto Murata (nobuto)
Changed in curtin:
status: Incomplete → New
Revision history for this message
Nobuto Murata (nobuto) wrote :
Nobuto Murata (nobuto)
description: updated
Nobuto Murata (nobuto)
description: updated
Revision history for this message
Ryan Harper (raharper) wrote :

On Fri, Oct 13, 2017 at 1:34 PM, Nobuto Murata <email address hidden>
wrote:

> How to reproduce:
>
> $ maas admin machine deploy retc7f
> $ maas admin machine get-curtin-config retc7f > curtin-config.yaml
>
> -> first deployment succeeds
>
> $ maas admin machine release retc7f erase=true quick_erase=true
> -> quick erase succeeds
>

Do you have the log for this operation?

When I attempt to perform the quick-wipe on disks that have been configured
in bcache mode
that operation fails. I'm interested in if you're also seeing this.

>
> $ maas admin machine deploy retc7f
> $ maas admin machine get-curtin-config retc7f >
> curtin-config_after_erase.yaml
>

There is no change in config (which is good).

>
> -> second deployment with the same configuration fails with:
> An error occured handling 'vda-part2': OSError - [Errno 16] Device or
> resource busy: '/dev/vda2'
>

Do you have the entire log here? In the early part of the curtin install
process, curtin dumps
the block device tree layout w.r.t which devices has a holder (like
bcache). I'd like to see what curtin
saw on this second run.

> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1718699
>
> Title:
> curtin fails to setup bcache when unclean bcache from previous install
> exists
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1718699/+subscriptions
>

Revision history for this message
Ryan Harper (raharper) wrote :

OK. I've recreated this. Certainly a subtle bug w.r.t how bcache retains its claim to a device.

Changed in curtin:
status: New → Confirmed
Revision history for this message
Ryan Harper (raharper) wrote :

First, for a workaround, I believe if you *skip* the quick erase, maas/curtin will already issue wipe: superblock which will detect the previous bcache devices (they've not been wiped yet) and shut them down properly. Please see if that helps. Alternatively, modifying the size of the partitions between runs will prevent the bcache metadata from being found.

In terms of what's happening, here goes.

After the quick-erase, we've "hidden" two bcache devices, first a backing device on /dev/vda2, and second a cache set device on /dev/vdb.

The cache set device does not become active automatically has it's not paired with a backing device which would normally result in a bcacheN device being found on the system and the underlying devices (vda, vdb) would have block devices in /sys/class/block/<kname>/holders

However since they're not active, no holder is found and curtin then attempts to open the device exclusively. This fails as the device has a bcache cache set on the device and /sys/class/block/<kname>/bcache exists and /sys/class/block/<kname>/bcache/set points to the cache which needs to be stopped to release the underlying device.

The same is true for the backing device. These buried bcache devices are not found due to the partitioning data being wiped by the quick-erase, thus when curtin starts partitioning the device at the exact same offsets, the kernel then reads the superblock of the newly created partition and the kernel bcache drive detects the metadata and then "opens" the device and updates sysfs.

Currently working on some changes to curtin to handle this particular case.

Ryan Harper (raharper)
Changed in curtin:
status: Confirmed → In Progress
Ryan Harper (raharper)
Changed in curtin:
status: In Progress → Fix Committed
Revision history for this message
Scott Moser (smoser) wrote : Fixed in Curtin 17.1

This bug is believed to be fixed in curtin in 17.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in curtin:
status: Fix Committed → Fix Released
Revision history for this message
Scott Moser (smoser) wrote :

A fix for this bug is being SRU to Ubuntu 16.04 and 17.10 under bug 1743618.

Changed in curtin (Ubuntu):
status: New → Fix Released
importance: Undecided → Medium
Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

I'm facing this bug on a 33-server openstack deployment using xenial-queens.
Curtin version: 18.1-17-gae48e86f-0ubuntu1~16.04.1

As per comment #14, the fix seems to be applied to version 17.1. Therefore, I can point that the problem still persists.

The failure is intermittent, meaning generally MAAS seems to correctly release servers.
But, sometimes, servers get stuck on pending during deployment and fail all subsequent deployments with errors similar to: https://pastebin.canonical.com/p/w6CJXBtHTf/

Applying the fix: maas <session> machine release <system_id> erase=true
(without quick_erase) solves the issue but is very slow.

Revision history for this message
Ryan Harper (raharper) wrote :

Please attach curtin verbose logs and your curtin config.

https://discourse.maas.io/t/getting-curtin-debug-logs/169

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

@raharper
Please, reference to: https://bugs.launchpad.net/curtin/+bug/1815018
I am adding all the info I've collected there on that bug report instead

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.