curtin mishandles removing dm-multipath partition removal

Bug #1857042 reported by Frank Heimes
24
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
Critical
Canonical Server
curtin
Fix Released
Undecided
Ryan Harper
subiquity
Fix Released
Undecided
Unassigned

Bug Description

With a closer look at the install we discovered that curtin
was attempting to clear a partition which was still controlled
by device-mapper; this prevents writes to the underlying partition.

Curtin will need to look up a partition by it's parent device,
discover the multipath-id of the parent, construct the correct
device-mapper DM_NAME value, and examine the output from
dmsetup ls to map the DM_NAME to the device-mapper device (dm-XX).

While working on this bug, we discovered a subiquity bug,
LP: #1860444

-- Original Description --
While trying Ubuntu focal ISO image with timestamp Dec 19 that includes subiquity 19.12.2+git6.b024f284 I run into problems with curtin on the disk handling:

2019-12-19 16:17:15,732 WARNING curtin:1242 Extract storage config does not validate.
2019-12-19 16:17:15,734 ERROR block-discover:115 block probing failed restricted=True
Traceback (most recent call last):
  File "/snap/subiquity/1413/lib/python3.6/site-packages/subiquity/controllers/filesystem.py", line 112, in _probe
    await asyncio.wait_for(self._probe_once_task.task, 5.0)
  File "/snap/subiquity/1413/usr/lib/python3.6/asyncio/tasks.py", line 358, in wait_for
    return fut.result()
  File "/snap/subiquity/1413/lib/python3.6/site-packages/subiquity/controllers/filesystem.py", line 99, in _probe_once
    self.model.load_probe_data(storage)
  File "/snap/subiquity/1413/lib/python3.6/site-packages/subiquity/models/filesystem.py", line 1378, in load_probe_data
    self.reset()
  File "/snap/subiquity/1413/lib/python3.6/site-packages/subiquity/models/filesystem.py", line 1202, in reset
    config = storage_config.extract_storage_config(self._probe_data)
  File "/snap/subiquity/1413/lib/python3.6/site-packages/curtin/storage_config.py", line 1257, in extract_storage_config
    tree = get_config_tree(cfg.get('id'), final_config)
  File "/snap/subiquity/1413/lib/python3.6/site-packages/curtin/storage_config.py", line 274, in get_config_tree
    for dep in find_item_dependencies(item, sconfig):
  File "/snap/subiquity/1413/lib/python3.6/site-packages/curtin/storage_config.py", line 244, in find_item_dependencies
    _validate_dep_type(item_id, dep_key, dep, config)
  File "/snap/subiquity/1413/lib/python3.6/site-packages/curtin/storage_config.py", line 192, in _validate_dep_type
    'Invalid dep_id (%s) not in storage config' % dep_id)
ValueError: Invalid dep_id (disk-sda) not in storage config
2019-12-19 16:17:15,741 INFO subiquity.core:193 saving crash report 'block probing crashed with ValueError' to /var/crash/1576772235.740345001.disk_probe_fail.crash

I did two installations that failed this way.
I've attached two sets of logs/crash files.

Related branches

Revision history for this message
Frank Heimes (fheimes) wrote :
Changed in ubuntu-z-systems:
importance: Undecided → High
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
importance: High → Critical
Revision history for this message
Dan Watkins (oddbloke) wrote :

I think the underlying issue here is that curtin failed during block-meta due to /dev/sda1 being busy:

finish: cmd-install/stage-partitioning/builtin/cmd-block-meta/clear-holders: FAIL: removing previous storage devices
TIMED BLOCK_META: 10.963
finish: cmd-install/stage-partitioning/builtin/cmd-block-meta: FAIL: curtin command block-meta
Traceback (most recent call last):
  File "/snap/subiquity/1413/lib/python3.6/site-packages/curtin/commands/main.py", line 202, in main
    ret = args.func(args)
  File "/snap/subiquity/1413/lib/python3.6/site-packages/curtin/log.py", line 97, in wrapper
    return log_time("TIMED %s: " % msg, func, *args, **kwargs)
  File "/snap/subiquity/1413/lib/python3.6/site-packages/curtin/log.py", line 79, in log_time
    return func(*args, **kwargs)
  File "/snap/subiquity/1413/lib/python3.6/site-packages/curtin/commands/block_meta.py", line 79, in block_meta
    meta_clear(devices, state.get('report_stack_prefix', ''))
  File "/snap/subiquity/1413/lib/python3.6/site-packages/curtin/commands/block_meta.py", line 1662, in meta_clear
    clear_holders.clear_holders(devices)
  File "/snap/subiquity/1413/lib/python3.6/site-packages/curtin/block/clear_holders.py", line 624, in clear_holders
    shutdown_function(dev_info['device'])
  File "/snap/subiquity/1413/lib/python3.6/site-packages/curtin/block/clear_holders.py", line 288, in wipe_superblock
    _wipe_superblock(blockdev)
  File "/snap/subiquity/1413/lib/python3.6/site-packages/curtin/block/clear_holders.py", line 333, in _wipe_superblock
    exclusive=exclusive, strict=strict)
  File "/snap/subiquity/1413/lib/python3.6/site-packages/curtin/block/__init__.py", line 1127, in wipe_volume
    quick_zero(path, partitions=False, exclusive=exclusive, strict=strict)
  File "/snap/subiquity/1413/lib/python3.6/site-packages/curtin/block/__init__.py", line 1041, in quick_zero
    exclusive=exclusive, strict=strict)
  File "/snap/subiquity/1413/lib/python3.6/site-packages/curtin/block/__init__.py", line 1061, in zero_file_at_offsets
    with exclusive_open(path, exclusive=exclusive) as fp:
  File "/snap/subiquity/1413/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/snap/subiquity/1413/lib/python3.6/site-packages/curtin/block/__init__.py", line 954, in exclusive_open
    fd = os.open(path, flags)
OSError: [Errno 16] Device or resource busy: '/dev/sda1'
[Errno 16] Device or resource busy: '/dev/sda1'

And I think that curtin thinks that /dev/sda1 shouldn't be busy because fuser returns non-zero and so doesn't report any users:

wiping /dev/sda1 attempt 4/4
wiping 1M on /dev/sda1 at offsets [0, -1048576]
Failed to exclusively open path: /dev/sda1
devname '/dev/sda1' had holders: []
Device holders with exclusive access: []
Device mounts: []
Running command ['fuser', '--verbose', '--mount', '/dev/sda1'] with allowed return codes [0] (capture=True)
fuser returned non-zero: ''
Possible users of /dev/sda1:
None

Playing around locally, fuser emits its error messages on stderr, so perhaps we're not capturing stderr here to get the full error message? I'm unsure.

Revision history for this message
Ryan Harper (raharper) wrote :

@Frank

Can you provide any more details about the storage setup at the start and what steps you were taking in subiquity? At some point /dev/sda becomes unaccesible and then things fall apart so it would be good to understand what's happening on the system before the failure.

Changed in curtin:
status: New → Incomplete
Revision history for this message
Frank Heimes (fheimes) wrote :
Download full text (3.2 KiB)

@ Ryan, sure.
So the disk setup that I use is actually the simplest that's possible (as of today).
It's 'normal' multipath (two HBAs) zFCP/SCSI setup with a 64GB LUN that results in a mpatha at the end.

root@ubuntu-server:/# multipath -l
mpatha (36005076306ffd6b60000000000002602) dm-0 IBM,2107900
size=64G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=0 status=active
  |- 0:0:0:1073889318 sda 8:0 active undef running
  |- 0:0:1:1073889318 sdb 8:16 active undef running
  |- 1:0:0:1073889318 sdc 8:32 active undef running
  `- 1:0:1:1073889318 sdd 8:48 active undef running
root@ubuntu-server:/#

root@ubuntu-server:/# multipath -ll
mpatha (36005076306ffd6b60000000000002602) dm-0 IBM,2107900
size=64G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 0:0:0:1073889318 sda 8:0 active ready running
  |- 0:0:1:1073889318 sdb 8:16 active ready running
  |- 1:0:0:1073889318 sdc 8:32 active ready running
  `- 1:0:1:1073889318 sdd 8:48 active ready running
root@ubuntu-server:/#

root@ubuntu-server:/# dmsetup ls --tree -o nodevice
mpatha-part1
       mpatha
root@ubuntu-server:/#

root@ubuntu-server:/# fdisk -l /dev/mapper/mpatha
Disk /dev/mapper/mpatha: 64 GiB, 68719476736 bytes, 134217728 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 3334481D-1BA4-4C8E-B7DD-4B4A1652DF73

Device Start End Sectors Size Type
/dev/mapper/mpatha-part1 2048 134215679 134213632 64G Linux filesystem
root@ubuntu-server:/#

I wondering about the gpt disk label type, since on a alternate 18.04.3 LPAR installation it's dos:
$ sudo fdisk -l /dev/mapper/mpatha
Disk /dev/mapper/mpatha: 64 GiB, 68719476736 bytes, 134217728 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x097ab3e7

Device Boot Start End Sectors Size Id Type
/dev/mapper/mpatha-part1 2048 134217727 134215680 64G 83 Linux

Well, the above output might be a bit outdated (and is maybe from an older version), I'll today redo an installation based o...

Read more...

Revision history for this message
Frank Heimes (fheimes) wrote :

Retried again with latest image (from today, time stamp Jan 8th) and have now more (current) details.
Please see attached (txt) for the steps in the installer, as well as the error log and curtin config snippets, and the entire /var/log in in the tgz.
Looks like the partition is gone ?! at some point in time ...

Revision history for this message
Frank Heimes (fheimes) wrote :
Chad Smith (chad.smith)
Changed in curtin:
status: Incomplete → New
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
assignee: nobody → Canonical Server Team (canonical-server)
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: New → In Progress
Changed in curtin:
status: New → In Progress
Ryan Harper (raharper)
summary: - subiquity on s390x fails with curtin: block-discover probing
+ curtin mishandles removing dm-multipath partition removal
description: updated
Revision history for this message
Frank Heimes (fheimes) wrote :

So after trying with "subiquity-channel=latest/edge/case" and refreshing,
I guess we hand at a slightly different step (but nearby).
/var/log and /var/crash is attached

Revision history for this message
Frank Heimes (fheimes) wrote :

The last comment (#7) shows that the issue discussed here in this particular bug is fixed, using latest/edge/case.
For a new issue (failing to clean disk / timeout) I think it's best that I open a separate ticket,
it's now in LP 1860778.

Dan Watkins (oddbloke)
Changed in curtin:
assignee: nobody → Ryan Harper (raharper)
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

This bug is fixed with commit b475db2d to curtin on branch master.
To view that commit see the following URL:
https://git.launchpad.net/curtin/commit/?id=b475db2d

Changed in curtin:
status: In Progress → Fix Committed
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: In Progress → Fix Committed
Revision history for this message
Frank Heimes (fheimes) wrote :

That seems to be fine now with 19.12.2+git47.e510f574 from latest/edge/case.

Revision history for this message
Frank Heimes (fheimes) wrote :

I can confirm that it is fixed with subiquity 20.03.
I installed focal on an zfcp multipath system where before a LUKS installation was running.
While selecting 'whole disk' the installer was able to wipe out the old scheme and replaced it by an all-in-on part setup.

Changed in ubuntu-z-systems:
status: Fix Committed → Fix Released
Revision history for this message
Joshua Powers (powersj) wrote :

Marking other tasks fix released so this doesn't show up in the report.

Changed in subiquity:
status: New → Fix Released
Changed in curtin:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.