Curtin fails to deploy on S390X DPM

Bug #1813228 reported by Lee Trager
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
High
Unassigned
curtin
Fix Released
High
Unassigned

Bug Description

When I try to deploy Ubuntu with MAAS on S390X DPM the installation fails when running vgchange --activate=y. This may be due to the system using multipath and related to LP:1813227

Tags: s390x

Related branches

Revision history for this message
Lee Trager (ltrager) wrote :
Revision history for this message
Ryan Harper (raharper) wrote :
Download full text (5.8 KiB)

Curtin generally expects vgchange -ay to exit with 0, non-zero exit indicates as failure.

In this install[1], multipath is installed and when it is present, vgchange detects that
the lvm devices it found were created using device names that were not multipath names.

It is not yet clear whether this is fatal to curtin. Curtin probes the LVM subsystem
to discover any LVM devices present; at this early stage, it is possible that *none* of
the devices discovered will be related to the storage config sent to curtin, and curtin
should then ignore all of these devices.

This approach argues for curtin it ignore the error return code, and possible not even log
the warning/error from vgchange -ay.

On the counter side, if a node is able to detect multiple paths to devices and potentially
discovers misconfiguration, it may not be prudent to continue; if any of the devices in storage config are part of the detected LVM devices.

1.
Stdout: Unexpected error while running command.
        Command: ['vgchange', '--activate=y']
        Exit code: 5
        Reason: -
        Stdout: 0 logical volume(s) in volume group "zkvm4" now active
                  0 logical volume(s) in volume group "zkvm1" now active
                  0 logical volume(s) in volume group "Z_APPL_ROOT_lvm_473D86E3" now active
                  0 logical volume(s) in volume group "Z_APPL_ROOT_lvm_BBD0E8D4" now active
                  0 logical volume(s) in volume group "Z_APPL_ROOT_lvm_382F5B8E" now active

        Stderr: WARNING: Not using lvmetad because duplicate PVs were found.
                  WARNING: Use multipath or vgimportclone to resolve duplicate PVs?
                  WARNING: After duplicates are resolved, run "pvscan --cache" to enable lvmetad.
                  WARNING: PV tIvQUe-wsuE-RWlZ-Wmdo-uzws-qL5H-Y9zW5l on /dev/sdb2 was already found on /dev/sdfe2.
                  WARNING: PV 3zsO6H-QCQq-eyJZ-8aLc-TGU1-q91J-LFWsKs on /dev/sde2 was already found on /dev/sdfh2.
                  WARNING: PV VEyfqe-Vln5-zMp5-wCeK-hkQN-u8Wp-LkPTMF on /dev/sdba2 was already found on /dev/sdhd2.
                  WARNING: PV H7Ucbo-TJSw-tS4y-AwKT-xdql-NiMp-YvZC1S on /dev/sdfj2 was already found on /dev/sddi2.
                  WARNING: PV H7Ucbo-TJSw-tS4y-AwKT-xdql-NiMp-YvZC1S on /dev/sdg2 was already found on /dev/sddi2.
                  WARNING: PV tIvQUe-wsuE-RWlZ-Wmdo-uzws-qL5H-Y9zW5l on /dev/sdbc2 was already found on /dev/sdfe2.
                  WARNING: PV 3zsO6H-QCQq-eyJZ-8aLc-TGU1-q91J-LFWsKs on /dev/sdbf2 was already found on /dev/sdfh2.
                  WARNING: PV VEyfqe-Vln5-zMp5-wCeK-hkQN-u8Wp-LkPTMF on /dev/sddb2 was already found on /dev/sdhd2.
                  WARNING: PV 6ZwSAD-cFkP-Qjeg-WIDt-g3Hc-m8Tt-i3ex81 on /dev/sddr2 was already found on /dev/sdbq2.
                  WARNING: PV H7Ucbo-TJSw-tS4y-AwKT-xdql-NiMp-YvZC1S on /dev/sdbh2 was already found on /dev/sddi2.
                  WARNING: PV tIvQUe-wsuE-RWlZ-Wmdo-uzws-qL5H-Y9zW5l on /dev/sddd2 was already found on /dev/sdfe2.
                  WARNING: PV 3zsO6H-QCQq-eyJZ-8aLc-TGU1-q91J-LFWsKs on /dev/sddg2 was already found on /dev/sdfh2.
                  WARNING: PV VEyfq...

Read more...

Changed in curtin:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Lee Trager (ltrager) wrote :

Just to clarify while the system is using multipath the MAAS ephemeral environment does not have the multipath kernel module nor the multipath userland tools.

Revision history for this message
Ryan Harper (raharper) wrote :

That's fair; it's the duplicate paths which have the same LVM_UUID on them; so we'll see this with or without multipath daemon/module any how.

Installing and enabling the daemon would resolve the issue if the LVMs are created using MPATH block paths; if they weren't (used /dev/sdxxx) then we'll still see this error.

And at least on s390x, where it's very possible to have redundant FC adapters pointing to the same LUNs, it is almost certain that we'll see these messages.

What remains is a discussion what curtin can or should do about them.

Revision history for this message
Ryan Harper (raharper) wrote :

I think we'd like have the commissioning stage ensure that multipath-tools is installed such that multipathd will have scanned and assembled any paths, this will result in persistent storage paths to the devices, and will allow maas to reference a particular lun via the serial of the lun and path. For example:

/dev/disk/by-id/dm-uuid-mpath-36005076306ffd6b60000000000002406 -> ../../dm-2

Has a ID_WWN_WITH_EXTENSION value of "36005076306ffd6b60000000000002406"
which can be used in the type: disk wwn field.

Frank Heimes (fheimes)
tags: added: s390x
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

FYI - xnox just added multipath-tools to the seed so sooner or later this will be avilable already.
Never the less ensuring it gets installed (we can apt-get there right?) especially for older images seems to be the right way.

Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1813228] Re: Curtin fails to deploy on S390X DPM

On Fri, Feb 22, 2019 at 4:05 AM Christian Ehrhardt  <
<email address hidden>> wrote:

> FYI - xnox just added multipath-tools to the seed so sooner or later this
> will be avilable already.
> Never the less ensuring it gets installed (we can apt-get there right?)
> especially for older images seems to be the right way.
>

Yes, let's add a curtin task to add multipath-tools to it's dep list.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1813228
>
> Title:
> Curtin fails to deploy on S390X DPM
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1813228/+subscriptions
>

Revision history for this message
Lee Trager (ltrager) wrote :

Adding multipath changes the error message but Curtin still fails.

curtin: Installation started. (18.2-19-g36351dea-0ubuntu1)
third party drivers not installed or necessary.
device-mapper: create ioctl on 36005076307ffc6a60000000000001330p1 part1-mpath-36005076307ffc6a60000000000001330 failed: Device or resource busy
An error occured handling 'sdaw-part1': OSError - could not get path to dev from kname: dm-48p1
could not get path to dev from kname: dm-48p1
curtin: Installation failed with exception: Unexpected error while running command.
Command: ['curtin', 'block-meta', 'custom']
Exit code: 3
Reason: -
Stdout: device-mapper: create ioctl on 36005076307ffc6a60000000000001330p1 part1-mpath-36005076307ffc6a60000000000001330 failed: Device or resource busy
        An error occured handling 'sdaw-part1': OSError - could not get path to dev from kname: dm-48p1
        could not get path to dev from kname: dm-48p1

Stderr: ''

Revision history for this message
Ryan Harper (raharper) wrote :

Can you enable curtin verbose?

I think I see what's going on, but I'd like to confirm with the verbose
logs.

On Thu, Apr 4, 2019 at 11:35 PM Lee Trager <email address hidden> wrote:

> Adding multipath changes the error message but Curtin still fails.
>
> curtin: Installation started. (18.2-19-g36351dea-0ubuntu1)
> third party drivers not installed or necessary.
> device-mapper: create ioctl on 36005076307ffc6a60000000000001330p1
> part1-mpath-36005076307ffc6a60000000000001330 failed: Device or resource
> busy
> An error occured handling 'sdaw-part1': OSError - could not get path to
> dev from kname: dm-48p1
> could not get path to dev from kname: dm-48p1
> curtin: Installation failed with exception: Unexpected error while running
> command.
> Command: ['curtin', 'block-meta', 'custom']
> Exit code: 3
> Reason: -
> Stdout: device-mapper: create ioctl on 36005076307ffc6a60000000000001330p1
> part1-mpath-36005076307ffc6a60000000000001330 failed: Device or resource
> busy
> An error occured handling 'sdaw-part1': OSError - could not get
> path to dev from kname: dm-48p1
> could not get path to dev from kname: dm-48p1
>
> Stderr: ''
>
>
> ** Attachment added: "curtin-logs.tar"
>
> https://bugs.launchpad.net/curtin/+bug/1813228/+attachment/5252985/+files/curtin-logs.tar
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1813228
>
> Title:
> Curtin fails to deploy on S390X DPM
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1813228/+subscriptions
>

Revision history for this message
Lee Trager (ltrager) wrote :

Curtin logs with verbose enabled.

Revision history for this message
Ryan Harper (raharper) wrote :

The curtin-logs.tar.xz appears invalid (only 12.5kb vs. 140kb previously).

Revision history for this message
Lee Trager (ltrager) wrote :

I xz compressed the logs this time since the uncompress tar was a bit bigger than before. The tar itself works fine, xz is very efficient :)

Revision history for this message
Ryan Harper (raharper) wrote :

Sorry, I see. The archive opener didn't notice it.

On Fri, Apr 5, 2019 at 2:50 PM Lee Trager <email address hidden> wrote:

> I xz compressed the logs this time since the uncompress tar was a bit
> bigger than before. The tar itself works fine, xz is very efficient :)
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1813228
>
> Title:
> Curtin fails to deploy on S390X DPM
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1813228/+subscriptions
>

Revision history for this message
Lee Trager (ltrager) wrote :

Curtin still fails to install with the attached branch

Revision history for this message
Ryan Harper (raharper) wrote :

Thanks for the logs. Looks like we'll need some work on clear-holders to deal with these sorts of devices.

Would you be able to run the latest probert on these systems?

https://github.com/CanonicalLtd/probert

Alternatively, if you could import my ssh key, lp:raharper

I can poke at this faster.

Revision history for this message
Lee Trager (ltrager) wrote :

Attached is the output from probert on the system Curtin is failing on.

The Z13 I am using is on IBM's network and requires IBM VPN access. Frank may be able to set you up with an account.

Revision history for this message
Ryan Harper (raharper) wrote :

Thanks.

Mpath devices create partitions as additional dm devices. Curtin needs to
untangle those values.

On Mon, Apr 22, 2019 at 8:20 PM Lee Trager <email address hidden> wrote:

> Attached is the output from probert on the system Curtin is failing on.
>
> The Z13 I am using is on IBM's network and requires IBM VPN access.
> Frank may be able to set you up with an account.
>
> ** Attachment added: "probert.json"
>
> https://bugs.launchpad.net/curtin/+bug/1813228/+attachment/5258060/+files/probert.json
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1813228
>
> Title:
> Curtin fails to deploy on S390X DPM
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1813228/+subscriptions
>

Revision history for this message
Lee Trager (ltrager) wrote :

Curtin is still failing to deploy

Revision history for this message
Ryan Harper (raharper) wrote :

If you're on the box now, if you could run:

% ls -al /dev/mapper/
% sudo multipath -l
% sudo dmsetup ls
% udevadm info --query=all /sys/class/block/sdcy1

Revision history for this message
Lee Trager (ltrager) wrote :
Download full text (32.5 KiB)

root@maas-node-3:~# ls -alh /dev/mapper
total 0
drwxr-xr-x 2 root root 2.1K Apr 27 00:07 .
drwxr-xr-x 20 root root 17K Apr 27 00:07 ..
lrwxrwxrwx 1 root root 8 Apr 27 00:10 Z_APPL_ROOT_lvm_382F5B8E-root_pool -> ../dm-94
lrwxrwxrwx 1 root root 8 Apr 27 00:10 Z_APPL_ROOT_lvm_473D86E3-root_pool -> ../dm-95
lrwxrwxrwx 1 root root 8 Apr 27 00:10 Z_APPL_ROOT_lvm_BBD0E8D4-root_pool -> ../dm-96
crw------- 1 root root 10, 236 Apr 27 00:10 control
lrwxrwxrwx 1 root root 7 Apr 27 00:10 mpatha -> ../dm-0
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpatha-part1 -> ../dm-58
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpatha-part2 -> ../dm-59
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpatha-part5 -> ../dm-60
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathaa -> ../dm-26
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathab -> ../dm-27
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathac -> ../dm-28
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathad -> ../dm-29
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathae -> ../dm-30
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathaf -> ../dm-31
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathag -> ../dm-32
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathah -> ../dm-33
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathai -> ../dm-34
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathaj -> ../dm-35
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathak -> ../dm-36
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathal -> ../dm-37
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpatham -> ../dm-38
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathan -> ../dm-39
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathao -> ../dm-40
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathap -> ../dm-41
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathaq -> ../dm-42
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathar -> ../dm-43
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathas -> ../dm-44
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathat -> ../dm-45
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathat-part1 -> ../dm-86
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathau -> ../dm-46
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathau-part1 -> ../dm-87
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathav -> ../dm-47
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathav-part1 -> ../dm-88
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathaw -> ../dm-48
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathax -> ../dm-49
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathay -> ../dm-50
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathay-part1 -> ../dm-89
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathaz -> ../dm-51
lrwxrwxrwx 1 root root 7 Apr 27 00:10 mpathb -> ../dm-1
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathb-part1 -> ../dm-54
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathb-part2 -> ../dm-55
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathba -> ../dm-52
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathba-part1 -> ../dm-90
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathba-part2 -> ../dm-91
lrwxrwxrwx 1 root root 8 Apr 27 00:10 mpathbb -> ../dm-53
lrwxrw...

Revision history for this message
Ryan Harper (raharper) wrote :

# udevadm info --query=all /sys/class/block/sdcy1
Unknown device "/sys/class/block/sdcy1": No such device

Ugh, I typo'd, I wanted to query the partition that was still around

% udevadm info --query=all /sys/class/block/sdcy/sdcy1

I did notice that dmsetup remove is somewhat racy, so I'll likely
switch to combining a settle along with a util.wait_for_removal() which
watches a path in sysfs until it's gone.

Thanks,
Ryan

Revision history for this message
Lee Trager (ltrager) wrote :

/sys/class/block/sdcy/sdcy1 actually doesn't exist, although this is in rescue mode and not directly after the failed deployment.

# ls /sys/class/block/sdcy/
alignment_offset dev events ext_range inflight power removable slaves trace
bdi device events_async hidden integrity queue ro stat uevent
capability discard_alignment events_poll_msecs holders mq range size subsystem

Revision history for this message
Ryan Harper (raharper) wrote :

Ok, I think we did the right thing, but as I said, need to wait for the
symlink/node to be removed after the dmsetup remove.

On Fri, Apr 26, 2019 at 10:10 PM Lee Trager <email address hidden>
wrote:

> /sys/class/block/sdcy/sdcy1 actually doesn't exist, although this is in
> rescue mode and not directly after the failed deployment.
>
> # ls /sys/class/block/sdcy/
> alignment_offset dev events ext_range
> inflight power removable slaves trace
> bdi device events_async hidden
> integrity queue ro stat uevent
> capability discard_alignment events_poll_msecs holders mq
> range size subsystem
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1813228
>
> Title:
> Curtin fails to deploy on S390X DPM
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1813228/+subscriptions
>

Revision history for this message
Ryan Harper (raharper) wrote :

OK, with much debugging, I'm going to push the latest bits to address the multipath issue; but this isn't all that's needed and we need some discussion to move forward.

1) The storage config points to a single path, and curtin has added code to detect and determine the main mapping device; we dmsetup remove the partitions and multipath -f the mpath device to clear things before deploying. Ultimately, maas should collect all of the path and mapping info during commissioning and support multipath; for now I suppose this is enough.

2) the maas has a preseed to do some chripl bits; this likely should be pushed into curtin; though we need discussion to know when to do this versus allowing other users (like subiquity) to do it.

3) After install, the first boot is very slow; don't yet know why

4) the network configuration rendered into the target from maas does not match the hardware due to the known firmware issue of shifting MAC values on the interfaces. The result is that network does not match any of the interfaces and no network config is applied.

One approach here is to have maas send a v2 config that matches not on MAC but on path, the zdev device paths are stable.

ethernets:
   encf:
     match:
        path: ccwgroup-0.0.000f
     set-name: encf
     dhcp4: yes

Changed in curtin:
status: Confirmed → In Progress
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

This bug is fixed with commit 6e82b6f7 to curtin on branch master.
To view that commit see the following URL:
https://git.launchpad.net/curtin/commit/?id=6e82b6f7

Changed in curtin:
status: In Progress → Fix Committed
Revision history for this message
Dan Watkins (oddbloke) wrote : Fixed in curtin version 19.1.

This bug is believed to be fixed in curtin in version 19.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in curtin:
status: Fix Committed → Fix Released
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
importance: Undecided → High
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.