[sru] fail to extend in-use fibre channel volume due to multipath-tools version

Bug #1888675 reported by zhaoleilc
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Yoga
Fix Released
High
Unassigned
Zed
Fix Released
Undecided
Unassigned
os-brick
Fix Released
Medium
Unassigned
python-os-brick (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
Fix Released
High
Unassigned

Bug Description

[IMPACT]

The `multipathd reconfigure` has became a asynchronous command since the 0.6.1 version of multipath-tools. There is a difference as follows:
https://github.com/openSUSE/multipath-tools/blob/0.6.0/multipathd/main.c#L997
https://github.com/openSUSE/multipath-tools/blob/0.6.1/multipathd/main.c#L1135

That leads to a failure to extend in-use fibre channel volume, because `multipathd resize map` will output 'timeout' before `multipathd reconfigure` command finishes when `multipathd resize map` command will be executed as soon as `multipathd reconfigure` command is executed.

However, current code only considers the 'fail' result and so timeouts are not retried, but instead end up as failed, resulting in the FC volume not extending.

[TEST PLAN]

1. Guarantee that there are enough fibre channel volumes attached on the compute node so that `multipathd reconfigure` requires a huge amount of time.

2. Create a server on the compute node and the server name we call 'c1'.

3. Attach a volume whose name is 'v1' to the server 'c1' and the size of 'v1' is 4G.

$ openstack server add volume c1 v1

4. Extend the volume which is called 'v1' to 8G.

$ cinder --os-volume-api-version 3.42 extend v1 8

Check the size using the command of 'fdisk -l') and verify from the logs (see [OTHER INFO])

Without the fix, after the volume have been extended from 4G to 8G, the volume in the instance is still 4G.The fibre channel volume scsi_wwn has been changed to 8G.

With the fix, the new size will reflect immediately because if multipathd resize map returns a timeout, we keep re-trying the same multipathd resize map command for 120 seconds more, giving a chance for the (now asynchronous) 'multipathd reconfigure' to complete and hence letting multipath resize map run succcessfully when we retry.

[WHERE PROBLEMS COULD OCCUR]

I have verified the code is robust and I do not anticipate any issues. The patch is already merged to master, and at the time of writing this, has received 2 acks for the merge into yoga.(https://review.opendev.org/c/openstack/os-brick/+/888343). "multipathd resize map" will not return anything but 1 or 0, (see https://github.com/openSUSE/multipath-tools/blob/0.6.1/multipathd/cli_handlers.c#L702C1-L719C2 ) and if it returns 1, the ProcessExecutionError exception will indeed be raised, because this exception is raised for any return value from the executed command apart from a default of [0]. (https://docs.openstack.org/oslo.concurrency/latest/reference/processutils.html)

However if the timeout is for genuine reasons, and multipath timeout is set to a smaller value, say 30 seconds, we would be needlessly waiting 120 seconds instead of failing the operation at 30 seconds. Also, we could run into this same issue if the resize map operation takes even longer than 120 seconds but that is unlikely and I anticipate the multipathd timeout will also be set to a max of 120 seconds.

[OTHER INFO]

Logs WITHOUT the fix show
==============
2020-07-23 12:42:46.764 2713929 INFO nova.compute.manager [req-8defc1e3-c514-4673-a3b7-98b5343ba1cd 46ff538c684b4816b9454bfdc0e0ec97 4f20deff2 - 15396630649143a78afa714b3e4a0adb 15396630649143a78afa714b3e4a0adb] [instance: ddd3010f-fdf9-4e50-a363-edd02532e683] Cinder d-c206-4713-8381-1ee47d412f31; extending it to detect new size
2020-07-23 12:42:46.764 2713929 INFO nova.compute.manager [req-8defc1e3-c514-4673-a3b7-98b5343ba1cd 46ff538c684b4816b9454bfdc0e0ec97 4f20deff2 - 15396630649143a78afa714b3e4a0adb 15396630649143a78afa714b3e4a0adb] [instance: ddd3010f-fdf9-4e50-a363-edd02532e683] Cinder d-c206-4713-8381-1ee47d412f31; extending it to detect new size
2020-07-23 12:42:48.254 2713929 INFO os_brick.initiator.linuxscsi [req-8defc1e3-c514-4673-a3b7-98b5343ba1cd 46ff538c684b4816b9454bfdc0825c54e0f20deff2 - 15396630649143a78afa714b3e4a0adb 15396630649143a78afa714b3e4a0adb] Find Multipath device file for volume WWN 3600502196
2020-07-23 12:42:48.355 2713929 INFO os_brick.initiator.linuxscsi [req-8defc1e3-c514-4673-a3b7-98b5343ba1cd 46ff538c684b4816b9454bfdc0825c54e0f20deff2 - 15396630649143a78afa714b3e4a0adb 15396630649143a78afa714b3e4a0adb] mpath(/dev/disk/by-id/dm-uuid-mpath-360050767088current size 4294967296
2020-07-23 12:42:48.449 2713929 INFO os_brick.initiator.linuxscsi [req-8defc1e3-c514-4673-a3b7-98b5343ba1cd 46ff538c684b4816b9454bfdc0825c54e0f20deff2 - 15396630649143a78afa714b3e4a0adb 15396630649143a78afa714b3e4a0adb] mpath(/dev/disk/by-id/dm-uuid-mpath-360050767088new size 4294967296

The logs indicate that the current (i.e older) size (4294967296) is the same as the new size. (4294967296)

Note that the fibre channel volume scsi_wwn has been changed to the new size.

zhaoleilc (zhaoleilc)
description: updated
zhaoleilc (zhaoleilc)
description: updated
zhaoleilc (zhaoleilc)
summary: - fail to extend in-use fibre channel volume
+ fail to extend in-use fibre channel volume due to multipath-tools
summary: fail to extend in-use fibre channel volume due to multipath-tools
+ version
zhaoleilc (zhaoleilc)
description: updated
description: updated
zhaoleilc (zhaoleilc)
description: updated
description: updated
description: updated
zhaoleilc (zhaoleilc)
description: updated
Changed in os-brick:
importance: Undecided → Medium
tags: added: extend fc in multipath use
Revision history for this message
Lucian Petrut (petrutlucian94) wrote : Re: fail to extend in-use fibre channel volume due to multipath-tools version
Revision history for this message
Gorka Eguileor (gorka) wrote :
Download full text (17.2 KiB)

Complete error log:

May 06 08:36:45 localhost.localdomain nova-compute[171993]: DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): multipathd reconfigure {{(pid=172523) execute /usr/local/lib/python3.6/site-packages/oslo_concurrency/processutils.py:384}}
May 06 08:36:45 localhost.localdomain nova-compute[171993]: DEBUG oslo_concurrency.processutils [-] CMD "multipathd reconfigure" returned: 0 in 0.011s {{(pid=172523) execute /usr/local/lib/python3.6/site-packages/oslo_concurrency/processutils.py:423}}
May 06 08:36:45 localhost.localdomain nova-compute[171993]: DEBUG oslo.privsep.daemon [-] privsep: reply[7dabc28c-3a5c-413b-b742-1802fc9e041b]: (4, ('ok\n', '')) {{(pid=172523) _call_back /usr/local/lib/python3.6/site-packages/oslo_privsep/daemon.py:502}}
May 06 08:36:45 localhost.localdomain nova-compute[171993]: DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): blockdev --getsize64 /dev/disk/by-id/dm-uuid-mpath-360014057c304960fa9c481ebc9fd6aa9 {{(pid=172523) execute /usr/local/lib/python3.6/site-packages/oslo_concurrency/processutils.py:384}}
May 06 08:36:45 localhost.localdomain nova-compute[171993]: DEBUG oslo_concurrency.processutils [-] CMD "blockdev --getsize64 /dev/disk/by-id/dm-uuid-mpath-360014057c304960fa9c481ebc9fd6aa9" returned: 0 in 0.010s {{(pid=172523) execute /usr/local/lib/python3.6/site-packages/oslo_concurrency/processutils.py:423}}
May 06 08:36:45 localhost.localdomain nova-compute[171993]: DEBUG oslo.privsep.daemon [-] privsep: reply[3c6fdf2f-4170-4cb5-b579-711b7a3dd05a]: (4, ('1073741824\n', '')) {{(pid=172523) _call_back /usr/local/lib/python3.6/site-packages/oslo_privsep/daemon.py:502}}
May 06 08:36:45 localhost.localdomain nova-compute[171993]: INFO os_brick.initiator.linuxscsi [req-6a4238e0-1f53-4119-bf4d-d1225822317c req-4cc646ba-7a4d-4112-9ccd-c0a28f56c624 service nova] mpath(/dev/disk/by-id/dm-uuid-mpath-360014057c304960fa9c481ebc9fd6aa9) current size 1073741824
May 06 08:36:45 localhost.localdomain nova-compute[171993]: DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): multipathd resize map 360014057c304960fa9c481ebc9fd6aa9 {{(pid=172523) execute /usr/local/lib/python3.6/site-packages/oslo_concurrency/processutils.py:384}}
May 06 08:36:46 localhost.localdomain nova-compute[171993]: DEBUG oslo_service.periodic_task [None req-b354add0-2695-4892-825b-953c48798c07 None None] Running periodic task ComputeManager._poll_rebooting_instances {{(pid=171993) run_periodic_tasks /usr/local/lib/python3.6/site-packages/oslo_service/periodic_task.py:211}}
May 06 08:36:49 localhost.localdomain nova-compute[171993]: DEBUG oslo_concurrency.processutils [-] CMD "multipathd resize map 360014057c304960fa9c481ebc9fd6aa9" returned: 1 in 4.116s {{(pid=172523) execute /usr/local/lib/python3.6/site-packages/oslo_concurrency/processutils.py:423}}
May 06 08:36:49 localhost.localdomain nova-compute[171993]: DEBUG oslo_concurrency.processutils [-] 'multipathd resize map 360014057c304960fa9c481ebc9fd6aa9' failed. Not Retrying. {{(pid=172523) execute /usr/local/lib/python3.6/site-packages/oslo_concurrency/processutils.py:474}}
May 06 08:36:49 localhost.localdomain nova-compute[171993]: DEBUG oslo.pr...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (master)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/762776
Committed: https://opendev.org/openstack/os-brick/commit/557f38677a07386807b8d284a20b0ecaa61490f9
Submitter: "Zuul (22348)"
Branch: master

commit 557f38677a07386807b8d284a20b0ecaa61490f9
Author: zhaoleilc <15247232416@163.com>
Date: Mon Nov 16 14:08:59 2020 +0800

    Avoid volume extension errors caused by multipath-tools version

    `multipathd reconfigure` is an asynchronous command as of
    multipath-tools 0.6.1 [1][2], potentially even before that [3].

    Extending in-use iSCSI or FC volumes can fail because of that
    as `multipathd resize map` will output "timeout" while the
    "multipathd reconfigure" operation is still in progress.

    This commit will ensure that multipathd errors are handled
    accordingly, retrying in case of timeouts for up to 2 minutes.

    [1] https://github.com/openSUSE/multipath-tools/blob/0.6.0/multipathd/main.c#L997
    [2] https://github.com/openSUSE/multipath-tools/blob/0.6.1/multipathd/main.c#L1135
    [3] https://github.com/opensvc/multipath-tools/blob/b21c567961f518810a1ac3b209c8db45f6dbac2c/multipathd/cli_handlers.c#L847-L851

    Change-Id: I66e866700728eee7160f48455258c3974ada55bf
    Closes-Bug: #1888675

Changed in os-brick:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 6.0.0

This issue was fixed in the openstack/os-brick 6.0.0 release.

Revision history for this message
nikhil kshirsagar (nkshirsagar) wrote (last edit ): Re: fail to extend in-use fibre channel volume due to multipath-tools version

Uploading debdiff for yoga cloud archive (see https://review.opendev.org/c/openstack/os-brick/+/888343)

Revision history for this message
nikhil kshirsagar (nkshirsagar) wrote (last edit ):
tags: added: patch
Changed in cloud-archive:
status: New → Fix Released
Changed in python-os-brick (Ubuntu):
status: New → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Thanks Nikhil. Can you add Ubuntu SRU details according to https://wiki.ubuntu.com/StableReleaseUpdates ? ie. [Impact], [Test Plan], and [Where problems could occur].

Changed in python-os-brick (Ubuntu Jammy):
status: New → Triaged
importance: Undecided → High
summary: - fail to extend in-use fibre channel volume due to multipath-tools
+ [SRU] fail to extend in-use fibre channel volume due to multipath-tools
version
summary: - [SRU] fail to extend in-use fibre channel volume due to multipath-tools
+ [sru] fail to extend in-use fibre channel volume due to multipath-tools
version
description: updated
Revision history for this message
nikhil kshirsagar (nkshirsagar) wrote :

I have added [sru] to the bug title and modified the description as per the SRU template.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Thanks Nikhil. A new package version has been uploaded to the jammy unapproved queue.

Revision history for this message
Timo Aaltonen (tjaalton) wrote : Please test proposed package

Hello zhaoleilc, or anyone else affected,

Accepted python-os-brick into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/python-os-brick/5.2.2-0ubuntu1.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in python-os-brick (Ubuntu Jammy):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-jammy
Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (python-os-brick/5.2.2-0ubuntu1.3)

All autopkgtests for the newly accepted python-os-brick (5.2.2-0ubuntu1.3) for jammy have finished running.
The following regressions have been reported in tests triggered by the package:

cinder/2:20.3.0-0ubuntu1 (amd64)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/jammy/update_excuses.html#python-os-brick

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
nikhil kshirsagar (nkshirsagar) wrote (last edit ):

I've restarted the autopkgtest for amd64 to check once more.

seems the autopkgtest failure is due to,

571s Removing autopkgtest-satdep (0) ...
572s autopkgtest [15:29:08]: test cinder-daemons: [-----------------------
573s /usr/lib/python3/dist-packages/cinder/db/sqlalchemy/models.py:152: SAWarning: implicitly coercing SELECT object to scalar subquery; please use the .scalar_subquery() method to produce a scalar subquery.
573s last_heartbeat = column_property(
573s /usr/lib/python3/dist-packages/cinder/db/sqlalchemy/models.py:160: SAWarning: implicitly coercing SELECT object to scalar subquery; please use the .scalar_subquery() method to produce a scalar subquery.
573s num_hosts = column_property(
573s /usr/lib/python3/dist-packages/cinder/db/sqlalchemy/models.py:169: SAWarning: implicitly coercing SELECT object to scalar subquery; please use the .scalar_subquery() method to produce a scalar subquery.
573s num_down_hosts = column_property(
573s 2023-09-15 15:29:09.430 12574 INFO cinder.db.migration [-] Applying migration(s)
573s 2023-09-15 15:29:09.432 12574 INFO alembic.runtime.migration [-] Context impl MySQLImpl.
573s 2023-09-15 15:29:09.432 12574 INFO alembic.runtime.migration [-] Will assume non-transactional DDL.
573s 2023-09-15 15:29:09.454 12574 INFO alembic.runtime.migration [-] Running upgrade -> 921e1a36b076, Initial migration.
575s 2023-09-15 15:29:11.505 12574 INFO cinder.db.migration [-] Migration(s) applied
575s Job for cinder-scheduler.service failed.
575s See "systemctl status cinder-scheduler.service" and "journalctl -xeu cinder-scheduler.service" for details.
576s autopkgtest [15:29:12]: test cinder-daemons: -----------------------]
576s autopkgtest [15:29:12]: test cinder-daemons: - - - - - - - - - - results - - - - - - - - - -
576s cinder-daemons FAIL non-zero exit status 1

This seems fixed through https://review.opendev.org/c/openstack/cinder/+/807083 and https://bugs.launchpad.net/cinder/+bug/1941068

Revision history for this message
nikhil kshirsagar (nkshirsagar) wrote :

I have tested this fix using gdb breakpoints to simulate the "timeout" from multipathd map resize.

I've verified the fix code is called correctly in this situation using the proposed package, and that the resize command is resent.

After multipathd map resize stops timing out, we see the new size reflecting in both the underlying device as well as the mpath device.

I am marking the verification flags accordingly, and also attaching the testing details file to the bug for reference.

Revision history for this message
nikhil kshirsagar (nkshirsagar) wrote :

Jammy proposed package testing details attached.

(our customer has already verified the focal proposed package which we had sent him as a test package - That PPA location is https://launchpad.net/~canonical-support-eng/+archive/ubuntu/fdj-sf363507 for reference)

tags: added: verification-done verification-done-jammy
removed: verification-needed verification-needed-jammy
Revision history for this message
Corey Bryant (corey.bryant) wrote : Please test proposed package

Hello zhaoleilc, or anyone else affected,

Accepted python-os-brick into yoga-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:yoga-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-yoga-needed to verification-yoga-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-yoga-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-yoga-needed
Revision history for this message
nikhil kshirsagar (nkshirsagar) wrote :

I have tested the cloud archive python-os-brick package, attaching testing details. The patch works as expected.

I will mark the flag to verification-yoga-done.

tags: added: verification-yoga-done
removed: verification-yoga-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package python-os-brick - 5.2.2-0ubuntu1.3

---------------
python-os-brick (5.2.2-0ubuntu1.3) jammy; urgency=medium

  * d/p/avoid-volume-extension-errors.patch: Avoid volume extension
    errors caused by multipath-tools version (LP: #1888675).

 -- Nikhil Kshirsagar <email address hidden> Mon, 11 Sep 2023 09:34:29 +0000

Changed in python-os-brick (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Chris Halse Rogers (raof) wrote : Update Released

The verification of the Stable Release Update for python-os-brick has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for python-os-brick has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package python-os-brick - 5.2.2-0ubuntu1.3~cloud0
---------------

 python-os-brick (5.2.2-0ubuntu1.3~cloud0) focal-yoga; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 python-os-brick (5.2.2-0ubuntu1.3) jammy; urgency=medium
 .
   * d/p/avoid-volume-extension-errors.patch: Avoid volume extension
     errors caused by multipath-tools version (LP: #1888675).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/888343
Committed: https://opendev.org/openstack/os-brick/commit/994cfb3f38b7481c8bf06a615bd9959f93bbd142
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 994cfb3f38b7481c8bf06a615bd9959f93bbd142
Author: zhaoleilc <15247232416@163.com>
Date: Mon Nov 16 14:08:59 2020 +0800

    Avoid volume extension errors caused by multipath-tools version

    `multipathd reconfigure` is an asynchronous command as of
    multipath-tools 0.6.1 [1][2], potentially even before that [3].

    Extending in-use iSCSI or FC volumes can fail because of that
    as `multipathd resize map` will output "timeout" while the
    "multipathd reconfigure" operation is still in progress.

    This commit will ensure that multipathd errors are handled
    accordingly, retrying in case of timeouts for up to 2 minutes.

    [1] https://github.com/openSUSE/multipath-tools/blob/0.6.0/multipathd/main.c#L997
    [2] https://github.com/openSUSE/multipath-tools/blob/0.6.1/multipathd/main.c#L1135
    [3] https://github.com/opensvc/multipath-tools/blob/b21c567961f518810a1ac3b209c8db45f6dbac2c/multipathd/cli_handlers.c#L847-L851

    Change-Id: I66e866700728eee7160f48455258c3974ada55bf
    Closes-Bug: #1888675
    (cherry picked from commit 557f38677a07386807b8d284a20b0ecaa61490f9)

tags: added: in-stable-yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick yoga-eom

This issue was fixed in the openstack/os-brick yoga-eom release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.