[FFe][20.10 FEAT] KVM: Enable Channel Path Handling for vfio-ccw - qemu part

Bug #1887930 reported by bugproxy
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
High
Skipper Bug Screeners
qemu (Ubuntu)
Fix Released
Undecided
Canonical Server

Bug Description

[Feature Freeze Exception]

1. Feature:

s390x is all about resilency, and one of the great features it provides is that on FICON stroage that even is mostly handled by the underlying layers.
Think of it as multipathd not being in userspace/kernel but firmware.
Nowadays that can not yet all be transparently handled for KVM guests.
The feature proposed here is one that enables this for dasd's passed through via vfio-ccw.

Without support for channel path handling changes in the host's channel path for underlying CCW devices may go unnoticed by the guest. This could cause erroneous operation and lead to a crashing guest. Therefore it is necessary to ensure proper handling of channel paths in vfio-ccw, like passing through channel path operations and notification of channel path changes.

This is one of the more visible drawbacks of KVM to the alternatives of running in LPAR or as a z/VM guest. An example could be channel path changes caused by the FICON fabric, or simply offlining and subsequently onlining. This feature is critical to use vfio-ccw in production more reliably.

2. Changes
https://git.qemu.org/?p=qemu.git;a=commit;h=f76b348ec7
https://git.qemu.org/?p=qemu.git;a=commit;h=2a3b9cbaa7
https://git.qemu.org/?p=qemu.git;a=commit;h=46ea3841ed
https://git.qemu.org/?p=qemu.git;a=commit;h=690e29b911
https://git.qemu.org/?p=qemu.git;a=commit;h=f6dde1b012
https://git.qemu.org/?p=qemu.git;a=commit;h=f030532f2a

All of them apply cleanly on top of qemu 5.0 as it is in Groovy right now.

3. Regression risk

All changes are s390x / vfio-ccw (which is exclusive to s390x) only.
So potential regressions of this late addition would be limited to the s390x which in the Form of the IBM Dev&Test teams is the requester of this addition. But if something unexpected happens then in the vfio-ccw use cases.

Related branches

bugproxy (bugproxy)
tags: added: architecture-s39064 bugnameltc-186730 severity-high targetmilestone-inin2010
Changed in ubuntu:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
affects: ubuntu → qemu (Ubuntu)
Revision history for this message
Frank Heimes (fheimes) wrote :

Please can you share the qemu target version this code is going to land in?
Please notice that it's 'planned' to have qemu-kvm 5.0 in groovy.

tags: added: qemukvm-20.10
Changed in ubuntu-z-systems:
status: New → Incomplete
importance: Undecided → High
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
Changed in qemu (Ubuntu):
assignee: Skipper Bug Screeners (skipper-screen-team) → Canonical Server Team (canonical-server)
Frank Heimes (fheimes)
tags: added: qemu-20.10
removed: qemukvm-20.10
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-09-09 01:13 EDT-------
The QEMU commits involved are:

f030532f2a vfio-ccw: Add support for the CRW region and IRQ
f6dde1b012 s390x/css: Refactor the css_queue_crw() routine
690e29b911 vfio-ccw: Refactor ccw irq handler
46ea3841ed vfio-ccw: Add support for the schib region
2a3b9cbaa7 vfio-ccw: Refactor cleanup of regions
f76b348ec7 Linux headers: update

The first one is a Linux headers update to 5.8-rc1, to pick up the corresponding kernel changes.

Revision history for this message
Frank Heimes (fheimes) wrote : Re: [20.10 FEAT] KVM: Enable Channel Path Handling for vfio-ccw - qemu part

Since we are now several days after groovy's feature freeze (https://wiki.ubuntu.com/FeatureFreeze), that was reached on August 27th, a feature request like this is usually moved to be included in the next Ubuntu release, that is 21.04.

A Feature Freeze Exception requires additional effort, a solid and sound justification as well as upfront testing (https://wiki.ubuntu.com/FreezeExceptionProcess).

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

FYI:
The patches would apply in groovy, but I agree to Frank that for the late addition to Groovy we'd need you to drive an FFe process (very soon). And since you usually want to have things in the last LTS we will need to make even more of a good case for this to have a compelling case for an SRU later on.

The current description outlines what technically happens, but if you could add more detail and scenarios to it where it actually fails today and works with the fix. If there are some/no ways for today users to mitigate the issue and so on.
I mean a crash of a guest (which you mentioned is the consequence) is always bad and shold be avouded. But "changes in the host's channel path" is rather generic term. Let me outline the range that comes to my mind for that:
1. A user disables the ccw disk the guest currently uses - well bad luck that will never be smooth I guess.
2. some one sets a checkbox on the storage server UI enabling an unimportant feature for a disk, due to that all guests - even those not using the disks - will crash

I hope you can see why adding more detail would help to decide on this case.

Then we can decide if it is:
a) it is really really important: drive this together through FFe process now and SRU process later
b) it is nice, but not strictly required: then let us just have it flow in via qemu >=5.1 in 21.04 please

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-09-09 10:23 EDT-------
Unless a channel device is used for some critical I/O, Linux is able to recover from channel path changes. Without these changes, KVM guest will not successfully handle situations that Linux would otherwise recover from whem running in LPAR or as a z/VM guest, e.g. channel path changes caused by the FICON fabric, or simply offlining and subsequently onlining. This feature is critical to use vfio-ccw in production.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote : Re: [20.10 FEAT] KVM: Enable Channel Path Handling for vfio-ccw - qemu part

Thanks Victor for explaining what use-case this is about.
Well, what should I say - I'm and was a s390x fanboy due to my history - so I'd like to add it.
But I'm not the one to decide - and in this case that is probably good to not be too influenced.
We will have to present this to release team for groovy and SRU Team later on.

Groovy:
I'll add an FFe template here based on comment #5

## SRUability ##

I've looked forward a bit for an SRU to qemu 4.2 later on that would also need to include
f76b348ec7 Linux headers: update
dc6f8d458a linux-headers: update against Linux 5.7-rc3
ddda37483d linux-headers: update
50fd0c375b linux-headers: Update
2a886794f1 linux-headers: Update

Of which some then won't apply cleanly due to the protvirt backport.
The noise is handleable, but these are more and more changes that pile up.

If you want to drive this forward to FOCAL as well it might be useful to provide us a header update with just exactly what you need instead.

information type: Private → Public
summary: - [20.10 FEAT] KVM: Enable Channel Path Handling for vfio-ccw - qemu part
+ [FFe][20.10 FEAT] KVM: Enable Channel Path Handling for vfio-ccw - qemu
+ part
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Converted to an FFe and subscribed the release Team to sign off on it.

description: updated
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-09-10 11:46 EDT-------
(In reply to comment #19)
...snip...
> If you want to drive this forward to FOCAL as well it might be useful to
> provide us a header update with just exactly what you need instead.

As far as the header update in QEMU goes, these two kernel commits are the ones involved:

d8cac29b1d52 vfio-ccw: Introduce a new CRW region
24c986748ba6 vfio-ccw: Introduce a new schib region

With those applied to kernel 5.4, the following diff is the important hunk for QEMU. Should apply to https://git.launchpad.net/~usd-import-team/ubuntu/+source/qemu branch origin/ubuntu/focal-devel

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index fb10370d2..9e227348b 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -378,6 +378,8 @@ struct vfio_region_gfx_edid {
/* sub-types for VFIO_REGION_TYPE_CCW */
#define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD (1)
+#define VFIO_REGION_SUBTYPE_CCW_SCHIB (2)
+#define VFIO_REGION_SUBTYPE_CCW_CRW (3)

/*
* The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
@@ -577,6 +579,7 @@ enum {

enum {
VFIO_CCW_IO_IRQ_INDEX,
+ VFIO_CCW_CRW_IRQ_INDEX,
VFIO_CCW_NUM_IRQS
};

diff --git a/linux-headers/linux/vfio_ccw.h b/linux-headers/linux/vfio_ccw.h
index fcc3e69ef..516496f1d 100644
--- a/linux-headers/linux/vfio_ccw.h
+++ b/linux-headers/linux/vfio_ccw.h
@@ -34,4 +34,23 @@ struct ccw_cmd_region {
__u32 ret_code;
} __attribute__((packed));

+/*
+ * Used for processing commands that read the subchannel-information block
+ * Reading this region triggers a stsch() to hardware
+ * Note: this is controlled by a capability
+ */
+struct ccw_schib_region {
+#define SCHIB_AREA_SIZE 52
+ __u8 schib_area[SCHIB_AREA_SIZE];
+} __attribute__((packed));
+
+/*
+ * Used for returning a Channel Report Word to userspace.
+ * Note: this is controlled by a capability
+ */
+struct ccw_crw_region {
+ __u32 crw;
+ __u32 pad;
+} __attribute__((packed));
+
#endif

Revision history for this message
Frank Heimes (fheimes) wrote :

@<email address hidden>
Just to be sure, we talk about groovy aka 20.10, right? As indicated by the title as well as with the tag 'targetmilestone-inin2010'.
And therefore we talk about the groovy kernel which is 5.8:
linux-generic | 5.8.0.18.22 | groovy | s390x
and not about 5.4, which is the focal / 20.04 kernel (and was only in temp. used in groovy for a short period of time at the very beginning).

And the kernel 5.8 has the commits already in:
user@box:~/ubuntu-groovy-master-next$ git log --oneline --grep "vfio-ccw: Introduce a new CRW region"
d8cac29b1d52 vfio-ccw: Introduce a new CRW region
user@box:~/ubuntu-groovy-master-next$ git tag --contains d8cac29b1d52
Ubuntu-5.8.0-13.14
Ubuntu-5.8.0-14.15
Ubuntu-5.8.0-15.16
Ubuntu-5.8.0-16.17
Ubuntu-5.8.0-17.18
Ubuntu-5.8.0-18.19
v5.8
v5.8-rc1
v5.8-rc2
v5.8-rc3
v5.8-rc4
v5.8-rc5
v5.8-rc6
v5.8-rc7
user@box:~/ubuntu-groovy-master-next$ git log --oneline --grep "vfio-ccw: Introduce a new schib region"
24c986748ba6 vfio-ccw: Introduce a new schib region
user@box:~/ubuntu-groovy-master-next$ git tag --contains 24c986748ba6
Ubuntu-5.8.0-13.14
Ubuntu-5.8.0-14.15
Ubuntu-5.8.0-15.16
Ubuntu-5.8.0-16.17
Ubuntu-5.8.0-17.18
Ubuntu-5.8.0-18.19
v5.8
v5.8-rc1
v5.8-rc2
v5.8-rc3
v5.8-rc4
v5.8-rc5
v5.8-rc6
v5.8-rc7

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-09-10 15:04 EDT-------
(In reply to comment #32)
> @<email address hidden>
> Just to be sure, we talk about groovy aka 20.10, right? As indicated by the
> title as well as with the tag 'targetmilestone-inin2010'.
...snip...

I believe so. The last message from @paelzer mentioned SRUability to Focal, so I looked there. The QEMU linux-headers update for Groovy should be pretty straightforward considering the corresponding kernel version, as you see.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Yeah @Fheimes - we are just pre-checking how doable (or not) this would be for a later Focal SRU.
Currently this is only targetted for Groovy and waits for the FFe ack.

Revision history for this message
Iain Lane (laney) wrote :

Scope is limited, I feel sure you'll keep an eye on any resulting issues after testing and it's requested/wanted by IBM, so feel free to go ahead.

Changed in qemu (Ubuntu):
status: New → Confirmed
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Incomplete → Confirmed
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Changed in qemu (Ubuntu):
status: Confirmed → In Progress
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Confirmed → In Progress
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Tests, MP review and FFe ack complete - uploaded to groovy.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:5.0-5ubuntu8

---------------
qemu (1:5.0-5ubuntu8) groovy; urgency=medium

  * d/p/u/lp-1887930-*: Enable Channel Path Handling for vfio-ccw (LP: #1887930)

qemu (1:5.0-5ubuntu7) groovy; urgency=medium

  * d/p/u/lp-1894942-*: fix virtio-ccw host/guest notification (LP: #1894942)

 -- Christian Ehrhardt <email address hidden> Mon, 14 Sep 2020 08:23:49 +0200

Changed in qemu (Ubuntu):
status: In Progress → Fix Released
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: In Progress → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-09-22 02:29 EDT-------
IBM Bugzilla status->closed, Fix Releaesed with groovy

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.