Guest crashed when detaching the ovs interface device

Bug #1812822 reported by Xiao Feng Ren
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
Medium
Unassigned
linux (Ubuntu)
Invalid
Undecided
Unassigned
qemu (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

 When detaching one openvswitch interface device with virsh detach-device, if the port has been deleted from the ovs and the interface device has been deleted. The virsh detach-device will fail with "error: Unable to read from monitor: Connection reset by peer", the qemu is terminated and the log shows " UNSETVNETLE ioctl() failed, File descriptor in bad state".

[Background] This error is originally found from the openstack KVM CI tempest test.  By investigating I found it's introduced by one ovs-vif patch, which deletes the ovs port and delete the interface before detaching the device.  You can find the commit from https://bugs.launchpad.net/os-vif/+bug/1801072

Reproduced:

   root@xxxx:~#  ovs-vsctl del-port br0 tap9273235a-dd
   root@xxxx:~#  ip link del tap9273235a-dd

The interface device tap9273235a-dd has been removed from the host(ifconfg, ovs-vsctl show)  and can be found in the guest.(logon the guest, ip a  it's in down state)

root@xxxx:~# virsh detach-device kvm net.xml
error: Failed to detach device from net.xml
error: Unable to read from monitor: Connection reset by peer

The qemu has terminated and the log in /var/log/libvirt/qemu/kvm.log
TUNSETVNETLE ioctl() failed: File descriptor in bad state.
2019-01-18 08:16:11.304+0000: shutting down, reason=crashed

It seems the qemu tried to handle this interface, but in fact it has been deleted. qemu couldn't read the file and give the error.
But I don't think the guest should be crashed directly for the file descriptor error.

Environment:
Ubuntu 16.04.5 LTS
Linux (EC12) 4.4.0-141-generic
QEMU emulator version 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.5~cloud0)
libvirtd (libvirt) 4.0.0

net.xml

     <interface type='bridge'>
    <mac address='52:54:00:fb:5c:46'/>
    <source bridge='br0'/>
    <virtualport type='openvswitch'>
      <parameters  interfaceid='9273234d-9ad4-4ecf-8869-d63ac17a0e6d'/>
    </virtualport>
    <target dev='tap9273235a-dd'/>
      <model type='virtio'/>
      <mtu size='1450'/>
      <alias name='net1'/>
      <address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0005'/>
  </interface>

kvm.xml

<domain type='kvm' id='31'>
  <name>kvm</name>
  <uuid>59f71b47-16e4-401d-9d33-30bc1605a84a</uuid>
  <memory unit='KiB'>524288</memory>
  <currentMemory unit='KiB'>524288</currentMemory>
  <vcpu placement='static'>1</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='s390x' machine='s390-ccw-virtio-bionic'>hvm</type>
    <boot dev='hd'/>
  </os>
  <cpu>
    <topology sockets='1' cores='1' threads='1'/>
  </cpu>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/qemu-system-s390x</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/root/xenial-minimal.qcow2'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0000'/>
    </disk>
    <console type='pty' tty='/dev/pts/2'>
      <source path='/dev/pts/2'/>
      <target type='sclp' port='0'/>
      <alias name='console0'/>
    </console>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0001'/>
    </memballoon>
    <panic model='s390'/>
  </devices>
  <seclabel type='dynamic' model='apparmor' relabel='yes'>
    <label>libvirt-59f71b47-16e4-401d-9d33-30bc1605a84a</label>
    <imagelabel>libvirt-59f71b47-16e4-401d-9d33-30bc1605a84a</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+0:+0</label>
    <imagelabel>+0:+0</imagelabel>
  </seclabel>
</domain>

Frank Heimes (fheimes)
tags: added: s390x
removed: detach device
bugproxy (bugproxy)
tags: added: architecture-s39064 bugnameltc-174882 severity-high targetmilestone-inin16045
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1812822

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Frank Heimes (fheimes) wrote :

As of comment #1, please share/attach the relevant logs.

Revision history for this message
Xiao Feng Ren (renxiaof) wrote :

I couldn't use the command apport-collect 1812822

root@****:~# apport-collect 1812822
ERROR: The python-launchpadlib package is not installed. This functionality is not available.

root@****:~# apt install python-launchpadlib -y
Reading package lists... Done
Building dependency tree
Reading state information... Done
python-launchpadlib is already the newest version (1.10.3-3ubuntu0.1).
0 upgraded, 0 newly installed, 0 to remove and 41 not upgraded.

root@****:~# dpkg -l python-launchpadlib
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=================-=============-=============-========================================
ii python-launchpadl 1.10.3-3ubunt all Launchpad web services client library

There's the guest log under /var/log/libvirt/qemu/kvm.log

2019-01-31 10:52:55.667+0000: starting up libvirt version: 4.0.0, package: 1ubuntu8.5~cloud0 (Openstack Ubuntu Testing Bot <email address hidden> Fri, 07 Sep 2018 04:25:04 +0000), qemu version: 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.5~cloud0), hostname: zfwcec178.boeblingen.de.ibm.com
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-system-s390x -name guest=kvm,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-36-kvm/master-key.aes -machine s390-ccw-virtio-bionic,accel=kvm,usb=off,dump-guest-core=off -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 59f71b47-16e4-401d-9d33-30bc1605a84a -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-36-kvm/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -drive file=/root/xenial-minimal.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-ccw,scsi=off,devno=fe.0.0000,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev pty,id=charconsole0 -device sclpconsole,chardev=charconsole0,id=console0 -device virtio-balloon-ccw,id=balloon0,devno=fe.0.0001 -msg timestamp=on
2019-01-31 10:52:55.667+0000: Domain id=36 is tainted: high-privileges
2019-01-31T10:52:55.749009Z qemu-system-s390x: -chardev pty,id=charconsole0: char device redirected to /dev/pts/1 (label charconsole0)
TUNSETVNETLE ioctl() failed: File descriptor in bad state.
2019-01-31 10:57:29.694+0000: shutting down, reason=crashed

affects: qemu-kvm (Ubuntu) → qemu (Ubuntu)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I used a recent version of the softwrae stack from Disco
- qemu 3.1
- libvirt 5.0
- openvswitch 2.11

With that I had a guest with an OVS device like that:
    <interface type='bridge'>
      <mac address='52:54:00:22:57:fd'/>
      <source network='ovsbr0' bridge='ovsbr0'/>
      <virtualport type='openvswitch'>
        <parameters interfaceid='f44ac4e9-fe46-48b8-920c-7ba13dd024ba'/>
      </virtualport>
      <target dev='vnet1'/>
      <model type='virtio'/>
      <driver name='vhost' queues='4'/>
      <alias name='net1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </interface>

Not too different to your's I'd think.
The OVS is trivial just having this interface atm.

$ sudo ovs-vsctl show
596674ef-e4cd-471f-9708-9caa5737961c
    Bridge "ovsbr0"
        Port "eno49"
            Interface "eno49"
        Port "ovsbr0"
            Interface "ovsbr0"
                type: internal
        Port "vnet1"
            Interface "vnet1"
    ovs_version: "2.11.0"

$ ip link show dev vnet1
93: vnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fe:54:00:22:57:fd brd ff:ff:ff:ff:ff:ff

I have a started a second guest on the same vswitch (to check traffic from the first guest later on).

Now lets delete that port:
$ sudo ovs-vsctl del-port ovsbr0 vnet1
$ sudo ovs-vsctl show
596674ef-e4cd-471f-9708-9caa5737961c
    Bridge "ovsbr0"
        Port "vnet3"
            Interface "vnet3"
        Port "eno49"
            Interface "eno49"
        Port "ovsbr0"
            Interface "ovsbr0"
                type: internal
    ovs_version: "2.11.0"

Ok the OVS device is gone.
Obviously traffic on that interface is dead now, but the guest is still alive and happy.

The host dev is still there:
$ ip link show dev vnet1
93: vnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fe:54:00:22:57:fd brd ff:ff:ff:ff:ff:ff
Removing that as well as suggested

$ sudo ip link del vnet1
$ ip link show dev vnet1
Device "vnet1" does not exist.

The guest still is up and running, while traffic still won't work for obvious reasons.
Now lets trigger the hot-unplug of the device.

$ virsh detach-device guest-openvswitch-1 net.xml
Device detached successfully

The guest is still happy and alive.
It lost the device (since we detached it) but that is ok and intentional.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

To some extend this feels a bit like:
- https://bugzilla.redhat.com/show_bug.cgi?id=1242383
- https://bugzilla.redhat.com/show_bug.cgi?id=1151306

All those got closed as "invalid host config -> won't fix" so we can't find the fix there.
But something happened to let it work fine in my case, we need to find that.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

We now need to find out what the difference is:
a) your test case is slightly different and you can trigger it on the same SW levels it works for me, then we need to report that to upstream as those are the very latest versions
b) your test is good once you use the more recent SW levels, in that case we need to drill down into your crash and identify the fix (that must be in between qemu 2.11 and 3.1 somewhere) to consider backporting it.
c) This would be arch dependent (I tested x86), but we would find that further down the road as you'd report (a) to happen. After all TUNSETVNETLE is for setting big/little endian operations for linux tap/macvtap so it could be s390x only after all.

I can't re-deploy the system to use Bionic level components that you use at the moment and that also would only answer (b) but not (a).
Therefore to differentiate between the above I'd want to ask you if you could re-run your test on Ubuntu 19.04 with Proposed enabled [1] as the new openvswitch still is in proposed for now.

Report back if you can still trigger the issue, but then I'll most likely encourage you to report it upstream and I'd then participate in the discussion there - probably building test PPAs for you as needed.

Also report back if this SW stack works for you as well, in that case I'd wonder if you get an actual crash in /var/crash that would help where in the qemu code we would look for
  TUNSETVNETLE ioctl() failed: File descriptor in bad state.
I'd assume net/tap-linux.c in tap_fd_set_vnet_le, but let's be sure.

[1]: https://wiki.ubuntu.com/Testing/EnableProposed

Changed in qemu (Ubuntu):
status: New → Incomplete
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: New → Incomplete
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
importance: Undecided → Medium
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2019-02-14 08:22 EDT-------
Can this be reproduced with the upstream qemu? If yes, can you also report this to the qemu-s390x mailing list?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-02-15 03:09 EDT-------
(In reply to comment #12)
> Can this be reproduced with the upstream qemu? If yes, can you also report
> this to the qemu-s390x mailing list?

Have tested and reproduced this bug with the latest SW version:

qemu-system-s390x : QEMU emulator version 3.1.0 (Debian 1:3.1+dfsg-2ubuntu1)
libvirtd : libvirtd (libvirt) 5.0.0
openvswitch: ovs-vsctl (Open vSwitch) 2.11.0 DB Schema 7.16.1

Distributor: Ubuntu Disco Dingo(development branch)
Linux server 4.19.0-12-generic #13-Ubuntu SMP

I have reported this problem to the qemu-s390x mailing list.

Revision history for this message
Xiao Feng Ren (renxiaof) wrote :

I couldn't get the crash file in the disco system though I set the apport, but get the qemu_system-s390x.crash under /var/crash in the original test system ubuntu16.04(qemu: 2.11.1).

Revision history for this message
Dan Streetman (ddstreet) wrote :
tags: added: qemu-19.10
Manoj Iyer (manjo)
Changed in linux (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

The mentioned commit 6ab79a20af3a7b3bf610ba9aebb446a9f0b05930 is in qemu 4.1 and later.
Since we will have qemu 4.2 in 20.04 please retry with that once available.
I'll set a bug tak reference in the changelog of the update.

tags: added: qemu-20.04
removed: qemu-19.10
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Incomplete → In Progress
Changed in qemu (Ubuntu):
status: Incomplete → In Progress
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (11.6 KiB)

This bug was fixed in the package qemu - 1:4.2-1ubuntu1

---------------
qemu (1:4.2-1ubuntu1) focal; urgency=medium

  * Merge with Debian testing, Among many other things this fixes LP Bugs:
    LP: #1847806 - add mff* instructions to not break on ppc64 with newer glibc
    LP: #1812822 - avoid crashes on detaching vhost_net interfaces
    LP: #1852744 - Crypto Passthrough Interrupt Support
    LP: #1853316 - CCW IPL Support
    Remaining changes:
    - qemu-kvm to systemd unit
      - d/qemu-kvm-init: script for QEMU KVM preparation modules, ksm,
        hugepages and architecture specifics
      - d/qemu-system-common.qemu-kvm.service: systemd unit to call
        qemu-kvm-init
      - d/qemu-system-common.install: install helper script
      - d/qemu-system-common.maintscript: clean old sysv and upstart scripts
      - d/qemu-system-common.qemu-kvm.default: defaults for
        /etc/default/qemu-kvm
      - d/rules: call dh_installinit and dh_installsystemd for qemu-kvm
    - Distribution specific machine type (LP: 1304107 1621042)
      - d/p/ubuntu/define-ubuntu-machine-types.patch: define distro machine
        types
      - d/qemu-system-x86.NEWS Info on fixed machine type definitions
        for host-phys-bits=true (LP: 1776189)
      - add an info about -hpb machine type in debian/qemu-system-x86.NEWS
      - provide pseries-bionic-2.11-sxxm type as convenience with all
        meltdown/spectre workarounds enabled by default. (LP: 1761372).
    - Enable nesting by default
      - d/p/ubuntu/expose-vmx_qemu64cpu.patch: expose nested kvm by default
        in qemu64 cpu type.
      - d/p/ubuntu/enable-svm-by-default.patch: Enable nested svm by default
        in qemu64 on amd
        [ No more strictly needed, but required for backward compatibility ]
    - improved dependencies
      - Make qemu-system-common depend on qemu-block-extra
      - Make qemu-utils depend on qemu-block-extra
      - let qemu-utils recommend sharutils
    - s390x support
      - Create qemu-system-s390x package
      - Enable numa support for s390x
      - d/rules: build s390-ccw.img with upstream Makefile
      - d/rules: build s390-netboot.img with upstream Makefile
    - arch aware kvm wrappers
    - d/control: update VCS links
    - tolerate ipxe size change on migrations to >=18.04 (LP: 1713490)
      - d/p/ubuntu/pre-bionic-256k-ipxe-efi-roms.patch: old machine types
        reference 256k path
      - d/control-in: depend on ipxe-qemu-256k-compat-efi-roms to be able to
        handle incoming migrations from former releases.
    - d/control-in: Disable capstone disassembler library support (universe)
    - d/control: disable bluetooth being deprecated
    - d/not-installed: ignore new interop docs and extra icons for now
    - d/not-installed: do not install elf2dmp until namespaced
    - d/qemu-utils.install: install new tools qemu-edid and qemu-keymap
    - d/control-in: promote qemu-efi/ovmf in Ubuntu (LP 1570617)
    - d/binfmt-update-in: fix binfmt being called in some containers
      (LP 1840956)
  - Dropped changes (in Debian)
    - qemu-guest-agent: freeze-hook fixes (LP: 1484990)
      - d/qemu-guest-agent.install: provide /etc/qemu/fsfree...

Changed in qemu (Ubuntu):
status: In Progress → Fix Released
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: In Progress → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-03-12 06:24 EDT-------
IBM bugzilla status -> closed. If the problem will be detected again, a new LP will be created.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.