[arm64 ocata] newly created instances are unable to raise network interfaces

Bug #1719196 reported by Sean Feole
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
QEMU
Fix Released
Undecided
Unassigned
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Ocata
Fix Committed
High
Unassigned
libvirt
New
Undecided
Unassigned
libvirt (Ubuntu)
Invalid
Undecided
Unassigned
qemu (Ubuntu)
Fix Released
Undecided
Unassigned
Zesty
Fix Released
Undecided
Unassigned

Bug Description

[Impact]

 * A change in qemu 2.8 (83d768b virtio: set ISR on dataplane
   notifications) broke virtio handling on platforms without a
   controller. Those encounter flaky networking due to missed IRQs

 * Fix is a backport of the upstream fix b4b9862b: virtio: Fix no
   interrupt when not creating msi controller

[Test Case]

 * On Arm with Zesty (or Ocata) run a guest without PCI based devices

 * Example in e.g. c#23

 * Without the fix the networking does not work reliably (as it losses
   IRQs), with the fix it works fine.

[Regression Potential]

 * Changing the IRQ handling of virtio could affect virtio in general.
   But when reviwing the patch you'll see that it is small and actually
   only changes to enable IRQ on one more place. That could cause more
   IRQs than needed in the worst case, but those are usually not
   breaking but only slowing things down. Also this fix is upstream
   quite a while, increasing confidence.

[Other Info]

 * There is currently 1720397 in flight in the SRU queue, so acceptance
   of this upload has to wait until that completes.

---

arm64 Ocata ,

I'm testing to see I can get Ocata running on arm64 and using the openstack-base bundle to deploy it. I have added the bundle to the log file attached to this bug.

When I create a new instance via nova, the VM comes up and runs, however fails to raise its eth0 interface. This occurs on both internal and external networks.

ubuntu@openstackaw:~$ nova list
+--------------------------------------+---------+--------+------------+-------------+--------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+---------+--------+------------+-------------+--------------------+
| dcaf6d51-f81e-4cbd-ac77-0c5d21bde57c | sfeole1 | ACTIVE | - | Running | internal=10.5.5.3 |
| aa0b8aee-5650-41f4-8fa0-aeccdc763425 | sfeole2 | ACTIVE | - | Running | internal=10.5.5.13 |
+--------------------------------------+---------+--------+------------+-------------+--------------------+
ubuntu@openstackaw:~$ nova show aa0b8aee-5650-41f4-8fa0-aeccdc763425
+--------------------------------------+----------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | awrep3 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | awrep3.maas |
| OS-EXT-SRV-ATTR:instance_name | instance-00000003 |
| OS-EXT-STS:power_state | 1 |
| OS-EXT-STS:task_state | - |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2017-09-24T14:23:08.000000 |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| config_drive | |
| created | 2017-09-24T14:22:41Z |
| flavor | m1.small (717660ae-0440-4b19-a762-ffeb32a0575c) |
| hostId | 5612a00671c47255d2ebd6737a64ec9bd3a5866d1233ecf3e988b025 |
| id | aa0b8aee-5650-41f4-8fa0-aeccdc763425 |
| image | zestynosplash (e88fd1bd-f040-44d8-9e7c-c462ccf4b945) |
| internal network | 10.5.5.13 |
| key_name | mykey |
| metadata | {} |
| name | sfeole2 |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| security_groups | default |
| status | ACTIVE |
| tenant_id | 9f7a21c1ad264fec81abc09f3960ad1d |
| updated | 2017-09-24T14:23:09Z |
| user_id | e6bb6f5178a248c1b5ae66ed388f9040 |
+--------------------------------------+----------------------------------------------------------+

As seen above the instances boot an run. Full Console output is attached to this bug.

[ OK ] Started Initial cloud-init job (pre-networking).
[ OK ] Reached target Network (Pre).
[ OK ] Started AppArmor initialization.
         Starting Raise network interfaces...
[FAILED] Failed to start Raise network interfaces.
See 'systemctl status networking.service' for details.
         Starting Initial cloud-init job (metadata service crawler)...
[ OK ] Reached target Network.
[ 315.051902] cloud-init[881]: Cloud-init v. 0.7.9 running 'init' at Fri, 22 Sep 2017 18:29:15 +0000. Up 314.70 seconds.
[ 315.057291] cloud-init[881]: ci-info: +++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++
[ 315.060338] cloud-init[881]: ci-info: +--------+------+-----------+-----------+-------+-------------------+
[ 315.063308] cloud-init[881]: ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
[ 315.066304] cloud-init[881]: ci-info: +--------+------+-----------+-----------+-------+-------------------+
[ 315.069303] cloud-init[881]: ci-info: | eth0: | True | . | . | . | fa:16:3e:39:4c:48 |
[ 315.072308] cloud-init[881]: ci-info: | eth0: | True | . | . | d | fa:16:3e:39:4c:48 |
[ 315.075260] cloud-init[881]: ci-info: | lo: | True | 127.0.0.1 | 255.0.0.0 | . | . |
[ 315.078258] cloud-init[881]: ci-info: | lo: | True | . | . | d | . |
[ 315.081249] cloud-init[881]: ci-info: +--------+------+-----------+-----------+-------+-------------------+
[ 315.084240] cloud-init[881]: 2017-09-22 18:29:15,393 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [0/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0xffffb10794e0>: Failed to establish a new connection: [Errno 101] Network is unreachable',))]

----------------

I have checked all services in neutron and made sure that they are running and restarted in neutron-gateway

 [ + ] neutron-dhcp-agent
 [ + ] neutron-l3-agent
 [ + ] neutron-lbaasv2-agent
 [ + ] neutron-metadata-agent
 [ + ] neutron-metering-agent
 [ + ] neutron-openvswitch-agent
 [ + ] neutron-ovs-cleanup

and have also restarted and checked the nova-compute logs for their neutron counterparts.

 [ + ] neutron-openvswitch-agent
 [ + ] neutron-ovs-cleanup

 There are some warnings/errors in the neutron-gateway logs which I have attached the full tarball in a separate attachment to this bug:

2017-09-24 14:39:33.152 10322 INFO ryu.base.app_manager [-] instantiating app ryu.controller.ofp_handler of OFPHandler
2017-09-24 14:39:33.153 10322 INFO ryu.base.app_manager [-] instantiating app ryu.app.ofctl.service of OfctlService
2017-09-24 14:39:39.577 10322 ERROR neutron.agent.ovsdb.impl_vsctl [req-2f084ae8-13dc-47dc-b351-24c8f3c57067 - - - - -] Unable to execute ['ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', '--id=@manager', 'create', 'Manager', 'target="ptcp:6640:127.0.0.1"', '--', 'add', 'Open_vSwitch', '.', 'manager_options', '@manager']. Exception: Exit code: 1; Stdin: ; Stdout: ; Stderr: ovs-vsctl: transaction error: {"details":"Transaction causes multiple rows in \"Manager\" table to have identical values (\"ptcp:6640:127.0.0.1\") for index on column \"target\". First row, with UUID e02a5f7f-bfd2-4a1d-ae3c-0321db4bd3fb, existed in the database before this transaction and was not modified by the transaction. Second row, with UUID 6e9aba3a-471a-4976-bffd-b7131bbe5377, was inserted by this transaction.","error":"constraint violation"}

These warnings/errors also occur on the nova-compute hosts in /var/log/neutron/

2017-09-22 18:54:52.130 387556 INFO ryu.base.app_manager [-] instantiating app ryu.app.ofctl.service of OfctlService
2017-09-22 18:54:56.124 387556 ERROR neutron.agent.ovsdb.impl_vsctl [req-e291c2f9-a123-422c-be7c-fadaeb5decfa - - - - -] Unable to execute ['ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', '--id=@manager', 'create', 'Manager', 'target="ptcp:6640:127.0.0.1"', '--', 'add', 'Open_vSwitch', '.', 'manager_options', '@manager']. Exception: Exit code: 1; Stdin: ; Stdout: ; Stderr: ovs-vsctl: transaction error: {"details":"Transaction causes multiple rows in \"Manager\" table to have identical values (\"ptcp:6640:127.0.0.1\") for index on column \"target\". First row, with UUID 9f27ddee-9881-4cbc-9777-2f42fee735e9, was inserted by this transaction. Second row, with UUID ccf0e097-09d5-449c-b353-6b69781dc3f7, existed in the database before this transaction and was not modified by the transaction.","error":"constraint violation"}

I'm not sure if the above error could be pertaining to the failure, I have also attached those logs to this bug as well...

.
(neutron) agent-list
+--------------------------------------+----------------------+--------+-------------------+-------+----------------+---------------------------+
| id | agent_type | host | availability_zone | alive | admin_state_up | binary |
+--------------------------------------+----------------------+--------+-------------------+-------+----------------+---------------------------+
| 0cca03fb-abb2-4704-8b0b-e7d3e117d882 | DHCP agent | awrep1 | nova | :-) | True | neutron-dhcp-agent |
| 14a5fd52-fbc3-450c-96d5-4e9a65776dad | L3 agent | awrep1 | nova | :-) | True | neutron-l3-agent |
| 2ebc7238-5e61-41f8-bc60-df14ec6b226b | Loadbalancerv2 agent | awrep1 | | :-) | True | neutron-lbaasv2-agent |
| 4f6275be-fc8b-4994-bdac-13a4b76f6a83 | Metering agent | awrep1 | | :-) | True | neutron-metering-agent |
| 86ecc6b0-c100-4298-b861-40c17516cc08 | Open vSwitch agent | awrep1 | | :-) | True | neutron-openvswitch-agent |
| 947ad3ab-650b-4b96-a520-00441ecb33e7 | Open vSwitch agent | awrep4 | | :-) | True | neutron-openvswitch-agent |
| 996b0692-7d19-4641-bec3-e057f4a856f6 | Open vSwitch agent | awrep3 | | :-) | True | neutron-openvswitch-agent |
| ab6b1065-0b98-4cf3-9f46-6bddba0c5e75 | Metadata agent | awrep1 | | :-) | True | neutron-metadata-agent |
| fe24f622-b77c-4eed-ae22-18e4195cf763 | Open vSwitch agent | awrep2 | | :-) | True | neutron-openvswitch-agent |
+--------------------------------------+----------------------+--------+-------------------+-------+----------------+---------------------------+

Should neutron-openvswitch be assigned to the 'nova' availability zone?

I have also attached the ovs-vsctl show from a nova-compute host and the neutron-gateway host to ensure that the open v switch routes are correct.

I can supply more logs if required.

Revision history for this message
Sean Feole (sfeole) wrote :
tags: added: arm64
Revision history for this message
Sean Feole (sfeole) wrote :
Revision history for this message
Sean Feole (sfeole) wrote :
Revision history for this message
Sean Feole (sfeole) wrote :
Revision history for this message
Sean Feole (sfeole) wrote :

juju status
bundle file
vm console output
and nova service list

can be found in ocata-arm64-neutronbug.txt

Revision history for this message
Liam Young (gnuoy) wrote :

Running tcpdump on the guests tap device shows the dhcp request leaving and the reply coming back.

Revision history for this message
Andrew McLeod (admcleod) wrote :

This may be qemu related, rather than libvirt - further investigation required - but unlikely to be a problem with neutron-gateway charm

affects: charm-neutron-gateway → libvirt
Revision history for this message
Andrew McLeod (admcleod) wrote :

I have confirmed this bug with another arm64 + ocata deployment.

Revision history for this message
Andrew McLeod (admcleod) wrote :

Further summary:

we can see dhcp requests leaving and returning to the tap interface associated with the guest, but the guest does not register the returned packet (does not acquire an address).

tags: added: libvirt qemu uosci
Revision history for this message
Sean Feole (sfeole) wrote :

I was also able to reproduce this again on Cavium ThunderX hardware,

ii ipxe-qemu 1.0.0+git-20150424.a25a16d-1ubuntu1 all PXE boot firmware - ROM images for qemu
ii qemu-block-extra:arm64 1:2.8+dfsg-3ubuntu2.3~cloud0 arm64 extra block backend modules for qemu-system and qemu-utils
ii qemu-efi 0~20160408.ffea0a2c-2 all UEFI firmware for virtual machines
ii qemu-kvm 1:2.8+dfsg-3ubuntu2.3~cloud0 arm64 QEMU Full virtualization
ii qemu-system-arm 1:2.8+dfsg-3ubuntu2.3~cloud0 arm64 QEMU full system emulation binaries (arm)
ii qemu-system-common 1:2.8+dfsg-3ubuntu2.3~cloud0 arm64 QEMU full system emulation binaries (common files)
ii qemu-utils 1:2.8+dfsg-3ubuntu2.3~cloud0 arm64 QEMU utilities

-------------------

ii libvirt-bin 2.5.0-3ubuntu5.4~cloud0 arm64 programs for the libvirt library
ii libvirt-clients 2.5.0-3ubuntu5.4~cloud0 arm64 Programs for the libvirt library
ii libvirt-daemon 2.5.0-3ubuntu5.4~cloud0 arm64 Virtualization daemon
ii libvirt-daemon-system 2.5.0-3ubuntu5.4~cloud0 arm64 Libvirt daemon configuration files
ii libvirt0:arm64 2.5.0-3ubuntu5.4~cloud0 arm64 library for interfacing with different virtualization systems
ii nova-compute-libvirt 2:15.0.6-0ubuntu1~cloud0 all OpenStack Compute - compute node libvirt support
ii python-libvirt 3.0.0-2~cloud0 arm64 libvirt Python bindings

Revision history for this message
Sean Feole (sfeole) wrote :

Just to further add to comment #6, That this is not a neutron issue, Here is the tcpdump output of the guests tap device shows the dhcp request leaving and the reply coming back. For anyways that may be curious.

Guest MAC:

    <interface type='bridge'>
      <mac address='fa:16:3e:1d:a8:82'/>
      <source bridge='qbrc8faaf66-7f'/>
      <target dev='tapc8faaf66-7f'/>
      <model type='virtio'/>
      <address type='virtio-mmio'/>
    </interface>

$ sudo ip netns exec qdhcp-a9958ab4-8a7e-4ded-b9a0-860bc42f79d9 tcpdump -A -l -i ns-c751afb3-2b
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ns-c751afb3-2b, link-type EN10MB (Ethernet), capture size 262144 bytes
13:51:11.458941 IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
`....$..................................:...................................
13:51:11.462966 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from fa:16:3e:1d:a8:82 (oui Unknown), length 300
E..H......9..........D.C.4q.....5QN{......................>.............................................................................................................................................................................................................c.Sc5....ubuntu7.......w.,/.y*..................................
13:51:11.463331 IP 10.5.5.2.bootps > 10.5.5.9.bootpc: BOOTP/DHCP, Reply, length 328
E..dY'..@...

Revision history for this message
Sean Feole (sfeole) wrote :

Today I ran some tests and installed a Newton Deployment on arm64, which we already know works. I upgraded QEMU and Libvirt on one of the hypervisors from the xenial-updates/ocata cloud-archive.

See attached notes.

Libvirt - 1.3.1-1ubuntu10.14 -> 2.5.0-3ubuntu5.5~cloud0
QEMU - 1:2.5+dfsg-5ubuntu10.16 -> 1:2.8+dfsg-3ubuntu2.3~cloud0

I was able to reset the already built instance on the hypervisor and was able to receive a dhcp response from the ovs tap device. Eth0 came up as expected with an internal tenant IP.

Steps to reproduce.

1.) Install Newton & start a few VM's
2.) Choose 1 hypervisor , upgrade qemu & libvirt to versions from the Ocata cloud-archive.
3.) Reset the running VM so that it now runs with the latest QEMU/Libvirt
4.) Reset the Instance, see if it boots and network can be reached.

Revision history for this message
Sean Feole (sfeole) wrote :

Taken from the upgraded hypervisor

ubuntu@aw3:/var/lib/nova/instances/2cec409e-de92-4d29-ad68-3f1d1f8be7fc$ sudo qemu-system-aarch64 --version
QEMU emulator version 2.8.0(Debian 1:2.8+dfsg-3ubuntu2.3~cloud0)
Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers

ubuntu@aw3:/var/lib/nova/instances/2cec409e-de92-4d29-ad68-3f1d1f8be7fc$ sudo libvirtd --version
libvirtd (libvirt) 2.5.0

Revision history for this message
Sean Feole (sfeole) wrote :
Download full text (3.7 KiB)

I was able to gather libvirt XMLs from both Newton and Ocata Instances, see attached

sfeole@BSG-75:~$ diff xmlocata xmlnewton
1,2c1,2
<
< main type='kvm' id='1'>
---
> ubuntu@lundmark:/var/lib/nova/instances/358596e4-135d-461d-a514-84116440014c$ sudo virsh dumpxml instance-00000001
> <domain type='kvm' id='1'>
4c4
< <uuid>7c0dcd78-d6b4-4575-a882-ee5d29c64fe0</uuid>
---
> <uuid>358596e4-135d-461d-a514-84116440014c</uuid>
7,9c7,9
< <nova:package version="15.0.6"/>
< <nova:name>sfeole14</nova:name>
< <nova:creationTime>2017-10-05 00:58:15</nova:creationTime>
---
> <nova:package version="14.0.8"/>
> <nova:name>sfeole-newton</nova:name>
> <nova:creationTime>2017-10-05 01:40:39</nova:creationTime>
18,19c18,19
< <nova:user uuid="d9f92a61c37948d9a29b8cc37e1bca05">admin</nova:user>
< <nova:project uuid="701441267bd148d3842f1696530b1c92">admin</nova:project>
---
> <nova:user uuid="12d13712253141ab845b89406507cd6c">admin</nova:user>
> <nova:project uuid="8b8fcaf183954f45b3eb15b27c52ec94">admin</nova:project>
21c21
< <nova:root type="image" uuid="f329117f-5da2-4d61-8341-89a969bf00e7"/>
---
> <nova:root type="image" uuid="fcdf7c26-4238-4594-b8ee-fd59f601fdcb"/>
34c34
< <type arch='aarch64' machine='virt-2.8'>hvm</type>
---
> <type arch='aarch64' machine='virt'>hvm</type>
58c58
< <source file='/var/lib/nova/instances/7c0dcd78-d6b4-4575-a882-ee5d29c64fe0/disk'/>
---
> <source file='/var/lib/nova/instances/358596e4-135d-461d-a514-84116440014c/disk'/>
61c61
< <source file='/var/lib/nova/instances/_base/035458feead7f83be80ed020442ebf815a4067ea'/>
---
> <source file='/var/lib/nova/instances/_base/ab7429e8558ea27fb54f70eb92fbd0c1c07dd595'/>
70c70
< <source file='/var/lib/nova/instances/7c0dcd78-d6b4-4575-a882-ee5d29c64fe0/disk.eph0'/>
---
> <source file='/var/lib/nova/instances/358596e4-135d-461d-a514-84116440014c/disk.eph0'/>
82a83,93
> <controller type='pci' index='1' model='dmi-to-pci-bridge'>
> <model name='i82801b11-bridge'/>
> <alias name='pci.1'/>
> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
> </controller>
> <controller type='pci' index='2' model='pci-bridge'>
> <model name='pci-bridge'/>
> <target chassisNr='2'/>
> <alias name='pci.2'/>
> <address type='pci' domain='0x0000' bus='0x01' slot='0x01' function='0x0'/>
> </controller>
84,86c95,97
< <mac address='fa:16:3e:a6:4b:d4'/>
< <source bridge='qbra7012530-32'/>
< <target dev='tapa7012530-32'/>
---
> <mac address='fa:16:3e:8e:fc:48'/>
> <source bridge='qbrb5d335fc-46'/>
> <target dev='tapb5d335fc-46'/>
92c103
< <source path='/var/lib/nova/instances/7c0dcd78-d6b4-4575-a882-ee5d29c64fe0/console.log'/>
---
> <source path='/var/lib/nova/instances/358596e4-135d-461d-a514-84116440014c/console.log'/>
95a107,111
> <serial type='pty'>
> <source path='/dev/pts/3'/>
> <target port='1'/>
> <alias name='serial1'/>
> </serial>
97c113
< <source path='/var/lib/nova/instances/7c0dcd78-d6b4-4575-a882-ee5d29c64fe0/consol...

Read more...

Revision history for this message
dann frazier (dannf) wrote :

Thanks so much for doing that Sean.

Omitting expected changes (uuid, mac address, etc), here's are the significant changes I see:

1) N uses the QEMU 'virt' model, O uses 'virt-2.8'
2) N and O both expose a pci root, but N also exposed 2 PCI bridges that O does not.
3) N exposes an additional serial device.
4) N and O both use an apparmor seclabel. However, O also has a DAC model.

#4 is the most interesting to me. Is there a way to configure ocata nova to not enable DAC?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (6.2 KiB)

Hi,
I was reading into this after being back from PTO (actually back next monday).
I was wondering as I tested (without openstack) just that last week with Dannf on the Rally).
And indeed it seems to work for me on Artful (which as we all know is =Ocata in terms of SW stack).

$ cat arm-template.xml
<domain type='kvm'>
        <os>
                <type arch='aarch64' machine='virt'>hvm</type>
                <loader readonly='yes' type='pflash'>/usr/share/AAVMF/AAVMF_CODE.fd</loader>
                <nvram template='/usr/share/AAVMF/AAVMF_CODE.fd'>/tmp/AAVMF_CODE.fd</nvram>
                <boot dev='hd'/>
        </os>
        <features>
                <acpi/>
                <apic/>
                <pae/>
        </features>
        <cpu mode='custom' match='exact'>
                <model fallback='allow'>host</model>
        </cpu>
        <devices>
                <interface type='network'>
                        <source network='default'/>
                        <model type='virtio'/>
                </interface>
                <serial type='pty'>
                        <source path='/dev/pts/3'/>
                        <target port='0'/>
                </serial>
        </devices>
</domain>

$ uvt-kvm create --template arm-template.xml --password=ubuntu artful-test4 release=artful arch=arm64 label=daily

The template is created in a way to let as much as possible for libvirt and qemu to fill in defaults. That way if one of them change we do not have to follow and adapt.
Maybe such a thing is happening to you for the more "defined" xml that openstack is sending.

From the hosts point of view all looks normal in journal, you see start and dhcp discover/offer/ack sequence:
Okt 11 10:45:43 seidel kernel: virbr0: port 2(vnet0) entered learning state
Okt 11 10:45:45 seidel kernel: virbr0: port 2(vnet0) entered forwarding state
Okt 11 10:45:45 seidel kernel: virbr0: topology change detected, propagating
Okt 11 10:46:14 seidel dnsmasq-dhcp[2610]: DHCPDISCOVER(virbr0) 52:54:00:b1:db:89
Okt 11 10:46:14 seidel dnsmasq-dhcp[2610]: DHCPOFFER(virbr0) 192.168.122.13 52:54:00:b1:db:89
Okt 11 10:46:14 seidel dnsmasq-dhcp[2610]: DHCPDISCOVER(virbr0) 52:54:00:b1:db:89
Okt 11 10:46:14 seidel dnsmasq-dhcp[2610]: DHCPOFFER(virbr0) 192.168.122.13 52:54:00:b1:db:89
Okt 11 10:46:14 seidel dnsmasq-dhcp[2610]: DHCPREQUEST(virbr0) 192.168.122.13 52:54:00:b1:db:89
Okt 11 10:46:14 seidel dnsmasq-dhcp[2610]: DHCPACK(virbr0) 192.168.122.13 52:54:00:b1:db:89 ubuntu

The guest also seems "normal" to me:
$ uvt-kvm ssh artful-test4 -- journalctl -u systemd-networkd
-- Logs begin at Wed 2017-10-11 10:46:03 UTC, end at Wed 2017-10-11 11:02:31 UTC. --
Oct 11 10:46:11 ubuntu systemd[1]: Starting Network Service...
Oct 11 10:46:11 ubuntu systemd-networkd[547]: Enumeration completed
Oct 11 10:46:11 ubuntu systemd[1]: Started Network Service.
Oct 11 10:46:11 ubuntu systemd-networkd[547]: enp1s0: IPv6 successfully enabled
Oct 11 10:46:11 ubuntu systemd-networkd[547]: enp1s0: Gained carrier
Oct 11 10:46:12 ubuntu systemd-networkd[547]: enp1s0: Gained IPv6LL
Oct 11 10:46:14 ubuntu systemd-networkd[547]: enp1s0: DHCPv4 address 192.168.122.13/24 via 192.168.122.1
Oct 11 10:4...

Read more...

Changed in libvirt (Ubuntu):
status: New → Incomplete
Changed in qemu (Ubuntu):
status: New → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Note - be careful if you mean upstream qemu/libvirt (as the bug was filed) or the packages in Ubuntu (I added tasks for these).
I try to spot both, but will be more on-track for the latter.

Revision history for this message
Sean Feole (sfeole) wrote :

I spent some time today trying to modify the ocata xml templates produced in /etc/libvirt/qemu/<INSTANCE-ID>.xml for each of the generated instances. Destroying & Undefining the existing instance, making some changes and redefining the xml, however it appears that nova regenerates these templates upon instance reset, thus removing any changes made to the xml template. Is there any way to disable this feature that does not require some serious modifications to the nova underpinnings?

Revision history for this message
Andrew McLeod (admcleod) wrote :

The problem is with virtio-mmio.

https://bugzilla.redhat.com/show_bug.cgi?id=1422413

Instances launched with virtio-mmio on aarch64 will not get DHCP (will not have a nic)

xml with libvirt 2.5.0

    <interface type='bridge'>
      <mac address='fa:16:3e:af:95:2e'/>
      <source bridge='qbrb5abdeb0-a0'/>
      <target dev='tapb5abdeb0-a0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='virtio-mmio'/>

I have updated libvirt-daemon to 3.6.0 on a particular compute node - when an instance is booted now, the nic section of the virsh xml looks like this:

    <interface type='bridge'>
      <mac address='fa:16:3e:10:0e:22'/>
      <source bridge='qbr274809a0-dc'/>
      <target dev='tap274809a0-dc'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>

The instance then gets a NIC and is able to get DHCP and complete cloud-init successfully.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

So,
libvirt knew about some change and picks the right default if you do not specify it.
But if you (Openstack in this case) specify virtio-mmio as type, then it fails - is that correct?

The patch in the referred RH-BZ is already in the qemu we have in Artful (and thereby Ocata).
So if that is supposed to be the issue there has to be a new one after that fix.

Also as I've shown in c#17 (yes it is long sorry) - virtio-mmio works with the bridge that libvirt is usually creating - again my assumption is that it is somehow related to how this bridge is created (openvswitch in your case I assume).

So is the real error "network fails when using virtio-mmio on openvswitch set up by openstack"?
I have no OVS around to quickly try something around that atm.

Would it be reasonable to teach Openstack to not define virtio-mmio in this case?
Libvirt will make the right default (hopefully also when driven by Openstack which sets some force options), and just work then.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi,
@admcleod as discussed on IRC I realized Ocata maps to Zesty so that is qemu 2.8.
Therefore the referred patch is released in Artful, but not in Zesty.

I tagged up the bug tasks and provide a fix in [1] to test.

[1]: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/2995

Changed in qemu (Ubuntu):
status: Incomplete → Fix Released
no longer affects: libvirt (Ubuntu Zesty)
Changed in qemu (Ubuntu Zesty):
status: New → Incomplete
status: Incomplete → In Progress
Revision history for this message
Sean Feole (sfeole) wrote :

To further reinforce what Christian said, the Newton cloud-archive also uses virtio-mmio for its address type. (See my comment#15)

Newton XML:

    <interface type='bridge'>
      <mac address='fa:16:3e:8e:fc:48'/>
      <source bridge='qbrb5d335fc-46'/>
      <target dev='tapb5d335fc-46'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='virtio-mmio'/>
    </interface>

but we have proven this works with Newton. Is it fair to say this could be attributed to a change in OVS, between Newton and Ocata?

Revision history for this message
Ryan Beisner (1chb1n) wrote :
Changed in libvirt (Ubuntu):
status: Incomplete → New
status: New → Incomplete
no longer affects: cloud-archive/pike
Changed in cloud-archive:
status: New → Fix Released
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Yeah, the offending patch in RH-BZ 1422413 appeared in qemu 2.8.
So it would make sense to work with newtwon (2.6.1), and pike (2.10), but fail on Ocata (2.8).

I checked the ppa, in general it seems to work for me, so I'm now waiting for your verification if that really addresses the issue you are facing.
If that would be the case I can pass it through regression tests and afterwards to SRU (which currently holds another update that has to clear before doing so)

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Since we wait for the zesty verification I'm setting the status to incomplete for now.

Changed in qemu (Ubuntu Zesty):
status: In Progress → Incomplete
Changed in libvirt (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
Thomas Huth (th-huth) wrote :

Since the fix mentioned in the previous comments is already in upstream QEMU, I'm setting the upstream status to "Fix released".

Changed in qemu:
status: New → Fix Released
Revision history for this message
Andrew McLeod (admcleod) wrote :

I've tested with the packages from the ppa:

https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/2995

qemu:
  Installed: 1:2.8+dfsg-3ubuntu2.7~ppa5cloud
qemu-system-arm:
  Installed: 1:2.8+dfsg-3ubuntu2.7~ppa5cloud
qemu-system-aarch64:
  Installed: 1:2.8+dfsg-3ubuntu2.7~ppa5cloud

Rebooted the instance and it aquired an IP address and booted.

more info, virsh dumpxml excerpt:

    <interface type='bridge'>
      <mac address='fa:16:3e:4e:1f:8f'/>
      <source bridge='qbrd542e755-45'/>
      <target dev='tapd542e755-45'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='virtio-mmio'/>

Revision history for this message
Sean Feole (sfeole) wrote :

will test these and report back shortly.

Revision history for this message
Sean Feole (sfeole) wrote :

I've testing with the same packages listed in comment #28, Confirmed that this now works..

See attached log

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Ok, driving that into an SRU then - thanks for verifying.

description: updated
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Regression tests on the ppa are good as well, we need to double check the auto-run on proposed then to ensure this doesn't behave differently on other arches.
I cleaned up the changelog and UCA backport noise and made a proper SRU for zesty.

Note: there is currently another SRU in flight (already in verified state for 2 days), so acceptance from zesty-unapproved likely has to wait a bit until the former one clears completely

Changed in qemu (Ubuntu Zesty):
status: Incomplete → Fix Committed
Revision history for this message
Brian Murray (brian-murray) wrote :

I accepted this but Launchpad timed out when the SRU testing comment was being added.

tags: added: verification-needed verification-needed-zesty
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks for the FYI Brian.
I see it in pending-SRUs as it should be.
I added the -needed tags to be "correct".

@Andrew/Sean - could you test proposed and set verified then (assuming it works like the ppa did)

Odd - this really seems to hit everything, not only LP timeouts.
There are also dep8 regressions listed on systemd which make no sense in relation to the fix.
The test history on arm is borked since February [1], so I ask you to override and ignore that.
On s390x it at least works sometimes, but the log [2] doesn't look much better, but there at least it hits a different issue than before - I'm retriggering this for now - but likely ask to ignore that as well - but I'll take a look at the retry first.

[1]: http://autopkgtest.ubuntu.com/packages/s/systemd/zesty/armhf
[2]: http://autopkgtest.ubuntu.com/packages/s/systemd/zesty/s390x

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

As assumed s390x passed now (so a flaky test), and as outlined before armhf we just have to give up.
Looking at the history an override might be the right thing to do.

Other than that all looks good, waiting for the verify by Sean/Andrew now.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Ok, Zesty actually had an override for the arm test (had to learn the placement of those for SRUs).
So only up to the verification now.

Revision history for this message
Corey Bryant (corey.bryant) wrote : Please test proposed package

Hello Sean, or anyone else affected,

Accepted qemu into ocata-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:ocata-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-ocata-needed to verification-ocata-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-ocata-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-ocata-needed
Revision history for this message
dann frazier (dannf) wrote :

Thanks Christian - I've now verified this. I took a stepwise approach:

1) We originally observed this issue w/ the ocata cloud archive on xenial, so I redeployed that. I verified that I was still seeing the problem. I then created a PPA[*] w/ an arm64 build of QEMU from the ocata-staging PPA, which is a backport of the zesty-proposed package, and upgraded my nova-compute nodes to this version. I rebooted my test guests, and the problem was resolved.

2) I then updated my sources.list to point to zesty (w/ proposed enabled), and upgraded qemu-system-arm. This way I could test the actual build in zesty-proposed, as opposed to my backport. This continued to work.

3) Finally, I dist-upgraded this system from xenial to zesty - so that I'm actually testing the zesty build in a zesty environment, and rebooted. Still worked :)

[*] https://launchpad.net/~dannf/+archive/ubuntu/lp1719196

dann frazier (dannf)
tags: added: verification-done-zesty
removed: verification-needed-zesty
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:2.8+dfsg-3ubuntu2.7

---------------
qemu (1:2.8+dfsg-3ubuntu2.7) zesty; urgency=medium

  * d/p/ubuntu/virtio-Fix-no-interrupt-when-not-creating-msi-contro.patch:
    on Arm fix no interrupt when not creating msi controller. That fixes
    broken networking if running with virtio-mmio only (LP: #1719196).

 -- Christian Ehrhardt <email address hidden> Wed, 18 Oct 2017 16:17:34 +0200

Changed in qemu (Ubuntu Zesty):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for qemu has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Regression testing has passed successfully.

zesty-ocata-proposed with stable charms:

======
Totals
======
Ran: 102 tests in 1897.0150 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 1011.5607 sec.

zesty-ocata-proposed with dev charms:

======
Totals
======
Ran: 102 tests in 1933.5299 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 963.0546 sec.

xenial-ocata-proposed with stable charms:

======
Totals
======
Ran: 102 tests in 1767.3787 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 906.2188 sec.

xenial-ocata-proposed with dev charms:

======
Totals
======
Ran: 102 tests in 2051.1377 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 998.9001 sec.

tags: added: verification-ocata-done
removed: verification-ocata-needed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.