xenial guest on arm64 drops to busybox under openstack bionic-rocky

Bug #1797092 reported by Andrew McLeod
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Unassigned
Xenial
Fix Released
High
Unassigned
Bionic
Fix Released
High
Unassigned
Cosmic
Fix Released
Undecided
Unassigned
linux-raspi2 (Ubuntu)
Fix Released
Undecided
Unassigned
Xenial
Fix Released
Undecided
Unassigned
Bionic
Fix Released
Undecided
Unassigned
Cosmic
Fix Released
Undecided
Unassigned
linux-snapdragon (Ubuntu)
Fix Released
Undecided
Unassigned
Xenial
Fix Released
Undecided
Unassigned
Bionic
Fix Released
Undecided
Unassigned
Cosmic
Fix Released
Undecided
Unassigned

Bug Description

[Impact]
on openstack rocky-bionic (with patch, see 1771662), xenial guests will fail to launch as they drop to the busybox prompt after booting. The reason is that the EFI firmware image switched to ACPI mode by default in bionic. We knew that was happening, and backported some minimal ACPI support to keep xenial guests booting (bug 1744754). But, now we're seeing guests panic when trying to initialize PCI.

[Test Case]
Deploy rocky-bionic OpenStack and try to launch a xenial guest.

[Fix]
Backport ACPI/PCIe support for arm64 from upstream.

[ Regression Risk ]
TLDR: Definitely warrants some regression testing on armhf/arm64, but regression risk seems low for other architectures. Here's a patch-by-patch risk analysis:

> 0001-UBUNTU-Config-CONFIG_PCI_ECAM-y.patch
Enables code that will be added in the next patch.

> 0002-PCI-Provide-common-functions-for-ECAM-mapping.patch
Only adds new code w/ no callers yet

> 0003-PCI-generic-thunder-Use-generic-ECAM-API.patch
As noted in the commit message:
"The patch does not introduce any functional changes other than a very minor one: with the new code, on 64-bit platforms, we do just a single ioremap for the whole config space."

> 0004-PCI-of-Move-PCI-I-O-space-management-to-PCI-core-cod.patch
As noted, no functional change.

> 0005-PCI-Move-ecam.h-to-linux-include-pci-ecam.h.patch
As noted, no functional change.

> 0006-PCI-Add-parent-device-field-to-ECAM-struct-pci_confi.patch
This makes changes to the ecam code, but that was added in this series, so no regression risk there. The pci-thunder-pem changes will be regression tested on the hardware that uses that driver (Cavium ThunderX).

> 0007-PCI-Add-pci_unmap_iospace-to-unmap-I-O-resources.patch
Only adds new code, w/ no callers yet

> 0008-PCI-ACPI-Support-I-O-resources-when-parsing-host-bri.patch
Adds and calls a new function that is a no-op on !ARM (PCI_IOBASE is only defined on ARM). Also adds a call to pci_unmap_iospace(), which was added in a previous patch and is also a no-op on !ARM.

> 0009-UBUNTU-Config-CONFIG_ACPI_MCFG-y.patch
Enables code that will be added in the next patch.

> 0010-PCI-ACPI-Add-generic-MCFG-table-handling.patch
Adds parsing of a new ACPI table. A possible regression risk to platforms an MCFG table that were working fine w/ 4.4 and there's a latent bug in this parsing code.

> 0011-PCI-Refactor-pci_bus_assign_domain_nr-for-CONFIG_PCI.patch
As noted, no functional change intended.

> 0012-PCI-Factor-DT-specific-pci_bus_find_domain_nr-code-o.patch
Again, no functional change.

> 0013-ARM64-PCI-Add-acpi_pci_bus_find_domain_nr.patch
Only impacts ARM due to CONFIG_PCI_DOMAINS_GENERIC guard

> 0014-ARM64-PCI-ACPI-support-for-legacy-IRQs-parsing-and-c.patch
arm64-specific PCI initialization code - mitigate risk by regression testing on supported platforms (X-Gene, ThunderX).

> 0015-ARM64-PCI-Support-ACPI-based-PCI-host-controller.patch
Adds new code for arm64-ACPI support. We didn't previously support these devices, so regression risk is negligible.

Revision history for this message
Andrew McLeod (admcleod) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1797092

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: xenial
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Bionic):
importance: Undecided → High
status: New → Triaged
Changed in linux (Ubuntu):
status: Incomplete → Triaged
tags: added: kernel-da-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote : Re: xenial guest on arm64 drops to busybux under openstack bionic-rocky

Would it be possible for you to test out some test kernels? That would allow us to reverse bisect and identify any missing commits from the 4.4 kernel.

tags: added: performing-bisect
Revision history for this message
Andrew McLeod (admcleod) wrote :

@Joseph I think I have bandwidth, please provide specific links and instructions:)

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'd like to perform a "Reverse" bisect to figure out what commit fixes this bug. We need to identify the last kernel version that had the bug, and the first kernel version that fixed the bug.

Can you test the following kernels and report back:

v4.6 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.6-yakkety/
v4.8 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8/
V4.10 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10/

You don't have to test every kernel, just up until the first kernel that does not have the bug.

Thanks in advance!

Revision history for this message
dann frazier (dannf) wrote :

There was no v4.6-yakkety build for arm64, but v4.8 booted fine. I further bisected this down to:

0cb0786bac159 ARM64: PCI: Support ACPI-based PCI host controller

Presumably nova in newer OpenStack releases has made changes to the libvirt xml it generates that requires additional kernel support. I'll attempt a backport of the necessary changes to get an idea of feasibility of repairing this w/ a kernel change.

Can someone provide a copy of a guest XML generated by an earlier OpenStack that does successfully boot xenial guests? An alternative approach we could investigate if the kernel backport proves infeasible would be to see if we could go back to the older XML format until xenial guest support EOLs.

Revision history for this message
dann frazier (dannf) wrote :

I've backported the necessary fixes and verified that the result does solve the problem:
  https://git.launchpad.net/~dannf/ubuntu/+source/linux/+git/linux?h=xenial-arm64-acpi-pci

That being said, the changes are not particularly isolated, so it will require careful review and testing for regression risk. Particularly on any armhf platforms that support PCI (I don't have any).

Test builds are available in ppa:dannf/test.

tags: removed: performing-bisect
Revision history for this message
dann frazier (dannf) wrote :

@ppisati: Would you be able to regression test this on armhf?

description: updated
Changed in linux (Ubuntu Xenial):
status: New → Triaged
Changed in linux (Ubuntu Bionic):
status: Triaged → Fix Released
Changed in linux (Ubuntu):
status: Triaged → Fix Released
Changed in linux (Ubuntu Xenial):
importance: Undecided → High
Revision history for this message
dann frazier (dannf) wrote :

Regression tested on ThunderX/X-Gene, which I believe were the only 2 arm64 servers supported by the 4.4 kernel.

Revision history for this message
dann frazier (dannf) wrote :

I'm not aware of any armhf hardware that has PCI and was supported by the 4.4 kernel. Calxeda gear maybe - but that was never officially released and I don't have access to it anyway.

I did regression test using the virt model in QEMU. System still boots, lspci smoke tests OK:
ubuntu@ubuntu:~$ lspci
00:00.0 Host bridge: Red Hat, Inc. QEMU PCIe Host bridge

Kernel messages during PCI probe look identical, except we now see ECAM messages (which makes sense, that support is new):

4.4.0+arm64acpipci.dmesg:[ 1.269192] PCI: CLS 0 bytes, default 64
4.4.0+arm64acpipci.dmesg:[ 6.708312] PCI host bridge /pcie@10000000 ranges:
4.4.0+arm64acpipci.dmesg:[ 6.747553] pci-host-generic 3f000000.pcie: PCI host bridge to bus 0000:00
4.4.0+arm64acpipci.dmesg:[ 6.792745] PCI: bus0: Fast back to back transfers disabled
4.4.0+arm64acpipci.dmesg:[ 7.012429] ehci-pci: EHCI PCI platform driver
4.4.0+arm64acpipci.dmesg:[ 7.037416] ohci-pci: OHCI PCI platform driver
4.4.0.dmesg:[ 1.309199] PCI: CLS 0 bytes, default 64
4.4.0.dmesg:[ 6.733705] PCI host bridge /pcie@10000000 ranges:
4.4.0.dmesg:[ 6.760085] pci-host-generic 3f000000.pcie: PCI host bridge to bus 0000:00
4.4.0.dmesg:[ 6.802961] PCI: bus0: Fast back to back transfers disabled
4.4.0.dmesg:[ 7.016247] ehci-pci: EHCI PCI platform driver
4.4.0.dmesg:[ 7.040325] ohci-pci: OHCI PCI platform driver

dann frazier (dannf)
description: updated
dann frazier (dannf)
summary: - xenial guest on arm64 drops to busybux under openstack bionic-rocky
+ xenial guest on arm64 drops to busybox under openstack bionic-rocky
Changed in linux (Ubuntu Xenial):
status: Triaged → Fix Committed
Changed in linux-raspi2 (Ubuntu Xenial):
status: New → Fix Committed
Changed in linux-snapdragon (Ubuntu Xenial):
status: New → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
dann frazier (dannf)
Changed in linux (Ubuntu Cosmic):
status: New → Fix Released
Changed in linux-raspi2 (Ubuntu):
status: New → Fix Released
Changed in linux-raspi2 (Ubuntu Bionic):
status: New → Fix Released
Changed in linux-raspi2 (Ubuntu Cosmic):
status: New → Fix Released
Changed in linux-snapdragon (Ubuntu Cosmic):
status: New → Fix Released
Changed in linux-snapdragon (Ubuntu):
status: New → Fix Released
Changed in linux-snapdragon (Ubuntu Bionic):
status: New → Fix Released
Revision history for this message
dann frazier (dannf) wrote :

Verified - see attached dmesg.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (8.4 KiB)

This bug was fixed in the package linux - 4.4.0-140.166

---------------
linux (4.4.0-140.166) xenial; urgency=medium

  * linux: 4.4.0-140.166 -proposed tracker (LP: #1802776)

  * Bypass of mount visibility through userns + mount propagation (LP: #1789161)
    - mount: Retest MNT_LOCKED in do_umount
    - mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts

  * kdump fail due to an IRQ storm (LP: #1797990)
    - SAUCE: x86/PCI: Export find_cap() to be used in early PCI code
    - SAUCE: x86/quirks: Add parameter to clear MSIs early on boot
    - SAUCE: x86/quirks: Scan all busses for early PCI quirks

  * crash in ENA driver on removing an interface (LP: #1802341)
    - SAUCE: net: ena: fix crash during ena_remove()

  * xenial guest on arm64 drops to busybox under openstack bionic-rocky
    (LP: #1797092)
    - [Config] CONFIG_PCI_ECAM=y
    - PCI: Provide common functions for ECAM mapping
    - PCI: generic, thunder: Use generic ECAM API
    - PCI, of: Move PCI I/O space management to PCI core code
    - PCI: Move ecam.h to linux/include/pci-ecam.h
    - PCI: Add parent device field to ECAM struct pci_config_window
    - PCI: Add pci_unmap_iospace() to unmap I/O resources
    - PCI/ACPI: Support I/O resources when parsing host bridge resources
    - [Config] CONFIG_ACPI_MCFG=y
    - PCI/ACPI: Add generic MCFG table handling
    - PCI: Refactor pci_bus_assign_domain_nr() for CONFIG_PCI_DOMAINS_GENERIC
    - PCI: Factor DT-specific pci_bus_find_domain_nr() code out
    - ARM64: PCI: Add acpi_pci_bus_find_domain_nr()
    - ARM64: PCI: ACPI support for legacy IRQs parsing and consolidation with DT
      code
    - ARM64: PCI: Support ACPI-based PCI host controller

  * [GLK/CLX] Enhanced IBRS (LP: #1786139)
    - x86/speculation: Remove SPECTRE_V2_IBRS in enum spectre_v2_mitigation
    - x86/speculation: Support Enhanced IBRS on future CPUs

  * Update ENA driver to version 2.0.1K (LP: #1798182)
    - net: ena: remove ndo_poll_controller
    - net: ena: fix warning in rmmod caused by double iounmap
    - net: ena: fix rare bug when failed restart/resume is followed by driver
      removal
    - net: ena: fix NULL dereference due to untimely napi initialization
    - net: ena: fix auto casting to boolean
    - net: ena: minor performance improvement
    - net: ena: complete host info to match latest ENA spec
    - net: ena: introduce Low Latency Queues data structures according to ENA spec
    - net: ena: add functions for handling Low Latency Queues in ena_com
    - net: ena: add functions for handling Low Latency Queues in ena_netdev
    - net: ena: use CSUM_CHECKED device indication to report skb's checksum status
    - net: ena: explicit casting and initialization, and clearer error handling
    - net: ena: limit refill Rx threshold to 256 to avoid latency issues
    - net: ena: change rx copybreak default to reduce kernel memory pressure
    - net: ena: remove redundant parameter in ena_com_admin_init()
    - net: ena: update driver version to 2.0.1
    - net: ena: fix indentations in ena_defs for better readability
    - net: ena: Fix Kconfig dependency on X86
    - net: ena: enable Low Latency Queues
    - net: ena: fix compilat...

Read more...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (8.8 KiB)

This bug was fixed in the package linux-raspi2 - 4.4.0-1101.109

---------------
linux-raspi2 (4.4.0-1101.109) xenial; urgency=medium

  * linux-raspi2: 4.4.0-1101.109 -proposed tracker (LP: #1802780)

  * rpi3bp+: ethernet leds don't blink (LP: #1802320)
    - Revert "lan78xx: Ignore DT MAC address if already valid"
    - Revert "lan78xx: Read MAC address from DT if present"
    - lan78xx: Read MAC address from DT if present
    - Revert "UBUNTU: SAUCE: Revert "lan78xx: Correctly indicate invalid OTP""

  [ Ubuntu: 4.4.0-140.166 ]

  * linux: 4.4.0-140.166 -proposed tracker (LP: #1802776)
  * Bypass of mount visibility through userns + mount propagation (LP: #1789161)
    - mount: Retest MNT_LOCKED in do_umount
    - mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts
  * kdump fail due to an IRQ storm (LP: #1797990)
    - SAUCE: x86/PCI: Export find_cap() to be used in early PCI code
    - SAUCE: x86/quirks: Add parameter to clear MSIs early on boot
    - SAUCE: x86/quirks: Scan all busses for early PCI quirks
  * crash in ENA driver on removing an interface (LP: #1802341)
    - SAUCE: net: ena: fix crash during ena_remove()
  * xenial guest on arm64 drops to busybox under openstack bionic-rocky
    (LP: #1797092)
    - [Config] CONFIG_PCI_ECAM=y
    - PCI: Provide common functions for ECAM mapping
    - PCI: generic, thunder: Use generic ECAM API
    - PCI, of: Move PCI I/O space management to PCI core code
    - PCI: Move ecam.h to linux/include/pci-ecam.h
    - PCI: Add parent device field to ECAM struct pci_config_window
    - PCI: Add pci_unmap_iospace() to unmap I/O resources
    - PCI/ACPI: Support I/O resources when parsing host bridge resources
    - [Config] CONFIG_ACPI_MCFG=y
    - PCI/ACPI: Add generic MCFG table handling
    - PCI: Refactor pci_bus_assign_domain_nr() for CONFIG_PCI_DOMAINS_GENERIC
    - PCI: Factor DT-specific pci_bus_find_domain_nr() code out
    - ARM64: PCI: Add acpi_pci_bus_find_domain_nr()
    - ARM64: PCI: ACPI support for legacy IRQs parsing and consolidation with DT
      code
    - ARM64: PCI: Support ACPI-based PCI host controller
  * [GLK/CLX] Enhanced IBRS (LP: #1786139)
    - x86/speculation: Remove SPECTRE_V2_IBRS in enum spectre_v2_mitigation
    - x86/speculation: Support Enhanced IBRS on future CPUs
  * Update ENA driver to version 2.0.1K (LP: #1798182)
    - net: ena: remove ndo_poll_controller
    - net: ena: fix warning in rmmod caused by double iounmap
    - net: ena: fix rare bug when failed restart/resume is followed by driver
      removal
    - net: ena: fix NULL dereference due to untimely napi initialization
    - net: ena: fix auto casting to boolean
    - net: ena: minor performance improvement
    - net: ena: complete host info to match latest ENA spec
    - net: ena: introduce Low Latency Queues data structures according to ENA spec
    - net: ena: add functions for handling Low Latency Queues in ena_com
    - net: ena: add functions for handling Low Latency Queues in ena_netdev
    - net: ena: use CSUM_CHECKED device indication to report skb's checksum status
    - net: ena: explicit casting and initialization, and clearer error handling
    - net: ena: limit refill Rx th...

Read more...

Changed in linux-raspi2 (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (8.6 KiB)

This bug was fixed in the package linux-snapdragon - 4.4.0-1105.110

---------------
linux-snapdragon (4.4.0-1105.110) xenial; urgency=medium

  * linux-snapdragon: 4.4.0-1105.110 -proposed tracker (LP: #1802781)

  * xenial guest on arm64 drops to busybox under openstack bionic-rocky
    (LP: #1797092)
    - [Config] CONFIG_PCI_ECAM=y
    - [Config] CONFIG_ACPI_MCFG=y

  [ Ubuntu: 4.4.0-140.166 ]

  * linux: 4.4.0-140.166 -proposed tracker (LP: #1802776)
  * Bypass of mount visibility through userns + mount propagation (LP: #1789161)
    - mount: Retest MNT_LOCKED in do_umount
    - mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts
  * kdump fail due to an IRQ storm (LP: #1797990)
    - SAUCE: x86/PCI: Export find_cap() to be used in early PCI code
    - SAUCE: x86/quirks: Add parameter to clear MSIs early on boot
    - SAUCE: x86/quirks: Scan all busses for early PCI quirks
  * crash in ENA driver on removing an interface (LP: #1802341)
    - SAUCE: net: ena: fix crash during ena_remove()
  * xenial guest on arm64 drops to busybox under openstack bionic-rocky
    (LP: #1797092)
    - [Config] CONFIG_PCI_ECAM=y
    - PCI: Provide common functions for ECAM mapping
    - PCI: generic, thunder: Use generic ECAM API
    - PCI, of: Move PCI I/O space management to PCI core code
    - PCI: Move ecam.h to linux/include/pci-ecam.h
    - PCI: Add parent device field to ECAM struct pci_config_window
    - PCI: Add pci_unmap_iospace() to unmap I/O resources
    - PCI/ACPI: Support I/O resources when parsing host bridge resources
    - [Config] CONFIG_ACPI_MCFG=y
    - PCI/ACPI: Add generic MCFG table handling
    - PCI: Refactor pci_bus_assign_domain_nr() for CONFIG_PCI_DOMAINS_GENERIC
    - PCI: Factor DT-specific pci_bus_find_domain_nr() code out
    - ARM64: PCI: Add acpi_pci_bus_find_domain_nr()
    - ARM64: PCI: ACPI support for legacy IRQs parsing and consolidation with DT
      code
    - ARM64: PCI: Support ACPI-based PCI host controller
  * [GLK/CLX] Enhanced IBRS (LP: #1786139)
    - x86/speculation: Remove SPECTRE_V2_IBRS in enum spectre_v2_mitigation
    - x86/speculation: Support Enhanced IBRS on future CPUs
  * Update ENA driver to version 2.0.1K (LP: #1798182)
    - net: ena: remove ndo_poll_controller
    - net: ena: fix warning in rmmod caused by double iounmap
    - net: ena: fix rare bug when failed restart/resume is followed by driver
      removal
    - net: ena: fix NULL dereference due to untimely napi initialization
    - net: ena: fix auto casting to boolean
    - net: ena: minor performance improvement
    - net: ena: complete host info to match latest ENA spec
    - net: ena: introduce Low Latency Queues data structures according to ENA spec
    - net: ena: add functions for handling Low Latency Queues in ena_com
    - net: ena: add functions for handling Low Latency Queues in ena_netdev
    - net: ena: use CSUM_CHECKED device indication to report skb's checksum status
    - net: ena: explicit casting and initialization, and clearer error handling
    - net: ena: limit refill Rx threshold to 256 to avoid latency issues
    - net: ena: change rx copybreak default to reduce kernel memory pressure
    - net: ena: remov...

Read more...

Changed in linux-snapdragon (Ubuntu Xenial):
status: Fix Committed → Fix Released
Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.