[Regression] Stuck CPU1-x when booting as Xen HVM guest on certain Intel hosts
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Won't Fix
|
High
|
Unassigned | ||
Precise |
Won't Fix
|
Undecided
|
Unassigned | ||
Quantal |
Won't Fix
|
Undecided
|
Unassigned | ||
xen (Ubuntu) |
Fix Released
|
High
|
Stefan Bader | ||
Precise |
Fix Released
|
Medium
|
Unassigned | ||
Quantal |
Fix Released
|
Medium
|
Unassigned |
Bug Description
SRU Justification:
Impact: When booting a kernel version 3.5 or later on aHVM guest with multiple VCPUs on a that supports Supervisor Mode Execution Protection (SMEP), only the boot processor is running. All additional VCPUs get stuck.
This happens because Xen is using paging even if the guest VCPU has not enabled paging mode, but the pages are not set up to grant execution rights.
Fix: A set of three patches backported from upstream Xen will mask off the SMEP bit from the hardware register as long as the guest VCPU is not in paging mode.
Testcase: Set up Xen host (Intel CPU that supports SMEP), install a HVM guest (Quantal or later) with more that one VCPU. After boot /proc/cpuinfo only shows one CPU and dmesg contains "Stuck" messages. With the fix, all CPUs come up.
---
Architecture: amd64
Xen version: 4.2.1
When testing I found that when I boot a Xen HVM guest on newer Intel based systems (maybe starting with Sandy Bridge) none of the additional VCPUs come online:
cpu 1 spinlock event irq 70
Booting Node 0, Processors #1
CPU1: Stuck ??
This does not happen on my AMD Opteron host and neither on a box with an old i7 (one of the first ones that came out). This started with kernels between 3.4 and 3.5-rc1, so Quantal and onwards. I was able to limit the range via bisect (unfortunately within that range the kernel does not build):
323f90a xen-acpi-processor: Add missing #include <xen/xen.h>
2ee93ab acpi, bgrd: Add missing <linux/io.h> to drivers/acpi/bgrt.c
638d957 x86, realmode: Change EFER to a single u64 field
1371270 x86, realmode: Move kernel/realmode.c to realmode/init.c
51edbe6 x86, realmode: Move not-common bits out of trampoline_common.S
7960387 x86, realmode: Mask out EFER.LMA when saving trampoline EFER
34d0b02 x86, realmode: Fix no cache bits test in reboot_32.S
0f6f11eb x86, realmode: Make sure all generated files are listed in targets
c5403ae x86, realmode: build fix: remove duplicate build
cda846f x86, realmode: read cr4 and EFER from kernel for 64-bit trampoline
bf8b88e x86, realmode: fixes compilation issue in tboot.c
f2604c1 x86, realmode: move relocs from scripts/ to arch/x86/tools
f37240f x86, realmode: header for trampoline code
c484547 x86, realmode: flattened rm hierachy
b429dbf x86, realmode: don't copy real_mode_header
8e029fc x86, realmode: fix 64-bit wakeup sequence
6feb592 x86, realmode: Fix always-zero test in reboot_32.S
be60828 x86, realmode: Move trampoline_*.S early in the link order
e5684ec x86, realmode: Replace open-coded ljmpw with a macro
968ff9e x86, realmode: Remove indirect jumps in trampoline_32 and wakeup_asm
056a43a x86, realmode: Remove indirect jumps in trampoline_64.S
f7436a9 x86, realmode: Align .data section in trampoline_32.S
Not sure why this only affects certain Intel CPUs, maybe some VMX feature that has some side-effect on the changes in the realmode code.
description: | updated |
tags: | added: patch |
Changed in linux (Ubuntu Precise): | |
status: | New → Won't Fix |
Changed in linux (Ubuntu Quantal): | |
status: | New → Won't Fix |
Changed in xen (Ubuntu Precise): | |
status: | New → Triaged |
Changed in xen (Ubuntu Quantal): | |
status: | New → Triaged |
Changed in xen (Ubuntu Precise): | |
importance: | Undecided → High |
importance: | High → Medium |
Changed in xen (Ubuntu Quantal): | |
importance: | Undecided → Medium |
Verified that the issue still exists on v3.9-rc3 upstream.