Machine fails to boot if MAAS server is not available
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Invalid
|
Critical
|
Unassigned | ||
curtin |
Fix Released
|
Critical
|
Blake Rouse |
Bug Description
maas 2.1.3+bzr5573-
Machine type: Intel NUC (power type: Intel AMT, UEFI mode)
Steps to reproduce:
1. Install on the node (Intel NUC) Ubuntu 16.04 (default flat disk layout, default network config)
2. SSH into the deployed node.
3. On the MAAS server, terminate all MAAS services.
4. On the deployed node, do a reboot.
Observed behavior: after power-cycling, the machine stalls waiting to PXE boot.
Expected behavior: the machine should be booting from its disk.
Control environment 1:
1. On the same node, deploy Ubuntu 16.04 using a USB stick (*no* changes to the AMT settings, boot device order, etc)
2. Reboot the device
Observed behavior: properly boots from the disk.
Control environment 2:
1. Install on a KVM/virsh VM Ubuntu 16.04 (default flat disk layout, default network config)
2. SSH into the deployed VM.
3. On the MAAS server, terminate all MAAS services.
4. On the deployed VM, do a reboot.
Observed behavior: the machine is booting from its disk, as expected.
Conclusions:
1. PEBCAK - possible, but the behavior was confirmed on multiple different NUCs, from multiple users.
2. Intel's EFI implementation is fragile - possible, but the scenarios above have been tested with two different firmware versions (1.5 years old - mature, the latest release ~2 months old). Most importantly, when Ubuntu is deployed using a USB stick, the boot behavior works as expected.
3. There is something subtle in the way maas, curtin and cloud-init lay down the system that triggers the observed behavior. If someone can test this on a different hw/power type (e.g. IPMI) I believe it would help with the triage.
The issue is consistently reproducible, happy to provide any logs and configuration details as needed.
Output from dpkg -l '*maas*' http://
Related branches
- Ryan Harper (community): Approve
- Server Team CI bot: Approve (continuous-integration)
-
Diff: 544 lines (+450/-4)4 files modifiedcurtin/commands/curthooks.py (+66/-1)
curtin/util.py (+53/-0)
tests/unittests/test_curthooks.py (+261/-3)
tests/unittests/test_util.py (+70/-0)
Changed in maas: | |
status: | New → Incomplete |
Changed in maas: | |
milestone: | 2.2.0rc2 → 2.2.0rc3 |
Changed in maas: | |
milestone: | 2.2.0rc3 → 2.2.1 |
Changed in maas: | |
milestone: | 2.2.1 → 2.2.0rc4 |
assignee: | nobody → Blake Rouse (blake-rouse) |
Changed in maas: | |
status: | Incomplete → Invalid |
Changed in maas: | |
assignee: | Blake Rouse (blake-rouse) → nobody |
Changed in curtin: | |
status: | New → Triaged |
importance: | Undecided → Critical |
Changed in curtin: | |
status: | Incomplete → Triaged |
assignee: | nobody → Blake Rouse (blake-rouse) |
status: | Triaged → In Progress |
Changed in curtin: | |
status: | In Progress → Fix Committed |
Many thanks to Francisco Hernandez for fact checking the repro steps and replicating the issue!