Node powered off after 'PXE request - power off' during deployment, does not get marked as failed and goes from Deploying to Allocated state
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Invalid
|
Critical
|
Unassigned |
Bug Description
Seeing this sequence of events where a node starts to deploy:
- Goes through transitions: Ready-> Allocated -> Deploying states
- PXE boots but no curtin install - PXE power off request instead
- is powered down - and remains powered off for the duration of the juju deployment.
- Sometimes later when juju environment is teared down, it actually transitions from Allocated to Ready. The node did not get to deployed nor failed state and somehow transitioned from Deploying to Allocated state (transition to Allocated not shown in the Event logs nor in the maas logs)
We have seen this happen a number of times. Here's one example:
From Events log:
INFO Sun, 01 March 2015 04:45:23 Node changed status — From 'Allocated' to 'Ready'
INFO Sun, 01 March 2015 04:45:23 Powering node off
INFO Sun, 01 March 2015 04:42:21 Node powered off
INFO Sun, 01 March 2015 04:42:20 Powering node off
DEBUG Sun, 01 March 2015 03:28:34 TFTP Request — ubuntu/
INFO Sun, 01 March 2015 03:28:33 PXE Request — power off
DEBUG Sun, 01 March 2015 03:28:33 TFTP Request — ubuntu/
DEBUG Sun, 01 March 2015 03:28:33 TFTP Request — ifcpu64.c32
DEBUG Sun, 01 March 2015 03:28:33 TFTP Request — pxelinux.
INFO Sun, 01 March 2015 03:28:33 PXE Request — power off
DEBUG Sun, 01 March 2015 03:28:33 TFTP Request — pxelinux.
DEBUG Sun, 01 March 2015 03:28:33 TFTP Request — pxelinux.
DEBUG Sun, 01 March 2015 03:28:32 TFTP Request — pxelinux.0
DEBUG Sun, 01 March 2015 03:28:32 TFTP Request — pxelinux.0
INFO Sun, 01 March 2015 03:25:41 Node powered on
INFO Sun, 01 March 2015 03:25:33 Powering node on
INFO Sun, 01 March 2015 03:25:16 Node changed status — From 'Allocated' to 'Deploying'
INFO Sun, 01 March 2015 03:25:14 Node changed status — From 'Ready' to 'Allocated' (to oil-slave-5)
From Maas logs:
grep "Mar 1 0[3|4]" /var/log/
...
Mar 1 03:25:14 maas-trusty-
Mar 1 03:25:32 maas-trusty-
Mar 1 03:25:33 maas-trusty-
Mar 1 03:25:41 maas-trusty-
Mar 1 03:30:18 maas-trusty-
Mar 1 04:42:20 maas-trusty-
Mar 1 04:42:21 maas-trusty-
Mar 1 04:45:23 maas-trusty-
Mar 1 04:45:23 maas-trusty-
Mar 1 04:45:23 maas-trusty-
Mar 1 04:45:25 maas-trusty-
...
From pserv.log:
ubuntu@
2015-03-01 03:28:32+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 2070): <RRQDatagram(
2015-03-01 03:28:32+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 2070): <RRQDatagram(
2015-03-01 03:28:32+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 2071): <RRQDatagram(
2015-03-01 03:28:32+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 2071): <RRQDatagram(
2015-03-01 03:28:32+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49152): <RRQDatagram(
2015-03-01 03:28:32+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49152): <RRQDatagram(
2015-03-01 03:28:33+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49153): <RRQDatagram(
2015-03-01 03:28:33+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49153): <RRQDatagram(
2015-03-01 03:28:33+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49153): <RRQDatagram(
2015-03-01 03:28:33+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49153): <RRQDatagram(
2015-03-01 03:28:33+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49154): <RRQDatagram(
2015-03-01 03:28:33+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49154): <RRQDatagram(
2015-03-01 03:28:33+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49155): <RRQDatagram(
2015-03-01 03:28:33+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49155): <RRQDatagram(
2015-03-01 03:28:34+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49156): <RRQDatagram(
2015-03-01 03:28:34+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49156): <RRQDatagram(
<Larry COMMENT>This next entry is for different build<Larry ENDCOMMENT> 2015-03-01 04:51:37+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 2070): <RRQDatagram(
...
From console.txt:
2015-03-01 03:26:43,808 [DEBUG] deployer.import: Adding units...
2015-03-01 03:26:45,098 [WARNING] deployer.import: Config specifies num units for subordinate: ceilometer-agent
2015-03-01 03:26:45,099 [DEBUG] deployer.import: Service 'ceph' does not need any more units added.
2015-03-01 03:26:45,100 [DEBUG] deployer.import: Service 'cinder' does not need any more units added.
2015-03-01 03:26:45,102 [DEBUG] deployer.import: Service 'nova-cloud-
2015-03-01 03:26:45,103 [DEBUG] deployer.import: Service 'ceilometer' does not need any more units added.
2015-03-01 03:26:45,104 [DEBUG] deployer.import: Service 'glance' does not need any more units added.
2015-03-01 03:26:45,104 [DEBUG] deployer.import: Service 'heat' does not need any more units added.
2015-03-01 03:26:45,104 [DEBUG] deployer.import: Service 'keystone' does not need any more units added.
2015-03-01 03:26:45,104 [DEBUG] deployer.import: Service 'mongodb' does not need any more units added.
2015-03-01 03:26:45,105 [DEBUG] deployer.import: Service 'mysql' does not need any more units added.
2015-03-01 03:26:45,107 [INFO] deployer.import: Adding 2 more units to nova-compute
2015-03-01 03:26:55,638 [DEBUG] deployer.import: Service 'openstack-
2015-03-01 03:26:55,639 [DEBUG] deployer.import: Service 'rabbitmq-server' does not need any more units added.
2015-03-01 03:26:55,639 [DEBUG] deployer.import: Waiting for units before adding relations
2015-03-01 03:26:59,325 [DEBUG] deployer.env: Delta unit: nova-compute/2 change:pending
2015-03-01 03:26:59,408 [DEBUG] deployer.env: Delta machine: 3 change:pending
2015-03-01 03:27:29,332 [DEBUG] deployer.env: Delta machine: 4 change:pending
2015-03-01 03:27:29,427 [DEBUG] deployer.env: Delta machine: 5 change:pending
2015-03-01 03:28:24,355 [DEBUG] deployer.env: Delta machine: 5 change:pending
2015-03-01 03:29:14,360 [DEBUG] deployer.env: Delta machine: 4 change:pending
2015-03-01 03:29:24,362 [DEBUG] deployer.env: Delta machine: 5 change:pending
2015-03-01 03:32:29,382 [DEBUG] deployer.env: Delta machine: 1 change:pending
2015-03-01 03:32:34,384 [DEBUG] deployer.env: Delta machine: 1 change:pending
2015-03-01 03:32:34,466 [DEBUG] deployer.env: Delta machine: 1 change:started
2015-03-01 03:32:34,466 [DEBUG] deployer.env: Delta unit: ceph/0 change:pending
...
2015-03-01 04:41:56,117 [ERROR] deployer.import: Reached deployment timeout.. exiting
From juju_status.yaml:
'2':
agent-state: pending
containers:
2/lxc/0:
series: trusty
2/lxc/1:
series: trusty
dns-name: euclid.oil
hardware: arch=amd64 cpu-cores=32 mem=49152M tags=hw-
instance-id: /MAAS/api/
series: trusty
Something to note that may or may not be of relevance: juju-deployer does not show log for machine 2 as change:pending in the console.txt but juju_status.yaml does show it as pending.
summary: |
- Node powered off after PXE power off request during deployment, is not - marked as failed and goes from Deploying to Allocated state + Node powered off after 'PXE request - power off' during deployment, does + not get marked as failed and goes from Deploying to Allocated state |
description: | updated |
Changed in maas: | |
milestone: | none → 1.8.0 |
Changed in maas: | |
assignee: | nobody → Ricardo Bánffy (rbanffy) |
assignee: | Ricardo Bánffy (rbanffy) → nobody |
Changed in maas: | |
status: | Incomplete → Triaged |
Changed in maas: | |
importance: | Critical → High |
Changed in maas: | |
importance: | High → Critical |
I see another system with same issue where the change:pending log does show in the console.txt so that last observation is likely a moot point.