cloud-init fails when rebooting EC2 i3.metal instances
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
cloud-init |
Expired
|
High
|
Unassigned |
Bug Description
In order to collect boot-speed metrics I deploy/
1. Deploy an instance and wait for cloud-init
to finish using `cloud-init status --wait`
2. Collect and retrieve some logs via SSH/SFTP
3. Reboot the instance using boto3's reboot()
4. Collect some more logs
5. Terminate the instance
This works in a fairly reliable way, but on i3.metal instances the instance often fails to survive the reboot step. After a failed reboot the instance state appears as "running", but it's unreachable via SSH.
By detaching the root volume and attaching it to another instance in the same availability zone I've been able to inspect the logs, and problem is a cloud-init failure. At a first glance of the logs it looks like cloud-init doesn't like /var/lib/
2019-08-23 11:31:27,585 - util.py[DEBUG]: Reading from /var/lib/
2019-08-23 11:31:27,585 - util.py[DEBUG]: Read 0 bytes from /var/lib/
2019-08-23 11:31:27,585 - util.py[WARNING]: failed stage init-local
2019-08-23 11:31:27,586 - util.py[DEBUG]: failed stage init-local
Traceback (most recent call last):
File "/usr/lib/
ret = functor(name, args)
File "/usr/lib/
_maybe_
File "/usr/lib/
cc_
File "/usr/lib/
prev_hostname = util.load_
File "/usr/lib/
decoded = json.loads(
File "/usr/lib/
return _default_
File "/usr/lib/
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/
raise JSONDecodeError
json.decoder.
I'm not sure on where the actual problem is here. Is set-hostname supposed to always contain something? Should cloud-init be able to handle an empty set-hostname? Could the fact that the instance is rebooted shortly after being deployed affect this?
The full logs are attached.
Changed in cloud-init: | |
status: | Incomplete → New |
Changed in cloud-init: | |
status: | New → Incomplete |
Changed in cloud-init: | |
status: | Expired → Incomplete |
Changed in cloud-init: | |
status: | Incomplete → Expired |
Thanks for submitting this. Is it possible to collect-logs from the initial boot of the instance before the reboot? In particular I'd like to see a cloud-init collect-logs and a tar of /var/lib/cloud before the reboot.