cloud-init fails when rebooting EC2 i3.metal instances

Bug #1841182 reported by Paride Legovini
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-init
Expired
High
Unassigned

Bug Description

In order to collect boot-speed metrics I deploy/reboot/terminate several EC2 instances per day. At a high level this is that the jobs do:

1. Deploy an instance and wait for cloud-init
   to finish using `cloud-init status --wait`
2. Collect and retrieve some logs via SSH/SFTP
3. Reboot the instance using boto3's reboot()
4. Collect some more logs
5. Terminate the instance

This works in a fairly reliable way, but on i3.metal instances the instance often fails to survive the reboot step. After a failed reboot the instance state appears as "running", but it's unreachable via SSH.

By detaching the root volume and attaching it to another instance in the same availability zone I've been able to inspect the logs, and problem is a cloud-init failure. At a first glance of the logs it looks like cloud-init doesn't like /var/lib/cloud/data/set-hostname being empty:

2019-08-23 11:31:27,585 - util.py[DEBUG]: Reading from /var/lib/cloud/data/set-hostname (quiet=False)
2019-08-23 11:31:27,585 - util.py[DEBUG]: Read 0 bytes from /var/lib/cloud/data/set-hostname
2019-08-23 11:31:27,585 - util.py[WARNING]: failed stage init-local
2019-08-23 11:31:27,586 - util.py[DEBUG]: failed stage init-local
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 653, in status_wrapper
    ret = functor(name, args)
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 361, in main_init
    _maybe_set_hostname(init, stage='local', retry_stage='network')
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 709, in _maybe_set_hostname
    cc_set_hostname.handle('set-hostname', init.cfg, cloud, LOG, None)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_set_hostname.py", line 67, in handle
    prev_hostname = util.load_json(util.load_file(prev_fn))
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1586, in load_json
    decoded = json.loads(decode_binary(text))
  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I'm not sure on where the actual problem is here. Is set-hostname supposed to always contain something? Should cloud-init be able to handle an empty set-hostname? Could the fact that the instance is rebooted shortly after being deployed affect this?

The full logs are attached.

Revision history for this message
Paride Legovini (paride) wrote :
Revision history for this message
Ryan Harper (raharper) wrote :

Thanks for submitting this. Is it possible to collect-logs from the initial boot of the instance before the reboot? In particular I'd like to see a cloud-init collect-logs and a tar of /var/lib/cloud before the reboot.

Changed in cloud-init:
importance: Undecided → High
status: New → Incomplete
Revision history for this message
Paride Legovini (paride) wrote : Re: [Bug 1841182] Re: cloud-init fails when rebooting EC2 i3.metal instances

Ryan Harper wrote on 28/08/2019:
> Thanks for submitting this. Is it possible to collect-logs from the
> initial boot of the instance before the reboot? In particular I'd like
> to see a cloud-init collect-logs and a tar of /var/lib/cloud before the
> reboot.

Here are the first-boot logs from an instance that when rebooted hit the
problem. The post-reboot-console-log-fail.log file is the console log
collected after the reboot and that shows the cloud-init error message.
It has been collected using `aws ec2 get-console-output`.

Paride

Paride Legovini (paride)
Changed in cloud-init:
status: Incomplete → New
Revision history for this message
Dan Watkins (oddbloke) wrote :

OK, it looks like /var/lib/cloud/data/set-hostname ends up empty after the reboot, which is what causes the issue. We should handle that case gracefully, assuming that it isn't cloud-init that's causing it to be removed.

To help confirm what's going on, are you able to capture /var/lib/cloud and logs post-failure (perhaps by attaching the root disk elsewhere)?

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Paride Legovini (paride) wrote :

Dan Watkins wrote on 03/09/2019:
> OK, it looks like /var/lib/cloud/data/set-hostname ends up empty after
> the reboot, which is what causes the issue. We should handle that case
> gracefully, assuming that it isn't cloud-init that's causing it to be
> removed.
>
> To help confirm what's going on, are you able to capture /var/lib/cloud
> and logs post-failure (perhaps by attaching the root disk elsewhere)?

Here are the first-boot and post-reboot logs collected from an instance
that encountered the problem.

Changed in cloud-init:
status: Incomplete → New
Revision history for this message
Dan Watkins (oddbloke) wrote :

OK, to clarify my previous comment, the issue is not that /var/lib/cloud/data/set-hostname is _missing_, the issue is specifically that it is present but empty. Unfortunately, we don't have any logging from the code path used to write set-hostname (it uses cloudinit.atomic_helper.write_json which uses atomic_helper.write_file), so we can't be 100% sure that we aren't incorrectly writing out this empty file.

So, I think this breaks down into three things:

1) We should improve the logging situation there (just filed bug 1843276)
2) We should work out why the file ends up empty in this case
3) We should gracefully treat an empty file as an absent file (probably with a WARNING?), because I think that's the best we can do once we find ourselves in that situation

(2) and (3) should _probably_ be separate bugs, but without (1) it's not 100% clear that that's the case so I haven't filed anything separate yet.

Revision history for this message
Dan Watkins (oddbloke) wrote :

Paride, are you able to test this again using the cloud-init from the daily PPA[0]? That has some additional logging added, which should make working out (2) easier.

[0] https://launchpad.net/~cloud-init-dev/+archive/ubuntu/daily

Dan Watkins (oddbloke)
Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for cloud-init because there has been no activity for 60 days.]

Changed in cloud-init:
status: Incomplete → Expired
Paride Legovini (paride)
Changed in cloud-init:
status: Expired → Incomplete
Changed in cloud-init:
status: Incomplete → Expired
Revision history for this message
Paride Legovini (paride) wrote :

This didn't happen again in quite some time, not sure what fixed it. I'll reopen this bug report if I encounter the issue again.

Revision history for this message
xiaoyi chen (xiachen-rh) wrote :

Hi Dan Watkins,
we got a similar issue that cloud-init fails when /var/lib/cloud/data/set-hostname is present but empty during reboot. We are doing further research that why the file ends up empty in our case, meanwhile, will cloud-init handle the exception as you commented(https://bugs.launchpad.net/cloud-init/+bug/1841182/comments/6)?

Our problem details are here,
https://bugzilla.redhat.com/show_bug.cgi?id=2140893

Revision history for this message
James Falcon (falcojr) wrote :

Xiaoyi, that bugzilla bug isn't public, so we can't see the details. We'll need the resulting tarball from 'cloud-init collect-logs' (with --include-userdata if there's no sensitive userdata) in order to try to debug this issue. Also note that this bug was filed 3 years ago against a version of cloud-init that is 3 years old. Can the problem be reproduced on the current version of cloud-init?

Revision history for this message
Emanuele Esposito (esposem) wrote :

The BZ is now public. However, there is only one suggestion that is private that might help solving the problem, so I am pasting it here:

```
Firstly, IMO, empty '/var/lib/cloud/data/set-hostname' is indeed an exception/case that cloud-init should handle properly than leave the service failed directly.

Back to this issue, hard reboot during boot or trigger sysrq b immediately when system boot up. The content of set-hostname lost after system boot up.
It seems that the content did not synced to disk when system reset.

Here is the instructions of sysrq b.

``b`` Will immediately reboot the system without syncing or unmounting
            your disks.

Below information shows that set-hostname already has content before reset, but it is 0 after boot again.
2022-11-08 01:49:17,759 - util.py[DEBUG]: Reading from /var/lib/cloud/data/set-hostname (quiet=False)
2022-11-08 01:49:17,759 - util.py[DEBUG]: Read 88 bytes from /var/lib/cloud/data/set-hostname
2022-11-08 01:49:17,759 - cc_set_hostname.py[DEBUG]: No hostname changes. Skipping set-hostname
2022-11-08 02:33:58,585 - util.py[DEBUG]: Reading from /var/lib/cloud/data/set-hostname (quiet=False)
2022-11-08 02:33:58,585 - util.py[DEBUG]: Read 0 bytes from /var/lib/cloud/data/set-hostname

Not sure if turn off buffering is helpful to mitigate this, but I did not reproduce this issue after 10+ retry times.

# diff -u /usr/lib/python3.9/site-packages/cloudinit/atomic_helper.py /usr/lib/python3.9/site-packages/cloudinit/atomic_helper.py.orig
--- /usr/lib/python3.9/site-packages/cloudinit/atomic_helper.py 2022-11-08 08:17:56.262679073 +0000
+++ /usr/lib/python3.9/site-packages/cloudinit/atomic_helper.py.orig 2022-11-08 08:17:03.722890860 +0000
@@ -25,7 +25,7 @@
     tf = None
     try:
         tf = tempfile.NamedTemporaryFile(
- dir=os.path.dirname(filename), delete=False, mode=omode, buffering=0
+ dir=os.path.dirname(filename), delete=False, mode=omode
         )
```

Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.