cloud-init

Bug #2008952
Comment #18

Comment 18 for bug 2008952

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-03-14 (last edit on 2023-03-15):

#18

Thank you all for your attention on this bug.

Sorry my earlier comment on this bug was ill-informed and incorrect. I'm able to reproduce this as well through the server installer qemu/kvm based installs as well so I can confirm as well that this isn't/wasn't NetworkManager and systemd-networkd fighting over management of the device because server ISOs don't have network-manager installed.

Also, I am concerned with cloud-init.service being ordered specifically after systemd-resolved.service on all deployments as we will be affecting all boots and delaying them on the systemd-resolved setup of DNS when only specific use-cases such NoCloudNet with an FQDN as kernel cmdline directive may need that service to be active.
## Update: from testing in comment #20: systemd-resolved.service doesn't seem to add cost to the underlying boot, it just re-orders the resolved service earlier. But, even though resolved is "up" and active it doesn't yet have the ability to resolve anything until NetworkManager-wait-online.service is complete and registers a connected NIC.

Some other datasources like GCP do rely on DNS resolution of the instance metadata service (GCP), but cloud-images inject a config into /etc/hosts to resolve that locally in absence of active DNS in early boot. Ec2 does also define instance-data:8773 as a potential fallback IMDS definition, but both IPv4 and IPv6 endpoints are defined earlier in the search order, so we never get back to that DNS lookup in all practical deployments.

## Update per comment #20, retries will work for systemd-networkd managed systems because systemd-networkd-wait-online.service happens after=network-pre.target and before=sysinit.target. Retries won't work for NetworkManager currently because NetworkManager is After=dbus.service which is After=sysinit.target

We may be able to avoid the cost of a strict `After=systemd-resolved.service` clause in cloud-init.service if we can add the following logic to nocloud by adding sensible retries in the NoCloud datasource.

1. Check if seed URLs `netloc` is an ip address. If IP, no retries on failure.

2a. When seed URL is non-IP, retry on specific 'network resolution error' URLError raised and retry X times for that failure mode

- or -

2b . When seed URL is non-IP, invoke socket.getaddrinfo to validate DNS resolution prior to attempting to download metadata, if not resolvable, retry only as long as systemd.resolved.services isn't yet active.

These retry approaches should allow us to avoid impacting typical boots on most systems, yet still support DNS-based needs for datasource detection in early boot if FQDN is used for IMDS.

Thank you all for your attention on this bug.

1. Check if seed URLs `netloc` is an ip address. If IP, no retries on failure.

2a. When seed URL is non-IP, retry on specific 'network resolution error' URLError raised and retry X times for that failure mode

- or -

These retry approaches should allow us to avoid impacting typical boots on most systems, yet still support DNS-based needs for datasource detection in early boot if FQDN is used for IMDS.