Staged boot, to fix integration of systemd generators

Bug #1892851 reported by Lukas Märdian
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Netplan
Fix Released
Undecided
Unassigned
cloud-init
Invalid
Undecided
Unassigned
netplan.io (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

[Intro]
Cloud-init makes use of the "netplan" systemd generator, but calls "netplan generate" manually at runtime, while currently executing the initial systemd boot transaction, instead of running it as intended via "systemctl daemon-reload" at systemd generator stage, due to restrictions it has regarding fetching of its data source (e.g. netplan YAML config).

[Problem]
This leads to problems at first boot, as the systemd unit dependencies are calculated after the generator stage, but ahead of the boot transaction (e.g. via systemctl daemon-reload), therefore the new service units and its dependencies, which are generated by manually calling systemd generators are ignored during the first-boot transaction. In subsequent boots (where the cloud-init data source, netplan YAML config and unit files are already in place), everything works as expected.

It is a tricky situation, as cloud-init
 1/ does not have the full config to run the systemd generators (e.g. netplan YAML) yet before the systemd boot transaction. It first needs to fetch it via a DataSource, possibly via a network connection.
 2/ cannot execute the generators manually (e.g. "netplan generate") during the systemd boot transaction, because this way the newly generated service units and corresponding dependencies will be ignored.
 3/ cannot re-execute the systemd generators after the initial boot transaction, as it is already too late at this point and applications expect to have a readily configured network setup after cloud-final.target has been reached.

[References]
Such problems have been reported and discussed for WiFi on RaspberryPi (LP: #1870346) or Open vSwitch setups in MAAS (https://github.com/CanonicalLtd/netplan/pull/157), where some of the generated service units/dependencies (netplan-ovs-*.service or netplan-wpa-*.service, possibly SR-IOV units as well...) are not properly executed on first boot.

[Suggestion]
A possible solution I discussed with @xnox would be to re-engineer how cloud-init targets work a bit, by splitting up the cloud-init boot sequence into multiple stages, e.g.:

* Start "Stage 0" systemd transaction: systemctl isolate cloud-stage0.target
  - execute the init local modules
  - setup basic networking (DHCP on eth0/ens3)
  - fetch data source & place netplan YAML in /etc/netplan/
* Finish "Stage 0" transaction
* Call systemctl daemon-reload
  - This will trigger all systemd generators (incl. netplan generate) and re-calculate all dependencies
* Start "Stage 1" systemd transaction: systemctl isolate default.target
  - execute all the normal cloud-init modules and start all the normal services, e.g. via cloud-final.target
* Finish "Stage 1" transaction
* System is now fully booted

The idea here is to split up the boot sequence into two (or more?) systemd transactions, so we can call "systemctl daemon-reload" in between (but not within a running systemd transaction) to re-run all the generators and re-calculate all the dependencies. This way all generators would be used in their intended way and should work as expected, even on first boot.

Doing that would also allow users to do interesting things with systemd via cloud-config. Like changing the default.target from multiuser.target to emergency.target, adding / masking / removing units used in early boot, and "just write fstab" and allow systemd-fstab-generator to process it, and mount things, etc...

### Config used to reproduce the problem in a LXD container:
"systemctl status netplan-ovs-ovs0.service" will show that this unit has not be executed on first boot.

config:
  user.network-config: |
    # cloud-config
    version: 2
    bridges:
      ovs0:
        addresses: [10.10.10.20/24]
        interfaces: [eth0.21]
        parameters:
          stp: false
        openvswitch: {}
    ethernets:
      eth0:
        addresses: [10.10.10.30/24]
    vlans:
      eth0.21:
        id: 21
        link: eth0
description: My OVS debugging profile
devices:
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
  root:
    path: /
    pool: default
    type: disk
name: myovs

Changed in cloud-init:
status: New → Confirmed
Revision history for this message
Lukas Märdian (slyon) wrote :

We found a fix for this problem in netplan itself.
The overall idea of a staged-boot environment should probably still be considered for a future release of cloud-init.

https://github.com/CanonicalLtd/netplan/pull/162

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

@slyon

I have disscussed multi-transaction boot with systemd upstream; and cloud-init developers.

Overall, it's an expensive operation, that may cause the boot slower, and may have unintended consequences which will be harder to debug.

If more needs to add units during boot arise, imho we should do similar to what was done in netplan to simply start/add units to the current transaction whenever possible. As that is quick.

Changed in cloud-init:
status: Confirmed → Invalid
Changed in netplan:
status: New → Fix Committed
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

I think it's best to keep cloud-init task as Invalid for now.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package netplan.io - 0.100-0ubuntu4

---------------
netplan.io (0.100-0ubuntu4) groovy; urgency=medium

  * debian/tests/cloud-init
    - Improve reboot test to avoid failure on arm64

 -- Lukas Märdian <email address hidden> Mon, 21 Sep 2020 12:23:02 +0200

Changed in netplan.io (Ubuntu):
status: New → Fix Released
Lukas Märdian (slyon)
Changed in netplan:
status: Fix Committed → Fix Released
Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.