Unable to SSH Into Instance when deploying Impish 21.10

Bug #1938299 reported by Sean Feole
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
cloud-init (Ubuntu)
Fix Released
Critical
Unassigned
Bionic
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned
Hirsute
Fix Released
Undecided
Unassigned
Impish
Fix Released
Critical
Unassigned
google-guest-agent (Ubuntu)
Fix Released
Medium
Unassigned
Bionic
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned
Hirsute
Won't Fix
Undecided
Unassigned
Impish
Won't Fix
Medium
Unassigned
netplan.io (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Won't Fix
Undecided
Unassigned
Focal
Won't Fix
Undecided
Unassigned
Hirsute
Won't Fix
Undecided
Unassigned
Impish
Won't Fix
Undecided
Unassigned

Bug Description

=== Begin SRU Template ===
[Impact]
In PR #919 (81299de), we refactored some of the code used to bring up networks across distros. Previously, the call to bring up network interfaces during 'init' stage unintentionally resulted in a no-op such that network interfaces were NEVER brought up by cloud-init, even if new network interfaces were found after crawling the metadata.

In #919, the code was altered to bring up these discovered network interfaces. On Ubuntu, this results in a 'netplan apply' call during 'init' stage for any ubuntu-based distro on a datasource that has a NETWORK dependency. On GCE, this additional 'netplan apply' conflicts with the google-guest-agent service, shutting that service down due to that project's PartOf= systemd relationship resulting in an instance that can not be connected to.

To fix this, we added a new 'disable_network_activation' option that can be set to true in /etc/cloud.cfg.d/*.cfg by image creators to disable the activation of network interfaces in 'init' stage. This will avoid the 'netplan apply' call on GCE instances.

[Test Case]
An integration test has been added at `tests/integration_tests/datasources/test_network_dependency.py` to test this functionality. To test manually:

1. Launch an instance on GCE
2. Install the cloud-init version with the fix
3. Add a file, '/etc/cloud/cloud.cfg.d/99-disable-network-activation.cfg' with the contents:
disable_network_activation: true

4. Run cloud-init clean --logs
5. Create a new image based on this instance
6. Launch a new instance based on the new image
7. Instance should launch successfully and able to be ssh'ed into
8. "['netplan', 'apply']" should not be present anywhere in /var/log/cloud-init.log.
9. "Bringing up newly configured network interfaces" should not exist anywhere in /var/log/cloud-init.log

In the failure case, we will fail at step 7.

[Regression Potential]
The code in question determines whether to bring up interfaces after applying network config. Accidentally not doing this should not be a problem as we previously (unintentionally) did not bring these interfaces up. Accidentally bringing up interfaces when we shouldn't be also generally shouldn't cause a large problem outside of GCE, because outside of GCE there aren't (that we're aware of) other processes independently setting up network. If this setup determination code somehow fails, it happens early enough in boot that it could leave an instance unusable, however, the code is small enough and defensive enough that we don't believe that is a possibility.

Additionally, any cloud datasource that is discovered in `init-local` stage (Azure, Ec2, Hetzner, IBMCloud, OpenStack and Oracle) aren't exposed to this code path because full network config it emitted before system network is brought up so there is no need to call `netplan apply` at that time.

[Other Info]
Github PR: https://github.com/canonical/cloud-init/pull/1048
Upstream commit: https://github.com/canonical/cloud-init/commit/9c147e8341e287366790e60658f646cdcc59bef2

=== End SRU Template ===
Original bug report:

Google Instances deployed with the Ubuntu 21.10 Daily images are inaccessible via SSH.

gcloud compute instances create sf-impish-v20200720 --zone us-west1-a --network "default" --no-restart-on-failure --image-project ubuntu-os-cloud-devel --image daily-ubuntu-2110-impish-v20210720 --machine-type n1-standard-2

Will result in a successful deploy yet, inaccessible via ssh from the end users configured laptop.

This appears to affect all daily images after 20210719.

daily-ubuntu-2110-impish-v20210719 ubuntu-os-cloud-devel ubuntu-2110 READY
daily-ubuntu-2110-impish-v20210720 ubuntu-os-cloud-devel ubuntu-2110 READY
daily-ubuntu-2110-impish-v20210721 ubuntu-os-cloud-devel ubuntu-2110 READY
daily-ubuntu-2110-impish-v20210723 ubuntu-os-cloud-devel ubuntu-2110 READY
daily-ubuntu-2110-impish-v20210724 ubuntu-os-cloud-devel ubuntu-2110 READY
daily-ubuntu-2110-impish-v20210725 ubuntu-os-cloud-devel ubuntu-2110 READY
daily-ubuntu-2110-impish-v20210728 ubuntu-os-cloud-devel ubuntu-2110

This problem also appears to be reproducible via the gcloud UI, create a new virtual machine using the daily-ubuntu-2110-impish-v20210720 or greater and instruct the virtual machine to import a ssh_pub_key in the security tab. The Instance will start, yet still be inaccessible via the users private sshkey

The google-guest-agent.service appears to be responsible for adding the google project ssh keys to the instance once its deployed. Please see below when queried on the 20210719 image:

 google-guest-agent.service - Google Compute Engine Guest Agent
     Loaded: loaded (/lib/systemd/system/google-guest-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2021-07-27 19:47:48 UTC; 18h ago
   Main PID: 711 (google_guest_ag)
      Tasks: 9 (limit: 8924)
     Memory: 19.7M
     CGroup: /system.slice/google-guest-agent.service
             └─711 /usr/bin/google_guest_agent

Jul 27 19:47:55 sean-imp gpasswd[1469]: user google added by root to group floppy
Jul 27 19:47:55 sean-imp gpasswd[1475]: user google added by root to group audio
Jul 27 19:47:55 sean-imp gpasswd[1481]: user google added by root to group dip
Jul 27 19:47:55 sean-imp gpasswd[1487]: user google added by root to group video
Jul 27 19:47:55 sean-imp gpasswd[1493]: user google added by root to group plugdev
Jul 27 19:47:55 sean-imp gpasswd[1499]: user google added by root to group netdev
Jul 27 19:47:55 sean-imp gpasswd[1505]: user google added by root to group lxd
Jul 27 19:47:55 sean-imp gpasswd[1511]: user google added by root to group google-sudoers
Jul 27 19:47:55 sean-imp GCEGuestAgent[711]: 2021-07-27T19:47:55.1699Z GCEGuestAgent Info: Updating keys for user google.
Jul 27 19:47:55 sean-imp google_guest_agent[711]: 2021/07/27 19:47:55 logging client: rpc error: code = PermissionDenied desc = Clo>
lines 1-19/19 (END)

Joshua Powers (powersj)
description: updated
summary: - Unable to SSH Into Instance when deploying Impish 12.10
+ Unable to SSH Into Instance when deploying Impish 21.10
Revision history for this message
Francis Ginther (fginther) wrote :
Download full text (3.2 KiB)

[Summary]
I believe this problem is related to a change in behavior in cloud-init version 21.2-43-g184c836a-0ubuntu1 due to comparing two different daily impish images before and after this update.

The problem appears in a gcp account which currently has multple global sshkeys associated with different users. For example, we have keys for a 'testuser' and 'testuser2' account. When booting the older serial (as well as older releases), we see accounts created for 'testuser', 'testuser2' and 'ubuntu'. For the newer serial, we only see an account for 'ubuntu'. As our test automation uses one of the 'testuser' keys, it can no longer access impish VMs.

I've included the package list below from my two test systems. Including the google-agent packages since these could also be suspect.

[Expected behavoir]
If I have a gcp account with global ssh keys associated with non-ubuntu users, I expect those users to be present in the VM after launch and .ssh/authorized_keys updated with those public keys.

[Current behavoir]
Only the 'ubuntu' user is being created

[Package list with unexpected behavior - impish 20210728 serial]
$ dpkg -l|grep cloud-init
ii cloud-init 21.2-43-g184c836a-0ubuntu1 all initialization and customization tool for cloud instances
ii cloud-initramfs-copymods 0.47ubuntu1 all copy initramfs modules into root filesystem for later use
ii cloud-initramfs-dyn-netconf 0.47ubuntu1 all write a network interface file in /run for BOOTIF
$ dpkg -l |grep agent
ii google-guest-agent 20210414.00-0ubuntu1 amd64 Google Compute Engine Guest Agent
ii google-osconfig-agent 20210219.00-0ubuntu1 amd64 Google OS Config Agent
ii gpg-agent 2.2.20-1ubuntu4 amd64 GNU privacy guard - cryptographic agent
ii libpolkit-agent-1-0:amd64 0.105-31 amd64 PolicyKit Authentication Agent API
ii lxd-agent-loader 0.4 all LXD - VM agent loader

[Package list with expected behavior - impish 20190719 serial]
$ dpkg -l|grep cloud-init
ii cloud-init 21.2-3-g899bfaa9-0ubuntu2 all initialization and customization tool for cloud instances
ii cloud-initramfs-copymods 0.47ubuntu1 all copy initramfs modules into root filesystem for later use
ii cloud-initramfs-dyn-netconf 0.47ubuntu1 all write a network interface file in /run for BOOTIF
$ dpkg -l |grep agent
ii google-guest-agent 20210414.00-0ubuntu1 amd64 Google Compute Engine Guest Agent
ii google-osconfig-agent 20210219.00-0ubuntu1 amd64 Google OS Config Agent
ii gpg-agent 2.2.20-1ubuntu4 amd64 GNU privacy guard - cryptographic agent
ii libpolkit-agent-1-0:amd64 0.105-31 amd64 PolicyKit Authentication Agent API
ii lxd-agent-loader 0.4 ...

Read more...

Revision history for this message
Chad Smith (chad.smith) wrote :

Thanks for this bug Sean Feole and the additional reproducer context fginther. If either of you have cloud-init collect-logs tar.gz from that affected system it'd give us a firm grasp on what fell over here. I'll try reproducing this tomorrow, but I can confirm that launching said image from the command line provided does result in an inaccessible instance for me.

Revision history for this message
Chad Smith (chad.smith) wrote :

Ok, so on working images (v20200719 or earlier), google-guest-agent.service is running. On 20210720 or later that service is not included in the systemd boot targets for some reason.

I've checked for system ordering dependency issues which would have disabled the g-g-a.service on the "broken" images and there is nothing at fault there for some reaason it's enabled by not participating in the overall boot target. More triage needed here.

# working images have GGA.service enabled and running
ubuntu@working-image:~$ cat /etc/cloud/build.info
build_name: server
serial: 20210719
ubuntu@working-image:~$ sudo systemctl status google-guest-agent.service
● google-guest-agent.service - Google Compute Engine Guest Agent
     Loaded: loaded (/lib/systemd/system/google-guest-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2021-08-06 19:10:20 UTC; 27min ago
   Main PID: 717 (google_guest_ag)
      Tasks: 9 (limit: 8924)
     Memory: 19.1M
     CGroup: /system.slice/google-guest-agent.service
             └─717 /usr/bin/google_guest_agent

Aug 06 19:10:26 SRU-worked-gce gpasswd[1323]: user csmith added by root to group dip
Aug 06 19:10:26 SRU-worked-gce gpasswd[1329]: user csmith added by root to group video
Aug 06 19:10:26 SRU-worked-gce gpasswd[1335]: user csmith added by root to group plugdev
Aug 06 19:10:26 SRU-worked-gce gpasswd[1341]: user csmith added by root to group netdev
Aug 06 19:10:26 SRU-worked-gce gpasswd[1347]: user csmith added by root to group lxd
Aug 06 19:10:26 SRU-worked-gce gpasswd[1353]: user csmith added by root to group google-sudoers
Aug 06 19:10:26 SRU-worked-gce GCEGuestAgent[717]: 2021-08-06T19:10:26.2934Z GCEGuestAgent Info: Updating keys for user csmith.
Aug 06 19:10:26 SRU-worked-gce GCEGuestAgent[717]: 2021-08-06T19:10:26.2937Z GCEGuestAgent Info: Adding existing user ubuntu to google-sudoers group.
Aug 06 19:10:26 SRU-worked-gce gpasswd[1359]: user ubuntu added by root to group google-sudoers
Aug 06 19:10:26 SRU-worked-gce GCEGuestAgent[717]: 2021-08-06T19:10:26.3087Z GCEGuestAgent Info: Updating keys for user ubuntu.

# disabled GGA.service on July20 and later
ubuntu@broken-google-guest-image-unit:~$ cat /etc/cloud/build.info
build_name: server
serial: 20210720
ubuntu@broken-google-guest-image-unit:~$ systemctl status google-guest-agent.service
○ google-guest-agent.service - Google Compute Engine Guest Agent
     Loaded: loaded (/lib/systemd/system/google-guest-agent.service; enabled; vendor preset: enabled)
     Active: inactive (dead)

Revision history for this message
Chad Smith (chad.smith) wrote :

A note as well, a 2nd reboot of these Hirsute "broken" images will end up properly activating google-guest-agent across that reboot. So, for some reason the unit appears installed and enabled but not running.

# after reboot
$ systemctl status google-guest-agent.service
● google-guest-agent.service - Google Compute Engine Guest Agent
     Loaded: loaded (/lib/systemd/system/google-guest-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2021-08-06 20:35:26 UTC; 17min ago
   Main PID: 626 (google_guest_ag)
      Tasks: 9 (limit: 8924)
     Memory: 20.1M
     CGroup: /system.slice/google-guest-agent.service
             └─626 /usr/bin/google_guest_agent

Aug 06 20:35:26 SRU-worked-gce GCEGuestAgent[626]: 2021-08-06T20:35:26.9717Z GCEGuestAgent Info: Created google sudoers file
Aug 06 20:35:26 SRU-worked-gce GCEGuestAgent[626]: 2021-08-06T20:35:26.9720Z GCEGuestAgent Info: Creating user powersj.
Aug 06 20:35:27 SRU-worked-gce GCEGuestAgent[626]: 2021-08-06T20:35:27.3118Z GCEGuestAgent Info: Updating keys for user powersj.
Aug 06 20:35:27 SRU-worked-gce GCEGuestAgent[626]: 2021-08-06T20:35:27.3121Z GCEGuestAgent Info: Creating user csmith.
Aug 06 20:35:27 SRU-worked-gce gpasswd[946]: user csmith added by root to group lxd
Aug 06 20:35:27 SRU-worked-gce gpasswd[953]: user csmith added by root to group google-sudoers
Aug 06 20:35:27 SRU-worked-gce GCEGuestAgent[626]: 2021-08-06T20:35:27.5706Z GCEGuestAgent Info: Updating keys for user csmith.
Aug 06 20:35:27 SRU-worked-gce GCEGuestAgent[626]: 2021-08-06T20:35:27.5709Z GCEGuestAgent Info: Adding existing user ubuntu to google-sudoers group.
Aug 06 20:35:27 SRU-worked-gce gpasswd[959]: user ubuntu added by root to group google-sudoers
Aug 06 20:35:27 SRU-worked-gce GCEGuestAgent[626]: 2021-08-06T20:35:27.5859Z GCEGuestAgent Info: Updating keys for user ubuntu.

Revision history for this message
Chad Smith (chad.smith) wrote :

I found no Ordering cycle breaks that would imply that the google-guest-agent got "unloaded" from the primary boot target objectives in journalctl.

Attached is the journalctl from a machine that was only launched and rebooted 1. After the reboot you can egrep 'GCE|gpasswd' to see the operations of google-guest-agent finally start to occur.

Revision history for this message
James Falcon (falcojr) wrote :

Chad and I investigated this issue together. To summarize, the google-guest-agent.service appears to be properly enabled with the proper systemd symlinks, but it is not running on first boot. We can find no logs anywhere about an attempt to start the service or about any error with the service. Upon reboot, google-guest-agent.service runs as expected. We can't point to any cloud-init changes that would impact this behavior.

Given this, I'm moving this to Incomplete for cloud-init. If there's more reason to believe this is an issue caused by cloud-init, please change status back to New.

Changed in cloud-init (Ubuntu):
status: New → Incomplete
Pat Viafore (patviafore)
tags: added: rls-ii-incoming
tags: added: fr-1631
tags: removed: rls-ii-incoming
Norbert (nrbrtx)
tags: added: impish
Revision history for this message
Chad Smith (chad.smith) wrote :

This is the same symptom as a GCP issue raised.

The following commit[1] introduced this change in behavior for cloud-init by correcting an bug in the behavior of cloud-init which didn't apply networking changes in cloud-init "init" stage after writing files. This version of cloud-init 2.3.

google-guest-agent and cloud-init both race trying to setup networking on a newly booting VM as of cloud-init 21.2-43-g184c836a-0ubuntu1 did introduce a change that now calls "netplan apply" after cloud-init writes fallback network configuration files which request dhcp on the primary nic.

We were able to reproduce this same issue with latest images rolled out to GCP containing cloud-init 21.3. The `netplan apply` called by cloud-init does disrupt the dhclient calls that google-guest-agent uses to create the initial network config and sets up google project users and keys on the VM https://github.com/GoogleCloudPlatform/guest-agent#network

Cloud-init team is working this issue as top-priority.
Workarounds for this issue are:
1. rebooting the instance after launch which allows google-guest-agent.service to come back up without cloud-init interaction

2.

Workarounds at the moment launch a VM with updated cloud-init 21.3 with the following user-data:

cat > clouinit_start_google_guest_agent.yaml <<EOF
#cloud-config
runcmd:
 - systemctl start google-guest-agent.service
EOF

root@publishing-f:~# gcloud compute instances create sf2-impish-v20200720 --zone us-west1-a --network "default" --no-restart-on-failure --image-project ubuntu-os-cloud-devel --image daily-ubuntu-2110-impish-v20210720 --machine-type n1-standard-2 --metadata-from-file user-data=cloudinit_start_google_guest_agent.yaml

# see ssh working on that image launch
gcloud compute ssh sf2-impish-v20200720 --zone us-west1-a

Changed in cloud-init (Ubuntu Impish):
status: Incomplete → Triaged
Changed in google-guest-agent (Ubuntu Impish):
status: New → In Progress
Changed in cloud-init (Ubuntu Impish):
importance: Undecided → Critical
Changed in google-guest-agent (Ubuntu Impish):
importance: Undecided → Critical
Revision history for this message
Chad Smith (chad.smith) wrote :

For this google-guest-agent task, the daemon itself should probably be a bit more resilient in the face of a `netplan apply` which could be invoked by any admin manipulating network configuration and will teardown any dhclient connections resulting in a dead google-guest-agent.service from that point on. So, I'd leave the google-guest-agent task here, but probably not mark it critical.

Changed in google-guest-agent (Ubuntu Impish):
importance: Critical → Medium
Changed in cloud-init (Ubuntu Impish):
status: Triaged → In Progress
Changed in google-guest-agent (Ubuntu Impish):
status: In Progress → New
Revision history for this message
Chad Smith (chad.smith) wrote :

Upstream PR with a fix for this behavior on GCP https://github.com/canonical/cloud-init/pull/1048

James Falcon (falcojr)
description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 21.3-1-g6803368d-0ubuntu3

---------------
cloud-init (21.3-1-g6803368d-0ubuntu3) impish; urgency=medium

  * cherry-pick 9c147e83: Allow disabling of network activation (SC-307)
    (#1048) (LP: #1938299)
  * cherry-pick 612e3908: Add connectivity_url to Oracle's
    EphemeralDHCPv4 (#988) (LP: #1939603)
  * cherry-pick dc227869: Set Azure to apply networking config every BOOT
    (#1023)

 -- James Falcon <email address hidden> Thu, 07 Oct 2021 11:43:55 -0500

Changed in cloud-init (Ubuntu Impish):
status: In Progress → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in cloud-init (Ubuntu Bionic):
status: New → Confirmed
Changed in cloud-init (Ubuntu Focal):
status: New → Confirmed
Changed in cloud-init (Ubuntu Hirsute):
status: New → Confirmed
Changed in google-guest-agent (Ubuntu Bionic):
status: New → Confirmed
Changed in google-guest-agent (Ubuntu Focal):
status: New → Confirmed
Changed in google-guest-agent (Ubuntu Hirsute):
status: New → Confirmed
Changed in google-guest-agent (Ubuntu):
status: New → Confirmed
Revision history for this message
Joe Slagel (slagelwa) wrote :

I seem to have encountered the same or similar issue in ubuntu focal/20.04 when using the image ubuntu-minimal-2004-focal-v20210928 on GCP. Creating a new custom GCP image using this as the base image and running:

apt-get update
apt-get upgrade -y

created a custom image that exhibited the same issues with GCP OS login. Rebooting the VM also resolved the ssh connection issues. Placing a hold ('apt-mark hold cloud-init') on the cloud-init package before running the upgrade also seemed to resolve the issue.

Revision history for this message
Timo Aaltonen (tjaalton) wrote : Please test proposed package

Hello Sean, or anyone else affected,

Accepted cloud-init into hirsute-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/21.3-1-g6803368d-0ubuntu1~21.04.4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-hirsute to verification-done-hirsute. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-hirsute. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in cloud-init (Ubuntu Hirsute):
status: Confirmed → Fix Committed
tags: added: verification-needed verification-needed-hirsute
Changed in cloud-init (Ubuntu Focal):
status: Confirmed → Fix Committed
tags: added: verification-needed-focal
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

Hello Sean, or anyone else affected,

Accepted cloud-init into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/21.3-1-g6803368d-0ubuntu1~20.04.4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in cloud-init (Ubuntu Bionic):
status: Confirmed → Fix Committed
tags: added: verification-needed-bionic
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

Hello Sean, or anyone else affected,

Accepted cloud-init into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/21.3-1-g6803368d-0ubuntu1~18.04.4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
James Falcon (falcojr) wrote (last edit ):

Attach file integration_test_results.tar.gz. These tests show the issue fixed in B, F, and H

Revision history for this message
Utkarsh Gupta (utkarsh) wrote :

Thank you, James, for doing the testing.

tags: added: verification-done-bionic verification-done-focal verification-done-hirsute
removed: verification-needed-bionic verification-needed-focal verification-needed-hirsute
Revision history for this message
Liam Hopkins (liamh-google) wrote :

Just to add some info on guest agent here:

the guest agent does not set up the primary interface

there should be no race between guest agent and cloud-init for the primary interface

the guest agent does not start any dhclient process for primary interface, and should not care if any dhclient process on the system is killed

so a number of comments in this bug such as 'killing dhclient leaves guest agent dead' are not true

Revision history for this message
Chad Smith (chad.smith) wrote :

To clarify the actual root cause here and reflect it back to this original bug.

   google-guest-agent defines a `PartOf=` relationship with systemd-networkd.service[1]. This relationship means that if systemd-networkd.service is either stopped, google-guest-agent.service gets stopped. When systemd-networkd.service is restarted, so is google-guest-agent.

But if systemd-networkd.service is subsequently started after a previous stop call, google-guest-agent is left in stopped state. The call `netplan apply` (emitted by cloud-init after writing network config) in fact calls systemctl stop systemd-networkd.service and follows it with a 'start' instead of directly invoking systemctl restart systemd-networkd.service[2]. This leaves google-guest in stopped state indefinitely.

I'm not entirely sure netplan can fix this issue due to some other cleanup they are doing between networkd stop and start, but I have reflected this bug to netplan.io folks and we'll see what the consensus is about whether this can be resolved with instrumenting a "systemctl restart" instead of separate "systemctl stop" and "systemctl start" calls.

References:
[1] https://github.com/GoogleCloudPlatform/guest-agent/blob/main/google-guest-agent.service#L13
[2] https://git.launchpad.net/ubuntu/+source/netplan.io/tree/netplan/cli/commands/apply.py?h=applied/ubuntu/devel#n169

Revision history for this message
Lukas Märdian (slyon) wrote :

From a netplan POV this has already been changed upstream as part of https://github.com/canonical/netplan/pull/200

I.e. the "systemctl stop/start systemd-networkd.service" have been replaced by "networkctl reload/reconfigure" calls instead. Due to an incompatibility with systemd v247 that change had to be reverted in the netplan.io Impish package, but going forward netplan will not be calling "systemctl stop/start systemd-networkd.service" anymore.

Revision history for this message
Chad Smith (chad.smith) wrote :

Thanks Lukas.

Is the expectation then that only releases later than Impish will support this functionality?

If that is the case, we will continue with instrumenting a workaroud for cloud-init to avoid calling `netplan apply` on google to avoid triggering this symptom until we can ensure google-guest-agent.service defines a RequiredBy= relationship to ensure that the guest-agent gets started when `netplan apply` calls `systemctl start systemd-networkd.service`.

Revision history for this message
Chad Smith (chad.smith) wrote (last edit ):

Confirmed that Impish daily image builds on GCP are unblocked and include the necessary fix to prevent `netplan apply` from being invoked by cloud-init. This is a stop gap solution until we can publish a release of google-guest-agent.service which declares a "RequiredBy=" or "WantedBy=" relationship with systemd-networkd.service to ensure google-guest-agent starts anytime after networkd starts.

Verified success on GCP daily images Ivan had released daily Google image builds for Impish with cloud-config updates which contain cloud-init 21.3-1-g6803368d-0ubuntu3 and the /etc/cloud/cloud.cfg.d/*.cfg file to "disable_network_activation: true".

$ gcloud compute instances create sf-impish --zone us-west1-a --network "default" --no-restart-on-failure --image-project ubuntu-os-cloud-devel --image daily-ubuntu-2110-impish-v20211012 --machine-type n1-standard-2

$ gcloud compute ssh --zone us-west1-a sf-impish

root@sf-impish:~# systemctl status google-guest-agent
● google-guest-agent.service - Google Compute Engine Guest Agent
     Loaded: loaded (/lib/systemd/system/google-guest-agent.service; enabled; v>
     Active: active (running) since Tue 2021-10-12 20:34:43 UTC; 7min ago
   Main PID: 731 (google_guest_ag)
      Tasks: 9 (limit: 8923)
     Memory: 19.6M
        CPU: 533ms
     CGroup: /system.slice/google-guest-agent.service
             └─731 /usr/bin/google_guest_agent

Oct 12 20:34:49 sf-impish gpasswd[1201]: user XY added by root to group di>
Oct 12 20:34:49 sf-impish GCEGuestAgent[731]: 2021-10-12T20:34:49.4968Z GCEG....

Given that we can ssh into the vm with `gcloud compute ssh` on intial launch, the problem no longer exists. Also systemctl shows a health google-guest-agent.service.

Revision history for this message
Lukas Märdian (slyon) wrote :

Yes indeed only releases later than Impish will support this functionality in netplan as it depends on some fixes in systemd v249.

Changed in netplan.io (Ubuntu Bionic):
status: New → Won't Fix
Changed in netplan.io (Ubuntu Focal):
status: New → Won't Fix
Changed in netplan.io (Ubuntu Hirsute):
status: New → Won't Fix
Changed in netplan.io (Ubuntu Impish):
status: New → Won't Fix
Revision history for this message
Chad Smith (chad.smith) wrote :

From Google internal communication on this bug: "I did a manual test of this fix and it seems to work as expected. We'll also get signal from the daily images when they include this change."

Chad Smith (chad.smith)
description: updated
James Falcon (falcojr)
description: updated
James Falcon (falcojr)
description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 21.3-1-g6803368d-0ubuntu1~21.04.4

---------------
cloud-init (21.3-1-g6803368d-0ubuntu1~21.04.4) hirsute; urgency=medium

  * cherry-pick 9c147e83: Allow disabling of network activation (SC-307)
    (#1048) (LP: #1938299)

 -- James Falcon <email address hidden> Thu, 07 Oct 2021 11:48:53 -0500

Changed in cloud-init (Ubuntu Hirsute):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for cloud-init has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 21.3-1-g6803368d-0ubuntu1~20.04.4

---------------
cloud-init (21.3-1-g6803368d-0ubuntu1~20.04.4) focal; urgency=medium

  * cherry-pick 9c147e83: Allow disabling of network activation (SC-307)
    (#1048) (LP: #1938299)

 -- James Falcon <email address hidden> Thu, 07 Oct 2021 11:51:28 -0500

Changed in cloud-init (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 21.3-1-g6803368d-0ubuntu1~18.04.4

---------------
cloud-init (21.3-1-g6803368d-0ubuntu1~18.04.4) bionic; urgency=medium

  * cherry-pick 9c147e83: Allow disabling of network activation (SC-307)
    (#1048) (LP: #1938299)

 -- James Falcon <email address hidden> Thu, 07 Oct 2021 11:53:34 -0500

Changed in cloud-init (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in netplan.io (Ubuntu):
status: New → Confirmed
Revision history for this message
Brian Murray (brian-murray) wrote :

netplan.io (0.103-0ubuntu10) jammy; urgency=medium

  * update gbp.conf branch
  * Drop d/p/0002-Revert-cli-apply-reload-reconfigure-networkd-instead.patch
    This is not needed anymore with systemd v249
  * Refresh d/p/0006-netplan-set-make-it-possible-to-unset-a-whole-devtyp.patch
  * Add d/p/0012-test-bridge-base-give-bridge-some-more-time-to-reach.patch
    To fix flaky test_bridge_anonymous autopkgtest (upstream c6ad8e6)
  * Upstream cherry-picks for snapd dbus config set-try-apply integration fixes
    - dbus-wait-for-netplan-try-to-be-ready-LP-1949893-245.patch (LP: #1949893)
    - get-set-ignore-empty-YAML-hints-and-delete-files-on-.patch (LP: #1946957)

 -- Lukas Märdian <email address hidden> Mon, 29 Nov 2021 17:14:32 +0100

Changed in netplan.io (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote :

The Hirsute Hippo has reached End of Life, so this bug will not be fixed for that release.

Changed in google-guest-agent (Ubuntu Hirsute):
status: Confirmed → Won't Fix
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package google-guest-agent - 20220104.00-0ubuntu1

---------------
google-guest-agent (20220104.00-0ubuntu1) jammy; urgency=medium

  * New upstream version 20220104.00. (LP: #1959392)
    - Use IP address for calling the metadata server. (#116)
    - Debug logging. (#122)
    - Support enable-oslogin-sk key. (#120)
    - New integ test. (#124)
    - Restore line. (#127)
    - Correct linux startup script order. (#135)
    - Add WantedBy network dependencies to google-guest-agent
      service. (#136) (LP: #1938299)
    - Don't open ssh tempfile exclusively. (#137)
    - Enable ipv6 on secondary interfaces (#133)
    - Enforce script ordering. (#138)
    - Handle comm errors in script runner. (#140)
    - Integration test: test create and remove google user. (#128)
    - Integration tests: instance setup. (#143)
    - Don't duplicate logs. (#146)
    - Add malformed ssh key unit test. (#142)
    - Add or remove route integration test, utils. (#147)
    - List IPv6 routes. (#150)

 -- Utkarsh Gupta <email address hidden> Fri, 28 Jan 2022 17:36:04 +0530

Changed in google-guest-agent (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote :

Ubuntu 21.10 (Impish Indri) has reached end of life, so this bug will not be fixed for that specific release.

Changed in google-guest-agent (Ubuntu Impish):
status: Confirmed → Won't Fix
Revision history for this message
Andrew Cloke (andrew-cloke) wrote (last edit ):

Note that the google-guest-agent part of this issue will be fixed for bionic and focal once the updates pocket has been upgraded to v20220622, which is being tracked in LP #1959392.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package google-guest-agent - 20220622.00-0ubuntu2~20.04.0

---------------
google-guest-agent (20220622.00-0ubuntu2~20.04.0) focal; urgency=medium

  * No-change rebuild for Focal. (LP: #1980725)

google-guest-agent (20220622.00-0ubuntu2) kinetic; urgency=medium

  * d/rules: don't build google_authorized_keys as it
    conflicts with the same binary shipped by
    src:google-compute-engine-oslogin. (LP: #1980725)

google-guest-agent (20220622.00-0ubuntu1~20.04.0) focal; urgency=medium

  * No-change rebuild for Focal. (LP: #1959392)

 -- Utkarsh Gupta <email address hidden> Tue, 05 Jul 2022 17:08:17 +0530

Changed in google-guest-agent (Ubuntu Focal):
status: Confirmed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package google-guest-agent - 20220622.00-0ubuntu2~18.04.0

---------------
google-guest-agent (20220622.00-0ubuntu2~18.04.0) bionic; urgency=medium

  * Rebuild for Bionic.
    - Set GO111MODULE to off to avoid internet usage.

google-guest-agent (20220622.00-0ubuntu2) kinetic; urgency=medium

  * d/rules: don't build google_authorized_keys as it
    conflicts with the same binary shipped by
    src:google-compute-engine-oslogin. (LP: #1980725)

google-guest-agent (20220622.00-0ubuntu1~18.04.0) bionic; urgency=medium

  * No-change rebuild for Bionic. (LP: #1959392)

 -- Utkarsh Gupta <email address hidden> Tue, 05 Jul 2022 17:30:39 +0530

Changed in google-guest-agent (Ubuntu Bionic):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.