Merge autopkgtest-cloud:more-fail-strings into autopkgtest-cloud:master

Proposed by Brian Murray
Status: Merged
Merge reported by: Brian Murray
Merged at revision: 88b314fa4018c70cfb455938ef2d5f7e73538b05
Proposed branch: autopkgtest-cloud:more-fail-strings
Merge into: autopkgtest-cloud:master
Diff against target: 14 lines (+2/-1)
1 file modified
charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/worker/worker (+2/-1)
Reviewer Review Type Date Requested Status
Paride Legovini Approve
Iain Lane Approve
Review via email: mp+428883@code.launchpad.net

Description of the change

The systemd autopkgtests installs the new version systemd, reboots and then tries to 'apt-get source systemd'. Getting the source will fail if name resolution is broken with the new version of systemd and this ended up causing the autopkgtest-cloud worker code to continually retry the tests. This resolves the issue by considering a failure to extract the source code a regression. Then if the test fails in the same way 3 times in a row it will be marked as a failure.

I cowboy'ed this change in staging and ran it with the systemd package version from https://launchpad.net/~enr0n/+archive/ubuntu/systemd-broken. Some of the test log (with the change ) follows:

Aug 24 20:52:04 juju-806ee7-stg-proposed-migration-59 /home/ubuntu/autopkgtest-cloud/worker/worker[1154912]: WARNING: Saw ERROR: testbed failure: rules extract failed with exit code 100 in log, which is a sign of a real (not tmp) failure - seen 3 so far
...

Creating nova instance adt-kinetic-amd64-systemd-20220824-202028-juju-806ee7-stg-proposed-migration-59 from image adt/ubuntu-kinetic-amd64-server-20220824.img (UUID fdf4719c-6dfc-4e5e-8184-7e8b448f215f)...
                                                                                                             autopkgtest [20:52:04]: ERROR: testbed failure: rules extract failed with exit code 100 (apt failure)
Aug 24 20:52:04 juju-806ee7-stg-proposed-migration-59 /home/ubuntu/autopkgtest-cloud/worker/worker[1154912]: WARNING: Three fails in a row - considering this a failure rather than tmpfail
Aug 24 20:52:04 juju-806ee7-stg-proposed-migration-59 /home/ubuntu/autopkgtest-cloud/worker/worker[1154912]: INFO: autopkgtest exited with code 4
Aug 24 20:52:05 juju-806ee7-stg-proposed-migration-59 /home/ubuntu/autopkgtest-cloud/worker/worker[1154912]: WARNING: Code 4 returned and no testpkg-version - returning "unknown" for systemd
Aug 24 20:52:05 juju-806ee7-stg-proposed-migration-59 /home/ubuntu/autopkgtest-cloud/worker/worker[1154912]: INFO: Putting results into swift autopkgtest-kinetic-enr0n-systemd-broken kinetic/amd64/s/systemd/20220824_205205_5b7f6@

To post a comment you must log in.
Revision history for this message
Brian Murray (brian-murray) wrote :

Running the autopkgtest with --shell-fail these are the actual errors seen:

Err:1 https://ppa.launchpadcontent.net/enr0n/systemd-broken/ubuntu kinetic/main systemd 251.4-1ubuntu1~ppa6 (tar)
  Temporary failure resolving 'ppa.launchpadcontent.net'
Err:2 https://ppa.launchpadcontent.net/enr0n/systemd-broken/ubuntu kinetic/main systemd 251.4-1ubuntu1~ppa6 (diff)
  Temporary failure resolving 'ppa.launchpadcontent.net'
Err:3 https://ppa.launchpadcontent.net/enr0n/systemd-broken/ubuntu kinetic/main systemd 251.4-1ubuntu1~ppa6 (dsc)
  Temporary failure resolving 'ppa.launchpadcontent.net'
E: Failed to fetch https://ppa.launchpadcontent.net/enr0n/systemd-broken/ubuntu/pool/main/s/systemd/systemd_251.4.orig.tar.gz Temporary failure resolving 'ppa.launchpadcontent.net'
E: Failed to fetch https://ppa.launchpadcontent.net/enr0n/systemd-broken/ubuntu/pool/main/s/systemd/systemd_251.4-1ubuntu1%7eppa6.debian.tar.xz Temporary failure resolving 'ppa.launchpadcontent.net'
E: Failed to fetch https://ppa.launchpadcontent.net/enr0n/systemd-broken/ubuntu/pool/main/s/systemd/systemd_251.4-1ubuntu1%7eppa6.dsc Temporary failure resolving 'ppa.launchpadcontent.net'
E: Failed to fetch some archives.

Which makes we wonder if we should update the 'Cannot initiate the connection to ppa.launchpad.net:80' string, not because it'll help here but because the domain name has changed.

autopkgtest-cloud:more-fail-strings updated
88b314f... by Brian Murray

Be more specific in our string

Revision history for this message
Iain Lane (laney) wrote :

> Which makes we wonder if we should update the 'Cannot initiate the connection to ppa.launchpad.net:80' string, not because it'll help here but because the domain name has changed.

👍

review: Approve
Revision history for this message
Paride Legovini (paride) wrote :

The change is OK but I'm not entirely happy with it as FAIL_PKG_STRINGS already has a "Temporary failure resolving" entry, see:

{'systemd*': ['timed out waiting for testbed to reboot',
              'Timed out on waiting for ssh connection',
              'Temporary failure resolving',
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^

which is exactly the error which appears in the log excerpt that Brian posted a couple of comments above:

Err:1 https://ppa.launchpadcontent.net/enr0n/systemd-broken/ubuntu kinetic/main systemd 251.4-1ubuntu1~ppa6 (tar)
  Temporary failure resolving 'ppa.launchpadcontent.net'

so this failure case should be already covered. I dug up where that line was added and here is it:

https://git.launchpad.net/autopkgtest-cloud/commit/?id=625889eb9b6a7a8075e20635df4a6c59fb935eb0

note the commit message:

  In case systemd itself breaks DNS;
  we don't want this to be a tmpfail loop.

Iain was trying to avoid the same retry loop we're trying to tackle here, but for some reason that's not working, and I can't see why.

(Also as I understand it FAIL_PKG_STRINGS is meant to cover *specific* breakage caused by the package (see the comment that describes its purpose), while here we're adding a somehow catchall apt error to the list, which is not specific to systemd.)

I don't want to block merging, but if you can think of why 'Temporary failure resolving' doesn't count as a non-temporary failure for systemd then let's fix this in a more proper way.

review: Needs Information
Revision history for this message
Iain Lane (laney) wrote :

It's a fair comment, I guess 'rules extract failed' is kind of a suspect thing to see and maybe there's some more context beforehand.

Perhaps we can see a full log of a failure - is there a link handy (probably not if there's a tmpfail loop)? Or if you're able to run it manually *not* with --shell-fail, save that output and share the log.

Revision history for this message
Brian Murray (brian-murray) wrote :

I ran autopkgtest for the version of systemd with the qemu backend:

autopkgtest --apt-upgrade systemd --add-apt-source="deb https://ppa.launchpadcontent.net/enr0n/systemd-broken/ubuntu kinetic main" --add-apt-source="deb-src https://ppa.launchpadcontent.net/enr0n/systemd-broken/ubuntu kinetic main" -o /tmp/systemd-broken -- qemu --ram-size=
4096 --cpus=4 /srv/vms/autopkgtest-kinetic-amd64.img

and the log file does not contain any lines regarding failure to resolve:

autopkgtest [10:17:56]: rebooting testbed after setup commands that affected boot
autopkgtest [10:18:11]: testbed running kernel: Linux 5.15.0-27-generic #28-Ubuntu SMP Thu Apr 14 04:55:28 UTC 2022
autopkgtest [10:18:12]: testbed dpkg architecture: amd64
autopkgtest [10:18:14]: @@@@@@@@@@@@@@@@@@@@ apt-source systemd
qemu-system-x86_64: terminating on signal 15 from pid 983677 (/usr/bin/python3)
autopkgtest [10:21:21]: ERROR: testbed failure: rules extract failed with exit code 100 (apt failure)

Perhaps autopkgtest is not logging as much it did in the past or the --debug switch was being used?

Revision history for this message
Brian Murray (brian-murray) wrote :

Running autopkgtest with --debug I do see "Temporary failure resolving 'ppa.launchpadcontent.net'"

Revision history for this message
Paride Legovini (paride) wrote (last edit ):

For posterity: we didn't fully understand why in the past the "Temporary failure resolving" error showed up in the logs without --debug, and hence the infinite retry loop was avoided by having that string in FAIL_PKG_STRINGS (assuming that [1] worked as intended in the first place!). This is tracked in LP: #1988223.

In the end we decided the most sensible thing to do is just merging this change.

[1] https://git.launchpad.net/autopkgtest-cloud/commit/?id=625889eb9b6a7a8075e20635df4a6c59fb935eb0

review: Approve
Revision history for this message
Iain Lane (laney) wrote :

Probably not a terrible idea to pass `--debug` all the time to be honest. It can be really hard or time consuming to debug autopkgtest failures and maybe having all the available info up front would help with that.

I think those changes did work at the time; I would merge them and make sure that the infinite retries stopped happening on their own (as a result of the new code). I wouldn't delete things from the queue manually in this kind of case.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/worker/worker b/charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/worker/worker
2index 343a461..8a05766 100755
3--- a/charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/worker/worker
4+++ b/charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/worker/worker
5@@ -112,7 +112,8 @@ FAIL_PKG_STRINGS = {'systemd*': ['timed out waiting for testbed to reboot',
6 'Timed out on waiting for ssh connection',
7 'Temporary failure resolving',
8 'VirtSubproc.Timeout',
9- 'ERROR: testbed failure: testbed auxverb failed with exit code 255'],
10+ 'ERROR: testbed failure: testbed auxverb failed with exit code 255',
11+ 'ERROR: testbed failure: rules extract failed with exit code 100 (apt failure)'],
12 'linux-*': ['timed out waiting for testbed to reboot',
13 'Timed out on waiting for ssh connection',
14 'ERROR: testbed failure: testbed auxverb failed',

Subscribers

People subscribed via source and target branches