lp:~andersson123/autopkgtest-cloud

Owned by Tim Andersson
Get this repository:
git clone https://git.launchpad.net/~andersson123/autopkgtest-cloud
Only Tim Andersson can upload to this repository. If you are Tim Andersson please log in for upload directions.

Branches

Name Last Modified Last Commit
api-key-howto 2024-07-25 16:56:05 UTC
docs: detail how to create an API key

Author: Tim Andersson
Author Date: 2024-07-22 13:40:50 UTC

docs: detail how to create an API key

Currently, the docs only mention **how** an individual can use an API
key - there is no mention or guideline regarding how an admin should
create an API key for a requested user, group or bot.

This commit amends the issue by adding a new section which details how
an `autopkgtest-cloud` admin should create an API key.

s-n-r-prepend-series-version-to-uuid 2024-07-25 15:52:25 UTC
fix: seed-new-release: modify uuid before copying testinfo.json

Author: Tim Andersson
Author Date: 2024-07-25 13:54:58 UTC

fix: seed-new-release: modify uuid before copying testinfo.json

We realised recently that the uniqueness of the uuid column in the
database was causing issues when running seed-new-release. The summary
of the problem is as follows:

- seed-new-release copies swift objects from the old to new container
    - the two objects, in testinfo.json within result.tar have the
      same uuid
- we then run download-all-results for the new release
    - the way the results are entered into the db is with:
      INSERT OR REPLACE
    - the uuid column is unique, meaning the database entry for the old
      release gets REPLACE'd.
    - We longer have the result for the old release in the db
    - If one were to then run download-all-results for the old release,
      the results from the new release would then get removed from the
      database also.

This commit amends the issue by prepending the uuid in testinfo.json
with the series version of the old release before uploading it to the
container for the new release, preserving the database entries for both
the old and new releases.

dump-service-bundle-move-to-terraform 2024-07-24 11:19:46 UTC
asdf

Author: Tim Andersson
Author Date: 2024-07-24 11:19:46 UTC

asdf

bump-bos03-s390x-quota 2024-07-23 08:48:36 UTC
service-bundle: add 20 bos03-s390x workers

Author: Tim Andersson
Author Date: 2024-07-23 08:48:36 UTC

service-bundle: add 20 bos03-s390x workers

fix-osc-lib-traceback 2024-07-22 16:41:40 UTC
fix: worker: import osc_lib in a different manner

Author: Tim Andersson
Author Date: 2024-07-22 16:41:40 UTC

fix: worker: import osc_lib in a different manner

The previous method of importing osc_lib was resulting in the following
traceback:
```
AttributeError: module 'osc_lib' has no attribute 'utils'
```

I tried importing osc_lib.utils in many different ways - both in a venv
on my dev machine and on the servers in prod. The implementation in this
commit is the one that worked both in a venv on my machine, and in a
python shell in production, as well as in the worker code after a
cowboy.

fix-bad-metrics-math 2024-07-22 08:58:13 UTC
fix: worker: fix bad math in tools/metrics

Author: Tim Andersson
Author Date: 2024-07-22 08:58:13 UTC

fix: worker: fix bad math in tools/metrics

I recently made an MP that introduced a new metric - the cloud worker
active unit percentage.

I noticed this morning on the KPI that there were some negative values,
and realised that the math in the previous implementation was only valid
under certain circumstances, so I've modified it in this MP.

apache-request-monitoring-investigation 2024-07-12 16:40:52 UTC
asdf

Author: Tim Andersson
Author Date: 2024-07-12 16:40:52 UTC

asdf

cloud-worker-active-error-percentage 2024-07-12 09:47:43 UTC
feat: cloud: add new metric for percentage of active cloud worker units

Author: Tim Andersson
Author Date: 2024-07-10 14:37:08 UTC

feat: cloud: add new metric for percentage of active cloud worker units

This commit introduces new functionality to the `metrics` script in the
autopkgtest-cloud charm.

It adds a new metric, just for the cloud worker units:
`autopkgtest_unit_status_active_percentage`

This is a new metric which we will use in grafana to alert the team when
the percentage of active cloud worker units drops below 50% for a
specified period of time.

This couldn't be done with just pure grafana, due to limitations
surrounding alerting and transformations.

This has already been tested. This version of the metrics script has
been running in a tmux session for a while and the panel can be seen on
grafana, already active, with the alert already set up.

The data is also dilineated by hostname to help with debugging.

fix-ci-increase-lp-request-timeout 2024-07-12 09:33:24 UTC
fix: web: don't let unit tests make api calls to launchpad

Author: Tim Andersson
Author Date: 2024-07-12 09:33:24 UTC

fix: web: don't let unit tests make api calls to launchpad

This commit refactors a couple of unit tests which weren't mocking
the urllib.url_open function call - resulting in our unit tests in CI
actually making api calls to Launchpad, which isn't appropriate. They
now correctly mock the url_open call and no longer make queries to the
Launchpad API.

This explains our recent flaky CI - perhaps there was some Launchpad
instability, or at least, something causing queries to take a little
longer, which was in turn causing our unit tests to fail.

fix-d-a-r-bug 2024-07-10 12:51:16 UTC
fix: web: fix download-all-results TypeError

Author: Tim Andersson
Author Date: 2024-07-10 12:44:53 UTC

fix: web: fix download-all-results TypeError

When running download-all-results for noble recently, the following
traceback was encountered:
```
DEBUG:__main__:Fetched test result for noble/armhf/python-keycloak/3.9.0+dfsg-1 20240326_084914_e13af@ (triggers: python3-defaults/3.12.2-0ubuntu1): exit code 8
Traceback (most recent call last):
  File "./download-all-results", line 264, in <module>
    fetch_container(
  File "./download-all-results", line 203, in fetch_container
    fetch_one_result(
  File "./download-all-results", line 144, in fetch_one_result
    env_vars.append("=".join([env, value]))
TypeError: sequence item 1: expected str instance, int found
```

This is because of the all_proposed environment variable being
set to 1, which is then read as an integer rather than a string.

This commit amends the issue by explicitly casting the type of the value
in the env_vars to be a string.

swift-cleanup 2024-07-10 08:01:05 UTC
feat: web: add script to cleanup broken swift results

Author: Tim Andersson
Author Date: 2024-04-03 13:09:01 UTC

feat: web: add script to cleanup broken swift results

When running seed-new-release, and copying over results from the last
release to the devel release, we encountered a lot of errors in which
the testinfo.json file (something created by autopkgtest) wasn't
decipherable as json.

This commit introduces a script which iterates through testinfo.json
files in swift, and any that aren't valid json, or are corrupted in any
way, are replaced with a dummy dictionary like so:
```
{
 "message": "This file has been added manually and is not a result
    of any test"
}
```

This should reduce the error messages when running seed-new-release, and
also, make it a bit more clear for developers browsing results - having
a clear message like this instead of just an empty file is more
explicit.

generate-charm-inventory-dont-list-docs-entries 2024-07-08 14:12:16 UTC
fix: generate-charm-inventory: don't display docs commits

Author: Tim Andersson
Author Date: 2024-07-08 14:11:56 UTC

fix: generate-charm-inventory: don't display docs commits

As pointed out by Brian Murray, changes to the documentation of
autopkgtest-cloud have no effect on the functionality of either charm.

This commit amends the issue by ignoring commits with a message that
begins with "docs:".

service-bundle-staging-remove-net-name 2024-07-03 13:03:39 UTC
service-bundle: remove net-name variable from staging autopkgtest-cloud-worker

Author: Tim Andersson
Author Date: 2024-07-03 13:03:39 UTC

service-bundle: remove net-name variable from staging autopkgtest-cloud-worker

This was missed in a previous MP. The net-name variable can no longer be
present as it's no longer an option in the charms config.yaml.

Currently, this throws an error when deploying the service-bundle in
staging, so this MP will amend that.

service-bundle-add-bos03-s390x-net-name-flavor 2024-07-03 12:48:22 UTC
service-bundle: add flavors and net name for bos03-s390x

Author: Tim Andersson
Author Date: 2024-07-03 12:48:22 UTC

service-bundle: add flavors and net name for bos03-s390x

This commit adds the flavors for (staging) bos03-s390x, and adds the
required net names for both staging and production for bos03-s390x.

bos03-ppc64el-small-n-workers-bump 2024-07-03 10:15:53 UTC
service-bundle: add 20 bos03-ppc64el workers

Author: Tim Andersson
Author Date: 2024-07-03 10:15:53 UTC

service-bundle: add 20 bos03-ppc64el workers

The quota for bos03-ppc64el is limited right now. We have a quota of 80
cores currently. This cannot be increased at the moment - IS have stated
that ppc64el in bos03 is currently far behind bos03-arm64 in terms of
resources.

Thus I've added 20 workers for bos03-ppc64el - I believe this should be
okay for the time being, though we could hit quota issues in the event
that ALL of these workers are using the `big` flavor but this is
an unlikely occurence.

fix-package-page-nonexistent-package 2024-06-26 13:05:37 UTC
fix: web: Don't show package pages for packages that don't exist

Author: Tim Andersson
Author Date: 2024-03-19 15:20:15 UTC

fix: web: Don't show package pages for packages that don't exist

This commit changes the behaviour when a user tries to reach a package
or results page for a package that doesn't exist.

The results page used to throw an error stating that the package doesn't
exist, however, I think this is slightly innaccurate - the package could
exist, but we could just have no test results for it. This is infact
not really, an error, and not something we should surface *as* an error.

The behaviour, with this commit, is as follows:
On the results and package pages, if the user goes to one of these pages
for a package that doesn't exist, they get the following message:
```
Oops! Looks like this package has no previous results. The package
itself may not exist - you can check by clicking the Launchpad icon.
```

I think this is better because, a, we are no longer throwing an error,
and b, because the user can now validate the package exists by clicking
on the Launchpad icon.

Overall I think this is just an accurate representation of all the
potential possibilities when going to either of these pages with an
"invalid" package name.

There is the possibility of checking if the package exists via
Launchpad, but that'd be an http request to Launchpad every time a user
views a package or results page, which is a waste of resources, and
seems unnecessary.

Fixes bug LP: #2058059

apache-logging-monitoring-check-error-log-too 2024-06-26 12:12:15 UTC
fix: web: also check error log in apache-request-monitoring

Author: Tim Andersson
Author Date: 2024-06-26 12:07:00 UTC

fix: web: also check error log in apache-request-monitoring

This script had a fatal flaw - we were only checking
/var/log/apache2/access.log and not /var/log/apache2/error.log, meaning
there were no records of 500's in the data and subsequently in the
grafana KPI, which is arguably one of the most important http response
status codes we'd want to keep track of.

This commit amends the issue by checking both the access.log and the
error.log.

It also makes the script a little bit more reliable as one of the
functions before could return None, which isn't ideal.

autopkgtest-db-sha256-fix 2024-06-26 10:40:05 UTC
fix: web: ensure that autopkgtest.db.sha256 is symlinked

Author: Tim Andersson
Author Date: 2024-06-26 08:30:55 UTC

fix: web: ensure that autopkgtest.db.sha256 is symlinked

The functionality to symlink /home/ubuntu/public/autopkgtest.db to the
static directory which the flask web app serves files from was
duplicated for autopkgtest.db.sha256 - however a flag wasn't added for
this new statically served file, and the flag for the db being symlinked
was already set, meaning the `symlink_public_db` function wasn't
executed in production and thus autopkgtest.db.sha256 was never
symlinked or served via the web app.

Additionally to this, symlinking the sha256 file was in the same
try/except block as the db itself, where the exception was a
FileExistsError. The db symlink already existed, meaning the exception
was thrown, and thus the sha256 file wouldn't have been symlinked even
with a separate flag as described above.

This commit amends the issue by adding a second flag for the sha256
file, and symlinking the db and the sha256 file in a loop, separately.
The functionality beforehand in which the public directory is created,
was moved to it's own try/except block also.

proposed-package-images 2024-06-25 15:09:27 UTC
fix: cloud: make create-nova-image-with-proposed-package up to date

Author: Tim Andersson
Author Date: 2024-05-01 15:15:30 UTC

fix: cloud: make create-nova-image-with-proposed-package up to date

This commit introduces a new mechanism of loading creds for the
aforementioned script, given that we now use a wider variety of
datacentres, and this script was last updated nearly 5 years ago.

This script also adds two new dependencies to the cloud-worker charm:
- qemu-user-static
- binfmt-support
These are required for the script to execute bash commands on vm's on a
different arch to the host VM - i.e. on arm64 from an amd64 host.

It also modifies the mechanism in which the desired package is installed
from proposed, and does away with the sed line that existed before.

create-nova-image-with-proposed is a script, which hasn't been used in
*quite* a while, which rebuilds one of our adt images, with a specified
package from proposed.

We had to use this recently when the version of base-files in the
release pocket was breaking our tests, but the version of base-files
in the proposed pocket would fix said issue.

remove-rabbitmq-auto-restart 2024-06-25 13:10:48 UTC
fix: rabbitmq: remove auto-restart for rabbitmq server

Author: Tim Andersson
Author Date: 2024-06-25 12:54:17 UTC

fix: rabbitmq: remove auto-restart for rabbitmq server

For the last few years, rabbitmq was auto-restarting after using up 2GiB
of ram.

This was a longstanding issue, in which the root cause was addressed in
the following commit:
https://git.launchpad.net/autopkgtest-cloud/commit/?id=f019281e0fe38d3f298b933b9fd9fcb243795a7a

The root cause of the issue was the worker code sending status updates
to a queue at a rate which the consumer couldn't keep up with, causing
the rabbitmq queue to grow in size in perpetuity. You can see a more
complete description of the issue in the message of the commit above.

That being said, we can now remove the script that sets up the service
which auto-restarts the rabbitmq-server.service (with glee).

flexible-net-names 2024-06-25 10:47:16 UTC
feat: cloud: make network names flexible for dc/arch combinations

Author: Tim Andersson
Author Date: 2024-06-24 11:28:04 UTC

feat: cloud: make network names flexible for dc/arch combinations

Much like the recent change to the flavor config, this commit introduces
a mechanism to have specific network names for each datacentre/arch
combination.

It introduces a new config variable to the autopkgtest-cloud-worker
charm, worker-net-names, which is a string with yaml inside of it, the
same as the flavor config.

The values in the yaml are inserted into the
worker-$datacentre-$arch.conf file, as the net-name juju config option
used to be.

Having a specific datacentre/arch network name isn't required as a
default is specified, which is
net_$instance-proposed-migration, where instance is either "prod" or
"stg".

This commit also refactors part of the `write_net_names` function to use
pathlib instead of writing to a file using `with open`.

stg-bos03-ppc64el-flavor-names 2024-06-25 10:08:18 UTC
service-bundle: add flavor names for bos03-ppc64el in staging

Author: Tim Andersson
Author Date: 2024-06-25 10:08:18 UTC

service-bundle: add flavor names for bos03-ppc64el in staging

metrics-worker-healthy-percentage 2024-06-24 16:39:26 UTC
worker percentage wip

Author: Tim Andersson
Author Date: 2024-06-24 16:38:55 UTC

worker percentage wip

apache-request-monitoring-timer-fix 2024-06-24 13:21:59 UTC
fix: web: fix timer syntax in apache-request-monitoring.timer

Author: Tim Andersson
Author Date: 2024-06-24 13:16:42 UTC

fix: web: fix timer syntax in apache-request-monitoring.timer

This was missed in a previous MP, but the syntax in the timer file for
this unit was incorrect and thus was only triggering at 5 past midnight
rather than every 5 minutes.

This commit amends the issue by adding the proper syntax for running the
unit every 5 minutes.

add-djlint-to-ci 2024-06-24 09:38:00 UTC
web: lint all templates in line with djlint now in pre-commit and CI

Author: Tim Andersson
Author Date: 2024-06-24 09:38:00 UTC

web: lint all templates in line with djlint now in pre-commit and CI

revert-lxd-armhf-security-nesting 2024-06-19 11:54:16 UTC
Revert "fix: lxd-worker: add security.nesting=true to lxd config"

Author: Tim Andersson
Author Date: 2024-06-19 11:54:16 UTC

Revert "fix: lxd-worker: add security.nesting=true to lxd config"

This reverts commit 7e2db60fdb52a81febf88f462383f557abe5b7dd.

fix-lxd-security-nesting 2024-06-19 11:52:32 UTC
fix: lxd-worker: put security.nesting: true in the profile config

Author: Tim Andersson
Author Date: 2024-06-19 11:42:03 UTC

fix: lxd-worker: put security.nesting: true in the profile config

A recent commit introduced this new config option, however it was in the
wrong place. It shouldn't be in the global lxd config, it should be
configured in the default profile instead [1].

I figured out the correct lxd-init syntax by adding the config option to
the profile with:
lxc profile set default security.nesting=true

And then checking the syntax with:
lxc profile show default

And adding the same syntax to our creation of the lxd-init file in
armhf-lxd.userdata.

And then to double check the config option, I launched an instance and
double checked the config with:
lxc config show <instance-name> --expanded

When setting the config option in the default profile, the config option
doesn't show up without the --expanded flag.

The details on the security.nesting instance option can be found at [2].

[1] https://discuss.linuxcontainers.org/t/where-to-set-lxc-config-defaults-on-a-snap-installation/14595/4
[2] https://documentation.ubuntu.com/lxd/en/latest/reference/instance_options/

lxd-security-nesting-true 2024-06-19 10:01:55 UTC
fix: lxd-worker: add security.nesting=true to lxd config

Author: Tim Andersson
Author Date: 2024-06-19 07:16:53 UTC

fix: lxd-worker: add security.nesting=true to lxd config

There's a version of systemd in oracular-proposed which is purported to
break armhf tests (for oracular) once it migrates to the release pocket.

TLDR; Any systemd units with credentials on unprivileged containers will
fail on oracular tests with the new version of systemd in proposed.

This would cause systemd-tmpfiles-setup.service to be broken on the lxd
containers, which is a service which creates /var/run/utmp, which is how
runlevel is stored. runlevel is checked in lib/VirtSubProc.py [1] in the
wait_booted function. So, subsequently, wait_booted would eventually
timeout, as systemd-tmpfiles-setup.service would never store runlevel
appropriately on the testbed.

The workaround was discussed [2] between the systemd maintainer (enr0n)
and the lxd team, and the solution was to enable security.nesting for
the lxd containers running our armhf tests.

security.nesting simply allows for nested containerisation. [3]

I tried to find a concrete piece of documentation about where this
specific config flag should go in the lxd config, however, I couldn't
find anything concrete, apart from these comments [4] and [5].

To summarise, we would be hitting [6] because of [7].

[1] https://salsa.debian.org/ubuntu-ci-team/autopkgtest/-/blob/master/lib/VirtSubproc.py?ref_type=heads#L454
[2] https://github.com/canonical/lxd/issues/13631
[3] https://discuss.linuxcontainers.org/t/what-does-security-nesting-true/7156/4
[4] https://discuss.linuxcontainers.org/t/what-does-security-nesting-true/7156/5
[5] https://discuss.linuxcontainers.org/t/what-does-security-nesting-true/7156/6
[6] https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/1998943
[7] https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/2046486

fix-retry-url 2024-06-19 08:50:56 UTC
fix: web: fix retry buttons on results page and user page

Author: Tim Andersson
Author Date: 2024-06-19 06:57:26 UTC

fix: web: fix retry buttons on results page and user page

This was a change that was missed in the user page MP.

The base_url variable wasn't inherited by the macro, causing the retry
url to be the current url appended with the retry arguments (package,
release, arch, triggers, etc). Obviously this didn't work and users
couldn't trigger retries from the webpage.

Another issue was that the package, release and version variables
weren't inherited by the macro - obvious in retrospect.

This commit amends the two issues by:
- using url_for('index_root') [1] instead of base_url
- explicitly passing package, release, and arch to the
  results_table_core macro

[1] https://flask.palletsprojects.com/en/3.0.x/api/#flask.Flask.url_for

web-reactive-status-fix 2024-06-18 13:35:14 UTC
fix: web: Don't let autopkgtest-web unit stay in "maintenance" status

Author: Tim Andersson
Author Date: 2024-06-18 13:26:44 UTC

fix: web: Don't let autopkgtest-web unit stay in "maintenance" status

In prod, with the recent changes to the web unit regarding restarting
the services, this change was missed.

Without it, if the systemd units get written, and none of them have
changed, the status will never get set to active. The status getting set
to active was intended to be done in the
`restart_all_autopkgtest_web_services` function, however, if the files
haven't changed, this won't happen.

This commit amends the issue by the status to active at the end of the
`set_up_systemd_units` function. If the units have changed, it'll be set
back to maintenance in the `restart_all_autopkgtest_web_services`
function, so there won't be any false "active" statuses.

cleanup-ppa-containers-small-bugfix 2024-06-17 14:50:13 UTC
fix: cleanup-ppa-containers: datetime type comparison bugfix

Author: Tim Andersson
Author Date: 2024-06-17 14:50:13 UTC

fix: cleanup-ppa-containers: datetime type comparison bugfix

There was a small bug which got missed in the initial MP of this script.

This commit amends the bug by not converting the datetime object to a
timestamp prior to comparison with another datetime object.

lxd-armhf-kill-openstack-server-traceback 2024-06-14 13:02:41 UTC
fix: worker: don't try to kill openstack servers on the lxd-worker

Author: Tim Andersson
Author Date: 2024-06-14 12:27:47 UTC

fix: worker: don't try to kill openstack servers on the lxd-worker

This commit fixes a bug which was causing the following traceback on the
lxd worker:
 Traceback (most recent call last):
   File "/home/ubuntu/autopkgtest-cloud/worker/worker", line 1716, in <module>
     main()
   File "/home/ubuntu/autopkgtest-cloud/worker/worker", line 1709, in main
     queue.wait()
   File "/usr/lib/python3/dist-packages/amqplib/client_0_8/abstract_channel.py", line 97, in wait
     return self.dispatch_method(method_sig, args, content)
   File "/usr/lib/python3/dist-packages/amqplib/client_0_8/abstract_channel.py", line 117, in dispatch_me>
     return amqp_method(self, args, content)
   File "/usr/lib/python3/dist-packages/amqplib/client_0_8/channel.py", line 2060, in _basic_deliver
     func(msg)
   File "/home/ubuntu/autopkgtest-cloud/worker/worker", line 1360, in request
     kill_openstack_server(test_uuid)
   File "/home/ubuntu/autopkgtest-cloud/worker/worker", line 647, in kill_openstack_server
     auth_url=os.environ["OS_AUTH_URL"],
   File "/usr/lib/python3.8/os.py", line 675, in __getitem__
     raise KeyError(key) from None
 KeyError: 'OS_AUTH_URL'

This is fixed by checking to see if the OS_IDENTITY_API_VERSION env
variable is present - if it's not, we're on the lxd worker. In this
case, the kill_openstack_server function does nothing.

This commit also moves some logging surrounding the
kill_openstack_server function into the function itself, to avoid
duplication of code.

user-specific-page 2024-06-13 15:54:18 UTC
refactor: web: make browse-results.html use the results_table_core macro

Author: Tim Andersson
Author Date: 2024-04-17 17:10:32 UTC

refactor: web: make browse-results.html use the results_table_core macro

tim-stats 2024-06-13 08:53:39 UTC
check for armhf results with log message

Author: Tim Andersson
Author Date: 2024-06-13 08:53:39 UTC

check for armhf results with log message

publish-db-fix 2024-06-12 16:34:25 UTC
fix: web: use ftpmaster.internal for publish-db

Author: Tim Andersson
Author Date: 2024-06-12 16:15:39 UTC

fix: web: use ftpmaster.internal for publish-db

This commit modifies the publish-db script to use ftpmaster.internal
instead of archive.ubuntu.com. ftpmaster.internal is obviously a lot
faster and should help improve the speed of the publish-db script.

The script was recently failing hitting various urls at
archive.ubuntu.com, with a ConnectionResetError. This commit also adds
that exception to the try except block which attempts to download the
Sources.gz file for a component/pocket/release combination, just to
stop the script from completely failing if this happens again. As
mentioned, the script was failing completely, meaning the public
database wasn't getting updated at all.

cleanup_old_ppa_containers 2024-06-12 13:43:18 UTC
feat: cloud: add script for cleaning up old ppa containers in swift

Author: Tim Andersson
Author Date: 2023-08-03 10:22:19 UTC

feat: cloud: add script for cleaning up old ppa containers in swift

The swift database is currently a bit overloaded with lots of results
from people testing PPA's, with some results being from a very long time
ago. This is obviously less than ideal and after some recent discussion
26 weeks was decided to be an appropriate amount of time for a max age
for PPA results.

This commit adds a script, which iterates through all of the containers
in swift, skips the distro results, the db backups container, and any
upstream containers, and removes all ppa results older than the
specified time.

This script should be run when we open up a new series, and thus this
step has been added to the section in `docs/administration.rst`
regarding opening up a new series. Since the duration between releases
is roughly 26 weeks, I think this makes sense.

restart-worker-on-charm-update 2024-06-11 10:08:22 UTC
fix: web: restart apache2 and autopkgtest-web.target on charm update

Author: Tim Andersson
Author Date: 2024-06-11 09:42:08 UTC

fix: web: restart apache2 and autopkgtest-web.target on charm update

This commit introduces a new mechanism in the `upgrade-charm` hook,
which restarts the apache2.service and the autopkgtest-web.target
systemd units. This will make the webpage take into effect any changes
to the webcontrol code, and make any services that have changes to the
code or service files have those changes take immediate effect.

Prior to this commit, restarting apache2 and the autopkgtest-web target
was something that was done manually by an admin. This can be easily
overlooked and should be automated.

To implement this fix, the `upgrade-charm` [1] hook is utilised. This
is a hook that runs every time a unit is undergoing an upgrade. It is
customisable, and one can add any number of steps to run through in the
event of a charm upgrade. For this purpose, the commit adds two calls:
`systemctl restart apache2.service`
`systemctl restart autopkgtest-web.target`
to the hook.

`apache2.service` is not a part of autopkgtest-web.target.

[1] https://juju.is/docs/sdk/upgrade-charm-event

improve-handling-cache-amqp-failures 2024-06-11 08:09:36 UTC
fix: web: make send_amqp_request function have a timeout

Author: Tim Andersson
Author Date: 2024-01-17 11:26:38 UTC

fix: web: make send_amqp_request function have a timeout

We were recently having some issues with production, where
requesting a test via the webpage would result in the page
eventually telling the user the server has timed out. This
was because the send_amqp_request function was failing, but
the library used to send the request has no internal timeout.

send_amqp_request was failing because the rabbitmq-server ran
out of memory.

This commit amends the above issue by adding a timeout for the
send_amqp_request function in the form of an signal handler,
via the `signal` python module.

The other issue, UX wise, is when rabbitmq is mid-restart, and the user
is given this, unhelpful, error:
```
local variable `msg` referenced before assignment
```

The try/except block used to catch the TimeoutException also catches the
UnboundLocalError exception. In the case rabbitmq is mid-restart, or
completely down, this block now catches the issue.

This commit also introduces a new exception class, QueueDead. This
exception is used to give the user a more helpful error message, based
on the two issues this commit catches, and overall, just improve the UX.

The message from QueueDead is prepended with our generic "A server error
has occurred" message, which includes details of how to contact members
of the QA team.

preserve-set-correct-content-encoding 2024-06-08 12:18:28 UTC
cloud-worker: feat: script for setting correct content encoding for logfiles

Author: Tim Andersson
Author Date: 2023-12-05 16:28:19 UTC

cloud-worker: feat: script for setting correct content encoding for logfiles

There was an issue when migrating our swift storage where the objects
from our old swift storage were copied successfully but had an incorrect
content-encoding for the logfiles, which meant users had to download the
logs and couldn't view them in their browser.

This script added in this commit sets the correct content encoding for
all logfiles in all autopkgtest-$release containers. So it only does so
for the distro related tests, no PPA's, no upstream tests, but this is
intentional - it takes so long to run, and distro test results are a
higher priority.

The script, hopefully, will never be required again, but I'm preserving
it here, just in case.

rabbitmq-cleanup-fix 2024-06-08 10:09:39 UTC
fix: mojo: Fix deployment of rabbitmq cleanup script

Author: Tim Andersson
Author Date: 2024-04-02 09:03:28 UTC

fix: mojo: Fix deployment of rabbitmq cleanup script

Prior to this commit, the rabbitmq cleanup script wasn't getting copied
over properly to the rabbitmq unit, causing the script to fail when
rabbitmq required restarting.

This commit creates a file for the rabbitmq cleanup script, in the
autopkgtest-cloud repo, instead of just in the deployment script, as
well as the necessary service files. The script now gets deployed as
intended and functions as intended.

This is a temporary workaround still until we get to the bottom of the
rabbitmq restarts.

restart-autopkgtest-rgn-arch-services-on-worker-conf-file-change 2024-06-07 13:23:00 UTC
experimental and WIP: restart autopkgtest (specific) services on worker conf ...

Author: Tim Andersson
Author Date: 2024-06-07 13:23:00 UTC

experimental and WIP: restart autopkgtest (specific) services on worker conf file changes

restart-autopkgtest-services-on-worker-conf-file-change 2024-06-07 12:38:44 UTC
fix: worker: restart autopkgtest.target when worker code has changed

Author: Tim Andersson
Author Date: 2024-06-06 10:23:31 UTC

fix: worker: restart autopkgtest.target when worker code has changed

We've, up until this point, had the finnicky issue of requiring the
autopkgtest admins to manually restart the autopkgtest.target systemd
target upon a charm update.

This is problematic as it's something we could easily miss/forget, and
also requires us to discuss whether a restart of the services is
necessary when we're deciding whether or not to update the charms.

This commit utilises the `any_file_changed` function from
`charms.reactive.helpers`.

The reactive part of the cloud worker charm now checks to see if there's
been any changes to the worker code, and if so, sets the
`autopkgtest.target-restart-needed` flag, restarting autopkgtest.target.

apache_logging_monitoring 2024-06-07 10:54:37 UTC
feat: web: add apache monitoring and reporting to autopkgtest-web

Author: Tim Andersson
Author Date: 2023-08-07 15:26:04 UTC

feat: web: add apache monitoring and reporting to autopkgtest-web

This commit introduces a new script, `apache-request-monitoring`, which
monitors the exit codes of http requests to our apache server.

It captures the exit code and the count of said exit code in the last 5
minutes.

This commit also adds the necessary service file, and adds the needed juju
config options for the charm (the influx creds).

The panel for this visualisation already exists on Grafana [1]. Once
this script starts running in prod, we should start seeing some nice
visualisations.

[1] https://ubuntu-release.kpi.ubuntu.com/d/76Oe_0-Gz/autopkgtest?orgId=1&refresh=5m&from=1717736107186&to=1717757707187&var-instance=production&viewPanel=20

dont-let-cache-amqp-hang 2024-06-05 12:58:14 UTC
fix: web: utilise TimeoutStartSec to stop cache-amqp from ever hanging

Author: Tim Andersson
Author Date: 2024-05-02 11:20:54 UTC

fix: web: utilise TimeoutStartSec to stop cache-amqp from ever hanging

cache-amqp would previously hang when there was a fault with the
semaphore queues - a brittle mechanism which we intend to fix in the
future if we decide to move back to a distributed system r.e. the
autopkgtest-web units.

When cache-amqp hangs, our KPI is no longer indicative of our
queue size, which is very detrimental to our observability.

The issue with hanging is sorted quite easily, via a systemctl
restart of the cache-amqp service, so here we add a maximum
runtime for cache-amqp of ten minutes. The service is triggered
minutely and thus ten minutes is quite a conservative max runtime,
but we don't want to start restarting the service prematurely either.
There is the possibility of course that cache-amqp could hang for a few
minutes and subsequently recover.

We integrate this "maximum runtime" by utilising TimeoutStartSec [1],
which monitors the time the ExecStart call is taking to complete.

[1] https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html#TimeoutStartSec=

fix-systemd-stop-restart-losing-jobs 2024-06-04 15:23:57 UTC
fix: worker: actually save messages when service stopped by systemd

Author: Tim Andersson
Author Date: 2024-06-04 10:26:36 UTC

fix: worker: actually save messages when service stopped by systemd

Previously, we had an iteration of this functionality which instead
utilised checking the exit code of the autopkgtest subprocess
for a -15 code, utilising the mechanism detailed in [1].

This was brittle - in the case the restart was executed at the time
before the VM for the test was in BUILD state, the autopkgtest
subprocess would exit with code 1 and the test request would be
lost.

This commit amends the issue by instead utilising the signal
handlers in the worker code.

This commit also re-introduces the documentation lost by the
revert commit prior to this one.

Fixes LP: #2067714

[1] https://docs.python.org/3/library/subprocess.html#subprocess.CompletedProcess.returncode

revert-innappropriate-fail-strings 2024-06-04 07:55:48 UTC
Revert "fix: worker: add fail strings for systemd failures and postfix failures"

Author: Tim Andersson
Author Date: 2024-06-04 07:55:48 UTC

Revert "fix: worker: add fail strings for systemd failures and postfix failures"

This reverts commit 3a33ae7e20701d36f2dbb680a7875d6b9ded45b5.

grafana-agent-setup 2024-05-23 07:43:49 UTC
asdf needs a lil testing

Author: Tim Andersson
Author Date: 2024-05-23 07:43:49 UTC

asdf needs a lil testing

worker-dont-remove-queue-item-systemctl-restart 2024-05-21 15:30:49 UTC
fix: worker: don't ack message if worker was killed with USR1 (code -15)

Author: Tim Andersson
Author Date: 2024-05-02 16:43:58 UTC

fix: worker: don't ack message if worker was killed with USR1 (code -15)

With the recent introduction [1] of the easing of killing running tests
- we hit an unforeseen issue! systemctl restarting the worker service
currently falls into the block of logic that removes the openstack
server AND removes the test request from the queue - oh no!

This commit amends the issue by causing both systemctl stop and
systemctl restart to be caught by the worker code, caught by our
signal handlers, and in both cases, the amqp message to be
explicitly requeued.

This commit also amends the docs w.r.t. killing running tests. A
different code must be specified to the kill command.

This commit fixes bug LP: #2064582

For information about how python's subprocess module inherits exit codes
from systemd signals, see [2].

[1] https://git.launchpad.net/autopkgtest-cloud/commit/?id=d0e3b2ddb3ac5fe4d62f879d2e1c9ca578797120
[2] https://docs.python.org/3/library/subprocess.html#subprocess.CompletedProcess.returncode

copy-security-group-fix 2024-05-17 07:42:41 UTC
fix: cloud: autopkgtest@.service always copy default security group

Author: Tim Andersson
Author Date: 2024-05-17 07:38:19 UTC

fix: cloud: autopkgtest@.service always copy default security group

This commit changes the service file for our main autopkgtest service.

Prior to this commit the copy-security-group script would be copying a
security group based on the name of the service i.e.
autopkgtest@lcy02...

This change is committed in the hopes that our security group usage will
be more robust if we continuously copy only from the default group.

It'll fail if the default group doesn't exist. But this is probably a
good thing.

We had an issue where our security groups seemed to start having the
wrong rules out of nowhere. An assumption could be that one security
group creation partially failed, without having the correct rules,
causing a cascading effect where subsequent secgroups inherited the
borked rules.

kill-server-on-retries 2024-05-16 07:39:05 UTC
fix: worker: always kill openstack server on retries

Author: Tim Andersson
Author Date: 2024-05-16 07:39:05 UTC

fix: worker: always kill openstack server on retries

remove-dead-lxd-bos02 2024-05-13 14:11:56 UTC
service-bundle: remove lxd-armhf 2,5,7 (bos02)

Author: Tim Andersson
Author Date: 2024-05-13 14:11:56 UTC

service-bundle: remove lxd-armhf 2,5,7 (bos02)

die-roll-mechanism-stable-vs-devel 2024-05-13 08:00:25 UTC
feat: cloud: add die roll mechanism for stable vs devel releases

Author: Tim Andersson
Author Date: 2024-05-10 11:51:47 UTC

feat: cloud: add die roll mechanism for stable vs devel releases

This mechanism also handles the case where the worker unit has an
outdated version of python3-distro-info.

service-bundle-n-workers-update 2024-05-10 14:55:52 UTC
service-bundle: align n-workers with values in prod

Author: Tim Andersson
Author Date: 2024-05-10 14:55:52 UTC

service-bundle: align n-workers with values in prod

admin-page-no-fail-when-empty 2024-05-09 15:02:23 UTC
fix: web: no traceback on empty admin page

Author: Tim Andersson
Author Date: 2024-05-09 10:35:30 UTC

fix: web: no traceback on empty admin page

If the db query in the admin page heuristic returns no results, the
admin page traces back as the heuristic attempts to multiply a NoneType.

This commit fixes the issue by first checking if duration_avg is not
None before attempting the aforementioned multiplication.

web-charm-config-dir-fix 2024-05-09 13:37:23 UTC
fix: web: make sure to create `.config/autopkgtest-web` directory in reactive...

Author: Tim Andersson
Author Date: 2024-05-09 13:36:18 UTC

fix: web: make sure to create `.config/autopkgtest-web` directory in reactive charm

auto-queue-cleanup 2024-05-08 14:02:48 UTC
feat: web: add queue-cleaner script

Author: Tim Andersson
Author Date: 2024-04-23 10:48:35 UTC

feat: web: add queue-cleaner script

This script removes unnecessary items from the queue by checking for
items in the queue that have triggers that aren't in the proposed
or release pockets.

It also checks the queues for any duplicate items - after a specific
item has been found, if another queue item has the same parameters,
it is removed from the queue.

This script runs every 15 minutes, and should help our throughput by not
having unnecessary queue items.

This commit also adds a flock call to the cache-amqp and queue-cleaner
services to ensure that these services never run at the same time
as to avoid any issues with parallel reads of the queues.

stop-tests-from-webpage 2024-05-02 16:37:02 UTC
feat: cloud&web: add option to stop test from webpage

Author: Tim Andersson
Author Date: 2024-04-23 15:02:54 UTC

feat: cloud&web: add option to stop test from webpage

This commit introduces the functionality of being able to kill a
currently running test from the autopkgtest webpage.

*Test-killer*
It introduces a new script, test-killer, which runs on the cloud worker
units as a systemd service.

test-killer listens to requests via amqp on the "tests-to-kill"
exchange. Test uuid's are part of the message sent on this exchange to
test-killer, and test-killer then kills the test using the test uuid.

The initial message in the test-killer queue will look as such:
{
    "uuid": "b864593b-82e2-424e-bfe7-f37748dbd047",
    "not-running-on": [],
}

The "not-running-on" list gets appended when a worker unit checks for
the test with the given uuid and the test isn't present on that specific
worker unit. test-killer appends the hostname of the current worker unit
to this list.

When the length of the "not-running-on" list is equal to the number of
worker units, the message is removed from the queue if the uuid is not
found in queues.json. In this case we assume the test has finished
before we've had a chance to kill it.

In this way, you can simply pass test-killer a uuid, and via amqp it'll
check for the test on every worker unit.

*web changes*
The running page now displays a link under each running job (for admins
only) which redirects to a new app under webcontrol - test-manager.

test-manager has only one endpoint, similar to request/app.py. This
endpoint is only available to a select few admins.

This list of admins is now a config option for the charm (admin-nicks).
This is in the service-bundle, with a sensible default set.

This endpoint can be passed a uuid (uuid=$uuid), which then submits that
uuid to the tests-to-kill exchange. test-manager first checks that the
uuid is present in running.json, however, as to avoid wasting resources
on killing a test that isn't already running.

If the given uuid is found in running.json, that uuid is sent via amqp
to the test-killer services on the various worker units, where the test
is then killed.

upgrade-charm-to-docs 2024-05-02 14:22:20 UTC
docs: remove mention of new dependencies requiring a unit replacement

Author: Tim Andersson
Author Date: 2024-03-28 14:39:04 UTC

docs: remove mention of new dependencies requiring a unit replacement

filter-amqp-all-queues-option 2024-04-30 13:16:33 UTC
feat: cloud: add option to clean regex from all queues in filter-amqp

Author: Tim Andersson
Author Date: 2024-03-01 11:57:42 UTC

feat: cloud: add option to clean regex from all queues in filter-amqp

If you pass all instead of the queue name, filter-amqp will remove
test requests for all queues matching the regex.

This commit also renames --all to --all-items-in-queue, as to avoid
confusion when running filter-amqp with the queue name set to all.

fix-cache-amqp-creds 2024-04-30 10:57:09 UTC
fix: web: fix cache-amqp incorrectly parsing private jobs

Author: Tim Andersson
Author Date: 2024-04-30 09:04:22 UTC

fix: web: fix cache-amqp incorrectly parsing private jobs

Some private jobs were recently queued without the newline character
present in the test request string.

Due to the try-except we previously had here, we would fall back to
params={}. This was problematic, as for private jobs we rely on checking
the key value pairs of the test request message to accurately denote
them as private jobs.

This commit marks all test requests that are in the incorrect format as
"malformed request" in queued.json.

stop-looping-fix 2024-04-29 16:08:21 UTC
fix: worker: also fake up files in the case of unidentified testbed failure

Author: Tim Andersson
Author Date: 2024-04-29 16:08:21 UTC

fix: worker: also fake up files in the case of unidentified testbed failure

seed-new-release-update-auth-version 2024-04-26 19:19:21 UTC
fix: cloud: update swift auth version for seed-new-release

Author: Tim Andersson
Author Date: 2024-04-26 19:19:21 UTC

fix: cloud: update swift auth version for seed-new-release

The auth version on the "new" bastion is 3.0.

uuid-db-column-unique-constraint 2024-04-26 13:34:15 UTC
fix: web: add UNIQUE constraint to uuid column creation in helpers/utils.py `...

Author: Tim Andersson
Author Date: 2024-04-26 13:33:33 UTC

fix: web: add UNIQUE constraint to uuid column creation in helpers/utils.py `init_db`

Whilst this doesn't fix the issue of two "write requests" going into
the sqlite-writer queue, this commit would still mean we get no
duplicate results.

Even in the case of two duplicate queue message, this commit would just
ensure that the original entry is just replaced with the duplicate
entry.

no-double-download-results 2024-04-26 13:34:15 UTC
fix: web: add UNIQUE constraint to uuid column creation in helpers/utils.py `...

Author: Tim Andersson
Author Date: 2024-04-26 13:33:33 UTC

fix: web: add UNIQUE constraint to uuid column creation in helpers/utils.py `init_db`

Whilst this doesn't fix the issue of two "write requests" going into
the sqlite-writer queue, this commit would still mean we get no
duplicate results.

Even in the case of two duplicate queue message, this commit would just
ensure that the original entry is just replaced with the duplicate
entry.

fix-killing-tests-api-version-parsing 2024-04-25 09:30:05 UTC
fix: worker: fix api version check for datacentres where this isn't explicitl...

Author: Tim Andersson
Author Date: 2024-04-25 08:52:40 UTC

fix: worker: fix api version check for datacentres where this isn't explicitly defined

Not a critical bug, but killing any tests on datacentres without the
api version explicitly defined wasn't killing the test itself in a
structured manner, but just killing the worker service.

Fixes bug LP: #2063429

three-tmpfails-no-looping-please 2024-04-24 15:26:53 UTC
fix: worker: Never, ever let tests permanently loop

Author: Tim Andersson
Author Date: 2024-04-24 13:09:50 UTC

fix: worker: Never, ever let tests permanently loop

This commit completely removes the mechanism in which a worker is
assumed to be "broken" in some way. This mechanism has in the past on
countless occasions caused tests to permanently loop, and we've
decided to kill it with fire.

temp-disable-content-length 2024-04-24 15:09:45 UTC
fix: web: disable content-length header for static files

Author: Tim Andersson
Author Date: 2024-04-24 13:13:46 UTC

fix: web: disable content-length header for static files

britney still as of today utilises the content-length header if it is
present in the autopkgtest.db download.

However, after recent changes [1] to the apache2 package for focal, we've
discovered that the content-length header is no longer 100% accurate.

Because of this, we will disable the content-length header.

[1] https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/2061816

fix-tims-recent-docs 2024-04-23 15:11:28 UTC
docs: fix missing lines inbetween code-block and code for "Resizing Volumes" ...

Author: Tim Andersson
Author Date: 2024-04-23 15:11:28 UTC

docs: fix missing lines inbetween code-block and code for "Resizing Volumes" and "Killing Running Tests" sections

align-with-prod 2024-04-23 08:16:34 UTC
service-bundle: align with service bundle in prod

Author: Tim Andersson
Author Date: 2024-04-23 08:12:03 UTC

service-bundle: align with service bundle in prod

make-killing-tests-less-painful 2024-04-22 13:26:41 UTC
docs: add section on killing a currently running test

Author: Tim Andersson
Author Date: 2024-04-22 13:06:31 UTC

docs: add section on killing a currently running test

worker-upstream-percentage 2024-04-22 10:58:54 UTC
feat: cloud: add upstream percentage as juju config option

Author: Tim Andersson
Author Date: 2024-04-15 11:20:31 UTC

feat: cloud: add upstream percentage as juju config option

This was a feature request from Brian Murray.

Using this new feature, we can, on-the-fly, change the percentage of
jobs that will willingly take upstream tests.

This can be useful in situations where we'd like to prioritise distro
tests or prioritise upstream tests.

To modify, on the fly:

juju config autopkgtest-$type-worker worker-upstream-percentage="$perc"

Where $type is [cloud | lxd] and $perc is an integer between 1 and 100.

more-exceptions-worker-put-object 2024-04-22 09:19:16 UTC
fix: worker: Catch all exceptions in the try-except in swiftclient put_object...

Author: Tim Andersson
Author Date: 2024-04-22 08:48:40 UTC

fix: worker: Catch all exceptions in the try-except in swiftclient put_object call

We've been seeing recurring swift errors with the following traceback:

```
 7571 Apr 22 00:47:27 juju-7f2275-prod-proposed-migration-environment-3 sh[1767534]: File "/home/ubuntu/autopkgtest-cloud/worker/worker", line 1388, in request¬
 7572 Apr 22 00:47:27 juju-7f2275-prod-proposed-migration-environment-3 sh[1767534]: swiftclient.put_object(¬
...
 7586 Apr 22 00:47:27 juju-7f2275-prod-proposed-migration-environment-3 sh[1767534]: raise ConnectionError(err, request=request)¬
 7587 Apr 22 00:47:27 juju-7f2275-prod-proposed-migration-environment-3 sh[1767534]: requests.exceptions.ConnectionError: ('Connection aborted.', OSError("(32, 'EPIPE')"))¬
```

This exception isn't currently caught by the except statement meaning put_object
won't retry in this case.

fix-trailing-whitespace 2024-04-18 13:23:38 UTC
docs: fix trailing whitespace in administration.rst

Author: Tim Andersson
Author Date: 2024-04-18 13:23:38 UTC

docs: fix trailing whitespace in administration.rst

resize-docs 2024-04-18 09:08:40 UTC
docs: add section on resizing ceph (tmp) partitions

Author: Tim Andersson
Author Date: 2024-04-15 17:00:18 UTC

docs: add section on resizing ceph (tmp) partitions

This commit adds a section detailing the process to increase the size of
our cloud worker partitions, details when this may be a pertinent step
to take, and also concisely details our findings related to the juju
units recognising the increase in disk size.

fix-apache-transfer-encoding-for-autopkgtest-db 2024-04-16 14:45:03 UTC
fix: web: fix no content-length header for static file endpoints

Author: Tim Andersson
Author Date: 2024-04-16 14:06:10 UTC

fix: web: fix no content-length header for static file endpoints

britney recently started having an issue with downloading the
autopkgtest.db - suddenly the static endpoint had stopped returning the
"Content-Length" header for these static get requests.

This turned out to be a recent change from a security update for the
apache2 package, see below:
https://launchpad.net/ubuntu/+source/apache2/2.4.41-4ubuntu3.17
https://launchpadlibrarian.net/724225454/apache2_2.4.41-4ubuntu3.16_2.4.41-4ubuntu3.17.diff.gz

It was verified in staging that apache2/2.4.41-4ubuntu3.17 was the
problematic version - I tested apache2/2.4.41-4ubuntu3.16 and the
Content-Length header was present when downloading the db.

The header was no longer present because apache2 now serves static files
by default with the "chunked" transfer-encoding, rather than the
"identity" transfer-encoding (which includes the Content-Length header,
Content-Length header is not compatible with a chunked transfer
encoding, see [1]).

This bug:
https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/2061816

Was opened, where a discussion with the package maintainer was had, and
the conclusion was that this new behaviour by default was intended and
is a necessary security patch.

I was pointed to this thread:
https://bz.apache.org/bugzilla/show_bug.cgi?id=68872

Which helpfully had a solution that I modified for our web app.

All static files with the config present in this commit now get served
with the "identity" transfer-encoding and have the appropriate
Content-Length header.

[1] https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding#directives

docs-add-queue-cleanup-section 2024-04-15 11:22:27 UTC
docs: add section on how to do queue cleanup of obsoleted packages

Author: Tim Andersson
Author Date: 2024-03-19 12:33:25 UTC

docs: add section on how to do queue cleanup of obsoleted packages

lp-2060213-fix 2024-04-11 13:02:02 UTC
fix: cloud: fix unattended upgrades interrupting lxd tests

Author: Tim Andersson
Author Date: 2024-04-11 13:02:02 UTC

fix: cloud: fix unattended upgrades interrupting lxd tests

Prior to this commit, we were seeing instances of lxd armhf tests
failing with the following:
954s E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 763 (apt-get)
954s E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

This is because unattended upgrades weren't disabled in our lxd armhf
runners, whereas they are for all other architectures.

This commit introduces a new setup script, setup-canonical-lxd.sh

This script runs all the necessary setup commands for our lxd-armhf
runners.

This commit also moves the setup-commands from the service bundle for
the autopkgtest-lxd-worker application into the setup-canonical-lxd.sh
script. It replaces the setup-commands entry in the service bundle with
the setup script, rather than the individual commands.

Fixes bug LP: #2060213

service-message 2024-04-11 12:06:07 UTC
feat: web: add the possibility of displaying a service message with juju config

Author: Tim Andersson
Author Date: 2024-04-10 15:43:30 UTC

feat: web: add the possibility of displaying a service message with juju config

This commit adds the possibility of displaying a short service message
at the top of all pages.

This is useful in situations where there's an issue/recent bug that'll
affect all users.

To add a service message, use:

juju config autopkgtest-web important-service-message "my message".

The message is displayed with black text on a yellow background.

Displaying the service message requires a restart to apache2.

backup-worker-logs 2024-04-10 10:10:24 UTC
feat: cloud-worker: Add service and timer for store-worker-logs

Author: Tim Andersson
Author Date: 2023-12-11 11:58:23 UTC

feat: cloud-worker: Add service and timer for store-worker-logs

lxd-cleanup-srv-files 2024-04-10 10:05:01 UTC
autopkgtest-cloud-worker: fix: remove old service files for lxd units

Author: Tim Andersson
Author Date: 2023-12-04 12:35:31 UTC

autopkgtest-cloud-worker: fix: remove old service files for lxd units

Whenever we introduce a new lxd remote and remove an old one, the
systemd services for that specific remote are still leftover. This MP
fixes that, and should make debugging armhf problems a bit easier.

amend-inserts-paramstyle-named 2024-04-09 09:30:09 UTC
fix: web: use sqlite3.paramstyled = "named" where necessary

Author: Tim Andersson
Author Date: 2024-03-11 10:52:37 UTC

fix: web: use sqlite3.paramstyled = "named" where necessary

This commit makes other db write operations use the named paramstyle for
sqlite, rather than the default, which is qmark.

This just makes DB inserts a bit cleaner, rather than passing a tuple,
using named parameters is much safer, and adds the bonus of being able
to pass dictionaries to DB inserts.

To quote waveform, the named paramstyle is the "One True Paramstyle"!

https://peps.python.org/pep-0249/#paramstyle

web-fix-config-dir-permissions 2024-04-08 16:40:10 UTC
fix: web: fix file permissions of .config folder

Author: Tim Andersson
Author Date: 2024-04-08 16:24:46 UTC

fix: web: fix file permissions of .config folder

prior to this only the root user had read and write access to this
directory, which is obviously less than ideal. This is due to some
inherent behaviour of pathlib, see [1]. Because of this behaviour, the
.config directory had these permissions, but the
.config/autopkgtest-web/ directory had correct permissions.

This commit also explicitly sets the permissions for the directories,
just as an added way of ensuring these directories will have the correct
permissions.

[1] https://docs.python.org/3/library/pathlib.html#pathlib.Path.mkdir

fix-allowed-requestor-teams 2024-04-08 16:07:40 UTC
fix: web: rstrip the team name for allowed-requestor-teams

Author: Tim Andersson
Author Date: 2024-04-08 16:04:31 UTC

fix: web: rstrip the team name for allowed-requestor-teams

Prior to this commit, whitespace characters from the indented yaml juju
config options were preserved causing LP api calls to " example-team"
or something along those lines.

generate-charm-inventory-amendment 2024-04-08 09:37:35 UTC
fix: generate-charm-inventory: ignore all merge commits

Author: Tim Andersson
Author Date: 2024-04-08 09:37:35 UTC

fix: generate-charm-inventory: ignore all merge commits

worker-re-enable-retries 2024-04-08 08:42:23 UTC
fix: worker: re-enable retries

Author: Tim Andersson
Author Date: 2024-04-08 08:42:23 UTC

fix: worker: re-enable retries

admin-page-running-for-logtail-mismatch-heuristic 2024-04-05 16:28:16 UTC
feat: web: add tests to admin page which have a mismatch between the logtail ...

Author: Tim Andersson
Author Date: 2024-04-05 16:25:22 UTC

feat: web: add tests to admin page which have a mismatch between the logtail timestamp and running_for value

Fixes bug LP: #2058463

04052024-bos02-armhf-update 2024-04-05 16:26:39 UTC
service-bundle: modify IPs for bos02 armhf in line with recent changes

Author: Tim Andersson
Author Date: 2024-04-05 16:26:39 UTC

service-bundle: modify IPs for bos02 armhf in line with recent changes

preserve-a-p-in-queued-tests 2024-04-05 15:39:31 UTC
fix: web: show all-proposed for queued tests on results pages

Author: Tim Andersson
Author Date: 2024-04-05 09:28:56 UTC

fix: web: show all-proposed for queued tests on results pages

fix-ci-failures 2024-04-05 10:51:02 UTC
fix: web: add __init__.py to private_results/ to fix CI failures

Author: Tim Andersson
Author Date: 2024-04-05 10:51:02 UTC

fix: web: add __init__.py to private_results/ to fix CI failures

fix-amqp-status-collector 2024-04-04 13:33:40 UTC
fix: web: Add RuntimeDirectoryPreserve to amqp-status-collector.service

Author: Tim Andersson
Author Date: 2024-04-04 08:06:29 UTC

fix: web: Add RuntimeDirectoryPreserve to amqp-status-collector.service

This fixes the following issue:
- When RabbitMQ is unresponsive, the amqp-status-collector script fails repeatedly
- When amqp-status-collector isn't running, the /run/amqp-status-collector/
  directory is removed, due to the behaviour of RuntimeDirectory
- When that directory is removed, running.json also gets removed
- Lots of other functionality in webcontrol depends upon this file. Such as
  requests, and browsing the /running or results pages

This commit fixes the issue by adding RuntimeDirectoryPreserve=yes to
amqp-status-collector.service. This flag, when set to restart or yes, causes
the runtime directory to not be removed when the systemd unit is down.

allowed-teams-to-juju-config 2024-04-03 12:00:18 UTC
feat: web: move ALLOWED_TEAMS to juju config instead of being hardcoded in re...

Author: Tim Andersson
Author Date: 2024-02-27 14:48:22 UTC

feat: web: move ALLOWED_TEAMS to juju config instead of being hardcoded in request/submit.py

lxd-metrics-update 2024-04-03 11:12:44 UTC
fix: cloud: only check intended ips for autopkgtest-lxd-worker metrics

Author: Tim Andersson
Author Date: 2024-02-07 15:38:30 UTC

fix: cloud: only check intended ips for autopkgtest-lxd-worker metrics

This commit adds a fix to the lxd metrics - we don't have a metric right now
which checks the remotes specified in the service bundle if they aren't present
in lxc remote list on the autopkgtest-lxd-worker.

So this checks the list of intended remotes and makes note of any intended remotes
which aren't in lxc remote list.

We also currently report on remotes which aren't specified in the service bundle,
but I think that's fine to leave in the metrics as it's indicative of issues.

This commit also writes lxc-remotes.json to the ~ directory on the lxd
worker, as the metrics script now uses this information to more
accurately report the metrics.

cloud-worker-tmp-cleanup 2024-04-03 11:09:56 UTC
feat: cloud: add worker tmp cleanup config

Author: Tim Andersson
Author Date: 2024-02-16 11:23:49 UTC

feat: cloud: add worker tmp cleanup config

tmp doesn't get automatically cleaned up periodically, only on boot. This is
problematic as any edge case worker errors that cause the worker script to
exit before cleaning up the logfile directory leaves the entire directory
in tmp, leading to low disk space errors.

This commit introduces a config file which removes files and directories
in /tmp that haven't been modified in the last 30 days.

It adds the tmp cleanup config to the service bundle common
options for the autopkgtest-cloud-worker application.

It also adds the config option to layer.yaml.

And it also writes the cleanup config to /etc/tmpfiles.d/tmp.conf

pull-amqp-push-amqp 2024-04-03 09:21:50 UTC
feat: cloud: add pull-amqp and push-amqp scripts

Author: Tim Andersson
Author Date: 2024-03-27 12:35:30 UTC

feat: cloud: add pull-amqp and push-amqp scripts

pull-amqp is a script that pulls all message from a queue. If the script
is passed a regex, it will only pull the messages from the queue that
match said regex. If the --empty arg is passed, it'll remove said
messages.

push-amqp is a script that simply pushes a message to a specified queue.
The two scripts can be used in conjunction to easily shift specific
queue messages from one queue to another, removing the need to craft a
retry-autopkgtest-regressions command to shift tests between queues.

push-amqp can also be used to push messages to other queues, like the
sqlite-writer queue or the download-results queue.

Fixes bug LP: #2059235

configparser-read-refactor 2024-04-02 17:23:40 UTC
refactor: web&cloud: replace configparser.read with configparser.read_file or...

Author: Tim Andersson
Author Date: 2024-03-08 14:17:17 UTC

refactor: web&cloud: replace configparser.read with configparser.read_file or read_string

Also refactors all duplicate usage of configparser.read and shares common
functions from helpers/utils.py, and amends unit tests in line with
these changes

bump-workers-10-percent 2024-04-02 16:48:32 UTC
service-bundle: bump all n-workers by 10% inline with recent quota changes

Author: Tim Andersson
Author Date: 2024-04-02 16:48:10 UTC

service-bundle: bump all n-workers by 10% inline with recent quota changes

drop-all-all-proposed 2024-03-28 15:04:51 UTC
fix: web: remove all all-proposed for noble

Author: Tim Andersson
Author Date: 2024-03-28 13:56:33 UTC

fix: web: remove all all-proposed for noble

exit-code-14 2024-03-28 14:19:01 UTC
fix: web: fix display of exit code 14 tests

Author: Tim Andersson
Author Date: 2024-03-28 14:19:01 UTC

fix: web: fix display of exit code 14 tests

We received a ping on IRC on 27/03/2024, about "exit code 14"
being displayed on the results pages, which is somewhat confusing
and non-descript.

This mp fixes this by giving exit code 14 tests a human exit code
that is descriptive w.r.t the autopkgtest man page.

login-button 2024-03-28 10:48:11 UTC
feat: web: share login details between browse.cgi and request.cgi

Author: Tim Andersson
Author Date: 2024-03-27 15:08:28 UTC

feat: web: share login details between browse.cgi and request.cgi

This commit makes our flask app share the flask session between
browse.cgi and request.cgi.

Users, when not logged in, will now see a "Login" button on the navbar.
Clicking this will log them in using the pre-existing mechanism in
request.cgi.

postfix-systemd-fail-strings 2024-03-28 10:10:24 UTC
fix: worker: add fail strings for systemd failures and postfix failures

Author: Tim Andersson
Author Date: 2024-03-28 10:10:24 UTC

fix: worker: add fail strings for systemd failures and postfix failures

These failures seem to cause forever looping tests, so we need to
correctly recognise them as fail strings.

1100 of 345 results
This repository contains Public information 
Everyone can see this information.

Subscribers