Merge ~paelzer/ubuntu-archive-tools:identify-removals-from-rebuilds into ubuntu-archive-tools:main

Proposed by Christian Ehrhardt 
Status: Work in progress
Proposed branch: ~paelzer/ubuntu-archive-tools:identify-removals-from-rebuilds
Merge into: ubuntu-archive-tools:main
Diff against target: 824 lines (+812/-0)
2 files modified
identify-removals-from-rebuilds (+677/-0)
uat_lib/cache.py (+135/-0)
Reviewer Review Type Date Requested Status
Steve Langasek Approve
Alex Burrage Pending
Ubuntu Package Archive Administrators Pending
Review via email: mp+433775@code.launchpad.net
To post a comment you must log in.
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

The --help output is meant to be rather complete and self-explanatory, but please let me know if you'd need/want to change something.

$ ./identify-removals-from-rebuilds.py --help
usage: identify-removals-from-rebuilds.py [-h] -s SERIES -c CSV_FILE [--min-arch-fails N] [--no-filtering] [--skip check-id [check-id ...]] [--verbose]

Identify removal candidates based on archive rebuild results.

This helper reads the csv files co-created together with html based archive
rebuild reports by
  https://git.launchpad.net/~ubuntu-test-rebuild/lp-ftbfs-report/tree/source/build_status.py.

For each package in that list it checks a set of constraints that help to
decide if a package is maybe better off to be removed from the archive instead
of spending lots of effort to fix the FTBFS.

This is meant to be a countermeasure of too many FTBFS that are leftover in an
active release and helps Archive Admins to identify candidates on which they
can then consider a removal.

options:
  -h, --help show this help message and exit
  -s SERIES, --series SERIES
                        Name of the series to check dependencies against.
  -c CSV_FILE, --csv CSV_FILE
                        Path to CVS file as generated for archive rebuilds.
  --min-arch-fails N Check not_min_arch_fail will ignore cases which have less architectures failing than this value. For example, set this to 5 to only show cases which fail on 5 or
                        more (Default: 1).
  --no-filtering By default the list is filtered by each check usually making the output small and readable as well as execution fast. By setting this option there will be no
                        filtering, instead the report will include all FTBFS and the result of each check will be reported per package (Warning: this will be slow).
  --skip check-id [check-id ...]
                        Skip the checks passed to this argument. Potential values listed below as "Known checks" (default: none).
  --verbose Report for per package that was excluded from the final report, why it was excluded and provides info about internal errors. (default: False)

Known checks:
  - enough_failing_arches => Did enough architectures fail to be a problem?
  - inacceptable_state => Is the fail state really inacceptable?
  - no_dependencies => Does nothing depend on this packages binaries?
  - no_build_dependencies => Does nothing build-depend on this package?
  - is_not_seeded => Is this package not seeded in Ubuntu?

It has options to skip particular checks if you want to have a different (than the check-all default) look at those lists.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Example usage against a public archive rebuild report (you can pass it path or URL to csv or html file, in the example I point to the html report which users usually look at manually):

$ ./identify-removals-from-rebuilds.py --series jammy --csv https://people.canonical.com/~ginggs/ftbfs-report/test-rebuild-20211217-jammy-jammy.html
Reading CSV file https://people.canonical.com/~ginggs/ftbfs-report/test-rebuild-20211217-jammy-jammy.html ...
Info: got an html file path, using: https://people.canonical.com/~ginggs/ftbfs-report/test-rebuild-20211217-jammy-jammy.csv instead
Info: http[s]::// path, fetching https://people.canonical.com/~ginggs/ftbfs-report/test-rebuild-20211217-jammy-jammy.csv
1893 cases found
Check "enough_failing_arches": Did enough architectures fail to be a problem?
Check done, 1893 cases left
Check "inacceptable_state": Is the fail state really inacceptable?
Check done, 1851 cases left
Check "no_dependencies": Does nothing depend on this packages binaries?
Check done, 0 cases left
No cases left, exiting ...

As you can see after applying the checks that we have discussed in Prague there is nothing left for a removal :-/
But I hope that this might not always be true (otherwise creating this was rather useless) especially if such rebuild reports are more recently done or of a more recent release (in both cases not that many cases are fixed already).

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

There is also a (slow) non-filtering mode in which you'll get a tabular view of all checked attributes. Here an example of that as well (I'm sure launchpad will slay the layout, but it looks fine in a console):

$ ./identify-removals-from-rebuilds.py --series jammy --csv /home/paelzer/work/lp-ftbfs-report/lp-ftbfs-report/test-rebuild-20211217-jammy-jammy-short.csv --no-filter
Reading CSV file /home/paelzer/work/lp-ftbfs-report/lp-ftbfs-report/test-rebuild-20211217-jammy-jammy-short.csv ...
14 cases found
Check "enough_failing_arches": Did enough architectures fail to be a problem?
Check "inacceptable_state": Is the fail state really inacceptable?
Check "no_dependencies": Does nothing depend on this packages binaries?
Check "no_build_dependencies": Does nothing build-depend on this package?
Check "is_not_seeded": Is this package not seeded in Ubuntu?
14 (100.0%)

## Final report ##
Found 14 relevant cases
Captions:
Fail = Did enough architectures fail to be a problem?
AccS = Is the fail state really inacceptable?
Dep = Does nothing depend on this packages binaries?
Bdep = Does nothing build-depend on this package?
Seed = Is this package not seeded in Ubuntu?

| Source Package | State | Failing Architectures | Fail | AccS | Dep | Bdep | Seed |
| grubzfs-testsuite | Failed to upload | riscv64 armhf | 2 | Yes | No | No | No |
| zsys | Always FTBFS | riscv64 armhf | 2 | No | Yes | Yes | No |
| zsys | Always DepWait | s390x | 1 | No | Yes | Yes | No |
| cross-toolchain-base | Always FTBFS | amd64 | 1 | No | Yes | Yes | No |
| icu | Always FTBFS | amd64 arm64 armhf ppc64el s390x | 5 | No | Yes | Yes | Yes |
| pyicu | Always FTBFS | arm64 armhf ppc64el s390x amd64 | 5 | No | Yes | Yes | Yes |
| glibc | Always FTBFS | armhf s390x | 2 | No | Yes | Yes | Yes |
| strace | Always FTBFS | arm64 armhf ppc64el amd64 s390x | 5 | No | Yes | Yes | Yes |
| servicelog | Always FTBFS | ppc64el | 1 | No | Yes | No | No |
| ppc64-diag | Always FTBFS | ppc64el | 1 | No | Yes | No | No |
| libreoffice | Always FTBFS | arm64 | 1 | No | Yes | Yes | Yes |
| mir | Always FTBFS | riscv64 armhf arm64 s390x ppc64el amd64 | 6 | No | No | No | Yes |
| squid | Always FTBFS | riscv64 armhf arm64 s390x ppc64el amd64 | 6 | No | Yes | No | No |
| libstatgrab | Always FTBFS | armhf arm64 s390x riscv64 amd64 ppc64el i386 | 7 | No | Yes | Yes | Yes |

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Since it is a new file there will be no conflict no matter how long this MP is stale.
But I'd like to hear if this does fulfill what we discussed, therefore a 2023 ping for a review.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi and happy 2023,
There was no feedback yet and it might have fallen out of your inboxes "most-recent" cache.

But the recent Lunar Test rebuild is a great opportunity to bring this up again.
I've run the tool against [1] and found candidates to consider.

This test rebuild has just started and it does not mean we should lightly remove all those.
But this is a great chance to think about what other checks we might want to have following those already implemented (if any)?

Nice list pointing to some questionable cases like [2] for example.

The one thing that comes to my mind is that it might be useful to also point to the package on LP (like [2] in addition to the build log) and report the version. But I'll wait what you others think before spending more time, maybe it isn't as useful as we initially thought?

[1]: https://people.canonical.com/~ginggs/ftbfs-report/test-rebuild-20221215-lunar-normal-lunar.html
[2]: https://launchpad.net/ubuntu/+source/audio-convert

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

https://paste.ubuntu.com/p/7Xj3bJkDqW/

The list still is long, hence I'm waiting for suggestions what to further filter on.

Revision history for this message
Steve Langasek (vorlon) wrote :

Test run of the script, the very first package listed for me is:

| linux-starfive-5.17 | Failed to build | riscv64 |

which is an Ubuntu-specific package and shouldn't be removed because it FTBFS.

I thought when we discussed this in Prague, one of the criteria for packages going on this list was that it had been removed from Debian testing. Am I misremembering? I definitely don't see that as one of the checks in the implementation. (Checking for the package being present in Debian unstable but absent from Debian testing is also a check that can be done against downloaded Sources files without a lot of separate API calls, so I would expect it to significantly speed up the runtime.)

OTOH, although the first 4 packages reported are Ubuntu-specific packages that should not be removed, the fifth one is a FTBFS Ubuntu-specific package that *should* be pushed towards
removal: bot-sentry had one upload to Ubuntu in 2008, no uploads since. No idea how it's flown under the radar as being FTBFS. However, because it's Ubuntu-specific, I think we have a duty to file bugs in Launchpad first and give a maintainer an opportunity to respond rather than simply removing it as we would for a leaf package synced from Debian that has been removed from testing. Perhaps the output should be partitioned into two categories, for Debian vs Ubuntu removal candidates?

Another category of information that is important for deciding what to do with the package, and takes effort to manually chase down - if the publishing history says the package was originally synced from Debian, but is no longer in testing or unstable, it should definitely be removed - such packages can be removed using process-removals, but were probably missed at the time of the Debian removal because they still had Ubuntu-specific reverse-dependencies.

So looking at the first 10 packages in the report, I have:
- 4 Ubuntu-specific source packages that should not be removed because of a build failure because they are maintained
- 2 Ubuntu-specific package that are candidates for removal, but did not yet have any bugs filed (bot-sentry, add-apt-key)
- 2 Ubuntu-specific packages that had a bug report filed in LP since {2013,2016} about being totally useless, which I've therefore removed directly (foo-plugins, add-apt-key)
- 3 packages that are present in Debian testing and unstable and should therefore be out of scope for this report as they are not low-hanging fruit (glosstex, hotspot, acl2)

That's not a great hit rate at present and I think we need to improve it quite a bit before I would want to spend much more time on the output. Perhaps improved as per some of the above comments, perhaps you have other ideas.

In any case, I would like to see some improvements here before merging.

review: Needs Fixing
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

> I thought when we discussed this in Prague, one of the criteria for packages going on this list was that it had been removed from Debian testing. Am I misremembering?

It might have been mentioned, but it wasn't on the notes we have been taking.
I agree that this check will be quite useful and will consider adding it.

> Perhaps the output should be partitioned into two categories, for Debian vs Ubuntu removal candidates?

I'd not partition it, but add "ubuntu-only" a non filtering check, which would then appear as an extra column. That fits more nicely to the test and report architecture and should be sufficient to allow looking at those two categories separately. It also allows a user to change that

> Another category of information that is important for deciding what to do with the package, and takes effort to manually chase down - if the publishing history says the package was originally synced from Debian, but is no longer in testing or unstable, it should definitely be removed

Yep, that is another good suggestion for a check - again most likely a reported field instead of a filtering one.

> That's not a great hit rate at present and I think we need to improve it quite a bit before I would want to spend much more time on the output.
...
> In any case, I would like to see some improvements here before merging.

No problem, hence I was asking for opinions after I completed what my notes from Prague listed.

I'll also add the ideas I had which mostly help to navigate the result (report version, link to launchpad overview and tracker) to ease follow on activity.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I think another non filtering check could be to probe if the package version was untouched since the last release, if so it is more likely to be old cruft that is unmaintained.
I think we could report the oldest release "since when it is on that version".
I need to check how slow such a check would be, in the past I was quite happy with local madison-lite - but that is quite some setup effort.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Changes:
- check where a package exists to help decision making
- add links to Ubuntu and Debian overview pages to the output
- check and report since when a package was not touched
- some table style polishing
- speed up the is-seeded check

Example of the new output:
https://paste.ubuntu.com/p/643K6dgjgF/

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I think this is ready for another look

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

From my look at the latest list it is really showing quite a lot good candidates.
Plenty of the case Steve described as "removed from testing but still in Ubuntu" and given that we report the last update we see many are from many years ago.

The set of Ubuntu-only packages also seems good for cleanup, either filing bugs or pinging the owning team. Or consider removing stuff that was important and ubuntu-only long ago, but not anymore.

I'm happy to learn more exclusion arguments if others look at the list and find another systematic reason to not be in this list we would consider for cleanup. If there are no such arguments, we might consider merging it as it then seems to be as helpful as it can be.

Revision history for this message
Steve Langasek (vorlon) wrote :

halting my review for the moment because downloading the Debian Sources.gz for me is so unreliable I'm not able to iterate.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote (last edit ):

> halting my review for the moment because downloading the Debian Sources.gz for me is so unreliable I'm not able to iterate.

Interesting, I didn't face any such when working on it yesterday and in December.
Is it anything we can fix on our side, or just some kind of overload-at-server or bad-connection?
... trying myself ...
Hmm, yeah this is indeed, ... umm, "rather slow" today.

For the sake of helping to overcome this problem in general I have reworked the downloading code:
1. It now uses retry mechanisms of python3-tenacity for any kind of flaky issue
2. Allow to specify alternative archive URLs

The former works without anything needing to be done by the user.
The latter has the usual "give me a prefix" options.
The following worked really well for me today: `--archive-ubuntu http://de.archive.ubuntu.com/ubuntu --archive-ubuntu-ports http://mirror.kumi.systems/ubuntu-ports`
If you have a local archive mirror on your disk anyway you can even use: `--archive-ubuntu /var/apt-mirror/ubuntu --archive-ubuntu-ports /var/apt-mirror/ubuntu` which saves you the download time.

While looking at it I've found that it also has grown complex enough to use proper logging, which I added as well.

Please let me know, if you faced something else I (actually the code) could help with.

Revision history for this message
Steve Langasek (vorlon) wrote :

On Fri, Jan 13, 2023 at 08:27:38AM -0000, Christian Ehrhardt  wrote:
> > halting my review for the moment because downloading the Debian Sources.gz for me is so unreliable I'm not able to iterate.

> Interesting, I didn't face any such when working on it yesterday and in
> December. Is it anything we can fix on our side, or just some kind of
> overload-at-server or bad-connection?
> ... trying myself ...
> Hmm, yeah this is indeed, ... umm, "rather slow" today.

I know there were some Canonical networking issues happening yesterday, so
perhaps it's related. It seems better this morning.

Revision history for this message
Steve Langasek (vorlon) :
review: Needs Fixing
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks for your further suggestions and your experience on how the list could be further trimmed.

Just FYI on the delay, since I implement this mostly in breaks in between meetings this takes a while to complete.

So far I've fixed the few trivial changes you asked for and added the filtering of cases which did not have a binary built on the architectures that are now reported as failing.
It shrunk the list from 245 -> 229 cases.
This is fine, but you said "I find the list is dominated by packages that build on amd64 but ftbfs on one or more of our ports" which sounds like there should be more. If you, by any chance, have a list of cases left from your tests that were in this category please let me know so that I could cross check them.

I hope I'm able to find some time to look at the caching you asked for before marking it ready for review again.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

> "wrong semantics of counting number of failed archs instead of number of successful archs."

I'm not sure I agree with that being the wrong semantic.
In general I'd assume one is generally interested as soon as at least "one fails" which is the default. OTOH it is clear that a package only is in the list to begin with if at least one fail happened. So in the default config I admit that all will pass (only takes about 20 µs which should be no problem; otherwise I'd change it to only run if someone has specified --min-arch-fails > 1).

But the actual use-case is, if some day one wants to look for those cases that are not one-offs and for that wants to look for those failing on more than one architecture. For that one might bump up --min-arch-fails.

Remember that comparison to "former succeeding builds" is done in other stages.
They are later as they need to reach out more and thereby are slower.

This is not what this check is about, therefore counting "successful builds" instead would be odd here as there might be packages which only try to build on 3, others on all, others on just one. If the filter here would be "at least that many successful builds" what would it really check then and what would be the default number of such builds to make them not being reported?

> I would suggest we add cache handling for these ...

In regard to the caching I have added it, but did not (yet) modified any other tool so that we can first agree on the structure of the cache. For now I have gone with a flat structure in the cache directory based on what they fetch - that is rather readable in code and works fine.

Potential further changes (not sure how far we want/need to go):
- Put all caching function in a file shared between the scripts using it (since we usually run from git I'm not keen on making it a full lib in lib path, but more like a local relative include)
- Modify other users of caching to use this shared code as well (currently find-proposed-cluster and retry-autopkgtest-regressions do have their own copy)
- Modify other consumers of Package/Source data to use the cache
- Change the flat structure to a directory tree structure that matches where it fetches from, but then it would come awfully close to be a full apt-mirror which I'd not like

All of these might be overkill - and I've held them back until I hear your preference on those and the basic cache structure to begin with.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

FYI: I've started to illustrate how this could be used as lib inside the UAT repo.
I've some ideas to make it more useful in general, please allow me another day or two to push these new commits once completed.

Revision history for this message
Steve Langasek (vorlon) wrote :

On Wed, Feb 08, 2023 at 08:52:58AM -0000, Christian Ehrhardt  wrote:
> > "wrong semantics of counting number of failed archs instead of number of successful archs."

> I'm not sure I agree with that being the wrong semantic.
> In general I'd assume one is generally interested as soon as at least "one fails" which is the default. OTOH it is clear that a package only is in the list to begin with if at least one fail happened. So in the default config I admit that all will pass (only takes about 20 µs which should be no problem; otherwise I'd change it to only run if someone has specified --min-arch-fails > 1).

> But the actual use-case is, if some day one wants to look for those cases that are not one-offs and for that wants to look for those failing on more than one architecture. For that one might bump up --min-arch-fails.

> Remember that comparison to "former succeeding builds" is done in other stages.
> They are later as they need to reach out more and thereby are slower.

> This is not what this check is about, therefore counting "successful
> builds" instead would be odd here as there might be packages which only
> try to build on 3, others on all, others on just one. If the filter here
> would be "at least that many successful builds" what would it really check
> then and what would be the default number of such builds to make them not
> being reported?

My earlier comment, buried in the code comments on an earlier revision:

"From my POV, a package is an obvious candidate for source removal and should
be on this report only if it fails to build for all architectures for which
it is tried. This is different than having a fixed number of architectures
it fails on, because some packages limit the architectures they try to build
on, or build only Architecture: all packages, so the necessary number of
packages it fails on could be 1.

Packages that FTBFS on only a subset of architectures for which they
previously built are not good candidates for archive admins to do source
removal on. We should prefer either binary removal or fixing the build -
and deciding which of these is the correct approach requires developer time,
so I don't think they should be on this report."

This is what I mean by counting successful builds.

> - Put all caching function in a file shared between the scripts using it
> (since we usually run from git I'm not keen on making it a full lib in
> lib path, but more like a local relative include)

> - Modify other users of caching to use this shared code as well (currently
> find-proposed-cluster and retry-autopkgtest-regressions do have their
> own copy)

These are good ideas I think, but probably not as part of this particular
MP?

> - Change the flat structure to a directory tree structure that matches
> where it fetches from, but then it would come awfully close to be a full
> apt-mirror which I'd not like

I'm not keen on this one

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

> This is what I mean by counting successful builds.

Thanks for explaining!
Since we already have this scenario handled via the "nobinaries" check I think there is no need to change/add things in that regard.

> > - Put all caching function in a file shared between the scripts using it
> > (since we usually run from git I'm not keen on making it a full lib in
> > lib path, but more like a local relative include)
>
> > - Modify other users of caching to use this shared code as well (currently
> > find-proposed-cluster and retry-autopkgtest-regressions do have their
> > own copy)
>
> These are good ideas I think, but probably not as part of this particular
> MP?

Ok, I'll not modify the other users/scripts when landing this MP.
But I'd at least make it easy to be re-used and show how it could be used.
Therefore I've moved it into an includable source to be the first example.

By chance Utkarsh was visiting me last weekend and already had a problem/task which would be able to re-use it :-)

> > - Change the flat structure to a directory tree structure that matches
> > where it fetches from, but then it would come awfully close to be a full
> > apt-mirror which I'd not like
>
> I'm not keen on this one

Good, then it is already two of us :-)

I now pushed the latest changes, let me know if you think we can land it this way or if more is needed.

a70e101... by Christian Ehrhardt 

uat_lib/cache.py: update copyright

In the initial form this had some code of Iain which is why I kept
it initially, but in the form it was committed there is actually
nothing left of it, so drop him from the license header.

Signed-off-by: Christian Ehrhardt <email address hidden>

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

By now some of the existing caching was already refactored to be in utils.py
See https://git.launchpad.net/ubuntu-archive-tools/commit/?id=c60760abfefd92838557cf05a413f03ea4af407f

I'll need to revamp my MR to extend and use that, back to WIP until I found the time for that.

Revision history for this message
Steve Langasek (vorlon) wrote :

This is looking pretty good now in terms of behavior. I've reviewed 60 packages so far from the list of recommendations; of these:
- 57 have been removed
- 2 are appropriate for the script to recommend, but I have not removed them
- 1, c-evo-dh, looks wrong to me to be included in the list because it has not failed to build on any of the archs for which binaries are currently published, it only failed on armhf, riscv64, and s390x which the script correctly reports.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Reminder from MM chat:

[01:11] <vorlon> [05e] one thing that could be a nice improvement would be if it included a link to file a removal bug with a standard template
[01:11] <vorlon> [05f] à la https://bugs.launchpad.net/ubuntu/+source/opendrim-lmp-os/+bug/2015155

Revision history for this message
Steve Langasek (vorlon) wrote :

It appears the issues with the quality of the recommendations have been resolved. I've been removing dozens of ftbfs packages from lunar over the past two weeks using identity-removals-from-rebuilds as input.

I have some quibbles with the UX of the output and I know Christian still wants to do some code refactoring, but from my perspective this is ready to land and be more widely consumed by archive admins.

review: Approve
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Reminder for a debug session:
[16:56] <vorlon> cpaelzer: so a rerun of identify-removals-from-rebuilds is now reporting candidates that it wasn't reporting yesterday, by almost double... hmm. (example: chibi-scheme)
[16:57] <vorlon> cpaelzer: and chibi-scheme shouldn't be a candidate, the only build failure is an arch for which it's not built in lunar... dunno what's failed there when this check worked previously
[17:09] <cpaelzer> I didn't change any code yet
[17:09] <cpaelzer> since it checks various conditions, like versions in Debian/Ubuntu and such maybe any of that changed and makes it a candidate now
[17:10] <cpaelzer> I could download the ftbfs report, reduce it to just chibi-scheme and then via cmdline we can bump up the loglevel to understand more in detail why it is shown now

Revision history for this message
Steve Langasek (vorlon) wrote :

This change in behavior was root-caused to a change in the contents of the report, so not a regression in the tool.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

More reminders for improvement from chat:
... or they are `superseded` failures on the ftbfs report and should not be candidates

TODO: check if we can detect and filter out the superseded cases

Unmerged commits

a70e101... by Christian Ehrhardt 

uat_lib/cache.py: update copyright

In the initial form this had some code of Iain which is why I kept
it initially, but in the form it was committed there is actually
nothing left of it, so drop him from the license header.

Signed-off-by: Christian Ehrhardt <email address hidden>

5301c02... by Christian Ehrhardt 

make caching usable outside of identify-removals-from-rebuilds

Signed-off-by: Christian Ehrhardt <email address hidden>

fa3869d... by Christian Ehrhardt 

make caching usable outside of identify-removals-from-rebuilds

Signed-off-by: Christian Ehrhardt <email address hidden>

f4797f3... by Christian Ehrhardt 

identify-removals-from-rebuilds: use UAT caching

Use the common path for caching data in regard to ubuntu
archive tools fro Packages and Sources.

If users point to a local path with a mirror via commandline arguments
no caching is used as the data already is local and that would
just create redundant data on disk.

Suggested by Steve and inspired by as it is used in
retry-autopkgtest-regressions and find-proposed-cluster.

Signed-off-by: Christian Ehrhardt <email address hidden>

7603642... by Christian Ehrhardt 

identify-removals-from-rebuilds: remove duplicate Error:

On some logger.error calls there was still another 'Error:' from
the pre-logger code which appears as duplicate in the output.
Remove those.

Signed-off-by: Christian Ehrhardt <email address hidden>

5e78466... by Christian Ehrhardt 

identify-removals-from-rebuilds: filter cases that have no binaries

If a package has no binaries on the architecture that is now failing
to build we consider this not a problem, filter those out.

Signed-off-by: Christian Ehrhardt <email address hidden>

9f14293... by Christian Ehrhardt 

identify-removals-from-rebuilds: add missing punctuation

Signed-off-by: Christian Ehrhardt <email address hidden>

ea317fa... by Christian Ehrhardt 

identify-removals-from-rebuilds: readbility enhancements

Signed-off-by: Christian Ehrhardt <email address hidden>

cd304fe... by Christian Ehrhardt 

identify-removals-from-rebuilds: make source processing less noisy

Signed-off-by: Christian Ehrhardt <email address hidden>

1e690d9... by Christian Ehrhardt 

identify-removals-from-rebuilds: ignore pkgs that built

The states 'ALWAYSFTBFS' and 'ALWAYSDEPWAIT' have never produced
anything that went into Ubuntu, hence we do not need to include them.
Reducing these false positives by adding them to acceptable-states.

Signed-off-by: Christian Ehrhardt <email address hidden>

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/identify-removals-from-rebuilds b/identify-removals-from-rebuilds
2new file mode 100755
3index 0000000..6037fbd
4--- /dev/null
5+++ b/identify-removals-from-rebuilds
6@@ -0,0 +1,677 @@
7+#!/usr/bin/python3
8+# -*- coding: utf-8 -*-
9+# Copyright © 2022 Christian Ehrhardt <christian.ehrhardt@canonical.com>
10+# License:
11+# GPLv2 (or later), see /usr/share/common-licenses/GPL
12+"""Identify removal candidates based on archive rebuild results.
13+
14+This helper reads the csv files co-created together with html based archive
15+rebuild reports by
16+ https://git.launchpad.net/~ubuntu-test-rebuild/lp-ftbfs-report/tree/source/build_status.py.
17+
18+For each package in that list it checks a set of constraints that help to
19+decide if a package is maybe better off to be removed from the archive instead
20+of spending lots of effort to fix the FTBFS.
21+
22+This is meant to be a countermeasure of too many FTBFS that are leftover in an
23+active release and helps Archive Admins to identify candidates on which they
24+can then consider a removal.
25+"""
26+
27+# Requirements (Ubuntu Packages):
28+# - distro-info
29+# - ubuntu-dev-tools
30+# - python3-requests
31+# - python3-slugify
32+# - python3-tenacity
33+
34+import argparse
35+import json
36+import logging
37+import re
38+import subprocess
39+import sys
40+import urllib.request
41+
42+from urllib.parse import unquote
43+
44+import apt_pkg
45+
46+from debian.debian_support import Version
47+
48+import uat_lib.cache as uat_cache
49+
50+LOGGER = logging.getLogger(__name__)
51+
52+
53+STATE_TO_TEXT = {
54+ 'FAILEDTOBUILD': 'Failed to build',
55+ 'ALWAYSFTBFS': 'Always FTBFS',
56+ 'ALWAYSDEPWAIT': 'Always DepWait',
57+ 'NOREGRFTBFS': 'NoRegr FTBFS',
58+ 'NOREGRDEPWAIT': 'NoRegr DepWait',
59+ 'MANUALDEPWAIT': 'Dependency wait',
60+ 'CHROOTWAIT': 'Chroot problem',
61+ 'UPLOADFAIL': 'Failed to upload',
62+ 'CANCELLED': 'Cancelled build',
63+}
64+
65+STATES_ACCEPTABLE = ['MANUALDEPWAIT',
66+ 'CHROOTWAIT',
67+ 'UPLOADFAIL',
68+ 'CANCELLED',
69+ 'ALWAYSFTBFS',
70+ 'ALWAYSDEPWAIT']
71+
72+ARCHES = {
73+ "amd64",
74+ "arm64",
75+ "armhf",
76+ "i386",
77+ "ppc64el",
78+ "riscv64",
79+ "s390x",
80+}
81+
82+SEED_URL = 'http://qa.ubuntuwire.org/ubuntu-seeded-packages/seeded.json.gz'
83+
84+ARCHIVE_DEBIAN = "https://ftp.debian.org/debian"
85+ARCHIVE_UBUNTU = "http://archive.ubuntu.com/ubuntu"
86+ARCHIVE_UBUNTU_PORTS = "http://ports.ubuntu.com/ubuntu-ports"
87+
88+
89+def parse_csv_file(lines=None):
90+ """Parse Data written by build_status.py.
91+
92+ Sadly this isn't really CSV conform, using inner comma in architecture
93+ list without quoting breaks standard parsers. But OTOH it isn't too
94+ complex either.
95+ """
96+ ftbfs = []
97+ for row in lines:
98+ if isinstance(row, bytes):
99+ row = row.decode("utf-8")
100+ try:
101+ package, buildlog, explain = row.rstrip('\n').split(',', 2)
102+ except ValueError:
103+ LOGGER.error('row does not match expected csv format: %s',
104+ row)
105+ continue
106+
107+ allarches, delim, state = explain.rpartition(' ')
108+ if not allarches and delim and state:
109+ LOGGER.error('row does not match compound format: %s',
110+ explain)
111+ continue
112+
113+ arches = [a.strip() for a in allarches.strip('[]').split(',')]
114+ ftbfs.append({"pkg": package,
115+ "buildlog": buildlog,
116+ "fail_arches": arches,
117+ "state": state,
118+ })
119+ return ftbfs
120+
121+
122+def load_csv_file(csv_file_name=None):
123+ """Read the result of an archive rebuild report.
124+
125+ If using .html suffix replace it with .csv
126+ If prefixed with http[s]: fetch it via http
127+ """
128+ if csv_file_name.endswith(".html"):
129+ csv_file_name = re.sub(r'(.*).html', r'\1.csv', csv_file_name)
130+ LOGGER.debug('Got an html file path, using: %s instead', csv_file_name)
131+
132+ if csv_file_name.startswith(('http://', 'https://')):
133+ LOGGER.debug('http[s]::// path, fetching %s', csv_file_name)
134+ try:
135+ with urllib.request.urlopen(csv_file_name) as data:
136+ return parse_csv_file(data)
137+ except (urllib.error.HTTPError, urllib.error.URLError):
138+ LOGGER.error('failed to fetch %s', csv_file_name)
139+ sys.exit(1)
140+ else:
141+ try:
142+ with open(csv_file_name, newline='', encoding="utf-8") as file:
143+ return parse_csv_file(file)
144+ except OSError:
145+ LOGGER.error('failed to read %s', csv_file_name)
146+ sys.exit(1)
147+ return []
148+
149+
150+def load_seeded_sources():
151+ '''Download seeding info and map to sources.
152+
153+ This is slow, but much faster than calling seeded-in-ubuntu each time.
154+ Returns tuple with a dict and a list:
155+ - Dict:
156+ - key: seeded source package
157+ - value: list of binaries related to this source
158+ - List: contains all source packages that relate to a seeded binary
159+ '''
160+ seeded_sources = []
161+ src_to_bin = {}
162+
163+ with uat_cache.CachedResource(SEED_URL) as seed:
164+ seeded_list = json.load(seed)
165+
166+ apt_pkg.init()
167+ components = get_components("ubuntu")
168+ for component in components:
169+ for arch in ARCHES:
170+ url = get_pkg_url(ARGS.series, component, arch)
171+ with uat_cache.CachedResource(url) as sources_file:
172+ apt_sources = apt_pkg.TagFile(sources_file)
173+ for section in apt_sources:
174+ binary = section["Package"]
175+ try:
176+ source = section["Source"]
177+ except KeyError:
178+ source = binary
179+
180+ if source in src_to_bin:
181+ if binary not in src_to_bin[source]:
182+ src_to_bin[source].append(binary)
183+ else:
184+ src_to_bin[source] = [binary]
185+
186+ if binary in seeded_list:
187+ seeded_sources.append(source)
188+
189+ return (src_to_bin, seeded_sources)
190+
191+
192+def check_enough_failing_arches(ftbfs, _pkg):
193+ """Check how many architectures failed and filter if not enough."""
194+ failing_arch_count = len(ftbfs.get("fail_arches"))
195+ if failing_arch_count < ARGS.min_arch_fails:
196+ return False
197+ return failing_arch_count
198+
199+
200+def check_inacceptable_state(ftbfs, _pkg):
201+ """Check if fail state is usually not caused by the PKG (= acceptable)."""
202+ return ftbfs.get("state") not in STATES_ACCEPTABLE
203+
204+
205+def get_rmadison_info(rma):
206+ '''returns the text output of rmadison as structure.
207+
208+ Strips common suffixes and epochs if present.
209+ Might raise a KeyError if there was nothing found.'''
210+ rma_info = []
211+ for rma_line in rma.split('\n'):
212+ rma_elements = rma_line.split('|')
213+ if len(rma_elements) == 4:
214+ rma_release = rma_elements[2].strip()
215+ for sfx in ['/universe', '/restricted', '/multiverse']:
216+ rma_release = rma_release.removesuffix(sfx)
217+ for pocket in ["", "-security", "-updates"]:
218+ rma_release = rma_release.removesuffix(pocket)
219+ # might have an epoch prefix to strip
220+ rma_version = re.sub(r'^\d+:', '', rma_elements[1].strip())
221+ rma_version = Version(rma_version)
222+ rma_arches = rma_elements[3].strip().split(',')
223+ rma_info.append({'release': rma_release,
224+ 'ver': rma_version,
225+ 'arch': rma_arches})
226+ return rma_info
227+
228+
229+def check_untouched(ftbfs, pkg):
230+ '''Check if the version of this package didn't change.
231+
232+ Packages passing this check are more likely old cruft that
233+ is unmaintained. Returns the first Ubuntu rekease that had this version.'''
234+ try:
235+ cmd = ['rmadison', '-u', 'ubuntu', '-a', 'source', pkg]
236+ rma = subprocess.check_output(cmd,
237+ stderr=subprocess.STDOUT,
238+ encoding='utf-8')
239+ rma_info = get_rmadison_info(rma)
240+ except (subprocess.CalledProcessError, KeyError):
241+ LOGGER.debug('Package not found by rmadison: %s', pkg)
242+ return "not found"
243+ if len(rma_info) == 0:
244+ return "not found"
245+ sorted_rma_info = sorted(rma_info, key=lambda d: d['ver'])
246+
247+ # CSV does not contain the version, derive from url
248+
249+ build_log_url = unquote(ftbfs['buildlog'])
250+ versionmatch = re.match(rf'.*{pkg}_(.*)_BUILDING.txt', build_log_url)
251+ if versionmatch and versionmatch.group(1) is not None:
252+ try:
253+ build_version = Version(versionmatch.group(1))
254+ except ValueError:
255+ return "Invalid version"
256+ else:
257+ LOGGER.debug('could not find version in %s', ftbfs["buildlog"])
258+ return "no version"
259+
260+ if build_version < sorted_rma_info[0]['ver']:
261+ return "too old"
262+
263+ # return earliest release that had this version
264+ for release_info in sorted_rma_info:
265+ if release_info['ver'] == build_version:
266+ return release_info['release']
267+
268+ return "not in rmadison"
269+
270+
271+def check_nobinaries(ftbfs, pkg):
272+ '''Check if the failing architectures did have binaries built before.
273+
274+ Such packages are not considered a regression and it is not a problem
275+ for later maintenance nor a reason to be removed.'''
276+ if pkg not in SRC_TO_BIN:
277+ LOGGER.debug('no binaries known for source %s', pkg)
278+ return False
279+
280+ try:
281+ cmd = ['rmadison',
282+ '-u', 'ubuntu',
283+ '-s', ARGS.series,
284+ '-a', ','.join(str(x) for x in ftbfs.get("fail_arches")),
285+ ','.join(SRC_TO_BIN[pkg])]
286+ rma = subprocess.check_output(cmd,
287+ stderr=subprocess.STDOUT,
288+ encoding='utf-8')
289+ if rma:
290+ LOGGER.debug('Got existing builds for src: (%s): "%s"', pkg, rma)
291+ rma_info = get_rmadison_info(rma)
292+ # we could report all, but that wastes space, report one arch
293+ # that we have found plus the number of further arches that
294+ # were listed in the ftbfs, but exist in the target suite.
295+ arches = []
296+ for info in rma_info:
297+ arches.extend(info['arch'])
298+ arches = list(set(arches))
299+ extra_count = len(arches)-1
300+ return f'{rma_info[0]["arch"][0]} +{extra_count}'
301+
302+ LOGGER.debug('Got no binaries in %s for src (%s)', ARGS.series, pkg)
303+ return False
304+
305+ except (subprocess.CalledProcessError, KeyError):
306+ LOGGER.error('Package not found by rmadison: %s', pkg)
307+ return False
308+
309+
310+def check_existence(_ftbfs, pkg):
311+ '''Check existence in Ubuntu, Debian-unstable and Debian-testing.
312+ - If it is in unstable, but absent from testing it is a good candidate
313+ for removal
314+ - If it is only in Ubuntu, keep it as we need to maintain it (but report
315+ it as ubuntu only)
316+ - otherwise filter it from the list as it is not a low hanging
317+ fruit for removal.'''
318+ if pkg not in SOURCE_MAP['unstable'] and pkg not in SOURCE_MAP['testing']:
319+ return "Ubuntu-only"
320+ if pkg not in SOURCE_MAP['ubuntu']:
321+ # already removed
322+ return False
323+ if pkg in SOURCE_MAP['unstable'] and pkg not in SOURCE_MAP['testing']:
324+ if 'ubuntu' in SOURCE_MAP["ubuntu"][pkg]:
325+ return "Not in testing (delta)"
326+ return "Not in testing (sync)"
327+ # Ignore all others
328+ return False
329+
330+
331+def check_has_no_depends(_ftbfs, pkg):
332+ """Check if nothing depends on binaries of this package."""
333+ cmd = ['reverse-depends', f'--release={ARGS.series}', f'src:{pkg}']
334+ try:
335+ dep = subprocess.check_output(cmd, stderr=subprocess.STDOUT)
336+ except subprocess.CalledProcessError:
337+ LOGGER.debug('Package not found by reverse-depends: %s', pkg)
338+ return False
339+ return dep == b'No reverse dependencies found\n'
340+
341+
342+def check_has_no_bdepends(_ftbfs, pkg):
343+ """Check if nothing build-depends on this package."""
344+ cmd = ['reverse-depends', f'--release={ARGS.series}',
345+ '--build-depends', f'src:{pkg}']
346+ try:
347+ bdep = subprocess.check_output(cmd, stderr=subprocess.STDOUT)
348+ except subprocess.CalledProcessError:
349+ LOGGER.debug('Package not found by reverse-depends: %s', pkg)
350+ return False
351+ return bdep == b'No reverse dependencies found\n'
352+
353+
354+def check_is_not_seeded(_ftbfs, pkg):
355+ """Check if this package is not seeded in Ubuntu."""
356+ return pkg not in SEEDED_SOURCES
357+
358+
359+# This list also defines the order in which checks are applied
360+# To make this more effective the checks are sorted by speed with those
361+# reaching out externally like reverse-depends and seeded-in-ubuntu at the
362+# end which will cause them to be rarely executed.
363+#
364+# Usually the what remains in the list after filtering has the same attributes.
365+# Therefore, by default, no check results are individually reported.
366+# OTOH if --no-filter is set each checked attribute is reported.
367+# Defining it as a per check attribute leaves flexibility for later added
368+# checks to be always reported by setting reportworthy => True.
369+CHECKS = {
370+ "enough_failing_arches": {
371+ "enabled": True,
372+ "reportworthy": False,
373+ "checktxt": "Did enough architectures fail to be a problem?",
374+ "fn": check_enough_failing_arches,
375+ "short": "Fail",
376+ },
377+ "inacceptable_state": {
378+ "enabled": True,
379+ "reportworthy": False,
380+ "checktxt": "Is the fail state really inacceptable?",
381+ "fn": check_inacceptable_state,
382+ "short": "Acceptable",
383+ },
384+ "is_not_seeded": {
385+ "enabled": True,
386+ "reportworthy": False,
387+ "checktxt": "Is this package not seeded in Ubuntu?",
388+ "fn": check_is_not_seeded,
389+ "short": "Seeded in Ubuntu",
390+ },
391+ "existence": {
392+ "enabled": True,
393+ "reportworthy": True,
394+ "checktxt": "Absent from testing or Ubuntu-only?",
395+ "fn": check_existence,
396+ "short": "Exists in (sync/delta)",
397+ },
398+ "no_dependencies": {
399+ "enabled": True,
400+ "reportworthy": False,
401+ "checktxt": "Does nothing depend on this packages binaries?",
402+ "fn": check_has_no_depends,
403+ "short": "Has Dep",
404+ },
405+ "no_build_dependencies": {
406+ "enabled": True,
407+ "reportworthy": False,
408+ "checktxt": "Does nothing build-depend on this package?",
409+ "fn": check_has_no_bdepends,
410+ "short": "Has Bdep",
411+ },
412+ "nobinaries": {
413+ "enabled": True,
414+ "reportworthy": True,
415+ "checktxt": "Did the failing architectures have binaries published?",
416+ "fn": check_nobinaries,
417+ "short": "Has Binaries in",
418+ },
419+ "untouched": {
420+ "enabled": True,
421+ "reportworthy": True,
422+ "checktxt": "Since when was there no update?",
423+ "fn": check_untouched,
424+ "short": "Untouched since",
425+ },
426+}
427+
428+
429+def filter_ftbfs_list(ftbfs_l):
430+ """Runs all enabled checks on the list of FTBFS and reduces the list.
431+
432+ Note: Strips the list in each iteration to trigger less tests each round.
433+ """
434+ ftbfs_remaining = ftbfs_l
435+ for check, checkdetails in CHECKS.items():
436+ if check == "internal-error":
437+ continue
438+ if not checkdetails["enabled"]:
439+ LOGGER.info('Check %s is disabled', check)
440+ continue
441+ LOGGER.info('Check "%s"', checkdetails["checktxt"])
442+ filtered = []
443+ progress = 0
444+ remaining = len(ftbfs_remaining)
445+ for ftbfs in ftbfs_remaining:
446+ pkg = ftbfs.get("pkg")
447+ progress += 1
448+ print(f'{progress} ({progress/remaining*100:5,.1f}%)',
449+ end='\r')
450+ result = checkdetails["fn"](ftbfs, pkg)
451+ ftbfs |= {check: result}
452+ if not result:
453+ LOGGER.debug('PKG "%s" skipped by check "%s".', pkg, check)
454+ continue
455+ filtered.append(ftbfs)
456+
457+ if not ARGS.no_filtering:
458+ LOGGER.info('Check done: %s cases left', len(filtered))
459+ if len(filtered) == 0:
460+ LOGGER.info('No cases left, exiting ...')
461+ sys.exit(0)
462+ ftbfs_remaining = filtered
463+
464+ return ftbfs_remaining
465+
466+
467+# From https://stackoverflow.com/questions/40419276
468+def link(uri, label=None):
469+ """Wrap text into URL that is a link in e.g. vtw based terminals"""
470+ if label is None:
471+ label = uri
472+ parameters = ''
473+
474+ # OSC 8 ; params ; URI ST <name> OSC 8 ;; ST
475+ escape_mask = '\033]8;{};{}\033\\{}\033]8;;\033\\'
476+
477+ return escape_mask.format(parameters, uri, label)
478+
479+
480+def report_ftbfs(ftbfs_l):
481+ """Report the ftbfs that are left after filtering."""
482+ LOGGER.info('## Final report ##')
483+ LOGGER.info('Found %s relevant cases', len(ftbfs_l))
484+
485+ pkg_width = 1
486+ for ftbfs in ftbfs_l:
487+ pkg_width = max(pkg_width, len(ftbfs.get("pkg")))
488+
489+ header = ('| Pkg-Info '
490+ f'| {"Source Package":{pkg_width}} '
491+ f'| {"State":<18} '
492+ f'| {"Failing Architectures":45} ')
493+ for _name, details in CHECKS.items():
494+ if details["enabled"] and details["reportworthy"]:
495+ header += f'| {details["short"]} '
496+ header += '|'
497+ print(header)
498+
499+ for ftbfs in ftbfs_l:
500+ archstr = ' '.join(str(x) for x in ftbfs.get("fail_arches"))
501+ uurl = f'https://launchpad.net/ubuntu/+source/{ftbfs.get("pkg")}'
502+ durl = f'https://tracker.debian.org/pkg/{ftbfs.get("pkg")}'
503+ txt = ('| ' + link(durl, 'Deb') + ' ' + link(uurl, 'Ubu') + ' '
504+ '| ' + link(ftbfs.get("buildlog"),
505+ f'{ftbfs.get("pkg"):{pkg_width}}') + ' '
506+ f'| {STATE_TO_TEXT[ftbfs.get("state")]:<18} '
507+ f'| {archstr:45} ')
508+ for name, details in CHECKS.items():
509+ if details["enabled"] and details["reportworthy"]:
510+ value = ftbfs.get(name)
511+ if value is None:
512+ value = "n/a"
513+ # To be able to carry more than true/false (e.g. count) when
514+ # a check passes this is unintuitively mapped to the boolean
515+ # values, re-translate it back for the user.
516+ if isinstance(value, bool):
517+ if value is True:
518+ value = "No"
519+ else:
520+ value = "Yes"
521+ clen = len(details["short"])
522+ txt += f'| {value:<{clen}s} '
523+ txt += '|'
524+ print(txt)
525+
526+
527+def get_components(suite):
528+ '''Return components for a suite.'''
529+ if suite == 'ubuntu':
530+ return ("main", "restricted", "universe", "multiverse")
531+ return ("main", "contrib", "non-free-firmware", "non-free")
532+
533+
534+def get_src_url(suite, component):
535+ '''Return the URL to fetch sources from.'''
536+ if suite == 'ubuntu':
537+ return (f'{ARGS.archive_ubuntu}/dists/{ARGS.series}/'
538+ f'{component}/source/Sources.gz')
539+ return (f'{ARGS.archive_debian}/dists/{suite}/'
540+ f'{component}/source/Sources.gz')
541+
542+
543+def get_pkg_url(series, component, arch):
544+ '''Return the URL to fetch package info from.'''
545+ if arch in ['amd64', 'i386']:
546+ base_url = ARGS.archive_ubuntu
547+ else:
548+ base_url = ARGS.archive_ubuntu_ports
549+
550+ return (f'{base_url}/dists/{series}/'
551+ f'{component}/binary-{arch}/Packages.gz')
552+
553+
554+def read_sources():
555+ """Read information from the Ubuntu and Debian Sources files.
556+
557+ Returns a dict containing:
558+ * a Ubuntu mapping of source package names to versions
559+ * a Debian-unstable mapping of source package names to versions
560+ * a Debian-testing mapping of source package names to versions
561+ """
562+ LOGGER.info('Reading Sources ...')
563+ source_map = {'ubuntu': {}, 'testing': {}, 'unstable': {}}
564+ apt_pkg.init()
565+
566+ for suite in source_map:
567+ components = get_components(suite)
568+ for component in components:
569+ url = get_src_url(suite, component)
570+ with uat_cache.CachedResource(url) as sources_file:
571+ apt_sources = apt_pkg.TagFile(sources_file)
572+ for section in apt_sources:
573+ src = section["Package"]
574+ ver = section["Version"]
575+ if (src not in source_map[suite] or
576+ apt_pkg.version_compare(source_map[suite][src], ver) < 0):
577+ source_map[suite][src] = ver
578+
579+ return source_map
580+
581+
582+def setup_parser():
583+ '''Set up the argument parser for this program.'''
584+
585+ epilog = 'Known checks:'
586+ for name, details in CHECKS.items():
587+ epilog += f'\n - {name} => {details["checktxt"]}'
588+ formatter = argparse.RawDescriptionHelpFormatter
589+
590+ parser = argparse.ArgumentParser(epilog=epilog,
591+ description=__doc__,
592+ formatter_class=formatter)
593+ parser.add_argument(
594+ "-s", "--series", required=True,
595+ help="Name of the series to check dependencies against.")
596+ parser.add_argument(
597+ "-c", "--csv", dest="csv_file", required=True,
598+ help="Path to CVS file as generated for archive rebuilds.")
599+
600+ parser.add_argument(
601+ "--min-arch-fails",
602+ default=1, metavar='N', type=int,
603+ help="Check not_min_arch_fail will ignore cases which have less "
604+ "architectures failing than this value. For example, set this "
605+ "to 5 to only show cases which fail on 5 or more (Default: 1).")
606+
607+ parser.add_argument(
608+ "--no-filtering", action='store_true',
609+ help="By default the list is filtered by each check usually making "
610+ "the output small and readable as well as execution fast. "
611+ "By setting this option there will be no filtering, instead the "
612+ "report will include all FTBFS and the result of each check will "
613+ "be reported per package (Warning: this will be slow).")
614+
615+ parser.add_argument(
616+ "--skip",
617+ type=str,
618+ nargs='+',
619+ metavar='check-id',
620+ help='Skip the checks passed to this argument. Potential values '
621+ 'listed below as "Known checks" (default: none).')
622+
623+ parser.add_argument(
624+ "--archive-debian",
625+ default=ARCHIVE_DEBIAN,
626+ help="URL or local path prefix to access the Debian Archive. "
627+ f'(Default: {ARCHIVE_DEBIAN})')
628+ parser.add_argument(
629+ "--archive-ubuntu",
630+ default=ARCHIVE_UBUNTU,
631+ help="URL or local path prefix to access the Ubuntu Archive. "
632+ f'(Default: {ARCHIVE_UBUNTU})')
633+ parser.add_argument(
634+ "--archive-ubuntu-ports",
635+ default=ARCHIVE_UBUNTU_PORTS,
636+ help="URL or local path prefix to access the Ubuntu-Ports Archive. "
637+ f'(Default: {ARCHIVE_UBUNTU_PORTS})')
638+
639+ parser.add_argument('-l',
640+ '--loglevel',
641+ default='info',
642+ help='Provide logging level. Example '
643+ '--loglevel debug (Default: info)')
644+
645+ return parser.parse_args()
646+
647+
648+if __name__ == '__main__':
649+ ARGS = setup_parser()
650+
651+ logging.basicConfig(level=ARGS.loglevel.upper(),
652+ format="%(levelname)s: %(asctime)s - %(message)s")
653+
654+ if ARGS.series:
655+ try:
656+ subprocess.check_output(['distro-info', '--series', ARGS.series])
657+ except subprocess.CalledProcessError:
658+ LOGGER.error('%s is not a valid series.', ARGS.series)
659+ sys.exit(1)
660+
661+ # Without filtering we want the info of all checks reported
662+ if ARGS.no_filtering:
663+ for _check, checkconfig in CHECKS.items():
664+ checkconfig["reportworthy"] = True
665+
666+ if ARGS.skip:
667+ for checkid in ARGS.skip:
668+ try:
669+ CHECKS[checkid]["enabled"] = False
670+ except KeyError:
671+ LOGGER.error('%s is not a known check.', checkid)
672+ sys.exit(1)
673+
674+ LOGGER.info('Reading CSV file %s', ARGS.csv_file)
675+ ftbfs_list = load_csv_file(csv_file_name=ARGS.csv_file)
676+ LOGGER.info('%s cases found', len(ftbfs_list))
677+
678+ SOURCE_MAP = read_sources()
679+ (SRC_TO_BIN, SEEDED_SOURCES) = load_seeded_sources()
680+
681+ ftbfs_list = filter_ftbfs_list(ftbfs_list)
682+
683+ report_ftbfs(ftbfs_list)
684diff --git a/uat_lib/cache.py b/uat_lib/cache.py
685new file mode 100644
686index 0000000..5bd1e23
687--- /dev/null
688+++ b/uat_lib/cache.py
689@@ -0,0 +1,135 @@
690+# Copyright (C) 2020-2023 Canonical Ltd.
691+"""
692+Helper to cache commonly used ressources like packages.gz and similar
693+in a path common to all ubuntu archive tools. Provides a context manager
694+and auto-fetch as well as auto-decompression to allow the programs using
695+it to focus on what makes them special and not re-implement fetch&open
696+over and over.
697+"""
698+# Author: Christian Ehrhardt <christian.ehrhardt@canonical.com>
699+
700+# This library is free software; you can redistribute it and/or
701+# modify it under the terms of the GNU Lesser General Public
702+# License as published by the Free Software Foundation; either
703+# version 2.1 of the License, or (at your option) any later version.
704+
705+# This library is distributed in the hope that it will be useful,
706+# but WITHOUT ANY WARRANTY; without even the implied warranty of
707+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
708+# Lesser General Public License for more details.
709+
710+# You should have received a copy of the GNU Lesser General Public
711+# License along with this library; if not, write to the Free Software
712+# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301
713+# USA
714+
715+import gzip
716+import logging
717+import os
718+import tempfile
719+
720+from contextlib import closing
721+from urllib.request import urlopen
722+from urllib.error import URLError
723+from datetime import datetime, timedelta
724+from tenacity import retry, stop_after_delay, stop_after_attempt, wait_fixed
725+
726+from slugify import slugify
727+
728+LOGGER = logging.getLogger(__name__)
729+
730+
731+class CachedResource:
732+ '''Makes a resource available independent to location and format.'''
733+ def __init__(self, resource):
734+ '''Open the resource after making it available.
735+
736+ - If it is an URL fetch it to the cache, otherwise use it directly
737+ from the local path
738+ - If it is gzip compresses decompresses it
739+ - set self.file_obj to the resource as opened file
740+ '''
741+ self.tmp = False
742+
743+ if os.path.isfile(resource):
744+ LOGGER.debug('Using local file "%s"', resource)
745+ fname = resource
746+ else:
747+ cache_fname = CachedResource.url_to_cache_file(resource)
748+ if CachedResource.is_recent_enough(cache_fname):
749+ LOGGER.debug('Use "%s" from cache for "%s"',
750+ cache_fname, resource)
751+ else:
752+ LOGGER.debug('Caching "%s" as "%s"', resource, cache_fname)
753+ CachedResource.fetch_resource(resource, cache_fname)
754+ fname = cache_fname
755+
756+ # For now simple detection via file suffix (no magic, no try/except)
757+ if fname.endswith('.gz'):
758+ fd, self.file_name = tempfile.mkstemp()
759+ LOGGER.debug('Tempfile for gzip extraction %s', self.file_name)
760+ self.tmp = True
761+ try:
762+ with closing(gzip.GzipFile(fname)) as gz_file:
763+ with os.fdopen(fd, "wb") as out_file:
764+ out_file.write(gz_file.read())
765+ except gzip.BadGzipFile as err:
766+ LOGGER.error('Unable to gunzip %s: %s', fname, err)
767+ raise err
768+ else:
769+ self.file_name = fname
770+
771+ self.file_obj = open(self.file_name, 'r', encoding='utf-8')
772+
773+ def __enter__(self):
774+ '''Return the opened file.'''
775+ return self.file_obj
776+
777+ def __exit__(self, __type, __value, __traceback):
778+ '''Close file and remove temporary file if needed.'''
779+ self.file_obj.close()
780+ if self.tmp:
781+ os.remove(self.file_name)
782+
783+ @staticmethod
784+ def get_cache_dir():
785+ """Returns the cache base directory for ubuntu archive tools"""
786+ fallback = os.path.expanduser(os.path.join('~', '.cache'))
787+ cache_dir = os.environ.get('XDG_CACHE_HOME', fallback)
788+ uat_cache_path = os.path.join(cache_dir, 'ubuntu-archive-tools')
789+ os.makedirs(uat_cache_path, exist_ok=True)
790+ return uat_cache_path
791+
792+ @staticmethod
793+ def is_recent_enough(filename, max_age=timedelta(hours=24)):
794+ """Checks if a file was last modified within the last max_age_hours."""
795+ try:
796+ file_mtime = datetime.utcfromtimestamp(os.path.getmtime(filename))
797+ except FileNotFoundError:
798+ return False
799+ cutoff = datetime.utcnow() - max_age
800+ return file_mtime > cutoff
801+
802+ @staticmethod
803+ def url_to_cache_file(url):
804+ """Derive file name as it will appear in the cache based on the url."""
805+ cache_dir = CachedResource.get_cache_dir()
806+ # remove (file) incompatible elements of an url
807+ return os.path.join(cache_dir, slugify(url, separator='.'))
808+
809+ @retry(stop=(stop_after_delay(60) | stop_after_attempt(5)),
810+ wait=wait_fixed(2))
811+ @staticmethod
812+ def fetch_resource(url, fname):
813+ '''Fetch URL to filename
814+
815+ Use retry functionality for more reliability over network
816+ '''
817+
818+ try:
819+ with closing(urlopen(url)) as url_file:
820+ with open(fname, "wb") as comp_file:
821+ comp_file.write(url_file.read())
822+ except URLError as err:
823+ LOGGER.error('Failed to fetch url %s', url)
824+ raise err

Subscribers

People subscribed via source and target branches