Merge ~nacc/usd-importer:modernize-scripts-v2 into usd-importer:master

Proposed by Nish Aravamudan on 2017-11-10
Status: Merged
Approved by: Nish Aravamudan on 2017-11-16
Approved revision: 9ed2e8a2ac18665cbc7d966a74b1cf0deee32680
Merged at revision: af30fd0a33d1ea36b514420c5d8867dbfb2d7160
Proposed branch: ~nacc/usd-importer:modernize-scripts-v2
Merge into: usd-importer:master
Prerequisite: ~nacc/usd-importer:lp1730734-cache-importer-progress
Diff against target: 1613 lines (+1122/-290)
10 files modified
dev/null (+0/-199)
gitubuntu/importer.py (+3/-50)
gitubuntu/source-package-blacklist.txt (+7/-0)
gitubuntu/source-package-whitelist.txt (+0/-7)
gitubuntu/source_information.py (+2/-16)
man/man1/git-ubuntu-import.1 (+2/-18)
scripts/import-source-packages.py (+377/-0)
scripts/scriptutils.py (+191/-0)
scripts/source-package-walker.py (+272/-0)
scripts/update-repository-alias.py (+268/-0)
Reviewer Review Type Date Requested Status
Robie Basak 2017-11-10 Approve on 2017-11-14
Server Team CI bot continuous-integration 2017-11-10 Approve on 2017-11-13
Review via email: mp+333500@code.launchpad.net

This proposal supersedes a proposal from 2017-11-01.

Description of the change

Make jenkins happy.

To post a comment you must log in.
Server Team CI bot (server-team-bot) wrote : Posted in a previous version of this proposal

PASSED: Continuous integration, rev:a341a5c452dde92ab1d0a319c8aac0f729cbd90f
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/195/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Style Check
    SUCCESS: Unit Tests
    SUCCESS: Integration Tests
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/195/rebuild

review: Approve (continuous-integration)
Robie Basak (racb) wrote : Posted in a previous version of this proposal

Why not put all the code into the gitubuntu module, with thin wrappers in bin/, like git-ubuntu, rather than have a separate infrastructure in scripts/ with a different additional sys.path hack? Or is there some kind of snap-related difficulty with this?

Robie Basak (racb) wrote : Posted in a previous version of this proposal

If we're going with pylint and treating these as new, then none of this is currently pylint clean. I'm not saying it necessarily should be. What's your opinion?

Nish Aravamudan (nacc) wrote : Posted in a previous version of this proposal

On Tue, Nov 7, 2017 at 6:24 AM, Robie Basak <email address hidden> wrote:
> Why not put all the code into the gitubuntu module, with thin wrappers in bin/, like git-ubuntu,

I can do this, but I don't want these scripts in the snap. Every
default change would imply a a new snap version, which forces everyone
to download a new update. While the xdelta gunk from the store should
make that minimal, in practice it does not.

> rather than have a separate infrastructure in scripts/ with a different additional sys.path hack?

I can move it to bin/, but it was mostly for keeping the code isolated
and clean. It's not something everyone is going to need to run. I
realize, though, the alias script, at least, should be in the snap
maybe.

Robie Basak (racb) wrote : Posted in a previous version of this proposal

On Tue, Nov 07, 2017 at 04:04:53PM -0000, Nish Aravamudan wrote:
> On Tue, Nov 7, 2017 at 6:24 AM, Robie Basak <email address hidden> wrote:
> > Why not put all the code into the gitubuntu module, with thin wrappers in bin/, like git-ubuntu,
>
> I can do this, but I don't want these scripts in the snap. Every
> default change would imply a a new snap version, which forces everyone
> to download a new update. While the xdelta gunk from the store should
> make that minimal, in practice it does not.

Is this because edge is automatically generated? Could we start using
"beta" instead, and push to beta only when there's some other real
change? Or does this make other pain worse?

Nish Aravamudan (nacc) wrote : Posted in a previous version of this proposal

On Tue, Nov 7, 2017 at 6:28 AM, Robie Basak <email address hidden> wrote:
> If we're going with pylint and treating these as new, then none of this is currently pylint clean. I'm not saying it necessarily should be. What's your opinion?

I'm fine with targetting clean runs, but we need a pylintrc, i think,
in order to have it agree to our formatting (4 space always indented).
And I think the 'too many arguments' and 'too many variables' rules
are dumb, but I'll read why the exist first. I pushed a few cleanups
that I had missed previously.

Nish Aravamudan (nacc) wrote : Posted in a previous version of this proposal

On Tue, Nov 7, 2017 at 8:12 AM, Robie Basak <email address hidden> wrote:
> On Tue, Nov 07, 2017 at 04:04:53PM -0000, Nish Aravamudan wrote:
>> On Tue, Nov 7, 2017 at 6:24 AM, Robie Basak <email address hidden> wrote:
>> > Why not put all the code into the gitubuntu module, with thin wrappers in bin/, like git-ubuntu,
>>
>> I can do this, but I don't want these scripts in the snap. Every
>> default change would imply a a new snap version, which forces everyone
>> to download a new update. While the xdelta gunk from the store should
>> make that minimal, in practice it does not.
>
> Is this because edge is automatically generated? Could we start using
> "beta" instead, and push to beta only when there's some other real
> change? Or does this make other pain worse?

Well, and also, these are not really part of the application, they are
part of the wrapper of the application.

Adding another branch for the beta channel would be fine, but then we
have even more management to deal with (merge from edge to beta, merge
from beta to stable).

Server Team CI bot (server-team-bot) wrote : Posted in a previous version of this proposal

PASSED: Continuous integration, rev:5b794876d4e94f6fbcbf44de018535c94f692e9b
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/198/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Style Check
    SUCCESS: Unit Tests
    SUCCESS: Integration Tests
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/198/rebuild

review: Approve (continuous-integration)
Robie Basak (racb) wrote : Posted in a previous version of this proposal

Looks good in general! I like the multiprocess.Pool.map use as it is now.

See my review branch for review comments for which it's easier just to describe in code. Individual explanations are in individual commit messages: https://code.launchpad.net/~racb/usd-importer/+git/usd-importer/+ref/modernize-scripts-review

I've left a couple of inline comments.

Some other general comments:

I'm not keen on the return spec of import_srcpkg. It seems rather arbitrary and unnatural, and it's not even convenient as you still have to massage it in the caller. That massaging is also duplicated twice in the two callers. I suggest you change the return spec to (pkgname, success_bool) instead. That would save you from having to zip it back up in the caller.

If you still want a (success_list, fail_list) tuple to process further up the stack more easily, then that can't be done by changing the return type of import_srcpkg further. So you'd still end up having some duplication. To fix that (if you still want success_list, fail_list), I suggest wrapping import_srcpkg and putting the wrapper in scriptutils. The wrapper could to the multiprocess pool, call the real import_srcpkg, and generate the (success_list, fail_list) result.

An example of getting from a (pkgname, success_bool) list to a (success_list, fail_list):

return (
    [pkg for pkg, success in results if success],
    [pkg for pkg, succces in results if not success],
)

General comments on source-package-walker.py:

Parsing Sources.xz alarmed me; perhaps add a comment that it's a cjwatson suggestion because the API way will be very slow and there's no better API for it :)

Can we make the components a list, because we'll be adding more later? Otherwise there's code duplication already due to the hardcoding like this.

We also need to import from other pockets and other series (at least the supported ones). Eg. source packages that have been deleted before this cycle, and source packages that were added to older stable releases in the other pockets directly such as HWE stacks.

Please parse using debian.deb822 or similar:

10:42 <rbasak> While you're here (I'm review nacc's MP), any opinion on the actual parsing, as implemented in Python? nacc is doing some Python-based parsing of the Sources file. Which feels ugly, but shelling out to grep-dctrl would also be ugly.
10:42 <rbasak> What would you do?
10:42 <cjwatson> rbasak: I'd use python-debian
10:42 <cjwatson> The stuff in debian.deb822 is generally fine for this
10:43 <rbasak> Thanks!
10:43 <cjwatson> rbasak: (python-apt is also fine; python-debian sometimes makes use of that for speed. Use whichever interface is more comfortable.)
10:44 <cjwatson> rbasak: germinate uses python-apt, so possibly I preferred that at one point. I think when I wrote that python-debian was significantly less good.

review: Needs Fixing
Robie Basak (racb) wrote : Posted in a previous version of this proposal

PS. all my suggestions are entirely untested :)

PASSED: Continuous integration, rev:56ea91693c5d49d6f52ef05b41329b0fa55f4c03
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/204/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Style Check
    SUCCESS: Unit Tests
    SUCCESS: Integration Tests
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/204/rebuild

review: Approve (continuous-integration)

PASSED: Continuous integration, rev:325412302b10decc4dd3ec714d814092d2e2669c
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/208/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Style Check
    SUCCESS: Unit Tests
    SUCCESS: Integration Tests
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/208/rebuild

review: Approve (continuous-integration)

PASSED: Continuous integration, rev:9ed2e8a2ac18665cbc7d966a74b1cf0deee32680
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/210/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Style Check
    SUCCESS: Unit Tests
    SUCCESS: Integration Tests
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/210/rebuild

review: Approve (continuous-integration)
Robie Basak (racb) wrote :

import_all_published_sources looks really neat now, thanks! Minor fixups inline, but apart from those, this looks good now.

review: Approve
Nish Aravamudan (nacc) wrote :
Download full text (17.0 KiB)

On Mon, Nov 13, 2017 at 11:36 PM, Robie Basak <email address hidden> wrote:
> Review: Approve
>
> import_all_published_sources looks really neat now, thanks! Minor fixups inline, but apart from those, this looks good now.

Responses inline.

>
> Diff comments:
>
>> diff --git a/scripts/scriptutils.py b/scripts/scriptutils.py
>> new file mode 100644
>> index 0000000..c89507e
>> --- /dev/null
>> +++ b/scripts/scriptutils.py
>> @@ -0,0 +1,191 @@
>> +from collections import namedtuple
>> +import functools
>> +import hashlib
>> +import multiprocessing
>> +import os
>> +import sys
>> +import subprocess
>> +import time
>> +
>> +import pkg_resources
>> +
>> +# We expect to be running from a git repository in master for this
>> +# script, because the snap's python code is not exposed except within
>> +# the snap
>> +try:
>> + REALPATH = os.readlink(__file__)
>> +except OSError:
>> + REALPATH = __file__
>> +sys.path.insert(
>> + 0,
>> + os.path.abspath(
>> + os.path.join(os.path.dirname(REALPATH), os.path.pardir)
>> + )
>> +)
>> +
>> +from gitubuntu.run import run
>> +
>> +Defaults = namedtuple(
>> + 'Defaults',
>> + [
>> + 'num_workers',
>> + 'whitelist',
>> + 'blacklist',
>> + 'phasing_universe',
>> + 'phasing_main',
>> + 'dry_run',
>> + 'use_whitelist',
>> + ],
>> +)
>> +
>> +DEFAULTS = Defaults(
>> + num_workers=10,
>> + whitelist=pkg_resources.resource_filename(
>> + 'gitubuntu',
>> + 'source-package-whitelist.txt',
>> + ),
>> + blacklist=pkg_resources.resource_filename(
>> + 'gitubuntu',
>> + 'source-package-blacklist.txt',
>> + ),
>> + phasing_universe=0,
>> + phasing_main=1,
>> + dry_run=False,
>> + use_whitelist=True,
>> +)
>> +
>> +
>> +def should_import_srcpkg(
>> + pkgname,
>> + component,
>> + whitelist,
>> + blacklist,
>> + phasing_main,
>> + phasing_universe,
>> +):
>> + """should_import_srcpkg - indicate if a given source package should be imported
>> +
>> + The phasing is implemented similarly to update-manager. If the
>> + md5sum of the source package name is less than the (appropriate
>> + percentage * 2^128) (the maximum representable md5sum), the source
>> + package name is in the appropriate phasing set.
>> +
>> + Arguments:
>> + pkgname - string name of a source package
>> + component - string archive component of @pgkname
>> + whitelist - a list of of packages to always import
>> + blacklist - a list of packages to never import
>> + phasing_main - a integer percentage of all packages in main to import
>> + phasing_universe - a integer percentage of all packages in universe to import
>> +
>> + Returns:
>> + True if @pkgname should be imported, False if not.
>> + """
>> + if pkgname in blacklist:
>> + return False
>> + if pkgname in whitelist:
>> + return True
>> + md5sum = int(
>> + hashlib.md5(
>> + pkgname.encode('utf-8')
>> + ).hexdigest(),
>> + 16,
>> + )
>> + if component == 'main':
>> + if md5sum <= (phasing_main / 100) * (2**128):
>> + return Tru...

Robie Basak (racb) wrote :

On Tue, Nov 14, 2017 at 03:30:56PM -0000, Nish Aravamudan wrote:
> On Mon, Nov 13, 2017 at 11:36 PM, Robie Basak <email address hidden> wrote:
> Heh, and I apparently have an aversion to the below method. In C, the
> following can result in less efficient code (iirc), on some
> architectures.
>
> I'm happy to make your change, though.

IMHO, extra state is more error-prone, and so are bindings that get
redefined (ie. "variables" that change value). OTOH, Python is already
pretty non-performant, and we don't care much for that, in favour of
readability and less-error-prone-ness. Unless we actually hit a
performance issue, or if there are two equivalent ways of doing things
and one is more performant without harming anything else.

> > Style: no trailing comma
>
> I find this a bit confusing, if I had done
>
> pocket_suffixes = [
> '-proposed',
> '-updates',
> '-security',
> ''
> ]
>
> I think our style would have indicated it was wrong because it was missing
> a trailing comma. Why does it matter if it's inline versus multiline for the
> style rule to be applied? Can this be clarified in the style document (it also
> means moving from inline to multi-line is more prone to missing this, IMO).

Python convention is to use [x, y, z] (with no other whitespace or
commas) AIUI. I'm not sure about multi-line, but Python allows the
trailing comma specifically (AIUI) to allow the style of not having to
edit the previous line to add a new one. So in my head, the trailing
comma only applies to the multi-line case.

I'm interested to know what others on the team think. But yeah, I'll
happily edit the style document if you agree.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/bin/import-cron b/bin/import-cron
2deleted file mode 100755
3index 0044b74..0000000
4--- a/bin/import-cron
5+++ /dev/null
6@@ -1,199 +0,0 @@
7-#!/usr/bin/env python3
8-
9-from copy import copy
10-import os
11-import subprocess
12-import sys
13-import tempfile
14-import time
15-# we know that our relative path doesn't include the module
16-try:
17- realpath = os.readlink(__file__)
18-except OSError:
19- realpath = __file__
20-sys.path.insert(
21- 0,
22- os.path.abspath(
23- os.path.join(os.path.dirname(realpath), os.path.pardir)
24- )
25-)
26-from gitubuntu.source_information import launchpad_login
27-from gitubuntu.run import run
28-try:
29- pkg = 'python3-pkg-resources'
30- import pkg_resources
31-except ImportError:
32- logging.error('Is %s installed?', pkg)
33- sys.exit(1)
34-
35-# we might want this to be contentfully stored in the importer's git
36-# repository?
37-import_scan_timestamps = dict()
38-import_scan_timestamps['debian'] = time.time() - (24 * 60 * 60)
39-import_scan_timestamps['ubuntu'] = time.time() - (24 * 60 * 60)
40-import_scan_links = dict()
41-import_scan_links['debian'] = None
42-import_scan_links['ubuntu'] = None
43-
44-import_cron_log = os.path.join(tempfile.gettempdir(), 'import-cron-log')
45-import_cron_packages = pkg_resources.resource_filename('gitubuntu',
46- 'import-cron-packages.txt')
47-
48-def write_timestamps():
49- global import_scan_timestamps
50- global import_scan_links
51-
52- with open(import_cron_log, 'w+') as f:
53- for dist_name in ['debian', 'ubuntu']:
54- f.write('%s timestamp: %f\n' % (dist_name, import_scan_timestamps[dist_name]))
55- f.write('%s link: %s\n' % (dist_name, import_scan_links[dist_name]))
56-
57-def read_timestamps():
58- global import_scan_timestamps
59- global import_scan_links
60-
61- # This could be more defensive
62- try:
63- with open(import_cron_log, 'r') as f:
64- for line in f:
65- dist_name, log_type, value = line.split()
66- if 'timestamp' in log_type:
67- import_scan_timestamps[dist_name] = float(value)
68- elif 'link' in log_type:
69- import_scan_links[dist_name] = value
70- except IOError:
71- pass
72-
73-def update_timestamp(dist_name, spphr):
74- global import_scan_timestamps
75- global import_scan_links
76-
77- if spphr.date_published:
78- timestamp = spphr.date_published.timestamp()
79- link = spphr.self_link
80- import_scan_timestamps[dist_name] = timestamp
81- import_scan_links[dist_name] = str(link)
82- return True
83- return False
84-
85-def import_new_published_sources(launchpad, packages, imported_srcpkgs, failed_srcpkgs):
86- global import_scan_timestamps
87- global import_scan_links
88- args = {'order_by_date':True}
89-
90- for dist_name in ['debian', 'ubuntu']:
91- print('Examining publishes in %s since %f' % (dist_name, import_scan_timestamps[dist_name]))
92- dist = launchpad.distributions[dist_name]
93- spph = dist.main_archive.getPublishedSources(**args)
94- if len(spph) == 0:
95- print('No publishing data found in %s' % dist_name)
96- continue
97- if import_scan_timestamps[dist_name]:
98- _spph = list()
99- for spphr in spph:
100- # only check if we should stop iterating if there is a
101- # timestamp to compare to
102- if spphr.date_published:
103- # stop iterating (backwards chronologically) when we see
104- # a publish timestamp before the last cron run
105- if spphr.date_published.timestamp() < import_scan_timestamps[dist_name]:
106- break
107- if spphr.source_package_name not in packages:
108- continue
109- _spph.append(spphr)
110- # if no packages in our whitelist have been updated in the
111- # scan window, then bump the last scan date manually to the
112- # last valid timestamp in the window
113- if len(_spph) == 0:
114- print('No new relevant publishes found in %s relative to %f' % (dist_name, import_scan_timestamps[dist_name]))
115- for spphr in spph:
116- if update_timestamp(dist_name, spphr):
117- break
118- spph = _spph
119- spph_iter = reversed(spph)
120- caught_up = False
121- for spphr in spph_iter:
122- # find the matching upload to the last publish seen
123- if not caught_up:
124- if not import_scan_links[dist_name] or str(spphr.self_link) == import_scan_links[dist_name]:
125- caught_up = True
126- continue
127-
128- if spphr.source_package_name in failed_srcpkgs:
129- update_timestamp(dist_name, spphr)
130- continue
131-
132- # try up to 3 times before declaring failure, in case of
133- # racing with the publisher/transient download failure
134- success = False
135- for i in range(3):
136- if spphr.source_package_name in imported_srcpkgs:
137- success = True
138- break
139- print('git ubuntu import -l usd-importer-bot %s' % spphr.source_package_name)
140- try:
141- run(['git', 'ubuntu', 'import', '-l', 'usd-importer-bot', spphr.source_package_name], check=True)
142- imported_srcpkgs.append(spphr.source_package_name)
143- except subprocess.CalledProcessError:
144- print('failed to import %s (attempt %d/3)' % (spphr.source_package_name, i+1))
145- time.sleep(10)
146- continue
147-
148- if not success:
149- print('failed to import %s' % spphr.source_package_name)
150- failed_srcpkgs.append(spphr.source_package_name)
151- # we want to bump the timestamp for every source package
152- # published, regardless of whether we have already run the
153- # importer for it in this cron run, since it might be published
154- # multiple times (e.g., different series for an SRU).
155- update_timestamp(dist_name, spphr)
156- return (imported_srcpkgs, failed_srcpkgs)
157-
158-def main():
159- try:
160- run(['git', 'config', '--global', 'user.name'], check=True)
161- except subprocess.CalledProcessError:
162- run(['git', 'config', '--global', 'user.name', 'Ubuntu Git Importer'])
163- try:
164- run(['git', 'config', '--global', 'user.email'], check=True)
165- except subprocess.CalledProcessError:
166- run(['git', 'config', '--global', 'user.email', 'usd-importer-announce@list.canonical.com'])
167- try:
168- with open(import_cron_packages, 'r') as f:
169- packages = [line.strip() for line in f if not line.startswith('#')]
170- except:
171- packages = list()
172- read_timestamps()
173- # minutes
174- sleep_interval = 20
175- # hours
176- mail_interval = 1
177- launchpad = launchpad_login()
178- imported_srcpkgs = list()
179- failed_srcpkgs = list()
180- last_mail_timestamp = time.time()
181- while True:
182- if (time.time() - last_mail_timestamp >= (mail_interval * 60 * 60) and
183- (len(imported_srcpkgs) != 0 or len(failed_srcpkgs) != 0)
184- ):
185- i = b'Subject: Importer report\n'
186- if len(imported_srcpkgs) != 0:
187- i += b'Successfully imported the following source packages\n' + b'\n'.join(map(lambda x : x.encode('utf-8'), imported_srcpkgs))
188- if len(failed_srcpkgs) != 0:
189- i += b'\nFailed to import the following source packages\n' + b'\n'.join(map(lambda x : x.encode('utf-8'), failed_srcpkgs))
190- run(['sendmail', '-F', 'Ubuntu Git Importer', '-f', 'usd-importer-do-not-mail@canonical.com', 'usd-import-announce@lists.canonical.com'], input=i)
191- last_mail_timestamp = time.time()
192- orig_imported_srcpkgs = copy(imported_srcpkgs)
193- imported_srcpkgs, failed_srcpkgs = import_new_published_sources(launchpad, packages, imported_srcpkgs, failed_srcpkgs)
194- print('Imported %d source packages' % (len(imported_srcpkgs) - len(orig_imported_srcpkgs)))
195- if len(imported_srcpkgs) == len(orig_imported_srcpkgs):
196- write_timestamps()
197- time.sleep(sleep_interval * 60)
198- imported_srcpkgs = list()
199- failed_srcpkgs = list()
200- # I have seen transient network issues lead to timeouts and stalls
201- # so reset the connection on each iteration
202- launchpad = launchpad_login()
203-
204-if __name__ == '__main__':
205- main()
206diff --git a/gitubuntu/importer.py b/gitubuntu/importer.py
207index fb5cd77..f7dfe92 100644
208--- a/gitubuntu/importer.py
209+++ b/gitubuntu/importer.py
210@@ -26,7 +26,6 @@
211
212 import argparse
213 import atexit
214-import dbm
215 import functools
216 import getpass
217 import logging
218@@ -151,7 +150,6 @@ def main(
219 parentfile=top_level_defaults.parentfile,
220 retries=top_level_defaults.retries,
221 retry_backoffs=top_level_defaults.retry_backoffs,
222- db_cache_dir=None,
223 ):
224 """Main entry point to the importer
225
226@@ -180,9 +178,6 @@ def main(
227 @parentfile: string path to file specifying parent overrides
228 @retries: integer number of download retries to attempt
229 @retry_backoffs: list of backoff durations to use between retries
230- @db_cache_dir: string fileystem directory containing 'ubuntu' and
231- 'debian' dbm database files, which store the progress of prior
232- importer runs
233
234 If directory is None, a temporary directory is created and used.
235
236@@ -192,11 +187,6 @@ def main(
237 If dl_cache is None, CACHE_PATH in the local repository will be
238 used.
239
240- If db_cache_dir is None, no database lookups are performed which makes
241- it possible that the import will attempt to re-import already
242- imported publishes. This should be fine, although less efficient
243- than possible.
244-
245 Returns 0 on successful import (which includes non-fatal failures);
246 1 otherwise.
247 """
248@@ -324,9 +314,6 @@ def main(
249 else:
250 workdir = dl_cache
251
252- if db_cache_dir:
253- os.makedirs(db_cache_dir, exist_ok=True)
254-
255 os.makedirs(workdir, exist_ok=True)
256
257 # now sets a global _PARENT_OVERRIDES
258@@ -339,7 +326,6 @@ def main(
259 patches_applied=False,
260 debian_head_versions=debian_head_versions,
261 ubuntu_head_versions=ubuntu_head_versions,
262- db_cache_dir=db_cache_dir,
263 debian_sinfo=debian_sinfo,
264 ubuntu_sinfo=ubuntu_sinfo,
265 active_series_only=active_series_only,
266@@ -361,7 +347,6 @@ def main(
267 patches_applied=True,
268 debian_head_versions=applied_debian_head_versions,
269 ubuntu_head_versions=applied_ubuntu_head_versions,
270- db_cache_dir=db_cache_dir,
271 debian_sinfo=debian_sinfo,
272 ubuntu_sinfo=ubuntu_sinfo,
273 active_series_only=active_series_only,
274@@ -1367,7 +1352,6 @@ def import_publishes(
275 patches_applied,
276 debian_head_versions,
277 ubuntu_head_versions,
278- db_cache_dir,
279 debian_sinfo,
280 ubuntu_sinfo,
281 active_series_only,
282@@ -1378,8 +1362,6 @@ def import_publishes(
283 history_found = False
284 only_debian = False
285 srcpkg_information = None
286- last_debian_spphr = None
287- last_ubuntu_spphr = None
288 if patches_applied:
289 _namespace = namespace
290 namespace = '%s/applied' % namespace
291@@ -1395,28 +1377,15 @@ def import_publishes(
292 import_unapplied_spi,
293 skip_orig=skip_orig,
294 )
295-
296- for distname, versions, dist_sinfo, last_spphr in (
297- ("debian", debian_head_versions, debian_sinfo, last_debian_spphr),
298- ("ubuntu", ubuntu_head_versions, ubuntu_sinfo, last_ubuntu_spphr),
299- ):
300+ for distname, versions, dist_sinfo in (
301+ ("debian", debian_head_versions, debian_sinfo),
302+ ("ubuntu", ubuntu_head_versions, ubuntu_sinfo)):
303 if active_series_only and distname == "debian":
304 continue
305-
306- last_spphr = None
307- if db_cache_dir:
308- with dbm.open(os.path.join(db_cache_dir, distname), 'c') as cache:
309- try:
310- last_spphr = decode_binary(cache[pkgname])
311- except KeyError:
312- pass
313-
314 try:
315- last_spi = None
316 for srcpkg_information in dist_sinfo.launchpad_versions_published_after(
317 versions,
318 namespace,
319- last_spphr,
320 workdir=workdir,
321 active_series_only=active_series_only
322 ):
323@@ -1427,11 +1396,6 @@ def import_publishes(
324 namespace=_namespace,
325 ubuntu_sinfo=ubuntu_sinfo,
326 )
327- last_spi = srcpkg_information
328- if last_spi:
329- if db_cache_dir:
330- with dbm.open(os.path.join(db_cache_dir, distname), 'w') as db_cache:
331- db_cache[pkgname] = str(last_spi.spphr)
332 except NoPublicationHistoryException:
333 logging.warning("No publication history found for %s in %s.",
334 pkgname, distname
335@@ -1507,11 +1471,6 @@ def parse_args(subparsers=None, base_subparsers=None):
336 action='store_true',
337 help=argparse.SUPPRESS,
338 )
339- parser.add_argument(
340- '--db-cache',
341- type=str,
342- help=argparse.SUPPRESS,
343- )
344 if not subparsers:
345 return parser.parse_args()
346 return 'import - %s' % kwargs['description']
347@@ -1540,11 +1499,6 @@ def cli_main(args):
348 except AttributeError:
349 dl_cache = None
350
351- try:
352- db_cache = args.db_cache
353- except AttributeError:
354- db_cache = None
355-
356 return main(
357 pkgname=args.package,
358 owner=args.lp_owner,
359@@ -1564,5 +1518,4 @@ def cli_main(args):
360 parentfile=args.parentfile,
361 retries=args.retries,
362 retry_backoffs=args.retry_backoffs,
363- db_cache_dir=args.db_cache,
364 )
365diff --git a/gitubuntu/source-package-blacklist.txt b/gitubuntu/source-package-blacklist.txt
366new file mode 100644
367index 0000000..ac8aef9
368--- /dev/null
369+++ b/gitubuntu/source-package-blacklist.txt
370@@ -0,0 +1,7 @@
371+linux
372+linux-base
373+linux-firmware
374+linux-meta
375+lxc
376+lxcfs
377+lxd
378diff --git a/gitubuntu/import-cron-packages.txt b/gitubuntu/source-package-whitelist.txt
379index 74df660..61b6f35 100644
380--- a/gitubuntu/import-cron-packages.txt
381+++ b/gitubuntu/source-package-whitelist.txt
382@@ -435,11 +435,7 @@ libxml-security-java
383 libxml-xpath-perl
384 libxmu
385 libyaml
386-#linux
387 linux-atm
388-#linux-base
389-#linux-firmware
390-#linux-meta
391 lm-sensors
392 lockfile-progs
393 logcheck
394@@ -454,9 +450,6 @@ ltrace
395 lua5.2
396 lua-lpeg
397 lvm2
398-#lxc
399-#lxcfs
400-#lxd
401 lz4
402 lzo2
403 m2300w
404diff --git a/gitubuntu/source_information.py b/gitubuntu/source_information.py
405index 6399460..3fef4ec 100644
406--- a/gitubuntu/source_information.py
407+++ b/gitubuntu/source_information.py
408@@ -421,14 +421,7 @@ class GitUbuntuSourceInformation(object):
409 for srcpkg in spph:
410 yield self.get_corrected_spi(srcpkg, workdir)
411
412- def launchpad_versions_published_after(
413- self,
414- head_versions,
415- namespace,
416- last_spphr=None,
417- workdir=None,
418- active_series_only=False,
419- ):
420+ def launchpad_versions_published_after(self, head_versions, namespace, workdir=None, active_series_only=False):
421 args = {
422 'exact_match':True,
423 'source_name':self.pkgname,
424@@ -456,14 +449,7 @@ class GitUbuntuSourceInformation(object):
425 if len(spph) == 0:
426 raise NoPublicationHistoryException("Is %s published in %s?" %
427 (self.pkgname, self.dist_name))
428- if last_spphr:
429- _spph = list()
430- for spphr in spph:
431- if str(spphr) == last_spphr:
432- break
433- _spph.append(spphr)
434- spph = _spph
435- elif head_versions:
436+ if len(head_versions) > 0:
437 _spph = list()
438 for spphr in spph:
439 spi = GitUbuntuSourcePackageInformation(spphr, self.dist_name,
440diff --git a/man/man1/git-ubuntu-import.1 b/man/man1/git-ubuntu-import.1
441index 74bfe83..dd7fd9e 100644
442--- a/man/man1/git-ubuntu-import.1
443+++ b/man/man1/git-ubuntu-import.1
444@@ -1,4 +1,4 @@
445-.TH "GIT-UBUNTU-IMPORT" "1" "2017-11-08" "Git-Ubuntu 0.6.2" "Git-Ubuntu Manual"
446+.TH "GIT-UBUNTU-IMPORT" "1" "2017-07-19" "Git-Ubuntu 0.2" "Git-Ubuntu Manual"
447
448 .SH "NAME"
449 git-ubuntu import \- Import Launchpad publishing history to Git
450@@ -9,8 +9,7 @@ git-ubuntu import \- Import Launchpad publishing history to Git
451 <user>] [\-\-dl-cache <dl_cache>] [\-\-no-fetch] [\-\-no-push]
452 [\-\-no-clean] [\-d | \-\-directory <directory>]
453 [\-\-active-series-only] [\-\-skip-applied] [\-\-skip-orig]
454-[\-\-reimport] [\-\-allow-applied-failures] [\-\-db-cache <db_cache>]
455-<package>
456+[\-\-reimport] [\-\-allow-applied-failures] <package>
457 .FI
458 .SP
459 .SH "DESCRIPTION"
460@@ -198,21 +197,6 @@ After investigation, this flag can be used to indicate the importer is
461 allowed to ignore such a failure\&.
462 .RE
463 .PP
464-\-\-db-cache <db_cache>
465-.RS 4
466-The path to a directory containing Python dbm database disk files for
467-importer metadata\&.
468-If \fB<db_cache>\fR does not exist, it will be created\&.
469-Two files in \fB<db_cache>\fR are used, "ubuntu" and "debian', which are
470-created if not already present\&.
471-The cache files provide information to the importer about prior imports
472-of \fB<package>\fR and which Launchpad publishing record was last
473-imported\&.
474-This is necessary because the imported Git repository does not
475-necessarily maintain any metadata about Launchpad publishing
476-information\&.
477-.RE
478-.PP
479 <package>
480 .RS 4
481 The name of the source package to import\&.
482diff --git a/scripts/import-source-packages.py b/scripts/import-source-packages.py
483new file mode 100755
484index 0000000..698ae35
485--- /dev/null
486+++ b/scripts/import-source-packages.py
487@@ -0,0 +1,377 @@
488+#!/usr/bin/env python3
489+
490+# General design:
491+# Infinite loop:
492+# now = time()
493+# let new_publishes be the set of unique srcpkg names in publishes between last run and now
494+# for srcpkg in new_publishes:
495+# If srcpkg in blacklist: skip
496+# If srcpkg not in PHASING_{component}: skip
497+# try to import srcpkg
498+# Report on successful and failed imports
499+# If no srcpkgs: sleep for some time to let publisher run
500+
501+import argparse
502+import collections
503+import datetime
504+import os
505+import sys
506+import tempfile
507+import time
508+
509+# We expect to be running from a git repository in master for this
510+# script, because the snap's python code is not exposed except within
511+# the snap
512+try:
513+ REALPATH = os.readlink(__file__)
514+except OSError:
515+ REALPATH = __file__
516+sys.path.insert(
517+ 0,
518+ os.path.abspath(
519+ os.path.join(os.path.dirname(REALPATH), os.path.pardir)
520+ )
521+)
522+
523+from gitubuntu.source_information import launchpad_login
524+from gitubuntu.run import run
525+import scriptutils
526+
527+# The 'time' attribute is the publication date, as a timestamp, of the SPPHR
528+# corresponding to the URL stored in the 'link' attribute.
529+Timestamp = collections.namedtuple('Timestamp', ['time', 'link'])
530+
531+LOG_PATH = os.path.join(tempfile.gettempdir(), 'import-source-packages-log')
532+
533+def import_new_published_sources(
534+ num_workers,
535+ whitelist,
536+ blacklist,
537+ phasing_main,
538+ phasing_universe,
539+ dry_run,
540+):
541+ """import_new_published_source - Import all new publishes since a prior execution
542+
543+ Arguments:
544+ num_workers - integer number of worker processes to use
545+ whitelist - a list of of packages to always import
546+ blacklist - a list of packages to never import
547+ phasing_main - a integer percentage of all packages in main to import
548+ phasing_universe - a integer percentage of all packages in universe to import
549+ dry_run - a boolean to indicate a dry-run operation
550+
551+ Returns:
552+ A tuple of two lists, the first containing the names of all
553+ successfully imported source packages, the second containing the
554+ names of all source packages that failed to import.
555+ """
556+ timestamps = read_timestamps()
557+ launchpad = launchpad_login()
558+
559+ # filtered_pkgnames is the list of source package names across all
560+ # distributions we will want to process, based upon
561+ # scriptutils.should_import_srcpkg
562+ filtered_pkgnames = set()
563+
564+ for dist_name in ['debian', 'ubuntu']:
565+ timestamp = timestamps[dist_name]
566+
567+ # dist_newest_spphr is the most recent publication record with a
568+ # valid published date in the distribution
569+ dist_newest_spphr = None
570+
571+ # dist_filtered_pkgnames is the set of source package names in
572+ # the dist_name distribution that qualify for import according to our
573+ # whitelists, blacklists and phasing requirements
574+ dist_filtered_pkgnames = set()
575+
576+ print(
577+ "Examining publishes in %s since %s" % (
578+ dist_name,
579+ datetime.datetime.fromtimestamp(
580+ timestamp.time,
581+ ).strftime("%Y-%m-%d %H:%M:%S"),
582+ )
583+ )
584+
585+ # spph is the raw publication history for a distribution from
586+ # Launchpad, sorted by publication date in reverse chronological order
587+ dist = launchpad.distributions[dist_name]
588+ spph = dist.main_archive.getPublishedSources(order_by_date=True)
589+ if not spph:
590+ print("No publishing data found in %s" % dist_name)
591+ continue
592+
593+ for spphr in spph:
594+ # this is the matching upload to the last publish seen
595+ if str(spphr) == timestamp.link:
596+ break
597+
598+ # only check if we should stop iterating due to the
599+ # timestamps if there is a timestamp to compare to, which
600+ # means the source package is actually published.
601+ if spphr.date_published:
602+ if not dist_newest_spphr:
603+ dist_newest_spphr = spphr
604+ # stop iterating (backwards chronologically) when we see
605+ # a publish timestamp before the last run
606+ if spphr.date_published.timestamp() < timestamp.time:
607+ break
608+ if scriptutils.should_import_srcpkg(
609+ spphr.source_package_name,
610+ spphr.component_name,
611+ whitelist,
612+ blacklist,
613+ phasing_main,
614+ phasing_universe,
615+ ):
616+ dist_filtered_pkgnames.add(spphr.source_package_name)
617+
618+ if not dist_filtered_pkgnames:
619+ print(
620+ "No new relevant publishes found in %s relative to %s" % (
621+ dist_name,
622+ datetime.datetime.fromtimestamp(
623+ timestamp.time
624+ ).strftime("%Y-%m-%d %H:%M:%S"),
625+ )
626+ )
627+
628+ filtered_pkgnames = filtered_pkgnames | dist_filtered_pkgnames
629+ if dist_newest_spphr:
630+ # Update timestamp
631+ timestamps[dist_name] = timestamp = Timestamp(
632+ time=dist_newest_spphr.date_published.timestamp(),
633+ link=str(dist_newest_spphr),
634+ )
635+
636+
637+ ret = scriptutils.pool_map_import_srcpkg(
638+ num_workers=num_workers,
639+ dry_run=dry_run,
640+ pkgnames=filtered_pkgnames,
641+ )
642+
643+ write_timestamps(timestamps)
644+
645+ return ret
646+
647+def read_timestamps():
648+ """read_timestamps - Read saved timestamp values from LOG_PATH
649+
650+ If the log file is not readable (e.g., does not exist), the
651+ timestamps will be set to 24 hours before now.
652+
653+ This method is symmetrical to write_timestamps.
654+
655+ Returns:
656+ A dictionary with 'debian' and 'ubuntu' keys, each of which is a
657+ Timestamp namedtuple.
658+ """
659+ try:
660+ timestamps = dict()
661+ with open(LOG_PATH, 'r') as log_file:
662+ for line in log_file:
663+ dist_name, log_type, value = line.split()
664+ if dist_name not in timestamps:
665+ timestamps[dist_name] = dict()
666+ assert log_type in ['time', 'link']
667+ assert log_type not in timestamps[dist_name]
668+ if log_type == 'link':
669+ timestamps[dist_name][log_type] = value
670+ else:
671+ timestamps[dist_name][log_type] = float(value)
672+ return {
673+ dist: Timestamp(**dict_form)
674+ for dist, dict_form in timestamps
675+ }
676+ except IOError:
677+ _start = time.time() - (24 * 60 * 60)
678+ return dict(
679+ ubuntu=Timestamp(time=_start, link=None),
680+ debian=Timestamp(time=_start, link=None),
681+ )
682+
683+def write_timestamps(timestamps):
684+ """write_timestamps - Write timestamp values to LOG_PATH
685+
686+ Arguments:
687+ timestamps - a dictionary with 'debian' and 'ubuntu' keys, each of which is
688+ a Timestamp namedtuple.
689+
690+ This method is symmetrical to read_timestamps.
691+ """
692+ new_log = os.path.join(LOG_PATH, '.new')
693+ with open(new_log, 'w+') as log_file:
694+ log_file.write(
695+ 'debian time %f\n' %
696+ timestamps['debian'].time
697+ )
698+ log_file.write(
699+ 'debian link %s\n' %
700+ timestamps['debian'].link
701+ )
702+ log_file.write(
703+ 'ubuntu time %f\n' %
704+ timestamps['ubuntu'].time
705+ )
706+ log_file.write(
707+ 'ubuntu link %s\n' %
708+ timestamps['ubuntu'].link
709+ )
710+ os.replace(new_log, LOG_PATH)
711+
712+def main(
713+ num_workers=scriptutils.DEFAULTS.num_workers,
714+ whitelist_path=scriptutils.DEFAULTS.whitelist,
715+ blacklist_path=scriptutils.DEFAULTS.blacklist,
716+ phasing_main=scriptutils.DEFAULTS.phasing_main,
717+ phasing_universe=scriptutils.DEFAULTS.phasing_universe,
718+ dry_run=scriptutils.DEFAULTS.dry_run,
719+):
720+ """main - Main entry point to the script
721+
722+ Arguments:
723+ num_workers - integer number of worker threads to use
724+ whitelist_path - string filesystem path to a text file of packages
725+ to always import
726+ blacklist_path - string filesystem path to a text file of packages
727+ to never import
728+ phasing_main - a integer percentage of all packages in main to
729+ import
730+ phasing_universe - a integer percentage of all packages in universe
731+ to import
732+ dry_run - a boolean to indicate a dry-run operation
733+ """
734+ scriptutils.setup_git_config()
735+
736+ try:
737+ with open(whitelist_path, 'r') as whitelist_file:
738+ whitelist = [
739+ line.strip() for line in whitelist_file
740+ if not line.startswith('#')
741+ ]
742+ except (FileNotFoundError, IOError):
743+ whitelist = list()
744+
745+ try:
746+ with open(blacklist_path, 'r') as blacklist_file:
747+ blacklist = [
748+ line.strip() for line in blacklist_file
749+ if not line.startswith('#')
750+ ]
751+ except (FileNotFoundError, IOError):
752+ blacklist = list()
753+
754+ sleep_interval_minutes = 20
755+ mail_interval_hours = 1
756+
757+ # pretend we sent an e-mail recently
758+ last_mail_timestamp = time.time()
759+ mail_imported_srcpkgs = set()
760+ mail_failed_srcpkgs = set()
761+
762+ while True:
763+ imported_srcpkgs, failed_srcpkgs = import_new_published_sources(
764+ num_workers,
765+ whitelist,
766+ blacklist,
767+ phasing_main,
768+ phasing_universe,
769+ dry_run,
770+ )
771+ print("Imported %d source packages" % len(imported_srcpkgs))
772+ mail_imported_srcpkgs |= set(imported_srcpkgs)
773+ mail_failed_srcpkgs |= set(failed_srcpkgs)
774+ secs_since_last_mail = time.time() - last_mail_timestamp
775+ if (
776+ secs_since_last_mail >= (mail_interval_hours * 60 * 60) and
777+ (mail_imported_srcpkgs or mail_failed_srcpkgs)
778+ ):
779+ msg = b"Subject: Importer report\n"
780+ if mail_imported_srcpkgs:
781+ msg += b"Successfully imported the following source packages:\n"
782+ msg += b"\n".join(
783+ map(lambda x: x.encode('utf-8'), mail_imported_srcpkgs)
784+ )
785+ if mail_failed_srcpkgs:
786+ msg += b"\nFailed to import the following source packages:\n"
787+ msg += b"\n".join(
788+ map(lambda x: x.encode('utf-8'), mail_failed_srcpkgs)
789+ )
790+ if dry_run:
791+ print("Would send email with contents:\n%s" % msg.decode())
792+ else:
793+ run(
794+ [
795+ 'sendmail',
796+ '-F', 'Ubuntu Git Importer',
797+ '-f', 'usd-importer-do-not-mail@canonical.com',
798+ 'usd-import-announce@lists.canonical.com',
799+ ],
800+ input=msg,
801+ )
802+ last_mail_timestamp = time.time()
803+ mail_imported_srcpkgs = set()
804+ mail_failed_srcpkgs = set()
805+ # if we have caught up to the publisher, go to sleep
806+ if not imported_srcpkgs:
807+ time.sleep(sleep_interval_minutes * 60)
808+
809+def cli_main():
810+ """cli_main - CLI entry point to script
811+ """
812+ parser = argparse.ArgumentParser(
813+ description='Script to import all source packages with phasing',
814+ )
815+ parser.add_argument(
816+ '--num-workers',
817+ type=int,
818+ help="Number of worker threads to use",
819+ default=scriptutils.DEFAULTS.num_workers,
820+ )
821+ parser.add_argument(
822+ '--whitelist',
823+ type=str,
824+ help="Path to whitelist file",
825+ default=scriptutils.DEFAULTS.whitelist,
826+ )
827+ parser.add_argument(
828+ '--blacklist',
829+ type=str,
830+ help="Path to blacklist file",
831+ default=scriptutils.DEFAULTS.blacklist,
832+ )
833+ parser.add_argument(
834+ '--phasing-universe',
835+ type=int,
836+ help="Percentage of universe packages to phase",
837+ default=scriptutils.DEFAULTS.phasing_universe,
838+ )
839+ parser.add_argument(
840+ '--phasing-main',
841+ type=int,
842+ help="Percentage of main packages to phase",
843+ default=scriptutils.DEFAULTS.phasing_main,
844+ )
845+ parser.add_argument(
846+ '--dry-run',
847+ action='store_true',
848+ help="Simulate operation but do not actually do anything",
849+ default=scriptutils.DEFAULTS.dry_run,
850+ )
851+
852+ args = parser.parse_args()
853+
854+ main(
855+ num_workers=args.num_workers,
856+ whitelist_path=args.whitelist,
857+ blacklist_path=args.blacklist,
858+ phasing_main=args.phasing_main,
859+ phasing_universe=args.phasing_universe,
860+ dry_run=args.dry_run,
861+ )
862+
863+if __name__ == '__main__':
864+ cli_main()
865diff --git a/scripts/scriptutils.py b/scripts/scriptutils.py
866new file mode 100644
867index 0000000..c89507e
868--- /dev/null
869+++ b/scripts/scriptutils.py
870@@ -0,0 +1,191 @@
871+from collections import namedtuple
872+import functools
873+import hashlib
874+import multiprocessing
875+import os
876+import sys
877+import subprocess
878+import time
879+
880+import pkg_resources
881+
882+# We expect to be running from a git repository in master for this
883+# script, because the snap's python code is not exposed except within
884+# the snap
885+try:
886+ REALPATH = os.readlink(__file__)
887+except OSError:
888+ REALPATH = __file__
889+sys.path.insert(
890+ 0,
891+ os.path.abspath(
892+ os.path.join(os.path.dirname(REALPATH), os.path.pardir)
893+ )
894+)
895+
896+from gitubuntu.run import run
897+
898+Defaults = namedtuple(
899+ 'Defaults',
900+ [
901+ 'num_workers',
902+ 'whitelist',
903+ 'blacklist',
904+ 'phasing_universe',
905+ 'phasing_main',
906+ 'dry_run',
907+ 'use_whitelist',
908+ ],
909+)
910+
911+DEFAULTS = Defaults(
912+ num_workers=10,
913+ whitelist=pkg_resources.resource_filename(
914+ 'gitubuntu',
915+ 'source-package-whitelist.txt',
916+ ),
917+ blacklist=pkg_resources.resource_filename(
918+ 'gitubuntu',
919+ 'source-package-blacklist.txt',
920+ ),
921+ phasing_universe=0,
922+ phasing_main=1,
923+ dry_run=False,
924+ use_whitelist=True,
925+)
926+
927+
928+def should_import_srcpkg(
929+ pkgname,
930+ component,
931+ whitelist,
932+ blacklist,
933+ phasing_main,
934+ phasing_universe,
935+):
936+ """should_import_srcpkg - indicate if a given source package should be imported
937+
938+ The phasing is implemented similarly to update-manager. If the
939+ md5sum of the source package name is less than the (appropriate
940+ percentage * 2^128) (the maximum representable md5sum), the source
941+ package name is in the appropriate phasing set.
942+
943+ Arguments:
944+ pkgname - string name of a source package
945+ component - string archive component of @pgkname
946+ whitelist - a list of of packages to always import
947+ blacklist - a list of packages to never import
948+ phasing_main - a integer percentage of all packages in main to import
949+ phasing_universe - a integer percentage of all packages in universe to import
950+
951+ Returns:
952+ True if @pkgname should be imported, False if not.
953+ """
954+ if pkgname in blacklist:
955+ return False
956+ if pkgname in whitelist:
957+ return True
958+ md5sum = int(
959+ hashlib.md5(
960+ pkgname.encode('utf-8')
961+ ).hexdigest(),
962+ 16,
963+ )
964+ if component == 'main':
965+ if md5sum <= (phasing_main / 100) * (2**128):
966+ return True
967+ elif component == 'universe':
968+ if md5sum <= (phasing_universe / 100) * (2**128):
969+ return True
970+ # skip partner and multiverse for now
971+ return False
972+
973+
974+def import_srcpkg(pkgname, dry_run):
975+ """import_srcpkg - Invoke git ubuntu import on @pkgname
976+
977+ Arguments:
978+ pkgname - string name of a source package
979+ dry_run - a boolean to indicate a dry-run operation
980+
981+ Returns:
982+ A tuple of boolean and a string, where the boolean is the success or
983+ failure of the import and the string is the package name.
984+ """
985+ ret = False
986+
987+ # try up to 3 times before declaring failure, in case of
988+ # racing with the publisher finalizing files and/or
989+ # transient download failure
990+
991+ for attempt in range(3):
992+ cmd = [
993+ 'git',
994+ 'ubuntu',
995+ 'import',
996+ '-l',
997+ 'usd-importer-bot',
998+ pkgname,
999+ ]
1000+ try:
1001+ print(' '.join(cmd))
1002+ if not dry_run:
1003+ run(cmd, check=True)
1004+ ret = True
1005+ break
1006+ except subprocess.CalledProcessError:
1007+ print(
1008+ "Failed to import %s (attempt %d/3)" % (
1009+ pkgname,
1010+ attempt+1,
1011+ )
1012+ )
1013+ time.sleep(10)
1014+
1015+ return pkgname, ret
1016+
1017+def setup_git_config(
1018+ name='Ubuntu Git Importer',
1019+ email='usd-importer-announce@lists.canonical.com',
1020+):
1021+ """setup_git_config - Ensure global required Git configuration values are set
1022+
1023+ Arguments:
1024+ name - string name to set as user.name in Git config
1025+ email - string email to set as user.email in Git config
1026+ """
1027+ try:
1028+ run(['git', 'config', '--global', 'user.name'], check=True)
1029+ except subprocess.CalledProcessError:
1030+ run(['git', 'config', '--global', 'user.name', name])
1031+ try:
1032+ run(['git', 'config', '--global', 'user.email'], check=True)
1033+ except subprocess.CalledProcessError:
1034+ run(['git', 'config', '--global', 'user.email', email])
1035+
1036+def pool_map_import_srcpkg(
1037+ num_workers,
1038+ dry_run,
1039+ pkgnames,
1040+):
1041+ """pool_map_import_srcpkg - Use a multiprocessing.Pool to parallel
1042+ import source packages
1043+
1044+ Arguments:
1045+ num_workers - integer number of worker processes to use
1046+ dry_run - a boolean to indicate a dry-run operation
1047+ pkgnames - a list of string names of source packages
1048+ """
1049+ with multiprocessing.Pool(processes=num_workers) as pool:
1050+ results = pool.map(
1051+ functools.partial(
1052+ import_srcpkg,
1053+ dry_run=dry_run,
1054+ ),
1055+ pkgnames,
1056+ )
1057+
1058+ return (
1059+ [pkg for pkg, success in results if success],
1060+ [pkg for pkg, success in results if not success],
1061+ )
1062diff --git a/scripts/source-package-walker.py b/scripts/source-package-walker.py
1063new file mode 100755
1064index 0000000..aaa8ef1
1065--- /dev/null
1066+++ b/scripts/source-package-walker.py
1067@@ -0,0 +1,272 @@
1068+#!/usr/bin/env python3
1069+
1070+# General design:
1071+# let publishes be the set of srcpkg names
1072+# for srcpkg in publishes:
1073+# If srcpkg in blacklist: skip
1074+# If srcpkg not in PHASING_{component}: skip
1075+# try to import srcpkg
1076+# Report on successful and failed imports
1077+
1078+import argparse
1079+import bz2
1080+import itertools
1081+import gzip
1082+import lzma
1083+import os
1084+import sys
1085+import urllib.request
1086+
1087+import scriptutils
1088+
1089+from debian.deb822 import Sources
1090+
1091+# We expect to be running from a git repository in master for this
1092+# script, because the snap's python code is not exposed except within
1093+# the snap
1094+try:
1095+ REALPATH = os.readlink(__file__)
1096+except OSError:
1097+ REALPATH = __file__
1098+sys.path.insert(
1099+ 0,
1100+ os.path.abspath(
1101+ os.path.join(os.path.dirname(REALPATH), os.path.pardir)
1102+ )
1103+)
1104+
1105+from gitubuntu.source_information import GitUbuntuSourceInformation
1106+
1107+def import_all_published_sources(
1108+ num_workers,
1109+ whitelist,
1110+ blacklist,
1111+ phasing_main,
1112+ phasing_universe,
1113+ dry_run,
1114+):
1115+ """import_all_published_sources - Import all publishes satisfying a
1116+ {white,black}list and phasing
1117+
1118+ Arguments:
1119+ num_workers - integer number of worker processes to use
1120+ whitelist - a list of of packages to always import
1121+ blacklist - a list of packages to never import
1122+ phasing_main - a integer percentage of all packages in main to import
1123+ phasing_universe - a integer percentage of all packages in universe to import
1124+ dry_run - a boolean to indicate a dry-run operation
1125+
1126+ Returns:
1127+ A tuple of two lists, the first containing the names of all
1128+ successfully imported source packages, the second containing the
1129+ names of all source packages that failed to import.
1130+ """
1131+ serieses = GitUbuntuSourceInformation('ubuntu').active_series_name_list
1132+ components = ['main', 'universe',]
1133+ pocket_suffixes = ['-proposed', '-updates', '-security', '',]
1134+ compressions = {
1135+ '.xz': lzma.open,
1136+ '.bz2': bz2.open,
1137+ '.gz': gzip.open,
1138+ }
1139+
1140+ base_sources_url = (
1141+ 'http://archive.ubuntu.com/ubuntu/dists/%s%s/%s/source/Sources'
1142+ )
1143+ pkgnames = set()
1144+ for component in components:
1145+ for series, pocket_suffix in itertools.product(
1146+ serieses,
1147+ pocket_suffixes,
1148+ ):
1149+ url = base_sources_url % (
1150+ series,
1151+ pocket_suffix,
1152+ component,
1153+ )
1154+ for compression, opener in compressions.items():
1155+ try:
1156+ with urllib.request.urlopen(
1157+ url + compression
1158+ ) as source_url_file:
1159+ with opener(source_url_file, mode='r') as sources:
1160+ print(url + compression)
1161+ for src in Sources.iter_paragraphs(
1162+ sources,
1163+ #fields=['Package,'],
1164+ use_apt_pkg=False,
1165+ ):
1166+ pkgname = src['Package']
1167+ if scriptutils.should_import_srcpkg(
1168+ pkgname,
1169+ component,
1170+ whitelist,
1171+ blacklist,
1172+ phasing_main,
1173+ phasing_universe,
1174+ ):
1175+ pkgnames.add(pkgname)
1176+ break
1177+ except urllib.error.HTTPError:
1178+ pass
1179+ else:
1180+ print(
1181+ "Unable to find any Sources file for component=%s, "
1182+ "series=%s, pocket=%s" % (
1183+ component,
1184+ series,
1185+ pocket_suffix,
1186+ )
1187+ )
1188+ sys.exit(1)
1189+
1190+ if not pkgnames:
1191+ print("No relevant publishes found")
1192+ return [], []
1193+
1194+ return scriptutils.pool_map_import_srcpkg(
1195+ num_workers=num_workers,
1196+ dry_run=dry_run,
1197+ pkgnames=pkgnames,
1198+ )
1199+
1200+def main(
1201+ num_workers=scriptutils.DEFAULTS.num_workers,
1202+ whitelist_path=scriptutils.DEFAULTS.whitelist,
1203+ blacklist_path=scriptutils.DEFAULTS.blacklist,
1204+ phasing_main=scriptutils.DEFAULTS.phasing_main,
1205+ phasing_universe=scriptutils.DEFAULTS.phasing_universe,
1206+ dry_run=scriptutils.DEFAULTS.dry_run,
1207+ use_whitelist=scriptutils.DEFAULTS.use_whitelist,
1208+):
1209+ """main - Main entry point to the script
1210+
1211+ Arguments:
1212+ num_workers - integer number of worker threads to use
1213+ whitelist_path - string filesystem path to a text file of packages
1214+ to always import
1215+ blacklist_path - string filesystem path to a text file of packages
1216+ to never import
1217+ phasing_main - a integer percentage of all packages in main to
1218+ import
1219+ phasing_universe - a integer percentage of all packages in universe
1220+ to import
1221+ dry_run - a boolean to indicate a dry-run operation
1222+ use_whitelist - a boolean to control whether the whitelist data is
1223+ used
1224+
1225+ use_whitelist exists because during the rampup of imports, we want
1226+ to import the whitelist packages and the phased packages. But after
1227+ that first operation to import (or possibly reimport), we do not
1228+ want to keep hitting the whitelist set (we only want to adjust the
1229+ phasing). This is more important in other scripts, but is relevant
1230+ here too.
1231+ """
1232+ scriptutils.setup_git_config()
1233+
1234+ if use_whitelist:
1235+ try:
1236+ with open(whitelist_path, 'r') as whitelist_file:
1237+ whitelist = [
1238+ line.strip() for line in whitelist_file
1239+ if not line.startswith('#')
1240+ ]
1241+ except (FileNotFoundError, IOError):
1242+ whitelist = list()
1243+ else:
1244+ whitelist = list()
1245+
1246+ try:
1247+ with open(blacklist_path, 'r') as blacklist_file:
1248+ blacklist = [
1249+ line.strip() for line in blacklist_file
1250+ if not line.startswith('#')
1251+ ]
1252+ except (FileNotFoundError, IOError):
1253+ blacklist = list()
1254+
1255+ imported_srcpkgs, failed_srcpkgs = import_all_published_sources(
1256+ num_workers,
1257+ whitelist,
1258+ blacklist,
1259+ phasing_main,
1260+ phasing_universe,
1261+ dry_run,
1262+ )
1263+ print(
1264+ "Imported %d source packages:\n%s" % (
1265+ len(imported_srcpkgs),
1266+ '\n'.join(imported_srcpkgs),
1267+ )
1268+ )
1269+ print(
1270+ "Failed to import %d source packages:\n%s" % (
1271+ len(failed_srcpkgs),
1272+ '\n'.join(failed_srcpkgs),
1273+ )
1274+ )
1275+
1276+def cli_main():
1277+ """cli_main - CLI entry point to script
1278+ """
1279+ parser = argparse.ArgumentParser(
1280+ description='Script to import all source packages with phasing',
1281+ )
1282+ parser.add_argument(
1283+ '--num-workers',
1284+ type=int,
1285+ help="Number of worker threads to use",
1286+ default=scriptutils.DEFAULTS.num_workers,
1287+ )
1288+ parser.add_argument(
1289+ '--no-whitelist',
1290+ action='store_false',
1291+ dest='use_whitelist',
1292+ help="Do not process packages in the whitelist",
1293+ default=not scriptutils.DEFAULTS.use_whitelist,
1294+ )
1295+ parser.add_argument(
1296+ '--whitelist',
1297+ type=str,
1298+ help="Path to whitelist file",
1299+ default=scriptutils.DEFAULTS.whitelist,
1300+ )
1301+ parser.add_argument(
1302+ '--blacklist',
1303+ type=str,
1304+ help="Path to blacklist file",
1305+ default=scriptutils.DEFAULTS.blacklist,
1306+ )
1307+ parser.add_argument(
1308+ '--phasing-universe',
1309+ type=int,
1310+ help="Percentage of universe packages to phase",
1311+ default=scriptutils.DEFAULTS.phasing_universe,
1312+ )
1313+ parser.add_argument(
1314+ '--phasing-main',
1315+ type=int,
1316+ help="Percentage of main packages to phase",
1317+ default=scriptutils.DEFAULTS.phasing_main,
1318+ )
1319+ parser.add_argument(
1320+ '--dry-run',
1321+ action='store_true',
1322+ help="Simulate operation but do not actually do anything",
1323+ default=scriptutils.DEFAULTS.dry_run,
1324+ )
1325+
1326+ args = parser.parse_args()
1327+
1328+ main(
1329+ num_workers=args.num_workers,
1330+ whitelist_path=args.whitelist,
1331+ blacklist_path=args.blacklist,
1332+ phasing_main=args.phasing_main,
1333+ phasing_universe=args.phasing_universe,
1334+ dry_run=args.dry_run,
1335+ use_whitelist=args.use_whitelist,
1336+ )
1337+
1338+if __name__ == '__main__':
1339+ cli_main()
1340diff --git a/scripts/update-repository-alias.py b/scripts/update-repository-alias.py
1341new file mode 100755
1342index 0000000..5f2c8ae
1343--- /dev/null
1344+++ b/scripts/update-repository-alias.py
1345@@ -0,0 +1,268 @@
1346+#!/usr/bin/env python3
1347+
1348+import argparse
1349+import functools
1350+import lzma
1351+import multiprocessing
1352+import os
1353+import sys
1354+import urllib
1355+
1356+import scriptutils
1357+
1358+# We expect to be running from a git repository in master for this
1359+# script, because the snap's python code is not exposed except within
1360+# the snap
1361+try:
1362+ REALPATH = os.readlink(__file__)
1363+except OSError:
1364+ REALPATH = __file__
1365+sys.path.insert(
1366+ 0,
1367+ os.path.abspath(
1368+ os.path.join(os.path.dirname(REALPATH), os.path.pardir)
1369+ )
1370+)
1371+
1372+from gitubuntu.source_information import launchpad_login_auth
1373+
1374+def update_git_repository(package, dry_run, unset):
1375+ """update_git_repository - set the default Git Repository on Launchpad for a source package
1376+
1377+ Arguments:
1378+ package - string name of a source package
1379+ dry_run - a boolean to indicate a dry-run operation
1380+ unset - a boolean to indicate the URL should be set to None instead
1381+ of the usd-import-team repository
1382+ """
1383+ launchpad = launchpad_login_auth()
1384+ quoted_package = urllib.parse.quote(package)
1385+ target = launchpad.load('ubuntu/+source/%s' % quoted_package)
1386+ current_default = launchpad.git_repositories.getDefaultRepository(
1387+ target=target,
1388+ )
1389+ logmsg = list()
1390+ if current_default:
1391+ logmsg.append(
1392+ "Current default Git repository for %s: %s" % (
1393+ package,
1394+ current_default.git_https_url,
1395+ )
1396+ )
1397+ else:
1398+ logmsg.append("No default Git repository set for %s" % package)
1399+
1400+ if unset:
1401+ logmsg.append("Unsetting default Git repository for %s" % package)
1402+ if not dry_run:
1403+ launchpad.git_repositories.setDefaultRepository(
1404+ repository=None,
1405+ target=target,
1406+ )
1407+ else:
1408+ path = '~usd-import-team/ubuntu/+source/%s/+git/%s' % (
1409+ quoted_package,
1410+ quoted_package,
1411+ )
1412+ repository = launchpad.git_repositories.getByPath(path=path)
1413+ if repository:
1414+ logmsg.append(
1415+ "Setting default Git repository for %s to %s" % (
1416+ package,
1417+ path,
1418+ )
1419+ )
1420+ if not dry_run:
1421+ launchpad.git_repositories.setDefaultRepository(
1422+ repository=repository,
1423+ target=target,
1424+ )
1425+ else:
1426+ logmsg.append("No usd-import-team repository for %s" % package)
1427+
1428+ print('\n'.join(logmsg))
1429+
1430+# add whitelist, blacklist, phasing and loop
1431+def main(
1432+ num_workers=scriptutils.DEFAULTS.num_workers,
1433+ whitelist_path=scriptutils.DEFAULTS.whitelist,
1434+ blacklist_path=scriptutils.DEFAULTS.blacklist,
1435+ phasing_main=scriptutils.DEFAULTS.phasing_main,
1436+ phasing_universe=scriptutils.DEFAULTS.phasing_universe,
1437+ dry_run=scriptutils.DEFAULTS.dry_run,
1438+ unset=False,
1439+ use_whitelist=scriptutils.DEFAULTS.use_whitelist,
1440+):
1441+ """main - Main entry point to the script
1442+
1443+ Set the default Git Repository target on Launchpad for all source
1444+ packages imported to usd-import-team to the usd-import-team
1445+ repository.
1446+
1447+ Arguments:
1448+ num_workers - integer number of worker threads to use
1449+ whitelist - a list of of packages to always import
1450+ blacklist - a list of packages to never import
1451+ phasing_main - a integer percentage of all packages in main to import
1452+ phasing_universe - a integer percentage of all packages in universe to import
1453+ dry_run - a boolean to indicate a dry-run operation
1454+ unset - a boolean to indicate the URL should be set to None instead
1455+ of the usd-import-team repository
1456+ use_whitelist - a boolean to control whether the whitelist data is
1457+ used
1458+
1459+ use_whitelist exists because during the rampup of imports, we want
1460+ to import the whitelist packages and the phased packages. But after
1461+ that first operation to import (or possibly reimport), we do not
1462+ want to keep hitting the whitelist set (we only want to adjust the
1463+ phasing). This is more important in other scripts, but is relevant
1464+ here too.
1465+ """
1466+ if use_whitelist:
1467+ try:
1468+ with open(whitelist_path, 'r') as whitelist_file:
1469+ whitelist = [
1470+ line.strip() for line in whitelist_file
1471+ if not line.startswith('#')
1472+ ]
1473+ except (FileNotFoundError, IOError):
1474+ whitelist = list()
1475+ else:
1476+ whitelist = list()
1477+
1478+ try:
1479+ with open(blacklist_path, 'r') as blacklist_file:
1480+ blacklist = [
1481+ line.strip() for line in blacklist_file
1482+ if not line.startswith('#')
1483+ ]
1484+ except (FileNotFoundError, IOError):
1485+ blacklist = list()
1486+
1487+ main_packages = list()
1488+ with urllib.request.urlopen(
1489+ 'http://archive.ubuntu.com/ubuntu/dists/devel/main/source/Sources.xz'
1490+ ) as source_url_file:
1491+ with lzma.open(source_url_file, mode='rt') as main_sources:
1492+ for line in main_sources:
1493+ if line.startswith('Package:'):
1494+ _, pkgname = line.split(':')
1495+ main_packages.append(pkgname.strip())
1496+
1497+ universe_packages = list()
1498+ with urllib.request.urlopen(
1499+ 'http://archive.ubuntu.com/ubuntu/dists/devel/universe/source/Sources.xz'
1500+ ) as source_url_file:
1501+ with lzma.open(source_url_file, mode='rt') as universe_sources:
1502+ for line in universe_sources:
1503+ if line.startswith('Package:'):
1504+ _, pkgname = line.split(':')
1505+ universe_packages.append(pkgname.strip())
1506+
1507+ filtered_pkgnames = set()
1508+ for pkgname in main_packages:
1509+ if scriptutils.should_import_srcpkg(
1510+ pkgname,
1511+ 'main',
1512+ whitelist,
1513+ blacklist,
1514+ phasing_main,
1515+ phasing_universe,
1516+ ):
1517+ filtered_pkgnames.add(pkgname)
1518+ for pkgname in universe_packages:
1519+ if scriptutils.should_import_srcpkg(
1520+ pkgname,
1521+ 'universe',
1522+ whitelist,
1523+ blacklist,
1524+ phasing_main,
1525+ phasing_universe,
1526+ ):
1527+ filtered_pkgnames.add(pkgname)
1528+
1529+ if not filtered_pkgnames:
1530+ print("No relevant publishes found")
1531+ return
1532+
1533+ with multiprocessing.Pool(processes=num_workers) as pool:
1534+ pool.map(
1535+ functools.partial(
1536+ update_git_repository,
1537+ dry_run=dry_run,
1538+ unset=unset,
1539+ ),
1540+ filtered_pkgnames,
1541+ )
1542+
1543+def cli_main():
1544+ """cli_main - main entry point for CLI
1545+ """
1546+ parser = argparse.ArgumentParser(
1547+ description='Update the default Git repository for imported source packages',
1548+ )
1549+ parser.add_argument(
1550+ '--num-workers',
1551+ type=int,
1552+ help="Number of worker threads to use",
1553+ default=scriptutils.DEFAULTS.num_workers,
1554+ )
1555+ parser.add_argument(
1556+ '--no-whitelist',
1557+ action='store_false',
1558+ dest='use_whitelist',
1559+ help="Do not process packages in the whitelist",
1560+ default=not scriptutils.DEFAULTS.use_whitelist,
1561+ )
1562+ parser.add_argument(
1563+ '--whitelist',
1564+ type=str,
1565+ help="Path to whitelist file",
1566+ default=scriptutils.DEFAULTS.whitelist,
1567+ )
1568+ parser.add_argument(
1569+ '--blacklist',
1570+ type=str,
1571+ help="Path to blacklist file",
1572+ default=scriptutils.DEFAULTS.blacklist,
1573+ )
1574+ parser.add_argument(
1575+ '--phasing-universe',
1576+ type=int,
1577+ help="Percentage of universe packages to phase",
1578+ default=scriptutils.DEFAULTS.phasing_universe,
1579+ )
1580+ parser.add_argument(
1581+ '--phasing-main',
1582+ type=int,
1583+ help="Percentage of main packages to phase",
1584+ default=scriptutils.DEFAULTS.phasing_main,
1585+ )
1586+ parser.add_argument(
1587+ '--dry-run',
1588+ action='store_true',
1589+ help="Simulate operation but do not actually do anything",
1590+ default=False,
1591+ )
1592+ parser.add_argument(
1593+ '--unset',
1594+ action='store_true',
1595+ help="Unset default repository (for testing)",
1596+ default=False,
1597+ )
1598+
1599+ args = parser.parse_args()
1600+
1601+ main(
1602+ num_workers=args.num_workers,
1603+ whitelist_path=args.whitelist,
1604+ blacklist_path=args.blacklist,
1605+ phasing_main=args.phasing_main,
1606+ phasing_universe=args.phasing_universe,
1607+ dry_run=args.dry_run,
1608+ unset=args.unset,
1609+ use_whitelist=args.use_whitelist,
1610+ )
1611+
1612+if __name__ == '__main__':
1613+ cli_main()

Subscribers

People subscribed via source and target branches