Merge ~nacc/git-ubuntu:lp1730734-cache-importer-progress into git-ubuntu:master

Proposed by Nish Aravamudan
Status: Work in progress
Proposed branch: ~nacc/git-ubuntu:lp1730734-cache-importer-progress
Merge into: git-ubuntu:master
Diff against target: 475 lines (+136/-8)
6 files modified
gitubuntu/importer.py (+50/-3)
gitubuntu/source_information.py (+16/-2)
man/man1/git-ubuntu-import.1 (+18/-2)
scripts/import-source-packages.py (+18/-0)
scripts/scriptutils.py (+16/-1)
scripts/source-package-walker.py (+18/-0)
Reviewer Review Type Date Requested Status
Server Team CI bot continuous-integration Approve
git-ubuntu developers Pending
Review via email: mp+333499@code.launchpad.net

Description of the change

Make jenkins happy.

To post a comment you must log in.
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:16af9e00af3e93c4fcba5a49cd660d01d6d10f61
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/205/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Style Check
    SUCCESS: Unit Tests
    SUCCESS: Integration Tests
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/205/rebuild

review: Approve (continuous-integration)
Revision history for this message
Nish Aravamudan (nacc) wrote :

As a local test, I built the git-ubuntu snap with this change in place and ran the following:

# 1) prime the cache
$ git ubuntu import --no-push ipsec-tools --no-clean --db-cache /tmp/git-ubuntu-db-cache
11/10/2017 11:58:19 - INFO:Ubuntu Server Team importer v0.6.2
11/10/2017 11:58:19 - INFO:Using git repository at /tmp/tmpso3vrstw
11/10/2017 11:59:25 - INFO:Importing patches-unapplied 1:0.8.2+20140711-10 to ubuntu/bionic
11/10/2017 11:59:39 - WARNING:ubuntu/bionic is identical to 1:0.8.2+20140711-10
11/10/2017 12:00:15 - INFO:Not pushing to remote as specified
11/10/2017 12:00:15 - INFO:Leaving /tmp/tmpso3vrstw as directed

# 2) setup test repositories
$ cp -R /tmp/tmpso3vrstw /tmp/cache-test
$ cp -R /tmp/tmpso3vrstw /tmp/no-cache-test

# 3) Run again using the same cache, no new uploads processed (cache worked)
$ git ubuntu import --no-push ipsec-tools --no-clean --db-cache /tmp/git-ubuntu-db-cache --no-fetch -d /tmp/cache-test
11/10/2017 12:01:45 - INFO:Ubuntu Server Team importer v0.6.2
11/10/2017 12:02:49 - INFO:Not pushing to remote as specified
11/10/2017 12:02:49 - INFO:Leaving /tmp/cache-test as directed

# 4) Run again not using the cache (default operation, as well), we end up processing already seen records
$ git ubuntu import --no-push ipsec-tools --no-clean --no-fetch -d /tmp/no-cache-test
11/10/2017 12:03:33 - INFO:Ubuntu Server Team importer v0.6.2
11/10/2017 12:04:17 - INFO:Importing patches-unapplied 1:0.8.2+20140711-10 to ubuntu/bionic
11/10/2017 12:04:22 - WARNING:ubuntu/bionic is identical to 1:0.8.2+20140711-10
11/10/2017 12:04:37 - INFO:Importing patches-applied 1:0.8.2+20140711-10 to ubuntu/bionic
11/10/2017 12:04:40 - WARNING:ubuntu/bionic is identical to 1:0.8.2+20140711-10
11/10/2017 12:05:07 - INFO:Not pushing to remote as specified
11/10/2017 12:05:07 - INFO:Leaving /tmp/no-cache-test as directed

Revision history for this message
Nish Aravamudan (nacc) wrote :

I am fairly confident in the actual changes here. What I think needs deciding is how the linear import script and publisher-watching script should interact with a shared DBM cache. I think it is relatively safe within one process, but from what I can tell, there is no guarantee the two won't race and the DBM implementation itself is not concurrency safe.

Possibly we should switch to sqlite3 for that, then, even though it is more overhead to configure and query.

Finally, I'm wondering how to describe the race in question. Basically, we'd see two processes read different (possibly unset in one case) values for the last SPPHR seen in the cache. They would then iterate a different set of Launchpad records, but the result would be the same, relative to the actual contents of the imports. The question is the branches, I think. They both start with the same git repository (effectively, under the assumption neither has pushed) and move the branches. In the case of one processing more (older) records, they would move the branches more, but the end result *should* be the same, or an ancestor of what will occur. But that's the concern (ancestor) as we are now timing-sensitive to what gets pushed (since we force push branches). I wonder if we should treat it more like RCU, in that we can always read or write, but the read data may be stale. So it needs to be verified before and after our loop. I imagine we could read the data, iterate our records and import them, and before we write our last_spphr back into the database, check to see if the current value is still the same? That would shrink the race to between re-reading the value and writing the new value (and the linear importer is a one-time operation, in theory).

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:e087180218bfe4276957a93c9375d875c6652a3d
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/207/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Style Check
    SUCCESS: Unit Tests
    SUCCESS: Integration Tests
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/207/rebuild

review: Approve (continuous-integration)
Revision history for this message
Nish Aravamudan (nacc) wrote :

Aother thought, does the cache need to be distinct for debian/ubuntu X unapplied/applied? Given that applied failures may be fixable, they can be at a different version than the unapplied?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I'm not sure I can help you with the "implications" as you asked on IRC without thinking much more into it. But I have a few questions that might help to clarify at least.

To do so I want to hear more details on the actual use-case that lets us run into this race:
1) Do we even want that two importers run concurrently on the same system?
   If not lets just lock the DB to be of single use - problem solved.

2) If we want multiple processes to be allowed we could make the use of
   the cache exclusive per package. (I beg your pardon if I don't see the point why this would
   fail). So only one worker can be in said [package] at any time.
   Processes working on different packages should never conflict right?
   Another worker with a different [package] can continue and we know there won't be a race.
   Essentially locking on [package] (yes this might need a better db to support
   rollback/timeouts on aborted processes if you want to lock in the DB).
   The tail of the processing would be updating sphhr and then unlocking the given package.

In general I think you should add tests along these commits that e.g. does "import-no-cache-used == import into cache == import with cache".

Revision history for this message
Nish Aravamudan (nacc) wrote :
Download full text (5.0 KiB)

On 08.01.2018 [20:17:52 -0000], ChristianEhrhardt wrote:
> I'm not sure I can help you with the "implications" as you asked on
> IRC without thinking much more into it. But I have a few questions
> that might help to clarify at least.

Understood and thank you!

> To do so I want to hear more details on the actual use-case that lets
> us run into this race:

The race is internal, in some ways, to the importer, but you do bring up
an interesting second case.

`git ubuntu import` is managed by a looping script,
scripts/import-source-packages.py, which is operating in a keepup
fashion. Roughly:

Obtain a LIST of publishing events in reverse chronological order
Walk LIST backwards until we find a PUBLISH that we have
already imported (or earlier)
   Walk LIST foward after PUBLISH and IMPORT each unique source package
name

IMPORT itself is algorithmically:
   Get current REPOSITORY
   Get Launchpad publishing DATA for a given source package in reverse
chronological order
   Walk DATA backwards until we find a PUBLISH that we have already
imported (or earlier)
      Walk DATA forwards and import each publish record

The issue we are trying to resolve here is the "until we find a PUBLISH
that we have already imported" in IMPORT.

In the prior importer code, we had a unique Git commit for every
publication event, because the targetted <series>-<pocket> was part of
the Git commit (via the commit message and parenting). So as we did
IMPORT's reverse walk, we could look at the branch tips and compare
their commit data (where we stored the publishing timestamp) to the
publishing records to find the last imported publishing record.

We dropped that ability, by dropping publishing parents altogether. We
now just import all published versions once, tying them together only by
their changelogs. We then forcibly move the branch tips.

So now if we use an unmodified importer, we will end up going back in
IMPORT to the last publish that was the first time we saw a given
version (typically in Debian, therefore) and walking forward from there.
This will be unnecessary iteration of publishing data, at least, and
unnecessary moving of the branch pointers.

My branch modifies the catch-up to move the storage of the publication
event to an external cache, currently DBM.

> 1) Do we even want that two importers run concurrently on the same system?
> If not lets just lock the DB to be of single use - problem solved.

Right, so we have a mode we know will exist at some point, where there
is a linear script walking all 'to-import' source packages getting them
loaded to Launchpad, and the keep-up script which is ensuring that
publishes that are happeninng while the linear walker is going get
integrated. So we do expect to see (and it shouldn't just break) if
there are two imports of the same source package going on. What we don't
want to happen is for a slower 'old' import (where the list of events to
import is older) to somehow trump a faster 'new' import and thus end up
setting the branch pointers to the wrong place(s).

The problem is, that if we just outright lock the DB, then the linear
script can't work if it's also using the DB. And the point of it is that
they...

Read more...

Revision history for this message
Nish Aravamudan (nacc) wrote :

> Aother thought, does the cache need to be distinct for debian/ubuntu X
> unapplied/applied? Given that applied failures may be fixable, they can be at
> a different version than the unapplied?

I do think we actually need 4 caches, since the unapplied / applied heads can easily be at a different state. I'll work on this now.

I'll also add some changes to the importer to clean up that code.

Revision history for this message
Nish Aravamudan (nacc) wrote :

> Aother thought, does the cache need to be distinct for debian/ubuntu X
> unapplied/applied? Given that applied failures may be fixable, they can be at
> a different version than the unapplied?

I do think we actually need 4 caches, since the unapplied / applied heads can easily be at a different state. I'll work on this now.

I'll also add some changes to the importer to clean up that code.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

[...]

> So now if we use an unmodified importer, we will end up going back in
> IMPORT to the last publish that was the first time we saw a given
> version (typically in Debian, therefore) and walking forward from there.

To confirm - you go to "the first time we saw a given version" because without parenting history that is the only safe place to walk forward from?

> This will be unnecessary iteration of publishing data

because some might already have been handled.

>, at least, and
> unnecessary moving of the branch pointers.

For the walking

Thanks for the details, much clearer to me now what your case actually is.

>
> My branch modifies the catch-up to move the storage of the publication
> event to an external cache, currently DBM.

[...]

> The more I think about it, I think I am spinning myself on this problem
> for no reason :)
>
> I think we can let the linear walker not use the dbm [we can keep the
> code, just not pass anything] (or sqlite) and it will just mean if there
> is a race between when the linear walker hits a source package and when
> the keep-up script does, there will be some no-op looping (unless new
> SPRs are found).

Sounds good, and then you can actually lock the DB to be fully safe against unintentional concurrent use of it by the catch up (if one manually starts it twice or so).

[...]

> I do think we actually need 4 caches, since the unapplied / applied heads can easily be at a
> different state. I'll work on this now.

> I'll also add some changes to the importer to clean up that code.

Since you will end up only accessing it by code and never manually I think you could even implement as debian/ubuntu x unapplied/applied x per-package. If you end up locking the DBs in any way you will still have a better granularity on those locks.
And for your code it doesn't matter how split the file structure is.

Unmerged commits

a9960c8... by Nish Aravamudan

git ubuntu import: use a dbm cache to store importer progress

With the recent changes to the importer algorithm, we no longer can
identify a Launchpad publishing event by the commit information in the
repository -- a given SourcePackageRelease (source package name and
version) is only imported once (presuming all future publishes of the
same name and version match exactly). We rely on this in
launchpad_versions_published_after, which iterates the Launchpad data
backwards until we either match the commit data for a branch head, or
see Launchpad data from before one of the branch heads.

Change the importer code to take a --db-cache argument as a directory
containing a DBM cache for debian and ubuntu (this is needed, because
DBM are single-leve string-indexed string storage databases). Lookup the
source package name in the relevant cache before we lookup Launchpad
data, to obtain the last SPPHR used. Store the latest processed SPPHR
after iterating the Launchpad data.

Also update the scripts to support passing a persistent/consistent value
to the import for the cache.

LP: #1730734

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/gitubuntu/importer.py b/gitubuntu/importer.py
2index 2063f83..6abad7d 100644
3--- a/gitubuntu/importer.py
4+++ b/gitubuntu/importer.py
5@@ -26,6 +26,7 @@
6
7 import argparse
8 import atexit
9+import dbm
10 import functools
11 import getpass
12 import logging
13@@ -150,6 +151,7 @@ def main(
14 parentfile=top_level_defaults.parentfile,
15 retries=top_level_defaults.retries,
16 retry_backoffs=top_level_defaults.retry_backoffs,
17+ db_cache_dir=None,
18 ):
19 """Main entry point to the importer
20
21@@ -178,6 +180,9 @@ def main(
22 @parentfile: string path to file specifying parent overrides
23 @retries: integer number of download retries to attempt
24 @retry_backoffs: list of backoff durations to use between retries
25+ @db_cache_dir: string fileystem directory containing 'ubuntu' and
26+ 'debian' dbm database files, which store the progress of prior
27+ importer runs
28
29 If directory is None, a temporary directory is created and used.
30
31@@ -187,6 +192,11 @@ def main(
32 If dl_cache is None, CACHE_PATH in the local repository will be
33 used.
34
35+ If db_cache_dir is None, no database lookups are performed which makes
36+ it possible that the import will attempt to re-import already
37+ imported publishes. This should be fine, although less efficient
38+ than possible.
39+
40 Returns 0 on successful import (which includes non-fatal failures);
41 1 otherwise.
42 """
43@@ -314,6 +324,9 @@ def main(
44 else:
45 workdir = dl_cache
46
47+ if db_cache_dir:
48+ os.makedirs(db_cache_dir, exist_ok=True)
49+
50 os.makedirs(workdir, exist_ok=True)
51
52 # now sets a global _PARENT_OVERRIDES
53@@ -326,6 +339,7 @@ def main(
54 patches_applied=False,
55 debian_head_versions=debian_head_versions,
56 ubuntu_head_versions=ubuntu_head_versions,
57+ db_cache_dir=db_cache_dir,
58 debian_sinfo=debian_sinfo,
59 ubuntu_sinfo=ubuntu_sinfo,
60 active_series_only=active_series_only,
61@@ -347,6 +361,7 @@ def main(
62 patches_applied=True,
63 debian_head_versions=applied_debian_head_versions,
64 ubuntu_head_versions=applied_ubuntu_head_versions,
65+ db_cache_dir=db_cache_dir,
66 debian_sinfo=debian_sinfo,
67 ubuntu_sinfo=ubuntu_sinfo,
68 active_series_only=active_series_only,
69@@ -1401,6 +1416,7 @@ def import_publishes(
70 patches_applied,
71 debian_head_versions,
72 ubuntu_head_versions,
73+ db_cache_dir,
74 debian_sinfo,
75 ubuntu_sinfo,
76 active_series_only,
77@@ -1411,6 +1427,8 @@ def import_publishes(
78 history_found = False
79 only_debian = False
80 srcpkg_information = None
81+ last_debian_spphr = None
82+ last_ubuntu_spphr = None
83 if patches_applied:
84 _namespace = namespace
85 namespace = '%s/applied' % namespace
86@@ -1426,15 +1444,28 @@ def import_publishes(
87 import_unapplied_spi,
88 skip_orig=skip_orig,
89 )
90- for distname, versions, dist_sinfo in (
91- ("debian", debian_head_versions, debian_sinfo),
92- ("ubuntu", ubuntu_head_versions, ubuntu_sinfo)):
93+
94+ for distname, versions, dist_sinfo, last_spphr in (
95+ ("debian", debian_head_versions, debian_sinfo, last_debian_spphr),
96+ ("ubuntu", ubuntu_head_versions, ubuntu_sinfo, last_ubuntu_spphr),
97+ ):
98 if active_series_only and distname == "debian":
99 continue
100+
101+ last_spphr = None
102+ if db_cache_dir:
103+ with dbm.open(os.path.join(db_cache_dir, distname), 'c') as cache:
104+ try:
105+ last_spphr = decode_binary(cache[pkgname])
106+ except KeyError:
107+ pass
108+
109 try:
110+ last_spi = None
111 for srcpkg_information in dist_sinfo.launchpad_versions_published_after(
112 versions,
113 namespace,
114+ last_spphr,
115 workdir=workdir,
116 active_series_only=active_series_only
117 ):
118@@ -1445,6 +1476,11 @@ def import_publishes(
119 namespace=_namespace,
120 ubuntu_sinfo=ubuntu_sinfo,
121 )
122+ last_spi = srcpkg_information
123+ if last_spi:
124+ if db_cache_dir:
125+ with dbm.open(os.path.join(db_cache_dir, distname), 'w') as db_cache:
126+ db_cache[pkgname] = str(last_spi.spphr)
127 except NoPublicationHistoryException:
128 logging.warning("No publication history found for %s in %s.",
129 pkgname, distname
130@@ -1520,6 +1556,11 @@ def parse_args(subparsers=None, base_subparsers=None):
131 action='store_true',
132 help=argparse.SUPPRESS,
133 )
134+ parser.add_argument(
135+ '--db-cache',
136+ type=str,
137+ help=argparse.SUPPRESS,
138+ )
139 if not subparsers:
140 return parser.parse_args()
141 return 'import - %s' % kwargs['description']
142@@ -1548,6 +1589,11 @@ def cli_main(args):
143 except AttributeError:
144 dl_cache = None
145
146+ try:
147+ db_cache = args.db_cache
148+ except AttributeError:
149+ db_cache = None
150+
151 return main(
152 pkgname=args.package,
153 owner=args.lp_owner,
154@@ -1567,4 +1613,5 @@ def cli_main(args):
155 parentfile=args.parentfile,
156 retries=args.retries,
157 retry_backoffs=args.retry_backoffs,
158+ db_cache_dir=args.db_cache,
159 )
160diff --git a/gitubuntu/source_information.py b/gitubuntu/source_information.py
161index b97bc8a..050d60a 100644
162--- a/gitubuntu/source_information.py
163+++ b/gitubuntu/source_information.py
164@@ -443,7 +443,14 @@ class GitUbuntuSourceInformation(object):
165 for srcpkg in spph:
166 yield self.get_corrected_spi(srcpkg, workdir)
167
168- def launchpad_versions_published_after(self, head_versions, namespace, workdir=None, active_series_only=False):
169+ def launchpad_versions_published_after(
170+ self,
171+ head_versions,
172+ namespace,
173+ last_spphr=None,
174+ workdir=None,
175+ active_series_only=False,
176+ ):
177 args = {
178 'exact_match':True,
179 'source_name':self.pkgname,
180@@ -471,7 +478,14 @@ class GitUbuntuSourceInformation(object):
181 if len(spph) == 0:
182 raise NoPublicationHistoryException("Is %s published in %s?" %
183 (self.pkgname, self.dist_name))
184- if len(head_versions) > 0:
185+ if last_spphr:
186+ _spph = list()
187+ for spphr in spph:
188+ if str(spphr) == last_spphr:
189+ break
190+ _spph.append(spphr)
191+ spph = _spph
192+ elif head_versions:
193 _spph = list()
194 for spphr in spph:
195 spi = GitUbuntuSourcePackageInformation(spphr, self.dist_name,
196diff --git a/man/man1/git-ubuntu-import.1 b/man/man1/git-ubuntu-import.1
197index dd7fd9e..74bfe83 100644
198--- a/man/man1/git-ubuntu-import.1
199+++ b/man/man1/git-ubuntu-import.1
200@@ -1,4 +1,4 @@
201-.TH "GIT-UBUNTU-IMPORT" "1" "2017-07-19" "Git-Ubuntu 0.2" "Git-Ubuntu Manual"
202+.TH "GIT-UBUNTU-IMPORT" "1" "2017-11-08" "Git-Ubuntu 0.6.2" "Git-Ubuntu Manual"
203
204 .SH "NAME"
205 git-ubuntu import \- Import Launchpad publishing history to Git
206@@ -9,7 +9,8 @@ git-ubuntu import \- Import Launchpad publishing history to Git
207 <user>] [\-\-dl-cache <dl_cache>] [\-\-no-fetch] [\-\-no-push]
208 [\-\-no-clean] [\-d | \-\-directory <directory>]
209 [\-\-active-series-only] [\-\-skip-applied] [\-\-skip-orig]
210-[\-\-reimport] [\-\-allow-applied-failures] <package>
211+[\-\-reimport] [\-\-allow-applied-failures] [\-\-db-cache <db_cache>]
212+<package>
213 .FI
214 .SP
215 .SH "DESCRIPTION"
216@@ -197,6 +198,21 @@ After investigation, this flag can be used to indicate the importer is
217 allowed to ignore such a failure\&.
218 .RE
219 .PP
220+\-\-db-cache <db_cache>
221+.RS 4
222+The path to a directory containing Python dbm database disk files for
223+importer metadata\&.
224+If \fB<db_cache>\fR does not exist, it will be created\&.
225+Two files in \fB<db_cache>\fR are used, "ubuntu" and "debian', which are
226+created if not already present\&.
227+The cache files provide information to the importer about prior imports
228+of \fB<package>\fR and which Launchpad publishing record was last
229+imported\&.
230+This is necessary because the imported Git repository does not
231+necessarily maintain any metadata about Launchpad publishing
232+information\&.
233+.RE
234+.PP
235 <package>
236 .RS 4
237 The name of the source package to import\&.
238diff --git a/scripts/import-source-packages.py b/scripts/import-source-packages.py
239index 698ae35..7dd534e 100755
240--- a/scripts/import-source-packages.py
241+++ b/scripts/import-source-packages.py
242@@ -50,6 +50,7 @@ def import_new_published_sources(
243 phasing_main,
244 phasing_universe,
245 dry_run,
246+ db_cache,
247 ):
248 """import_new_published_source - Import all new publishes since a prior execution
249
250@@ -60,6 +61,10 @@ def import_new_published_sources(
251 phasing_main - a integer percentage of all packages in main to import
252 phasing_universe - a integer percentage of all packages in universe to import
253 dry_run - a boolean to indicate a dry-run operation
254+ db_cache - string filesystem path containing DBM files for storing
255+ importer progress, will be created if it does not exist
256+
257+ If db_cache is None, no cache is used.
258
259 Returns:
260 A tuple of two lists, the first containing the names of all
261@@ -150,6 +155,7 @@ def import_new_published_sources(
262 ret = scriptutils.pool_map_import_srcpkg(
263 num_workers=num_workers,
264 dry_run=dry_run,
265+ db_cache=db_cache,
266 pkgnames=filtered_pkgnames,
267 )
268
269@@ -229,6 +235,7 @@ def main(
270 phasing_main=scriptutils.DEFAULTS.phasing_main,
271 phasing_universe=scriptutils.DEFAULTS.phasing_universe,
272 dry_run=scriptutils.DEFAULTS.dry_run,
273+ db_cache=scriptutils.DEFAULTS.db_cache,
274 ):
275 """main - Main entry point to the script
276
277@@ -243,6 +250,8 @@ def main(
278 phasing_universe - a integer percentage of all packages in universe
279 to import
280 dry_run - a boolean to indicate a dry-run operation
281+ db_cache - string filesystem path containing DBM files for storing
282+ importer progress, will be created if it does not exist
283 """
284 scriptutils.setup_git_config()
285
286@@ -280,6 +289,7 @@ def main(
287 phasing_main,
288 phasing_universe,
289 dry_run,
290+ db_cache,
291 )
292 print("Imported %d source packages" % len(imported_srcpkgs))
293 mail_imported_srcpkgs |= set(imported_srcpkgs)
294@@ -361,6 +371,13 @@ def cli_main():
295 help="Simulate operation but do not actually do anything",
296 default=scriptutils.DEFAULTS.dry_run,
297 )
298+ parser.add_argument(
299+ '--db-cache',
300+ type=str,
301+ help="Directory containing Python DBM files which store importer "
302+ "progress",
303+ default=scriptutils.DEFAULTS.db_cache,
304+ )
305
306 args = parser.parse_args()
307
308@@ -371,6 +388,7 @@ def cli_main():
309 phasing_main=args.phasing_main,
310 phasing_universe=args.phasing_universe,
311 dry_run=args.dry_run,
312+ db_cache=args.db_cache,
313 )
314
315 if __name__ == '__main__':
316diff --git a/scripts/scriptutils.py b/scripts/scriptutils.py
317index 912efd1..e24c394 100644
318--- a/scripts/scriptutils.py
319+++ b/scripts/scriptutils.py
320@@ -5,6 +5,7 @@ import multiprocessing
321 import os
322 import sys
323 import subprocess
324+import tempfile
325 import time
326
327 import pkg_resources
328@@ -35,6 +36,7 @@ Defaults = namedtuple(
329 'phasing_main',
330 'dry_run',
331 'use_whitelist',
332+ 'db_cache',
333 ],
334 )
335
336@@ -52,6 +54,7 @@ DEFAULTS = Defaults(
337 phasing_main=1,
338 dry_run=False,
339 use_whitelist=True,
340+ db_cache=os.path.join(tempfile.gettempdir(), 'git-ubuntu-db-cache'),
341 )
342
343
344@@ -101,12 +104,16 @@ def should_import_srcpkg(
345 return False
346
347
348-def import_srcpkg(pkgname, dry_run):
349+def import_srcpkg(pkgname, dry_run, db_cache):
350 """import_srcpkg - Invoke git ubuntu import on @pkgname
351
352 Arguments:
353 pkgname - string name of a source package
354 dry_run - a boolean to indicate a dry-run operation
355+ db_cache - string filesystem path containing DBM files for storing
356+ importer progress, will be created if it does not exist
357+
358+ If db_cache is None, no cache is used.
359
360 Returns:
361 A tuple of a string and a boolean, where the boolean is the success
362@@ -125,6 +132,8 @@ def import_srcpkg(pkgname, dry_run):
363 'usd-importer-bot',
364 pkgname,
365 ]
366+ if db_cache:
367+ cmd.extend(['--db-cache', db_cache,])
368 try:
369 print(' '.join(cmd))
370 if not dry_run:
371@@ -163,6 +172,7 @@ def setup_git_config(
372 def pool_map_import_srcpkg(
373 num_workers,
374 dry_run,
375+ db_cache,
376 pkgnames,
377 ):
378 """pool_map_import_srcpkg - Use a multiprocessing.Pool to parallel
379@@ -171,13 +181,18 @@ def pool_map_import_srcpkg(
380 Arguments:
381 num_workers - integer number of worker processes to use
382 dry_run - a boolean to indicate a dry-run operation
383+ db_cache - string filesystem path containing DBM files for storing
384+ importer progress, will be created if it does not exist
385 pkgnames - a list of string names of source packages
386+
387+ If db_cache is None, no cache is used.
388 """
389 with multiprocessing.Pool(processes=num_workers) as pool:
390 results = pool.map(
391 functools.partial(
392 import_srcpkg,
393 dry_run=dry_run,
394+ db_cache=db_cache,
395 ),
396 pkgnames,
397 )
398diff --git a/scripts/source-package-walker.py b/scripts/source-package-walker.py
399index 58c3d5a..f7b9629 100755
400--- a/scripts/source-package-walker.py
401+++ b/scripts/source-package-walker.py
402@@ -44,6 +44,7 @@ def import_all_published_sources(
403 phasing_main,
404 phasing_universe,
405 dry_run,
406+ db_cache,
407 ):
408 """import_all_published_sources - Import all publishes satisfying a
409 {white,black}list and phasing
410@@ -55,6 +56,10 @@ def import_all_published_sources(
411 phasing_main - a integer percentage of all packages in main to import
412 phasing_universe - a integer percentage of all packages in universe to import
413 dry_run - a boolean to indicate a dry-run operation
414+ db_cache - string filesystem path containing DBM files for storing
415+ importer progress, will be created if it does not exist
416+
417+ If db_cache is None, no cache is used.
418
419 Returns:
420 A tuple of two lists, the first containing the names of all
421@@ -126,6 +131,7 @@ def import_all_published_sources(
422 return scriptutils.pool_map_import_srcpkg(
423 num_workers=num_workers,
424 dry_run=dry_run,
425+ db_cache=db_cache,
426 pkgnames=pkgnames,
427 )
428
429@@ -137,6 +143,7 @@ def main(
430 phasing_universe=scriptutils.DEFAULTS.phasing_universe,
431 dry_run=scriptutils.DEFAULTS.dry_run,
432 use_whitelist=scriptutils.DEFAULTS.use_whitelist,
433+ db_cache=scriptutils.DEFAULTS.db_cache,
434 ):
435 """main - Main entry point to the script
436
437@@ -153,6 +160,8 @@ def main(
438 dry_run - a boolean to indicate a dry-run operation
439 use_whitelist - a boolean to control whether the whitelist data is
440 used
441+ db_cache - string filesystem path containing DBM files for storing
442+ importer progress, will be created if it does not exist
443
444 use_whitelist exists because during the rampup of imports, we want
445 to import the whitelist packages and the phased packages. But after
446@@ -191,6 +200,7 @@ def main(
447 phasing_main,
448 phasing_universe,
449 dry_run,
450+ db_cache,
451 )
452 print(
453 "Imported %d source packages:\n%s" % (
454@@ -254,6 +264,13 @@ def cli_main():
455 help="Simulate operation but do not actually do anything",
456 default=scriptutils.DEFAULTS.dry_run,
457 )
458+ parser.add_argument(
459+ '--db-cache',
460+ type=str,
461+ help="Directory containing Python DBM files which store importer "
462+ "progress",
463+ default=scriptutils.DEFAULTS.db_cache,
464+ )
465
466 args = parser.parse_args()
467
468@@ -265,6 +282,7 @@ def cli_main():
469 phasing_universe=args.phasing_universe,
470 dry_run=args.dry_run,
471 use_whitelist=args.use_whitelist,
472+ db_cache=args.db_cache,
473 )
474
475 if __name__ == '__main__':

Subscribers

People subscribed via source and target branches