Merge into master : lp1730734-cache-importer-progress : lp:~nacc/git-ubuntu : Git : Code : git-ubuntu

Reviewer	Review Type	Date Requested	Status
Server Team CI bot	continuous-integration		Approve on 2017-11-10
git-ubuntu developers		2017-11-10	Pending
Review via email: mp+333499@code.launchpad.net

Revision history for this message

Server Team CI bot (server-team-bot) wrote on 2017-11-10:

#

PASSED: Continuous integration, rev:16af9e00af3e93c4fcba5a49cd660d01d6d10f61
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/205/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Style Check
    SUCCESS: Unit Tests
    SUCCESS: Integration Tests
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/205/rebuild

review: Approve (continuous-integration)

Reply

Revision history for this message

Nish Aravamudan (nacc) wrote on 2017-11-10:

#

As a local test, I built the git-ubuntu snap with this change in place and ran the following:

# 1) prime the cache
$ git ubuntu import --no-push ipsec-tools --no-clean --db-cache /tmp/git-ubuntu-db-cache
11/10/2017 11:58:19 - INFO:Ubuntu Server Team importer v0.6.2
11/10/2017 11:58:19 - INFO:Using git repository at /tmp/tmpso3vrstw
11/10/2017 11:59:25 - INFO:Importing patches-unapplied 1:0.8.2+20140711-10 to ubuntu/bionic
11/10/2017 11:59:39 - WARNING:ubuntu/bionic is identical to 1:0.8.2+20140711-10
11/10/2017 12:00:15 - INFO:Not pushing to remote as specified
11/10/2017 12:00:15 - INFO:Leaving /tmp/tmpso3vrstw as directed

# 2) setup test repositories
$ cp -R /tmp/tmpso3vrstw /tmp/cache-test
$ cp -R /tmp/tmpso3vrstw /tmp/no-cache-test

# 3) Run again using the same cache, no new uploads processed (cache worked)
$ git ubuntu import --no-push ipsec-tools --no-clean --db-cache /tmp/git-ubuntu-db-cache --no-fetch -d /tmp/cache-test
11/10/2017 12:01:45 - INFO:Ubuntu Server Team importer v0.6.2
11/10/2017 12:02:49 - INFO:Not pushing to remote as specified
11/10/2017 12:02:49 - INFO:Leaving /tmp/cache-test as directed

# 4) Run again not using the cache (default operation, as well), we end up processing already seen records
$ git ubuntu import --no-push ipsec-tools --no-clean --no-fetch -d /tmp/no-cache-test
11/10/2017 12:03:33 - INFO:Ubuntu Server Team importer v0.6.2
11/10/2017 12:04:17 - INFO:Importing patches-unapplied 1:0.8.2+20140711-10 to ubuntu/bionic
11/10/2017 12:04:22 - WARNING:ubuntu/bionic is identical to 1:0.8.2+20140711-10
11/10/2017 12:04:37 - INFO:Importing patches-applied 1:0.8.2+20140711-10 to ubuntu/bionic
11/10/2017 12:04:40 - WARNING:ubuntu/bionic is identical to 1:0.8.2+20140711-10
11/10/2017 12:05:07 - INFO:Not pushing to remote as specified
11/10/2017 12:05:07 - INFO:Leaving /tmp/no-cache-test as directed

Reply

Revision history for this message

Nish Aravamudan (nacc) wrote on 2017-11-10:

#

I am fairly confident in the actual changes here. What I think needs deciding is how the linear import script and publisher-watching script should interact with a shared DBM cache. I think it is relatively safe within one process, but from what I can tell, there is no guarantee the two won't race and the DBM implementation itself is not concurrency safe.

Possibly we should switch to sqlite3 for that, then, even though it is more overhead to configure and query.

Finally, I'm wondering how to describe the race in question. Basically, we'd see two processes read different (possibly unset in one case) values for the last SPPHR seen in the cache. They would then iterate a different set of Launchpad records, but the result would be the same, relative to the actual contents of the imports. The question is the branches, I think. They both start with the same git repository (effectively, under the assumption neither has pushed) and move the branches. In the case of one processing more (older) records, they would move the branches more, but the end result *should* be the same, or an ancestor of what will occur. But that's the concern (ancestor) as we are now timing-sensitive to what gets pushed (since we force push branches). I wonder if we should treat it more like RCU, in that we can always read or write, but the read data may be stale. So it needs to be verified before and after our loop. I imagine we could read the data, iterate our records and import them, and before we write our last_spphr back into the database, check to see if the current value is still the same? That would shrink the race to between re-reading the value and writing the new value (and the linear importer is a one-time operation, in theory).

Reply

Revision history for this message

Server Team CI bot (server-team-bot) wrote on 2017-11-10:

#

PASSED: Continuous integration, rev:e087180218bfe4276957a93c9375d875c6652a3d
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/207/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Style Check
    SUCCESS: Unit Tests
    SUCCESS: Integration Tests
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/207/rebuild

review: Approve (continuous-integration)

Reply

Revision history for this message

Nish Aravamudan (nacc) wrote on 2017-11-10:

#

Aother thought, does the cache need to be distinct for debian/ubuntu X unapplied/applied? Given that applied failures may be fixable, they can be at a different version than the unapplied?

Reply

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-01-08:

#

I'm not sure I can help you with the "implications" as you asked on IRC without thinking much more into it. But I have a few questions that might help to clarify at least.

To do so I want to hear more details on the actual use-case that lets us run into this race:
1) Do we even want that two importers run concurrently on the same system?
If not lets just lock the DB to be of single use - problem solved.

2) If we want multiple processes to be allowed we could make the use of
   the cache exclusive per package. (I beg your pardon if I don't see the point why this would
   fail). So only one worker can be in said [package] at any time.
   Processes working on different packages should never conflict right?
   Another worker with a different [package] can continue and we know there won't be a race.
   Essentially locking on [package] (yes this might need a better db to support
   rollback/timeouts on aborted processes if you want to lock in the DB).
   The tail of the processing would be updating sphhr and then unlocking the given package.

In general I think you should add tests along these commits that e.g. does "import-no-cache-used == import into cache == import with cache".

Reply

Revision history for this message

Nish Aravamudan (nacc) wrote on 2018-01-08:

#

Download full text (5.0 KiB)

On 08.01.2018 [20:17:52 -0000], ChristianEhrhardt wrote:
> I'm not sure I can help you with the "implications" as you asked on
> IRC without thinking much more into it. But I have a few questions
> that might help to clarify at least.

Understood and thank you!

> To do so I want to hear more details on the actual use-case that lets
> us run into this race:

The race is internal, in some ways, to the importer, but you do bring up
an interesting second case.

`git ubuntu import` is managed by a looping script,
scripts/import-source-packages.py, which is operating in a keepup
fashion. Roughly:

Obtain a LIST of publishing events in reverse chronological order
Walk LIST backwards until we find a PUBLISH that we have
already imported (or earlier)
Walk LIST foward after PUBLISH and IMPORT each unique source package
name

IMPORT itself is algorithmically:
   Get current REPOSITORY
   Get Launchpad publishing DATA for a given source package in reverse
chronological order
   Walk DATA backwards until we find a PUBLISH that we have already
imported (or earlier)
      Walk DATA forwards and import each publish record

The issue we are trying to resolve here is the "until we find a PUBLISH
that we have already imported" in IMPORT.

In the prior importer code, we had a unique Git commit for every
publication event, because the targetted <series>-<pocket> was part of
the Git commit (via the commit message and parenting). So as we did
IMPORT's reverse walk, we could look at the branch tips and compare
their commit data (where we stored the publishing timestamp) to the
publishing records to find the last imported publishing record.

We dropped that ability, by dropping publishing parents altogether. We
now just import all published versions once, tying them together only by
their changelogs. We then forcibly move the branch tips.

So now if we use an unmodified importer, we will end up going back in
IMPORT to the last publish that was the first time we saw a given
version (typically in Debian, therefore) and walking forward from there.
This will be unnecessary iteration of publishing data, at least, and
unnecessary moving of the branch pointers.

My branch modifies the catch-up to move the storage of the publication
event to an external cache, currently DBM.

> 1) Do we even want that two importers run concurrently on the same system?
> If not lets just lock the DB to be of single use - problem solved.

Right, so we have a mode we know will exist at some point, where there
is a linear script walking all 'to-import' source packages getting them
loaded to Launchpad, and the keep-up script which is ensuring that
publishes that are happeninng while the linear walker is going get
integrated. So we do expect to see (and it shouldn't just break) if
there are two imports of the same source package going on. What we don't
want to happen is for a slower 'old' import (where the list of events to
import is older) to somehow trump a faster 'new' import and thus end up
setting the branch pointers to the wrong place(s).

The problem is, that if we just outright lock the DB, then the linear
script can't work if it's also using the DB. And the point of it is that
they...

On 08.01.2018 [20:17:52 -0000], ChristianEhrhardt wrote:
> I'm not sure I can help you with the "implications" as you asked on
> IRC without thinking much more into it. But I have a few questions
> that might help to clarify at least.

Understood and thank you!

> To do so I want to hear more details on the actual use-case that lets
> us run into this race:

The race is internal, in some ways, to the importer, but you do bring up
an interesting second case.

`git ubuntu import` is managed by a looping script,
scripts/import-source-packages.py, which is operating in a keepup
fashion. Roughly:

Obtain a LIST of publishing events in reverse chronological order
Walk LIST backwards until we find a PUBLISH that we have
already imported (or earlier)
   Walk LIST foward after PUBLISH and IMPORT each unique source package
name

IMPORT itself is algorithmically:
   Get current REPOSITORY
   Get Launchpad publishing DATA for a given source package in reverse
chronological order
   Walk DATA backwards until we find a PUBLISH that we have already
imported (or earlier)
      Walk DATA forwards and import each publish record

The issue we are trying to resolve here is the "until we find a PUBLISH
that we have already imported" in IMPORT.

In the prior importer code, we had a unique Git commit for every
publication event, because the targetted <series>-<pocket> was part of
the Git commit (via the commit message and parenting). So as we did
IMPORT's reverse walk, we could look at the branch tips and compare
their commit data (where we stored the publishing timestamp) to the
publishing records to find the last imported publishing record.

We dropped that ability, by dropping publishing parents altogether. We
now just import all published versions once, tying them together only by
their changelogs. We then forcibly move the branch tips.

So now if we use an unmodified importer, we will end up going back in
IMPORT to the last publish that was the first time we saw a given
version (typically in Debian, therefore) and walking forward from there.
This will be unnecessary iteration of publishing data, at least, and
unnecessary moving of the branch pointers.

My branch modifies the catch-up to move the storage of the publication
event to an external cache, currently DBM.

> 1) Do we even want that two importers run concurrently on the same system?
>    If not lets just lock the DB to be of single use - problem solved.

Right, so we have a mode we know will exist at some point, where there
is a linear script walking all 'to-import' source packages getting them
loaded to Launchpad, and the keep-up script which is ensuring that
publishes that are happeninng while the linear walker is going get
integrated. So we do expect to see (and it shouldn't just break) if
there are two imports of the same source package going on. What we don't
want to happen is for a slower 'old' import (where the list of events to
import is older) to somehow trump a faster 'new' import and thus end up
setting the branch pointers to the wrong place(s).

The problem is, that if we just outright lock the DB, then the linear
script can't work if it's also using the DB. And the point of it is that
they are both import operations, using `git-ubuntu-import`, and if we
don't pass the same DB to both, then we will end up in the state I
mentioned above, I think. It's mostly efficiency at that point, I
think, where the linear script is going to do work that the catchup
script will also do. But with our redesign, I guess, that latter will be
a lot faster, since it will just see that the versions it needs to
iterate are already imported and can be skipped.

Hrm, that last sentence sounds good, but I think it actuallly needs code
to back it up.

> 2) If we want multiple processes to be allowed we could make the use of 
>    the cache exclusive per package. (I beg your pardon if I don't see the point why this would 
>    fail). So only one worker can be in said [package] at any time.
>    Processes working on different packages should never conflict right?
>    Another worker with a different [package] can continue and we know there won't be a race.
>    Essentially locking on [package] (yes this might need a better db to support 
>    rollback/timeouts on aborted processes if you want to lock in the DB).
>    The tail of the processing would be updating sphhr and then unlocking the given package.

The more I think about it, I think I am spinning myself on this problem
for no reason :)

I think we can let the linear walker not use the dbm [we can keep the
code, just not pass anything] (or sqlite) and it will just mean if there
is a race between when the linear walker hits a source package and when
the keep-up script does, there will be some no-op looping (unless new
SPRs are found).

> In general I think you should add tests along these commits that e.g.
> does "import-no-cache-used == import into cache == import with cache".

Yeah, that's a good idea. TBH, I'll need to think about how to add these
as 'unit tests', as we would need a cache to do so.

And as integration tests, I think it might be painful, as it will take
a long time to run.

Reply

Revision history for this message

Nish Aravamudan (nacc) wrote on 2018-01-08:

#

> Aother thought, does the cache need to be distinct for debian/ubuntu X
> unapplied/applied? Given that applied failures may be fixable, they can be at
> a different version than the unapplied?

I do think we actually need 4 caches, since the unapplied / applied heads can easily be at a different state. I'll work on this now.

I'll also add some changes to the importer to clean up that code.

Reply

Revision history for this message

Nish Aravamudan (nacc) wrote on 2018-01-08:

#

> Aother thought, does the cache need to be distinct for debian/ubuntu X
> unapplied/applied? Given that applied failures may be fixable, they can be at
> a different version than the unapplied?

I do think we actually need 4 caches, since the unapplied / applied heads can easily be at a different state. I'll work on this now.

I'll also add some changes to the importer to clean up that code.

Reply

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-01-09:

#

[...]

> So now if we use an unmodified importer, we will end up going back in
> IMPORT to the last publish that was the first time we saw a given
> version (typically in Debian, therefore) and walking forward from there.

To confirm - you go to "the first time we saw a given version" because without parenting history that is the only safe place to walk forward from?

> This will be unnecessary iteration of publishing data

because some might already have been handled.

>, at least, and
> unnecessary moving of the branch pointers.

For the walking

Thanks for the details, much clearer to me now what your case actually is.

>
> My branch modifies the catch-up to move the storage of the publication
> event to an external cache, currently DBM.

[...]

> The more I think about it, I think I am spinning myself on this problem
> for no reason :)
>
> I think we can let the linear walker not use the dbm [we can keep the
> code, just not pass anything] (or sqlite) and it will just mean if there
> is a race between when the linear walker hits a source package and when
> the keep-up script does, there will be some no-op looping (unless new
> SPRs are found).

Sounds good, and then you can actually lock the DB to be fully safe against unintentional concurrent use of it by the catch up (if one manually starts it twice or so).

[...]

> I do think we actually need 4 caches, since the unapplied / applied heads can easily be at a
> different state. I'll work on this now.

> I'll also add some changes to the importer to clean up that code.

Since you will end up only accessing it by code and never manually I think you could even implement as debian/ubuntu x unapplied/applied x per-package. If you end up locking the DBs in any way you will still have a better granularity on those locks.
And for your code it doesn't matter how split the file structure is.

Reply

git-ubuntu

Merge ~nacc/git-ubuntu:lp1730734-cache-importer-progress into git-ubuntu:master

Commit message

Description of the change

Unmerged commits

Preview Diff

Subscribers

1	diff --git a/gitubuntu/importer.py b/gitubuntu/importer.py
2	index 2063f83..6abad7d 100644
3	--- a/gitubuntu/importer.py
4	+++ b/gitubuntu/importer.py
5	@@ -26,6 +26,7 @@
6
7	import argparse
8	import atexit
9	+import dbm
10	import functools
11	import getpass
12	import logging
13	@@ -150,6 +151,7 @@ def main(
14	parentfile=top_level_defaults.parentfile,
15	retries=top_level_defaults.retries,
16	retry_backoffs=top_level_defaults.retry_backoffs,
17	+ db_cache_dir=None,
18	):
19	"""Main entry point to the importer
20
21	@@ -178,6 +180,9 @@ def main(
22	@parentfile: string path to file specifying parent overrides
23	@retries: integer number of download retries to attempt
24	@retry_backoffs: list of backoff durations to use between retries
25	+ @db_cache_dir: string fileystem directory containing 'ubuntu' and
26	+ 'debian' dbm database files, which store the progress of prior
27	+ importer runs
28
29	If directory is None, a temporary directory is created and used.
30
31	@@ -187,6 +192,11 @@ def main(
32	If dl_cache is None, CACHE_PATH in the local repository will be
33	used.
34
35	+ If db_cache_dir is None, no database lookups are performed which makes
36	+ it possible that the import will attempt to re-import already
37	+ imported publishes. This should be fine, although less efficient
38	+ than possible.
39	+
40	Returns 0 on successful import (which includes non-fatal failures);
41	1 otherwise.
42	"""
43	@@ -314,6 +324,9 @@ def main(
44	else:
45	workdir = dl_cache
46
47	+ if db_cache_dir:
48	+ os.makedirs(db_cache_dir, exist_ok=True)
49	+
50	os.makedirs(workdir, exist_ok=True)
51
52	# now sets a global _PARENT_OVERRIDES
53	@@ -326,6 +339,7 @@ def main(
54	patches_applied=False,
55	debian_head_versions=debian_head_versions,
56	ubuntu_head_versions=ubuntu_head_versions,
57	+ db_cache_dir=db_cache_dir,
58	debian_sinfo=debian_sinfo,
59	ubuntu_sinfo=ubuntu_sinfo,
60	active_series_only=active_series_only,
61	@@ -347,6 +361,7 @@ def main(
62	patches_applied=True,
63	debian_head_versions=applied_debian_head_versions,
64	ubuntu_head_versions=applied_ubuntu_head_versions,
65	+ db_cache_dir=db_cache_dir,
66	debian_sinfo=debian_sinfo,
67	ubuntu_sinfo=ubuntu_sinfo,
68	active_series_only=active_series_only,
69	@@ -1401,6 +1416,7 @@ def import_publishes(
70	patches_applied,
71	debian_head_versions,
72	ubuntu_head_versions,
73	+ db_cache_dir,
74	debian_sinfo,
75	ubuntu_sinfo,
76	active_series_only,
77	@@ -1411,6 +1427,8 @@ def import_publishes(
78	history_found = False
79	only_debian = False
80	srcpkg_information = None
81	+ last_debian_spphr = None
82	+ last_ubuntu_spphr = None
83	if patches_applied:
84	_namespace = namespace
85	namespace = '%s/applied' % namespace
86	@@ -1426,15 +1444,28 @@ def import_publishes(
87	import_unapplied_spi,
88	skip_orig=skip_orig,
89	)
90	- for distname, versions, dist_sinfo in (
91	- ("debian", debian_head_versions, debian_sinfo),
92	- ("ubuntu", ubuntu_head_versions, ubuntu_sinfo)):
93	+
94	+ for distname, versions, dist_sinfo, last_spphr in (
95	+ ("debian", debian_head_versions, debian_sinfo, last_debian_spphr),
96	+ ("ubuntu", ubuntu_head_versions, ubuntu_sinfo, last_ubuntu_spphr),
97	+ ):
98	if active_series_only and distname == "debian":
99	continue
100	+
101	+ last_spphr = None
102	+ if db_cache_dir:
103	+ with dbm.open(os.path.join(db_cache_dir, distname), 'c') as cache:
104	+ try:
105	+ last_spphr = decode_binary(cache[pkgname])
106	+ except KeyError:
107	+ pass
108	+
109	try:
110	+ last_spi = None
111	for srcpkg_information in dist_sinfo.launchpad_versions_published_after(
112	versions,
113	namespace,
114	+ last_spphr,
115	workdir=workdir,
116	active_series_only=active_series_only
117	):
118	@@ -1445,6 +1476,11 @@ def import_publishes(
119	namespace=_namespace,
120	ubuntu_sinfo=ubuntu_sinfo,
121	)
122	+ last_spi = srcpkg_information
123	+ if last_spi:
124	+ if db_cache_dir:
125	+ with dbm.open(os.path.join(db_cache_dir, distname), 'w') as db_cache:
126	+ db_cache[pkgname] = str(last_spi.spphr)
127	except NoPublicationHistoryException:
128	logging.warning("No publication history found for %s in %s.",
129	pkgname, distname
130	@@ -1520,6 +1556,11 @@ def parse_args(subparsers=None, base_subparsers=None):
131	action='store_true',
132	help=argparse.SUPPRESS,
133	)
134	+ parser.add_argument(
135	+ '--db-cache',
136	+ type=str,
137	+ help=argparse.SUPPRESS,
138	+ )
139	if not subparsers:
140	return parser.parse_args()
141	return 'import - %s' % kwargs['description']
142	@@ -1548,6 +1589,11 @@ def cli_main(args):
143	except AttributeError:
144	dl_cache = None
145
146	+ try:
147	+ db_cache = args.db_cache
148	+ except AttributeError:
149	+ db_cache = None
150	+
151	return main(
152	pkgname=args.package,
153	owner=args.lp_owner,
154	@@ -1567,4 +1613,5 @@ def cli_main(args):
155	parentfile=args.parentfile,
156	retries=args.retries,
157	retry_backoffs=args.retry_backoffs,
158	+ db_cache_dir=args.db_cache,
159	)
160	diff --git a/gitubuntu/source_information.py b/gitubuntu/source_information.py
161	index b97bc8a..050d60a 100644
162	--- a/gitubuntu/source_information.py
163	+++ b/gitubuntu/source_information.py
164	@@ -443,7 +443,14 @@ class GitUbuntuSourceInformation(object):
165	for srcpkg in spph:
166	yield self.get_corrected_spi(srcpkg, workdir)
167
168	- def launchpad_versions_published_after(self, head_versions, namespace, workdir=None, active_series_only=False):
169	+ def launchpad_versions_published_after(
170	+ self,
171	+ head_versions,
172	+ namespace,
173	+ last_spphr=None,
174	+ workdir=None,
175	+ active_series_only=False,
176	+ ):
177	args = {
178	'exact_match':True,
179	'source_name':self.pkgname,
180	@@ -471,7 +478,14 @@ class GitUbuntuSourceInformation(object):
181	if len(spph) == 0:
182	raise NoPublicationHistoryException("Is %s published in %s?" %
183	(self.pkgname, self.dist_name))
184	- if len(head_versions) > 0:
185	+ if last_spphr:
186	+ _spph = list()
187	+ for spphr in spph:
188	+ if str(spphr) == last_spphr:
189	+ break
190	+ _spph.append(spphr)
191	+ spph = _spph
192	+ elif head_versions:
193	_spph = list()
194	for spphr in spph:
195	spi = GitUbuntuSourcePackageInformation(spphr, self.dist_name,
196	diff --git a/man/man1/git-ubuntu-import.1 b/man/man1/git-ubuntu-import.1
197	index dd7fd9e..74bfe83 100644
198	--- a/man/man1/git-ubuntu-import.1
199	+++ b/man/man1/git-ubuntu-import.1
200	@@ -1,4 +1,4 @@
201	-.TH "GIT-UBUNTU-IMPORT" "1" "2017-07-19" "Git-Ubuntu 0.2" "Git-Ubuntu Manual"
202	+.TH "GIT-UBUNTU-IMPORT" "1" "2017-11-08" "Git-Ubuntu 0.6.2" "Git-Ubuntu Manual"
203
204	.SH "NAME"
205	git-ubuntu import \- Import Launchpad publishing history to Git
206	@@ -9,7 +9,8 @@ git-ubuntu import \- Import Launchpad publishing history to Git
207	<user>] [\-\-dl-cache <dl_cache>] [\-\-no-fetch] [\-\-no-push]
208	[\-\-no-clean] [\-d \| \-\-directory <directory>]
209	[\-\-active-series-only] [\-\-skip-applied] [\-\-skip-orig]
210	-[\-\-reimport] [\-\-allow-applied-failures] <package>
211	+[\-\-reimport] [\-\-allow-applied-failures] [\-\-db-cache <db_cache>]
212	+<package>
213	.FI
214	.SP
215	.SH "DESCRIPTION"
216	@@ -197,6 +198,21 @@ After investigation, this flag can be used to indicate the importer is
217	allowed to ignore such a failure\&.
218	.RE
219	.PP
220	+\-\-db-cache <db_cache>
221	+.RS 4
222	+The path to a directory containing Python dbm database disk files for
223	+importer metadata\&.
224	+If \fB<db_cache>\fR does not exist, it will be created\&.
225	+Two files in \fB<db_cache>\fR are used, "ubuntu" and "debian', which are
226	+created if not already present\&.
227	+The cache files provide information to the importer about prior imports
228	+of \fB<package>\fR and which Launchpad publishing record was last
229	+imported\&.
230	+This is necessary because the imported Git repository does not
231	+necessarily maintain any metadata about Launchpad publishing
232	+information\&.
233	+.RE
234	+.PP
235	<package>
236	.RS 4
237	The name of the source package to import\&.
238	diff --git a/scripts/import-source-packages.py b/scripts/import-source-packages.py
239	index 698ae35..7dd534e 100755
240	--- a/scripts/import-source-packages.py
241	+++ b/scripts/import-source-packages.py
242	@@ -50,6 +50,7 @@ def import_new_published_sources(
243	phasing_main,
244	phasing_universe,
245	dry_run,
246	+ db_cache,
247	):
248	"""import_new_published_source - Import all new publishes since a prior execution
249
250	@@ -60,6 +61,10 @@ def import_new_published_sources(
251	phasing_main - a integer percentage of all packages in main to import
252	phasing_universe - a integer percentage of all packages in universe to import
253	dry_run - a boolean to indicate a dry-run operation
254	+ db_cache - string filesystem path containing DBM files for storing
255	+ importer progress, will be created if it does not exist
256	+
257	+ If db_cache is None, no cache is used.
258
259	Returns:
260	A tuple of two lists, the first containing the names of all
261	@@ -150,6 +155,7 @@ def import_new_published_sources(
262	ret = scriptutils.pool_map_import_srcpkg(
263	num_workers=num_workers,
264	dry_run=dry_run,
265	+ db_cache=db_cache,
266	pkgnames=filtered_pkgnames,
267	)
268
269	@@ -229,6 +235,7 @@ def main(
270	phasing_main=scriptutils.DEFAULTS.phasing_main,
271	phasing_universe=scriptutils.DEFAULTS.phasing_universe,
272	dry_run=scriptutils.DEFAULTS.dry_run,
273	+ db_cache=scriptutils.DEFAULTS.db_cache,
274	):
275	"""main - Main entry point to the script
276
277	@@ -243,6 +250,8 @@ def main(
278	phasing_universe - a integer percentage of all packages in universe
279	to import
280	dry_run - a boolean to indicate a dry-run operation
281	+ db_cache - string filesystem path containing DBM files for storing
282	+ importer progress, will be created if it does not exist
283	"""
284	scriptutils.setup_git_config()
285
286	@@ -280,6 +289,7 @@ def main(
287	phasing_main,
288	phasing_universe,
289	dry_run,
290	+ db_cache,
291	)
292	print("Imported %d source packages" % len(imported_srcpkgs))
293	mail_imported_srcpkgs \|= set(imported_srcpkgs)
294	@@ -361,6 +371,13 @@ def cli_main():
295	help="Simulate operation but do not actually do anything",
296	default=scriptutils.DEFAULTS.dry_run,
297	)
298	+ parser.add_argument(
299	+ '--db-cache',
300	+ type=str,
301	+ help="Directory containing Python DBM files which store importer "
302	+ "progress",
303	+ default=scriptutils.DEFAULTS.db_cache,
304	+ )
305
306	args = parser.parse_args()
307
308	@@ -371,6 +388,7 @@ def cli_main():
309	phasing_main=args.phasing_main,
310	phasing_universe=args.phasing_universe,
311	dry_run=args.dry_run,
312	+ db_cache=args.db_cache,
313	)
314
315	if __name__ == '__main__':
316	diff --git a/scripts/scriptutils.py b/scripts/scriptutils.py
317	index 912efd1..e24c394 100644
318	--- a/scripts/scriptutils.py
319	+++ b/scripts/scriptutils.py
320	@@ -5,6 +5,7 @@ import multiprocessing
321	import os
322	import sys
323	import subprocess
324	+import tempfile
325	import time
326
327	import pkg_resources
328	@@ -35,6 +36,7 @@ Defaults = namedtuple(
329	'phasing_main',
330	'dry_run',
331	'use_whitelist',
332	+ 'db_cache',
333	],
334	)
335
336	@@ -52,6 +54,7 @@ DEFAULTS = Defaults(
337	phasing_main=1,
338	dry_run=False,
339	use_whitelist=True,
340	+ db_cache=os.path.join(tempfile.gettempdir(), 'git-ubuntu-db-cache'),
341	)
342
343
344	@@ -101,12 +104,16 @@ def should_import_srcpkg(
345	return False
346
347
348	-def import_srcpkg(pkgname, dry_run):
349	+def import_srcpkg(pkgname, dry_run, db_cache):
350	"""import_srcpkg - Invoke git ubuntu import on @pkgname
351
352	Arguments:
353	pkgname - string name of a source package
354	dry_run - a boolean to indicate a dry-run operation
355	+ db_cache - string filesystem path containing DBM files for storing
356	+ importer progress, will be created if it does not exist
357	+
358	+ If db_cache is None, no cache is used.
359
360	Returns:
361	A tuple of a string and a boolean, where the boolean is the success
362	@@ -125,6 +132,8 @@ def import_srcpkg(pkgname, dry_run):
363	'usd-importer-bot',
364	pkgname,
365	]
366	+ if db_cache:
367	+ cmd.extend(['--db-cache', db_cache,])
368	try:
369	print(' '.join(cmd))
370	if not dry_run:
371	@@ -163,6 +172,7 @@ def setup_git_config(
372	def pool_map_import_srcpkg(
373	num_workers,
374	dry_run,
375	+ db_cache,
376	pkgnames,
377	):
378	"""pool_map_import_srcpkg - Use a multiprocessing.Pool to parallel
379	@@ -171,13 +181,18 @@ def pool_map_import_srcpkg(
380	Arguments:
381	num_workers - integer number of worker processes to use
382	dry_run - a boolean to indicate a dry-run operation
383	+ db_cache - string filesystem path containing DBM files for storing
384	+ importer progress, will be created if it does not exist
385	pkgnames - a list of string names of source packages
386	+
387	+ If db_cache is None, no cache is used.
388	"""
389	with multiprocessing.Pool(processes=num_workers) as pool:
390	results = pool.map(
391	functools.partial(
392	import_srcpkg,
393	dry_run=dry_run,
394	+ db_cache=db_cache,
395	),
396	pkgnames,
397	)
398	diff --git a/scripts/source-package-walker.py b/scripts/source-package-walker.py
399	index 58c3d5a..f7b9629 100755
400	--- a/scripts/source-package-walker.py
401	+++ b/scripts/source-package-walker.py
402	@@ -44,6 +44,7 @@ def import_all_published_sources(
403	phasing_main,
404	phasing_universe,
405	dry_run,
406	+ db_cache,
407	):
408	"""import_all_published_sources - Import all publishes satisfying a
409	{white,black}list and phasing
410	@@ -55,6 +56,10 @@ def import_all_published_sources(
411	phasing_main - a integer percentage of all packages in main to import
412	phasing_universe - a integer percentage of all packages in universe to import
413	dry_run - a boolean to indicate a dry-run operation
414	+ db_cache - string filesystem path containing DBM files for storing
415	+ importer progress, will be created if it does not exist
416	+
417	+ If db_cache is None, no cache is used.
418
419	Returns:
420	A tuple of two lists, the first containing the names of all
421	@@ -126,6 +131,7 @@ def import_all_published_sources(
422	return scriptutils.pool_map_import_srcpkg(
423	num_workers=num_workers,
424	dry_run=dry_run,
425	+ db_cache=db_cache,
426	pkgnames=pkgnames,
427	)
428
429	@@ -137,6 +143,7 @@ def main(
430	phasing_universe=scriptutils.DEFAULTS.phasing_universe,
431	dry_run=scriptutils.DEFAULTS.dry_run,
432	use_whitelist=scriptutils.DEFAULTS.use_whitelist,
433	+ db_cache=scriptutils.DEFAULTS.db_cache,
434	):
435	"""main - Main entry point to the script
436
437	@@ -153,6 +160,8 @@ def main(
438	dry_run - a boolean to indicate a dry-run operation
439	use_whitelist - a boolean to control whether the whitelist data is
440	used
441	+ db_cache - string filesystem path containing DBM files for storing
442	+ importer progress, will be created if it does not exist
443
444	use_whitelist exists because during the rampup of imports, we want
445	to import the whitelist packages and the phased packages. But after
446	@@ -191,6 +200,7 @@ def main(
447	phasing_main,
448	phasing_universe,
449	dry_run,
450	+ db_cache,
451	)
452	print(
453	"Imported %d source packages:\n%s" % (
454	@@ -254,6 +264,13 @@ def cli_main():
455	help="Simulate operation but do not actually do anything",
456	default=scriptutils.DEFAULTS.dry_run,
457	)
458	+ parser.add_argument(
459	+ '--db-cache',
460	+ type=str,
461	+ help="Directory containing Python DBM files which store importer "
462	+ "progress",
463	+ default=scriptutils.DEFAULTS.db_cache,
464	+ )
465
466	args = parser.parse_args()
467
468	@@ -265,6 +282,7 @@ def cli_main():
469	phasing_universe=args.phasing_universe,
470	dry_run=args.dry_run,
471	use_whitelist=args.use_whitelist,
472	+ db_cache=args.db_cache,
473	)
474
475	if __name__ == '__main__':