Merge ~nacc/git-ubuntu:lp1730734-cache-importer-progress into git-ubuntu:master

Proposed by Nish Aravamudan
Status: Work in progress
Proposed branch: ~nacc/git-ubuntu:lp1730734-cache-importer-progress
Merge into: git-ubuntu:master
Diff against target: 475 lines (+136/-8)
6 files modified
gitubuntu/importer.py (+50/-3)
gitubuntu/source_information.py (+16/-2)
man/man1/git-ubuntu-import.1 (+18/-2)
scripts/import-source-packages.py (+18/-0)
scripts/scriptutils.py (+16/-1)
scripts/source-package-walker.py (+18/-0)
Reviewer Review Type Date Requested Status
Server Team CI bot continuous-integration Approve
git-ubuntu developers Pending
Review via email: mp+333499@code.launchpad.net

Description of the change

Make jenkins happy.

To post a comment you must log in.
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:16af9e00af3e93c4fcba5a49cd660d01d6d10f61
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/205/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Style Check
    SUCCESS: Unit Tests
    SUCCESS: Integration Tests
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/205/rebuild

review: Approve (continuous-integration)
Revision history for this message
Nish Aravamudan (nacc) wrote :

As a local test, I built the git-ubuntu snap with this change in place and ran the following:

# 1) prime the cache
$ git ubuntu import --no-push ipsec-tools --no-clean --db-cache /tmp/git-ubuntu-db-cache
11/10/2017 11:58:19 - INFO:Ubuntu Server Team importer v0.6.2
11/10/2017 11:58:19 - INFO:Using git repository at /tmp/tmpso3vrstw
11/10/2017 11:59:25 - INFO:Importing patches-unapplied 1:0.8.2+20140711-10 to ubuntu/bionic
11/10/2017 11:59:39 - WARNING:ubuntu/bionic is identical to 1:0.8.2+20140711-10
11/10/2017 12:00:15 - INFO:Not pushing to remote as specified
11/10/2017 12:00:15 - INFO:Leaving /tmp/tmpso3vrstw as directed

# 2) setup test repositories
$ cp -R /tmp/tmpso3vrstw /tmp/cache-test
$ cp -R /tmp/tmpso3vrstw /tmp/no-cache-test

# 3) Run again using the same cache, no new uploads processed (cache worked)
$ git ubuntu import --no-push ipsec-tools --no-clean --db-cache /tmp/git-ubuntu-db-cache --no-fetch -d /tmp/cache-test
11/10/2017 12:01:45 - INFO:Ubuntu Server Team importer v0.6.2
11/10/2017 12:02:49 - INFO:Not pushing to remote as specified
11/10/2017 12:02:49 - INFO:Leaving /tmp/cache-test as directed

# 4) Run again not using the cache (default operation, as well), we end up processing already seen records
$ git ubuntu import --no-push ipsec-tools --no-clean --no-fetch -d /tmp/no-cache-test
11/10/2017 12:03:33 - INFO:Ubuntu Server Team importer v0.6.2
11/10/2017 12:04:17 - INFO:Importing patches-unapplied 1:0.8.2+20140711-10 to ubuntu/bionic
11/10/2017 12:04:22 - WARNING:ubuntu/bionic is identical to 1:0.8.2+20140711-10
11/10/2017 12:04:37 - INFO:Importing patches-applied 1:0.8.2+20140711-10 to ubuntu/bionic
11/10/2017 12:04:40 - WARNING:ubuntu/bionic is identical to 1:0.8.2+20140711-10
11/10/2017 12:05:07 - INFO:Not pushing to remote as specified
11/10/2017 12:05:07 - INFO:Leaving /tmp/no-cache-test as directed

Revision history for this message
Nish Aravamudan (nacc) wrote :

I am fairly confident in the actual changes here. What I think needs deciding is how the linear import script and publisher-watching script should interact with a shared DBM cache. I think it is relatively safe within one process, but from what I can tell, there is no guarantee the two won't race and the DBM implementation itself is not concurrency safe.

Possibly we should switch to sqlite3 for that, then, even though it is more overhead to configure and query.

Finally, I'm wondering how to describe the race in question. Basically, we'd see two processes read different (possibly unset in one case) values for the last SPPHR seen in the cache. They would then iterate a different set of Launchpad records, but the result would be the same, relative to the actual contents of the imports. The question is the branches, I think. They both start with the same git repository (effectively, under the assumption neither has pushed) and move the branches. In the case of one processing more (older) records, they would move the branches more, but the end result *should* be the same, or an ancestor of what will occur. But that's the concern (ancestor) as we are now timing-sensitive to what gets pushed (since we force push branches). I wonder if we should treat it more like RCU, in that we can always read or write, but the read data may be stale. So it needs to be verified before and after our loop. I imagine we could read the data, iterate our records and import them, and before we write our last_spphr back into the database, check to see if the current value is still the same? That would shrink the race to between re-reading the value and writing the new value (and the linear importer is a one-time operation, in theory).

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:e087180218bfe4276957a93c9375d875c6652a3d
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/207/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Style Check
    SUCCESS: Unit Tests
    SUCCESS: Integration Tests
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/207/rebuild

review: Approve (continuous-integration)
Revision history for this message
Nish Aravamudan (nacc) wrote :

Aother thought, does the cache need to be distinct for debian/ubuntu X unapplied/applied? Given that applied failures may be fixable, they can be at a different version than the unapplied?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I'm not sure I can help you with the "implications" as you asked on IRC without thinking much more into it. But I have a few questions that might help to clarify at least.

To do so I want to hear more details on the actual use-case that lets us run into this race:
1) Do we even want that two importers run concurrently on the same system?
   If not lets just lock the DB to be of single use - problem solved.

2) If we want multiple processes to be allowed we could make the use of
   the cache exclusive per package. (I beg your pardon if I don't see the point why this would
   fail). So only one worker can be in said [package] at any time.
   Processes working on different packages should never conflict right?
   Another worker with a different [package] can continue and we know there won't be a race.
   Essentially locking on [package] (yes this might need a better db to support
   rollback/timeouts on aborted processes if you want to lock in the DB).
   The tail of the processing would be updating sphhr and then unlocking the given package.

In general I think you should add tests along these commits that e.g. does "import-no-cache-used == import into cache == import with cache".

Revision history for this message
Nish Aravamudan (nacc) wrote :
Download full text (5.0 KiB)

On 08.01.2018 [20:17:52 -0000], ChristianEhrhardt wrote:
> I'm not sure I can help you with the "implications" as you asked on
> IRC without thinking much more into it. But I have a few questions
> that might help to clarify at least.

Understood and thank you!

> To do so I want to hear more details on the actual use-case that lets
> us run into this race:

The race is internal, in some ways, to the importer, but you do bring up
an interesting second case.

`git ubuntu import` is managed by a looping script,
scripts/import-source-packages.py, which is operating in a keepup
fashion. Roughly:

Obtain a LIST of publishing events in reverse chronological order
Walk LIST backwards until we find a PUBLISH that we have
already imported (or earlier)
   Walk LIST foward after PUBLISH and IMPORT each unique source package
name

IMPORT itself is algorithmically:
   Get current REPOSITORY
   Get Launchpad publishing DATA for a given source package in reverse
chronological order
   Walk DATA backwards until we find a PUBLISH that we have already
imported (or earlier)
      Walk DATA forwards and import each publish record

The issue we are trying to resolve here is the "until we find a PUBLISH
that we have already imported" in IMPORT.

In the prior importer code, we had a unique Git commit for every
publication event, because the targetted <series>-<pocket> was part of
the Git commit (via the commit message and parenting). So as we did
IMPORT's reverse walk, we could look at the branch tips and compare
their commit data (where we stored the publishing timestamp) to the
publishing records to find the last imported publishing record.

We dropped that ability, by dropping publishing parents altogether. We
now just import all published versions once, tying them together only by
their changelogs. We then forcibly move the branch tips.

So now if we use an unmodified importer, we will end up going back in
IMPORT to the last publish that was the first time we saw a given
version (typically in Debian, therefore) and walking forward from there.
This will be unnecessary iteration of publishing data, at least, and
unnecessary moving of the branch pointers.

My branch modifies the catch-up to move the storage of the publication
event to an external cache, currently DBM.

> 1) Do we even want that two importers run concurrently on the same system?
> If not lets just lock the DB to be of single use - problem solved.

Right, so we have a mode we know will exist at some point, where there
is a linear script walking all 'to-import' source packages getting them
loaded to Launchpad, and the keep-up script which is ensuring that
publishes that are happeninng while the linear walker is going get
integrated. So we do expect to see (and it shouldn't just break) if
there are two imports of the same source package going on. What we don't
want to happen is for a slower 'old' import (where the list of events to
import is older) to somehow trump a faster 'new' import and thus end up
setting the branch pointers to the wrong place(s).

The problem is, that if we just outright lock the DB, then the linear
script can't work if it's also using the DB. And the point of it is that
they...

Read more...

Revision history for this message
Nish Aravamudan (nacc) wrote :

> Aother thought, does the cache need to be distinct for debian/ubuntu X
> unapplied/applied? Given that applied failures may be fixable, they can be at
> a different version than the unapplied?

I do think we actually need 4 caches, since the unapplied / applied heads can easily be at a different state. I'll work on this now.

I'll also add some changes to the importer to clean up that code.

Revision history for this message
Nish Aravamudan (nacc) wrote :

> Aother thought, does the cache need to be distinct for debian/ubuntu X
> unapplied/applied? Given that applied failures may be fixable, they can be at
> a different version than the unapplied?

I do think we actually need 4 caches, since the unapplied / applied heads can easily be at a different state. I'll work on this now.

I'll also add some changes to the importer to clean up that code.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

[...]

> So now if we use an unmodified importer, we will end up going back in
> IMPORT to the last publish that was the first time we saw a given
> version (typically in Debian, therefore) and walking forward from there.

To confirm - you go to "the first time we saw a given version" because without parenting history that is the only safe place to walk forward from?

> This will be unnecessary iteration of publishing data

because some might already have been handled.

>, at least, and
> unnecessary moving of the branch pointers.

For the walking

Thanks for the details, much clearer to me now what your case actually is.

>
> My branch modifies the catch-up to move the storage of the publication
> event to an external cache, currently DBM.

[...]

> The more I think about it, I think I am spinning myself on this problem
> for no reason :)
>
> I think we can let the linear walker not use the dbm [we can keep the
> code, just not pass anything] (or sqlite) and it will just mean if there
> is a race between when the linear walker hits a source package and when
> the keep-up script does, there will be some no-op looping (unless new
> SPRs are found).

Sounds good, and then you can actually lock the DB to be fully safe against unintentional concurrent use of it by the catch up (if one manually starts it twice or so).

[...]

> I do think we actually need 4 caches, since the unapplied / applied heads can easily be at a
> different state. I'll work on this now.

> I'll also add some changes to the importer to clean up that code.

Since you will end up only accessing it by code and never manually I think you could even implement as debian/ubuntu x unapplied/applied x per-package. If you end up locking the DBs in any way you will still have a better granularity on those locks.
And for your code it doesn't matter how split the file structure is.

Unmerged commits

a9960c8... by Nish Aravamudan

git ubuntu import: use a dbm cache to store importer progress

With the recent changes to the importer algorithm, we no longer can
identify a Launchpad publishing event by the commit information in the
repository -- a given SourcePackageRelease (source package name and
version) is only imported once (presuming all future publishes of the
same name and version match exactly). We rely on this in
launchpad_versions_published_after, which iterates the Launchpad data
backwards until we either match the commit data for a branch head, or
see Launchpad data from before one of the branch heads.

Change the importer code to take a --db-cache argument as a directory
containing a DBM cache for debian and ubuntu (this is needed, because
DBM are single-leve string-indexed string storage databases). Lookup the
source package name in the relevant cache before we lookup Launchpad
data, to obtain the last SPPHR used. Store the latest processed SPPHR
after iterating the Launchpad data.

Also update the scripts to support passing a persistent/consistent value
to the import for the cache.

LP: #1730734

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
diff --git a/gitubuntu/importer.py b/gitubuntu/importer.py
index 2063f83..6abad7d 100644
--- a/gitubuntu/importer.py
+++ b/gitubuntu/importer.py
@@ -26,6 +26,7 @@
2626
27import argparse27import argparse
28import atexit28import atexit
29import dbm
29import functools30import functools
30import getpass31import getpass
31import logging32import logging
@@ -150,6 +151,7 @@ def main(
150 parentfile=top_level_defaults.parentfile,151 parentfile=top_level_defaults.parentfile,
151 retries=top_level_defaults.retries,152 retries=top_level_defaults.retries,
152 retry_backoffs=top_level_defaults.retry_backoffs,153 retry_backoffs=top_level_defaults.retry_backoffs,
154 db_cache_dir=None,
153):155):
154 """Main entry point to the importer156 """Main entry point to the importer
155157
@@ -178,6 +180,9 @@ def main(
178 @parentfile: string path to file specifying parent overrides180 @parentfile: string path to file specifying parent overrides
179 @retries: integer number of download retries to attempt181 @retries: integer number of download retries to attempt
180 @retry_backoffs: list of backoff durations to use between retries182 @retry_backoffs: list of backoff durations to use between retries
183 @db_cache_dir: string fileystem directory containing 'ubuntu' and
184 'debian' dbm database files, which store the progress of prior
185 importer runs
181186
182 If directory is None, a temporary directory is created and used.187 If directory is None, a temporary directory is created and used.
183188
@@ -187,6 +192,11 @@ def main(
187 If dl_cache is None, CACHE_PATH in the local repository will be192 If dl_cache is None, CACHE_PATH in the local repository will be
188 used.193 used.
189194
195 If db_cache_dir is None, no database lookups are performed which makes
196 it possible that the import will attempt to re-import already
197 imported publishes. This should be fine, although less efficient
198 than possible.
199
190 Returns 0 on successful import (which includes non-fatal failures);200 Returns 0 on successful import (which includes non-fatal failures);
191 1 otherwise.201 1 otherwise.
192 """202 """
@@ -314,6 +324,9 @@ def main(
314 else:324 else:
315 workdir = dl_cache325 workdir = dl_cache
316326
327 if db_cache_dir:
328 os.makedirs(db_cache_dir, exist_ok=True)
329
317 os.makedirs(workdir, exist_ok=True)330 os.makedirs(workdir, exist_ok=True)
318331
319 # now sets a global _PARENT_OVERRIDES332 # now sets a global _PARENT_OVERRIDES
@@ -326,6 +339,7 @@ def main(
326 patches_applied=False,339 patches_applied=False,
327 debian_head_versions=debian_head_versions,340 debian_head_versions=debian_head_versions,
328 ubuntu_head_versions=ubuntu_head_versions,341 ubuntu_head_versions=ubuntu_head_versions,
342 db_cache_dir=db_cache_dir,
329 debian_sinfo=debian_sinfo,343 debian_sinfo=debian_sinfo,
330 ubuntu_sinfo=ubuntu_sinfo,344 ubuntu_sinfo=ubuntu_sinfo,
331 active_series_only=active_series_only,345 active_series_only=active_series_only,
@@ -347,6 +361,7 @@ def main(
347 patches_applied=True,361 patches_applied=True,
348 debian_head_versions=applied_debian_head_versions,362 debian_head_versions=applied_debian_head_versions,
349 ubuntu_head_versions=applied_ubuntu_head_versions,363 ubuntu_head_versions=applied_ubuntu_head_versions,
364 db_cache_dir=db_cache_dir,
350 debian_sinfo=debian_sinfo,365 debian_sinfo=debian_sinfo,
351 ubuntu_sinfo=ubuntu_sinfo,366 ubuntu_sinfo=ubuntu_sinfo,
352 active_series_only=active_series_only,367 active_series_only=active_series_only,
@@ -1401,6 +1416,7 @@ def import_publishes(
1401 patches_applied,1416 patches_applied,
1402 debian_head_versions,1417 debian_head_versions,
1403 ubuntu_head_versions,1418 ubuntu_head_versions,
1419 db_cache_dir,
1404 debian_sinfo,1420 debian_sinfo,
1405 ubuntu_sinfo,1421 ubuntu_sinfo,
1406 active_series_only,1422 active_series_only,
@@ -1411,6 +1427,8 @@ def import_publishes(
1411 history_found = False1427 history_found = False
1412 only_debian = False1428 only_debian = False
1413 srcpkg_information = None1429 srcpkg_information = None
1430 last_debian_spphr = None
1431 last_ubuntu_spphr = None
1414 if patches_applied:1432 if patches_applied:
1415 _namespace = namespace1433 _namespace = namespace
1416 namespace = '%s/applied' % namespace1434 namespace = '%s/applied' % namespace
@@ -1426,15 +1444,28 @@ def import_publishes(
1426 import_unapplied_spi,1444 import_unapplied_spi,
1427 skip_orig=skip_orig,1445 skip_orig=skip_orig,
1428 )1446 )
1429 for distname, versions, dist_sinfo in (1447
1430 ("debian", debian_head_versions, debian_sinfo),1448 for distname, versions, dist_sinfo, last_spphr in (
1431 ("ubuntu", ubuntu_head_versions, ubuntu_sinfo)):1449 ("debian", debian_head_versions, debian_sinfo, last_debian_spphr),
1450 ("ubuntu", ubuntu_head_versions, ubuntu_sinfo, last_ubuntu_spphr),
1451 ):
1432 if active_series_only and distname == "debian":1452 if active_series_only and distname == "debian":
1433 continue1453 continue
1454
1455 last_spphr = None
1456 if db_cache_dir:
1457 with dbm.open(os.path.join(db_cache_dir, distname), 'c') as cache:
1458 try:
1459 last_spphr = decode_binary(cache[pkgname])
1460 except KeyError:
1461 pass
1462
1434 try:1463 try:
1464 last_spi = None
1435 for srcpkg_information in dist_sinfo.launchpad_versions_published_after(1465 for srcpkg_information in dist_sinfo.launchpad_versions_published_after(
1436 versions,1466 versions,
1437 namespace,1467 namespace,
1468 last_spphr,
1438 workdir=workdir,1469 workdir=workdir,
1439 active_series_only=active_series_only1470 active_series_only=active_series_only
1440 ):1471 ):
@@ -1445,6 +1476,11 @@ def import_publishes(
1445 namespace=_namespace,1476 namespace=_namespace,
1446 ubuntu_sinfo=ubuntu_sinfo,1477 ubuntu_sinfo=ubuntu_sinfo,
1447 )1478 )
1479 last_spi = srcpkg_information
1480 if last_spi:
1481 if db_cache_dir:
1482 with dbm.open(os.path.join(db_cache_dir, distname), 'w') as db_cache:
1483 db_cache[pkgname] = str(last_spi.spphr)
1448 except NoPublicationHistoryException:1484 except NoPublicationHistoryException:
1449 logging.warning("No publication history found for %s in %s.",1485 logging.warning("No publication history found for %s in %s.",
1450 pkgname, distname1486 pkgname, distname
@@ -1520,6 +1556,11 @@ def parse_args(subparsers=None, base_subparsers=None):
1520 action='store_true',1556 action='store_true',
1521 help=argparse.SUPPRESS,1557 help=argparse.SUPPRESS,
1522 )1558 )
1559 parser.add_argument(
1560 '--db-cache',
1561 type=str,
1562 help=argparse.SUPPRESS,
1563 )
1523 if not subparsers:1564 if not subparsers:
1524 return parser.parse_args()1565 return parser.parse_args()
1525 return 'import - %s' % kwargs['description']1566 return 'import - %s' % kwargs['description']
@@ -1548,6 +1589,11 @@ def cli_main(args):
1548 except AttributeError:1589 except AttributeError:
1549 dl_cache = None1590 dl_cache = None
15501591
1592 try:
1593 db_cache = args.db_cache
1594 except AttributeError:
1595 db_cache = None
1596
1551 return main(1597 return main(
1552 pkgname=args.package,1598 pkgname=args.package,
1553 owner=args.lp_owner,1599 owner=args.lp_owner,
@@ -1567,4 +1613,5 @@ def cli_main(args):
1567 parentfile=args.parentfile,1613 parentfile=args.parentfile,
1568 retries=args.retries,1614 retries=args.retries,
1569 retry_backoffs=args.retry_backoffs,1615 retry_backoffs=args.retry_backoffs,
1616 db_cache_dir=args.db_cache,
1570 )1617 )
diff --git a/gitubuntu/source_information.py b/gitubuntu/source_information.py
index b97bc8a..050d60a 100644
--- a/gitubuntu/source_information.py
+++ b/gitubuntu/source_information.py
@@ -443,7 +443,14 @@ class GitUbuntuSourceInformation(object):
443 for srcpkg in spph:443 for srcpkg in spph:
444 yield self.get_corrected_spi(srcpkg, workdir)444 yield self.get_corrected_spi(srcpkg, workdir)
445445
446 def launchpad_versions_published_after(self, head_versions, namespace, workdir=None, active_series_only=False):446 def launchpad_versions_published_after(
447 self,
448 head_versions,
449 namespace,
450 last_spphr=None,
451 workdir=None,
452 active_series_only=False,
453 ):
447 args = {454 args = {
448 'exact_match':True,455 'exact_match':True,
449 'source_name':self.pkgname,456 'source_name':self.pkgname,
@@ -471,7 +478,14 @@ class GitUbuntuSourceInformation(object):
471 if len(spph) == 0:478 if len(spph) == 0:
472 raise NoPublicationHistoryException("Is %s published in %s?" %479 raise NoPublicationHistoryException("Is %s published in %s?" %
473 (self.pkgname, self.dist_name))480 (self.pkgname, self.dist_name))
474 if len(head_versions) > 0:481 if last_spphr:
482 _spph = list()
483 for spphr in spph:
484 if str(spphr) == last_spphr:
485 break
486 _spph.append(spphr)
487 spph = _spph
488 elif head_versions:
475 _spph = list()489 _spph = list()
476 for spphr in spph:490 for spphr in spph:
477 spi = GitUbuntuSourcePackageInformation(spphr, self.dist_name,491 spi = GitUbuntuSourcePackageInformation(spphr, self.dist_name,
diff --git a/man/man1/git-ubuntu-import.1 b/man/man1/git-ubuntu-import.1
index dd7fd9e..74bfe83 100644
--- a/man/man1/git-ubuntu-import.1
+++ b/man/man1/git-ubuntu-import.1
@@ -1,4 +1,4 @@
1.TH "GIT-UBUNTU-IMPORT" "1" "2017-07-19" "Git-Ubuntu 0.2" "Git-Ubuntu Manual"1.TH "GIT-UBUNTU-IMPORT" "1" "2017-11-08" "Git-Ubuntu 0.6.2" "Git-Ubuntu Manual"
22
3.SH "NAME"3.SH "NAME"
4git-ubuntu import \- Import Launchpad publishing history to Git4git-ubuntu import \- Import Launchpad publishing history to Git
@@ -9,7 +9,8 @@ git-ubuntu import \- Import Launchpad publishing history to Git
9<user>] [\-\-dl-cache <dl_cache>] [\-\-no-fetch] [\-\-no-push]9<user>] [\-\-dl-cache <dl_cache>] [\-\-no-fetch] [\-\-no-push]
10[\-\-no-clean] [\-d | \-\-directory <directory>]10[\-\-no-clean] [\-d | \-\-directory <directory>]
11[\-\-active-series-only] [\-\-skip-applied] [\-\-skip-orig]11[\-\-active-series-only] [\-\-skip-applied] [\-\-skip-orig]
12[\-\-reimport] [\-\-allow-applied-failures] <package>12[\-\-reimport] [\-\-allow-applied-failures] [\-\-db-cache <db_cache>]
13<package>
13.FI14.FI
14.SP15.SP
15.SH "DESCRIPTION"16.SH "DESCRIPTION"
@@ -197,6 +198,21 @@ After investigation, this flag can be used to indicate the importer is
197allowed to ignore such a failure\&.198allowed to ignore such a failure\&.
198.RE199.RE
199.PP200.PP
201\-\-db-cache <db_cache>
202.RS 4
203The path to a directory containing Python dbm database disk files for
204importer metadata\&.
205If \fB<db_cache>\fR does not exist, it will be created\&.
206Two files in \fB<db_cache>\fR are used, "ubuntu" and "debian', which are
207created if not already present\&.
208The cache files provide information to the importer about prior imports
209of \fB<package>\fR and which Launchpad publishing record was last
210imported\&.
211This is necessary because the imported Git repository does not
212necessarily maintain any metadata about Launchpad publishing
213information\&.
214.RE
215.PP
200<package>216<package>
201.RS 4217.RS 4
202The name of the source package to import\&.218The name of the source package to import\&.
diff --git a/scripts/import-source-packages.py b/scripts/import-source-packages.py
index 698ae35..7dd534e 100755
--- a/scripts/import-source-packages.py
+++ b/scripts/import-source-packages.py
@@ -50,6 +50,7 @@ def import_new_published_sources(
50 phasing_main,50 phasing_main,
51 phasing_universe,51 phasing_universe,
52 dry_run,52 dry_run,
53 db_cache,
53):54):
54 """import_new_published_source - Import all new publishes since a prior execution55 """import_new_published_source - Import all new publishes since a prior execution
5556
@@ -60,6 +61,10 @@ def import_new_published_sources(
60 phasing_main - a integer percentage of all packages in main to import61 phasing_main - a integer percentage of all packages in main to import
61 phasing_universe - a integer percentage of all packages in universe to import62 phasing_universe - a integer percentage of all packages in universe to import
62 dry_run - a boolean to indicate a dry-run operation63 dry_run - a boolean to indicate a dry-run operation
64 db_cache - string filesystem path containing DBM files for storing
65 importer progress, will be created if it does not exist
66
67 If db_cache is None, no cache is used.
6368
64 Returns:69 Returns:
65 A tuple of two lists, the first containing the names of all70 A tuple of two lists, the first containing the names of all
@@ -150,6 +155,7 @@ def import_new_published_sources(
150 ret = scriptutils.pool_map_import_srcpkg(155 ret = scriptutils.pool_map_import_srcpkg(
151 num_workers=num_workers,156 num_workers=num_workers,
152 dry_run=dry_run,157 dry_run=dry_run,
158 db_cache=db_cache,
153 pkgnames=filtered_pkgnames,159 pkgnames=filtered_pkgnames,
154 )160 )
155161
@@ -229,6 +235,7 @@ def main(
229 phasing_main=scriptutils.DEFAULTS.phasing_main,235 phasing_main=scriptutils.DEFAULTS.phasing_main,
230 phasing_universe=scriptutils.DEFAULTS.phasing_universe,236 phasing_universe=scriptutils.DEFAULTS.phasing_universe,
231 dry_run=scriptutils.DEFAULTS.dry_run,237 dry_run=scriptutils.DEFAULTS.dry_run,
238 db_cache=scriptutils.DEFAULTS.db_cache,
232):239):
233 """main - Main entry point to the script240 """main - Main entry point to the script
234241
@@ -243,6 +250,8 @@ def main(
243 phasing_universe - a integer percentage of all packages in universe250 phasing_universe - a integer percentage of all packages in universe
244 to import251 to import
245 dry_run - a boolean to indicate a dry-run operation252 dry_run - a boolean to indicate a dry-run operation
253 db_cache - string filesystem path containing DBM files for storing
254 importer progress, will be created if it does not exist
246 """255 """
247 scriptutils.setup_git_config()256 scriptutils.setup_git_config()
248257
@@ -280,6 +289,7 @@ def main(
280 phasing_main,289 phasing_main,
281 phasing_universe,290 phasing_universe,
282 dry_run,291 dry_run,
292 db_cache,
283 )293 )
284 print("Imported %d source packages" % len(imported_srcpkgs))294 print("Imported %d source packages" % len(imported_srcpkgs))
285 mail_imported_srcpkgs |= set(imported_srcpkgs)295 mail_imported_srcpkgs |= set(imported_srcpkgs)
@@ -361,6 +371,13 @@ def cli_main():
361 help="Simulate operation but do not actually do anything",371 help="Simulate operation but do not actually do anything",
362 default=scriptutils.DEFAULTS.dry_run,372 default=scriptutils.DEFAULTS.dry_run,
363 )373 )
374 parser.add_argument(
375 '--db-cache',
376 type=str,
377 help="Directory containing Python DBM files which store importer "
378 "progress",
379 default=scriptutils.DEFAULTS.db_cache,
380 )
364381
365 args = parser.parse_args()382 args = parser.parse_args()
366383
@@ -371,6 +388,7 @@ def cli_main():
371 phasing_main=args.phasing_main,388 phasing_main=args.phasing_main,
372 phasing_universe=args.phasing_universe,389 phasing_universe=args.phasing_universe,
373 dry_run=args.dry_run,390 dry_run=args.dry_run,
391 db_cache=args.db_cache,
374 )392 )
375393
376if __name__ == '__main__':394if __name__ == '__main__':
diff --git a/scripts/scriptutils.py b/scripts/scriptutils.py
index 912efd1..e24c394 100644
--- a/scripts/scriptutils.py
+++ b/scripts/scriptutils.py
@@ -5,6 +5,7 @@ import multiprocessing
5import os5import os
6import sys6import sys
7import subprocess7import subprocess
8import tempfile
8import time9import time
910
10import pkg_resources11import pkg_resources
@@ -35,6 +36,7 @@ Defaults = namedtuple(
35 'phasing_main',36 'phasing_main',
36 'dry_run',37 'dry_run',
37 'use_whitelist',38 'use_whitelist',
39 'db_cache',
38 ],40 ],
39)41)
4042
@@ -52,6 +54,7 @@ DEFAULTS = Defaults(
52 phasing_main=1,54 phasing_main=1,
53 dry_run=False,55 dry_run=False,
54 use_whitelist=True,56 use_whitelist=True,
57 db_cache=os.path.join(tempfile.gettempdir(), 'git-ubuntu-db-cache'),
55)58)
5659
5760
@@ -101,12 +104,16 @@ def should_import_srcpkg(
101 return False104 return False
102105
103106
104def import_srcpkg(pkgname, dry_run):107def import_srcpkg(pkgname, dry_run, db_cache):
105 """import_srcpkg - Invoke git ubuntu import on @pkgname108 """import_srcpkg - Invoke git ubuntu import on @pkgname
106109
107 Arguments:110 Arguments:
108 pkgname - string name of a source package111 pkgname - string name of a source package
109 dry_run - a boolean to indicate a dry-run operation112 dry_run - a boolean to indicate a dry-run operation
113 db_cache - string filesystem path containing DBM files for storing
114 importer progress, will be created if it does not exist
115
116 If db_cache is None, no cache is used.
110117
111 Returns:118 Returns:
112 A tuple of a string and a boolean, where the boolean is the success119 A tuple of a string and a boolean, where the boolean is the success
@@ -125,6 +132,8 @@ def import_srcpkg(pkgname, dry_run):
125 'usd-importer-bot',132 'usd-importer-bot',
126 pkgname,133 pkgname,
127 ]134 ]
135 if db_cache:
136 cmd.extend(['--db-cache', db_cache,])
128 try:137 try:
129 print(' '.join(cmd))138 print(' '.join(cmd))
130 if not dry_run:139 if not dry_run:
@@ -163,6 +172,7 @@ def setup_git_config(
163def pool_map_import_srcpkg(172def pool_map_import_srcpkg(
164 num_workers,173 num_workers,
165 dry_run,174 dry_run,
175 db_cache,
166 pkgnames,176 pkgnames,
167):177):
168 """pool_map_import_srcpkg - Use a multiprocessing.Pool to parallel178 """pool_map_import_srcpkg - Use a multiprocessing.Pool to parallel
@@ -171,13 +181,18 @@ def pool_map_import_srcpkg(
171 Arguments:181 Arguments:
172 num_workers - integer number of worker processes to use182 num_workers - integer number of worker processes to use
173 dry_run - a boolean to indicate a dry-run operation183 dry_run - a boolean to indicate a dry-run operation
184 db_cache - string filesystem path containing DBM files for storing
185 importer progress, will be created if it does not exist
174 pkgnames - a list of string names of source packages186 pkgnames - a list of string names of source packages
187
188 If db_cache is None, no cache is used.
175 """189 """
176 with multiprocessing.Pool(processes=num_workers) as pool:190 with multiprocessing.Pool(processes=num_workers) as pool:
177 results = pool.map(191 results = pool.map(
178 functools.partial(192 functools.partial(
179 import_srcpkg,193 import_srcpkg,
180 dry_run=dry_run,194 dry_run=dry_run,
195 db_cache=db_cache,
181 ),196 ),
182 pkgnames,197 pkgnames,
183 )198 )
diff --git a/scripts/source-package-walker.py b/scripts/source-package-walker.py
index 58c3d5a..f7b9629 100755
--- a/scripts/source-package-walker.py
+++ b/scripts/source-package-walker.py
@@ -44,6 +44,7 @@ def import_all_published_sources(
44 phasing_main,44 phasing_main,
45 phasing_universe,45 phasing_universe,
46 dry_run,46 dry_run,
47 db_cache,
47):48):
48 """import_all_published_sources - Import all publishes satisfying a49 """import_all_published_sources - Import all publishes satisfying a
49 {white,black}list and phasing50 {white,black}list and phasing
@@ -55,6 +56,10 @@ def import_all_published_sources(
55 phasing_main - a integer percentage of all packages in main to import56 phasing_main - a integer percentage of all packages in main to import
56 phasing_universe - a integer percentage of all packages in universe to import57 phasing_universe - a integer percentage of all packages in universe to import
57 dry_run - a boolean to indicate a dry-run operation58 dry_run - a boolean to indicate a dry-run operation
59 db_cache - string filesystem path containing DBM files for storing
60 importer progress, will be created if it does not exist
61
62 If db_cache is None, no cache is used.
5863
59 Returns:64 Returns:
60 A tuple of two lists, the first containing the names of all65 A tuple of two lists, the first containing the names of all
@@ -126,6 +131,7 @@ def import_all_published_sources(
126 return scriptutils.pool_map_import_srcpkg(131 return scriptutils.pool_map_import_srcpkg(
127 num_workers=num_workers,132 num_workers=num_workers,
128 dry_run=dry_run,133 dry_run=dry_run,
134 db_cache=db_cache,
129 pkgnames=pkgnames,135 pkgnames=pkgnames,
130 )136 )
131137
@@ -137,6 +143,7 @@ def main(
137 phasing_universe=scriptutils.DEFAULTS.phasing_universe,143 phasing_universe=scriptutils.DEFAULTS.phasing_universe,
138 dry_run=scriptutils.DEFAULTS.dry_run,144 dry_run=scriptutils.DEFAULTS.dry_run,
139 use_whitelist=scriptutils.DEFAULTS.use_whitelist,145 use_whitelist=scriptutils.DEFAULTS.use_whitelist,
146 db_cache=scriptutils.DEFAULTS.db_cache,
140):147):
141 """main - Main entry point to the script148 """main - Main entry point to the script
142149
@@ -153,6 +160,8 @@ def main(
153 dry_run - a boolean to indicate a dry-run operation160 dry_run - a boolean to indicate a dry-run operation
154 use_whitelist - a boolean to control whether the whitelist data is161 use_whitelist - a boolean to control whether the whitelist data is
155 used162 used
163 db_cache - string filesystem path containing DBM files for storing
164 importer progress, will be created if it does not exist
156165
157 use_whitelist exists because during the rampup of imports, we want166 use_whitelist exists because during the rampup of imports, we want
158 to import the whitelist packages and the phased packages. But after167 to import the whitelist packages and the phased packages. But after
@@ -191,6 +200,7 @@ def main(
191 phasing_main,200 phasing_main,
192 phasing_universe,201 phasing_universe,
193 dry_run,202 dry_run,
203 db_cache,
194 )204 )
195 print(205 print(
196 "Imported %d source packages:\n%s" % (206 "Imported %d source packages:\n%s" % (
@@ -254,6 +264,13 @@ def cli_main():
254 help="Simulate operation but do not actually do anything",264 help="Simulate operation but do not actually do anything",
255 default=scriptutils.DEFAULTS.dry_run,265 default=scriptutils.DEFAULTS.dry_run,
256 )266 )
267 parser.add_argument(
268 '--db-cache',
269 type=str,
270 help="Directory containing Python DBM files which store importer "
271 "progress",
272 default=scriptutils.DEFAULTS.db_cache,
273 )
257274
258 args = parser.parse_args()275 args = parser.parse_args()
259276
@@ -265,6 +282,7 @@ def cli_main():
265 phasing_universe=args.phasing_universe,282 phasing_universe=args.phasing_universe,
266 dry_run=args.dry_run,283 dry_run=args.dry_run,
267 use_whitelist=args.use_whitelist,284 use_whitelist=args.use_whitelist,
285 db_cache=args.db_cache,
268 )286 )
269287
270if __name__ == '__main__':288if __name__ == '__main__':

Subscribers

People subscribed via source and target branches