Merge lp:~cjwatson/extract-changelogs/order-by-date into lp:extract-changelogs

Proposed by Colin Watson
Status: Merged
Merged at revision: 49
Proposed branch: lp:~cjwatson/extract-changelogs/order-by-date
Merge into: lp:extract-changelogs
Diff against target: 34 lines (+11/-2)
2 files modified
get-published-since.py (+2/-1)
lp-extract-changelogs.py (+9/-1)
To merge this branch: bzr merge lp:~cjwatson/extract-changelogs/order-by-date
Reviewer Review Type Date Requested Status
Greg Mason (community) Approve
Ubuntu Core Development Team Pending
Review via email: mp+291837@code.launchpad.net

Commit message

Use archive.getPublishedSources(order_by_date=True) for a significant speedup.

Description of the change

Use archive.getPublishedSources(order_by_date=True) for a significant speedup.

The query that extract-changelogs is currently relying on is very slow, and there are some subtle ways in which iterating over the collection can go wrong. For ddeb-retriever, we did a fair bit of work on this:

  https://bugs.launchpad.net/launchpad/+bug/1441729
  https://code.launchpad.net/~cjwatson/launchpad/db-index-bpph-datecreated/+merge/255539
  https://code.launchpad.net/~cjwatson/launchpad/getpublishedbinaries-sorting/+merge/255822

In the case of extract-changelogs, it should be sufficient to add order_by_date=True, which has the effect of joining fewer tables and using a reasonably well-indexed query to return a collection which is in decreasing ID order. If the collection changes during iteration (as long as you don't try to do any status filtering or similar, as explained in a comment here) then the worst case is that you get the same source package more than once, but extract-changelogs already handles this in LaunchpadChangelogsCrawler._unpack_changelogs_to_target.

Please do test this! I have not done so. However, I hear that extract-changelogs times out when asked to work from a very old starting date, and this should make it behave a lot better.

To post a comment you must log in.
Revision history for this message
Greg Mason (gmason) :
review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'get-published-since.py'
2--- get-published-since.py 2013-07-24 13:29:08 +0000
3+++ get-published-since.py 2016-04-13 23:50:59 +0000
4@@ -46,7 +46,8 @@
5 def get_changelogs_since(self, date):
6 ubuntu = self._launchpad.distributions[DISTRIBUTION]
7 archive = ubuntu.main_archive
8- changed = archive.getPublishedSources(created_since_date=date)
9+ changed = archive.getPublishedSources(order_by_date=True,
10+ created_since_date=date)
11 for source in changed:
12 print(source.source_package_name, source.date_published,
13 source.status)
14
15=== modified file 'lp-extract-changelogs.py'
16--- lp-extract-changelogs.py 2016-03-11 16:26:54 +0000
17+++ lp-extract-changelogs.py 2016-04-13 23:50:59 +0000
18@@ -216,7 +216,15 @@
19 def get_changelogs_since(self, archive, date):
20 self._time_of_last_check = time.time()
21 now = time.time()
22- changed = archive.getPublishedSources(created_since_date=date)
23+ # It's important to omit the status filter here, even if we find
24+ # ourselves filtering on the status later. This is because the
25+ # collection may change as we're iterating over it. Without any
26+ # filtering, this is OK because entries can never be removed from
27+ # the collection: the worst case is that we encounter the same
28+ # publication twice. With filtering on mutable properties, it would
29+ # be possible to lose entries between two successive batches.
30+ changed = archive.getPublishedSources(order_by_date=True,
31+ created_since_date=date)
32 logging.debug("getPublishedSources() took %i seconds" %
33 (time.time() - now))
34 self._get_changelogs_from_source_package_history_collection(changed)

Subscribers

People subscribed via source and target branches