Merge lp:~allenap/launchpad/localpackagediffs-time-out-bug-798301 into lp:launchpad

Proposed by Gavin Panella
Status: Rejected
Rejected by: Gavin Panella
Proposed branch: lp:~allenap/launchpad/localpackagediffs-time-out-bug-798301
Merge into: lp:launchpad
Diff against target: 119 lines (+56/-11)
1 file modified
lib/lp/registry/model/distroseriesdifference.py (+56/-11)
To merge this branch: bzr merge lp:~allenap/launchpad/localpackagediffs-time-out-bug-798301
Reviewer Review Type Date Requested Status
Jeroen T. Vermeulen Pending
Review via email: mp+65426@code.launchpad.net

Description of the change

A small warning: this branch has been born in an atmosphere of
frustration with PostgreSQL and Storm, most especially Storm.

The query that most_recent_publications() issues results in some
pathological query performance as the number of DSDs queried against
grows. Changing the number of DSDs from 75 to 300 causes a tenfold
increase in execution time.

Some experimentation against production with alternative queries -
forcing inner joins, using temporary tables, using WITH clauses - has
yielded similar or worse performance. The only approach that has shown
promise (and quite a lot of promise: query time down from ~6s to 0.6s)
is breaking the query into small batches and then UNIONing them back
together. A DISTINCT ON clause needs to be applied to the set as a
whole.

However, Storm does not make this easy. See bug 799824 for that. Most
of the complexity in this branch is working around that annoying bug.

To post a comment you must log in.
Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

As discussed on IRC: I too find the union trick highly suspicious. And I question the shape of the join:

 * Shouldn't the SPPHs be filtered by distroseries?

 * Can't the Archive.status check be done outside of the query?

 * Does the query really need DSD?

In the extreme lucky case where all of these questions hit pay dirt, you'd end up with a join between just SPPH and SPR.

Revision history for this message
Gavin Panella (allenap) wrote :

Thanks Jeroen, your questions here and on IRC have led to some different thinking which seems to provide a similar performance increase with much less complexity.

Revision history for this message
Gavin Panella (allenap) wrote :

Unmerged revisions

13253. By Gavin Panella

Split the query into batches of around 35 DSDs at a time, and UNION it back together.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/lp/registry/model/distroseriesdifference.py'
2--- lib/lp/registry/model/distroseriesdifference.py 2011-06-14 13:47:51 +0000
3+++ lib/lp/registry/model/distroseriesdifference.py 2011-06-21 22:50:41 +0000
4@@ -10,7 +10,10 @@
5 ]
6
7 from collections import defaultdict
8-from itertools import chain
9+from itertools import (
10+ chain,
11+ islice,
12+ )
13 from operator import itemgetter
14
15 import apt_pkg
16@@ -20,12 +23,17 @@
17 )
18 from lazr.enum import DBItem
19 from sqlobject import StringCol
20+import storm
21 from storm.exceptions import NotOneError
22 from storm.expr import (
23+ Alias,
24 And,
25 Column,
26 Desc,
27+ Select,
28+ SQLRaw,
29 Table,
30+ Union,
31 )
32 from storm.locals import (
33 Int,
34@@ -101,6 +109,21 @@
35 from lp.soyuz.model.sourcepackagerelease import SourcePackageRelease
36
37
38+class FlushingResultSet(storm.store.ResultSet):
39+ """A `storm.store.ResultSet` that flushes the cache before executing.
40+
41+ If implicit flushes are not blocked this will flush the underlying store
42+ before being iterated. The store normally flushes its own caches at the
43+ right times, but when constructing `ResultSet`s by hand this behaviour is
44+ bypassed.
45+ """
46+
47+ def __iter__(self):
48+ if self._store._implicit_flush_block_count == 0:
49+ self._store.flush()
50+ return super(FlushingResultSet, self).__iter__()
51+
52+
53 def most_recent_publications(dsds, in_parent, statuses, match_version=False):
54 """The most recent publications for the given `DistroSeriesDifference`s.
55
56@@ -111,12 +134,16 @@
57 :param in_parent: A boolean indicating if we should look in the parent
58 series' archive instead of the derived series' archive.
59 """
60- columns = (
61- DistroSeriesDifference.source_package_name_id,
62- SourcePackagePublishingHistory,
63- )
64+ # XXX: GavinPanella 2011-06-21 bug=799824: Storm cannot currently do
65+ # DISTINCT ON with a set expression (e.g. on a UNION). A large amount of
66+ # the tomfoolery with FindSpec, SQLRaw, FlushingResultSet et al below is
67+ # because it's bloody hard to get Storm to just run a query of your
68+ # choosing and return a set of *loaded objects*.
69+ spec = storm.store.FindSpec(
70+ (DistroSeriesDifference.source_package_name_id,
71+ SourcePackagePublishingHistory))
72+ columns, tables = spec.get_columns_and_tables()
73 conditions = And(
74- DistroSeriesDifference.id.is_in(dsd.id for dsd in dsds),
75 SourcePackagePublishingHistory.archiveID == Archive.id,
76 SourcePackagePublishingHistory.sourcepackagereleaseID == (
77 SourcePackageRelease.id),
78@@ -149,18 +176,36 @@
79 conditions,
80 SourcePackageRelease.version == version_column,
81 )
82+ # The query performance drops exponentially with number of DSDs, so we
83+ # break it up into a number of smaller queries and then UNION them.
84+ dsd_ids = [dsd.id for dsd in dsds]
85+ batch_count = max(1, len(dsd_ids) / 35)
86+ batches = (
87+ islice(dsd_ids, batch, None, batch_count)
88+ for batch in xrange(batch_count))
89+ batch_conditions = (
90+ DistroSeriesDifference.id.is_in(batch)
91+ for batch in batches)
92+ subselects = (
93+ Select(columns, And(conditions, batch_condition))
94+ for batch_condition in batch_conditions)
95+ union = Union(*subselects)
96+ union_alias = Alias(union)
97+ union_column = lambda column: Column(column.name, union_alias)
98 # The sort order is critical so that the DISTINCT ON clause selects the
99 # most recent publication (i.e. the one with the highest id).
100 order_by = (
101- DistroSeriesDifference.source_package_name_id,
102- Desc(SourcePackagePublishingHistory.id),
103+ union_column(DistroSeriesDifference.source_package_name_id),
104+ Desc(union_column(SourcePackagePublishingHistory.id)),
105 )
106 distinct_on = (
107- DistroSeriesDifference.source_package_name_id,
108+ union_column(DistroSeriesDifference.source_package_name_id),
109 )
110+ select = Select(
111+ SQLRaw("*"), tables=union_alias, order_by=order_by,
112+ distinct=distinct_on)
113 store = IStore(SourcePackagePublishingHistory)
114- return store.find(
115- columns, conditions).order_by(*order_by).config(distinct=distinct_on)
116+ return FlushingResultSet(store, spec, select=select)
117
118
119 def most_recent_comments(dsds):