Merge lp:~cjwatson/launchpad/db-index-bpph-datecreated into lp:launchpad/db-devel

Proposed by Colin Watson on 2015-04-08
Status: Merged
Merged at revision: 12977
Proposed branch: lp:~cjwatson/launchpad/db-index-bpph-datecreated
Merge into: lp:launchpad/db-devel
Diff against target: 13 lines (+9/-0)
1 file modified
database/schema/patch-2209-53-9.sql (+9/-0)
To merge this branch: bzr merge lp:~cjwatson/launchpad/db-index-bpph-datecreated
Reviewer Review Type Date Requested Status
William Grant db 2015-04-08 Approve on 2015-04-13
Stuart Bishop db 2015-04-08 Pending
Review via email: mp+255539@code.launchpad.net

Commit message

Index BPPH (archive, datecreated, id) to speed up Archive.getAllPublishedBinaries(order_by_date=True).

Description of the change

In the new ddeb world order, ddeb-retriever will need to use Archive.getPublishedBinaries(created_since_date=) on the webservice to catch up with binary publications. This is very slow right now because BPPH.datecreated is unindexed, and it often times out on qastaging. There's currently an index on SPPH (datecreated, id), so let's have a BPPH version of that too.

This will take some time to create; it should be applied live using CREATE INDEX CONCURRENTLY.

Timings on dogfood:

launchpad_dogfood=# EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT COUNT(*) FROM BinaryPackageName, BinaryPackagePublishingHistory, BinaryPackageRelease WHERE BinaryPackagePublishingHistory.archive = 1 AND BinaryPackagePublishingHistory.binarypackagerelease = BinaryPackageRelease.id AND BinaryPackagePublishingHistory.binarypackagename = BinaryPackageName.id AND BinaryPackagePublishingHistory.status IN (2) AND BinaryPackagePublishingHistory.datecreated >= '2015-03-01 00:00:00+00:00' AND (1=1);
                                                                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate (cost=885162.09..885162.10 rows=1 width=0) (actual time=4922.615..4922.616 rows=1 loops=1)
   Buffers: shared read=287546
   -> Nested Loop (cost=34913.25..885161.71 rows=154 width=0) (actual time=4922.612..4922.612 rows=0 loops=1)
         Buffers: shared read=287546
         -> Nested Loop (cost=34912.82..883857.71 rows=154 width=4) (actual time=4922.612..4922.612 rows=0 loops=1)
               Buffers: shared read=287546
               -> Bitmap Heap Scan on binarypackagepublishinghistory (cost=34912.39..882600.03 rows=154 width=8) (actual time=4922.610..4922.610 rows=0 loops=1)
                     Recheck Cond: ((archive = 1) AND (status = 2))
                     Rows Removed by Index Recheck: 12289303
                     Filter: (datecreated >= '2015-03-01 00:00:00'::timestamp without time zone)
                     Rows Removed by Filter: 2603144
                     Buffers: shared read=287546
                     -> Bitmap Index Scan on securebinarypackagepublishinghistory__archive__status__idx (cost=0.00..34912.36 rows=1664779 width=0) (actual time=465.737..465.737 rows=2603559 loops=1)
                           Index Cond: ((archive = 1) AND (status = 2))
                           Buffers: shared read=7118
               -> Index Only Scan using binarypackagename_pkey on binarypackagename (cost=0.42..8.16 rows=1 width=4) (never executed)
                     Index Cond: (id = binarypackagepublishinghistory.binarypackagename)
                     Heap Fetches: 0
         -> Index Only Scan using binarypackage_pkey on binarypackagerelease (cost=0.44..8.46 rows=1 width=4) (never executed)
               Index Cond: (id = binarypackagepublishinghistory.binarypackagerelease)
               Heap Fetches: 0
 Total runtime: 4922.678 ms
(22 rows)

Time: 4924.305 ms
launchpad_dogfood=# BEGIN;
BEGIN
Time: 0.176 ms
launchpad_dogfood=# CREATE INDEX binarypackagepublishinghistory__datecreated__id__idx ON binarypackagepublishinghistory (datecreated, id);
CREATE INDEX
Time: 122697.012 ms
launchpad_dogfood=# EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT COUNT(*) FROM BinaryPackageName, BinaryPackagePublishingHistory, BinaryPackageRelease WHERE BinaryPackagePublishingHistory.archive = 1 AND BinaryPackagePublishingHistory.binarypackagerelease = BinaryPackageRelease.id AND BinaryPackagePublishingHistory.binarypackagename = BinaryPackageName.id AND BinaryPackagePublishingHistory.status IN (2) AND BinaryPackagePublishingHistory.datecreated >= '2015-03-01 00:00:00+00:00' AND (1=1);
                                                                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate (cost=78891.93..78891.94 rows=1 width=0) (actual time=617.535..617.535 rows=1 loops=1)
   Buffers: shared hit=28 read=10081, temp written=823
   -> Nested Loop (cost=49327.12..78886.21 rows=2289 width=0) (actual time=617.532..617.532 rows=0 loops=1)
         Buffers: shared hit=28 read=10081, temp written=823
         -> Hash Join (cost=49326.69..59672.10 rows=2289 width=4) (actual time=617.532..617.532 rows=0 loops=1)
               Hash Cond: (binarypackagepublishinghistory.binarypackagename = binarypackagename.id)
               Buffers: shared hit=28 read=10081, temp written=823
               -> Bitmap Heap Scan on binarypackagepublishinghistory (cost=36476.82..45298.46 rows=2289 width=8) (actual time=415.358..415.358 rows=0 loops=1)
                     Recheck Cond: ((datecreated >= '2015-03-01 00:00:00'::timestamp without time zone) AND (archive = 1) AND (status = 2))
                     Buffers: shared hit=8 read=7115
                     -> BitmapAnd (cost=36476.82..36476.82 rows=2289 width=0) (actual time=415.355..415.355 rows=0 loops=1)
                           Buffers: shared hit=8 read=7115
                           -> Bitmap Index Scan on binarypackagepublishinghistory__datecreated__id__idx (cost=0.00..1563.08 rows=68068 width=0) (actual time=0.034..0.034 rows=98 loops=1)
                                 Index Cond: (datecreated >= '2015-03-01 00:00:00'::timestamp without time zone)
                                 Buffers: shared hit=4 read=1
                           -> Bitmap Index Scan on securebinarypackagepublishinghistory__archive__status__idx (cost=0.00..34912.36 rows=1664779 width=0) (actual time=415.313..415.313 rows=2603559 loops=1)
                                 Index Cond: ((archive = 1) AND (status = 2))
                                 Buffers: shared hit=4 read=7114
               -> Hash (cost=6721.05..6721.05 rows=373505 width=4) (actual time=200.514..200.514 rows=373505 loops=1)
                     Buckets: 16384 Batches: 4 Memory Usage: 3289kB
                     Buffers: shared hit=20 read=2966, temp written=820
                     -> Seq Scan on binarypackagename (cost=0.00..6721.05 rows=373505 width=4) (actual time=0.017..87.032 rows=373505 loops=1)
                           Buffers: shared hit=20 read=2966
         -> Index Only Scan using binarypackage_pkey on binarypackagerelease (cost=0.44..8.38 rows=1 width=4) (never executed)
               Index Cond: (id = binarypackagepublishinghistory.binarypackagerelease)
               Heap Fetches: 0
 Total runtime: 617.595 ms
(27 rows)

Time: 620.716 ms

To post a comment you must log in.
William Grant (wgrant) wrote :
Download full text (3.2 KiB)

I'd forgotten quite how ghastly Archive.getAllPublishedBinaries was. Its only redeeming feature is created_since_date, but even that is broken to the point of uselessness.

600ms is still very slow, but fortunately the queries aren't very representative of reality: except in beta and 1.0, the API won't do a COUNT(*) unless you explicitly ask for it. But still, we can cut the COUNT(*) down to about 70ms by adding an (archive, status, datecreated) index and dropping the pointless BPR and BPN joins unless the sort or filter require them. The search over the entire history of the archive is gone, and a very specific single index scan resolves the result set.

But what a client really cares about are the page queries, and before we dive into that detail we need a quick detour to cover some of the API's design flaws around paging.

Implementing a correct client on top of a Launchpad collection API is extremely difficult in general, as the pages can move under you. For example, if I ask for a batch of 75 in some order, and then the collection changes before I request the next batch, my second set of 75 won't immediately follow on from the first. There might be duplicates, or some items might not show up at all. This is most obviously a problem for collections without a defined order, but that just means it's more subtly broken for other collections. In this case, it's likely that some ddebs will be missed, as the collection of publications changes frequently.

So this isn't just a matter of performance, but correctness too. There's only one sensible way to implement a client that correctly incrementally updates: have an API that is sorted by some combination of creation date and/or serial, and use StormRangeFactory so you get a consistent threshold rather than an offset that can move under you. It also fixes performance when you're deep in the collection, as you don't need to traverse millions of rows just find the next 75.

In this case, making the collection correctly usable would actually render created_since_date obsolete. I'd introduce a new parameter that causes getAllPublishedBinaries to sort by (datecreated DESC, id DESC) and use StormRangeFactory. A client can then just iterate down the collection until it decides it's seen enough. A recheck window is still required, as publications can appear out of order if a long write transaction is involved, but that would just mean iterating until say an hour before your last check time. Throw in an index on (archive, status, datecreated, id), and you have a very snappy and reliable collection, without needing created_since_date at all.

So I don't think the general (datecreated, id) index is terribly useful. It makes things almost acceptably fast, but doesn't help with correctness, while a slightly wider index combined with minor code changes fixes both. In summary, this is what I'd recommend:

 - Fix getAllPublishedBinaries to join BPR/BPN only when the sort or filter require them.
 - Introduce a new parameter to getAllPublishedBinaries that causes it to sort by (datecreated, id) and use StormRangeFactory.
 - Index BPPH(archive, status, datecreated, id).
 - Port your client to use the new sort ra...

Read more...

William Grant (wgrant) :
review: Needs Fixing (db)
William Grant (wgrant) :
review: Approve (db)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== added file 'database/schema/patch-2209-53-9.sql'
2--- database/schema/patch-2209-53-9.sql 1970-01-01 00:00:00 +0000
3+++ database/schema/patch-2209-53-9.sql 2015-04-13 18:32:41 +0000
4@@ -0,0 +1,9 @@
5+-- Copyright 2015 Canonical Ltd. This software is licensed under the
6+-- GNU Affero General Public License version 3 (see the file LICENSE).
7+
8+SET client_min_messages=ERROR;
9+
10+CREATE INDEX binarypackagepublishinghistory__archive__datecreated__id__idx
11+ ON binarypackagepublishinghistory (archive, datecreated, id);
12+
13+INSERT INTO LaunchpadDatabaseRevision VALUES (2209, 53, 9);

Subscribers

People subscribed via source and target branches

to status/vote changes: