Code review comment for lp:~cjwatson/launchpad/db-index-bpph-datecreated

Revision history for this message
William Grant (wgrant) wrote :

I'd forgotten quite how ghastly Archive.getAllPublishedBinaries was. Its only redeeming feature is created_since_date, but even that is broken to the point of uselessness.

600ms is still very slow, but fortunately the queries aren't very representative of reality: except in beta and 1.0, the API won't do a COUNT(*) unless you explicitly ask for it. But still, we can cut the COUNT(*) down to about 70ms by adding an (archive, status, datecreated) index and dropping the pointless BPR and BPN joins unless the sort or filter require them. The search over the entire history of the archive is gone, and a very specific single index scan resolves the result set.

But what a client really cares about are the page queries, and before we dive into that detail we need a quick detour to cover some of the API's design flaws around paging.

Implementing a correct client on top of a Launchpad collection API is extremely difficult in general, as the pages can move under you. For example, if I ask for a batch of 75 in some order, and then the collection changes before I request the next batch, my second set of 75 won't immediately follow on from the first. There might be duplicates, or some items might not show up at all. This is most obviously a problem for collections without a defined order, but that just means it's more subtly broken for other collections. In this case, it's likely that some ddebs will be missed, as the collection of publications changes frequently.

So this isn't just a matter of performance, but correctness too. There's only one sensible way to implement a client that correctly incrementally updates: have an API that is sorted by some combination of creation date and/or serial, and use StormRangeFactory so you get a consistent threshold rather than an offset that can move under you. It also fixes performance when you're deep in the collection, as you don't need to traverse millions of rows just find the next 75.

In this case, making the collection correctly usable would actually render created_since_date obsolete. I'd introduce a new parameter that causes getAllPublishedBinaries to sort by (datecreated DESC, id DESC) and use StormRangeFactory. A client can then just iterate down the collection until it decides it's seen enough. A recheck window is still required, as publications can appear out of order if a long write transaction is involved, but that would just mean iterating until say an hour before your last check time. Throw in an index on (archive, status, datecreated, id), and you have a very snappy and reliable collection, without needing created_since_date at all.

So I don't think the general (datecreated, id) index is terribly useful. It makes things almost acceptably fast, but doesn't help with correctness, while a slightly wider index combined with minor code changes fixes both. In summary, this is what I'd recommend:

 - Fix getAllPublishedBinaries to join BPR/BPN only when the sort or filter require them.
 - Introduce a new parameter to getAllPublishedBinaries that causes it to sort by (datecreated, id) and use StormRangeFactory.
 - Index BPPH(archive, status, datecreated, id).
 - Port your client to use the new sort rather than created_since_date, and iterate down to a point say an hour before it last looked.

« Back to merge proposal