Launchpad itself

Merge lp:~allenap/launchpad/localpackagediffs-time-out-bug-798301 into lp:launchpad

localpackagediffs-time-out-bug-798301
Merge into devel

Proposed by Gavin Panella on 2011-06-21

Status:

Rejected

Rejected by:

Gavin Panella on 2011-06-22

Proposed branch:

lp:~allenap/launchpad/localpackagediffs-time-out-bug-798301

Merge into:

lp:launchpad

Diff against target:

119 lines (+56/-11)

1 file modified

lib/lp/registry/model/distroseriesdifference.py (+56/-11)

To merge this branch:

bzr merge lp:~allenap/launchpad/localpackagediffs-time-out-bug-798301

Critical

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Jeroen T. Vermeulen		2011-06-21	Pending
Review via email: mp+65426@code.launchpad.net

Description of the change

A small warning: this branch has been born in an atmosphere of
frustration with PostgreSQL and Storm, most especially Storm.

The query that most_recent_publications() issues results in some
pathological query performance as the number of DSDs queried against
grows. Changing the number of DSDs from 75 to 300 causes a tenfold
increase in execution time.

Some experimentation against production with alternative queries -
forcing inner joins, using temporary tables, using WITH clauses - has
yielded similar or worse performance. The only approach that has shown
promise (and quite a lot of promise: query time down from ~6s to 0.6s)
is breaking the query into small batches and then UNIONing them back
together. A DISTINCT ON clause needs to be applied to the set as a
whole.

However, Storm does not make this easy. See bug 799824 for that. Most
of the complexity in this branch is working around that annoying bug.

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2011-06-22:

As discussed on IRC: I too find the union trick highly suspicious. And I question the shape of the join:

* Shouldn't the SPPHs be filtered by distroseries?

* Can't the Archive.status check be done outside of the query?

* Does the query really need DSD?

In the extreme lucky case where all of these questions hit pay dirt, you'd end up with a join between just SPPH and SPR.

Revision history for this message

Gavin Panella (allenap) wrote on 2011-06-22:

Thanks Jeroen, your questions here and on IRC have led to some different thinking which seems to provide a similar performance increase with much less complexity.

Revision history for this message

Gavin Panella (allenap) wrote on 2011-06-22:

lp:~allenap/launchpad/localpackagediffs-time-out-bug-798301-alt is that simpler fix.

Unmerged revisions

13253. By Gavin Panella on 2011-06-21: Split the query into batches of around 35 DSDs at a time, and UNION it back together.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Barki Mustapha

Celso Providelo

Christian Reis

Christy Awad

Colin Watson

Gavin Panella

Harpianto,ANDI

James Troup

John A Meinel

Kevin bush

Launchpad code reviewers

Launchpad code reviewers from Canonical

Matthew Tanner

Maximiliano Bertacchini

Oguz Ersoz

Simon Brakhane

Ubuntu-BR DevOps

William Grant

alhawiti

api.ng

pedro cavazos

todaioan

wenjingwen

to status/vote changes:

Tzaddi

Tzaddi Belding

 === modified file 'lib/lp/registry/model/distroseriesdifference.py'
 --- lib/lp/registry/model/distroseriesdifference.py	2011-06-14 13:47:51 +0000
 +++ lib/lp/registry/model/distroseriesdifference.py	2011-06-21 22:50:41 +0000
@@ -10,7 +10,10 @@
+     ]
  from collections import defaultdict
--from itertools import chain
++from itertools import (
++    chain,
++    islice,
++    )
  from operator import itemgetter
  import apt_pkg
@@ -20,12 +23,17 @@
+     )
  from lazr.enum import DBItem
  from sqlobject import StringCol
++import storm
  from storm.exceptions import NotOneError
  from storm.expr import (
++    Alias,
      And,
      Column,
      Desc,
++    Select,
++    SQLRaw,
      Table,
++    Union,
+     )
  from storm.locals import (
      Int,
@@ -101,6 +109,21 @@
  from lp.soyuz.model.sourcepackagerelease import SourcePackageRelease
++class FlushingResultSet(storm.store.ResultSet):
++    """A `storm.store.ResultSet` that flushes the cache before executing.
++
++    If implicit flushes are not blocked this will flush the underlying store
++    before being iterated. The store normally flushes its own caches at the
++    right times, but when constructing `ResultSet`s by hand this behaviour is
++    bypassed.
++    """
++
++    def __iter__(self):
++        if self._store._implicit_flush_block_count == 0:
++            self._store.flush()
++        return super(FlushingResultSet, self).__iter__()
++
++
  def most_recent_publications(dsds, in_parent, statuses, match_version=False):
      """The most recent publications for the given `DistroSeriesDifference`s.
@@ -111,12 +134,16 @@
      :param in_parent: A boolean indicating if we should look in the parent
          series' archive instead of the derived series' archive.
      """
--    columns = (
--        DistroSeriesDifference.source_package_name_id,
--        SourcePackagePublishingHistory,
--        )
++    # XXX: GavinPanella 2011-06-21 bug=799824: Storm cannot currently do
++    # DISTINCT ON with a set expression (e.g. on a UNION). A large amount of
++    # the tomfoolery with FindSpec, SQLRaw, FlushingResultSet et al below is
++    # because it's bloody hard to get Storm to just run a query of your
++    # choosing and return a set of *loaded objects*.
++    spec = storm.store.FindSpec(
++        (DistroSeriesDifference.source_package_name_id,
++         SourcePackagePublishingHistory))
++    columns, tables = spec.get_columns_and_tables()
      conditions = And(
--        DistroSeriesDifference.id.is_in(dsd.id for dsd in dsds),
          SourcePackagePublishingHistory.archiveID == Archive.id,
          SourcePackagePublishingHistory.sourcepackagereleaseID == (
              SourcePackageRelease.id),
@@ -149,18 +176,36 @@
              conditions,
              SourcePackageRelease.version == version_column,
+             )
++    # The query performance drops exponentially with number of DSDs, so we
++    # break it up into a number of smaller queries and then UNION them.
++    dsd_ids = [dsd.id for dsd in dsds]
++    batch_count = max(1, len(dsd_ids) / 35)
++    batches = (
++        islice(dsd_ids, batch, None, batch_count)
++        for batch in xrange(batch_count))
++    batch_conditions = (
++        DistroSeriesDifference.id.is_in(batch)
++        for batch in batches)
++    subselects = (
++        Select(columns, And(conditions, batch_condition))
++        for batch_condition in batch_conditions)
++    union = Union(*subselects)
++    union_alias = Alias(union)
++    union_column = lambda column: Column(column.name, union_alias)
      # The sort order is critical so that the DISTINCT ON clause selects the
      # most recent publication (i.e. the one with the highest id).
      order_by = (
--        DistroSeriesDifference.source_package_name_id,
--        Desc(SourcePackagePublishingHistory.id),
++        union_column(DistroSeriesDifference.source_package_name_id),
++        Desc(union_column(SourcePackagePublishingHistory.id)),
+         )
      distinct_on = (
--        DistroSeriesDifference.source_package_name_id,
++        union_column(DistroSeriesDifference.source_package_name_id),
+         )
++    select = Select(
++        SQLRaw("*"), tables=union_alias, order_by=order_by,
++        distinct=distinct_on)
      store = IStore(SourcePackagePublishingHistory)
--    return store.find(
--        columns, conditions).order_by(*order_by).config(distinct=distinct_on)
++    return FlushingResultSet(store, spec, select=select)
  def most_recent_comments(dsds):

Launchpad itself

Merge lp:~allenap/launchpad/localpackagediffs-time-out-bug-798301 into lp:launchpad

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers