Bug #588684 “builders index timing out after 10.05 release” : Bugs : Launchpad itself

Michael Nelson (michael.nelson) on 2010-06-02

Changed in soyuz:
status:	New → In Progress
importance:	Undecided → Critical
assignee:	nobody → Michael Nelson (michael.nelson)

Revision history for this message

Michael Nelson (michael.nelson) wrote on 2010-06-02:

#1

Requesting that we cowboy the following until we sort out the timing issue:

http://pastebin.ubuntu.com/443324/

Revision history for this message

Michael Nelson (michael.nelson) wrote on 2010-06-02:

#2

And the problem statements analysed:
http://pastebin.ubuntu.com/443363/

Revision history for this message

Michael Nelson (michael.nelson) wrote on 2010-06-02:

#3

The cowboy has been applied to edge (and is currently being applied to production) so that the builders page is up again (with 'unknown' queue lengths/times).

stub managed to get a real explain analyse from production, which shows quite a different plan from that on dogfood. At this stage we're still no closer to finding out why the query suddenly blew up. The change was straight-forward: http://pastebin.ubuntu.com/443408/ . stub sees that without the archive.enabled/virtualized clauses, it works fine (but that hasn't changed).

Details:
{{{
16:28 * stub wonders if the database planner statistics had not been updated
16:29 < stub> hmm..
16:36 < stub> noodles: Got a real query from the postgresql logs
16:37 < noodles> stub: great... does it tell you something more useful?
16:37 < stub> noodles: https://pastebin.canonical.com/32859/
16:38 < stub> At one point, it is materializing 920064 rows
16:38 < noodles> a bad join? /me looks closer.
16:38 < stub> But it thought it was only going to materialize 5k rows
16:51 < noodles> stub: were you able to see if it was related to the planner stats?
17:00 < stub> noodles: I think it is because the bulk of our archives are enabled and don't require virtualization, but the planner isn't smart enough to realize that the data is skewed in that way, and it joins tables in a suboptimal order.
17:02 < noodles> stub: s/don't require/do require/? (most ppas require virtualization), but I'm glad you're seeing reasons... I keep going over the query itself but can't see anything strange.
17:02 < stub> The enabled flag makes the biggest difference.
17:03 < noodles> Aha. But why would that result in 900k rows instead of 5k during a join?
17:12 < stub> noodles: I'm not sure if the plan is wrong - the numbers are skewed, but it still might be the fastest plan.
17:13 < noodles> hrm.
17:14 * noodles gets a diff of what changed for that specific query.
17:20 < noodles> stub: the diff itself looks fine http://pastebin.ubuntu.com/443408/ and the indexes are there for the foreign keys etc.
17:21 < noodles> Chex: great, thankyou... I've just checked https://edge.launchpad.net/builders and it worked perfectly. Please update production as well.
17:21 < stub> If I remove the enabled and requires_virtual clauses, it works fine
17:21 < stub> Well... 0.5-1 second
17:22 < Chex> noodles: great, thanks, on it now
17:26 < noodles> stub: there are no indexes on archive.enabled or archive.requires_virtual (nor have there been in the past)... but I assume that would help?
17:27 < stub> no - I've tried that already
17:27 < noodles> :/
}}}

The cowboy has been applied to edge (and is currently being applied to production) so that the builders page is up again (with 'unknown' queue lengths/times).

stub managed to get a real explain analyse from production, which shows quite a different plan from that on dogfood. At this stage we're still no closer to finding out why the query suddenly blew up. The change was straight-forward: http://pastebin.ubuntu.com/443408/ . stub sees that without the archive.enabled/virtualized clauses, it works fine (but that hasn't changed).

Details:
{{{
16:28  * stub wonders if the database planner statistics had not been updated
16:29 < stub> hmm..
16:36 < stub> noodles: Got a real query from the postgresql logs
16:37 < noodles> stub: great... does it tell you something more useful?
16:37 < stub> noodles: https://pastebin.canonical.com/32859/
16:38 < stub> At one point, it is materializing 920064 rows
16:38 < noodles> a bad join? /me looks closer.
16:38 < stub> But it thought it was only going to materialize 5k rows
16:51 < noodles> stub: were you able to see if it was related to the planner stats?
17:00 < stub> noodles: I think it is because the bulk of our archives are enabled and don't require virtualization, but the planner isn't smart enough to realize that the data is skewed in that way, and it joins tables in a suboptimal order.
17:02 < noodles> stub: s/don't require/do require/? (most ppas require virtualization), but I'm glad you're seeing reasons... I keep going over the query itself but can't see anything strange.
17:02 < stub> The enabled flag makes the biggest difference. 
17:03 < noodles> Aha. But why would that result in 900k rows instead of 5k during a join?
17:12 < stub> noodles: I'm not sure if the plan is wrong - the numbers are skewed, but it still might be the fastest plan.
17:13 < noodles> hrm.
17:14  * noodles gets a diff of what changed for that specific query.
17:20 < noodles> stub: the diff itself looks fine http://pastebin.ubuntu.com/443408/ and the indexes are there for the foreign keys etc.
17:21 < noodles> Chex: great, thankyou... I've just checked https://edge.launchpad.net/builders and it worked perfectly. Please update production as well.
17:21 < stub> If I remove the enabled and requires_virtual clauses, it works fine
17:21 < stub> Well... 0.5-1 second
17:22 < Chex> noodles: great, thanks, on it now
17:26 < noodles> stub: there are no indexes on archive.enabled or archive.requires_virtual (nor have there been in the past)... but I assume that would help?
17:27 < stub> no - I've tried that already
17:27 < noodles> :/
}}}

Revision history for this message

Michael Nelson (michael.nelson) wrote on 2010-06-02:

#4

Updating importance now that the patch is applied and the page is no longer timing out.

Changed in soyuz:
importance:	Critical → High

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-06-03:

#5

There's currently a disabled copy archive that has thousands of pending builds which will never dispatch. Could this be affecting the query plan with the modified query?

Revision history for this message

Stuart Bishop (stub) wrote on 2010-06-04:

#6

BuildQueue and BuildPackageJob are irrelevant to the problem. The problem area is seen in this query:

SELECT COUNT(*)
FROM
    BinaryPackageBuild
    JOIN PackageBuild ON BinaryPackageBuild.package_build = PackageBuild.id
    JOIN BuildFarmJob ON PackageBuild.build_farm_job = BuildFarmJob.id
    JOIN Archive ON PackageBuild.archive = Archive.id
    JOIN DistroArchSeries
        ON BinaryPackageBuild.distro_arch_series = DistroArchSeries.id
    JOIN Processor ON DistroArchSeries.processorfamily = Processor.family
WHERE
    BuildFarmJob.status = 0
    AND Processor.id = 5
    AND Archive.enabled = true
    AND Archive.require_virtualized = false;

Query plan is at http://paste.ubuntu.com/444486/

One quick note - we should be using IS TRUE and IS FALSE instead of = TRUE and =FALSE to avoid confusion with SQL's three value boolean logic.

The sequential scan on Archive could be fixed with an index, although this isn't a major contributer to the main problem:

CREATE INDEX archive__require_virtualized__idx
ON Archive(require_virtualized);

We also can make use of an index on BuildFarmJob.status:

CREATE INDEX buildfarmjob__status__idx
ON BuildFarmJob(status);

When Archive and PackageBuild are joined, we see the estimated row count of the selected PackageBuild rows off by 30x. This gets magnified by the number of Archive rows, taking a few seconds and resulting in nearly 1 million rows in the working set instead of the estimated 5.5k. This isn't the core problem though - this is just an artifact that Archive #1 (the main Ubuntu archive) has a comparitively huge number of packages, but the PostgreSQL planner doesn't know this when estimating how many rows will match in PackageBuild.

(To be continued...)

Revision history for this message

Michael Nelson (michael.nelson) wrote on 2010-06-04:

#7

{{{
11:02 < stub> noodles775: So that build page that was timing out. Does this query provide all the information we need for the bits that are timing out? http://paste.ubuntu.com/444495/
11:13 < noodles775> stub: checking
11:14 < stub> noodles775: Things also run more than twice as fast on a slave node (that particular query takes just over 5 seconds on the master, but 1.4 seconds on a slave). I don't think anyone will care if the stats are maybe a few seconds out of date.
...
11:21 < noodles775> stub: great, so I can update the storm code to (1) run on the slave and (2) use the count/sum in the findspec rather than querying once for each. Certainly looks much better.
11:22 < stub> Yup.
11:22 < noodles775> Or should I just use the query verbatim (so we know exactly what's being executed)?
11:22 < noodles775> (and thanks!)
11:22 < stub> Using Storm to generate the query should give you pretty much the same thing
11:23 < stub> You can check by turning on the storm SQL tracing. Or getting a user requested oops report.
11:23 -!- zyga [~zyga@ubuntu/member/zyga] has quit [Ping timeout: 252 seconds]
11:24 < stub> I've been looking into indexes - BuildFarmJob.status and Archive.require_virtualized help a little with the existing query, but not much and not at all with the count/sum query
}}}

Revision history for this message

Stuart Bishop (stub) wrote on 2010-06-04:

#8

Two more points....

Queries on these tables is a lot faster on the slave databases than on the master. Use the slave for a big win.

Also, the following single query can replace the several slow queries being attempted to show the total number of jobs and estimated duration of the various build queues:

http://paste.ubuntu.com/444495/

Revision history for this message

Michael Nelson (michael.nelson) wrote on 2010-06-04:

#9

Just some local testing of the linked branch. Note: locally there are only 2 builder groups - 1 386 virt and 1 386 non-virt, leading to 4 queries replaced by 1. So the time difference is only a 40% improvement.
Local testing:

Clean devel:
At least 41 queries issued in 0.20 seconds OOPS-1616X12
requested oops: http://pastebin.ubuntu.com/444684/ (the last 4 queries)

Above branch:
At least 38 queries issued in 0.14 seconds OOPS-1616X3
requested oops: http://pastebin.ubuntu.com/444681/ (just 1 query at the end).

On edge/production we should see much greater improvements - ~11 builder groups, leading to 22 slow queries replaced by 1. Thanks stub!

Revision history for this message

Ursula Junque (ursinha) wrote on 2010-06-06: Bug fixed by a commit

#10

Fixed in stable r10952 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/10952>

Changed in soyuz:
milestone:	none → 10.06
status:	In Progress → Fix Committed
tags:	added: qa-needstesting

Revision history for this message

Michael Nelson (michael.nelson) wrote on 2010-06-07:

#11

QA'd on edge:
https://edge.launchpad.net/builders

I'll organise a CP for production now.

tags:

added: qa-ok
removed: qa-needstesting

Michael Nelson (michael.nelson) on 2010-07-07

Changed in soyuz:
status:	Fix Committed → Fix Released

Launchpad itself

builders index timing out after 10.05 release

Bug Description

Related branches

Other bug subscribers

Remote bug watches