Merge lp:~flacoste/launchpad/ppr-constant-memory into lp:launchpad

Proposed by Francis J. Lacoste on 2010-10-29
Status: Merged
Approved by: Gary Poster on 2010-11-04
Approved revision: no longer in the source branch.
Merged at revision: 11872
Proposed branch: lp:~flacoste/launchpad/ppr-constant-memory
Merge into: lp:launchpad
Diff against target: 1043 lines (+656/-229)
5 files modified
lib/lp/scripts/utilities/pageperformancereport.py (+359/-227)
lib/lp/scripts/utilities/tests/test_pageperformancereport.py (+290/-0)
setup.py (+0/-1)
utilities/page-performance-report.ini (+7/-0)
versions.cfg (+0/-1)
To merge this branch: bzr merge lp:~flacoste/launchpad/ppr-constant-memory
Reviewer Review Type Date Requested Status
Gary Poster (community) 2010-10-29 Approve on 2010-11-04
Review via email: mp+39666@code.launchpad.net

Description of the Change

This is my third (and hopefully final) round of refactoring of the page
performance report.

The second SQLite version which was running in constant memory was taking a
long time (>8hours) to compute report for 8M requests. It was also thrashing
the machine doing a lot of IO.

After searching the net, I found a couple of ways to compute the stats we are
gathering on-line with the data with minimal storage. That's the algorithms
I'm implemeting there.

A big change is that we are not computing the exact median, but only an
approximation using a variation on the median of median. That's where the
bulk of the memory is going. It took 900M on my local machine to compute the
report for 300K record. I'll see how many we are using on devpad when
generating a daily. If that's too much, I suggest we simply drop the median
computation. (It took 2m42 of user time to generate, that's way better than
both SQLite algorithm and about 50s than the keep-everything-in-memory
version.)

Other changes:
- I switched to using the gzip and bz2 python builtin library instead of
  external process. Hopefully this will get rid of some warnings that were
  happening in production with unclosed streams.
- I added saving the 99 and mean values for selected categories in cricket
  format. This will allow us to chart our progress using tuolumne.

To post a comment you must log in.
Francis J. Lacoste (flacoste) wrote :

Turns out that all the memory was taken by the url_times cache. In 300k requests, we have nearly 1k different URLS! All of this to generate the top 200. So I now limit the size of that cache to 50 times the top_n we are looking for, and culling it by 10% everything we reach the limit.

I've tried various size and cull ratio, and this seems a good size/time/accuracy compromise. Generating the report now takes a constant 137M and the top200 report is identical to one generated with all of them.

It actually reduced the time to generate the report also by 9 seconds.

I also now persist the RequestTimes data structure to disk. This will allow us to generate aggregate report very fast now. And that structure takes ~1.8M of disk space once compressed.

Gary Poster (gary) wrote :
Download full text (3.1 KiB)

This is very cool, both practically and in its implementation. Thank you for doing it.

Thank you for working on the median changes that we discussed, and I look forward to seeing them; I think what we discussed makes a lot more sense. Even if that new approach doesn't, as I said, tossing the smaller buckets makes more sense than what was there when I first looked, AFAICT. And even if *that* is wrong, at least a comment describing what is going on and why is good. :-) The paper is not clear on the implementation intent in this case, IMO.

Everything else is small, and mostly optional.

Median docstring comments:

 * In "(It will hold in memory log B of n elements.)" I suggest either deleting "B" (vaguely asserting that it has good characteristics) or clarify that B is bucket_size (AIUI).

 * In "It will compute the median among bucket_size values. And the median among those." I was a bit concerned that it didn't describe what was actually going on but it's at least a start for someone trying to understand the code. Only change it if you want to, and have a good idea on how to improve it. :-) Otherwise, at least it helps.

Also in the median code, I'd prefer ``self.median_idx = bucket_size/2`` using the explicit floor division of ``self.median_idx = bucket_size//2`` but again, that's not a big deal.

The docstring of RequestTimes is incomplete (currently """Collect the """). Similarly, add_request method is now """Add a request to the .""" Probably ought to fix that one way or another.

The behavior of the top urls cache (with its top_urls_cache_size) makes me a bit nervous. (You clarified on IRC that we get 100k unique URLs within 300k requests, which makes more sense with the code). It is a heuristic that experientially works out well with the *current* state of things. If something changes, it may no longer be reasonable. In theory, I'd love to have some verification that the current heuristic will maintain a relationship of keeping around a decent amount (currently approximately 1/3) of the URLs at a time. The mad scientist in me would be interested in trying out a bloom filter for the purpose of verifying the approximate relative size of the cache to the total set of URLs, though a better test would be to occasionally run this with a bunch of memory and make sure that the results still jibe. In any case, practically and for this landing, I just request a simple comment warning that this is a heuristic, and why we believe this heuristic is reasonable, and how the heuristic might come to fail us (i.e., if the cache keeps around less and less of the full set of unique URLs).

Awesome that you added tests, thank you. Take this or leave it, as you prefer:

 * It strikes me you could seed the random number generator instead of hard-coding the shuffle. You probably thought about and made a conscious choice. Dunno which is better. <shrug>

 * In your test, you verify that ``self.assertEquals(52, self.estimator.median)``. Since you are using the big juicy lp TestCase with all the testtools stuff, something like ``self.assertIn(self.estimator.median, range(48, 53))`` might convey the test's intent better. Again, <shrug>.

Thanks ...

Read more...

Francis J. Lacoste (flacoste) wrote :
Download full text (5.1 KiB)

On November 4, 2010, Gary Poster wrote:
> This is very cool, both practically and in its implementation. Thank you
> for doing it.
>
> Thank you for working on the median changes that we discussed, and I look
> forward to seeing them; I think what we discussed makes a lot more sense.
> Even if that new approach doesn't, as I said, tossing the smaller buckets
> makes more sense than what was there when I first looked, AFAICT. And
> even if *that* is wrong, at least a comment describing what is going on
> and why is good. :-) The paper is not clear on the implementation intent
> in this case, IMO.
>

Your suggested algorithm change works much better in practice:

# 100 rounds, create a 10000 array of random numbers from 0 to 10
# and compare the approximation to true median
>>> simulate(100, 9, 10000)
New: min=-2, max=3, mean=0.190000, median=0, std=1.006926
Old: min=-6, max=9, mean=1.850000, median=2, std=3.188652
>>> simulate(100, 9, 10000)
New: min=-2, max=3, mean=-0.010000, median=0, std=1.072334
Old: min=-4, max=9, mean=1.770000, median=2, std=2.781564
>>> execfile("median.py")
>>> simulate(100, 9, 10000)
New: min=-258, max=214, mean=-20.560000, median=-28, std=92.003730
Old: min=-629, max=800, mean=129.400000, median=118, std=292.351809

# Same thing but with a random number from 0 to 10000
>>> simulate(100, 9, 10000)
New: min=-230, max=262, mean=10.550000, median=8, std=90.401590
Old: min=-612, max=805, mean=154.820000, median=167, std=307.682381

> Everything else is small, and mostly optional.
>
> Median docstring comments:
>
> * In "(It will hold in memory log B of n elements.)" I suggest either
> deleting "B" (vaguely asserting that it has good characteristics) or
> clarify that B is bucket_size (AIUI).

Yes, that's a left-over from a previous edit.
>
> * In "It will compute the median among bucket_size values. And the median
> among those." I was a bit concerned that it didn't describe what was
> actually going on but it's at least a start for someone trying to
> understand the code. Only change it if you want to, and have a good idea
> on how to improve it. :-) Otherwise, at least it helps.

I wanted to hint that the algorithm is a variation on Tukey median of medians.
I left it as is.

>
> Also in the median code, I'd prefer ``self.median_idx = bucket_size/2``
> using the explicit floor division of ``self.median_idx = bucket_size//2``
> but again, that's not a big deal.

Changed, and I also use the correct formula of (bucket_size-1)//2. Since
bucket_size is always odd it wasn't a problem, but better to have consistency
with the other place I do this.

>
> The docstring of RequestTimes is incomplete (currently """Collect the """).
> Similarly, add_request method is now """Add a request to the ."""
> Probably ought to fix that one way or another.

I fixed both.

>
> The behavior of the top urls cache (with its top_urls_cache_size) makes me
> a bit nervous. (You clarified on IRC that we get 100k unique URLs within
> 300k requests, which makes more sense with the code). It is a heuristic
> that experientially works out well with the *current* state of things. If
> something changes, it may no longer be reasonab...

Read more...

1=== modified file 'lib/lp/scripts/utilities/pageperformancereport.py'
2--- lib/lp/scripts/utilities/pageperformancereport.py 2010-11-01 20:34:41 +0000
3+++ lib/lp/scripts/utilities/pageperformancereport.py 2010-11-04 21:19:55 +0000
4@@ -118,7 +118,7 @@
5 """Approximate the median of a set of elements.
6
7 This implements a space-efficient algorithm which only sees each value
8- once. (It will hold in memory log B of n elements.)
9+ once. (It will hold in memory log bucket_size of n elements.)
10
11 It was described and analysed in
12 D. Cantone and M.Hofri,
13@@ -135,14 +135,14 @@
14
15 It approximates the median by finding the median among each
16 successive bucket_size element. And then using these medians for other
17- round of selection.
18+ rounds of selection.
19
20 The bucket size should be a low odd-integer.
21 """
22 self.count = 0
23 self.bucket_size = bucket_size
24 # Index of the median in a completed bucket.
25- self.median_idx = bucket_size/2
26+ self.median_idx = (bucket_size-1)//2
27 self.buckets = []
28
29 def update(self, x):
30@@ -182,12 +182,19 @@
31 weight = self.bucket_size ** i
32 for x in bucket:
33 total_weight += weight
34- candidates.append([weight, x])
35- # Make weight relative.
36- candidates = [
37- ((float(weight)/total_weight)*x, x)
38- for weight, x in candidates]
39- return sorted(candidates)[(len(candidates)-1)/2][1]
40+ candidates.append([x, weight])
41+ # Each weight is the equivalent of having the candidates appear
42+ # that number of times in the array.
43+ # So buckets like [[1, 2], [2, 3], [4, 2]] would be expanded to
44+ # [1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4,
45+ # 4, 4, 4, 4, 4] and we find the median of that list (2).
46+ # We don't expand the items to conserve memory.
47+ median = (total_weight-1) / 2
48+ weighted_idx = 0
49+ for x, weight in sorted(candidates):
50+ weighted_idx += weight
51+ if weighted_idx > median:
52+ return x
53
54
55 class Stats:
56@@ -336,12 +343,20 @@
57
58
59 class RequestTimes:
60- """Collect the """
61+ """Collect statistics from requests.
62+
63+ Statistics are updated by calling the add_request() method.
64+
65+ Statistics for mean/stddev/total/median for request times, SQL times and
66+ number of SQL statements are collected.
67+
68+ They are grouped by Category, URL or PageID.
69+ """
70
71 def __init__(self, categories, options):
72 self.by_pageids = options.pageids
73 self.top_urls = options.top_urls
74- # We only keep in memory 20 times the number of URLs we want to
75+ # We only keep in memory 50 times the number of URLs we want to
76 # return. The number of URLs can go pretty high (because of the
77 # distinct query parameters).
78 #
79@@ -352,6 +367,9 @@
80 #
81 # Keeping 10 times or culling at 90% generated a near-identical report
82 # (it differed a little in the tail.)
83+ #
84+ # The size/cull parameters might need to change if the requests
85+ # distribution become very different than what it currently is.
86 self.top_urls_cache_size = self.top_urls * 50
87
88 # Histogram has a bin per second up to 1.5 our timeout.
89@@ -363,7 +381,7 @@
90 self.pageid_times = {}
91
92 def add_request(self, request):
93- """Add a request to the ."""
94+ """Add request to the set of requests we collect stats for."""
95 for category, stats in self.category_times:
96 if category.match(request):
97 stats.update(request)
98
99=== modified file 'lib/lp/scripts/utilities/tests/test_pageperformancereport.py'
100--- lib/lp/scripts/utilities/tests/test_pageperformancereport.py 2010-11-01 20:25:55 +0000
101+++ lib/lp/scripts/utilities/tests/test_pageperformancereport.py 2010-11-04 20:47:37 +0000
102@@ -66,11 +66,11 @@
103 # Median is an approximation.
104 # Real values are: 2.50, 2.20, 30
105 (Category('All', ''), FakeStats(
106- total_hits=12, total_time=62.60, mean=5.22, median=1.0, std=5.99,
107- total_sqltime=42, mean_sqltime=3.82, median_sqltime=1.3,
108+ total_hits=12, total_time=62.60, mean=5.22, median=4.20, std=5.99,
109+ total_sqltime=42, mean_sqltime=3.82, median_sqltime=3.0,
110 std_sqltime=3.89,
111 total_sqlstatements=1392, mean_sqlstatements=126.55,
112- median_sqlstatements=16, std_sqlstatements=208.94,
113+ median_sqlstatements=56, std_sqlstatements=208.94,
114 histogram=[[0, 2], [1, 2], [2, 2], [3, 1], [4, 2], [5, 3]],
115 )),
116 (Category('Test', ''), FakeStats()),
117@@ -282,8 +282,8 @@
118 def test_approximage_median_is_good_enough(self):
119 for x in SHUFFLE_RANGE_100:
120 self.estimator.update(x)
121- # True median is 50, 52 is good enough :-)
122- self.assertEquals(52, self.estimator.median)
123+ # True median is 50, 49 is good enough :-)
124+ self.assertIn(self.estimator.median, range(49,52))
125
126
127 def test_suite():
Gary Poster (gary) wrote :

Looks great!

One more fly-by: in the diff I noticed "test_approximage_median_is_good_enough" as the test name: I assume "approximate" is intended.

Thanks again

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/lp/scripts/utilities/pageperformancereport.py'
2--- lib/lp/scripts/utilities/pageperformancereport.py 2010-10-25 21:47:16 +0000
3+++ lib/lp/scripts/utilities/pageperformancereport.py 2010-11-04 21:26:27 +0000
4@@ -6,19 +6,20 @@
5 __metaclass__ = type
6 __all__ = ['main']
7
8+import bz2
9+import cPickle
10 from cgi import escape as html_quote
11 from ConfigParser import RawConfigParser
12+import csv
13 from datetime import datetime
14+import gzip
15+import math
16 import os.path
17 import re
18-import subprocess
19 from textwrap import dedent
20-import sqlite3
21-import tempfile
22+import textwrap
23 import time
24-import warnings
25
26-import numpy
27 import simplejson as json
28 import sre_constants
29 import zc.zservertracelog.tracereport
30@@ -27,9 +28,6 @@
31 from canonical.launchpad.scripts.logger import log
32 from lp.scripts.helpers import LPOptionParser
33
34-# We don't care about conversion to nan, they are expected.
35-warnings.filterwarnings(
36- 'ignore', '.*converting a masked element to nan.', UserWarning)
37
38 class Request(zc.zservertracelog.tracereport.Request):
39 url = None
40@@ -58,6 +56,7 @@
41
42 Requests belong to a Category if the URL matches a regular expression.
43 """
44+
45 def __init__(self, title, regexp):
46 self.title = title
47 self.regexp = regexp
48@@ -71,8 +70,135 @@
49 return cmp(self.title.lower(), other.title.lower())
50
51
52+class OnlineStatsCalculator:
53+ """Object that can compute count, sum, mean, variance and median.
54+
55+ It computes these value incrementally and using minimal storage
56+ using the Welford / Knuth algorithm described at
57+ http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
58+ """
59+
60+ def __init__(self):
61+ self.count = 0
62+ self.sum = 0
63+ self.M2 = 0.0 # Sum of square difference
64+ self.mean = 0.0
65+
66+ def update(self, x):
67+ """Incrementally update the stats when adding x to the set.
68+
69+ None values are ignored.
70+ """
71+ if x is None:
72+ return
73+ self.count += 1
74+ self.sum += x
75+ delta = x - self.mean
76+ self.mean = float(self.sum)/self.count
77+ self.M2 += delta*(x - self.mean)
78+
79+ @property
80+ def variance(self):
81+ """Return the population variance."""
82+ if self.count == 0:
83+ return 0
84+ else:
85+ return self.M2/self.count
86+
87+ @property
88+ def std(self):
89+ """Return the standard deviation."""
90+ if self.count == 0:
91+ return 0
92+ else:
93+ return math.sqrt(self.variance)
94+
95+
96+class OnlineApproximateMedian:
97+ """Approximate the median of a set of elements.
98+
99+ This implements a space-efficient algorithm which only sees each value
100+ once. (It will hold in memory log bucket_size of n elements.)
101+
102+ It was described and analysed in
103+ D. Cantone and M.Hofri,
104+ "Analysis of An Approximate Median Selection Algorithm"
105+ ftp://ftp.cs.wpi.edu/pub/techreports/pdf/06-17.pdf
106+
107+ This algorithm is similar to Tukey's median of medians technique.
108+ It will compute the median among bucket_size values. And the median among
109+ those.
110+ """
111+
112+ def __init__(self, bucket_size=9):
113+ """Creates a new estimator.
114+
115+ It approximates the median by finding the median among each
116+ successive bucket_size element. And then using these medians for other
117+ rounds of selection.
118+
119+ The bucket size should be a low odd-integer.
120+ """
121+ self.count = 0
122+ self.bucket_size = bucket_size
123+ # Index of the median in a completed bucket.
124+ self.median_idx = (bucket_size-1)//2
125+ self.buckets = []
126+
127+ def update(self, x):
128+ """Update with x."""
129+ if x is None:
130+ return
131+
132+ self.count += 1
133+ i = 0
134+ while True:
135+ # Create bucket on demand.
136+ if i == len(self.buckets):
137+ self.buckets.append([])
138+ bucket = self.buckets[i]
139+ bucket.append(x)
140+ if len(bucket) == self.bucket_size:
141+ # Select the median in this bucket, and promote it.
142+ x = sorted(bucket)[self.median_idx]
143+ # Free the bucket for the next round.
144+ del bucket[:]
145+ i += 1
146+ continue
147+ else:
148+ break
149+
150+ @property
151+ def median(self):
152+ """Return the median."""
153+ if self.count == 0:
154+ return 0
155+
156+ # Find the 'weighted' median by assigning a weight to each
157+ # element proportional to how far they have been selected.
158+ candidates = []
159+ total_weight = 0
160+ for i, bucket in enumerate(self.buckets):
161+ weight = self.bucket_size ** i
162+ for x in bucket:
163+ total_weight += weight
164+ candidates.append([x, weight])
165+ # Each weight is the equivalent of having the candidates appear
166+ # that number of times in the array.
167+ # So buckets like [[1, 2], [2, 3], [4, 2]] would be expanded to
168+ # [1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4,
169+ # 4, 4, 4, 4, 4] and we find the median of that list (2).
170+ # We don't expand the items to conserve memory.
171+ median = (total_weight-1) / 2
172+ weighted_idx = 0
173+ for x, weight in sorted(candidates):
174+ weighted_idx += weight
175+ if weighted_idx > median:
176+ return x
177+
178+
179 class Stats:
180- """Bag to hold request statistics.
181+ """Bag to hold and compute request statistics.
182
183 All times are in seconds.
184 """
185@@ -82,7 +208,6 @@
186 mean = 0 # Mean time per hit.
187 median = 0 # Median time per hit.
188 std = 0 # Standard deviation per hit.
189- ninetyninth_percentile_time = 0
190 histogram = None # # Request times histogram.
191
192 total_sqltime = 0 # Total time spent waiting for SQL to process.
193@@ -95,212 +220,207 @@
194 median_sqlstatements = 0
195 std_sqlstatements = 0
196
197- def __init__(self, times, timeout):
198- """Compute the stats based on times.
199-
200- Times is a list of (app_time, sql_statements, sql_times).
201-
202- The histogram is a list of request counts per 1 second bucket.
203- ie. histogram[0] contains the number of requests taking between 0 and
204- 1 second, histogram[1] contains the number of requests taking between
205- 1 and 2 seconds etc. histogram is None if there are no requests in
206- this Category.
207+ @property
208+ def ninetyninth_percentile_time(self):
209+ """Time under which 99% of requests are rendered.
210+
211+ This is estimated as 3 std deviations from the mean. Given that
212+ in a daily report, many URLs or PageIds won't have 100 requests, it's
213+ more useful to use this estimator.
214 """
215- if not times:
216- return
217-
218- self.total_hits = len(times)
219-
220- # Ignore missing values (-1) in computation.
221- times_array = numpy.ma.masked_values(
222- numpy.asarray(times, dtype=numpy.float32), -1.)
223-
224- self.total_time, self.total_sqlstatements, self.total_sqltime = (
225- times_array.sum(axis=0))
226-
227- self.mean, self.mean_sqlstatements, self.mean_sqltime = (
228- times_array.mean(axis=0))
229-
230- self.median, self.median_sqlstatements, self.median_sqltime = (
231- numpy.median(times_array, axis=0))
232-
233- self.std, self.std_sqlstatements, self.std_sqltime = (
234- numpy.std(times_array, axis=0))
235-
236- # This is an approximation which may not be true: we don't know if we
237- # have a std distribution or not. We could just find the 99th
238- # percentile by counting. Shock. Horror; however this appears pretty
239- # good based on eyeballing things so far - once we're down in the 2-3
240- # second range for everything we may want to revisit.
241- self.ninetyninth_percentile_time = self.mean + self.std*3
242-
243- histogram_width = int(timeout*1.5)
244- histogram_times = numpy.clip(times_array[:,0], 0, histogram_width)
245- histogram = numpy.histogram(
246- histogram_times, normed=True, range=(0, histogram_width),
247- bins=histogram_width)
248- self.histogram = zip(histogram[1], histogram[0])
249-
250-
251-class SQLiteRequestTimes:
252- """SQLite-based request times computation."""
253+ return self.mean + 3*self.std
254+
255+ @property
256+ def relative_histogram(self):
257+ """Return an histogram where the frequency is relative."""
258+ if self.histogram:
259+ return [[x, float(f)/self.total_hits] for x, f in self.histogram]
260+ else:
261+ return None
262+
263+ def text(self):
264+ """Return a textual version of the stats."""
265+ return textwrap.dedent("""
266+ <Stats for %d requests:
267+ Time: total=%.2f; mean=%.2f; median=%.2f; std=%.2f
268+ SQL time: total=%.2f; mean=%.2f; median=%.2f; std=%.2f
269+ SQL stmt: total=%.f; mean=%.2f; median=%.f; std=%.2f
270+ >""" % (
271+ self.total_hits, self.total_time, self.mean, self.median,
272+ self.std, self.total_sqltime, self.mean_sqltime,
273+ self.median_sqltime, self.std_sqltime,
274+ self.total_sqlstatements, self.mean_sqlstatements,
275+ self.median_sqlstatements, self.std_sqlstatements))
276+
277+
278+class OnlineStats(Stats):
279+ """Implementation of stats that can be computed online.
280+
281+ You call update() for each request and the stats are updated incrementally
282+ with minimum storage space.
283+ """
284+
285+ def __init__(self, histogram_width):
286+ self.time_stats = OnlineStatsCalculator()
287+ self.time_median_approximate = OnlineApproximateMedian()
288+ self.sql_time_stats = OnlineStatsCalculator()
289+ self.sql_time_median_approximate = OnlineApproximateMedian()
290+ self.sql_statements_stats = OnlineStatsCalculator()
291+ self.sql_statements_median_approximate = OnlineApproximateMedian()
292+ self._histogram = [
293+ [x, 0] for x in range(histogram_width)]
294+
295+ @property
296+ def total_hits(self):
297+ return self.time_stats.count
298+
299+ @property
300+ def total_time(self):
301+ return self.time_stats.sum
302+
303+ @property
304+ def mean(self):
305+ return self.time_stats.mean
306+
307+ @property
308+ def median(self):
309+ return self.time_median_approximate.median
310+
311+ @property
312+ def std(self):
313+ return self.time_stats.std
314+
315+ @property
316+ def total_sqltime(self):
317+ return self.sql_time_stats.sum
318+
319+ @property
320+ def mean_sqltime(self):
321+ return self.sql_time_stats.mean
322+
323+ @property
324+ def median_sqltime(self):
325+ return self.sql_time_median_approximate.median
326+
327+ @property
328+ def std_sqltime(self):
329+ return self.sql_time_stats.std
330+
331+ @property
332+ def total_sqlstatements(self):
333+ return self.sql_statements_stats.sum
334+
335+ @property
336+ def mean_sqlstatements(self):
337+ return self.sql_statements_stats.mean
338+
339+ @property
340+ def median_sqlstatements(self):
341+ return self.sql_statements_median_approximate.median
342+
343+ @property
344+ def std_sqlstatements(self):
345+ return self.sql_statements_stats.std
346+
347+ @property
348+ def histogram(self):
349+ if self.time_stats.count:
350+ return self._histogram
351+ else:
352+ return None
353+
354+ def update(self, request):
355+ """Update the stats based on request."""
356+ self.time_stats.update(request.app_seconds)
357+ self.time_median_approximate.update(request.app_seconds)
358+ self.sql_time_stats.update(request.sql_seconds)
359+ self.sql_time_median_approximate.update(request.sql_seconds)
360+ self.sql_statements_stats.update(request.sql_statements)
361+ self.sql_statements_median_approximate.update(request.sql_statements)
362+
363+ idx = int(min(len(self.histogram)-1, request.app_seconds))
364+ self.histogram[idx][1] += 1
365+
366+
367+class RequestTimes:
368+ """Collect statistics from requests.
369+
370+ Statistics are updated by calling the add_request() method.
371+
372+ Statistics for mean/stddev/total/median for request times, SQL times and
373+ number of SQL statements are collected.
374+
375+ They are grouped by Category, URL or PageID.
376+ """
377
378 def __init__(self, categories, options):
379- if options.db_file is None:
380- fd, self.filename = tempfile.mkstemp(suffix='.db', prefix='ppr')
381- os.close(fd)
382- else:
383- self.filename = options.db_file
384- self.con = sqlite3.connect(self.filename, isolation_level='EXCLUSIVE')
385- log.debug('Using request database %s' % self.filename)
386- # Some speed optimization.
387- self.con.execute('PRAGMA synchronous = off')
388- self.con.execute('PRAGMA journal_mode = off')
389-
390- self.categories = categories
391- self.store_all_request = options.pageids or options.top_urls
392- self.timeout = options.timeout
393- self.cur = self.con.cursor()
394-
395- # Create the tables, ignore errors about them being already present.
396- try:
397- self.cur.execute('''
398- CREATE TABLE category_request (
399- category INTEGER,
400- time REAL,
401- sql_statements INTEGER,
402- sql_time REAL)
403- ''');
404- except sqlite3.OperationalError, e:
405- if 'already exists' in str(e):
406- pass
407- else:
408- raise
409-
410- if self.store_all_request:
411- try:
412- self.cur.execute('''
413- CREATE TABLE request (
414- pageid TEXT,
415- url TEXT,
416- time REAL,
417- sql_statements INTEGER,
418- sql_time REAL)
419- ''');
420- except sqlite3.OperationalError, e:
421- if 'already exists' in str(e):
422- pass
423- else:
424- raise
425+ self.by_pageids = options.pageids
426+ self.top_urls = options.top_urls
427+ # We only keep in memory 50 times the number of URLs we want to
428+ # return. The number of URLs can go pretty high (because of the
429+ # distinct query parameters).
430+ #
431+ # Keeping all in memory at once is prohibitive. On a small but
432+ # representative sample, keeping 50 times the possible number of
433+ # candidates and culling to 90% on overflow, generated an identical
434+ # report than keeping all the candidates in-memory.
435+ #
436+ # Keeping 10 times or culling at 90% generated a near-identical report
437+ # (it differed a little in the tail.)
438+ #
439+ # The size/cull parameters might need to change if the requests
440+ # distribution become very different than what it currently is.
441+ self.top_urls_cache_size = self.top_urls * 50
442+
443+ # Histogram has a bin per second up to 1.5 our timeout.
444+ self.histogram_width = int(options.timeout*1.5)
445+ self.category_times = [
446+ (category, OnlineStats(self.histogram_width))
447+ for category in categories]
448+ self.url_times = {}
449+ self.pageid_times = {}
450
451 def add_request(self, request):
452- """Add a request to the cache."""
453- sql_statements = request.sql_statements
454- sql_seconds = request.sql_seconds
455-
456- # Store missing value as -1, as it makes dealing with those
457- # easier with numpy.
458- if sql_statements is None:
459- sql_statements = -1
460- if sql_seconds is None:
461- sql_seconds = -1
462- for idx, category in enumerate(self.categories):
463+ """Add request to the set of requests we collect stats for."""
464+ for category, stats in self.category_times:
465 if category.match(request):
466- self.con.execute(
467- "INSERT INTO category_request VALUES (?,?,?,?)",
468- (idx, request.app_seconds, sql_statements, sql_seconds))
469+ stats.update(request)
470
471- if self.store_all_request:
472+ if self.by_pageids:
473 pageid = request.pageid or 'Unknown'
474- self.con.execute(
475- "INSERT INTO request VALUES (?,?,?,?,?)",
476- (pageid, request.url, request.app_seconds, sql_statements,
477- sql_seconds))
478-
479- def commit(self):
480- """Call commit on the underlying connection."""
481- self.con.commit()
482+ stats = self.pageid_times.setdefault(
483+ pageid, OnlineStats(self.histogram_width))
484+ stats.update(request)
485+
486+ if self.top_urls:
487+ stats = self.url_times.setdefault(
488+ request.url, OnlineStats(self.histogram_width))
489+ stats.update(request)
490+ # Whenever we have more URLs than we need to, discard 10%
491+ # that is less likely to end up in the top.
492+ if len(self.url_times) > self.top_urls_cache_size:
493+ cutoff = int(self.top_urls_cache_size*0.90)
494+ self.url_times = dict(
495+ sorted(self.url_times.items(),
496+ key=lambda x: x[1].total_time,
497+ reverse=True)[:cutoff])
498+
499
500 def get_category_times(self):
501 """Return the times for each category."""
502- category_query = 'SELECT * FROM category_request ORDER BY category'
503-
504- empty_stats = Stats([], 0)
505- categories = dict(self.get_times(category_query))
506- return [
507- (category, categories.get(idx, empty_stats))
508- for idx, category in enumerate(self.categories)]
509-
510- def get_top_urls_times(self, top_n):
511+ return self.category_times
512+
513+ def get_top_urls_times(self):
514 """Return the times for the Top URL by total time"""
515- top_url_query = '''
516- SELECT url, time, sql_statements, sql_time
517- FROM request WHERE url IN (
518- SELECT url FROM (SELECT url, sum(time) FROM request
519- GROUP BY url
520- ORDER BY sum(time) DESC
521- LIMIT %d))
522- ORDER BY url
523- ''' % top_n
524 # Sort the result by total time
525 return sorted(
526- self.get_times(top_url_query), key=lambda x: x[1].total_time,
527- reverse=True)
528+ self.url_times.items(),
529+ key=lambda x: x[1].total_time, reverse=True)[:self.top_urls]
530
531 def get_pageid_times(self):
532 """Return the times for the pageids."""
533- pageid_query = '''
534- SELECT pageid, time, sql_statements, sql_time
535- FROM request
536- ORDER BY pageid
537- '''
538- return self.get_times(pageid_query)
539-
540- def get_times(self, query):
541- """Return a list of key, stats based on the query.
542-
543- The query should return rows of the form:
544- [key, app_time, sql_statements, sql_times]
545-
546- And should be sorted on key.
547- """
548- times = []
549- current_key = None
550- results = []
551- self.cur.execute(query)
552- while True:
553- rows = self.cur.fetchmany()
554- if len(rows) == 0:
555- break
556- for row in rows:
557- # We are encountering a new group...
558- if row[0] != current_key:
559- # Compute the stats of the previous group
560- if current_key != None:
561- results.append(
562- (current_key, Stats(times, self.timeout)))
563- # Initialize the new group.
564- current_key = row[0]
565- times = []
566-
567- times.append(row[1:])
568- # Compute the stats of the last group
569- if current_key != None:
570- results.append((current_key, Stats(times, self.timeout)))
571-
572- return results
573-
574- def close(self, remove=False):
575- """Close the SQLite connection.
576-
577- :param remove: If true, the DB file will be removed.
578- """
579- self.con.close()
580- if remove:
581- log.debug('Deleting request database.')
582- os.unlink(self.filename)
583- else:
584- log.debug('Keeping request database %s.' % self.filename)
585+ # Sort the result by pageid
586+ return sorted(self.pageid_times.items())
587
588
589 def main():
590@@ -339,17 +459,13 @@
591 # Default to 12: the staging timeout.
592 default=12, type="int",
593 help="The configured timeout value : determines high risk page ids.")
594- parser.add_option(
595- "--db-file", dest="db_file",
596- default=None, metavar="FILE",
597- help="Do not parse the records, generate reports from the DB file.")
598
599 options, args = parser.parse_args()
600
601 if not os.path.isdir(options.directory):
602 parser.error("Directory %s does not exist" % options.directory)
603
604- if len(args) == 0 and options.db_file is None:
605+ if len(args) == 0:
606 parser.error("At least one zserver tracelog file must be provided")
607
608 if options.from_ts is not None and options.until_ts is not None:
609@@ -383,22 +499,17 @@
610 if len(categories) == 0:
611 parser.error("No data in [categories] section of configuration.")
612
613- times = SQLiteRequestTimes(categories, options)
614-
615- if len(args) > 0:
616- parse(args, times, options)
617- times.commit()
618-
619- log.debug('Generating category statistics...')
620+ times = RequestTimes(categories, options)
621+
622+ parse(args, times, options)
623+
624 category_times = times.get_category_times()
625
626 pageid_times = []
627 url_times= []
628 if options.top_urls:
629- log.debug('Generating top %d urls statistics...' % options.top_urls)
630- url_times = times.get_top_urls_times(options.top_urls)
631+ url_times = times.get_top_urls_times()
632 if options.pageids:
633- log.debug('Generating pageid statistics...')
634 pageid_times = times.get_pageid_times()
635
636 def _report_filename(filename):
637@@ -436,7 +547,37 @@
638 open(report_filename, 'w'), None, pageid_times, None,
639 options.timeout - 2)
640
641- times.close(options.db_file is None)
642+ # Save the times cache for later merging.
643+ report_filename = _report_filename('stats.pck.bz2')
644+ log.info("Saving times database in %s", report_filename)
645+ stats_file = bz2.BZ2File(report_filename, 'w')
646+ cPickle.dump(times, stats_file, protocol=cPickle.HIGHEST_PROTOCOL)
647+ stats_file.close()
648+
649+ # Output metrics for selected categories.
650+ report_filename = _report_filename('metrics.dat')
651+ log.info('Saving category_metrics %s', report_filename)
652+ metrics_file = open(report_filename, 'w')
653+ writer = csv.writer(metrics_file, delimiter=':')
654+ date = options.until_ts or options.from_ts or datetime.utcnow()
655+ date = time.mktime(date.timetuple())
656+
657+ for option in script_config.options('metrics'):
658+ name = script_config.get('metrics', option)
659+ found = False
660+ for category, stats in category_times:
661+ if category.title == name:
662+ writer.writerows([
663+ ("%s_99" % option, "%f@%d" % (
664+ stats.ninetyninth_percentile_time, date)),
665+ ("%s_mean" % option, "%f@%d" % (stats.mean, date))])
666+ found = True
667+ break
668+ if not found:
669+ log.warning("Can't find category %s for metric %s" % (
670+ option, name))
671+ metrics_file.close()
672+
673 return 0
674
675
676@@ -447,17 +588,9 @@
677 """
678 ext = os.path.splitext(filename)[1]
679 if ext == '.bz2':
680- p = subprocess.Popen(
681- ['bunzip2', '-c', filename],
682- stdout=subprocess.PIPE, stdin=subprocess.PIPE)
683- p.stdin.close()
684- return p.stdout
685+ return bz2.BZ2File(filename, 'r')
686 elif ext == '.gz':
687- p = subprocess.Popen(
688- ['gunzip', '-c', filename],
689- stdout=subprocess.PIPE, stdin=subprocess.PIPE)
690- p.stdin.close()
691- return p.stdout
692+ return gzip.GzipFile(filename, 'r')
693 else:
694 return open(filename, mode)
695
696@@ -684,7 +817,7 @@
697 histograms = []
698
699 def handle_times(html_title, stats):
700- histograms.append(stats.histogram)
701+ histograms.append(stats.relative_histogram)
702 print >> outf, dedent("""\
703 <tr>
704 <th class="category-title">%s</th>
705@@ -810,4 +943,3 @@
706 </body>
707 </html>
708 """)
709-
710
711=== added file 'lib/lp/scripts/utilities/tests/test_pageperformancereport.py'
712--- lib/lp/scripts/utilities/tests/test_pageperformancereport.py 1970-01-01 00:00:00 +0000
713+++ lib/lp/scripts/utilities/tests/test_pageperformancereport.py 2010-11-04 21:26:27 +0000
714@@ -0,0 +1,290 @@
715+# Copyright 2010 Canonical Ltd. This software is licensed under the
716+# GNU Affero General Public License version 3 (see the file LICENSE).
717+
718+"""Test the pageperformancereport script."""
719+
720+__metaclass__ = type
721+
722+import unittest
723+
724+from lp.testing import TestCase
725+
726+from lp.scripts.utilities.pageperformancereport import (
727+ Category,
728+ OnlineApproximateMedian,
729+ OnlineStatsCalculator,
730+ RequestTimes,
731+ Stats,
732+ )
733+
734+
735+class FakeOptions:
736+ timeout = 4
737+ db_file = None
738+ pageids = True
739+ top_urls = 3
740+
741+ def __init__(self, **kwargs):
742+ """Assign all arguments as attributes."""
743+ self.__dict__.update(kwargs)
744+
745+
746+class FakeRequest:
747+ def __init__(self, url, app_seconds, sql_statements=None,
748+ sql_seconds=None, pageid=None):
749+ self.url = url
750+ self.pageid = pageid
751+ self.app_seconds = app_seconds
752+ self.sql_statements = sql_statements
753+ self.sql_seconds = sql_seconds
754+
755+
756+class FakeStats(Stats):
757+ def __init__(self, **kwargs):
758+ # Override the constructor to just store the values.
759+ self.__dict__.update(kwargs)
760+
761+
762+FAKE_REQUESTS = [
763+ FakeRequest('/', 0.5, pageid='+root'),
764+ FakeRequest('/bugs', 4.5, 56, 3.0, pageid='+bugs'),
765+ FakeRequest('/bugs', 4.2, 56, 2.2, pageid='+bugs'),
766+ FakeRequest('/bugs', 5.5, 76, 4.0, pageid='+bugs'),
767+ FakeRequest('/ubuntu', 2.5, 6, 2.0, pageid='+distribution'),
768+ FakeRequest('/launchpad', 3.5, 3, 3.0, pageid='+project'),
769+ FakeRequest('/bzr', 2.5, 4, 2.0, pageid='+project'),
770+ FakeRequest('/bugs/1', 20.5, 567, 14.0, pageid='+bug'),
771+ FakeRequest('/bugs/1', 15.5, 567, 9.0, pageid='+bug'),
772+ FakeRequest('/bugs/5', 1.5, 30, 1.2, pageid='+bug'),
773+ FakeRequest('/lazr', 1.0, 16, 0.3, pageid='+project'),
774+ FakeRequest('/drizzle', 0.9, 11, 1.3, pageid='+project'),
775+ ]
776+
777+
778+# The category stats computed for the above 12 requests.
779+CATEGORY_STATS = [
780+ # Median is an approximation.
781+ # Real values are: 2.50, 2.20, 30
782+ (Category('All', ''), FakeStats(
783+ total_hits=12, total_time=62.60, mean=5.22, median=4.20, std=5.99,
784+ total_sqltime=42, mean_sqltime=3.82, median_sqltime=3.0,
785+ std_sqltime=3.89,
786+ total_sqlstatements=1392, mean_sqlstatements=126.55,
787+ median_sqlstatements=56, std_sqlstatements=208.94,
788+ histogram=[[0, 2], [1, 2], [2, 2], [3, 1], [4, 2], [5, 3]],
789+ )),
790+ (Category('Test', ''), FakeStats()),
791+ (Category('Bugs', ''), FakeStats(
792+ total_hits=6, total_time=51.70, mean=8.62, median=4.5, std=6.90,
793+ total_sqltime=33.40, mean_sqltime=5.57, median_sqltime=3,
794+ std_sqltime=4.52,
795+ total_sqlstatements=1352, mean_sqlstatements=225.33,
796+ median_sqlstatements=56, std_sqlstatements=241.96,
797+ histogram=[[0, 0], [1, 1], [2, 0], [3, 0], [4, 2], [5, 3]],
798+ )),
799+ ]
800+
801+
802+# The top 3 URL stats computed for the above 12 requests.
803+TOP_3_URL_STATS = [
804+ ('/bugs/1', FakeStats(
805+ total_hits=2, total_time=36.0, mean=18.0, median=15.5, std=2.50,
806+ total_sqltime=23.0, mean_sqltime=11.5, median_sqltime=9.0,
807+ std_sqltime=2.50,
808+ total_sqlstatements=1134, mean_sqlstatements=567.0,
809+ median_sqlstatements=567, std_statements=0,
810+ histogram=[[0, 0], [1, 0], [2, 0], [3, 0], [4, 0], [5, 2]],
811+ )),
812+ ('/bugs', FakeStats(
813+ total_hits=3, total_time=14.2, mean=4.73, median=4.5, std=0.56,
814+ total_sqltime=9.2, mean_sqltime=3.07, median_sqltime=3,
815+ std_sqltime=0.74,
816+ total_sqlstatements=188, mean_sqlstatements=62.67,
817+ median_sqlstatements=56, std_sqlstatements=9.43,
818+ histogram=[[0, 0], [1, 0], [2, 0], [3, 0], [4, 2], [5, 1]],
819+ )),
820+ ('/launchpad', FakeStats(
821+ total_hits=1, total_time=3.5, mean=3.5, median=3.5, std=0,
822+ total_sqltime=3.0, mean_sqltime=3, median_sqltime=3, std_sqltime=0,
823+ total_sqlstatements=3, mean_sqlstatements=3,
824+ median_sqlstatements=3, std_sqlstatements=0,
825+ histogram=[[0, 0], [1, 0], [2, 0], [3, 1], [4, 0], [5, 0]],
826+ )),
827+ ]
828+
829+
830+# The pageid stats computed for the above 12 requests.
831+PAGEID_STATS = [
832+ ('+bug', FakeStats(
833+ total_hits=3, total_time=37.5, mean=12.5, median=15.5, std=8.04,
834+ total_sqltime=24.2, mean_sqltime=8.07, median_sqltime=9,
835+ std_sqltime=5.27,
836+ total_sqlstatements=1164, mean_sqlstatements=388,
837+ median_sqlstatements=567, std_sqlstatements=253.14,
838+ histogram=[[0, 0], [1, 1], [2, 0], [3, 0], [4, 0], [5, 2]],
839+ )),
840+ ('+bugs', FakeStats(
841+ total_hits=3, total_time=14.2, mean=4.73, median=4.5, std=0.56,
842+ total_sqltime=9.2, mean_sqltime=3.07, median_sqltime=3,
843+ std_sqltime=0.74,
844+ total_sqlstatements=188, mean_sqlstatements=62.67,
845+ median_sqlstatements=56, std_sqlstatements=9.43,
846+ histogram=[[0, 0], [1, 0], [2, 0], [3, 0], [4, 2], [5, 1]],
847+ )),
848+ ('+distribution', FakeStats(
849+ total_hits=1, total_time=2.5, mean=2.5, median=2.5, std=0,
850+ total_sqltime=2.0, mean_sqltime=2, median_sqltime=2, std_sqltime=0,
851+ total_sqlstatements=6, mean_sqlstatements=6,
852+ median_sqlstatements=6, std_sqlstatements=0,
853+ histogram=[[0, 0], [1, 0], [2, 1], [3, 0], [4, 0], [5, 0]],
854+ )),
855+ ('+project', FakeStats(
856+ total_hits=4, total_time=7.9, mean=1.98, median=1, std=1.08,
857+ total_sqltime=6.6, mean_sqltime=1.65, median_sqltime=1.3,
858+ std_sqltime=0.99,
859+ total_sqlstatements=34, mean_sqlstatements=8.5,
860+ median_sqlstatements=4, std_sqlstatements=5.32,
861+ histogram=[[0, 1], [1, 1], [2, 1], [3, 1], [4, 0], [5, 0]],
862+ )),
863+ ('+root', FakeStats(
864+ total_hits=1, total_time=0.5, mean=0.5, median=0.5, std=0,
865+ histogram=[[0, 1], [1, 0], [2, 0], [3, 0], [4, 0], [5, 0]],
866+ )),
867+ ]
868+
869+
870+class TestRequestTimes(TestCase):
871+ """Tests the RequestTimes backend."""
872+
873+ def setUp(self):
874+ TestCase.setUp(self)
875+ self.categories = [
876+ Category('All', '.*'), Category('Test', '.*test.*'),
877+ Category('Bugs', '.*bugs.*')]
878+ self.db = RequestTimes(self.categories, FakeOptions())
879+
880+ def setUpRequests(self):
881+ """Insert some requests into the db."""
882+ for r in FAKE_REQUESTS:
883+ self.db.add_request(r)
884+
885+ def assertStatsAreEquals(self, expected, results):
886+ self.assertEquals(
887+ len(expected), len(results), 'Wrong number of results')
888+ for idx in range(len(results)):
889+ self.assertEquals(expected[idx][0], results[idx][0],
890+ "Wrong key for results %d" % idx)
891+ key = results[idx][0]
892+ self.assertEquals(expected[idx][1].text(), results[idx][1].text(),
893+ "Wrong stats for results %d (%s)" % (idx, key))
894+ self.assertEquals(
895+ expected[idx][1].histogram, results[idx][1].histogram,
896+ "Wrong histogram for results %d (%s)" % (idx, key))
897+
898+ def test_get_category_times(self):
899+ self.setUpRequests()
900+ category_times = self.db.get_category_times()
901+ self.assertStatsAreEquals(CATEGORY_STATS, category_times)
902+
903+ def test_get_url_times(self):
904+ self.setUpRequests()
905+ url_times = self.db.get_top_urls_times()
906+ self.assertStatsAreEquals(TOP_3_URL_STATS, url_times)
907+
908+ def test_get_pageid_times(self):
909+ self.setUpRequests()
910+ pageid_times = self.db.get_pageid_times()
911+ self.assertStatsAreEquals(PAGEID_STATS, pageid_times)
912+
913+
914+class TestStats(TestCase):
915+ """Tests for the stats class."""
916+
917+ def test_relative_histogram(self):
918+ # Test that relative histogram gives an histogram using
919+ # relative frequency.
920+ stats = Stats()
921+ stats.total_hits = 100
922+ stats.histogram = [[0, 50], [1, 10], [2, 33], [3, 0], [4, 0], [5, 7]]
923+ self.assertEquals(
924+ [[0, 0.5], [1, .1], [2, .33], [3, 0], [4, 0], [5, .07]],
925+ stats.relative_histogram)
926+
927+
928+class TestOnlineStatsCalculator(TestCase):
929+ """Tests for the online stats calculator."""
930+
931+ def setUp(self):
932+ TestCase.setUp(self)
933+ self.stats = OnlineStatsCalculator()
934+
935+ def test_stats_for_empty_set(self):
936+ # Test the stats when there is no input.
937+ self.assertEquals(0, self.stats.count)
938+ self.assertEquals(0, self.stats.sum)
939+ self.assertEquals(0, self.stats.mean)
940+ self.assertEquals(0, self.stats.variance)
941+ self.assertEquals(0, self.stats.std)
942+
943+ def test_stats_for_one_value(self):
944+ # Test the stats when adding one element.
945+ self.stats.update(5)
946+ self.assertEquals(1, self.stats.count)
947+ self.assertEquals(5, self.stats.sum)
948+ self.assertEquals(5, self.stats.mean)
949+ self.assertEquals(0, self.stats.variance)
950+ self.assertEquals(0, self.stats.std)
951+
952+ def test_None_are_ignored(self):
953+ self.stats.update(None)
954+ self.assertEquals(0, self.stats.count)
955+
956+ def test_stats_for_3_values(self):
957+ for x in [3, 6, 9]:
958+ self.stats.update(x)
959+ self.assertEquals(3, self.stats.count)
960+ self.assertEquals(18, self.stats.sum)
961+ self.assertEquals(6, self.stats.mean)
962+ self.assertEquals(6, self.stats.variance)
963+ self.assertEquals("2.45", "%.2f" % self.stats.std)
964+
965+
966+SHUFFLE_RANGE_100 = [
967+ 25, 79, 99, 76, 60, 63, 87, 77, 51, 82, 42, 96, 93, 58, 32, 66, 75,
968+ 2, 26, 22, 11, 73, 61, 83, 65, 68, 44, 81, 64, 3, 33, 34, 15, 1,
969+ 92, 27, 90, 74, 46, 57, 59, 31, 13, 19, 89, 29, 56, 94, 50, 49, 62,
970+ 37, 21, 35, 5, 84, 88, 16, 8, 23, 40, 6, 48, 10, 97, 0, 53, 17, 30,
971+ 18, 43, 86, 12, 71, 38, 78, 36, 7, 45, 47, 80, 54, 39, 91, 98, 24,
972+ 55, 14, 52, 20, 69, 85, 95, 28, 4, 9, 67, 70, 41, 72
973+ ]
974+
975+
976+class TestOnlineApproximateMedian(TestCase):
977+ """Tests for the approximate median computation."""
978+
979+ def setUp(self):
980+ TestCase.setUp(self)
981+ self.estimator = OnlineApproximateMedian()
982+
983+ def test_median_is_0_when_no_input(self):
984+ self.assertEquals(0, self.estimator.median)
985+
986+ def test_median_is_true_median_for_n_lower_than_bucket_size(self):
987+ for x in range(9):
988+ self.estimator.update(x)
989+ self.assertEquals(4, self.estimator.median)
990+
991+ def test_None_input_is_ignored(self):
992+ self.estimator.update(1)
993+ self.estimator.update(None)
994+ self.assertEquals(1, self.estimator.median)
995+
996+ def test_approximage_median_is_good_enough(self):
997+ for x in SHUFFLE_RANGE_100:
998+ self.estimator.update(x)
999+ # True median is 50, 49 is good enough :-)
1000+ self.assertIn(self.estimator.median, range(49,52))
1001+
1002+
1003+def test_suite():
1004+ return unittest.TestLoader().loadTestsFromName(__name__)
1005
1006=== modified file 'setup.py'
1007--- setup.py 2010-09-18 08:00:27 +0000
1008+++ setup.py 2010-11-04 21:26:27 +0000
1009@@ -51,7 +51,6 @@
1010 'meliae',
1011 'mercurial',
1012 'mocker',
1013- 'numpy',
1014 'oauth',
1015 'paramiko',
1016 'python-memcached',
1017
1018=== modified file 'utilities/page-performance-report.ini'
1019--- utilities/page-performance-report.ini 2010-10-24 21:00:11 +0000
1020+++ utilities/page-performance-report.ini 2010-11-04 21:26:27 +0000
1021@@ -45,3 +45,10 @@
1022 Private XML-RPC=^https?://xmlrpc-private\.
1023 Shipit=^https?://shipit\.
1024
1025+[metrics]
1026+ppr_all=All launchpad except opstats
1027+ppr_bugs=Bugs
1028+ppr_api=API
1029+ppr_code=Code
1030+ppr_translations=Translations
1031+ppr_registry=Registry
1032
1033=== modified file 'versions.cfg'
1034--- versions.cfg 2010-11-01 19:59:42 +0000
1035+++ versions.cfg 2010-11-04 21:26:27 +0000
1036@@ -43,7 +43,6 @@
1037 mercurial = 1.6.2
1038 mocker = 0.10.1
1039 mozrunner = 1.3.4
1040-numpy = 1.3.0
1041 oauth = 1.0
1042 paramiko = 1.7.4
1043 Paste = 1.7.2