Merge lp:~flacoste/launchpad/ppr-constant-memory into lp:launchpad

Proposed by Francis J. Lacoste on 2010-10-28
Status: Rejected
Rejected by: Francis J. Lacoste on 2010-10-29
Proposed branch: lp:~flacoste/launchpad/ppr-constant-memory
Merge into: lp:launchpad
Diff against target: 992 lines (+605/-225)
5 files modified
lib/lp/scripts/utilities/pageperformancereport.py (+308/-223)
lib/lp/scripts/utilities/tests/test_pageperformancereport.py (+290/-0)
setup.py (+0/-1)
utilities/page-performance-report.ini (+7/-0)
versions.cfg (+0/-1)
To merge this branch: bzr merge lp:~flacoste/launchpad/ppr-constant-memory
Reviewer Review Type Date Requested Status
Robert Collins (community) 2010-10-28 Approve on 2010-10-28
Review via email: mp+39578@code.launchpad.net

Description of the Change

This is my second iteration on making the page-performance-report uses less
memory.

This time around, I managed to get it to run in constant memory.

I dropped numpy and compute all statistics using SQL.

I also added some automated tests. They are very coarse, making sure that the
computed stats for a set of requests match. The expected results were
hand-calculated in a spreadsheet.

Another drive-by is that I increased the cache_size on Robert's suggestions.
Local testing seem to show that it makes a very small difference (main effect
seems to reduce the sys time, but that's not the bulk of the time.)

The whole report seems at least now twice slower than the previous algorithm.

The previous took ~1m52 to generate a report for 300k records. The new one
takes ~4m. But the biggest advantage is that it runs in a constant amount of
memory. This means that we'll be able to generate weekly and monthly reports
(once we have more space on sodium, space is currently too tight to generate
them).

To post a comment you must log in.
Robert Collins (lifeless) wrote :

175 """Return the times for each category."""
176 - category_query = 'SELECT * FROM category_request ORDER BY category'
177 -
178 - empty_stats = Stats([], 0)
179 - categories = dict(self.get_times(category_query))
180 +
181 + times = self.get_times('category_request', 'category')
182 +
183 return [

The VWS here weirds me out - it breaks up the function without being a useful separation; its up to you but I'd rather read

"""return ..."
times = self.get_times...
return [
 ...

Similarly on some other functions.

This looks like it will work but I think we can make this massively more efficient by just doing it incrementally. E.g. see http://mathcentral.uregina.ca/QQ/database/QQ.09.04/carlos1.html (though I think they have a sign reversal, you'll want to derive it yourself).

We can do that as a third iteration though.

review: Approve
Robert Collins (lifeless) wrote :

I suggest this because the current implementation running on devpad is -thrashing- machine performance.

Francis J. Lacoste (flacoste) wrote :

This was replaced by an online in-memory version.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/lp/scripts/utilities/pageperformancereport.py'
2--- lib/lp/scripts/utilities/pageperformancereport.py 2010-10-25 21:47:16 +0000
3+++ lib/lp/scripts/utilities/pageperformancereport.py 2010-10-29 22:02:04 +0000
4@@ -6,19 +6,19 @@
5 __metaclass__ = type
6 __all__ = ['main']
7
8+import bz2
9 from cgi import escape as html_quote
10 from ConfigParser import RawConfigParser
11+import csv
12 from datetime import datetime
13+import gzip
14+import math
15 import os.path
16 import re
17-import subprocess
18 from textwrap import dedent
19-import sqlite3
20-import tempfile
21+import textwrap
22 import time
23-import warnings
24
25-import numpy
26 import simplejson as json
27 import sre_constants
28 import zc.zservertracelog.tracereport
29@@ -27,9 +27,6 @@
30 from canonical.launchpad.scripts.logger import log
31 from lp.scripts.helpers import LPOptionParser
32
33-# We don't care about conversion to nan, they are expected.
34-warnings.filterwarnings(
35- 'ignore', '.*converting a masked element to nan.', UserWarning)
36
37 class Request(zc.zservertracelog.tracereport.Request):
38 url = None
39@@ -58,6 +55,7 @@
40
41 Requests belong to a Category if the URL matches a regular expression.
42 """
43+
44 def __init__(self, title, regexp):
45 self.title = title
46 self.regexp = regexp
47@@ -71,8 +69,128 @@
48 return cmp(self.title.lower(), other.title.lower())
49
50
51+class OnlineStatsCalculator:
52+ """Object that can compute count, sum, mean, variance and median.
53+
54+ It computes these value incrementally and using minimal storage
55+ using the Welford / Knuth algorithm described at
56+ http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
57+ """
58+
59+ def __init__(self):
60+ self.count = 0
61+ self.sum = 0
62+ self.M2 = 0.0 # Sum of square difference
63+ self.mean = 0.0
64+
65+ def update(self, x):
66+ """Incrementally update the stats when adding x to the set.
67+
68+ None values are ignored.
69+ """
70+ if x is None:
71+ return
72+ self.count += 1
73+ self.sum += x
74+ delta = x - self.mean
75+ self.mean = float(self.sum)/self.count
76+ self.M2 += delta*(x - self.mean)
77+
78+ @property
79+ def variance(self):
80+ """Return the population variance."""
81+ if self.count == 0:
82+ return 0
83+ else:
84+ return self.M2/self.count
85+
86+ @property
87+ def std(self):
88+ """Return the standard deviation."""
89+ if self.count == 0:
90+ return 0
91+ else:
92+ return math.sqrt(self.variance)
93+
94+
95+class OnlineApproximateMedian:
96+ """Approximate the median of a set of elements.
97+
98+ This implements a space-efficient algorithm which only sees each value
99+ once. (It will hold in memory log B of n elements.)
100+
101+ It was described and analysed in
102+ D. Cantone and M.Hofri,
103+ "Analysis of An Approximate Median Selection Algorithm"
104+ ftp://ftp.cs.wpi.edu/pub/techreports/pdf/06-17.pdf
105+
106+ This algorithm is similar to Tukey's median of medians technique.
107+ It will compute the median among bucket_size values. And the median among
108+ those.
109+ """
110+
111+ def __init__(self, bucket_size=9):
112+ """Creates a new estimator.
113+
114+ It approximates the median by finding the median among each
115+ successive bucket_size element. And then using these medians for other
116+ round of selection.
117+
118+ The bucket size should be a low odd-integer.
119+ """
120+ self.count = 0
121+ self.bucket_size = bucket_size
122+ # Index of the median in a completed bucket.
123+ self.median_idx = bucket_size/2
124+ self.buckets = []
125+
126+ def update(self, x):
127+ """Update with x."""
128+ if x is None:
129+ return
130+
131+ self.count += 1
132+ i = 0
133+ while True:
134+ # Create bucket on demand.
135+ if i == len(self.buckets):
136+ self.buckets.append([])
137+ bucket = self.buckets[i]
138+ bucket.append(x)
139+ if len(bucket) == self.bucket_size:
140+ # Select the median in this bucket, and promote it.
141+ x = sorted(bucket)[self.median_idx]
142+ # Free the bucket for the next round.
143+ del bucket[:]
144+ i += 1
145+ continue
146+ else:
147+ break
148+
149+ @property
150+ def median(self):
151+ """Return the median."""
152+ if self.count == 0:
153+ return 0
154+
155+ # Find the 'weighted' median by assigning a weight to each
156+ # element proportional to how far they have been selected.
157+ candidates = []
158+ total_weight = 0
159+ for i, bucket in enumerate(self.buckets):
160+ weight = self.bucket_size ** i
161+ for x in bucket:
162+ total_weight += weight
163+ candidates.append([weight, x])
164+ # Make weight relative.
165+ candidates = [
166+ ((float(weight)/total_weight)*x, x)
167+ for weight, x in candidates]
168+ return sorted(candidates)[(len(candidates)-1)/2][1]
169+
170+
171 class Stats:
172- """Bag to hold request statistics.
173+ """Bag to hold and compute request statistics.
174
175 All times are in seconds.
176 """
177@@ -82,7 +200,6 @@
178 mean = 0 # Mean time per hit.
179 median = 0 # Median time per hit.
180 std = 0 # Standard deviation per hit.
181- ninetyninth_percentile_time = 0
182 histogram = None # # Request times histogram.
183
184 total_sqltime = 0 # Total time spent waiting for SQL to process.
185@@ -95,212 +212,175 @@
186 median_sqlstatements = 0
187 std_sqlstatements = 0
188
189- def __init__(self, times, timeout):
190- """Compute the stats based on times.
191-
192- Times is a list of (app_time, sql_statements, sql_times).
193-
194- The histogram is a list of request counts per 1 second bucket.
195- ie. histogram[0] contains the number of requests taking between 0 and
196- 1 second, histogram[1] contains the number of requests taking between
197- 1 and 2 seconds etc. histogram is None if there are no requests in
198- this Category.
199+ @property
200+ def ninetyninth_percentile_time(self):
201+ """Time under which 99% of requests are rendered.
202+
203+ This is estimated as 3 std deviations from the mean. Given that
204+ in a daily report, many URLs or PageIds won't have 100 requests, it's
205+ more useful to use this estimator.
206 """
207- if not times:
208- return
209-
210- self.total_hits = len(times)
211-
212- # Ignore missing values (-1) in computation.
213- times_array = numpy.ma.masked_values(
214- numpy.asarray(times, dtype=numpy.float32), -1.)
215-
216- self.total_time, self.total_sqlstatements, self.total_sqltime = (
217- times_array.sum(axis=0))
218-
219- self.mean, self.mean_sqlstatements, self.mean_sqltime = (
220- times_array.mean(axis=0))
221-
222- self.median, self.median_sqlstatements, self.median_sqltime = (
223- numpy.median(times_array, axis=0))
224-
225- self.std, self.std_sqlstatements, self.std_sqltime = (
226- numpy.std(times_array, axis=0))
227-
228- # This is an approximation which may not be true: we don't know if we
229- # have a std distribution or not. We could just find the 99th
230- # percentile by counting. Shock. Horror; however this appears pretty
231- # good based on eyeballing things so far - once we're down in the 2-3
232- # second range for everything we may want to revisit.
233- self.ninetyninth_percentile_time = self.mean + self.std*3
234-
235- histogram_width = int(timeout*1.5)
236- histogram_times = numpy.clip(times_array[:,0], 0, histogram_width)
237- histogram = numpy.histogram(
238- histogram_times, normed=True, range=(0, histogram_width),
239- bins=histogram_width)
240- self.histogram = zip(histogram[1], histogram[0])
241-
242-
243-class SQLiteRequestTimes:
244- """SQLite-based request times computation."""
245+ return self.mean + 3*self.std
246+
247+ @property
248+ def relative_histogram(self):
249+ """Return an histogram where the frequency is relative."""
250+ if self.histogram:
251+ return [[x, float(f)/self.total_hits] for x, f in self.histogram]
252+ else:
253+ return None
254+
255+ def text(self):
256+ """Return a textual version of the stats."""
257+ return textwrap.dedent("""
258+ <Stats for %d requests:
259+ Time: total=%.2f; mean=%.2f; median=%.2f; std=%.2f
260+ SQL time: total=%.2f; mean=%.2f; median=%.2f; std=%.2f
261+ SQL stmt: total=%.f; mean=%.2f; median=%.f; std=%.2f
262+ >""" % (
263+ self.total_hits, self.total_time, self.mean, self.median,
264+ self.std, self.total_sqltime, self.mean_sqltime,
265+ self.median_sqltime, self.std_sqltime,
266+ self.total_sqlstatements, self.mean_sqlstatements,
267+ self.median_sqlstatements, self.std_sqlstatements))
268+
269+
270+class OnlineStats(Stats):
271+ """Implementation of stats that can be computed online.
272+
273+ You call update() for each request and the stats are updated incrementally
274+ with minimum storage space.
275+ """
276+
277+ def __init__(self, histogram_width):
278+ self.time_stats = OnlineStatsCalculator()
279+ self.time_median_approximate = OnlineApproximateMedian()
280+ self.sql_time_stats = OnlineStatsCalculator()
281+ self.sql_time_median_approximate = OnlineApproximateMedian()
282+ self.sql_statements_stats = OnlineStatsCalculator()
283+ self.sql_statements_median_approximate = OnlineApproximateMedian()
284+ self._histogram = [
285+ [x, 0] for x in range(histogram_width)]
286+
287+ @property
288+ def total_hits(self):
289+ return self.time_stats.count
290+
291+ @property
292+ def total_time(self):
293+ return self.time_stats.sum
294+
295+ @property
296+ def mean(self):
297+ return self.time_stats.mean
298+
299+ @property
300+ def median(self):
301+ return self.time_median_approximate.median
302+
303+ @property
304+ def std(self):
305+ return self.time_stats.std
306+
307+ @property
308+ def total_sqltime(self):
309+ return self.sql_time_stats.sum
310+
311+ @property
312+ def mean_sqltime(self):
313+ return self.sql_time_stats.mean
314+
315+ @property
316+ def median_sqltime(self):
317+ return self.sql_time_median_approximate.median
318+
319+ @property
320+ def std_sqltime(self):
321+ return self.sql_time_stats.std
322+
323+ @property
324+ def total_sqlstatements(self):
325+ return self.sql_statements_stats.sum
326+
327+ @property
328+ def mean_sqlstatements(self):
329+ return self.sql_statements_stats.mean
330+
331+ @property
332+ def median_sqlstatements(self):
333+ return self.sql_statements_median_approximate.median
334+
335+ @property
336+ def std_sqlstatements(self):
337+ return self.sql_statements_stats.std
338+
339+ @property
340+ def histogram(self):
341+ if self.time_stats.count:
342+ return self._histogram
343+ else:
344+ return None
345+
346+ def update(self, request):
347+ """Update the stats based on request."""
348+ self.time_stats.update(request.app_seconds)
349+ self.time_median_approximate.update(request.app_seconds)
350+ self.sql_time_stats.update(request.sql_seconds)
351+ self.sql_time_median_approximate.update(request.sql_seconds)
352+ self.sql_statements_stats.update(request.sql_statements)
353+ self.sql_statements_median_approximate.update(request.sql_statements)
354+
355+ idx = int(min(len(self.histogram)-1, request.app_seconds))
356+ self.histogram[idx][1] += 1
357+
358+
359+class RequestTimes:
360+ """Collect the """
361
362 def __init__(self, categories, options):
363- if options.db_file is None:
364- fd, self.filename = tempfile.mkstemp(suffix='.db', prefix='ppr')
365- os.close(fd)
366- else:
367- self.filename = options.db_file
368- self.con = sqlite3.connect(self.filename, isolation_level='EXCLUSIVE')
369- log.debug('Using request database %s' % self.filename)
370- # Some speed optimization.
371- self.con.execute('PRAGMA synchronous = off')
372- self.con.execute('PRAGMA journal_mode = off')
373-
374- self.categories = categories
375- self.store_all_request = options.pageids or options.top_urls
376- self.timeout = options.timeout
377- self.cur = self.con.cursor()
378-
379- # Create the tables, ignore errors about them being already present.
380- try:
381- self.cur.execute('''
382- CREATE TABLE category_request (
383- category INTEGER,
384- time REAL,
385- sql_statements INTEGER,
386- sql_time REAL)
387- ''');
388- except sqlite3.OperationalError, e:
389- if 'already exists' in str(e):
390- pass
391- else:
392- raise
393-
394- if self.store_all_request:
395- try:
396- self.cur.execute('''
397- CREATE TABLE request (
398- pageid TEXT,
399- url TEXT,
400- time REAL,
401- sql_statements INTEGER,
402- sql_time REAL)
403- ''');
404- except sqlite3.OperationalError, e:
405- if 'already exists' in str(e):
406- pass
407- else:
408- raise
409+ self.by_pageids = options.pageids
410+ self.by_urls = options.top_urls
411+
412+ # Histogram has a bin per second up to 1.5 our timeout.
413+ self.histogram_width = int(options.timeout*1.5)
414+ self.category_times = [
415+ (category, OnlineStats(self.histogram_width))
416+ for category in categories]
417+ self.url_times = {}
418+ self.pageid_times = {}
419
420 def add_request(self, request):
421- """Add a request to the cache."""
422- sql_statements = request.sql_statements
423- sql_seconds = request.sql_seconds
424-
425- # Store missing value as -1, as it makes dealing with those
426- # easier with numpy.
427- if sql_statements is None:
428- sql_statements = -1
429- if sql_seconds is None:
430- sql_seconds = -1
431- for idx, category in enumerate(self.categories):
432+ """Add a request to the ."""
433+ for category, stats in self.category_times:
434 if category.match(request):
435- self.con.execute(
436- "INSERT INTO category_request VALUES (?,?,?,?)",
437- (idx, request.app_seconds, sql_statements, sql_seconds))
438+ stats.update(request)
439
440- if self.store_all_request:
441+ if self.by_pageids:
442 pageid = request.pageid or 'Unknown'
443- self.con.execute(
444- "INSERT INTO request VALUES (?,?,?,?,?)",
445- (pageid, request.url, request.app_seconds, sql_statements,
446- sql_seconds))
447+ stats = self.pageid_times.setdefault(
448+ pageid, OnlineStats(self.histogram_width))
449+ stats.update(request)
450
451- def commit(self):
452- """Call commit on the underlying connection."""
453- self.con.commit()
454+ if self.by_urls:
455+ stats = self.url_times.setdefault(
456+ request.url, OnlineStats(self.histogram_width))
457+ stats.update(request)
458
459 def get_category_times(self):
460 """Return the times for each category."""
461- category_query = 'SELECT * FROM category_request ORDER BY category'
462-
463- empty_stats = Stats([], 0)
464- categories = dict(self.get_times(category_query))
465- return [
466- (category, categories.get(idx, empty_stats))
467- for idx, category in enumerate(self.categories)]
468+ return self.category_times
469
470 def get_top_urls_times(self, top_n):
471 """Return the times for the Top URL by total time"""
472- top_url_query = '''
473- SELECT url, time, sql_statements, sql_time
474- FROM request WHERE url IN (
475- SELECT url FROM (SELECT url, sum(time) FROM request
476- GROUP BY url
477- ORDER BY sum(time) DESC
478- LIMIT %d))
479- ORDER BY url
480- ''' % top_n
481 # Sort the result by total time
482 return sorted(
483- self.get_times(top_url_query), key=lambda x: x[1].total_time,
484- reverse=True)
485+ self.url_times.items(),
486+ key=lambda x: x[1].total_time, reverse=True)[:top_n]
487
488 def get_pageid_times(self):
489 """Return the times for the pageids."""
490- pageid_query = '''
491- SELECT pageid, time, sql_statements, sql_time
492- FROM request
493- ORDER BY pageid
494- '''
495- return self.get_times(pageid_query)
496-
497- def get_times(self, query):
498- """Return a list of key, stats based on the query.
499-
500- The query should return rows of the form:
501- [key, app_time, sql_statements, sql_times]
502-
503- And should be sorted on key.
504- """
505- times = []
506- current_key = None
507- results = []
508- self.cur.execute(query)
509- while True:
510- rows = self.cur.fetchmany()
511- if len(rows) == 0:
512- break
513- for row in rows:
514- # We are encountering a new group...
515- if row[0] != current_key:
516- # Compute the stats of the previous group
517- if current_key != None:
518- results.append(
519- (current_key, Stats(times, self.timeout)))
520- # Initialize the new group.
521- current_key = row[0]
522- times = []
523-
524- times.append(row[1:])
525- # Compute the stats of the last group
526- if current_key != None:
527- results.append((current_key, Stats(times, self.timeout)))
528-
529- return results
530-
531- def close(self, remove=False):
532- """Close the SQLite connection.
533-
534- :param remove: If true, the DB file will be removed.
535- """
536- self.con.close()
537- if remove:
538- log.debug('Deleting request database.')
539- os.unlink(self.filename)
540- else:
541- log.debug('Keeping request database %s.' % self.filename)
542+ # Sort the result by pageid
543+ return sorted(self.pageid_times.items())
544
545
546 def main():
547@@ -339,17 +419,13 @@
548 # Default to 12: the staging timeout.
549 default=12, type="int",
550 help="The configured timeout value : determines high risk page ids.")
551- parser.add_option(
552- "--db-file", dest="db_file",
553- default=None, metavar="FILE",
554- help="Do not parse the records, generate reports from the DB file.")
555
556 options, args = parser.parse_args()
557
558 if not os.path.isdir(options.directory):
559 parser.error("Directory %s does not exist" % options.directory)
560
561- if len(args) == 0 and options.db_file is None:
562+ if len(args) == 0:
563 parser.error("At least one zserver tracelog file must be provided")
564
565 if options.from_ts is not None and options.until_ts is not None:
566@@ -383,22 +459,17 @@
567 if len(categories) == 0:
568 parser.error("No data in [categories] section of configuration.")
569
570- times = SQLiteRequestTimes(categories, options)
571-
572- if len(args) > 0:
573- parse(args, times, options)
574- times.commit()
575-
576- log.debug('Generating category statistics...')
577+ times = RequestTimes(categories, options)
578+
579+ parse(args, times, options)
580+
581 category_times = times.get_category_times()
582
583 pageid_times = []
584 url_times= []
585 if options.top_urls:
586- log.debug('Generating top %d urls statistics...' % options.top_urls)
587 url_times = times.get_top_urls_times(options.top_urls)
588 if options.pageids:
589- log.debug('Generating pageid statistics...')
590 pageid_times = times.get_pageid_times()
591
592 def _report_filename(filename):
593@@ -436,7 +507,30 @@
594 open(report_filename, 'w'), None, pageid_times, None,
595 options.timeout - 2)
596
597- times.close(options.db_file is None)
598+ # Output metrics for selected categories.
599+ report_filename = _report_filename('metrics.dat')
600+ log.info('Saving category_metrics %s', report_filename)
601+ metrics_file = open(report_filename, 'w')
602+ writer = csv.writer(metrics_file, delimiter=':')
603+ date = options.until_ts or options.from_ts or datetime.utcnow()
604+ date = time.mktime(date.timetuple())
605+
606+ for option in script_config.options('metrics'):
607+ name = script_config.get('metrics', option)
608+ found = False
609+ for category, stats in category_times:
610+ if category.title == name:
611+ writer.writerows([
612+ ("%s_99" % option, "%f@%d" % (
613+ stats.ninetyninth_percentile_time, date)),
614+ ("%s_mean" % option, "%f@%d" % (stats.mean, date))])
615+ found = True
616+ break
617+ if not found:
618+ log.warning("Can't find category %s for metric %s" % (
619+ option, name))
620+ metrics_file.close()
621+
622 return 0
623
624
625@@ -447,17 +541,9 @@
626 """
627 ext = os.path.splitext(filename)[1]
628 if ext == '.bz2':
629- p = subprocess.Popen(
630- ['bunzip2', '-c', filename],
631- stdout=subprocess.PIPE, stdin=subprocess.PIPE)
632- p.stdin.close()
633- return p.stdout
634+ return bz2.BZ2File(filename, 'r')
635 elif ext == '.gz':
636- p = subprocess.Popen(
637- ['gunzip', '-c', filename],
638- stdout=subprocess.PIPE, stdin=subprocess.PIPE)
639- p.stdin.close()
640- return p.stdout
641+ return gzip.GzipFile(filename, 'r')
642 else:
643 return open(filename, mode)
644
645@@ -684,7 +770,7 @@
646 histograms = []
647
648 def handle_times(html_title, stats):
649- histograms.append(stats.histogram)
650+ histograms.append(stats.relative_histogram)
651 print >> outf, dedent("""\
652 <tr>
653 <th class="category-title">%s</th>
654@@ -810,4 +896,3 @@
655 </body>
656 </html>
657 """)
658-
659
660=== added file 'lib/lp/scripts/utilities/tests/test_pageperformancereport.py'
661--- lib/lp/scripts/utilities/tests/test_pageperformancereport.py 1970-01-01 00:00:00 +0000
662+++ lib/lp/scripts/utilities/tests/test_pageperformancereport.py 2010-10-29 22:02:04 +0000
663@@ -0,0 +1,290 @@
664+# Copyright 2010 Canonical Ltd. This software is licensed under the
665+# GNU Affero General Public License version 3 (see the file LICENSE).
666+
667+"""Test the pageperformancereport script."""
668+
669+__metaclass__ = type
670+
671+import unittest
672+
673+from lp.testing import TestCase
674+
675+from lp.scripts.utilities.pageperformancereport import (
676+ Category,
677+ OnlineApproximateMedian,
678+ OnlineStatsCalculator,
679+ RequestTimes,
680+ Stats,
681+ )
682+
683+
684+class FakeOptions:
685+ timeout = 4
686+ db_file = None
687+ pageids = True
688+ top_urls = True
689+
690+ def __init__(self, **kwargs):
691+ """Assign all arguments as attributes."""
692+ self.__dict__.update(kwargs)
693+
694+
695+class FakeRequest:
696+ def __init__(self, url, app_seconds, sql_statements=None,
697+ sql_seconds=None, pageid=None):
698+ self.url = url
699+ self.pageid = pageid
700+ self.app_seconds = app_seconds
701+ self.sql_statements = sql_statements
702+ self.sql_seconds = sql_seconds
703+
704+
705+class FakeStats(Stats):
706+ def __init__(self, **kwargs):
707+ # Override the constructor to just store the values.
708+ self.__dict__.update(kwargs)
709+
710+
711+FAKE_REQUESTS = [
712+ FakeRequest('/', 0.5, pageid='+root'),
713+ FakeRequest('/bugs', 4.5, 56, 3.0, pageid='+bugs'),
714+ FakeRequest('/bugs', 4.2, 56, 2.2, pageid='+bugs'),
715+ FakeRequest('/bugs', 5.5, 76, 4.0, pageid='+bugs'),
716+ FakeRequest('/ubuntu', 2.5, 6, 2.0, pageid='+distribution'),
717+ FakeRequest('/launchpad', 3.5, 3, 3.0, pageid='+project'),
718+ FakeRequest('/bzr', 2.5, 4, 2.0, pageid='+project'),
719+ FakeRequest('/bugs/1', 20.5, 567, 14.0, pageid='+bug'),
720+ FakeRequest('/bugs/1', 15.5, 567, 9.0, pageid='+bug'),
721+ FakeRequest('/bugs/5', 1.5, 30, 1.2, pageid='+bug'),
722+ FakeRequest('/lazr', 1.0, 16, 0.3, pageid='+project'),
723+ FakeRequest('/drizzle', 0.9, 11, 1.3, pageid='+project'),
724+ ]
725+
726+
727+# The category stats computed for the above 12 requests.
728+CATEGORY_STATS = [
729+ # Median is an approximation.
730+ # Real values are: 2.50, 2.20, 30
731+ (Category('All', ''), FakeStats(
732+ total_hits=12, total_time=62.60, mean=5.22, median=1.0, std=5.99,
733+ total_sqltime=42, mean_sqltime=3.82, median_sqltime=1.3,
734+ std_sqltime=3.89,
735+ total_sqlstatements=1392, mean_sqlstatements=126.55,
736+ median_sqlstatements=16, std_sqlstatements=208.94,
737+ histogram=[[0, 2], [1, 2], [2, 2], [3, 1], [4, 2], [5, 3]],
738+ )),
739+ (Category('Test', ''), FakeStats()),
740+ (Category('Bugs', ''), FakeStats(
741+ total_hits=6, total_time=51.70, mean=8.62, median=4.5, std=6.90,
742+ total_sqltime=33.40, mean_sqltime=5.57, median_sqltime=3,
743+ std_sqltime=4.52,
744+ total_sqlstatements=1352, mean_sqlstatements=225.33,
745+ median_sqlstatements=56, std_sqlstatements=241.96,
746+ histogram=[[0, 0], [1, 1], [2, 0], [3, 0], [4, 2], [5, 3]],
747+ )),
748+ ]
749+
750+
751+# The top 3 URL stats computed for the above 12 requests.
752+TOP_3_URL_STATS = [
753+ ('/bugs/1', FakeStats(
754+ total_hits=2, total_time=36.0, mean=18.0, median=15.5, std=2.50,
755+ total_sqltime=23.0, mean_sqltime=11.5, median_sqltime=9.0,
756+ std_sqltime=2.50,
757+ total_sqlstatements=1134, mean_sqlstatements=567.0,
758+ median_sqlstatements=567, std_statements=0,
759+ histogram=[[0, 0], [1, 0], [2, 0], [3, 0], [4, 0], [5, 2]],
760+ )),
761+ ('/bugs', FakeStats(
762+ total_hits=3, total_time=14.2, mean=4.73, median=4.5, std=0.56,
763+ total_sqltime=9.2, mean_sqltime=3.07, median_sqltime=3,
764+ std_sqltime=0.74,
765+ total_sqlstatements=188, mean_sqlstatements=62.67,
766+ median_sqlstatements=56, std_sqlstatements=9.43,
767+ histogram=[[0, 0], [1, 0], [2, 0], [3, 0], [4, 2], [5, 1]],
768+ )),
769+ ('/launchpad', FakeStats(
770+ total_hits=1, total_time=3.5, mean=3.5, median=3.5, std=0,
771+ total_sqltime=3.0, mean_sqltime=3, median_sqltime=3, std_sqltime=0,
772+ total_sqlstatements=3, mean_sqlstatements=3,
773+ median_sqlstatements=3, std_sqlstatements=0,
774+ histogram=[[0, 0], [1, 0], [2, 0], [3, 1], [4, 0], [5, 0]],
775+ )),
776+ ]
777+
778+
779+# The pageid stats computed for the above 12 requests.
780+PAGEID_STATS = [
781+ ('+bug', FakeStats(
782+ total_hits=3, total_time=37.5, mean=12.5, median=15.5, std=8.04,
783+ total_sqltime=24.2, mean_sqltime=8.07, median_sqltime=9,
784+ std_sqltime=5.27,
785+ total_sqlstatements=1164, mean_sqlstatements=388,
786+ median_sqlstatements=567, std_sqlstatements=253.14,
787+ histogram=[[0, 0], [1, 1], [2, 0], [3, 0], [4, 0], [5, 2]],
788+ )),
789+ ('+bugs', FakeStats(
790+ total_hits=3, total_time=14.2, mean=4.73, median=4.5, std=0.56,
791+ total_sqltime=9.2, mean_sqltime=3.07, median_sqltime=3,
792+ std_sqltime=0.74,
793+ total_sqlstatements=188, mean_sqlstatements=62.67,
794+ median_sqlstatements=56, std_sqlstatements=9.43,
795+ histogram=[[0, 0], [1, 0], [2, 0], [3, 0], [4, 2], [5, 1]],
796+ )),
797+ ('+distribution', FakeStats(
798+ total_hits=1, total_time=2.5, mean=2.5, median=2.5, std=0,
799+ total_sqltime=2.0, mean_sqltime=2, median_sqltime=2, std_sqltime=0,
800+ total_sqlstatements=6, mean_sqlstatements=6,
801+ median_sqlstatements=6, std_sqlstatements=0,
802+ histogram=[[0, 0], [1, 0], [2, 1], [3, 0], [4, 0], [5, 0]],
803+ )),
804+ ('+project', FakeStats(
805+ total_hits=4, total_time=7.9, mean=1.98, median=1, std=1.08,
806+ total_sqltime=6.6, mean_sqltime=1.65, median_sqltime=1.3,
807+ std_sqltime=0.99,
808+ total_sqlstatements=34, mean_sqlstatements=8.5,
809+ median_sqlstatements=4, std_sqlstatements=5.32,
810+ histogram=[[0, 1], [1, 1], [2, 1], [3, 1], [4, 0], [5, 0]],
811+ )),
812+ ('+root', FakeStats(
813+ total_hits=1, total_time=0.5, mean=0.5, median=0.5, std=0,
814+ histogram=[[0, 1], [1, 0], [2, 0], [3, 0], [4, 0], [5, 0]],
815+ )),
816+ ]
817+
818+
819+class TestRequestTimes(TestCase):
820+ """Tests the RequestTimes backend."""
821+
822+ def setUp(self):
823+ TestCase.setUp(self)
824+ self.categories = [
825+ Category('All', '.*'), Category('Test', '.*test.*'),
826+ Category('Bugs', '.*bugs.*')]
827+ self.db = RequestTimes(self.categories, FakeOptions())
828+
829+ def setUpRequests(self):
830+ """Insert some requests into the db."""
831+ for r in FAKE_REQUESTS:
832+ self.db.add_request(r)
833+
834+ def assertStatsAreEquals(self, expected, results):
835+ self.assertEquals(
836+ len(expected), len(results), 'Wrong number of results')
837+ for idx in range(len(results)):
838+ self.assertEquals(expected[idx][0], results[idx][0],
839+ "Wrong key for results %d" % idx)
840+ key = results[idx][0]
841+ self.assertEquals(expected[idx][1].text(), results[idx][1].text(),
842+ "Wrong stats for results %d (%s)" % (idx, key))
843+ self.assertEquals(
844+ expected[idx][1].histogram, results[idx][1].histogram,
845+ "Wrong histogram for results %d (%s)" % (idx, key))
846+
847+ def test_get_category_times(self):
848+ self.setUpRequests()
849+ category_times = self.db.get_category_times()
850+ self.assertStatsAreEquals(CATEGORY_STATS, category_times)
851+
852+ def test_get_url_times(self):
853+ self.setUpRequests()
854+ url_times = self.db.get_top_urls_times(3)
855+ self.assertStatsAreEquals(TOP_3_URL_STATS, url_times)
856+
857+ def test_get_pageid_times(self):
858+ self.setUpRequests()
859+ pageid_times = self.db.get_pageid_times()
860+ self.assertStatsAreEquals(PAGEID_STATS, pageid_times)
861+
862+
863+class TestStats(TestCase):
864+ """Tests for the stats class."""
865+
866+ def test_relative_histogram(self):
867+ # Test that relative histogram gives an histogram using
868+ # relative frequency.
869+ stats = Stats()
870+ stats.total_hits = 100
871+ stats.histogram = [[0, 50], [1, 10], [2, 33], [3, 0], [4, 0], [5, 7]]
872+ self.assertEquals(
873+ [[0, 0.5], [1, .1], [2, .33], [3, 0], [4, 0], [5, .07]],
874+ stats.relative_histogram)
875+
876+
877+class TestOnlineStatsCalculator(TestCase):
878+ """Tests for the online stats calculator."""
879+
880+ def setUp(self):
881+ TestCase.setUp(self)
882+ self.stats = OnlineStatsCalculator()
883+
884+ def test_stats_for_empty_set(self):
885+ # Test the stats when there is no input.
886+ self.assertEquals(0, self.stats.count)
887+ self.assertEquals(0, self.stats.sum)
888+ self.assertEquals(0, self.stats.mean)
889+ self.assertEquals(0, self.stats.variance)
890+ self.assertEquals(0, self.stats.std)
891+
892+ def test_stats_for_one_value(self):
893+ # Test the stats when adding one element.
894+ self.stats.update(5)
895+ self.assertEquals(1, self.stats.count)
896+ self.assertEquals(5, self.stats.sum)
897+ self.assertEquals(5, self.stats.mean)
898+ self.assertEquals(0, self.stats.variance)
899+ self.assertEquals(0, self.stats.std)
900+
901+ def test_None_are_ignored(self):
902+ self.stats.update(None)
903+ self.assertEquals(0, self.stats.count)
904+
905+ def test_stats_for_3_values(self):
906+ for x in [3, 6, 9]:
907+ self.stats.update(x)
908+ self.assertEquals(3, self.stats.count)
909+ self.assertEquals(18, self.stats.sum)
910+ self.assertEquals(6, self.stats.mean)
911+ self.assertEquals(6, self.stats.variance)
912+ self.assertEquals("2.45", "%.2f" % self.stats.std)
913+
914+
915+SHUFFLE_RANGE_100 = [
916+ 25, 79, 99, 76, 60, 63, 87, 77, 51, 82, 42, 96, 93, 58, 32, 66, 75,
917+ 2, 26, 22, 11, 73, 61, 83, 65, 68, 44, 81, 64, 3, 33, 34, 15, 1,
918+ 92, 27, 90, 74, 46, 57, 59, 31, 13, 19, 89, 29, 56, 94, 50, 49, 62,
919+ 37, 21, 35, 5, 84, 88, 16, 8, 23, 40, 6, 48, 10, 97, 0, 53, 17, 30,
920+ 18, 43, 86, 12, 71, 38, 78, 36, 7, 45, 47, 80, 54, 39, 91, 98, 24,
921+ 55, 14, 52, 20, 69, 85, 95, 28, 4, 9, 67, 70, 41, 72
922+ ]
923+
924+
925+class TestOnlineApproximateMedian(TestCase):
926+ """Tests for the approximate median computation."""
927+
928+ def setUp(self):
929+ TestCase.setUp(self)
930+ self.estimator = OnlineApproximateMedian()
931+
932+ def test_median_is_0_when_no_input(self):
933+ self.assertEquals(0, self.estimator.median)
934+
935+ def test_median_is_true_median_for_n_lower_than_bucket_size(self):
936+ for x in range(9):
937+ self.estimator.update(x)
938+ self.assertEquals(4, self.estimator.median)
939+
940+ def test_None_input_is_ignored(self):
941+ self.estimator.update(1)
942+ self.estimator.update(None)
943+ self.assertEquals(1, self.estimator.median)
944+
945+ def test_approximage_median_is_good_enough(self):
946+ for x in SHUFFLE_RANGE_100:
947+ self.estimator.update(x)
948+ # True median is 50, 52 is good enough :-)
949+ self.assertEquals(52, self.estimator.median)
950+
951+
952+def test_suite():
953+ return unittest.TestLoader().loadTestsFromName(__name__)
954
955=== modified file 'setup.py'
956--- setup.py 2010-09-18 08:00:27 +0000
957+++ setup.py 2010-10-29 22:02:04 +0000
958@@ -51,7 +51,6 @@
959 'meliae',
960 'mercurial',
961 'mocker',
962- 'numpy',
963 'oauth',
964 'paramiko',
965 'python-memcached',
966
967=== modified file 'utilities/page-performance-report.ini'
968--- utilities/page-performance-report.ini 2010-10-24 21:00:11 +0000
969+++ utilities/page-performance-report.ini 2010-10-29 22:02:04 +0000
970@@ -45,3 +45,10 @@
971 Private XML-RPC=^https?://xmlrpc-private\.
972 Shipit=^https?://shipit\.
973
974+[metrics]
975+ppr_all=All launchpad except opstats
976+ppr_bugs=Bugs
977+ppr_api=API
978+ppr_code=Code
979+ppr_translations=Translations
980+ppr_registry=Registry
981
982=== modified file 'versions.cfg'
983--- versions.cfg 2010-10-27 04:23:52 +0000
984+++ versions.cfg 2010-10-29 22:02:04 +0000
985@@ -45,7 +45,6 @@
986 mercurial = 1.6.2
987 mocker = 0.10.1
988 mozrunner = 1.3.4
989-numpy = 1.3.0
990 oauth = 1.0
991 paramiko = 1.7.4
992 Paste = 1.7.2