Merge lp:~flacoste/launchpad/ppr-merge into lp:launchpad

Proposed by Francis J. Lacoste
Status: Merged
Approved by: Gavin Panella
Approved revision: no longer in the source branch.
Merged at revision: 11872
Proposed branch: lp:~flacoste/launchpad/ppr-merge
Merge into: lp:launchpad
Prerequisite: lp:~flacoste/launchpad/ppr-constant-memory
Diff against target: 421 lines (+208/-20)
2 files modified
lib/lp/scripts/utilities/pageperformancereport.py (+124/-16)
lib/lp/scripts/utilities/tests/test_pageperformancereport.py (+84/-4)
To merge this branch: bzr merge lp:~flacoste/launchpad/ppr-merge
Reviewer Review Type Date Requested Status
Gavin Panella (community) Approve
Review via email: mp+40014@code.launchpad.net

Commit message

Add --merge option that can be used to aggregate stats saved from individual run.

Description of the change

Hello,

This is a follow-up on my previous constant-memory branch. It adds the option
of generating a report by aggregating the RequestTimes data structure pickled
in a previous run.

This will allow us to generate weekly, monthly and even yearly page
performance report with minimal fuss.

I did this by adding a __add__ method to all data structure involved. There
are only two tricky algorithms, one how to merge the variance. I used the
algorithm pointed at by the wikipedia page
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

The transcription on that page actually didn't produce the expected results.
So I went to the paper and used the original formula.

The may to merge two approximate medians is home-made. I went with what seemed
to make the most sense. It seem to produce ok results.

To post a comment you must log in.
Revision history for this message
Gavin Panella (allenap) wrote :

[1]

+ (float(self.count)/(other.count*results.count)) *
+ ((float(other.count)/self.count)*self.sum - other.sum)**2)
...
+ results.mean = float(results.sum)/results.count
...
+ for n in range((i+1)-len(self.buckets)):

PEP8 suggests "Use spaces around arithmetic operators", but also says
"A Foolish Consistency is the Hobgoblin of Little Minds" :)

[2]

+ found = False
+ for i, (category, stats) in enumerate(self.category_times):
+ if category.title == other_category.title:
+ results.category_times[i] = (
+ category, stats + other_stats)
+ found = True
+ break
+ if not found:
+ results.category_times.append(
+ (other_category, copy.deepcopy(other_stats)))

I think this is what for...else is for:

            for i, (category, stats) in enumerate(self.category_times):
                if category.title == other_category.title:
                    results.category_times[i] = (
                        category, stats + other_stats)
                    break
            else:
                results.category_times.append(
                    (other_category, copy.deepcopy(other_stats)))

[3]

+ sorted(url_times.items(),
+ key=lambda x: x[1].total_time,
+ reverse=True)[:self.top_urls_cache_size])

Indentation, and the key function might be clearer with some
unwrapping:

                sorted(url_times.items(),
                       key=lambda (category, stats): stats.total_time,
                       reverse=True)[:self.top_urls_cache_size])

sorted() is also happy with an iterator - i.e. iteritems() - but I
don't think that url_times will ever be large enough to make much of a
difference.

[4]

+ f = bz2.BZ2File(filename, 'r')
+ times += cPickle.load(f)
+ f.close()

BZ2File doesn't provide the context protocol (or whatever it's called)
but files returned by codecs.open(...) do, so this is possible:

            with codecs.open(filename, 'r', 'bz2') as f:
                times += cPickle.load(f)

Makes bugger all difference though :)

review: Approve
Revision history for this message
Francis J. Lacoste (flacoste) wrote :

On November 4, 2010, Gavin Panella wrote:
> Review: Approve
> [1]
>
> + (float(self.count)/(other.count*results.count)) *
> + ((float(other.count)/self.count)*self.sum - other.sum)**2)
> ...
> + results.mean = float(results.sum)/results.count
> ...
> + for n in range((i+1)-len(self.buckets)):
>
> PEP8 suggests "Use spaces around arithmetic operators", but also says
> "A Foolish Consistency is the Hobgoblin of Little Minds" :)
>
>

I added spaces, except for the **2 which would make the line go over 80!

> [2]
>
> + found = False
> + for i, (category, stats) in enumerate(self.category_times):
> + if category.title == other_category.title:
> + results.category_times[i] = (
> + category, stats + other_stats)
> + found = True
> + break
> + if not found:
> + results.category_times.append(
> + (other_category, copy.deepcopy(other_stats)))
>
> I think this is what for...else is for:
>
> for i, (category, stats) in enumerate(self.category_times):
> if category.title == other_category.title:
> results.category_times[i] = (
> category, stats + other_stats)
> break
> else:
> results.category_times.append(
> (other_category, copy.deepcopy(other_stats)))
>

Indeed, I'm never sure of that one. Thanks for pointing this out. I changed
another instance of that idiom in the same file.

>
> [3]
>
> + sorted(url_times.items(),
> + key=lambda x: x[1].total_time,
> + reverse=True)[:self.top_urls_cache_size])
>
> Indentation, and the key function might be clearer with some
> unwrapping:
>
> sorted(url_times.items(),
> key=lambda (category, stats): stats.total_time,
> reverse=True)[:self.top_urls_cache_size])
>
> sorted() is also happy with an iterator - i.e. iteritems() - but I
> don't think that url_times will ever be large enough to make much of a
> difference.
>

Nice suggestion.

>
> [4]
>
> + f = bz2.BZ2File(filename, 'r')
> + times += cPickle.load(f)
> + f.close()
>
> BZ2File doesn't provide the context protocol (or whatever it's called)
> but files returned by codecs.open(...) do, so this is possible:
>
> with codecs.open(filename, 'r', 'bz2') as f:
> times += cPickle.load(f)
>
> Makes bugger all difference though :)

Hmm, I already import bz2, so kept the less 3733t idiom instead of adding an
import :-p

Thanks for the review!

--
Francis J. Lacoste
<email address hidden>

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/lp/scripts/utilities/pageperformancereport.py'
2--- lib/lp/scripts/utilities/pageperformancereport.py 2010-11-05 15:24:12 +0000
3+++ lib/lp/scripts/utilities/pageperformancereport.py 2010-11-05 15:24:31 +0000
4@@ -9,6 +9,7 @@
5 import bz2
6 import cPickle
7 from cgi import escape as html_quote
8+import copy
9 from ConfigParser import RawConfigParser
10 import csv
11 from datetime import datetime
12@@ -69,6 +70,11 @@
13 def __cmp__(self, other):
14 return cmp(self.title.lower(), other.title.lower())
15
16+ def __deepcopy__(self, memo):
17+ # We provide __deepcopy__ because the module doesn't handle
18+ # compiled regular expression by default.
19+ return Category(self.title, self.regexp)
20+
21
22 class OnlineStatsCalculator:
23 """Object that can compute count, sum, mean, variance and median.
24@@ -113,6 +119,30 @@
25 else:
26 return math.sqrt(self.variance)
27
28+ def __add__(self, other):
29+ """Adds this and another OnlineStatsCalculator.
30+
31+ The result combines the stats of the two objects.
32+ """
33+ results = OnlineStatsCalculator()
34+ results.count = self.count + other.count
35+ results.sum = self.sum + other.sum
36+ if self.count > 0 and other.count > 0:
37+ # This is 2.1b in Chan, Tony F.; Golub, Gene H.; LeVeque,
38+ # Randall J. (1979), "Updating Formulae and a Pairwise Algorithm
39+ # for Computing Sample Variances.",
40+ # Technical Report STAN-CS-79-773,
41+ # Department of Computer Science, Stanford University,
42+ # ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf .
43+ results.M2 = self.M2 + other.M2 + (
44+ (float(self.count) / (other.count * results.count)) *
45+ ((float(other.count) / self.count) * self.sum - other.sum)**2)
46+ else:
47+ results.M2 = self.M2 + other.M2 # One of them is 0.
48+ if results.count > 0:
49+ results.mean = float(results.sum) / results.count
50+ return results
51+
52
53 class OnlineApproximateMedian:
54 """Approximate the median of a set of elements.
55@@ -139,23 +169,22 @@
56
57 The bucket size should be a low odd-integer.
58 """
59- self.count = 0
60 self.bucket_size = bucket_size
61 # Index of the median in a completed bucket.
62 self.median_idx = (bucket_size-1)//2
63 self.buckets = []
64
65- def update(self, x):
66+ def update(self, x, order=0):
67 """Update with x."""
68 if x is None:
69 return
70
71- self.count += 1
72- i = 0
73+ i = order
74 while True:
75 # Create bucket on demand.
76- if i == len(self.buckets):
77- self.buckets.append([])
78+ if i >= len(self.buckets):
79+ for n in range((i+1)-len(self.buckets)):
80+ self.buckets.append([])
81 bucket = self.buckets[i]
82 bucket.append(x)
83 if len(bucket) == self.bucket_size:
84@@ -171,9 +200,6 @@
85 @property
86 def median(self):
87 """Return the median."""
88- if self.count == 0:
89- return 0
90-
91 # Find the 'weighted' median by assigning a weight to each
92 # element proportional to how far they have been selected.
93 candidates = []
94@@ -183,6 +209,9 @@
95 for x in bucket:
96 total_weight += weight
97 candidates.append([x, weight])
98+ if len(candidates) == 0:
99+ return 0
100+
101 # Each weight is the equivalent of having the candidates appear
102 # that number of times in the array.
103 # So buckets like [[1, 2], [2, 3], [4, 2]] would be expanded to
104@@ -196,6 +225,21 @@
105 if weighted_idx > median:
106 return x
107
108+ def __add__(self, other):
109+ """Merge two approximators together.
110+
111+ All candidates from the other are merged through the standard
112+ algorithm, starting at the same level. So an item that went through
113+ two rounds of selection, will be compared with other items having
114+ gone through the same number of rounds.
115+ """
116+ results = OnlineApproximateMedian(self.bucket_size)
117+ results.buckets = copy.deepcopy(self.buckets)
118+ for i, bucket in enumerate(other.buckets):
119+ for x in bucket:
120+ results.update(x, i)
121+ return results
122+
123
124 class Stats:
125 """Bag to hold and compute request statistics.
126@@ -341,6 +385,21 @@
127 idx = int(min(len(self.histogram)-1, request.app_seconds))
128 self.histogram[idx][1] += 1
129
130+ def __add__(self, other):
131+ """Merge another OnlineStats with this one."""
132+ results = copy.deepcopy(self)
133+ results.time_stats += other.time_stats
134+ results.time_median_approximate += other.time_median_approximate
135+ results.sql_time_stats += other.sql_time_stats
136+ results.sql_time_median_approximate += (
137+ other.sql_time_median_approximate)
138+ results.sql_statements_stats += other.sql_statements_stats
139+ results.sql_statements_median_approximate += (
140+ other.sql_statements_median_approximate)
141+ for i, (n, f) in enumerate(other._histogram):
142+ results._histogram[i][1] += f
143+ return results
144+
145
146 class RequestTimes:
147 """Collect statistics from requests.
148@@ -402,10 +461,9 @@
149 cutoff = int(self.top_urls_cache_size*0.90)
150 self.url_times = dict(
151 sorted(self.url_times.items(),
152- key=lambda x: x[1].total_time,
153+ key=lambda (url, stats): stats.total_time,
154 reverse=True)[:cutoff])
155
156-
157 def get_category_times(self):
158 """Return the times for each category."""
159 return self.category_times
160@@ -415,13 +473,50 @@
161 # Sort the result by total time
162 return sorted(
163 self.url_times.items(),
164- key=lambda x: x[1].total_time, reverse=True)[:self.top_urls]
165+ key=lambda (url, stats): stats.total_time,
166+ reverse=True)[:self.top_urls]
167
168 def get_pageid_times(self):
169 """Return the times for the pageids."""
170 # Sort the result by pageid
171 return sorted(self.pageid_times.items())
172
173+ def __add__(self, other):
174+ """Merge two RequestTimes together."""
175+ results = copy.deepcopy(self)
176+ for other_category, other_stats in other.category_times:
177+ for i, (category, stats) in enumerate(self.category_times):
178+ if category.title == other_category.title:
179+ results.category_times[i] = (
180+ category, stats + other_stats)
181+ break
182+ else:
183+ results.category_times.append(
184+ (other_category, copy.deepcopy(other_stats)))
185+
186+ url_times = results.url_times
187+ for url, stats in other.url_times.items():
188+ if url in url_times:
189+ url_times[url] += stats
190+ else:
191+ url_times[url] = copy.deepcopy(stats)
192+ # Only keep top_urls_cache_size entries.
193+ if len(self.url_times) > self.top_urls_cache_size:
194+ self.url_times = dict(
195+ sorted(
196+ url_times.items(),
197+ key=lambda (url, stats): stats.total_time,
198+ reverse=True)[:self.top_urls_cache_size])
199+
200+ pageid_times = results.pageid_times
201+ for pageid, stats in other.pageid_times.items():
202+ if pageid in pageid_times:
203+ pageid_times[pageid] += stats
204+ else:
205+ pageid_times[pageid] = copy.deepcopy(stats)
206+
207+ return results
208+
209
210 def main():
211 parser = LPOptionParser("%prog [args] tracelog [...]")
212@@ -459,6 +554,11 @@
213 # Default to 12: the staging timeout.
214 default=12, type="int",
215 help="The configured timeout value : determines high risk page ids.")
216+ parser.add_option(
217+ "--merge", dest="merge",
218+ default=False, action='store_true',
219+ help="Files are interpreted as pickled stats and are aggregated for" +
220+ "the report.")
221
222 options, args = parser.parse_args()
223
224@@ -473,6 +573,9 @@
225 parser.error(
226 "--from timestamp %s is before --until timestamp %s"
227 % (options.from_ts, options.until_ts))
228+ if options.from_ts is not None or options.until_ts is not None:
229+ if options.merge:
230+ parser.error('--from and --until cannot be used with --merge')
231
232 for filename in args:
233 if not os.path.exists(filename):
234@@ -501,7 +604,14 @@
235
236 times = RequestTimes(categories, options)
237
238- parse(args, times, options)
239+ if options.merge:
240+ for filename in args:
241+ log.info('Merging %s...' % filename)
242+ f = bz2.BZ2File(filename, 'r')
243+ times += cPickle.load(f)
244+ f.close()
245+ else:
246+ parse(args, times, options)
247
248 category_times = times.get_category_times()
249
250@@ -564,16 +674,14 @@
251
252 for option in script_config.options('metrics'):
253 name = script_config.get('metrics', option)
254- found = False
255 for category, stats in category_times:
256 if category.title == name:
257 writer.writerows([
258 ("%s_99" % option, "%f@%d" % (
259 stats.ninetyninth_percentile_time, date)),
260 ("%s_mean" % option, "%f@%d" % (stats.mean, date))])
261- found = True
262 break
263- if not found:
264+ else:
265 log.warning("Can't find category %s for metric %s" % (
266 option, name))
267 metrics_file.close()
268
269=== modified file 'lib/lp/scripts/utilities/tests/test_pageperformancereport.py'
270--- lib/lp/scripts/utilities/tests/test_pageperformancereport.py 2010-11-05 15:24:12 +0000
271+++ lib/lp/scripts/utilities/tests/test_pageperformancereport.py 2010-11-05 15:24:31 +0000
272@@ -12,6 +12,7 @@
273 from lp.scripts.utilities.pageperformancereport import (
274 Category,
275 OnlineApproximateMedian,
276+ OnlineStats,
277 OnlineStatsCalculator,
278 RequestTimes,
279 Stats,
280@@ -30,6 +31,7 @@
281
282
283 class FakeRequest:
284+
285 def __init__(self, url, app_seconds, sql_statements=None,
286 sql_seconds=None, pageid=None):
287 self.url = url
288@@ -40,6 +42,7 @@
289
290
291 class FakeStats(Stats):
292+
293 def __init__(self, **kwargs):
294 # Override the constructor to just store the values.
295 self.__dict__.update(kwargs)
296@@ -196,9 +199,28 @@
297 pageid_times = self.db.get_pageid_times()
298 self.assertStatsAreEquals(PAGEID_STATS, pageid_times)
299
300+ def test___add__(self):
301+ # Ensure that adding two RequestTimes together result in
302+ # a merge of their constituencies.
303+ db1 = self.db
304+ db2 = RequestTimes(self.categories, FakeOptions())
305+ db1.add_request(FakeRequest('/', 1.5, 5, 1.0, '+root'))
306+ db1.add_request(FakeRequest('/bugs', 3.5, 15, 1.0, '+bugs'))
307+ db2.add_request(FakeRequest('/bugs/1', 5.0, 30, 4.0, '+bug'))
308+ results = db1 + db2
309+ self.assertEquals(3, results.category_times[0][1].total_hits)
310+ self.assertEquals(0, results.category_times[1][1].total_hits)
311+ self.assertEquals(2, results.category_times[2][1].total_hits)
312+ self.assertEquals(1, results.pageid_times['+root'].total_hits)
313+ self.assertEquals(1, results.pageid_times['+bugs'].total_hits)
314+ self.assertEquals(1, results.pageid_times['+bug'].total_hits)
315+ self.assertEquals(1, results.url_times['/'].total_hits)
316+ self.assertEquals(1, results.url_times['/bugs'].total_hits)
317+ self.assertEquals(1, results.url_times['/bugs/1'].total_hits)
318+
319
320 class TestStats(TestCase):
321- """Tests for the stats class."""
322+ """Tests for the Stats class."""
323
324 def test_relative_histogram(self):
325 # Test that relative histogram gives an histogram using
326@@ -211,6 +233,26 @@
327 stats.relative_histogram)
328
329
330+class TestOnlineStats(TestCase):
331+ """Tests for the OnlineStats class."""
332+
333+ def test___add__(self):
334+ # Ensure that adding two OnlineStats merge all their constituencies.
335+ stats1 = OnlineStats(4)
336+ stats1.update(FakeRequest('/', 2.0, 5, 1.5))
337+ stats2 = OnlineStats(4)
338+ stats2.update(FakeRequest('/', 1.5, 2, 3.0))
339+ stats2.update(FakeRequest('/', 5.0, 2, 2.0))
340+ results = stats1 + stats2
341+ self.assertEquals(3, results.total_hits)
342+ self.assertEquals(2, results.median)
343+ self.assertEquals(9, results.total_sqlstatements)
344+ self.assertEquals(2, results.median_sqlstatements)
345+ self.assertEquals(6.5, results.total_sqltime)
346+ self.assertEquals(2.0, results.median_sqltime)
347+ self.assertEquals([[0, 0], [1, 1], [2, 1], [3, 1]], results.histogram)
348+
349+
350 class TestOnlineStatsCalculator(TestCase):
351 """Tests for the online stats calculator."""
352
353@@ -248,14 +290,44 @@
354 self.assertEquals(6, self.stats.variance)
355 self.assertEquals("2.45", "%.2f" % self.stats.std)
356
357+ def test___add___two_empty_together(self):
358+ stats2 = OnlineStatsCalculator()
359+ results = self.stats + stats2
360+ self.assertEquals(0, results.count)
361+ self.assertEquals(0, results.sum)
362+ self.assertEquals(0, results.mean)
363+ self.assertEquals(0, results.variance)
364+
365+ def test___add___one_empty(self):
366+ stats2 = OnlineStatsCalculator()
367+ for x in [1, 2, 3]:
368+ self.stats.update(x)
369+ results = self.stats + stats2
370+ self.assertEquals(3, results.count)
371+ self.assertEquals(6, results.sum)
372+ self.assertEquals(2, results.mean)
373+ self.assertEquals(2, results.M2)
374+
375+ def test___add__(self):
376+ stats2 = OnlineStatsCalculator()
377+ for x in [3, 6, 9]:
378+ self.stats.update(x)
379+ for x in [1, 2, 3]:
380+ stats2.update(x)
381+ results = self.stats + stats2
382+ self.assertEquals(6, results.count)
383+ self.assertEquals(24, results.sum)
384+ self.assertEquals(4, results.mean)
385+ self.assertEquals(44, results.M2)
386+
387
388 SHUFFLE_RANGE_100 = [
389 25, 79, 99, 76, 60, 63, 87, 77, 51, 82, 42, 96, 93, 58, 32, 66, 75,
390- 2, 26, 22, 11, 73, 61, 83, 65, 68, 44, 81, 64, 3, 33, 34, 15, 1,
391+ 2, 26, 22, 11, 73, 61, 83, 65, 68, 44, 81, 64, 3, 33, 34, 15, 1,
392 92, 27, 90, 74, 46, 57, 59, 31, 13, 19, 89, 29, 56, 94, 50, 49, 62,
393 37, 21, 35, 5, 84, 88, 16, 8, 23, 40, 6, 48, 10, 97, 0, 53, 17, 30,
394 18, 43, 86, 12, 71, 38, 78, 36, 7, 45, 47, 80, 54, 39, 91, 98, 24,
395- 55, 14, 52, 20, 69, 85, 95, 28, 4, 9, 67, 70, 41, 72
396+ 55, 14, 52, 20, 69, 85, 95, 28, 4, 9, 67, 70, 41, 72,
397 ]
398
399
400@@ -279,12 +351,20 @@
401 self.estimator.update(None)
402 self.assertEquals(1, self.estimator.median)
403
404- def test_approximage_median_is_good_enough(self):
405+ def test_approximate_median_is_good_enough(self):
406 for x in SHUFFLE_RANGE_100:
407 self.estimator.update(x)
408 # True median is 50, 49 is good enough :-)
409 self.assertIn(self.estimator.median, range(49,52))
410
411+ def test___add__(self):
412+ median1 = OnlineApproximateMedian(3)
413+ median1.buckets = [[1, 3], [4, 5], [6, 3]]
414+ median2 = OnlineApproximateMedian(3)
415+ median2.buckets = [[], [3, 6], [3, 7]]
416+ results = median1 + median2
417+ self.assertEquals([[1, 3], [6], [3, 7], [4]], results.buckets)
418+
419
420 def test_suite():
421 return unittest.TestLoader().loadTestsFromName(__name__)