Merge lp:~flacoste/launchpad/ppr-merge into lp:launchpad
- ppr-merge
- Merge into devel
Status: | Merged |
---|---|
Approved by: | Gavin Panella |
Approved revision: | no longer in the source branch. |
Merged at revision: | 11872 |
Proposed branch: | lp:~flacoste/launchpad/ppr-merge |
Merge into: | lp:launchpad |
Prerequisite: | lp:~flacoste/launchpad/ppr-constant-memory |
Diff against target: |
421 lines (+208/-20) 2 files modified
lib/lp/scripts/utilities/pageperformancereport.py (+124/-16) lib/lp/scripts/utilities/tests/test_pageperformancereport.py (+84/-4) |
To merge this branch: | bzr merge lp:~flacoste/launchpad/ppr-merge |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Gavin Panella (community) | Approve | ||
Review via email: mp+40014@code.launchpad.net |
Commit message
Add --merge option that can be used to aggregate stats saved from individual run.
Description of the change
Hello,
This is a follow-up on my previous constant-memory branch. It adds the option
of generating a report by aggregating the RequestTimes data structure pickled
in a previous run.
This will allow us to generate weekly, monthly and even yearly page
performance report with minimal fuss.
I did this by adding a __add__ method to all data structure involved. There
are only two tricky algorithms, one how to merge the variance. I used the
algorithm pointed at by the wikipedia page
http://
The transcription on that page actually didn't produce the expected results.
So I went to the paper and used the original formula.
The may to merge two approximate medians is home-made. I went with what seemed
to make the most sense. It seem to produce ok results.
Francis J. Lacoste (flacoste) wrote : | # |
On November 4, 2010, Gavin Panella wrote:
> Review: Approve
> [1]
>
> + (float(
> + ((float(
> ...
> + results.mean = float(results.
> ...
> + for n in range((
>
> PEP8 suggests "Use spaces around arithmetic operators", but also says
> "A Foolish Consistency is the Hobgoblin of Little Minds" :)
>
>
I added spaces, except for the **2 which would make the line go over 80!
> [2]
>
> + found = False
> + for i, (category, stats) in enumerate(
> + if category.title == other_category.
> + results.
> + category, stats + other_stats)
> + found = True
> + break
> + if not found:
> + results.
> + (other_category, copy.deepcopy(
>
> I think this is what for...else is for:
>
> for i, (category, stats) in enumerate(
> if category.title == other_category.
> results.
> category, stats + other_stats)
> break
> else:
> results.
> (other_category, copy.deepcopy(
>
Indeed, I'm never sure of that one. Thanks for pointing this out. I changed
another instance of that idiom in the same file.
>
> [3]
>
> + sorted(
> + key=lambda x: x[1].total_time,
> + reverse=
>
> Indentation, and the key function might be clearer with some
> unwrapping:
>
> sorted(
> key=lambda (category, stats): stats.total_time,
> reverse=
>
> sorted() is also happy with an iterator - i.e. iteritems() - but I
> don't think that url_times will ever be large enough to make much of a
> difference.
>
Nice suggestion.
>
> [4]
>
> + f = bz2.BZ2File(
> + times += cPickle.load(f)
> + f.close()
>
> BZ2File doesn't provide the context protocol (or whatever it's called)
> but files returned by codecs.open(...) do, so this is possible:
>
> with codecs.
> times += cPickle.load(f)
>
> Makes bugger all difference though :)
Hmm, I already import bz2, so kept the less 3733t idiom instead of adding an
import :-p
Thanks for the review!
--
Francis J. Lacoste
<email address hidden>
Preview Diff
1 | === modified file 'lib/lp/scripts/utilities/pageperformancereport.py' |
2 | --- lib/lp/scripts/utilities/pageperformancereport.py 2010-11-05 15:24:12 +0000 |
3 | +++ lib/lp/scripts/utilities/pageperformancereport.py 2010-11-05 15:24:31 +0000 |
4 | @@ -9,6 +9,7 @@ |
5 | import bz2 |
6 | import cPickle |
7 | from cgi import escape as html_quote |
8 | +import copy |
9 | from ConfigParser import RawConfigParser |
10 | import csv |
11 | from datetime import datetime |
12 | @@ -69,6 +70,11 @@ |
13 | def __cmp__(self, other): |
14 | return cmp(self.title.lower(), other.title.lower()) |
15 | |
16 | + def __deepcopy__(self, memo): |
17 | + # We provide __deepcopy__ because the module doesn't handle |
18 | + # compiled regular expression by default. |
19 | + return Category(self.title, self.regexp) |
20 | + |
21 | |
22 | class OnlineStatsCalculator: |
23 | """Object that can compute count, sum, mean, variance and median. |
24 | @@ -113,6 +119,30 @@ |
25 | else: |
26 | return math.sqrt(self.variance) |
27 | |
28 | + def __add__(self, other): |
29 | + """Adds this and another OnlineStatsCalculator. |
30 | + |
31 | + The result combines the stats of the two objects. |
32 | + """ |
33 | + results = OnlineStatsCalculator() |
34 | + results.count = self.count + other.count |
35 | + results.sum = self.sum + other.sum |
36 | + if self.count > 0 and other.count > 0: |
37 | + # This is 2.1b in Chan, Tony F.; Golub, Gene H.; LeVeque, |
38 | + # Randall J. (1979), "Updating Formulae and a Pairwise Algorithm |
39 | + # for Computing Sample Variances.", |
40 | + # Technical Report STAN-CS-79-773, |
41 | + # Department of Computer Science, Stanford University, |
42 | + # ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf . |
43 | + results.M2 = self.M2 + other.M2 + ( |
44 | + (float(self.count) / (other.count * results.count)) * |
45 | + ((float(other.count) / self.count) * self.sum - other.sum)**2) |
46 | + else: |
47 | + results.M2 = self.M2 + other.M2 # One of them is 0. |
48 | + if results.count > 0: |
49 | + results.mean = float(results.sum) / results.count |
50 | + return results |
51 | + |
52 | |
53 | class OnlineApproximateMedian: |
54 | """Approximate the median of a set of elements. |
55 | @@ -139,23 +169,22 @@ |
56 | |
57 | The bucket size should be a low odd-integer. |
58 | """ |
59 | - self.count = 0 |
60 | self.bucket_size = bucket_size |
61 | # Index of the median in a completed bucket. |
62 | self.median_idx = (bucket_size-1)//2 |
63 | self.buckets = [] |
64 | |
65 | - def update(self, x): |
66 | + def update(self, x, order=0): |
67 | """Update with x.""" |
68 | if x is None: |
69 | return |
70 | |
71 | - self.count += 1 |
72 | - i = 0 |
73 | + i = order |
74 | while True: |
75 | # Create bucket on demand. |
76 | - if i == len(self.buckets): |
77 | - self.buckets.append([]) |
78 | + if i >= len(self.buckets): |
79 | + for n in range((i+1)-len(self.buckets)): |
80 | + self.buckets.append([]) |
81 | bucket = self.buckets[i] |
82 | bucket.append(x) |
83 | if len(bucket) == self.bucket_size: |
84 | @@ -171,9 +200,6 @@ |
85 | @property |
86 | def median(self): |
87 | """Return the median.""" |
88 | - if self.count == 0: |
89 | - return 0 |
90 | - |
91 | # Find the 'weighted' median by assigning a weight to each |
92 | # element proportional to how far they have been selected. |
93 | candidates = [] |
94 | @@ -183,6 +209,9 @@ |
95 | for x in bucket: |
96 | total_weight += weight |
97 | candidates.append([x, weight]) |
98 | + if len(candidates) == 0: |
99 | + return 0 |
100 | + |
101 | # Each weight is the equivalent of having the candidates appear |
102 | # that number of times in the array. |
103 | # So buckets like [[1, 2], [2, 3], [4, 2]] would be expanded to |
104 | @@ -196,6 +225,21 @@ |
105 | if weighted_idx > median: |
106 | return x |
107 | |
108 | + def __add__(self, other): |
109 | + """Merge two approximators together. |
110 | + |
111 | + All candidates from the other are merged through the standard |
112 | + algorithm, starting at the same level. So an item that went through |
113 | + two rounds of selection, will be compared with other items having |
114 | + gone through the same number of rounds. |
115 | + """ |
116 | + results = OnlineApproximateMedian(self.bucket_size) |
117 | + results.buckets = copy.deepcopy(self.buckets) |
118 | + for i, bucket in enumerate(other.buckets): |
119 | + for x in bucket: |
120 | + results.update(x, i) |
121 | + return results |
122 | + |
123 | |
124 | class Stats: |
125 | """Bag to hold and compute request statistics. |
126 | @@ -341,6 +385,21 @@ |
127 | idx = int(min(len(self.histogram)-1, request.app_seconds)) |
128 | self.histogram[idx][1] += 1 |
129 | |
130 | + def __add__(self, other): |
131 | + """Merge another OnlineStats with this one.""" |
132 | + results = copy.deepcopy(self) |
133 | + results.time_stats += other.time_stats |
134 | + results.time_median_approximate += other.time_median_approximate |
135 | + results.sql_time_stats += other.sql_time_stats |
136 | + results.sql_time_median_approximate += ( |
137 | + other.sql_time_median_approximate) |
138 | + results.sql_statements_stats += other.sql_statements_stats |
139 | + results.sql_statements_median_approximate += ( |
140 | + other.sql_statements_median_approximate) |
141 | + for i, (n, f) in enumerate(other._histogram): |
142 | + results._histogram[i][1] += f |
143 | + return results |
144 | + |
145 | |
146 | class RequestTimes: |
147 | """Collect statistics from requests. |
148 | @@ -402,10 +461,9 @@ |
149 | cutoff = int(self.top_urls_cache_size*0.90) |
150 | self.url_times = dict( |
151 | sorted(self.url_times.items(), |
152 | - key=lambda x: x[1].total_time, |
153 | + key=lambda (url, stats): stats.total_time, |
154 | reverse=True)[:cutoff]) |
155 | |
156 | - |
157 | def get_category_times(self): |
158 | """Return the times for each category.""" |
159 | return self.category_times |
160 | @@ -415,13 +473,50 @@ |
161 | # Sort the result by total time |
162 | return sorted( |
163 | self.url_times.items(), |
164 | - key=lambda x: x[1].total_time, reverse=True)[:self.top_urls] |
165 | + key=lambda (url, stats): stats.total_time, |
166 | + reverse=True)[:self.top_urls] |
167 | |
168 | def get_pageid_times(self): |
169 | """Return the times for the pageids.""" |
170 | # Sort the result by pageid |
171 | return sorted(self.pageid_times.items()) |
172 | |
173 | + def __add__(self, other): |
174 | + """Merge two RequestTimes together.""" |
175 | + results = copy.deepcopy(self) |
176 | + for other_category, other_stats in other.category_times: |
177 | + for i, (category, stats) in enumerate(self.category_times): |
178 | + if category.title == other_category.title: |
179 | + results.category_times[i] = ( |
180 | + category, stats + other_stats) |
181 | + break |
182 | + else: |
183 | + results.category_times.append( |
184 | + (other_category, copy.deepcopy(other_stats))) |
185 | + |
186 | + url_times = results.url_times |
187 | + for url, stats in other.url_times.items(): |
188 | + if url in url_times: |
189 | + url_times[url] += stats |
190 | + else: |
191 | + url_times[url] = copy.deepcopy(stats) |
192 | + # Only keep top_urls_cache_size entries. |
193 | + if len(self.url_times) > self.top_urls_cache_size: |
194 | + self.url_times = dict( |
195 | + sorted( |
196 | + url_times.items(), |
197 | + key=lambda (url, stats): stats.total_time, |
198 | + reverse=True)[:self.top_urls_cache_size]) |
199 | + |
200 | + pageid_times = results.pageid_times |
201 | + for pageid, stats in other.pageid_times.items(): |
202 | + if pageid in pageid_times: |
203 | + pageid_times[pageid] += stats |
204 | + else: |
205 | + pageid_times[pageid] = copy.deepcopy(stats) |
206 | + |
207 | + return results |
208 | + |
209 | |
210 | def main(): |
211 | parser = LPOptionParser("%prog [args] tracelog [...]") |
212 | @@ -459,6 +554,11 @@ |
213 | # Default to 12: the staging timeout. |
214 | default=12, type="int", |
215 | help="The configured timeout value : determines high risk page ids.") |
216 | + parser.add_option( |
217 | + "--merge", dest="merge", |
218 | + default=False, action='store_true', |
219 | + help="Files are interpreted as pickled stats and are aggregated for" + |
220 | + "the report.") |
221 | |
222 | options, args = parser.parse_args() |
223 | |
224 | @@ -473,6 +573,9 @@ |
225 | parser.error( |
226 | "--from timestamp %s is before --until timestamp %s" |
227 | % (options.from_ts, options.until_ts)) |
228 | + if options.from_ts is not None or options.until_ts is not None: |
229 | + if options.merge: |
230 | + parser.error('--from and --until cannot be used with --merge') |
231 | |
232 | for filename in args: |
233 | if not os.path.exists(filename): |
234 | @@ -501,7 +604,14 @@ |
235 | |
236 | times = RequestTimes(categories, options) |
237 | |
238 | - parse(args, times, options) |
239 | + if options.merge: |
240 | + for filename in args: |
241 | + log.info('Merging %s...' % filename) |
242 | + f = bz2.BZ2File(filename, 'r') |
243 | + times += cPickle.load(f) |
244 | + f.close() |
245 | + else: |
246 | + parse(args, times, options) |
247 | |
248 | category_times = times.get_category_times() |
249 | |
250 | @@ -564,16 +674,14 @@ |
251 | |
252 | for option in script_config.options('metrics'): |
253 | name = script_config.get('metrics', option) |
254 | - found = False |
255 | for category, stats in category_times: |
256 | if category.title == name: |
257 | writer.writerows([ |
258 | ("%s_99" % option, "%f@%d" % ( |
259 | stats.ninetyninth_percentile_time, date)), |
260 | ("%s_mean" % option, "%f@%d" % (stats.mean, date))]) |
261 | - found = True |
262 | break |
263 | - if not found: |
264 | + else: |
265 | log.warning("Can't find category %s for metric %s" % ( |
266 | option, name)) |
267 | metrics_file.close() |
268 | |
269 | === modified file 'lib/lp/scripts/utilities/tests/test_pageperformancereport.py' |
270 | --- lib/lp/scripts/utilities/tests/test_pageperformancereport.py 2010-11-05 15:24:12 +0000 |
271 | +++ lib/lp/scripts/utilities/tests/test_pageperformancereport.py 2010-11-05 15:24:31 +0000 |
272 | @@ -12,6 +12,7 @@ |
273 | from lp.scripts.utilities.pageperformancereport import ( |
274 | Category, |
275 | OnlineApproximateMedian, |
276 | + OnlineStats, |
277 | OnlineStatsCalculator, |
278 | RequestTimes, |
279 | Stats, |
280 | @@ -30,6 +31,7 @@ |
281 | |
282 | |
283 | class FakeRequest: |
284 | + |
285 | def __init__(self, url, app_seconds, sql_statements=None, |
286 | sql_seconds=None, pageid=None): |
287 | self.url = url |
288 | @@ -40,6 +42,7 @@ |
289 | |
290 | |
291 | class FakeStats(Stats): |
292 | + |
293 | def __init__(self, **kwargs): |
294 | # Override the constructor to just store the values. |
295 | self.__dict__.update(kwargs) |
296 | @@ -196,9 +199,28 @@ |
297 | pageid_times = self.db.get_pageid_times() |
298 | self.assertStatsAreEquals(PAGEID_STATS, pageid_times) |
299 | |
300 | + def test___add__(self): |
301 | + # Ensure that adding two RequestTimes together result in |
302 | + # a merge of their constituencies. |
303 | + db1 = self.db |
304 | + db2 = RequestTimes(self.categories, FakeOptions()) |
305 | + db1.add_request(FakeRequest('/', 1.5, 5, 1.0, '+root')) |
306 | + db1.add_request(FakeRequest('/bugs', 3.5, 15, 1.0, '+bugs')) |
307 | + db2.add_request(FakeRequest('/bugs/1', 5.0, 30, 4.0, '+bug')) |
308 | + results = db1 + db2 |
309 | + self.assertEquals(3, results.category_times[0][1].total_hits) |
310 | + self.assertEquals(0, results.category_times[1][1].total_hits) |
311 | + self.assertEquals(2, results.category_times[2][1].total_hits) |
312 | + self.assertEquals(1, results.pageid_times['+root'].total_hits) |
313 | + self.assertEquals(1, results.pageid_times['+bugs'].total_hits) |
314 | + self.assertEquals(1, results.pageid_times['+bug'].total_hits) |
315 | + self.assertEquals(1, results.url_times['/'].total_hits) |
316 | + self.assertEquals(1, results.url_times['/bugs'].total_hits) |
317 | + self.assertEquals(1, results.url_times['/bugs/1'].total_hits) |
318 | + |
319 | |
320 | class TestStats(TestCase): |
321 | - """Tests for the stats class.""" |
322 | + """Tests for the Stats class.""" |
323 | |
324 | def test_relative_histogram(self): |
325 | # Test that relative histogram gives an histogram using |
326 | @@ -211,6 +233,26 @@ |
327 | stats.relative_histogram) |
328 | |
329 | |
330 | +class TestOnlineStats(TestCase): |
331 | + """Tests for the OnlineStats class.""" |
332 | + |
333 | + def test___add__(self): |
334 | + # Ensure that adding two OnlineStats merge all their constituencies. |
335 | + stats1 = OnlineStats(4) |
336 | + stats1.update(FakeRequest('/', 2.0, 5, 1.5)) |
337 | + stats2 = OnlineStats(4) |
338 | + stats2.update(FakeRequest('/', 1.5, 2, 3.0)) |
339 | + stats2.update(FakeRequest('/', 5.0, 2, 2.0)) |
340 | + results = stats1 + stats2 |
341 | + self.assertEquals(3, results.total_hits) |
342 | + self.assertEquals(2, results.median) |
343 | + self.assertEquals(9, results.total_sqlstatements) |
344 | + self.assertEquals(2, results.median_sqlstatements) |
345 | + self.assertEquals(6.5, results.total_sqltime) |
346 | + self.assertEquals(2.0, results.median_sqltime) |
347 | + self.assertEquals([[0, 0], [1, 1], [2, 1], [3, 1]], results.histogram) |
348 | + |
349 | + |
350 | class TestOnlineStatsCalculator(TestCase): |
351 | """Tests for the online stats calculator.""" |
352 | |
353 | @@ -248,14 +290,44 @@ |
354 | self.assertEquals(6, self.stats.variance) |
355 | self.assertEquals("2.45", "%.2f" % self.stats.std) |
356 | |
357 | + def test___add___two_empty_together(self): |
358 | + stats2 = OnlineStatsCalculator() |
359 | + results = self.stats + stats2 |
360 | + self.assertEquals(0, results.count) |
361 | + self.assertEquals(0, results.sum) |
362 | + self.assertEquals(0, results.mean) |
363 | + self.assertEquals(0, results.variance) |
364 | + |
365 | + def test___add___one_empty(self): |
366 | + stats2 = OnlineStatsCalculator() |
367 | + for x in [1, 2, 3]: |
368 | + self.stats.update(x) |
369 | + results = self.stats + stats2 |
370 | + self.assertEquals(3, results.count) |
371 | + self.assertEquals(6, results.sum) |
372 | + self.assertEquals(2, results.mean) |
373 | + self.assertEquals(2, results.M2) |
374 | + |
375 | + def test___add__(self): |
376 | + stats2 = OnlineStatsCalculator() |
377 | + for x in [3, 6, 9]: |
378 | + self.stats.update(x) |
379 | + for x in [1, 2, 3]: |
380 | + stats2.update(x) |
381 | + results = self.stats + stats2 |
382 | + self.assertEquals(6, results.count) |
383 | + self.assertEquals(24, results.sum) |
384 | + self.assertEquals(4, results.mean) |
385 | + self.assertEquals(44, results.M2) |
386 | + |
387 | |
388 | SHUFFLE_RANGE_100 = [ |
389 | 25, 79, 99, 76, 60, 63, 87, 77, 51, 82, 42, 96, 93, 58, 32, 66, 75, |
390 | - 2, 26, 22, 11, 73, 61, 83, 65, 68, 44, 81, 64, 3, 33, 34, 15, 1, |
391 | + 2, 26, 22, 11, 73, 61, 83, 65, 68, 44, 81, 64, 3, 33, 34, 15, 1, |
392 | 92, 27, 90, 74, 46, 57, 59, 31, 13, 19, 89, 29, 56, 94, 50, 49, 62, |
393 | 37, 21, 35, 5, 84, 88, 16, 8, 23, 40, 6, 48, 10, 97, 0, 53, 17, 30, |
394 | 18, 43, 86, 12, 71, 38, 78, 36, 7, 45, 47, 80, 54, 39, 91, 98, 24, |
395 | - 55, 14, 52, 20, 69, 85, 95, 28, 4, 9, 67, 70, 41, 72 |
396 | + 55, 14, 52, 20, 69, 85, 95, 28, 4, 9, 67, 70, 41, 72, |
397 | ] |
398 | |
399 | |
400 | @@ -279,12 +351,20 @@ |
401 | self.estimator.update(None) |
402 | self.assertEquals(1, self.estimator.median) |
403 | |
404 | - def test_approximage_median_is_good_enough(self): |
405 | + def test_approximate_median_is_good_enough(self): |
406 | for x in SHUFFLE_RANGE_100: |
407 | self.estimator.update(x) |
408 | # True median is 50, 49 is good enough :-) |
409 | self.assertIn(self.estimator.median, range(49,52)) |
410 | |
411 | + def test___add__(self): |
412 | + median1 = OnlineApproximateMedian(3) |
413 | + median1.buckets = [[1, 3], [4, 5], [6, 3]] |
414 | + median2 = OnlineApproximateMedian(3) |
415 | + median2.buckets = [[], [3, 6], [3, 7]] |
416 | + results = median1 + median2 |
417 | + self.assertEquals([[1, 3], [6], [3, 7], [4]], results.buckets) |
418 | + |
419 | |
420 | def test_suite(): |
421 | return unittest.TestLoader().loadTestsFromName(__name__) |
[1]
+ (float( self.count) /(other. count*results. count)) * other.count) /self.count) *self.sum - other.sum)**2) sum)/results. count i+1)-len( self.buckets) ):
+ ((float(
...
+ results.mean = float(results.
...
+ for n in range((
PEP8 suggests "Use spaces around arithmetic operators", but also says
"A Foolish Consistency is the Hobgoblin of Little Minds" :)
[2]
+ found = False self.category_ times): title: category_ times[i] = ( category_ times.append( other_stats) ))
+ for i, (category, stats) in enumerate(
+ if category.title == other_category.
+ results.
+ category, stats + other_stats)
+ found = True
+ break
+ if not found:
+ results.
+ (other_category, copy.deepcopy(
I think this is what for...else is for:
for i, (category, stats) in enumerate( self.category_ times): title:
results. category_ times[i] = (
category, stats + other_stats)
break
results. category_ times.append(
( other_category, copy.deepcopy( other_stats) ))
if category.title == other_category.
else:
[3]
+ sorted( url_times. items() , True)[: self.top_ urls_cache_ size])
+ key=lambda x: x[1].total_time,
+ reverse=
Indentation, and the key function might be clearer with some
unwrapping:
sorted() is also happy with an iterator - i.e. iteritems() - but I
don't think that url_times will ever be large enough to make much of a
difference.
[4]
+ f = bz2.BZ2File( filename, 'r')
+ times += cPickle.load(f)
+ f.close()
BZ2File doesn't provide the context protocol (or whatever it's called)
but files returned by codecs.open(...) do, so this is possible:
with codecs. open(filename, 'r', 'bz2') as f:
times += cPickle.load(f)
Makes bugger all difference though :)