Merge lp:~flacoste/launchpad/ppr-constant-memory into lp:launchpad
| Status: | Merged |
|---|---|
| Approved by: | Gary Poster on 2010-11-04 |
| Approved revision: | no longer in the source branch. |
| Merged at revision: | 11872 |
| Proposed branch: | lp:~flacoste/launchpad/ppr-constant-memory |
| Merge into: | lp:launchpad |
| Diff against target: |
1043 lines (+656/-229) 5 files modified
lib/lp/scripts/utilities/pageperformancereport.py (+359/-227) lib/lp/scripts/utilities/tests/test_pageperformancereport.py (+290/-0) setup.py (+0/-1) utilities/page-performance-report.ini (+7/-0) versions.cfg (+0/-1) |
| To merge this branch: | bzr merge lp:~flacoste/launchpad/ppr-constant-memory |
| Related bugs: |
| Reviewer | Review Type | Date Requested | Status |
|---|---|---|---|
| Gary Poster (community) | 2010-10-29 | Approve on 2010-11-04 | |
|
Review via email:
|
|||
Description of the Change
This is my third (and hopefully final) round of refactoring of the page
performance report.
The second SQLite version which was running in constant memory was taking a
long time (>8hours) to compute report for 8M requests. It was also thrashing
the machine doing a lot of IO.
After searching the net, I found a couple of ways to compute the stats we are
gathering on-line with the data with minimal storage. That's the algorithms
I'm implemeting there.
A big change is that we are not computing the exact median, but only an
approximation using a variation on the median of median. That's where the
bulk of the memory is going. It took 900M on my local machine to compute the
report for 300K record. I'll see how many we are using on devpad when
generating a daily. If that's too much, I suggest we simply drop the median
computation. (It took 2m42 of user time to generate, that's way better than
both SQLite algorithm and about 50s than the keep-everything
version.)
Other changes:
- I switched to using the gzip and bz2 python builtin library instead of
external process. Hopefully this will get rid of some warnings that were
happening in production with unclosed streams.
- I added saving the 99 and mean values for selected categories in cricket
format. This will allow us to chart our progress using tuolumne.
| Francis J. Lacoste (flacoste) wrote : | # |
| Gary Poster (gary) wrote : | # |
This is very cool, both practically and in its implementation. Thank you for doing it.
Thank you for working on the median changes that we discussed, and I look forward to seeing them; I think what we discussed makes a lot more sense. Even if that new approach doesn't, as I said, tossing the smaller buckets makes more sense than what was there when I first looked, AFAICT. And even if *that* is wrong, at least a comment describing what is going on and why is good. :-) The paper is not clear on the implementation intent in this case, IMO.
Everything else is small, and mostly optional.
Median docstring comments:
* In "(It will hold in memory log B of n elements.)" I suggest either deleting "B" (vaguely asserting that it has good characteristics) or clarify that B is bucket_size (AIUI).
* In "It will compute the median among bucket_size values. And the median among those." I was a bit concerned that it didn't describe what was actually going on but it's at least a start for someone trying to understand the code. Only change it if you want to, and have a good idea on how to improve it. :-) Otherwise, at least it helps.
Also in the median code, I'd prefer ``self.median_idx = bucket_size/2`` using the explicit floor division of ``self.median_idx = bucket_size//2`` but again, that's not a big deal.
The docstring of RequestTimes is incomplete (currently """Collect the """). Similarly, add_request method is now """Add a request to the .""" Probably ought to fix that one way or another.
The behavior of the top urls cache (with its top_urls_
Awesome that you added tests, thank you. Take this or leave it, as you prefer:
* It strikes me you could seed the random number generator instead of hard-coding the shuffle. You probably thought about and made a conscious choice. Dunno which is better. <shrug>
* In your test, you verify that ``self.
Thanks ...
| Francis J. Lacoste (flacoste) wrote : | # |
On November 4, 2010, Gary Poster wrote:
> This is very cool, both practically and in its implementation. Thank you
> for doing it.
>
> Thank you for working on the median changes that we discussed, and I look
> forward to seeing them; I think what we discussed makes a lot more sense.
> Even if that new approach doesn't, as I said, tossing the smaller buckets
> makes more sense than what was there when I first looked, AFAICT. And
> even if *that* is wrong, at least a comment describing what is going on
> and why is good. :-) The paper is not clear on the implementation intent
> in this case, IMO.
>
Your suggested algorithm change works much better in practice:
# 100 rounds, create a 10000 array of random numbers from 0 to 10
# and compare the approximation to true median
>>> simulate(100, 9, 10000)
New: min=-2, max=3, mean=0.190000, median=0, std=1.006926
Old: min=-6, max=9, mean=1.850000, median=2, std=3.188652
>>> simulate(100, 9, 10000)
New: min=-2, max=3, mean=-0.010000, median=0, std=1.072334
Old: min=-4, max=9, mean=1.770000, median=2, std=2.781564
>>> execfile(
>>> simulate(100, 9, 10000)
New: min=-258, max=214, mean=-20.560000, median=-28, std=92.003730
Old: min=-629, max=800, mean=129.400000, median=118, std=292.351809
# Same thing but with a random number from 0 to 10000
>>> simulate(100, 9, 10000)
New: min=-230, max=262, mean=10.550000, median=8, std=90.401590
Old: min=-612, max=805, mean=154.820000, median=167, std=307.682381
> Everything else is small, and mostly optional.
>
> Median docstring comments:
>
> * In "(It will hold in memory log B of n elements.)" I suggest either
> deleting "B" (vaguely asserting that it has good characteristics) or
> clarify that B is bucket_size (AIUI).
Yes, that's a left-over from a previous edit.
>
> * In "It will compute the median among bucket_size values. And the median
> among those." I was a bit concerned that it didn't describe what was
> actually going on but it's at least a start for someone trying to
> understand the code. Only change it if you want to, and have a good idea
> on how to improve it. :-) Otherwise, at least it helps.
I wanted to hint that the algorithm is a variation on Tukey median of medians.
I left it as is.
>
> Also in the median code, I'd prefer ``self.median_idx = bucket_size/2``
> using the explicit floor division of ``self.median_idx = bucket_size//2``
> but again, that's not a big deal.
Changed, and I also use the correct formula of (bucket_size-1)//2. Since
bucket_size is always odd it wasn't a problem, but better to have consistency
with the other place I do this.
>
> The docstring of RequestTimes is incomplete (currently """Collect the """).
> Similarly, add_request method is now """Add a request to the ."""
> Probably ought to fix that one way or another.
I fixed both.
>
> The behavior of the top urls cache (with its top_urls_
> a bit nervous. (You clarified on IRC that we get 100k unique URLs within
> 300k requests, which makes more sense with the code). It is a heuristic
> that experientially works out well with the *current* state of things. If
> something changes, it may no longer be reasonab...
| Gary Poster (gary) wrote : | # |
Looks great!
One more fly-by: in the diff I noticed "test_approxima
Thanks again

Turns out that all the memory was taken by the url_times cache. In 300k requests, we have nearly 1k different URLS! All of this to generate the top 200. So I now limit the size of that cache to 50 times the top_n we are looking for, and culling it by 10% everything we reach the limit.
I've tried various size and cull ratio, and this seems a good size/time/accuracy compromise. Generating the report now takes a constant 137M and the top200 report is identical to one generated with all of them.
It actually reduced the time to generate the report also by 9 seconds.
I also now persist the RequestTimes data structure to disk. This will allow us to generate aggregate report very fast now. And that structure takes ~1.8M of disk space once compressed.