Merge lp:~flacoste/launchpad/ppr-constant-memory into lp:launchpad
| Status: | Merged |
|---|---|
| Approved by: | Robert Collins on 2010-10-25 |
| Approved revision: | no longer in the source branch. |
| Merged at revision: | 11795 |
| Proposed branch: | lp:~flacoste/launchpad/ppr-constant-memory |
| Merge into: | lp:launchpad |
| Diff against target: |
628 lines (+246/-169) 1 file modified
lib/lp/scripts/utilities/pageperformancereport.py (+246/-169) |
| To merge this branch: | bzr merge lp:~flacoste/launchpad/ppr-constant-memory |
| Related bugs: |
| Reviewer | Review Type | Date Requested | Status |
|---|---|---|---|
| Robert Collins (community) | 2010-10-25 | Approve on 2010-10-25 | |
|
Review via email:
|
|||
Commit Message
Refactor page-performanc
Description of the Change
This branch changes the algorithm used by the Page Performance Report to be
able to reduce memory usage.
The current algorithm builds the statistics as it parses the logs
all-in-memory. This uses a great amount of memory because it maintains
multiple array of request times in memory for all the keys (categories, page
ids, urls) it wants to report on. It currently fails to generate any weekly or
monthly report and has trouble with some daily report too.
The new algorithm parses all the logs into a SQLite3 database and then
generates statistics for one key at a time. It still does the statistics
computation in memory. This means that the amount of memory still grows
linearly with the number of requests, as the all category will require an
array that has all the request times.
Other changes:
* I've dropped the variance column for the report. We include standard deviation
which is its square root and more useful anyway.
* I've used numpy.clip instead of doing it using list comprehension for the
input to the histogram.
Locally on a 300 000 request file here are the performance diff:
Old New
User time 1m33 1m52
Sys time 0m1.6 0m5
RSS 483M 229M
QA
I've compared the reports generated using the old algorithm with the new one
and the reports are identical (apart the removed column).
On sodium, I've been able to generate the problematic daily reports. It peaked
at 2.2G for 4 million requests. I'm not sure that the weekly and monthly
reports will be able to be computed still. Trying that now.
| Francis J. Lacoste (flacoste) wrote : | # |
| Robert Collins (lifeless) wrote : | # |
Seems plausible; it might be better to not put the time and sql time in the same table.
If you used different tables, you could avoid all the masking stuff entirely.
| Francis J. Lacoste (flacoste) wrote : | # |
Yeah, computing one statistics at a time would also reduce the peak amount of memory used at the cost of more processing time. I'll see how it goes for weekly and monthly and assess if another round is needed.

As far as stats goes, I forgot to report that the SQLite3 DB size was 55M for 300000 requests and 776M for 4.1M.