Merge lp:~julian-edwards/launchpad/log-parser-bug-680463 into lp:launchpad

Proposed by Julian Edwards
Status: Merged
Approved by: Gavin Panella
Approved revision: no longer in the source branch.
Merged at revision: 11988
Proposed branch: lp:~julian-edwards/launchpad/log-parser-bug-680463
Merge into: lp:launchpad
Diff against target: 24 lines (+7/-6)
1 file modified
lib/lp/services/apachelogparser/base.py (+7/-6)
To merge this branch: bzr merge lp:~julian-edwards/launchpad/log-parser-bug-680463
Reviewer Review Type Date Requested Status
Gavin Panella (community) Approve
Review via email: mp+41865@code.launchpad.net

Commit message

When parsing Apache logs, figure out the length of gzip log files without having to read them in to memory in their entirety which can cause the process to crash when it eats all the available memory.

Description of the change

Figure out the length of gzip log files without having to read them in to memory.

The existing code tries to read the uncompressed contents of a gzip file into memory in their entirety. This makes the PPA log parser blow up quite horribly as the log files are very large.

Use existing test with:
bin/test -cvv test_apachelogparser Test_get_fd_and_file_size

QA Plan
-------

I have got a copy of the production log files that cause the crash on dogfood. Running with the fix allows the processing to continue with no increased memory usage as observed in "top".

To post a comment you must log in.
Revision history for this message
Gavin Panella (allenap) :
review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/lp/services/apachelogparser/base.py'
2--- lib/lp/services/apachelogparser/base.py 2010-11-17 23:20:07 +0000
3+++ lib/lp/services/apachelogparser/base.py 2010-11-26 17:47:08 +0000
4@@ -64,13 +64,14 @@
5 file_path points to a gzipped file.
6 """
7 if file_path.endswith('.gz'):
8+ # The last 4 bytes of the file contains the uncompressed file's
9+ # size, modulo 2**32. This code is somewhat stolen from the gzip
10+ # module in Python 2.6.
11 fd = gzip.open(file_path)
12- # There doesn't seem to be a better way of figuring out the
13- # uncompressed size of a file, so we'll read the whole file here.
14- file_size = len(fd.read())
15- # Seek back to the beginning of the file as if we had just opened
16- # it.
17- fd.seek(0)
18+ fd.fileobj.seek(-4, os.SEEK_END)
19+ isize = gzip.read32(fd.fileobj) # may exceed 2GB
20+ file_size = isize & 0xffffffffL
21+ fd.fileobj.seek(0)
22 else:
23 fd = open(file_path)
24 file_size = os.path.getsize(file_path)