Merge lp:~michael.nelson/launchpad/588288-log-parser-dont-read-entire-file-qa into lp:launchpad

Proposed by Michael Nelson
Status: Merged
Approved by: Michael Nelson
Approved revision: no longer in the source branch.
Merged at revision: 11035
Proposed branch: lp:~michael.nelson/launchpad/588288-log-parser-dont-read-entire-file-qa
Merge into: lp:launchpad
Diff against target: 205 lines (+54/-22)
5 files modified
lib/canonical/config/schema-lazr.conf (+2/-2)
lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py (+7/-3)
lib/lp/services/apachelogparser/base.py (+9/-4)
lib/lp/services/apachelogparser/tests/apache-log-files/launchpadlibrarian.net.access-log (+1/-0)
lib/lp/services/apachelogparser/tests/test_apachelogparser.py (+35/-13)
To merge this branch: bzr merge lp:~michael.nelson/launchpad/588288-log-parser-dont-read-entire-file-qa
Reviewer Review Type Date Requested Status
Abel Deuring (community) code Approve
Review via email: mp+27849@code.launchpad.net

Commit message

Ensure the next line is incremented when log parser skips lines.

Description of the change

This branch fixes bug 590766 (which was a result of a bad fix landed for bug 588288).

The apache log parser has always ignored lines that are not for an http status 200 OK, or are not GET requests etc. But the current code in devel doesn't increment to the next line when ignoring these lines.

The fix was just a small restructure to ensure that the line is always incremented.

To test:
bin/test -vvm test_apachelogparser

To demo locally:
Apply the following patch so that only 2 lines are read at a time: http://pastebin.ubuntu.com/451153/

Then copy an apache access log from your /var/log/apache2 to /srv/launchpadlibrarian.net-logs

and then run:
cronscripts/parse-librarian-apache-access-logs.py
multiple times. You can watch the data being updated with `psql launchpad_dev` running the query:
select * from parsedapachelog;

No lint.

To post a comment you must log in.
Revision history for this message
Abel Deuring (adeuring) wrote :

Hi Michael,

your changes look good -- but I believe I found another issue in the "while next_line" loop (or I had not have yet enough coffee...):

If we get any exception in the second try/except block of parse_file(), parsed_bytes is set to the value for the start of the line causing the exception, and the loop is terminated via a "break".

So, when parse_file() is called the next time, the line causing the exception is read again. That's fine if the exception of the previous run was due to the line being truncated; in this case, it is fair to assume that a new run with more data will work better. But if the exception is caused for example by a bug in get_method_and_path() or geoip.getRecordByAddress(), we will have an "non-terminating loop", in the sense that sucessive calls of parse_file() will never get past the line causing the exception.

I'm approving this branch but I think it is worth to rethink the exception handling in the loop.

review: Approve (code)
Revision history for this message
Abel Deuring (adeuring) wrote :

...one more comment: Can't we use something like

  line[-1] == '\n'

as an indicator if the current line is complete of if we have reached the end of the file and should not parse this line?

Revision history for this message
Abel Deuring (adeuring) wrote :

I think that would allow us to remove the "exception Exception" handling

Revision history for this message
Michael Nelson (michael.nelson) wrote :

Thanks Abel. Regarding the exception handling, it looks to me as if that is by design (but I didn't design it ;)) - that is, if there is an error the log parser will not continue past the line where the error occurred until it has been resolved (so that we don't end up skipping chunks of the file, but are rather alerted to it). I'm basing this on the test:

    def test_unexpected_error_while_parsing(self):
        # When there's an unexpected error, we log it and return as if we had
        # parsed up to the line before the one where the failure occurred.

I can create a bug with your comments for the foundations guys to look at if you think it's worthwhile.
Thanks!

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/canonical/config/schema-lazr.conf'
2--- lib/canonical/config/schema-lazr.conf 2010-06-14 18:32:58 +0000
3+++ lib/canonical/config/schema-lazr.conf 2010-06-18 11:23:29 +0000
4@@ -1132,8 +1132,8 @@
5 geonames_identity:
6
7 # The maximum number of lines that should be parsed by the launchpad
8-# log parser.
9-logparser_max_parsed_lines: 100000
10+# log parser. The default value of None means there is no maximum.
11+logparser_max_parsed_lines: None
12
13
14 [launchpad_session]
15
16=== modified file 'lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py'
17--- lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py 2009-12-22 14:08:17 +0000
18+++ lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py 2010-06-18 11:23:29 +0000
19@@ -95,10 +95,12 @@
20 downloads, parsed_bytes = parse_file(
21 fd, start_position=0, logger=self.logger,
22 get_download_key=get_library_file_id)
23- self.assertEqual(self.logger.buffer.getvalue(), '')
24+ self.assertEqual(
25+ self.logger.buffer.getvalue().strip(),
26+ 'INFO: Parsed 1 lines resulting in 1 download stats.')
27
28 date = datetime(2008, 6, 13)
29- self.assertEqual(downloads,
30+ self.assertEqual(downloads,
31 {'15018215': {datetime(2008, 6, 13): {'US': 1}}})
32
33 self.assertEqual(parsed_bytes, fd.tell())
34@@ -112,7 +114,9 @@
35 downloads, parsed_bytes = parse_file(
36 fd, start_position=0, logger=self.logger,
37 get_download_key=get_library_file_id)
38- self.assertEqual(self.logger.buffer.getvalue(), '')
39+ self.assertEqual(
40+ self.logger.buffer.getvalue().strip(),
41+ 'INFO: Parsed 1 lines resulting in 0 download stats.')
42 self.assertEqual(downloads, {})
43 self.assertEqual(parsed_bytes, fd.tell())
44
45
46=== modified file 'lib/lp/services/apachelogparser/base.py'
47--- lib/lp/services/apachelogparser/base.py 2010-06-01 14:31:13 +0000
48+++ lib/lp/services/apachelogparser/base.py 2010-06-18 11:23:29 +0000
49@@ -85,7 +85,7 @@
50 """
51 # Seek file to given position, read all lines.
52 fd.seek(start_position)
53- line = fd.readline()
54+ next_line = fd.readline()
55
56 parsed_bytes = start_position
57
58@@ -97,16 +97,17 @@
59 max_parsed_lines = getattr(
60 config.launchpad, 'logparser_max_parsed_lines', None)
61
62- while line:
63+ while next_line:
64 if max_parsed_lines is not None and parsed_lines >= max_parsed_lines:
65 break
66
67+ line = next_line
68+
69 # Always skip the last line as it may be truncated since we're
70 # rsyncing live logs, unless there is only one line for us to
71 # parse, in which case This probably means we're dealing with a
72 # logfile that has been rotated already, so it should be safe to
73 # parse its last line.
74- next_line = ''
75 try:
76 next_line = fd.next()
77 except StopIteration:
78@@ -161,7 +162,11 @@
79 logger.error('Error (%s) while parsing "%s"' % (e, line))
80 break
81
82- line = next_line
83+
84+ if parsed_lines > 0:
85+ logger.info('Parsed %d lines resulting in %d download stats.' % (
86+ parsed_lines, len(downloads)))
87+
88 return downloads, parsed_bytes
89
90
91
92=== modified file 'lib/lp/services/apachelogparser/tests/apache-log-files/launchpadlibrarian.net.access-log'
93--- lib/lp/services/apachelogparser/tests/apache-log-files/launchpadlibrarian.net.access-log 2009-03-27 03:29:31 +0000
94+++ lib/lp/services/apachelogparser/tests/apache-log-files/launchpadlibrarian.net.access-log 2010-06-18 11:23:29 +0000
95@@ -1,3 +1,4 @@
96+121.44.28.210 - - [13/Jun/2008:14:55:06 +0100] "GET /9096290/me-tv-icon-14x14.png HTTP/1.1" 404 730 "https://launchpad.net/me-tv" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9) Gecko/2008060900 Firefox/3.0"
97 121.44.28.210 - - [13/Jun/2008:14:55:06 +0100] "GET /9096290/me-tv-icon-14x14.png HTTP/1.1" 200 730 "https://launchpad.net/me-tv" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9) Gecko/2008060900 Firefox/3.0"
98 121.44.28.210 - - [13/Jun/2008:14:55:13 +0100] "GET /12060796/me-tv-icon-64x64.png HTTP/1.1" 200 6378 "https://launchpad.net/me-tv" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9) Gecko/2008060900 Firefox/3.0"
99 157.92.18.21 - - [13/Jun/2008:14:55:15 +0100] "GET /8196569/mediumubuntulogo.png HTTP/1.1" 200 3420 "https://bugs.launchpad.net/ubuntu/+source/acpi-support/+bug/59695/comments/14" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9) Gecko/2008052623 Iceweasel/3.0 (Debian-3.0~rc1-1)"
100
101=== modified file 'lib/lp/services/apachelogparser/tests/test_apachelogparser.py'
102--- lib/lp/services/apachelogparser/tests/test_apachelogparser.py 2010-06-01 14:28:40 +0000
103+++ lib/lp/services/apachelogparser/tests/test_apachelogparser.py 2010-06-18 11:23:29 +0000
104@@ -126,7 +126,9 @@
105 downloads, parsed_bytes = parse_file(
106 fd, start_position=0, logger=self.logger,
107 get_download_key=get_path_download_key)
108- self.assertEqual(self.logger.buffer.getvalue(), '')
109+ self.assertEqual(
110+ self.logger.buffer.getvalue().strip(),
111+ 'INFO: Parsed 5 lines resulting in 3 download stats.')
112 date = datetime(2008, 6, 13)
113 self.assertContentEqual(
114 downloads.items(),
115@@ -148,7 +150,9 @@
116 downloads, parsed_bytes = parse_file(
117 fd, start_position=self._getLastLineStart(fd), logger=self.logger,
118 get_download_key=get_path_download_key)
119- self.assertEqual(self.logger.buffer.getvalue(), '')
120+ self.assertEqual(
121+ self.logger.buffer.getvalue().strip(),
122+ 'INFO: Parsed 1 lines resulting in 1 download stats.')
123 self.assertEqual(parsed_bytes, fd.tell())
124
125 self.assertContentEqual(
126@@ -175,7 +179,9 @@
127 downloads, parsed_bytes = parse_file(
128 fd, start_position=0, logger=self.logger,
129 get_download_key=get_path_download_key)
130- self.assertEqual(self.logger.buffer.getvalue(), '')
131+ self.assertEqual(
132+ self.logger.buffer.getvalue().strip(),
133+ 'INFO: Parsed 1 lines resulting in 0 download stats.')
134 self.assertEqual(downloads, {})
135 self.assertEqual(parsed_bytes, fd.tell())
136
137@@ -198,7 +204,9 @@
138 downloads, parsed_bytes = parse_file(
139 fd, start_position=0, logger=self.logger,
140 get_download_key=get_path_download_key)
141- self.assertEqual(self.logger.buffer.getvalue(), '')
142+ self.assertEqual(
143+ self.logger.buffer.getvalue().strip(),
144+ 'INFO: Parsed 1 lines resulting in 0 download stats.')
145 self.assertEqual(downloads, {})
146 self.assertEqual(parsed_bytes, fd.tell())
147
148@@ -214,9 +222,10 @@
149 downloads, parsed_bytes = parse_file(
150 fd, start_position=0, logger=self.logger,
151 get_download_key=get_path_download_key)
152- self.assertEqual(self.logger.buffer.getvalue(), '')
153+ self.assertEqual(
154+ self.logger.buffer.getvalue().strip(),
155+ 'INFO: Parsed 1 lines resulting in 1 download stats.')
156
157- date = datetime(2008, 6, 13)
158 self.assertEqual(downloads,
159 {'/15018215/ul_logo_64x64.png':
160 {datetime(2008, 6, 13): {'US': 1}}})
161@@ -232,24 +241,37 @@
162 [launchpad]
163 logparser_max_parsed_lines: 2
164 '''))
165+ self.addCleanup(config.pop, 'log_parser config')
166 fd = open(os.path.join(
167 here, 'apache-log-files', 'launchpadlibrarian.net.access-log'))
168+ self.addCleanup(fd.close)
169+
170 downloads, parsed_bytes = parse_file(
171 fd, start_position=0, logger=self.logger,
172 get_download_key=get_path_download_key)
173- config.pop("log_parser config")
174
175- self.assertEqual(self.logger.buffer.getvalue(), '')
176+ # We have initially parsed only the first two lines of data,
177+ # corresponding to one download (the first line is a 404 and
178+ # so ignored).
179 date = datetime(2008, 6, 13)
180 self.assertContentEqual(
181 downloads.items(),
182- [('/12060796/me-tv-icon-64x64.png', {date: {'AU': 1}}),
183- ('/9096290/me-tv-icon-14x14.png', {date: {'AU': 1}})])
184-
185- # We should have parsed only the first two lines of data.
186+ [('/9096290/me-tv-icon-14x14.png', {date: {'AU': 1}})])
187 fd.seek(0)
188 lines = fd.readlines()
189- self.assertEqual(parsed_bytes, len(lines[0]) + len(lines[1]))
190+ line_lengths = [len(line) for line in lines]
191+ self.assertEqual(parsed_bytes, sum(line_lengths[:2]))
192+
193+ # And the subsequent parse will be for the 3rd and 4th lines,
194+ # corresponding to two downloads of the same file.
195+ downloads, parsed_bytes = parse_file(
196+ fd, start_position=parsed_bytes, logger=self.logger,
197+ get_download_key=get_path_download_key)
198+ self.assertContentEqual(
199+ downloads.items(),
200+ [('/12060796/me-tv-icon-64x64.png', {date: {'AU': 1}}),
201+ ('/8196569/mediumubuntulogo.png', {date: {'AR': 1}})])
202+ self.assertEqual(parsed_bytes, sum(line_lengths[:4]))
203
204
205 class TestParsedFilesDetection(TestCase):