Merge lp:~michael.nelson/launchpad/588288-log-parser-dont-read-entire-file-qa into lp:launchpad

Proposed by Michael Nelson
Status: Merged
Approved by: Michael Nelson
Approved revision: no longer in the source branch.
Merged at revision: 11035
Proposed branch: lp:~michael.nelson/launchpad/588288-log-parser-dont-read-entire-file-qa
Merge into: lp:launchpad
Diff against target: 205 lines (+54/-22)
5 files modified
lib/canonical/config/schema-lazr.conf (+2/-2)
lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py (+7/-3)
lib/lp/services/apachelogparser/base.py (+9/-4)
lib/lp/services/apachelogparser/tests/apache-log-files/launchpadlibrarian.net.access-log (+1/-0)
lib/lp/services/apachelogparser/tests/test_apachelogparser.py (+35/-13)
To merge this branch: bzr merge lp:~michael.nelson/launchpad/588288-log-parser-dont-read-entire-file-qa
Reviewer Review Type Date Requested Status
Abel Deuring (community) code Approve
Review via email: mp+27849@code.launchpad.net

Commit message

Ensure the next line is incremented when log parser skips lines.

Description of the change

This branch fixes bug 590766 (which was a result of a bad fix landed for bug 588288).

The apache log parser has always ignored lines that are not for an http status 200 OK, or are not GET requests etc. But the current code in devel doesn't increment to the next line when ignoring these lines.

The fix was just a small restructure to ensure that the line is always incremented.

To test:
bin/test -vvm test_apachelogparser

To demo locally:
Apply the following patch so that only 2 lines are read at a time: http://pastebin.ubuntu.com/451153/

Then copy an apache access log from your /var/log/apache2 to /srv/launchpadlibrarian.net-logs

and then run:
cronscripts/parse-librarian-apache-access-logs.py
multiple times. You can watch the data being updated with `psql launchpad_dev` running the query:
select * from parsedapachelog;

No lint.

To post a comment you must log in.
Revision history for this message
Abel Deuring (adeuring) wrote :

Hi Michael,

your changes look good -- but I believe I found another issue in the "while next_line" loop (or I had not have yet enough coffee...):

If we get any exception in the second try/except block of parse_file(), parsed_bytes is set to the value for the start of the line causing the exception, and the loop is terminated via a "break".

So, when parse_file() is called the next time, the line causing the exception is read again. That's fine if the exception of the previous run was due to the line being truncated; in this case, it is fair to assume that a new run with more data will work better. But if the exception is caused for example by a bug in get_method_and_path() or geoip.getRecordByAddress(), we will have an "non-terminating loop", in the sense that sucessive calls of parse_file() will never get past the line causing the exception.

I'm approving this branch but I think it is worth to rethink the exception handling in the loop.

review: Approve (code)
Revision history for this message
Abel Deuring (adeuring) wrote :

...one more comment: Can't we use something like

  line[-1] == '\n'

as an indicator if the current line is complete of if we have reached the end of the file and should not parse this line?

Revision history for this message
Abel Deuring (adeuring) wrote :

I think that would allow us to remove the "exception Exception" handling

Revision history for this message
Michael Nelson (michael.nelson) wrote :

Thanks Abel. Regarding the exception handling, it looks to me as if that is by design (but I didn't design it ;)) - that is, if there is an error the log parser will not continue past the line where the error occurred until it has been resolved (so that we don't end up skipping chunks of the file, but are rather alerted to it). I'm basing this on the test:

    def test_unexpected_error_while_parsing(self):
        # When there's an unexpected error, we log it and return as if we had
        # parsed up to the line before the one where the failure occurred.

I can create a bug with your comments for the foundations guys to look at if you think it's worthwhile.
Thanks!

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'lib/canonical/config/schema-lazr.conf'
--- lib/canonical/config/schema-lazr.conf 2010-06-14 18:32:58 +0000
+++ lib/canonical/config/schema-lazr.conf 2010-06-18 11:23:29 +0000
@@ -1132,8 +1132,8 @@
1132geonames_identity:1132geonames_identity:
11331133
1134# The maximum number of lines that should be parsed by the launchpad1134# The maximum number of lines that should be parsed by the launchpad
1135# log parser.1135# log parser. The default value of None means there is no maximum.
1136logparser_max_parsed_lines: 1000001136logparser_max_parsed_lines: None
11371137
11381138
1139[launchpad_session]1139[launchpad_session]
11401140
=== modified file 'lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py'
--- lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py 2009-12-22 14:08:17 +0000
+++ lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py 2010-06-18 11:23:29 +0000
@@ -95,10 +95,12 @@
95 downloads, parsed_bytes = parse_file(95 downloads, parsed_bytes = parse_file(
96 fd, start_position=0, logger=self.logger,96 fd, start_position=0, logger=self.logger,
97 get_download_key=get_library_file_id)97 get_download_key=get_library_file_id)
98 self.assertEqual(self.logger.buffer.getvalue(), '')98 self.assertEqual(
99 self.logger.buffer.getvalue().strip(),
100 'INFO: Parsed 1 lines resulting in 1 download stats.')
99101
100 date = datetime(2008, 6, 13)102 date = datetime(2008, 6, 13)
101 self.assertEqual(downloads, 103 self.assertEqual(downloads,
102 {'15018215': {datetime(2008, 6, 13): {'US': 1}}})104 {'15018215': {datetime(2008, 6, 13): {'US': 1}}})
103105
104 self.assertEqual(parsed_bytes, fd.tell())106 self.assertEqual(parsed_bytes, fd.tell())
@@ -112,7 +114,9 @@
112 downloads, parsed_bytes = parse_file(114 downloads, parsed_bytes = parse_file(
113 fd, start_position=0, logger=self.logger,115 fd, start_position=0, logger=self.logger,
114 get_download_key=get_library_file_id)116 get_download_key=get_library_file_id)
115 self.assertEqual(self.logger.buffer.getvalue(), '')117 self.assertEqual(
118 self.logger.buffer.getvalue().strip(),
119 'INFO: Parsed 1 lines resulting in 0 download stats.')
116 self.assertEqual(downloads, {})120 self.assertEqual(downloads, {})
117 self.assertEqual(parsed_bytes, fd.tell())121 self.assertEqual(parsed_bytes, fd.tell())
118122
119123
=== modified file 'lib/lp/services/apachelogparser/base.py'
--- lib/lp/services/apachelogparser/base.py 2010-06-01 14:31:13 +0000
+++ lib/lp/services/apachelogparser/base.py 2010-06-18 11:23:29 +0000
@@ -85,7 +85,7 @@
85 """85 """
86 # Seek file to given position, read all lines.86 # Seek file to given position, read all lines.
87 fd.seek(start_position)87 fd.seek(start_position)
88 line = fd.readline()88 next_line = fd.readline()
8989
90 parsed_bytes = start_position90 parsed_bytes = start_position
9191
@@ -97,16 +97,17 @@
97 max_parsed_lines = getattr(97 max_parsed_lines = getattr(
98 config.launchpad, 'logparser_max_parsed_lines', None)98 config.launchpad, 'logparser_max_parsed_lines', None)
9999
100 while line:100 while next_line:
101 if max_parsed_lines is not None and parsed_lines >= max_parsed_lines:101 if max_parsed_lines is not None and parsed_lines >= max_parsed_lines:
102 break102 break
103103
104 line = next_line
105
104 # Always skip the last line as it may be truncated since we're106 # Always skip the last line as it may be truncated since we're
105 # rsyncing live logs, unless there is only one line for us to107 # rsyncing live logs, unless there is only one line for us to
106 # parse, in which case This probably means we're dealing with a108 # parse, in which case This probably means we're dealing with a
107 # logfile that has been rotated already, so it should be safe to109 # logfile that has been rotated already, so it should be safe to
108 # parse its last line.110 # parse its last line.
109 next_line = ''
110 try:111 try:
111 next_line = fd.next()112 next_line = fd.next()
112 except StopIteration:113 except StopIteration:
@@ -161,7 +162,11 @@
161 logger.error('Error (%s) while parsing "%s"' % (e, line))162 logger.error('Error (%s) while parsing "%s"' % (e, line))
162 break163 break
163164
164 line = next_line165
166 if parsed_lines > 0:
167 logger.info('Parsed %d lines resulting in %d download stats.' % (
168 parsed_lines, len(downloads)))
169
165 return downloads, parsed_bytes170 return downloads, parsed_bytes
166171
167172
168173
=== modified file 'lib/lp/services/apachelogparser/tests/apache-log-files/launchpadlibrarian.net.access-log'
--- lib/lp/services/apachelogparser/tests/apache-log-files/launchpadlibrarian.net.access-log 2009-03-27 03:29:31 +0000
+++ lib/lp/services/apachelogparser/tests/apache-log-files/launchpadlibrarian.net.access-log 2010-06-18 11:23:29 +0000
@@ -1,3 +1,4 @@
1121.44.28.210 - - [13/Jun/2008:14:55:06 +0100] "GET /9096290/me-tv-icon-14x14.png HTTP/1.1" 404 730 "https://launchpad.net/me-tv" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9) Gecko/2008060900 Firefox/3.0"
1121.44.28.210 - - [13/Jun/2008:14:55:06 +0100] "GET /9096290/me-tv-icon-14x14.png HTTP/1.1" 200 730 "https://launchpad.net/me-tv" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9) Gecko/2008060900 Firefox/3.0"2121.44.28.210 - - [13/Jun/2008:14:55:06 +0100] "GET /9096290/me-tv-icon-14x14.png HTTP/1.1" 200 730 "https://launchpad.net/me-tv" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9) Gecko/2008060900 Firefox/3.0"
2121.44.28.210 - - [13/Jun/2008:14:55:13 +0100] "GET /12060796/me-tv-icon-64x64.png HTTP/1.1" 200 6378 "https://launchpad.net/me-tv" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9) Gecko/2008060900 Firefox/3.0"3121.44.28.210 - - [13/Jun/2008:14:55:13 +0100] "GET /12060796/me-tv-icon-64x64.png HTTP/1.1" 200 6378 "https://launchpad.net/me-tv" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9) Gecko/2008060900 Firefox/3.0"
3157.92.18.21 - - [13/Jun/2008:14:55:15 +0100] "GET /8196569/mediumubuntulogo.png HTTP/1.1" 200 3420 "https://bugs.launchpad.net/ubuntu/+source/acpi-support/+bug/59695/comments/14" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9) Gecko/2008052623 Iceweasel/3.0 (Debian-3.0~rc1-1)"4157.92.18.21 - - [13/Jun/2008:14:55:15 +0100] "GET /8196569/mediumubuntulogo.png HTTP/1.1" 200 3420 "https://bugs.launchpad.net/ubuntu/+source/acpi-support/+bug/59695/comments/14" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9) Gecko/2008052623 Iceweasel/3.0 (Debian-3.0~rc1-1)"
45
=== modified file 'lib/lp/services/apachelogparser/tests/test_apachelogparser.py'
--- lib/lp/services/apachelogparser/tests/test_apachelogparser.py 2010-06-01 14:28:40 +0000
+++ lib/lp/services/apachelogparser/tests/test_apachelogparser.py 2010-06-18 11:23:29 +0000
@@ -126,7 +126,9 @@
126 downloads, parsed_bytes = parse_file(126 downloads, parsed_bytes = parse_file(
127 fd, start_position=0, logger=self.logger,127 fd, start_position=0, logger=self.logger,
128 get_download_key=get_path_download_key)128 get_download_key=get_path_download_key)
129 self.assertEqual(self.logger.buffer.getvalue(), '')129 self.assertEqual(
130 self.logger.buffer.getvalue().strip(),
131 'INFO: Parsed 5 lines resulting in 3 download stats.')
130 date = datetime(2008, 6, 13)132 date = datetime(2008, 6, 13)
131 self.assertContentEqual(133 self.assertContentEqual(
132 downloads.items(),134 downloads.items(),
@@ -148,7 +150,9 @@
148 downloads, parsed_bytes = parse_file(150 downloads, parsed_bytes = parse_file(
149 fd, start_position=self._getLastLineStart(fd), logger=self.logger,151 fd, start_position=self._getLastLineStart(fd), logger=self.logger,
150 get_download_key=get_path_download_key)152 get_download_key=get_path_download_key)
151 self.assertEqual(self.logger.buffer.getvalue(), '')153 self.assertEqual(
154 self.logger.buffer.getvalue().strip(),
155 'INFO: Parsed 1 lines resulting in 1 download stats.')
152 self.assertEqual(parsed_bytes, fd.tell())156 self.assertEqual(parsed_bytes, fd.tell())
153157
154 self.assertContentEqual(158 self.assertContentEqual(
@@ -175,7 +179,9 @@
175 downloads, parsed_bytes = parse_file(179 downloads, parsed_bytes = parse_file(
176 fd, start_position=0, logger=self.logger,180 fd, start_position=0, logger=self.logger,
177 get_download_key=get_path_download_key)181 get_download_key=get_path_download_key)
178 self.assertEqual(self.logger.buffer.getvalue(), '')182 self.assertEqual(
183 self.logger.buffer.getvalue().strip(),
184 'INFO: Parsed 1 lines resulting in 0 download stats.')
179 self.assertEqual(downloads, {})185 self.assertEqual(downloads, {})
180 self.assertEqual(parsed_bytes, fd.tell())186 self.assertEqual(parsed_bytes, fd.tell())
181187
@@ -198,7 +204,9 @@
198 downloads, parsed_bytes = parse_file(204 downloads, parsed_bytes = parse_file(
199 fd, start_position=0, logger=self.logger,205 fd, start_position=0, logger=self.logger,
200 get_download_key=get_path_download_key)206 get_download_key=get_path_download_key)
201 self.assertEqual(self.logger.buffer.getvalue(), '')207 self.assertEqual(
208 self.logger.buffer.getvalue().strip(),
209 'INFO: Parsed 1 lines resulting in 0 download stats.')
202 self.assertEqual(downloads, {})210 self.assertEqual(downloads, {})
203 self.assertEqual(parsed_bytes, fd.tell())211 self.assertEqual(parsed_bytes, fd.tell())
204212
@@ -214,9 +222,10 @@
214 downloads, parsed_bytes = parse_file(222 downloads, parsed_bytes = parse_file(
215 fd, start_position=0, logger=self.logger,223 fd, start_position=0, logger=self.logger,
216 get_download_key=get_path_download_key)224 get_download_key=get_path_download_key)
217 self.assertEqual(self.logger.buffer.getvalue(), '')225 self.assertEqual(
226 self.logger.buffer.getvalue().strip(),
227 'INFO: Parsed 1 lines resulting in 1 download stats.')
218228
219 date = datetime(2008, 6, 13)
220 self.assertEqual(downloads,229 self.assertEqual(downloads,
221 {'/15018215/ul_logo_64x64.png':230 {'/15018215/ul_logo_64x64.png':
222 {datetime(2008, 6, 13): {'US': 1}}})231 {datetime(2008, 6, 13): {'US': 1}}})
@@ -232,24 +241,37 @@
232 [launchpad]241 [launchpad]
233 logparser_max_parsed_lines: 2242 logparser_max_parsed_lines: 2
234 '''))243 '''))
244 self.addCleanup(config.pop, 'log_parser config')
235 fd = open(os.path.join(245 fd = open(os.path.join(
236 here, 'apache-log-files', 'launchpadlibrarian.net.access-log'))246 here, 'apache-log-files', 'launchpadlibrarian.net.access-log'))
247 self.addCleanup(fd.close)
248
237 downloads, parsed_bytes = parse_file(249 downloads, parsed_bytes = parse_file(
238 fd, start_position=0, logger=self.logger,250 fd, start_position=0, logger=self.logger,
239 get_download_key=get_path_download_key)251 get_download_key=get_path_download_key)
240 config.pop("log_parser config")
241252
242 self.assertEqual(self.logger.buffer.getvalue(), '')253 # We have initially parsed only the first two lines of data,
254 # corresponding to one download (the first line is a 404 and
255 # so ignored).
243 date = datetime(2008, 6, 13)256 date = datetime(2008, 6, 13)
244 self.assertContentEqual(257 self.assertContentEqual(
245 downloads.items(),258 downloads.items(),
246 [('/12060796/me-tv-icon-64x64.png', {date: {'AU': 1}}),259 [('/9096290/me-tv-icon-14x14.png', {date: {'AU': 1}})])
247 ('/9096290/me-tv-icon-14x14.png', {date: {'AU': 1}})])
248
249 # We should have parsed only the first two lines of data.
250 fd.seek(0)260 fd.seek(0)
251 lines = fd.readlines()261 lines = fd.readlines()
252 self.assertEqual(parsed_bytes, len(lines[0]) + len(lines[1]))262 line_lengths = [len(line) for line in lines]
263 self.assertEqual(parsed_bytes, sum(line_lengths[:2]))
264
265 # And the subsequent parse will be for the 3rd and 4th lines,
266 # corresponding to two downloads of the same file.
267 downloads, parsed_bytes = parse_file(
268 fd, start_position=parsed_bytes, logger=self.logger,
269 get_download_key=get_path_download_key)
270 self.assertContentEqual(
271 downloads.items(),
272 [('/12060796/me-tv-icon-64x64.png', {date: {'AU': 1}}),
273 ('/8196569/mediumubuntulogo.png', {date: {'AR': 1}})])
274 self.assertEqual(parsed_bytes, sum(line_lengths[:4]))
253275
254276
255class TestParsedFilesDetection(TestCase):277class TestParsedFilesDetection(TestCase):