Launchpad itself

Merge lp:~michael.nelson/launchpad/588288-log-parser-dont-read-entire-file-qa into lp:launchpad

588288-log-parser-dont-read-entire-file-qa
Merge into devel

Proposed by Michael Nelson on 2010-06-17

Status:

Merged

Approved by:

Michael Nelson on 2010-06-18

Approved revision:

no longer in the source branch.

Merged at revision:

11035

Proposed branch:

lp:~michael.nelson/launchpad/588288-log-parser-dont-read-entire-file-qa

Merge into:

lp:launchpad

Diff against target:

205 lines (+54/-22)

5 files modified

lib/canonical/config/schema-lazr.conf (+2/-2)
lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py (+7/-3)
lib/lp/services/apachelogparser/base.py (+9/-4)
lib/lp/services/apachelogparser/tests/apache-log-files/launchpadlibrarian.net.access-log (+1/-0)
lib/lp/services/apachelogparser/tests/test_apachelogparser.py (+35/-13)

To merge this branch:

bzr merge lp:~michael.nelson/launchpad/588288-log-parser-dont-read-entire-file-qa

High

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Abel Deuring (community)	code	2010-06-17	Approve on 2010-06-18
Review via email: mp+27849@code.launchpad.net

Commit message

Ensure the next line is incremented when log parser skips lines.

Description of the change

This branch fixes bug 590766 (which was a result of a bad fix landed for bug 588288).

The apache log parser has always ignored lines that are not for an http status 200 OK, or are not GET requests etc. But the current code in devel doesn't increment to the next line when ignoring these lines.

The fix was just a small restructure to ensure that the line is always incremented.

To test:
bin/test -vvm test_apachelogparser

To demo locally:
Apply the following patch so that only 2 lines are read at a time: http://pastebin.ubuntu.com/451153/

Then copy an apache access log from your /var/log/apache2 to /srv/launchpadlibrarian.net-logs

and then run:
cronscripts/parse-librarian-apache-access-logs.py
multiple times. You can watch the data being updated with `psql launchpad_dev` running the query:
select * from parsedapachelog;

No lint.

Revision history for this message

Abel Deuring (adeuring) wrote on 2010-06-18:

Hi Michael,

your changes look good -- but I believe I found another issue in the "while next_line" loop (or I had not have yet enough coffee...):

If we get any exception in the second try/except block of parse_file(), parsed_bytes is set to the value for the start of the line causing the exception, and the loop is terminated via a "break".

So, when parse_file() is called the next time, the line causing the exception is read again. That's fine if the exception of the previous run was due to the line being truncated; in this case, it is fair to assume that a new run with more data will work better. But if the exception is caused for example by a bug in get_method_and_path() or geoip.getRecordByAddress(), we will have an "non-terminating loop", in the sense that sucessive calls of parse_file() will never get past the line causing the exception.

I'm approving this branch but I think it is worth to rethink the exception handling in the loop.

review: Approve (code)

Revision history for this message

Abel Deuring (adeuring) wrote on 2010-06-18:

...one more comment: Can't we use something like

line[-1] == '\n'

as an indicator if the current line is complete of if we have reached the end of the file and should not parse this line?

Revision history for this message

Abel Deuring (adeuring) wrote on 2010-06-18:

I think that would allow us to remove the "exception Exception" handling

Revision history for this message

Michael Nelson (michael.nelson) wrote on 2010-06-18:

Thanks Abel. Regarding the exception handling, it looks to me as if that is by design (but I didn't design it ;)) - that is, if there is an error the log parser will not continue past the line where the error occurred until it has been resolved (so that we don't end up skipping chunks of the file, but are rather alerted to it). I'm basing this on the test:

    def test_unexpected_error_while_parsing(self):
        # When there's an unexpected error, we log it and return as if we had
        # parsed up to the line before the one where the failure occurred.

I can create a bug with your comments for the foundations guys to look at if you think it's worthwhile.
Thanks!

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Barki Mustapha

Celso Providelo

Christian Reis

Christy Awad

Colin Watson

Harpianto,ANDI

James Troup

John A Meinel

Kevin bush

Launchpad code reviewers

Launchpad code reviewers from Canonical

Matthew Tanner

Maximiliano Bertacchini

Michael Nelson

Oguz Ersoz

Simon Brakhane

Ubuntu-BR DevOps

William Grant

alexsio nau

alhawiti

api.ng

pedro cavazos

todaioan

wenjingwen

to status/vote changes:

Tzaddi

Tzaddi Belding

1	=== modified file 'lib/canonical/config/schema-lazr.conf'
2	--- lib/canonical/config/schema-lazr.conf 2010-06-14 18:32:58 +0000
3	+++ lib/canonical/config/schema-lazr.conf 2010-06-18 11:23:29 +0000
4	@@ -1132,8 +1132,8 @@
5	geonames_identity:
6
7	# The maximum number of lines that should be parsed by the launchpad
8	-# log parser.
9	-logparser_max_parsed_lines: 100000
10	+# log parser. The default value of None means there is no maximum.
11	+logparser_max_parsed_lines: None
12
13
14	[launchpad_session]
15
16	=== modified file 'lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py'
17	--- lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py 2009-12-22 14:08:17 +0000
18	+++ lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py 2010-06-18 11:23:29 +0000
19	@@ -95,10 +95,12 @@
20	downloads, parsed_bytes = parse_file(
21	fd, start_position=0, logger=self.logger,
22	get_download_key=get_library_file_id)
23	- self.assertEqual(self.logger.buffer.getvalue(), '')
24	+ self.assertEqual(
25	+ self.logger.buffer.getvalue().strip(),
26	+ 'INFO: Parsed 1 lines resulting in 1 download stats.')
27
28	date = datetime(2008, 6, 13)
29	- self.assertEqual(downloads,
30	+ self.assertEqual(downloads,
31	{'15018215': {datetime(2008, 6, 13): {'US': 1}}})
32
33	self.assertEqual(parsed_bytes, fd.tell())
34	@@ -112,7 +114,9 @@
35	downloads, parsed_bytes = parse_file(
36	fd, start_position=0, logger=self.logger,
37	get_download_key=get_library_file_id)
38	- self.assertEqual(self.logger.buffer.getvalue(), '')
39	+ self.assertEqual(
40	+ self.logger.buffer.getvalue().strip(),
41	+ 'INFO: Parsed 1 lines resulting in 0 download stats.')
42	self.assertEqual(downloads, {})
43	self.assertEqual(parsed_bytes, fd.tell())
44
45
46	=== modified file 'lib/lp/services/apachelogparser/base.py'
47	--- lib/lp/services/apachelogparser/base.py 2010-06-01 14:31:13 +0000
48	+++ lib/lp/services/apachelogparser/base.py 2010-06-18 11:23:29 +0000
49	@@ -85,7 +85,7 @@
50	"""
51	# Seek file to given position, read all lines.
52	fd.seek(start_position)
53	- line = fd.readline()
54	+ next_line = fd.readline()
55
56	parsed_bytes = start_position
57
58	@@ -97,16 +97,17 @@
59	max_parsed_lines = getattr(
60	config.launchpad, 'logparser_max_parsed_lines', None)
61
62	- while line:
63	+ while next_line:
64	if max_parsed_lines is not None and parsed_lines >= max_parsed_lines:
65	break
66
67	+ line = next_line
68	+
69	# Always skip the last line as it may be truncated since we're
70	# rsyncing live logs, unless there is only one line for us to
71	# parse, in which case This probably means we're dealing with a
72	# logfile that has been rotated already, so it should be safe to
73	# parse its last line.
74	- next_line = ''
75	try:
76	next_line = fd.next()
77	except StopIteration:
78	@@ -161,7 +162,11 @@
79	logger.error('Error (%s) while parsing "%s"' % (e, line))
80	break
81
82	- line = next_line
83	+
84	+ if parsed_lines > 0:
85	+ logger.info('Parsed %d lines resulting in %d download stats.' % (
86	+ parsed_lines, len(downloads)))
87	+
88	return downloads, parsed_bytes
89
90
91
92	=== modified file 'lib/lp/services/apachelogparser/tests/apache-log-files/launchpadlibrarian.net.access-log'
93	--- lib/lp/services/apachelogparser/tests/apache-log-files/launchpadlibrarian.net.access-log 2009-03-27 03:29:31 +0000
94	+++ lib/lp/services/apachelogparser/tests/apache-log-files/launchpadlibrarian.net.access-log 2010-06-18 11:23:29 +0000
95	@@ -1,3 +1,4 @@
96	+121.44.28.210 - - [13/Jun/2008:14:55:06 +0100] "GET /9096290/me-tv-icon-14x14.png HTTP/1.1" 404 730 "https://launchpad.net/me-tv" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9) Gecko/2008060900 Firefox/3.0"
97	121.44.28.210 - - [13/Jun/2008:14:55:06 +0100] "GET /9096290/me-tv-icon-14x14.png HTTP/1.1" 200 730 "https://launchpad.net/me-tv" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9) Gecko/2008060900 Firefox/3.0"
98	121.44.28.210 - - [13/Jun/2008:14:55:13 +0100] "GET /12060796/me-tv-icon-64x64.png HTTP/1.1" 200 6378 "https://launchpad.net/me-tv" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9) Gecko/2008060900 Firefox/3.0"
99	157.92.18.21 - - [13/Jun/2008:14:55:15 +0100] "GET /8196569/mediumubuntulogo.png HTTP/1.1" 200 3420 "https://bugs.launchpad.net/ubuntu/+source/acpi-support/+bug/59695/comments/14" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9) Gecko/2008052623 Iceweasel/3.0 (Debian-3.0~rc1-1)"
100
101	=== modified file 'lib/lp/services/apachelogparser/tests/test_apachelogparser.py'
102	--- lib/lp/services/apachelogparser/tests/test_apachelogparser.py 2010-06-01 14:28:40 +0000
103	+++ lib/lp/services/apachelogparser/tests/test_apachelogparser.py 2010-06-18 11:23:29 +0000
104	@@ -126,7 +126,9 @@
105	downloads, parsed_bytes = parse_file(
106	fd, start_position=0, logger=self.logger,
107	get_download_key=get_path_download_key)
108	- self.assertEqual(self.logger.buffer.getvalue(), '')
109	+ self.assertEqual(
110	+ self.logger.buffer.getvalue().strip(),
111	+ 'INFO: Parsed 5 lines resulting in 3 download stats.')
112	date = datetime(2008, 6, 13)
113	self.assertContentEqual(
114	downloads.items(),
115	@@ -148,7 +150,9 @@
116	downloads, parsed_bytes = parse_file(
117	fd, start_position=self._getLastLineStart(fd), logger=self.logger,
118	get_download_key=get_path_download_key)
119	- self.assertEqual(self.logger.buffer.getvalue(), '')
120	+ self.assertEqual(
121	+ self.logger.buffer.getvalue().strip(),
122	+ 'INFO: Parsed 1 lines resulting in 1 download stats.')
123	self.assertEqual(parsed_bytes, fd.tell())
124
125	self.assertContentEqual(
126	@@ -175,7 +179,9 @@
127	downloads, parsed_bytes = parse_file(
128	fd, start_position=0, logger=self.logger,
129	get_download_key=get_path_download_key)
130	- self.assertEqual(self.logger.buffer.getvalue(), '')
131	+ self.assertEqual(
132	+ self.logger.buffer.getvalue().strip(),
133	+ 'INFO: Parsed 1 lines resulting in 0 download stats.')
134	self.assertEqual(downloads, {})
135	self.assertEqual(parsed_bytes, fd.tell())
136
137	@@ -198,7 +204,9 @@
138	downloads, parsed_bytes = parse_file(
139	fd, start_position=0, logger=self.logger,
140	get_download_key=get_path_download_key)
141	- self.assertEqual(self.logger.buffer.getvalue(), '')
142	+ self.assertEqual(
143	+ self.logger.buffer.getvalue().strip(),
144	+ 'INFO: Parsed 1 lines resulting in 0 download stats.')
145	self.assertEqual(downloads, {})
146	self.assertEqual(parsed_bytes, fd.tell())
147
148	@@ -214,9 +222,10 @@
149	downloads, parsed_bytes = parse_file(
150	fd, start_position=0, logger=self.logger,
151	get_download_key=get_path_download_key)
152	- self.assertEqual(self.logger.buffer.getvalue(), '')
153	+ self.assertEqual(
154	+ self.logger.buffer.getvalue().strip(),
155	+ 'INFO: Parsed 1 lines resulting in 1 download stats.')
156
157	- date = datetime(2008, 6, 13)
158	self.assertEqual(downloads,
159	{'/15018215/ul_logo_64x64.png':
160	{datetime(2008, 6, 13): {'US': 1}}})
161	@@ -232,24 +241,37 @@
162	[launchpad]
163	logparser_max_parsed_lines: 2
164	'''))
165	+ self.addCleanup(config.pop, 'log_parser config')
166	fd = open(os.path.join(
167	here, 'apache-log-files', 'launchpadlibrarian.net.access-log'))
168	+ self.addCleanup(fd.close)
169	+
170	downloads, parsed_bytes = parse_file(
171	fd, start_position=0, logger=self.logger,
172	get_download_key=get_path_download_key)
173	- config.pop("log_parser config")
174
175	- self.assertEqual(self.logger.buffer.getvalue(), '')
176	+ # We have initially parsed only the first two lines of data,
177	+ # corresponding to one download (the first line is a 404 and
178	+ # so ignored).
179	date = datetime(2008, 6, 13)
180	self.assertContentEqual(
181	downloads.items(),
182	- [('/12060796/me-tv-icon-64x64.png', {date: {'AU': 1}}),
183	- ('/9096290/me-tv-icon-14x14.png', {date: {'AU': 1}})])
184	-
185	- # We should have parsed only the first two lines of data.
186	+ [('/9096290/me-tv-icon-14x14.png', {date: {'AU': 1}})])
187	fd.seek(0)
188	lines = fd.readlines()
189	- self.assertEqual(parsed_bytes, len(lines[0]) + len(lines[1]))
190	+ line_lengths = [len(line) for line in lines]
191	+ self.assertEqual(parsed_bytes, sum(line_lengths[:2]))
192	+
193	+ # And the subsequent parse will be for the 3rd and 4th lines,
194	+ # corresponding to two downloads of the same file.
195	+ downloads, parsed_bytes = parse_file(
196	+ fd, start_position=parsed_bytes, logger=self.logger,
197	+ get_download_key=get_path_download_key)
198	+ self.assertContentEqual(
199	+ downloads.items(),
200	+ [('/12060796/me-tv-icon-64x64.png', {date: {'AU': 1}}),
201	+ ('/8196569/mediumubuntulogo.png', {date: {'AR': 1}})])
202	+ self.assertEqual(parsed_bytes, sum(line_lengths[:4]))
203
204
205	class TestParsedFilesDetection(TestCase):