Merge lp:~coreygoldberg/lp-dev-utils/ppr-access-parser into lp:lp-dev-utils

Proposed by Corey Goldberg
Status: Merged
Approved by: j.c.sackett
Approved revision: 139
Merged at revision: 125
Proposed branch: lp:~coreygoldberg/lp-dev-utils/ppr-access-parser
Merge into: lp:lp-dev-utils
Diff against target: 447 lines (+227/-63)
4 files modified
page-perf_acccess.ini (+40/-0)
page-performance-report.ini (+3/-0)
page-performance-report.py (+3/-3)
pageperformancereport.py (+181/-60)
To merge this branch: bzr merge lp:~coreygoldberg/lp-dev-utils/ppr-access-parser
Reviewer Review Type Date Requested Status
j.c.sackett (community) Approve
Robert Collins Pending
Review via email: mp+119409@code.launchpad.net

Commit message

 give PPR (page-performance-report) ability to analyze Apache Access logs.

Description of the change

this branch is an enhancement to PPR (page-performance-report)... giving it the ability to analyze Apache Access logs.

when configured to run in "access log mode" (log_format=access), no 3rd party python libs are needed (such as zc.zservertracelog)

Change Summary:
---------------

* added parser for apache access logs

* added new mandatory [parser] config section to ini.
  takes option: (log_format=launchpad|access)

* changed CSS links in HTML to relative URLs,
  so we aren't harcoded to devpad.c.c.

* added new config/ini file for SSO/access logs:
    * page-perf_acccess.ini

* tweaked inline css slightly, and switched to Ubuntu font

Notes:
------

* report generated from sample staging access log:
    * https://chinstrap.canonical.com/~cgoldberg/ppr/sample-ppr/combined.html

* sample log for testing:
    * https://chinstrap.canonical.com/~cgoldberg/ppr/access_sample_sso_staging.log

* command-line example invoking page-performance-report.py
against sample access log, using new ini and a defined
output directory:

$ python page-performance-report.py \
--config=page-perf_acccess.ini \
--directory=/home/cgoldberg/tmp \
--no-partition \
access_sample_sso_staging.log

To post a comment you must log in.
131. By Corey Goldberg

use genex to convert to ints

132. By Corey Goldberg

updated ini and fixed url field in parser

133. By Corey Goldberg

updated example SSO config

134. By Corey Goldberg

updated ini with SSO categories

135. By Corey Goldberg

fixed regexes in .ini categories

136. By Corey Goldberg

fixed config for ppr

137. By Corey Goldberg

ini cleanup

Revision history for this message
j.c.sackett (jcsackett) wrote :

Corey--

This looks like a good start, but there's some confusing and possibly
unnecessary code here, plus a few style issues.

#117: Looks like you've unordered the imports here, please reorder them
appropriately (e.g. alphabetically).

#159 `create_request_class` method seems unnecessary; as you know in
`parse_launchpad` or `parse_access` which format you're using, it would be
cleaner to just define two different classes instead and use the appropriate
one in the appropriate parse function. I'm guessing this to facilitate not
importing from zc unless you need it, but this seems like an unnecessary
optimization that results in less clear code.

review: Needs Fixing
Revision history for this message
Corey Goldberg (coreygoldberg) wrote :

thanks for looking over the code.

> #159 `create_request_class` method seems unnecessary;
> I'm guessing this to facilitate not importing from zc
> unless you need it, but this seems like an unnecessary
> optimization that results in less clear code.

I am going to be using this code on servers that don't have zc libs installed.

Revision history for this message
j.c.sackett (jcsackett) wrote :

As discussed in IRC, given the apparent need to have this script support both functions and not require zc &c, if you just fix the other import issues and leave a comment indicating why this peculiarity exists and that if we separated these two utilities in some way we might be able to clean it up, we can call this good to go.

Thanks, Corey.

138. By Corey Goldberg

reordered imports and left XXX comment per MP comments

139. By Corey Goldberg

fixed comments

Revision history for this message
Corey Goldberg (coreygoldberg) wrote :

> fix the other import issues and leave a comment
> indicating why this peculiarity exists and that
> if we separated these two utilities in some way
> we might be able to clean it up

fixed imports and left comment. see updated branch/diff.

thanks!

Revision history for this message
j.c.sackett (jcsackett) wrote :

Thanks, Corey.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== added file 'page-perf_acccess.ini'
2--- page-perf_acccess.ini 1970-01-01 00:00:00 +0000
3+++ page-perf_acccess.ini 2012-08-16 19:59:16 +0000
4@@ -0,0 +1,40 @@
5+[categories]
6+#
7+# Category -> Python regular expression.
8+# Remeber to quote ?, ., + & ? characters to match litterally.
9+#
10+Home=^/$
11+Login=/(\+login|\+enter_token)
12+Logout=/\+logout
13+OpenID=/(\+openid|\+id|\+decide)
14+Static Media=/assets/
15+Admin=/admin/
16+API=/api/
17+Registration=/\+new_account
18+Two Factor=/(device-list|two_factor_auth|device-addition)
19+Account=/(\+edit|\+new-email|\+emails)
20+Consumer=/consumer/
21+All Requests=.
22+
23+
24+[metrics]
25+home=Home
26+login=Login
27+logout=Logout
28+open_id=OpenID
29+static_media=Static Media
30+admin=Admin
31+api=API
32+registration=Registration
33+two_factor=Two Factor
34+account=Account
35+consumer=Consumer
36+all_requests=All Requests
37+
38+
39+[partition]
40+All Requests=
41+
42+
43+[parser]
44+log_format=access
45
46=== modified file 'page-performance-report.ini'
47--- page-performance-report.ini 2012-08-09 04:53:16 +0000
48+++ page-performance-report.ini 2012-08-16 19:59:16 +0000
49@@ -77,3 +77,6 @@
50 Public XML-RPC=
51 Web (Non API/non operational/non XML-RPC)=
52 Other=
53+
54+[parser]
55+log_format=launchpad
56
57=== modified file 'page-performance-report.py'
58--- page-performance-report.py 2012-08-09 04:53:16 +0000
59+++ page-performance-report.py 2012-08-16 19:59:16 +0000
60@@ -1,9 +1,9 @@
61 #!/usr/bin/python -S
62 #
63-# Copyright 2010 Canonical Ltd. This software is licensed under the
64+# Copyright 2010 - 2012 Canonical Ltd. This software is licensed under the
65 # GNU Affero General Public License version 3 (see the file LICENSE).
66
67-"""Page performance report generated from zserver tracelogs."""
68+"""Page performance report generated from zserver tracelogs or apache access logs."""
69
70 __metaclass__ = type
71
72@@ -11,7 +11,7 @@
73
74 import sys
75
76-from lp.scripts.utilities.pageperformancereport import main
77+from pageperformancereport import main
78
79
80 if __name__ == '__main__':
81
82=== modified file 'pageperformancereport.py'
83--- pageperformancereport.py 2012-08-09 04:53:16 +0000
84+++ pageperformancereport.py 2012-08-16 19:59:16 +0000
85@@ -1,31 +1,29 @@
86-# Copyright 2010 Canonical Ltd. This software is licensed under the
87+# Copyright 2010-2012 Canonical Ltd. This software is licensed under the
88 # GNU Affero General Public License version 3 (see the file LICENSE).
89
90-"""Page performance report generated from zserver trace logs."""
91+"""Page performance report generated from zserver tracelogs or apache access logs."""
92
93 __metaclass__ = type
94 __all__ = ['main']
95
96+from cgi import escape as html_quote
97 import bz2
98-from cgi import escape as html_quote
99-from ConfigParser import RawConfigParser
100+import ConfigParser
101 import copy
102 import cPickle
103 import csv
104-from datetime import datetime
105+from datetime import (datetime, timedelta)
106 import gzip
107 import logging
108 import math
109 import optparse
110 import os.path
111 import re
112-import textwrap
113-from textwrap import dedent
114-import time
115-
116 import simplejson as json
117 import sre_constants
118-import zc.zservertracelog.tracereport
119+from textwrap import dedent
120+import time
121+
122
123 logging.basicConfig()
124 log = logging
125@@ -61,7 +59,7 @@
126 TYPE_CHECKER[datetime] = _check_datetime
127
128
129-class OptionParser(optparse.OptionParser):
130+class ExtendedOptionParser(optparse.OptionParser):
131 """Extended optparse OptionParser.
132
133 Adds a 'datetime' option type.
134@@ -72,27 +70,63 @@
135 optparse.OptionParser.__init__(self, *args, **kw)
136
137
138-class Request(zc.zservertracelog.tracereport.Request):
139- url = None
140- pageid = None
141- ticks = None
142- sql_statements = None
143- sql_seconds = None
144-
145- # Override the broken version in our superclass that always
146- # returns an integer.
147- @property
148- def app_seconds(self):
149- interval = self.app_time - self.start_app_time
150- return interval.seconds + interval.microseconds / 1000000.0
151-
152- # Override the broken version in our superclass that always
153- # returns an integer.
154- @property
155- def total_seconds(self):
156- interval = self.end - self.start
157- return interval.seconds + interval.microseconds / 1000000.0
158-
159+# XXX: coreygoldberg 2012-08-16 : this function exists to avoid importing
160+# zc libs (zc.zservertracelog.tracereport) when not in log_format=launchpad
161+# parser mode. It could be cleaned up if we refactored and made modular, or
162+# separated this into different projects/utils.
163+def create_request_class(log_format='launchapd'):
164+ """ Define a Request class for use with compatible log_format. """
165+
166+ if log_format == 'launchpad':
167+ import zc.zservertracelog.tracereport
168+ class ZServerRequest(zc.zservertracelog.tracereport.Request):
169+ url = None
170+ pageid = None
171+ ticks = None
172+ sql_statements = None
173+ sql_seconds = None
174+
175+ # Override the broken version in our superclass that always
176+ # returns an integer.
177+ @property
178+ def app_seconds(self):
179+ interval = self.app_time - self.start_app_time
180+ return interval.seconds + interval.microseconds / 1000000.0
181+
182+ # Override the broken version in our superclass that always
183+ # returns an integer.
184+ @property
185+ def total_seconds(self):
186+ interval = self.end - self.start
187+ return interval.seconds + interval.microseconds / 1000000.0
188+ return ZServerRequest
189+
190+ elif log_format == 'access':
191+ class AccessRequest(object):
192+ url = None
193+ pageid = None
194+ ticks = None
195+ sql_statements = None
196+ sql_seconds = None
197+
198+ def __init__(self, dt, method, url, duration):
199+ self.dt = dt
200+ self.method = method
201+ self.url = url
202+ self.duration = duration
203+ self.interval = timedelta(microseconds=self.duration)
204+
205+ @property
206+ def app_seconds(self):
207+ interval = self.interval
208+ return interval.seconds + interval.microseconds / 1000000.0
209+
210+ @property
211+ def total_seconds(self):
212+ interval = self.interval
213+ return interval.seconds + interval.microseconds / 1000000.0
214+ return AccessRequest
215+
216
217 class Category:
218 """A Category in our report.
219@@ -335,7 +369,7 @@
220
221 def text(self):
222 """Return a textual version of the stats."""
223- return textwrap.dedent("""
224+ return dedent("""
225 <Stats for %d requests:
226 Time: total=%.2f; mean=%.2f; median=%.2f; std=%.2f
227 SQL time: total=%.2f; mean=%.2f; median=%.2f; std=%.2f
228@@ -762,10 +796,20 @@
229 parser.error("Config file %s not found." % options.config)
230
231 # Need a better config mechanism as ConfigParser doesn't preserve order.
232- script_config = RawConfigParser()
233+ script_config = ConfigParser.RawConfigParser()
234 script_config.optionxform = str # Make keys case sensitive.
235 script_config.readfp(open(options.config))
236-
237+
238+ try:
239+ parser_options = script_config.options('parser')
240+ except ConfigParser.NoSectionError:
241+ log.fatal("[parser] section not found in ini file.")
242+ return 1
243+ if 'log_format' in parser_options:
244+ log_format = script_config.get('parser', 'log_format')
245+ else:
246+ parser.error("[parser] log_format not defined in ini file.")
247+
248 categories = [] # A list of Category, in report order.
249 for option in script_config.options('categories'):
250 regexp = script_config.get('categories', option)
251@@ -789,7 +833,7 @@
252 log.warning(
253 "In partition definition: %s isn't a defined category",
254 option)
255-
256+
257 times = RequestTimes(categories, options)
258
259 if options.merge:
260@@ -799,8 +843,15 @@
261 times += cPickle.load(f)
262 f.close()
263 else:
264- parse(args, times, options)
265+ if log_format == 'launchpad':
266+ parse_launchpad(args, times, options, log_format)
267+ elif log_format == 'access':
268+ parse_access(args, times, options, log_format)
269+ else:
270+ log.fatal("No parser available for log_format: %r" % log_format)
271+ return 1
272
273+
274 category_times = times.get_category_times()
275
276 pageid_times = []
277@@ -916,21 +967,36 @@
278 """A malformed line was found in the trace log."""
279
280
281-_ts_re = re.compile(
282+_zserver_ts_re = re.compile(
283 '^(\d{4})-(\d\d)-(\d\d)\s(\d\d):(\d\d):(\d\d)(?:.(\d{6}))?$')
284
285-
286-def parse_timestamp(ts_string):
287- match = _ts_re.search(ts_string)
288+_access_ts_re = re.compile(
289+ '^\[(\d{2})/([A-Za-z]+)/(\d{4}):(\d{2}):(\d{2}):(\d{2})[+-]\d{4}\]$')
290+
291+
292+def parse_timestamp_zserverlog(ts_string):
293+ match = _zserver_ts_re.search(ts_string)
294 if match is None:
295 raise ValueError("Invalid timestamp")
296 return datetime(
297 *(int(elem) for elem in match.groups() if elem is not None))
298
299
300-def parse(tracefiles, times, options):
301+def parse_timestamp_accesslog(ts_string):
302+ match = _access_ts_re.search(ts_string)
303+ if match is None:
304+ raise ValueError("Invalid timestamp")
305+ day, alpha_month, year, hour, min, sec = match.groups()
306+ month = time.strptime(alpha_month,'%b').tm_mon # get numeric month
307+ return datetime(
308+ *(int(elem) for elem in (year, month, day, hour, min, sec)))
309+
310+
311+
312+def parse_launchpad(tracefiles, times, options, log_format):
313 requests = {}
314 total_requests = 0
315+ Request = create_request_class(log_format)
316 for tracefile in tracefiles:
317 log.info('Processing %s', tracefile)
318 for line in smart_open(tracefile):
319@@ -949,7 +1015,7 @@
320 # Parse the timestamp.
321 ts_string = '%s %s' % (date, time_)
322 try:
323- dt = parse_timestamp(ts_string)
324+ dt = parse_timestamp_zserverlog(ts_string)
325 except ValueError:
326 raise MalformedLine(
327 'Invalid timestamp %s' % repr(ts_string))
328@@ -1035,6 +1101,57 @@
329 "Unknown extension prefix %s" % prefix)
330
331
332+def parse_access(tracefiles, times, options, log_format):
333+ requests = {}
334+ total_requests = 0
335+ Request = create_request_class(log_format)
336+ for tracefile in tracefiles:
337+ log.info('Processing %s', tracefile)
338+ for line in smart_open(tracefile):
339+ line = line.rstrip()
340+ try:
341+ record = line.split()
342+ try:
343+ request_id = record[6]
344+ date = record[3]
345+ time_ = record[4]
346+ url = record[6]
347+ method = record[5]
348+ duration = int(record[-1])
349+ except ValueError:
350+ raise MalformedLine()
351+
352+ # Parse the timestamp.
353+ ts_string = '%s%s' % (date, time_)
354+ try:
355+ dt = parse_timestamp_accesslog(ts_string)
356+ except ValueError:
357+ raise MalformedLine(
358+ 'Invalid timestamp %s' % repr(ts_string))
359+
360+ # Filter entries by command line date range.
361+ if options.from_ts is not None and dt < options.from_ts:
362+ continue # Skip to next line.
363+ if options.until_ts is not None and dt > options.until_ts:
364+ break # Skip to next log file.
365+
366+ requests[request_id] = Request(dt, method, url, duration)
367+ request = requests.get(request_id, None)
368+ if request is None: # Just ignore partial records.
369+ continue
370+ request.url = url
371+ del requests[request_id]
372+ total_requests += 1
373+ if total_requests % 10000 == 0:
374+ log.debug("Parsed %d requests", total_requests)
375+ # Add the request to any matching categories.
376+ times.add_request(request)
377+
378+ except MalformedLine as x:
379+ log.error(
380+ "Malformed line %s (%s)" % (repr(line), x))
381+
382+
383 def html_report(
384 outf, category_times, pageid_times, url_times,
385 ninetyninth_percentile_threshold=None, histogram_resolution=0.5,
386@@ -1058,28 +1175,30 @@
387 <html>
388 <head>
389 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
390- <title>Launchpad Page Performance Report %(date)s</title>
391- <script language="javascript" type="text/javascript"
392- src="https://devpad.canonical.com/~lpqateam/ppr/js/flot/jquery.min.js"
393- ></script>
394- <script language="javascript" type="text/javascript"
395- src="https://devpad.canonical.com/~lpqateam/ppr/js/jquery.appear-1.1.1.min.js"
396- ></script>
397- <script language="javascript" type="text/javascript"
398- src="https://devpad.canonical.com/~lpqateam/ppr/js/flot/jquery.flot.min.js"
399- ></script>
400- <script language="javascript" type="text/javascript"
401- src="https://devpad.canonical.com/~lpqateam/ppr/js/sorttable.js"></script>
402+ <title>Page Performance Report %(date)s</title>
403+ <script language="javascript" type="text/javascript"
404+ src="../js/flot/jquery.min.js"
405+ ></script>
406+ <script language="javascript" type="text/javascript"
407+ src="../js/jquery.appear-1.1.1.min.js"
408+ ></script>
409+ <script language="javascript" type="text/javascript"
410+ src="../js/flot/jquery.flot.min.js"
411+ ></script>
412+ <script language="javascript" type="text/javascript"
413+ src="../js/sorttable.js"></script>
414+ <link href='http://fonts.googleapis.com/css?family=Ubuntu' rel='stylesheet' type='text/css'>
415 <style type="text/css">
416+ body { font-family: 'Ubuntu', sans-serif; }
417 h3 { font-weight: normal; font-size: 1em; }
418 thead th { padding-left: 1em; padding-right: 1em; }
419 .category-title { text-align: right; padding-right: 2em;
420- max-width: 25em; }
421+ max-width: 30em; font-size: 0.9em; font-weight: normal; }
422 .regexp { font-size: x-small; font-weight: normal; }
423 .mean { text-align: right; padding-right: 1em; }
424 .median { text-align: right; padding-right: 1em; }
425 .standard-deviation { text-align: right; padding-right: 1em; }
426- .histogram { padding: 0.5em 1em; width:400px; height:250px; }
427+ .histogram { padding: 0.5em 1em; width:400px; height:250px; border-left: 1px dashed #000000; }
428 .odd-row { background-color: #eeeeff; }
429 .even-row { background-color: #ffffee; }
430 table.sortable thead {
431@@ -1094,12 +1213,14 @@
432 padding: 1em;
433 }
434 .clickable { cursor: hand; }
435- .total-hits, .histogram, .median-sqltime,
436- .median-sqlstatements { border-right: 1px dashed #000000; }
437+ .total-hits, .histogram, .category-title,
438+ .median-sqltime, .median-sqlstatements {
439+ border-right: 1px dashed #000000;
440+ }
441 </style>
442 </head>
443 <body>
444- <h1>Launchpad Page Performance Report</h1>
445+ <h1>Page Performance Report</h1>
446 <h3>%(date)s</h3>
447 ''' % {'date': time.ctime()})
448

Subscribers

People subscribed via source and target branches