Merge ~kotodama/container-log-archive-charm:anonymize_web_log-new_options into container-log-archive-charm:master

Proposed by Loïc Gomez
Status: Merged
Approved by: Tom Haddon
Approved revision: 488cf0e384c285e5268f847b8ff499bec14eb2cf
Merged at revision: 136c40098c0c573450b524a69dfb2305e16548c1
Proposed branch: ~kotodama/container-log-archive-charm:anonymize_web_log-new_options
Merge into: container-log-archive-charm:master
Diff against target: 135 lines (+55/-5)
2 files modified
files/anonymize_web_log.py (+53/-5)
files/preprocessor.py (+2/-0)
Reviewer Review Type Date Requested Status
Tom Haddon Approve
Loïc Gomez +1 Approve
Joel Sing (community) Needs Fixing
Review via email: mp+409516@code.launchpad.net

Commit message

anonymize_web_log.py: allow reading from stdin, add skip_keywords, strip_hashes options

To post a comment you must log in.
Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

This merge proposal is being monitored by mergebot. Change the status to Approved to merge.

Revision history for this message
Loïc Gomez (kotodama) wrote :

This MP follows a request to access Apache logs with hashes and PII removed.

Example of hashes to obfuscate:
login?id=824a1183f5a22c7c4069...
model_uuid=4372e956-eca1-49fd-bd42-4bee02f05b81

Example of keywords where a skip of the logline would be nice:
/login to ignore GET /identity/login-legacy?did=b
or:
/usso_macaroon/ to ignore GET /identity/login/usso_macaroon/login?...

Revision history for this message
Tom Haddon (mthaddon) wrote :

As discussed IRL, would be good to alphabetise the arguments, and preserve the contextmanager for opening the logfile if possible to avoid needing to clean up afterwards.

Revision history for this message
Joel Sing (jsing) :
review: Needs Fixing
Revision history for this message
Joel Sing (jsing) :
Revision history for this message
Loïc Gomez (kotodama) wrote :

Thanks Joel, I'll push an update soon.

Revision history for this message
Tom Haddon (mthaddon) wrote :

A few comments/questions inline, and I see Joel is also reviewing this.

Revision history for this message
Loïc Gomez (kotodama) wrote :

Thanks, I'll alphabetize (and update preprocessor too actually) and change the dummy default value.

Revision history for this message
Tom Haddon (mthaddon) wrote :

This looks good to me. Will hold off on adding an approval since Joel is also looking at this MP, but fine to merge from my perspective if/when Joel approves.

Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

Change cannot be self approved, setting status to needs review.

Revision history for this message
Loïc Gomez (kotodama) wrote :

Approved by mthaddon by previous comment.

review: Approve (+1)
Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

Change cannot be self approved, setting status to needs review.

Revision history for this message
Tom Haddon (mthaddon) wrote :

LGTM

review: Approve
Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

Change successfully merged at revision 136c40098c0c573450b524a69dfb2305e16548c1

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/files/anonymize_web_log.py b/files/anonymize_web_log.py
2index 1862847..e2a3712 100755
3--- a/files/anonymize_web_log.py
4+++ b/files/anonymize_web_log.py
5@@ -8,12 +8,15 @@ import re
6 import sys
7
8
9-def anonymize(new_file, log_file, skip_ips=None, skip_private=False, strip_usernames=False, username_field=2):
10+def anonymize(new_file, log_file, skip_ips=None, skip_keywords=None, skip_private=False, strip_hashes=False,
11+ strip_usernames=False, username_field=2):
12 """
13 :param new_file: A file object that will be written out anonymized
14 :param log_file: A file object to read from
15 :param skip_ips: A netaddr.IPSet of ips to skip
16+ :param skip_keywords: A list of keywords to trigger a line skip
17 :param skip_private: A boolean, true to skip private ips
18+ :param strip_hashes: A boolean, true to strip hashes
19 :param strip_usernames: A boolean, true to strip usernames
20 :param username_field: An integer specifying the field usernames are indexed by
21 :return: (total lines processed, lines skipped)
22@@ -21,6 +24,13 @@ def anonymize(new_file, log_file, skip_ips=None, skip_private=False, strip_usern
23 ips = {}
24 total = 0
25 skip = 0
26+
27+ # Let's consider hashes/uuids are at least 30 char long
28+ if strip_hashes:
29+ rx_hashes = re.compile(r'[0-9a-fA-F-]{30,}')
30+ if skip_keywords:
31+ rx_kw = re.compile('({})'.format('|'.join([re.escape(kw) for kw in skip_keywords])))
32+
33 for line in log_file:
34 total += 1
35 splits = line.split()
36@@ -46,6 +56,23 @@ def anonymize(new_file, log_file, skip_ips=None, skip_private=False, strip_usern
37 del splits[username_field]
38
39 anonymized_line = "{} {}\n".format(ips[str(ip)], b' '.join(splits[1:]).decode('utf-8'))
40+
41+ if strip_hashes or skip_keywords:
42+ splits = anonymized_line.split('"')
43+ if strip_hashes:
44+ # anonymize request
45+ splits[1] = rx_hashes.sub("<<HASH/UUID_REDACTED>>", splits[1])
46+ # anonymize referer
47+ splits[3] = rx_hashes.sub("<<HASH/UUID_REDACTED>>", splits[3])
48+ # anonymize juju metadata
49+ splits[7] = rx_hashes.sub("<<HASH/UUID_REDACTED>>", splits[7])
50+ anonymized_line = '"'.join(splits)
51+
52+ if skip_keywords:
53+ if rx_kw.search(splits[1]):
54+ skip += 1
55+ continue
56+
57 new_file.write(anonymized_line.encode('utf-8'))
58
59 return total, skip
60@@ -54,16 +81,25 @@ def anonymize(new_file, log_file, skip_ips=None, skip_private=False, strip_usern
61 def parse_args(argv):
62 # Parse arguments, build skip_set and paths as needed
63 parser = argparse.ArgumentParser()
64- parser.add_argument('log_path', help='The path to the log to be anonymized')
65- parser.add_argument('--output', '-o', help='Output file, defaults to input file'
66- 'with .anonymized extension')
67+ input_group = parser.add_mutually_exclusive_group(required=True)
68+ input_group.add_argument('log_path', default=None, nargs='?',
69+ help='The path to the log to be anonymized')
70+ input_group.add_argument('--stdin', default=False, action='store_true',
71+ help='Read from stdin instead of using log_path (requires: --output.')
72+ parser.add_argument('--output', '-o',
73+ help='Output file, defaults to input file with .anonymized extension')
74+
75 parser.add_argument('--skip', '-s', action='append', help='A cidr notation network for which any'
76 'for which any source traffic will be removed'
77 'from the final output. Can be specified multiple'
78 'times')
79+ parser.add_argument('--skip_keywords', '-k', action='append', default=None,
80+ help='If specified remove any request matching one of the keywords.')
81 parser.add_argument('--skip_private', action='store_true', default=False,
82 help='If specified remove any private source ips from the final file')
83
84+ parser.add_argument('--strip_hashes', action='store_true', default=False,
85+ help='Strip hash-looking strings from log lines.')
86 parser.add_argument('--strip_usernames', action='store_true', default=False,
87 help='Strip usernames from log lines.')
88 parser.add_argument('--username_field', type=int, default=2,
89@@ -80,6 +116,11 @@ def parse_args(argv):
90 else:
91 args.skip_ips = None
92
93+ if args.stdin and not args.output:
94+ parser.print_help()
95+ print('Error: --stdin requires --output.\n')
96+ sys.exit(-1)
97+
98 return args
99
100
101@@ -92,12 +133,19 @@ def main():
102
103 # Process the lines of the file
104 with open(anonymized_path, 'wb') as new_file:
105- with open(args.log_path, 'rb') as log_file:
106+ if args.stdin:
107+ log_path = 0
108+ else:
109+ log_path = args.log_path
110+
111+ with open(log_path, 'rb') as log_file:
112 anonymize_args = (
113 new_file,
114 log_file,
115 args.skip_ips,
116+ args.skip_keywords,
117 args.skip_private,
118+ args.strip_hashes,
119 args.strip_usernames,
120 args.username_field
121 )
122diff --git a/files/preprocessor.py b/files/preprocessor.py
123index 21dd92f..7048e31 100644
124--- a/files/preprocessor.py
125+++ b/files/preprocessor.py
126@@ -67,7 +67,9 @@ class AnonymizeWebLog(PreProcessor):
127 anonymize_web_log.anonymize(new_file,
128 log_file,
129 parsed_args.skip_ips,
130+ parsed_args.skip_keywords,
131 parsed_args.skip_private,
132+ parsed_args.strip_hashes,
133 parsed_args.strip_usernames,
134 parsed_args.username_field)
135 new_file.seek(0)

Subscribers

People subscribed via source and target branches