Merge lp:~milo/linaro-patchmetrics/parsemail-fixes into lp:linaro-patchmetrics
- parsemail-fixes
- Merge into trunk
Status: | Merged |
---|---|
Merged at revision: | 401 |
Proposed branch: | lp:~milo/linaro-patchmetrics/parsemail-fixes |
Merge into: | lp:linaro-patchmetrics |
Diff against target: |
236 lines (+91/-38) 2 files modified
apps/patchwork/bin/parsemail.py (+64/-37) apps/patchwork/utils.py (+27/-1) |
To merge this branch: | bzr merge lp:~milo/linaro-patchmetrics/parsemail-fixes |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Paul Sokolovsky | Needs Fixing | ||
Review via email: mp+197241@code.launchpad.net |
Commit message
Description of the change
This is a fix for the script that parses the emails searching for patches. With the new changes, if the submitter of a patch nor the author of a patch are using Linaro email address, we do not consider the patch and we move on.
Another change made, is to when the saving action on a new Person (in this case a new author) happens. Previously, if a new author was found, it was saved immediately, now we save it only after making sure it is a valid Linaro engineer. In this way we do the exact same thing as it was done with the Submitter person, and we do not save authors that we do not want to track.
Milo Casagrande (milo) wrote : | # |
Thanks Paul for looking into it!
On Wed, Dec 4, 2013 at 2:25 PM, Paul Sokolovsky
<email address hidden> wrote:
> Review: Needs Fixing
>
> Logic looks good. And I usually don't go at such details with stylistic changes, but for few quick looks at linaro-patchmetrics codebase, I felt confused and saw few less-than-ideal practices and hardcodings. So, let's try to do better.
>
> 1. Can "new_author" var be named "is_new_auhor"?
>
> 2.
>
> 8 - return None, None
> 9 + return None, None, None
>
> >From the rest of patch it's possible to figure out what 3rd return value is, and I'm not sure if that func had a docstring, but maybe it might?
I added info in the function doc string.
>
> 57 + # If submitter nor author of a patch are not Linaro, we move on.
> 58 + # TODO: may be worth to integrate with crowd.
> 59 + if "@linaro.org" not in submitter.email:
> 60 + if author is not None and "@linaro.org" not in author.email:
> 61 + return 0
> 62 + return 0
>
> There're 2 issues here: one is that hardcoded "@linaro.org". We can do better ;-) (yeah, I know it takes much more patch lines to use config settings, and sometimes with my changes I wonder if I should bother either; friendly answer looking from outside - yes ;-) ). And we should anticipate need to support multiple patterns - either as array, or better just as a regexp.
I know about the hardcoded @linaro.org values (there are a few more in
Linaro patchwork code), but I'm not going to fix them here. The code
cannot go in production like that, I'm aware of it, that's why I would
like the other MP in first, to be able to address this issue as well.
The other MP, that is bigger, adds some integration with Crowd that we
can use here as well, that's why there is that TODO.
I do not like the regexp approach, at least in this case: we can do
even better than that, and it is taking data out of Crowd.
The problem cannot be solved by simply querying Crowd though:
switching to a subscription model for patches.l.o (we subscribe to the
mailing list, they do not send the patches directly anymore), it means
that we can receive even more than 500 mails per day, that will lead
to 500 Crowd requests. When I tested the old patch, IT were already
looking at who was hitting the server (and it was with far less
requests). We need to locally cache those requests: the other MP tries
to solve that.
> Another is that there's superfluous/
> "if". If code should look like that, there should be a comment (like, "one sweet day, we may want to handle these conditions separately"). Otherwise...
There was one "return 0" too much, fixed.
--
Milo Casagrande | Automation Engineer
Linaro.org <www.linaro.org> │ Open source software for ARM SoCs
- 401. By Milo Casagrande
-
Fixed doc string, refactored variable, fixed if-logic.
- 402. By Milo Casagrande
-
Fixed if clause.
- 403. By Milo Casagrande
-
Merged from trunk.
- 404. By Milo Casagrande
-
Added crowd integration for email checking.
- 405. By Milo Casagrande
-
Fixed if clause.
- 406. By Milo Casagrande
-
Refactored code.
* Moved one function from the parsemail script into the utils.py file.
Had to import one function from utils.py in parsemail.py, but apparently
the import was breaking something (probably circular import).
Milo Casagrande (milo) wrote : | # |
Added the Crowd integration also on the check for valid patch submitter or authors.
Had to refactor some code: I needed to import a function form utils.py, but utils.py was importing a function from parsemail.py, resulting in an import error when the script is executed.
Preview Diff
1 | === modified file 'apps/patchwork/bin/parsemail.py' | |||
2 | --- apps/patchwork/bin/parsemail.py 2013-11-29 16:20:37 +0000 | |||
3 | +++ apps/patchwork/bin/parsemail.py 2013-12-05 18:37:48 +0000 | |||
4 | @@ -20,6 +20,7 @@ | |||
5 | 20 | # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA | 20 | # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA |
6 | 21 | 21 | ||
7 | 22 | import _pythonpath | 22 | import _pythonpath |
8 | 23 | import atexit | ||
9 | 23 | import chardet | 24 | import chardet |
10 | 24 | import datetime | 25 | import datetime |
11 | 25 | import operator | 26 | import operator |
12 | @@ -37,8 +38,14 @@ | |||
13 | 37 | from email.Header import Header, decode_header | 38 | from email.Header import Header, decode_header |
14 | 38 | from email.Utils import parsedate_tz, mktime_tz | 39 | from email.Utils import parsedate_tz, mktime_tz |
15 | 39 | 40 | ||
16 | 41 | from patchmetrics.crowd import Crowd | ||
17 | 42 | from patchwork.db import PatchworkDB | ||
18 | 40 | from patchwork.parser import parse_patch | 43 | from patchwork.parser import parse_patch |
19 | 41 | from patchwork.models import Patch, Project, Person, Comment, State | 44 | from patchwork.models import Patch, Project, Person, Comment, State |
20 | 45 | from patchwork.utils import ( | ||
21 | 46 | is_linaro_user, | ||
22 | 47 | parse_name_and_email, | ||
23 | 48 | ) | ||
24 | 42 | from django.contrib.auth.models import User | 49 | from django.contrib.auth.models import User |
25 | 43 | 50 | ||
26 | 44 | default_patch_state = 'New' | 51 | default_patch_state = 'New' |
27 | @@ -118,33 +125,6 @@ | |||
28 | 118 | return project | 125 | return project |
29 | 119 | 126 | ||
30 | 120 | 127 | ||
31 | 121 | def parse_name_and_email(text): | ||
32 | 122 | # tuple of (regex, fn) | ||
33 | 123 | # - where fn returns a (name, email) tuple from the match groups resulting | ||
34 | 124 | # from re.match().groups() | ||
35 | 125 | from_res = [ | ||
36 | 126 | # for "Firstname Lastname" <example@example.com> style addresses | ||
37 | 127 | (re.compile('"?(.*?)"?\s*<([^>]+)>'), (lambda g: (g[0], g[1]))), | ||
38 | 128 | |||
39 | 129 | # for example@example.com (Firstname Lastname) style addresses | ||
40 | 130 | (re.compile('"?(.*?)"?\s*\(([^\)]+)\)'), (lambda g: (g[1], g[0]))), | ||
41 | 131 | |||
42 | 132 | # everything else | ||
43 | 133 | (re.compile('(.*)'), (lambda g: (None, g[0]))), | ||
44 | 134 | ] | ||
45 | 135 | |||
46 | 136 | for regex, fn in from_res: | ||
47 | 137 | match = regex.match(text) | ||
48 | 138 | if match: | ||
49 | 139 | name, email = fn(match.groups()) | ||
50 | 140 | if name is not None: | ||
51 | 141 | name = name.strip() | ||
52 | 142 | if email is not None: | ||
53 | 143 | email = email.strip() | ||
54 | 144 | return name, email | ||
55 | 145 | return None, None | ||
56 | 146 | |||
57 | 147 | |||
58 | 148 | def find_submitter(mail): | 128 | def find_submitter(mail): |
59 | 149 | 129 | ||
60 | 150 | from_header = clean_header(mail.get('From')) | 130 | from_header = clean_header(mail.get('From')) |
61 | @@ -282,18 +262,22 @@ | |||
62 | 282 | 262 | ||
63 | 283 | 263 | ||
64 | 284 | def find_patch_name_and_author(comment): | 264 | def find_patch_name_and_author(comment): |
66 | 285 | """Return the patch name and author as specified in the given comment. | 265 | """Finds the patch name, and the patch author. |
67 | 286 | 266 | ||
68 | 287 | The name and author of a patch are sometimes different from the Subject | 267 | The name and author of a patch are sometimes different from the Subject |
69 | 288 | and From header fields and so are specified in the body of the message, | 268 | and From header fields and so are specified in the body of the message, |
70 | 289 | before the actual patch. This is identified by "From: " and "Subject: " | 269 | before the actual patch. This is identified by "From: " and "Subject: " |
71 | 290 | lines starting the body, which is the format understood by 'git-am'. | 270 | lines starting the body, which is the format understood by 'git-am'. |
72 | 271 | |||
73 | 272 | :return A tuple consisting of: the name of the patch, the author of the | ||
74 | 273 | patch, and a boolean indicating if the author needs to be saved in the | ||
75 | 274 | database. | ||
76 | 291 | """ | 275 | """ |
77 | 292 | if not comment: | 276 | if not comment: |
79 | 293 | return None, None | 277 | return None, None, None |
80 | 294 | lines = comment.content.split('\n') | 278 | lines = comment.content.split('\n') |
81 | 295 | if len(lines) == 0: | 279 | if len(lines) == 0: |
83 | 296 | return None, None | 280 | return None, None, None |
84 | 297 | # Append an extra line at the end so that our code can assume there will | 281 | # Append an extra line at the end so that our code can assume there will |
85 | 298 | # always be at least two lines, simplifying things significantly. | 282 | # always be at least two lines, simplifying things significantly. |
86 | 299 | lines.append('') | 283 | lines.append('') |
87 | @@ -308,25 +292,26 @@ | |||
88 | 308 | if second_line.startswith('From:'): | 292 | if second_line.startswith('From:'): |
89 | 309 | author_line = second_line | 293 | author_line = second_line |
90 | 310 | else: | 294 | else: |
92 | 311 | return None, None | 295 | return None, None, None |
93 | 312 | 296 | ||
94 | 313 | subject = None | 297 | subject = None |
95 | 314 | if subject_line is not None: | 298 | if subject_line is not None: |
96 | 315 | subject = subject_line.replace('Subject:', '').strip() | 299 | subject = subject_line.replace('Subject:', '').strip() |
97 | 316 | 300 | ||
98 | 317 | if author_line is None: | 301 | if author_line is None: |
100 | 318 | return subject, None | 302 | return subject, None, None |
101 | 319 | 303 | ||
102 | 320 | name, email = parse_name_and_email(author_line.replace('From:', '')) | 304 | name, email = parse_name_and_email(author_line.replace('From:', '')) |
103 | 321 | if email is None: | 305 | if email is None: |
105 | 322 | return subject, None | 306 | return subject, None, None |
106 | 323 | 307 | ||
107 | 324 | try: | 308 | try: |
108 | 325 | person = Person.objects.get(email__iexact=email) | 309 | person = Person.objects.get(email__iexact=email) |
109 | 310 | new_person = False | ||
110 | 326 | except Person.DoesNotExist: | 311 | except Person.DoesNotExist: |
111 | 327 | person = Person(name=name, email=email) | 312 | person = Person(name=name, email=email) |
114 | 328 | person.save() | 313 | new_person = True |
115 | 329 | return subject, person | 314 | return subject, person, new_person |
116 | 330 | 315 | ||
117 | 331 | 316 | ||
118 | 332 | def find_patch_for_comment(project, mail): | 317 | def find_patch_for_comment(project, mail): |
119 | @@ -493,6 +478,39 @@ | |||
120 | 493 | return None | 478 | return None |
121 | 494 | 479 | ||
122 | 495 | 480 | ||
123 | 481 | def is_valid_patch(submitter, author): | ||
124 | 482 | """A patch is valid if submitter or author are valid Linaro users. | ||
125 | 483 | |||
126 | 484 | :param submitter: The submitter to check, a Person object. | ||
127 | 485 | :param author: The author to check, a Person object. | ||
128 | 486 | :return True if one between submitter and author is a valid Linaro user. | ||
129 | 487 | """ | ||
130 | 488 | valid_patch = False | ||
131 | 489 | |||
132 | 490 | if (settings.AUTH_CROWD_APPLICATION_USER is None and | ||
133 | 491 | settings.CROWD_USERS_DB_FILE is None): | ||
134 | 492 | # If none of the above settings is defined, fallback to an hardoceded | ||
135 | 493 | # check and accept only linaro.org email addresses. | ||
136 | 494 | if "@linaro.org" in submitter.email: | ||
137 | 495 | valid_patch |= True | ||
138 | 496 | |||
139 | 497 | if author is not None and "@linaro.org" in author.email: | ||
140 | 498 | valid_patch |= True | ||
141 | 499 | else: | ||
142 | 500 | crwd = Crowd(settings.AUTH_CROWD_APPLICATION_USER, | ||
143 | 501 | settings.AUTH_CROWD_APPLICATION_PASSWORD, | ||
144 | 502 | settings.AUTH_CROWD_SERVER_REST_URI) | ||
145 | 503 | |||
146 | 504 | cache_db = PatchworkDB(settings.CROWD_USERS_DB_FILE) | ||
147 | 505 | atexit.register(cache_db.close) | ||
148 | 506 | |||
149 | 507 | valid_patch |= is_linaro_user(submitter.email, cache_db, crwd) | ||
150 | 508 | if author is not None: | ||
151 | 509 | valid_patch |= is_linaro_user(author.email, cache_db, crwd) | ||
152 | 510 | |||
153 | 511 | return valid_patch | ||
154 | 512 | |||
155 | 513 | |||
156 | 496 | def parse_mail(mail): | 514 | def parse_mail(mail): |
157 | 497 | 515 | ||
158 | 498 | # some basic sanity checks | 516 | # some basic sanity checks |
159 | @@ -517,8 +535,14 @@ | |||
160 | 517 | msgid = mail.get('Message-Id').strip() | 535 | msgid = mail.get('Message-Id').strip() |
161 | 518 | 536 | ||
162 | 519 | (submitter, save_required) = find_submitter(mail) | 537 | (submitter, save_required) = find_submitter(mail) |
163 | 520 | |||
164 | 521 | (patch, comment) = find_content(project, mail) | 538 | (patch, comment) = find_content(project, mail) |
165 | 539 | (name, author, is_new_author) = find_patch_name_and_author(comment) | ||
166 | 540 | |||
167 | 541 | # If submitter nor author of a patch are not Linaro, we move on. | ||
168 | 542 | valid_patch = is_valid_patch(submitter, author) | ||
169 | 543 | |||
170 | 544 | if not valid_patch: | ||
171 | 545 | return 0 | ||
172 | 522 | 546 | ||
173 | 523 | if patch: | 547 | if patch: |
174 | 524 | # we delay the saving until we know we have a patch. | 548 | # we delay the saving until we know we have a patch. |
175 | @@ -526,11 +550,14 @@ | |||
176 | 526 | submitter.save() | 550 | submitter.save() |
177 | 527 | save_required = False | 551 | save_required = False |
178 | 528 | patch.submitter = submitter | 552 | patch.submitter = submitter |
180 | 529 | name, author = find_patch_name_and_author(comment) | 553 | |
181 | 554 | if is_new_author: | ||
182 | 555 | author.save() | ||
183 | 530 | if author is not None: | 556 | if author is not None: |
184 | 531 | patch.author = author | 557 | patch.author = author |
185 | 532 | if name is not None: | 558 | if name is not None: |
186 | 533 | patch.name = name | 559 | patch.name = name |
187 | 560 | |||
188 | 534 | patch.msgid = msgid | 561 | patch.msgid = msgid |
189 | 535 | patch.project = project | 562 | patch.project = project |
190 | 536 | patch.state = get_state(mail.get('X-Patchwork-State', '').strip()) | 563 | patch.state = get_state(mail.get('X-Patchwork-State', '').strip()) |
191 | 537 | 564 | ||
192 | === modified file 'apps/patchwork/utils.py' | |||
193 | --- apps/patchwork/utils.py 2013-10-16 13:25:16 +0000 | |||
194 | +++ apps/patchwork/utils.py 2013-12-05 18:37:48 +0000 | |||
195 | @@ -29,7 +29,6 @@ | |||
196 | 29 | from difflib import SequenceMatcher | 29 | from difflib import SequenceMatcher |
197 | 30 | from django.conf import settings | 30 | from django.conf import settings |
198 | 31 | from django.shortcuts import get_object_or_404 | 31 | from django.shortcuts import get_object_or_404 |
199 | 32 | from patchwork.bin.parsemail import parse_name_and_email | ||
200 | 33 | from patchwork.models import Bundle, BundlePatch, Patch, Person, State | 32 | from patchwork.models import Bundle, BundlePatch, Patch, Person, State |
201 | 34 | from patchwork.parser import hash_patch, parse_patch | 33 | from patchwork.parser import hash_patch, parse_patch |
202 | 35 | 34 | ||
203 | @@ -83,6 +82,33 @@ | |||
204 | 83 | return ids | 82 | return ids |
205 | 84 | 83 | ||
206 | 85 | 84 | ||
207 | 85 | def parse_name_and_email(text): | ||
208 | 86 | # tuple of (regex, fn) | ||
209 | 87 | # - where fn returns a (name, email) tuple from the match groups resulting | ||
210 | 88 | # from re.match().groups() | ||
211 | 89 | from_res = [ | ||
212 | 90 | # for "Firstname Lastname" <example@example.com> style addresses | ||
213 | 91 | (re.compile('"?(.*?)"?\s*<([^>]+)>'), (lambda g: (g[0], g[1]))), | ||
214 | 92 | |||
215 | 93 | # for example@example.com (Firstname Lastname) style addresses | ||
216 | 94 | (re.compile('"?(.*?)"?\s*\(([^\)]+)\)'), (lambda g: (g[1], g[0]))), | ||
217 | 95 | |||
218 | 96 | # everything else | ||
219 | 97 | (re.compile('(.*)'), (lambda g: (None, g[0]))), | ||
220 | 98 | ] | ||
221 | 99 | |||
222 | 100 | for regex, fn in from_res: | ||
223 | 101 | match = regex.match(text) | ||
224 | 102 | if match: | ||
225 | 103 | name, email = fn(match.groups()) | ||
226 | 104 | if name is not None: | ||
227 | 105 | name = name.strip() | ||
228 | 106 | if email is not None: | ||
229 | 107 | email = email.strip() | ||
230 | 108 | return name, email | ||
231 | 109 | return None, None | ||
232 | 110 | |||
233 | 111 | |||
234 | 86 | class Order(object): | 112 | class Order(object): |
235 | 87 | order_map = { | 113 | order_map = { |
236 | 88 | 'date': 'date', | 114 | 'date': 'date', |
Logic looks good. And I usually don't go at such details with stylistic changes, but for few quick looks at linaro-patchmetrics codebase, I felt confused and saw few less-than-ideal practices and hardcodings. So, let's try to do better.
1. Can "new_author" var be named "is_new_auhor"?
2.
8 - return None, None
9 + return None, None, None
From the rest of patch it's possible to figure out what 3rd return value is, and I'm not sure if that func had a docstring, but maybe it might?
3.
57 + # If submitter nor author of a patch are not Linaro, we move on.
58 + # TODO: may be worth to integrate with crowd.
59 + if "@linaro.org" not in submitter.email:
60 + if author is not None and "@linaro.org" not in author.email:
61 + return 0
62 + return 0
There're 2 issues here: one is that hardcoded "@linaro.org". We can do better ;-) (yeah, I know it takes much more patch lines to use config settings, and sometimes with my changes I wonder if I should bother either; friendly answer looking from outside - yes ;-) ). And we should anticipate need to support multiple patterns - either as array, or better just as a regexp.
Another is that there's superfluous/ erroneous logic - if first "if" fired, it's "return 0" regardless of second
"if". If code should look like that, there should be a comment (like, "one sweet day, we may want to handle these conditions separately"). Otherwise...