Merge lp:~stefanor/ibid/url-397900 into lp:~ibid-core/ibid/old-trunk-pack-0.92
- url-397900
- Merge into old-trunk-pack-0.92
Status: | Merged | ||||||||
---|---|---|---|---|---|---|---|---|---|
Approved by: | Jonathan Hitchcock | ||||||||
Approved revision: | 719 | ||||||||
Merged at revision: | 714 | ||||||||
Proposed branch: | lp:~stefanor/ibid/url-397900 | ||||||||
Merge into: | lp:~ibid-core/ibid/old-trunk-pack-0.92 | ||||||||
Diff against target: | None lines | ||||||||
To merge this branch: | bzr merge lp:~stefanor/ibid/url-397900 | ||||||||
Related bugs: |
|
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Jonathan Hitchcock | Approve | ||
Michael Gorven | Approve | ||
Review via email: mp+8696@code.launchpad.net |
This proposal supersedes a proposal from 2009-07-13.
Commit message
Description of the change
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal | # |
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal | # |
review approve
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal | # |
Test cases I used in the bug report (thanks mithrandi for the last one)
Jonathan Hitchcock (vhata) : Posted in a previous version of this proposal | # |
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal | # |
review approve
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal | # |
Full TLD support in the grab regex
The test suite I used:
These match:
I: <email address hidden>
O:
I: google.com
O: google.com
I: x <email address hidden>
O:
I: http://
O: http://
I: <email address hidden>
O:
I: aoeuoeu <www.jar.com>
O: www.jar.com
I: so bar http://
O: http://
I: 'http://
O: http://
I: Thingie boo.com/a eue
O: boo.com/a
I: joe (www.google.com) says foo
O: www.google.com
I: http://
O: http://
I: aoeu http://
O: http://
I: aoeu http://
O: http://
I: http://
O: http://
I: ouoe <http://
O: http://
These don't:
I: http://
O: http://
I: Python <http://
O: http://
I: Python <URL:http://
O: http://
I: This is an IDN TLD: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/
O: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/
I specifically don't convert the xn--.* TLDs to unicode form, because then we have to use unicode mode for the regex, which messes with \s and \w
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal | # |
+ event.addrespon
+ return
Er, is that supposed to be there? Looks fine otherwise.
Stefano Rivera (stefanor) wrote : | # |
This branch has become rather a cesspool of fixes. Enjoy
Michael Gorven (mgorven) wrote : | # |
Yay for tests!
review approve
- 719. By Stefano Rivera
-
Remove extraneous imports
Jonathan Hitchcock (vhata) wrote : | # |
Rocking!
Now we need tests for all the other plugins :)
Preview Diff
1 | === removed file 'NOTES' |
2 | --- NOTES 2009-02-18 10:41:40 +0000 |
3 | +++ NOTES 1970-01-01 00:00:00 +0000 |
4 | @@ -1,22 +0,0 @@ |
5 | -general |
6 | -======= |
7 | -- 'object' and 'type' are probably keywords that shouldn't be reused, btw |
8 | -- proper exception handling needed all round (and maybe some more error checks?) |
9 | -- could a Processor dynamically alter priority? of itself, or another class? if so, would it ever want to? "learning"? |
10 | -- the patch on ConfigObj only does one level of list interpolation. Does ConfigObj support multi-level nested lists? |
11 | - |
12 | -plugins |
13 | -======= |
14 | -- process() is still a bit biased towards type=message? |
15 | -- how about having a lambda decorator like @match(pattern) to allow other types of matching? |
16 | -- reload config is a good example of somewhere that error reporting should occur |
17 | -- load/reload is a good example of the reloader not being generic enough... I think? |
18 | -- is ignore strong enough? it just sets processed to true - postprocessors won't ignore this. is that bad? |
19 | -- applause to Michael for the Responses processor: that's clever |
20 | - -- that said, not at all convinced by Address processor |
21 | - -- it takes a basestring only, not a dict as well |
22 | - -- do we even want what it does? cf. the Announce plugin - I think Processors should decide for themselves whether they want to address their replies or not. (But what about private replies - adding an address in to that is crap) |
23 | -- identity plugin: 'identify' is different in different scopes and it's a little confusing - not cool to a first-time reader |
24 | -- twisted helper functions (with that whole blocking-wrapper thing) may as well be written |
25 | -- sources processor: needs a bit of error checking, authentication and... all sorts. |
26 | - -- (no, I know, it was there for debugging, just a note for Future Us) |
27 | |
28 | === renamed file 'dbus-ping.py' => 'attic/dbus-ping.py' |
29 | === added directory 'data' |
30 | === added file 'data/README' |
31 | --- data/README 1970-01-01 00:00:00 +0000 |
32 | +++ data/README 2009-07-13 14:23:03 +0000 |
33 | @@ -0,0 +1,5 @@ |
34 | +These are data files used by plugins, that don't change very much. |
35 | +Thus outdated versions shouldn't be a major issue. |
36 | + |
37 | +Sources: |
38 | +http://data.iana.org/TLD/tlds-alpha-by-domain.txt |
39 | |
40 | === added file 'data/tlds-alpha-by-domain.txt' |
41 | --- data/tlds-alpha-by-domain.txt 1970-01-01 00:00:00 +0000 |
42 | +++ data/tlds-alpha-by-domain.txt 2009-07-13 14:23:03 +0000 |
43 | @@ -0,0 +1,281 @@ |
44 | +# Version 2009071300, Last Updated Mon Jul 13 07:07:02 2009 UTC |
45 | +AC |
46 | +AD |
47 | +AE |
48 | +AERO |
49 | +AF |
50 | +AG |
51 | +AI |
52 | +AL |
53 | +AM |
54 | +AN |
55 | +AO |
56 | +AQ |
57 | +AR |
58 | +ARPA |
59 | +AS |
60 | +ASIA |
61 | +AT |
62 | +AU |
63 | +AW |
64 | +AX |
65 | +AZ |
66 | +BA |
67 | +BB |
68 | +BD |
69 | +BE |
70 | +BF |
71 | +BG |
72 | +BH |
73 | +BI |
74 | +BIZ |
75 | +BJ |
76 | +BM |
77 | +BN |
78 | +BO |
79 | +BR |
80 | +BS |
81 | +BT |
82 | +BV |
83 | +BW |
84 | +BY |
85 | +BZ |
86 | +CA |
87 | +CAT |
88 | +CC |
89 | +CD |
90 | +CF |
91 | +CG |
92 | +CH |
93 | +CI |
94 | +CK |
95 | +CL |
96 | +CM |
97 | +CN |
98 | +CO |
99 | +COM |
100 | +COOP |
101 | +CR |
102 | +CU |
103 | +CV |
104 | +CX |
105 | +CY |
106 | +CZ |
107 | +DE |
108 | +DJ |
109 | +DK |
110 | +DM |
111 | +DO |
112 | +DZ |
113 | +EC |
114 | +EDU |
115 | +EE |
116 | +EG |
117 | +ER |
118 | +ES |
119 | +ET |
120 | +EU |
121 | +FI |
122 | +FJ |
123 | +FK |
124 | +FM |
125 | +FO |
126 | +FR |
127 | +GA |
128 | +GB |
129 | +GD |
130 | +GE |
131 | +GF |
132 | +GG |
133 | +GH |
134 | +GI |
135 | +GL |
136 | +GM |
137 | +GN |
138 | +GOV |
139 | +GP |
140 | +GQ |
141 | +GR |
142 | +GS |
143 | +GT |
144 | +GU |
145 | +GW |
146 | +GY |
147 | +HK |
148 | +HM |
149 | +HN |
150 | +HR |
151 | +HT |
152 | +HU |
153 | +ID |
154 | +IE |
155 | +IL |
156 | +IM |
157 | +IN |
158 | +INFO |
159 | +INT |
160 | +IO |
161 | +IQ |
162 | +IR |
163 | +IS |
164 | +IT |
165 | +JE |
166 | +JM |
167 | +JO |
168 | +JOBS |
169 | +JP |
170 | +KE |
171 | +KG |
172 | +KH |
173 | +KI |
174 | +KM |
175 | +KN |
176 | +KP |
177 | +KR |
178 | +KW |
179 | +KY |
180 | +KZ |
181 | +LA |
182 | +LB |
183 | +LC |
184 | +LI |
185 | +LK |
186 | +LR |
187 | +LS |
188 | +LT |
189 | +LU |
190 | +LV |
191 | +LY |
192 | +MA |
193 | +MC |
194 | +MD |
195 | +ME |
196 | +MG |
197 | +MH |
198 | +MIL |
199 | +MK |
200 | +ML |
201 | +MM |
202 | +MN |
203 | +MO |
204 | +MOBI |
205 | +MP |
206 | +MQ |
207 | +MR |
208 | +MS |
209 | +MT |
210 | +MU |
211 | +MUSEUM |
212 | +MV |
213 | +MW |
214 | +MX |
215 | +MY |
216 | +MZ |
217 | +NA |
218 | +NAME |
219 | +NC |
220 | +NE |
221 | +NET |
222 | +NF |
223 | +NG |
224 | +NI |
225 | +NL |
226 | +NO |
227 | +NP |
228 | +NR |
229 | +NU |
230 | +NZ |
231 | +OM |
232 | +ORG |
233 | +PA |
234 | +PE |
235 | +PF |
236 | +PG |
237 | +PH |
238 | +PK |
239 | +PL |
240 | +PM |
241 | +PN |
242 | +PR |
243 | +PRO |
244 | +PS |
245 | +PT |
246 | +PW |
247 | +PY |
248 | +QA |
249 | +RE |
250 | +RO |
251 | +RS |
252 | +RU |
253 | +RW |
254 | +SA |
255 | +SB |
256 | +SC |
257 | +SD |
258 | +SE |
259 | +SG |
260 | +SH |
261 | +SI |
262 | +SJ |
263 | +SK |
264 | +SL |
265 | +SM |
266 | +SN |
267 | +SO |
268 | +SR |
269 | +ST |
270 | +SU |
271 | +SV |
272 | +SY |
273 | +SZ |
274 | +TC |
275 | +TD |
276 | +TEL |
277 | +TF |
278 | +TG |
279 | +TH |
280 | +TJ |
281 | +TK |
282 | +TL |
283 | +TM |
284 | +TN |
285 | +TO |
286 | +TP |
287 | +TR |
288 | +TRAVEL |
289 | +TT |
290 | +TV |
291 | +TW |
292 | +TZ |
293 | +UA |
294 | +UG |
295 | +UK |
296 | +US |
297 | +UY |
298 | +UZ |
299 | +VA |
300 | +VC |
301 | +VE |
302 | +VG |
303 | +VI |
304 | +VN |
305 | +VU |
306 | +WF |
307 | +WS |
308 | +XN--0ZWM56D |
309 | +XN--11B5BS3A9AJ6G |
310 | +XN--80AKHBYKNJ4F |
311 | +XN--9T4B11YI5A |
312 | +XN--DEBA0AD |
313 | +XN--G6W251D |
314 | +XN--HGBK6AJ7F53BBA |
315 | +XN--HLCJ6AYA9ESC7A |
316 | +XN--JXALPDLP |
317 | +XN--KGBECHTV |
318 | +XN--ZCKZAH |
319 | +YE |
320 | +YT |
321 | +YU |
322 | +ZA |
323 | +ZM |
324 | +ZW |
325 | |
326 | === modified file 'ibid/plugins/http.py' |
327 | --- ibid/plugins/http.py 2009-05-01 12:17:57 +0000 |
328 | +++ ibid/plugins/http.py 2009-07-13 15:19:39 +0000 |
329 | @@ -17,7 +17,7 @@ |
330 | |
331 | max_size = IntOption('max_size', 'Only request this many bytes', 500) |
332 | |
333 | - @match(r'^(get|head)\s+(.+)$') |
334 | + @match(r'^(get|head)\s+(\S+\.\S+)$') |
335 | def handler(self, event, action, url): |
336 | if not url.lower().startswith("http://") and not url.lower().startswith("https://"): |
337 | url = "http://" + url |
338 | |
339 | === modified file 'ibid/plugins/url.py' |
340 | --- ibid/plugins/url.py 2009-07-10 12:01:48 +0000 |
341 | +++ ibid/plugins/url.py 2009-07-13 15:17:24 +0000 |
342 | @@ -5,6 +5,7 @@ |
343 | import logging |
344 | import re |
345 | |
346 | +from pkg_resources import resource_exists, resource_stream |
347 | from sqlalchemy import Column, Integer, Unicode, DateTime, UnicodeText, ForeignKey, Table |
348 | |
349 | import ibid |
350 | @@ -112,7 +113,26 @@ |
351 | password = Option('delicious_password', 'delicious account password') |
352 | delicious = Delicious() |
353 | |
354 | - @match(r'((?:\S+://|(?:www|ftp)\.)\S+|\S+\.(?:com|org|net|za)\S*)') |
355 | + def setup(self): |
356 | + if resource_exists(__name__, '../../data/tlds-alpha-by-domain.txt'): |
357 | + tlds = [tld.strip().lower() for tld |
358 | + in resource_stream(__name__, '../../data/tlds-alpha-by-domain.txt') |
359 | + .readlines() |
360 | + if not tld.startswith('#') |
361 | + ] |
362 | + |
363 | + else: |
364 | + log.warning(u"Couldn't open TLD list, falling back to minimal default") |
365 | + tlds = 'com.org.net.za'.split('.') |
366 | + |
367 | + self.grab.im_func.pattern = re.compile(( |
368 | + r'(?:[^@]\b|\A)(' # Match a boundry, but not on an e-mail address |
369 | + r'(?:\w+://|(?:www|ftp)\.)\S+?' # Match an explicit URL or guess by www. |
370 | + r'|[^@\s:]+\.(?:%s)(?:/\S*?)?' # Guess at the URL based on TLD |
371 | + r')[\[>)\]"\'.]*(?:\s|\Z)' # End Boundry |
372 | + ) % '|'.join(tlds), re.I | re.DOTALL) |
373 | + |
374 | + @handler |
375 | def grab(self, event, url): |
376 | if url.find('://') == -1: |
377 | if url.lower().startswith('ftp'): |
378 | @@ -154,7 +174,7 @@ |
379 | )) |
380 | |
381 | def setup(self): |
382 | - self.lengthen.im_func.pattern = re.compile(r'^((?:%s)\S+)$' % '|'.join([re.escape(service) for service in self.services]), re.I) |
383 | + self.lengthen.im_func.pattern = re.compile(r'^((?:%s)\S+)$' % '|'.join([re.escape(service) for service in self.services]), re.I|re.DOTALL) |
384 | |
385 | @handler |
386 | def lengthen(self, event, url): |
387 | |
388 | === modified file 'ibid/test/plugins/test_core.py' |
389 | --- ibid/test/plugins/test_core.py 2009-03-05 16:33:12 +0000 |
390 | +++ ibid/test/plugins/test_core.py 2009-07-13 14:47:02 +0000 |
391 | @@ -22,15 +22,24 @@ |
392 | def assert_addressed(self, event, addressed, message): |
393 | self.assert_(hasattr(event, 'addressed')) |
394 | self.assertEqual(event.addressed, addressed) |
395 | - self.assertEqual(event.message.strip(), message) |
396 | + self.assertEqual(event.message['deaddressed'].strip(), message) |
397 | + |
398 | + def create_event(self, message, event_type=u'message'): |
399 | + event = Event(u'fakesource', event_type) |
400 | + event.message = { |
401 | + 'raw': message, |
402 | + 'deaddressed': message, |
403 | + 'clean': message, |
404 | + 'stripped': message, |
405 | + } |
406 | + return event |
407 | |
408 | def test_non_messages(self): |
409 | for event_type in [u'timer', u'rpc']: |
410 | - event = Event(u'fakesource', event_type) |
411 | - event.message = u'bot: foo' |
412 | + event = self.create_event(u'bot: foo', event_type) |
413 | self.processor.process(event) |
414 | self.assertFalse(hasattr(event, u'addressed')) |
415 | - self.assertEqual(event.message, u'bot: foo') |
416 | + self.assertEqual(event.message['deaddressed'], u'bot: foo') |
417 | |
418 | happy_prefixes = [ |
419 | (u'bot', u': '), |
420 | @@ -40,8 +49,7 @@ |
421 | |
422 | def test_happy_prefix_names(self): |
423 | for prefix in self.happy_prefixes: |
424 | - event = Event(u'fakesource', u'message') |
425 | - event.message = u'%s%sfoo' % prefix |
426 | + event = self.create_event(u'%s%sfoo' % prefix) |
427 | self.processor.process(event) |
428 | self.assert_addressed(event, prefix[0], u'foo') |
429 | |
430 | @@ -53,8 +61,7 @@ |
431 | |
432 | def test_sad_prefix_names(self): |
433 | for prefix in self.sad_prefixes: |
434 | - event = Event(u'fakesource', u'message') |
435 | - event.message = u'%s%sfoo' % prefix |
436 | + event = self.create_event(u'%s%sfoo' % prefix) |
437 | self.processor.process(event) |
438 | self.assert_addressed(event, False, u'%s%sfoo' % prefix) |
439 | |
440 | @@ -66,8 +73,7 @@ |
441 | |
442 | def test_happy_suffix_names(self): |
443 | for suffix in self.happy_suffixes: |
444 | - event = Event(u'fakesource', u'message') |
445 | - event.message = u'foo%s%s' % suffix |
446 | + event = self.create_event(u'foo%s%s' % suffix) |
447 | self.processor.process(event) |
448 | self.assert_addressed(event, suffix[1], u'foo') |
449 | |
450 | @@ -80,8 +86,7 @@ |
451 | |
452 | def test_sad_suffix_names(self): |
453 | for suffix in self.sad_suffixes: |
454 | - event = Event(u'fakesource', u'message') |
455 | - event.message = u'foo%s%s' % suffix |
456 | + event = self.create_event(u'foo%s%s' % suffix) |
457 | self.processor.process(event) |
458 | self.assert_addressed(event, False, u'foo%s%s' % suffix) |
459 | |
460 | |
461 | === added file 'ibid/test/plugins/test_url.py' |
462 | --- ibid/test/plugins/test_url.py 1970-01-01 00:00:00 +0000 |
463 | +++ ibid/test/plugins/test_url.py 2009-07-13 15:17:19 +0000 |
464 | @@ -0,0 +1,57 @@ |
465 | +from twisted.trial import unittest |
466 | +import ibid.test |
467 | + |
468 | +from ibid.event import Event |
469 | +from ibid.plugins import url |
470 | + |
471 | +class TestURLGrabber(unittest.TestCase): |
472 | + |
473 | + def setUp(self): |
474 | + self.grab = url.Grab(u'testplugin') |
475 | + |
476 | + good_grabs = [ |
477 | + (u'google.com', u'google.com'), |
478 | + (u'http://foo.bar', u'http://foo.bar'), |
479 | + (u'aoeuoeu <www.jar.com>', u'www.jar.com'), |
480 | + (u'aoeuoeu <www.jar.com> def', u'www.jar.com'), |
481 | + (u'<www.jar.com>', u'www.jar.com'), |
482 | + (u'so bar http://foo.bar/baz to jo', u'http://foo.bar/baz'), |
483 | + (u"'http://bar.com'", u'http://bar.com'), |
484 | + (u'Thingie boo.com/a eue', u'boo.com/a'), |
485 | + (u'joe (www.google.com) says foo', u'www.google.com'), |
486 | + (u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/', |
487 | + u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'), |
488 | + (u'aoeu http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/ aoeu', |
489 | + u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'), |
490 | + (u'aoeu http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/. aoeu', |
491 | + u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'), |
492 | + (u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/.', |
493 | + u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'), |
494 | + (u'ouoe <http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/> aoeuao', |
495 | + u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'), |
496 | + # We accept that the following are non-optimal |
497 | + (u'http://en.example.org/wiki/Python_(programming_language)', |
498 | + u'http://en.example.org/wiki/Python_(programming_language'), |
499 | + (u'Python <http://en.example.org/wiki/Python_(programming_language)> is a lekker language', |
500 | + u'http://en.example.org/wiki/Python_(programming_language'), |
501 | + (u'Python <URL:http://en.example.org/wiki/Python_(programming_language)> is a lekker language', |
502 | + u'http://en.example.org/wiki/Python_(programming_language'), |
503 | + ] |
504 | + |
505 | + def test_good_grabs(self): |
506 | + for input, url in self.good_grabs: |
507 | + m = self.grab.grab.im_func.pattern.search(input) |
508 | + self.assertEqual(m.group(1), url) |
509 | + |
510 | + bad_grabs = [ |
511 | + u'joe@bar.com', |
512 | + u'x joe@google.com', |
513 | + u'<joe@bar.com>', |
514 | + ] |
515 | + |
516 | + def test_bad_grabs(self): |
517 | + for input in self.bad_grabs: |
518 | + m = self.grab.grab.im_func.pattern.search(input) |
519 | + self.assertEqual(m, None) |
520 | + |
521 | +# vi: set et sta sw=4 ts=4: |
Yes, I've tested it, but that doesn't mean that you mustn't