Merge lp:~stefanor/ibid/url-397900 into lp:~ibid-core/ibid/old-trunk-pack-0.92

Proposed by Stefano Rivera
Status: Merged
Approved by: Jonathan Hitchcock
Approved revision: 719
Merged at revision: 714
Proposed branch: lp:~stefanor/ibid/url-397900
Merge into: lp:~ibid-core/ibid/old-trunk-pack-0.92
Diff against target: None lines
To merge this branch: bzr merge lp:~stefanor/ibid/url-397900
Reviewer Review Type Date Requested Status
Jonathan Hitchcock Approve
Michael Gorven Approve
Review via email: mp+8696@code.launchpad.net

This proposal supersedes a proposal from 2009-07-13.

To post a comment you must log in.
Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

Yes, I've tested it, but that doesn't mean that you mustn't

Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

 review approve

review: Approve
Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

Test cases I used in the bug report (thanks mithrandi for the last one)

Revision history for this message
Jonathan Hitchcock (vhata) : Posted in a previous version of this proposal
review: Approve
Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

 review approve

review: Approve
Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

Full TLD support in the grab regex
The test suite I used:

These match:
I: <email address hidden>
O:
I: google.com
O: google.com
I: x <email address hidden>
O:
I: http://foo.bar
O: http://foo.bar
I: <email address hidden>
O:
I: aoeuoeu <www.jar.com>
O: www.jar.com
I: so bar http://foo.bar/baz to jo
O: http://foo.bar/baz
I: 'http://bar.com'
O: http://bar.com
I: Thingie boo.com/a eue
O: boo.com/a
I: joe (www.google.com) says foo
O: www.google.com
I: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: aoeu http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/ aoeu
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: aoeu http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/. aoeu
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/.
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: ouoe <http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/> aoeuao
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/

These don't:
I: http://en.wikipedia.org/wiki/Python_(programming_language)
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: Python <http://en.wikipedia.org/wiki/Python_(programming_language)> is a lekker language
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: Python <URL:http://en.wikipedia.org/wiki/Python_(programming_language)> is a lekker language
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: This is an IDN TLD: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9 ouoeu
O: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9

I specifically don't convert the xn--.* TLDs to unicode form, because then we have to use unicode mode for the regex, which messes with \s and \w

Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

+ event.addresponse(u'Matched %s', url)
+ return

Er, is that supposed to be there? Looks fine otherwise.

Revision history for this message
Stefano Rivera (stefanor) wrote :

This branch has become rather a cesspool of fixes. Enjoy

Revision history for this message
Michael Gorven (mgorven) wrote :

Yay for tests!
 review approve

review: Approve
lp:~stefanor/ibid/url-397900 updated
719. By Stefano Rivera

Remove extraneous imports

Revision history for this message
Jonathan Hitchcock (vhata) wrote :

Rocking!

Now we need tests for all the other plugins :)

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== removed file 'NOTES'
2--- NOTES 2009-02-18 10:41:40 +0000
3+++ NOTES 1970-01-01 00:00:00 +0000
4@@ -1,22 +0,0 @@
5-general
6-=======
7-- 'object' and 'type' are probably keywords that shouldn't be reused, btw
8-- proper exception handling needed all round (and maybe some more error checks?)
9-- could a Processor dynamically alter priority? of itself, or another class? if so, would it ever want to? "learning"?
10-- the patch on ConfigObj only does one level of list interpolation. Does ConfigObj support multi-level nested lists?
11-
12-plugins
13-=======
14-- process() is still a bit biased towards type=message?
15-- how about having a lambda decorator like @match(pattern) to allow other types of matching?
16-- reload config is a good example of somewhere that error reporting should occur
17-- load/reload is a good example of the reloader not being generic enough... I think?
18-- is ignore strong enough? it just sets processed to true - postprocessors won't ignore this. is that bad?
19-- applause to Michael for the Responses processor: that's clever
20- -- that said, not at all convinced by Address processor
21- -- it takes a basestring only, not a dict as well
22- -- do we even want what it does? cf. the Announce plugin - I think Processors should decide for themselves whether they want to address their replies or not. (But what about private replies - adding an address in to that is crap)
23-- identity plugin: 'identify' is different in different scopes and it's a little confusing - not cool to a first-time reader
24-- twisted helper functions (with that whole blocking-wrapper thing) may as well be written
25-- sources processor: needs a bit of error checking, authentication and... all sorts.
26- -- (no, I know, it was there for debugging, just a note for Future Us)
27
28=== renamed file 'dbus-ping.py' => 'attic/dbus-ping.py'
29=== added directory 'data'
30=== added file 'data/README'
31--- data/README 1970-01-01 00:00:00 +0000
32+++ data/README 2009-07-13 14:23:03 +0000
33@@ -0,0 +1,5 @@
34+These are data files used by plugins, that don't change very much.
35+Thus outdated versions shouldn't be a major issue.
36+
37+Sources:
38+http://data.iana.org/TLD/tlds-alpha-by-domain.txt
39
40=== added file 'data/tlds-alpha-by-domain.txt'
41--- data/tlds-alpha-by-domain.txt 1970-01-01 00:00:00 +0000
42+++ data/tlds-alpha-by-domain.txt 2009-07-13 14:23:03 +0000
43@@ -0,0 +1,281 @@
44+# Version 2009071300, Last Updated Mon Jul 13 07:07:02 2009 UTC
45+AC
46+AD
47+AE
48+AERO
49+AF
50+AG
51+AI
52+AL
53+AM
54+AN
55+AO
56+AQ
57+AR
58+ARPA
59+AS
60+ASIA
61+AT
62+AU
63+AW
64+AX
65+AZ
66+BA
67+BB
68+BD
69+BE
70+BF
71+BG
72+BH
73+BI
74+BIZ
75+BJ
76+BM
77+BN
78+BO
79+BR
80+BS
81+BT
82+BV
83+BW
84+BY
85+BZ
86+CA
87+CAT
88+CC
89+CD
90+CF
91+CG
92+CH
93+CI
94+CK
95+CL
96+CM
97+CN
98+CO
99+COM
100+COOP
101+CR
102+CU
103+CV
104+CX
105+CY
106+CZ
107+DE
108+DJ
109+DK
110+DM
111+DO
112+DZ
113+EC
114+EDU
115+EE
116+EG
117+ER
118+ES
119+ET
120+EU
121+FI
122+FJ
123+FK
124+FM
125+FO
126+FR
127+GA
128+GB
129+GD
130+GE
131+GF
132+GG
133+GH
134+GI
135+GL
136+GM
137+GN
138+GOV
139+GP
140+GQ
141+GR
142+GS
143+GT
144+GU
145+GW
146+GY
147+HK
148+HM
149+HN
150+HR
151+HT
152+HU
153+ID
154+IE
155+IL
156+IM
157+IN
158+INFO
159+INT
160+IO
161+IQ
162+IR
163+IS
164+IT
165+JE
166+JM
167+JO
168+JOBS
169+JP
170+KE
171+KG
172+KH
173+KI
174+KM
175+KN
176+KP
177+KR
178+KW
179+KY
180+KZ
181+LA
182+LB
183+LC
184+LI
185+LK
186+LR
187+LS
188+LT
189+LU
190+LV
191+LY
192+MA
193+MC
194+MD
195+ME
196+MG
197+MH
198+MIL
199+MK
200+ML
201+MM
202+MN
203+MO
204+MOBI
205+MP
206+MQ
207+MR
208+MS
209+MT
210+MU
211+MUSEUM
212+MV
213+MW
214+MX
215+MY
216+MZ
217+NA
218+NAME
219+NC
220+NE
221+NET
222+NF
223+NG
224+NI
225+NL
226+NO
227+NP
228+NR
229+NU
230+NZ
231+OM
232+ORG
233+PA
234+PE
235+PF
236+PG
237+PH
238+PK
239+PL
240+PM
241+PN
242+PR
243+PRO
244+PS
245+PT
246+PW
247+PY
248+QA
249+RE
250+RO
251+RS
252+RU
253+RW
254+SA
255+SB
256+SC
257+SD
258+SE
259+SG
260+SH
261+SI
262+SJ
263+SK
264+SL
265+SM
266+SN
267+SO
268+SR
269+ST
270+SU
271+SV
272+SY
273+SZ
274+TC
275+TD
276+TEL
277+TF
278+TG
279+TH
280+TJ
281+TK
282+TL
283+TM
284+TN
285+TO
286+TP
287+TR
288+TRAVEL
289+TT
290+TV
291+TW
292+TZ
293+UA
294+UG
295+UK
296+US
297+UY
298+UZ
299+VA
300+VC
301+VE
302+VG
303+VI
304+VN
305+VU
306+WF
307+WS
308+XN--0ZWM56D
309+XN--11B5BS3A9AJ6G
310+XN--80AKHBYKNJ4F
311+XN--9T4B11YI5A
312+XN--DEBA0AD
313+XN--G6W251D
314+XN--HGBK6AJ7F53BBA
315+XN--HLCJ6AYA9ESC7A
316+XN--JXALPDLP
317+XN--KGBECHTV
318+XN--ZCKZAH
319+YE
320+YT
321+YU
322+ZA
323+ZM
324+ZW
325
326=== modified file 'ibid/plugins/http.py'
327--- ibid/plugins/http.py 2009-05-01 12:17:57 +0000
328+++ ibid/plugins/http.py 2009-07-13 15:19:39 +0000
329@@ -17,7 +17,7 @@
330
331 max_size = IntOption('max_size', 'Only request this many bytes', 500)
332
333- @match(r'^(get|head)\s+(.+)$')
334+ @match(r'^(get|head)\s+(\S+\.\S+)$')
335 def handler(self, event, action, url):
336 if not url.lower().startswith("http://") and not url.lower().startswith("https://"):
337 url = "http://" + url
338
339=== modified file 'ibid/plugins/url.py'
340--- ibid/plugins/url.py 2009-07-10 12:01:48 +0000
341+++ ibid/plugins/url.py 2009-07-13 15:17:24 +0000
342@@ -5,6 +5,7 @@
343 import logging
344 import re
345
346+from pkg_resources import resource_exists, resource_stream
347 from sqlalchemy import Column, Integer, Unicode, DateTime, UnicodeText, ForeignKey, Table
348
349 import ibid
350@@ -112,7 +113,26 @@
351 password = Option('delicious_password', 'delicious account password')
352 delicious = Delicious()
353
354- @match(r'((?:\S+://|(?:www|ftp)\.)\S+|\S+\.(?:com|org|net|za)\S*)')
355+ def setup(self):
356+ if resource_exists(__name__, '../../data/tlds-alpha-by-domain.txt'):
357+ tlds = [tld.strip().lower() for tld
358+ in resource_stream(__name__, '../../data/tlds-alpha-by-domain.txt')
359+ .readlines()
360+ if not tld.startswith('#')
361+ ]
362+
363+ else:
364+ log.warning(u"Couldn't open TLD list, falling back to minimal default")
365+ tlds = 'com.org.net.za'.split('.')
366+
367+ self.grab.im_func.pattern = re.compile((
368+ r'(?:[^@]\b|\A)(' # Match a boundry, but not on an e-mail address
369+ r'(?:\w+://|(?:www|ftp)\.)\S+?' # Match an explicit URL or guess by www.
370+ r'|[^@\s:]+\.(?:%s)(?:/\S*?)?' # Guess at the URL based on TLD
371+ r')[\[>)\]"\'.]*(?:\s|\Z)' # End Boundry
372+ ) % '|'.join(tlds), re.I | re.DOTALL)
373+
374+ @handler
375 def grab(self, event, url):
376 if url.find('://') == -1:
377 if url.lower().startswith('ftp'):
378@@ -154,7 +174,7 @@
379 ))
380
381 def setup(self):
382- self.lengthen.im_func.pattern = re.compile(r'^((?:%s)\S+)$' % '|'.join([re.escape(service) for service in self.services]), re.I)
383+ self.lengthen.im_func.pattern = re.compile(r'^((?:%s)\S+)$' % '|'.join([re.escape(service) for service in self.services]), re.I|re.DOTALL)
384
385 @handler
386 def lengthen(self, event, url):
387
388=== modified file 'ibid/test/plugins/test_core.py'
389--- ibid/test/plugins/test_core.py 2009-03-05 16:33:12 +0000
390+++ ibid/test/plugins/test_core.py 2009-07-13 14:47:02 +0000
391@@ -22,15 +22,24 @@
392 def assert_addressed(self, event, addressed, message):
393 self.assert_(hasattr(event, 'addressed'))
394 self.assertEqual(event.addressed, addressed)
395- self.assertEqual(event.message.strip(), message)
396+ self.assertEqual(event.message['deaddressed'].strip(), message)
397+
398+ def create_event(self, message, event_type=u'message'):
399+ event = Event(u'fakesource', event_type)
400+ event.message = {
401+ 'raw': message,
402+ 'deaddressed': message,
403+ 'clean': message,
404+ 'stripped': message,
405+ }
406+ return event
407
408 def test_non_messages(self):
409 for event_type in [u'timer', u'rpc']:
410- event = Event(u'fakesource', event_type)
411- event.message = u'bot: foo'
412+ event = self.create_event(u'bot: foo', event_type)
413 self.processor.process(event)
414 self.assertFalse(hasattr(event, u'addressed'))
415- self.assertEqual(event.message, u'bot: foo')
416+ self.assertEqual(event.message['deaddressed'], u'bot: foo')
417
418 happy_prefixes = [
419 (u'bot', u': '),
420@@ -40,8 +49,7 @@
421
422 def test_happy_prefix_names(self):
423 for prefix in self.happy_prefixes:
424- event = Event(u'fakesource', u'message')
425- event.message = u'%s%sfoo' % prefix
426+ event = self.create_event(u'%s%sfoo' % prefix)
427 self.processor.process(event)
428 self.assert_addressed(event, prefix[0], u'foo')
429
430@@ -53,8 +61,7 @@
431
432 def test_sad_prefix_names(self):
433 for prefix in self.sad_prefixes:
434- event = Event(u'fakesource', u'message')
435- event.message = u'%s%sfoo' % prefix
436+ event = self.create_event(u'%s%sfoo' % prefix)
437 self.processor.process(event)
438 self.assert_addressed(event, False, u'%s%sfoo' % prefix)
439
440@@ -66,8 +73,7 @@
441
442 def test_happy_suffix_names(self):
443 for suffix in self.happy_suffixes:
444- event = Event(u'fakesource', u'message')
445- event.message = u'foo%s%s' % suffix
446+ event = self.create_event(u'foo%s%s' % suffix)
447 self.processor.process(event)
448 self.assert_addressed(event, suffix[1], u'foo')
449
450@@ -80,8 +86,7 @@
451
452 def test_sad_suffix_names(self):
453 for suffix in self.sad_suffixes:
454- event = Event(u'fakesource', u'message')
455- event.message = u'foo%s%s' % suffix
456+ event = self.create_event(u'foo%s%s' % suffix)
457 self.processor.process(event)
458 self.assert_addressed(event, False, u'foo%s%s' % suffix)
459
460
461=== added file 'ibid/test/plugins/test_url.py'
462--- ibid/test/plugins/test_url.py 1970-01-01 00:00:00 +0000
463+++ ibid/test/plugins/test_url.py 2009-07-13 15:17:19 +0000
464@@ -0,0 +1,57 @@
465+from twisted.trial import unittest
466+import ibid.test
467+
468+from ibid.event import Event
469+from ibid.plugins import url
470+
471+class TestURLGrabber(unittest.TestCase):
472+
473+ def setUp(self):
474+ self.grab = url.Grab(u'testplugin')
475+
476+ good_grabs = [
477+ (u'google.com', u'google.com'),
478+ (u'http://foo.bar', u'http://foo.bar'),
479+ (u'aoeuoeu <www.jar.com>', u'www.jar.com'),
480+ (u'aoeuoeu <www.jar.com> def', u'www.jar.com'),
481+ (u'<www.jar.com>', u'www.jar.com'),
482+ (u'so bar http://foo.bar/baz to jo', u'http://foo.bar/baz'),
483+ (u"'http://bar.com'", u'http://bar.com'),
484+ (u'Thingie boo.com/a eue', u'boo.com/a'),
485+ (u'joe (www.google.com) says foo', u'www.google.com'),
486+ (u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/',
487+ u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'),
488+ (u'aoeu http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/ aoeu',
489+ u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'),
490+ (u'aoeu http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/. aoeu',
491+ u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'),
492+ (u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/.',
493+ u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'),
494+ (u'ouoe <http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/> aoeuao',
495+ u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'),
496+ # We accept that the following are non-optimal
497+ (u'http://en.example.org/wiki/Python_(programming_language)',
498+ u'http://en.example.org/wiki/Python_(programming_language'),
499+ (u'Python <http://en.example.org/wiki/Python_(programming_language)> is a lekker language',
500+ u'http://en.example.org/wiki/Python_(programming_language'),
501+ (u'Python <URL:http://en.example.org/wiki/Python_(programming_language)> is a lekker language',
502+ u'http://en.example.org/wiki/Python_(programming_language'),
503+ ]
504+
505+ def test_good_grabs(self):
506+ for input, url in self.good_grabs:
507+ m = self.grab.grab.im_func.pattern.search(input)
508+ self.assertEqual(m.group(1), url)
509+
510+ bad_grabs = [
511+ u'joe@bar.com',
512+ u'x joe@google.com',
513+ u'<joe@bar.com>',
514+ ]
515+
516+ def test_bad_grabs(self):
517+ for input in self.bad_grabs:
518+ m = self.grab.grab.im_func.pattern.search(input)
519+ self.assertEqual(m, None)
520+
521+# vi: set et sta sw=4 ts=4:

Subscribers

People subscribed via source and target branches