Merge lp:~stefanor/ibid/url-397900 into lp:~ibid-core/ibid/old-trunk-pack-0.92

Proposed by Stefano Rivera
Status: Merged
Approved by: Jonathan Hitchcock
Approved revision: 719
Merged at revision: 714
Proposed branch: lp:~stefanor/ibid/url-397900
Merge into: lp:~ibid-core/ibid/old-trunk-pack-0.92
Diff against target: None lines
To merge this branch: bzr merge lp:~stefanor/ibid/url-397900
Reviewer Review Type Date Requested Status
Jonathan Hitchcock Approve
Michael Gorven Approve
Review via email: mp+8696@code.launchpad.net

This proposal supersedes a proposal from 2009-07-13.

To post a comment you must log in.
Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

Yes, I've tested it, but that doesn't mean that you mustn't

Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

 review approve

review: Approve
Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

Test cases I used in the bug report (thanks mithrandi for the last one)

Revision history for this message
Jonathan Hitchcock (vhata) : Posted in a previous version of this proposal
review: Approve
Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

 review approve

review: Approve
Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

Full TLD support in the grab regex
The test suite I used:

These match:
I: <email address hidden>
O:
I: google.com
O: google.com
I: x <email address hidden>
O:
I: http://foo.bar
O: http://foo.bar
I: <email address hidden>
O:
I: aoeuoeu <www.jar.com>
O: www.jar.com
I: so bar http://foo.bar/baz to jo
O: http://foo.bar/baz
I: 'http://bar.com'
O: http://bar.com
I: Thingie boo.com/a eue
O: boo.com/a
I: joe (www.google.com) says foo
O: www.google.com
I: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: aoeu http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/ aoeu
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: aoeu http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/. aoeu
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/.
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: ouoe <http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/> aoeuao
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/

These don't:
I: http://en.wikipedia.org/wiki/Python_(programming_language)
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: Python <http://en.wikipedia.org/wiki/Python_(programming_language)> is a lekker language
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: Python <URL:http://en.wikipedia.org/wiki/Python_(programming_language)> is a lekker language
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: This is an IDN TLD: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9 ouoeu
O: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9

I specifically don't convert the xn--.* TLDs to unicode form, because then we have to use unicode mode for the regex, which messes with \s and \w

Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

+ event.addresponse(u'Matched %s', url)
+ return

Er, is that supposed to be there? Looks fine otherwise.

Revision history for this message
Stefano Rivera (stefanor) wrote :

This branch has become rather a cesspool of fixes. Enjoy

Revision history for this message
Michael Gorven (mgorven) wrote :

Yay for tests!
 review approve

review: Approve
lp:~stefanor/ibid/url-397900 updated
719. By Stefano Rivera

Remove extraneous imports

Revision history for this message
Jonathan Hitchcock (vhata) wrote :

Rocking!

Now we need tests for all the other plugins :)

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== removed file 'NOTES'
--- NOTES 2009-02-18 10:41:40 +0000
+++ NOTES 1970-01-01 00:00:00 +0000
@@ -1,22 +0,0 @@
1general
2=======
3- 'object' and 'type' are probably keywords that shouldn't be reused, btw
4- proper exception handling needed all round (and maybe some more error checks?)
5- could a Processor dynamically alter priority? of itself, or another class? if so, would it ever want to? "learning"?
6- the patch on ConfigObj only does one level of list interpolation. Does ConfigObj support multi-level nested lists?
7
8plugins
9=======
10- process() is still a bit biased towards type=message?
11- how about having a lambda decorator like @match(pattern) to allow other types of matching?
12- reload config is a good example of somewhere that error reporting should occur
13- load/reload is a good example of the reloader not being generic enough... I think?
14- is ignore strong enough? it just sets processed to true - postprocessors won't ignore this. is that bad?
15- applause to Michael for the Responses processor: that's clever
16 -- that said, not at all convinced by Address processor
17 -- it takes a basestring only, not a dict as well
18 -- do we even want what it does? cf. the Announce plugin - I think Processors should decide for themselves whether they want to address their replies or not. (But what about private replies - adding an address in to that is crap)
19- identity plugin: 'identify' is different in different scopes and it's a little confusing - not cool to a first-time reader
20- twisted helper functions (with that whole blocking-wrapper thing) may as well be written
21- sources processor: needs a bit of error checking, authentication and... all sorts.
22 -- (no, I know, it was there for debugging, just a note for Future Us)
230
=== renamed file 'dbus-ping.py' => 'attic/dbus-ping.py'
=== added directory 'data'
=== added file 'data/README'
--- data/README 1970-01-01 00:00:00 +0000
+++ data/README 2009-07-13 14:23:03 +0000
@@ -0,0 +1,5 @@
1These are data files used by plugins, that don't change very much.
2Thus outdated versions shouldn't be a major issue.
3
4Sources:
5http://data.iana.org/TLD/tlds-alpha-by-domain.txt
06
=== added file 'data/tlds-alpha-by-domain.txt'
--- data/tlds-alpha-by-domain.txt 1970-01-01 00:00:00 +0000
+++ data/tlds-alpha-by-domain.txt 2009-07-13 14:23:03 +0000
@@ -0,0 +1,281 @@
1# Version 2009071300, Last Updated Mon Jul 13 07:07:02 2009 UTC
2AC
3AD
4AE
5AERO
6AF
7AG
8AI
9AL
10AM
11AN
12AO
13AQ
14AR
15ARPA
16AS
17ASIA
18AT
19AU
20AW
21AX
22AZ
23BA
24BB
25BD
26BE
27BF
28BG
29BH
30BI
31BIZ
32BJ
33BM
34BN
35BO
36BR
37BS
38BT
39BV
40BW
41BY
42BZ
43CA
44CAT
45CC
46CD
47CF
48CG
49CH
50CI
51CK
52CL
53CM
54CN
55CO
56COM
57COOP
58CR
59CU
60CV
61CX
62CY
63CZ
64DE
65DJ
66DK
67DM
68DO
69DZ
70EC
71EDU
72EE
73EG
74ER
75ES
76ET
77EU
78FI
79FJ
80FK
81FM
82FO
83FR
84GA
85GB
86GD
87GE
88GF
89GG
90GH
91GI
92GL
93GM
94GN
95GOV
96GP
97GQ
98GR
99GS
100GT
101GU
102GW
103GY
104HK
105HM
106HN
107HR
108HT
109HU
110ID
111IE
112IL
113IM
114IN
115INFO
116INT
117IO
118IQ
119IR
120IS
121IT
122JE
123JM
124JO
125JOBS
126JP
127KE
128KG
129KH
130KI
131KM
132KN
133KP
134KR
135KW
136KY
137KZ
138LA
139LB
140LC
141LI
142LK
143LR
144LS
145LT
146LU
147LV
148LY
149MA
150MC
151MD
152ME
153MG
154MH
155MIL
156MK
157ML
158MM
159MN
160MO
161MOBI
162MP
163MQ
164MR
165MS
166MT
167MU
168MUSEUM
169MV
170MW
171MX
172MY
173MZ
174NA
175NAME
176NC
177NE
178NET
179NF
180NG
181NI
182NL
183NO
184NP
185NR
186NU
187NZ
188OM
189ORG
190PA
191PE
192PF
193PG
194PH
195PK
196PL
197PM
198PN
199PR
200PRO
201PS
202PT
203PW
204PY
205QA
206RE
207RO
208RS
209RU
210RW
211SA
212SB
213SC
214SD
215SE
216SG
217SH
218SI
219SJ
220SK
221SL
222SM
223SN
224SO
225SR
226ST
227SU
228SV
229SY
230SZ
231TC
232TD
233TEL
234TF
235TG
236TH
237TJ
238TK
239TL
240TM
241TN
242TO
243TP
244TR
245TRAVEL
246TT
247TV
248TW
249TZ
250UA
251UG
252UK
253US
254UY
255UZ
256VA
257VC
258VE
259VG
260VI
261VN
262VU
263WF
264WS
265XN--0ZWM56D
266XN--11B5BS3A9AJ6G
267XN--80AKHBYKNJ4F
268XN--9T4B11YI5A
269XN--DEBA0AD
270XN--G6W251D
271XN--HGBK6AJ7F53BBA
272XN--HLCJ6AYA9ESC7A
273XN--JXALPDLP
274XN--KGBECHTV
275XN--ZCKZAH
276YE
277YT
278YU
279ZA
280ZM
281ZW
0282
=== modified file 'ibid/plugins/http.py'
--- ibid/plugins/http.py 2009-05-01 12:17:57 +0000
+++ ibid/plugins/http.py 2009-07-13 15:19:39 +0000
@@ -17,7 +17,7 @@
1717
18 max_size = IntOption('max_size', 'Only request this many bytes', 500)18 max_size = IntOption('max_size', 'Only request this many bytes', 500)
1919
20 @match(r'^(get|head)\s+(.+)$')20 @match(r'^(get|head)\s+(\S+\.\S+)$')
21 def handler(self, event, action, url):21 def handler(self, event, action, url):
22 if not url.lower().startswith("http://") and not url.lower().startswith("https://"):22 if not url.lower().startswith("http://") and not url.lower().startswith("https://"):
23 url = "http://" + url23 url = "http://" + url
2424
=== modified file 'ibid/plugins/url.py'
--- ibid/plugins/url.py 2009-07-10 12:01:48 +0000
+++ ibid/plugins/url.py 2009-07-13 15:17:24 +0000
@@ -5,6 +5,7 @@
5import logging5import logging
6import re6import re
77
8from pkg_resources import resource_exists, resource_stream
8from sqlalchemy import Column, Integer, Unicode, DateTime, UnicodeText, ForeignKey, Table9from sqlalchemy import Column, Integer, Unicode, DateTime, UnicodeText, ForeignKey, Table
910
10import ibid11import ibid
@@ -112,7 +113,26 @@
112 password = Option('delicious_password', 'delicious account password')113 password = Option('delicious_password', 'delicious account password')
113 delicious = Delicious()114 delicious = Delicious()
114115
115 @match(r'((?:\S+://|(?:www|ftp)\.)\S+|\S+\.(?:com|org|net|za)\S*)')116 def setup(self):
117 if resource_exists(__name__, '../../data/tlds-alpha-by-domain.txt'):
118 tlds = [tld.strip().lower() for tld
119 in resource_stream(__name__, '../../data/tlds-alpha-by-domain.txt')
120 .readlines()
121 if not tld.startswith('#')
122 ]
123
124 else:
125 log.warning(u"Couldn't open TLD list, falling back to minimal default")
126 tlds = 'com.org.net.za'.split('.')
127
128 self.grab.im_func.pattern = re.compile((
129 r'(?:[^@]\b|\A)(' # Match a boundry, but not on an e-mail address
130 r'(?:\w+://|(?:www|ftp)\.)\S+?' # Match an explicit URL or guess by www.
131 r'|[^@\s:]+\.(?:%s)(?:/\S*?)?' # Guess at the URL based on TLD
132 r')[\[>)\]"\'.]*(?:\s|\Z)' # End Boundry
133 ) % '|'.join(tlds), re.I | re.DOTALL)
134
135 @handler
116 def grab(self, event, url):136 def grab(self, event, url):
117 if url.find('://') == -1:137 if url.find('://') == -1:
118 if url.lower().startswith('ftp'):138 if url.lower().startswith('ftp'):
@@ -154,7 +174,7 @@
154 ))174 ))
155175
156 def setup(self):176 def setup(self):
157 self.lengthen.im_func.pattern = re.compile(r'^((?:%s)\S+)$' % '|'.join([re.escape(service) for service in self.services]), re.I)177 self.lengthen.im_func.pattern = re.compile(r'^((?:%s)\S+)$' % '|'.join([re.escape(service) for service in self.services]), re.I|re.DOTALL)
158178
159 @handler179 @handler
160 def lengthen(self, event, url):180 def lengthen(self, event, url):
161181
=== modified file 'ibid/test/plugins/test_core.py'
--- ibid/test/plugins/test_core.py 2009-03-05 16:33:12 +0000
+++ ibid/test/plugins/test_core.py 2009-07-13 14:47:02 +0000
@@ -22,15 +22,24 @@
22 def assert_addressed(self, event, addressed, message):22 def assert_addressed(self, event, addressed, message):
23 self.assert_(hasattr(event, 'addressed'))23 self.assert_(hasattr(event, 'addressed'))
24 self.assertEqual(event.addressed, addressed)24 self.assertEqual(event.addressed, addressed)
25 self.assertEqual(event.message.strip(), message)25 self.assertEqual(event.message['deaddressed'].strip(), message)
26
27 def create_event(self, message, event_type=u'message'):
28 event = Event(u'fakesource', event_type)
29 event.message = {
30 'raw': message,
31 'deaddressed': message,
32 'clean': message,
33 'stripped': message,
34 }
35 return event
2636
27 def test_non_messages(self):37 def test_non_messages(self):
28 for event_type in [u'timer', u'rpc']:38 for event_type in [u'timer', u'rpc']:
29 event = Event(u'fakesource', event_type)39 event = self.create_event(u'bot: foo', event_type)
30 event.message = u'bot: foo'
31 self.processor.process(event)40 self.processor.process(event)
32 self.assertFalse(hasattr(event, u'addressed'))41 self.assertFalse(hasattr(event, u'addressed'))
33 self.assertEqual(event.message, u'bot: foo')42 self.assertEqual(event.message['deaddressed'], u'bot: foo')
3443
35 happy_prefixes = [44 happy_prefixes = [
36 (u'bot', u': '),45 (u'bot', u': '),
@@ -40,8 +49,7 @@
4049
41 def test_happy_prefix_names(self):50 def test_happy_prefix_names(self):
42 for prefix in self.happy_prefixes:51 for prefix in self.happy_prefixes:
43 event = Event(u'fakesource', u'message')52 event = self.create_event(u'%s%sfoo' % prefix)
44 event.message = u'%s%sfoo' % prefix
45 self.processor.process(event)53 self.processor.process(event)
46 self.assert_addressed(event, prefix[0], u'foo')54 self.assert_addressed(event, prefix[0], u'foo')
4755
@@ -53,8 +61,7 @@
5361
54 def test_sad_prefix_names(self):62 def test_sad_prefix_names(self):
55 for prefix in self.sad_prefixes:63 for prefix in self.sad_prefixes:
56 event = Event(u'fakesource', u'message')64 event = self.create_event(u'%s%sfoo' % prefix)
57 event.message = u'%s%sfoo' % prefix
58 self.processor.process(event)65 self.processor.process(event)
59 self.assert_addressed(event, False, u'%s%sfoo' % prefix)66 self.assert_addressed(event, False, u'%s%sfoo' % prefix)
6067
@@ -66,8 +73,7 @@
6673
67 def test_happy_suffix_names(self):74 def test_happy_suffix_names(self):
68 for suffix in self.happy_suffixes:75 for suffix in self.happy_suffixes:
69 event = Event(u'fakesource', u'message')76 event = self.create_event(u'foo%s%s' % suffix)
70 event.message = u'foo%s%s' % suffix
71 self.processor.process(event)77 self.processor.process(event)
72 self.assert_addressed(event, suffix[1], u'foo')78 self.assert_addressed(event, suffix[1], u'foo')
7379
@@ -80,8 +86,7 @@
8086
81 def test_sad_suffix_names(self):87 def test_sad_suffix_names(self):
82 for suffix in self.sad_suffixes:88 for suffix in self.sad_suffixes:
83 event = Event(u'fakesource', u'message')89 event = self.create_event(u'foo%s%s' % suffix)
84 event.message = u'foo%s%s' % suffix
85 self.processor.process(event)90 self.processor.process(event)
86 self.assert_addressed(event, False, u'foo%s%s' % suffix)91 self.assert_addressed(event, False, u'foo%s%s' % suffix)
8792
8893
=== added file 'ibid/test/plugins/test_url.py'
--- ibid/test/plugins/test_url.py 1970-01-01 00:00:00 +0000
+++ ibid/test/plugins/test_url.py 2009-07-13 15:17:19 +0000
@@ -0,0 +1,57 @@
1from twisted.trial import unittest
2import ibid.test
3
4from ibid.event import Event
5from ibid.plugins import url
6
7class TestURLGrabber(unittest.TestCase):
8
9 def setUp(self):
10 self.grab = url.Grab(u'testplugin')
11
12 good_grabs = [
13 (u'google.com', u'google.com'),
14 (u'http://foo.bar', u'http://foo.bar'),
15 (u'aoeuoeu <www.jar.com>', u'www.jar.com'),
16 (u'aoeuoeu <www.jar.com> def', u'www.jar.com'),
17 (u'<www.jar.com>', u'www.jar.com'),
18 (u'so bar http://foo.bar/baz to jo', u'http://foo.bar/baz'),
19 (u"'http://bar.com'", u'http://bar.com'),
20 (u'Thingie boo.com/a eue', u'boo.com/a'),
21 (u'joe (www.google.com) says foo', u'www.google.com'),
22 (u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/',
23 u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'),
24 (u'aoeu http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/ aoeu',
25 u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'),
26 (u'aoeu http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/. aoeu',
27 u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'),
28 (u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/.',
29 u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'),
30 (u'ouoe <http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/> aoeuao',
31 u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'),
32 # We accept that the following are non-optimal
33 (u'http://en.example.org/wiki/Python_(programming_language)',
34 u'http://en.example.org/wiki/Python_(programming_language'),
35 (u'Python <http://en.example.org/wiki/Python_(programming_language)> is a lekker language',
36 u'http://en.example.org/wiki/Python_(programming_language'),
37 (u'Python <URL:http://en.example.org/wiki/Python_(programming_language)> is a lekker language',
38 u'http://en.example.org/wiki/Python_(programming_language'),
39 ]
40
41 def test_good_grabs(self):
42 for input, url in self.good_grabs:
43 m = self.grab.grab.im_func.pattern.search(input)
44 self.assertEqual(m.group(1), url)
45
46 bad_grabs = [
47 u'joe@bar.com',
48 u'x joe@google.com',
49 u'<joe@bar.com>',
50 ]
51
52 def test_bad_grabs(self):
53 for input in self.bad_grabs:
54 m = self.grab.grab.im_func.pattern.search(input)
55 self.assertEqual(m, None)
56
57# vi: set et sta sw=4 ts=4:

Subscribers

People subscribed via source and target branches