Merge lp:~stefanor/ibid/url-397900 into lp:~ibid-core/ibid/old-trunk-pack-0.92

Proposed by Stefano Rivera
Status: Superseded
Proposed branch: lp:~stefanor/ibid/url-397900
Merge into: lp:~ibid-core/ibid/old-trunk-pack-0.92
Diff against target: None lines
To merge this branch: bzr merge lp:~stefanor/ibid/url-397900
Reviewer Review Type Date Requested Status
Ibid Core Team Pending
Review via email: mp+8690@code.launchpad.net

This proposal supersedes a proposal from 2009-07-11.

This proposal has been superseded by a proposal from 2009-07-13.

To post a comment you must log in.
Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

Yes, I've tested it, but that doesn't mean that you mustn't

Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

 review approve

review: Approve
Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

Test cases I used in the bug report (thanks mithrandi for the last one)

Revision history for this message
Jonathan Hitchcock (vhata) : Posted in a previous version of this proposal
review: Approve
Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

 review approve

review: Approve
Revision history for this message
Stefano Rivera (stefanor) wrote :

Full TLD support in the grab regex
The test suite I used:

These match:
I: <email address hidden>
O:
I: google.com
O: google.com
I: x <email address hidden>
O:
I: http://foo.bar
O: http://foo.bar
I: <email address hidden>
O:
I: aoeuoeu <www.jar.com>
O: www.jar.com
I: so bar http://foo.bar/baz to jo
O: http://foo.bar/baz
I: 'http://bar.com'
O: http://bar.com
I: Thingie boo.com/a eue
O: boo.com/a
I: joe (www.google.com) says foo
O: www.google.com
I: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: aoeu http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/ aoeu
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: aoeu http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/. aoeu
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/.
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: ouoe <http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/> aoeuao
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/

These don't:
I: http://en.wikipedia.org/wiki/Python_(programming_language)
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: Python <http://en.wikipedia.org/wiki/Python_(programming_language)> is a lekker language
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: Python <URL:http://en.wikipedia.org/wiki/Python_(programming_language)> is a lekker language
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: This is an IDN TLD: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9 ouoeu
O: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9

I specifically don't convert the xn--.* TLDs to unicode form, because then we have to use unicode mode for the regex, which messes with \s and \w

Revision history for this message
Michael Gorven (mgorven) wrote :

+ event.addresponse(u'Matched %s', url)
+ return

Er, is that supposed to be there? Looks fine otherwise.

lp:~stefanor/ibid/url-397900 updated
715. By Stefano Rivera

Fix the *far* from complete unit tests to work again

716. By Stefano Rivera

Add URL grabber tests

717. By Stefano Rivera

Forgotten testing code removed

718. By Stefano Rivera

Require a . in http head|get url

719. By Stefano Rivera

Remove extraneous imports

Unmerged revisions

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== removed file 'NOTES'
2--- NOTES 2009-02-18 10:41:40 +0000
3+++ NOTES 1970-01-01 00:00:00 +0000
4@@ -1,22 +0,0 @@
5-general
6-=======
7-- 'object' and 'type' are probably keywords that shouldn't be reused, btw
8-- proper exception handling needed all round (and maybe some more error checks?)
9-- could a Processor dynamically alter priority? of itself, or another class? if so, would it ever want to? "learning"?
10-- the patch on ConfigObj only does one level of list interpolation. Does ConfigObj support multi-level nested lists?
11-
12-plugins
13-=======
14-- process() is still a bit biased towards type=message?
15-- how about having a lambda decorator like @match(pattern) to allow other types of matching?
16-- reload config is a good example of somewhere that error reporting should occur
17-- load/reload is a good example of the reloader not being generic enough... I think?
18-- is ignore strong enough? it just sets processed to true - postprocessors won't ignore this. is that bad?
19-- applause to Michael for the Responses processor: that's clever
20- -- that said, not at all convinced by Address processor
21- -- it takes a basestring only, not a dict as well
22- -- do we even want what it does? cf. the Announce plugin - I think Processors should decide for themselves whether they want to address their replies or not. (But what about private replies - adding an address in to that is crap)
23-- identity plugin: 'identify' is different in different scopes and it's a little confusing - not cool to a first-time reader
24-- twisted helper functions (with that whole blocking-wrapper thing) may as well be written
25-- sources processor: needs a bit of error checking, authentication and... all sorts.
26- -- (no, I know, it was there for debugging, just a note for Future Us)
27
28=== renamed file 'dbus-ping.py' => 'attic/dbus-ping.py'
29=== added directory 'data'
30=== added file 'data/README'
31--- data/README 1970-01-01 00:00:00 +0000
32+++ data/README 2009-07-13 14:23:03 +0000
33@@ -0,0 +1,5 @@
34+These are data files used by plugins, that don't change very much.
35+Thus outdated versions shouldn't be a major issue.
36+
37+Sources:
38+http://data.iana.org/TLD/tlds-alpha-by-domain.txt
39
40=== added file 'data/tlds-alpha-by-domain.txt'
41--- data/tlds-alpha-by-domain.txt 1970-01-01 00:00:00 +0000
42+++ data/tlds-alpha-by-domain.txt 2009-07-13 14:23:03 +0000
43@@ -0,0 +1,281 @@
44+# Version 2009071300, Last Updated Mon Jul 13 07:07:02 2009 UTC
45+AC
46+AD
47+AE
48+AERO
49+AF
50+AG
51+AI
52+AL
53+AM
54+AN
55+AO
56+AQ
57+AR
58+ARPA
59+AS
60+ASIA
61+AT
62+AU
63+AW
64+AX
65+AZ
66+BA
67+BB
68+BD
69+BE
70+BF
71+BG
72+BH
73+BI
74+BIZ
75+BJ
76+BM
77+BN
78+BO
79+BR
80+BS
81+BT
82+BV
83+BW
84+BY
85+BZ
86+CA
87+CAT
88+CC
89+CD
90+CF
91+CG
92+CH
93+CI
94+CK
95+CL
96+CM
97+CN
98+CO
99+COM
100+COOP
101+CR
102+CU
103+CV
104+CX
105+CY
106+CZ
107+DE
108+DJ
109+DK
110+DM
111+DO
112+DZ
113+EC
114+EDU
115+EE
116+EG
117+ER
118+ES
119+ET
120+EU
121+FI
122+FJ
123+FK
124+FM
125+FO
126+FR
127+GA
128+GB
129+GD
130+GE
131+GF
132+GG
133+GH
134+GI
135+GL
136+GM
137+GN
138+GOV
139+GP
140+GQ
141+GR
142+GS
143+GT
144+GU
145+GW
146+GY
147+HK
148+HM
149+HN
150+HR
151+HT
152+HU
153+ID
154+IE
155+IL
156+IM
157+IN
158+INFO
159+INT
160+IO
161+IQ
162+IR
163+IS
164+IT
165+JE
166+JM
167+JO
168+JOBS
169+JP
170+KE
171+KG
172+KH
173+KI
174+KM
175+KN
176+KP
177+KR
178+KW
179+KY
180+KZ
181+LA
182+LB
183+LC
184+LI
185+LK
186+LR
187+LS
188+LT
189+LU
190+LV
191+LY
192+MA
193+MC
194+MD
195+ME
196+MG
197+MH
198+MIL
199+MK
200+ML
201+MM
202+MN
203+MO
204+MOBI
205+MP
206+MQ
207+MR
208+MS
209+MT
210+MU
211+MUSEUM
212+MV
213+MW
214+MX
215+MY
216+MZ
217+NA
218+NAME
219+NC
220+NE
221+NET
222+NF
223+NG
224+NI
225+NL
226+NO
227+NP
228+NR
229+NU
230+NZ
231+OM
232+ORG
233+PA
234+PE
235+PF
236+PG
237+PH
238+PK
239+PL
240+PM
241+PN
242+PR
243+PRO
244+PS
245+PT
246+PW
247+PY
248+QA
249+RE
250+RO
251+RS
252+RU
253+RW
254+SA
255+SB
256+SC
257+SD
258+SE
259+SG
260+SH
261+SI
262+SJ
263+SK
264+SL
265+SM
266+SN
267+SO
268+SR
269+ST
270+SU
271+SV
272+SY
273+SZ
274+TC
275+TD
276+TEL
277+TF
278+TG
279+TH
280+TJ
281+TK
282+TL
283+TM
284+TN
285+TO
286+TP
287+TR
288+TRAVEL
289+TT
290+TV
291+TW
292+TZ
293+UA
294+UG
295+UK
296+US
297+UY
298+UZ
299+VA
300+VC
301+VE
302+VG
303+VI
304+VN
305+VU
306+WF
307+WS
308+XN--0ZWM56D
309+XN--11B5BS3A9AJ6G
310+XN--80AKHBYKNJ4F
311+XN--9T4B11YI5A
312+XN--DEBA0AD
313+XN--G6W251D
314+XN--HGBK6AJ7F53BBA
315+XN--HLCJ6AYA9ESC7A
316+XN--JXALPDLP
317+XN--KGBECHTV
318+XN--ZCKZAH
319+YE
320+YT
321+YU
322+ZA
323+ZM
324+ZW
325
326=== modified file 'ibid/plugins/url.py'
327--- ibid/plugins/url.py 2009-07-10 12:01:48 +0000
328+++ ibid/plugins/url.py 2009-07-13 14:23:03 +0000
329@@ -5,6 +5,7 @@
330 import logging
331 import re
332
333+from pkg_resources import resource_exists, resource_stream
334 from sqlalchemy import Column, Integer, Unicode, DateTime, UnicodeText, ForeignKey, Table
335
336 import ibid
337@@ -112,8 +113,29 @@
338 password = Option('delicious_password', 'delicious account password')
339 delicious = Delicious()
340
341- @match(r'((?:\S+://|(?:www|ftp)\.)\S+|\S+\.(?:com|org|net|za)\S*)')
342+ def setup(self):
343+ if resource_exists(__name__, '../../data/tlds-alpha-by-domain.txt'):
344+ tlds = [tld.strip().lower() for tld
345+ in resource_stream(__name__, '../../data/tlds-alpha-by-domain.txt')
346+ .readlines()
347+ if not tld.startswith('#')
348+ ]
349+
350+ else:
351+ log.warning(u"Couldn't open TLD list, falling back to minimal default")
352+ tlds = 'com.org.net.za'.split('.')
353+
354+ self.grab.im_func.pattern = re.compile((
355+ r'(?:[^@]\b|\A)(' # Match a boundry, but not on an e-mail address
356+ r'(?:\w+://|(?:www|ftp)\.)\S+?' # Match an explicit URL or guess by www.
357+ r'|[^@\s:]+\.(?:%s)(?:/\S*?)?' # Guess at the URL based on TLD
358+ r')[\[>)\]"\'.]*(?:\s|\Z)' # End Boundry
359+ ) % '|'.join(tlds), re.I | re.DOTALL)
360+
361+ @handler
362 def grab(self, event, url):
363+ event.addresponse(u'Matched %s', url)
364+ return
365 if url.find('://') == -1:
366 if url.lower().startswith('ftp'):
367 url = 'ftp://%s' % url
368@@ -154,7 +176,7 @@
369 ))
370
371 def setup(self):
372- self.lengthen.im_func.pattern = re.compile(r'^((?:%s)\S+)$' % '|'.join([re.escape(service) for service in self.services]), re.I)
373+ self.lengthen.im_func.pattern = re.compile(r'^((?:%s)\S+)$' % '|'.join([re.escape(service) for service in self.services]), re.I|re.DOTALL)
374
375 @handler
376 def lengthen(self, event, url):

Subscribers

People subscribed via source and target branches