Merge into old-trunk-pack-0.92 : url-397900 : Code : Ibid

Status:

Superseded

Proposed branch:

lp:~stefanor/ibid/url-397900

Merge into:

lp:~ibid-core/ibid/old-trunk-pack-0.92

Diff against target:

None lines

To merge this branch:

bzr merge lp:~stefanor/ibid/url-397900

Related bugs:

Bug #397900: [URL] e-mail addresses are not URLs	Low	Fix Released
Bug #398783: Get plugin doesn't validate URLs	Low	Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Ibid Core Team		2009-07-13	Pending
Review via email: mp+8690@code.launchpad.net

This proposal supersedes a proposal from 2009-07-11.

This proposal has been superseded by a proposal from 2009-07-13.

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-07-10: Posted in a previous version of this proposal

#

Yes, I've tested it, but that doesn't mean that you mustn't

Revision history for this message

Michael Gorven (mgorven) wrote on 2009-07-11: Posted in a previous version of this proposal

#

review approve

review: Approve

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-07-11: Posted in a previous version of this proposal

#

Test cases I used in the bug report (thanks mithrandi for the last one)

Revision history for this message

Jonathan Hitchcock (vhata) on 2009-07-12: Posted in a previous version of this proposal

#

review: Approve

Revision history for this message

Michael Gorven (mgorven) wrote on 2009-07-12: Posted in a previous version of this proposal

#

review approve

review: Approve

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-07-13:

#

Full TLD support in the grab regex
The test suite I used:

These match:
I: <email address hidden>
O:
I: google.com
O: google.com
I: x <email address hidden>
O:
I: http://foo.bar
O: http://foo.bar
I: <email address hidden>
O:
I: aoeuoeu <www.jar.com>
O: www.jar.com
I: so bar http://foo.bar/baz to jo
O: http://foo.bar/baz
I: 'http://bar.com'
O: http://bar.com
I: Thingie boo.com/a eue
O: boo.com/a
I: joe (www.google.com) says foo
O: www.google.com
I: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: aoeu http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/ aoeu
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: aoeu http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/. aoeu
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/.
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: ouoe <http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/> aoeuao
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/

These don't:
I: http://en.wikipedia.org/wiki/Python_(programming_language)
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: Python <http://en.wikipedia.org/wiki/Python_(programming_language)> is a lekker language
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: Python <URL:http://en.wikipedia.org/wiki/Python_(programming_language)> is a lekker language
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: This is an IDN TLD: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9 ouoeu
O: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9

I specifically don't convert the xn--.* TLDs to unicode form, because then we have to use unicode mode for the regex, which messes with \s and \w

Full TLD support in the grab regex
The test suite I used:

These match:
I: joe@bar.com
O:
I: google.com
O: google.com
I: x joe@google.com
O:
I: http://foo.bar
O: http://foo.bar
I: <joe@bar.com>
O:
I: aoeuoeu <www.jar.com>
O: www.jar.com
I: so bar http://foo.bar/baz to jo
O: http://foo.bar/baz
I: 'http://bar.com'
O: http://bar.com
I: Thingie boo.com/a eue
O: boo.com/a
I: joe (www.google.com) says foo
O: www.google.com
I: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: aoeu http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/ aoeu
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: aoeu http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/. aoeu
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/.
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: ouoe <http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/> aoeuao
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/

These don't:
I: http://en.wikipedia.org/wiki/Python_(programming_language)
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: Python <http://en.wikipedia.org/wiki/Python_(programming_language)> is a lekker language
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: Python <URL:http://en.wikipedia.org/wiki/Python_(programming_language)> is a lekker language
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: This is an IDN TLD: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9 ouoeu
O: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9

I specifically don't convert the xn--.* TLDs to unicode form, because then we have to use unicode mode for the regex, which messes with \s and \w

Revision history for this message

Michael Gorven (mgorven) wrote on 2009-07-13:

#

+ event.addresponse(u'Matched %s', url)
+ return

Er, is that supposed to be there? Looks fine otherwise.

lp:~stefanor/ibid/url-397900 updated on 2009-07-13

715. By Stefano Rivera on 2009-07-13: Fix the *far* from complete unit tests to work again
716. By Stefano Rivera on 2009-07-13: Add URL grabber tests
717. By Stefano Rivera on 2009-07-13: Forgotten testing code removed
718. By Stefano Rivera on 2009-07-13: Require a . in http head|get url
719. By Stefano Rivera on 2009-07-13: Remove extraneous imports

Ibid

Merge lp:~stefanor/ibid/url-397900 into lp:~ibid-core/ibid/old-trunk-pack-0.92

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers

 === removed file 'NOTES'
 --- NOTES	2009-02-18 10:41:40 +0000
 +++ NOTES	1970-01-01 00:00:00 +0000
@@ -1,22 +0,0 @@
--general
--=======
--- 'object' and 'type' are probably keywords that shouldn't be reused, btw
--- proper exception handling needed all round (and maybe some more error checks?)
--- could a Processor dynamically alter priority?  of itself, or another class?  if so, would it ever want to?  "learning"?
--- the patch on ConfigObj only does one level of list interpolation.  Does ConfigObj support multi-level nested lists?
--
--plugins
--=======
--- process() is still a bit biased towards type=message?
--- how about having a lambda decorator like @match(pattern) to allow other types of matching?
--- reload config is a good example of somewhere that error reporting should occur
--- load/reload is a good example of the reloader not being generic enough... I think?
--- is ignore strong enough?  it just sets processed to true - postprocessors won't ignore this.  is that bad?
--- applause to Michael for the Responses processor: that's clever
--  -- that said, not at all convinced by Address processor
--    -- it takes a basestring only, not a dict as well
--    -- do we even want what it does?  cf. the Announce plugin - I think Processors should decide for themselves whether they want to address their replies or not.  (But what about private replies - adding an address in to that is crap)
--- identity plugin:  'identify' is different in different scopes and it's a little confusing - not cool to a first-time reader
--- twisted helper functions (with that whole blocking-wrapper thing) may as well be written
--- sources processor:  needs a bit of error checking, authentication and...  all sorts.
--  -- (no, I know, it was there for debugging, just a note for Future Us)
 === renamed file 'dbus-ping.py' => 'attic/dbus-ping.py'
 === added directory 'data'
 === added file 'data/README'
 --- data/README	1970-01-01 00:00:00 +0000
 +++ data/README	2009-07-13 14:23:03 +0000
@@ -0,0 +1,5 @@
++These are data files used by plugins, that don't change very much.
++Thus outdated versions shouldn't be a major issue.
++
++Sources:
++http://data.iana.org/TLD/tlds-alpha-by-domain.txt
 === added file 'data/tlds-alpha-by-domain.txt'
 --- data/tlds-alpha-by-domain.txt	1970-01-01 00:00:00 +0000
 +++ data/tlds-alpha-by-domain.txt	2009-07-13 14:23:03 +0000
@@ -0,0 +1,281 @@
++# Version 2009071300, Last Updated Mon Jul 13 07:07:02 2009 UTC
++AC
++AD
++AE
++AERO
++AF
++AG
++AI
++AL
++AM
++AN
++AO
++AQ
++AR
++ARPA
++AS
++ASIA
++AT
++AU
++AW
++AX
++AZ
++BA
++BB
++BD
++BE
++BF
++BG
++BH
++BI
++BIZ
++BJ
++BM
++BN
++BO
++BR
++BS
++BT
++BV
++BW
++BY
++BZ
++CA
++CAT
++CC
++CD
++CF
++CG
++CH
++CI
++CK
++CL
++CM
++CN
++CO
++COM
++COOP
++CR
++CU
++CV
++CX
++CY
++CZ
++DE
++DJ
++DK
++DM
++DO
++DZ
++EC
++EDU
++EE
++EG
++ER
++ES
++ET
++EU
++FI
++FJ
++FK
++FM
++FO
++FR
++GA
++GB
++GD
++GE
++GF
++GG
++GH
++GI
++GL
++GM
++GN
++GOV
++GP
++GQ
++GR
++GS
++GT
++GU
++GW
++GY
++HK
++HM
++HN
++HR
++HT
++HU
++ID
++IE
++IL
++IM
++IN
++INFO
++INT
++IO
++IQ
++IR
++IS
++IT
++JE
++JM
++JO
++JOBS
++JP
++KE
++KG
++KH
++KI
++KM
++KN
++KP
++KR
++KW
++KY
++KZ
++LA
++LB
++LC
++LI
++LK
++LR
++LS
++LT
++LU
++LV
++LY
++MA
++MC
++MD
++ME
++MG
++MH
++MIL
++MK
++ML
++MM
++MN
++MO
++MOBI
++MP
++MQ
++MR
++MS
++MT
++MU
++MUSEUM
++MV
++MW
++MX
++MY
++MZ
++NA
++NAME
++NC
++NE
++NET
++NF
++NG
++NI
++NL
++NO
++NP
++NR
++NU
++NZ
++OM
++ORG
++PA
++PE
++PF
++PG
++PH
++PK
++PL
++PM
++PN
++PR
++PRO
++PS
++PT
++PW
++PY
++QA
++RE
++RO
++RS
++RU
++RW
++SA
++SB
++SC
++SD
++SE
++SG
++SH
++SI
++SJ
++SK
++SL
++SM
++SN
++SO
++SR
++ST
++SU
++SV
++SY
++SZ
++TC
++TD
++TEL
++TF
++TG
++TH
++TJ
++TK
++TL
++TM
++TN
++TO
++TP
++TR
++TRAVEL
++TT
++TV
++TW
++TZ
++UA
++UG
++UK
++US
++UY
++UZ
++VA
++VC
++VE
++VG
++VI
++VN
++VU
++WF
++WS
++XN--0ZWM56D
++XN--11B5BS3A9AJ6G
++XN--80AKHBYKNJ4F
++XN--9T4B11YI5A
++XN--DEBA0AD
++XN--G6W251D
++XN--HGBK6AJ7F53BBA
++XN--HLCJ6AYA9ESC7A
++XN--JXALPDLP
++XN--KGBECHTV
++XN--ZCKZAH
++YE
++YT
++YU
++ZA
++ZM
++ZW
 === modified file 'ibid/plugins/url.py'
 --- ibid/plugins/url.py	2009-07-10 12:01:48 +0000
 +++ ibid/plugins/url.py	2009-07-13 14:23:03 +0000
@@ -5,6 +5,7 @@
  import logging
  import re
++from pkg_resources import resource_exists, resource_stream
  from sqlalchemy import Column, Integer, Unicode, DateTime, UnicodeText, ForeignKey, Table
  import ibid
@@ -112,8 +113,29 @@
      password  = Option('delicious_password', 'delicious account password')
      delicious = Delicious()
--    @match(r'((?:\S+://|(?:www|ftp)\.)\S+|\S+\.(?:com|org|net|za)\S*)')
++    def setup(self):
++        if resource_exists(__name__, '../../data/tlds-alpha-by-domain.txt'):
++            tlds = [tld.strip().lower() for tld
++                    in resource_stream(__name__, '../../data/tlds-alpha-by-domain.txt')
++                        .readlines()
++                    if not tld.startswith('#')
++            ]
++
++        else:
++            log.warning(u"Couldn't open TLD list, falling back to minimal default")
++            tlds = 'com.org.net.za'.split('.')
++
++        self.grab.im_func.pattern = re.compile((
++            r'(?:[^@]\b|\A)('               # Match a boundry, but not on an e-mail address
++            r'(?:\w+://|(?:www|ftp)\.)\S+?' # Match an explicit URL or guess by www.
++            r'|[^@\s:]+\.(?:%s)(?:/\S*?)?'  # Guess at the URL based on TLD
++            r')[\[>)\]"\'.]*(?:\s|\Z)'      # End Boundry
++        ) % '|'.join(tlds), re.I | re.DOTALL)
++
++    @handler
      def grab(self, event, url):
++        event.addresponse(u'Matched %s', url)
++        return
          if url.find('://') == -1:
              if url.lower().startswith('ftp'):
                  url = 'ftp://%s' % url
@@ -154,7 +176,7 @@
      ))
      def setup(self):
--        self.lengthen.im_func.pattern = re.compile(r'^((?:%s)\S+)$' % '|'.join([re.escape(service) for service in self.services]), re.I)
++        self.lengthen.im_func.pattern = re.compile(r'^((?:%s)\S+)$' % '|'.join([re.escape(service) for service in self.services]), re.I|re.DOTALL)
      @handler
      def lengthen(self, event, url):