Merge into old-trunk-pack-0.92 : url-397900 : Code : Ibid

Reviewer	Date Requested	Status
Jonathan Hitchcock		Approve on 2009-07-13
Michael Gorven	2009-07-13	Approve on 2009-07-13
Review via email: mp+8696@code.launchpad.net

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-07-10: Posted in a previous version of this proposal

#

Yes, I've tested it, but that doesn't mean that you mustn't

Revision history for this message

Michael Gorven (mgorven) wrote on 2009-07-11: Posted in a previous version of this proposal

#

review approve

review: Approve

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-07-11: Posted in a previous version of this proposal

#

Test cases I used in the bug report (thanks mithrandi for the last one)

Revision history for this message

Jonathan Hitchcock (vhata) on 2009-07-12: Posted in a previous version of this proposal

#

review: Approve

Revision history for this message

Michael Gorven (mgorven) wrote on 2009-07-12: Posted in a previous version of this proposal

#

review approve

review: Approve

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-07-13: Posted in a previous version of this proposal

#

Full TLD support in the grab regex
The test suite I used:

These match:
I: <email address hidden>
O:
I: google.com
O: google.com
I: x <email address hidden>
O:
I: http://foo.bar
O: http://foo.bar
I: <email address hidden>
O:
I: aoeuoeu <www.jar.com>
O: www.jar.com
I: so bar http://foo.bar/baz to jo
O: http://foo.bar/baz
I: 'http://bar.com'
O: http://bar.com
I: Thingie boo.com/a eue
O: boo.com/a
I: joe (www.google.com) says foo
O: www.google.com
I: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: aoeu http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/ aoeu
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: aoeu http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/. aoeu
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/.
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: ouoe <http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/> aoeuao
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/

These don't:
I: http://en.wikipedia.org/wiki/Python_(programming_language)
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: Python <http://en.wikipedia.org/wiki/Python_(programming_language)> is a lekker language
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: Python <URL:http://en.wikipedia.org/wiki/Python_(programming_language)> is a lekker language
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: This is an IDN TLD: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9 ouoeu
O: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9

I specifically don't convert the xn--.* TLDs to unicode form, because then we have to use unicode mode for the regex, which messes with \s and \w

Full TLD support in the grab regex
The test suite I used:

These match:
I: joe@bar.com
O:
I: google.com
O: google.com
I: x joe@google.com
O:
I: http://foo.bar
O: http://foo.bar
I: <joe@bar.com>
O:
I: aoeuoeu <www.jar.com>
O: www.jar.com
I: so bar http://foo.bar/baz to jo
O: http://foo.bar/baz
I: 'http://bar.com'
O: http://bar.com
I: Thingie boo.com/a eue
O: boo.com/a
I: joe (www.google.com) says foo
O: www.google.com
I: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: aoeu http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/ aoeu
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: aoeu http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/. aoeu
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/.
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: ouoe <http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/> aoeuao
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/

These don't:
I: http://en.wikipedia.org/wiki/Python_(programming_language)
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: Python <http://en.wikipedia.org/wiki/Python_(programming_language)> is a lekker language
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: Python <URL:http://en.wikipedia.org/wiki/Python_(programming_language)> is a lekker language
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: This is an IDN TLD: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9 ouoeu
O: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9

I specifically don't convert the xn--.* TLDs to unicode form, because then we have to use unicode mode for the regex, which messes with \s and \w

Revision history for this message

Michael Gorven (mgorven) wrote on 2009-07-13: Posted in a previous version of this proposal

#

+ event.addresponse(u'Matched %s', url)
+ return

Er, is that supposed to be there? Looks fine otherwise.

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-07-13:

#

This branch has become rather a cesspool of fixes. Enjoy

Revision history for this message

Michael Gorven (mgorven) wrote on 2009-07-13:

#

Yay for tests!
review approve

review: Approve

lp:~stefanor/ibid/url-397900 updated on 2009-07-13

719. By Stefano Rivera on 2009-07-13: Remove extraneous imports

Revision history for this message

Jonathan Hitchcock (vhata) wrote on 2009-07-13:

#

Rocking!

Now we need tests for all the other plugins :)

review: Approve

Ibid

Merge lp:~stefanor/ibid/url-397900 into lp:~ibid-core/ibid/old-trunk-pack-0.92

Commit message

Description of the change

Preview Diff

Subscribers

 === removed file 'NOTES'
 --- NOTES	2009-02-18 10:41:40 +0000
 +++ NOTES	1970-01-01 00:00:00 +0000
@@ -1,22 +0,0 @@
--general
--=======
--- 'object' and 'type' are probably keywords that shouldn't be reused, btw
--- proper exception handling needed all round (and maybe some more error checks?)
--- could a Processor dynamically alter priority?  of itself, or another class?  if so, would it ever want to?  "learning"?
--- the patch on ConfigObj only does one level of list interpolation.  Does ConfigObj support multi-level nested lists?
--
--plugins
--=======
--- process() is still a bit biased towards type=message?
--- how about having a lambda decorator like @match(pattern) to allow other types of matching?
--- reload config is a good example of somewhere that error reporting should occur
--- load/reload is a good example of the reloader not being generic enough... I think?
--- is ignore strong enough?  it just sets processed to true - postprocessors won't ignore this.  is that bad?
--- applause to Michael for the Responses processor: that's clever
--  -- that said, not at all convinced by Address processor
--    -- it takes a basestring only, not a dict as well
--    -- do we even want what it does?  cf. the Announce plugin - I think Processors should decide for themselves whether they want to address their replies or not.  (But what about private replies - adding an address in to that is crap)
--- identity plugin:  'identify' is different in different scopes and it's a little confusing - not cool to a first-time reader
--- twisted helper functions (with that whole blocking-wrapper thing) may as well be written
--- sources processor:  needs a bit of error checking, authentication and...  all sorts.
--  -- (no, I know, it was there for debugging, just a note for Future Us)
 === renamed file 'dbus-ping.py' => 'attic/dbus-ping.py'
 === added directory 'data'
 === added file 'data/README'
 --- data/README	1970-01-01 00:00:00 +0000
 +++ data/README	2009-07-13 14:23:03 +0000
@@ -0,0 +1,5 @@
++These are data files used by plugins, that don't change very much.
++Thus outdated versions shouldn't be a major issue.
++
++Sources:
++http://data.iana.org/TLD/tlds-alpha-by-domain.txt
 === added file 'data/tlds-alpha-by-domain.txt'
 --- data/tlds-alpha-by-domain.txt	1970-01-01 00:00:00 +0000
 +++ data/tlds-alpha-by-domain.txt	2009-07-13 14:23:03 +0000
@@ -0,0 +1,281 @@
++# Version 2009071300, Last Updated Mon Jul 13 07:07:02 2009 UTC
++AC
++AD
++AE
++AERO
++AF
++AG
++AI
++AL
++AM
++AN
++AO
++AQ
++AR
++ARPA
++AS
++ASIA
++AT
++AU
++AW
++AX
++AZ
++BA
++BB
++BD
++BE
++BF
++BG
++BH
++BI
++BIZ
++BJ
++BM
++BN
++BO
++BR
++BS
++BT
++BV
++BW
++BY
++BZ
++CA
++CAT
++CC
++CD
++CF
++CG
++CH
++CI
++CK
++CL
++CM
++CN
++CO
++COM
++COOP
++CR
++CU
++CV
++CX
++CY
++CZ
++DE
++DJ
++DK
++DM
++DO
++DZ
++EC
++EDU
++EE
++EG
++ER
++ES
++ET
++EU
++FI
++FJ
++FK
++FM
++FO
++FR
++GA
++GB
++GD
++GE
++GF
++GG
++GH
++GI
++GL
++GM
++GN
++GOV
++GP
++GQ
++GR
++GS
++GT
++GU
++GW
++GY
++HK
++HM
++HN
++HR
++HT
++HU
++ID
++IE
++IL
++IM
++IN
++INFO
++INT
++IO
++IQ
++IR
++IS
++IT
++JE
++JM
++JO
++JOBS
++JP
++KE
++KG
++KH
++KI
++KM
++KN
++KP
++KR
++KW
++KY
++KZ
++LA
++LB
++LC
++LI
++LK
++LR
++LS
++LT
++LU
++LV
++LY
++MA
++MC
++MD
++ME
++MG
++MH
++MIL
++MK
++ML
++MM
++MN
++MO
++MOBI
++MP
++MQ
++MR
++MS
++MT
++MU
++MUSEUM
++MV
++MW
++MX
++MY
++MZ
++NA
++NAME
++NC
++NE
++NET
++NF
++NG
++NI
++NL
++NO
++NP
++NR
++NU
++NZ
++OM
++ORG
++PA
++PE
++PF
++PG
++PH
++PK
++PL
++PM
++PN
++PR
++PRO
++PS
++PT
++PW
++PY
++QA
++RE
++RO
++RS
++RU
++RW
++SA
++SB
++SC
++SD
++SE
++SG
++SH
++SI
++SJ
++SK
++SL
++SM
++SN
++SO
++SR
++ST
++SU
++SV
++SY
++SZ
++TC
++TD
++TEL
++TF
++TG
++TH
++TJ
++TK
++TL
++TM
++TN
++TO
++TP
++TR
++TRAVEL
++TT
++TV
++TW
++TZ
++UA
++UG
++UK
++US
++UY
++UZ
++VA
++VC
++VE
++VG
++VI
++VN
++VU
++WF
++WS
++XN--0ZWM56D
++XN--11B5BS3A9AJ6G
++XN--80AKHBYKNJ4F
++XN--9T4B11YI5A
++XN--DEBA0AD
++XN--G6W251D
++XN--HGBK6AJ7F53BBA
++XN--HLCJ6AYA9ESC7A
++XN--JXALPDLP
++XN--KGBECHTV
++XN--ZCKZAH
++YE
++YT
++YU
++ZA
++ZM
++ZW
 === modified file 'ibid/plugins/http.py'
 --- ibid/plugins/http.py	2009-05-01 12:17:57 +0000
 +++ ibid/plugins/http.py	2009-07-13 15:19:39 +0000
@@ -17,7 +17,7 @@
      max_size = IntOption('max_size', 'Only request this many bytes', 500)
--    @match(r'^(get|head)\s+(.+)$')
++    @match(r'^(get|head)\s+(\S+\.\S+)$')
      def handler(self, event, action, url):
          if not url.lower().startswith("http://") and not url.lower().startswith("https://"):
              url = "http://" + url
 === modified file 'ibid/plugins/url.py'
 --- ibid/plugins/url.py	2009-07-10 12:01:48 +0000
 +++ ibid/plugins/url.py	2009-07-13 15:17:24 +0000
@@ -5,6 +5,7 @@
  import logging
  import re
++from pkg_resources import resource_exists, resource_stream
  from sqlalchemy import Column, Integer, Unicode, DateTime, UnicodeText, ForeignKey, Table
  import ibid
@@ -112,7 +113,26 @@
      password  = Option('delicious_password', 'delicious account password')
      delicious = Delicious()
--    @match(r'((?:\S+://|(?:www|ftp)\.)\S+|\S+\.(?:com|org|net|za)\S*)')
++    def setup(self):
++        if resource_exists(__name__, '../../data/tlds-alpha-by-domain.txt'):
++            tlds = [tld.strip().lower() for tld
++                    in resource_stream(__name__, '../../data/tlds-alpha-by-domain.txt')
++                        .readlines()
++                    if not tld.startswith('#')
++            ]
++
++        else:
++            log.warning(u"Couldn't open TLD list, falling back to minimal default")
++            tlds = 'com.org.net.za'.split('.')
++
++        self.grab.im_func.pattern = re.compile((
++            r'(?:[^@]\b|\A)('               # Match a boundry, but not on an e-mail address
++            r'(?:\w+://|(?:www|ftp)\.)\S+?' # Match an explicit URL or guess by www.
++            r'|[^@\s:]+\.(?:%s)(?:/\S*?)?'  # Guess at the URL based on TLD
++            r')[\[>)\]"\'.]*(?:\s|\Z)'      # End Boundry
++        ) % '|'.join(tlds), re.I | re.DOTALL)
++
++    @handler
      def grab(self, event, url):
          if url.find('://') == -1:
              if url.lower().startswith('ftp'):
@@ -154,7 +174,7 @@
      ))
      def setup(self):
--        self.lengthen.im_func.pattern = re.compile(r'^((?:%s)\S+)$' % '|'.join([re.escape(service) for service in self.services]), re.I)
++        self.lengthen.im_func.pattern = re.compile(r'^((?:%s)\S+)$' % '|'.join([re.escape(service) for service in self.services]), re.I|re.DOTALL)
      @handler
      def lengthen(self, event, url):
 === modified file 'ibid/test/plugins/test_core.py'
 --- ibid/test/plugins/test_core.py	2009-03-05 16:33:12 +0000
 +++ ibid/test/plugins/test_core.py	2009-07-13 14:47:02 +0000
@@ -22,15 +22,24 @@
      def assert_addressed(self, event, addressed, message):
          self.assert_(hasattr(event, 'addressed'))
          self.assertEqual(event.addressed, addressed)
--        self.assertEqual(event.message.strip(), message)
++        self.assertEqual(event.message['deaddressed'].strip(), message)
++
++    def create_event(self, message, event_type=u'message'):
++        event = Event(u'fakesource', event_type)
++        event.message = {
++            'raw': message,
++            'deaddressed': message,
++            'clean': message,
++            'stripped': message,
++        }
++        return event
      def test_non_messages(self):
          for event_type in [u'timer', u'rpc']:
--            event = Event(u'fakesource', event_type)
--            event.message = u'bot: foo'
++            event = self.create_event(u'bot: foo', event_type)
              self.processor.process(event)
              self.assertFalse(hasattr(event, u'addressed'))
--            self.assertEqual(event.message, u'bot: foo')
++            self.assertEqual(event.message['deaddressed'], u'bot: foo')
      happy_prefixes = [
          (u'bot', u':  '),
@@ -40,8 +49,7 @@
      def test_happy_prefix_names(self):
          for prefix in self.happy_prefixes:
--            event = Event(u'fakesource', u'message')
--            event.message = u'%s%sfoo' % prefix
++            event = self.create_event(u'%s%sfoo' % prefix)
              self.processor.process(event)
              self.assert_addressed(event, prefix[0], u'foo')
@@ -53,8 +61,7 @@
      def test_sad_prefix_names(self):
          for prefix in self.sad_prefixes:
--            event = Event(u'fakesource', u'message')
--            event.message = u'%s%sfoo' % prefix
++            event = self.create_event(u'%s%sfoo' % prefix)
              self.processor.process(event)
              self.assert_addressed(event, False, u'%s%sfoo' % prefix)
@@ -66,8 +73,7 @@
      def test_happy_suffix_names(self):
          for suffix in self.happy_suffixes:
--            event = Event(u'fakesource', u'message')
--            event.message = u'foo%s%s' % suffix
++            event = self.create_event(u'foo%s%s' % suffix)
              self.processor.process(event)
              self.assert_addressed(event, suffix[1], u'foo')
@@ -80,8 +86,7 @@
      def test_sad_suffix_names(self):
          for suffix in self.sad_suffixes:
--            event = Event(u'fakesource', u'message')
--            event.message = u'foo%s%s' % suffix
++            event = self.create_event(u'foo%s%s' % suffix)
              self.processor.process(event)
              self.assert_addressed(event, False, u'foo%s%s' % suffix)
 === added file 'ibid/test/plugins/test_url.py'
 --- ibid/test/plugins/test_url.py	1970-01-01 00:00:00 +0000
 +++ ibid/test/plugins/test_url.py	2009-07-13 15:17:19 +0000
@@ -0,0 +1,57 @@
++from twisted.trial import unittest
++import ibid.test
++
++from ibid.event import Event
++from ibid.plugins import url
++
++class TestURLGrabber(unittest.TestCase):
++
++    def setUp(self):
++        self.grab = url.Grab(u'testplugin')
++
++    good_grabs = [
++        (u'google.com', u'google.com'),
++        (u'http://foo.bar', u'http://foo.bar'),
++        (u'aoeuoeu <www.jar.com>', u'www.jar.com'),
++        (u'aoeuoeu <www.jar.com> def', u'www.jar.com'),
++        (u'<www.jar.com>', u'www.jar.com'),
++        (u'so bar http://foo.bar/baz to jo', u'http://foo.bar/baz'),
++        (u"'http://bar.com'", u'http://bar.com'),
++        (u'Thingie boo.com/a eue', u'boo.com/a'),
++        (u'joe (www.google.com) says foo', u'www.google.com'),
++        (u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/',
++            u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'),
++        (u'aoeu http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/ aoeu',
++            u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'),
++        (u'aoeu http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/. aoeu',
++            u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'),
++        (u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/.',
++            u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'),
++        (u'ouoe <http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/> aoeuao',
++            u'http://www.example.net/blog/2008/11/09/debugging-python-regular-expressions/'),
++        # We accept that the following are non-optimal
++        (u'http://en.example.org/wiki/Python_(programming_language)',
++            u'http://en.example.org/wiki/Python_(programming_language'),
++        (u'Python <http://en.example.org/wiki/Python_(programming_language)> is a lekker language',
++            u'http://en.example.org/wiki/Python_(programming_language'),
++        (u'Python <URL:http://en.example.org/wiki/Python_(programming_language)> is a lekker language',
++            u'http://en.example.org/wiki/Python_(programming_language'),
++    ]
++
++    def test_good_grabs(self):
++        for input, url in self.good_grabs:
++            m = self.grab.grab.im_func.pattern.search(input)
++            self.assertEqual(m.group(1), url)
++
++    bad_grabs = [
++        u'joe@bar.com',
++        u'x joe@google.com',
++        u'<joe@bar.com>',
++    ]
++
++    def test_bad_grabs(self):
++        for input in self.bad_grabs:
++            m = self.grab.grab.im_func.pattern.search(input)
++            self.assertEqual(m, None)
++
++# vi: set et sta sw=4 ts=4: