Code review comment for lp:~stefanor/ibid/url-397900

Revision history for this message
Stefano Rivera (stefanor) wrote :

Full TLD support in the grab regex
The test suite I used:

These match:
I: <email address hidden>
O:
I: google.com
O: google.com
I: x <email address hidden>
O:
I: http://foo.bar
O: http://foo.bar
I: <email address hidden>
O:
I: aoeuoeu <www.jar.com>
O: www.jar.com
I: so bar http://foo.bar/baz to jo
O: http://foo.bar/baz
I: 'http://bar.com'
O: http://bar.com
I: Thingie boo.com/a eue
O: boo.com/a
I: joe (www.google.com) says foo
O: www.google.com
I: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: aoeu http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/ aoeu
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: aoeu http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/. aoeu
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/.
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
I: ouoe <http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/> aoeuao
O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/

These don't:
I: http://en.wikipedia.org/wiki/Python_(programming_language)
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: Python <http://en.wikipedia.org/wiki/Python_(programming_language)> is a lekker language
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: Python <URL:http://en.wikipedia.org/wiki/Python_(programming_language)> is a lekker language
O: http://en.wikipedia.org/wiki/Python_(programming_language)
I: This is an IDN TLD: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9 ouoeu
O: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9

I specifically don't convert the xn--.* TLDs to unicode form, because then we have to use unicode mode for the regex, which messes with \s and \w

« Back to merge proposal