Full TLD support in the grab regex The test suite I used:
These match: I: <email address hidden> O: I: google.com O: google.com I: x <email address hidden> O: I: http://foo.bar O: http://foo.bar I: <email address hidden> O: I: aoeuoeu <www.jar.com> O: www.jar.com I: so bar http://foo.bar/baz to jo O: http://foo.bar/baz I: 'http://bar.com' O: http://bar.com I: Thingie boo.com/a eue O: boo.com/a I: joe (www.google.com) says foo O: www.google.com I: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/ O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/ I: aoeu http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/ aoeu O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/ I: aoeu http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/. aoeu O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/ I: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/. O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/ I: ouoe <http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/> aoeuao O: http://www.heikkitoivonen.net/blog/2008/11/09/debugging-python-regular-expressions/
These don't: I: http://en.wikipedia.org/wiki/Python_(programming_language) O: http://en.wikipedia.org/wiki/Python_(programming_language) I: Python <http://en.wikipedia.org/wiki/Python_(programming_language)> is a lekker language O: http://en.wikipedia.org/wiki/Python_(programming_language) I: Python <URL:http://en.wikipedia.org/wiki/Python_(programming_language)> is a lekker language O: http://en.wikipedia.org/wiki/Python_(programming_language) I: This is an IDN TLD: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9 ouoeu O: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9
I specifically don't convert the xn--.* TLDs to unicode form, because then we have to use unicode mode for the regex, which messes with \s and \w
« Back to merge proposal
Full TLD support in the grab regex
The test suite I used:
These match: foo.bar foo.bar foo.bar/ baz to jo foo.bar/ baz bar.com' bar.com www.heikkitoivo nen.net/ blog/2008/ 11/09/debugging -python- regular- expressions/ www.heikkitoivo nen.net/ blog/2008/ 11/09/debugging -python- regular- expressions/ www.heikkitoivo nen.net/ blog/2008/ 11/09/debugging -python- regular- expressions/ aoeu www.heikkitoivo nen.net/ blog/2008/ 11/09/debugging -python- regular- expressions/ www.heikkitoivo nen.net/ blog/2008/ 11/09/debugging -python- regular- expressions/. aoeu www.heikkitoivo nen.net/ blog/2008/ 11/09/debugging -python- regular- expressions/ www.heikkitoivo nen.net/ blog/2008/ 11/09/debugging -python- regular- expressions/. www.heikkitoivo nen.net/ blog/2008/ 11/09/debugging -python- regular- expressions/ www.heikkitoivo nen.net/ blog/2008/ 11/09/debugging -python- regular- expressions/> aoeuao www.heikkitoivo nen.net/ blog/2008/ 11/09/debugging -python- regular- expressions/
I: <email address hidden>
O:
I: google.com
O: google.com
I: x <email address hidden>
O:
I: http://
O: http://
I: <email address hidden>
O:
I: aoeuoeu <www.jar.com>
O: www.jar.com
I: so bar http://
O: http://
I: 'http://
O: http://
I: Thingie boo.com/a eue
O: boo.com/a
I: joe (www.google.com) says foo
O: www.google.com
I: http://
O: http://
I: aoeu http://
O: http://
I: aoeu http://
O: http://
I: http://
O: http://
I: ouoe <http://
O: http://
These don't: en.wikipedia. org/wiki/ Python_ (programming_ language) en.wikipedia. org/wiki/ Python_ (programming_ language) en.wikipedia. org/wiki/ Python_ (programming_ language)> is a lekker language en.wikipedia. org/wiki/ Python_ (programming_ language) en.wikipedia. org/wiki/ Python_ (programming_ language)> is a lekker language en.wikipedia. org/wiki/ Python_ (programming_ language) %D8%A7% D9%84%D8% B5%D9%81% D8%AD%D8% A9_%D8% A7%D9%84% D8%B1%D8% A6%D9%8A% D8%B3%D9% 8A%D8%A9 ouoeu %D8%A7% D9%84%D8% B5%D9%81% D8%AD%D8% A9_%D8% A7%D9%84% D8%B1%D8% A6%D9%8A% D8%B3%D9% 8A%D8%A9
I: http://
O: http://
I: Python <http://
O: http://
I: Python <URL:http://
O: http://
I: This is an IDN TLD: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/
O: ﻢﺛﺎﻟ.ﺈﺨﺘﺑﺍﺭ/
I specifically don't convert the xn--.* TLDs to unicode form, because then we have to use unicode mode for the regex, which messes with \s and \w