Merge lp:~jcsackett/launchpad/linkifier-bugs-2 into lp:launchpad

Proposed by j.c.sackett on 2010-12-16
Status: Merged
Approved by: j.c.sackett on 2010-12-17
Approved revision: no longer in the source branch.
Merged at revision: 12102
Proposed branch: lp:~jcsackett/launchpad/linkifier-bugs-2
Merge into: lp:launchpad
Diff against target: 218 lines (+135/-18)
2 files modified
lib/lp/app/browser/stringformatter.py (+67/-15)
lib/lp/app/browser/tests/test_stringformatter.py (+68/-3)
To merge this branch: bzr merge lp:~jcsackett/launchpad/linkifier-bugs-2
Reviewer Review Type Date Requested Status
Brad Crittenden (community) code 2010-12-16 Approve on 2010-12-17
Review via email: mp+43980@code.launchpad.net

Commit Message

[r=bac][ui=none][bug=60172,118284] Fixes linkifier so URIs of just a protocol are not linked, and parenthesis are handled better.

Description of the Change

Summary
=======

As part of bugjam2010, a series of bugs having to do with the linkifier have been targeted. This branch deals with standard URI links that need some postprocessing after the regex capture.

URIs are identified in stringformatter via a serious regex, but to do certain things (like reject urls that consist of just a protocol with no domain) you would have to modify the regex to catch invalid urls, which would make it even harder to deal with. Instead it's easier and cleaner to just make decisions about when to linkify based on the output of the regex.

Implementation
==============

Two methods have been added to the FormattersAPI in lib/lp/app/browser/stringformatter.py

The first, _linkify_url_is_blacklisted, adds capacity to reject URIs for linkification if they are in a predefined list. The list is all of the protocols defined in the regex as standalone URIs (e.g. http://, sftp://). This is used in the linkify process to return the URL as text rather than as a link if it's only the protocol.

The second _handle_parens_in_trailers deals with closing parenthesis following the url that should be considered part of the url--e.g. the closing parenthesis is http://wikipedia.com/Fell_(comic). This is used in the linkification process to move the closing parenthesis out of the trailers and back into the URI.

Tests
=====

bin/test -t TestLinkifyingProtocols

QA
==

Create a bug comment on launchpad.dev with the following text:

"http:// isn't a link. http://example.com/something_(with_parens) should be fully linked."

The first http:// shouldn't link. The second URI should link, including the closing paren at the end.

Lint
====

= Launchpad lint =

Checking for conflicts and issues in changed files.

Linting changed files:
  lib/lp/app/browser/stringformatter.py
  lib/lp/app/browser/tests/test_stringformatter.py

./lib/lp/app/browser/stringformatter.py
     450: E501 line too long (90 characters)
     454: E501 line too long (88 characters)
     450: Line exceeds 78 characters.
     454: Line exceeds 78 characters.

The two lines that are too long are part of the regex, which is difficult enough to read without breaking up those sections.

To post a comment you must log in.
Brad Crittenden (bac) wrote :

Hi Jon,

* _handle_parens_in_trailers needs a doc string with explanation of parameters. Just looking at it cold I cannot figure out the purpose. A trailer is something you haul cows to the sale barn in. You also agreed to add some more corner case tests -- thanks!

* In your blacklist method please use triple double-quotes for the docstring open/close and do put in closing punctuation.

* On IRC we discussed a name better than 'blacklist'. How about 'ignored' or something even better if you can think of it?

* Please create a dict and use named targets in:

83 + return '<a rel="nofollow" href="%s">%s</a>%s' % (
84 + cgi.escape(url, quote=True),
85 + add_word_breaks(cgi.escape(url)),
86 + cgi.escape(trailers))

* FormattersAPI._split_url_and_trailers is only used once but its definition is way down below it -- I think some code was re-arranged. Please put it closer to the call site. Also document in English what is being matched for the RE impaired to easily understand it.

Other than those issues the branch looks good and thorough. Thanks. Marking it approved, pending changes.

review: Approve (code)
j.c.sackett (jcsackett) wrote :

> Hi Jon,
>
> * _handle_parens_in_trailers needs a doc string with explanation of
> parameters. Just looking at it cold I cannot figure out the purpose. A
> trailer is something you haul cows to the sale barn in. You also agreed to
> add some more corner case tests -- thanks!

Done.

> * In your blacklist method please use triple double-quotes for the docstring
> open/close and do put in closing punctuation.

Done.

> * On IRC we discussed a name better than 'blacklist'. How about 'ignored' or
> something even better if you can think of it?

Couldn't think of better than ignored. Done.

> * Please create a dict and use named targets in:
>
> 83 + return '<a rel="nofollow" href="%s">%s</a>%s' % (
> 84 + cgi.escape(url, quote=True),
> 85 + add_word_breaks(cgi.escape(url)),
> 86 + cgi.escape(trailers))

Done.

> * FormattersAPI._split_url_and_trailers is only used once but its definition
> is way down below it -- I think some code was re-arranged. Please put it
> closer to the call site.

It's actually called in several places; I've moved the method so it's grouped with _handle_parens and the former blacklist method, that places it closer to the methods it's called in.

> Also document in English what is being matched for
> the RE impaired to easily understand it.

I'm still not sure which RE section you're asking about? Can you include the block or file number so I can find it?

>
> Other than those issues the branch looks good and thorough. Thanks. Marking
> it approved, pending changes.

Brad Crittenden (bac) wrote :

Thanks for the fixes. I was referring to _re_url_trailers.

j.c.sackett (jcsackett) wrote :

> I was referring to _re_url_trailers.

Ah, okay. I was trying to find an regex I had made or altered, therein was my confusion. I've updated comments for it. Thanks for the review.

Robert Collins (lifeless) wrote :

blacklisting protocols seems bizarre.

Why not just create a URL object and check that more than the scheme
is filled out?

j.c.sackett (jcsackett) wrote :

That actually seems bizarre to me; why build a new object when we're working with strings?

This also allows us to expand to other things we might wish to ignore, should another case crop up.

Blacklisting as a notion has been removed, if the loaded definition of that bothers you--we're just treating them as things the linkifier should ignore.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/lp/app/browser/stringformatter.py'
2--- lib/lp/app/browser/stringformatter.py 2010-12-14 17:33:05 +0000
3+++ lib/lp/app/browser/stringformatter.py 2010-12-17 18:13:52 +0000
4@@ -201,6 +201,34 @@
5 return '&nbsp;' * len(groups[0])
6
7 @staticmethod
8+ def _linkify_bug_number(text, bugnum, trailers=''):
9+ # Don't look up the bug or display anything about the bug, just
10+ # linkify to the general bug url.
11+ url = '/bugs/%s' % bugnum
12+ # The text will have already been cgi escaped.
13+ return '<a href="%s">%s</a>%s' % (url, text, trailers)
14+
15+ @staticmethod
16+ def _handle_parens_in_trailers(url, trailers):
17+ """Move closing parens from the trailer back into the url if needed.
18+
19+ If there are opening parens in the url that are matched by closing
20+ parens at the start of the trailer, those closing parens should be
21+ part of the url."""
22+ opencount = url.count('(')
23+ closedcount = url.count(')')
24+ missing = opencount - closedcount
25+ slice_idx = 0
26+ while slice_idx < missing:
27+ if trailers[slice_idx] == ')':
28+ slice_idx += 1
29+ else:
30+ break
31+ url += trailers[:slice_idx]
32+ trailers = trailers[slice_idx:]
33+ return url, trailers
34+
35+ @staticmethod
36 def _split_url_and_trailers(url):
37 """Given a URL return a tuple of the URL and punctuation trailers.
38
39@@ -216,15 +244,31 @@
40 url = url[:-len(trailers)]
41 else:
42 trailers = ''
43- return url, trailers
44+ return FormattersAPI._handle_parens_in_trailers(url, trailers)
45
46 @staticmethod
47- def _linkify_bug_number(text, bugnum, trailers=''):
48- # Don't look up the bug or display anything about the bug, just
49- # linkify to the general bug url.
50- url = '/bugs/%s' % bugnum
51- # The text will have already been cgi escaped.
52- return '<a href="%s">%s</a>%s' % (url, text, trailers)
53+ def _linkify_url_should_be_ignored(url):
54+ """Don't linkify URIs consisting of just the protocol."""
55+
56+ protocol_bases = [
57+ 'about',
58+ 'gopher',
59+ 'http',
60+ 'https',
61+ 'sftp',
62+ 'news',
63+ 'ftp',
64+ 'mailto',
65+ 'irc',
66+ 'jabber',
67+ 'apt',
68+ 'data',
69+ ]
70+
71+ for base in protocol_bases:
72+ if url in ('%s' % base, '%s:' % base, '%s://' % base):
73+ return True
74+ return False
75
76 @staticmethod
77 def _linkify_substitution(match):
78@@ -235,16 +279,22 @@
79 # The text will already have been cgi escaped. We temporarily
80 # unescape it so that we can strip common trailing characters
81 # that aren't part of the URL.
82- url = match.group('url')
83- url, trailers = FormattersAPI._split_url_and_trailers(url)
84+ full_url = match.group('url')
85+ url, trailers = FormattersAPI._split_url_and_trailers(full_url)
86 # We use nofollow for these links to reduce the value of
87 # adding spam URLs to our comments; it's a way of moderately
88 # devaluing the return on effort for spammers that consider
89 # using Launchpad.
90- return '<a rel="nofollow" href="%s">%s</a>%s' % (
91- cgi.escape(url, quote=True),
92- add_word_breaks(cgi.escape(url)),
93- cgi.escape(trailers))
94+ if not FormattersAPI._linkify_url_should_be_ignored(url):
95+ link_string = ('<a rel="nofollow" '
96+ 'href="%(url)s">%(linked_text)s</a>%(trailers)s' % {
97+ 'url': cgi.escape(url, quote=True),
98+ 'linked_text': add_word_breaks(cgi.escape(url)),
99+ 'trailers': cgi.escape(trailers)
100+ })
101+ return link_string
102+ else:
103+ return full_url
104 elif match.group('faq') is not None:
105 # This is *BAD*. We shouldn't be doing database lookups to
106 # linkify text.
107@@ -423,8 +473,10 @@
108 ''' % {'unreserved': "-a-zA-Z0-9._~%!$&'()*+,;="},
109 re.IGNORECASE | re.VERBOSE)
110
111- # a pattern to match common trailing punctuation for URLs that we
112- # don't want to include in the link.
113+ # There is various punctuation that can occur at the end of a link that
114+ # shouldn't be included. The regex below matches on the set of characters
115+ # we don't generally want. See also _handle_parens_in_trailers, which
116+ # re-attaches parens if we do want them to be part of the url.
117 _re_url_trailers = re.compile(r'([,.?:);>]+)$')
118
119 def text_to_html(self):
120
121=== modified file 'lib/lp/app/browser/tests/test_stringformatter.py'
122--- lib/lp/app/browser/tests/test_stringformatter.py 2010-12-14 19:56:40 +0000
123+++ lib/lp/app/browser/tests/test_stringformatter.py 2010-12-17 18:13:52 +0000
124@@ -107,8 +107,72 @@
125 <tag>1234567890123456</tag>
126 """
127
128+
129 class TestLinkifyingProtocols(TestCase):
130
131+ def test_normal_set(self):
132+ test_strings = [
133+ "http://example.com",
134+ "http://example.com/",
135+ "http://example.com/path",
136+ "http://example.com/path/",
137+ ]
138+
139+ expected_strings = [
140+ ('<p><a rel="nofollow" href="http://example.com">'
141+ 'http://<wbr></wbr>example.<wbr></wbr>com</a></p>'),
142+ ('<p><a rel="nofollow" href="http://example.com/">'
143+ 'http://<wbr></wbr>example.<wbr></wbr>com/</a></p>'),
144+ ('<p><a rel="nofollow" href="http://example.com/path">'
145+ 'http://<wbr></wbr>example.<wbr></wbr>com/path</a></p>'),
146+ ('<p><a rel="nofollow" href="http://example.com/path/">'
147+ 'http://<wbr></wbr>example.<wbr></wbr>com/path/</a></p>'),
148+ ]
149+
150+ self.assertEqual(
151+ expected_strings,
152+ [FormattersAPI(text).text_to_html() for text in test_strings])
153+
154+ def test_parens_handled_well(self):
155+ test_strings = [
156+ '(http://example.com)',
157+ 'http://example.com/path_(with_parens)',
158+ '(http://example.com/path_(with_parens))',
159+ '(http://example.com/path_(with_parens)and_stuff)',
160+ ]
161+
162+ expected_html = [
163+ ('<p>(<a rel="nofollow" href="http://example.com">'
164+ 'http://<wbr></wbr>example.<wbr></wbr>com</a>)</p>'),
165+ ('<p><a rel="nofollow" '
166+ 'href="http://example.com/path_(with_parens)">'
167+ 'http://<wbr></wbr>example.<wbr></wbr>com/path_'
168+ '<wbr></wbr>(with_parens)</a></p>'),
169+ ('<p>(<a rel="nofollow" '
170+ 'href="http://example.com/path_(with_parens)">'
171+ 'http://<wbr></wbr>example.<wbr></wbr>com/path_'
172+ '<wbr></wbr>(with_parens)</a>)</p>'),
173+ ('<p>(<a rel="nofollow" '
174+ 'href="http://example.com/path_(with_parens)and_stuff">'
175+ 'http://<wbr></wbr>example.<wbr></wbr>com'
176+ '/path_<wbr></wbr>(with_parens)<wbr></wbr>and_stuff</a>)</p>'),
177+ ]
178+
179+ self.assertEqual(
180+ expected_html,
181+ [FormattersAPI(text).text_to_html() for text in test_strings])
182+
183+ def test_protocol_alone_does_not_link(self):
184+ test_string = "This doesn't link: apt:"
185+ html = FormattersAPI(test_string).text_to_html()
186+ expected_html = "<p>This doesn't link: apt:</p>"
187+ self.assertEqual(expected_html, html)
188+
189+ test_string = "This doesn't link: http://"
190+ html = FormattersAPI(test_string).text_to_html()
191+ expected_html = "<p>This doesn't link: http://</p>"
192+ self.assertEqual(expected_html, html)
193+
194 def test_apt_is_linked(self):
195 test_string = 'This becomes a link: apt:some-package'
196 html = FormattersAPI(test_string).text_to_html()
197@@ -124,7 +188,7 @@
198 expected_html = (
199 '<p>This becomes a link: '
200 '<a rel="nofollow" '
201- 'href="apt://some-package">apt://some-<wbr></wbr>package</a></p>')
202+ 'href="apt://some-package">apt://some-<wbr></wbr>package</a></p>')
203 self.assertEqual(expected_html, html)
204
205 def test_file_is_not_linked(self):
206@@ -141,9 +205,10 @@
207 expected_html = (
208 "<p>This becomes a link: "
209 '<a rel="nofollow" '
210- 'href="data:text/plain,test">data:text/<wbr></wbr>plain,test</a></p>')
211+ 'href="data:text/plain,test">'
212+ 'data:text/<wbr></wbr>plain,test</a></p>')
213 self.assertEqual(expected_html, html)
214-
215+
216
217 class TestDiffFormatter(TestCase):
218 """Test the string formatter fmt:diff."""