Merge lp:~openerp-dev/openobject-server/trunk-text2html-chm into lp:openobject-server

Proposed by Christophe Matthieu (OpenERP)
Status: Needs review
Proposed branch: lp:~openerp-dev/openobject-server/trunk-text2html-chm
Merge into: lp:openobject-server
Diff against target: 163 lines (+129/-1)
2 files modified
openerp/tests/test_mail.py (+85/-1)
openerp/tools/mail.py (+44/-0)
To merge this branch: bzr merge lp:~openerp-dev/openobject-server/trunk-text2html-chm
Reviewer Review Type Date Requested Status
Thibault Delavallée (OpenERP) (community) Needs Information
Christophe Matthieu (OpenERP) (community) Needs Resubmitting
Vo Minh Thu (community) Disapprove
Christophe Simonis (OpenERP) Needs Fixing
Review via email: mp+135607@code.launchpad.net
To post a comment you must log in.
4582. By Christophe Matthieu (OpenERP)

[IMP] tools mail: improve regexp for link

4583. By Christophe Matthieu (OpenERP)

[IMP] mail: add test for html_regexp_link

4584. By Christophe Matthieu (OpenERP)

[IMP] tools mail: improve regex for link and add comment

4585. By Christophe Matthieu (OpenERP)

[FIX] tools mail: fix test and regexp

Revision history for this message
Vo Minh Thu (thu) wrote :

Howdy (I think this is my first code review for you ;-)

- You have defined a html_regexp_link() function. Please write a proper doc string (not some half-assed comment above).

- Please name the function correctly; its current name doesn't mean anything. In particular its name should make its intent clear. The fact it uses regex for its job is an implementation detail, no need to state it in its name.

- Your variable names are weird: you mix underscore_separated with camelCase, e.g. new_html and innerText. Why not use simple words like original, input, output, processed, expected, and so on.

- Your tests seem weird: why does each link end with $@ ? Why does each line end with a . ?

- Please think your comments are meant to be read by people who don't know what your code do. For instance

  # 7. create link

would probably read better if it was something along the line of

  #7. Convert URIs to HTML links.

- Instead of a big innerText (together with its outText), you can make a list of pairs and loop over it.

Now on the biggest gripe I have: please use a existing tool instead of reinventing the wheel all over again, e.g. https://github.com/jsocol/bleach#readme

If you insist on reimplementing stuff on your own, maybe add a proper test suite, see maybe e.g. https://github.com/jsocol/bleach/tree/master/bleach/tests

In particular, I would try things like

  <a>

alone, or

  <href

alone, or

  hellowww.openerp.com

you know, weird stuff that people actually write.

review: Disapprove
4586. By Christophe Matthieu (OpenERP)

[IMP] tools mail: Convert html_regexp_link into URIs_to_html_link; Improve regexp for URIs_to_html_link; Add test for uri.

Revision history for this message
Christophe Simonis (OpenERP) (kangol) wrote :

https links wont works...

review: Needs Fixing
4587. By Christophe Matthieu (OpenERP)

[IMP] tools mail: add target="_blank" for the external link html.

4588. By Christophe Matthieu (OpenERP)

[IMP] tools mail: add test for protocol http, https, ftp for URIs_to_html_link

4589. By Christophe Matthieu (OpenERP)

[IMP] tools mail: add tests for test_URIs_to_html_link

4590. By Christophe Matthieu (OpenERP)

[IMP] tools mail: change comment

Revision history for this message
Christophe Matthieu (OpenERP) (chm-openerp) :
review: Needs Resubmitting
4591. By Christophe Matthieu (OpenERP)

[MERGE] from trunk

Revision history for this message
Vo Minh Thu (thu) wrote :

Tests seem nice but not enough.

From the top of my head I would add the //www.google.com case, i.e. the protocol part can be left out in href and src attributes.

Make it clear somewhere, like in a spec, if it is normal the www.google.com becomes http://www.google.com in the href attribute (instead of www.google.com).

I believe the code doesn't work for simply google.com.

Does it work for unicode URLs ?

I have pushed a branch at lp:~openerp-dev/openobject-server/trunk-text2html-chm-vmt. It shows additional tests and a way to generate more of them. Some values are commented out because they don't work.

Again, I don't see the point of replicating an existing library. If you do so, you'd better give some good reason in the merge proposal comment (e.g. the desired behavior is not available) and reuse their test suite.

Don't add a Resubmit: it is something the reviewer can say, not the reviewee ;-)

review: Disapprove
Revision history for this message
Vo Minh Thu (thu) wrote :

B.t.w, should it work for email address (and use the mailto scheme) ?

Revision history for this message
Vo Minh Thu (thu) wrote :

The test using hellowww.abcde.fr is wrong: hellowww is a correct subdomain (even if the user didn't intended to write that).

Revision history for this message
Christophe Matthieu (OpenERP) (chm-openerp) wrote :

hellowww.abcde.fr or words.words.it : These links will be processed if they are preceded by http.
Whether to extend the application, I can add processing something.something.something

Revision history for this message
Xavier (Open ERP) (xmo-deactivatedaccount) wrote :

> Make it clear somewhere, like in a spec, if it is normal the www.google.com becomes http://www.google.com in the href attribute (instead of www.google.com).

For what it's worth, Apple's auto-likifier (e.g. in Apple Mail) seems to do the same: `google.com` will be hyperlinked to http://google.com/, and `www.google.com` to http://www.google.com/

Revision history for this message
Xavier (Open ERP) (xmo-deactivatedaccount) wrote :

And bleach.linkify does the same, on your paragraphs with links in them it outputs:

----

Make it clear somewhere, like in a spec, if it is normal the <a href="http://www.google.com" rel="nofollow">www.google.com</a> becomes <a href="http://www.google.com" rel="nofollow">http://www.google.com</a> in the href attribute (instead of <a href="http://www.google.com" rel="nofollow">www.google.com</a>).

I believe the code doesn't work for simply <a href="http://google.com" rel="nofollow">google.com</a>.

<a href="http://hellowww.abcde.fr" rel="nofollow">hellowww.abcde.fr</a> or <a href="http://words.words.it" rel="nofollow">words.words.it</a> : These links will be processed if they are preceded by http.
Whether to extend the application, I can add processing something.something.something

4592. By Christophe Matthieu (OpenERP)

[IMP] tools mail: URIs_to_html_link, convert link with all subdomain (not limited to www and ftp)

4593. By Christophe Matthieu (OpenERP)

[IMP] tools mail: URIs_to_html_link, convert email address into link

4594. By Christophe Matthieu (OpenERP)

[IMP] tools mail: URIs_to_html_link, add test email

4595. By Christophe Matthieu (OpenERP)

[IMP] tools mail: change double quote into quote for email link

Revision history for this message
Christophe Matthieu (OpenERP) (chm-openerp) :
review: Needs Resubmitting
Revision history for this message
Vo Minh Thu (thu) wrote :

Please read again my previous comments.

4596. By Christophe Matthieu (OpenERP)

[IMP]tools mail: add // protocol

4597. By Christophe Matthieu (OpenERP)

[IMP] tools mail: refactoring of URIs_to_html_link

4598. By Christophe Matthieu (OpenERP)

[IMP] tools mail: add test for URIs_to_html_link

Revision history for this message
Thibault Delavallée (OpenERP) (tde-openerp) wrote :

Reported for 7.1 :) .

review: Needs Information

Unmerged revisions

4598. By Christophe Matthieu (OpenERP)

[IMP] tools mail: add test for URIs_to_html_link

4597. By Christophe Matthieu (OpenERP)

[IMP] tools mail: refactoring of URIs_to_html_link

4596. By Christophe Matthieu (OpenERP)

[IMP]tools mail: add // protocol

4595. By Christophe Matthieu (OpenERP)

[IMP] tools mail: change double quote into quote for email link

4594. By Christophe Matthieu (OpenERP)

[IMP] tools mail: URIs_to_html_link, add test email

4593. By Christophe Matthieu (OpenERP)

[IMP] tools mail: URIs_to_html_link, convert email address into link

4592. By Christophe Matthieu (OpenERP)

[IMP] tools mail: URIs_to_html_link, convert link with all subdomain (not limited to www and ftp)

4591. By Christophe Matthieu (OpenERP)

[MERGE] from trunk

4590. By Christophe Matthieu (OpenERP)

[IMP] tools mail: change comment

4589. By Christophe Matthieu (OpenERP)

[IMP] tools mail: add tests for test_URIs_to_html_link

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'openerp/tests/test_mail.py'
2--- openerp/tests/test_mail.py 2012-11-28 13:51:01 +0000
3+++ openerp/tests/test_mail.py 2012-12-13 13:27:23 +0000
4@@ -23,7 +23,7 @@
5 ##############################################################################
6
7 import unittest2
8-from openerp.tools import html_sanitize, html_email_clean, append_content_to_html, plaintext2html
9+from openerp.tools import html_sanitize, html_email_clean, URIs_to_html_link, append_content_to_html, plaintext2html
10
11 HTML_SOURCE = """
12 <font size="2" style="color: rgb(31, 31, 31); font-family: monospace; font-variant: normal; line-height: normal; ">test1</font>
13@@ -177,6 +177,90 @@
14 new_html = html_email_clean(False)
15 self.assertEqual(new_html, False, 'html_email_cleaner did change a False in an other value.')
16
17+ def test_URIs_to_html_link(self):
18+ # Test1: Test if the method convert text links and do not convert html links
19+
20+ schemes = ('http://', 'https://', 'ftp://', '', '//')
21+ pages = ('', '/about/Santa.html', '?debug=&ts=4', '#action=1[3,4]&$%~.-;:=,?@+', ':8080', ':8080/mail.php', ':8080?variables=5', ':8080#title3')
22+ TLDSs = ('be', 'com', 'ac.be')
23+ urls = (
24+ 'google.',
25+ 'www.google.',
26+ 'info.fundp.',
27+ 't.website.',
28+ 'ns.phx4.nearlyfreespeech.')
29+ links = [
30+ ( scheme + url + TLDS + page ,
31+ '<a href="' + (scheme or 'http://') + url + TLDS + page + '" target="_blank">' + scheme + url + TLDS + page + '</a>' )
32+ for TLDS in TLDSs for page in pages for scheme in schemes for url in urls]
33+
34+ links = links + [
35+ # test standard link begin with a protocol http
36+ ( 'balbal http://localhost.truc.be ',
37+ 'balbal <a href="http://localhost.truc.be" target="_blank">http://localhost.truc.be</a> '),
38+ # test standard link begin with a protocol http
39+ ( ' //localhost.truc.be ',
40+ ' <a href="//localhost.truc.be" target="_blank">//localhost.truc.be</a> '),
41+ # test standard link begin with a protocol https
42+ ( 'balbal https://localhost.truc.be ',
43+ 'balbal <a href="https://localhost.truc.be" target="_blank">https://localhost.truc.be</a> '),
44+ # test standard link begin with a protocol ftp
45+ ( 'balbal ftp://localhost.truc.be ',
46+ 'balbal <a href="ftp://localhost.truc.be" target="_blank">ftp://localhost.truc.be</a> '),
47+ # test standard link begin with a protocol, with port and special chars
48+ ( 'balbal http://localhost:8069/?debug=&ts=4#action=1[3,4]&$%~.-;:=,?@+ ',
49+ 'balbal <a href="http://localhost:8069/?debug=&ts=4#action=1[3,4]&$%~.-;:=,?@+" target="_blank">http://localhost:8069/?debug=&ts=4#action=1[3,4]&$%~.-;:=,?@+</a> '),
50+ # test standard link
51+ ( 'toto www.onewebsite.be www.truc<a> hellowww.abcde.fr <abc>www.test.fr</abc>',
52+ 'toto <a href="http://www.onewebsite.be" target="_blank">www.onewebsite.be</a> www.truc<a> <a href="http://hellowww.abcde.fr" target="_blank">hellowww.abcde.fr</a> <abc><a href="http://www.test.fr" target="_blank">www.test.fr</a></abc>'),
53+ # test link wrong html tag
54+ ( 'toto <>www.onewebsite.be<> www.truc.communication<a>',
55+ 'toto <><a href="http://www.onewebsite.be" target="_blank">www.onewebsite.be</a><> www.truc.communication<a>'),
56+ ( 'toto <span www.onewebsite.be www.truc<a>',
57+ 'toto <span <a href="http://www.onewebsite.be" target="_blank">www.onewebsite.be</a> www.truc<a>'),
58+ # test with IP website url
59+ ( '123 http://234.33.45.01:8069 ???',
60+ '123 <a href="http://234.33.45.01:8069" target="_blank">http://234.33.45.01:8069</a> ???'),
61+ # test with smiles
62+ ( ':-> <bold>ftp.ftpsite.be:8069/</bold> :->',
63+ ':-> <bold><a href="http://ftp.ftpsite.be:8069/" target="_blank">ftp.ftpsite.be:8069/</a></bold> :->'),
64+ # test with images and smiles
65+ ( '<:-) <img src="http://www.link.com/tata"/> :->',
66+ '<:-) <img src="http://www.link.com/tata"/> :->'),
67+ # test with image in an external html link and smiles
68+ ( '<a href="http://www.link.com/tata" class="truc"><:-)<img src="http://www.link.com/tata"/>:-></a>',
69+ '<a href="http://www.link.com/tata" class="truc" target="_blank"><:-)<img src="http://www.link.com/tata"/>:-></a>'),
70+ # test image with relative link
71+ ( '<img src="./tata"/>',
72+ '<img src="./tata"/>'),
73+ # test html link with relative link
74+ ( '<a href="./tata">beta</a>',
75+ '<a href="./tata">beta</a>'),
76+ # test relative link
77+ ( ' ./link.com/to/page ',
78+ ' ./link.com/to/page '),
79+ # test short link with good tlds
80+ ( ' google.com google.be ',
81+ ' <a href="http://google.com" target="_blank">google.com</a> <a href="http://google.be" target="_blank">google.be</a> '),
82+ # test standard email
83+ ( 'Abc.123@example.com ',
84+ '<a href=\'mailto:Abc.123@example.com\' target=\'_blank\'>Abc.123@example.com</a> '),
85+ # test standard email
86+ ( 'user+mailbox/department=shipping@example.com ',
87+ '<a href=\'mailto:user+mailbox/department=shipping@example.com\' target=\'_blank\'>user+mailbox/department=shipping@example.com</a> '),
88+ # test complexe email
89+ ( '"Joe.#$%&\'*+-/=? Blow"@example.com ',
90+ '<a href=\'mailto:"Joe.#$%&\'*+-/=? Blow"@example.com\' target=\'_blank\'>"Joe.#$%&\'*+-/=? Blow"@example.com</a> '),
91+ ]
92+
93+ for link in links:
94+ # test for convert
95+ output = URIs_to_html_link( link[0] )
96+ self.assertEqual(output, link[1], 'test_URIs_to_html_link did convert links for case : %s' % link[0])
97+ # converted link don't change
98+ output = URIs_to_html_link( link[1] )
99+ self.assertEqual(output, link[1], 'test_URIs_to_html_link did convert links for case : %s' % link[1] )
100+
101 class TestHtmlTools(unittest2.TestCase):
102 """ Test some of our generic utility functions about html """
103
104
105=== modified file 'openerp/tools/mail.py'
106--- openerp/tools/mail.py 2012-11-26 18:15:27 +0000
107+++ openerp/tools/mail.py 2012-12-13 13:27:23 +0000
108@@ -110,6 +110,47 @@
109 #----------------------------------------------------------
110 # HTML Cleaner
111 #----------------------------------------------------------
112+# Convert URIs to HTML links.
113+def URIs_to_html_link(html):
114+
115+ TLDS = "|".join("""ac ad ae aero af ag ai al am an ao aq ar arpa as asia at au aw ax az
116+ ba bb bd be bf bg bh bi biz bj bm bn bo br bs bt bv bw by bz ca cat
117+ cc cd cf cg ch ci ck cl cm cn co com coop cr cu cv cx cy cz de dj dk
118+ dm do dz ec edu ee eg er es et eu fi fj fk fm fo fr ga gb gd ge gf gg
119+ gh gi gl gm gn gov gp gq gr gs gt gu gw gy hk hm hn hr ht hu id ie il
120+ im in info int io iq ir is it je jm jo jobs jp ke kg kh ki km kn kp
121+ kr kw ky kz la lb lc li lk lr ls lt lu lv ly ma mc md me mg mh mil mk
122+ ml mm mn mo mobi mp mq mr ms mt mu museum mv mw mx my mz na name nc ne
123+ net nf ng ni nl no np nr nu nz om org pa pe pf pg ph pk pl pm pn pr pro
124+ ps pt pw py qa re ro rs ru rw sa sb sc sd se sg sh si sj sk sl sm sn so
125+ sr st su sv sy sz tc td tel tf tg th tj tk tl tm tn to tp tr travel tt
126+ tv tw tz ua ug uk us uy uz va vc ve vg vi vn vu wf ws xn ye yt yu za zm
127+ zw""".split())
128+ IP = r"[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+"
129+ link = r"[\w]([.-]?[\w]+)*"
130+ outerlink = r"(:[0-9]+)?([/?#][/\w\#&$\%~.\-;:=,?@\[\]+]*)?"
131+ protocol = "|".join("""http:// https:// ftp:// // """.split())
132+ email = r"(([\w]+([.+/=-][\w]+)*)|\"[^\"]+\")@(%s)\.(%s)" % (link, TLDS)
133+
134+ # add target blanck on external link
135+ part = re.compile(r"(<a (?:[^t>]|t(?!arget=['\"]_blank['\"]))*?(?:https?|ftp)\://(?:[^t>]|t(?!arget=['\"]_blank['\"]))*?)(>.+?</a>)", re.IGNORECASE | re.DOTALL)
136+ html = part.sub(r'\1 target="_blank"\2', html)
137+
138+ # avoid to have a tag link in an tag image or tag link
139+ REidempotent = r"(^|\s|<>|<[^a][^<>]*[>\s]|<a[^\s][^<>]*[>\s])(?!<a\s[^>]+>|href=[\"']|href=[\"']mailto:|src=[\"']|url=[\"'])"
140+ end = r"(?![\w])"
141+
142+ # find the links are beginning with http://, https://, ftp://...
143+ part = re.compile(r"%s((%s)(%s|%s)(%s))%s" % (REidempotent, protocol, link, IP, outerlink, end), re.IGNORECASE | re.DOTALL)
144+ html = part.sub(r'\1<a href="\2" target="_blank">\2</a>', html)
145+ # find the links are beginning with www., ftp. or other subdomain
146+ part = re.compile(r"%s((%s)\.(%s)(%s))%s" % (REidempotent, link, TLDS, outerlink, end), re.IGNORECASE | re.DOTALL)
147+ html = part.sub(r'\1<a href="http://\2" target="_blank">\2</a>', html)
148+ # find email and convert into link with mailto
149+ part = re.compile(r"%s(%s)%s" % (REidempotent, email, end), re.IGNORECASE | re.DOTALL)
150+ html = part.sub(r"\1<a href='mailto:\2' target='_blank'>\2</a>", html)
151+
152+ return html
153
154 def html_email_clean(html):
155 """ html_email_clean: clean the html to display in the web client.
156@@ -183,6 +224,9 @@
157 br_div_tags = re.compile(r'(<div>\s*<br\s*\/>\s*<\/div>)')
158 html = _replace_matching_regex(br_div_tags, html, '<br />')
159
160+ # 7. Convert URIs to HTML links.
161+ html = URIs_to_html_link(html)
162+
163 return html
164
165