Merge lp:~vauxoo/openobject-server/7.0-backport_new_html_sanitize into lp:openobject-server/7.0

Status: Needs review
Proposed branch: lp:~vauxoo/openobject-server/7.0-backport_new_html_sanitize
Merge into: lp:openobject-server/7.0
Diff against target: 135 lines (+90/-14)
1 file modified
openerp/tools/mail.py (+90/-14)
To merge this branch: bzr merge lp:~vauxoo/openobject-server/7.0-backport_new_html_sanitize
Reviewer Review Type Date Requested Status
Olivier Dony (Odoo) (surface scan) Pending
Review via email: mp+186367@code.launchpad.net

This proposal supersedes a proposal from 2013-09-18.

Description of the change

Overwritten method html_sanitizer from tools, to add 3 new features:

1) Avoid render a link to hosts not allowed, for this we add a new configuration with a host allow list in ir.configuration model

2) Avoide render a link with a scheme not allowed, also added a list with schemes allow like https in ir.configuration model

3) Backport new features from trunk branch

To post a comment you must log in.
Revision history for this message
Olivier Dony (Odoo) (odo-openerp) wrote : Posted in a previous version of this proposal

Thanks for the proposal!
Assuming this MP is targeted to 7.0, you need to re-submit it with the correct target branch, which will solve the huge diff/conflicts problem.
Now, based on the summary and diff it seems more like an improvement/new feature than a bugfix, so it should actually be for trunk and not for 7.0.

Note: calling the normal eval() on a user-provided string is very dangerous, you should use ast.literal_eval or safe_eval.expr_eval.

Thanks,

review: Needs Resubmitting ((surface scan))
Revision history for this message
Jose Antonio Morales Ponce(vauxoo) - - http://www.vauxoo.com (josemoralesp) wrote : Posted in a previous version of this proposal

You will need this library, that I overwrite

5082. By Jose Antonio Morales Ponce(vauxoo) - - http://www.vauxoo.com

[IMP] Added inpect to know the elements for method allowed

5083. By Jose Antonio Morales Ponce(vauxoo) - - http://www.vauxoo.com

[IMP]

Unmerged revisions

5083. By Jose Antonio Morales Ponce(vauxoo) - - http://www.vauxoo.com

[IMP]

5082. By Jose Antonio Morales Ponce(vauxoo) - - http://www.vauxoo.com

[IMP] Added inpect to know the elements for method allowed

5081. By Jose Antonio Morales Ponce(vauxoo) - - http://www.vauxoo.com

[IMP] Change eval for safe_eval to better the code

5080. By Jose Antonio Morales Ponce(vauxoo) - - http://www.vauxoo.com

[IMP] Avoid apply eval if not exist configuration

5079. By Jose Antonio Morales Ponce(vauxoo) - - http://www.vauxoo.com

[IMP] Overwritten method html_sanitizer from tools, to add 3 new features:

1) Avoid render a link to hosts not allowed, for this we add a new configuration with a host allow list in ir.configuration model

2) Avoide render a link with a scheme not allowed, also added a list with schemes allow like https in ir.configuration model

3) Backport new features from trunk branch

5078. By Jose Antonio Morales Ponce(vauxoo) - - http://www.vauxoo.com

[IMP] Backport from trunk to add new features in html_sanitize method

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'openerp/tools/mail.py'
2--- openerp/tools/mail.py 2013-03-29 15:41:30 +0000
3+++ openerp/tools/mail.py 2013-09-19 15:03:35 +0000
4@@ -20,6 +20,7 @@
5 ##############################################################################
6
7 from lxml import etree
8+from openerp.tools.safe_eval import safe_eval
9 import cgi
10 import logging
11 import lxml.html
12@@ -29,7 +30,16 @@
13 import re
14 import socket
15 import threading
16+from lxml.html import xhtml_to_html, _transform_result
17+
18+try:
19+ from urlparse import urlsplit
20+except ImportError:
21+ # Python3
22+ from urllib.parse import urlsplit
23 import time
24+import inspect
25+from openerp import SUPERUSER_ID
26
27 from openerp.loglevels import ustr
28
29@@ -43,28 +53,94 @@
30 tags_to_kill = ["script", "head", "meta", "title", "link", "style", "frame", "iframe", "base", "object", "embed"]
31 tags_to_remove = ['html', 'body', 'font']
32
33-
34-def html_sanitize(src):
35+# allow new semantic HTML5 tags
36+allowed_tags = clean.defs.tags | frozenset('article section header footer hgroup nav aside figure'.split())
37+safe_attrs = clean.defs.safe_attrs | frozenset(['style'])
38+
39+
40+
41+def html_sanitize(src, silent=True):
42 if not src:
43 return src
44+ logger = logging.getLogger(__name__ + '.html_sanitize')
45+ cr = False
46+ host_whitelist = False
47+ scheme_allowed = False
48+ try:
49+ db_name = getattr(threading.currentThread(), 'dbname', None)
50+ if db_name:
51+ local_cr = cr = pooler.get_db(db_name).cursor()
52+ except:
53+ logger.warning.error("Data Base not found")
54+
55 src = ustr(src, errors='replace')
56
57+
58 # html encode email tags
59 part = re.compile(r"(<(([^a<>]|a[^<>\s])[^<>]*)@[^<>]+>)", re.IGNORECASE | re.DOTALL)
60 src = part.sub(lambda m: cgi.escape(m.group(1)), src)
61-
62- # some corner cases make the parser crash (such as <SCRIPT/XSS SRC=\"http://ha.ckers.org/xss.js\"></SCRIPT> in test_mail)
63+
64+ if cr:
65+ ir_config_pool = pooler.get_pool(cr.dbname).get('ir.config_parameter')
66+ host_whitelist = ir_config_pool.get_param(cr, SUPERUSER_ID, 'host_whitelist')
67+ host_whitelist = host_whitelist and safe_eval(host_whitelist)
68+ scheme_allowed = ir_config_pool.get_param(cr, SUPERUSER_ID, 'scheme_allowed')
69+ scheme_allowed = scheme_allowed and safe_eval(scheme_allowed)
70+
71+ kwargs = {
72+ 'page_structure': True,
73+ 'whitelist_tags': None, # do not exclude url for their tag
74+ 'host_whitelist': host_whitelist or (),
75+ 'style': False, # do not remove style attributes
76+ 'forms': True, # remove form tags
77+ 'remove_unknown_tags': False,
78+ 'allow_tags': allowed_tags,
79+ }
80+ if etree.LXML_VERSION >= (2, 3, 1):
81+ # kill_tags attribute has been added in version 2.3.1
82+ kwargs.update({
83+ 'kill_tags': tags_to_kill,
84+ 'remove_tags': tags_to_remove,
85+ })
86+ else:
87+ kwargs['remove_tags'] = tags_to_kill + tags_to_remove
88+
89+ if etree.LXML_VERSION >= (3, 1, 0):
90+ kwargs.update({
91+ 'safe_attrs_only': True,
92+ 'safe_attrs': safe_attrs,
93+ })
94+ else:
95+ # lxml < 3.1.0 does not allow to specify safe_attrs. We keep all attributes in order to keep "style"
96+ kwargs['safe_attrs_only'] = False
97+
98 try:
99- cleaner = clean.Cleaner(page_structure=True, style=False, safe_attrs_only=False, forms=False, kill_tags=tags_to_kill, remove_tags=tags_to_remove)
100- cleaned = cleaner.clean_html(src)
101- except TypeError, e:
102- # lxml.clean version < 2.3.1 does not have a kill_tags attribute
103- # to remove in 2014
104- cleaner = clean.Cleaner(page_structure=True, style=False, safe_attrs_only=False, forms=False, remove_tags=tags_to_kill+tags_to_remove)
105- cleaned = cleaner.clean_html(src)
106- except:
107- _logger.warning('html_sanitize failed to parse %s' % (src))
108- cleaned = '<p>Impossible to parse</p>'
109+ # some corner cases make the parser crash (such as <SCRIPT/XSS SRC=\"http://ha.ckers.org/xss.js\"></SCRIPT> in test_mail)
110+ cleaner = clean.Cleaner(**kwargs)
111+ body = lxml.html.document_fromstring(src)
112+ for el, attrib, link, pos in body.iterlinks():
113+ scheme, netloc, path, query, fragment = urlsplit(link)
114+ elements = inspect.getargspec(cleaner.allow_embedded_url)
115+ elements = elements.args
116+ if len(elements) > 3 and not \
117+ cleaner.allow_embedded_url(el, link, scheme_allowed) or \
118+ cleaner.allow_embedded_url(el, link) and \
119+ scheme and el.tag != 'a':
120+ el.getparent().replace(el, clean.fromstring(cgi.escape(clean.tostring(el))))
121+ cleaned = cleaner.clean_html(body)
122+ if not isinstance(cleaned, basestring):
123+ cleaned = _transform_result(unicode, cleaned)
124+
125+ except etree.ParserError:
126+ if not silent:
127+ raise
128+ logger.warning('ParserError obtained when sanitizing %r', src, exc_info=True)
129+ cleaned = '<p>ParserError when sanitizing</p>'
130+ except Exception:
131+ if not silent:
132+ raise
133+ logger.warning('unknown error obtained when sanitizing %r', src, exc_info=True)
134+ cleaned = '<p>Unknown error when sanitizing</p>'
135 return cleaned
136
137