Merge lp:~salgado/launchpad/bug-61171 into lp:launchpad

Proposed by Guilherme Salgado
Status: Merged
Merged at revision: not available
Proposed branch: lp:~salgado/launchpad/bug-61171
Merge into: lp:launchpad
Diff against target: None lines
To merge this branch: bzr merge lp:~salgado/launchpad/bug-61171
Reviewer Review Type Date Requested Status
Paul Hummer (community) Approve
Leonard Richardson (community) Approve
Review via email: mp+9474@code.launchpad.net
To post a comment you must log in.
Revision history for this message
Guilherme Salgado (salgado) wrote :

= Summary =

Non-ASCII characters in an ASCII-encoded (and escaped) query string
cause our +login page to OOPS: https://launchpad.net/bugs/61171

== Proposed fix ==

Fix the bug by replacing the non-ASCII characters when decoding them
into unicode. That seems to be the only thing we can do.

I was hoping to test it at the view level, exercising just the method we
care about but that didn't work and I don't want to spend much time on
this, so I wrote a pagetest using testbrowser.

== Tests ==

./bin/test -vvt test_login

== Demo and Q/A ==

= Launchpad lint =

Checking for conflicts. and issues in doctests and templates.
Running jslint, xmllint, pyflakes, and pylint.
Using normal rules.

Linting changed files:
  lib/canonical/launchpad/webapp/login.py
  lib/canonical/launchpad/webapp/tests/test_login.py

Revision history for this message
Leonard Richardson (leonardr) wrote :

This is a good start, but stripping non-ASCII characters is an overzealous way to handle clients that generating URLs containing non-utf8 values. (In this case the encoding is windows-1252 or iso-8859-1.) This is bad behavior on the client's part, but we can do better than forcing everything to ASCII, especially since windows-1252 and iso-8859-1 are likely to be the vast majority of the problematic encodings.

It would be better to sniff the encoding (I recommend UnicodeDammit, which comes with Beautiful Soup, but you can just try to .decode with a bunch of different encodings) and only force the data to ASCII if the sniffing fails.

I recommend putting this code in a helper method that can be used at any Launchpad entry point used by a non-browser client.

This is why you're seeing this oops a lot from specific clients (apport and TenCent Traveler) -- those clients are programmed to use some encoding other than utf-8 for everything.

Revision history for this message
Guilherme Salgado (salgado) wrote :
Download full text (3.5 KiB)

Thanks for the help with this branch, Leonard.

On Thu, 2009-07-30 at 20:03 +0000, Leonard Richardson wrote:
> This is a good start, but stripping non-ASCII characters is an
> overzealous way to handle clients that generating URLs containing
> non-utf8 values. (In this case the encoding is windows-1252 or
> iso-8859-1.) This is bad behavior on the client's part, but we can do
> better than forcing everything to ASCII, especially since windows-1252
> and iso-8859-1 are likely to be the vast majority of the problematic
> encodings.

Agreed.
>
> It would be better to sniff the encoding (I recommend UnicodeDammit,
> which comes with Beautiful Soup, but you can just try to .decode with
> a bunch of different encodings) and only force the data to ASCII if
> the sniffing fails.

I've changed my implementation to use UnicodeDammit
>
> I recommend putting this code in a helper method that can be used at
> any Launchpad entry point used by a non-browser client.

I'll leave that for later, when the need arises.
>
> This is why you're seeing this oops a lot from specific clients
> (apport and TenCent Traveler) -- those clients are programmed to use

I think apport uses what is specified on the user's locale but anyway.

I'm including the incremental diff with this change. Please let me know
how it looks like now.

=== modified file 'lib/canonical/launchpad/webapp/login.py'
--- lib/canonical/launchpad/webapp/login.py 2009-07-30 18:00:42 +0000
+++ lib/canonical/launchpad/webapp/login.py 2009-07-30 19:56:28 +0000
@@ -9,6 +9,8 @@
 import urllib
 from datetime import datetime, timedelta

+from BeautifulSoup import UnicodeDammit
+
 from zope.component import getUtility
 from zope.session.interfaces import ISession, IClientIdManager
 from zope.event import notify
@@ -380,7 +382,8 @@
         for name, value in self.iter_form_items():
             # Thanks to apport (https://launchpad.net/bugs/61171), we need to
             # do this here.
- value = value.decode('ascii', 'replace')
+ dammit = UnicodeDammit(value)
+ value = value.decode(dammit.originalEncoding, 'replace')
             L.append(html % (name, cgi.escape(value, quote=True)))
         return '\n'.join(L)

=== modified file 'lib/canonical/launchpad/webapp/tests/test_login.py'
--- lib/canonical/launchpad/webapp/tests/test_login.py 2009-07-30 18:00:42 +0000
+++ lib/canonical/launchpad/webapp/tests/test_login.py 2009-07-30 20:13:46 +0000
@@ -19,14 +19,15 @@
         # Apport can construct ASCII-encoded URLs containing non-ASCII
         # characters (in the query string), where they send users to so that
         # they can finish reporting bugs. Apport shouldn't do that but we
- # can't OOPS when it does, so we decode the query string back and
- # replace those non-ASCII characters. For more details, see
- # https://launchpad.net/bugs/61171.
+ # can't OOPS when it does, so we use UnicodeDammit to figure out its
+ # original encoding and decode it back. If UnicodeDammit can't figure
+ # out the correct encoding, we replace those non-ASCII characters. For
+ # more details, see https://launchpad.net/bugs/61171.
       ...

Read more...

Revision history for this message
Leonard Richardson (leonardr) wrote :

This is much better, but there's no need to call value.decode. UnicodeDammit makes the Unicode string available as .markup. So approved* if you use that instead of decoding again.

review: Approve
Revision history for this message
Paul Hummer (rockstar) :
review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/canonical/launchpad/webapp/login.py'
2--- lib/canonical/launchpad/webapp/login.py 2009-06-25 05:30:52 +0000
3+++ lib/canonical/launchpad/webapp/login.py 2009-07-30 18:00:42 +0000
4@@ -376,11 +376,12 @@
5 def preserve_query(self):
6 """Return zero or more hidden inputs that preserve the URL's query."""
7 L = []
8+ html = '<input type="hidden" name="%s" value="%s" />'
9 for name, value in self.iter_form_items():
10- L.append('<input type="hidden" name="%s" value="%s" />' % (
11- name, cgi.escape(value, quote=True)
12- ))
13-
14+ # Thanks to apport (https://launchpad.net/bugs/61171), we need to
15+ # do this here.
16+ value = value.decode('ascii', 'replace')
17+ L.append(html % (name, cgi.escape(value, quote=True)))
18 return '\n'.join(L)
19
20
21
22=== added file 'lib/canonical/launchpad/webapp/tests/test_login.py'
23--- lib/canonical/launchpad/webapp/tests/test_login.py 1970-01-01 00:00:00 +0000
24+++ lib/canonical/launchpad/webapp/tests/test_login.py 2009-07-30 18:00:42 +0000
25@@ -0,0 +1,33 @@
26+# Copyright 2009 Canonical Ltd. This software is licensed under the
27+# GNU Affero General Public License version 3 (see the file LICENSE).
28+
29+__metaclass__ = type
30+
31+import unittest
32+
33+from canonical.launchpad.ftests import logout
34+from canonical.launchpad.testing.pages import setupBrowser
35+from canonical.testing import DatabaseFunctionalLayer
36+
37+from lp.testing import TestCase
38+
39+
40+class TestLoginOrRegister_preserve_query(TestCase):
41+ layer = DatabaseFunctionalLayer
42+
43+ def test_non_ascii_characters_in_query_string_ascii_encoded(self):
44+ # Apport can construct ASCII-encoded URLs containing non-ASCII
45+ # characters (in the query string), where they send users to so that
46+ # they can finish reporting bugs. Apport shouldn't do that but we
47+ # can't OOPS when it does, so we decode the query string back and
48+ # replace those non-ASCII characters. For more details, see
49+ # https://launchpad.net/bugs/61171.
50+ logout()
51+ browser = setupBrowser()
52+ browser.open('http://launchpad.dev/+login?foo=subproc%E9s')
53+ self.assertIn(
54+ 'subproc', browser.getControl(name='foo', index=0).value)
55+
56+
57+def test_suite():
58+ return unittest.TestLoader().loadTestsFromName(__name__)
59