Python2.5 Unicode-bug when using sgmllib.py: UnicodeDecodeError

Bug #240929 reported by Flemming Bjerke
10
Affects Status Importance Assigned to Milestone
Python
Fix Committed
Unknown
python2.5 (Ubuntu)
Fix Released
Medium
Unassigned
python2.6 (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

I HAD TO CHECK "I don't know" the package. It couldn't find python2.5. Strange.

The bug is described here:
http://mail.python.org/pipermail/python-bugs-list/2007-February/037082.html

John Nagle have explained and solved the bug:

Found the problem. In sgmllib.py for Python 2.5, in convert_charref, the
code for handling character escapes assumes that ASCII characters have
values up to 255.

But the correct limit is 127, of course.

If a Unicode string is run through SGMLparser, and that string has a
character in an attribute with a value between 128 and 255, which is valid
in Unicode, the value is passed through as a character with "chr", creating a
one-character invalid ASCII string.

Then, when the bad string is later converted to Unicode as the output is
assembled, the UnicodeDecodeError exception is raised.

So the fix is to change 255 to 127 in convert_charref in sgmllib.py,
as shown below. This forces characters above 127 to be expressed with
escape sequences. Please patch accordingly. Thanks.

def convert_charref(self, name):
    """Convert character reference, may be overridden."""
    try:
        n = int(name)
    except ValueError:
        return
    if not 0 <= n <= 127 : # ASCII ends at 127, not 255
        return
    return self.convert_codepoint(n)

Revision history for this message
Flemming Bjerke (flem) wrote : Re: [Bug 240929] [NEW] Python2.5 Unicode-bug when using sgmllib.py: UnicodeDecodeError

Wednesday 18 June 2008 skrev Flemming Bjerke:
> Public bug reported:
>
> I HAD TO CHECK "I don't know" the package. It couldn't find python2.5.
> Strange.

The problem turned up after upgrade to python2.5. The html-parser modules
called beautifulsoup (not included in python2.5, but relies on sgmlib.py)
stopped working.

--
Flemming Bjerke
Hyldebjerg 67
DK-4330 Hvalsø
Phone: +45 46928846
Mobile: +45 22120366

Matthias Klose (doko)
Changed in python2.5:
importance: Undecided → Medium
status: New → Triaged
Changed in python:
status: Unknown → New
Changed in python:
status: New → Fix Committed
Matthias Klose (doko)
Changed in python2.6 (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
status: Triaged → In Progress
Changed in python2.5 (Ubuntu):
status: Triaged → In Progress
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package python2.5 - 2.5.4-1ubuntu4

---------------
python2.5 (2.5.4-1ubuntu4) jaunty; urgency=low

  * Fix issue #1651995, _convert_ref for non-ASCII characters. LP: #240929.
  * Fix issue #3845, in PyRun_SimpleFileExFlags avoid invalid memory access
    with short file names. LP: #234798.
  * Fix issue #1046, title endtag in HTMLCalender.formatyearpage().
    Closes: #513335.
  * Py_DECREF: Add `do { ... } while (0)' to avoid compiler warnings.
  * curses.initscr(): raise an error instead of calling exit() in error cases.
    Closes: #478817.
  * Fix comment macro in python manpage.

 -- Matthias Klose <email address hidden> Sat, 04 Apr 2009 19:09:56 +0200

Changed in python2.5 (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package python2.6 - 2.6.1-1ubuntu11

---------------
python2.6 (2.6.1-1ubuntu11) jaunty; urgency=low

  * Update to 20090405, taken from the 2.6 release branch.
    - Fix issue #1651995, _convert_ref for non-ASCII characters. LP: #240929.
    - Fix issue #3845, in PyRun_SimpleFileExFlags avoid invalid memory access
      with short file names. LP: #234798.
    - Fix issues #5190, #5444, #5471, #5615, #5617, #5631, #1326077, #1726172.
    - Fix documentation issues #3427, #4411, #4882, #5018, #5298, #5370,
      #5432, #5563, #5580, #5598, #5601, #5618, #5635, #5642, #5655, #1096310,
      #1530012, #1675026, #1718017, #1742837,
  * Fix issue #1113244: Py_XINCREF, Py_DECREF, Py_XDECREF: Add
   `do { ... } while (0)' to avoid compiler warnings. Closes: #516956.

 -- Matthias Klose <email address hidden> Mon, 06 Apr 2009 00:36:01 +0200

Changed in python2.6 (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
Tero Karvinen (karvinen+launchpad) wrote :

Doesn't seem to be fixed in 8.04 Hardy. Instead, Hardy has only older version of python2.5 (2.5.2-2ubuntu4.1). Will this be fixed on Hardy at all?

When using Beautiful Soup to read web page in Finnish, I get: "UnicodeEncodeError: 'ascii' codec can't encode character u'\xb7' in position 5: ordinal not in range(128)"

My /usr/lib/python2.5/sgmllib.py incorrectlly compares to 255 (and not the correct 127):

  def handle_charref(self, name):
        # ...
        if not 0 <= n <= 255:
            return

I'm using Ubuntu 8.04.2 hardy (cat /etc/lsb-release) and python2.5 2.5.2-2ubuntu4.1 (dpkg --list python2.5).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.