UnicodeDecodeError: 'utf8' codec can't decode byte 0xad in position 7: unexpected code byte on some bson reports

Bug #896959 reported by Robert Collins
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Critical
Robert Collins

Bug Description

This is being raised from an apparently normal OOPS successfully serialized as bson. Probably indicates a bson bug, but will need a fixed version / fixed dependency in oops-datedir-repo.

>>> bson.loads(file('OOPS-3cbd11de34ca8a70afaf668853adc17e').read())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/robertc/source/launchpad/oops-datedir-repo/working/eggs/bson-0.3.2-py2.6.egg/bson/__init__.py", line 75, in loads
    return decode_document(data, 0)[1]
  File "/home/robertc/source/launchpad/oops-datedir-repo/working/eggs/bson-0.3.2-py2.6.egg/bson/codec.py", line 240, in decode_document
    base, name, value = decode_element(data, base)
  File "/home/robertc/source/launchpad/oops-datedir-repo/working/eggs/bson-0.3.2-py2.6.egg/bson/codec.py", line 232, in decode_element
    return decode_func(data, base)
  File "/home/robertc/source/launchpad/oops-datedir-repo/working/eggs/bson-0.3.2-py2.6.egg/bson/codec.py", line 253, in decode_document_element
    base, value = decode_document(data, base)
  File "/home/robertc/source/launchpad/oops-datedir-repo/working/eggs/bson-0.3.2-py2.6.egg/bson/codec.py", line 240, in decode_document
    base, name, value = decode_element(data, base)
  File "/home/robertc/source/launchpad/oops-datedir-repo/working/eggs/bson-0.3.2-py2.6.egg/bson/codec.py", line 232, in decode_element
    return decode_func(data, base)
  File "/home/robertc/source/launchpad/oops-datedir-repo/working/eggs/bson-0.3.2-py2.6.egg/bson/codec.py", line 163, in decode_string_element
    base, name = decode_cstring(data, base + 1)
  File "/home/robertc/source/launchpad/oops-datedir-repo/working/eggs/bson-0.3.2-py2.6.egg/bson/codec.py", line 121, in decode_cstring
    return (base + length, buf.getvalue().decode("utf8"))
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x85 in position 5: unexpected code byte

Related branches

Revision history for this message
Robert Collins (lifeless) wrote :

according to the backtrace we have a unicode string which when serialized by the serializer cannot be deserialized.

(decode_element reads the type code then dispatches to decode_string_element. I'm a bit confused why decode_cstring is doing a utf8 decode..)

Revision history for this message
Robert Collins (lifeless) wrote :

Ah, the decode_cstring is for the key of the dict. So it looks like encode_cstring has perhaps gotten horribly confused.

Revision history for this message
Robert Collins (lifeless) wrote :

One possibility is a failure to write the key correctly. According to the bson spec key names must be
cstring ::= (byte*) "\x00" CString
(where byte* are valid utf8) - its a utf8 bytestring without embedded \x00.

Revision history for this message
Robert Collins (lifeless) wrote :

b'field\x85ries_filter' is the key that is giving grief. Which appears to be one of the keys in args.
>>> bson.loads(bson.dumps({'field\x85ries_filter':0}))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/robertc/source/launchpad/oops-datedir-repo/working/eggs/bson-0.3.2-py2.6.egg/bson/__init__.py", line 75, in loads
    return decode_document(data, 0)[1]
  File "/home/robertc/source/launchpad/oops-datedir-repo/working/eggs/bson-0.3.2-py2.6.egg/bson/codec.py", line 240, in decode_document
    base, name, value = decode_element(data, base)
  File "/home/robertc/source/launchpad/oops-datedir-repo/working/eggs/bson-0.3.2-py2.6.egg/bson/codec.py", line 232, in decode_element
    return decode_func(data, base)
  File "/home/robertc/source/launchpad/oops-datedir-repo/working/eggs/bson-0.3.2-py2.6.egg/bson/codec.py", line 313, in decode_int32_element
    base, name = decode_cstring(data, base + 1)
  File "/home/robertc/source/launchpad/oops-datedir-repo/working/eggs/bson-0.3.2-py2.6.egg/bson/codec.py", line 121, in decode_cstring
    return (base + length, buf.getvalue().decode("utf8"))
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x85 in position 5: unexpected code byte

So this is a case of a) bson accepting something it shouldn't and b) LP generating something invalid.

Revision history for this message
Robert Collins (lifeless) wrote :

def encode_cstring(value):
    if isinstance(value, unicode):
        value = value.encode("utf8")
    return value + "\x00"

clearly unsafe: while most invalid keys will raise e.g. TypeError:
>>> bson.encode({1:0})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'encode'
>>> bson.dumps({1:0})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/robertc/source/launchpad/oops-datedir-repo/working/eggs/bson-0.3.2-py2.6.egg/bson/__init__.py", line 69, in dumps
    return encode_document(obj, [], generator_func = generator)
  File "/home/robertc/source/launchpad/oops-datedir-repo/working/eggs/bson-0.3.2-py2.6.egg/bson/codec.py", line 207, in encode_document
    encode_value(name, value, buf, traversal_stack, generator_func)
  File "/home/robertc/source/launchpad/oops-datedir-repo/working/eggs/bson-0.3.2-py2.6.egg/bson/codec.py", line 193, in encode_value
    buf.write(encode_int32_element(name, value))
  File "/home/robertc/source/launchpad/oops-datedir-repo/working/eggs/bson-0.3.2-py2.6.egg/bson/codec.py", line 310, in encode_int32_element
    return "\x10" + encode_cstring(name) + struct.pack("<i", value)
  File "/home/robertc/source/launchpad/oops-datedir-repo/working/eggs/bson-0.3.2-py2.6.egg/bson/codec.py", line 111, in encode_cstring
    return value + "\x00"
TypeError: unsupported operand type(s) for +: 'int' and 'str'

but a bytestring is assumed to be utf8 already, which is quite clearly not the case here.
def encode_cstring(value):
    if isinstance(value, unicode):
        value = value.encode("utf8")
    elif isinstance(value, str):
         # check value is utf8.
         value.decode('utf8')
    else:
        raise TypeError('Invalid type for cstring %r' % value)
    return value + "\x00"

Revision history for this message
Robert Collins (lifeless) wrote :
Revision history for this message
Robert Collins (lifeless) wrote :

datedir-repo isn't actually at fault here, nor the cause, so dropping its task to high, and moving onto LP itself.

Changed in python-oops-datedir-repo:
importance: Critical → High
Changed in python-oops-datedir-repo:
assignee: nobody → Robert Collins (lifeless)
Changed in launchpad:
assignee: nobody → Robert Collins (lifeless)
Changed in python-oops-datedir-repo:
assignee: Robert Collins (lifeless) → nobody
Changed in launchpad:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
Changed in launchpad:
status: Triaged → Fix Committed
Revision history for this message
Robert Collins (lifeless) wrote :

Can't trigger the oops anymore because the bug with +login is also fixed.

tags: added: qa-untestable
removed: qa-needstesting
William Grant (wgrant)
Changed in launchpad:
status: Fix Committed → Fix Released
no longer affects: python-oops-datedir-repo
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.