Comment 5 for bug 1384463

Revision history for this message
Joshua Harlow (harlowja) wrote :

Ok, think I see whats happening. I am using in doc8 'chardet' to detect that files encoding.

Trying on this file. It gives a bad encoding:

>>> b = open('test.rst').read()
>>> chardet.detect(b)['encoding']
'ISO-8859-2'

Sooo that is then being decoded in the 'ISO-8859-2' -> unicode by python.

Which then goes into docutils rst parser, which then when iterated over gives back the decoding of that line in 'ISO-8859-2'

>>> y = b'known exploit in the wild, for example – the time between advance notification'
>>> import six
>>> g = six.text_type(y, encoding='ISO-8859-2')
>>> print g
known exploit in the wild, for example – the time between advance notification
>>> len(g)
80

So that seems to be the cause (bad detection).