Line length for lines with non-ascii characters are not computed correctly

Bug #1384463 reported by Alex Gaynor
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
doc8
Fix Released
High
Joshua Harlow

Bug Description

Consider this line:

known exploit in the wild, for example – the time between advance notification

visibly it is below the line length limit, however the "dash" character is a non-ascii unicode character which occupies 3 utf-8 bytes. doc8 complains about this line; apparently because it is counting bytes rather than graphemes.

Joshua Harlow (harlowja)
Changed in doc8:
importance: Undecided → High
Revision history for this message
Joshua Harlow (harlowja) wrote :

Is this in a text file or an rst file? (the rst files are parsed with docutils so I just want to verify its not that code instead).

Revision history for this message
Joshua Harlow (harlowja) wrote :

I believe all lines are treated as utf-8 (or converted to that) and then use the basic len() operation.

>>> x = u'known exploit in the wild, for example – the time between advance notification'
>>> x
u'known exploit in the wild, for example \u2013 the time between advance notification'
>>> len(x)
78

Revision history for this message
Alex Gaynor (alex-gaynor) wrote :

This is a rst file: https://github.com/pyca/cryptography/blob/master/docs/security.rst

I assume the issue is that x is actually `x.encode("utf-8")`:

>>>> x = u'known exploit in the wild, for example – the time between advance notification'
>>>> len(x)
78
>>>> len(x.encode("utf-8"))
80

Revision history for this message
Joshua Harlow (harlowja) wrote :

Cool, let me try that. I'm thinking docutils needs a slight tweaking to avoid that.

Revision history for this message
Joshua Harlow (harlowja) wrote :

Ok, think I see whats happening. I am using in doc8 'chardet' to detect that files encoding.

Trying on this file. It gives a bad encoding:

>>> b = open('test.rst').read()
>>> chardet.detect(b)['encoding']
'ISO-8859-2'

Sooo that is then being decoded in the 'ISO-8859-2' -> unicode by python.

Which then goes into docutils rst parser, which then when iterated over gives back the decoding of that line in 'ISO-8859-2'

>>> y = b'known exploit in the wild, for example – the time between advance notification'
>>> import six
>>> g = six.text_type(y, encoding='ISO-8859-2')
>>> print g
known exploit in the wild, for example – the time between advance notification
>>> len(g)
80

So that seems to be the cause (bad detection).

Revision history for this message
Joshua Harlow (harlowja) wrote :

I can make it default to a configured encoding that overrides the usage of chardet? Acceptable to u.

Could be something in a [doc8] section, like.

[doc8]
file_encoding = 'utf-8'

This would (if present) skip the chardet running. Seem ok?

Revision history for this message
Alex Gaynor (alex-gaynor) wrote :

Sounds great!

Revision history for this message
Joshua Harlow (harlowja) wrote :

https://review.openstack.org/130390 (not sure if the openstack-CI can write to this project). Let me know any comments. Thx!

Joshua Harlow (harlowja)
Changed in doc8:
assignee: nobody → Joshua Harlow (harlowja)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to doc8 (master)

Reviewed: https://review.openstack.org/130390
Committed: https://git.openstack.org/cgit/stackforge/doc8/commit/?id=04a710c687a7cb2da2b2a6bbd86c20aa32d1dc60
Submitter: Jenkins
Branch: master

commit 04a710c687a7cb2da2b2a6bbd86c20aa32d1dc60
Author: Joshua Harlow <email address hidden>
Date: Wed Oct 22 16:13:33 2014 -0700

    Allow overriding file encoding

    Chardet doesn't always seem to correctly detect files encoding
    correctly in all circumstances, to make it so that a user can
    specify the exact encoding of there files allow a new config
    option and a new CLI option that allows for manually overriding
    the encoding that chardet will try to determine.

    If enabled chardet detection will no longer run.

    Fixes bug 1384463

    Change-Id: Ie8baf3f79083e1495c7420a9d0569390cad2115e

Changed in doc8:
status: New → Fix Committed
Joshua Harlow (harlowja)
Changed in doc8:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.