Code review comment for lp:~songofacandy/bzr/i18n-utextwrap

Martin Packman (gz) wrote :

Good work tackling this, reading TR14 again reminds me how not-fun all the details are. I think your overall approach of trying extend the existing stdlib functionality just enough for bzr is sensible. We currently have en, es, ja, and ru docs so there isn't any need to deal with esoteric rules for other languages yet.

Various notes and nitpicks on the diff follow. The main thing that needs fixing is that wrapping for Russian (and Greek) will follow east asian style with this change rather than western, and width measurement will be wrong in those cases. The easiest way to fix this is debatable, I'd be tempted to try and use the language as an input to change the text wrapping behaviour, and possibly the codepage for the width measurement.

general
=======

I find the tests a little hard to follow, but they look like a realy good start. Currently only hiragana and basic latin are covered, which is why the cyrillic and greek problems aren't apparent.

+# Copyright (C) 2011 Canonical Ltd

For bzrlib.utextwrap this line possibly wants the dates and names from textwrap.py as you note later that some code comes from that module.

+ if self.drop_whitespace and chunks[-1].strip() == '' and lines:

The `drop_whitespace` param is 2.6 or later only:

<http://docs.python.org/library/textwrap.html#textwrap.TextWrapper.drop_whitespace>

This is likely a feature bzr doesn't need, but it's not hard to add a little version check, pull <lp:~gz/bzr/tweak_utextwrap> for a fix.

It looks like the `expand_tabs` param won't work correctly for ideographic text as the `_munge_whitespace` method isn't overridden and it uses `unicode.expandtabs` which does the wrong measurement for tab stops. This probably also a feature bzr doesn't need, but the limitation should be documented.

str vs. unicode
===============

Is there any reason to support non-unicode strings in this class? Is is a requirement for your neat trick of running the Python textwrap tests against this module too?

+ if isinstance(s, str):
+ return len(s)
+ assert isinstance(s, unicode)

The two blocks like this in the private `_width` and `_cut` functions would ideally not be needed, the public entrance points should ensure the string is unicode. Currently, they look like they'll just silently do the wrong thing for non-ascii byte strings.

+ def _split(self, text):
+ chunks = textwrap.TextWrapper._split(self, unicode(text))

This is also a private method so shouldn't need to convert the argument, the callers should just use the right object.

+ def wrap(self, text):
+ # ensure text is unicode
+ return textwrap.TextWrapper.wrap(self, unicode(text))

In the public interface this is a reasonable way to support ascii-only byte strings and a convenience for callers, other parts of bzrlib do a similar thing. The main issue is that it's easy for callers to get away with doing the wrong thing, but that's not much of a concern here as it should be obvious when non-english text is passed.

east_asian_width
================

The main problem comes from this logic:

+ w += (c in 'FWA' and 2) or 1
...
+ w += (_eawidth(c) in 'FWA' and 2) or 1
...
+ if _eawidth(char) in 'FWA':

Firstly, it's a bad thing that a similar-but different line appears three times. That's a sign a helper is needed.

For width measurement, `in 'FWA'` isn't a bad check in an ideographic context, I have nearly the same logic in a script I use to work out indents in EUC-JP text with a `eawidth_map = {"F":2,"H":1,"W":2,"Na":1,"A":2,"N":1}` mapping.

However, the "A" class includes various characters that are fullwidth on my CP932 terminal, but will not be on most nix terminals, and not in other windows codepages. For odd punctuation, guessing the width a bit too wide is find (and is better than guessing too narrow, which may result in some lines overflowing), but a Russian user might find only half the screen is used.

The trick on windows is to encode to the terminal codepage and count the bytes to get an accurate width. The rough general solution is perhaps to count "A" as 1 except where an ideographic context is likely.

There may be miscounts from other things like combining characters not combining. Here the rule should just be to avoid using ambiguous constructions, there's no need to complicate this code to cover things translators probably don't need.

For line breaking logic, it's a similar problem, compounded by the fact Python doesn't actually expose the UCD table we want. Not supporting small kana and other details isn't an issue, but allowing a break in the contexts it makes sense to in east asian text will do funny things for western text.

I'm afraid I don't have any easy solutions.

« Back to merge proposal