Bazaar

Code review comment for lp:~songofacandy/bzr/i18n-utextwrap

i18n-utextwrap
Merge into bzr.dev

Revision history for this message

methane (songofacandy) wrote on 2011-05-08:

> +# Copyright (C) 2011 Canonical Ltd
>
> For bzrlib.utextwrap this line possibly wants the dates and names from
> textwrap.py as you note later that some code comes from that module.

I've copied Copyright lines from textwrap.

> + if self.drop_whitespace and chunks[-1].strip() == '' and lines:
>
> The `drop_whitespace` param is 2.6 or later only:
>
> <http://docs.python.org/library/textwrap.html#textwrap.TextWrapper.drop_whites
> pace>
>
> This is likely a feature bzr doesn't need, but it's not hard to add a little
> version check, pull <lp:~gz/bzr/tweak_utextwrap> for a fix.

Thank you. I've merged it.

> It looks like the `expand_tabs` param won't work correctly for ideographic
> text as the `_munge_whitespace` method isn't overridden and it uses
> `unicode.expandtabs` which does the wrong measurement for tab stops. This
> probably also a feature bzr doesn't need, but the limitation should be
> documented.

Could you tell me what is the problem of `unicode.expandtabs`?
I haven't heard about that.

> str vs. unicode
> ===============
>
> Is there any reason to support non-unicode strings in this class? Is is a
> requirement for your neat trick of running the Python textwrap tests against
> this module too?
>
> + if isinstance(s, str):
> + return len(s)
> + assert isinstance(s, unicode)
>
> The two blocks like this in the private `_width` and `_cut` functions would
> ideally not be needed, the public entrance points should ensure the string is
> unicode. Currently, they look like they'll just silently do the wrong thing
> for non-ascii byte strings.

I've removed `if isinstance(...`.

>
> + def _split(self, text):
> + chunks = textwrap.TextWrapper._split(self, unicode(text))
>
> This is also a private method so shouldn't need to convert the argument, the
> callers should just use the right object.

This private method is tested by test_textwrap and it passes byte string.
`unicode(text)` uses default encoding that is 'ascii' on most platform and
raises UnicodeDecodeError when we passes non-ascii string.

>
> + def wrap(self, text):
> + # ensure text is unicode
> + return textwrap.TextWrapper.wrap(self, unicode(text))
>
> In the public interface this is a reasonable way to support ascii-only byte
> strings and a convenience for callers, other parts of bzrlib do a similar
> thing. The main issue is that it's easy for callers to get away with doing the
> wrong thing, but that's not much of a concern here as it should be obvious
> when non-english text is passed.
>
>
> east_asian_width
> ================
>
> The main problem comes from this logic:
>
> + w += (c in 'FWA' and 2) or 1
> ...
> + w += (_eawidth(c) in 'FWA' and 2) or 1
> ...
> + if _eawidth(char) in 'FWA':
>
> Firstly, it's a bad thing that a similar-but different line appears three
> times. That's a sign a helper is needed.

I've split it.

>
> For width measurement, `in 'FWA'` isn't a bad check in an ideographic context,
> I have nearly the same logic in a script I use to work out indents in EUC-JP
> text with a `eawidth_map = {"F":2,"H":1,"W":2,"Na":1,"A":2,"N":1}` mapping.
>
> However, the "A" class includes various characters that are fullwidth on my
> CP932 terminal, but will not be on most nix terminals, and not in other
> windows codepages. For odd punctuation, guessing the width a bit too wide is
> find (and is better than guessing too narrow, which may result in some lines
> overflowing), but a Russian user might find only half the screen is used.

I've added "ambiguous_width" parameter to UTextWrapper.

> The trick on windows is to encode to the terminal codepage and count the bytes
> to get an accurate width. The rough general solution is perhaps to count "A"
> as 1 except where an ideographic context is likely.

I think we can use i18n for it. For example, i18n module may be able to provide a
function like:

def get_ambiguous_width():
    if _installed_language in ('cn', 'ja', 'kr'):
        return 2
    else:
        return 1

For now, I've stopped using utextwrap in bzrlib.help module.

>
> There may be miscounts from other things like combining characters not
> combining. Here the rule should just be to avoid using ambiguous
> constructions, there's no need to complicate this code to cover things
> translators probably don't need.
>
> For line breaking logic, it's a similar problem, compounded by the fact Python
> doesn't actually expose the UCD table we want. Not supporting small kana and
> other details isn't an issue, but allowing a break in the contexts it makes
> sense to in east asian text will do funny things for western text.
>
> I'm afraid I don't have any easy solutions.

Splitting between combined character or surrogate pair may cause a problem.
I'll try to make more clever line breaking if the problem happens.
But if the problem happen, users can show English help (we shoud provide
that way) so nothing goes worse than now.

> +# Copyright (C) 2011 Canonical Ltd
> 
> For bzrlib.utextwrap this line possibly wants the dates and names from
> textwrap.py as you note later that some code comes from that module.

I've copied Copyright lines from textwrap.

> +            if self.drop_whitespace and chunks[-1].strip() == '' and lines:
> 
> The `drop_whitespace` param is 2.6 or later only:
> 
> <http://docs.python.org/library/textwrap.html#textwrap.TextWrapper.drop_whites
> pace>
> 
> This is likely a feature bzr doesn't need, but it's not hard to add a little
> version check, pull <lp:~gz/bzr/tweak_utextwrap> for a fix.

Thank you. I've merged it.

Could you tell me what is the problem of `unicode.expandtabs`?
I haven't heard about that.

> str vs. unicode
> ===============
> 
> Is there any reason to support non-unicode strings in this class? Is is a
> requirement for your neat trick of running the Python textwrap tests against
> this module too?
> 
> +    if isinstance(s, str):
> +        return len(s)
> +    assert isinstance(s, unicode)
> 
> The two blocks like this in the private `_width` and `_cut` functions would
> ideally not be needed, the public entrance points should ensure the string is
> unicode. Currently, they look like they'll just silently do the wrong thing
> for non-ascii byte strings.

I've removed `if isinstance(...`.

> 
> +    def _split(self, text):
> +        chunks = textwrap.TextWrapper._split(self, unicode(text))
> 
> This is also a private method so shouldn't need to convert the argument, the
> callers should just use the right object.

> 
> +    def wrap(self, text):
> +        # ensure text is unicode
> +        return textwrap.TextWrapper.wrap(self, unicode(text))
> 
> In the public interface this is a reasonable way to support ascii-only byte
> strings and a convenience for callers, other parts of bzrlib do a similar
> thing. The main issue is that it's easy for callers to get away with doing the
> wrong thing, but that's not much of a concern here as it should be obvious
> when non-english text is passed.
> 
> 
> east_asian_width
> ================
> 
> The main problem comes from this logic:
> 
> +        w += (c in 'FWA' and 2) or 1
> ...
> +        w += (_eawidth(c) in 'FWA' and 2) or 1
> ...
> +                if _eawidth(char) in 'FWA':
> 
> Firstly, it's a bad thing that a similar-but different line appears three
> times. That's a sign a helper is needed.

I've split it.

> 
> For width measurement, `in 'FWA'` isn't a bad check in an ideographic context,
> I have nearly the same logic in a script I use to work out indents in EUC-JP
> text with a `eawidth_map = {"F":2,"H":1,"W":2,"Na":1,"A":2,"N":1}` mapping.
> 
> However, the "A" class includes various characters that are fullwidth on my
> CP932 terminal, but will not be on most nix terminals, and not in other
> windows codepages. For odd punctuation, guessing the width a bit too wide is
> find (and is better than guessing too narrow, which may result in some lines
> overflowing), but a Russian user might find only half the screen is used.

I've added "ambiguous_width" parameter to UTextWrapper.

I think we can use i18n for it. For example, i18n module may be able to provide a
function like:

def get_ambiguous_width():
    if _installed_language in ('cn', 'ja', 'kr'):
        return 2
    else:
        return 1

For now, I've stopped using utextwrap in bzrlib.help module.

> 
> There may be miscounts from other things like combining characters not
> combining. Here the rule should just be to avoid using ambiguous
> constructions, there's no need to complicate this code to cover things
> translators probably don't need.
> 
> For line breaking logic, it's a similar problem, compounded by the fact Python
> doesn't actually expose the UCD table we want. Not supporting small kana and
> other details isn't an issue, but allowing a break in the contexts it makes
> sense to in east asian text will do funny things for western text.
> 
> I'm afraid I don't have any easy solutions.

« Back to merge proposal