Code review comment for lp:~gz/bzr/2.5_text_progress_view_unicode_966934

Revision history for this message
Martin Packman (gz) wrote :

> 71 +class _TTYStringIO(StringIO):
> 78 +class _NonTTYStringIO(StringIO):
>
> I don't think we want the leading '_' here. Those are helper classes (and
> there is even "113 + TTYStringIO = _TTYStringIO" which seems even weirder.

Okay, can do, as I'm touching those lines anyway with the module move.

> 142 + self._encoding = getattr(term_file, "encoding", "ascii")
>
> Don't we want to default to utf8 instead ?

Not really, this class is just for printing to a terminal, so we don't have the case of redirecting to a file to be concerned with. If we print bytes that the term doesn't support, at best we get replacement characters, and generally we'll get random junk. Given the way gettext and locales work, users shouldn't ever hit a situation where translations are installed but the terminal doesn't support them.

> 159 + # GZ 2012-03-28: Counting bytes is wrong for calculating width of
> 160 + # text but better than counting codepoints.
>
> Can you just briefly add (in this comment) a description of what you think
> should be done (or even better file a bug and just reference it there) ?
> Just from reading the comment, I'm not sure if we should blame python for
> not providing an easy way to do that or what *we* need to do there.

It's ridiculously complicated.

On my SJIS term things are reasonable simple. Characters that use a single byte, so ascii and the halfwidth katakana, are one sell wide. Double byte characters are two cells wide, including greek and cyrillic which is only appropriate in an ideographic context (unlike native or mixed latin use).

With terminals using utf-8, things a vastly more complex as the display width of a character the number of bytes used to encode it, but instead depends on what and how the terminal implements various unicode features. So, a linux term supports up to 512 glyphs, so as configured here means it can display CP437 characters, and any other (decoded from utf-8) codepoint appears as a diamond. An xterm is much more capable, and will display most things provided relevent fonts are installed.

Posix provides wcwidth and wcswidth functions that aim to resolve the problem of knowing how many columns are needed to display a character or string, but like all things C and locale related, they're not all that great. One of the issues is there's a gap between what a term can actually display, and how in theory a string could be displayed. So, it will give a width of 1 for a decomposed string of character and combining diacritic, but on basic terminals that will still be shown as the width 2 sequence character and missing glyph marker.

What I've done in the past is use the unicode tables plus some knowledge about the platform to get the basic cases correct, then try and fudge so incorrect outcomes are on the less-bad side. So, when trying to blank a whole line, under-measure so that mistakes will result in some junk at the end of the line, rather than wrapping to a new line.

Just found a recent python bug aimed at exposing the posix function that covers a few points as well:

<http://bugs.python.org/issue12568>

> 155 + def _show_line(self, u):
>
> Worth a trivial docstring mentioning 'u' means: 'We expect unicode'

As it's a private helper, I'll probably just document in the public class interfaces that unicode should be passed.

« Back to merge proposal