Code review comment for lp:~jameinel/bzr-builddeb/unicode-author-508251

Revision history for this message
John A Meinel (jameinel) wrote :

This is a basic fix for bug #508251. Specifically it:

1) Tries to decode using utf-8, if that fails it falls back to iso-8859-1. For now it also mutters the string it failed to decode. (might get a bit noisy, but it would let you know if there are issues with a given import.)

2) Applies this to both author decoding *and* to the commit message. I think the author stuff hid the fact that the commit message was also broken. Basically, find_extra_authors decodes everything before bzr was going to get a chance at it. And bzr was always decoding 'message' as bzrlib.user_encoding, which I assume was always utf-8 for the import machine. Arguably it was succeeding 'by accident', rather than by design.

3) Changes 'find_thanks()' to allow names to start with a Unicode character, rather than requiring strictly A-Z. If you want, I can bring back "author[0].isupper()" or something like that. Looking at the regex, if I said "Thanks to my cat" it seems reasonable to have 'deb-thanks': ['my cat'] even though it wasn't "Mr Cat". The "Thanks to" and "thank you" seem to be a decent filter, without having to worry about the exact name. If you want this changed to something else, just let me know.
I can restore the original behavior and change the tests, but it seemed reasonable to allow non-ascii as the first letter of someone's name. Given this changelog entry:
    - Translators: Vital Khilko (be), Vladimir Petkov (bg), Hendrik
      Brandt (de), Kostas Papadimas (el), Adam Weinberger (en_CA), Francisco
      Javier F. Serrador (es), Ilkka Tuohela (fi), Ignacio Casal Quinteiro
      (gl), Ankit Patel (gu), Luca Ferretti (it), Takeshi AIHANA (ja),
      Žygimantas Beručka (lt), Øivind Hoel (nb), Reinout van Schouwen (nl),
      Øivind Hoel (no), Evandro Fernandes Giovanini (pt_BR), Слободан Д.
      Средојевић (sr), Theppitak Karoonboonyanan (th), Clytie Siddall (vi),
      Funda Wang (zh_CN)

At least 3 of those people have non-ascii first letters (Žygimantas, Øivind, etc)

4) I also made sure to run this locally against 'gnome-panel' which was one of the failing imports. It has certainly gotten a lot farther, and I've check that it has run into a few of these mixed-encoding sections. Note that this assumes that each changelog block uses a constant encoding (for the purposes of commit message), but that actually seems reasonable. As dapper/debian/changelog switches back and forth from iso-8859-1 in some blocks to utf-8 in other blocks.

« Back to merge proposal