Failure to import when decoding changelog authors

Bug #508251 reported by Andrew Starr-Bochicchio
32
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ubuntu Distributed Development
Fix Released
Medium
Unassigned

Bug Description

There are no bazaar source branches for the “evolution” package in Ubuntu on Launchpad. See:

https://code.edge.launchpad.net/ubuntu/+source/evolution/+branches

http://package-import.ubuntu.com/failures/evolution

Failed at 2010-01-12 21:57:53.778573

Traceback (most recent call last):
  File "./import_package.py", line 788, in <module>
    no_existing=options.no_existing))
  File "./import_package.py", line 713, in main
    import_package(temp_dir, importp, revid_db, bstore, possible_transports=possible_transports)
  File "./import_package.py", line 481, in import_package
    use_time_from_changelog=True)
  File "/srv/package-import.canonical.com/new/scripts/plugins/builddeb/import_dsc.py", line 1481, in import_package
    file_ids_from=file_ids_from)
  File "/srv/package-import.canonical.com/new/scripts/plugins/builddeb/import_dsc.py", line 1376, in _do_import_package
    timestamp=timestamp, file_ids_from=file_ids_from)
  File "/srv/package-import.canonical.com/new/scripts/plugins/builddeb/import_dsc.py", line 1261, in import_debian
    get_commit_info_from_changelog(changelog, self.branch)
  File "/srv/package-import.canonical.com/new/scripts/plugins/builddeb/util.py", line 440, in get_commit_info_from_changelog
    authors += find_extra_authors(changes)
  File "/srv/package-import.canonical.com/new/scripts/plugins/builddeb/util.py", line 388, in find_extra_authors
    match = extra_author_re.match(change.decode("utf-8"))
  File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 22-27: unsupported Unicode code range

Effects:
  evolution
  gnome-control-center
  totem (slightly different, see comment #1 below).
  gnome-panel

Related branches

description: updated
John A Meinel (jameinel)
summary: - No source branches for “evolution” package in Ubuntu
+ Failure to import when decoding changelog authors
description: updated
description: updated
Changed in udd:
importance: Undecided → Medium
status: New → Confirmed
John A Meinel (jameinel)
description: updated
Revision history for this message
Robert Collins (lifeless) wrote :

A related failure:
Traceback (most recent call last):
  File "./import_package.py", line 983, in <module>
    extra_debian=options.extra_debian))
  File "./import_package.py", line 941, in main
    import_package(temp_dir, importp, possible_transports=possible_transports)
  File "./import_package.py", line 563, in import_package
    use_time_from_changelog=True)
  File "/srv/package-import.canonical.com/new/scripts/plugins/builddeb/import_dsc.py", line 1541, in import_package
    timestamp=timestamp, author=author)
  File "/srv/package-import.canonical.com/new/scripts/plugins/builddeb/import_dsc.py", line 1428, in _do_import_package
    timestamp=timestamp)
  File "/srv/package-import.canonical.com/new/scripts/plugins/builddeb/import_dsc.py", line 1264, in import_debian
    revprops['authors'] = "\n".join(authors).decode("utf-8")
  File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 12: ordinal not in range(128)

If Authors is made always unicode and the handling overhauled this will go away.

description: updated
John A Meinel (jameinel)
description: updated
Revision history for this message
John A Meinel (jameinel) wrote :

I tried to look into this a bit. As near as I can tell, the python-debian changelog parser doesn't decode anything. I would argue that all of the commit messages and authors should be Unicode strings as soon as we can reasonably make them. Which is either in import_dsc.py or in python-debian itself if we can hack that code.

James- I think I saw that you were one of the authors in python-debian. Is it code that we can reasonably hack? Or is it sort of debian-specific and we should be doing the changes in bzr-builddeb?

Revision history for this message
John A Meinel (jameinel) wrote :

In the case of gnome-panel, this fails during "find_extra_authors". It fails because it iterates the changelog and tries to decode('utf-8') each line, looking for an author.

In the case of gnome-panel, it lists Translators as:

(Pdb) pp changes
['* New upstream version:',

...
 ' Docs Translators:',
 ' - Maxim Dziumanenko (uk)',
 ' Translators:',
 ' - Vital Khilko (be)',
 " - J\xe9r\xe9my Le Floc'h (br)",
 ' - Pema Geyleg (dz)',
 ' - Ivar Smolin (et)',
 ' - Beno\xeet Dejean (fr)',
...

Note that I'm pretty certain this is iso-8859-1 encoding, as '\xe9' => é and '\xee' => î. Not to mention that iso-8859-2 and iso-8859-15 all decode it to the same characters. I guess that means it could be any of them...

Anyway,

#1) These won't match the extra author information anyway, because they aren't in the form [Author Name]. So we could just wait to decode them until after the match is run. The current author regex is:
extra_author_re = re.compile(r"\s*\[([^\]]+)]\s*", re.UNICODE)

Which IIRC, says "leading-space [ anything-but-] ] trailing space".

However, if this sort of data is then brought into the commit log, etc, it is going to fail anyway, when we try to create a Unicode commit message.

#2) Allow the decode to fail, and just assume there isn't an author there.

#3) Fall back to iso-8859-1 as the decoder.

James Westby (james-w)
Changed in udd:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.