w3mman2html.cgi doesn't correctly underline UTF-8 characters
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Manpage Repository |
Fix Released
|
Undecided
|
Unassigned | ||
w3m |
Unknown
|
Unknown
|
|||
w3m (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Binary package hint: w3m
Ubuntu version: Ubuntu 10.10
Package version: 0.5.2-6
W3mman2html doesn't correctly deal with underlined UTF-8 text and every single byte is underlined separately, for example the backspace-escape code _^Hé is transformed into the HTML code <u>\0xC3</u>\0xA9 (with two invalid 1 byte UTF-8 sequences).
In fact it assumes that backspace escape codes are of the form __^Hé or é^H__ (with two underscores). However, as far as I could test with the Ubuntu man program, only one underscore is generated, independently of the length of the UTF-8 encoding for that letter (man version 2.5.7-4, groff version 1.20.1-10).
The number of backspace characters in the bold and italic escape codes is only one as far as I could see, hence the match for multiple backspace characters is useless, even if it is innocuous.
I submit a patch that should correctly deal with bold and underline escapes independently of the length of the UTF-8 character. Till now only 2-byte characters were taken into account.
If the man page is in a single byte encoding instead of UTF-8, the underline matching code may match too much character like for the combination _^Hé . Such sequences should however be very rare, since usually only whole words are underlined and a backspace escape code will be followed either by a space or by another backspace escape code.
The attachment "Correct underline processing and more UTF-8 support" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.
[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]