Code review comment for ~bryce/ubuntu/+source/apache2:merge-v2.4.51-2-jammy

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

For the test logs, it runs into:
  'utf-8' codec can't decode byte 0xa0 in position 1404450: invalid start byte

Firefox, Chroma and gunzip can extract it well.
But it isn't the extraction, but the following UTF convert anyway.

That is reproducible fetching the the log file and running:
import gzip
with gzip.open("log.gz") as f:
    f.read().decode("utf-8")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 1404450: invalid start byte

00156E18 6E 48 6F 73 74 3A 20 61 62 63 A0 5C 72 5C 6E 5C 72 5C 6E 0A 23 20 65 78 70 65 63 74 69 6E 67 20 32 30 30 2C 20 67 6F 74 nHost: abc.\r\n\r\n.# expecting 200, got

Terminal/Vim renders that as a "." but the usual "." isn't A0 but 0A.

Type is reported as UTF-8
log-unp: UTF-8 Unicode text, with very long lines, with CRLF, CR, LF line terminators

It isn't the python code that is wrong, other tools agree

$ iconv -f UTF-8 log-unp -o /dev/null
iconv: illegal input sequence at position 1404450

So maybe we should make our tool more tolerant as well.
Using errors="replace" makes this work much better.

The other decodes shall stay strict IMHO.

Pushed to the tools repo

« Back to merge proposal