Code review comment for lp:~andreserl/maas/lp1570609

Revision history for this message
Gavin Panella (allenap) wrote :

The error:

  builtins.ValueError: Unicode strings with encoding declaration are not
  supported. Please use bytes input or XML fragments without
  declaration.

means that we actually need to go back a step further.

The change:

- for doc in self._parse_multiple_xml_docs(xml)
+ for doc in self._parse_multiple_xml_docs(xml.encode())

encodes `xml` as UTF-8 [1], but if the XML declaration is:

  <?xml version="1.0" encoding="ISO-8859-1"?>

then the XML parser will treat it as ISO-8859-1, which is subtly
different to UTF-8. Substitute ISO-8859-1 with UTF-16, Shift-JIS, or any
other recognised encoding and you'll have similar problems.

What you need are the bytes that were originally pulled off the wire,
read in from a process, or read in from a file. Those should be passed
to the XML parser which will DTRT with respect to encoding.

[1] FWIW, I *hate* that Python 3 has defined a default character set for
    str.encode() and bytes.decode().

review: Needs Fixing

« Back to merge proposal