builtins.ValueError: Unicode strings with encoding declaration are not
supported. Please use bytes input or XML fragments without
declaration.
means that we actually need to go back a step further.
The change:
- for doc in self._parse_multiple_xml_docs(xml)
+ for doc in self._parse_multiple_xml_docs(xml.encode())
encodes `xml` as UTF-8 [1], but if the XML declaration is:
<?xml version="1.0" encoding="ISO-8859-1"?>
then the XML parser will treat it as ISO-8859-1, which is subtly
different to UTF-8. Substitute ISO-8859-1 with UTF-16, Shift-JIS, or any
other recognised encoding and you'll have similar problems.
What you need are the bytes that were originally pulled off the wire,
read in from a process, or read in from a file. Those should be passed
to the XML parser which will DTRT with respect to encoding.
[1] FWIW, I *hate* that Python 3 has defined a default character set for
str.encode() and bytes.decode().
The error:
builtins. ValueError: Unicode strings with encoding declaration are not
supported. Please use bytes input or XML fragments without
declaration.
means that we actually need to go back a step further.
The change:
- for doc in self._parse_ multiple_ xml_docs( xml) multiple_ xml_docs( xml.encode( ))
+ for doc in self._parse_
encodes `xml` as UTF-8 [1], but if the XML declaration is:
<?xml version="1.0" encoding= "ISO-8859- 1"?>
then the XML parser will treat it as ISO-8859-1, which is subtly
different to UTF-8. Substitute ISO-8859-1 with UTF-16, Shift-JIS, or any
other recognised encoding and you'll have similar problems.
What you need are the bytes that were originally pulled off the wire,
read in from a process, or read in from a file. Those should be passed
to the XML parser which will DTRT with respect to encoding.
[1] FWIW, I *hate* that Python 3 has defined a default character set for
str.encode() and bytes.decode().