Priorities of BOM and from_encoding should be switched

Bug #1889014 reported by John Wodder
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

If I'm reading the bs4 source correctly, when BeautifulSoup attempts to determine the encoding of a binary document, it tries the user-specified encoding first, and then after that it tries the encoding implied by the BOM (if any). However, the WHATWG standard for determining character encodings (https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding) says that the BOM encoding should take precedence over all other encoding sources, with user-specified encodings (and transport-layer-declared encodings, like the HTTP Content-Type charset, which I would wager is a major source of `from_encoding` values) coming in next. BeautifulSoup4 should thus try the BOM encoding first in order to be conformant.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for taking the time to file this bug.

The "override_encodings" argument is designed to handle the "known definite encoding" case (https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding).

But in library code there's not a strong distinction between "known definite encoding" and "user has explicitly instructed the user agent to override the document's character encoding with a specific encoding". There's some passive voice in 12.2.3.1 -- who "knows" that the input has a certain encoding, if not the "user"?

Would it solve your problem if there were two arguments like "override_encodings", one list to be applied before BOM sniffing and one list to be applied afterwards?

Revision history for this message
John Wodder (jwodder) wrote :

> Would it solve your problem if there were two arguments like "override_encodings", one list to be applied before BOM sniffing and one list to be applied afterwards?

Yes, it would.

Revision history for this message
Leonard Richardson (leonardr) wrote :

As of revision 598, the UnicodeDammit and EncodingDetector classes take arguments "known_definite_encodings" and "user_encodings", which are named after the corresponding steps in the algorithm laid out in the HTML5 spec. The old argument "override_encodings" is now deprecated and its value gets tacked on to the end of "known_definite_encodings".

Changed in beautifulsoup:
status: New → Fix Committed
Revision history for this message
Leonard Richardson (leonardr) wrote :

Released in 4.10.0.

Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.