Beautiful Soup

Priorities of BOM and from_encoding should be switched

Bug #1889014 reported by John Wodder on 2020-07-26

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Fix Released	Undecided	Unassigned

Bug Description

If I'm reading the bs4 source correctly, when BeautifulSoup attempts to determine the encoding of a binary document, it tries the user-specified encoding first, and then after that it tries the encoding implied by the BOM (if any). However, the WHATWG standard for determining character encodings (https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding) says that the BOM encoding should take precedence over all other encoding sources, with user-specified encodings (and transport-layer-declared encodings, like the HTTP Content-Type charset, which I would wager is a major source of `from_encoding` values) coming in next. BeautifulSoup4 should thus try the BOM encoding first in order to be conformant.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-07-27:

Thanks for taking the time to file this bug.

The "override_encodings" argument is designed to handle the "known definite encoding" case (https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding).

But in library code there's not a strong distinction between "known definite encoding" and "user has explicitly instructed the user agent to override the document's character encoding with a specific encoding". There's some passive voice in 12.2.3.1 -- who "knows" that the input has a certain encoding, if not the "user"?

Would it solve your problem if there were two arguments like "override_encodings", one list to be applied before BOM sniffing and one list to be applied afterwards?

Revision history for this message

John Wodder (jwodder) wrote on 2020-07-27:

> Would it solve your problem if there were two arguments like "override_encodings", one list to be applied before BOM sniffing and one list to be applied afterwards?

Yes, it would.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2021-02-13:

As of revision 598, the UnicodeDammit and EncodingDetector classes take arguments "known_definite_encodings" and "user_encodings", which are named after the corresponding steps in the algorithm laid out in the HTML5 spec. The old argument "override_encodings" is now deprecated and its value gets tacked on to the end of "known_definite_encodings".

Changed in beautifulsoup:
status:	New → Fix Committed

Revision history for this message

Leonard Richardson (leonardr) wrote on 2021-09-08:

Released in 4.10.0.

Changed in beautifulsoup:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.