Priorities of BOM and from_encoding should be switched
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
If I'm reading the bs4 source correctly, when BeautifulSoup attempts to determine the encoding of a binary document, it tries the user-specified encoding first, and then after that it tries the encoding implied by the BOM (if any). However, the WHATWG standard for determining character encodings (https:/
Thanks for taking the time to file this bug.
The "override_ encodings" argument is designed to handle the "known definite encoding" case (https:/ /html.spec. whatwg. org/multipage/ parsing. html#parsing- with-a- known-character -encoding).
But in library code there's not a strong distinction between "known definite encoding" and "user has explicitly instructed the user agent to override the document's character encoding with a specific encoding". There's some passive voice in 12.2.3.1 -- who "knows" that the input has a certain encoding, if not the "user"?
Would it solve your problem if there were two arguments like "override_ encodings" , one list to be applied before BOM sniffing and one list to be applied afterwards?