Support case-insensitve DOCTYPE with htmlparser

Bug #1848401 reported by Jibben Nee
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

Current bs4/builder/_htmlparser.py BeautifulSoupHTMLParser.handle_decl (lines 188-196) matches for "DOCTYPE " or "DOCTYPE" only. Many sites have doctype in lowercase, resulting in <!DOCTYPE doctype html>. Should check data.lower() against "doctype" like html.parser does.

Or simply, the existing first case can actually apply in all situations, since it's a given (being called by html.parser's parse_html_declaration) the first word is doctype. Simply, `data = data[len("DOCTYPE "):]` will always work. No if/elif required.

html.parser for reference: https://github.com/python/cpython/blob/3.8/Lib/html/parser.py#L265

Jibben Nee (ziddey)
description: updated
Jibben Nee (ziddey)
description: updated
description: updated
Jibben Nee (ziddey)
description: updated
description: updated
description: updated
description: updated
Revision history for this message
Leonard Richardson (leonardr) wrote :

Fixed in revision 538.

Changed in beautifulsoup:
status: New → Fix Committed
Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.