BeautifulSoup._popToTag will pop every tag in the document if given a mismatched end tag

Bug #1880420 reported by Leonard Richardson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

Code to reproduce:

---
from bs4 import BeautifulSoup
data = """<html><div>b</div></span>c"""
print(BeautifulSoup(data, 'html.parser'))
---

Output:

---
<html><div>b</div></html>c
---

The markup '</span>' makes html.parser call BeautifulSoupHTMLParser.handle_endtag, which calls BeautifulSoup.handle_endtag, which calls BeautifulSoup._popToTag('span'). Since there is no open <span> tag in the stack, every tag in the stack gets popped, including <html>. Parsing of the document then proceeds even though </html> is supposed to signify the end of the document.

This is only a problem when using html.parser, since lxml and html5lib know not to treat "</span>" as a real closing tag.

One solution that wouldn't hurt performance much would be to keep a Counter of tag names, and to make _popToTag a no-op if the tag name isn't in the Counter. (It needs to be a Counter so we can keep track of nested tags with the same name while still keeping constant-time lookups.)

Revision history for this message
Leonard Richardson (leonardr) wrote :

Revision 579 has a fix.

Changed in beautifulsoup:
status: New → Fix Committed
Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.