html.parser doesn't acknowledge HTML5 named entities

Bug #1924908 reported by Jean-Christophe Amiel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

Given the following html input:

```
<div>&RightArrowLeftArrow;</div>
```

And parse it with Beautiful soup:

```
>>> soup = BeautifulSoup("<div>&RightArrowLeftArrow;</div>")
>>> soup.decode(formatter=None)
'<div>&RightArrowLeftArrow</div>'
```

Ending comma of the html entities is removed.
Using other html entities, like &eacute; is working perfectly:

```
>>> soup = BeautifulSoup("<div>&RightArrowLeftArrow;</div>")
>>> soup.decode(formatter=None)
'<div>é</div>'
```

(tested with Beautiful 4.9.1 and 4.9.3)

description: updated
Revision history for this message
Leonard Richardson (leonardr) wrote :

RightArrowLeftArrow is a named entity in HTML5, so the simplest way to solve your problem is to use the html5lib parser, which understands that entity:

soup = BeautifulSoup("<div>&RightArrowLeftArrow;</div>", "html5lib")
soup.decode(formatter=None)
# '<html><head></head><body><div>⇄</div></body></html>'

I spent some time looking into whether it's possible to make the other parsers able to handle HTML5 named entities, and it does seem possible in almost all cases.

Changed in beautifulsoup:
status: New → Confirmed
summary: - Comma removed when printing some html entities
+ html.parser and lxml don't acknowledge HTML5 named entities
Revision history for this message
Jean-Christophe Amiel (jicea) wrote : Re: html.parser and lxml don't acknowledge HTML5 named entities

Thank you!

Jicea

summary: - html.parser and lxml don't acknowledge HTML5 named entities
+ html.parser doesn't acknowledge HTML5 named entities
Revision history for this message
Leonard Richardson (leonardr) wrote :

Revision 604 makes the html.parser tree builder work the same way as the html5lib tree builder. However I was not able to give the same treatment to lxml.

Changed in beautifulsoup:
status: Confirmed → Fix Committed
Revision history for this message
Leonard Richardson (leonardr) wrote :

Released in 4.10.0.

Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.