html.parser doesn't acknowledge HTML5 named entities
Bug #1924908 reported by
Jean-Christophe Amiel
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Given the following html input:
```
<div>&RightArro
```
And parse it with Beautiful soup:
```
>>> soup = BeautifulSoup(
>>> soup.decode(
'<div>&
```
Ending comma of the html entities is removed.
Using other html entities, like é is working perfectly:
```
>>> soup = BeautifulSoup(
>>> soup.decode(
'<div>é</div>'
```
(tested with Beautiful 4.9.1 and 4.9.3)
description: | updated |
summary: |
- html.parser and lxml don't acknowledge HTML5 named entities + html.parser doesn't acknowledge HTML5 named entities |
To post a comment you must log in.
RightArrowLeftArrow is a named entity in HTML5, so the simplest way to solve your problem is to use the html5lib parser, which understands that entity:
soup = BeautifulSoup( "<div>& RightArrowLeftA rrow;</ div>", "html5lib") formatter= None) <head>< /head>< body><div> ⇄</div> </body> </html> '
soup.decode(
# '<html>
I spent some time looking into whether it's possible to make the other parsers able to handle HTML5 named entities, and it does seem possible in almost all cases.