availability of lxml heavily changes parsing tree wrt empty tags like <br>

Bug #1398866 reported by Jörn Hees
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

Beautifulsoup 4.3.2 silently uses lxml as parser if lxml is available in the current environment.

This causes undeterministic behavior like parsing tree structure changes based on the current system for empty tags. For example <br> tags are transformed into <br/> tags if lxml is present, but they remain <br> tags (and effectively become parent elements in the tree when soup inserts the missing closing </br> tag) when lxml is not in the environment.

Example:
import bs4
soup = bs4.BeautifulSoup('<div>foo<br>bar</div>')
print soup.prettify()

Running this code with lxml in the environment will print this:
<html>
 <body>
  <div>
   foo
   <br/>
   bar
  </div>
 </body>
</html>

Running this code without lxml in the environment will print this:
<div>
 foo
 <br>
  bar
 </br>
</div>

As you can see, they are pretty different. Any traversal relying on a sibling relation over an empty tag (such as <br>) will only eventually work.

I suggest to either add lxml as a requirement to your setup.py and let pip pull it in automatically or make sure that the tree structure doesn't change in case lxml is not present. Yet another option would be: don't pick a parser implicitly, but make the choice explicit (quoting PEP 20: "Explicit is better than implicit").

The current behavior has the potential of costing developers several hours to figure out why code works on one system and doesn't on another... without any error messages, just not finding a sibling node. So if anyone for example is wondering why their bs4 code works when executed with a system wide python but does not when run in a virtual environment... you're welcome.

Wrt. the bug reporting guidelines... if the parsing tree depends so heavily on the underlying parser: why not put a warning note in the quick start docs already?

Revision history for this message
Leonard Richardson (leonardr) wrote :

Because of problems like the ones you've described, I've seriously considered requiring that the constructor call explicitly name a parser. I haven't done that because that would break most existing Beautiful Soup code. Instead, I've instituted a warning:

---
>>> Beautifulsoup("foo")
bs4/__init__.py:160: UserWarning: No parser was explicitly specified, so I'm using the best available parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another computer, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")
---

After keeping the warning in place for a couple years it may make sense to change it to an error.

More documentation is not the answer. Two sections of the documentation are currently devoted to this problem, and it didn't help you:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#other-parser-problems

Adding a third discussion of this problem will make the documentation harder to use for everyone who isn't currently experiencing this particular problem. In particular, the quick start section is intended to be understandable by people who don't know what a parser is.

Changed in beautifulsoup:
status: New → Fix Committed
Revision history for this message
Jörn Hees (joernhees) wrote :

thanks, good solution :)

Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.