elements between head and body cause traversal to fail

Bug #1237763 reported by David Hull
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

# /usr/bin/python

# This script attempts to demonstrate what I believe is a parsing or tree
# traversal bug in Beautiful Soup 4.3.2. The SCRIPT element between the HEAD
# and BODY elements causes the children of soup.html to be (head, None, body).
# My guess is that this None element causes the Beautiful Soup's various tree
# searching functions to fail to find the body.

from bs4 import BeautifulSoup

content = """
<html>
  <head>
    <title>This is a test</title>
  </head>
  <script type="text/javascript">"hello";</script>
  <body>
    <img src="test.png" alt="This is a test" />
  </body>
</html>
"""

soup = BeautifulSoup(content, 'html5lib')

print 'head: %s' % soup.html.head # Prints head with script element moved inside head.
print 'body: %s' % soup.html.body # Prints "body: None"

# Prints: "head\nNone\nbody\n":
for tag in soup.html.children:
  print tag.name

print 'img: %s' % soup.find('img') # Prints "img: None"

Changed in beautifulsoup:
status: New → Confirmed
Revision history for this message
Leonard Richardson (leonardr) wrote :

The bug happens when a tag between <head> and <body> is moved into <head>. (script, meta, link, etc.) It does not happen if the tag gets moved into <body> (p, span, etc.)

Changed in beautifulsoup:
status: Confirmed → Fix Committed
Revision history for this message
LaunchpadLoginDefectReport (nagle-1) wrote :

Looks like the fix didn't work. See

 https://bugs.launchpad.net/beautifulsoup/+bug/1430633

Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.