HTML causes 'TypeError: expected string or buffer' in _html5lib.AttrList

Bug #1483781 reported by Roel Kramer
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

When I feed this document to BS4.4 a TypeError is raised, I expected to get data to work with. I substracted this from a larger document, but brought it down to this minimal example. If i change p to z, or remove p it works fine. Please note i'm running python 3.4.

Document:
<a class="my_class"><p></a>

Example TestCase:
class TestCase(unittest.TestCase):

    def test_soup(self):
        # should not raise an exception
        data = '<a class="my_class"><p></a>'
        BeautifulSoup(data, 'html5lib')

Stacktrace:

Error
Traceback (most recent call last):
  File "/Users/roel/PycharmProjects/trctestapi/src/blockconverter/tests/test_blockconverter.py", line 205, in test_convert_to_blocks_error
    BeautifulSoup(data, 'html5lib')
  File "/Users/roel/PycharmProjects/trctestapi/.venv/lib/python3.4/site-packages/bs4/__init__.py", line 215, in __init__
    self._feed()
  File "/Users/roel/PycharmProjects/trctestapi/.venv/lib/python3.4/site-packages/bs4/__init__.py", line 239, in _feed
    self.builder.feed(self.markup)
  File "/Users/roel/PycharmProjects/trctestapi/.venv/lib/python3.4/site-packages/bs4/builder/_html5lib.py", line 50, in feed
    doc = parser.parse(markup, encoding=self.user_specified_encoding)
  File "/Users/roel/PycharmProjects/trctestapi/.venv/lib/python3.4/site-packages/html5lib/html5parser.py", line 236, in parse
    parseMeta=parseMeta, useChardet=useChardet)
  File "/Users/roel/PycharmProjects/trctestapi/.venv/lib/python3.4/site-packages/html5lib/html5parser.py", line 94, in _parse
    self.mainLoop()
  File "/Users/roel/PycharmProjects/trctestapi/.venv/lib/python3.4/site-packages/html5lib/html5parser.py", line 201, in mainLoop
    new_token = phase.processEndTag(new_token)
  File "/Users/roel/PycharmProjects/trctestapi/.venv/lib/python3.4/site-packages/html5lib/html5parser.py", line 493, in processEndTag
    return self.endTagHandler[token["name"]](token)
  File "/Users/roel/PycharmProjects/trctestapi/.venv/lib/python3.4/site-packages/html5lib/html5parser.py", line 1545, in endTagFormatting
    clone = formattingElement.cloneNode()
  File "/Users/roel/PycharmProjects/trctestapi/.venv/lib/python3.4/site-packages/bs4/builder/_html5lib.py", line 308, in cloneNode
    node.attributes[key] = value
  File "/Users/roel/PycharmProjects/trctestapi/.venv/lib/python3.4/site-packages/bs4/builder/_html5lib.py", line 123, in __setitem__
    value = whitespace_re.split(value)
TypeError: expected string or buffer

Library versions on OSX:
beautifulsoup4==4.4.0
html5lib==0.999999

description: updated
description: updated
description: updated
description: updated
Revision history for this message
Jan Murre (jan-murre-gmail) wrote :

A possible fix for this problem:

--- bs4/builder/_html5lib.py 2015-06-28 19:39:36 +0000
+++ bs4/builder/_html5lib.py 2015-08-27 13:03:59 +0000
@@ -120,7 +120,10 @@
         if (name in list_attr['*']
             or (self.element.name in list_attr
                 and name in list_attr[self.element.name])):
- value = whitespace_re.split(value)
+ # Node that is being cloned possibly already has
+ # attributes with list values
+ if not isinstance(value, list):
+ value = whitespace_re.split(value)
         self.element[name] = value
     def items(self):
         return list(self.attrs.items())

However, the "<a />" will be duplicated in the output.

Revision history for this message
LaunchpadLoginDefectReport (nagle-1) wrote :

I'm seeing that too. I just started up a new production server with beautifulsoup4 (4.4.0), andhtml5lib (0.999999), instead of beautifulsoup4 (4.3.2) and html5lib (0.999). The new version is getting an error every few minutes.

Some URLs it doesn't like:

http://www.huffingtonpost.ca/2015/09/23/large-employers-want-more-clarification-on-rules-for-new-ontario-pension-plan_n_8180784.html?utm_hp_ref=canada-business

http://www.huffingtonpost.ca/anita-saulite-/money-stress-health-problems_b_8173002.html?utm_hp_ref=canada-business
(Many more Huffington Post URLs)

http://www.ucmerced.edu

Revision history for this message
LaunchpadLoginDefectReport (nagle-1) wrote :

This is on Python 3.4.3, CentOS 64-bit.

Revision history for this message
LaunchpadLoginDefectReport (nagle-1) wrote :

Switched back to beautifulsoup4 (4.3.2) and problem went away.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Patch applied in revision 395.

Changed in beautifulsoup:
status: New → Fix Committed
Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.