Ubuntu
beautifulsoup4 package

Merge lp:~noskcaj/ubuntu/utopic/beautifulsoup4/merge into lp:ubuntu/utopic/beautifulsoup4

Utopic (14.10)
merge
Merge into utopic

Proposed by Jackson Doak on 2014-05-29

Status:	Needs review
Proposed branch:	lp:~noskcaj/ubuntu/utopic/beautifulsoup4/merge
Merge into:	lp:ubuntu/utopic/beautifulsoup4
Diff against target:	1665 lines (+709/-333) 20 files modified NEWS.txt (+62/-0) PKG-INFO (+1/-1) bs4/__init__.py (+83/-42) bs4/builder/__init__.py (+13/-8) bs4/builder/_html5lib.py (+82/-19) bs4/builder/_htmlparser.py (+14/-5) bs4/builder/_lxml.py (+64/-30) bs4/dammit.py (+165/-163) bs4/diagnose.py (+28/-2) bs4/element.py (+34/-21) bs4/testing.py (+13/-0) bs4/tests/test_html5lib.py (+13/-0) bs4/tests/test_lxml.py (+7/-4) bs4/tests/test_soup.py (+71/-20) bs4/tests/test_tree.py (+29/-0) debian/changelog (+16/-0) debian/control (+1/-1) debian/copyright (+1/-1) doc/source/index.rst (+11/-15) setup.py (+1/-1)
To merge this branch:	bzr merge lp:~noskcaj/ubuntu/utopic/beautifulsoup4/merge
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Daniel Holbach (community)		2014-05-29	Approve on 2014-06-05
Review via email: mp+221346@code.launchpad.net

Description of the change

New upstream release from debian

Revision history for this message

Daniel Holbach (dholbach) wrote on 2014-06-05:

Thanks. Uploaded.

review: Approve

Unmerged revisions

15. By Jackson Doak on 2014-05-29: * Merge from debian. Remaining changes:
  - debian/control, debian/rules: Disable pypy-bs4 and Build-Depends on
  pypy since the latter is in universe, while beautifulsoup4 is being
  pulled into main via webtest.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Jackson Doak

Ubuntu branches

 === modified file 'NEWS.txt'
 --- NEWS.txt	2013-08-09 18:39:43 +0000
 +++ NEWS.txt	2014-05-29 09:58:03 +0000
@@ -1,3 +1,65 @@
++= 4.3.2 (20131002) =
++
++* Fixed a bug in which short Unicode input was improperly encoded to
++  ASCII when checking whether or not it was the name of a file on
++  disk. [bug=1227016]
++
++* Fixed a crash when a short input contains data not valid in
++  filenames. [bug=1232604]
++
++* Fixed a bug that caused Unicode data put into UnicodeDammit to
++  return None instead of the original data. [bug=1214983]
++
++* Combined two tests to stop a spurious test failure when tests are
++  run by nosetests. [bug=1212445]
++
++= 4.3.1 (20130815) =
++
++* Fixed yet another problem with the html5lib tree builder, caused by
++  html5lib's tendency to rearrange the tree during
++  parsing. [bug=1189267]
++
++* Fixed a bug that caused the optimized version of find_all() to
++  return nothing. [bug=1212655]
++
++= 4.3.0 (20130812) =
++
++* Instead of converting incoming data to Unicode and feeding it to the
++  lxml tree builder in chunks, Beautiful Soup now makes successive
++  guesses at the encoding of the incoming data, and tells lxml to
++  parse the data as that encoding. Giving lxml more control over the
++  parsing process improves performance and avoids a number of bugs and
++  issues with the lxml parser which had previously required elaborate
++  workarounds:
++
++  - An issue in which lxml refuses to parse Unicode strings on some
++    systems. [bug=1180527]
++
++  - A returning bug that truncated documents longer than a (very
++    small) size. [bug=963880]
++
++  - A returning bug in which extra spaces were added to a document if
++    the document defined a charset other than UTF-8. [bug=972466]
++
++  This required a major overhaul of the tree builder architecture. If
++  you wrote your own tree builder and didn't tell me, you'll need to
++  modify your prepare_markup() method.
++
++* The UnicodeDammit code that makes guesses at encodings has been
++  split into its own class, EncodingDetector. A lot of apparently
++  redundant code has been removed from Unicode, Dammit, and some
++  undocumented features have also been removed.
++
++* Beautiful Soup will issue a warning if instead of markup you pass it
++  a URL or the name of a file on disk (a common beginner's mistake).
++
++* A number of optimizations improve the performance of the lxml tree
++  builder by about 33%, the html.parser tree builder by about 20%, and
++  the html5lib tree builder by about 15%.
++
++* All find_all calls should now return a ResultSet object. Patch by
++  Aaron DeVore. [bug=1194034]
++
  = 4.2.1 (20130531) =
  * The default XML formatter will now replace ampersands even if they
 === modified file 'PKG-INFO'
 --- PKG-INFO	2013-08-09 18:39:43 +0000
 +++ PKG-INFO	2014-05-29 09:58:03 +0000
@@ -1,6 +1,6 @@
  Metadata-Version: 1.1
  Name: beautifulsoup4
--Version: 4.2.1
++Version: 4.3.2
  Summary: UNKNOWN
  Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/
  Author: Leonard Richardson
 === modified file 'bs4/__init__.py'
 --- bs4/__init__.py	2013-08-09 18:39:43 +0000
 +++ bs4/__init__.py	2014-05-29 09:58:03 +0000
@@ -17,16 +17,17 @@
  """
  __author__ = "Leonard Richardson (leonardr@segfault.org)"
--__version__ = "4.2.1"
++__version__ = "4.3.2"
  __copyright__ = "Copyright (c) 2004-2013 Leonard Richardson"
  __license__ = "MIT"
  __all__ = ['BeautifulSoup']
++import os
  import re
  import warnings
--from .builder import builder_registry
++from .builder import builder_registry, ParserRejectedMarkup
  from .dammit import UnicodeDammit
  from .element import (
      CData,
@@ -74,11 +75,7 @@
      # want, look for one with these features.
      DEFAULT_BUILDER_FEATURES = ['html', 'fast']
--    # Used when determining whether a text node is all whitespace and
--    # can be replaced with a single space. A text node that contains
--    # fancy Unicode spaces (usually non-breaking) should be left
--    # alone.
--    STRIP_ASCII_SPACES = {9: None, 10: None, 12: None, 13: None, 32: None, }
++    ASCII_SPACES = '\x20\x0a\x09\x0c\x0d'
      def __init__(self, markup="", features=None, builder=None,
                   parse_only=None, from_encoding=None, **kwargs):
@@ -160,18 +157,46 @@
          self.parse_only = parse_only
--        self.reset()
--
          if hasattr(markup, 'read'):        # It's a file-type object.
              markup = markup.read()
--        (self.markup, self.original_encoding, self.declared_html_encoding,
--         self.contains_replacement_characters) = (
--            self.builder.prepare_markup(markup, from_encoding))
++        elif len(markup) <= 256:
++            # Print out warnings for a couple beginner problems
++            # involving passing non-markup to Beautiful Soup.
++            # Beautiful Soup will still parse the input as markup,
++            # just in case that's what the user really wants.
++            if (isinstance(markup, unicode)
++                and not os.path.supports_unicode_filenames):
++                possible_filename = markup.encode("utf8")
++            else:
++                possible_filename = markup
++            is_file = False
++            try:
++                is_file = os.path.exists(possible_filename)
++            except Exception, e:
++                # This is almost certainly a problem involving
++                # characters not valid in filenames on this
++                # system. Just let it go.
++                pass
++            if is_file:
++                warnings.warn(
++                    '"%s" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.' % markup)
++            if markup[:5] == "http:" or markup[:6] == "https:":
++                # TODO: This is ugly but I couldn't get it to work in
++                # Python 3 otherwise.
++                if ((isinstance(markup, bytes) and not b' ' in markup)
++                    or (isinstance(markup, unicode) and not u' ' in markup)):
++                    warnings.warn(
++                        '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
--        try:
--            self._feed()
--        except StopParsing:
--            pass
++        for (self.markup, self.original_encoding, self.declared_html_encoding,
++         self.contains_replacement_characters) in (
++            self.builder.prepare_markup(markup, from_encoding)):
++            self.reset()
++            try:
++                self._feed()
++                break
++            except ParserRejectedMarkup:
++                pass
          # Clear out the markup and remove the builder's circular
          # reference to this object.
@@ -192,9 +217,10 @@
          Tag.__init__(self, self, self.builder, self.ROOT_TAG_NAME)
          self.hidden = 1
          self.builder.reset()
--        self.currentData = []
++        self.current_data = []
          self.currentTag = None
          self.tagStack = []
++        self.preserve_whitespace_tag_stack = []
          self.pushTag(self)
      def new_tag(self, name, namespace=None, nsprefix=None, **attrs):
@@ -215,6 +241,8 @@
      def popTag(self):
          tag = self.tagStack.pop()
++        if self.preserve_whitespace_tag_stack and tag == self.preserve_whitespace_tag_stack[-1]:
++            self.preserve_whitespace_tag_stack.pop()
          #print "Pop", tag.name
          if self.tagStack:
              self.currentTag = self.tagStack[-1]
@@ -226,23 +254,37 @@
              self.currentTag.contents.append(tag)
          self.tagStack.append(tag)
          self.currentTag = self.tagStack[-1]
++        if tag.name in self.builder.preserve_whitespace_tags:
++            self.preserve_whitespace_tag_stack.append(tag)
      def endData(self, containerClass=NavigableString):
--        if self.currentData:
--            currentData = u''.join(self.currentData)
--            if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and
--                not set([tag.name for tag in self.tagStack]).intersection(
--                    self.builder.preserve_whitespace_tags)):
--                if '\n' in currentData:
--                    currentData = '\n'
--                else:
--                    currentData = ' '
--            self.currentData = []
++        if self.current_data:
++            current_data = u''.join(self.current_data)
++            # If whitespace is not preserved, and this string contains
++            # nothing but ASCII spaces, replace it with a single space
++            # or newline.
++            if not self.preserve_whitespace_tag_stack:
++                strippable = True
++                for i in current_data:
++                    if i not in self.ASCII_SPACES:
++                        strippable = False
++                        break
++                if strippable:
++                    if '\n' in current_data:
++                        current_data = '\n'
++                    else:
++                        current_data = ' '
++
++            # Reset the data collector.
++            self.current_data = []
++
++            # Should we add this string to the tree at all?
              if self.parse_only and len(self.tagStack) <= 1 and \
                     (not self.parse_only.text or \
--                    not self.parse_only.search(currentData)):
++                    not self.parse_only.search(current_data)):
                  return
--            o = containerClass(currentData)
++
++            o = containerClass(current_data)
              self.object_was_parsed(o)
      def object_was_parsed(self, o, parent=None, most_recent_element=None):
@@ -250,6 +292,7 @@
          parent = parent or self.currentTag
          most_recent_element = most_recent_element or self._most_recent_element
          o.setup(parent, most_recent_element)
++
          if most_recent_element is not None:
              most_recent_element.next_element = o
          self._most_recent_element = o
@@ -262,22 +305,21 @@
          the given tag."""
          #print "Popping to %s" % name
          if name == self.ROOT_TAG_NAME:
++            # The BeautifulSoup object itself can never be popped.
              return
--        numPops = 0
--        mostRecentTag = None
++        most_recently_popped = None
--        for i in range(len(self.tagStack) - 1, 0, -1):
--            if (name == self.tagStack[i].name
--                and nsprefix == self.tagStack[i].prefix):
--                numPops = len(self.tagStack) - i
++        stack_size = len(self.tagStack)
++        for i in range(stack_size - 1, 0, -1):
++            t = self.tagStack[i]
++            if (name == t.name and nsprefix == t.prefix):
++                if inclusivePop:
++                    most_recently_popped = self.popTag()
                  break
--        if not inclusivePop:
--            numPops = numPops - 1
++            most_recently_popped = self.popTag()
--        for i in range(0, numPops):
--            mostRecentTag = self.popTag()
--        return mostRecentTag
++        return most_recently_popped
      def handle_starttag(self, name, namespace, nsprefix, attrs):
          """Push a start tag on to the stack.
@@ -312,7 +354,7 @@
          self._popToTag(name, nsprefix)
      def handle_data(self, data):
--        self.currentData.append(data)
++        self.current_data.append(data)
      def decode(self, pretty_print=False,
                 eventual_encoding=DEFAULT_OUTPUT_ENCODING,
@@ -353,7 +395,6 @@
  class StopParsing(Exception):
      pass
--
  class FeatureNotFound(ValueError):
      pass
 === modified file 'bs4/builder/__init__.py'
 --- bs4/builder/__init__.py	2013-08-09 18:39:43 +0000
 +++ bs4/builder/__init__.py	2014-05-29 09:58:03 +0000
@@ -147,16 +147,18 @@
          Modifies its input in place.
          """
++        if not attrs:
++            return attrs
          if self.cdata_list_attributes:
              universal = self.cdata_list_attributes.get('*', [])
              tag_specific = self.cdata_list_attributes.get(
--                tag_name.lower(), [])
--            for cdata_list_attr in itertools.chain(universal, tag_specific):
--                if cdata_list_attr in attrs:
--                    # Basically, we have a "class" attribute whose
--                    # value is a whitespace-separated list of CSS
--                    # classes. Split it into a list.
--                    value = attrs[cdata_list_attr]
++                tag_name.lower(), None)
++            for attr in attrs.keys():
++                if attr in universal or (tag_specific and attr in tag_specific):
++                    # We have a "class"-type attribute whose string
++                    # value is a whitespace-separated list of
++                    # values. Split it into a list.
++                    value = attrs[attr]
                      if isinstance(value, basestring):
                          values = whitespace_re.split(value)
                      else:
@@ -167,7 +169,7 @@
                          # leave the value alone rather than trying to
                          # split it again.
                          values = value
--                    attrs[cdata_list_attr] = values
++                    attrs[attr] = values
          return attrs
  class SAXTreeBuilder(TreeBuilder):
@@ -296,6 +298,9 @@
              # Register the builder while we're at it.
              this_module.builder_registry.register(obj)
++class ParserRejectedMarkup(Exception):
++    pass
++
  # Builders are registered in reverse order of priority, so that custom
  # builder registrations will take precedence. In general, we want lxml
  # to take precedence over html5lib, because it's faster. And we only
 === modified file 'bs4/builder/_html5lib.py'
 --- bs4/builder/_html5lib.py	2013-08-09 18:39:43 +0000
 +++ bs4/builder/_html5lib.py	2014-05-29 09:58:03 +0000
@@ -27,7 +27,7 @@
      def prepare_markup(self, markup, user_specified_encoding):
          # Store the user-specified encoding for use later on.
          self.user_specified_encoding = user_specified_encoding
--        return markup, None, None, False
++        yield (markup, None, None, False)
      # These methods are defined by Beautiful Soup.
      def feed(self, markup):
@@ -123,17 +123,50 @@
          self.namespace = namespace
      def appendChild(self, node):
--        if (node.element.__class__ == NavigableString and self.element.contents
++        string_child = child = None
++        if isinstance(node, basestring):
++            # Some other piece of code decided to pass in a string
++            # instead of creating a TextElement object to contain the
++            # string.
++            string_child = child = node
++        elif isinstance(node, Tag):
++            # Some other piece of code decided to pass in a Tag
++            # instead of creating an Element object to contain the
++            # Tag.
++            child = node
++        elif node.element.__class__ == NavigableString:
++            string_child = child = node.element
++        else:
++            child = node.element
++
++        if not isinstance(child, basestring) and child.parent is not None:
++            node.element.extract()
++
++        if (string_child and self.element.contents
              and self.element.contents[-1].__class__ == NavigableString):
--            # Concatenate new text onto old text node
--            # XXX This has O(n^2) performance, for input like
++            # We are appending a string onto another string.
++            # TODO This has O(n^2) performance, for input like
              # "a</a>a</a>a</a>..."
              old_element = self.element.contents[-1]
--            new_element = self.soup.new_string(old_element + node.element)
++            new_element = self.soup.new_string(old_element + string_child)
              old_element.replace_with(new_element)
              self.soup._most_recent_element = new_element
          else:
--            self.soup.object_was_parsed(node.element, parent=self.element)
++            if isinstance(node, basestring):
++                # Create a brand new NavigableString from this string.
++                child = self.soup.new_string(node)
++
++            # Tell Beautiful Soup to act as if it parsed this element
++            # immediately after the parent's last descendant. (Or
++            # immediately after the parent, if it has no children.)
++            if self.element.contents:
++                most_recent_element = self.element._last_descendant(False)
++            else:
++                most_recent_element = self.element
++
++            self.soup.object_was_parsed(
++                child, parent=self.element,
++                most_recent_element=most_recent_element)
      def getAttributes(self):
          return AttrList(self.element)
@@ -162,11 +195,11 @@
      attributes = property(getAttributes, setAttributes)
      def insertText(self, data, insertBefore=None):
--        text = TextNode(self.soup.new_string(data), self.soup)
          if insertBefore:
--            self.insertBefore(text, insertBefore)
++            text = TextNode(self.soup.new_string(data), self.soup)
++            self.insertBefore(data, insertBefore)
          else:
--            self.appendChild(text)
++            self.appendChild(data)
      def insertBefore(self, node, refNode):
          index = self.element.index(refNode.element)
@@ -183,16 +216,46 @@
      def removeChild(self, node):
          node.element.extract()
--    def reparentChildren(self, newParent):
--        while self.element.contents:
--            child = self.element.contents[0]
--            child.extract()
--            if isinstance(child, Tag):
--                newParent.appendChild(
--                    Element(child, self.soup, namespaces["html"]))
--            else:
--                newParent.appendChild(
--                    TextNode(child, self.soup))
++    def reparentChildren(self, new_parent):
++        """Move all of this tag's children into another tag."""
++        element = self.element
++        new_parent_element = new_parent.element
++        # Determine what this tag's next_element will be once all the children
++        # are removed.
++        final_next_element = element.next_sibling
++
++        new_parents_last_descendant = new_parent_element._last_descendant(False, False)
++        if len(new_parent_element.contents) > 0:
++            # The new parent already contains children. We will be
++            # appending this tag's children to the end.
++            new_parents_last_child = new_parent_element.contents[-1]
++            new_parents_last_descendant_next_element = new_parents_last_descendant.next_element
++        else:
++            # The new parent contains no children.
++            new_parents_last_child = None
++            new_parents_last_descendant_next_element = new_parent_element.next_element
++
++        to_append = element.contents
++        append_after = new_parent.element.contents
++        if len(to_append) > 0:
++            # Set the first child's previous_element and previous_sibling
++            # to elements within the new parent
++            first_child = to_append[0]
++            first_child.previous_element = new_parents_last_descendant
++            first_child.previous_sibling = new_parents_last_child
++
++            # Fix the last child's next_element and next_sibling
++            last_child = to_append[-1]
++            last_child.next_element = new_parents_last_descendant_next_element
++            last_child.next_sibling = None
++
++        for child in to_append:
++            child.parent = new_parent_element
++            new_parent_element.contents.append(child)
++
++        # Now that this element has no children, change its .next_element.
++        element.contents = []
++        element.next_element = final_next_element
      def cloneNode(self):
          tag = self.soup.new_tag(self.element.name, self.namespace)
 === modified file 'bs4/builder/_htmlparser.py'
 --- bs4/builder/_htmlparser.py	2013-08-09 18:39:43 +0000
 +++ bs4/builder/_htmlparser.py	2014-05-29 09:58:03 +0000
@@ -45,7 +45,15 @@
  class BeautifulSoupHTMLParser(HTMLParser):
      def handle_starttag(self, name, attrs):
          # XXX namespace
--        self.soup.handle_starttag(name, None, None, dict(attrs))
++        attr_dict = {}
++        for key, value in attrs:
++            # Change None attribute values to the empty string
++            # for consistency with the other tree builders.
++            if value is None:
++                value = ''
++            attr_dict[key] = value
++            attrvalue = '""'
++        self.soup.handle_starttag(name, None, None, attr_dict)
      def handle_endtag(self, name):
          self.soup.handle_endtag(name)
@@ -135,13 +143,14 @@
          replaced with REPLACEMENT CHARACTER).
          """
          if isinstance(markup, unicode):
--            return markup, None, None, False
++            yield (markup, None, None, False)
++            return
          try_encodings = [user_specified_encoding, document_declared_encoding]
          dammit = UnicodeDammit(markup, try_encodings, is_html=True)
--        return (dammit.markup, dammit.original_encoding,
--                dammit.declared_html_encoding,
--                dammit.contains_replacement_characters)
++        yield (dammit.markup, dammit.original_encoding,
++               dammit.declared_html_encoding,
++               dammit.contains_replacement_characters)
      def feed(self, markup):
          args, kwargs = self.parser_args
 === modified file 'bs4/builder/_lxml.py'
 --- bs4/builder/_lxml.py	2013-08-09 18:39:43 +0000
 +++ bs4/builder/_lxml.py	2014-05-29 09:58:03 +0000
@@ -13,9 +13,10 @@
      HTML,
      HTMLTreeBuilder,
      PERMISSIVE,
++    ParserRejectedMarkup,
      TreeBuilder,
      XML)
--from bs4.dammit import UnicodeDammit
++from bs4.dammit import EncodingDetector
  LXML = 'lxml'
@@ -33,22 +34,30 @@
      # standard.
      DEFAULT_NSMAPS = {'http://www.w3.org/XML/1998/namespace' : "xml"}
--    @property
--    def default_parser(self):
++    def default_parser(self, encoding):
          # This can either return a parser object or a class, which
          # will be instantiated with default arguments.
--        return etree.XMLParser(target=self, strip_cdata=False, recover=True)
++        if self._default_parser is not None:
++            return self._default_parser
++        return etree.XMLParser(
++            target=self, strip_cdata=False, recover=True, encoding=encoding)
++
++    def parser_for(self, encoding):
++        # Use the default parser.
++        parser = self.default_parser(encoding)
++
++        if isinstance(parser, collections.Callable):
++            # Instantiate the parser with default arguments
++            parser = parser(target=self, strip_cdata=False, encoding=encoding)
++        return parser
      def __init__(self, parser=None, empty_element_tags=None):
++        # TODO: Issue a warning if parser is present but not a
++        # callable, since that means there's no way to create new
++        # parsers for different encodings.
++        self._default_parser = parser
          if empty_element_tags is not None:
              self.empty_element_tags = set(empty_element_tags)
--        if parser is None:
--            # Use the default parser.
--            parser = self.default_parser
--        if isinstance(parser, collections.Callable):
--            # Instantiate the parser with default arguments
--            parser = parser(target=self, strip_cdata=False)
--        self.parser = parser
          self.soup = None
          self.nsmaps = [self.DEFAULT_NSMAPS]
@@ -63,33 +72,53 @@
      def prepare_markup(self, markup, user_specified_encoding=None,
                         document_declared_encoding=None):
          """
--        :return: A 3-tuple (markup, original encoding, encoding
--        declared within markup).
++        :yield: A series of 4-tuples.
++         (markup, encoding, declared encoding,
++          has undergone character replacement)
++
++        Each 4-tuple represents a strategy for parsing the document.
          """
          if isinstance(markup, unicode):
--            return markup, None, None, False
--
++            # We were given Unicode. Maybe lxml can parse Unicode on
++            # this system?
++            yield markup, None, document_declared_encoding, False
++
++        if isinstance(markup, unicode):
++            # No, apparently not. Convert the Unicode to UTF-8 and
++            # tell lxml to parse it as UTF-8.
++            yield (markup.encode("utf8"), "utf8",
++                   document_declared_encoding, False)
++
++        # Instead of using UnicodeDammit to convert the bytestring to
++        # Unicode using different encodings, use EncodingDetector to
++        # iterate over the encodings, and tell lxml to try to parse
++        # the document as each one in turn.
++        is_html = not self.is_xml
          try_encodings = [user_specified_encoding, document_declared_encoding]
--        dammit = UnicodeDammit(markup, try_encodings, is_html=True)
--        return (dammit.markup, dammit.original_encoding,
--                dammit.declared_html_encoding,
--                dammit.contains_replacement_characters)
++        detector = EncodingDetector(markup, try_encodings, is_html)
++        for encoding in detector.encodings:
++            yield (detector.markup, encoding, document_declared_encoding, False)
      def feed(self, markup):
          if isinstance(markup, bytes):
              markup = BytesIO(markup)
          elif isinstance(markup, unicode):
              markup = StringIO(markup)
++
          # Call feed() at least once, even if the markup is empty,
          # or the parser won't be initialized.
          data = markup.read(self.CHUNK_SIZE)
--        self.parser.feed(data)
--        while data != '':
--            # Now call feed() on the rest of the data, chunk by chunk.
--            data = markup.read(self.CHUNK_SIZE)
--            if data != '':
--                self.parser.feed(data)
--        self.parser.close()
++        try:
++            self.parser = self.parser_for(self.soup.original_encoding)
++            self.parser.feed(data)
++            while len(data) != 0:
++                # Now call feed() on the rest of the data, chunk by chunk.
++                data = markup.read(self.CHUNK_SIZE)
++                if len(data) != 0:
++                    self.parser.feed(data)
++            self.parser.close()
++        except (UnicodeDecodeError, LookupError, etree.ParserError), e:
++            raise ParserRejectedMarkup(str(e))
      def close(self):
          self.nsmaps = [self.DEFAULT_NSMAPS]
@@ -186,13 +215,18 @@
      features = [LXML, HTML, FAST, PERMISSIVE]
      is_xml = False
--    @property
--    def default_parser(self):
++    def default_parser(self, encoding):
          return etree.HTMLParser
      def feed(self, markup):
--        self.parser.feed(markup)
--        self.parser.close()
++        encoding = self.soup.original_encoding
++        try:
++            self.parser = self.parser_for(encoding)
++            self.parser.feed(markup)
++            self.parser.close()
++        except (UnicodeDecodeError, LookupError, etree.ParserError), e:
++            raise ParserRejectedMarkup(str(e))
++
      def test_fragment_to_document(self, fragment):
          """See `TreeBuilder`."""
 === modified file 'bs4/dammit.py'
 --- bs4/dammit.py	2013-08-09 18:39:43 +0000
 +++ bs4/dammit.py	2014-05-29 09:58:03 +0000
@@ -1,16 +1,17 @@
  # -*- coding: utf-8 -*-
  """Beautiful Soup bonus library: Unicode, Dammit
--This class forces XML data into a standard format (usually to UTF-8 or
--Unicode).  It is heavily based on code from Mark Pilgrim's Universal
--Feed Parser. It does not rewrite the XML or HTML to reflect a new
--encoding; that's the tree builder's job.
++This library converts a bytestream to Unicode through any means
++necessary. It is heavily based on code from Mark Pilgrim's Universal
++Feed Parser. It works best on XML and XML, but it does not rewrite the
++XML or HTML to reflect a new encoding; that's the tree builder's job.
  """
  import codecs
  from htmlentitydefs import codepoint2name
  import re
  import logging
++import string
  # Import a library to autodetect character encodings.
  chardet_type = None
@@ -175,7 +176,6 @@
              value = cls.quoted_attribute_value(value)
          return value
--
      @classmethod
      def substitute_html(cls, s):
          """Replace certain Unicode characters with named HTML entities.
@@ -192,6 +192,125 @@
              cls._substitute_html_entity, s)
++class EncodingDetector:
++    """Suggests a number of possible encodings for a bytestring.
++
++    Order of precedence:
++
++    1. Encodings you specifically tell EncodingDetector to try first
++    (the override_encodings argument to the constructor).
++
++    2. An encoding declared within the bytestring itself, either in an
++    XML declaration (if the bytestring is to be interpreted as an XML
++    document), or in a <meta> tag (if the bytestring is to be
++    interpreted as an HTML document.)
++
++    3. An encoding detected through textual analysis by chardet,
++    cchardet, or a similar external library.
++
++    4. UTF-8.
++
++    5. Windows-1252.
++    """
++    def __init__(self, markup, override_encodings=None, is_html=False):
++        self.override_encodings = override_encodings or []
++        self.chardet_encoding = None
++        self.is_html = is_html
++        self.declared_encoding = None
++
++        # First order of business: strip a byte-order mark.
++        self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup)
++
++    def _usable(self, encoding, tried):
++        if encoding is not None:
++            encoding = encoding.lower()
++            if encoding not in tried:
++                tried.add(encoding)
++                return True
++        return False
++
++    @property
++    def encodings(self):
++        """Yield a number of encodings that might work for this markup."""
++        tried = set()
++        for e in self.override_encodings:
++            if self._usable(e, tried):
++                yield e
++
++        # Did the document originally start with a byte-order mark
++        # that indicated its encoding?
++        if self._usable(self.sniffed_encoding, tried):
++            yield self.sniffed_encoding
++
++        # Look within the document for an XML or HTML encoding
++        # declaration.
++        if self.declared_encoding is None:
++            self.declared_encoding = self.find_declared_encoding(
++                self.markup, self.is_html)
++        if self._usable(self.declared_encoding, tried):
++            yield self.declared_encoding
++
++        # Use third-party character set detection to guess at the
++        # encoding.
++        if self.chardet_encoding is None:
++            self.chardet_encoding = chardet_dammit(self.markup)
++        if self._usable(self.chardet_encoding, tried):
++            yield self.chardet_encoding
++
++        # As a last-ditch effort, try utf-8 and windows-1252.
++        for e in ('utf-8', 'windows-1252'):
++            if self._usable(e, tried):
++                yield e
++
++    @classmethod
++    def strip_byte_order_mark(cls, data):
++        """If a byte-order mark is present, strip it and return the encoding it implies."""
++        encoding = None
++        if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
++               and (data[2:4] != '\x00\x00'):
++            encoding = 'utf-16be'
++            data = data[2:]
++        elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
++                 and (data[2:4] != '\x00\x00'):
++            encoding = 'utf-16le'
++            data = data[2:]
++        elif data[:3] == b'\xef\xbb\xbf':
++            encoding = 'utf-8'
++            data = data[3:]
++        elif data[:4] == b'\x00\x00\xfe\xff':
++            encoding = 'utf-32be'
++            data = data[4:]
++        elif data[:4] == b'\xff\xfe\x00\x00':
++            encoding = 'utf-32le'
++            data = data[4:]
++        return data, encoding
++
++    @classmethod
++    def find_declared_encoding(cls, markup, is_html=False, search_entire_document=False):
++        """Given a document, tries to find its declared encoding.
++
++        An XML encoding is declared at the beginning of the document.
++
++        An HTML encoding is declared in a <meta> tag, hopefully near the
++        beginning of the document.
++        """
++        if search_entire_document:
++            xml_endpos = html_endpos = len(markup)
++        else:
++            xml_endpos = 1024
++            html_endpos = max(2048, int(len(markup) * 0.05))
++
++        declared_encoding = None
++        declared_encoding_match = xml_encoding_re.search(markup, endpos=xml_endpos)
++        if not declared_encoding_match and is_html:
++            declared_encoding_match = html_meta_re.search(markup, endpos=html_endpos)
++        if declared_encoding_match is not None:
++            declared_encoding = declared_encoding_match.groups()[0].decode(
++                'ascii')
++        if declared_encoding:
++            return declared_encoding.lower()
++        return None
++
  class UnicodeDammit:
      """A class for detecting the encoding of a *ML document and
      converting it to a Unicode string. If the source encoding is
@@ -213,55 +332,38 @@
      def __init__(self, markup, override_encodings=[],
                   smart_quotes_to=None, is_html=False):
--        self.declared_html_encoding = None
          self.smart_quotes_to = smart_quotes_to
          self.tried_encodings = []
          self.contains_replacement_characters = False
--
--        if markup == '' or isinstance(markup, unicode):
++        self.is_html = is_html
++
++        self.detector = EncodingDetector(markup, override_encodings, is_html)
++
++        # Short-circuit if the data is in Unicode to begin with.
++        if isinstance(markup, unicode) or markup == '':
              self.markup = markup
              self.unicode_markup = unicode(markup)
              self.original_encoding = None
              return
--        new_markup, document_encoding, sniffed_encoding = \
--            self._detectEncoding(markup, is_html)
--        self.markup = new_markup
++        # The encoding detector may have stripped a byte-order mark.
++        # Use the stripped markup from this point on.
++        self.markup = self.detector.markup
          u = None
--        if new_markup != markup:
--            # _detectEncoding modified the markup, then converted it to
--            # Unicode and then to UTF-8. So convert it from UTF-8.
--            u = self._convert_from("utf8")
--            self.original_encoding = sniffed_encoding
--
--        if not u:
--            for proposed_encoding in (
--                override_encodings + [document_encoding, sniffed_encoding]):
--                if proposed_encoding is not None:
--                    u = self._convert_from(proposed_encoding)
--                    if u:
--                        break
--
--        # If no luck and we have auto-detection library, try that:
--        if not u and not isinstance(self.markup, unicode):
--            u = self._convert_from(chardet_dammit(self.markup))
--
--        # As a last resort, try utf-8 and windows-1252:
--        if not u:
--            for proposed_encoding in ("utf-8", "windows-1252"):
--                u = self._convert_from(proposed_encoding)
--                if u:
--                    break
--
--        # As an absolute last resort, try the encodings again with
--        # character replacement.
--        if not u:
--            for proposed_encoding in (
--                override_encodings + [
--                    document_encoding, sniffed_encoding, "utf-8", "windows-1252"]):
--                if proposed_encoding != "ascii":
--                    u = self._convert_from(proposed_encoding, "replace")
++        for encoding in self.detector.encodings:
++            markup = self.detector.markup
++            u = self._convert_from(encoding)
++            if u is not None:
++                break
++
++        if not u:
++            # None of the encodings worked. As an absolute last resort,
++            # try them again with character replacement.
++
++            for encoding in self.detector.encodings:
++                if encoding != "ascii":
++                    u = self._convert_from(encoding, "replace")
                  if u is not None:
                      logging.warning(
                              "Some characters could not be decoded, and were "
@@ -269,8 +371,9 @@
                      self.contains_replacement_characters = True
                      break
--        # We could at this point force it to ASCII, but that would
--        # destroy so much data that I think giving up is better
++        # If none of that worked, we could at this point force it to
++        # ASCII, but that would destroy so much data that I think
++        # giving up is better.
          self.unicode_markup = u
          if not u:
              self.original_encoding = None
@@ -301,7 +404,7 @@
          # Convert smart quotes to HTML if coming from an encoding
          # that might have them.
          if (self.smart_quotes_to is not None
--            and proposed.lower() in self.ENCODINGS_WITH_SMART_QUOTES):
++            and proposed in self.ENCODINGS_WITH_SMART_QUOTES):
              smart_quotes_re = b"([\x80-\x9f])"
              smart_quotes_compiled = re.compile(smart_quotes_re)
              markup = smart_quotes_compiled.sub(self._sub_ms_char, markup)
@@ -322,99 +425,24 @@
      def _to_unicode(self, data, encoding, errors="strict"):
          '''Given a string and its encoding, decodes the string into Unicode.
          %encoding is a string recognized by encodings.aliases'''
--
--        # strip Byte Order Mark (if present)
--        if (len(data) >= 4) and (data[:2] == '\xfe\xff') \
--               and (data[2:4] != '\x00\x00'):
--            encoding = 'utf-16be'
--            data = data[2:]
--        elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \
--                 and (data[2:4] != '\x00\x00'):
--            encoding = 'utf-16le'
--            data = data[2:]
--        elif data[:3] == '\xef\xbb\xbf':
--            encoding = 'utf-8'
--            data = data[3:]
--        elif data[:4] == '\x00\x00\xfe\xff':
--            encoding = 'utf-32be'
--            data = data[4:]
--        elif data[:4] == '\xff\xfe\x00\x00':
--            encoding = 'utf-32le'
--            data = data[4:]
--        newdata = unicode(data, encoding, errors)
--        return newdata
--
--    def _detectEncoding(self, xml_data, is_html=False):
--        """Given a document, tries to detect its XML encoding."""
--        xml_encoding = sniffed_xml_encoding = None
--        try:
--            if xml_data[:4] == b'\x4c\x6f\xa7\x94':
--                # EBCDIC
--                xml_data = self._ebcdic_to_ascii(xml_data)
--            elif xml_data[:4] == b'\x00\x3c\x00\x3f':
--                # UTF-16BE
--                sniffed_xml_encoding = 'utf-16be'
--                xml_data = unicode(xml_data, 'utf-16be').encode('utf-8')
--            elif (len(xml_data) >= 4) and (xml_data[:2] == b'\xfe\xff') \
--                     and (xml_data[2:4] != b'\x00\x00'):
--                # UTF-16BE with BOM
--                sniffed_xml_encoding = 'utf-16be'
--                xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8')
--            elif xml_data[:4] == b'\x3c\x00\x3f\x00':
--                # UTF-16LE
--                sniffed_xml_encoding = 'utf-16le'
--                xml_data = unicode(xml_data, 'utf-16le').encode('utf-8')
--            elif (len(xml_data) >= 4) and (xml_data[:2] == b'\xff\xfe') and \
--                     (xml_data[2:4] != b'\x00\x00'):
--                # UTF-16LE with BOM
--                sniffed_xml_encoding = 'utf-16le'
--                xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8')
--            elif xml_data[:4] == b'\x00\x00\x00\x3c':
--                # UTF-32BE
--                sniffed_xml_encoding = 'utf-32be'
--                xml_data = unicode(xml_data, 'utf-32be').encode('utf-8')
--            elif xml_data[:4] == b'\x3c\x00\x00\x00':
--                # UTF-32LE
--                sniffed_xml_encoding = 'utf-32le'
--                xml_data = unicode(xml_data, 'utf-32le').encode('utf-8')
--            elif xml_data[:4] == b'\x00\x00\xfe\xff':
--                # UTF-32BE with BOM
--                sniffed_xml_encoding = 'utf-32be'
--                xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8')
--            elif xml_data[:4] == b'\xff\xfe\x00\x00':
--                # UTF-32LE with BOM
--                sniffed_xml_encoding = 'utf-32le'
--                xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8')
--            elif xml_data[:3] == b'\xef\xbb\xbf':
--                # UTF-8 with BOM
--                sniffed_xml_encoding = 'utf-8'
--                xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8')
--            else:
--                sniffed_xml_encoding = 'ascii'
--                pass
--        except:
--            xml_encoding_match = None
--        xml_encoding_match = xml_encoding_re.match(xml_data)
--        if not xml_encoding_match and is_html:
--            xml_encoding_match = html_meta_re.search(xml_data)
--        if xml_encoding_match is not None:
--            xml_encoding = xml_encoding_match.groups()[0].decode(
--                'ascii').lower()
--            if is_html:
--                self.declared_html_encoding = xml_encoding
--            if sniffed_xml_encoding and \
--               (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode',
--                                 'iso-10646-ucs-4', 'ucs-4', 'csucs4',
--                                 'utf-16', 'utf-32', 'utf_16', 'utf_32',
--                                 'utf16', 'u16')):
--                xml_encoding = sniffed_xml_encoding
--        return xml_data, xml_encoding, sniffed_xml_encoding
++        return unicode(data, encoding, errors)
++
++    @property
++    def declared_html_encoding(self):
++        if not self.is_html:
++            return None
++        return self.detector.declared_encoding
      def find_codec(self, charset):
--        return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
--               or (charset and self._codec(charset.replace("-", ""))) \
--               or (charset and self._codec(charset.replace("-", "_"))) \
++        value = (self._codec(self.CHARSET_ALIASES.get(charset, charset))
++               or (charset and self._codec(charset.replace("-", "")))
++               or (charset and self._codec(charset.replace("-", "_")))
++               or (charset and charset.lower())
                 or charset
++                )
++        if value:
++            return value.lower()
++        return None
      def _codec(self, charset):
          if not charset:
@@ -427,32 +455,6 @@
              pass
          return codec
--    EBCDIC_TO_ASCII_MAP = None
--
--    def _ebcdic_to_ascii(self, s):
--        c = self.__class__
--        if not c.EBCDIC_TO_ASCII_MAP:
--            emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
--                    16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
--                    128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
--                    144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
--                    32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
--                    38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
--                    45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
--                    186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
--                    195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,
--                    201,202,106,107,108,109,110,111,112,113,114,203,204,205,
--                    206,207,208,209,126,115,116,117,118,119,120,121,122,210,
--                    211,212,213,214,215,216,217,218,219,220,221,222,223,224,
--                    225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72,
--                    73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81,
--                    82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89,
--                    90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57,
--                    250,251,252,253,254,255)
--            import string
--            c.EBCDIC_TO_ASCII_MAP = string.maketrans(
--            ''.join(map(chr, list(range(256)))), ''.join(map(chr, emap)))
--        return s.translate(c.EBCDIC_TO_ASCII_MAP)
      # A partial mapping of ISO-Latin-1 to HTML entities/XML numeric entities.
      MS_CHARS = {b'\x80': ('euro', '20AC'),
 === modified file 'bs4/diagnose.py'
 --- bs4/diagnose.py	2013-08-09 18:39:43 +0000
 +++ bs4/diagnose.py	2014-05-29 09:58:03 +0000
@@ -1,10 +1,15 @@
  """Diagnostic functions, mainly for use when doing tech support."""
++import cProfile
  from StringIO import StringIO
  from HTMLParser import HTMLParser
++import bs4
  from bs4 import BeautifulSoup, __version__
  from bs4.builder import builder_registry
++
  import os
++import pstats
  import random
++import tempfile
  import time
  import traceback
  import sys
@@ -61,14 +66,14 @@
          print "-" * 80
--def lxml_trace(data, html=True):
++def lxml_trace(data, html=True, **kwargs):
      """Print out the lxml events that occur during parsing.
      This lets you see how lxml parses a document when no Beautiful
      Soup code is running.
      """
      from lxml import etree
--    for event, element in etree.iterparse(StringIO(data), html=html):
++    for event, element in etree.iterparse(StringIO(data), html=html, **kwargs):
          print("%s, %4s, %s" % (event, element.tag, element.text))
  class AnnouncingParser(HTMLParser):
@@ -174,5 +179,26 @@
      b = time.time()
      print "Raw lxml parsed the markup in %.2fs." % (b-a)
++    import html5lib
++    parser = html5lib.HTMLParser()
++    a = time.time()
++    parser.parse(data)
++    b = time.time()
++    print "Raw html5lib parsed the markup in %.2fs." % (b-a)
++
++def profile(num_elements=100000, parser="lxml"):
++
++    filehandle = tempfile.NamedTemporaryFile()
++    filename = filehandle.name
++
++    data = rdoc(num_elements)
++    vars = dict(bs4=bs4, data=data, parser=parser)
++    cProfile.runctx('bs4.BeautifulSoup(data, parser)' , vars, vars, filename)
++
++    stats = pstats.Stats(filename)
++    # stats.strip_dirs()
++    stats.sort_stats("cumulative")
++    stats.print_stats('_html5lib|bs4', 50)
++
  if __name__ == '__main__':
      diagnose(sys.stdin.read())
 === modified file 'bs4/element.py'
 --- bs4/element.py	2013-05-25 21:27:22 +0000
 +++ bs4/element.py	2014-05-29 09:58:03 +0000
@@ -255,11 +255,16 @@
          self.previous_sibling = self.next_sibling = None
          return self
--    def _last_descendant(self):
++    def _last_descendant(self, is_initialized=True, accept_self=True):
          "Finds the last element beneath this object to be parsed."
--        last_child = self
--        while hasattr(last_child, 'contents') and last_child.contents:
--            last_child = last_child.contents[-1]
++        if is_initialized and self.next_sibling:
++            last_child = self.next_sibling.previous_element
++        else:
++            last_child = self
++            while isinstance(last_child, Tag) and last_child.contents:
++                last_child = last_child.contents[-1]
++        if not accept_self and last_child == self:
++            last_child = None
          return last_child
      # BS3: Not part of the API!
      _lastRecursiveChild = _last_descendant
@@ -294,11 +299,11 @@
              previous_child = self.contents[position - 1]
              new_child.previous_sibling = previous_child
              new_child.previous_sibling.next_sibling = new_child
--            new_child.previous_element = previous_child._last_descendant()
++            new_child.previous_element = previous_child._last_descendant(False)
          if new_child.previous_element is not None:
              new_child.previous_element.next_element = new_child
--        new_childs_last_element = new_child._last_descendant()
++        new_childs_last_element = new_child._last_descendant(False)
          if position >= len(self.contents):
              new_child.next_sibling = None
@@ -475,20 +480,21 @@
          if isinstance(name, SoupStrainer):
              strainer = name
--        elif text is None and not limit and not attrs and not kwargs:
--            # Optimization to find all tags.
++        else:
++            strainer = SoupStrainer(name, attrs, text, **kwargs)
++
++        if text is None and not limit and not attrs and not kwargs:
              if name is True or name is None:
--                return [element for element in generator
--                        if isinstance(element, Tag)]
--            # Optimization to find all tags with a given name.
++                # Optimization to find all tags.
++                result = (element for element in generator
++                          if isinstance(element, Tag))
++                return ResultSet(strainer, result)
              elif isinstance(name, basestring):
--                return [element for element in generator
--                        if isinstance(element, Tag) and element.name == name]
--            else:
--                strainer = SoupStrainer(name, attrs, text, **kwargs)
--        else:
--            # Build a SoupStrainer
--            strainer = SoupStrainer(name, attrs, text, **kwargs)
++                # Optimization to find all tags with a given name.
++                result = (element for element in generator
++                          if isinstance(element, Tag)
++                            and element.name == name)
++                return ResultSet(strainer, result)
          results = ResultSet(strainer)
          while True:
              try:
@@ -672,6 +678,13 @@
          output = self.format_string(self, formatter)
          return self.PREFIX + output + self.SUFFIX
++    @property
++    def name(self):
++        return None
++
++    @name.setter
++    def name(self, name):
++        raise AttributeError("A NavigableString cannot be given a name.")
  class PreformattedString(NavigableString):
      """A NavigableString not subject to the normal formatting rules.
@@ -746,7 +759,7 @@
          self.prefix = prefix
          if attrs is None:
              attrs = {}
--        elif builder.cdata_list_attributes:
++        elif attrs and builder.cdata_list_attributes:
              attrs = builder._replace_cdata_list_attribute_values(
                  self.name, attrs)
          else:
@@ -1593,6 +1606,6 @@
  class ResultSet(list):
      """A ResultSet is just a list that keeps track of the SoupStrainer
      that created it."""
--    def __init__(self, source):
--        list.__init__([])
++    def __init__(self, source, result=()):
++        super(ResultSet, self).__init__(result)
          self.source = source
 === modified file 'bs4/testing.py'
 --- bs4/testing.py	2013-08-09 18:39:43 +0000
 +++ bs4/testing.py	2014-05-29 09:58:03 +0000
@@ -281,6 +281,14 @@
      # to detect any differences between them.
+     #
++    def test_can_parse_unicode_document(self):
++        # A seemingly innocuous document... but it's in Unicode! And
++        # it contains characters that can't be represented in the
++        # encoding found in the  declaration! The horror!
++        markup = u'<html><head><meta encoding="euc-jp"></head><body>Sacr\N{LATIN SMALL LETTER E WITH ACUTE} bleu!</body>'
++        soup = self.soup(markup)
++        self.assertEqual(u'Sacr\xe9 bleu!', soup.body.string)
++
      def test_soupstrainer(self):
          """Parsers should be able to work with SoupStrainers."""
          strainer = SoupStrainer("b")
@@ -484,6 +492,11 @@
          encoded = soup.encode()
          self.assertTrue(b"&lt; &lt; hey &gt; &gt;" in encoded)
++    def test_can_parse_unicode_document(self):
++        markup = u'<?xml version="1.0" encoding="euc-jp"><root>Sacr\N{LATIN SMALL LETTER E WITH ACUTE} bleu!</root>'
++        soup = self.soup(markup)
++        self.assertEqual(u'Sacr\xe9 bleu!', soup.root.string)
++
      def test_popping_namespaced_tag(self):
          markup = '<rss xmlns:dc="foo"><dc:creator>b</dc:creator><dc:date>2012-07-02T20:33:42Z</dc:date><dc:rights>c</dc:rights><image>d</image></rss>'
          soup = self.soup(markup)
 === modified file 'bs4/tests/test_html5lib.py'
 --- bs4/tests/test_html5lib.py	2013-08-09 18:39:43 +0000
 +++ bs4/tests/test_html5lib.py	2014-05-29 09:58:03 +0000
@@ -70,3 +70,16 @@
          soup = self.soup(markup)
          # Verify that we can reach the <p> tag; this means the tree is connected.
          self.assertEqual(b"<p>foo</p>", soup.p.encode())
++
++    def test_reparented_markup(self):
++        markup = '<p><em>foo</p>\n<p>bar<a></a></em></p>'
++        soup = self.soup(markup)
++        self.assertEqual(u"<body><p><em>foo</em></p><em>\n</em><p><em>bar<a></a></em></p></body>", soup.body.decode())
++        self.assertEqual(2, len(soup.find_all('p')))
++
++
++    def test_reparented_markup_ends_with_whitespace(self):
++        markup = '<p><em>foo</p>\n<p>bar<a></a></em></p>\n'
++        soup = self.soup(markup)
++        self.assertEqual(u"<body><p><em>foo</em></p><em>\n</em><p><em>bar<a></a></em></p>\n</body>", soup.body.decode())
++        self.assertEqual(2, len(soup.find_all('p')))
 === modified file 'bs4/tests/test_lxml.py'
 --- bs4/tests/test_lxml.py	2013-08-09 18:39:43 +0000
 +++ bs4/tests/test_lxml.py	2014-05-29 09:58:03 +0000
@@ -4,14 +4,16 @@
  import warnings
  try:
--    from bs4.builder import LXMLTreeBuilder, LXMLTreeBuilderForXML
++    import lxml.etree
      LXML_PRESENT = True
--    import lxml.etree
      LXML_VERSION = lxml.etree.LXML_VERSION
  except ImportError, e:
      LXML_PRESENT = False
      LXML_VERSION = (0,)
++if LXML_PRESENT:
++    from bs4.builder import LXMLTreeBuilder, LXMLTreeBuilderForXML
++
  from bs4 import (
      BeautifulSoup,
      BeautifulStoneSoup,
@@ -58,9 +60,10 @@
      def test_beautifulstonesoup_is_xml_parser(self):
          # Make sure that the deprecated BSS class uses an xml builder
          # if one is installed.
--        with warnings.catch_warnings(record=False) as w:
++        with warnings.catch_warnings(record=True) as w:
              soup = BeautifulStoneSoup("<b />")
--            self.assertEqual(u"<b/>", unicode(soup.b))
++        self.assertEqual(u"<b/>", unicode(soup.b))
++        self.assertTrue("BeautifulStoneSoup class is deprecated" in str(w[0].message))
      def test_real_xhtml_document(self):
          """lxml strips the XML definition from an XHTML doc, which is fine."""
 === modified file 'bs4/tests/test_soup.py'
 --- bs4/tests/test_soup.py	2013-08-09 18:39:43 +0000
 +++ bs4/tests/test_soup.py	2014-05-29 09:58:03 +0000
@@ -4,6 +4,8 @@
  import logging
  import unittest
  import sys
++import tempfile
++
  from bs4 import (
      BeautifulSoup,
      BeautifulStoneSoup,
@@ -15,7 +17,10 @@
      NamespacedAttribute,
+     )
  import bs4.dammit
--from bs4.dammit import EntitySubstitution, UnicodeDammit
++from bs4.dammit import (
++    EntitySubstitution,
++    UnicodeDammit,
++)
  from bs4.testing import (
      SoupTest,
      skipIf,
@@ -31,6 +36,19 @@
  PYTHON_2_PRE_2_7 = (sys.version_info < (2,7))
  PYTHON_3_PRE_3_2 = (sys.version_info[0] == 3 and sys.version_info < (3,2))
++class TestConstructor(SoupTest):
++
++    def test_short_unicode_input(self):
++        data = u"<h1>éé</h1>"
++        soup = self.soup(data)
++        self.assertEqual(u"éé", soup.h1.string)
++
++    def test_embedded_null(self):
++        data = u"<h1>foo\0bar</h1>"
++        soup = self.soup(data)
++        self.assertEqual(u"foo\0bar", soup.h1.string)
++
++
  class TestDeprecatedConstructorArguments(SoupTest):
      def test_parseOnlyThese_renamed_to_parse_only(self):
@@ -54,14 +72,33 @@
          self.assertRaises(
              TypeError, self.soup, "<a>", no_such_argument=True)
--    @skipIf(
--        not LXML_PRESENT,
--        "lxml not present, not testing BeautifulStoneSoup.")
--    def test_beautifulstonesoup(self):
--        with warnings.catch_warnings(record=True) as w:
--            soup = BeautifulStoneSoup("<markup>")
--            self.assertTrue(isinstance(soup, BeautifulSoup))
--            self.assertTrue("BeautifulStoneSoup class is deprecated")
++class TestWarnings(SoupTest):
++
++    def test_disk_file_warning(self):
++        filehandle = tempfile.NamedTemporaryFile()
++        filename = filehandle.name
++        try:
++            with warnings.catch_warnings(record=True) as w:
++                soup = self.soup(filename)
++            msg = str(w[0].message)
++            self.assertTrue("looks like a filename" in msg)
++        finally:
++            filehandle.close()
++
++        # The file no longer exists, so Beautiful Soup will no longer issue the warning.
++        with warnings.catch_warnings(record=True) as w:
++            soup = self.soup(filename)
++        self.assertEqual(0, len(w))
++
++    def test_url_warning(self):
++        with warnings.catch_warnings(record=True) as w:
++            soup = self.soup("http://www.crummy.com/")
++        msg = str(w[0].message)
++        self.assertTrue("looks like a URL" in msg)
++
++        with warnings.catch_warnings(record=True) as w:
++            soup = self.soup("http://www.crummy.com/ is great")
++        self.assertEqual(0, len(w))
  class TestSelectiveParsing(SoupTest):
@@ -156,13 +193,23 @@
      def test_ascii_in_unicode_out(self):
          # ASCII input is converted to Unicode. The original_encoding
--        # attribute is set.
--        ascii = b"<foo>a</foo>"
--        soup_from_ascii = self.soup(ascii)
--        unicode_output = soup_from_ascii.decode()
--        self.assertTrue(isinstance(unicode_output, unicode))
--        self.assertEqual(unicode_output, self.document_for(ascii.decode()))
--        self.assertEqual(soup_from_ascii.original_encoding.lower(), "ascii")
++        # attribute is set to 'utf-8', a superset of ASCII.
++        chardet = bs4.dammit.chardet_dammit
++        logging.disable(logging.WARNING)
++        try:
++            def noop(str):
++                return None
++            # Disable chardet, which will realize that the ASCII is ASCII.
++            bs4.dammit.chardet_dammit = noop
++            ascii = b"<foo>a</foo>"
++            soup_from_ascii = self.soup(ascii)
++            unicode_output = soup_from_ascii.decode()
++            self.assertTrue(isinstance(unicode_output, unicode))
++            self.assertEqual(unicode_output, self.document_for(ascii.decode()))
++            self.assertEqual(soup_from_ascii.original_encoding.lower(), "utf-8")
++        finally:
++            logging.disable(logging.NOTSET)
++            bs4.dammit.chardet_dammit = chardet
      def test_unicode_in_unicode_out(self):
          # Unicode input is left alone. The original_encoding attribute
@@ -192,7 +239,12 @@
          self.assertEqual(self.soup(markup).div.encode("utf8"), markup.encode("utf8"))
  class TestUnicodeDammit(unittest.TestCase):
--    """Standalone tests of Unicode, Dammit."""
++    """Standalone tests of UnicodeDammit."""
++
++    def test_unicode_input(self):
++        markup = u"I'm already Unicode! \N{SNOWMAN}"
++        dammit = UnicodeDammit(markup)
++        self.assertEqual(dammit.unicode_markup, markup)
      def test_smart_quotes_to_unicode(self):
          markup = b"<foo>\x91\x92\x93\x94</foo>"
@@ -293,9 +345,8 @@
              logging.disable(logging.NOTSET)
              bs4.dammit.chardet_dammit = chardet
--    def test_sniffed_xml_encoding(self):
--        # A document written in UTF-16LE will be converted by a different
--        # code path that sniffs the byte order markers.
++    def test_byte_order_mark_removed(self):
++        # A document written in UTF-16LE will have its byte order marker stripped.
          data = b'\xff\xfe<\x00a\x00>\x00\xe1\x00\xe9\x00<\x00/\x00a\x00>\x00'
          dammit = UnicodeDammit(data)
          self.assertEqual(u"<a>áé</a>", dammit.unicode_markup)
 === modified file 'bs4/tests/test_tree.py'
 --- bs4/tests/test_tree.py	2013-08-09 18:39:43 +0000
 +++ bs4/tests/test_tree.py	2014-05-29 09:58:03 +0000
@@ -70,6 +70,16 @@
          soup = self.soup(u'<h1>Räksmörgås</h1>')
          self.assertEqual(soup.find(text=u'Räksmörgås'), u'Räksmörgås')
++    def test_find_everything(self):
++        """Test an optimization that finds all tags."""
++        soup = self.soup("<a>foo</a><b>bar</b>")
++        self.assertEqual(2, len(soup.find_all()))
++
++    def test_find_everything_with_name(self):
++        """Test an optimization that finds all tags with a given name."""
++        soup = self.soup("<a>foo</a><b>bar</b><a>baz</a>")
++        self.assertEqual(2, len(soup.find_all('a')))
++
  class TestFindAll(TreeTest):
      """Basic tests of the find_all() method."""
@@ -115,6 +125,19 @@
          # recursion.
          self.assertEqual([], soup.find_all(l))
++    def test_find_all_resultset(self):
++        """All find_all calls return a ResultSet"""
++        soup = self.soup("<a></a>")
++        result = soup.find_all("a")
++        self.assertTrue(hasattr(result, "source"))
++
++        result = soup.find_all(True)
++        self.assertTrue(hasattr(result, "source"))
++
++        result = soup.find_all(text="foo")
++        self.assertTrue(hasattr(result, "source"))
++
++
  class TestFindAllBasicNamespaces(TreeTest):
      def test_find_by_namespaced_name(self):
@@ -1219,6 +1242,12 @@
          # attribute for any other tag.
          self.assertEqual('ISO-8859-1 UTF-8', soup.a['accept-charset'])
++    def test_string_has_immutable_name_property(self):
++        string = self.soup("s").string
++        self.assertEqual(None, string.name)
++        def t():
++            string.name = 'foo'
++        self.assertRaises(AttributeError, t)
  class TestPersistence(SoupTest):
      "Testing features like pickle and deepcopy."
 === modified file 'debian/changelog'
 --- debian/changelog	2014-02-23 13:46:15 +0000
 +++ debian/changelog	2014-05-29 09:58:03 +0000
@@ -1,3 +1,19 @@
++beautifulsoup4 (4.3.2-1ubuntu1) utopic; urgency=medium
++
++  * Merge from debian. Remaining changes:
++    - debian/control, debian/rules: Disable pypy-bs4 and Build-Depends on
++    pypy since the latter is in universe, while beautifulsoup4 is being
++    pulled into main via webtest.
++
++ -- Jackson Doak <noskcaj@ubuntu.com>  Thu, 29 May 2014 19:50:43 +1000
++
++beautifulsoup4 (4.3.2-1) unstable; urgency=low
++
++  * New upstream release.
++  * Bump Standards-Version to 3.9.5, no changes needed.
++
++ -- Stefano Rivera <stefanor@debian.org>  Sat, 03 May 2014 14:19:04 +0200
++
  beautifulsoup4 (4.2.1-1ubuntu2) trusty; urgency=medium
    * Rebuild to drop files installed into /usr/share/pyshared.
 === modified file 'debian/control'
 --- debian/control	2013-11-15 09:56:34 +0000
 +++ debian/control	2014-05-29 09:58:03 +0000
@@ -15,7 +15,7 @@
   python3-lxml,
   python3-pkg-resources
  X-Python-Version: >= 2.6
--Standards-Version: 3.9.4
++Standards-Version: 3.9.5
  Homepage: http://www.crummy.com/software/BeautifulSoup
  Vcs-Svn: svn://anonscm.debian.org/python-modules/packages/beautifulsoup4/trunk/
  Vcs-Browser: http://anonscm.debian.org/viewvc/python-modules/packages/beautifulsoup4/trunk/
 === modified file 'debian/copyright'
 --- debian/copyright	2013-05-25 21:27:22 +0000
 +++ debian/copyright	2014-05-29 09:58:03 +0000
@@ -18,7 +18,7 @@
  Files: debian/*
  Copyright:
 -2009, Decklin Foster <decklin@red-bean.com>
-- 2011-2013, Stefano Rivera <stefanor@debian.org>
++ 2011-2014, Stefano Rivera <stefanor@debian.org>
  License: Expatish
  License: Expatish
 === modified file 'doc/source/index.rst'
 --- doc/source/index.rst	2013-08-09 18:39:43 +0000
 +++ doc/source/index.rst	2014-05-29 09:58:03 +0000
@@ -26,6 +26,10 @@
  projects. If you want to learn about the differences between Beautiful
  Soup 3 and Beautiful Soup 4, see `Porting code to BS4`_.
++This documentation has been translated into other languages by its users.
++
++* 이 문서는 한국어 번역도 가능합니다. (`외부 링크 <http://coreapython.hosting.paran.com/etc/beautifulsoup4.html>`_)
++
  Getting help
  ------------
@@ -1209,8 +1213,8 @@
  You can filter an attribute based on `a string`_, `a regular
  expression`_, `a list`_, `a function`_, or `the value True`_.
--This code finds all tags that have an ``id`` attribute, regardless of
--what the value is::
++This code finds all tags whose ``id`` attribute has a value,
++regardless of what the value is::
   soup.find_all(id=True)
   # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
@@ -2478,9 +2482,11 @@
   dammit.original_encoding
   # 'utf-8'
--The more data you give Unicode, Dammit, the more accurately it will
--guess. If you have your own suspicions as to what the encoding might
--be, you can pass them in as a list::
++Unicode, Dammit's guesses will get a lot more accurate if you install
++the ``chardet`` or ``cchardet`` Python libraries. The more data you
++give Unicode, Dammit, the more accurately it will guess. If you have
++your own suspicions as to what the encoding might be, you can pass
++them in as a list::
   dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"])
   print(dammit.unicode_markup)
@@ -2823,16 +2829,6 @@
  You can speed up encoding detection significantly by installing the
  `cchardet <http://pypi.python.org/pypi/cchardet/>`_ library.
--Sometimes `Unicode, Dammit`_ can only detect the encoding of a file by
--doing a byte-by-byte examination of the file. This slows Beautiful
--Soup to a crawl. My tests indicate that this only happened on 2.x
--versions of Python, and that it happened most often with documents
--using Russian or Chinese encodings. If this is happening to you, you
--can fix it by installing cchardet, or by using Python 3 for your
--script. If you happen to know a document's encoding, you can pass
--it into the ``BeautifulSoup`` constructor as ``from_encoding``, and
--bypass encoding detection altogether.
--
  `Parsing only part of a document`_ won't save you much time parsing
  the document, but it can save a lot of memory, and it'll make
  `searching` the document much faster.
 === modified file 'setup.py'
 --- setup.py	2013-08-09 18:39:43 +0000
 +++ setup.py	2014-05-29 09:58:03 +0000
@@ -7,7 +7,7 @@
      from distutils.command.build_py import build_py
  setup(name="beautifulsoup4",
--      version = "4.2.1",
++      version = "4.3.2",
        author="Leonard Richardson",
        author_email='leonardr@segfault.org',
        url="http://www.crummy.com/software/BeautifulSoup/bs4/",

Ubuntubeautifulsoup4 package