Merge lp:~noskcaj/ubuntu/utopic/beautifulsoup4/merge into lp:ubuntu/utopic/beautifulsoup4

Proposed by Jackson Doak
Status: Needs review
Proposed branch: lp:~noskcaj/ubuntu/utopic/beautifulsoup4/merge
Merge into: lp:ubuntu/utopic/beautifulsoup4
Diff against target: 1665 lines (+709/-333)
20 files modified
NEWS.txt (+62/-0)
PKG-INFO (+1/-1)
bs4/__init__.py (+83/-42)
bs4/builder/__init__.py (+13/-8)
bs4/builder/_html5lib.py (+82/-19)
bs4/builder/_htmlparser.py (+14/-5)
bs4/builder/_lxml.py (+64/-30)
bs4/dammit.py (+165/-163)
bs4/diagnose.py (+28/-2)
bs4/element.py (+34/-21)
bs4/testing.py (+13/-0)
bs4/tests/test_html5lib.py (+13/-0)
bs4/tests/test_lxml.py (+7/-4)
bs4/tests/test_soup.py (+71/-20)
bs4/tests/test_tree.py (+29/-0)
debian/changelog (+16/-0)
debian/control (+1/-1)
debian/copyright (+1/-1)
doc/source/index.rst (+11/-15)
setup.py (+1/-1)
To merge this branch: bzr merge lp:~noskcaj/ubuntu/utopic/beautifulsoup4/merge
Reviewer Review Type Date Requested Status
Daniel Holbach (community) Approve
Review via email: mp+221346@code.launchpad.net

Description of the change

New upstream release from debian

To post a comment you must log in.
Revision history for this message
Daniel Holbach (dholbach) wrote :

Thanks. Uploaded.

review: Approve

Unmerged revisions

15. By Jackson Doak

* Merge from debian. Remaining changes:
  - debian/control, debian/rules: Disable pypy-bs4 and Build-Depends on
  pypy since the latter is in universe, while beautifulsoup4 is being
  pulled into main via webtest.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'NEWS.txt'
2--- NEWS.txt 2013-08-09 18:39:43 +0000
3+++ NEWS.txt 2014-05-29 09:58:03 +0000
4@@ -1,3 +1,65 @@
5+= 4.3.2 (20131002) =
6+
7+* Fixed a bug in which short Unicode input was improperly encoded to
8+ ASCII when checking whether or not it was the name of a file on
9+ disk. [bug=1227016]
10+
11+* Fixed a crash when a short input contains data not valid in
12+ filenames. [bug=1232604]
13+
14+* Fixed a bug that caused Unicode data put into UnicodeDammit to
15+ return None instead of the original data. [bug=1214983]
16+
17+* Combined two tests to stop a spurious test failure when tests are
18+ run by nosetests. [bug=1212445]
19+
20+= 4.3.1 (20130815) =
21+
22+* Fixed yet another problem with the html5lib tree builder, caused by
23+ html5lib's tendency to rearrange the tree during
24+ parsing. [bug=1189267]
25+
26+* Fixed a bug that caused the optimized version of find_all() to
27+ return nothing. [bug=1212655]
28+
29+= 4.3.0 (20130812) =
30+
31+* Instead of converting incoming data to Unicode and feeding it to the
32+ lxml tree builder in chunks, Beautiful Soup now makes successive
33+ guesses at the encoding of the incoming data, and tells lxml to
34+ parse the data as that encoding. Giving lxml more control over the
35+ parsing process improves performance and avoids a number of bugs and
36+ issues with the lxml parser which had previously required elaborate
37+ workarounds:
38+
39+ - An issue in which lxml refuses to parse Unicode strings on some
40+ systems. [bug=1180527]
41+
42+ - A returning bug that truncated documents longer than a (very
43+ small) size. [bug=963880]
44+
45+ - A returning bug in which extra spaces were added to a document if
46+ the document defined a charset other than UTF-8. [bug=972466]
47+
48+ This required a major overhaul of the tree builder architecture. If
49+ you wrote your own tree builder and didn't tell me, you'll need to
50+ modify your prepare_markup() method.
51+
52+* The UnicodeDammit code that makes guesses at encodings has been
53+ split into its own class, EncodingDetector. A lot of apparently
54+ redundant code has been removed from Unicode, Dammit, and some
55+ undocumented features have also been removed.
56+
57+* Beautiful Soup will issue a warning if instead of markup you pass it
58+ a URL or the name of a file on disk (a common beginner's mistake).
59+
60+* A number of optimizations improve the performance of the lxml tree
61+ builder by about 33%, the html.parser tree builder by about 20%, and
62+ the html5lib tree builder by about 15%.
63+
64+* All find_all calls should now return a ResultSet object. Patch by
65+ Aaron DeVore. [bug=1194034]
66+
67 = 4.2.1 (20130531) =
68
69 * The default XML formatter will now replace ampersands even if they
70
71=== modified file 'PKG-INFO'
72--- PKG-INFO 2013-08-09 18:39:43 +0000
73+++ PKG-INFO 2014-05-29 09:58:03 +0000
74@@ -1,6 +1,6 @@
75 Metadata-Version: 1.1
76 Name: beautifulsoup4
77-Version: 4.2.1
78+Version: 4.3.2
79 Summary: UNKNOWN
80 Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/
81 Author: Leonard Richardson
82
83=== modified file 'bs4/__init__.py'
84--- bs4/__init__.py 2013-08-09 18:39:43 +0000
85+++ bs4/__init__.py 2014-05-29 09:58:03 +0000
86@@ -17,16 +17,17 @@
87 """
88
89 __author__ = "Leonard Richardson (leonardr@segfault.org)"
90-__version__ = "4.2.1"
91+__version__ = "4.3.2"
92 __copyright__ = "Copyright (c) 2004-2013 Leonard Richardson"
93 __license__ = "MIT"
94
95 __all__ = ['BeautifulSoup']
96
97+import os
98 import re
99 import warnings
100
101-from .builder import builder_registry
102+from .builder import builder_registry, ParserRejectedMarkup
103 from .dammit import UnicodeDammit
104 from .element import (
105 CData,
106@@ -74,11 +75,7 @@
107 # want, look for one with these features.
108 DEFAULT_BUILDER_FEATURES = ['html', 'fast']
109
110- # Used when determining whether a text node is all whitespace and
111- # can be replaced with a single space. A text node that contains
112- # fancy Unicode spaces (usually non-breaking) should be left
113- # alone.
114- STRIP_ASCII_SPACES = {9: None, 10: None, 12: None, 13: None, 32: None, }
115+ ASCII_SPACES = '\x20\x0a\x09\x0c\x0d'
116
117 def __init__(self, markup="", features=None, builder=None,
118 parse_only=None, from_encoding=None, **kwargs):
119@@ -160,18 +157,46 @@
120
121 self.parse_only = parse_only
122
123- self.reset()
124-
125 if hasattr(markup, 'read'): # It's a file-type object.
126 markup = markup.read()
127- (self.markup, self.original_encoding, self.declared_html_encoding,
128- self.contains_replacement_characters) = (
129- self.builder.prepare_markup(markup, from_encoding))
130+ elif len(markup) <= 256:
131+ # Print out warnings for a couple beginner problems
132+ # involving passing non-markup to Beautiful Soup.
133+ # Beautiful Soup will still parse the input as markup,
134+ # just in case that's what the user really wants.
135+ if (isinstance(markup, unicode)
136+ and not os.path.supports_unicode_filenames):
137+ possible_filename = markup.encode("utf8")
138+ else:
139+ possible_filename = markup
140+ is_file = False
141+ try:
142+ is_file = os.path.exists(possible_filename)
143+ except Exception, e:
144+ # This is almost certainly a problem involving
145+ # characters not valid in filenames on this
146+ # system. Just let it go.
147+ pass
148+ if is_file:
149+ warnings.warn(
150+ '"%s" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.' % markup)
151+ if markup[:5] == "http:" or markup[:6] == "https:":
152+ # TODO: This is ugly but I couldn't get it to work in
153+ # Python 3 otherwise.
154+ if ((isinstance(markup, bytes) and not b' ' in markup)
155+ or (isinstance(markup, unicode) and not u' ' in markup)):
156+ warnings.warn(
157+ '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
158
159- try:
160- self._feed()
161- except StopParsing:
162- pass
163+ for (self.markup, self.original_encoding, self.declared_html_encoding,
164+ self.contains_replacement_characters) in (
165+ self.builder.prepare_markup(markup, from_encoding)):
166+ self.reset()
167+ try:
168+ self._feed()
169+ break
170+ except ParserRejectedMarkup:
171+ pass
172
173 # Clear out the markup and remove the builder's circular
174 # reference to this object.
175@@ -192,9 +217,10 @@
176 Tag.__init__(self, self, self.builder, self.ROOT_TAG_NAME)
177 self.hidden = 1
178 self.builder.reset()
179- self.currentData = []
180+ self.current_data = []
181 self.currentTag = None
182 self.tagStack = []
183+ self.preserve_whitespace_tag_stack = []
184 self.pushTag(self)
185
186 def new_tag(self, name, namespace=None, nsprefix=None, **attrs):
187@@ -215,6 +241,8 @@
188
189 def popTag(self):
190 tag = self.tagStack.pop()
191+ if self.preserve_whitespace_tag_stack and tag == self.preserve_whitespace_tag_stack[-1]:
192+ self.preserve_whitespace_tag_stack.pop()
193 #print "Pop", tag.name
194 if self.tagStack:
195 self.currentTag = self.tagStack[-1]
196@@ -226,23 +254,37 @@
197 self.currentTag.contents.append(tag)
198 self.tagStack.append(tag)
199 self.currentTag = self.tagStack[-1]
200+ if tag.name in self.builder.preserve_whitespace_tags:
201+ self.preserve_whitespace_tag_stack.append(tag)
202
203 def endData(self, containerClass=NavigableString):
204- if self.currentData:
205- currentData = u''.join(self.currentData)
206- if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and
207- not set([tag.name for tag in self.tagStack]).intersection(
208- self.builder.preserve_whitespace_tags)):
209- if '\n' in currentData:
210- currentData = '\n'
211- else:
212- currentData = ' '
213- self.currentData = []
214+ if self.current_data:
215+ current_data = u''.join(self.current_data)
216+ # If whitespace is not preserved, and this string contains
217+ # nothing but ASCII spaces, replace it with a single space
218+ # or newline.
219+ if not self.preserve_whitespace_tag_stack:
220+ strippable = True
221+ for i in current_data:
222+ if i not in self.ASCII_SPACES:
223+ strippable = False
224+ break
225+ if strippable:
226+ if '\n' in current_data:
227+ current_data = '\n'
228+ else:
229+ current_data = ' '
230+
231+ # Reset the data collector.
232+ self.current_data = []
233+
234+ # Should we add this string to the tree at all?
235 if self.parse_only and len(self.tagStack) <= 1 and \
236 (not self.parse_only.text or \
237- not self.parse_only.search(currentData)):
238+ not self.parse_only.search(current_data)):
239 return
240- o = containerClass(currentData)
241+
242+ o = containerClass(current_data)
243 self.object_was_parsed(o)
244
245 def object_was_parsed(self, o, parent=None, most_recent_element=None):
246@@ -250,6 +292,7 @@
247 parent = parent or self.currentTag
248 most_recent_element = most_recent_element or self._most_recent_element
249 o.setup(parent, most_recent_element)
250+
251 if most_recent_element is not None:
252 most_recent_element.next_element = o
253 self._most_recent_element = o
254@@ -262,22 +305,21 @@
255 the given tag."""
256 #print "Popping to %s" % name
257 if name == self.ROOT_TAG_NAME:
258+ # The BeautifulSoup object itself can never be popped.
259 return
260
261- numPops = 0
262- mostRecentTag = None
263+ most_recently_popped = None
264
265- for i in range(len(self.tagStack) - 1, 0, -1):
266- if (name == self.tagStack[i].name
267- and nsprefix == self.tagStack[i].prefix):
268- numPops = len(self.tagStack) - i
269+ stack_size = len(self.tagStack)
270+ for i in range(stack_size - 1, 0, -1):
271+ t = self.tagStack[i]
272+ if (name == t.name and nsprefix == t.prefix):
273+ if inclusivePop:
274+ most_recently_popped = self.popTag()
275 break
276- if not inclusivePop:
277- numPops = numPops - 1
278+ most_recently_popped = self.popTag()
279
280- for i in range(0, numPops):
281- mostRecentTag = self.popTag()
282- return mostRecentTag
283+ return most_recently_popped
284
285 def handle_starttag(self, name, namespace, nsprefix, attrs):
286 """Push a start tag on to the stack.
287@@ -312,7 +354,7 @@
288 self._popToTag(name, nsprefix)
289
290 def handle_data(self, data):
291- self.currentData.append(data)
292+ self.current_data.append(data)
293
294 def decode(self, pretty_print=False,
295 eventual_encoding=DEFAULT_OUTPUT_ENCODING,
296@@ -353,7 +395,6 @@
297 class StopParsing(Exception):
298 pass
299
300-
301 class FeatureNotFound(ValueError):
302 pass
303
304
305=== modified file 'bs4/builder/__init__.py'
306--- bs4/builder/__init__.py 2013-08-09 18:39:43 +0000
307+++ bs4/builder/__init__.py 2014-05-29 09:58:03 +0000
308@@ -147,16 +147,18 @@
309
310 Modifies its input in place.
311 """
312+ if not attrs:
313+ return attrs
314 if self.cdata_list_attributes:
315 universal = self.cdata_list_attributes.get('*', [])
316 tag_specific = self.cdata_list_attributes.get(
317- tag_name.lower(), [])
318- for cdata_list_attr in itertools.chain(universal, tag_specific):
319- if cdata_list_attr in attrs:
320- # Basically, we have a "class" attribute whose
321- # value is a whitespace-separated list of CSS
322- # classes. Split it into a list.
323- value = attrs[cdata_list_attr]
324+ tag_name.lower(), None)
325+ for attr in attrs.keys():
326+ if attr in universal or (tag_specific and attr in tag_specific):
327+ # We have a "class"-type attribute whose string
328+ # value is a whitespace-separated list of
329+ # values. Split it into a list.
330+ value = attrs[attr]
331 if isinstance(value, basestring):
332 values = whitespace_re.split(value)
333 else:
334@@ -167,7 +169,7 @@
335 # leave the value alone rather than trying to
336 # split it again.
337 values = value
338- attrs[cdata_list_attr] = values
339+ attrs[attr] = values
340 return attrs
341
342 class SAXTreeBuilder(TreeBuilder):
343@@ -296,6 +298,9 @@
344 # Register the builder while we're at it.
345 this_module.builder_registry.register(obj)
346
347+class ParserRejectedMarkup(Exception):
348+ pass
349+
350 # Builders are registered in reverse order of priority, so that custom
351 # builder registrations will take precedence. In general, we want lxml
352 # to take precedence over html5lib, because it's faster. And we only
353
354=== modified file 'bs4/builder/_html5lib.py'
355--- bs4/builder/_html5lib.py 2013-08-09 18:39:43 +0000
356+++ bs4/builder/_html5lib.py 2014-05-29 09:58:03 +0000
357@@ -27,7 +27,7 @@
358 def prepare_markup(self, markup, user_specified_encoding):
359 # Store the user-specified encoding for use later on.
360 self.user_specified_encoding = user_specified_encoding
361- return markup, None, None, False
362+ yield (markup, None, None, False)
363
364 # These methods are defined by Beautiful Soup.
365 def feed(self, markup):
366@@ -123,17 +123,50 @@
367 self.namespace = namespace
368
369 def appendChild(self, node):
370- if (node.element.__class__ == NavigableString and self.element.contents
371+ string_child = child = None
372+ if isinstance(node, basestring):
373+ # Some other piece of code decided to pass in a string
374+ # instead of creating a TextElement object to contain the
375+ # string.
376+ string_child = child = node
377+ elif isinstance(node, Tag):
378+ # Some other piece of code decided to pass in a Tag
379+ # instead of creating an Element object to contain the
380+ # Tag.
381+ child = node
382+ elif node.element.__class__ == NavigableString:
383+ string_child = child = node.element
384+ else:
385+ child = node.element
386+
387+ if not isinstance(child, basestring) and child.parent is not None:
388+ node.element.extract()
389+
390+ if (string_child and self.element.contents
391 and self.element.contents[-1].__class__ == NavigableString):
392- # Concatenate new text onto old text node
393- # XXX This has O(n^2) performance, for input like
394+ # We are appending a string onto another string.
395+ # TODO This has O(n^2) performance, for input like
396 # "a</a>a</a>a</a>..."
397 old_element = self.element.contents[-1]
398- new_element = self.soup.new_string(old_element + node.element)
399+ new_element = self.soup.new_string(old_element + string_child)
400 old_element.replace_with(new_element)
401 self.soup._most_recent_element = new_element
402 else:
403- self.soup.object_was_parsed(node.element, parent=self.element)
404+ if isinstance(node, basestring):
405+ # Create a brand new NavigableString from this string.
406+ child = self.soup.new_string(node)
407+
408+ # Tell Beautiful Soup to act as if it parsed this element
409+ # immediately after the parent's last descendant. (Or
410+ # immediately after the parent, if it has no children.)
411+ if self.element.contents:
412+ most_recent_element = self.element._last_descendant(False)
413+ else:
414+ most_recent_element = self.element
415+
416+ self.soup.object_was_parsed(
417+ child, parent=self.element,
418+ most_recent_element=most_recent_element)
419
420 def getAttributes(self):
421 return AttrList(self.element)
422@@ -162,11 +195,11 @@
423 attributes = property(getAttributes, setAttributes)
424
425 def insertText(self, data, insertBefore=None):
426- text = TextNode(self.soup.new_string(data), self.soup)
427 if insertBefore:
428- self.insertBefore(text, insertBefore)
429+ text = TextNode(self.soup.new_string(data), self.soup)
430+ self.insertBefore(data, insertBefore)
431 else:
432- self.appendChild(text)
433+ self.appendChild(data)
434
435 def insertBefore(self, node, refNode):
436 index = self.element.index(refNode.element)
437@@ -183,16 +216,46 @@
438 def removeChild(self, node):
439 node.element.extract()
440
441- def reparentChildren(self, newParent):
442- while self.element.contents:
443- child = self.element.contents[0]
444- child.extract()
445- if isinstance(child, Tag):
446- newParent.appendChild(
447- Element(child, self.soup, namespaces["html"]))
448- else:
449- newParent.appendChild(
450- TextNode(child, self.soup))
451+ def reparentChildren(self, new_parent):
452+ """Move all of this tag's children into another tag."""
453+ element = self.element
454+ new_parent_element = new_parent.element
455+ # Determine what this tag's next_element will be once all the children
456+ # are removed.
457+ final_next_element = element.next_sibling
458+
459+ new_parents_last_descendant = new_parent_element._last_descendant(False, False)
460+ if len(new_parent_element.contents) > 0:
461+ # The new parent already contains children. We will be
462+ # appending this tag's children to the end.
463+ new_parents_last_child = new_parent_element.contents[-1]
464+ new_parents_last_descendant_next_element = new_parents_last_descendant.next_element
465+ else:
466+ # The new parent contains no children.
467+ new_parents_last_child = None
468+ new_parents_last_descendant_next_element = new_parent_element.next_element
469+
470+ to_append = element.contents
471+ append_after = new_parent.element.contents
472+ if len(to_append) > 0:
473+ # Set the first child's previous_element and previous_sibling
474+ # to elements within the new parent
475+ first_child = to_append[0]
476+ first_child.previous_element = new_parents_last_descendant
477+ first_child.previous_sibling = new_parents_last_child
478+
479+ # Fix the last child's next_element and next_sibling
480+ last_child = to_append[-1]
481+ last_child.next_element = new_parents_last_descendant_next_element
482+ last_child.next_sibling = None
483+
484+ for child in to_append:
485+ child.parent = new_parent_element
486+ new_parent_element.contents.append(child)
487+
488+ # Now that this element has no children, change its .next_element.
489+ element.contents = []
490+ element.next_element = final_next_element
491
492 def cloneNode(self):
493 tag = self.soup.new_tag(self.element.name, self.namespace)
494
495=== modified file 'bs4/builder/_htmlparser.py'
496--- bs4/builder/_htmlparser.py 2013-08-09 18:39:43 +0000
497+++ bs4/builder/_htmlparser.py 2014-05-29 09:58:03 +0000
498@@ -45,7 +45,15 @@
499 class BeautifulSoupHTMLParser(HTMLParser):
500 def handle_starttag(self, name, attrs):
501 # XXX namespace
502- self.soup.handle_starttag(name, None, None, dict(attrs))
503+ attr_dict = {}
504+ for key, value in attrs:
505+ # Change None attribute values to the empty string
506+ # for consistency with the other tree builders.
507+ if value is None:
508+ value = ''
509+ attr_dict[key] = value
510+ attrvalue = '""'
511+ self.soup.handle_starttag(name, None, None, attr_dict)
512
513 def handle_endtag(self, name):
514 self.soup.handle_endtag(name)
515@@ -135,13 +143,14 @@
516 replaced with REPLACEMENT CHARACTER).
517 """
518 if isinstance(markup, unicode):
519- return markup, None, None, False
520+ yield (markup, None, None, False)
521+ return
522
523 try_encodings = [user_specified_encoding, document_declared_encoding]
524 dammit = UnicodeDammit(markup, try_encodings, is_html=True)
525- return (dammit.markup, dammit.original_encoding,
526- dammit.declared_html_encoding,
527- dammit.contains_replacement_characters)
528+ yield (dammit.markup, dammit.original_encoding,
529+ dammit.declared_html_encoding,
530+ dammit.contains_replacement_characters)
531
532 def feed(self, markup):
533 args, kwargs = self.parser_args
534
535=== modified file 'bs4/builder/_lxml.py'
536--- bs4/builder/_lxml.py 2013-08-09 18:39:43 +0000
537+++ bs4/builder/_lxml.py 2014-05-29 09:58:03 +0000
538@@ -13,9 +13,10 @@
539 HTML,
540 HTMLTreeBuilder,
541 PERMISSIVE,
542+ ParserRejectedMarkup,
543 TreeBuilder,
544 XML)
545-from bs4.dammit import UnicodeDammit
546+from bs4.dammit import EncodingDetector
547
548 LXML = 'lxml'
549
550@@ -33,22 +34,30 @@
551 # standard.
552 DEFAULT_NSMAPS = {'http://www.w3.org/XML/1998/namespace' : "xml"}
553
554- @property
555- def default_parser(self):
556+ def default_parser(self, encoding):
557 # This can either return a parser object or a class, which
558 # will be instantiated with default arguments.
559- return etree.XMLParser(target=self, strip_cdata=False, recover=True)
560+ if self._default_parser is not None:
561+ return self._default_parser
562+ return etree.XMLParser(
563+ target=self, strip_cdata=False, recover=True, encoding=encoding)
564+
565+ def parser_for(self, encoding):
566+ # Use the default parser.
567+ parser = self.default_parser(encoding)
568+
569+ if isinstance(parser, collections.Callable):
570+ # Instantiate the parser with default arguments
571+ parser = parser(target=self, strip_cdata=False, encoding=encoding)
572+ return parser
573
574 def __init__(self, parser=None, empty_element_tags=None):
575+ # TODO: Issue a warning if parser is present but not a
576+ # callable, since that means there's no way to create new
577+ # parsers for different encodings.
578+ self._default_parser = parser
579 if empty_element_tags is not None:
580 self.empty_element_tags = set(empty_element_tags)
581- if parser is None:
582- # Use the default parser.
583- parser = self.default_parser
584- if isinstance(parser, collections.Callable):
585- # Instantiate the parser with default arguments
586- parser = parser(target=self, strip_cdata=False)
587- self.parser = parser
588 self.soup = None
589 self.nsmaps = [self.DEFAULT_NSMAPS]
590
591@@ -63,33 +72,53 @@
592 def prepare_markup(self, markup, user_specified_encoding=None,
593 document_declared_encoding=None):
594 """
595- :return: A 3-tuple (markup, original encoding, encoding
596- declared within markup).
597+ :yield: A series of 4-tuples.
598+ (markup, encoding, declared encoding,
599+ has undergone character replacement)
600+
601+ Each 4-tuple represents a strategy for parsing the document.
602 """
603 if isinstance(markup, unicode):
604- return markup, None, None, False
605-
606+ # We were given Unicode. Maybe lxml can parse Unicode on
607+ # this system?
608+ yield markup, None, document_declared_encoding, False
609+
610+ if isinstance(markup, unicode):
611+ # No, apparently not. Convert the Unicode to UTF-8 and
612+ # tell lxml to parse it as UTF-8.
613+ yield (markup.encode("utf8"), "utf8",
614+ document_declared_encoding, False)
615+
616+ # Instead of using UnicodeDammit to convert the bytestring to
617+ # Unicode using different encodings, use EncodingDetector to
618+ # iterate over the encodings, and tell lxml to try to parse
619+ # the document as each one in turn.
620+ is_html = not self.is_xml
621 try_encodings = [user_specified_encoding, document_declared_encoding]
622- dammit = UnicodeDammit(markup, try_encodings, is_html=True)
623- return (dammit.markup, dammit.original_encoding,
624- dammit.declared_html_encoding,
625- dammit.contains_replacement_characters)
626+ detector = EncodingDetector(markup, try_encodings, is_html)
627+ for encoding in detector.encodings:
628+ yield (detector.markup, encoding, document_declared_encoding, False)
629
630 def feed(self, markup):
631 if isinstance(markup, bytes):
632 markup = BytesIO(markup)
633 elif isinstance(markup, unicode):
634 markup = StringIO(markup)
635+
636 # Call feed() at least once, even if the markup is empty,
637 # or the parser won't be initialized.
638 data = markup.read(self.CHUNK_SIZE)
639- self.parser.feed(data)
640- while data != '':
641- # Now call feed() on the rest of the data, chunk by chunk.
642- data = markup.read(self.CHUNK_SIZE)
643- if data != '':
644- self.parser.feed(data)
645- self.parser.close()
646+ try:
647+ self.parser = self.parser_for(self.soup.original_encoding)
648+ self.parser.feed(data)
649+ while len(data) != 0:
650+ # Now call feed() on the rest of the data, chunk by chunk.
651+ data = markup.read(self.CHUNK_SIZE)
652+ if len(data) != 0:
653+ self.parser.feed(data)
654+ self.parser.close()
655+ except (UnicodeDecodeError, LookupError, etree.ParserError), e:
656+ raise ParserRejectedMarkup(str(e))
657
658 def close(self):
659 self.nsmaps = [self.DEFAULT_NSMAPS]
660@@ -186,13 +215,18 @@
661 features = [LXML, HTML, FAST, PERMISSIVE]
662 is_xml = False
663
664- @property
665- def default_parser(self):
666+ def default_parser(self, encoding):
667 return etree.HTMLParser
668
669 def feed(self, markup):
670- self.parser.feed(markup)
671- self.parser.close()
672+ encoding = self.soup.original_encoding
673+ try:
674+ self.parser = self.parser_for(encoding)
675+ self.parser.feed(markup)
676+ self.parser.close()
677+ except (UnicodeDecodeError, LookupError, etree.ParserError), e:
678+ raise ParserRejectedMarkup(str(e))
679+
680
681 def test_fragment_to_document(self, fragment):
682 """See `TreeBuilder`."""
683
684=== modified file 'bs4/dammit.py'
685--- bs4/dammit.py 2013-08-09 18:39:43 +0000
686+++ bs4/dammit.py 2014-05-29 09:58:03 +0000
687@@ -1,16 +1,17 @@
688 # -*- coding: utf-8 -*-
689 """Beautiful Soup bonus library: Unicode, Dammit
690
691-This class forces XML data into a standard format (usually to UTF-8 or
692-Unicode). It is heavily based on code from Mark Pilgrim's Universal
693-Feed Parser. It does not rewrite the XML or HTML to reflect a new
694-encoding; that's the tree builder's job.
695+This library converts a bytestream to Unicode through any means
696+necessary. It is heavily based on code from Mark Pilgrim's Universal
697+Feed Parser. It works best on XML and XML, but it does not rewrite the
698+XML or HTML to reflect a new encoding; that's the tree builder's job.
699 """
700
701 import codecs
702 from htmlentitydefs import codepoint2name
703 import re
704 import logging
705+import string
706
707 # Import a library to autodetect character encodings.
708 chardet_type = None
709@@ -175,7 +176,6 @@
710 value = cls.quoted_attribute_value(value)
711 return value
712
713-
714 @classmethod
715 def substitute_html(cls, s):
716 """Replace certain Unicode characters with named HTML entities.
717@@ -192,6 +192,125 @@
718 cls._substitute_html_entity, s)
719
720
721+class EncodingDetector:
722+ """Suggests a number of possible encodings for a bytestring.
723+
724+ Order of precedence:
725+
726+ 1. Encodings you specifically tell EncodingDetector to try first
727+ (the override_encodings argument to the constructor).
728+
729+ 2. An encoding declared within the bytestring itself, either in an
730+ XML declaration (if the bytestring is to be interpreted as an XML
731+ document), or in a <meta> tag (if the bytestring is to be
732+ interpreted as an HTML document.)
733+
734+ 3. An encoding detected through textual analysis by chardet,
735+ cchardet, or a similar external library.
736+
737+ 4. UTF-8.
738+
739+ 5. Windows-1252.
740+ """
741+ def __init__(self, markup, override_encodings=None, is_html=False):
742+ self.override_encodings = override_encodings or []
743+ self.chardet_encoding = None
744+ self.is_html = is_html
745+ self.declared_encoding = None
746+
747+ # First order of business: strip a byte-order mark.
748+ self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup)
749+
750+ def _usable(self, encoding, tried):
751+ if encoding is not None:
752+ encoding = encoding.lower()
753+ if encoding not in tried:
754+ tried.add(encoding)
755+ return True
756+ return False
757+
758+ @property
759+ def encodings(self):
760+ """Yield a number of encodings that might work for this markup."""
761+ tried = set()
762+ for e in self.override_encodings:
763+ if self._usable(e, tried):
764+ yield e
765+
766+ # Did the document originally start with a byte-order mark
767+ # that indicated its encoding?
768+ if self._usable(self.sniffed_encoding, tried):
769+ yield self.sniffed_encoding
770+
771+ # Look within the document for an XML or HTML encoding
772+ # declaration.
773+ if self.declared_encoding is None:
774+ self.declared_encoding = self.find_declared_encoding(
775+ self.markup, self.is_html)
776+ if self._usable(self.declared_encoding, tried):
777+ yield self.declared_encoding
778+
779+ # Use third-party character set detection to guess at the
780+ # encoding.
781+ if self.chardet_encoding is None:
782+ self.chardet_encoding = chardet_dammit(self.markup)
783+ if self._usable(self.chardet_encoding, tried):
784+ yield self.chardet_encoding
785+
786+ # As a last-ditch effort, try utf-8 and windows-1252.
787+ for e in ('utf-8', 'windows-1252'):
788+ if self._usable(e, tried):
789+ yield e
790+
791+ @classmethod
792+ def strip_byte_order_mark(cls, data):
793+ """If a byte-order mark is present, strip it and return the encoding it implies."""
794+ encoding = None
795+ if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
796+ and (data[2:4] != '\x00\x00'):
797+ encoding = 'utf-16be'
798+ data = data[2:]
799+ elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
800+ and (data[2:4] != '\x00\x00'):
801+ encoding = 'utf-16le'
802+ data = data[2:]
803+ elif data[:3] == b'\xef\xbb\xbf':
804+ encoding = 'utf-8'
805+ data = data[3:]
806+ elif data[:4] == b'\x00\x00\xfe\xff':
807+ encoding = 'utf-32be'
808+ data = data[4:]
809+ elif data[:4] == b'\xff\xfe\x00\x00':
810+ encoding = 'utf-32le'
811+ data = data[4:]
812+ return data, encoding
813+
814+ @classmethod
815+ def find_declared_encoding(cls, markup, is_html=False, search_entire_document=False):
816+ """Given a document, tries to find its declared encoding.
817+
818+ An XML encoding is declared at the beginning of the document.
819+
820+ An HTML encoding is declared in a <meta> tag, hopefully near the
821+ beginning of the document.
822+ """
823+ if search_entire_document:
824+ xml_endpos = html_endpos = len(markup)
825+ else:
826+ xml_endpos = 1024
827+ html_endpos = max(2048, int(len(markup) * 0.05))
828+
829+ declared_encoding = None
830+ declared_encoding_match = xml_encoding_re.search(markup, endpos=xml_endpos)
831+ if not declared_encoding_match and is_html:
832+ declared_encoding_match = html_meta_re.search(markup, endpos=html_endpos)
833+ if declared_encoding_match is not None:
834+ declared_encoding = declared_encoding_match.groups()[0].decode(
835+ 'ascii')
836+ if declared_encoding:
837+ return declared_encoding.lower()
838+ return None
839+
840 class UnicodeDammit:
841 """A class for detecting the encoding of a *ML document and
842 converting it to a Unicode string. If the source encoding is
843@@ -213,55 +332,38 @@
844
845 def __init__(self, markup, override_encodings=[],
846 smart_quotes_to=None, is_html=False):
847- self.declared_html_encoding = None
848 self.smart_quotes_to = smart_quotes_to
849 self.tried_encodings = []
850 self.contains_replacement_characters = False
851-
852- if markup == '' or isinstance(markup, unicode):
853+ self.is_html = is_html
854+
855+ self.detector = EncodingDetector(markup, override_encodings, is_html)
856+
857+ # Short-circuit if the data is in Unicode to begin with.
858+ if isinstance(markup, unicode) or markup == '':
859 self.markup = markup
860 self.unicode_markup = unicode(markup)
861 self.original_encoding = None
862 return
863
864- new_markup, document_encoding, sniffed_encoding = \
865- self._detectEncoding(markup, is_html)
866- self.markup = new_markup
867+ # The encoding detector may have stripped a byte-order mark.
868+ # Use the stripped markup from this point on.
869+ self.markup = self.detector.markup
870
871 u = None
872- if new_markup != markup:
873- # _detectEncoding modified the markup, then converted it to
874- # Unicode and then to UTF-8. So convert it from UTF-8.
875- u = self._convert_from("utf8")
876- self.original_encoding = sniffed_encoding
877-
878- if not u:
879- for proposed_encoding in (
880- override_encodings + [document_encoding, sniffed_encoding]):
881- if proposed_encoding is not None:
882- u = self._convert_from(proposed_encoding)
883- if u:
884- break
885-
886- # If no luck and we have auto-detection library, try that:
887- if not u and not isinstance(self.markup, unicode):
888- u = self._convert_from(chardet_dammit(self.markup))
889-
890- # As a last resort, try utf-8 and windows-1252:
891- if not u:
892- for proposed_encoding in ("utf-8", "windows-1252"):
893- u = self._convert_from(proposed_encoding)
894- if u:
895- break
896-
897- # As an absolute last resort, try the encodings again with
898- # character replacement.
899- if not u:
900- for proposed_encoding in (
901- override_encodings + [
902- document_encoding, sniffed_encoding, "utf-8", "windows-1252"]):
903- if proposed_encoding != "ascii":
904- u = self._convert_from(proposed_encoding, "replace")
905+ for encoding in self.detector.encodings:
906+ markup = self.detector.markup
907+ u = self._convert_from(encoding)
908+ if u is not None:
909+ break
910+
911+ if not u:
912+ # None of the encodings worked. As an absolute last resort,
913+ # try them again with character replacement.
914+
915+ for encoding in self.detector.encodings:
916+ if encoding != "ascii":
917+ u = self._convert_from(encoding, "replace")
918 if u is not None:
919 logging.warning(
920 "Some characters could not be decoded, and were "
921@@ -269,8 +371,9 @@
922 self.contains_replacement_characters = True
923 break
924
925- # We could at this point force it to ASCII, but that would
926- # destroy so much data that I think giving up is better
927+ # If none of that worked, we could at this point force it to
928+ # ASCII, but that would destroy so much data that I think
929+ # giving up is better.
930 self.unicode_markup = u
931 if not u:
932 self.original_encoding = None
933@@ -301,7 +404,7 @@
934 # Convert smart quotes to HTML if coming from an encoding
935 # that might have them.
936 if (self.smart_quotes_to is not None
937- and proposed.lower() in self.ENCODINGS_WITH_SMART_QUOTES):
938+ and proposed in self.ENCODINGS_WITH_SMART_QUOTES):
939 smart_quotes_re = b"([\x80-\x9f])"
940 smart_quotes_compiled = re.compile(smart_quotes_re)
941 markup = smart_quotes_compiled.sub(self._sub_ms_char, markup)
942@@ -322,99 +425,24 @@
943 def _to_unicode(self, data, encoding, errors="strict"):
944 '''Given a string and its encoding, decodes the string into Unicode.
945 %encoding is a string recognized by encodings.aliases'''
946-
947- # strip Byte Order Mark (if present)
948- if (len(data) >= 4) and (data[:2] == '\xfe\xff') \
949- and (data[2:4] != '\x00\x00'):
950- encoding = 'utf-16be'
951- data = data[2:]
952- elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \
953- and (data[2:4] != '\x00\x00'):
954- encoding = 'utf-16le'
955- data = data[2:]
956- elif data[:3] == '\xef\xbb\xbf':
957- encoding = 'utf-8'
958- data = data[3:]
959- elif data[:4] == '\x00\x00\xfe\xff':
960- encoding = 'utf-32be'
961- data = data[4:]
962- elif data[:4] == '\xff\xfe\x00\x00':
963- encoding = 'utf-32le'
964- data = data[4:]
965- newdata = unicode(data, encoding, errors)
966- return newdata
967-
968- def _detectEncoding(self, xml_data, is_html=False):
969- """Given a document, tries to detect its XML encoding."""
970- xml_encoding = sniffed_xml_encoding = None
971- try:
972- if xml_data[:4] == b'\x4c\x6f\xa7\x94':
973- # EBCDIC
974- xml_data = self._ebcdic_to_ascii(xml_data)
975- elif xml_data[:4] == b'\x00\x3c\x00\x3f':
976- # UTF-16BE
977- sniffed_xml_encoding = 'utf-16be'
978- xml_data = unicode(xml_data, 'utf-16be').encode('utf-8')
979- elif (len(xml_data) >= 4) and (xml_data[:2] == b'\xfe\xff') \
980- and (xml_data[2:4] != b'\x00\x00'):
981- # UTF-16BE with BOM
982- sniffed_xml_encoding = 'utf-16be'
983- xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8')
984- elif xml_data[:4] == b'\x3c\x00\x3f\x00':
985- # UTF-16LE
986- sniffed_xml_encoding = 'utf-16le'
987- xml_data = unicode(xml_data, 'utf-16le').encode('utf-8')
988- elif (len(xml_data) >= 4) and (xml_data[:2] == b'\xff\xfe') and \
989- (xml_data[2:4] != b'\x00\x00'):
990- # UTF-16LE with BOM
991- sniffed_xml_encoding = 'utf-16le'
992- xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8')
993- elif xml_data[:4] == b'\x00\x00\x00\x3c':
994- # UTF-32BE
995- sniffed_xml_encoding = 'utf-32be'
996- xml_data = unicode(xml_data, 'utf-32be').encode('utf-8')
997- elif xml_data[:4] == b'\x3c\x00\x00\x00':
998- # UTF-32LE
999- sniffed_xml_encoding = 'utf-32le'
1000- xml_data = unicode(xml_data, 'utf-32le').encode('utf-8')
1001- elif xml_data[:4] == b'\x00\x00\xfe\xff':
1002- # UTF-32BE with BOM
1003- sniffed_xml_encoding = 'utf-32be'
1004- xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8')
1005- elif xml_data[:4] == b'\xff\xfe\x00\x00':
1006- # UTF-32LE with BOM
1007- sniffed_xml_encoding = 'utf-32le'
1008- xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8')
1009- elif xml_data[:3] == b'\xef\xbb\xbf':
1010- # UTF-8 with BOM
1011- sniffed_xml_encoding = 'utf-8'
1012- xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8')
1013- else:
1014- sniffed_xml_encoding = 'ascii'
1015- pass
1016- except:
1017- xml_encoding_match = None
1018- xml_encoding_match = xml_encoding_re.match(xml_data)
1019- if not xml_encoding_match and is_html:
1020- xml_encoding_match = html_meta_re.search(xml_data)
1021- if xml_encoding_match is not None:
1022- xml_encoding = xml_encoding_match.groups()[0].decode(
1023- 'ascii').lower()
1024- if is_html:
1025- self.declared_html_encoding = xml_encoding
1026- if sniffed_xml_encoding and \
1027- (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode',
1028- 'iso-10646-ucs-4', 'ucs-4', 'csucs4',
1029- 'utf-16', 'utf-32', 'utf_16', 'utf_32',
1030- 'utf16', 'u16')):
1031- xml_encoding = sniffed_xml_encoding
1032- return xml_data, xml_encoding, sniffed_xml_encoding
1033+ return unicode(data, encoding, errors)
1034+
1035+ @property
1036+ def declared_html_encoding(self):
1037+ if not self.is_html:
1038+ return None
1039+ return self.detector.declared_encoding
1040
1041 def find_codec(self, charset):
1042- return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
1043- or (charset and self._codec(charset.replace("-", ""))) \
1044- or (charset and self._codec(charset.replace("-", "_"))) \
1045+ value = (self._codec(self.CHARSET_ALIASES.get(charset, charset))
1046+ or (charset and self._codec(charset.replace("-", "")))
1047+ or (charset and self._codec(charset.replace("-", "_")))
1048+ or (charset and charset.lower())
1049 or charset
1050+ )
1051+ if value:
1052+ return value.lower()
1053+ return None
1054
1055 def _codec(self, charset):
1056 if not charset:
1057@@ -427,32 +455,6 @@
1058 pass
1059 return codec
1060
1061- EBCDIC_TO_ASCII_MAP = None
1062-
1063- def _ebcdic_to_ascii(self, s):
1064- c = self.__class__
1065- if not c.EBCDIC_TO_ASCII_MAP:
1066- emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
1067- 16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
1068- 128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
1069- 144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
1070- 32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
1071- 38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
1072- 45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
1073- 186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
1074- 195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,
1075- 201,202,106,107,108,109,110,111,112,113,114,203,204,205,
1076- 206,207,208,209,126,115,116,117,118,119,120,121,122,210,
1077- 211,212,213,214,215,216,217,218,219,220,221,222,223,224,
1078- 225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72,
1079- 73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81,
1080- 82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89,
1081- 90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57,
1082- 250,251,252,253,254,255)
1083- import string
1084- c.EBCDIC_TO_ASCII_MAP = string.maketrans(
1085- ''.join(map(chr, list(range(256)))), ''.join(map(chr, emap)))
1086- return s.translate(c.EBCDIC_TO_ASCII_MAP)
1087
1088 # A partial mapping of ISO-Latin-1 to HTML entities/XML numeric entities.
1089 MS_CHARS = {b'\x80': ('euro', '20AC'),
1090
1091=== modified file 'bs4/diagnose.py'
1092--- bs4/diagnose.py 2013-08-09 18:39:43 +0000
1093+++ bs4/diagnose.py 2014-05-29 09:58:03 +0000
1094@@ -1,10 +1,15 @@
1095 """Diagnostic functions, mainly for use when doing tech support."""
1096+import cProfile
1097 from StringIO import StringIO
1098 from HTMLParser import HTMLParser
1099+import bs4
1100 from bs4 import BeautifulSoup, __version__
1101 from bs4.builder import builder_registry
1102+
1103 import os
1104+import pstats
1105 import random
1106+import tempfile
1107 import time
1108 import traceback
1109 import sys
1110@@ -61,14 +66,14 @@
1111
1112 print "-" * 80
1113
1114-def lxml_trace(data, html=True):
1115+def lxml_trace(data, html=True, **kwargs):
1116 """Print out the lxml events that occur during parsing.
1117
1118 This lets you see how lxml parses a document when no Beautiful
1119 Soup code is running.
1120 """
1121 from lxml import etree
1122- for event, element in etree.iterparse(StringIO(data), html=html):
1123+ for event, element in etree.iterparse(StringIO(data), html=html, **kwargs):
1124 print("%s, %4s, %s" % (event, element.tag, element.text))
1125
1126 class AnnouncingParser(HTMLParser):
1127@@ -174,5 +179,26 @@
1128 b = time.time()
1129 print "Raw lxml parsed the markup in %.2fs." % (b-a)
1130
1131+ import html5lib
1132+ parser = html5lib.HTMLParser()
1133+ a = time.time()
1134+ parser.parse(data)
1135+ b = time.time()
1136+ print "Raw html5lib parsed the markup in %.2fs." % (b-a)
1137+
1138+def profile(num_elements=100000, parser="lxml"):
1139+
1140+ filehandle = tempfile.NamedTemporaryFile()
1141+ filename = filehandle.name
1142+
1143+ data = rdoc(num_elements)
1144+ vars = dict(bs4=bs4, data=data, parser=parser)
1145+ cProfile.runctx('bs4.BeautifulSoup(data, parser)' , vars, vars, filename)
1146+
1147+ stats = pstats.Stats(filename)
1148+ # stats.strip_dirs()
1149+ stats.sort_stats("cumulative")
1150+ stats.print_stats('_html5lib|bs4', 50)
1151+
1152 if __name__ == '__main__':
1153 diagnose(sys.stdin.read())
1154
1155=== modified file 'bs4/element.py'
1156--- bs4/element.py 2013-05-25 21:27:22 +0000
1157+++ bs4/element.py 2014-05-29 09:58:03 +0000
1158@@ -255,11 +255,16 @@
1159 self.previous_sibling = self.next_sibling = None
1160 return self
1161
1162- def _last_descendant(self):
1163+ def _last_descendant(self, is_initialized=True, accept_self=True):
1164 "Finds the last element beneath this object to be parsed."
1165- last_child = self
1166- while hasattr(last_child, 'contents') and last_child.contents:
1167- last_child = last_child.contents[-1]
1168+ if is_initialized and self.next_sibling:
1169+ last_child = self.next_sibling.previous_element
1170+ else:
1171+ last_child = self
1172+ while isinstance(last_child, Tag) and last_child.contents:
1173+ last_child = last_child.contents[-1]
1174+ if not accept_self and last_child == self:
1175+ last_child = None
1176 return last_child
1177 # BS3: Not part of the API!
1178 _lastRecursiveChild = _last_descendant
1179@@ -294,11 +299,11 @@
1180 previous_child = self.contents[position - 1]
1181 new_child.previous_sibling = previous_child
1182 new_child.previous_sibling.next_sibling = new_child
1183- new_child.previous_element = previous_child._last_descendant()
1184+ new_child.previous_element = previous_child._last_descendant(False)
1185 if new_child.previous_element is not None:
1186 new_child.previous_element.next_element = new_child
1187
1188- new_childs_last_element = new_child._last_descendant()
1189+ new_childs_last_element = new_child._last_descendant(False)
1190
1191 if position >= len(self.contents):
1192 new_child.next_sibling = None
1193@@ -475,20 +480,21 @@
1194
1195 if isinstance(name, SoupStrainer):
1196 strainer = name
1197- elif text is None and not limit and not attrs and not kwargs:
1198- # Optimization to find all tags.
1199+ else:
1200+ strainer = SoupStrainer(name, attrs, text, **kwargs)
1201+
1202+ if text is None and not limit and not attrs and not kwargs:
1203 if name is True or name is None:
1204- return [element for element in generator
1205- if isinstance(element, Tag)]
1206- # Optimization to find all tags with a given name.
1207+ # Optimization to find all tags.
1208+ result = (element for element in generator
1209+ if isinstance(element, Tag))
1210+ return ResultSet(strainer, result)
1211 elif isinstance(name, basestring):
1212- return [element for element in generator
1213- if isinstance(element, Tag) and element.name == name]
1214- else:
1215- strainer = SoupStrainer(name, attrs, text, **kwargs)
1216- else:
1217- # Build a SoupStrainer
1218- strainer = SoupStrainer(name, attrs, text, **kwargs)
1219+ # Optimization to find all tags with a given name.
1220+ result = (element for element in generator
1221+ if isinstance(element, Tag)
1222+ and element.name == name)
1223+ return ResultSet(strainer, result)
1224 results = ResultSet(strainer)
1225 while True:
1226 try:
1227@@ -672,6 +678,13 @@
1228 output = self.format_string(self, formatter)
1229 return self.PREFIX + output + self.SUFFIX
1230
1231+ @property
1232+ def name(self):
1233+ return None
1234+
1235+ @name.setter
1236+ def name(self, name):
1237+ raise AttributeError("A NavigableString cannot be given a name.")
1238
1239 class PreformattedString(NavigableString):
1240 """A NavigableString not subject to the normal formatting rules.
1241@@ -746,7 +759,7 @@
1242 self.prefix = prefix
1243 if attrs is None:
1244 attrs = {}
1245- elif builder.cdata_list_attributes:
1246+ elif attrs and builder.cdata_list_attributes:
1247 attrs = builder._replace_cdata_list_attribute_values(
1248 self.name, attrs)
1249 else:
1250@@ -1593,6 +1606,6 @@
1251 class ResultSet(list):
1252 """A ResultSet is just a list that keeps track of the SoupStrainer
1253 that created it."""
1254- def __init__(self, source):
1255- list.__init__([])
1256+ def __init__(self, source, result=()):
1257+ super(ResultSet, self).__init__(result)
1258 self.source = source
1259
1260=== modified file 'bs4/testing.py'
1261--- bs4/testing.py 2013-08-09 18:39:43 +0000
1262+++ bs4/testing.py 2014-05-29 09:58:03 +0000
1263@@ -281,6 +281,14 @@
1264 # to detect any differences between them.
1265 #
1266
1267+ def test_can_parse_unicode_document(self):
1268+ # A seemingly innocuous document... but it's in Unicode! And
1269+ # it contains characters that can't be represented in the
1270+ # encoding found in the declaration! The horror!
1271+ markup = u'<html><head><meta encoding="euc-jp"></head><body>Sacr\N{LATIN SMALL LETTER E WITH ACUTE} bleu!</body>'
1272+ soup = self.soup(markup)
1273+ self.assertEqual(u'Sacr\xe9 bleu!', soup.body.string)
1274+
1275 def test_soupstrainer(self):
1276 """Parsers should be able to work with SoupStrainers."""
1277 strainer = SoupStrainer("b")
1278@@ -484,6 +492,11 @@
1279 encoded = soup.encode()
1280 self.assertTrue(b"&lt; &lt; hey &gt; &gt;" in encoded)
1281
1282+ def test_can_parse_unicode_document(self):
1283+ markup = u'<?xml version="1.0" encoding="euc-jp"><root>Sacr\N{LATIN SMALL LETTER E WITH ACUTE} bleu!</root>'
1284+ soup = self.soup(markup)
1285+ self.assertEqual(u'Sacr\xe9 bleu!', soup.root.string)
1286+
1287 def test_popping_namespaced_tag(self):
1288 markup = '<rss xmlns:dc="foo"><dc:creator>b</dc:creator><dc:date>2012-07-02T20:33:42Z</dc:date><dc:rights>c</dc:rights><image>d</image></rss>'
1289 soup = self.soup(markup)
1290
1291=== modified file 'bs4/tests/test_html5lib.py'
1292--- bs4/tests/test_html5lib.py 2013-08-09 18:39:43 +0000
1293+++ bs4/tests/test_html5lib.py 2014-05-29 09:58:03 +0000
1294@@ -70,3 +70,16 @@
1295 soup = self.soup(markup)
1296 # Verify that we can reach the <p> tag; this means the tree is connected.
1297 self.assertEqual(b"<p>foo</p>", soup.p.encode())
1298+
1299+ def test_reparented_markup(self):
1300+ markup = '<p><em>foo</p>\n<p>bar<a></a></em></p>'
1301+ soup = self.soup(markup)
1302+ self.assertEqual(u"<body><p><em>foo</em></p><em>\n</em><p><em>bar<a></a></em></p></body>", soup.body.decode())
1303+ self.assertEqual(2, len(soup.find_all('p')))
1304+
1305+
1306+ def test_reparented_markup_ends_with_whitespace(self):
1307+ markup = '<p><em>foo</p>\n<p>bar<a></a></em></p>\n'
1308+ soup = self.soup(markup)
1309+ self.assertEqual(u"<body><p><em>foo</em></p><em>\n</em><p><em>bar<a></a></em></p>\n</body>", soup.body.decode())
1310+ self.assertEqual(2, len(soup.find_all('p')))
1311
1312=== modified file 'bs4/tests/test_lxml.py'
1313--- bs4/tests/test_lxml.py 2013-08-09 18:39:43 +0000
1314+++ bs4/tests/test_lxml.py 2014-05-29 09:58:03 +0000
1315@@ -4,14 +4,16 @@
1316 import warnings
1317
1318 try:
1319- from bs4.builder import LXMLTreeBuilder, LXMLTreeBuilderForXML
1320+ import lxml.etree
1321 LXML_PRESENT = True
1322- import lxml.etree
1323 LXML_VERSION = lxml.etree.LXML_VERSION
1324 except ImportError, e:
1325 LXML_PRESENT = False
1326 LXML_VERSION = (0,)
1327
1328+if LXML_PRESENT:
1329+ from bs4.builder import LXMLTreeBuilder, LXMLTreeBuilderForXML
1330+
1331 from bs4 import (
1332 BeautifulSoup,
1333 BeautifulStoneSoup,
1334@@ -58,9 +60,10 @@
1335 def test_beautifulstonesoup_is_xml_parser(self):
1336 # Make sure that the deprecated BSS class uses an xml builder
1337 # if one is installed.
1338- with warnings.catch_warnings(record=False) as w:
1339+ with warnings.catch_warnings(record=True) as w:
1340 soup = BeautifulStoneSoup("<b />")
1341- self.assertEqual(u"<b/>", unicode(soup.b))
1342+ self.assertEqual(u"<b/>", unicode(soup.b))
1343+ self.assertTrue("BeautifulStoneSoup class is deprecated" in str(w[0].message))
1344
1345 def test_real_xhtml_document(self):
1346 """lxml strips the XML definition from an XHTML doc, which is fine."""
1347
1348=== modified file 'bs4/tests/test_soup.py'
1349--- bs4/tests/test_soup.py 2013-08-09 18:39:43 +0000
1350+++ bs4/tests/test_soup.py 2014-05-29 09:58:03 +0000
1351@@ -4,6 +4,8 @@
1352 import logging
1353 import unittest
1354 import sys
1355+import tempfile
1356+
1357 from bs4 import (
1358 BeautifulSoup,
1359 BeautifulStoneSoup,
1360@@ -15,7 +17,10 @@
1361 NamespacedAttribute,
1362 )
1363 import bs4.dammit
1364-from bs4.dammit import EntitySubstitution, UnicodeDammit
1365+from bs4.dammit import (
1366+ EntitySubstitution,
1367+ UnicodeDammit,
1368+)
1369 from bs4.testing import (
1370 SoupTest,
1371 skipIf,
1372@@ -31,6 +36,19 @@
1373 PYTHON_2_PRE_2_7 = (sys.version_info < (2,7))
1374 PYTHON_3_PRE_3_2 = (sys.version_info[0] == 3 and sys.version_info < (3,2))
1375
1376+class TestConstructor(SoupTest):
1377+
1378+ def test_short_unicode_input(self):
1379+ data = u"<h1>éé</h1>"
1380+ soup = self.soup(data)
1381+ self.assertEqual(u"éé", soup.h1.string)
1382+
1383+ def test_embedded_null(self):
1384+ data = u"<h1>foo\0bar</h1>"
1385+ soup = self.soup(data)
1386+ self.assertEqual(u"foo\0bar", soup.h1.string)
1387+
1388+
1389 class TestDeprecatedConstructorArguments(SoupTest):
1390
1391 def test_parseOnlyThese_renamed_to_parse_only(self):
1392@@ -54,14 +72,33 @@
1393 self.assertRaises(
1394 TypeError, self.soup, "<a>", no_such_argument=True)
1395
1396- @skipIf(
1397- not LXML_PRESENT,
1398- "lxml not present, not testing BeautifulStoneSoup.")
1399- def test_beautifulstonesoup(self):
1400- with warnings.catch_warnings(record=True) as w:
1401- soup = BeautifulStoneSoup("<markup>")
1402- self.assertTrue(isinstance(soup, BeautifulSoup))
1403- self.assertTrue("BeautifulStoneSoup class is deprecated")
1404+class TestWarnings(SoupTest):
1405+
1406+ def test_disk_file_warning(self):
1407+ filehandle = tempfile.NamedTemporaryFile()
1408+ filename = filehandle.name
1409+ try:
1410+ with warnings.catch_warnings(record=True) as w:
1411+ soup = self.soup(filename)
1412+ msg = str(w[0].message)
1413+ self.assertTrue("looks like a filename" in msg)
1414+ finally:
1415+ filehandle.close()
1416+
1417+ # The file no longer exists, so Beautiful Soup will no longer issue the warning.
1418+ with warnings.catch_warnings(record=True) as w:
1419+ soup = self.soup(filename)
1420+ self.assertEqual(0, len(w))
1421+
1422+ def test_url_warning(self):
1423+ with warnings.catch_warnings(record=True) as w:
1424+ soup = self.soup("http://www.crummy.com/")
1425+ msg = str(w[0].message)
1426+ self.assertTrue("looks like a URL" in msg)
1427+
1428+ with warnings.catch_warnings(record=True) as w:
1429+ soup = self.soup("http://www.crummy.com/ is great")
1430+ self.assertEqual(0, len(w))
1431
1432 class TestSelectiveParsing(SoupTest):
1433
1434@@ -156,13 +193,23 @@
1435
1436 def test_ascii_in_unicode_out(self):
1437 # ASCII input is converted to Unicode. The original_encoding
1438- # attribute is set.
1439- ascii = b"<foo>a</foo>"
1440- soup_from_ascii = self.soup(ascii)
1441- unicode_output = soup_from_ascii.decode()
1442- self.assertTrue(isinstance(unicode_output, unicode))
1443- self.assertEqual(unicode_output, self.document_for(ascii.decode()))
1444- self.assertEqual(soup_from_ascii.original_encoding.lower(), "ascii")
1445+ # attribute is set to 'utf-8', a superset of ASCII.
1446+ chardet = bs4.dammit.chardet_dammit
1447+ logging.disable(logging.WARNING)
1448+ try:
1449+ def noop(str):
1450+ return None
1451+ # Disable chardet, which will realize that the ASCII is ASCII.
1452+ bs4.dammit.chardet_dammit = noop
1453+ ascii = b"<foo>a</foo>"
1454+ soup_from_ascii = self.soup(ascii)
1455+ unicode_output = soup_from_ascii.decode()
1456+ self.assertTrue(isinstance(unicode_output, unicode))
1457+ self.assertEqual(unicode_output, self.document_for(ascii.decode()))
1458+ self.assertEqual(soup_from_ascii.original_encoding.lower(), "utf-8")
1459+ finally:
1460+ logging.disable(logging.NOTSET)
1461+ bs4.dammit.chardet_dammit = chardet
1462
1463 def test_unicode_in_unicode_out(self):
1464 # Unicode input is left alone. The original_encoding attribute
1465@@ -192,7 +239,12 @@
1466 self.assertEqual(self.soup(markup).div.encode("utf8"), markup.encode("utf8"))
1467
1468 class TestUnicodeDammit(unittest.TestCase):
1469- """Standalone tests of Unicode, Dammit."""
1470+ """Standalone tests of UnicodeDammit."""
1471+
1472+ def test_unicode_input(self):
1473+ markup = u"I'm already Unicode! \N{SNOWMAN}"
1474+ dammit = UnicodeDammit(markup)
1475+ self.assertEqual(dammit.unicode_markup, markup)
1476
1477 def test_smart_quotes_to_unicode(self):
1478 markup = b"<foo>\x91\x92\x93\x94</foo>"
1479@@ -293,9 +345,8 @@
1480 logging.disable(logging.NOTSET)
1481 bs4.dammit.chardet_dammit = chardet
1482
1483- def test_sniffed_xml_encoding(self):
1484- # A document written in UTF-16LE will be converted by a different
1485- # code path that sniffs the byte order markers.
1486+ def test_byte_order_mark_removed(self):
1487+ # A document written in UTF-16LE will have its byte order marker stripped.
1488 data = b'\xff\xfe<\x00a\x00>\x00\xe1\x00\xe9\x00<\x00/\x00a\x00>\x00'
1489 dammit = UnicodeDammit(data)
1490 self.assertEqual(u"<a>áé</a>", dammit.unicode_markup)
1491
1492=== modified file 'bs4/tests/test_tree.py'
1493--- bs4/tests/test_tree.py 2013-08-09 18:39:43 +0000
1494+++ bs4/tests/test_tree.py 2014-05-29 09:58:03 +0000
1495@@ -70,6 +70,16 @@
1496 soup = self.soup(u'<h1>Räksmörgås</h1>')
1497 self.assertEqual(soup.find(text=u'Räksmörgås'), u'Räksmörgås')
1498
1499+ def test_find_everything(self):
1500+ """Test an optimization that finds all tags."""
1501+ soup = self.soup("<a>foo</a><b>bar</b>")
1502+ self.assertEqual(2, len(soup.find_all()))
1503+
1504+ def test_find_everything_with_name(self):
1505+ """Test an optimization that finds all tags with a given name."""
1506+ soup = self.soup("<a>foo</a><b>bar</b><a>baz</a>")
1507+ self.assertEqual(2, len(soup.find_all('a')))
1508+
1509 class TestFindAll(TreeTest):
1510 """Basic tests of the find_all() method."""
1511
1512@@ -115,6 +125,19 @@
1513 # recursion.
1514 self.assertEqual([], soup.find_all(l))
1515
1516+ def test_find_all_resultset(self):
1517+ """All find_all calls return a ResultSet"""
1518+ soup = self.soup("<a></a>")
1519+ result = soup.find_all("a")
1520+ self.assertTrue(hasattr(result, "source"))
1521+
1522+ result = soup.find_all(True)
1523+ self.assertTrue(hasattr(result, "source"))
1524+
1525+ result = soup.find_all(text="foo")
1526+ self.assertTrue(hasattr(result, "source"))
1527+
1528+
1529 class TestFindAllBasicNamespaces(TreeTest):
1530
1531 def test_find_by_namespaced_name(self):
1532@@ -1219,6 +1242,12 @@
1533 # attribute for any other tag.
1534 self.assertEqual('ISO-8859-1 UTF-8', soup.a['accept-charset'])
1535
1536+ def test_string_has_immutable_name_property(self):
1537+ string = self.soup("s").string
1538+ self.assertEqual(None, string.name)
1539+ def t():
1540+ string.name = 'foo'
1541+ self.assertRaises(AttributeError, t)
1542
1543 class TestPersistence(SoupTest):
1544 "Testing features like pickle and deepcopy."
1545
1546=== modified file 'debian/changelog'
1547--- debian/changelog 2014-02-23 13:46:15 +0000
1548+++ debian/changelog 2014-05-29 09:58:03 +0000
1549@@ -1,3 +1,19 @@
1550+beautifulsoup4 (4.3.2-1ubuntu1) utopic; urgency=medium
1551+
1552+ * Merge from debian. Remaining changes:
1553+ - debian/control, debian/rules: Disable pypy-bs4 and Build-Depends on
1554+ pypy since the latter is in universe, while beautifulsoup4 is being
1555+ pulled into main via webtest.
1556+
1557+ -- Jackson Doak <noskcaj@ubuntu.com> Thu, 29 May 2014 19:50:43 +1000
1558+
1559+beautifulsoup4 (4.3.2-1) unstable; urgency=low
1560+
1561+ * New upstream release.
1562+ * Bump Standards-Version to 3.9.5, no changes needed.
1563+
1564+ -- Stefano Rivera <stefanor@debian.org> Sat, 03 May 2014 14:19:04 +0200
1565+
1566 beautifulsoup4 (4.2.1-1ubuntu2) trusty; urgency=medium
1567
1568 * Rebuild to drop files installed into /usr/share/pyshared.
1569
1570=== modified file 'debian/control'
1571--- debian/control 2013-11-15 09:56:34 +0000
1572+++ debian/control 2014-05-29 09:58:03 +0000
1573@@ -15,7 +15,7 @@
1574 python3-lxml,
1575 python3-pkg-resources
1576 X-Python-Version: >= 2.6
1577-Standards-Version: 3.9.4
1578+Standards-Version: 3.9.5
1579 Homepage: http://www.crummy.com/software/BeautifulSoup
1580 Vcs-Svn: svn://anonscm.debian.org/python-modules/packages/beautifulsoup4/trunk/
1581 Vcs-Browser: http://anonscm.debian.org/viewvc/python-modules/packages/beautifulsoup4/trunk/
1582
1583=== modified file 'debian/copyright'
1584--- debian/copyright 2013-05-25 21:27:22 +0000
1585+++ debian/copyright 2014-05-29 09:58:03 +0000
1586@@ -18,7 +18,7 @@
1587 Files: debian/*
1588 Copyright:
1589 2005-2009, Decklin Foster <decklin@red-bean.com>
1590- 2011-2013, Stefano Rivera <stefanor@debian.org>
1591+ 2011-2014, Stefano Rivera <stefanor@debian.org>
1592 License: Expatish
1593
1594 License: Expatish
1595
1596=== modified file 'doc/source/index.rst'
1597--- doc/source/index.rst 2013-08-09 18:39:43 +0000
1598+++ doc/source/index.rst 2014-05-29 09:58:03 +0000
1599@@ -26,6 +26,10 @@
1600 projects. If you want to learn about the differences between Beautiful
1601 Soup 3 and Beautiful Soup 4, see `Porting code to BS4`_.
1602
1603+This documentation has been translated into other languages by its users.
1604+
1605+* 이 문서는 한국어 번역도 가능합니다. (`외부 링크 <http://coreapython.hosting.paran.com/etc/beautifulsoup4.html>`_)
1606+
1607 Getting help
1608 ------------
1609
1610@@ -1209,8 +1213,8 @@
1611 You can filter an attribute based on `a string`_, `a regular
1612 expression`_, `a list`_, `a function`_, or `the value True`_.
1613
1614-This code finds all tags that have an ``id`` attribute, regardless of
1615-what the value is::
1616+This code finds all tags whose ``id`` attribute has a value,
1617+regardless of what the value is::
1618
1619 soup.find_all(id=True)
1620 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1621@@ -2478,9 +2482,11 @@
1622 dammit.original_encoding
1623 # 'utf-8'
1624
1625-The more data you give Unicode, Dammit, the more accurately it will
1626-guess. If you have your own suspicions as to what the encoding might
1627-be, you can pass them in as a list::
1628+Unicode, Dammit's guesses will get a lot more accurate if you install
1629+the ``chardet`` or ``cchardet`` Python libraries. The more data you
1630+give Unicode, Dammit, the more accurately it will guess. If you have
1631+your own suspicions as to what the encoding might be, you can pass
1632+them in as a list::
1633
1634 dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"])
1635 print(dammit.unicode_markup)
1636@@ -2823,16 +2829,6 @@
1637 You can speed up encoding detection significantly by installing the
1638 `cchardet <http://pypi.python.org/pypi/cchardet/>`_ library.
1639
1640-Sometimes `Unicode, Dammit`_ can only detect the encoding of a file by
1641-doing a byte-by-byte examination of the file. This slows Beautiful
1642-Soup to a crawl. My tests indicate that this only happened on 2.x
1643-versions of Python, and that it happened most often with documents
1644-using Russian or Chinese encodings. If this is happening to you, you
1645-can fix it by installing cchardet, or by using Python 3 for your
1646-script. If you happen to know a document's encoding, you can pass
1647-it into the ``BeautifulSoup`` constructor as ``from_encoding``, and
1648-bypass encoding detection altogether.
1649-
1650 `Parsing only part of a document`_ won't save you much time parsing
1651 the document, but it can save a lot of memory, and it'll make
1652 `searching` the document much faster.
1653
1654=== modified file 'setup.py'
1655--- setup.py 2013-08-09 18:39:43 +0000
1656+++ setup.py 2014-05-29 09:58:03 +0000
1657@@ -7,7 +7,7 @@
1658 from distutils.command.build_py import build_py
1659
1660 setup(name="beautifulsoup4",
1661- version = "4.2.1",
1662+ version = "4.3.2",
1663 author="Leonard Richardson",
1664 author_email='leonardr@segfault.org',
1665 url="http://www.crummy.com/software/BeautifulSoup/bs4/",

Subscribers

People subscribed via source and target branches

to all changes: