Merge lp:~mjumbewu/beautifulsoup/text-white-space-fix into lp:beautifulsoup

Proposed by Mjumbe Wawatu Ukweli
Status: Superseded
Proposed branch: lp:~mjumbewu/beautifulsoup/text-white-space-fix
Merge into: lp:beautifulsoup
Diff against target: 3203 lines (+3075/-72) (has conflicts)
9 files modified
AUTHORS (+0/-34)
BeautifulSoup.py (+2014/-0)
BeautifulSoupTests.py (+903/-0)
NEWS (+79/-0)
PKG-INFO (+19/-0)
docs/__init__.py (+0/-1)
setup.py (+60/-0)
tests/__init__.py (+0/-1)
tests/test_docs.py (+0/-36)
Path conflict: AUTHORS / <deleted>
Contents conflict in CHANGELOG
Path conflict: CHANGELOG / <deleted>
Contents conflict in README.txt
Path conflict: README.txt / <deleted>
Contents conflict in bs4/__init__.py
Path conflict: bs4/__init__.py / <deleted>
Conflict: can't delete bs4/builder because it is not empty.  Not deleting.
Path conflict: bs4/builder / <deleted>
Conflict because bs4/builder is not versioned, but has versioned children.  Versioned directory.
Contents conflict in bs4/builder/__init__.py
Contents conflict in bs4/builder/_lxml.py
Path conflict: bs4/builder/_lxml.py / <deleted>
Contents conflict in bs4/dammit.py
Path conflict: bs4/dammit.py / <deleted>
Contents conflict in bs4/element.py
Path conflict: bs4/element.py / <deleted>
Contents conflict in bs4/testing.py
Path conflict: bs4/testing.py / <deleted>
Path conflict: docs / <deleted>
Conflict: can't delete tests because it is not empty.  Not deleting.
Path conflict: tests / <deleted>
Conflict because tests is not versioned, but has versioned children.  Versioned directory.
Contents conflict in tests/test_lxml.py
Contents conflict in tests/test_soup.py
To merge this branch: bzr merge lp:~mjumbewu/beautifulsoup/text-white-space-fix
Reviewer Review Type Date Requested Status
Leonard Richardson Pending
Review via email: mp+62619@code.launchpad.net

This proposal has been superseded by a proposal from 2011-05-27.

Description of the change

BeautifulSoup removes too much white space on getText. For example, the text of "<p>This is a <i>test</i>, ok?" should be "This is a test, ok?". Instead, BS calculates it as "This is atest, ok?"

This invalidates bug #788986

To post a comment you must log in.
45. By Mjumbe Wawatu Ukweli

In getText, multiple white space characters get truncated to one.

Unmerged revisions

45. By Mjumbe Wawatu Ukweli

In getText, multiple white space characters get truncated to one.

44. By Mjumbe Wawatu Ukweli

Preserve spacing when using getText.

43. By Leonard Richardson

Revved version number.

42. By Leonard Richardson

When creating a Tag object, you can specify its attributes as a dict
rather than as a list of 2-tuples.

41. By Leonard Richardson

Fix a typo and prep for release.

40. By Leonard Richardson

Cleaned up tests.

39. By Leonard Richardson

Applied Aaron's fix for bug 493722.

38. By Leonard Richardson

Added a failing test for bug 493722.

37. By Leonard Richardson

Fixed whitespace.

36. By Leonard Richardson

Changed iterators not to block on empty strings. Restored the set code since 2.2 doesn't work on this code anyway.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== removed file 'AUTHORS'
2--- AUTHORS 2011-01-28 16:39:36 +0000
3+++ AUTHORS 1970-01-01 00:00:00 +0000
4@@ -1,34 +0,0 @@
5-Behold, mortal, the origins of Beautiful Soup...
6-================================================
7-
8-Leonard Richardson is the primary programmer.
9-
10-Sam Ruby helps with a lot of edge cases.
11-
12-Mark Pilgrim provided the encoding detection code that forms the base
13-of UnicodeDammit.
14-
15-Jonathan Ellis was awarded the prestigous Beau Potage D'Or for his
16-work in solving the nestable tags conundrum.
17-
18-The following people have contributed patches to Beautiful Soup:
19-
20- Istvan Albert, Andrew Lin, Anthony Baxter, Andrew Boyko, Tony Chang,
21- Zephyr Fang, Fuzzy, Roman Gaufman, Yoni Gilad, Richie Hindle, Peteris
22- Krumins, Kent Johnson, Ben Last, Robert Leftwich, Staffan Malmgren,
23- Ksenia Marasanova, JP Moins, Adam Monsen, John Nagle, "Jon", Ed
24- Oskiewicz, Greg Phillips, Giles Radford, Arthur Rudolph, Marko
25- Samastur, Jouni Seppänen, Alexander Schmolck, Andy Theyers, Glyn
26- Webster, Paul Wright, Danny Yoo
27-
28-The following people made suggestions or found bugs or found ways to
29-break Beautiful Soup:
30-
31- Hanno Böck, Matteo Bertini, Chris Curvey, Simon Cusack, Matt Ernst,
32- Michael Foord, Tom Harris, Bill de hOra, Donald Howes, Matt
33- Patterson, Scott Roberts, Steve Strassmann, Mike Williams, warchild
34- at redho dot com, Sami Kuisma, Carlos Rocha, Bob Hutchison, Joren Mc,
35- Michal Migurski, John Kleven, Tim Heaney, Tripp Lilley, Ed Summers,
36- Dennis Sutch, Chris Smith, Aaron Sweep^W Swartz, Stuart Turner, Greg
37- Edwards, Kevin J Kalupson, Nikos Kouremenos, Artur de Sousa Rocha,
38- Yichun Wei, Per Vognsen
39
40=== added file 'BeautifulSoup.py'
41--- BeautifulSoup.py 1970-01-01 00:00:00 +0000
42+++ BeautifulSoup.py 2011-05-27 07:52:31 +0000
43@@ -0,0 +1,2014 @@
44+"""Beautiful Soup
45+Elixir and Tonic
46+"The Screen-Scraper's Friend"
47+http://www.crummy.com/software/BeautifulSoup/
48+
49+Beautiful Soup parses a (possibly invalid) XML or HTML document into a
50+tree representation. It provides methods and Pythonic idioms that make
51+it easy to navigate, search, and modify the tree.
52+
53+A well-formed XML/HTML document yields a well-formed data
54+structure. An ill-formed XML/HTML document yields a correspondingly
55+ill-formed data structure. If your document is only locally
56+well-formed, you can use this library to find and process the
57+well-formed part of it.
58+
59+Beautiful Soup works with Python 2.2 and up. It has no external
60+dependencies, but you'll have more success at converting data to UTF-8
61+if you also install these three packages:
62+
63+* chardet, for auto-detecting character encodings
64+ http://chardet.feedparser.org/
65+* cjkcodecs and iconv_codec, which add more encodings to the ones supported
66+ by stock Python.
67+ http://cjkpython.i18n.org/
68+
69+Beautiful Soup defines classes for two main parsing strategies:
70+
71+ * BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific
72+ language that kind of looks like XML.
73+
74+ * BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid
75+ or invalid. This class has web browser-like heuristics for
76+ obtaining a sensible parse tree in the face of common HTML errors.
77+
78+Beautiful Soup also defines a class (UnicodeDammit) for autodetecting
79+the encoding of an HTML or XML document, and converting it to
80+Unicode. Much of this code is taken from Mark Pilgrim's Universal Feed Parser.
81+
82+For more than you ever wanted to know about Beautiful Soup, see the
83+documentation:
84+http://www.crummy.com/software/BeautifulSoup/documentation.html
85+
86+Here, have some legalese:
87+
88+Copyright (c) 2004-2010, Leonard Richardson
89+
90+All rights reserved.
91+
92+Redistribution and use in source and binary forms, with or without
93+modification, are permitted provided that the following conditions are
94+met:
95+
96+ * Redistributions of source code must retain the above copyright
97+ notice, this list of conditions and the following disclaimer.
98+
99+ * Redistributions in binary form must reproduce the above
100+ copyright notice, this list of conditions and the following
101+ disclaimer in the documentation and/or other materials provided
102+ with the distribution.
103+
104+ * Neither the name of the the Beautiful Soup Consortium and All
105+ Night Kosher Bakery nor the names of its contributors may be
106+ used to endorse or promote products derived from this software
107+ without specific prior written permission.
108+
109+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
110+"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
111+LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
112+A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
113+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
114+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
115+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
116+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
117+LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
118+NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
119+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE, DAMMIT.
120+
121+"""
122+from __future__ import generators
123+
124+__author__ = "Leonard Richardson (leonardr@segfault.org)"
125+__version__ = "3.2.0"
126+__copyright__ = "Copyright (c) 2004-2010 Leonard Richardson"
127+__license__ = "New-style BSD"
128+
129+from sgmllib import SGMLParser, SGMLParseError
130+import codecs
131+import markupbase
132+import types
133+import re
134+import sgmllib
135+try:
136+ from htmlentitydefs import name2codepoint
137+except ImportError:
138+ name2codepoint = {}
139+try:
140+ set
141+except NameError:
142+ from sets import Set as set
143+
144+#These hacks make Beautiful Soup able to parse XML with namespaces
145+sgmllib.tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*')
146+markupbase._declname_match = re.compile(r'[a-zA-Z][-_.:a-zA-Z0-9]*\s*').match
147+
148+DEFAULT_OUTPUT_ENCODING = "utf-8"
149+
150+def _match_css_class(str):
151+ """Build a RE to match the given CSS class."""
152+ return re.compile(r"(^|.*\s)%s($|\s)" % str)
153+
154+# First, the classes that represent markup elements.
155+
156+class PageElement(object):
157+ """Contains the navigational information for some part of the page
158+ (either a tag or a piece of text)"""
159+
160+ def setup(self, parent=None, previous=None):
161+ """Sets up the initial relations between this element and
162+ other elements."""
163+ self.parent = parent
164+ self.previous = previous
165+ self.next = None
166+ self.previousSibling = None
167+ self.nextSibling = None
168+ if self.parent and self.parent.contents:
169+ self.previousSibling = self.parent.contents[-1]
170+ self.previousSibling.nextSibling = self
171+
172+ def replaceWith(self, replaceWith):
173+ oldParent = self.parent
174+ myIndex = self.parent.index(self)
175+ if hasattr(replaceWith, "parent")\
176+ and replaceWith.parent is self.parent:
177+ # We're replacing this element with one of its siblings.
178+ index = replaceWith.parent.index(replaceWith)
179+ if index and index < myIndex:
180+ # Furthermore, it comes before this element. That
181+ # means that when we extract it, the index of this
182+ # element will change.
183+ myIndex = myIndex - 1
184+ self.extract()
185+ oldParent.insert(myIndex, replaceWith)
186+
187+ def replaceWithChildren(self):
188+ myParent = self.parent
189+ myIndex = self.parent.index(self)
190+ self.extract()
191+ reversedChildren = list(self.contents)
192+ reversedChildren.reverse()
193+ for child in reversedChildren:
194+ myParent.insert(myIndex, child)
195+
196+ def extract(self):
197+ """Destructively rips this element out of the tree."""
198+ if self.parent:
199+ try:
200+ del self.parent.contents[self.parent.index(self)]
201+ except ValueError:
202+ pass
203+
204+ #Find the two elements that would be next to each other if
205+ #this element (and any children) hadn't been parsed. Connect
206+ #the two.
207+ lastChild = self._lastRecursiveChild()
208+ nextElement = lastChild.next
209+
210+ if self.previous:
211+ self.previous.next = nextElement
212+ if nextElement:
213+ nextElement.previous = self.previous
214+ self.previous = None
215+ lastChild.next = None
216+
217+ self.parent = None
218+ if self.previousSibling:
219+ self.previousSibling.nextSibling = self.nextSibling
220+ if self.nextSibling:
221+ self.nextSibling.previousSibling = self.previousSibling
222+ self.previousSibling = self.nextSibling = None
223+ return self
224+
225+ def _lastRecursiveChild(self):
226+ "Finds the last element beneath this object to be parsed."
227+ lastChild = self
228+ while hasattr(lastChild, 'contents') and lastChild.contents:
229+ lastChild = lastChild.contents[-1]
230+ return lastChild
231+
232+ def insert(self, position, newChild):
233+ if isinstance(newChild, basestring) \
234+ and not isinstance(newChild, NavigableString):
235+ newChild = NavigableString(newChild)
236+
237+ position = min(position, len(self.contents))
238+ if hasattr(newChild, 'parent') and newChild.parent is not None:
239+ # We're 'inserting' an element that's already one
240+ # of this object's children.
241+ if newChild.parent is self:
242+ index = self.index(newChild)
243+ if index > position:
244+ # Furthermore we're moving it further down the
245+ # list of this object's children. That means that
246+ # when we extract this element, our target index
247+ # will jump down one.
248+ position = position - 1
249+ newChild.extract()
250+
251+ newChild.parent = self
252+ previousChild = None
253+ if position == 0:
254+ newChild.previousSibling = None
255+ newChild.previous = self
256+ else:
257+ previousChild = self.contents[position-1]
258+ newChild.previousSibling = previousChild
259+ newChild.previousSibling.nextSibling = newChild
260+ newChild.previous = previousChild._lastRecursiveChild()
261+ if newChild.previous:
262+ newChild.previous.next = newChild
263+
264+ newChildsLastElement = newChild._lastRecursiveChild()
265+
266+ if position >= len(self.contents):
267+ newChild.nextSibling = None
268+
269+ parent = self
270+ parentsNextSibling = None
271+ while not parentsNextSibling:
272+ parentsNextSibling = parent.nextSibling
273+ parent = parent.parent
274+ if not parent: # This is the last element in the document.
275+ break
276+ if parentsNextSibling:
277+ newChildsLastElement.next = parentsNextSibling
278+ else:
279+ newChildsLastElement.next = None
280+ else:
281+ nextChild = self.contents[position]
282+ newChild.nextSibling = nextChild
283+ if newChild.nextSibling:
284+ newChild.nextSibling.previousSibling = newChild
285+ newChildsLastElement.next = nextChild
286+
287+ if newChildsLastElement.next:
288+ newChildsLastElement.next.previous = newChildsLastElement
289+ self.contents.insert(position, newChild)
290+
291+ def append(self, tag):
292+ """Appends the given tag to the contents of this tag."""
293+ self.insert(len(self.contents), tag)
294+
295+ def findNext(self, name=None, attrs={}, text=None, **kwargs):
296+ """Returns the first item that matches the given criteria and
297+ appears after this Tag in the document."""
298+ return self._findOne(self.findAllNext, name, attrs, text, **kwargs)
299+
300+ def findAllNext(self, name=None, attrs={}, text=None, limit=None,
301+ **kwargs):
302+ """Returns all items that match the given criteria and appear
303+ after this Tag in the document."""
304+ return self._findAll(name, attrs, text, limit, self.nextGenerator,
305+ **kwargs)
306+
307+ def findNextSibling(self, name=None, attrs={}, text=None, **kwargs):
308+ """Returns the closest sibling to this Tag that matches the
309+ given criteria and appears after this Tag in the document."""
310+ return self._findOne(self.findNextSiblings, name, attrs, text,
311+ **kwargs)
312+
313+ def findNextSiblings(self, name=None, attrs={}, text=None, limit=None,
314+ **kwargs):
315+ """Returns the siblings of this Tag that match the given
316+ criteria and appear after this Tag in the document."""
317+ return self._findAll(name, attrs, text, limit,
318+ self.nextSiblingGenerator, **kwargs)
319+ fetchNextSiblings = findNextSiblings # Compatibility with pre-3.x
320+
321+ def findPrevious(self, name=None, attrs={}, text=None, **kwargs):
322+ """Returns the first item that matches the given criteria and
323+ appears before this Tag in the document."""
324+ return self._findOne(self.findAllPrevious, name, attrs, text, **kwargs)
325+
326+ def findAllPrevious(self, name=None, attrs={}, text=None, limit=None,
327+ **kwargs):
328+ """Returns all items that match the given criteria and appear
329+ before this Tag in the document."""
330+ return self._findAll(name, attrs, text, limit, self.previousGenerator,
331+ **kwargs)
332+ fetchPrevious = findAllPrevious # Compatibility with pre-3.x
333+
334+ def findPreviousSibling(self, name=None, attrs={}, text=None, **kwargs):
335+ """Returns the closest sibling to this Tag that matches the
336+ given criteria and appears before this Tag in the document."""
337+ return self._findOne(self.findPreviousSiblings, name, attrs, text,
338+ **kwargs)
339+
340+ def findPreviousSiblings(self, name=None, attrs={}, text=None,
341+ limit=None, **kwargs):
342+ """Returns the siblings of this Tag that match the given
343+ criteria and appear before this Tag in the document."""
344+ return self._findAll(name, attrs, text, limit,
345+ self.previousSiblingGenerator, **kwargs)
346+ fetchPreviousSiblings = findPreviousSiblings # Compatibility with pre-3.x
347+
348+ def findParent(self, name=None, attrs={}, **kwargs):
349+ """Returns the closest parent of this Tag that matches the given
350+ criteria."""
351+ # NOTE: We can't use _findOne because findParents takes a different
352+ # set of arguments.
353+ r = None
354+ l = self.findParents(name, attrs, 1)
355+ if l:
356+ r = l[0]
357+ return r
358+
359+ def findParents(self, name=None, attrs={}, limit=None, **kwargs):
360+ """Returns the parents of this Tag that match the given
361+ criteria."""
362+
363+ return self._findAll(name, attrs, None, limit, self.parentGenerator,
364+ **kwargs)
365+ fetchParents = findParents # Compatibility with pre-3.x
366+
367+ #These methods do the real heavy lifting.
368+
369+ def _findOne(self, method, name, attrs, text, **kwargs):
370+ r = None
371+ l = method(name, attrs, text, 1, **kwargs)
372+ if l:
373+ r = l[0]
374+ return r
375+
376+ def _findAll(self, name, attrs, text, limit, generator, **kwargs):
377+ "Iterates over a generator looking for things that match."
378+
379+ if isinstance(name, SoupStrainer):
380+ strainer = name
381+ # (Possibly) special case some findAll*(...) searches
382+ elif text is None and not limit and not attrs and not kwargs:
383+ # findAll*(True)
384+ if name is True:
385+ return [element for element in generator()
386+ if isinstance(element, Tag)]
387+ # findAll*('tag-name')
388+ elif isinstance(name, basestring):
389+ return [element for element in generator()
390+ if isinstance(element, Tag) and
391+ element.name == name]
392+ else:
393+ strainer = SoupStrainer(name, attrs, text, **kwargs)
394+ # Build a SoupStrainer
395+ else:
396+ strainer = SoupStrainer(name, attrs, text, **kwargs)
397+ results = ResultSet(strainer)
398+ g = generator()
399+ while True:
400+ try:
401+ i = g.next()
402+ except StopIteration:
403+ break
404+ if i:
405+ found = strainer.search(i)
406+ if found:
407+ results.append(found)
408+ if limit and len(results) >= limit:
409+ break
410+ return results
411+
412+ #These Generators can be used to navigate starting from both
413+ #NavigableStrings and Tags.
414+ def nextGenerator(self):
415+ i = self
416+ while i is not None:
417+ i = i.next
418+ yield i
419+
420+ def nextSiblingGenerator(self):
421+ i = self
422+ while i is not None:
423+ i = i.nextSibling
424+ yield i
425+
426+ def previousGenerator(self):
427+ i = self
428+ while i is not None:
429+ i = i.previous
430+ yield i
431+
432+ def previousSiblingGenerator(self):
433+ i = self
434+ while i is not None:
435+ i = i.previousSibling
436+ yield i
437+
438+ def parentGenerator(self):
439+ i = self
440+ while i is not None:
441+ i = i.parent
442+ yield i
443+
444+ # Utility methods
445+ def substituteEncoding(self, str, encoding=None):
446+ encoding = encoding or "utf-8"
447+ return str.replace("%SOUP-ENCODING%", encoding)
448+
449+ def toEncoding(self, s, encoding=None):
450+ """Encodes an object to a string in some encoding, or to Unicode.
451+ ."""
452+ if isinstance(s, unicode):
453+ if encoding:
454+ s = s.encode(encoding)
455+ elif isinstance(s, str):
456+ if encoding:
457+ s = s.encode(encoding)
458+ else:
459+ s = unicode(s)
460+ else:
461+ if encoding:
462+ s = self.toEncoding(str(s), encoding)
463+ else:
464+ s = unicode(s)
465+ return s
466+
467+class NavigableString(unicode, PageElement):
468+
469+ def __new__(cls, value):
470+ """Create a new NavigableString.
471+
472+ When unpickling a NavigableString, this method is called with
473+ the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be
474+ passed in to the superclass's __new__ or the superclass won't know
475+ how to handle non-ASCII characters.
476+ """
477+ if isinstance(value, unicode):
478+ return unicode.__new__(cls, value)
479+ return unicode.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
480+
481+ def __getnewargs__(self):
482+ return (NavigableString.__str__(self),)
483+
484+ def __getattr__(self, attr):
485+ """text.string gives you text. This is for backwards
486+ compatibility for Navigable*String, but for CData* it lets you
487+ get the string without the CData wrapper."""
488+ if attr == 'string':
489+ return self
490+ else:
491+ raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)
492+
493+ def __unicode__(self):
494+ return str(self).decode(DEFAULT_OUTPUT_ENCODING)
495+
496+ def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
497+ if encoding:
498+ return self.encode(encoding)
499+ else:
500+ return self
501+
502+class CData(NavigableString):
503+
504+ def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
505+ return "<![CDATA[%s]]>" % NavigableString.__str__(self, encoding)
506+
507+class ProcessingInstruction(NavigableString):
508+ def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
509+ output = self
510+ if "%SOUP-ENCODING%" in output:
511+ output = self.substituteEncoding(output, encoding)
512+ return "<?%s?>" % self.toEncoding(output, encoding)
513+
514+class Comment(NavigableString):
515+ def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
516+ return "<!--%s-->" % NavigableString.__str__(self, encoding)
517+
518+class Declaration(NavigableString):
519+ def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
520+ return "<!%s>" % NavigableString.__str__(self, encoding)
521+
522+class Tag(PageElement):
523+
524+ """Represents a found HTML tag with its attributes and contents."""
525+
526+ def _invert(h):
527+ "Cheap function to invert a hash."
528+ i = {}
529+ for k,v in h.items():
530+ i[v] = k
531+ return i
532+
533+ XML_ENTITIES_TO_SPECIAL_CHARS = { "apos" : "'",
534+ "quot" : '"',
535+ "amp" : "&",
536+ "lt" : "<",
537+ "gt" : ">" }
538+
539+ XML_SPECIAL_CHARS_TO_ENTITIES = _invert(XML_ENTITIES_TO_SPECIAL_CHARS)
540+
541+ def _convertEntities(self, match):
542+ """Used in a call to re.sub to replace HTML, XML, and numeric
543+ entities with the appropriate Unicode characters. If HTML
544+ entities are being converted, any unrecognized entities are
545+ escaped."""
546+ x = match.group(1)
547+ if self.convertHTMLEntities and x in name2codepoint:
548+ return unichr(name2codepoint[x])
549+ elif x in self.XML_ENTITIES_TO_SPECIAL_CHARS:
550+ if self.convertXMLEntities:
551+ return self.XML_ENTITIES_TO_SPECIAL_CHARS[x]
552+ else:
553+ return u'&%s;' % x
554+ elif len(x) > 0 and x[0] == '#':
555+ # Handle numeric entities
556+ if len(x) > 1 and x[1] == 'x':
557+ return unichr(int(x[2:], 16))
558+ else:
559+ return unichr(int(x[1:]))
560+
561+ elif self.escapeUnrecognizedEntities:
562+ return u'&amp;%s;' % x
563+ else:
564+ return u'&%s;' % x
565+
566+ def __init__(self, parser, name, attrs=None, parent=None,
567+ previous=None):
568+ "Basic constructor."
569+
570+ # We don't actually store the parser object: that lets extracted
571+ # chunks be garbage-collected
572+ self.parserClass = parser.__class__
573+ self.isSelfClosing = parser.isSelfClosingTag(name)
574+ self.name = name
575+ if attrs is None:
576+ attrs = []
577+ elif isinstance(attrs, dict):
578+ attrs = attrs.items()
579+ self.attrs = attrs
580+ self.contents = []
581+ self.setup(parent, previous)
582+ self.hidden = False
583+ self.containsSubstitutions = False
584+ self.convertHTMLEntities = parser.convertHTMLEntities
585+ self.convertXMLEntities = parser.convertXMLEntities
586+ self.escapeUnrecognizedEntities = parser.escapeUnrecognizedEntities
587+
588+ # Convert any HTML, XML, or numeric entities in the attribute values.
589+ convert = lambda(k, val): (k,
590+ re.sub("&(#\d+|#x[0-9a-fA-F]+|\w+);",
591+ self._convertEntities,
592+ val))
593+ self.attrs = map(convert, self.attrs)
594+
595+ def getString(self):
596+ if (len(self.contents) == 1
597+ and isinstance(self.contents[0], NavigableString)):
598+ return self.contents[0]
599+
600+ def setString(self, string):
601+ """Replace the contents of the tag with a string"""
602+ self.clear()
603+ self.append(string)
604+
605+ string = property(getString, setString)
606+
607+ def getText(self, separator=u""):
608+ if not len(self.contents):
609+ return u""
610+ stopNode = self._lastRecursiveChild().next
611+ strings = []
612+ current = self.contents[0]
613+ while current is not stopNode:
614+ if isinstance(current, NavigableString):
615+ strings.append(current)
616+ current = current.next
617+ return separator.join(strings)
618+
619+ text = property(getText)
620+
621+ def get(self, key, default=None):
622+ """Returns the value of the 'key' attribute for the tag, or
623+ the value given for 'default' if it doesn't have that
624+ attribute."""
625+ return self._getAttrMap().get(key, default)
626+
627+ def clear(self):
628+ """Extract all children."""
629+ for child in self.contents[:]:
630+ child.extract()
631+
632+ def index(self, element):
633+ for i, child in enumerate(self.contents):
634+ if child is element:
635+ return i
636+ raise ValueError("Tag.index: element not in tag")
637+
638+ def has_key(self, key):
639+ return self._getAttrMap().has_key(key)
640+
641+ def __getitem__(self, key):
642+ """tag[key] returns the value of the 'key' attribute for the tag,
643+ and throws an exception if it's not there."""
644+ return self._getAttrMap()[key]
645+
646+ def __iter__(self):
647+ "Iterating over a tag iterates over its contents."
648+ return iter(self.contents)
649+
650+ def __len__(self):
651+ "The length of a tag is the length of its list of contents."
652+ return len(self.contents)
653+
654+ def __contains__(self, x):
655+ return x in self.contents
656+
657+ def __nonzero__(self):
658+ "A tag is non-None even if it has no contents."
659+ return True
660+
661+ def __setitem__(self, key, value):
662+ """Setting tag[key] sets the value of the 'key' attribute for the
663+ tag."""
664+ self._getAttrMap()
665+ self.attrMap[key] = value
666+ found = False
667+ for i in range(0, len(self.attrs)):
668+ if self.attrs[i][0] == key:
669+ self.attrs[i] = (key, value)
670+ found = True
671+ if not found:
672+ self.attrs.append((key, value))
673+ self._getAttrMap()[key] = value
674+
675+ def __delitem__(self, key):
676+ "Deleting tag[key] deletes all 'key' attributes for the tag."
677+ for item in self.attrs:
678+ if item[0] == key:
679+ self.attrs.remove(item)
680+ #We don't break because bad HTML can define the same
681+ #attribute multiple times.
682+ self._getAttrMap()
683+ if self.attrMap.has_key(key):
684+ del self.attrMap[key]
685+
686+ def __call__(self, *args, **kwargs):
687+ """Calling a tag like a function is the same as calling its
688+ findAll() method. Eg. tag('a') returns a list of all the A tags
689+ found within this tag."""
690+ return apply(self.findAll, args, kwargs)
691+
692+ def __getattr__(self, tag):
693+ #print "Getattr %s.%s" % (self.__class__, tag)
694+ if len(tag) > 3 and tag.rfind('Tag') == len(tag)-3:
695+ return self.find(tag[:-3])
696+ elif tag.find('__') != 0:
697+ return self.find(tag)
698+ raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__, tag)
699+
700+ def __eq__(self, other):
701+ """Returns true iff this tag has the same name, the same attributes,
702+ and the same contents (recursively) as the given tag.
703+
704+ NOTE: right now this will return false if two tags have the
705+ same attributes in a different order. Should this be fixed?"""
706+ if other is self:
707+ return True
708+ if not hasattr(other, 'name') or not hasattr(other, 'attrs') or not hasattr(other, 'contents') or self.name != other.name or self.attrs != other.attrs or len(self) != len(other):
709+ return False
710+ for i in range(0, len(self.contents)):
711+ if self.contents[i] != other.contents[i]:
712+ return False
713+ return True
714+
715+ def __ne__(self, other):
716+ """Returns true iff this tag is not identical to the other tag,
717+ as defined in __eq__."""
718+ return not self == other
719+
720+ def __repr__(self, encoding=DEFAULT_OUTPUT_ENCODING):
721+ """Renders this tag as a string."""
722+ return self.__str__(encoding)
723+
724+ def __unicode__(self):
725+ return self.__str__(None)
726+
727+ BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|"
728+ + "&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)"
729+ + ")")
730+
731+ def _sub_entity(self, x):
732+ """Used with a regular expression to substitute the
733+ appropriate XML entity for an XML special character."""
734+ return "&" + self.XML_SPECIAL_CHARS_TO_ENTITIES[x.group(0)[0]] + ";"
735+
736+ def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING,
737+ prettyPrint=False, indentLevel=0):
738+ """Returns a string or Unicode representation of this tag and
739+ its contents. To get Unicode, pass None for encoding.
740+
741+ NOTE: since Python's HTML parser consumes whitespace, this
742+ method is not certain to reproduce the whitespace present in
743+ the original string."""
744+
745+ encodedName = self.toEncoding(self.name, encoding)
746+
747+ attrs = []
748+ if self.attrs:
749+ for key, val in self.attrs:
750+ fmt = '%s="%s"'
751+ if isinstance(val, basestring):
752+ if self.containsSubstitutions and '%SOUP-ENCODING%' in val:
753+ val = self.substituteEncoding(val, encoding)
754+
755+ # The attribute value either:
756+ #
757+ # * Contains no embedded double quotes or single quotes.
758+ # No problem: we enclose it in double quotes.
759+ # * Contains embedded single quotes. No problem:
760+ # double quotes work here too.
761+ # * Contains embedded double quotes. No problem:
762+ # we enclose it in single quotes.
763+ # * Embeds both single _and_ double quotes. This
764+ # can't happen naturally, but it can happen if
765+ # you modify an attribute value after parsing
766+ # the document. Now we have a bit of a
767+ # problem. We solve it by enclosing the
768+ # attribute in single quotes, and escaping any
769+ # embedded single quotes to XML entities.
770+ if '"' in val:
771+ fmt = "%s='%s'"
772+ if "'" in val:
773+ # TODO: replace with apos when
774+ # appropriate.
775+ val = val.replace("'", "&squot;")
776+
777+ # Now we're okay w/r/t quotes. But the attribute
778+ # value might also contain angle brackets, or
779+ # ampersands that aren't part of entities. We need
780+ # to escape those to XML entities too.
781+ val = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, val)
782+
783+ attrs.append(fmt % (self.toEncoding(key, encoding),
784+ self.toEncoding(val, encoding)))
785+ close = ''
786+ closeTag = ''
787+ if self.isSelfClosing:
788+ close = ' /'
789+ else:
790+ closeTag = '</%s>' % encodedName
791+
792+ indentTag, indentContents = 0, 0
793+ if prettyPrint:
794+ indentTag = indentLevel
795+ space = (' ' * (indentTag-1))
796+ indentContents = indentTag + 1
797+ contents = self.renderContents(encoding, prettyPrint, indentContents)
798+ if self.hidden:
799+ s = contents
800+ else:
801+ s = []
802+ attributeString = ''
803+ if attrs:
804+ attributeString = ' ' + ' '.join(attrs)
805+ if prettyPrint:
806+ s.append(space)
807+ s.append('<%s%s%s>' % (encodedName, attributeString, close))
808+ if prettyPrint:
809+ s.append("\n")
810+ s.append(contents)
811+ if prettyPrint and contents and contents[-1] != "\n":
812+ s.append("\n")
813+ if prettyPrint and closeTag:
814+ s.append(space)
815+ s.append(closeTag)
816+ if prettyPrint and closeTag and self.nextSibling:
817+ s.append("\n")
818+ s = ''.join(s)
819+ return s
820+
821+ def decompose(self):
822+ """Recursively destroys the contents of this tree."""
823+ self.extract()
824+ if len(self.contents) == 0:
825+ return
826+ current = self.contents[0]
827+ while current is not None:
828+ next = current.next
829+ if isinstance(current, Tag):
830+ del current.contents[:]
831+ current.parent = None
832+ current.previous = None
833+ current.previousSibling = None
834+ current.next = None
835+ current.nextSibling = None
836+ current = next
837+
838+ def prettify(self, encoding=DEFAULT_OUTPUT_ENCODING):
839+ return self.__str__(encoding, True)
840+
841+ def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
842+ prettyPrint=False, indentLevel=0):
843+ """Renders the contents of this tag as a string in the given
844+ encoding. If encoding is None, returns a Unicode string.."""
845+ s=[]
846+ for c in self:
847+ text = None
848+ if isinstance(c, NavigableString):
849+ text = c.__str__(encoding)
850+ elif isinstance(c, Tag):
851+ s.append(c.__str__(encoding, prettyPrint, indentLevel))
852+ if text and prettyPrint:
853+ text = text.strip()
854+ if text:
855+ if prettyPrint:
856+ s.append(" " * (indentLevel-1))
857+ s.append(text)
858+ if prettyPrint:
859+ s.append("\n")
860+ return ''.join(s)
861+
862+ #Soup methods
863+
864+ def find(self, name=None, attrs={}, recursive=True, text=None,
865+ **kwargs):
866+ """Return only the first child of this Tag matching the given
867+ criteria."""
868+ r = None
869+ l = self.findAll(name, attrs, recursive, text, 1, **kwargs)
870+ if l:
871+ r = l[0]
872+ return r
873+ findChild = find
874+
875+ def findAll(self, name=None, attrs={}, recursive=True, text=None,
876+ limit=None, **kwargs):
877+ """Extracts a list of Tag objects that match the given
878+ criteria. You can specify the name of the Tag and any
879+ attributes you want the Tag to have.
880+
881+ The value of a key-value pair in the 'attrs' map can be a
882+ string, a list of strings, a regular expression object, or a
883+ callable that takes a string and returns whether or not the
884+ string matches for some custom definition of 'matches'. The
885+ same is true of the tag name."""
886+ generator = self.recursiveChildGenerator
887+ if not recursive:
888+ generator = self.childGenerator
889+ return self._findAll(name, attrs, text, limit, generator, **kwargs)
890+ findChildren = findAll
891+
892+ # Pre-3.x compatibility methods
893+ first = find
894+ fetch = findAll
895+
896+ def fetchText(self, text=None, recursive=True, limit=None):
897+ return self.findAll(text=text, recursive=recursive, limit=limit)
898+
899+ def firstText(self, text=None, recursive=True):
900+ return self.find(text=text, recursive=recursive)
901+
902+ #Private methods
903+
904+ def _getAttrMap(self):
905+ """Initializes a map representation of this tag's attributes,
906+ if not already initialized."""
907+ if not getattr(self, 'attrMap'):
908+ self.attrMap = {}
909+ for (key, value) in self.attrs:
910+ self.attrMap[key] = value
911+ return self.attrMap
912+
913+ #Generator methods
914+ def childGenerator(self):
915+ # Just use the iterator from the contents
916+ return iter(self.contents)
917+
918+ def recursiveChildGenerator(self):
919+ if not len(self.contents):
920+ raise StopIteration
921+ stopNode = self._lastRecursiveChild().next
922+ current = self.contents[0]
923+ while current is not stopNode:
924+ yield current
925+ current = current.next
926+
927+
928+# Next, a couple classes to represent queries and their results.
929+class SoupStrainer:
930+ """Encapsulates a number of ways of matching a markup element (tag or
931+ text)."""
932+
933+ def __init__(self, name=None, attrs={}, text=None, **kwargs):
934+ self.name = name
935+ if isinstance(attrs, basestring):
936+ kwargs['class'] = _match_css_class(attrs)
937+ attrs = None
938+ if kwargs:
939+ if attrs:
940+ attrs = attrs.copy()
941+ attrs.update(kwargs)
942+ else:
943+ attrs = kwargs
944+ self.attrs = attrs
945+ self.text = text
946+
947+ def __str__(self):
948+ if self.text:
949+ return self.text
950+ else:
951+ return "%s|%s" % (self.name, self.attrs)
952+
953+ def searchTag(self, markupName=None, markupAttrs={}):
954+ found = None
955+ markup = None
956+ if isinstance(markupName, Tag):
957+ markup = markupName
958+ markupAttrs = markup
959+ callFunctionWithTagData = callable(self.name) \
960+ and not isinstance(markupName, Tag)
961+
962+ if (not self.name) \
963+ or callFunctionWithTagData \
964+ or (markup and self._matches(markup, self.name)) \
965+ or (not markup and self._matches(markupName, self.name)):
966+ if callFunctionWithTagData:
967+ match = self.name(markupName, markupAttrs)
968+ else:
969+ match = True
970+ markupAttrMap = None
971+ for attr, matchAgainst in self.attrs.items():
972+ if not markupAttrMap:
973+ if hasattr(markupAttrs, 'get'):
974+ markupAttrMap = markupAttrs
975+ else:
976+ markupAttrMap = {}
977+ for k,v in markupAttrs:
978+ markupAttrMap[k] = v
979+ attrValue = markupAttrMap.get(attr)
980+ if not self._matches(attrValue, matchAgainst):
981+ match = False
982+ break
983+ if match:
984+ if markup:
985+ found = markup
986+ else:
987+ found = markupName
988+ return found
989+
990+ def search(self, markup):
991+ #print 'looking for %s in %s' % (self, markup)
992+ found = None
993+ # If given a list of items, scan it for a text element that
994+ # matches.
995+ if hasattr(markup, "__iter__") \
996+ and not isinstance(markup, Tag):
997+ for element in markup:
998+ if isinstance(element, NavigableString) \
999+ and self.search(element):
1000+ found = element
1001+ break
1002+ # If it's a Tag, make sure its name or attributes match.
1003+ # Don't bother with Tags if we're searching for text.
1004+ elif isinstance(markup, Tag):
1005+ if not self.text:
1006+ found = self.searchTag(markup)
1007+ # If it's text, make sure the text matches.
1008+ elif isinstance(markup, NavigableString) or \
1009+ isinstance(markup, basestring):
1010+ if self._matches(markup, self.text):
1011+ found = markup
1012+ else:
1013+ raise Exception, "I don't know how to match against a %s" \
1014+ % markup.__class__
1015+ return found
1016+
1017+ def _matches(self, markup, matchAgainst):
1018+ #print "Matching %s against %s" % (markup, matchAgainst)
1019+ result = False
1020+ if matchAgainst is True:
1021+ result = markup is not None
1022+ elif callable(matchAgainst):
1023+ result = matchAgainst(markup)
1024+ else:
1025+ #Custom match methods take the tag as an argument, but all
1026+ #other ways of matching match the tag name as a string.
1027+ if isinstance(markup, Tag):
1028+ markup = markup.name
1029+ if markup and not isinstance(markup, basestring):
1030+ markup = unicode(markup)
1031+ #Now we know that chunk is either a string, or None.
1032+ if hasattr(matchAgainst, 'match'):
1033+ # It's a regexp object.
1034+ result = markup and matchAgainst.search(markup)
1035+ elif hasattr(matchAgainst, '__iter__'): # list-like
1036+ result = markup in matchAgainst
1037+ elif hasattr(matchAgainst, 'items'):
1038+ result = markup.has_key(matchAgainst)
1039+ elif matchAgainst and isinstance(markup, basestring):
1040+ if isinstance(markup, unicode):
1041+ matchAgainst = unicode(matchAgainst)
1042+ else:
1043+ matchAgainst = str(matchAgainst)
1044+
1045+ if not result:
1046+ result = matchAgainst == markup
1047+ return result
1048+
1049+class ResultSet(list):
1050+ """A ResultSet is just a list that keeps track of the SoupStrainer
1051+ that created it."""
1052+ def __init__(self, source):
1053+ list.__init__([])
1054+ self.source = source
1055+
1056+# Now, some helper functions.
1057+
1058+def buildTagMap(default, *args):
1059+ """Turns a list of maps, lists, or scalars into a single map.
1060+ Used to build the SELF_CLOSING_TAGS, NESTABLE_TAGS, and
1061+ NESTING_RESET_TAGS maps out of lists and partial maps."""
1062+ built = {}
1063+ for portion in args:
1064+ if hasattr(portion, 'items'):
1065+ #It's a map. Merge it.
1066+ for k,v in portion.items():
1067+ built[k] = v
1068+ elif hasattr(portion, '__iter__'): # is a list
1069+ #It's a list. Map each item to the default.
1070+ for k in portion:
1071+ built[k] = default
1072+ else:
1073+ #It's a scalar. Map it to the default.
1074+ built[portion] = default
1075+ return built
1076+
1077+# Now, the parser classes.
1078+
1079+class BeautifulStoneSoup(Tag, SGMLParser):
1080+
1081+ """This class contains the basic parser and search code. It defines
1082+ a parser that knows nothing about tag behavior except for the
1083+ following:
1084+
1085+ You can't close a tag without closing all the tags it encloses.
1086+ That is, "<foo><bar></foo>" actually means
1087+ "<foo><bar></bar></foo>".
1088+
1089+ [Another possible explanation is "<foo><bar /></foo>", but since
1090+ this class defines no SELF_CLOSING_TAGS, it will never use that
1091+ explanation.]
1092+
1093+ This class is useful for parsing XML or made-up markup languages,
1094+ or when BeautifulSoup makes an assumption counter to what you were
1095+ expecting."""
1096+
1097+ SELF_CLOSING_TAGS = {}
1098+ NESTABLE_TAGS = {}
1099+ RESET_NESTING_TAGS = {}
1100+ QUOTE_TAGS = {}
1101+ PRESERVE_WHITESPACE_TAGS = []
1102+
1103+ MARKUP_MASSAGE = [(re.compile('(<[^<>]*)/>'),
1104+ lambda x: x.group(1) + ' />'),
1105+ (re.compile('<!\s+([^<>]*)>'),
1106+ lambda x: '<!' + x.group(1) + '>')
1107+ ]
1108+
1109+ ROOT_TAG_NAME = u'[document]'
1110+
1111+ HTML_ENTITIES = "html"
1112+ XML_ENTITIES = "xml"
1113+ XHTML_ENTITIES = "xhtml"
1114+ # TODO: This only exists for backwards-compatibility
1115+ ALL_ENTITIES = XHTML_ENTITIES
1116+
1117+ # Used when determining whether a text node is all whitespace and
1118+ # can be replaced with a single space. A text node that contains
1119+ # fancy Unicode spaces (usually non-breaking) should be left
1120+ # alone.
1121+ STRIP_ASCII_SPACES = { 9: None, 10: None, 12: None, 13: None, 32: None, }
1122+
1123+ def __init__(self, markup="", parseOnlyThese=None, fromEncoding=None,
1124+ markupMassage=True, smartQuotesTo=XML_ENTITIES,
1125+ convertEntities=None, selfClosingTags=None, isHTML=False):
1126+ """The Soup object is initialized as the 'root tag', and the
1127+ provided markup (which can be a string or a file-like object)
1128+ is fed into the underlying parser.
1129+
1130+ sgmllib will process most bad HTML, and the BeautifulSoup
1131+ class has some tricks for dealing with some HTML that kills
1132+ sgmllib, but Beautiful Soup can nonetheless choke or lose data
1133+ if your data uses self-closing tags or declarations
1134+ incorrectly.
1135+
1136+ By default, Beautiful Soup uses regexes to sanitize input,
1137+ avoiding the vast majority of these problems. If the problems
1138+ don't apply to you, pass in False for markupMassage, and
1139+ you'll get better performance.
1140+
1141+ The default parser massage techniques fix the two most common
1142+ instances of invalid HTML that choke sgmllib:
1143+
1144+ <br/> (No space between name of closing tag and tag close)
1145+ <! --Comment--> (Extraneous whitespace in declaration)
1146+
1147+ You can pass in a custom list of (RE object, replace method)
1148+ tuples to get Beautiful Soup to scrub your input the way you
1149+ want."""
1150+
1151+ self.parseOnlyThese = parseOnlyThese
1152+ self.fromEncoding = fromEncoding
1153+ self.smartQuotesTo = smartQuotesTo
1154+ self.convertEntities = convertEntities
1155+ # Set the rules for how we'll deal with the entities we
1156+ # encounter
1157+ if self.convertEntities:
1158+ # It doesn't make sense to convert encoded characters to
1159+ # entities even while you're converting entities to Unicode.
1160+ # Just convert it all to Unicode.
1161+ self.smartQuotesTo = None
1162+ if convertEntities == self.HTML_ENTITIES:
1163+ self.convertXMLEntities = False
1164+ self.convertHTMLEntities = True
1165+ self.escapeUnrecognizedEntities = True
1166+ elif convertEntities == self.XHTML_ENTITIES:
1167+ self.convertXMLEntities = True
1168+ self.convertHTMLEntities = True
1169+ self.escapeUnrecognizedEntities = False
1170+ elif convertEntities == self.XML_ENTITIES:
1171+ self.convertXMLEntities = True
1172+ self.convertHTMLEntities = False
1173+ self.escapeUnrecognizedEntities = False
1174+ else:
1175+ self.convertXMLEntities = False
1176+ self.convertHTMLEntities = False
1177+ self.escapeUnrecognizedEntities = False
1178+
1179+ self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags)
1180+ SGMLParser.__init__(self)
1181+
1182+ if hasattr(markup, 'read'): # It's a file-type object.
1183+ markup = markup.read()
1184+ self.markup = markup
1185+ self.markupMassage = markupMassage
1186+ try:
1187+ self._feed(isHTML=isHTML)
1188+ except StopParsing:
1189+ pass
1190+ self.markup = None # The markup can now be GCed
1191+
1192+ def convert_charref(self, name):
1193+ """This method fixes a bug in Python's SGMLParser."""
1194+ try:
1195+ n = int(name)
1196+ except ValueError:
1197+ return
1198+ if not 0 <= n <= 127 : # ASCII ends at 127, not 255
1199+ return
1200+ return self.convert_codepoint(n)
1201+
1202+ def _feed(self, inDocumentEncoding=None, isHTML=False):
1203+ # Convert the document to Unicode.
1204+ markup = self.markup
1205+ if isinstance(markup, unicode):
1206+ if not hasattr(self, 'originalEncoding'):
1207+ self.originalEncoding = None
1208+ else:
1209+ dammit = UnicodeDammit\
1210+ (markup, [self.fromEncoding, inDocumentEncoding],
1211+ smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)
1212+ markup = dammit.unicode
1213+ self.originalEncoding = dammit.originalEncoding
1214+ self.declaredHTMLEncoding = dammit.declaredHTMLEncoding
1215+ if markup:
1216+ if self.markupMassage:
1217+ if not hasattr(self.markupMassage, "__iter__"):
1218+ self.markupMassage = self.MARKUP_MASSAGE
1219+ for fix, m in self.markupMassage:
1220+ markup = fix.sub(m, markup)
1221+ # TODO: We get rid of markupMassage so that the
1222+ # soup object can be deepcopied later on. Some
1223+ # Python installations can't copy regexes. If anyone
1224+ # was relying on the existence of markupMassage, this
1225+ # might cause problems.
1226+ del(self.markupMassage)
1227+ self.reset()
1228+
1229+ SGMLParser.feed(self, markup)
1230+ # Close out any unfinished strings and close all the open tags.
1231+ self.endData()
1232+ while self.currentTag.name != self.ROOT_TAG_NAME:
1233+ self.popTag()
1234+
1235+ def __getattr__(self, methodName):
1236+ """This method routes method call requests to either the SGMLParser
1237+ superclass or the Tag superclass, depending on the method name."""
1238+ #print "__getattr__ called on %s.%s" % (self.__class__, methodName)
1239+
1240+ if methodName.startswith('start_') or methodName.startswith('end_') \
1241+ or methodName.startswith('do_'):
1242+ return SGMLParser.__getattr__(self, methodName)
1243+ elif not methodName.startswith('__'):
1244+ return Tag.__getattr__(self, methodName)
1245+ else:
1246+ raise AttributeError
1247+
1248+ def isSelfClosingTag(self, name):
1249+ """Returns true iff the given string is the name of a
1250+ self-closing tag according to this parser."""
1251+ return self.SELF_CLOSING_TAGS.has_key(name) \
1252+ or self.instanceSelfClosingTags.has_key(name)
1253+
1254+ def reset(self):
1255+ Tag.__init__(self, self, self.ROOT_TAG_NAME)
1256+ self.hidden = 1
1257+ SGMLParser.reset(self)
1258+ self.currentData = []
1259+ self.currentTag = None
1260+ self.tagStack = []
1261+ self.quoteStack = []
1262+ self.pushTag(self)
1263+
1264+ def popTag(self):
1265+ tag = self.tagStack.pop()
1266+
1267+ #print "Pop", tag.name
1268+ if self.tagStack:
1269+ self.currentTag = self.tagStack[-1]
1270+ return self.currentTag
1271+
1272+ def pushTag(self, tag):
1273+ #print "Push", tag.name
1274+ if self.currentTag:
1275+ self.currentTag.contents.append(tag)
1276+ self.tagStack.append(tag)
1277+ self.currentTag = self.tagStack[-1]
1278+
1279+ def endData(self, containerClass=NavigableString):
1280+ if self.currentData:
1281+ currentData = u''.join(self.currentData)
1282+ if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and
1283+ not set([tag.name for tag in self.tagStack]).intersection(
1284+ self.PRESERVE_WHITESPACE_TAGS)):
1285+ if '\n' in currentData:
1286+ currentData = '\n'
1287+ else:
1288+ currentData = ' '
1289+ self.currentData = []
1290+ if self.parseOnlyThese and len(self.tagStack) <= 1 and \
1291+ (not self.parseOnlyThese.text or \
1292+ not self.parseOnlyThese.search(currentData)):
1293+ return
1294+ o = containerClass(currentData)
1295+ o.setup(self.currentTag, self.previous)
1296+ if self.previous:
1297+ self.previous.next = o
1298+ self.previous = o
1299+ self.currentTag.contents.append(o)
1300+
1301+
1302+ def _popToTag(self, name, inclusivePop=True):
1303+ """Pops the tag stack up to and including the most recent
1304+ instance of the given tag. If inclusivePop is false, pops the tag
1305+ stack up to but *not* including the most recent instqance of
1306+ the given tag."""
1307+ #print "Popping to %s" % name
1308+ if name == self.ROOT_TAG_NAME:
1309+ return
1310+
1311+ numPops = 0
1312+ mostRecentTag = None
1313+ for i in range(len(self.tagStack)-1, 0, -1):
1314+ if name == self.tagStack[i].name:
1315+ numPops = len(self.tagStack)-i
1316+ break
1317+ if not inclusivePop:
1318+ numPops = numPops - 1
1319+
1320+ for i in range(0, numPops):
1321+ mostRecentTag = self.popTag()
1322+ return mostRecentTag
1323+
1324+ def _smartPop(self, name):
1325+
1326+ """We need to pop up to the previous tag of this type, unless
1327+ one of this tag's nesting reset triggers comes between this
1328+ tag and the previous tag of this type, OR unless this tag is a
1329+ generic nesting trigger and another generic nesting trigger
1330+ comes between this tag and the previous tag of this type.
1331+
1332+ Examples:
1333+ <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'.
1334+ <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'.
1335+ <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'.
1336+
1337+ <li><ul><li> *<li>* should pop to 'ul', not the first 'li'.
1338+ <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr'
1339+ <td><tr><td> *<td>* should pop to 'tr', not the first 'td'
1340+ """
1341+
1342+ nestingResetTriggers = self.NESTABLE_TAGS.get(name)
1343+ isNestable = nestingResetTriggers != None
1344+ isResetNesting = self.RESET_NESTING_TAGS.has_key(name)
1345+ popTo = None
1346+ inclusive = True
1347+ for i in range(len(self.tagStack)-1, 0, -1):
1348+ p = self.tagStack[i]
1349+ if (not p or p.name == name) and not isNestable:
1350+ #Non-nestable tags get popped to the top or to their
1351+ #last occurance.
1352+ popTo = name
1353+ break
1354+ if (nestingResetTriggers is not None
1355+ and p.name in nestingResetTriggers) \
1356+ or (nestingResetTriggers is None and isResetNesting
1357+ and self.RESET_NESTING_TAGS.has_key(p.name)):
1358+
1359+ #If we encounter one of the nesting reset triggers
1360+ #peculiar to this tag, or we encounter another tag
1361+ #that causes nesting to reset, pop up to but not
1362+ #including that tag.
1363+ popTo = p.name
1364+ inclusive = False
1365+ break
1366+ p = p.parent
1367+ if popTo:
1368+ self._popToTag(popTo, inclusive)
1369+
1370+ def unknown_starttag(self, name, attrs, selfClosing=0):
1371+ #print "Start tag %s: %s" % (name, attrs)
1372+ if self.quoteStack:
1373+ #This is not a real tag.
1374+ #print "<%s> is not real!" % name
1375+ attrs = ''.join([' %s="%s"' % (x, y) for x, y in attrs])
1376+ self.handle_data('<%s%s>' % (name, attrs))
1377+ return
1378+ self.endData()
1379+
1380+ if not self.isSelfClosingTag(name) and not selfClosing:
1381+ self._smartPop(name)
1382+
1383+ if self.parseOnlyThese and len(self.tagStack) <= 1 \
1384+ and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)):
1385+ return
1386+
1387+ tag = Tag(self, name, attrs, self.currentTag, self.previous)
1388+ if self.previous:
1389+ self.previous.next = tag
1390+ self.previous = tag
1391+ self.pushTag(tag)
1392+ if selfClosing or self.isSelfClosingTag(name):
1393+ self.popTag()
1394+ if name in self.QUOTE_TAGS:
1395+ #print "Beginning quote (%s)" % name
1396+ self.quoteStack.append(name)
1397+ self.literal = 1
1398+ return tag
1399+
1400+ def unknown_endtag(self, name):
1401+ #print "End tag %s" % name
1402+ if self.quoteStack and self.quoteStack[-1] != name:
1403+ #This is not a real end tag.
1404+ #print "</%s> is not real!" % name
1405+ self.handle_data('</%s>' % name)
1406+ return
1407+ self.endData()
1408+ self._popToTag(name)
1409+ if self.quoteStack and self.quoteStack[-1] == name:
1410+ self.quoteStack.pop()
1411+ self.literal = (len(self.quoteStack) > 0)
1412+
1413+ def handle_data(self, data):
1414+ self.currentData.append(data)
1415+
1416+ def _toStringSubclass(self, text, subclass):
1417+ """Adds a certain piece of text to the tree as a NavigableString
1418+ subclass."""
1419+ self.endData()
1420+ self.handle_data(text)
1421+ self.endData(subclass)
1422+
1423+ def handle_pi(self, text):
1424+ """Handle a processing instruction as a ProcessingInstruction
1425+ object, possibly one with a %SOUP-ENCODING% slot into which an
1426+ encoding will be plugged later."""
1427+ if text[:3] == "xml":
1428+ text = u"xml version='1.0' encoding='%SOUP-ENCODING%'"
1429+ self._toStringSubclass(text, ProcessingInstruction)
1430+
1431+ def handle_comment(self, text):
1432+ "Handle comments as Comment objects."
1433+ self._toStringSubclass(text, Comment)
1434+
1435+ def handle_charref(self, ref):
1436+ "Handle character references as data."
1437+ if self.convertEntities:
1438+ data = unichr(int(ref))
1439+ else:
1440+ data = '&#%s;' % ref
1441+ self.handle_data(data)
1442+
1443+ def handle_entityref(self, ref):
1444+ """Handle entity references as data, possibly converting known
1445+ HTML and/or XML entity references to the corresponding Unicode
1446+ characters."""
1447+ data = None
1448+ if self.convertHTMLEntities:
1449+ try:
1450+ data = unichr(name2codepoint[ref])
1451+ except KeyError:
1452+ pass
1453+
1454+ if not data and self.convertXMLEntities:
1455+ data = self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref)
1456+
1457+ if not data and self.convertHTMLEntities and \
1458+ not self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref):
1459+ # TODO: We've got a problem here. We're told this is
1460+ # an entity reference, but it's not an XML entity
1461+ # reference or an HTML entity reference. Nonetheless,
1462+ # the logical thing to do is to pass it through as an
1463+ # unrecognized entity reference.
1464+ #
1465+ # Except: when the input is "&carol;" this function
1466+ # will be called with input "carol". When the input is
1467+ # "AT&T", this function will be called with input
1468+ # "T". We have no way of knowing whether a semicolon
1469+ # was present originally, so we don't know whether
1470+ # this is an unknown entity or just a misplaced
1471+ # ampersand.
1472+ #
1473+ # The more common case is a misplaced ampersand, so I
1474+ # escape the ampersand and omit the trailing semicolon.
1475+ data = "&amp;%s" % ref
1476+ if not data:
1477+ # This case is different from the one above, because we
1478+ # haven't already gone through a supposedly comprehensive
1479+ # mapping of entities to Unicode characters. We might not
1480+ # have gone through any mapping at all. So the chances are
1481+ # very high that this is a real entity, and not a
1482+ # misplaced ampersand.
1483+ data = "&%s;" % ref
1484+ self.handle_data(data)
1485+
1486+ def handle_decl(self, data):
1487+ "Handle DOCTYPEs and the like as Declaration objects."
1488+ self._toStringSubclass(data, Declaration)
1489+
1490+ def parse_declaration(self, i):
1491+ """Treat a bogus SGML declaration as raw data. Treat a CDATA
1492+ declaration as a CData object."""
1493+ j = None
1494+ if self.rawdata[i:i+9] == '<![CDATA[':
1495+ k = self.rawdata.find(']]>', i)
1496+ if k == -1:
1497+ k = len(self.rawdata)
1498+ data = self.rawdata[i+9:k]
1499+ j = k+3
1500+ self._toStringSubclass(data, CData)
1501+ else:
1502+ try:
1503+ j = SGMLParser.parse_declaration(self, i)
1504+ except SGMLParseError:
1505+ toHandle = self.rawdata[i:]
1506+ self.handle_data(toHandle)
1507+ j = i + len(toHandle)
1508+ return j
1509+
1510+class BeautifulSoup(BeautifulStoneSoup):
1511+
1512+ """This parser knows the following facts about HTML:
1513+
1514+ * Some tags have no closing tag and should be interpreted as being
1515+ closed as soon as they are encountered.
1516+
1517+ * The text inside some tags (ie. 'script') may contain tags which
1518+ are not really part of the document and which should be parsed
1519+ as text, not tags. If you want to parse the text as tags, you can
1520+ always fetch it and parse it explicitly.
1521+
1522+ * Tag nesting rules:
1523+
1524+ Most tags can't be nested at all. For instance, the occurance of
1525+ a <p> tag should implicitly close the previous <p> tag.
1526+
1527+ <p>Para1<p>Para2
1528+ should be transformed into:
1529+ <p>Para1</p><p>Para2
1530+
1531+ Some tags can be nested arbitrarily. For instance, the occurance
1532+ of a <blockquote> tag should _not_ implicitly close the previous
1533+ <blockquote> tag.
1534+
1535+ Alice said: <blockquote>Bob said: <blockquote>Blah
1536+ should NOT be transformed into:
1537+ Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah
1538+
1539+ Some tags can be nested, but the nesting is reset by the
1540+ interposition of other tags. For instance, a <tr> tag should
1541+ implicitly close the previous <tr> tag within the same <table>,
1542+ but not close a <tr> tag in another table.
1543+
1544+ <table><tr>Blah<tr>Blah
1545+ should be transformed into:
1546+ <table><tr>Blah</tr><tr>Blah
1547+ but,
1548+ <tr>Blah<table><tr>Blah
1549+ should NOT be transformed into
1550+ <tr>Blah<table></tr><tr>Blah
1551+
1552+ Differing assumptions about tag nesting rules are a major source
1553+ of problems with the BeautifulSoup class. If BeautifulSoup is not
1554+ treating as nestable a tag your page author treats as nestable,
1555+ try ICantBelieveItsBeautifulSoup, MinimalSoup, or
1556+ BeautifulStoneSoup before writing your own subclass."""
1557+
1558+ def __init__(self, *args, **kwargs):
1559+ if not kwargs.has_key('smartQuotesTo'):
1560+ kwargs['smartQuotesTo'] = self.HTML_ENTITIES
1561+ kwargs['isHTML'] = True
1562+ BeautifulStoneSoup.__init__(self, *args, **kwargs)
1563+
1564+ SELF_CLOSING_TAGS = buildTagMap(None,
1565+ ('br' , 'hr', 'input', 'img', 'meta',
1566+ 'spacer', 'link', 'frame', 'base', 'col'))
1567+
1568+ PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
1569+
1570+ QUOTE_TAGS = {'script' : None, 'textarea' : None}
1571+
1572+ #According to the HTML standard, each of these inline tags can
1573+ #contain another tag of the same type. Furthermore, it's common
1574+ #to actually use these tags this way.
1575+ NESTABLE_INLINE_TAGS = ('span', 'font', 'q', 'object', 'bdo', 'sub', 'sup',
1576+ 'center')
1577+
1578+ #According to the HTML standard, these block tags can contain
1579+ #another tag of the same type. Furthermore, it's common
1580+ #to actually use these tags this way.
1581+ NESTABLE_BLOCK_TAGS = ('blockquote', 'div', 'fieldset', 'ins', 'del')
1582+
1583+ #Lists can contain other lists, but there are restrictions.
1584+ NESTABLE_LIST_TAGS = { 'ol' : [],
1585+ 'ul' : [],
1586+ 'li' : ['ul', 'ol'],
1587+ 'dl' : [],
1588+ 'dd' : ['dl'],
1589+ 'dt' : ['dl'] }
1590+
1591+ #Tables can contain other tables, but there are restrictions.
1592+ NESTABLE_TABLE_TAGS = {'table' : [],
1593+ 'tr' : ['table', 'tbody', 'tfoot', 'thead'],
1594+ 'td' : ['tr'],
1595+ 'th' : ['tr'],
1596+ 'thead' : ['table'],
1597+ 'tbody' : ['table'],
1598+ 'tfoot' : ['table'],
1599+ }
1600+
1601+ NON_NESTABLE_BLOCK_TAGS = ('address', 'form', 'p', 'pre')
1602+
1603+ #If one of these tags is encountered, all tags up to the next tag of
1604+ #this type are popped.
1605+ RESET_NESTING_TAGS = buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript',
1606+ NON_NESTABLE_BLOCK_TAGS,
1607+ NESTABLE_LIST_TAGS,
1608+ NESTABLE_TABLE_TAGS)
1609+
1610+ NESTABLE_TAGS = buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS,
1611+ NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS)
1612+
1613+ # Used to detect the charset in a META tag; see start_meta
1614+ CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M)
1615+
1616+ def start_meta(self, attrs):
1617+ """Beautiful Soup can detect a charset included in a META tag,
1618+ try to convert the document to that charset, and re-parse the
1619+ document from the beginning."""
1620+ httpEquiv = None
1621+ contentType = None
1622+ contentTypeIndex = None
1623+ tagNeedsEncodingSubstitution = False
1624+
1625+ for i in range(0, len(attrs)):
1626+ key, value = attrs[i]
1627+ key = key.lower()
1628+ if key == 'http-equiv':
1629+ httpEquiv = value
1630+ elif key == 'content':
1631+ contentType = value
1632+ contentTypeIndex = i
1633+
1634+ if httpEquiv and contentType: # It's an interesting meta tag.
1635+ match = self.CHARSET_RE.search(contentType)
1636+ if match:
1637+ if (self.declaredHTMLEncoding is not None or
1638+ self.originalEncoding == self.fromEncoding):
1639+ # An HTML encoding was sniffed while converting
1640+ # the document to Unicode, or an HTML encoding was
1641+ # sniffed during a previous pass through the
1642+ # document, or an encoding was specified
1643+ # explicitly and it worked. Rewrite the meta tag.
1644+ def rewrite(match):
1645+ return match.group(1) + "%SOUP-ENCODING%"
1646+ newAttr = self.CHARSET_RE.sub(rewrite, contentType)
1647+ attrs[contentTypeIndex] = (attrs[contentTypeIndex][0],
1648+ newAttr)
1649+ tagNeedsEncodingSubstitution = True
1650+ else:
1651+ # This is our first pass through the document.
1652+ # Go through it again with the encoding information.
1653+ newCharset = match.group(3)
1654+ if newCharset and newCharset != self.originalEncoding:
1655+ self.declaredHTMLEncoding = newCharset
1656+ self._feed(self.declaredHTMLEncoding)
1657+ raise StopParsing
1658+ pass
1659+ tag = self.unknown_starttag("meta", attrs)
1660+ if tag and tagNeedsEncodingSubstitution:
1661+ tag.containsSubstitutions = True
1662+
1663+class StopParsing(Exception):
1664+ pass
1665+
1666+class ICantBelieveItsBeautifulSoup(BeautifulSoup):
1667+
1668+ """The BeautifulSoup class is oriented towards skipping over
1669+ common HTML errors like unclosed tags. However, sometimes it makes
1670+ errors of its own. For instance, consider this fragment:
1671+
1672+ <b>Foo<b>Bar</b></b>
1673+
1674+ This is perfectly valid (if bizarre) HTML. However, the
1675+ BeautifulSoup class will implicitly close the first b tag when it
1676+ encounters the second 'b'. It will think the author wrote
1677+ "<b>Foo<b>Bar", and didn't close the first 'b' tag, because
1678+ there's no real-world reason to bold something that's already
1679+ bold. When it encounters '</b></b>' it will close two more 'b'
1680+ tags, for a grand total of three tags closed instead of two. This
1681+ can throw off the rest of your document structure. The same is
1682+ true of a number of other tags, listed below.
1683+
1684+ It's much more common for someone to forget to close a 'b' tag
1685+ than to actually use nested 'b' tags, and the BeautifulSoup class
1686+ handles the common case. This class handles the not-co-common
1687+ case: where you can't believe someone wrote what they did, but
1688+ it's valid HTML and BeautifulSoup screwed up by assuming it
1689+ wouldn't be."""
1690+
1691+ I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS = \
1692+ ('em', 'big', 'i', 'small', 'tt', 'abbr', 'acronym', 'strong',
1693+ 'cite', 'code', 'dfn', 'kbd', 'samp', 'strong', 'var', 'b',
1694+ 'big')
1695+
1696+ I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS = ('noscript',)
1697+
1698+ NESTABLE_TAGS = buildTagMap([], BeautifulSoup.NESTABLE_TAGS,
1699+ I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS,
1700+ I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS)
1701+
1702+class MinimalSoup(BeautifulSoup):
1703+ """The MinimalSoup class is for parsing HTML that contains
1704+ pathologically bad markup. It makes no assumptions about tag
1705+ nesting, but it does know which tags are self-closing, that
1706+ <script> tags contain Javascript and should not be parsed, that
1707+ META tags may contain encoding information, and so on.
1708+
1709+ This also makes it better for subclassing than BeautifulStoneSoup
1710+ or BeautifulSoup."""
1711+
1712+ RESET_NESTING_TAGS = buildTagMap('noscript')
1713+ NESTABLE_TAGS = {}
1714+
1715+class BeautifulSOAP(BeautifulStoneSoup):
1716+ """This class will push a tag with only a single string child into
1717+ the tag's parent as an attribute. The attribute's name is the tag
1718+ name, and the value is the string child. An example should give
1719+ the flavor of the change:
1720+
1721+ <foo><bar>baz</bar></foo>
1722+ =>
1723+ <foo bar="baz"><bar>baz</bar></foo>
1724+
1725+ You can then access fooTag['bar'] instead of fooTag.barTag.string.
1726+
1727+ This is, of course, useful for scraping structures that tend to
1728+ use subelements instead of attributes, such as SOAP messages. Note
1729+ that it modifies its input, so don't print the modified version
1730+ out.
1731+
1732+ I'm not sure how many people really want to use this class; let me
1733+ know if you do. Mainly I like the name."""
1734+
1735+ def popTag(self):
1736+ if len(self.tagStack) > 1:
1737+ tag = self.tagStack[-1]
1738+ parent = self.tagStack[-2]
1739+ parent._getAttrMap()
1740+ if (isinstance(tag, Tag) and len(tag.contents) == 1 and
1741+ isinstance(tag.contents[0], NavigableString) and
1742+ not parent.attrMap.has_key(tag.name)):
1743+ parent[tag.name] = tag.contents[0]
1744+ BeautifulStoneSoup.popTag(self)
1745+
1746+#Enterprise class names! It has come to our attention that some people
1747+#think the names of the Beautiful Soup parser classes are too silly
1748+#and "unprofessional" for use in enterprise screen-scraping. We feel
1749+#your pain! For such-minded folk, the Beautiful Soup Consortium And
1750+#All-Night Kosher Bakery recommends renaming this file to
1751+#"RobustParser.py" (or, in cases of extreme enterprisiness,
1752+#"RobustParserBeanInterface.class") and using the following
1753+#enterprise-friendly class aliases:
1754+class RobustXMLParser(BeautifulStoneSoup):
1755+ pass
1756+class RobustHTMLParser(BeautifulSoup):
1757+ pass
1758+class RobustWackAssHTMLParser(ICantBelieveItsBeautifulSoup):
1759+ pass
1760+class RobustInsanelyWackAssHTMLParser(MinimalSoup):
1761+ pass
1762+class SimplifyingSOAPParser(BeautifulSOAP):
1763+ pass
1764+
1765+######################################################
1766+#
1767+# Bonus library: Unicode, Dammit
1768+#
1769+# This class forces XML data into a standard format (usually to UTF-8
1770+# or Unicode). It is heavily based on code from Mark Pilgrim's
1771+# Universal Feed Parser. It does not rewrite the XML or HTML to
1772+# reflect a new encoding: that happens in BeautifulStoneSoup.handle_pi
1773+# (XML) and BeautifulSoup.start_meta (HTML).
1774+
1775+# Autodetects character encodings.
1776+# Download from http://chardet.feedparser.org/
1777+try:
1778+ import chardet
1779+# import chardet.constants
1780+# chardet.constants._debug = 1
1781+except ImportError:
1782+ chardet = None
1783+
1784+# cjkcodecs and iconv_codec make Python know about more character encodings.
1785+# Both are available from http://cjkpython.i18n.org/
1786+# They're built in if you use Python 2.4.
1787+try:
1788+ import cjkcodecs.aliases
1789+except ImportError:
1790+ pass
1791+try:
1792+ import iconv_codec
1793+except ImportError:
1794+ pass
1795+
1796+class UnicodeDammit:
1797+ """A class for detecting the encoding of a *ML document and
1798+ converting it to a Unicode string. If the source encoding is
1799+ windows-1252, can replace MS smart quotes with their HTML or XML
1800+ equivalents."""
1801+
1802+ # This dictionary maps commonly seen values for "charset" in HTML
1803+ # meta tags to the corresponding Python codec names. It only covers
1804+ # values that aren't in Python's aliases and can't be determined
1805+ # by the heuristics in find_codec.
1806+ CHARSET_ALIASES = { "macintosh" : "mac-roman",
1807+ "x-sjis" : "shift-jis" }
1808+
1809+ def __init__(self, markup, overrideEncodings=[],
1810+ smartQuotesTo='xml', isHTML=False):
1811+ self.declaredHTMLEncoding = None
1812+ self.markup, documentEncoding, sniffedEncoding = \
1813+ self._detectEncoding(markup, isHTML)
1814+ self.smartQuotesTo = smartQuotesTo
1815+ self.triedEncodings = []
1816+ if markup == '' or isinstance(markup, unicode):
1817+ self.originalEncoding = None
1818+ self.unicode = unicode(markup)
1819+ return
1820+
1821+ u = None
1822+ for proposedEncoding in overrideEncodings:
1823+ u = self._convertFrom(proposedEncoding)
1824+ if u: break
1825+ if not u:
1826+ for proposedEncoding in (documentEncoding, sniffedEncoding):
1827+ u = self._convertFrom(proposedEncoding)
1828+ if u: break
1829+
1830+ # If no luck and we have auto-detection library, try that:
1831+ if not u and chardet and not isinstance(self.markup, unicode):
1832+ u = self._convertFrom(chardet.detect(self.markup)['encoding'])
1833+
1834+ # As a last resort, try utf-8 and windows-1252:
1835+ if not u:
1836+ for proposed_encoding in ("utf-8", "windows-1252"):
1837+ u = self._convertFrom(proposed_encoding)
1838+ if u: break
1839+
1840+ self.unicode = u
1841+ if not u: self.originalEncoding = None
1842+
1843+ def _subMSChar(self, orig):
1844+ """Changes a MS smart quote character to an XML or HTML
1845+ entity."""
1846+ sub = self.MS_CHARS.get(orig)
1847+ if isinstance(sub, tuple):
1848+ if self.smartQuotesTo == 'xml':
1849+ sub = '&#x%s;' % sub[1]
1850+ else:
1851+ sub = '&%s;' % sub[0]
1852+ return sub
1853+
1854+ def _convertFrom(self, proposed):
1855+ proposed = self.find_codec(proposed)
1856+ if not proposed or proposed in self.triedEncodings:
1857+ return None
1858+ self.triedEncodings.append(proposed)
1859+ markup = self.markup
1860+
1861+ # Convert smart quotes to HTML if coming from an encoding
1862+ # that might have them.
1863+ if self.smartQuotesTo and proposed.lower() in("windows-1252",
1864+ "iso-8859-1",
1865+ "iso-8859-2"):
1866+ markup = re.compile("([\x80-\x9f])").sub \
1867+ (lambda(x): self._subMSChar(x.group(1)),
1868+ markup)
1869+
1870+ try:
1871+ # print "Trying to convert document to %s" % proposed
1872+ u = self._toUnicode(markup, proposed)
1873+ self.markup = u
1874+ self.originalEncoding = proposed
1875+ except Exception, e:
1876+ # print "That didn't work!"
1877+ # print e
1878+ return None
1879+ #print "Correct encoding: %s" % proposed
1880+ return self.markup
1881+
1882+ def _toUnicode(self, data, encoding):
1883+ '''Given a string and its encoding, decodes the string into Unicode.
1884+ %encoding is a string recognized by encodings.aliases'''
1885+
1886+ # strip Byte Order Mark (if present)
1887+ if (len(data) >= 4) and (data[:2] == '\xfe\xff') \
1888+ and (data[2:4] != '\x00\x00'):
1889+ encoding = 'utf-16be'
1890+ data = data[2:]
1891+ elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \
1892+ and (data[2:4] != '\x00\x00'):
1893+ encoding = 'utf-16le'
1894+ data = data[2:]
1895+ elif data[:3] == '\xef\xbb\xbf':
1896+ encoding = 'utf-8'
1897+ data = data[3:]
1898+ elif data[:4] == '\x00\x00\xfe\xff':
1899+ encoding = 'utf-32be'
1900+ data = data[4:]
1901+ elif data[:4] == '\xff\xfe\x00\x00':
1902+ encoding = 'utf-32le'
1903+ data = data[4:]
1904+ newdata = unicode(data, encoding)
1905+ return newdata
1906+
1907+ def _detectEncoding(self, xml_data, isHTML=False):
1908+ """Given a document, tries to detect its XML encoding."""
1909+ xml_encoding = sniffed_xml_encoding = None
1910+ try:
1911+ if xml_data[:4] == '\x4c\x6f\xa7\x94':
1912+ # EBCDIC
1913+ xml_data = self._ebcdic_to_ascii(xml_data)
1914+ elif xml_data[:4] == '\x00\x3c\x00\x3f':
1915+ # UTF-16BE
1916+ sniffed_xml_encoding = 'utf-16be'
1917+ xml_data = unicode(xml_data, 'utf-16be').encode('utf-8')
1918+ elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') \
1919+ and (xml_data[2:4] != '\x00\x00'):
1920+ # UTF-16BE with BOM
1921+ sniffed_xml_encoding = 'utf-16be'
1922+ xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8')
1923+ elif xml_data[:4] == '\x3c\x00\x3f\x00':
1924+ # UTF-16LE
1925+ sniffed_xml_encoding = 'utf-16le'
1926+ xml_data = unicode(xml_data, 'utf-16le').encode('utf-8')
1927+ elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and \
1928+ (xml_data[2:4] != '\x00\x00'):
1929+ # UTF-16LE with BOM
1930+ sniffed_xml_encoding = 'utf-16le'
1931+ xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8')
1932+ elif xml_data[:4] == '\x00\x00\x00\x3c':
1933+ # UTF-32BE
1934+ sniffed_xml_encoding = 'utf-32be'
1935+ xml_data = unicode(xml_data, 'utf-32be').encode('utf-8')
1936+ elif xml_data[:4] == '\x3c\x00\x00\x00':
1937+ # UTF-32LE
1938+ sniffed_xml_encoding = 'utf-32le'
1939+ xml_data = unicode(xml_data, 'utf-32le').encode('utf-8')
1940+ elif xml_data[:4] == '\x00\x00\xfe\xff':
1941+ # UTF-32BE with BOM
1942+ sniffed_xml_encoding = 'utf-32be'
1943+ xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8')
1944+ elif xml_data[:4] == '\xff\xfe\x00\x00':
1945+ # UTF-32LE with BOM
1946+ sniffed_xml_encoding = 'utf-32le'
1947+ xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8')
1948+ elif xml_data[:3] == '\xef\xbb\xbf':
1949+ # UTF-8 with BOM
1950+ sniffed_xml_encoding = 'utf-8'
1951+ xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8')
1952+ else:
1953+ sniffed_xml_encoding = 'ascii'
1954+ pass
1955+ except:
1956+ xml_encoding_match = None
1957+ xml_encoding_match = re.compile(
1958+ '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
1959+ if not xml_encoding_match and isHTML:
1960+ regexp = re.compile('<\s*meta[^>]+charset=([^>]*?)[;\'">]', re.I)
1961+ xml_encoding_match = regexp.search(xml_data)
1962+ if xml_encoding_match is not None:
1963+ xml_encoding = xml_encoding_match.groups()[0].lower()
1964+ if isHTML:
1965+ self.declaredHTMLEncoding = xml_encoding
1966+ if sniffed_xml_encoding and \
1967+ (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode',
1968+ 'iso-10646-ucs-4', 'ucs-4', 'csucs4',
1969+ 'utf-16', 'utf-32', 'utf_16', 'utf_32',
1970+ 'utf16', 'u16')):
1971+ xml_encoding = sniffed_xml_encoding
1972+ return xml_data, xml_encoding, sniffed_xml_encoding
1973+
1974+
1975+ def find_codec(self, charset):
1976+ return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
1977+ or (charset and self._codec(charset.replace("-", ""))) \
1978+ or (charset and self._codec(charset.replace("-", "_"))) \
1979+ or charset
1980+
1981+ def _codec(self, charset):
1982+ if not charset: return charset
1983+ codec = None
1984+ try:
1985+ codecs.lookup(charset)
1986+ codec = charset
1987+ except (LookupError, ValueError):
1988+ pass
1989+ return codec
1990+
1991+ EBCDIC_TO_ASCII_MAP = None
1992+ def _ebcdic_to_ascii(self, s):
1993+ c = self.__class__
1994+ if not c.EBCDIC_TO_ASCII_MAP:
1995+ emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
1996+ 16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
1997+ 128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
1998+ 144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
1999+ 32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
2000+ 38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
2001+ 45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
2002+ 186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
2003+ 195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,
2004+ 201,202,106,107,108,109,110,111,112,113,114,203,204,205,
2005+ 206,207,208,209,126,115,116,117,118,119,120,121,122,210,
2006+ 211,212,213,214,215,216,217,218,219,220,221,222,223,224,
2007+ 225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72,
2008+ 73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81,
2009+ 82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89,
2010+ 90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57,
2011+ 250,251,252,253,254,255)
2012+ import string
2013+ c.EBCDIC_TO_ASCII_MAP = string.maketrans( \
2014+ ''.join(map(chr, range(256))), ''.join(map(chr, emap)))
2015+ return s.translate(c.EBCDIC_TO_ASCII_MAP)
2016+
2017+ MS_CHARS = { '\x80' : ('euro', '20AC'),
2018+ '\x81' : ' ',
2019+ '\x82' : ('sbquo', '201A'),
2020+ '\x83' : ('fnof', '192'),
2021+ '\x84' : ('bdquo', '201E'),
2022+ '\x85' : ('hellip', '2026'),
2023+ '\x86' : ('dagger', '2020'),
2024+ '\x87' : ('Dagger', '2021'),
2025+ '\x88' : ('circ', '2C6'),
2026+ '\x89' : ('permil', '2030'),
2027+ '\x8A' : ('Scaron', '160'),
2028+ '\x8B' : ('lsaquo', '2039'),
2029+ '\x8C' : ('OElig', '152'),
2030+ '\x8D' : '?',
2031+ '\x8E' : ('#x17D', '17D'),
2032+ '\x8F' : '?',
2033+ '\x90' : '?',
2034+ '\x91' : ('lsquo', '2018'),
2035+ '\x92' : ('rsquo', '2019'),
2036+ '\x93' : ('ldquo', '201C'),
2037+ '\x94' : ('rdquo', '201D'),
2038+ '\x95' : ('bull', '2022'),
2039+ '\x96' : ('ndash', '2013'),
2040+ '\x97' : ('mdash', '2014'),
2041+ '\x98' : ('tilde', '2DC'),
2042+ '\x99' : ('trade', '2122'),
2043+ '\x9a' : ('scaron', '161'),
2044+ '\x9b' : ('rsaquo', '203A'),
2045+ '\x9c' : ('oelig', '153'),
2046+ '\x9d' : '?',
2047+ '\x9e' : ('#x17E', '17E'),
2048+ '\x9f' : ('Yuml', ''),}
2049+
2050+#######################################################################
2051+
2052+
2053+#By default, act as an HTML pretty-printer.
2054+if __name__ == '__main__':
2055+ import sys
2056+ soup = BeautifulSoup(sys.stdin)
2057+ print soup.prettify()
2058
2059=== added file 'BeautifulSoupTests.py'
2060--- BeautifulSoupTests.py 1970-01-01 00:00:00 +0000
2061+++ BeautifulSoupTests.py 2011-05-27 07:52:31 +0000
2062@@ -0,0 +1,903 @@
2063+# -*- coding: utf-8 -*-
2064+"""Unit tests for Beautiful Soup.
2065+
2066+These tests make sure the Beautiful Soup works as it should. If you
2067+find a bug in Beautiful Soup, the best way to express it is as a test
2068+case like this that fails."""
2069+
2070+import unittest
2071+from BeautifulSoup import *
2072+
2073+class SoupTest(unittest.TestCase):
2074+
2075+ def assertSoupEquals(self, toParse, rep=None, c=BeautifulSoup):
2076+ """Parse the given text and make sure its string rep is the other
2077+ given text."""
2078+ if rep == None:
2079+ rep = toParse
2080+ self.assertEqual(str(c(toParse)), rep)
2081+
2082+
2083+class FollowThatTag(SoupTest):
2084+
2085+ "Tests the various ways of fetching tags from a soup."
2086+
2087+ def setUp(self):
2088+ ml = """
2089+ <a id="x">1</a>
2090+ <A id="a">2</a>
2091+ <b id="b">3</a>
2092+ <b href="foo" id="x">4</a>
2093+ <ac width=100>4</ac>"""
2094+ self.soup = BeautifulStoneSoup(ml)
2095+
2096+ def testFindAllByName(self):
2097+ matching = self.soup('a')
2098+ self.assertEqual(len(matching), 2)
2099+ self.assertEqual(matching[0].name, 'a')
2100+ self.assertEqual(matching, self.soup.findAll('a'))
2101+ self.assertEqual(matching, self.soup.findAll(SoupStrainer('a')))
2102+
2103+ def testFindAllByAttribute(self):
2104+ matching = self.soup.findAll(id='x')
2105+ self.assertEqual(len(matching), 2)
2106+ self.assertEqual(matching[0].name, 'a')
2107+ self.assertEqual(matching[1].name, 'b')
2108+
2109+ matching2 = self.soup.findAll(attrs={'id' : 'x'})
2110+ self.assertEqual(matching, matching2)
2111+
2112+ strainer = SoupStrainer(attrs={'id' : 'x'})
2113+ self.assertEqual(matching, self.soup.findAll(strainer))
2114+
2115+ self.assertEqual(len(self.soup.findAll(id=None)), 1)
2116+
2117+ self.assertEqual(len(self.soup.findAll(width=100)), 1)
2118+ self.assertEqual(len(self.soup.findAll(junk=None)), 5)
2119+ self.assertEqual(len(self.soup.findAll(junk=[1, None])), 5)
2120+
2121+ self.assertEqual(len(self.soup.findAll(junk=re.compile('.*'))), 0)
2122+ self.assertEqual(len(self.soup.findAll(junk=True)), 0)
2123+
2124+ self.assertEqual(len(self.soup.findAll(junk=True)), 0)
2125+ self.assertEqual(len(self.soup.findAll(href=True)), 1)
2126+
2127+ def testFindallByClass(self):
2128+ soup = BeautifulSoup('<b class="foo">Foo</b><a class="1 23 4">Bar</a>')
2129+ self.assertEqual(soup.find(attrs='foo').string, "Foo")
2130+ self.assertEqual(soup.find('a', '1').string, "Bar")
2131+ self.assertEqual(soup.find('a', '23').string, "Bar")
2132+ self.assertEqual(soup.find('a', '4').string, "Bar")
2133+
2134+ self.assertEqual(soup.find('a', '2'), None)
2135+
2136+ def testFindAllByList(self):
2137+ matching = self.soup(['a', 'ac'])
2138+ self.assertEqual(len(matching), 3)
2139+
2140+ def testFindAllByHash(self):
2141+ matching = self.soup({'a' : True, 'b' : True})
2142+ self.assertEqual(len(matching), 4)
2143+
2144+ def testFindAllText(self):
2145+ soup = BeautifulSoup("<html>\xbb</html>")
2146+ self.assertEqual(soup.findAll(text=re.compile('.*')),
2147+ [u'\xbb'])
2148+
2149+ def testFindAllByRE(self):
2150+ import re
2151+ r = re.compile('a.*')
2152+ self.assertEqual(len(self.soup(r)), 3)
2153+
2154+ def testFindAllByMethod(self):
2155+ def matchTagWhereIDMatchesName(tag):
2156+ return tag.name == tag.get('id')
2157+
2158+ matching = self.soup.findAll(matchTagWhereIDMatchesName)
2159+ self.assertEqual(len(matching), 2)
2160+ self.assertEqual(matching[0].name, 'a')
2161+
2162+ def testFindByIndex(self):
2163+ """For when you have the tag and you want to know where it is."""
2164+ tag = self.soup.find('a', id="a")
2165+ self.assertEqual(self.soup.index(tag), 3)
2166+
2167+ # It works for NavigableStrings as well.
2168+ s = tag.string
2169+ self.assertEqual(tag.index(s), 0)
2170+
2171+ # If the tag isn't present, a ValueError is raised.
2172+ soup2 = BeautifulSoup("<b></b>")
2173+ tag2 = soup2.find('b')
2174+ self.assertRaises(ValueError, self.soup.index, tag2)
2175+
2176+ def testConflictingFindArguments(self):
2177+ """The 'text' argument takes precedence."""
2178+ soup = BeautifulSoup('Foo<b>Bar</b>Baz')
2179+ self.assertEqual(soup.find('b', text='Baz'), 'Baz')
2180+ self.assertEqual(soup.findAll('b', text='Baz'), ['Baz'])
2181+
2182+ self.assertEqual(soup.find(True, text='Baz'), 'Baz')
2183+ self.assertEqual(soup.findAll(True, text='Baz'), ['Baz'])
2184+
2185+ def testParents(self):
2186+ soup = BeautifulSoup('<ul id="foo"></ul><ul id="foo"><ul><ul id="foo" a="b"><b>Blah')
2187+ b = soup.b
2188+ self.assertEquals(len(b.findParents('ul', {'id' : 'foo'})), 2)
2189+ self.assertEquals(b.findParent('ul')['a'], 'b')
2190+
2191+ PROXIMITY_TEST = BeautifulSoup('<b id="1"><b id="2"><b id="3"><b id="4">')
2192+
2193+ def testNext(self):
2194+ soup = self.PROXIMITY_TEST
2195+ b = soup.find('b', {'id' : 2})
2196+ self.assertEquals(b.findNext('b')['id'], '3')
2197+ self.assertEquals(b.findNext('b')['id'], '3')
2198+ self.assertEquals(len(b.findAllNext('b')), 2)
2199+ self.assertEquals(len(b.findAllNext('b', {'id' : 4})), 1)
2200+
2201+ def testPrevious(self):
2202+ soup = self.PROXIMITY_TEST
2203+ b = soup.find('b', {'id' : 3})
2204+ self.assertEquals(b.findPrevious('b')['id'], '2')
2205+ self.assertEquals(b.findPrevious('b')['id'], '2')
2206+ self.assertEquals(len(b.findAllPrevious('b')), 2)
2207+ self.assertEquals(len(b.findAllPrevious('b', {'id' : 2})), 1)
2208+
2209+
2210+ SIBLING_TEST = BeautifulSoup('<blockquote id="1"><blockquote id="1.1"></blockquote></blockquote><blockquote id="2"><blockquote id="2.1"></blockquote></blockquote><blockquote id="3"><blockquote id="3.1"></blockquote></blockquote><blockquote id="4">')
2211+
2212+ def testNextSibling(self):
2213+ soup = self.SIBLING_TEST
2214+ tag = 'blockquote'
2215+ b = soup.find(tag, {'id' : 2})
2216+ self.assertEquals(b.findNext(tag)['id'], '2.1')
2217+ self.assertEquals(b.findNextSibling(tag)['id'], '3')
2218+ self.assertEquals(b.findNextSibling(tag)['id'], '3')
2219+ self.assertEquals(len(b.findNextSiblings(tag)), 2)
2220+ self.assertEquals(len(b.findNextSiblings(tag, {'id' : 4})), 1)
2221+
2222+ def testPreviousSibling(self):
2223+ soup = self.SIBLING_TEST
2224+ tag = 'blockquote'
2225+ b = soup.find(tag, {'id' : 3})
2226+ self.assertEquals(b.findPrevious(tag)['id'], '2.1')
2227+ self.assertEquals(b.findPreviousSibling(tag)['id'], '2')
2228+ self.assertEquals(b.findPreviousSibling(tag)['id'], '2')
2229+ self.assertEquals(len(b.findPreviousSiblings(tag)), 2)
2230+ self.assertEquals(len(b.findPreviousSiblings(tag, id=1)), 1)
2231+
2232+ def testTextNavigation(self):
2233+ soup = BeautifulSoup('Foo<b>Bar</b><i id="1"><b>Baz<br />Blee<hr id="1"/></b></i>Blargh')
2234+ baz = soup.find(text='Baz')
2235+ self.assertEquals(baz.findParent("i")['id'], '1')
2236+ self.assertEquals(baz.findNext(text='Blee'), 'Blee')
2237+ self.assertEquals(baz.findNextSibling(text='Blee'), 'Blee')
2238+ self.assertEquals(baz.findNextSibling(text='Blargh'), None)
2239+ self.assertEquals(baz.findNextSibling('hr')['id'], '1')
2240+
2241+class SiblingRivalry(SoupTest):
2242+ "Tests the nextSibling and previousSibling navigation."
2243+
2244+ def testSiblings(self):
2245+ soup = BeautifulSoup("<ul><li>1<p>A</p>B<li>2<li>3</ul>")
2246+ secondLI = soup.find('li').nextSibling
2247+ self.assert_(secondLI.name == 'li' and secondLI.string == '2')
2248+ self.assertEquals(soup.find(text='1').nextSibling.name, 'p')
2249+ self.assertEquals(soup.find('p').nextSibling, 'B')
2250+ self.assertEquals(soup.find('p').nextSibling.previousSibling.nextSibling, 'B')
2251+
2252+class TagsAreObjectsToo(SoupTest):
2253+ "Tests the various built-in functions of Tag objects."
2254+
2255+ def testLen(self):
2256+ soup = BeautifulSoup("<top>1<b>2</b>3</top>")
2257+ self.assertEquals(len(soup.top), 3)
2258+
2259+class StringEmUp(SoupTest):
2260+ "Tests the use of 'string' as an alias for a tag's only content."
2261+
2262+ def testString(self):
2263+ s = BeautifulSoup("<b>foo</b>")
2264+ self.assertEquals(s.b.string, 'foo')
2265+
2266+ def testLackOfString(self):
2267+ s = BeautifulSoup("<b>f<i>e</i>o</b>")
2268+ self.assert_(not s.b.string)
2269+
2270+ def testStringAssign(self):
2271+ s = BeautifulSoup("<b></b>")
2272+ b = s.b
2273+ b.string = "foo"
2274+ string = b.string
2275+ self.assertEquals(string, "foo")
2276+ self.assert_(isinstance(string, NavigableString))
2277+
2278+class AllText(SoupTest):
2279+ "Tests the use of 'text' to get all of string content from the tag."
2280+
2281+ def testText(self):
2282+ soup = BeautifulSoup("<ul><li>spam</li><li>eggs</li><li>cheese</li>")
2283+ self.assertEquals(soup.ul.text, "spameggscheese")
2284+ self.assertEquals(soup.ul.getText('/'), "spam/eggs/cheese")
2285+
2286+ def testTextHasCorrectSpacing(self):
2287+ soup = BeautifulSoup("<p>This is a <i>test</i>.")
2288+ self.assertEquals(soup.text, "This is a test.")
2289+ self.assertEquals(soup.getText('/'), "This is a /test/.")
2290+
2291+class ThatsMyLimit(SoupTest):
2292+ "Tests the limit argument."
2293+
2294+ def testBasicLimits(self):
2295+ s = BeautifulSoup('<br id="1" /><br id="1" /><br id="1" /><br id="1" />')
2296+ self.assertEquals(len(s.findAll('br')), 4)
2297+ self.assertEquals(len(s.findAll('br', limit=2)), 2)
2298+ self.assertEquals(len(s('br', limit=2)), 2)
2299+
2300+class OnlyTheLonely(SoupTest):
2301+ "Tests the parseOnly argument to the constructor."
2302+ def setUp(self):
2303+ x = []
2304+ for i in range(1,6):
2305+ x.append('<a id="%s">' % i)
2306+ for j in range(100,103):
2307+ x.append('<b id="%s.%s">Content %s.%s</b>' % (i,j, i,j))
2308+ x.append('</a>')
2309+ self.x = ''.join(x)
2310+
2311+ def testOnly(self):
2312+ strainer = SoupStrainer("b")
2313+ soup = BeautifulSoup(self.x, parseOnlyThese=strainer)
2314+ self.assertEquals(len(soup), 15)
2315+
2316+ strainer = SoupStrainer(id=re.compile("100.*"))
2317+ soup = BeautifulSoup(self.x, parseOnlyThese=strainer)
2318+ self.assertEquals(len(soup), 5)
2319+
2320+ strainer = SoupStrainer(text=re.compile("10[01].*"))
2321+ soup = BeautifulSoup(self.x, parseOnlyThese=strainer)
2322+ self.assertEquals(len(soup), 10)
2323+
2324+ strainer = SoupStrainer(text=lambda(x):x[8]=='3')
2325+ soup = BeautifulSoup(self.x, parseOnlyThese=strainer)
2326+ self.assertEquals(len(soup), 3)
2327+
2328+class PickleMeThis(SoupTest):
2329+ "Testing features like pickle and deepcopy."
2330+
2331+ def setUp(self):
2332+ self.page = """<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
2333+"http://www.w3.org/TR/REC-html40/transitional.dtd">
2334+<html>
2335+<head>
2336+<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
2337+<title>Beautiful Soup: We called him Tortoise because he taught us.</title>
2338+<link rev="made" href="mailto:leonardr@segfault.org">
2339+<meta name="Description" content="Beautiful Soup: an HTML parser optimized for screen-scraping.">
2340+<meta name="generator" content="Markov Approximation 1.4 (module: leonardr)">
2341+<meta name="author" content="Leonard Richardson">
2342+</head>
2343+<body>
2344+<a href="foo">foo</a>
2345+<a href="foo"><b>bar</b></a>
2346+</body>
2347+</html>"""
2348+
2349+ self.soup = BeautifulSoup(self.page)
2350+
2351+ def testPickle(self):
2352+ import pickle
2353+ dumped = pickle.dumps(self.soup, 2)
2354+ loaded = pickle.loads(dumped)
2355+ self.assertEqual(loaded.__class__, BeautifulSoup)
2356+ self.assertEqual(str(loaded), str(self.soup))
2357+
2358+ def testDeepcopy(self):
2359+ from copy import deepcopy
2360+ copied = deepcopy(self.soup)
2361+ self.assertEqual(str(copied), str(self.soup))
2362+
2363+ def testUnicodePickle(self):
2364+ import cPickle as pickle
2365+ html = "<b>" + chr(0xc3) + "</b>"
2366+ soup = BeautifulSoup(html)
2367+ dumped = pickle.dumps(soup, pickle.HIGHEST_PROTOCOL)
2368+ loaded = pickle.loads(dumped)
2369+ self.assertEqual(str(loaded), str(soup))
2370+
2371+
2372+class WriteOnlyCode(SoupTest):
2373+ "Testing the modification of the tree."
2374+
2375+ def testModifyAttributes(self):
2376+ soup = BeautifulSoup('<a id="1"></a>')
2377+ soup.a['id'] = 2
2378+ self.assertEqual(soup.renderContents(), '<a id="2"></a>')
2379+ del(soup.a['id'])
2380+ self.assertEqual(soup.renderContents(), '<a></a>')
2381+ soup.a['id2'] = 'foo'
2382+ self.assertEqual(soup.renderContents(), '<a id2="foo"></a>')
2383+
2384+ def testNewTagCreation(self):
2385+ "Makes sure tags don't step on each others' toes."
2386+ soup = BeautifulSoup()
2387+ a = Tag(soup, 'a')
2388+ ol = Tag(soup, 'ol')
2389+ a['href'] = 'http://foo.com/'
2390+ self.assertRaises(KeyError, lambda : ol['href'])
2391+
2392+ def testNewTagWithAttributes(self):
2393+ """Makes sure new tags can be created complete with attributes."""
2394+ soup = BeautifulSoup()
2395+ a = Tag(soup, 'a', [('href', 'foo')])
2396+ b = Tag(soup, 'b', {'class':'bar'})
2397+ soup.insert(0,a)
2398+ soup.insert(1,b)
2399+ self.assertEqual(soup.a['href'], 'foo')
2400+ self.assertEqual(soup.b['class'], 'bar')
2401+
2402+ def testTagReplacement(self):
2403+ # Make sure you can replace an element with itself.
2404+ text = "<a><b></b><c>Foo<d></d></c></a><a><e></e></a>"
2405+ soup = BeautifulSoup(text)
2406+ c = soup.c
2407+ soup.c.replaceWith(c)
2408+ self.assertEquals(str(soup), text)
2409+
2410+ # A very simple case
2411+ soup = BeautifulSoup("<b>Argh!</b>")
2412+ soup.find(text="Argh!").replaceWith("Hooray!")
2413+ newText = soup.find(text="Hooray!")
2414+ b = soup.b
2415+ self.assertEqual(newText.previous, b)
2416+ self.assertEqual(newText.parent, b)
2417+ self.assertEqual(newText.previous.next, newText)
2418+ self.assertEqual(newText.next, None)
2419+
2420+ # A more complex case
2421+ soup = BeautifulSoup("<a><b>Argh!</b><c></c><d></d></a>")
2422+ soup.b.insert(1, "Hooray!")
2423+ newText = soup.find(text="Hooray!")
2424+ self.assertEqual(newText.previous, "Argh!")
2425+ self.assertEqual(newText.previous.next, newText)
2426+
2427+ self.assertEqual(newText.previousSibling, "Argh!")
2428+ self.assertEqual(newText.previousSibling.nextSibling, newText)
2429+
2430+ self.assertEqual(newText.nextSibling, None)
2431+ self.assertEqual(newText.next, soup.c)
2432+
2433+ text = "<html>There's <b>no</b> business like <b>show</b> business</html>"
2434+ soup = BeautifulSoup(text)
2435+ no, show = soup.findAll('b')
2436+ show.replaceWith(no)
2437+ self.assertEquals(str(soup), "<html>There's business like <b>no</b> business</html>")
2438+
2439+ # Even more complex
2440+ soup = BeautifulSoup("<a><b>Find</b><c>lady!</c><d></d></a>")
2441+ tag = Tag(soup, 'magictag')
2442+ tag.insert(0, "the")
2443+ soup.a.insert(1, tag)
2444+
2445+ b = soup.b
2446+ c = soup.c
2447+ theText = tag.find(text=True)
2448+ findText = b.find(text="Find")
2449+
2450+ self.assertEqual(findText.next, tag)
2451+ self.assertEqual(tag.previous, findText)
2452+ self.assertEqual(b.nextSibling, tag)
2453+ self.assertEqual(tag.previousSibling, b)
2454+ self.assertEqual(tag.nextSibling, c)
2455+ self.assertEqual(c.previousSibling, tag)
2456+
2457+ self.assertEqual(theText.next, c)
2458+ self.assertEqual(c.previous, theText)
2459+
2460+ # Aand... incredibly complex.
2461+ soup = BeautifulSoup("""<a>We<b>reserve<c>the</c><d>right</d></b></a><e>to<f>refuse</f><g>service</g></e>""")
2462+ f = soup.f
2463+ a = soup.a
2464+ c = soup.c
2465+ e = soup.e
2466+ weText = a.find(text="We")
2467+ soup.b.replaceWith(soup.f)
2468+ self.assertEqual(str(soup), "<a>We<f>refuse</f></a><e>to<g>service</g></e>")
2469+
2470+ self.assertEqual(f.previous, weText)
2471+ self.assertEqual(weText.next, f)
2472+ self.assertEqual(f.previousSibling, weText)
2473+ self.assertEqual(f.nextSibling, None)
2474+ self.assertEqual(weText.nextSibling, f)
2475+
2476+ def testReplaceWithChildren(self):
2477+ soup = BeautifulStoneSoup(
2478+ "<top><replace><child1/><child2/></replace></top>",
2479+ selfClosingTags=["child1", "child2"])
2480+ soup.replaceTag.replaceWithChildren()
2481+ self.assertEqual(soup.top.contents[0].name, "child1")
2482+ self.assertEqual(soup.top.contents[1].name, "child2")
2483+
2484+ def testAppend(self):
2485+ doc = "<p>Don't leave me <b>here</b>.</p> <p>Don't leave me.</p>"
2486+ soup = BeautifulSoup(doc)
2487+ second_para = soup('p')[1]
2488+ bold = soup.find('b')
2489+ soup('p')[1].append(soup.find('b'))
2490+ self.assertEqual(bold.parent, second_para)
2491+ self.assertEqual(str(soup),
2492+ "<p>Don't leave me .</p> "
2493+ "<p>Don't leave me.<b>here</b></p>")
2494+
2495+ def testTagExtraction(self):
2496+ # A very simple case
2497+ text = '<html><div id="nav">Nav crap</div>Real content here.</html>'
2498+ soup = BeautifulSoup(text)
2499+ extracted = soup.find("div", id="nav").extract()
2500+ self.assertEqual(str(soup), "<html>Real content here.</html>")
2501+ self.assertEqual(str(extracted), '<div id="nav">Nav crap</div>')
2502+
2503+ # A simple case, a more complex test.
2504+ text = "<doc><a>1<b>2</b></a><a>i<b>ii</b></a><a>A<b>B</b></a></doc>"
2505+ soup = BeautifulStoneSoup(text)
2506+ doc = soup.doc
2507+ numbers, roman, letters = soup("a")
2508+
2509+ self.assertEqual(roman.parent, doc)
2510+ oldPrevious = roman.previous
2511+ endOfThisTag = roman.nextSibling.previous
2512+ self.assertEqual(oldPrevious, "2")
2513+ self.assertEqual(roman.next, "i")
2514+ self.assertEqual(endOfThisTag, "ii")
2515+ self.assertEqual(roman.previousSibling, numbers)
2516+ self.assertEqual(roman.nextSibling, letters)
2517+
2518+ roman.extract()
2519+ self.assertEqual(roman.parent, None)
2520+ self.assertEqual(roman.previous, None)
2521+ self.assertEqual(roman.next, "i")
2522+ self.assertEqual(letters.previous, '2')
2523+ self.assertEqual(roman.previousSibling, None)
2524+ self.assertEqual(roman.nextSibling, None)
2525+ self.assertEqual(endOfThisTag.next, None)
2526+ self.assertEqual(roman.b.contents[0].next, None)
2527+ self.assertEqual(numbers.nextSibling, letters)
2528+ self.assertEqual(letters.previousSibling, numbers)
2529+ self.assertEqual(len(doc.contents), 2)
2530+ self.assertEqual(doc.contents[0], numbers)
2531+ self.assertEqual(doc.contents[1], letters)
2532+
2533+ # A more complex case.
2534+ text = "<a>1<b>2<c>Hollywood, baby!</c></b></a>3"
2535+ soup = BeautifulStoneSoup(text)
2536+ one = soup.find(text="1")
2537+ three = soup.find(text="3")
2538+ toExtract = soup.b
2539+ soup.b.extract()
2540+ self.assertEqual(one.next, three)
2541+ self.assertEqual(three.previous, one)
2542+ self.assertEqual(one.parent.nextSibling, three)
2543+ self.assertEqual(three.previousSibling, soup.a)
2544+
2545+ def testClear(self):
2546+ soup = BeautifulSoup("<ul><li></li><li></li></ul>")
2547+ soup.ul.clear()
2548+ self.assertEqual(len(soup.ul.contents), 0)
2549+
2550+class TheManWithoutAttributes(SoupTest):
2551+ "Test attribute access"
2552+
2553+ def testHasKey(self):
2554+ text = "<foo attr='bar'>"
2555+ self.assertEquals(BeautifulSoup(text).foo.has_key('attr'), True)
2556+
2557+class QuoteMeOnThat(SoupTest):
2558+ "Test quoting"
2559+ def testQuotedAttributeValues(self):
2560+ self.assertSoupEquals("<foo attr='bar'></foo>",
2561+ '<foo attr="bar"></foo>')
2562+
2563+ text = """<foo attr='bar "brawls" happen'>a</foo>"""
2564+ soup = BeautifulSoup(text)
2565+ self.assertEquals(soup.renderContents(), text)
2566+
2567+ soup.foo['attr'] = 'Brawls happen at "Bob\'s Bar"'
2568+ newText = """<foo attr='Brawls happen at "Bob&squot;s Bar"'>a</foo>"""
2569+ self.assertSoupEquals(soup.renderContents(), newText)
2570+
2571+ self.assertSoupEquals('<this is="really messed up & stuff">',
2572+ '<this is="really messed up &amp; stuff"></this>')
2573+
2574+ # This is not what the original author had in mind, but it's
2575+ # a legitimate interpretation of what they wrote.
2576+ self.assertSoupEquals("""<a href="foo</a>, </a><a href="bar">baz</a>""",
2577+ '<a href="foo&lt;/a&gt;, &lt;/a&gt;&lt;a href="></a>, <a href="bar">baz</a>')
2578+
2579+ # SGMLParser generates bogus parse events when attribute values
2580+ # contain embedded brackets, but at least Beautiful Soup fixes
2581+ # it up a little.
2582+ self.assertSoupEquals('<a b="<a>">', '<a b="&lt;a&gt;"></a><a>"></a>')
2583+ self.assertSoupEquals('<a href="http://foo.com/<a> and blah and blah',
2584+ """<a href='"http://foo.com/'></a><a> and blah and blah</a>""")
2585+
2586+
2587+
2588+class YoureSoLiteral(SoupTest):
2589+ "Test literal mode."
2590+ def testLiteralMode(self):
2591+ text = "<script>if (i<imgs.length)</script><b>Foo</b>"
2592+ soup = BeautifulSoup(text)
2593+ self.assertEqual(soup.script.contents[0], "if (i<imgs.length)")
2594+ self.assertEqual(soup.b.contents[0], "Foo")
2595+
2596+ def testTextArea(self):
2597+ text = "<textarea><b>This is an example of an HTML tag</b><&<&</textarea>"
2598+ soup = BeautifulSoup(text)
2599+ self.assertEqual(soup.textarea.contents[0],
2600+ "<b>This is an example of an HTML tag</b><&<&")
2601+
2602+class OperatorOverload(SoupTest):
2603+ "Our operators do it all! Call now!"
2604+
2605+ def testTagNameAsFind(self):
2606+ "Tests that referencing a tag name as a member delegates to find()."
2607+ soup = BeautifulSoup('<b id="1">foo<i>bar</i></b><b>Red herring</b>')
2608+ self.assertEqual(soup.b.i, soup.find('b').find('i'))
2609+ self.assertEqual(soup.b.i.string, 'bar')
2610+ self.assertEqual(soup.b['id'], '1')
2611+ self.assertEqual(soup.b.contents[0], 'foo')
2612+ self.assert_(not soup.a)
2613+
2614+ #Test the .fooTag variant of .foo.
2615+ self.assertEqual(soup.bTag.iTag.string, 'bar')
2616+ self.assertEqual(soup.b.iTag.string, 'bar')
2617+ self.assertEqual(soup.find('b').find('i'), soup.bTag.iTag)
2618+
2619+class NestableEgg(SoupTest):
2620+ """Here we test tag nesting. TEST THE NEST, DUDE! X-TREME!"""
2621+
2622+ def testParaInsideBlockquote(self):
2623+ soup = BeautifulSoup('<blockquote><p><b>Foo</blockquote><p>Bar')
2624+ self.assertEqual(soup.blockquote.p.b.string, 'Foo')
2625+ self.assertEqual(soup.blockquote.b.string, 'Foo')
2626+ self.assertEqual(soup.find('p', recursive=False).string, 'Bar')
2627+
2628+ def testNestedTables(self):
2629+ text = """<table id="1"><tr><td>Here's another table:
2630+ <table id="2"><tr><td>Juicy text</td></tr></table></td></tr></table>"""
2631+ soup = BeautifulSoup(text)
2632+ self.assertEquals(soup.table.table.td.string, 'Juicy text')
2633+ self.assertEquals(len(soup.findAll('table')), 2)
2634+ self.assertEquals(len(soup.table.findAll('table')), 1)
2635+ self.assertEquals(soup.find('table', {'id' : 2}).parent.parent.parent.name,
2636+ 'table')
2637+
2638+ text = "<table><tr><td><div><table>Foo</table></div></td></tr></table>"
2639+ soup = BeautifulSoup(text)
2640+ self.assertEquals(soup.table.tr.td.div.table.contents[0], "Foo")
2641+
2642+ text = """<table><thead><tr>Foo</tr></thead><tbody><tr>Bar</tr></tbody>
2643+ <tfoot><tr>Baz</tr></tfoot></table>"""
2644+ soup = BeautifulSoup(text)
2645+ self.assertEquals(soup.table.thead.tr.contents[0], "Foo")
2646+
2647+ def testBadNestedTables(self):
2648+ soup = BeautifulSoup("<table><tr><table><tr id='nested'>")
2649+ self.assertEquals(soup.table.tr.table.tr['id'], 'nested')
2650+
2651+class CleanupOnAisleFour(SoupTest):
2652+ """Here we test cleanup of text that breaks SGMLParser or is just
2653+ obnoxious."""
2654+
2655+ def testSelfClosingtag(self):
2656+ self.assertEqual(str(BeautifulSoup("Foo<br/>Bar").find('br')),
2657+ '<br />')
2658+
2659+ self.assertSoupEquals('<p>test1<br/>test2</p>',
2660+ '<p>test1<br />test2</p>')
2661+
2662+ text = '<p>test1<selfclosing>test2'
2663+ soup = BeautifulStoneSoup(text)
2664+ self.assertEqual(str(soup),
2665+ '<p>test1<selfclosing>test2</selfclosing></p>')
2666+
2667+ soup = BeautifulStoneSoup(text, selfClosingTags='selfclosing')
2668+ self.assertEqual(str(soup),
2669+ '<p>test1<selfclosing />test2</p>')
2670+
2671+ def testSelfClosingTagOrNot(self):
2672+ text = "<item><link>http://foo.com/</link></item>"
2673+ self.assertEqual(BeautifulStoneSoup(text).renderContents(), text)
2674+ self.assertEqual(BeautifulSoup(text).renderContents(),
2675+ '<item><link />http://foo.com/</item>')
2676+
2677+ def testCData(self):
2678+ xml = "<root>foo<![CDATA[foobar]]>bar</root>"
2679+ self.assertSoupEquals(xml, xml)
2680+ r = re.compile("foo.*bar")
2681+ soup = BeautifulSoup(xml)
2682+ self.assertEquals(soup.find(text=r).string, "foobar")
2683+ self.assertEquals(soup.find(text=r).__class__, CData)
2684+
2685+ def testComments(self):
2686+ xml = "foo<!--foobar-->baz"
2687+ self.assertSoupEquals(xml)
2688+ r = re.compile("foo.*bar")
2689+ soup = BeautifulSoup(xml)
2690+ self.assertEquals(soup.find(text=r).string, "foobar")
2691+ self.assertEquals(soup.find(text="foobar").__class__, Comment)
2692+
2693+ def testDeclaration(self):
2694+ xml = "foo<!DOCTYPE foobar>baz"
2695+ self.assertSoupEquals(xml)
2696+ r = re.compile(".*foo.*bar")
2697+ soup = BeautifulSoup(xml)
2698+ text = "DOCTYPE foobar"
2699+ self.assertEquals(soup.find(text=r).string, text)
2700+ self.assertEquals(soup.find(text=text).__class__, Declaration)
2701+
2702+ namespaced_doctype = ('<!DOCTYPE xsl:stylesheet SYSTEM "htmlent.dtd">'
2703+ '<html>foo</html>')
2704+ soup = BeautifulSoup(namespaced_doctype)
2705+ self.assertEquals(soup.contents[0],
2706+ 'DOCTYPE xsl:stylesheet SYSTEM "htmlent.dtd"')
2707+ self.assertEquals(soup.html.contents[0], 'foo')
2708+
2709+ def testEntityConversions(self):
2710+ text = "&lt;&lt;sacr&eacute;&#32;bleu!&gt;&gt;"
2711+ soup = BeautifulStoneSoup(text)
2712+ self.assertSoupEquals(text)
2713+
2714+ xmlEnt = BeautifulStoneSoup.XML_ENTITIES
2715+ htmlEnt = BeautifulStoneSoup.HTML_ENTITIES
2716+ xhtmlEnt = BeautifulStoneSoup.XHTML_ENTITIES
2717+
2718+ soup = BeautifulStoneSoup(text, convertEntities=xmlEnt)
2719+ self.assertEquals(str(soup), "<<sacr&eacute; bleu!>>")
2720+
2721+ soup = BeautifulStoneSoup(text, convertEntities=xmlEnt)
2722+ self.assertEquals(str(soup), "<<sacr&eacute; bleu!>>")
2723+
2724+ soup = BeautifulStoneSoup(text, convertEntities=htmlEnt)
2725+ self.assertEquals(unicode(soup), u"<<sacr\xe9 bleu!>>")
2726+
2727+ # Make sure the "XML", "HTML", and "XHTML" settings work.
2728+ text = "&lt;&trade;&apos;"
2729+ soup = BeautifulStoneSoup(text, convertEntities=xmlEnt)
2730+ self.assertEquals(unicode(soup), u"<&trade;'")
2731+
2732+ soup = BeautifulStoneSoup(text, convertEntities=htmlEnt)
2733+ self.assertEquals(unicode(soup), u"<\u2122&apos;")
2734+
2735+ soup = BeautifulStoneSoup(text, convertEntities=xhtmlEnt)
2736+ self.assertEquals(unicode(soup), u"<\u2122'")
2737+
2738+ invalidEntity = "foo&#bar;baz"
2739+ soup = BeautifulStoneSoup\
2740+ (invalidEntity,
2741+ convertEntities=htmlEnt)
2742+ self.assertEquals(str(soup), invalidEntity)
2743+
2744+ def testNonBreakingSpaces(self):
2745+ soup = BeautifulSoup("<a>&nbsp;&nbsp;</a>",
2746+ convertEntities=BeautifulStoneSoup.HTML_ENTITIES)
2747+ self.assertEquals(unicode(soup), u"<a>\xa0\xa0</a>")
2748+
2749+ def testWhitespaceInDeclaration(self):
2750+ self.assertSoupEquals('<! DOCTYPE>', '<!DOCTYPE>')
2751+
2752+ def testJunkInDeclaration(self):
2753+ self.assertSoupEquals('<! Foo = -8>a', '<!Foo = -8>a')
2754+
2755+ def testIncompleteDeclaration(self):
2756+ self.assertSoupEquals('a<!b <p>c')
2757+
2758+ def testEntityReplacement(self):
2759+ self.assertSoupEquals('<b>hello&nbsp;there</b>')
2760+
2761+ def testEntitiesInAttributeValues(self):
2762+ self.assertSoupEquals('<x t="x&#241;">', '<x t="x\xc3\xb1"></x>')
2763+ self.assertSoupEquals('<x t="x&#xf1;">', '<x t="x\xc3\xb1"></x>')
2764+
2765+ soup = BeautifulSoup('<x t="&gt;&trade;">',
2766+ convertEntities=BeautifulStoneSoup.HTML_ENTITIES)
2767+ self.assertEquals(unicode(soup), u'<x t="&gt;\u2122"></x>')
2768+
2769+ uri = "http://crummy.com?sacr&eacute;&amp;bleu"
2770+ link = '<a href="%s"></a>' % uri
2771+ soup = BeautifulSoup(link)
2772+ self.assertEquals(unicode(soup), link)
2773+ #self.assertEquals(unicode(soup.a['href']), uri)
2774+
2775+ soup = BeautifulSoup(link, convertEntities=BeautifulSoup.HTML_ENTITIES)
2776+ self.assertEquals(unicode(soup),
2777+ link.replace("&eacute;", u"\xe9"))
2778+
2779+ uri = "http://crummy.com?sacr&eacute;&bleu"
2780+ link = '<a href="%s"></a>' % uri
2781+ soup = BeautifulSoup(link, convertEntities=BeautifulSoup.HTML_ENTITIES)
2782+ self.assertEquals(unicode(soup.a['href']),
2783+ uri.replace("&eacute;", u"\xe9"))
2784+
2785+ def testNakedAmpersands(self):
2786+ html = {'convertEntities':BeautifulStoneSoup.HTML_ENTITIES}
2787+ soup = BeautifulStoneSoup("AT&T ", **html)
2788+ self.assertEquals(str(soup), 'AT&amp;T ')
2789+
2790+ nakedAmpersandInASentence = "AT&T was Ma Bell"
2791+ soup = BeautifulStoneSoup(nakedAmpersandInASentence,**html)
2792+ self.assertEquals(str(soup), \
2793+ nakedAmpersandInASentence.replace('&','&amp;'))
2794+
2795+ invalidURL = '<a href="http://example.org?a=1&b=2;3">foo</a>'
2796+ validURL = invalidURL.replace('&','&amp;')
2797+ soup = BeautifulStoneSoup(invalidURL)
2798+ self.assertEquals(str(soup), validURL)
2799+
2800+ soup = BeautifulStoneSoup(validURL)
2801+ self.assertEquals(str(soup), validURL)
2802+
2803+
2804+class EncodeRed(SoupTest):
2805+ """Tests encoding conversion, Unicode conversion, and Microsoft
2806+ smart quote fixes."""
2807+
2808+ def testUnicodeDammitStandalone(self):
2809+ markup = "<foo>\x92</foo>"
2810+ dammit = UnicodeDammit(markup)
2811+ self.assertEquals(dammit.unicode, "<foo>&#x2019;</foo>")
2812+
2813+ hebrew = "\xed\xe5\xec\xf9"
2814+ dammit = UnicodeDammit(hebrew, ["iso-8859-8"])
2815+ self.assertEquals(dammit.unicode, u'\u05dd\u05d5\u05dc\u05e9')
2816+ self.assertEquals(dammit.originalEncoding, 'iso-8859-8')
2817+
2818+ def testGarbageInGarbageOut(self):
2819+ ascii = "<foo>a</foo>"
2820+ asciiSoup = BeautifulStoneSoup(ascii)
2821+ self.assertEquals(ascii, str(asciiSoup))
2822+
2823+ unicodeData = u"<foo>\u00FC</foo>"
2824+ utf8 = unicodeData.encode("utf-8")
2825+ self.assertEquals(utf8, '<foo>\xc3\xbc</foo>')
2826+
2827+ unicodeSoup = BeautifulStoneSoup(unicodeData)
2828+ self.assertEquals(unicodeData, unicode(unicodeSoup))
2829+ self.assertEquals(unicode(unicodeSoup.foo.string), u'\u00FC')
2830+
2831+ utf8Soup = BeautifulStoneSoup(utf8, fromEncoding='utf-8')
2832+ self.assertEquals(utf8, str(utf8Soup))
2833+ self.assertEquals(utf8Soup.originalEncoding, "utf-8")
2834+
2835+ utf8Soup = BeautifulStoneSoup(unicodeData)
2836+ self.assertEquals(utf8, str(utf8Soup))
2837+ self.assertEquals(utf8Soup.originalEncoding, None)
2838+
2839+
2840+ def testHandleInvalidCodec(self):
2841+ for bad_encoding in ['.utf8', '...', 'utF---16.!']:
2842+ soup = BeautifulSoup("Räksmörgås", fromEncoding=bad_encoding)
2843+ self.assertEquals(soup.originalEncoding, 'utf-8')
2844+
2845+ def testUnicodeSearch(self):
2846+ html = u'<html><body><h1>Räksmörgås</h1></body></html>'
2847+ soup = BeautifulSoup(html)
2848+ self.assertEqual(soup.find(text=u'Räksmörgås'),u'Räksmörgås')
2849+
2850+ def testRewrittenXMLHeader(self):
2851+ euc_jp = '<?xml version="1.0 encoding="euc-jp"?>\n<foo>\n\xa4\xb3\xa4\xec\xa4\xcfEUC-JP\xa4\xc7\xa5\xb3\xa1\xbc\xa5\xc7\xa5\xa3\xa5\xf3\xa5\xb0\xa4\xb5\xa4\xec\xa4\xbf\xc6\xfc\xcb\xdc\xb8\xec\xa4\xce\xa5\xd5\xa5\xa1\xa5\xa4\xa5\xeb\xa4\xc7\xa4\xb9\xa1\xa3\n</foo>\n'
2852+ utf8 = "<?xml version='1.0' encoding='utf-8'?>\n<foo>\n\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xafEUC-JP\xe3\x81\xa7\xe3\x82\xb3\xe3\x83\xbc\xe3\x83\x87\xe3\x82\xa3\xe3\x83\xb3\xe3\x82\xb0\xe3\x81\x95\xe3\x82\x8c\xe3\x81\x9f\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e\xe3\x81\xae\xe3\x83\x95\xe3\x82\xa1\xe3\x82\xa4\xe3\x83\xab\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\n</foo>\n"
2853+ soup = BeautifulStoneSoup(euc_jp)
2854+ if soup.originalEncoding != "euc-jp":
2855+ raise Exception("Test failed when parsing euc-jp document. "
2856+ "If you're running Python >=2.4, or you have "
2857+ "cjkcodecs installed, this is a real problem. "
2858+ "Otherwise, ignore it.")
2859+
2860+ self.assertEquals(soup.originalEncoding, "euc-jp")
2861+ self.assertEquals(str(soup), utf8)
2862+
2863+ old_text = "<?xml encoding='windows-1252'><foo>\x92</foo>"
2864+ new_text = "<?xml version='1.0' encoding='utf-8'?><foo>&rsquo;</foo>"
2865+ self.assertSoupEquals(old_text, new_text)
2866+
2867+ def testRewrittenMetaTag(self):
2868+ no_shift_jis_html = '''<html><head>\n<meta http-equiv="Content-language" content="ja" /></head><body><pre>\n\x82\xb1\x82\xea\x82\xcdShift-JIS\x82\xc5\x83R\x81[\x83f\x83B\x83\x93\x83O\x82\xb3\x82\xea\x82\xbd\x93\xfa\x96{\x8c\xea\x82\xcc\x83t\x83@\x83C\x83\x8b\x82\xc5\x82\xb7\x81B\n</pre></body></html>'''
2869+ soup = BeautifulSoup(no_shift_jis_html)
2870+
2871+ # Beautiful Soup used to try to rewrite the meta tag even if the
2872+ # meta tag got filtered out by the strainer. This test makes
2873+ # sure that doesn't happen.
2874+ strainer = SoupStrainer('pre')
2875+ soup = BeautifulSoup(no_shift_jis_html, parseOnlyThese=strainer)
2876+ self.assertEquals(soup.contents[0].name, 'pre')
2877+
2878+ meta_tag = ('<meta content="text/html; charset=x-sjis" '
2879+ 'http-equiv="Content-type" />')
2880+ shift_jis_html = (
2881+ '<html><head>\n%s\n'
2882+ '<meta http-equiv="Content-language" content="ja" />'
2883+ '</head><body><pre>\n'
2884+ '\x82\xb1\x82\xea\x82\xcdShift-JIS\x82\xc5\x83R\x81[\x83f'
2885+ '\x83B\x83\x93\x83O\x82\xb3\x82\xea\x82\xbd\x93\xfa\x96{\x8c'
2886+ '\xea\x82\xcc\x83t\x83@\x83C\x83\x8b\x82\xc5\x82\xb7\x81B\n'
2887+ '</pre></body></html>') % meta_tag
2888+ soup = BeautifulSoup(shift_jis_html)
2889+ if soup.originalEncoding != "shift-jis":
2890+ raise Exception("Test failed when parsing shift-jis document "
2891+ "with meta tag '%s'."
2892+ "If you're running Python >=2.4, or you have "
2893+ "cjkcodecs installed, this is a real problem. "
2894+ "Otherwise, ignore it." % meta_tag)
2895+ self.assertEquals(soup.originalEncoding, "shift-jis")
2896+
2897+ content_type_tag = soup.meta['content']
2898+ self.assertEquals(content_type_tag[content_type_tag.find('charset='):],
2899+ 'charset=%SOUP-ENCODING%')
2900+ content_type = str(soup.meta)
2901+ index = content_type.find('charset=')
2902+ self.assertEqual(content_type[index:index+len('charset=utf8')+1],
2903+ 'charset=utf-8')
2904+ content_type = soup.meta.__str__('shift-jis')
2905+ index = content_type.find('charset=')
2906+ self.assertEqual(content_type[index:index+len('charset=shift-jis')],
2907+ 'charset=shift-jis')
2908+
2909+ self.assertEquals(str(soup), (
2910+ '<html><head>\n'
2911+ '<meta content="text/html; charset=utf-8" '
2912+ 'http-equiv="Content-type" />\n'
2913+ '<meta http-equiv="Content-language" content="ja" />'
2914+ '</head><body><pre>\n'
2915+ '\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xafShift-JIS\xe3\x81\xa7\xe3'
2916+ '\x82\xb3\xe3\x83\xbc\xe3\x83\x87\xe3\x82\xa3\xe3\x83\xb3\xe3'
2917+ '\x82\xb0\xe3\x81\x95\xe3\x82\x8c\xe3\x81\x9f\xe6\x97\xa5\xe6'
2918+ '\x9c\xac\xe8\xaa\x9e\xe3\x81\xae\xe3\x83\x95\xe3\x82\xa1\xe3'
2919+ '\x82\xa4\xe3\x83\xab\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\n'
2920+ '</pre></body></html>'))
2921+ self.assertEquals(soup.renderContents("shift-jis"),
2922+ shift_jis_html.replace('x-sjis', 'shift-jis'))
2923+
2924+ isolatin ="""<html><meta http-equiv="Content-type" content="text/html; charset=ISO-Latin-1" />Sacr\xe9 bleu!</html>"""
2925+ soup = BeautifulSoup(isolatin)
2926+ self.assertSoupEquals(soup.__str__("utf-8"),
2927+ isolatin.replace("ISO-Latin-1", "utf-8").replace("\xe9", "\xc3\xa9"))
2928+
2929+ def testHebrew(self):
2930+ iso_8859_8= '<HEAD>\n<TITLE>Hebrew (ISO 8859-8) in Visual Directionality</TITLE>\n\n\n\n</HEAD>\n<BODY>\n<H1>Hebrew (ISO 8859-8) in Visual Directionality</H1>\n\xed\xe5\xec\xf9\n</BODY>\n'
2931+ utf8 = '<head>\n<title>Hebrew (ISO 8859-8) in Visual Directionality</title>\n</head>\n<body>\n<h1>Hebrew (ISO 8859-8) in Visual Directionality</h1>\n\xd7\x9d\xd7\x95\xd7\x9c\xd7\xa9\n</body>\n'
2932+ soup = BeautifulStoneSoup(iso_8859_8, fromEncoding="iso-8859-8")
2933+ self.assertEquals(str(soup), utf8)
2934+
2935+ def testSmartQuotesNotSoSmartAnymore(self):
2936+ self.assertSoupEquals("\x91Foo\x92 <!--blah-->",
2937+ '&lsquo;Foo&rsquo; <!--blah-->')
2938+
2939+ def testDontConvertSmartQuotesWhenAlsoConvertingEntities(self):
2940+ smartQuotes = "Il a dit, \x8BSacr&eacute; bl&#101;u!\x9b"
2941+ soup = BeautifulSoup(smartQuotes)
2942+ self.assertEquals(str(soup),
2943+ 'Il a dit, &lsaquo;Sacr&eacute; bl&#101;u!&rsaquo;')
2944+ soup = BeautifulSoup(smartQuotes, convertEntities="html")
2945+ self.assertEquals(str(soup),
2946+ 'Il a dit, \xe2\x80\xb9Sacr\xc3\xa9 bleu!\xe2\x80\xba')
2947+
2948+ def testDontSeeSmartQuotesWhereThereAreNone(self):
2949+ utf_8 = "\343\202\261\343\203\274\343\202\277\343\202\244 Watch"
2950+ self.assertSoupEquals(utf_8)
2951+
2952+
2953+class Whitewash(SoupTest):
2954+ """Test whitespace preservation."""
2955+
2956+ def testPreservedWhitespace(self):
2957+ self.assertSoupEquals("<pre> </pre>")
2958+ self.assertSoupEquals("<pre> woo </pre>")
2959+
2960+ def testCollapsedWhitespace(self):
2961+ self.assertSoupEquals("<p> </p>", "<p> </p>")
2962+
2963+
2964+if __name__ == '__main__':
2965+ unittest.main()
2966
2967=== renamed file 'CHANGELOG' => 'CHANGELOG.THIS'
2968=== added file 'NEWS'
2969--- NEWS 1970-01-01 00:00:00 +0000
2970+++ NEWS 2011-05-27 07:52:31 +0000
2971@@ -0,0 +1,79 @@
2972+Beautiful Soup 3.2.x series
2973+***************************
2974+
2975+This is the 'stable' series of Beautiful Soup. It will have only
2976+occasional bugfix releases. It will not work with alternate parsers or
2977+with Python 3.0. If you need these things, you'll need to use the 3.1
2978+series.
2979+
2980+3.2.0
2981+=====
2982+
2983+Gave the stable series a higher version number than the unstable
2984+series, to make it very clear which series most people should be using.
2985+
2986+When creating a Tag object, you can specify its attributes as a dict
2987+rather than as a list of 2-tuples.
2988+
2989+3.0.8.1
2990+=======
2991+
2992+Bug fixes
2993+---------
2994+
2995+Corrected Beautiful Soup's behavior when a findAll() call contained a
2996+value for the "text" argument as well as values for arguments that
2997+imply it should search for tags. (The "text" argument takes priority
2998+and text is returned, not tags.)
2999+
3000+Corrected a typo that made I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS
3001+stop being a tuple.
3002+
3003+3.0.8
3004+=====
3005+
3006+Inauguration of the 3.0.x series as the stable series.
3007+
3008+New features
3009+------------
3010+
3011+Tag.replaceWithChildren()
3012+ Replace a tag with its children.
3013+
3014+Tag.string assignment
3015+ `tag.string = string` replaces the contents of a tag with `string`.
3016+
3017+Tag.text property (NOT A FUNCTION!)
3018+ tag.text gathers together and joins all text children. Much faster than
3019+ "".join(tag.findAll(text=True))
3020+
3021+Tag.getText(seperator=u"")
3022+ Same as Tag.text, but a function that allows a custom seperator between joined
3023+ text elements.
3024+
3025+Tag.index(element) -> int
3026+ Returns the index of `element` within the tag. Matches the actual
3027+ element instead of using __eq__.
3028+
3029+Tag.clear()
3030+ Remove all child elements.
3031+
3032+Improvements
3033+------------
3034+
3035+Previously, searching by CSS class only matched tags that had the
3036+requested CSS class and no other classes. Now, searching by CSS class
3037+matches every tag that uses that class.
3038+
3039+Performance
3040+-----------
3041+
3042+Beware! Although searching the tree is much faster in 3.0.8 than in
3043+previous versions, you probably won't notice the difference in real
3044+situations, because the time spent searching the tree is typically
3045+dwarfed by the time spent parsing the file in the first place.
3046+
3047+Tag.decompose() is several times faster.
3048+A very basic findAll(...) is several times faster.
3049+findAll(True) is special cased
3050+Tag.recursiveChildGenerator is much faster
3051
3052=== added file 'PKG-INFO'
3053--- PKG-INFO 1970-01-01 00:00:00 +0000
3054+++ PKG-INFO 2011-05-27 07:52:31 +0000
3055@@ -0,0 +1,19 @@
3056+Metadata-Version: 1.0
3057+Name: BeautifulSoup
3058+Version: 3.0.7a
3059+Summary: HTML/XML parser for quick-turnaround applications like screen-scraping.
3060+Home-page: http://www.crummy.com/software/BeautifulSoup/
3061+Author: Leonard Richardson
3062+Author-email: leonardr@segfault.org
3063+License: BSD
3064+Download-URL: http://www.crummy.com/software/BeautifulSoup/download/
3065+Description: Beautiful Soup parses arbitrarily invalid SGML and provides a variety of methods and Pythonic idioms for iterating and searching the parse tree.
3066+Platform: UNKNOWN
3067+Classifier: Development Status :: 5 - Production/Stable
3068+Classifier: Intended Audience :: Developers
3069+Classifier: License :: OSI Approved :: Python Software Foundation License
3070+Classifier: Programming Language :: Python
3071+Classifier: Topic :: Text Processing :: Markup :: HTML
3072+Classifier: Topic :: Text Processing :: Markup :: XML
3073+Classifier: Topic :: Text Processing :: Markup :: SGML
3074+Classifier: Topic :: Software Development :: Libraries :: Python Modules
3075
3076=== renamed file 'README.txt' => 'README.txt.THIS'
3077=== renamed file 'bs4/__init__.py' => 'bs4/__init__.py.THIS'
3078=== renamed file 'bs4/builder/__init__.py' => 'bs4/builder/__init__.py.THIS'
3079=== renamed file 'bs4/builder/_lxml.py' => 'bs4/builder/_lxml.py.THIS'
3080=== renamed file 'bs4/dammit.py' => 'bs4/dammit.py.THIS'
3081=== renamed file 'bs4/element.py' => 'bs4/element.py.THIS'
3082=== renamed file 'bs4/testing.py' => 'bs4/testing.py.THIS'
3083=== removed directory 'docs'
3084=== removed file 'docs/__init__.py'
3085--- docs/__init__.py 2009-04-10 15:48:02 +0000
3086+++ docs/__init__.py 1970-01-01 00:00:00 +0000
3087@@ -1,1 +0,0 @@
3088-"""Executable documentation about beautifulsoup."""
3089
3090=== added file 'setup.py'
3091--- setup.py 1970-01-01 00:00:00 +0000
3092+++ setup.py 2011-05-27 07:52:31 +0000
3093@@ -0,0 +1,60 @@
3094+from distutils.core import setup
3095+import unittest
3096+import warnings
3097+warnings.filterwarnings("ignore", "Unknown distribution option")
3098+
3099+import sys
3100+# patch distutils if it can't cope with the "classifiers" keyword
3101+if sys.version < '2.2.3':
3102+ from distutils.dist import DistributionMetadata
3103+ DistributionMetadata.classifiers = None
3104+ DistributionMetadata.download_url = None
3105+
3106+from BeautifulSoup import __version__
3107+
3108+#Make sure all the tests complete.
3109+import BeautifulSoupTests
3110+loader = unittest.TestLoader()
3111+result = unittest.TestResult()
3112+suite = loader.loadTestsFromModule(BeautifulSoupTests)
3113+suite.run(result)
3114+if not result.wasSuccessful():
3115+ print "Unit tests have failed!"
3116+ for l in result.errors, result.failures:
3117+ for case, error in l:
3118+ print "-" * 80
3119+ desc = case.shortDescription()
3120+ if desc:
3121+ print desc
3122+ print error
3123+ print '''If you see an error like: "'ascii' codec can't encode character...", see\nthe Beautiful Soup documentation:\n http://www.crummy.com/software/BeautifulSoup/documentation.html#Why%20can't%20Beautiful%20Soup%20print%20out%20the%20non-ASCII%20characters%20I%20gave%20it?'''
3124+ print "This might or might not be a problem depending on what you plan to do with\nBeautiful Soup."
3125+ if sys.argv[1] == 'sdist':
3126+ print
3127+ print "I'm not going to make a source distribution since the tests don't pass."
3128+ sys.exit(1)
3129+
3130+setup(name="BeautifulSoup",
3131+ version=__version__,
3132+ py_modules=['BeautifulSoup', 'BeautifulSoupTests'],
3133+ description="HTML/XML parser for quick-turnaround applications like screen-scraping.",
3134+ author="Leonard Richardson",
3135+ author_email = "leonardr@segfault.org",
3136+ long_description="""Beautiful Soup parses arbitrarily invalid SGML and provides a variety of methods and Pythonic idioms for iterating and searching the parse tree.""",
3137+ classifiers=["Development Status :: 5 - Production/Stable",
3138+ "Intended Audience :: Developers",
3139+ "License :: OSI Approved :: Python Software Foundation License",
3140+ "Programming Language :: Python",
3141+ "Topic :: Text Processing :: Markup :: HTML",
3142+ "Topic :: Text Processing :: Markup :: XML",
3143+ "Topic :: Text Processing :: Markup :: SGML",
3144+ "Topic :: Software Development :: Libraries :: Python Modules",
3145+ ],
3146+ url="http://www.crummy.com/software/BeautifulSoup/",
3147+ license="BSD",
3148+ download_url="http://www.crummy.com/software/BeautifulSoup/download/"
3149+ )
3150+
3151+ # Send announce to:
3152+ # python-announce@python.org
3153+ # python-list@python.org
3154
3155=== removed file 'tests/__init__.py'
3156--- tests/__init__.py 2009-04-10 15:48:02 +0000
3157+++ tests/__init__.py 1970-01-01 00:00:00 +0000
3158@@ -1,1 +0,0 @@
3159-"The beautifulsoup tests."
3160
3161=== removed file 'tests/test_docs.py'
3162--- tests/test_docs.py 2009-04-10 15:48:02 +0000
3163+++ tests/test_docs.py 1970-01-01 00:00:00 +0000
3164@@ -1,36 +0,0 @@
3165-"Test harness for doctests."
3166-
3167-# pylint: disable-msg=E0611,W0142
3168-
3169-__metaclass__ = type
3170-__all__ = [
3171- 'additional_tests',
3172- ]
3173-
3174-import atexit
3175-import doctest
3176-import os
3177-from pkg_resources import (
3178- resource_filename, resource_exists, resource_listdir, cleanup_resources)
3179-import unittest
3180-
3181-DOCTEST_FLAGS = (
3182- doctest.ELLIPSIS |
3183- doctest.NORMALIZE_WHITESPACE |
3184- doctest.REPORT_NDIFF)
3185-
3186-
3187-def additional_tests():
3188- "Run the doc tests (README.txt and docs/*, if any exist)"
3189- doctest_files = [
3190- os.path.abspath(resource_filename('beautifulsoup', 'README.txt'))]
3191- if resource_exists('beautifulsoup', 'docs'):
3192- for name in resource_listdir('beautifulsoup', 'docs'):
3193- if name.endswith('.txt'):
3194- doctest_files.append(
3195- os.path.abspath(
3196- resource_filename('beautifulsoup', 'docs/%s' % name)))
3197- kwargs = dict(module_relative=False, optionflags=DOCTEST_FLAGS)
3198- atexit.register(cleanup_resources)
3199- return unittest.TestSuite((
3200- doctest.DocFileSuite(*doctest_files, **kwargs)))
3201
3202=== renamed file 'tests/test_lxml.py' => 'tests/test_lxml.py.THIS'
3203=== renamed file 'tests/test_soup.py' => 'tests/test_soup.py.THIS'

Subscribers

People subscribed via source and target branches

to status/vote changes: