Merge lp:~mjumbewu/beautifulsoup/text-white-space-fix into lp:beautifulsoup
- text-white-space-fix
- Merge into bs4
Status: | Superseded |
---|---|
Proposed branch: | lp:~mjumbewu/beautifulsoup/text-white-space-fix |
Merge into: | lp:beautifulsoup |
Diff against target: |
3203 lines (+3075/-72) (has conflicts) 9 files modified
AUTHORS (+0/-34) BeautifulSoup.py (+2014/-0) BeautifulSoupTests.py (+903/-0) NEWS (+79/-0) PKG-INFO (+19/-0) docs/__init__.py (+0/-1) setup.py (+60/-0) tests/__init__.py (+0/-1) tests/test_docs.py (+0/-36) Path conflict: AUTHORS / <deleted> Contents conflict in CHANGELOG Path conflict: CHANGELOG / <deleted> Contents conflict in README.txt Path conflict: README.txt / <deleted> Contents conflict in bs4/__init__.py Path conflict: bs4/__init__.py / <deleted> Conflict: can't delete bs4/builder because it is not empty. Not deleting. Path conflict: bs4/builder / <deleted> Conflict because bs4/builder is not versioned, but has versioned children. Versioned directory. Contents conflict in bs4/builder/__init__.py Contents conflict in bs4/builder/_lxml.py Path conflict: bs4/builder/_lxml.py / <deleted> Contents conflict in bs4/dammit.py Path conflict: bs4/dammit.py / <deleted> Contents conflict in bs4/element.py Path conflict: bs4/element.py / <deleted> Contents conflict in bs4/testing.py Path conflict: bs4/testing.py / <deleted> Path conflict: docs / <deleted> Conflict: can't delete tests because it is not empty. Not deleting. Path conflict: tests / <deleted> Conflict because tests is not versioned, but has versioned children. Versioned directory. Contents conflict in tests/test_lxml.py Contents conflict in tests/test_soup.py |
To merge this branch: | bzr merge lp:~mjumbewu/beautifulsoup/text-white-space-fix |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Leonard Richardson | Pending | ||
Review via email: mp+62619@code.launchpad.net |
This proposal has been superseded by a proposal from 2011-05-27.
Commit message
Description of the change
BeautifulSoup removes too much white space on getText. For example, the text of "<p>This is a <i>test</i>, ok?" should be "This is a test, ok?". Instead, BS calculates it as "This is atest, ok?"
This invalidates bug #788986
- 45. By Mjumbe Wawatu Ukweli
-
In getText, multiple white space characters get truncated to one.
Unmerged revisions
- 45. By Mjumbe Wawatu Ukweli
-
In getText, multiple white space characters get truncated to one.
- 44. By Mjumbe Wawatu Ukweli
-
Preserve spacing when using getText.
- 43. By Leonard Richardson
-
Revved version number.
- 42. By Leonard Richardson
-
When creating a Tag object, you can specify its attributes as a dict
rather than as a list of 2-tuples. - 41. By Leonard Richardson
-
Fix a typo and prep for release.
- 40. By Leonard Richardson
-
Cleaned up tests.
- 39. By Leonard Richardson
-
Applied Aaron's fix for bug 493722.
- 38. By Leonard Richardson
-
Added a failing test for bug 493722.
- 37. By Leonard Richardson
-
Fixed whitespace.
- 36. By Leonard Richardson
-
Changed iterators not to block on empty strings. Restored the set code since 2.2 doesn't work on this code anyway.
Preview Diff
1 | === removed file 'AUTHORS' |
2 | --- AUTHORS 2011-01-28 16:39:36 +0000 |
3 | +++ AUTHORS 1970-01-01 00:00:00 +0000 |
4 | @@ -1,34 +0,0 @@ |
5 | -Behold, mortal, the origins of Beautiful Soup... |
6 | -================================================ |
7 | - |
8 | -Leonard Richardson is the primary programmer. |
9 | - |
10 | -Sam Ruby helps with a lot of edge cases. |
11 | - |
12 | -Mark Pilgrim provided the encoding detection code that forms the base |
13 | -of UnicodeDammit. |
14 | - |
15 | -Jonathan Ellis was awarded the prestigous Beau Potage D'Or for his |
16 | -work in solving the nestable tags conundrum. |
17 | - |
18 | -The following people have contributed patches to Beautiful Soup: |
19 | - |
20 | - Istvan Albert, Andrew Lin, Anthony Baxter, Andrew Boyko, Tony Chang, |
21 | - Zephyr Fang, Fuzzy, Roman Gaufman, Yoni Gilad, Richie Hindle, Peteris |
22 | - Krumins, Kent Johnson, Ben Last, Robert Leftwich, Staffan Malmgren, |
23 | - Ksenia Marasanova, JP Moins, Adam Monsen, John Nagle, "Jon", Ed |
24 | - Oskiewicz, Greg Phillips, Giles Radford, Arthur Rudolph, Marko |
25 | - Samastur, Jouni Seppänen, Alexander Schmolck, Andy Theyers, Glyn |
26 | - Webster, Paul Wright, Danny Yoo |
27 | - |
28 | -The following people made suggestions or found bugs or found ways to |
29 | -break Beautiful Soup: |
30 | - |
31 | - Hanno Böck, Matteo Bertini, Chris Curvey, Simon Cusack, Matt Ernst, |
32 | - Michael Foord, Tom Harris, Bill de hOra, Donald Howes, Matt |
33 | - Patterson, Scott Roberts, Steve Strassmann, Mike Williams, warchild |
34 | - at redho dot com, Sami Kuisma, Carlos Rocha, Bob Hutchison, Joren Mc, |
35 | - Michal Migurski, John Kleven, Tim Heaney, Tripp Lilley, Ed Summers, |
36 | - Dennis Sutch, Chris Smith, Aaron Sweep^W Swartz, Stuart Turner, Greg |
37 | - Edwards, Kevin J Kalupson, Nikos Kouremenos, Artur de Sousa Rocha, |
38 | - Yichun Wei, Per Vognsen |
39 | |
40 | === added file 'BeautifulSoup.py' |
41 | --- BeautifulSoup.py 1970-01-01 00:00:00 +0000 |
42 | +++ BeautifulSoup.py 2011-05-27 07:52:31 +0000 |
43 | @@ -0,0 +1,2014 @@ |
44 | +"""Beautiful Soup |
45 | +Elixir and Tonic |
46 | +"The Screen-Scraper's Friend" |
47 | +http://www.crummy.com/software/BeautifulSoup/ |
48 | + |
49 | +Beautiful Soup parses a (possibly invalid) XML or HTML document into a |
50 | +tree representation. It provides methods and Pythonic idioms that make |
51 | +it easy to navigate, search, and modify the tree. |
52 | + |
53 | +A well-formed XML/HTML document yields a well-formed data |
54 | +structure. An ill-formed XML/HTML document yields a correspondingly |
55 | +ill-formed data structure. If your document is only locally |
56 | +well-formed, you can use this library to find and process the |
57 | +well-formed part of it. |
58 | + |
59 | +Beautiful Soup works with Python 2.2 and up. It has no external |
60 | +dependencies, but you'll have more success at converting data to UTF-8 |
61 | +if you also install these three packages: |
62 | + |
63 | +* chardet, for auto-detecting character encodings |
64 | + http://chardet.feedparser.org/ |
65 | +* cjkcodecs and iconv_codec, which add more encodings to the ones supported |
66 | + by stock Python. |
67 | + http://cjkpython.i18n.org/ |
68 | + |
69 | +Beautiful Soup defines classes for two main parsing strategies: |
70 | + |
71 | + * BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific |
72 | + language that kind of looks like XML. |
73 | + |
74 | + * BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid |
75 | + or invalid. This class has web browser-like heuristics for |
76 | + obtaining a sensible parse tree in the face of common HTML errors. |
77 | + |
78 | +Beautiful Soup also defines a class (UnicodeDammit) for autodetecting |
79 | +the encoding of an HTML or XML document, and converting it to |
80 | +Unicode. Much of this code is taken from Mark Pilgrim's Universal Feed Parser. |
81 | + |
82 | +For more than you ever wanted to know about Beautiful Soup, see the |
83 | +documentation: |
84 | +http://www.crummy.com/software/BeautifulSoup/documentation.html |
85 | + |
86 | +Here, have some legalese: |
87 | + |
88 | +Copyright (c) 2004-2010, Leonard Richardson |
89 | + |
90 | +All rights reserved. |
91 | + |
92 | +Redistribution and use in source and binary forms, with or without |
93 | +modification, are permitted provided that the following conditions are |
94 | +met: |
95 | + |
96 | + * Redistributions of source code must retain the above copyright |
97 | + notice, this list of conditions and the following disclaimer. |
98 | + |
99 | + * Redistributions in binary form must reproduce the above |
100 | + copyright notice, this list of conditions and the following |
101 | + disclaimer in the documentation and/or other materials provided |
102 | + with the distribution. |
103 | + |
104 | + * Neither the name of the the Beautiful Soup Consortium and All |
105 | + Night Kosher Bakery nor the names of its contributors may be |
106 | + used to endorse or promote products derived from this software |
107 | + without specific prior written permission. |
108 | + |
109 | +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS |
110 | +"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT |
111 | +LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR |
112 | +A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR |
113 | +CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, |
114 | +EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, |
115 | +PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR |
116 | +PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF |
117 | +LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING |
118 | +NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS |
119 | +SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE, DAMMIT. |
120 | + |
121 | +""" |
122 | +from __future__ import generators |
123 | + |
124 | +__author__ = "Leonard Richardson (leonardr@segfault.org)" |
125 | +__version__ = "3.2.0" |
126 | +__copyright__ = "Copyright (c) 2004-2010 Leonard Richardson" |
127 | +__license__ = "New-style BSD" |
128 | + |
129 | +from sgmllib import SGMLParser, SGMLParseError |
130 | +import codecs |
131 | +import markupbase |
132 | +import types |
133 | +import re |
134 | +import sgmllib |
135 | +try: |
136 | + from htmlentitydefs import name2codepoint |
137 | +except ImportError: |
138 | + name2codepoint = {} |
139 | +try: |
140 | + set |
141 | +except NameError: |
142 | + from sets import Set as set |
143 | + |
144 | +#These hacks make Beautiful Soup able to parse XML with namespaces |
145 | +sgmllib.tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*') |
146 | +markupbase._declname_match = re.compile(r'[a-zA-Z][-_.:a-zA-Z0-9]*\s*').match |
147 | + |
148 | +DEFAULT_OUTPUT_ENCODING = "utf-8" |
149 | + |
150 | +def _match_css_class(str): |
151 | + """Build a RE to match the given CSS class.""" |
152 | + return re.compile(r"(^|.*\s)%s($|\s)" % str) |
153 | + |
154 | +# First, the classes that represent markup elements. |
155 | + |
156 | +class PageElement(object): |
157 | + """Contains the navigational information for some part of the page |
158 | + (either a tag or a piece of text)""" |
159 | + |
160 | + def setup(self, parent=None, previous=None): |
161 | + """Sets up the initial relations between this element and |
162 | + other elements.""" |
163 | + self.parent = parent |
164 | + self.previous = previous |
165 | + self.next = None |
166 | + self.previousSibling = None |
167 | + self.nextSibling = None |
168 | + if self.parent and self.parent.contents: |
169 | + self.previousSibling = self.parent.contents[-1] |
170 | + self.previousSibling.nextSibling = self |
171 | + |
172 | + def replaceWith(self, replaceWith): |
173 | + oldParent = self.parent |
174 | + myIndex = self.parent.index(self) |
175 | + if hasattr(replaceWith, "parent")\ |
176 | + and replaceWith.parent is self.parent: |
177 | + # We're replacing this element with one of its siblings. |
178 | + index = replaceWith.parent.index(replaceWith) |
179 | + if index and index < myIndex: |
180 | + # Furthermore, it comes before this element. That |
181 | + # means that when we extract it, the index of this |
182 | + # element will change. |
183 | + myIndex = myIndex - 1 |
184 | + self.extract() |
185 | + oldParent.insert(myIndex, replaceWith) |
186 | + |
187 | + def replaceWithChildren(self): |
188 | + myParent = self.parent |
189 | + myIndex = self.parent.index(self) |
190 | + self.extract() |
191 | + reversedChildren = list(self.contents) |
192 | + reversedChildren.reverse() |
193 | + for child in reversedChildren: |
194 | + myParent.insert(myIndex, child) |
195 | + |
196 | + def extract(self): |
197 | + """Destructively rips this element out of the tree.""" |
198 | + if self.parent: |
199 | + try: |
200 | + del self.parent.contents[self.parent.index(self)] |
201 | + except ValueError: |
202 | + pass |
203 | + |
204 | + #Find the two elements that would be next to each other if |
205 | + #this element (and any children) hadn't been parsed. Connect |
206 | + #the two. |
207 | + lastChild = self._lastRecursiveChild() |
208 | + nextElement = lastChild.next |
209 | + |
210 | + if self.previous: |
211 | + self.previous.next = nextElement |
212 | + if nextElement: |
213 | + nextElement.previous = self.previous |
214 | + self.previous = None |
215 | + lastChild.next = None |
216 | + |
217 | + self.parent = None |
218 | + if self.previousSibling: |
219 | + self.previousSibling.nextSibling = self.nextSibling |
220 | + if self.nextSibling: |
221 | + self.nextSibling.previousSibling = self.previousSibling |
222 | + self.previousSibling = self.nextSibling = None |
223 | + return self |
224 | + |
225 | + def _lastRecursiveChild(self): |
226 | + "Finds the last element beneath this object to be parsed." |
227 | + lastChild = self |
228 | + while hasattr(lastChild, 'contents') and lastChild.contents: |
229 | + lastChild = lastChild.contents[-1] |
230 | + return lastChild |
231 | + |
232 | + def insert(self, position, newChild): |
233 | + if isinstance(newChild, basestring) \ |
234 | + and not isinstance(newChild, NavigableString): |
235 | + newChild = NavigableString(newChild) |
236 | + |
237 | + position = min(position, len(self.contents)) |
238 | + if hasattr(newChild, 'parent') and newChild.parent is not None: |
239 | + # We're 'inserting' an element that's already one |
240 | + # of this object's children. |
241 | + if newChild.parent is self: |
242 | + index = self.index(newChild) |
243 | + if index > position: |
244 | + # Furthermore we're moving it further down the |
245 | + # list of this object's children. That means that |
246 | + # when we extract this element, our target index |
247 | + # will jump down one. |
248 | + position = position - 1 |
249 | + newChild.extract() |
250 | + |
251 | + newChild.parent = self |
252 | + previousChild = None |
253 | + if position == 0: |
254 | + newChild.previousSibling = None |
255 | + newChild.previous = self |
256 | + else: |
257 | + previousChild = self.contents[position-1] |
258 | + newChild.previousSibling = previousChild |
259 | + newChild.previousSibling.nextSibling = newChild |
260 | + newChild.previous = previousChild._lastRecursiveChild() |
261 | + if newChild.previous: |
262 | + newChild.previous.next = newChild |
263 | + |
264 | + newChildsLastElement = newChild._lastRecursiveChild() |
265 | + |
266 | + if position >= len(self.contents): |
267 | + newChild.nextSibling = None |
268 | + |
269 | + parent = self |
270 | + parentsNextSibling = None |
271 | + while not parentsNextSibling: |
272 | + parentsNextSibling = parent.nextSibling |
273 | + parent = parent.parent |
274 | + if not parent: # This is the last element in the document. |
275 | + break |
276 | + if parentsNextSibling: |
277 | + newChildsLastElement.next = parentsNextSibling |
278 | + else: |
279 | + newChildsLastElement.next = None |
280 | + else: |
281 | + nextChild = self.contents[position] |
282 | + newChild.nextSibling = nextChild |
283 | + if newChild.nextSibling: |
284 | + newChild.nextSibling.previousSibling = newChild |
285 | + newChildsLastElement.next = nextChild |
286 | + |
287 | + if newChildsLastElement.next: |
288 | + newChildsLastElement.next.previous = newChildsLastElement |
289 | + self.contents.insert(position, newChild) |
290 | + |
291 | + def append(self, tag): |
292 | + """Appends the given tag to the contents of this tag.""" |
293 | + self.insert(len(self.contents), tag) |
294 | + |
295 | + def findNext(self, name=None, attrs={}, text=None, **kwargs): |
296 | + """Returns the first item that matches the given criteria and |
297 | + appears after this Tag in the document.""" |
298 | + return self._findOne(self.findAllNext, name, attrs, text, **kwargs) |
299 | + |
300 | + def findAllNext(self, name=None, attrs={}, text=None, limit=None, |
301 | + **kwargs): |
302 | + """Returns all items that match the given criteria and appear |
303 | + after this Tag in the document.""" |
304 | + return self._findAll(name, attrs, text, limit, self.nextGenerator, |
305 | + **kwargs) |
306 | + |
307 | + def findNextSibling(self, name=None, attrs={}, text=None, **kwargs): |
308 | + """Returns the closest sibling to this Tag that matches the |
309 | + given criteria and appears after this Tag in the document.""" |
310 | + return self._findOne(self.findNextSiblings, name, attrs, text, |
311 | + **kwargs) |
312 | + |
313 | + def findNextSiblings(self, name=None, attrs={}, text=None, limit=None, |
314 | + **kwargs): |
315 | + """Returns the siblings of this Tag that match the given |
316 | + criteria and appear after this Tag in the document.""" |
317 | + return self._findAll(name, attrs, text, limit, |
318 | + self.nextSiblingGenerator, **kwargs) |
319 | + fetchNextSiblings = findNextSiblings # Compatibility with pre-3.x |
320 | + |
321 | + def findPrevious(self, name=None, attrs={}, text=None, **kwargs): |
322 | + """Returns the first item that matches the given criteria and |
323 | + appears before this Tag in the document.""" |
324 | + return self._findOne(self.findAllPrevious, name, attrs, text, **kwargs) |
325 | + |
326 | + def findAllPrevious(self, name=None, attrs={}, text=None, limit=None, |
327 | + **kwargs): |
328 | + """Returns all items that match the given criteria and appear |
329 | + before this Tag in the document.""" |
330 | + return self._findAll(name, attrs, text, limit, self.previousGenerator, |
331 | + **kwargs) |
332 | + fetchPrevious = findAllPrevious # Compatibility with pre-3.x |
333 | + |
334 | + def findPreviousSibling(self, name=None, attrs={}, text=None, **kwargs): |
335 | + """Returns the closest sibling to this Tag that matches the |
336 | + given criteria and appears before this Tag in the document.""" |
337 | + return self._findOne(self.findPreviousSiblings, name, attrs, text, |
338 | + **kwargs) |
339 | + |
340 | + def findPreviousSiblings(self, name=None, attrs={}, text=None, |
341 | + limit=None, **kwargs): |
342 | + """Returns the siblings of this Tag that match the given |
343 | + criteria and appear before this Tag in the document.""" |
344 | + return self._findAll(name, attrs, text, limit, |
345 | + self.previousSiblingGenerator, **kwargs) |
346 | + fetchPreviousSiblings = findPreviousSiblings # Compatibility with pre-3.x |
347 | + |
348 | + def findParent(self, name=None, attrs={}, **kwargs): |
349 | + """Returns the closest parent of this Tag that matches the given |
350 | + criteria.""" |
351 | + # NOTE: We can't use _findOne because findParents takes a different |
352 | + # set of arguments. |
353 | + r = None |
354 | + l = self.findParents(name, attrs, 1) |
355 | + if l: |
356 | + r = l[0] |
357 | + return r |
358 | + |
359 | + def findParents(self, name=None, attrs={}, limit=None, **kwargs): |
360 | + """Returns the parents of this Tag that match the given |
361 | + criteria.""" |
362 | + |
363 | + return self._findAll(name, attrs, None, limit, self.parentGenerator, |
364 | + **kwargs) |
365 | + fetchParents = findParents # Compatibility with pre-3.x |
366 | + |
367 | + #These methods do the real heavy lifting. |
368 | + |
369 | + def _findOne(self, method, name, attrs, text, **kwargs): |
370 | + r = None |
371 | + l = method(name, attrs, text, 1, **kwargs) |
372 | + if l: |
373 | + r = l[0] |
374 | + return r |
375 | + |
376 | + def _findAll(self, name, attrs, text, limit, generator, **kwargs): |
377 | + "Iterates over a generator looking for things that match." |
378 | + |
379 | + if isinstance(name, SoupStrainer): |
380 | + strainer = name |
381 | + # (Possibly) special case some findAll*(...) searches |
382 | + elif text is None and not limit and not attrs and not kwargs: |
383 | + # findAll*(True) |
384 | + if name is True: |
385 | + return [element for element in generator() |
386 | + if isinstance(element, Tag)] |
387 | + # findAll*('tag-name') |
388 | + elif isinstance(name, basestring): |
389 | + return [element for element in generator() |
390 | + if isinstance(element, Tag) and |
391 | + element.name == name] |
392 | + else: |
393 | + strainer = SoupStrainer(name, attrs, text, **kwargs) |
394 | + # Build a SoupStrainer |
395 | + else: |
396 | + strainer = SoupStrainer(name, attrs, text, **kwargs) |
397 | + results = ResultSet(strainer) |
398 | + g = generator() |
399 | + while True: |
400 | + try: |
401 | + i = g.next() |
402 | + except StopIteration: |
403 | + break |
404 | + if i: |
405 | + found = strainer.search(i) |
406 | + if found: |
407 | + results.append(found) |
408 | + if limit and len(results) >= limit: |
409 | + break |
410 | + return results |
411 | + |
412 | + #These Generators can be used to navigate starting from both |
413 | + #NavigableStrings and Tags. |
414 | + def nextGenerator(self): |
415 | + i = self |
416 | + while i is not None: |
417 | + i = i.next |
418 | + yield i |
419 | + |
420 | + def nextSiblingGenerator(self): |
421 | + i = self |
422 | + while i is not None: |
423 | + i = i.nextSibling |
424 | + yield i |
425 | + |
426 | + def previousGenerator(self): |
427 | + i = self |
428 | + while i is not None: |
429 | + i = i.previous |
430 | + yield i |
431 | + |
432 | + def previousSiblingGenerator(self): |
433 | + i = self |
434 | + while i is not None: |
435 | + i = i.previousSibling |
436 | + yield i |
437 | + |
438 | + def parentGenerator(self): |
439 | + i = self |
440 | + while i is not None: |
441 | + i = i.parent |
442 | + yield i |
443 | + |
444 | + # Utility methods |
445 | + def substituteEncoding(self, str, encoding=None): |
446 | + encoding = encoding or "utf-8" |
447 | + return str.replace("%SOUP-ENCODING%", encoding) |
448 | + |
449 | + def toEncoding(self, s, encoding=None): |
450 | + """Encodes an object to a string in some encoding, or to Unicode. |
451 | + .""" |
452 | + if isinstance(s, unicode): |
453 | + if encoding: |
454 | + s = s.encode(encoding) |
455 | + elif isinstance(s, str): |
456 | + if encoding: |
457 | + s = s.encode(encoding) |
458 | + else: |
459 | + s = unicode(s) |
460 | + else: |
461 | + if encoding: |
462 | + s = self.toEncoding(str(s), encoding) |
463 | + else: |
464 | + s = unicode(s) |
465 | + return s |
466 | + |
467 | +class NavigableString(unicode, PageElement): |
468 | + |
469 | + def __new__(cls, value): |
470 | + """Create a new NavigableString. |
471 | + |
472 | + When unpickling a NavigableString, this method is called with |
473 | + the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be |
474 | + passed in to the superclass's __new__ or the superclass won't know |
475 | + how to handle non-ASCII characters. |
476 | + """ |
477 | + if isinstance(value, unicode): |
478 | + return unicode.__new__(cls, value) |
479 | + return unicode.__new__(cls, value, DEFAULT_OUTPUT_ENCODING) |
480 | + |
481 | + def __getnewargs__(self): |
482 | + return (NavigableString.__str__(self),) |
483 | + |
484 | + def __getattr__(self, attr): |
485 | + """text.string gives you text. This is for backwards |
486 | + compatibility for Navigable*String, but for CData* it lets you |
487 | + get the string without the CData wrapper.""" |
488 | + if attr == 'string': |
489 | + return self |
490 | + else: |
491 | + raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr) |
492 | + |
493 | + def __unicode__(self): |
494 | + return str(self).decode(DEFAULT_OUTPUT_ENCODING) |
495 | + |
496 | + def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): |
497 | + if encoding: |
498 | + return self.encode(encoding) |
499 | + else: |
500 | + return self |
501 | + |
502 | +class CData(NavigableString): |
503 | + |
504 | + def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): |
505 | + return "<![CDATA[%s]]>" % NavigableString.__str__(self, encoding) |
506 | + |
507 | +class ProcessingInstruction(NavigableString): |
508 | + def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): |
509 | + output = self |
510 | + if "%SOUP-ENCODING%" in output: |
511 | + output = self.substituteEncoding(output, encoding) |
512 | + return "<?%s?>" % self.toEncoding(output, encoding) |
513 | + |
514 | +class Comment(NavigableString): |
515 | + def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): |
516 | + return "<!--%s-->" % NavigableString.__str__(self, encoding) |
517 | + |
518 | +class Declaration(NavigableString): |
519 | + def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): |
520 | + return "<!%s>" % NavigableString.__str__(self, encoding) |
521 | + |
522 | +class Tag(PageElement): |
523 | + |
524 | + """Represents a found HTML tag with its attributes and contents.""" |
525 | + |
526 | + def _invert(h): |
527 | + "Cheap function to invert a hash." |
528 | + i = {} |
529 | + for k,v in h.items(): |
530 | + i[v] = k |
531 | + return i |
532 | + |
533 | + XML_ENTITIES_TO_SPECIAL_CHARS = { "apos" : "'", |
534 | + "quot" : '"', |
535 | + "amp" : "&", |
536 | + "lt" : "<", |
537 | + "gt" : ">" } |
538 | + |
539 | + XML_SPECIAL_CHARS_TO_ENTITIES = _invert(XML_ENTITIES_TO_SPECIAL_CHARS) |
540 | + |
541 | + def _convertEntities(self, match): |
542 | + """Used in a call to re.sub to replace HTML, XML, and numeric |
543 | + entities with the appropriate Unicode characters. If HTML |
544 | + entities are being converted, any unrecognized entities are |
545 | + escaped.""" |
546 | + x = match.group(1) |
547 | + if self.convertHTMLEntities and x in name2codepoint: |
548 | + return unichr(name2codepoint[x]) |
549 | + elif x in self.XML_ENTITIES_TO_SPECIAL_CHARS: |
550 | + if self.convertXMLEntities: |
551 | + return self.XML_ENTITIES_TO_SPECIAL_CHARS[x] |
552 | + else: |
553 | + return u'&%s;' % x |
554 | + elif len(x) > 0 and x[0] == '#': |
555 | + # Handle numeric entities |
556 | + if len(x) > 1 and x[1] == 'x': |
557 | + return unichr(int(x[2:], 16)) |
558 | + else: |
559 | + return unichr(int(x[1:])) |
560 | + |
561 | + elif self.escapeUnrecognizedEntities: |
562 | + return u'&%s;' % x |
563 | + else: |
564 | + return u'&%s;' % x |
565 | + |
566 | + def __init__(self, parser, name, attrs=None, parent=None, |
567 | + previous=None): |
568 | + "Basic constructor." |
569 | + |
570 | + # We don't actually store the parser object: that lets extracted |
571 | + # chunks be garbage-collected |
572 | + self.parserClass = parser.__class__ |
573 | + self.isSelfClosing = parser.isSelfClosingTag(name) |
574 | + self.name = name |
575 | + if attrs is None: |
576 | + attrs = [] |
577 | + elif isinstance(attrs, dict): |
578 | + attrs = attrs.items() |
579 | + self.attrs = attrs |
580 | + self.contents = [] |
581 | + self.setup(parent, previous) |
582 | + self.hidden = False |
583 | + self.containsSubstitutions = False |
584 | + self.convertHTMLEntities = parser.convertHTMLEntities |
585 | + self.convertXMLEntities = parser.convertXMLEntities |
586 | + self.escapeUnrecognizedEntities = parser.escapeUnrecognizedEntities |
587 | + |
588 | + # Convert any HTML, XML, or numeric entities in the attribute values. |
589 | + convert = lambda(k, val): (k, |
590 | + re.sub("&(#\d+|#x[0-9a-fA-F]+|\w+);", |
591 | + self._convertEntities, |
592 | + val)) |
593 | + self.attrs = map(convert, self.attrs) |
594 | + |
595 | + def getString(self): |
596 | + if (len(self.contents) == 1 |
597 | + and isinstance(self.contents[0], NavigableString)): |
598 | + return self.contents[0] |
599 | + |
600 | + def setString(self, string): |
601 | + """Replace the contents of the tag with a string""" |
602 | + self.clear() |
603 | + self.append(string) |
604 | + |
605 | + string = property(getString, setString) |
606 | + |
607 | + def getText(self, separator=u""): |
608 | + if not len(self.contents): |
609 | + return u"" |
610 | + stopNode = self._lastRecursiveChild().next |
611 | + strings = [] |
612 | + current = self.contents[0] |
613 | + while current is not stopNode: |
614 | + if isinstance(current, NavigableString): |
615 | + strings.append(current) |
616 | + current = current.next |
617 | + return separator.join(strings) |
618 | + |
619 | + text = property(getText) |
620 | + |
621 | + def get(self, key, default=None): |
622 | + """Returns the value of the 'key' attribute for the tag, or |
623 | + the value given for 'default' if it doesn't have that |
624 | + attribute.""" |
625 | + return self._getAttrMap().get(key, default) |
626 | + |
627 | + def clear(self): |
628 | + """Extract all children.""" |
629 | + for child in self.contents[:]: |
630 | + child.extract() |
631 | + |
632 | + def index(self, element): |
633 | + for i, child in enumerate(self.contents): |
634 | + if child is element: |
635 | + return i |
636 | + raise ValueError("Tag.index: element not in tag") |
637 | + |
638 | + def has_key(self, key): |
639 | + return self._getAttrMap().has_key(key) |
640 | + |
641 | + def __getitem__(self, key): |
642 | + """tag[key] returns the value of the 'key' attribute for the tag, |
643 | + and throws an exception if it's not there.""" |
644 | + return self._getAttrMap()[key] |
645 | + |
646 | + def __iter__(self): |
647 | + "Iterating over a tag iterates over its contents." |
648 | + return iter(self.contents) |
649 | + |
650 | + def __len__(self): |
651 | + "The length of a tag is the length of its list of contents." |
652 | + return len(self.contents) |
653 | + |
654 | + def __contains__(self, x): |
655 | + return x in self.contents |
656 | + |
657 | + def __nonzero__(self): |
658 | + "A tag is non-None even if it has no contents." |
659 | + return True |
660 | + |
661 | + def __setitem__(self, key, value): |
662 | + """Setting tag[key] sets the value of the 'key' attribute for the |
663 | + tag.""" |
664 | + self._getAttrMap() |
665 | + self.attrMap[key] = value |
666 | + found = False |
667 | + for i in range(0, len(self.attrs)): |
668 | + if self.attrs[i][0] == key: |
669 | + self.attrs[i] = (key, value) |
670 | + found = True |
671 | + if not found: |
672 | + self.attrs.append((key, value)) |
673 | + self._getAttrMap()[key] = value |
674 | + |
675 | + def __delitem__(self, key): |
676 | + "Deleting tag[key] deletes all 'key' attributes for the tag." |
677 | + for item in self.attrs: |
678 | + if item[0] == key: |
679 | + self.attrs.remove(item) |
680 | + #We don't break because bad HTML can define the same |
681 | + #attribute multiple times. |
682 | + self._getAttrMap() |
683 | + if self.attrMap.has_key(key): |
684 | + del self.attrMap[key] |
685 | + |
686 | + def __call__(self, *args, **kwargs): |
687 | + """Calling a tag like a function is the same as calling its |
688 | + findAll() method. Eg. tag('a') returns a list of all the A tags |
689 | + found within this tag.""" |
690 | + return apply(self.findAll, args, kwargs) |
691 | + |
692 | + def __getattr__(self, tag): |
693 | + #print "Getattr %s.%s" % (self.__class__, tag) |
694 | + if len(tag) > 3 and tag.rfind('Tag') == len(tag)-3: |
695 | + return self.find(tag[:-3]) |
696 | + elif tag.find('__') != 0: |
697 | + return self.find(tag) |
698 | + raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__, tag) |
699 | + |
700 | + def __eq__(self, other): |
701 | + """Returns true iff this tag has the same name, the same attributes, |
702 | + and the same contents (recursively) as the given tag. |
703 | + |
704 | + NOTE: right now this will return false if two tags have the |
705 | + same attributes in a different order. Should this be fixed?""" |
706 | + if other is self: |
707 | + return True |
708 | + if not hasattr(other, 'name') or not hasattr(other, 'attrs') or not hasattr(other, 'contents') or self.name != other.name or self.attrs != other.attrs or len(self) != len(other): |
709 | + return False |
710 | + for i in range(0, len(self.contents)): |
711 | + if self.contents[i] != other.contents[i]: |
712 | + return False |
713 | + return True |
714 | + |
715 | + def __ne__(self, other): |
716 | + """Returns true iff this tag is not identical to the other tag, |
717 | + as defined in __eq__.""" |
718 | + return not self == other |
719 | + |
720 | + def __repr__(self, encoding=DEFAULT_OUTPUT_ENCODING): |
721 | + """Renders this tag as a string.""" |
722 | + return self.__str__(encoding) |
723 | + |
724 | + def __unicode__(self): |
725 | + return self.__str__(None) |
726 | + |
727 | + BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|" |
728 | + + "&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)" |
729 | + + ")") |
730 | + |
731 | + def _sub_entity(self, x): |
732 | + """Used with a regular expression to substitute the |
733 | + appropriate XML entity for an XML special character.""" |
734 | + return "&" + self.XML_SPECIAL_CHARS_TO_ENTITIES[x.group(0)[0]] + ";" |
735 | + |
736 | + def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING, |
737 | + prettyPrint=False, indentLevel=0): |
738 | + """Returns a string or Unicode representation of this tag and |
739 | + its contents. To get Unicode, pass None for encoding. |
740 | + |
741 | + NOTE: since Python's HTML parser consumes whitespace, this |
742 | + method is not certain to reproduce the whitespace present in |
743 | + the original string.""" |
744 | + |
745 | + encodedName = self.toEncoding(self.name, encoding) |
746 | + |
747 | + attrs = [] |
748 | + if self.attrs: |
749 | + for key, val in self.attrs: |
750 | + fmt = '%s="%s"' |
751 | + if isinstance(val, basestring): |
752 | + if self.containsSubstitutions and '%SOUP-ENCODING%' in val: |
753 | + val = self.substituteEncoding(val, encoding) |
754 | + |
755 | + # The attribute value either: |
756 | + # |
757 | + # * Contains no embedded double quotes or single quotes. |
758 | + # No problem: we enclose it in double quotes. |
759 | + # * Contains embedded single quotes. No problem: |
760 | + # double quotes work here too. |
761 | + # * Contains embedded double quotes. No problem: |
762 | + # we enclose it in single quotes. |
763 | + # * Embeds both single _and_ double quotes. This |
764 | + # can't happen naturally, but it can happen if |
765 | + # you modify an attribute value after parsing |
766 | + # the document. Now we have a bit of a |
767 | + # problem. We solve it by enclosing the |
768 | + # attribute in single quotes, and escaping any |
769 | + # embedded single quotes to XML entities. |
770 | + if '"' in val: |
771 | + fmt = "%s='%s'" |
772 | + if "'" in val: |
773 | + # TODO: replace with apos when |
774 | + # appropriate. |
775 | + val = val.replace("'", "&squot;") |
776 | + |
777 | + # Now we're okay w/r/t quotes. But the attribute |
778 | + # value might also contain angle brackets, or |
779 | + # ampersands that aren't part of entities. We need |
780 | + # to escape those to XML entities too. |
781 | + val = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, val) |
782 | + |
783 | + attrs.append(fmt % (self.toEncoding(key, encoding), |
784 | + self.toEncoding(val, encoding))) |
785 | + close = '' |
786 | + closeTag = '' |
787 | + if self.isSelfClosing: |
788 | + close = ' /' |
789 | + else: |
790 | + closeTag = '</%s>' % encodedName |
791 | + |
792 | + indentTag, indentContents = 0, 0 |
793 | + if prettyPrint: |
794 | + indentTag = indentLevel |
795 | + space = (' ' * (indentTag-1)) |
796 | + indentContents = indentTag + 1 |
797 | + contents = self.renderContents(encoding, prettyPrint, indentContents) |
798 | + if self.hidden: |
799 | + s = contents |
800 | + else: |
801 | + s = [] |
802 | + attributeString = '' |
803 | + if attrs: |
804 | + attributeString = ' ' + ' '.join(attrs) |
805 | + if prettyPrint: |
806 | + s.append(space) |
807 | + s.append('<%s%s%s>' % (encodedName, attributeString, close)) |
808 | + if prettyPrint: |
809 | + s.append("\n") |
810 | + s.append(contents) |
811 | + if prettyPrint and contents and contents[-1] != "\n": |
812 | + s.append("\n") |
813 | + if prettyPrint and closeTag: |
814 | + s.append(space) |
815 | + s.append(closeTag) |
816 | + if prettyPrint and closeTag and self.nextSibling: |
817 | + s.append("\n") |
818 | + s = ''.join(s) |
819 | + return s |
820 | + |
821 | + def decompose(self): |
822 | + """Recursively destroys the contents of this tree.""" |
823 | + self.extract() |
824 | + if len(self.contents) == 0: |
825 | + return |
826 | + current = self.contents[0] |
827 | + while current is not None: |
828 | + next = current.next |
829 | + if isinstance(current, Tag): |
830 | + del current.contents[:] |
831 | + current.parent = None |
832 | + current.previous = None |
833 | + current.previousSibling = None |
834 | + current.next = None |
835 | + current.nextSibling = None |
836 | + current = next |
837 | + |
838 | + def prettify(self, encoding=DEFAULT_OUTPUT_ENCODING): |
839 | + return self.__str__(encoding, True) |
840 | + |
841 | + def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING, |
842 | + prettyPrint=False, indentLevel=0): |
843 | + """Renders the contents of this tag as a string in the given |
844 | + encoding. If encoding is None, returns a Unicode string..""" |
845 | + s=[] |
846 | + for c in self: |
847 | + text = None |
848 | + if isinstance(c, NavigableString): |
849 | + text = c.__str__(encoding) |
850 | + elif isinstance(c, Tag): |
851 | + s.append(c.__str__(encoding, prettyPrint, indentLevel)) |
852 | + if text and prettyPrint: |
853 | + text = text.strip() |
854 | + if text: |
855 | + if prettyPrint: |
856 | + s.append(" " * (indentLevel-1)) |
857 | + s.append(text) |
858 | + if prettyPrint: |
859 | + s.append("\n") |
860 | + return ''.join(s) |
861 | + |
862 | + #Soup methods |
863 | + |
864 | + def find(self, name=None, attrs={}, recursive=True, text=None, |
865 | + **kwargs): |
866 | + """Return only the first child of this Tag matching the given |
867 | + criteria.""" |
868 | + r = None |
869 | + l = self.findAll(name, attrs, recursive, text, 1, **kwargs) |
870 | + if l: |
871 | + r = l[0] |
872 | + return r |
873 | + findChild = find |
874 | + |
875 | + def findAll(self, name=None, attrs={}, recursive=True, text=None, |
876 | + limit=None, **kwargs): |
877 | + """Extracts a list of Tag objects that match the given |
878 | + criteria. You can specify the name of the Tag and any |
879 | + attributes you want the Tag to have. |
880 | + |
881 | + The value of a key-value pair in the 'attrs' map can be a |
882 | + string, a list of strings, a regular expression object, or a |
883 | + callable that takes a string and returns whether or not the |
884 | + string matches for some custom definition of 'matches'. The |
885 | + same is true of the tag name.""" |
886 | + generator = self.recursiveChildGenerator |
887 | + if not recursive: |
888 | + generator = self.childGenerator |
889 | + return self._findAll(name, attrs, text, limit, generator, **kwargs) |
890 | + findChildren = findAll |
891 | + |
892 | + # Pre-3.x compatibility methods |
893 | + first = find |
894 | + fetch = findAll |
895 | + |
896 | + def fetchText(self, text=None, recursive=True, limit=None): |
897 | + return self.findAll(text=text, recursive=recursive, limit=limit) |
898 | + |
899 | + def firstText(self, text=None, recursive=True): |
900 | + return self.find(text=text, recursive=recursive) |
901 | + |
902 | + #Private methods |
903 | + |
904 | + def _getAttrMap(self): |
905 | + """Initializes a map representation of this tag's attributes, |
906 | + if not already initialized.""" |
907 | + if not getattr(self, 'attrMap'): |
908 | + self.attrMap = {} |
909 | + for (key, value) in self.attrs: |
910 | + self.attrMap[key] = value |
911 | + return self.attrMap |
912 | + |
913 | + #Generator methods |
914 | + def childGenerator(self): |
915 | + # Just use the iterator from the contents |
916 | + return iter(self.contents) |
917 | + |
918 | + def recursiveChildGenerator(self): |
919 | + if not len(self.contents): |
920 | + raise StopIteration |
921 | + stopNode = self._lastRecursiveChild().next |
922 | + current = self.contents[0] |
923 | + while current is not stopNode: |
924 | + yield current |
925 | + current = current.next |
926 | + |
927 | + |
928 | +# Next, a couple classes to represent queries and their results. |
929 | +class SoupStrainer: |
930 | + """Encapsulates a number of ways of matching a markup element (tag or |
931 | + text).""" |
932 | + |
933 | + def __init__(self, name=None, attrs={}, text=None, **kwargs): |
934 | + self.name = name |
935 | + if isinstance(attrs, basestring): |
936 | + kwargs['class'] = _match_css_class(attrs) |
937 | + attrs = None |
938 | + if kwargs: |
939 | + if attrs: |
940 | + attrs = attrs.copy() |
941 | + attrs.update(kwargs) |
942 | + else: |
943 | + attrs = kwargs |
944 | + self.attrs = attrs |
945 | + self.text = text |
946 | + |
947 | + def __str__(self): |
948 | + if self.text: |
949 | + return self.text |
950 | + else: |
951 | + return "%s|%s" % (self.name, self.attrs) |
952 | + |
953 | + def searchTag(self, markupName=None, markupAttrs={}): |
954 | + found = None |
955 | + markup = None |
956 | + if isinstance(markupName, Tag): |
957 | + markup = markupName |
958 | + markupAttrs = markup |
959 | + callFunctionWithTagData = callable(self.name) \ |
960 | + and not isinstance(markupName, Tag) |
961 | + |
962 | + if (not self.name) \ |
963 | + or callFunctionWithTagData \ |
964 | + or (markup and self._matches(markup, self.name)) \ |
965 | + or (not markup and self._matches(markupName, self.name)): |
966 | + if callFunctionWithTagData: |
967 | + match = self.name(markupName, markupAttrs) |
968 | + else: |
969 | + match = True |
970 | + markupAttrMap = None |
971 | + for attr, matchAgainst in self.attrs.items(): |
972 | + if not markupAttrMap: |
973 | + if hasattr(markupAttrs, 'get'): |
974 | + markupAttrMap = markupAttrs |
975 | + else: |
976 | + markupAttrMap = {} |
977 | + for k,v in markupAttrs: |
978 | + markupAttrMap[k] = v |
979 | + attrValue = markupAttrMap.get(attr) |
980 | + if not self._matches(attrValue, matchAgainst): |
981 | + match = False |
982 | + break |
983 | + if match: |
984 | + if markup: |
985 | + found = markup |
986 | + else: |
987 | + found = markupName |
988 | + return found |
989 | + |
990 | + def search(self, markup): |
991 | + #print 'looking for %s in %s' % (self, markup) |
992 | + found = None |
993 | + # If given a list of items, scan it for a text element that |
994 | + # matches. |
995 | + if hasattr(markup, "__iter__") \ |
996 | + and not isinstance(markup, Tag): |
997 | + for element in markup: |
998 | + if isinstance(element, NavigableString) \ |
999 | + and self.search(element): |
1000 | + found = element |
1001 | + break |
1002 | + # If it's a Tag, make sure its name or attributes match. |
1003 | + # Don't bother with Tags if we're searching for text. |
1004 | + elif isinstance(markup, Tag): |
1005 | + if not self.text: |
1006 | + found = self.searchTag(markup) |
1007 | + # If it's text, make sure the text matches. |
1008 | + elif isinstance(markup, NavigableString) or \ |
1009 | + isinstance(markup, basestring): |
1010 | + if self._matches(markup, self.text): |
1011 | + found = markup |
1012 | + else: |
1013 | + raise Exception, "I don't know how to match against a %s" \ |
1014 | + % markup.__class__ |
1015 | + return found |
1016 | + |
1017 | + def _matches(self, markup, matchAgainst): |
1018 | + #print "Matching %s against %s" % (markup, matchAgainst) |
1019 | + result = False |
1020 | + if matchAgainst is True: |
1021 | + result = markup is not None |
1022 | + elif callable(matchAgainst): |
1023 | + result = matchAgainst(markup) |
1024 | + else: |
1025 | + #Custom match methods take the tag as an argument, but all |
1026 | + #other ways of matching match the tag name as a string. |
1027 | + if isinstance(markup, Tag): |
1028 | + markup = markup.name |
1029 | + if markup and not isinstance(markup, basestring): |
1030 | + markup = unicode(markup) |
1031 | + #Now we know that chunk is either a string, or None. |
1032 | + if hasattr(matchAgainst, 'match'): |
1033 | + # It's a regexp object. |
1034 | + result = markup and matchAgainst.search(markup) |
1035 | + elif hasattr(matchAgainst, '__iter__'): # list-like |
1036 | + result = markup in matchAgainst |
1037 | + elif hasattr(matchAgainst, 'items'): |
1038 | + result = markup.has_key(matchAgainst) |
1039 | + elif matchAgainst and isinstance(markup, basestring): |
1040 | + if isinstance(markup, unicode): |
1041 | + matchAgainst = unicode(matchAgainst) |
1042 | + else: |
1043 | + matchAgainst = str(matchAgainst) |
1044 | + |
1045 | + if not result: |
1046 | + result = matchAgainst == markup |
1047 | + return result |
1048 | + |
1049 | +class ResultSet(list): |
1050 | + """A ResultSet is just a list that keeps track of the SoupStrainer |
1051 | + that created it.""" |
1052 | + def __init__(self, source): |
1053 | + list.__init__([]) |
1054 | + self.source = source |
1055 | + |
1056 | +# Now, some helper functions. |
1057 | + |
1058 | +def buildTagMap(default, *args): |
1059 | + """Turns a list of maps, lists, or scalars into a single map. |
1060 | + Used to build the SELF_CLOSING_TAGS, NESTABLE_TAGS, and |
1061 | + NESTING_RESET_TAGS maps out of lists and partial maps.""" |
1062 | + built = {} |
1063 | + for portion in args: |
1064 | + if hasattr(portion, 'items'): |
1065 | + #It's a map. Merge it. |
1066 | + for k,v in portion.items(): |
1067 | + built[k] = v |
1068 | + elif hasattr(portion, '__iter__'): # is a list |
1069 | + #It's a list. Map each item to the default. |
1070 | + for k in portion: |
1071 | + built[k] = default |
1072 | + else: |
1073 | + #It's a scalar. Map it to the default. |
1074 | + built[portion] = default |
1075 | + return built |
1076 | + |
1077 | +# Now, the parser classes. |
1078 | + |
1079 | +class BeautifulStoneSoup(Tag, SGMLParser): |
1080 | + |
1081 | + """This class contains the basic parser and search code. It defines |
1082 | + a parser that knows nothing about tag behavior except for the |
1083 | + following: |
1084 | + |
1085 | + You can't close a tag without closing all the tags it encloses. |
1086 | + That is, "<foo><bar></foo>" actually means |
1087 | + "<foo><bar></bar></foo>". |
1088 | + |
1089 | + [Another possible explanation is "<foo><bar /></foo>", but since |
1090 | + this class defines no SELF_CLOSING_TAGS, it will never use that |
1091 | + explanation.] |
1092 | + |
1093 | + This class is useful for parsing XML or made-up markup languages, |
1094 | + or when BeautifulSoup makes an assumption counter to what you were |
1095 | + expecting.""" |
1096 | + |
1097 | + SELF_CLOSING_TAGS = {} |
1098 | + NESTABLE_TAGS = {} |
1099 | + RESET_NESTING_TAGS = {} |
1100 | + QUOTE_TAGS = {} |
1101 | + PRESERVE_WHITESPACE_TAGS = [] |
1102 | + |
1103 | + MARKUP_MASSAGE = [(re.compile('(<[^<>]*)/>'), |
1104 | + lambda x: x.group(1) + ' />'), |
1105 | + (re.compile('<!\s+([^<>]*)>'), |
1106 | + lambda x: '<!' + x.group(1) + '>') |
1107 | + ] |
1108 | + |
1109 | + ROOT_TAG_NAME = u'[document]' |
1110 | + |
1111 | + HTML_ENTITIES = "html" |
1112 | + XML_ENTITIES = "xml" |
1113 | + XHTML_ENTITIES = "xhtml" |
1114 | + # TODO: This only exists for backwards-compatibility |
1115 | + ALL_ENTITIES = XHTML_ENTITIES |
1116 | + |
1117 | + # Used when determining whether a text node is all whitespace and |
1118 | + # can be replaced with a single space. A text node that contains |
1119 | + # fancy Unicode spaces (usually non-breaking) should be left |
1120 | + # alone. |
1121 | + STRIP_ASCII_SPACES = { 9: None, 10: None, 12: None, 13: None, 32: None, } |
1122 | + |
1123 | + def __init__(self, markup="", parseOnlyThese=None, fromEncoding=None, |
1124 | + markupMassage=True, smartQuotesTo=XML_ENTITIES, |
1125 | + convertEntities=None, selfClosingTags=None, isHTML=False): |
1126 | + """The Soup object is initialized as the 'root tag', and the |
1127 | + provided markup (which can be a string or a file-like object) |
1128 | + is fed into the underlying parser. |
1129 | + |
1130 | + sgmllib will process most bad HTML, and the BeautifulSoup |
1131 | + class has some tricks for dealing with some HTML that kills |
1132 | + sgmllib, but Beautiful Soup can nonetheless choke or lose data |
1133 | + if your data uses self-closing tags or declarations |
1134 | + incorrectly. |
1135 | + |
1136 | + By default, Beautiful Soup uses regexes to sanitize input, |
1137 | + avoiding the vast majority of these problems. If the problems |
1138 | + don't apply to you, pass in False for markupMassage, and |
1139 | + you'll get better performance. |
1140 | + |
1141 | + The default parser massage techniques fix the two most common |
1142 | + instances of invalid HTML that choke sgmllib: |
1143 | + |
1144 | + <br/> (No space between name of closing tag and tag close) |
1145 | + <! --Comment--> (Extraneous whitespace in declaration) |
1146 | + |
1147 | + You can pass in a custom list of (RE object, replace method) |
1148 | + tuples to get Beautiful Soup to scrub your input the way you |
1149 | + want.""" |
1150 | + |
1151 | + self.parseOnlyThese = parseOnlyThese |
1152 | + self.fromEncoding = fromEncoding |
1153 | + self.smartQuotesTo = smartQuotesTo |
1154 | + self.convertEntities = convertEntities |
1155 | + # Set the rules for how we'll deal with the entities we |
1156 | + # encounter |
1157 | + if self.convertEntities: |
1158 | + # It doesn't make sense to convert encoded characters to |
1159 | + # entities even while you're converting entities to Unicode. |
1160 | + # Just convert it all to Unicode. |
1161 | + self.smartQuotesTo = None |
1162 | + if convertEntities == self.HTML_ENTITIES: |
1163 | + self.convertXMLEntities = False |
1164 | + self.convertHTMLEntities = True |
1165 | + self.escapeUnrecognizedEntities = True |
1166 | + elif convertEntities == self.XHTML_ENTITIES: |
1167 | + self.convertXMLEntities = True |
1168 | + self.convertHTMLEntities = True |
1169 | + self.escapeUnrecognizedEntities = False |
1170 | + elif convertEntities == self.XML_ENTITIES: |
1171 | + self.convertXMLEntities = True |
1172 | + self.convertHTMLEntities = False |
1173 | + self.escapeUnrecognizedEntities = False |
1174 | + else: |
1175 | + self.convertXMLEntities = False |
1176 | + self.convertHTMLEntities = False |
1177 | + self.escapeUnrecognizedEntities = False |
1178 | + |
1179 | + self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags) |
1180 | + SGMLParser.__init__(self) |
1181 | + |
1182 | + if hasattr(markup, 'read'): # It's a file-type object. |
1183 | + markup = markup.read() |
1184 | + self.markup = markup |
1185 | + self.markupMassage = markupMassage |
1186 | + try: |
1187 | + self._feed(isHTML=isHTML) |
1188 | + except StopParsing: |
1189 | + pass |
1190 | + self.markup = None # The markup can now be GCed |
1191 | + |
1192 | + def convert_charref(self, name): |
1193 | + """This method fixes a bug in Python's SGMLParser.""" |
1194 | + try: |
1195 | + n = int(name) |
1196 | + except ValueError: |
1197 | + return |
1198 | + if not 0 <= n <= 127 : # ASCII ends at 127, not 255 |
1199 | + return |
1200 | + return self.convert_codepoint(n) |
1201 | + |
1202 | + def _feed(self, inDocumentEncoding=None, isHTML=False): |
1203 | + # Convert the document to Unicode. |
1204 | + markup = self.markup |
1205 | + if isinstance(markup, unicode): |
1206 | + if not hasattr(self, 'originalEncoding'): |
1207 | + self.originalEncoding = None |
1208 | + else: |
1209 | + dammit = UnicodeDammit\ |
1210 | + (markup, [self.fromEncoding, inDocumentEncoding], |
1211 | + smartQuotesTo=self.smartQuotesTo, isHTML=isHTML) |
1212 | + markup = dammit.unicode |
1213 | + self.originalEncoding = dammit.originalEncoding |
1214 | + self.declaredHTMLEncoding = dammit.declaredHTMLEncoding |
1215 | + if markup: |
1216 | + if self.markupMassage: |
1217 | + if not hasattr(self.markupMassage, "__iter__"): |
1218 | + self.markupMassage = self.MARKUP_MASSAGE |
1219 | + for fix, m in self.markupMassage: |
1220 | + markup = fix.sub(m, markup) |
1221 | + # TODO: We get rid of markupMassage so that the |
1222 | + # soup object can be deepcopied later on. Some |
1223 | + # Python installations can't copy regexes. If anyone |
1224 | + # was relying on the existence of markupMassage, this |
1225 | + # might cause problems. |
1226 | + del(self.markupMassage) |
1227 | + self.reset() |
1228 | + |
1229 | + SGMLParser.feed(self, markup) |
1230 | + # Close out any unfinished strings and close all the open tags. |
1231 | + self.endData() |
1232 | + while self.currentTag.name != self.ROOT_TAG_NAME: |
1233 | + self.popTag() |
1234 | + |
1235 | + def __getattr__(self, methodName): |
1236 | + """This method routes method call requests to either the SGMLParser |
1237 | + superclass or the Tag superclass, depending on the method name.""" |
1238 | + #print "__getattr__ called on %s.%s" % (self.__class__, methodName) |
1239 | + |
1240 | + if methodName.startswith('start_') or methodName.startswith('end_') \ |
1241 | + or methodName.startswith('do_'): |
1242 | + return SGMLParser.__getattr__(self, methodName) |
1243 | + elif not methodName.startswith('__'): |
1244 | + return Tag.__getattr__(self, methodName) |
1245 | + else: |
1246 | + raise AttributeError |
1247 | + |
1248 | + def isSelfClosingTag(self, name): |
1249 | + """Returns true iff the given string is the name of a |
1250 | + self-closing tag according to this parser.""" |
1251 | + return self.SELF_CLOSING_TAGS.has_key(name) \ |
1252 | + or self.instanceSelfClosingTags.has_key(name) |
1253 | + |
1254 | + def reset(self): |
1255 | + Tag.__init__(self, self, self.ROOT_TAG_NAME) |
1256 | + self.hidden = 1 |
1257 | + SGMLParser.reset(self) |
1258 | + self.currentData = [] |
1259 | + self.currentTag = None |
1260 | + self.tagStack = [] |
1261 | + self.quoteStack = [] |
1262 | + self.pushTag(self) |
1263 | + |
1264 | + def popTag(self): |
1265 | + tag = self.tagStack.pop() |
1266 | + |
1267 | + #print "Pop", tag.name |
1268 | + if self.tagStack: |
1269 | + self.currentTag = self.tagStack[-1] |
1270 | + return self.currentTag |
1271 | + |
1272 | + def pushTag(self, tag): |
1273 | + #print "Push", tag.name |
1274 | + if self.currentTag: |
1275 | + self.currentTag.contents.append(tag) |
1276 | + self.tagStack.append(tag) |
1277 | + self.currentTag = self.tagStack[-1] |
1278 | + |
1279 | + def endData(self, containerClass=NavigableString): |
1280 | + if self.currentData: |
1281 | + currentData = u''.join(self.currentData) |
1282 | + if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and |
1283 | + not set([tag.name for tag in self.tagStack]).intersection( |
1284 | + self.PRESERVE_WHITESPACE_TAGS)): |
1285 | + if '\n' in currentData: |
1286 | + currentData = '\n' |
1287 | + else: |
1288 | + currentData = ' ' |
1289 | + self.currentData = [] |
1290 | + if self.parseOnlyThese and len(self.tagStack) <= 1 and \ |
1291 | + (not self.parseOnlyThese.text or \ |
1292 | + not self.parseOnlyThese.search(currentData)): |
1293 | + return |
1294 | + o = containerClass(currentData) |
1295 | + o.setup(self.currentTag, self.previous) |
1296 | + if self.previous: |
1297 | + self.previous.next = o |
1298 | + self.previous = o |
1299 | + self.currentTag.contents.append(o) |
1300 | + |
1301 | + |
1302 | + def _popToTag(self, name, inclusivePop=True): |
1303 | + """Pops the tag stack up to and including the most recent |
1304 | + instance of the given tag. If inclusivePop is false, pops the tag |
1305 | + stack up to but *not* including the most recent instqance of |
1306 | + the given tag.""" |
1307 | + #print "Popping to %s" % name |
1308 | + if name == self.ROOT_TAG_NAME: |
1309 | + return |
1310 | + |
1311 | + numPops = 0 |
1312 | + mostRecentTag = None |
1313 | + for i in range(len(self.tagStack)-1, 0, -1): |
1314 | + if name == self.tagStack[i].name: |
1315 | + numPops = len(self.tagStack)-i |
1316 | + break |
1317 | + if not inclusivePop: |
1318 | + numPops = numPops - 1 |
1319 | + |
1320 | + for i in range(0, numPops): |
1321 | + mostRecentTag = self.popTag() |
1322 | + return mostRecentTag |
1323 | + |
1324 | + def _smartPop(self, name): |
1325 | + |
1326 | + """We need to pop up to the previous tag of this type, unless |
1327 | + one of this tag's nesting reset triggers comes between this |
1328 | + tag and the previous tag of this type, OR unless this tag is a |
1329 | + generic nesting trigger and another generic nesting trigger |
1330 | + comes between this tag and the previous tag of this type. |
1331 | + |
1332 | + Examples: |
1333 | + <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'. |
1334 | + <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'. |
1335 | + <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'. |
1336 | + |
1337 | + <li><ul><li> *<li>* should pop to 'ul', not the first 'li'. |
1338 | + <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr' |
1339 | + <td><tr><td> *<td>* should pop to 'tr', not the first 'td' |
1340 | + """ |
1341 | + |
1342 | + nestingResetTriggers = self.NESTABLE_TAGS.get(name) |
1343 | + isNestable = nestingResetTriggers != None |
1344 | + isResetNesting = self.RESET_NESTING_TAGS.has_key(name) |
1345 | + popTo = None |
1346 | + inclusive = True |
1347 | + for i in range(len(self.tagStack)-1, 0, -1): |
1348 | + p = self.tagStack[i] |
1349 | + if (not p or p.name == name) and not isNestable: |
1350 | + #Non-nestable tags get popped to the top or to their |
1351 | + #last occurance. |
1352 | + popTo = name |
1353 | + break |
1354 | + if (nestingResetTriggers is not None |
1355 | + and p.name in nestingResetTriggers) \ |
1356 | + or (nestingResetTriggers is None and isResetNesting |
1357 | + and self.RESET_NESTING_TAGS.has_key(p.name)): |
1358 | + |
1359 | + #If we encounter one of the nesting reset triggers |
1360 | + #peculiar to this tag, or we encounter another tag |
1361 | + #that causes nesting to reset, pop up to but not |
1362 | + #including that tag. |
1363 | + popTo = p.name |
1364 | + inclusive = False |
1365 | + break |
1366 | + p = p.parent |
1367 | + if popTo: |
1368 | + self._popToTag(popTo, inclusive) |
1369 | + |
1370 | + def unknown_starttag(self, name, attrs, selfClosing=0): |
1371 | + #print "Start tag %s: %s" % (name, attrs) |
1372 | + if self.quoteStack: |
1373 | + #This is not a real tag. |
1374 | + #print "<%s> is not real!" % name |
1375 | + attrs = ''.join([' %s="%s"' % (x, y) for x, y in attrs]) |
1376 | + self.handle_data('<%s%s>' % (name, attrs)) |
1377 | + return |
1378 | + self.endData() |
1379 | + |
1380 | + if not self.isSelfClosingTag(name) and not selfClosing: |
1381 | + self._smartPop(name) |
1382 | + |
1383 | + if self.parseOnlyThese and len(self.tagStack) <= 1 \ |
1384 | + and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)): |
1385 | + return |
1386 | + |
1387 | + tag = Tag(self, name, attrs, self.currentTag, self.previous) |
1388 | + if self.previous: |
1389 | + self.previous.next = tag |
1390 | + self.previous = tag |
1391 | + self.pushTag(tag) |
1392 | + if selfClosing or self.isSelfClosingTag(name): |
1393 | + self.popTag() |
1394 | + if name in self.QUOTE_TAGS: |
1395 | + #print "Beginning quote (%s)" % name |
1396 | + self.quoteStack.append(name) |
1397 | + self.literal = 1 |
1398 | + return tag |
1399 | + |
1400 | + def unknown_endtag(self, name): |
1401 | + #print "End tag %s" % name |
1402 | + if self.quoteStack and self.quoteStack[-1] != name: |
1403 | + #This is not a real end tag. |
1404 | + #print "</%s> is not real!" % name |
1405 | + self.handle_data('</%s>' % name) |
1406 | + return |
1407 | + self.endData() |
1408 | + self._popToTag(name) |
1409 | + if self.quoteStack and self.quoteStack[-1] == name: |
1410 | + self.quoteStack.pop() |
1411 | + self.literal = (len(self.quoteStack) > 0) |
1412 | + |
1413 | + def handle_data(self, data): |
1414 | + self.currentData.append(data) |
1415 | + |
1416 | + def _toStringSubclass(self, text, subclass): |
1417 | + """Adds a certain piece of text to the tree as a NavigableString |
1418 | + subclass.""" |
1419 | + self.endData() |
1420 | + self.handle_data(text) |
1421 | + self.endData(subclass) |
1422 | + |
1423 | + def handle_pi(self, text): |
1424 | + """Handle a processing instruction as a ProcessingInstruction |
1425 | + object, possibly one with a %SOUP-ENCODING% slot into which an |
1426 | + encoding will be plugged later.""" |
1427 | + if text[:3] == "xml": |
1428 | + text = u"xml version='1.0' encoding='%SOUP-ENCODING%'" |
1429 | + self._toStringSubclass(text, ProcessingInstruction) |
1430 | + |
1431 | + def handle_comment(self, text): |
1432 | + "Handle comments as Comment objects." |
1433 | + self._toStringSubclass(text, Comment) |
1434 | + |
1435 | + def handle_charref(self, ref): |
1436 | + "Handle character references as data." |
1437 | + if self.convertEntities: |
1438 | + data = unichr(int(ref)) |
1439 | + else: |
1440 | + data = '&#%s;' % ref |
1441 | + self.handle_data(data) |
1442 | + |
1443 | + def handle_entityref(self, ref): |
1444 | + """Handle entity references as data, possibly converting known |
1445 | + HTML and/or XML entity references to the corresponding Unicode |
1446 | + characters.""" |
1447 | + data = None |
1448 | + if self.convertHTMLEntities: |
1449 | + try: |
1450 | + data = unichr(name2codepoint[ref]) |
1451 | + except KeyError: |
1452 | + pass |
1453 | + |
1454 | + if not data and self.convertXMLEntities: |
1455 | + data = self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref) |
1456 | + |
1457 | + if not data and self.convertHTMLEntities and \ |
1458 | + not self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref): |
1459 | + # TODO: We've got a problem here. We're told this is |
1460 | + # an entity reference, but it's not an XML entity |
1461 | + # reference or an HTML entity reference. Nonetheless, |
1462 | + # the logical thing to do is to pass it through as an |
1463 | + # unrecognized entity reference. |
1464 | + # |
1465 | + # Except: when the input is "&carol;" this function |
1466 | + # will be called with input "carol". When the input is |
1467 | + # "AT&T", this function will be called with input |
1468 | + # "T". We have no way of knowing whether a semicolon |
1469 | + # was present originally, so we don't know whether |
1470 | + # this is an unknown entity or just a misplaced |
1471 | + # ampersand. |
1472 | + # |
1473 | + # The more common case is a misplaced ampersand, so I |
1474 | + # escape the ampersand and omit the trailing semicolon. |
1475 | + data = "&%s" % ref |
1476 | + if not data: |
1477 | + # This case is different from the one above, because we |
1478 | + # haven't already gone through a supposedly comprehensive |
1479 | + # mapping of entities to Unicode characters. We might not |
1480 | + # have gone through any mapping at all. So the chances are |
1481 | + # very high that this is a real entity, and not a |
1482 | + # misplaced ampersand. |
1483 | + data = "&%s;" % ref |
1484 | + self.handle_data(data) |
1485 | + |
1486 | + def handle_decl(self, data): |
1487 | + "Handle DOCTYPEs and the like as Declaration objects." |
1488 | + self._toStringSubclass(data, Declaration) |
1489 | + |
1490 | + def parse_declaration(self, i): |
1491 | + """Treat a bogus SGML declaration as raw data. Treat a CDATA |
1492 | + declaration as a CData object.""" |
1493 | + j = None |
1494 | + if self.rawdata[i:i+9] == '<![CDATA[': |
1495 | + k = self.rawdata.find(']]>', i) |
1496 | + if k == -1: |
1497 | + k = len(self.rawdata) |
1498 | + data = self.rawdata[i+9:k] |
1499 | + j = k+3 |
1500 | + self._toStringSubclass(data, CData) |
1501 | + else: |
1502 | + try: |
1503 | + j = SGMLParser.parse_declaration(self, i) |
1504 | + except SGMLParseError: |
1505 | + toHandle = self.rawdata[i:] |
1506 | + self.handle_data(toHandle) |
1507 | + j = i + len(toHandle) |
1508 | + return j |
1509 | + |
1510 | +class BeautifulSoup(BeautifulStoneSoup): |
1511 | + |
1512 | + """This parser knows the following facts about HTML: |
1513 | + |
1514 | + * Some tags have no closing tag and should be interpreted as being |
1515 | + closed as soon as they are encountered. |
1516 | + |
1517 | + * The text inside some tags (ie. 'script') may contain tags which |
1518 | + are not really part of the document and which should be parsed |
1519 | + as text, not tags. If you want to parse the text as tags, you can |
1520 | + always fetch it and parse it explicitly. |
1521 | + |
1522 | + * Tag nesting rules: |
1523 | + |
1524 | + Most tags can't be nested at all. For instance, the occurance of |
1525 | + a <p> tag should implicitly close the previous <p> tag. |
1526 | + |
1527 | + <p>Para1<p>Para2 |
1528 | + should be transformed into: |
1529 | + <p>Para1</p><p>Para2 |
1530 | + |
1531 | + Some tags can be nested arbitrarily. For instance, the occurance |
1532 | + of a <blockquote> tag should _not_ implicitly close the previous |
1533 | + <blockquote> tag. |
1534 | + |
1535 | + Alice said: <blockquote>Bob said: <blockquote>Blah |
1536 | + should NOT be transformed into: |
1537 | + Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah |
1538 | + |
1539 | + Some tags can be nested, but the nesting is reset by the |
1540 | + interposition of other tags. For instance, a <tr> tag should |
1541 | + implicitly close the previous <tr> tag within the same <table>, |
1542 | + but not close a <tr> tag in another table. |
1543 | + |
1544 | + <table><tr>Blah<tr>Blah |
1545 | + should be transformed into: |
1546 | + <table><tr>Blah</tr><tr>Blah |
1547 | + but, |
1548 | + <tr>Blah<table><tr>Blah |
1549 | + should NOT be transformed into |
1550 | + <tr>Blah<table></tr><tr>Blah |
1551 | + |
1552 | + Differing assumptions about tag nesting rules are a major source |
1553 | + of problems with the BeautifulSoup class. If BeautifulSoup is not |
1554 | + treating as nestable a tag your page author treats as nestable, |
1555 | + try ICantBelieveItsBeautifulSoup, MinimalSoup, or |
1556 | + BeautifulStoneSoup before writing your own subclass.""" |
1557 | + |
1558 | + def __init__(self, *args, **kwargs): |
1559 | + if not kwargs.has_key('smartQuotesTo'): |
1560 | + kwargs['smartQuotesTo'] = self.HTML_ENTITIES |
1561 | + kwargs['isHTML'] = True |
1562 | + BeautifulStoneSoup.__init__(self, *args, **kwargs) |
1563 | + |
1564 | + SELF_CLOSING_TAGS = buildTagMap(None, |
1565 | + ('br' , 'hr', 'input', 'img', 'meta', |
1566 | + 'spacer', 'link', 'frame', 'base', 'col')) |
1567 | + |
1568 | + PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea']) |
1569 | + |
1570 | + QUOTE_TAGS = {'script' : None, 'textarea' : None} |
1571 | + |
1572 | + #According to the HTML standard, each of these inline tags can |
1573 | + #contain another tag of the same type. Furthermore, it's common |
1574 | + #to actually use these tags this way. |
1575 | + NESTABLE_INLINE_TAGS = ('span', 'font', 'q', 'object', 'bdo', 'sub', 'sup', |
1576 | + 'center') |
1577 | + |
1578 | + #According to the HTML standard, these block tags can contain |
1579 | + #another tag of the same type. Furthermore, it's common |
1580 | + #to actually use these tags this way. |
1581 | + NESTABLE_BLOCK_TAGS = ('blockquote', 'div', 'fieldset', 'ins', 'del') |
1582 | + |
1583 | + #Lists can contain other lists, but there are restrictions. |
1584 | + NESTABLE_LIST_TAGS = { 'ol' : [], |
1585 | + 'ul' : [], |
1586 | + 'li' : ['ul', 'ol'], |
1587 | + 'dl' : [], |
1588 | + 'dd' : ['dl'], |
1589 | + 'dt' : ['dl'] } |
1590 | + |
1591 | + #Tables can contain other tables, but there are restrictions. |
1592 | + NESTABLE_TABLE_TAGS = {'table' : [], |
1593 | + 'tr' : ['table', 'tbody', 'tfoot', 'thead'], |
1594 | + 'td' : ['tr'], |
1595 | + 'th' : ['tr'], |
1596 | + 'thead' : ['table'], |
1597 | + 'tbody' : ['table'], |
1598 | + 'tfoot' : ['table'], |
1599 | + } |
1600 | + |
1601 | + NON_NESTABLE_BLOCK_TAGS = ('address', 'form', 'p', 'pre') |
1602 | + |
1603 | + #If one of these tags is encountered, all tags up to the next tag of |
1604 | + #this type are popped. |
1605 | + RESET_NESTING_TAGS = buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript', |
1606 | + NON_NESTABLE_BLOCK_TAGS, |
1607 | + NESTABLE_LIST_TAGS, |
1608 | + NESTABLE_TABLE_TAGS) |
1609 | + |
1610 | + NESTABLE_TAGS = buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS, |
1611 | + NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS) |
1612 | + |
1613 | + # Used to detect the charset in a META tag; see start_meta |
1614 | + CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M) |
1615 | + |
1616 | + def start_meta(self, attrs): |
1617 | + """Beautiful Soup can detect a charset included in a META tag, |
1618 | + try to convert the document to that charset, and re-parse the |
1619 | + document from the beginning.""" |
1620 | + httpEquiv = None |
1621 | + contentType = None |
1622 | + contentTypeIndex = None |
1623 | + tagNeedsEncodingSubstitution = False |
1624 | + |
1625 | + for i in range(0, len(attrs)): |
1626 | + key, value = attrs[i] |
1627 | + key = key.lower() |
1628 | + if key == 'http-equiv': |
1629 | + httpEquiv = value |
1630 | + elif key == 'content': |
1631 | + contentType = value |
1632 | + contentTypeIndex = i |
1633 | + |
1634 | + if httpEquiv and contentType: # It's an interesting meta tag. |
1635 | + match = self.CHARSET_RE.search(contentType) |
1636 | + if match: |
1637 | + if (self.declaredHTMLEncoding is not None or |
1638 | + self.originalEncoding == self.fromEncoding): |
1639 | + # An HTML encoding was sniffed while converting |
1640 | + # the document to Unicode, or an HTML encoding was |
1641 | + # sniffed during a previous pass through the |
1642 | + # document, or an encoding was specified |
1643 | + # explicitly and it worked. Rewrite the meta tag. |
1644 | + def rewrite(match): |
1645 | + return match.group(1) + "%SOUP-ENCODING%" |
1646 | + newAttr = self.CHARSET_RE.sub(rewrite, contentType) |
1647 | + attrs[contentTypeIndex] = (attrs[contentTypeIndex][0], |
1648 | + newAttr) |
1649 | + tagNeedsEncodingSubstitution = True |
1650 | + else: |
1651 | + # This is our first pass through the document. |
1652 | + # Go through it again with the encoding information. |
1653 | + newCharset = match.group(3) |
1654 | + if newCharset and newCharset != self.originalEncoding: |
1655 | + self.declaredHTMLEncoding = newCharset |
1656 | + self._feed(self.declaredHTMLEncoding) |
1657 | + raise StopParsing |
1658 | + pass |
1659 | + tag = self.unknown_starttag("meta", attrs) |
1660 | + if tag and tagNeedsEncodingSubstitution: |
1661 | + tag.containsSubstitutions = True |
1662 | + |
1663 | +class StopParsing(Exception): |
1664 | + pass |
1665 | + |
1666 | +class ICantBelieveItsBeautifulSoup(BeautifulSoup): |
1667 | + |
1668 | + """The BeautifulSoup class is oriented towards skipping over |
1669 | + common HTML errors like unclosed tags. However, sometimes it makes |
1670 | + errors of its own. For instance, consider this fragment: |
1671 | + |
1672 | + <b>Foo<b>Bar</b></b> |
1673 | + |
1674 | + This is perfectly valid (if bizarre) HTML. However, the |
1675 | + BeautifulSoup class will implicitly close the first b tag when it |
1676 | + encounters the second 'b'. It will think the author wrote |
1677 | + "<b>Foo<b>Bar", and didn't close the first 'b' tag, because |
1678 | + there's no real-world reason to bold something that's already |
1679 | + bold. When it encounters '</b></b>' it will close two more 'b' |
1680 | + tags, for a grand total of three tags closed instead of two. This |
1681 | + can throw off the rest of your document structure. The same is |
1682 | + true of a number of other tags, listed below. |
1683 | + |
1684 | + It's much more common for someone to forget to close a 'b' tag |
1685 | + than to actually use nested 'b' tags, and the BeautifulSoup class |
1686 | + handles the common case. This class handles the not-co-common |
1687 | + case: where you can't believe someone wrote what they did, but |
1688 | + it's valid HTML and BeautifulSoup screwed up by assuming it |
1689 | + wouldn't be.""" |
1690 | + |
1691 | + I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS = \ |
1692 | + ('em', 'big', 'i', 'small', 'tt', 'abbr', 'acronym', 'strong', |
1693 | + 'cite', 'code', 'dfn', 'kbd', 'samp', 'strong', 'var', 'b', |
1694 | + 'big') |
1695 | + |
1696 | + I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS = ('noscript',) |
1697 | + |
1698 | + NESTABLE_TAGS = buildTagMap([], BeautifulSoup.NESTABLE_TAGS, |
1699 | + I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS, |
1700 | + I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS) |
1701 | + |
1702 | +class MinimalSoup(BeautifulSoup): |
1703 | + """The MinimalSoup class is for parsing HTML that contains |
1704 | + pathologically bad markup. It makes no assumptions about tag |
1705 | + nesting, but it does know which tags are self-closing, that |
1706 | + <script> tags contain Javascript and should not be parsed, that |
1707 | + META tags may contain encoding information, and so on. |
1708 | + |
1709 | + This also makes it better for subclassing than BeautifulStoneSoup |
1710 | + or BeautifulSoup.""" |
1711 | + |
1712 | + RESET_NESTING_TAGS = buildTagMap('noscript') |
1713 | + NESTABLE_TAGS = {} |
1714 | + |
1715 | +class BeautifulSOAP(BeautifulStoneSoup): |
1716 | + """This class will push a tag with only a single string child into |
1717 | + the tag's parent as an attribute. The attribute's name is the tag |
1718 | + name, and the value is the string child. An example should give |
1719 | + the flavor of the change: |
1720 | + |
1721 | + <foo><bar>baz</bar></foo> |
1722 | + => |
1723 | + <foo bar="baz"><bar>baz</bar></foo> |
1724 | + |
1725 | + You can then access fooTag['bar'] instead of fooTag.barTag.string. |
1726 | + |
1727 | + This is, of course, useful for scraping structures that tend to |
1728 | + use subelements instead of attributes, such as SOAP messages. Note |
1729 | + that it modifies its input, so don't print the modified version |
1730 | + out. |
1731 | + |
1732 | + I'm not sure how many people really want to use this class; let me |
1733 | + know if you do. Mainly I like the name.""" |
1734 | + |
1735 | + def popTag(self): |
1736 | + if len(self.tagStack) > 1: |
1737 | + tag = self.tagStack[-1] |
1738 | + parent = self.tagStack[-2] |
1739 | + parent._getAttrMap() |
1740 | + if (isinstance(tag, Tag) and len(tag.contents) == 1 and |
1741 | + isinstance(tag.contents[0], NavigableString) and |
1742 | + not parent.attrMap.has_key(tag.name)): |
1743 | + parent[tag.name] = tag.contents[0] |
1744 | + BeautifulStoneSoup.popTag(self) |
1745 | + |
1746 | +#Enterprise class names! It has come to our attention that some people |
1747 | +#think the names of the Beautiful Soup parser classes are too silly |
1748 | +#and "unprofessional" for use in enterprise screen-scraping. We feel |
1749 | +#your pain! For such-minded folk, the Beautiful Soup Consortium And |
1750 | +#All-Night Kosher Bakery recommends renaming this file to |
1751 | +#"RobustParser.py" (or, in cases of extreme enterprisiness, |
1752 | +#"RobustParserBeanInterface.class") and using the following |
1753 | +#enterprise-friendly class aliases: |
1754 | +class RobustXMLParser(BeautifulStoneSoup): |
1755 | + pass |
1756 | +class RobustHTMLParser(BeautifulSoup): |
1757 | + pass |
1758 | +class RobustWackAssHTMLParser(ICantBelieveItsBeautifulSoup): |
1759 | + pass |
1760 | +class RobustInsanelyWackAssHTMLParser(MinimalSoup): |
1761 | + pass |
1762 | +class SimplifyingSOAPParser(BeautifulSOAP): |
1763 | + pass |
1764 | + |
1765 | +###################################################### |
1766 | +# |
1767 | +# Bonus library: Unicode, Dammit |
1768 | +# |
1769 | +# This class forces XML data into a standard format (usually to UTF-8 |
1770 | +# or Unicode). It is heavily based on code from Mark Pilgrim's |
1771 | +# Universal Feed Parser. It does not rewrite the XML or HTML to |
1772 | +# reflect a new encoding: that happens in BeautifulStoneSoup.handle_pi |
1773 | +# (XML) and BeautifulSoup.start_meta (HTML). |
1774 | + |
1775 | +# Autodetects character encodings. |
1776 | +# Download from http://chardet.feedparser.org/ |
1777 | +try: |
1778 | + import chardet |
1779 | +# import chardet.constants |
1780 | +# chardet.constants._debug = 1 |
1781 | +except ImportError: |
1782 | + chardet = None |
1783 | + |
1784 | +# cjkcodecs and iconv_codec make Python know about more character encodings. |
1785 | +# Both are available from http://cjkpython.i18n.org/ |
1786 | +# They're built in if you use Python 2.4. |
1787 | +try: |
1788 | + import cjkcodecs.aliases |
1789 | +except ImportError: |
1790 | + pass |
1791 | +try: |
1792 | + import iconv_codec |
1793 | +except ImportError: |
1794 | + pass |
1795 | + |
1796 | +class UnicodeDammit: |
1797 | + """A class for detecting the encoding of a *ML document and |
1798 | + converting it to a Unicode string. If the source encoding is |
1799 | + windows-1252, can replace MS smart quotes with their HTML or XML |
1800 | + equivalents.""" |
1801 | + |
1802 | + # This dictionary maps commonly seen values for "charset" in HTML |
1803 | + # meta tags to the corresponding Python codec names. It only covers |
1804 | + # values that aren't in Python's aliases and can't be determined |
1805 | + # by the heuristics in find_codec. |
1806 | + CHARSET_ALIASES = { "macintosh" : "mac-roman", |
1807 | + "x-sjis" : "shift-jis" } |
1808 | + |
1809 | + def __init__(self, markup, overrideEncodings=[], |
1810 | + smartQuotesTo='xml', isHTML=False): |
1811 | + self.declaredHTMLEncoding = None |
1812 | + self.markup, documentEncoding, sniffedEncoding = \ |
1813 | + self._detectEncoding(markup, isHTML) |
1814 | + self.smartQuotesTo = smartQuotesTo |
1815 | + self.triedEncodings = [] |
1816 | + if markup == '' or isinstance(markup, unicode): |
1817 | + self.originalEncoding = None |
1818 | + self.unicode = unicode(markup) |
1819 | + return |
1820 | + |
1821 | + u = None |
1822 | + for proposedEncoding in overrideEncodings: |
1823 | + u = self._convertFrom(proposedEncoding) |
1824 | + if u: break |
1825 | + if not u: |
1826 | + for proposedEncoding in (documentEncoding, sniffedEncoding): |
1827 | + u = self._convertFrom(proposedEncoding) |
1828 | + if u: break |
1829 | + |
1830 | + # If no luck and we have auto-detection library, try that: |
1831 | + if not u and chardet and not isinstance(self.markup, unicode): |
1832 | + u = self._convertFrom(chardet.detect(self.markup)['encoding']) |
1833 | + |
1834 | + # As a last resort, try utf-8 and windows-1252: |
1835 | + if not u: |
1836 | + for proposed_encoding in ("utf-8", "windows-1252"): |
1837 | + u = self._convertFrom(proposed_encoding) |
1838 | + if u: break |
1839 | + |
1840 | + self.unicode = u |
1841 | + if not u: self.originalEncoding = None |
1842 | + |
1843 | + def _subMSChar(self, orig): |
1844 | + """Changes a MS smart quote character to an XML or HTML |
1845 | + entity.""" |
1846 | + sub = self.MS_CHARS.get(orig) |
1847 | + if isinstance(sub, tuple): |
1848 | + if self.smartQuotesTo == 'xml': |
1849 | + sub = '&#x%s;' % sub[1] |
1850 | + else: |
1851 | + sub = '&%s;' % sub[0] |
1852 | + return sub |
1853 | + |
1854 | + def _convertFrom(self, proposed): |
1855 | + proposed = self.find_codec(proposed) |
1856 | + if not proposed or proposed in self.triedEncodings: |
1857 | + return None |
1858 | + self.triedEncodings.append(proposed) |
1859 | + markup = self.markup |
1860 | + |
1861 | + # Convert smart quotes to HTML if coming from an encoding |
1862 | + # that might have them. |
1863 | + if self.smartQuotesTo and proposed.lower() in("windows-1252", |
1864 | + "iso-8859-1", |
1865 | + "iso-8859-2"): |
1866 | + markup = re.compile("([\x80-\x9f])").sub \ |
1867 | + (lambda(x): self._subMSChar(x.group(1)), |
1868 | + markup) |
1869 | + |
1870 | + try: |
1871 | + # print "Trying to convert document to %s" % proposed |
1872 | + u = self._toUnicode(markup, proposed) |
1873 | + self.markup = u |
1874 | + self.originalEncoding = proposed |
1875 | + except Exception, e: |
1876 | + # print "That didn't work!" |
1877 | + # print e |
1878 | + return None |
1879 | + #print "Correct encoding: %s" % proposed |
1880 | + return self.markup |
1881 | + |
1882 | + def _toUnicode(self, data, encoding): |
1883 | + '''Given a string and its encoding, decodes the string into Unicode. |
1884 | + %encoding is a string recognized by encodings.aliases''' |
1885 | + |
1886 | + # strip Byte Order Mark (if present) |
1887 | + if (len(data) >= 4) and (data[:2] == '\xfe\xff') \ |
1888 | + and (data[2:4] != '\x00\x00'): |
1889 | + encoding = 'utf-16be' |
1890 | + data = data[2:] |
1891 | + elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \ |
1892 | + and (data[2:4] != '\x00\x00'): |
1893 | + encoding = 'utf-16le' |
1894 | + data = data[2:] |
1895 | + elif data[:3] == '\xef\xbb\xbf': |
1896 | + encoding = 'utf-8' |
1897 | + data = data[3:] |
1898 | + elif data[:4] == '\x00\x00\xfe\xff': |
1899 | + encoding = 'utf-32be' |
1900 | + data = data[4:] |
1901 | + elif data[:4] == '\xff\xfe\x00\x00': |
1902 | + encoding = 'utf-32le' |
1903 | + data = data[4:] |
1904 | + newdata = unicode(data, encoding) |
1905 | + return newdata |
1906 | + |
1907 | + def _detectEncoding(self, xml_data, isHTML=False): |
1908 | + """Given a document, tries to detect its XML encoding.""" |
1909 | + xml_encoding = sniffed_xml_encoding = None |
1910 | + try: |
1911 | + if xml_data[:4] == '\x4c\x6f\xa7\x94': |
1912 | + # EBCDIC |
1913 | + xml_data = self._ebcdic_to_ascii(xml_data) |
1914 | + elif xml_data[:4] == '\x00\x3c\x00\x3f': |
1915 | + # UTF-16BE |
1916 | + sniffed_xml_encoding = 'utf-16be' |
1917 | + xml_data = unicode(xml_data, 'utf-16be').encode('utf-8') |
1918 | + elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') \ |
1919 | + and (xml_data[2:4] != '\x00\x00'): |
1920 | + # UTF-16BE with BOM |
1921 | + sniffed_xml_encoding = 'utf-16be' |
1922 | + xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8') |
1923 | + elif xml_data[:4] == '\x3c\x00\x3f\x00': |
1924 | + # UTF-16LE |
1925 | + sniffed_xml_encoding = 'utf-16le' |
1926 | + xml_data = unicode(xml_data, 'utf-16le').encode('utf-8') |
1927 | + elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and \ |
1928 | + (xml_data[2:4] != '\x00\x00'): |
1929 | + # UTF-16LE with BOM |
1930 | + sniffed_xml_encoding = 'utf-16le' |
1931 | + xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8') |
1932 | + elif xml_data[:4] == '\x00\x00\x00\x3c': |
1933 | + # UTF-32BE |
1934 | + sniffed_xml_encoding = 'utf-32be' |
1935 | + xml_data = unicode(xml_data, 'utf-32be').encode('utf-8') |
1936 | + elif xml_data[:4] == '\x3c\x00\x00\x00': |
1937 | + # UTF-32LE |
1938 | + sniffed_xml_encoding = 'utf-32le' |
1939 | + xml_data = unicode(xml_data, 'utf-32le').encode('utf-8') |
1940 | + elif xml_data[:4] == '\x00\x00\xfe\xff': |
1941 | + # UTF-32BE with BOM |
1942 | + sniffed_xml_encoding = 'utf-32be' |
1943 | + xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8') |
1944 | + elif xml_data[:4] == '\xff\xfe\x00\x00': |
1945 | + # UTF-32LE with BOM |
1946 | + sniffed_xml_encoding = 'utf-32le' |
1947 | + xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8') |
1948 | + elif xml_data[:3] == '\xef\xbb\xbf': |
1949 | + # UTF-8 with BOM |
1950 | + sniffed_xml_encoding = 'utf-8' |
1951 | + xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8') |
1952 | + else: |
1953 | + sniffed_xml_encoding = 'ascii' |
1954 | + pass |
1955 | + except: |
1956 | + xml_encoding_match = None |
1957 | + xml_encoding_match = re.compile( |
1958 | + '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data) |
1959 | + if not xml_encoding_match and isHTML: |
1960 | + regexp = re.compile('<\s*meta[^>]+charset=([^>]*?)[;\'">]', re.I) |
1961 | + xml_encoding_match = regexp.search(xml_data) |
1962 | + if xml_encoding_match is not None: |
1963 | + xml_encoding = xml_encoding_match.groups()[0].lower() |
1964 | + if isHTML: |
1965 | + self.declaredHTMLEncoding = xml_encoding |
1966 | + if sniffed_xml_encoding and \ |
1967 | + (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode', |
1968 | + 'iso-10646-ucs-4', 'ucs-4', 'csucs4', |
1969 | + 'utf-16', 'utf-32', 'utf_16', 'utf_32', |
1970 | + 'utf16', 'u16')): |
1971 | + xml_encoding = sniffed_xml_encoding |
1972 | + return xml_data, xml_encoding, sniffed_xml_encoding |
1973 | + |
1974 | + |
1975 | + def find_codec(self, charset): |
1976 | + return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \ |
1977 | + or (charset and self._codec(charset.replace("-", ""))) \ |
1978 | + or (charset and self._codec(charset.replace("-", "_"))) \ |
1979 | + or charset |
1980 | + |
1981 | + def _codec(self, charset): |
1982 | + if not charset: return charset |
1983 | + codec = None |
1984 | + try: |
1985 | + codecs.lookup(charset) |
1986 | + codec = charset |
1987 | + except (LookupError, ValueError): |
1988 | + pass |
1989 | + return codec |
1990 | + |
1991 | + EBCDIC_TO_ASCII_MAP = None |
1992 | + def _ebcdic_to_ascii(self, s): |
1993 | + c = self.__class__ |
1994 | + if not c.EBCDIC_TO_ASCII_MAP: |
1995 | + emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15, |
1996 | + 16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31, |
1997 | + 128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7, |
1998 | + 144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26, |
1999 | + 32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33, |
2000 | + 38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94, |
2001 | + 45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63, |
2002 | + 186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34, |
2003 | + 195,97,98,99,100,101,102,103,104,105,196,197,198,199,200, |
2004 | + 201,202,106,107,108,109,110,111,112,113,114,203,204,205, |
2005 | + 206,207,208,209,126,115,116,117,118,119,120,121,122,210, |
2006 | + 211,212,213,214,215,216,217,218,219,220,221,222,223,224, |
2007 | + 225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72, |
2008 | + 73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81, |
2009 | + 82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89, |
2010 | + 90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57, |
2011 | + 250,251,252,253,254,255) |
2012 | + import string |
2013 | + c.EBCDIC_TO_ASCII_MAP = string.maketrans( \ |
2014 | + ''.join(map(chr, range(256))), ''.join(map(chr, emap))) |
2015 | + return s.translate(c.EBCDIC_TO_ASCII_MAP) |
2016 | + |
2017 | + MS_CHARS = { '\x80' : ('euro', '20AC'), |
2018 | + '\x81' : ' ', |
2019 | + '\x82' : ('sbquo', '201A'), |
2020 | + '\x83' : ('fnof', '192'), |
2021 | + '\x84' : ('bdquo', '201E'), |
2022 | + '\x85' : ('hellip', '2026'), |
2023 | + '\x86' : ('dagger', '2020'), |
2024 | + '\x87' : ('Dagger', '2021'), |
2025 | + '\x88' : ('circ', '2C6'), |
2026 | + '\x89' : ('permil', '2030'), |
2027 | + '\x8A' : ('Scaron', '160'), |
2028 | + '\x8B' : ('lsaquo', '2039'), |
2029 | + '\x8C' : ('OElig', '152'), |
2030 | + '\x8D' : '?', |
2031 | + '\x8E' : ('#x17D', '17D'), |
2032 | + '\x8F' : '?', |
2033 | + '\x90' : '?', |
2034 | + '\x91' : ('lsquo', '2018'), |
2035 | + '\x92' : ('rsquo', '2019'), |
2036 | + '\x93' : ('ldquo', '201C'), |
2037 | + '\x94' : ('rdquo', '201D'), |
2038 | + '\x95' : ('bull', '2022'), |
2039 | + '\x96' : ('ndash', '2013'), |
2040 | + '\x97' : ('mdash', '2014'), |
2041 | + '\x98' : ('tilde', '2DC'), |
2042 | + '\x99' : ('trade', '2122'), |
2043 | + '\x9a' : ('scaron', '161'), |
2044 | + '\x9b' : ('rsaquo', '203A'), |
2045 | + '\x9c' : ('oelig', '153'), |
2046 | + '\x9d' : '?', |
2047 | + '\x9e' : ('#x17E', '17E'), |
2048 | + '\x9f' : ('Yuml', ''),} |
2049 | + |
2050 | +####################################################################### |
2051 | + |
2052 | + |
2053 | +#By default, act as an HTML pretty-printer. |
2054 | +if __name__ == '__main__': |
2055 | + import sys |
2056 | + soup = BeautifulSoup(sys.stdin) |
2057 | + print soup.prettify() |
2058 | |
2059 | === added file 'BeautifulSoupTests.py' |
2060 | --- BeautifulSoupTests.py 1970-01-01 00:00:00 +0000 |
2061 | +++ BeautifulSoupTests.py 2011-05-27 07:52:31 +0000 |
2062 | @@ -0,0 +1,903 @@ |
2063 | +# -*- coding: utf-8 -*- |
2064 | +"""Unit tests for Beautiful Soup. |
2065 | + |
2066 | +These tests make sure the Beautiful Soup works as it should. If you |
2067 | +find a bug in Beautiful Soup, the best way to express it is as a test |
2068 | +case like this that fails.""" |
2069 | + |
2070 | +import unittest |
2071 | +from BeautifulSoup import * |
2072 | + |
2073 | +class SoupTest(unittest.TestCase): |
2074 | + |
2075 | + def assertSoupEquals(self, toParse, rep=None, c=BeautifulSoup): |
2076 | + """Parse the given text and make sure its string rep is the other |
2077 | + given text.""" |
2078 | + if rep == None: |
2079 | + rep = toParse |
2080 | + self.assertEqual(str(c(toParse)), rep) |
2081 | + |
2082 | + |
2083 | +class FollowThatTag(SoupTest): |
2084 | + |
2085 | + "Tests the various ways of fetching tags from a soup." |
2086 | + |
2087 | + def setUp(self): |
2088 | + ml = """ |
2089 | + <a id="x">1</a> |
2090 | + <A id="a">2</a> |
2091 | + <b id="b">3</a> |
2092 | + <b href="foo" id="x">4</a> |
2093 | + <ac width=100>4</ac>""" |
2094 | + self.soup = BeautifulStoneSoup(ml) |
2095 | + |
2096 | + def testFindAllByName(self): |
2097 | + matching = self.soup('a') |
2098 | + self.assertEqual(len(matching), 2) |
2099 | + self.assertEqual(matching[0].name, 'a') |
2100 | + self.assertEqual(matching, self.soup.findAll('a')) |
2101 | + self.assertEqual(matching, self.soup.findAll(SoupStrainer('a'))) |
2102 | + |
2103 | + def testFindAllByAttribute(self): |
2104 | + matching = self.soup.findAll(id='x') |
2105 | + self.assertEqual(len(matching), 2) |
2106 | + self.assertEqual(matching[0].name, 'a') |
2107 | + self.assertEqual(matching[1].name, 'b') |
2108 | + |
2109 | + matching2 = self.soup.findAll(attrs={'id' : 'x'}) |
2110 | + self.assertEqual(matching, matching2) |
2111 | + |
2112 | + strainer = SoupStrainer(attrs={'id' : 'x'}) |
2113 | + self.assertEqual(matching, self.soup.findAll(strainer)) |
2114 | + |
2115 | + self.assertEqual(len(self.soup.findAll(id=None)), 1) |
2116 | + |
2117 | + self.assertEqual(len(self.soup.findAll(width=100)), 1) |
2118 | + self.assertEqual(len(self.soup.findAll(junk=None)), 5) |
2119 | + self.assertEqual(len(self.soup.findAll(junk=[1, None])), 5) |
2120 | + |
2121 | + self.assertEqual(len(self.soup.findAll(junk=re.compile('.*'))), 0) |
2122 | + self.assertEqual(len(self.soup.findAll(junk=True)), 0) |
2123 | + |
2124 | + self.assertEqual(len(self.soup.findAll(junk=True)), 0) |
2125 | + self.assertEqual(len(self.soup.findAll(href=True)), 1) |
2126 | + |
2127 | + def testFindallByClass(self): |
2128 | + soup = BeautifulSoup('<b class="foo">Foo</b><a class="1 23 4">Bar</a>') |
2129 | + self.assertEqual(soup.find(attrs='foo').string, "Foo") |
2130 | + self.assertEqual(soup.find('a', '1').string, "Bar") |
2131 | + self.assertEqual(soup.find('a', '23').string, "Bar") |
2132 | + self.assertEqual(soup.find('a', '4').string, "Bar") |
2133 | + |
2134 | + self.assertEqual(soup.find('a', '2'), None) |
2135 | + |
2136 | + def testFindAllByList(self): |
2137 | + matching = self.soup(['a', 'ac']) |
2138 | + self.assertEqual(len(matching), 3) |
2139 | + |
2140 | + def testFindAllByHash(self): |
2141 | + matching = self.soup({'a' : True, 'b' : True}) |
2142 | + self.assertEqual(len(matching), 4) |
2143 | + |
2144 | + def testFindAllText(self): |
2145 | + soup = BeautifulSoup("<html>\xbb</html>") |
2146 | + self.assertEqual(soup.findAll(text=re.compile('.*')), |
2147 | + [u'\xbb']) |
2148 | + |
2149 | + def testFindAllByRE(self): |
2150 | + import re |
2151 | + r = re.compile('a.*') |
2152 | + self.assertEqual(len(self.soup(r)), 3) |
2153 | + |
2154 | + def testFindAllByMethod(self): |
2155 | + def matchTagWhereIDMatchesName(tag): |
2156 | + return tag.name == tag.get('id') |
2157 | + |
2158 | + matching = self.soup.findAll(matchTagWhereIDMatchesName) |
2159 | + self.assertEqual(len(matching), 2) |
2160 | + self.assertEqual(matching[0].name, 'a') |
2161 | + |
2162 | + def testFindByIndex(self): |
2163 | + """For when you have the tag and you want to know where it is.""" |
2164 | + tag = self.soup.find('a', id="a") |
2165 | + self.assertEqual(self.soup.index(tag), 3) |
2166 | + |
2167 | + # It works for NavigableStrings as well. |
2168 | + s = tag.string |
2169 | + self.assertEqual(tag.index(s), 0) |
2170 | + |
2171 | + # If the tag isn't present, a ValueError is raised. |
2172 | + soup2 = BeautifulSoup("<b></b>") |
2173 | + tag2 = soup2.find('b') |
2174 | + self.assertRaises(ValueError, self.soup.index, tag2) |
2175 | + |
2176 | + def testConflictingFindArguments(self): |
2177 | + """The 'text' argument takes precedence.""" |
2178 | + soup = BeautifulSoup('Foo<b>Bar</b>Baz') |
2179 | + self.assertEqual(soup.find('b', text='Baz'), 'Baz') |
2180 | + self.assertEqual(soup.findAll('b', text='Baz'), ['Baz']) |
2181 | + |
2182 | + self.assertEqual(soup.find(True, text='Baz'), 'Baz') |
2183 | + self.assertEqual(soup.findAll(True, text='Baz'), ['Baz']) |
2184 | + |
2185 | + def testParents(self): |
2186 | + soup = BeautifulSoup('<ul id="foo"></ul><ul id="foo"><ul><ul id="foo" a="b"><b>Blah') |
2187 | + b = soup.b |
2188 | + self.assertEquals(len(b.findParents('ul', {'id' : 'foo'})), 2) |
2189 | + self.assertEquals(b.findParent('ul')['a'], 'b') |
2190 | + |
2191 | + PROXIMITY_TEST = BeautifulSoup('<b id="1"><b id="2"><b id="3"><b id="4">') |
2192 | + |
2193 | + def testNext(self): |
2194 | + soup = self.PROXIMITY_TEST |
2195 | + b = soup.find('b', {'id' : 2}) |
2196 | + self.assertEquals(b.findNext('b')['id'], '3') |
2197 | + self.assertEquals(b.findNext('b')['id'], '3') |
2198 | + self.assertEquals(len(b.findAllNext('b')), 2) |
2199 | + self.assertEquals(len(b.findAllNext('b', {'id' : 4})), 1) |
2200 | + |
2201 | + def testPrevious(self): |
2202 | + soup = self.PROXIMITY_TEST |
2203 | + b = soup.find('b', {'id' : 3}) |
2204 | + self.assertEquals(b.findPrevious('b')['id'], '2') |
2205 | + self.assertEquals(b.findPrevious('b')['id'], '2') |
2206 | + self.assertEquals(len(b.findAllPrevious('b')), 2) |
2207 | + self.assertEquals(len(b.findAllPrevious('b', {'id' : 2})), 1) |
2208 | + |
2209 | + |
2210 | + SIBLING_TEST = BeautifulSoup('<blockquote id="1"><blockquote id="1.1"></blockquote></blockquote><blockquote id="2"><blockquote id="2.1"></blockquote></blockquote><blockquote id="3"><blockquote id="3.1"></blockquote></blockquote><blockquote id="4">') |
2211 | + |
2212 | + def testNextSibling(self): |
2213 | + soup = self.SIBLING_TEST |
2214 | + tag = 'blockquote' |
2215 | + b = soup.find(tag, {'id' : 2}) |
2216 | + self.assertEquals(b.findNext(tag)['id'], '2.1') |
2217 | + self.assertEquals(b.findNextSibling(tag)['id'], '3') |
2218 | + self.assertEquals(b.findNextSibling(tag)['id'], '3') |
2219 | + self.assertEquals(len(b.findNextSiblings(tag)), 2) |
2220 | + self.assertEquals(len(b.findNextSiblings(tag, {'id' : 4})), 1) |
2221 | + |
2222 | + def testPreviousSibling(self): |
2223 | + soup = self.SIBLING_TEST |
2224 | + tag = 'blockquote' |
2225 | + b = soup.find(tag, {'id' : 3}) |
2226 | + self.assertEquals(b.findPrevious(tag)['id'], '2.1') |
2227 | + self.assertEquals(b.findPreviousSibling(tag)['id'], '2') |
2228 | + self.assertEquals(b.findPreviousSibling(tag)['id'], '2') |
2229 | + self.assertEquals(len(b.findPreviousSiblings(tag)), 2) |
2230 | + self.assertEquals(len(b.findPreviousSiblings(tag, id=1)), 1) |
2231 | + |
2232 | + def testTextNavigation(self): |
2233 | + soup = BeautifulSoup('Foo<b>Bar</b><i id="1"><b>Baz<br />Blee<hr id="1"/></b></i>Blargh') |
2234 | + baz = soup.find(text='Baz') |
2235 | + self.assertEquals(baz.findParent("i")['id'], '1') |
2236 | + self.assertEquals(baz.findNext(text='Blee'), 'Blee') |
2237 | + self.assertEquals(baz.findNextSibling(text='Blee'), 'Blee') |
2238 | + self.assertEquals(baz.findNextSibling(text='Blargh'), None) |
2239 | + self.assertEquals(baz.findNextSibling('hr')['id'], '1') |
2240 | + |
2241 | +class SiblingRivalry(SoupTest): |
2242 | + "Tests the nextSibling and previousSibling navigation." |
2243 | + |
2244 | + def testSiblings(self): |
2245 | + soup = BeautifulSoup("<ul><li>1<p>A</p>B<li>2<li>3</ul>") |
2246 | + secondLI = soup.find('li').nextSibling |
2247 | + self.assert_(secondLI.name == 'li' and secondLI.string == '2') |
2248 | + self.assertEquals(soup.find(text='1').nextSibling.name, 'p') |
2249 | + self.assertEquals(soup.find('p').nextSibling, 'B') |
2250 | + self.assertEquals(soup.find('p').nextSibling.previousSibling.nextSibling, 'B') |
2251 | + |
2252 | +class TagsAreObjectsToo(SoupTest): |
2253 | + "Tests the various built-in functions of Tag objects." |
2254 | + |
2255 | + def testLen(self): |
2256 | + soup = BeautifulSoup("<top>1<b>2</b>3</top>") |
2257 | + self.assertEquals(len(soup.top), 3) |
2258 | + |
2259 | +class StringEmUp(SoupTest): |
2260 | + "Tests the use of 'string' as an alias for a tag's only content." |
2261 | + |
2262 | + def testString(self): |
2263 | + s = BeautifulSoup("<b>foo</b>") |
2264 | + self.assertEquals(s.b.string, 'foo') |
2265 | + |
2266 | + def testLackOfString(self): |
2267 | + s = BeautifulSoup("<b>f<i>e</i>o</b>") |
2268 | + self.assert_(not s.b.string) |
2269 | + |
2270 | + def testStringAssign(self): |
2271 | + s = BeautifulSoup("<b></b>") |
2272 | + b = s.b |
2273 | + b.string = "foo" |
2274 | + string = b.string |
2275 | + self.assertEquals(string, "foo") |
2276 | + self.assert_(isinstance(string, NavigableString)) |
2277 | + |
2278 | +class AllText(SoupTest): |
2279 | + "Tests the use of 'text' to get all of string content from the tag." |
2280 | + |
2281 | + def testText(self): |
2282 | + soup = BeautifulSoup("<ul><li>spam</li><li>eggs</li><li>cheese</li>") |
2283 | + self.assertEquals(soup.ul.text, "spameggscheese") |
2284 | + self.assertEquals(soup.ul.getText('/'), "spam/eggs/cheese") |
2285 | + |
2286 | + def testTextHasCorrectSpacing(self): |
2287 | + soup = BeautifulSoup("<p>This is a <i>test</i>.") |
2288 | + self.assertEquals(soup.text, "This is a test.") |
2289 | + self.assertEquals(soup.getText('/'), "This is a /test/.") |
2290 | + |
2291 | +class ThatsMyLimit(SoupTest): |
2292 | + "Tests the limit argument." |
2293 | + |
2294 | + def testBasicLimits(self): |
2295 | + s = BeautifulSoup('<br id="1" /><br id="1" /><br id="1" /><br id="1" />') |
2296 | + self.assertEquals(len(s.findAll('br')), 4) |
2297 | + self.assertEquals(len(s.findAll('br', limit=2)), 2) |
2298 | + self.assertEquals(len(s('br', limit=2)), 2) |
2299 | + |
2300 | +class OnlyTheLonely(SoupTest): |
2301 | + "Tests the parseOnly argument to the constructor." |
2302 | + def setUp(self): |
2303 | + x = [] |
2304 | + for i in range(1,6): |
2305 | + x.append('<a id="%s">' % i) |
2306 | + for j in range(100,103): |
2307 | + x.append('<b id="%s.%s">Content %s.%s</b>' % (i,j, i,j)) |
2308 | + x.append('</a>') |
2309 | + self.x = ''.join(x) |
2310 | + |
2311 | + def testOnly(self): |
2312 | + strainer = SoupStrainer("b") |
2313 | + soup = BeautifulSoup(self.x, parseOnlyThese=strainer) |
2314 | + self.assertEquals(len(soup), 15) |
2315 | + |
2316 | + strainer = SoupStrainer(id=re.compile("100.*")) |
2317 | + soup = BeautifulSoup(self.x, parseOnlyThese=strainer) |
2318 | + self.assertEquals(len(soup), 5) |
2319 | + |
2320 | + strainer = SoupStrainer(text=re.compile("10[01].*")) |
2321 | + soup = BeautifulSoup(self.x, parseOnlyThese=strainer) |
2322 | + self.assertEquals(len(soup), 10) |
2323 | + |
2324 | + strainer = SoupStrainer(text=lambda(x):x[8]=='3') |
2325 | + soup = BeautifulSoup(self.x, parseOnlyThese=strainer) |
2326 | + self.assertEquals(len(soup), 3) |
2327 | + |
2328 | +class PickleMeThis(SoupTest): |
2329 | + "Testing features like pickle and deepcopy." |
2330 | + |
2331 | + def setUp(self): |
2332 | + self.page = """<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" |
2333 | +"http://www.w3.org/TR/REC-html40/transitional.dtd"> |
2334 | +<html> |
2335 | +<head> |
2336 | +<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
2337 | +<title>Beautiful Soup: We called him Tortoise because he taught us.</title> |
2338 | +<link rev="made" href="mailto:leonardr@segfault.org"> |
2339 | +<meta name="Description" content="Beautiful Soup: an HTML parser optimized for screen-scraping."> |
2340 | +<meta name="generator" content="Markov Approximation 1.4 (module: leonardr)"> |
2341 | +<meta name="author" content="Leonard Richardson"> |
2342 | +</head> |
2343 | +<body> |
2344 | +<a href="foo">foo</a> |
2345 | +<a href="foo"><b>bar</b></a> |
2346 | +</body> |
2347 | +</html>""" |
2348 | + |
2349 | + self.soup = BeautifulSoup(self.page) |
2350 | + |
2351 | + def testPickle(self): |
2352 | + import pickle |
2353 | + dumped = pickle.dumps(self.soup, 2) |
2354 | + loaded = pickle.loads(dumped) |
2355 | + self.assertEqual(loaded.__class__, BeautifulSoup) |
2356 | + self.assertEqual(str(loaded), str(self.soup)) |
2357 | + |
2358 | + def testDeepcopy(self): |
2359 | + from copy import deepcopy |
2360 | + copied = deepcopy(self.soup) |
2361 | + self.assertEqual(str(copied), str(self.soup)) |
2362 | + |
2363 | + def testUnicodePickle(self): |
2364 | + import cPickle as pickle |
2365 | + html = "<b>" + chr(0xc3) + "</b>" |
2366 | + soup = BeautifulSoup(html) |
2367 | + dumped = pickle.dumps(soup, pickle.HIGHEST_PROTOCOL) |
2368 | + loaded = pickle.loads(dumped) |
2369 | + self.assertEqual(str(loaded), str(soup)) |
2370 | + |
2371 | + |
2372 | +class WriteOnlyCode(SoupTest): |
2373 | + "Testing the modification of the tree." |
2374 | + |
2375 | + def testModifyAttributes(self): |
2376 | + soup = BeautifulSoup('<a id="1"></a>') |
2377 | + soup.a['id'] = 2 |
2378 | + self.assertEqual(soup.renderContents(), '<a id="2"></a>') |
2379 | + del(soup.a['id']) |
2380 | + self.assertEqual(soup.renderContents(), '<a></a>') |
2381 | + soup.a['id2'] = 'foo' |
2382 | + self.assertEqual(soup.renderContents(), '<a id2="foo"></a>') |
2383 | + |
2384 | + def testNewTagCreation(self): |
2385 | + "Makes sure tags don't step on each others' toes." |
2386 | + soup = BeautifulSoup() |
2387 | + a = Tag(soup, 'a') |
2388 | + ol = Tag(soup, 'ol') |
2389 | + a['href'] = 'http://foo.com/' |
2390 | + self.assertRaises(KeyError, lambda : ol['href']) |
2391 | + |
2392 | + def testNewTagWithAttributes(self): |
2393 | + """Makes sure new tags can be created complete with attributes.""" |
2394 | + soup = BeautifulSoup() |
2395 | + a = Tag(soup, 'a', [('href', 'foo')]) |
2396 | + b = Tag(soup, 'b', {'class':'bar'}) |
2397 | + soup.insert(0,a) |
2398 | + soup.insert(1,b) |
2399 | + self.assertEqual(soup.a['href'], 'foo') |
2400 | + self.assertEqual(soup.b['class'], 'bar') |
2401 | + |
2402 | + def testTagReplacement(self): |
2403 | + # Make sure you can replace an element with itself. |
2404 | + text = "<a><b></b><c>Foo<d></d></c></a><a><e></e></a>" |
2405 | + soup = BeautifulSoup(text) |
2406 | + c = soup.c |
2407 | + soup.c.replaceWith(c) |
2408 | + self.assertEquals(str(soup), text) |
2409 | + |
2410 | + # A very simple case |
2411 | + soup = BeautifulSoup("<b>Argh!</b>") |
2412 | + soup.find(text="Argh!").replaceWith("Hooray!") |
2413 | + newText = soup.find(text="Hooray!") |
2414 | + b = soup.b |
2415 | + self.assertEqual(newText.previous, b) |
2416 | + self.assertEqual(newText.parent, b) |
2417 | + self.assertEqual(newText.previous.next, newText) |
2418 | + self.assertEqual(newText.next, None) |
2419 | + |
2420 | + # A more complex case |
2421 | + soup = BeautifulSoup("<a><b>Argh!</b><c></c><d></d></a>") |
2422 | + soup.b.insert(1, "Hooray!") |
2423 | + newText = soup.find(text="Hooray!") |
2424 | + self.assertEqual(newText.previous, "Argh!") |
2425 | + self.assertEqual(newText.previous.next, newText) |
2426 | + |
2427 | + self.assertEqual(newText.previousSibling, "Argh!") |
2428 | + self.assertEqual(newText.previousSibling.nextSibling, newText) |
2429 | + |
2430 | + self.assertEqual(newText.nextSibling, None) |
2431 | + self.assertEqual(newText.next, soup.c) |
2432 | + |
2433 | + text = "<html>There's <b>no</b> business like <b>show</b> business</html>" |
2434 | + soup = BeautifulSoup(text) |
2435 | + no, show = soup.findAll('b') |
2436 | + show.replaceWith(no) |
2437 | + self.assertEquals(str(soup), "<html>There's business like <b>no</b> business</html>") |
2438 | + |
2439 | + # Even more complex |
2440 | + soup = BeautifulSoup("<a><b>Find</b><c>lady!</c><d></d></a>") |
2441 | + tag = Tag(soup, 'magictag') |
2442 | + tag.insert(0, "the") |
2443 | + soup.a.insert(1, tag) |
2444 | + |
2445 | + b = soup.b |
2446 | + c = soup.c |
2447 | + theText = tag.find(text=True) |
2448 | + findText = b.find(text="Find") |
2449 | + |
2450 | + self.assertEqual(findText.next, tag) |
2451 | + self.assertEqual(tag.previous, findText) |
2452 | + self.assertEqual(b.nextSibling, tag) |
2453 | + self.assertEqual(tag.previousSibling, b) |
2454 | + self.assertEqual(tag.nextSibling, c) |
2455 | + self.assertEqual(c.previousSibling, tag) |
2456 | + |
2457 | + self.assertEqual(theText.next, c) |
2458 | + self.assertEqual(c.previous, theText) |
2459 | + |
2460 | + # Aand... incredibly complex. |
2461 | + soup = BeautifulSoup("""<a>We<b>reserve<c>the</c><d>right</d></b></a><e>to<f>refuse</f><g>service</g></e>""") |
2462 | + f = soup.f |
2463 | + a = soup.a |
2464 | + c = soup.c |
2465 | + e = soup.e |
2466 | + weText = a.find(text="We") |
2467 | + soup.b.replaceWith(soup.f) |
2468 | + self.assertEqual(str(soup), "<a>We<f>refuse</f></a><e>to<g>service</g></e>") |
2469 | + |
2470 | + self.assertEqual(f.previous, weText) |
2471 | + self.assertEqual(weText.next, f) |
2472 | + self.assertEqual(f.previousSibling, weText) |
2473 | + self.assertEqual(f.nextSibling, None) |
2474 | + self.assertEqual(weText.nextSibling, f) |
2475 | + |
2476 | + def testReplaceWithChildren(self): |
2477 | + soup = BeautifulStoneSoup( |
2478 | + "<top><replace><child1/><child2/></replace></top>", |
2479 | + selfClosingTags=["child1", "child2"]) |
2480 | + soup.replaceTag.replaceWithChildren() |
2481 | + self.assertEqual(soup.top.contents[0].name, "child1") |
2482 | + self.assertEqual(soup.top.contents[1].name, "child2") |
2483 | + |
2484 | + def testAppend(self): |
2485 | + doc = "<p>Don't leave me <b>here</b>.</p> <p>Don't leave me.</p>" |
2486 | + soup = BeautifulSoup(doc) |
2487 | + second_para = soup('p')[1] |
2488 | + bold = soup.find('b') |
2489 | + soup('p')[1].append(soup.find('b')) |
2490 | + self.assertEqual(bold.parent, second_para) |
2491 | + self.assertEqual(str(soup), |
2492 | + "<p>Don't leave me .</p> " |
2493 | + "<p>Don't leave me.<b>here</b></p>") |
2494 | + |
2495 | + def testTagExtraction(self): |
2496 | + # A very simple case |
2497 | + text = '<html><div id="nav">Nav crap</div>Real content here.</html>' |
2498 | + soup = BeautifulSoup(text) |
2499 | + extracted = soup.find("div", id="nav").extract() |
2500 | + self.assertEqual(str(soup), "<html>Real content here.</html>") |
2501 | + self.assertEqual(str(extracted), '<div id="nav">Nav crap</div>') |
2502 | + |
2503 | + # A simple case, a more complex test. |
2504 | + text = "<doc><a>1<b>2</b></a><a>i<b>ii</b></a><a>A<b>B</b></a></doc>" |
2505 | + soup = BeautifulStoneSoup(text) |
2506 | + doc = soup.doc |
2507 | + numbers, roman, letters = soup("a") |
2508 | + |
2509 | + self.assertEqual(roman.parent, doc) |
2510 | + oldPrevious = roman.previous |
2511 | + endOfThisTag = roman.nextSibling.previous |
2512 | + self.assertEqual(oldPrevious, "2") |
2513 | + self.assertEqual(roman.next, "i") |
2514 | + self.assertEqual(endOfThisTag, "ii") |
2515 | + self.assertEqual(roman.previousSibling, numbers) |
2516 | + self.assertEqual(roman.nextSibling, letters) |
2517 | + |
2518 | + roman.extract() |
2519 | + self.assertEqual(roman.parent, None) |
2520 | + self.assertEqual(roman.previous, None) |
2521 | + self.assertEqual(roman.next, "i") |
2522 | + self.assertEqual(letters.previous, '2') |
2523 | + self.assertEqual(roman.previousSibling, None) |
2524 | + self.assertEqual(roman.nextSibling, None) |
2525 | + self.assertEqual(endOfThisTag.next, None) |
2526 | + self.assertEqual(roman.b.contents[0].next, None) |
2527 | + self.assertEqual(numbers.nextSibling, letters) |
2528 | + self.assertEqual(letters.previousSibling, numbers) |
2529 | + self.assertEqual(len(doc.contents), 2) |
2530 | + self.assertEqual(doc.contents[0], numbers) |
2531 | + self.assertEqual(doc.contents[1], letters) |
2532 | + |
2533 | + # A more complex case. |
2534 | + text = "<a>1<b>2<c>Hollywood, baby!</c></b></a>3" |
2535 | + soup = BeautifulStoneSoup(text) |
2536 | + one = soup.find(text="1") |
2537 | + three = soup.find(text="3") |
2538 | + toExtract = soup.b |
2539 | + soup.b.extract() |
2540 | + self.assertEqual(one.next, three) |
2541 | + self.assertEqual(three.previous, one) |
2542 | + self.assertEqual(one.parent.nextSibling, three) |
2543 | + self.assertEqual(three.previousSibling, soup.a) |
2544 | + |
2545 | + def testClear(self): |
2546 | + soup = BeautifulSoup("<ul><li></li><li></li></ul>") |
2547 | + soup.ul.clear() |
2548 | + self.assertEqual(len(soup.ul.contents), 0) |
2549 | + |
2550 | +class TheManWithoutAttributes(SoupTest): |
2551 | + "Test attribute access" |
2552 | + |
2553 | + def testHasKey(self): |
2554 | + text = "<foo attr='bar'>" |
2555 | + self.assertEquals(BeautifulSoup(text).foo.has_key('attr'), True) |
2556 | + |
2557 | +class QuoteMeOnThat(SoupTest): |
2558 | + "Test quoting" |
2559 | + def testQuotedAttributeValues(self): |
2560 | + self.assertSoupEquals("<foo attr='bar'></foo>", |
2561 | + '<foo attr="bar"></foo>') |
2562 | + |
2563 | + text = """<foo attr='bar "brawls" happen'>a</foo>""" |
2564 | + soup = BeautifulSoup(text) |
2565 | + self.assertEquals(soup.renderContents(), text) |
2566 | + |
2567 | + soup.foo['attr'] = 'Brawls happen at "Bob\'s Bar"' |
2568 | + newText = """<foo attr='Brawls happen at "Bob&squot;s Bar"'>a</foo>""" |
2569 | + self.assertSoupEquals(soup.renderContents(), newText) |
2570 | + |
2571 | + self.assertSoupEquals('<this is="really messed up & stuff">', |
2572 | + '<this is="really messed up & stuff"></this>') |
2573 | + |
2574 | + # This is not what the original author had in mind, but it's |
2575 | + # a legitimate interpretation of what they wrote. |
2576 | + self.assertSoupEquals("""<a href="foo</a>, </a><a href="bar">baz</a>""", |
2577 | + '<a href="foo</a>, </a><a href="></a>, <a href="bar">baz</a>') |
2578 | + |
2579 | + # SGMLParser generates bogus parse events when attribute values |
2580 | + # contain embedded brackets, but at least Beautiful Soup fixes |
2581 | + # it up a little. |
2582 | + self.assertSoupEquals('<a b="<a>">', '<a b="<a>"></a><a>"></a>') |
2583 | + self.assertSoupEquals('<a href="http://foo.com/<a> and blah and blah', |
2584 | + """<a href='"http://foo.com/'></a><a> and blah and blah</a>""") |
2585 | + |
2586 | + |
2587 | + |
2588 | +class YoureSoLiteral(SoupTest): |
2589 | + "Test literal mode." |
2590 | + def testLiteralMode(self): |
2591 | + text = "<script>if (i<imgs.length)</script><b>Foo</b>" |
2592 | + soup = BeautifulSoup(text) |
2593 | + self.assertEqual(soup.script.contents[0], "if (i<imgs.length)") |
2594 | + self.assertEqual(soup.b.contents[0], "Foo") |
2595 | + |
2596 | + def testTextArea(self): |
2597 | + text = "<textarea><b>This is an example of an HTML tag</b><&<&</textarea>" |
2598 | + soup = BeautifulSoup(text) |
2599 | + self.assertEqual(soup.textarea.contents[0], |
2600 | + "<b>This is an example of an HTML tag</b><&<&") |
2601 | + |
2602 | +class OperatorOverload(SoupTest): |
2603 | + "Our operators do it all! Call now!" |
2604 | + |
2605 | + def testTagNameAsFind(self): |
2606 | + "Tests that referencing a tag name as a member delegates to find()." |
2607 | + soup = BeautifulSoup('<b id="1">foo<i>bar</i></b><b>Red herring</b>') |
2608 | + self.assertEqual(soup.b.i, soup.find('b').find('i')) |
2609 | + self.assertEqual(soup.b.i.string, 'bar') |
2610 | + self.assertEqual(soup.b['id'], '1') |
2611 | + self.assertEqual(soup.b.contents[0], 'foo') |
2612 | + self.assert_(not soup.a) |
2613 | + |
2614 | + #Test the .fooTag variant of .foo. |
2615 | + self.assertEqual(soup.bTag.iTag.string, 'bar') |
2616 | + self.assertEqual(soup.b.iTag.string, 'bar') |
2617 | + self.assertEqual(soup.find('b').find('i'), soup.bTag.iTag) |
2618 | + |
2619 | +class NestableEgg(SoupTest): |
2620 | + """Here we test tag nesting. TEST THE NEST, DUDE! X-TREME!""" |
2621 | + |
2622 | + def testParaInsideBlockquote(self): |
2623 | + soup = BeautifulSoup('<blockquote><p><b>Foo</blockquote><p>Bar') |
2624 | + self.assertEqual(soup.blockquote.p.b.string, 'Foo') |
2625 | + self.assertEqual(soup.blockquote.b.string, 'Foo') |
2626 | + self.assertEqual(soup.find('p', recursive=False).string, 'Bar') |
2627 | + |
2628 | + def testNestedTables(self): |
2629 | + text = """<table id="1"><tr><td>Here's another table: |
2630 | + <table id="2"><tr><td>Juicy text</td></tr></table></td></tr></table>""" |
2631 | + soup = BeautifulSoup(text) |
2632 | + self.assertEquals(soup.table.table.td.string, 'Juicy text') |
2633 | + self.assertEquals(len(soup.findAll('table')), 2) |
2634 | + self.assertEquals(len(soup.table.findAll('table')), 1) |
2635 | + self.assertEquals(soup.find('table', {'id' : 2}).parent.parent.parent.name, |
2636 | + 'table') |
2637 | + |
2638 | + text = "<table><tr><td><div><table>Foo</table></div></td></tr></table>" |
2639 | + soup = BeautifulSoup(text) |
2640 | + self.assertEquals(soup.table.tr.td.div.table.contents[0], "Foo") |
2641 | + |
2642 | + text = """<table><thead><tr>Foo</tr></thead><tbody><tr>Bar</tr></tbody> |
2643 | + <tfoot><tr>Baz</tr></tfoot></table>""" |
2644 | + soup = BeautifulSoup(text) |
2645 | + self.assertEquals(soup.table.thead.tr.contents[0], "Foo") |
2646 | + |
2647 | + def testBadNestedTables(self): |
2648 | + soup = BeautifulSoup("<table><tr><table><tr id='nested'>") |
2649 | + self.assertEquals(soup.table.tr.table.tr['id'], 'nested') |
2650 | + |
2651 | +class CleanupOnAisleFour(SoupTest): |
2652 | + """Here we test cleanup of text that breaks SGMLParser or is just |
2653 | + obnoxious.""" |
2654 | + |
2655 | + def testSelfClosingtag(self): |
2656 | + self.assertEqual(str(BeautifulSoup("Foo<br/>Bar").find('br')), |
2657 | + '<br />') |
2658 | + |
2659 | + self.assertSoupEquals('<p>test1<br/>test2</p>', |
2660 | + '<p>test1<br />test2</p>') |
2661 | + |
2662 | + text = '<p>test1<selfclosing>test2' |
2663 | + soup = BeautifulStoneSoup(text) |
2664 | + self.assertEqual(str(soup), |
2665 | + '<p>test1<selfclosing>test2</selfclosing></p>') |
2666 | + |
2667 | + soup = BeautifulStoneSoup(text, selfClosingTags='selfclosing') |
2668 | + self.assertEqual(str(soup), |
2669 | + '<p>test1<selfclosing />test2</p>') |
2670 | + |
2671 | + def testSelfClosingTagOrNot(self): |
2672 | + text = "<item><link>http://foo.com/</link></item>" |
2673 | + self.assertEqual(BeautifulStoneSoup(text).renderContents(), text) |
2674 | + self.assertEqual(BeautifulSoup(text).renderContents(), |
2675 | + '<item><link />http://foo.com/</item>') |
2676 | + |
2677 | + def testCData(self): |
2678 | + xml = "<root>foo<![CDATA[foobar]]>bar</root>" |
2679 | + self.assertSoupEquals(xml, xml) |
2680 | + r = re.compile("foo.*bar") |
2681 | + soup = BeautifulSoup(xml) |
2682 | + self.assertEquals(soup.find(text=r).string, "foobar") |
2683 | + self.assertEquals(soup.find(text=r).__class__, CData) |
2684 | + |
2685 | + def testComments(self): |
2686 | + xml = "foo<!--foobar-->baz" |
2687 | + self.assertSoupEquals(xml) |
2688 | + r = re.compile("foo.*bar") |
2689 | + soup = BeautifulSoup(xml) |
2690 | + self.assertEquals(soup.find(text=r).string, "foobar") |
2691 | + self.assertEquals(soup.find(text="foobar").__class__, Comment) |
2692 | + |
2693 | + def testDeclaration(self): |
2694 | + xml = "foo<!DOCTYPE foobar>baz" |
2695 | + self.assertSoupEquals(xml) |
2696 | + r = re.compile(".*foo.*bar") |
2697 | + soup = BeautifulSoup(xml) |
2698 | + text = "DOCTYPE foobar" |
2699 | + self.assertEquals(soup.find(text=r).string, text) |
2700 | + self.assertEquals(soup.find(text=text).__class__, Declaration) |
2701 | + |
2702 | + namespaced_doctype = ('<!DOCTYPE xsl:stylesheet SYSTEM "htmlent.dtd">' |
2703 | + '<html>foo</html>') |
2704 | + soup = BeautifulSoup(namespaced_doctype) |
2705 | + self.assertEquals(soup.contents[0], |
2706 | + 'DOCTYPE xsl:stylesheet SYSTEM "htmlent.dtd"') |
2707 | + self.assertEquals(soup.html.contents[0], 'foo') |
2708 | + |
2709 | + def testEntityConversions(self): |
2710 | + text = "<<sacré bleu!>>" |
2711 | + soup = BeautifulStoneSoup(text) |
2712 | + self.assertSoupEquals(text) |
2713 | + |
2714 | + xmlEnt = BeautifulStoneSoup.XML_ENTITIES |
2715 | + htmlEnt = BeautifulStoneSoup.HTML_ENTITIES |
2716 | + xhtmlEnt = BeautifulStoneSoup.XHTML_ENTITIES |
2717 | + |
2718 | + soup = BeautifulStoneSoup(text, convertEntities=xmlEnt) |
2719 | + self.assertEquals(str(soup), "<<sacré bleu!>>") |
2720 | + |
2721 | + soup = BeautifulStoneSoup(text, convertEntities=xmlEnt) |
2722 | + self.assertEquals(str(soup), "<<sacré bleu!>>") |
2723 | + |
2724 | + soup = BeautifulStoneSoup(text, convertEntities=htmlEnt) |
2725 | + self.assertEquals(unicode(soup), u"<<sacr\xe9 bleu!>>") |
2726 | + |
2727 | + # Make sure the "XML", "HTML", and "XHTML" settings work. |
2728 | + text = "<™'" |
2729 | + soup = BeautifulStoneSoup(text, convertEntities=xmlEnt) |
2730 | + self.assertEquals(unicode(soup), u"<™'") |
2731 | + |
2732 | + soup = BeautifulStoneSoup(text, convertEntities=htmlEnt) |
2733 | + self.assertEquals(unicode(soup), u"<\u2122'") |
2734 | + |
2735 | + soup = BeautifulStoneSoup(text, convertEntities=xhtmlEnt) |
2736 | + self.assertEquals(unicode(soup), u"<\u2122'") |
2737 | + |
2738 | + invalidEntity = "foo&#bar;baz" |
2739 | + soup = BeautifulStoneSoup\ |
2740 | + (invalidEntity, |
2741 | + convertEntities=htmlEnt) |
2742 | + self.assertEquals(str(soup), invalidEntity) |
2743 | + |
2744 | + def testNonBreakingSpaces(self): |
2745 | + soup = BeautifulSoup("<a> </a>", |
2746 | + convertEntities=BeautifulStoneSoup.HTML_ENTITIES) |
2747 | + self.assertEquals(unicode(soup), u"<a>\xa0\xa0</a>") |
2748 | + |
2749 | + def testWhitespaceInDeclaration(self): |
2750 | + self.assertSoupEquals('<! DOCTYPE>', '<!DOCTYPE>') |
2751 | + |
2752 | + def testJunkInDeclaration(self): |
2753 | + self.assertSoupEquals('<! Foo = -8>a', '<!Foo = -8>a') |
2754 | + |
2755 | + def testIncompleteDeclaration(self): |
2756 | + self.assertSoupEquals('a<!b <p>c') |
2757 | + |
2758 | + def testEntityReplacement(self): |
2759 | + self.assertSoupEquals('<b>hello there</b>') |
2760 | + |
2761 | + def testEntitiesInAttributeValues(self): |
2762 | + self.assertSoupEquals('<x t="xñ">', '<x t="x\xc3\xb1"></x>') |
2763 | + self.assertSoupEquals('<x t="xñ">', '<x t="x\xc3\xb1"></x>') |
2764 | + |
2765 | + soup = BeautifulSoup('<x t=">™">', |
2766 | + convertEntities=BeautifulStoneSoup.HTML_ENTITIES) |
2767 | + self.assertEquals(unicode(soup), u'<x t=">\u2122"></x>') |
2768 | + |
2769 | + uri = "http://crummy.com?sacré&bleu" |
2770 | + link = '<a href="%s"></a>' % uri |
2771 | + soup = BeautifulSoup(link) |
2772 | + self.assertEquals(unicode(soup), link) |
2773 | + #self.assertEquals(unicode(soup.a['href']), uri) |
2774 | + |
2775 | + soup = BeautifulSoup(link, convertEntities=BeautifulSoup.HTML_ENTITIES) |
2776 | + self.assertEquals(unicode(soup), |
2777 | + link.replace("é", u"\xe9")) |
2778 | + |
2779 | + uri = "http://crummy.com?sacré&bleu" |
2780 | + link = '<a href="%s"></a>' % uri |
2781 | + soup = BeautifulSoup(link, convertEntities=BeautifulSoup.HTML_ENTITIES) |
2782 | + self.assertEquals(unicode(soup.a['href']), |
2783 | + uri.replace("é", u"\xe9")) |
2784 | + |
2785 | + def testNakedAmpersands(self): |
2786 | + html = {'convertEntities':BeautifulStoneSoup.HTML_ENTITIES} |
2787 | + soup = BeautifulStoneSoup("AT&T ", **html) |
2788 | + self.assertEquals(str(soup), 'AT&T ') |
2789 | + |
2790 | + nakedAmpersandInASentence = "AT&T was Ma Bell" |
2791 | + soup = BeautifulStoneSoup(nakedAmpersandInASentence,**html) |
2792 | + self.assertEquals(str(soup), \ |
2793 | + nakedAmpersandInASentence.replace('&','&')) |
2794 | + |
2795 | + invalidURL = '<a href="http://example.org?a=1&b=2;3">foo</a>' |
2796 | + validURL = invalidURL.replace('&','&') |
2797 | + soup = BeautifulStoneSoup(invalidURL) |
2798 | + self.assertEquals(str(soup), validURL) |
2799 | + |
2800 | + soup = BeautifulStoneSoup(validURL) |
2801 | + self.assertEquals(str(soup), validURL) |
2802 | + |
2803 | + |
2804 | +class EncodeRed(SoupTest): |
2805 | + """Tests encoding conversion, Unicode conversion, and Microsoft |
2806 | + smart quote fixes.""" |
2807 | + |
2808 | + def testUnicodeDammitStandalone(self): |
2809 | + markup = "<foo>\x92</foo>" |
2810 | + dammit = UnicodeDammit(markup) |
2811 | + self.assertEquals(dammit.unicode, "<foo>’</foo>") |
2812 | + |
2813 | + hebrew = "\xed\xe5\xec\xf9" |
2814 | + dammit = UnicodeDammit(hebrew, ["iso-8859-8"]) |
2815 | + self.assertEquals(dammit.unicode, u'\u05dd\u05d5\u05dc\u05e9') |
2816 | + self.assertEquals(dammit.originalEncoding, 'iso-8859-8') |
2817 | + |
2818 | + def testGarbageInGarbageOut(self): |
2819 | + ascii = "<foo>a</foo>" |
2820 | + asciiSoup = BeautifulStoneSoup(ascii) |
2821 | + self.assertEquals(ascii, str(asciiSoup)) |
2822 | + |
2823 | + unicodeData = u"<foo>\u00FC</foo>" |
2824 | + utf8 = unicodeData.encode("utf-8") |
2825 | + self.assertEquals(utf8, '<foo>\xc3\xbc</foo>') |
2826 | + |
2827 | + unicodeSoup = BeautifulStoneSoup(unicodeData) |
2828 | + self.assertEquals(unicodeData, unicode(unicodeSoup)) |
2829 | + self.assertEquals(unicode(unicodeSoup.foo.string), u'\u00FC') |
2830 | + |
2831 | + utf8Soup = BeautifulStoneSoup(utf8, fromEncoding='utf-8') |
2832 | + self.assertEquals(utf8, str(utf8Soup)) |
2833 | + self.assertEquals(utf8Soup.originalEncoding, "utf-8") |
2834 | + |
2835 | + utf8Soup = BeautifulStoneSoup(unicodeData) |
2836 | + self.assertEquals(utf8, str(utf8Soup)) |
2837 | + self.assertEquals(utf8Soup.originalEncoding, None) |
2838 | + |
2839 | + |
2840 | + def testHandleInvalidCodec(self): |
2841 | + for bad_encoding in ['.utf8', '...', 'utF---16.!']: |
2842 | + soup = BeautifulSoup("Räksmörgås", fromEncoding=bad_encoding) |
2843 | + self.assertEquals(soup.originalEncoding, 'utf-8') |
2844 | + |
2845 | + def testUnicodeSearch(self): |
2846 | + html = u'<html><body><h1>Räksmörgås</h1></body></html>' |
2847 | + soup = BeautifulSoup(html) |
2848 | + self.assertEqual(soup.find(text=u'Räksmörgås'),u'Räksmörgås') |
2849 | + |
2850 | + def testRewrittenXMLHeader(self): |
2851 | + euc_jp = '<?xml version="1.0 encoding="euc-jp"?>\n<foo>\n\xa4\xb3\xa4\xec\xa4\xcfEUC-JP\xa4\xc7\xa5\xb3\xa1\xbc\xa5\xc7\xa5\xa3\xa5\xf3\xa5\xb0\xa4\xb5\xa4\xec\xa4\xbf\xc6\xfc\xcb\xdc\xb8\xec\xa4\xce\xa5\xd5\xa5\xa1\xa5\xa4\xa5\xeb\xa4\xc7\xa4\xb9\xa1\xa3\n</foo>\n' |
2852 | + utf8 = "<?xml version='1.0' encoding='utf-8'?>\n<foo>\n\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xafEUC-JP\xe3\x81\xa7\xe3\x82\xb3\xe3\x83\xbc\xe3\x83\x87\xe3\x82\xa3\xe3\x83\xb3\xe3\x82\xb0\xe3\x81\x95\xe3\x82\x8c\xe3\x81\x9f\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e\xe3\x81\xae\xe3\x83\x95\xe3\x82\xa1\xe3\x82\xa4\xe3\x83\xab\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\n</foo>\n" |
2853 | + soup = BeautifulStoneSoup(euc_jp) |
2854 | + if soup.originalEncoding != "euc-jp": |
2855 | + raise Exception("Test failed when parsing euc-jp document. " |
2856 | + "If you're running Python >=2.4, or you have " |
2857 | + "cjkcodecs installed, this is a real problem. " |
2858 | + "Otherwise, ignore it.") |
2859 | + |
2860 | + self.assertEquals(soup.originalEncoding, "euc-jp") |
2861 | + self.assertEquals(str(soup), utf8) |
2862 | + |
2863 | + old_text = "<?xml encoding='windows-1252'><foo>\x92</foo>" |
2864 | + new_text = "<?xml version='1.0' encoding='utf-8'?><foo>’</foo>" |
2865 | + self.assertSoupEquals(old_text, new_text) |
2866 | + |
2867 | + def testRewrittenMetaTag(self): |
2868 | + no_shift_jis_html = '''<html><head>\n<meta http-equiv="Content-language" content="ja" /></head><body><pre>\n\x82\xb1\x82\xea\x82\xcdShift-JIS\x82\xc5\x83R\x81[\x83f\x83B\x83\x93\x83O\x82\xb3\x82\xea\x82\xbd\x93\xfa\x96{\x8c\xea\x82\xcc\x83t\x83@\x83C\x83\x8b\x82\xc5\x82\xb7\x81B\n</pre></body></html>''' |
2869 | + soup = BeautifulSoup(no_shift_jis_html) |
2870 | + |
2871 | + # Beautiful Soup used to try to rewrite the meta tag even if the |
2872 | + # meta tag got filtered out by the strainer. This test makes |
2873 | + # sure that doesn't happen. |
2874 | + strainer = SoupStrainer('pre') |
2875 | + soup = BeautifulSoup(no_shift_jis_html, parseOnlyThese=strainer) |
2876 | + self.assertEquals(soup.contents[0].name, 'pre') |
2877 | + |
2878 | + meta_tag = ('<meta content="text/html; charset=x-sjis" ' |
2879 | + 'http-equiv="Content-type" />') |
2880 | + shift_jis_html = ( |
2881 | + '<html><head>\n%s\n' |
2882 | + '<meta http-equiv="Content-language" content="ja" />' |
2883 | + '</head><body><pre>\n' |
2884 | + '\x82\xb1\x82\xea\x82\xcdShift-JIS\x82\xc5\x83R\x81[\x83f' |
2885 | + '\x83B\x83\x93\x83O\x82\xb3\x82\xea\x82\xbd\x93\xfa\x96{\x8c' |
2886 | + '\xea\x82\xcc\x83t\x83@\x83C\x83\x8b\x82\xc5\x82\xb7\x81B\n' |
2887 | + '</pre></body></html>') % meta_tag |
2888 | + soup = BeautifulSoup(shift_jis_html) |
2889 | + if soup.originalEncoding != "shift-jis": |
2890 | + raise Exception("Test failed when parsing shift-jis document " |
2891 | + "with meta tag '%s'." |
2892 | + "If you're running Python >=2.4, or you have " |
2893 | + "cjkcodecs installed, this is a real problem. " |
2894 | + "Otherwise, ignore it." % meta_tag) |
2895 | + self.assertEquals(soup.originalEncoding, "shift-jis") |
2896 | + |
2897 | + content_type_tag = soup.meta['content'] |
2898 | + self.assertEquals(content_type_tag[content_type_tag.find('charset='):], |
2899 | + 'charset=%SOUP-ENCODING%') |
2900 | + content_type = str(soup.meta) |
2901 | + index = content_type.find('charset=') |
2902 | + self.assertEqual(content_type[index:index+len('charset=utf8')+1], |
2903 | + 'charset=utf-8') |
2904 | + content_type = soup.meta.__str__('shift-jis') |
2905 | + index = content_type.find('charset=') |
2906 | + self.assertEqual(content_type[index:index+len('charset=shift-jis')], |
2907 | + 'charset=shift-jis') |
2908 | + |
2909 | + self.assertEquals(str(soup), ( |
2910 | + '<html><head>\n' |
2911 | + '<meta content="text/html; charset=utf-8" ' |
2912 | + 'http-equiv="Content-type" />\n' |
2913 | + '<meta http-equiv="Content-language" content="ja" />' |
2914 | + '</head><body><pre>\n' |
2915 | + '\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xafShift-JIS\xe3\x81\xa7\xe3' |
2916 | + '\x82\xb3\xe3\x83\xbc\xe3\x83\x87\xe3\x82\xa3\xe3\x83\xb3\xe3' |
2917 | + '\x82\xb0\xe3\x81\x95\xe3\x82\x8c\xe3\x81\x9f\xe6\x97\xa5\xe6' |
2918 | + '\x9c\xac\xe8\xaa\x9e\xe3\x81\xae\xe3\x83\x95\xe3\x82\xa1\xe3' |
2919 | + '\x82\xa4\xe3\x83\xab\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\n' |
2920 | + '</pre></body></html>')) |
2921 | + self.assertEquals(soup.renderContents("shift-jis"), |
2922 | + shift_jis_html.replace('x-sjis', 'shift-jis')) |
2923 | + |
2924 | + isolatin ="""<html><meta http-equiv="Content-type" content="text/html; charset=ISO-Latin-1" />Sacr\xe9 bleu!</html>""" |
2925 | + soup = BeautifulSoup(isolatin) |
2926 | + self.assertSoupEquals(soup.__str__("utf-8"), |
2927 | + isolatin.replace("ISO-Latin-1", "utf-8").replace("\xe9", "\xc3\xa9")) |
2928 | + |
2929 | + def testHebrew(self): |
2930 | + iso_8859_8= '<HEAD>\n<TITLE>Hebrew (ISO 8859-8) in Visual Directionality</TITLE>\n\n\n\n</HEAD>\n<BODY>\n<H1>Hebrew (ISO 8859-8) in Visual Directionality</H1>\n\xed\xe5\xec\xf9\n</BODY>\n' |
2931 | + utf8 = '<head>\n<title>Hebrew (ISO 8859-8) in Visual Directionality</title>\n</head>\n<body>\n<h1>Hebrew (ISO 8859-8) in Visual Directionality</h1>\n\xd7\x9d\xd7\x95\xd7\x9c\xd7\xa9\n</body>\n' |
2932 | + soup = BeautifulStoneSoup(iso_8859_8, fromEncoding="iso-8859-8") |
2933 | + self.assertEquals(str(soup), utf8) |
2934 | + |
2935 | + def testSmartQuotesNotSoSmartAnymore(self): |
2936 | + self.assertSoupEquals("\x91Foo\x92 <!--blah-->", |
2937 | + '‘Foo’ <!--blah-->') |
2938 | + |
2939 | + def testDontConvertSmartQuotesWhenAlsoConvertingEntities(self): |
2940 | + smartQuotes = "Il a dit, \x8BSacré bleu!\x9b" |
2941 | + soup = BeautifulSoup(smartQuotes) |
2942 | + self.assertEquals(str(soup), |
2943 | + 'Il a dit, ‹Sacré bleu!›') |
2944 | + soup = BeautifulSoup(smartQuotes, convertEntities="html") |
2945 | + self.assertEquals(str(soup), |
2946 | + 'Il a dit, \xe2\x80\xb9Sacr\xc3\xa9 bleu!\xe2\x80\xba') |
2947 | + |
2948 | + def testDontSeeSmartQuotesWhereThereAreNone(self): |
2949 | + utf_8 = "\343\202\261\343\203\274\343\202\277\343\202\244 Watch" |
2950 | + self.assertSoupEquals(utf_8) |
2951 | + |
2952 | + |
2953 | +class Whitewash(SoupTest): |
2954 | + """Test whitespace preservation.""" |
2955 | + |
2956 | + def testPreservedWhitespace(self): |
2957 | + self.assertSoupEquals("<pre> </pre>") |
2958 | + self.assertSoupEquals("<pre> woo </pre>") |
2959 | + |
2960 | + def testCollapsedWhitespace(self): |
2961 | + self.assertSoupEquals("<p> </p>", "<p> </p>") |
2962 | + |
2963 | + |
2964 | +if __name__ == '__main__': |
2965 | + unittest.main() |
2966 | |
2967 | === renamed file 'CHANGELOG' => 'CHANGELOG.THIS' |
2968 | === added file 'NEWS' |
2969 | --- NEWS 1970-01-01 00:00:00 +0000 |
2970 | +++ NEWS 2011-05-27 07:52:31 +0000 |
2971 | @@ -0,0 +1,79 @@ |
2972 | +Beautiful Soup 3.2.x series |
2973 | +*************************** |
2974 | + |
2975 | +This is the 'stable' series of Beautiful Soup. It will have only |
2976 | +occasional bugfix releases. It will not work with alternate parsers or |
2977 | +with Python 3.0. If you need these things, you'll need to use the 3.1 |
2978 | +series. |
2979 | + |
2980 | +3.2.0 |
2981 | +===== |
2982 | + |
2983 | +Gave the stable series a higher version number than the unstable |
2984 | +series, to make it very clear which series most people should be using. |
2985 | + |
2986 | +When creating a Tag object, you can specify its attributes as a dict |
2987 | +rather than as a list of 2-tuples. |
2988 | + |
2989 | +3.0.8.1 |
2990 | +======= |
2991 | + |
2992 | +Bug fixes |
2993 | +--------- |
2994 | + |
2995 | +Corrected Beautiful Soup's behavior when a findAll() call contained a |
2996 | +value for the "text" argument as well as values for arguments that |
2997 | +imply it should search for tags. (The "text" argument takes priority |
2998 | +and text is returned, not tags.) |
2999 | + |
3000 | +Corrected a typo that made I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS |
3001 | +stop being a tuple. |
3002 | + |
3003 | +3.0.8 |
3004 | +===== |
3005 | + |
3006 | +Inauguration of the 3.0.x series as the stable series. |
3007 | + |
3008 | +New features |
3009 | +------------ |
3010 | + |
3011 | +Tag.replaceWithChildren() |
3012 | + Replace a tag with its children. |
3013 | + |
3014 | +Tag.string assignment |
3015 | + `tag.string = string` replaces the contents of a tag with `string`. |
3016 | + |
3017 | +Tag.text property (NOT A FUNCTION!) |
3018 | + tag.text gathers together and joins all text children. Much faster than |
3019 | + "".join(tag.findAll(text=True)) |
3020 | + |
3021 | +Tag.getText(seperator=u"") |
3022 | + Same as Tag.text, but a function that allows a custom seperator between joined |
3023 | + text elements. |
3024 | + |
3025 | +Tag.index(element) -> int |
3026 | + Returns the index of `element` within the tag. Matches the actual |
3027 | + element instead of using __eq__. |
3028 | + |
3029 | +Tag.clear() |
3030 | + Remove all child elements. |
3031 | + |
3032 | +Improvements |
3033 | +------------ |
3034 | + |
3035 | +Previously, searching by CSS class only matched tags that had the |
3036 | +requested CSS class and no other classes. Now, searching by CSS class |
3037 | +matches every tag that uses that class. |
3038 | + |
3039 | +Performance |
3040 | +----------- |
3041 | + |
3042 | +Beware! Although searching the tree is much faster in 3.0.8 than in |
3043 | +previous versions, you probably won't notice the difference in real |
3044 | +situations, because the time spent searching the tree is typically |
3045 | +dwarfed by the time spent parsing the file in the first place. |
3046 | + |
3047 | +Tag.decompose() is several times faster. |
3048 | +A very basic findAll(...) is several times faster. |
3049 | +findAll(True) is special cased |
3050 | +Tag.recursiveChildGenerator is much faster |
3051 | |
3052 | === added file 'PKG-INFO' |
3053 | --- PKG-INFO 1970-01-01 00:00:00 +0000 |
3054 | +++ PKG-INFO 2011-05-27 07:52:31 +0000 |
3055 | @@ -0,0 +1,19 @@ |
3056 | +Metadata-Version: 1.0 |
3057 | +Name: BeautifulSoup |
3058 | +Version: 3.0.7a |
3059 | +Summary: HTML/XML parser for quick-turnaround applications like screen-scraping. |
3060 | +Home-page: http://www.crummy.com/software/BeautifulSoup/ |
3061 | +Author: Leonard Richardson |
3062 | +Author-email: leonardr@segfault.org |
3063 | +License: BSD |
3064 | +Download-URL: http://www.crummy.com/software/BeautifulSoup/download/ |
3065 | +Description: Beautiful Soup parses arbitrarily invalid SGML and provides a variety of methods and Pythonic idioms for iterating and searching the parse tree. |
3066 | +Platform: UNKNOWN |
3067 | +Classifier: Development Status :: 5 - Production/Stable |
3068 | +Classifier: Intended Audience :: Developers |
3069 | +Classifier: License :: OSI Approved :: Python Software Foundation License |
3070 | +Classifier: Programming Language :: Python |
3071 | +Classifier: Topic :: Text Processing :: Markup :: HTML |
3072 | +Classifier: Topic :: Text Processing :: Markup :: XML |
3073 | +Classifier: Topic :: Text Processing :: Markup :: SGML |
3074 | +Classifier: Topic :: Software Development :: Libraries :: Python Modules |
3075 | |
3076 | === renamed file 'README.txt' => 'README.txt.THIS' |
3077 | === renamed file 'bs4/__init__.py' => 'bs4/__init__.py.THIS' |
3078 | === renamed file 'bs4/builder/__init__.py' => 'bs4/builder/__init__.py.THIS' |
3079 | === renamed file 'bs4/builder/_lxml.py' => 'bs4/builder/_lxml.py.THIS' |
3080 | === renamed file 'bs4/dammit.py' => 'bs4/dammit.py.THIS' |
3081 | === renamed file 'bs4/element.py' => 'bs4/element.py.THIS' |
3082 | === renamed file 'bs4/testing.py' => 'bs4/testing.py.THIS' |
3083 | === removed directory 'docs' |
3084 | === removed file 'docs/__init__.py' |
3085 | --- docs/__init__.py 2009-04-10 15:48:02 +0000 |
3086 | +++ docs/__init__.py 1970-01-01 00:00:00 +0000 |
3087 | @@ -1,1 +0,0 @@ |
3088 | -"""Executable documentation about beautifulsoup.""" |
3089 | |
3090 | === added file 'setup.py' |
3091 | --- setup.py 1970-01-01 00:00:00 +0000 |
3092 | +++ setup.py 2011-05-27 07:52:31 +0000 |
3093 | @@ -0,0 +1,60 @@ |
3094 | +from distutils.core import setup |
3095 | +import unittest |
3096 | +import warnings |
3097 | +warnings.filterwarnings("ignore", "Unknown distribution option") |
3098 | + |
3099 | +import sys |
3100 | +# patch distutils if it can't cope with the "classifiers" keyword |
3101 | +if sys.version < '2.2.3': |
3102 | + from distutils.dist import DistributionMetadata |
3103 | + DistributionMetadata.classifiers = None |
3104 | + DistributionMetadata.download_url = None |
3105 | + |
3106 | +from BeautifulSoup import __version__ |
3107 | + |
3108 | +#Make sure all the tests complete. |
3109 | +import BeautifulSoupTests |
3110 | +loader = unittest.TestLoader() |
3111 | +result = unittest.TestResult() |
3112 | +suite = loader.loadTestsFromModule(BeautifulSoupTests) |
3113 | +suite.run(result) |
3114 | +if not result.wasSuccessful(): |
3115 | + print "Unit tests have failed!" |
3116 | + for l in result.errors, result.failures: |
3117 | + for case, error in l: |
3118 | + print "-" * 80 |
3119 | + desc = case.shortDescription() |
3120 | + if desc: |
3121 | + print desc |
3122 | + print error |
3123 | + print '''If you see an error like: "'ascii' codec can't encode character...", see\nthe Beautiful Soup documentation:\n http://www.crummy.com/software/BeautifulSoup/documentation.html#Why%20can't%20Beautiful%20Soup%20print%20out%20the%20non-ASCII%20characters%20I%20gave%20it?''' |
3124 | + print "This might or might not be a problem depending on what you plan to do with\nBeautiful Soup." |
3125 | + if sys.argv[1] == 'sdist': |
3126 | |
3127 | + print "I'm not going to make a source distribution since the tests don't pass." |
3128 | + sys.exit(1) |
3129 | + |
3130 | +setup(name="BeautifulSoup", |
3131 | + version=__version__, |
3132 | + py_modules=['BeautifulSoup', 'BeautifulSoupTests'], |
3133 | + description="HTML/XML parser for quick-turnaround applications like screen-scraping.", |
3134 | + author="Leonard Richardson", |
3135 | + author_email = "leonardr@segfault.org", |
3136 | + long_description="""Beautiful Soup parses arbitrarily invalid SGML and provides a variety of methods and Pythonic idioms for iterating and searching the parse tree.""", |
3137 | + classifiers=["Development Status :: 5 - Production/Stable", |
3138 | + "Intended Audience :: Developers", |
3139 | + "License :: OSI Approved :: Python Software Foundation License", |
3140 | + "Programming Language :: Python", |
3141 | + "Topic :: Text Processing :: Markup :: HTML", |
3142 | + "Topic :: Text Processing :: Markup :: XML", |
3143 | + "Topic :: Text Processing :: Markup :: SGML", |
3144 | + "Topic :: Software Development :: Libraries :: Python Modules", |
3145 | + ], |
3146 | + url="http://www.crummy.com/software/BeautifulSoup/", |
3147 | + license="BSD", |
3148 | + download_url="http://www.crummy.com/software/BeautifulSoup/download/" |
3149 | + ) |
3150 | + |
3151 | + # Send announce to: |
3152 | + # python-announce@python.org |
3153 | + # python-list@python.org |
3154 | |
3155 | === removed file 'tests/__init__.py' |
3156 | --- tests/__init__.py 2009-04-10 15:48:02 +0000 |
3157 | +++ tests/__init__.py 1970-01-01 00:00:00 +0000 |
3158 | @@ -1,1 +0,0 @@ |
3159 | -"The beautifulsoup tests." |
3160 | |
3161 | === removed file 'tests/test_docs.py' |
3162 | --- tests/test_docs.py 2009-04-10 15:48:02 +0000 |
3163 | +++ tests/test_docs.py 1970-01-01 00:00:00 +0000 |
3164 | @@ -1,36 +0,0 @@ |
3165 | -"Test harness for doctests." |
3166 | - |
3167 | -# pylint: disable-msg=E0611,W0142 |
3168 | - |
3169 | -__metaclass__ = type |
3170 | -__all__ = [ |
3171 | - 'additional_tests', |
3172 | - ] |
3173 | - |
3174 | -import atexit |
3175 | -import doctest |
3176 | -import os |
3177 | -from pkg_resources import ( |
3178 | - resource_filename, resource_exists, resource_listdir, cleanup_resources) |
3179 | -import unittest |
3180 | - |
3181 | -DOCTEST_FLAGS = ( |
3182 | - doctest.ELLIPSIS | |
3183 | - doctest.NORMALIZE_WHITESPACE | |
3184 | - doctest.REPORT_NDIFF) |
3185 | - |
3186 | - |
3187 | -def additional_tests(): |
3188 | - "Run the doc tests (README.txt and docs/*, if any exist)" |
3189 | - doctest_files = [ |
3190 | - os.path.abspath(resource_filename('beautifulsoup', 'README.txt'))] |
3191 | - if resource_exists('beautifulsoup', 'docs'): |
3192 | - for name in resource_listdir('beautifulsoup', 'docs'): |
3193 | - if name.endswith('.txt'): |
3194 | - doctest_files.append( |
3195 | - os.path.abspath( |
3196 | - resource_filename('beautifulsoup', 'docs/%s' % name))) |
3197 | - kwargs = dict(module_relative=False, optionflags=DOCTEST_FLAGS) |
3198 | - atexit.register(cleanup_resources) |
3199 | - return unittest.TestSuite(( |
3200 | - doctest.DocFileSuite(*doctest_files, **kwargs))) |
3201 | |
3202 | === renamed file 'tests/test_lxml.py' => 'tests/test_lxml.py.THIS' |
3203 | === renamed file 'tests/test_soup.py' => 'tests/test_soup.py.THIS' |