Merge lp:~noskcaj/ubuntu/utopic/beautifulsoup4/merge into lp:ubuntu/utopic/beautifulsoup4
- Utopic (14.10)
- merge
- Merge into utopic
Proposed by
Jackson Doak
Status: | Needs review |
---|---|
Proposed branch: | lp:~noskcaj/ubuntu/utopic/beautifulsoup4/merge |
Merge into: | lp:ubuntu/utopic/beautifulsoup4 |
Diff against target: |
1665 lines (+709/-333) 20 files modified
NEWS.txt (+62/-0) PKG-INFO (+1/-1) bs4/__init__.py (+83/-42) bs4/builder/__init__.py (+13/-8) bs4/builder/_html5lib.py (+82/-19) bs4/builder/_htmlparser.py (+14/-5) bs4/builder/_lxml.py (+64/-30) bs4/dammit.py (+165/-163) bs4/diagnose.py (+28/-2) bs4/element.py (+34/-21) bs4/testing.py (+13/-0) bs4/tests/test_html5lib.py (+13/-0) bs4/tests/test_lxml.py (+7/-4) bs4/tests/test_soup.py (+71/-20) bs4/tests/test_tree.py (+29/-0) debian/changelog (+16/-0) debian/control (+1/-1) debian/copyright (+1/-1) doc/source/index.rst (+11/-15) setup.py (+1/-1) |
To merge this branch: | bzr merge lp:~noskcaj/ubuntu/utopic/beautifulsoup4/merge |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Daniel Holbach (community) | Approve | ||
Review via email: mp+221346@code.launchpad.net |
Commit message
Description of the change
New upstream release from debian
To post a comment you must log in.
Unmerged revisions
- 15. By Jackson Doak
-
* Merge from debian. Remaining changes:
- debian/control, debian/rules: Disable pypy-bs4 and Build-Depends on
pypy since the latter is in universe, while beautifulsoup4 is being
pulled into main via webtest.
Preview Diff
[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1 | === modified file 'NEWS.txt' |
2 | --- NEWS.txt 2013-08-09 18:39:43 +0000 |
3 | +++ NEWS.txt 2014-05-29 09:58:03 +0000 |
4 | @@ -1,3 +1,65 @@ |
5 | += 4.3.2 (20131002) = |
6 | + |
7 | +* Fixed a bug in which short Unicode input was improperly encoded to |
8 | + ASCII when checking whether or not it was the name of a file on |
9 | + disk. [bug=1227016] |
10 | + |
11 | +* Fixed a crash when a short input contains data not valid in |
12 | + filenames. [bug=1232604] |
13 | + |
14 | +* Fixed a bug that caused Unicode data put into UnicodeDammit to |
15 | + return None instead of the original data. [bug=1214983] |
16 | + |
17 | +* Combined two tests to stop a spurious test failure when tests are |
18 | + run by nosetests. [bug=1212445] |
19 | + |
20 | += 4.3.1 (20130815) = |
21 | + |
22 | +* Fixed yet another problem with the html5lib tree builder, caused by |
23 | + html5lib's tendency to rearrange the tree during |
24 | + parsing. [bug=1189267] |
25 | + |
26 | +* Fixed a bug that caused the optimized version of find_all() to |
27 | + return nothing. [bug=1212655] |
28 | + |
29 | += 4.3.0 (20130812) = |
30 | + |
31 | +* Instead of converting incoming data to Unicode and feeding it to the |
32 | + lxml tree builder in chunks, Beautiful Soup now makes successive |
33 | + guesses at the encoding of the incoming data, and tells lxml to |
34 | + parse the data as that encoding. Giving lxml more control over the |
35 | + parsing process improves performance and avoids a number of bugs and |
36 | + issues with the lxml parser which had previously required elaborate |
37 | + workarounds: |
38 | + |
39 | + - An issue in which lxml refuses to parse Unicode strings on some |
40 | + systems. [bug=1180527] |
41 | + |
42 | + - A returning bug that truncated documents longer than a (very |
43 | + small) size. [bug=963880] |
44 | + |
45 | + - A returning bug in which extra spaces were added to a document if |
46 | + the document defined a charset other than UTF-8. [bug=972466] |
47 | + |
48 | + This required a major overhaul of the tree builder architecture. If |
49 | + you wrote your own tree builder and didn't tell me, you'll need to |
50 | + modify your prepare_markup() method. |
51 | + |
52 | +* The UnicodeDammit code that makes guesses at encodings has been |
53 | + split into its own class, EncodingDetector. A lot of apparently |
54 | + redundant code has been removed from Unicode, Dammit, and some |
55 | + undocumented features have also been removed. |
56 | + |
57 | +* Beautiful Soup will issue a warning if instead of markup you pass it |
58 | + a URL or the name of a file on disk (a common beginner's mistake). |
59 | + |
60 | +* A number of optimizations improve the performance of the lxml tree |
61 | + builder by about 33%, the html.parser tree builder by about 20%, and |
62 | + the html5lib tree builder by about 15%. |
63 | + |
64 | +* All find_all calls should now return a ResultSet object. Patch by |
65 | + Aaron DeVore. [bug=1194034] |
66 | + |
67 | = 4.2.1 (20130531) = |
68 | |
69 | * The default XML formatter will now replace ampersands even if they |
70 | |
71 | === modified file 'PKG-INFO' |
72 | --- PKG-INFO 2013-08-09 18:39:43 +0000 |
73 | +++ PKG-INFO 2014-05-29 09:58:03 +0000 |
74 | @@ -1,6 +1,6 @@ |
75 | Metadata-Version: 1.1 |
76 | Name: beautifulsoup4 |
77 | -Version: 4.2.1 |
78 | +Version: 4.3.2 |
79 | Summary: UNKNOWN |
80 | Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/ |
81 | Author: Leonard Richardson |
82 | |
83 | === modified file 'bs4/__init__.py' |
84 | --- bs4/__init__.py 2013-08-09 18:39:43 +0000 |
85 | +++ bs4/__init__.py 2014-05-29 09:58:03 +0000 |
86 | @@ -17,16 +17,17 @@ |
87 | """ |
88 | |
89 | __author__ = "Leonard Richardson (leonardr@segfault.org)" |
90 | -__version__ = "4.2.1" |
91 | +__version__ = "4.3.2" |
92 | __copyright__ = "Copyright (c) 2004-2013 Leonard Richardson" |
93 | __license__ = "MIT" |
94 | |
95 | __all__ = ['BeautifulSoup'] |
96 | |
97 | +import os |
98 | import re |
99 | import warnings |
100 | |
101 | -from .builder import builder_registry |
102 | +from .builder import builder_registry, ParserRejectedMarkup |
103 | from .dammit import UnicodeDammit |
104 | from .element import ( |
105 | CData, |
106 | @@ -74,11 +75,7 @@ |
107 | # want, look for one with these features. |
108 | DEFAULT_BUILDER_FEATURES = ['html', 'fast'] |
109 | |
110 | - # Used when determining whether a text node is all whitespace and |
111 | - # can be replaced with a single space. A text node that contains |
112 | - # fancy Unicode spaces (usually non-breaking) should be left |
113 | - # alone. |
114 | - STRIP_ASCII_SPACES = {9: None, 10: None, 12: None, 13: None, 32: None, } |
115 | + ASCII_SPACES = '\x20\x0a\x09\x0c\x0d' |
116 | |
117 | def __init__(self, markup="", features=None, builder=None, |
118 | parse_only=None, from_encoding=None, **kwargs): |
119 | @@ -160,18 +157,46 @@ |
120 | |
121 | self.parse_only = parse_only |
122 | |
123 | - self.reset() |
124 | - |
125 | if hasattr(markup, 'read'): # It's a file-type object. |
126 | markup = markup.read() |
127 | - (self.markup, self.original_encoding, self.declared_html_encoding, |
128 | - self.contains_replacement_characters) = ( |
129 | - self.builder.prepare_markup(markup, from_encoding)) |
130 | + elif len(markup) <= 256: |
131 | + # Print out warnings for a couple beginner problems |
132 | + # involving passing non-markup to Beautiful Soup. |
133 | + # Beautiful Soup will still parse the input as markup, |
134 | + # just in case that's what the user really wants. |
135 | + if (isinstance(markup, unicode) |
136 | + and not os.path.supports_unicode_filenames): |
137 | + possible_filename = markup.encode("utf8") |
138 | + else: |
139 | + possible_filename = markup |
140 | + is_file = False |
141 | + try: |
142 | + is_file = os.path.exists(possible_filename) |
143 | + except Exception, e: |
144 | + # This is almost certainly a problem involving |
145 | + # characters not valid in filenames on this |
146 | + # system. Just let it go. |
147 | + pass |
148 | + if is_file: |
149 | + warnings.warn( |
150 | + '"%s" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.' % markup) |
151 | + if markup[:5] == "http:" or markup[:6] == "https:": |
152 | + # TODO: This is ugly but I couldn't get it to work in |
153 | + # Python 3 otherwise. |
154 | + if ((isinstance(markup, bytes) and not b' ' in markup) |
155 | + or (isinstance(markup, unicode) and not u' ' in markup)): |
156 | + warnings.warn( |
157 | + '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup) |
158 | |
159 | - try: |
160 | - self._feed() |
161 | - except StopParsing: |
162 | - pass |
163 | + for (self.markup, self.original_encoding, self.declared_html_encoding, |
164 | + self.contains_replacement_characters) in ( |
165 | + self.builder.prepare_markup(markup, from_encoding)): |
166 | + self.reset() |
167 | + try: |
168 | + self._feed() |
169 | + break |
170 | + except ParserRejectedMarkup: |
171 | + pass |
172 | |
173 | # Clear out the markup and remove the builder's circular |
174 | # reference to this object. |
175 | @@ -192,9 +217,10 @@ |
176 | Tag.__init__(self, self, self.builder, self.ROOT_TAG_NAME) |
177 | self.hidden = 1 |
178 | self.builder.reset() |
179 | - self.currentData = [] |
180 | + self.current_data = [] |
181 | self.currentTag = None |
182 | self.tagStack = [] |
183 | + self.preserve_whitespace_tag_stack = [] |
184 | self.pushTag(self) |
185 | |
186 | def new_tag(self, name, namespace=None, nsprefix=None, **attrs): |
187 | @@ -215,6 +241,8 @@ |
188 | |
189 | def popTag(self): |
190 | tag = self.tagStack.pop() |
191 | + if self.preserve_whitespace_tag_stack and tag == self.preserve_whitespace_tag_stack[-1]: |
192 | + self.preserve_whitespace_tag_stack.pop() |
193 | #print "Pop", tag.name |
194 | if self.tagStack: |
195 | self.currentTag = self.tagStack[-1] |
196 | @@ -226,23 +254,37 @@ |
197 | self.currentTag.contents.append(tag) |
198 | self.tagStack.append(tag) |
199 | self.currentTag = self.tagStack[-1] |
200 | + if tag.name in self.builder.preserve_whitespace_tags: |
201 | + self.preserve_whitespace_tag_stack.append(tag) |
202 | |
203 | def endData(self, containerClass=NavigableString): |
204 | - if self.currentData: |
205 | - currentData = u''.join(self.currentData) |
206 | - if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and |
207 | - not set([tag.name for tag in self.tagStack]).intersection( |
208 | - self.builder.preserve_whitespace_tags)): |
209 | - if '\n' in currentData: |
210 | - currentData = '\n' |
211 | - else: |
212 | - currentData = ' ' |
213 | - self.currentData = [] |
214 | + if self.current_data: |
215 | + current_data = u''.join(self.current_data) |
216 | + # If whitespace is not preserved, and this string contains |
217 | + # nothing but ASCII spaces, replace it with a single space |
218 | + # or newline. |
219 | + if not self.preserve_whitespace_tag_stack: |
220 | + strippable = True |
221 | + for i in current_data: |
222 | + if i not in self.ASCII_SPACES: |
223 | + strippable = False |
224 | + break |
225 | + if strippable: |
226 | + if '\n' in current_data: |
227 | + current_data = '\n' |
228 | + else: |
229 | + current_data = ' ' |
230 | + |
231 | + # Reset the data collector. |
232 | + self.current_data = [] |
233 | + |
234 | + # Should we add this string to the tree at all? |
235 | if self.parse_only and len(self.tagStack) <= 1 and \ |
236 | (not self.parse_only.text or \ |
237 | - not self.parse_only.search(currentData)): |
238 | + not self.parse_only.search(current_data)): |
239 | return |
240 | - o = containerClass(currentData) |
241 | + |
242 | + o = containerClass(current_data) |
243 | self.object_was_parsed(o) |
244 | |
245 | def object_was_parsed(self, o, parent=None, most_recent_element=None): |
246 | @@ -250,6 +292,7 @@ |
247 | parent = parent or self.currentTag |
248 | most_recent_element = most_recent_element or self._most_recent_element |
249 | o.setup(parent, most_recent_element) |
250 | + |
251 | if most_recent_element is not None: |
252 | most_recent_element.next_element = o |
253 | self._most_recent_element = o |
254 | @@ -262,22 +305,21 @@ |
255 | the given tag.""" |
256 | #print "Popping to %s" % name |
257 | if name == self.ROOT_TAG_NAME: |
258 | + # The BeautifulSoup object itself can never be popped. |
259 | return |
260 | |
261 | - numPops = 0 |
262 | - mostRecentTag = None |
263 | + most_recently_popped = None |
264 | |
265 | - for i in range(len(self.tagStack) - 1, 0, -1): |
266 | - if (name == self.tagStack[i].name |
267 | - and nsprefix == self.tagStack[i].prefix): |
268 | - numPops = len(self.tagStack) - i |
269 | + stack_size = len(self.tagStack) |
270 | + for i in range(stack_size - 1, 0, -1): |
271 | + t = self.tagStack[i] |
272 | + if (name == t.name and nsprefix == t.prefix): |
273 | + if inclusivePop: |
274 | + most_recently_popped = self.popTag() |
275 | break |
276 | - if not inclusivePop: |
277 | - numPops = numPops - 1 |
278 | + most_recently_popped = self.popTag() |
279 | |
280 | - for i in range(0, numPops): |
281 | - mostRecentTag = self.popTag() |
282 | - return mostRecentTag |
283 | + return most_recently_popped |
284 | |
285 | def handle_starttag(self, name, namespace, nsprefix, attrs): |
286 | """Push a start tag on to the stack. |
287 | @@ -312,7 +354,7 @@ |
288 | self._popToTag(name, nsprefix) |
289 | |
290 | def handle_data(self, data): |
291 | - self.currentData.append(data) |
292 | + self.current_data.append(data) |
293 | |
294 | def decode(self, pretty_print=False, |
295 | eventual_encoding=DEFAULT_OUTPUT_ENCODING, |
296 | @@ -353,7 +395,6 @@ |
297 | class StopParsing(Exception): |
298 | pass |
299 | |
300 | - |
301 | class FeatureNotFound(ValueError): |
302 | pass |
303 | |
304 | |
305 | === modified file 'bs4/builder/__init__.py' |
306 | --- bs4/builder/__init__.py 2013-08-09 18:39:43 +0000 |
307 | +++ bs4/builder/__init__.py 2014-05-29 09:58:03 +0000 |
308 | @@ -147,16 +147,18 @@ |
309 | |
310 | Modifies its input in place. |
311 | """ |
312 | + if not attrs: |
313 | + return attrs |
314 | if self.cdata_list_attributes: |
315 | universal = self.cdata_list_attributes.get('*', []) |
316 | tag_specific = self.cdata_list_attributes.get( |
317 | - tag_name.lower(), []) |
318 | - for cdata_list_attr in itertools.chain(universal, tag_specific): |
319 | - if cdata_list_attr in attrs: |
320 | - # Basically, we have a "class" attribute whose |
321 | - # value is a whitespace-separated list of CSS |
322 | - # classes. Split it into a list. |
323 | - value = attrs[cdata_list_attr] |
324 | + tag_name.lower(), None) |
325 | + for attr in attrs.keys(): |
326 | + if attr in universal or (tag_specific and attr in tag_specific): |
327 | + # We have a "class"-type attribute whose string |
328 | + # value is a whitespace-separated list of |
329 | + # values. Split it into a list. |
330 | + value = attrs[attr] |
331 | if isinstance(value, basestring): |
332 | values = whitespace_re.split(value) |
333 | else: |
334 | @@ -167,7 +169,7 @@ |
335 | # leave the value alone rather than trying to |
336 | # split it again. |
337 | values = value |
338 | - attrs[cdata_list_attr] = values |
339 | + attrs[attr] = values |
340 | return attrs |
341 | |
342 | class SAXTreeBuilder(TreeBuilder): |
343 | @@ -296,6 +298,9 @@ |
344 | # Register the builder while we're at it. |
345 | this_module.builder_registry.register(obj) |
346 | |
347 | +class ParserRejectedMarkup(Exception): |
348 | + pass |
349 | + |
350 | # Builders are registered in reverse order of priority, so that custom |
351 | # builder registrations will take precedence. In general, we want lxml |
352 | # to take precedence over html5lib, because it's faster. And we only |
353 | |
354 | === modified file 'bs4/builder/_html5lib.py' |
355 | --- bs4/builder/_html5lib.py 2013-08-09 18:39:43 +0000 |
356 | +++ bs4/builder/_html5lib.py 2014-05-29 09:58:03 +0000 |
357 | @@ -27,7 +27,7 @@ |
358 | def prepare_markup(self, markup, user_specified_encoding): |
359 | # Store the user-specified encoding for use later on. |
360 | self.user_specified_encoding = user_specified_encoding |
361 | - return markup, None, None, False |
362 | + yield (markup, None, None, False) |
363 | |
364 | # These methods are defined by Beautiful Soup. |
365 | def feed(self, markup): |
366 | @@ -123,17 +123,50 @@ |
367 | self.namespace = namespace |
368 | |
369 | def appendChild(self, node): |
370 | - if (node.element.__class__ == NavigableString and self.element.contents |
371 | + string_child = child = None |
372 | + if isinstance(node, basestring): |
373 | + # Some other piece of code decided to pass in a string |
374 | + # instead of creating a TextElement object to contain the |
375 | + # string. |
376 | + string_child = child = node |
377 | + elif isinstance(node, Tag): |
378 | + # Some other piece of code decided to pass in a Tag |
379 | + # instead of creating an Element object to contain the |
380 | + # Tag. |
381 | + child = node |
382 | + elif node.element.__class__ == NavigableString: |
383 | + string_child = child = node.element |
384 | + else: |
385 | + child = node.element |
386 | + |
387 | + if not isinstance(child, basestring) and child.parent is not None: |
388 | + node.element.extract() |
389 | + |
390 | + if (string_child and self.element.contents |
391 | and self.element.contents[-1].__class__ == NavigableString): |
392 | - # Concatenate new text onto old text node |
393 | - # XXX This has O(n^2) performance, for input like |
394 | + # We are appending a string onto another string. |
395 | + # TODO This has O(n^2) performance, for input like |
396 | # "a</a>a</a>a</a>..." |
397 | old_element = self.element.contents[-1] |
398 | - new_element = self.soup.new_string(old_element + node.element) |
399 | + new_element = self.soup.new_string(old_element + string_child) |
400 | old_element.replace_with(new_element) |
401 | self.soup._most_recent_element = new_element |
402 | else: |
403 | - self.soup.object_was_parsed(node.element, parent=self.element) |
404 | + if isinstance(node, basestring): |
405 | + # Create a brand new NavigableString from this string. |
406 | + child = self.soup.new_string(node) |
407 | + |
408 | + # Tell Beautiful Soup to act as if it parsed this element |
409 | + # immediately after the parent's last descendant. (Or |
410 | + # immediately after the parent, if it has no children.) |
411 | + if self.element.contents: |
412 | + most_recent_element = self.element._last_descendant(False) |
413 | + else: |
414 | + most_recent_element = self.element |
415 | + |
416 | + self.soup.object_was_parsed( |
417 | + child, parent=self.element, |
418 | + most_recent_element=most_recent_element) |
419 | |
420 | def getAttributes(self): |
421 | return AttrList(self.element) |
422 | @@ -162,11 +195,11 @@ |
423 | attributes = property(getAttributes, setAttributes) |
424 | |
425 | def insertText(self, data, insertBefore=None): |
426 | - text = TextNode(self.soup.new_string(data), self.soup) |
427 | if insertBefore: |
428 | - self.insertBefore(text, insertBefore) |
429 | + text = TextNode(self.soup.new_string(data), self.soup) |
430 | + self.insertBefore(data, insertBefore) |
431 | else: |
432 | - self.appendChild(text) |
433 | + self.appendChild(data) |
434 | |
435 | def insertBefore(self, node, refNode): |
436 | index = self.element.index(refNode.element) |
437 | @@ -183,16 +216,46 @@ |
438 | def removeChild(self, node): |
439 | node.element.extract() |
440 | |
441 | - def reparentChildren(self, newParent): |
442 | - while self.element.contents: |
443 | - child = self.element.contents[0] |
444 | - child.extract() |
445 | - if isinstance(child, Tag): |
446 | - newParent.appendChild( |
447 | - Element(child, self.soup, namespaces["html"])) |
448 | - else: |
449 | - newParent.appendChild( |
450 | - TextNode(child, self.soup)) |
451 | + def reparentChildren(self, new_parent): |
452 | + """Move all of this tag's children into another tag.""" |
453 | + element = self.element |
454 | + new_parent_element = new_parent.element |
455 | + # Determine what this tag's next_element will be once all the children |
456 | + # are removed. |
457 | + final_next_element = element.next_sibling |
458 | + |
459 | + new_parents_last_descendant = new_parent_element._last_descendant(False, False) |
460 | + if len(new_parent_element.contents) > 0: |
461 | + # The new parent already contains children. We will be |
462 | + # appending this tag's children to the end. |
463 | + new_parents_last_child = new_parent_element.contents[-1] |
464 | + new_parents_last_descendant_next_element = new_parents_last_descendant.next_element |
465 | + else: |
466 | + # The new parent contains no children. |
467 | + new_parents_last_child = None |
468 | + new_parents_last_descendant_next_element = new_parent_element.next_element |
469 | + |
470 | + to_append = element.contents |
471 | + append_after = new_parent.element.contents |
472 | + if len(to_append) > 0: |
473 | + # Set the first child's previous_element and previous_sibling |
474 | + # to elements within the new parent |
475 | + first_child = to_append[0] |
476 | + first_child.previous_element = new_parents_last_descendant |
477 | + first_child.previous_sibling = new_parents_last_child |
478 | + |
479 | + # Fix the last child's next_element and next_sibling |
480 | + last_child = to_append[-1] |
481 | + last_child.next_element = new_parents_last_descendant_next_element |
482 | + last_child.next_sibling = None |
483 | + |
484 | + for child in to_append: |
485 | + child.parent = new_parent_element |
486 | + new_parent_element.contents.append(child) |
487 | + |
488 | + # Now that this element has no children, change its .next_element. |
489 | + element.contents = [] |
490 | + element.next_element = final_next_element |
491 | |
492 | def cloneNode(self): |
493 | tag = self.soup.new_tag(self.element.name, self.namespace) |
494 | |
495 | === modified file 'bs4/builder/_htmlparser.py' |
496 | --- bs4/builder/_htmlparser.py 2013-08-09 18:39:43 +0000 |
497 | +++ bs4/builder/_htmlparser.py 2014-05-29 09:58:03 +0000 |
498 | @@ -45,7 +45,15 @@ |
499 | class BeautifulSoupHTMLParser(HTMLParser): |
500 | def handle_starttag(self, name, attrs): |
501 | # XXX namespace |
502 | - self.soup.handle_starttag(name, None, None, dict(attrs)) |
503 | + attr_dict = {} |
504 | + for key, value in attrs: |
505 | + # Change None attribute values to the empty string |
506 | + # for consistency with the other tree builders. |
507 | + if value is None: |
508 | + value = '' |
509 | + attr_dict[key] = value |
510 | + attrvalue = '""' |
511 | + self.soup.handle_starttag(name, None, None, attr_dict) |
512 | |
513 | def handle_endtag(self, name): |
514 | self.soup.handle_endtag(name) |
515 | @@ -135,13 +143,14 @@ |
516 | replaced with REPLACEMENT CHARACTER). |
517 | """ |
518 | if isinstance(markup, unicode): |
519 | - return markup, None, None, False |
520 | + yield (markup, None, None, False) |
521 | + return |
522 | |
523 | try_encodings = [user_specified_encoding, document_declared_encoding] |
524 | dammit = UnicodeDammit(markup, try_encodings, is_html=True) |
525 | - return (dammit.markup, dammit.original_encoding, |
526 | - dammit.declared_html_encoding, |
527 | - dammit.contains_replacement_characters) |
528 | + yield (dammit.markup, dammit.original_encoding, |
529 | + dammit.declared_html_encoding, |
530 | + dammit.contains_replacement_characters) |
531 | |
532 | def feed(self, markup): |
533 | args, kwargs = self.parser_args |
534 | |
535 | === modified file 'bs4/builder/_lxml.py' |
536 | --- bs4/builder/_lxml.py 2013-08-09 18:39:43 +0000 |
537 | +++ bs4/builder/_lxml.py 2014-05-29 09:58:03 +0000 |
538 | @@ -13,9 +13,10 @@ |
539 | HTML, |
540 | HTMLTreeBuilder, |
541 | PERMISSIVE, |
542 | + ParserRejectedMarkup, |
543 | TreeBuilder, |
544 | XML) |
545 | -from bs4.dammit import UnicodeDammit |
546 | +from bs4.dammit import EncodingDetector |
547 | |
548 | LXML = 'lxml' |
549 | |
550 | @@ -33,22 +34,30 @@ |
551 | # standard. |
552 | DEFAULT_NSMAPS = {'http://www.w3.org/XML/1998/namespace' : "xml"} |
553 | |
554 | - @property |
555 | - def default_parser(self): |
556 | + def default_parser(self, encoding): |
557 | # This can either return a parser object or a class, which |
558 | # will be instantiated with default arguments. |
559 | - return etree.XMLParser(target=self, strip_cdata=False, recover=True) |
560 | + if self._default_parser is not None: |
561 | + return self._default_parser |
562 | + return etree.XMLParser( |
563 | + target=self, strip_cdata=False, recover=True, encoding=encoding) |
564 | + |
565 | + def parser_for(self, encoding): |
566 | + # Use the default parser. |
567 | + parser = self.default_parser(encoding) |
568 | + |
569 | + if isinstance(parser, collections.Callable): |
570 | + # Instantiate the parser with default arguments |
571 | + parser = parser(target=self, strip_cdata=False, encoding=encoding) |
572 | + return parser |
573 | |
574 | def __init__(self, parser=None, empty_element_tags=None): |
575 | + # TODO: Issue a warning if parser is present but not a |
576 | + # callable, since that means there's no way to create new |
577 | + # parsers for different encodings. |
578 | + self._default_parser = parser |
579 | if empty_element_tags is not None: |
580 | self.empty_element_tags = set(empty_element_tags) |
581 | - if parser is None: |
582 | - # Use the default parser. |
583 | - parser = self.default_parser |
584 | - if isinstance(parser, collections.Callable): |
585 | - # Instantiate the parser with default arguments |
586 | - parser = parser(target=self, strip_cdata=False) |
587 | - self.parser = parser |
588 | self.soup = None |
589 | self.nsmaps = [self.DEFAULT_NSMAPS] |
590 | |
591 | @@ -63,33 +72,53 @@ |
592 | def prepare_markup(self, markup, user_specified_encoding=None, |
593 | document_declared_encoding=None): |
594 | """ |
595 | - :return: A 3-tuple (markup, original encoding, encoding |
596 | - declared within markup). |
597 | + :yield: A series of 4-tuples. |
598 | + (markup, encoding, declared encoding, |
599 | + has undergone character replacement) |
600 | + |
601 | + Each 4-tuple represents a strategy for parsing the document. |
602 | """ |
603 | if isinstance(markup, unicode): |
604 | - return markup, None, None, False |
605 | - |
606 | + # We were given Unicode. Maybe lxml can parse Unicode on |
607 | + # this system? |
608 | + yield markup, None, document_declared_encoding, False |
609 | + |
610 | + if isinstance(markup, unicode): |
611 | + # No, apparently not. Convert the Unicode to UTF-8 and |
612 | + # tell lxml to parse it as UTF-8. |
613 | + yield (markup.encode("utf8"), "utf8", |
614 | + document_declared_encoding, False) |
615 | + |
616 | + # Instead of using UnicodeDammit to convert the bytestring to |
617 | + # Unicode using different encodings, use EncodingDetector to |
618 | + # iterate over the encodings, and tell lxml to try to parse |
619 | + # the document as each one in turn. |
620 | + is_html = not self.is_xml |
621 | try_encodings = [user_specified_encoding, document_declared_encoding] |
622 | - dammit = UnicodeDammit(markup, try_encodings, is_html=True) |
623 | - return (dammit.markup, dammit.original_encoding, |
624 | - dammit.declared_html_encoding, |
625 | - dammit.contains_replacement_characters) |
626 | + detector = EncodingDetector(markup, try_encodings, is_html) |
627 | + for encoding in detector.encodings: |
628 | + yield (detector.markup, encoding, document_declared_encoding, False) |
629 | |
630 | def feed(self, markup): |
631 | if isinstance(markup, bytes): |
632 | markup = BytesIO(markup) |
633 | elif isinstance(markup, unicode): |
634 | markup = StringIO(markup) |
635 | + |
636 | # Call feed() at least once, even if the markup is empty, |
637 | # or the parser won't be initialized. |
638 | data = markup.read(self.CHUNK_SIZE) |
639 | - self.parser.feed(data) |
640 | - while data != '': |
641 | - # Now call feed() on the rest of the data, chunk by chunk. |
642 | - data = markup.read(self.CHUNK_SIZE) |
643 | - if data != '': |
644 | - self.parser.feed(data) |
645 | - self.parser.close() |
646 | + try: |
647 | + self.parser = self.parser_for(self.soup.original_encoding) |
648 | + self.parser.feed(data) |
649 | + while len(data) != 0: |
650 | + # Now call feed() on the rest of the data, chunk by chunk. |
651 | + data = markup.read(self.CHUNK_SIZE) |
652 | + if len(data) != 0: |
653 | + self.parser.feed(data) |
654 | + self.parser.close() |
655 | + except (UnicodeDecodeError, LookupError, etree.ParserError), e: |
656 | + raise ParserRejectedMarkup(str(e)) |
657 | |
658 | def close(self): |
659 | self.nsmaps = [self.DEFAULT_NSMAPS] |
660 | @@ -186,13 +215,18 @@ |
661 | features = [LXML, HTML, FAST, PERMISSIVE] |
662 | is_xml = False |
663 | |
664 | - @property |
665 | - def default_parser(self): |
666 | + def default_parser(self, encoding): |
667 | return etree.HTMLParser |
668 | |
669 | def feed(self, markup): |
670 | - self.parser.feed(markup) |
671 | - self.parser.close() |
672 | + encoding = self.soup.original_encoding |
673 | + try: |
674 | + self.parser = self.parser_for(encoding) |
675 | + self.parser.feed(markup) |
676 | + self.parser.close() |
677 | + except (UnicodeDecodeError, LookupError, etree.ParserError), e: |
678 | + raise ParserRejectedMarkup(str(e)) |
679 | + |
680 | |
681 | def test_fragment_to_document(self, fragment): |
682 | """See `TreeBuilder`.""" |
683 | |
684 | === modified file 'bs4/dammit.py' |
685 | --- bs4/dammit.py 2013-08-09 18:39:43 +0000 |
686 | +++ bs4/dammit.py 2014-05-29 09:58:03 +0000 |
687 | @@ -1,16 +1,17 @@ |
688 | # -*- coding: utf-8 -*- |
689 | """Beautiful Soup bonus library: Unicode, Dammit |
690 | |
691 | -This class forces XML data into a standard format (usually to UTF-8 or |
692 | -Unicode). It is heavily based on code from Mark Pilgrim's Universal |
693 | -Feed Parser. It does not rewrite the XML or HTML to reflect a new |
694 | -encoding; that's the tree builder's job. |
695 | +This library converts a bytestream to Unicode through any means |
696 | +necessary. It is heavily based on code from Mark Pilgrim's Universal |
697 | +Feed Parser. It works best on XML and XML, but it does not rewrite the |
698 | +XML or HTML to reflect a new encoding; that's the tree builder's job. |
699 | """ |
700 | |
701 | import codecs |
702 | from htmlentitydefs import codepoint2name |
703 | import re |
704 | import logging |
705 | +import string |
706 | |
707 | # Import a library to autodetect character encodings. |
708 | chardet_type = None |
709 | @@ -175,7 +176,6 @@ |
710 | value = cls.quoted_attribute_value(value) |
711 | return value |
712 | |
713 | - |
714 | @classmethod |
715 | def substitute_html(cls, s): |
716 | """Replace certain Unicode characters with named HTML entities. |
717 | @@ -192,6 +192,125 @@ |
718 | cls._substitute_html_entity, s) |
719 | |
720 | |
721 | +class EncodingDetector: |
722 | + """Suggests a number of possible encodings for a bytestring. |
723 | + |
724 | + Order of precedence: |
725 | + |
726 | + 1. Encodings you specifically tell EncodingDetector to try first |
727 | + (the override_encodings argument to the constructor). |
728 | + |
729 | + 2. An encoding declared within the bytestring itself, either in an |
730 | + XML declaration (if the bytestring is to be interpreted as an XML |
731 | + document), or in a <meta> tag (if the bytestring is to be |
732 | + interpreted as an HTML document.) |
733 | + |
734 | + 3. An encoding detected through textual analysis by chardet, |
735 | + cchardet, or a similar external library. |
736 | + |
737 | + 4. UTF-8. |
738 | + |
739 | + 5. Windows-1252. |
740 | + """ |
741 | + def __init__(self, markup, override_encodings=None, is_html=False): |
742 | + self.override_encodings = override_encodings or [] |
743 | + self.chardet_encoding = None |
744 | + self.is_html = is_html |
745 | + self.declared_encoding = None |
746 | + |
747 | + # First order of business: strip a byte-order mark. |
748 | + self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup) |
749 | + |
750 | + def _usable(self, encoding, tried): |
751 | + if encoding is not None: |
752 | + encoding = encoding.lower() |
753 | + if encoding not in tried: |
754 | + tried.add(encoding) |
755 | + return True |
756 | + return False |
757 | + |
758 | + @property |
759 | + def encodings(self): |
760 | + """Yield a number of encodings that might work for this markup.""" |
761 | + tried = set() |
762 | + for e in self.override_encodings: |
763 | + if self._usable(e, tried): |
764 | + yield e |
765 | + |
766 | + # Did the document originally start with a byte-order mark |
767 | + # that indicated its encoding? |
768 | + if self._usable(self.sniffed_encoding, tried): |
769 | + yield self.sniffed_encoding |
770 | + |
771 | + # Look within the document for an XML or HTML encoding |
772 | + # declaration. |
773 | + if self.declared_encoding is None: |
774 | + self.declared_encoding = self.find_declared_encoding( |
775 | + self.markup, self.is_html) |
776 | + if self._usable(self.declared_encoding, tried): |
777 | + yield self.declared_encoding |
778 | + |
779 | + # Use third-party character set detection to guess at the |
780 | + # encoding. |
781 | + if self.chardet_encoding is None: |
782 | + self.chardet_encoding = chardet_dammit(self.markup) |
783 | + if self._usable(self.chardet_encoding, tried): |
784 | + yield self.chardet_encoding |
785 | + |
786 | + # As a last-ditch effort, try utf-8 and windows-1252. |
787 | + for e in ('utf-8', 'windows-1252'): |
788 | + if self._usable(e, tried): |
789 | + yield e |
790 | + |
791 | + @classmethod |
792 | + def strip_byte_order_mark(cls, data): |
793 | + """If a byte-order mark is present, strip it and return the encoding it implies.""" |
794 | + encoding = None |
795 | + if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \ |
796 | + and (data[2:4] != '\x00\x00'): |
797 | + encoding = 'utf-16be' |
798 | + data = data[2:] |
799 | + elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \ |
800 | + and (data[2:4] != '\x00\x00'): |
801 | + encoding = 'utf-16le' |
802 | + data = data[2:] |
803 | + elif data[:3] == b'\xef\xbb\xbf': |
804 | + encoding = 'utf-8' |
805 | + data = data[3:] |
806 | + elif data[:4] == b'\x00\x00\xfe\xff': |
807 | + encoding = 'utf-32be' |
808 | + data = data[4:] |
809 | + elif data[:4] == b'\xff\xfe\x00\x00': |
810 | + encoding = 'utf-32le' |
811 | + data = data[4:] |
812 | + return data, encoding |
813 | + |
814 | + @classmethod |
815 | + def find_declared_encoding(cls, markup, is_html=False, search_entire_document=False): |
816 | + """Given a document, tries to find its declared encoding. |
817 | + |
818 | + An XML encoding is declared at the beginning of the document. |
819 | + |
820 | + An HTML encoding is declared in a <meta> tag, hopefully near the |
821 | + beginning of the document. |
822 | + """ |
823 | + if search_entire_document: |
824 | + xml_endpos = html_endpos = len(markup) |
825 | + else: |
826 | + xml_endpos = 1024 |
827 | + html_endpos = max(2048, int(len(markup) * 0.05)) |
828 | + |
829 | + declared_encoding = None |
830 | + declared_encoding_match = xml_encoding_re.search(markup, endpos=xml_endpos) |
831 | + if not declared_encoding_match and is_html: |
832 | + declared_encoding_match = html_meta_re.search(markup, endpos=html_endpos) |
833 | + if declared_encoding_match is not None: |
834 | + declared_encoding = declared_encoding_match.groups()[0].decode( |
835 | + 'ascii') |
836 | + if declared_encoding: |
837 | + return declared_encoding.lower() |
838 | + return None |
839 | + |
840 | class UnicodeDammit: |
841 | """A class for detecting the encoding of a *ML document and |
842 | converting it to a Unicode string. If the source encoding is |
843 | @@ -213,55 +332,38 @@ |
844 | |
845 | def __init__(self, markup, override_encodings=[], |
846 | smart_quotes_to=None, is_html=False): |
847 | - self.declared_html_encoding = None |
848 | self.smart_quotes_to = smart_quotes_to |
849 | self.tried_encodings = [] |
850 | self.contains_replacement_characters = False |
851 | - |
852 | - if markup == '' or isinstance(markup, unicode): |
853 | + self.is_html = is_html |
854 | + |
855 | + self.detector = EncodingDetector(markup, override_encodings, is_html) |
856 | + |
857 | + # Short-circuit if the data is in Unicode to begin with. |
858 | + if isinstance(markup, unicode) or markup == '': |
859 | self.markup = markup |
860 | self.unicode_markup = unicode(markup) |
861 | self.original_encoding = None |
862 | return |
863 | |
864 | - new_markup, document_encoding, sniffed_encoding = \ |
865 | - self._detectEncoding(markup, is_html) |
866 | - self.markup = new_markup |
867 | + # The encoding detector may have stripped a byte-order mark. |
868 | + # Use the stripped markup from this point on. |
869 | + self.markup = self.detector.markup |
870 | |
871 | u = None |
872 | - if new_markup != markup: |
873 | - # _detectEncoding modified the markup, then converted it to |
874 | - # Unicode and then to UTF-8. So convert it from UTF-8. |
875 | - u = self._convert_from("utf8") |
876 | - self.original_encoding = sniffed_encoding |
877 | - |
878 | - if not u: |
879 | - for proposed_encoding in ( |
880 | - override_encodings + [document_encoding, sniffed_encoding]): |
881 | - if proposed_encoding is not None: |
882 | - u = self._convert_from(proposed_encoding) |
883 | - if u: |
884 | - break |
885 | - |
886 | - # If no luck and we have auto-detection library, try that: |
887 | - if not u and not isinstance(self.markup, unicode): |
888 | - u = self._convert_from(chardet_dammit(self.markup)) |
889 | - |
890 | - # As a last resort, try utf-8 and windows-1252: |
891 | - if not u: |
892 | - for proposed_encoding in ("utf-8", "windows-1252"): |
893 | - u = self._convert_from(proposed_encoding) |
894 | - if u: |
895 | - break |
896 | - |
897 | - # As an absolute last resort, try the encodings again with |
898 | - # character replacement. |
899 | - if not u: |
900 | - for proposed_encoding in ( |
901 | - override_encodings + [ |
902 | - document_encoding, sniffed_encoding, "utf-8", "windows-1252"]): |
903 | - if proposed_encoding != "ascii": |
904 | - u = self._convert_from(proposed_encoding, "replace") |
905 | + for encoding in self.detector.encodings: |
906 | + markup = self.detector.markup |
907 | + u = self._convert_from(encoding) |
908 | + if u is not None: |
909 | + break |
910 | + |
911 | + if not u: |
912 | + # None of the encodings worked. As an absolute last resort, |
913 | + # try them again with character replacement. |
914 | + |
915 | + for encoding in self.detector.encodings: |
916 | + if encoding != "ascii": |
917 | + u = self._convert_from(encoding, "replace") |
918 | if u is not None: |
919 | logging.warning( |
920 | "Some characters could not be decoded, and were " |
921 | @@ -269,8 +371,9 @@ |
922 | self.contains_replacement_characters = True |
923 | break |
924 | |
925 | - # We could at this point force it to ASCII, but that would |
926 | - # destroy so much data that I think giving up is better |
927 | + # If none of that worked, we could at this point force it to |
928 | + # ASCII, but that would destroy so much data that I think |
929 | + # giving up is better. |
930 | self.unicode_markup = u |
931 | if not u: |
932 | self.original_encoding = None |
933 | @@ -301,7 +404,7 @@ |
934 | # Convert smart quotes to HTML if coming from an encoding |
935 | # that might have them. |
936 | if (self.smart_quotes_to is not None |
937 | - and proposed.lower() in self.ENCODINGS_WITH_SMART_QUOTES): |
938 | + and proposed in self.ENCODINGS_WITH_SMART_QUOTES): |
939 | smart_quotes_re = b"([\x80-\x9f])" |
940 | smart_quotes_compiled = re.compile(smart_quotes_re) |
941 | markup = smart_quotes_compiled.sub(self._sub_ms_char, markup) |
942 | @@ -322,99 +425,24 @@ |
943 | def _to_unicode(self, data, encoding, errors="strict"): |
944 | '''Given a string and its encoding, decodes the string into Unicode. |
945 | %encoding is a string recognized by encodings.aliases''' |
946 | - |
947 | - # strip Byte Order Mark (if present) |
948 | - if (len(data) >= 4) and (data[:2] == '\xfe\xff') \ |
949 | - and (data[2:4] != '\x00\x00'): |
950 | - encoding = 'utf-16be' |
951 | - data = data[2:] |
952 | - elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \ |
953 | - and (data[2:4] != '\x00\x00'): |
954 | - encoding = 'utf-16le' |
955 | - data = data[2:] |
956 | - elif data[:3] == '\xef\xbb\xbf': |
957 | - encoding = 'utf-8' |
958 | - data = data[3:] |
959 | - elif data[:4] == '\x00\x00\xfe\xff': |
960 | - encoding = 'utf-32be' |
961 | - data = data[4:] |
962 | - elif data[:4] == '\xff\xfe\x00\x00': |
963 | - encoding = 'utf-32le' |
964 | - data = data[4:] |
965 | - newdata = unicode(data, encoding, errors) |
966 | - return newdata |
967 | - |
968 | - def _detectEncoding(self, xml_data, is_html=False): |
969 | - """Given a document, tries to detect its XML encoding.""" |
970 | - xml_encoding = sniffed_xml_encoding = None |
971 | - try: |
972 | - if xml_data[:4] == b'\x4c\x6f\xa7\x94': |
973 | - # EBCDIC |
974 | - xml_data = self._ebcdic_to_ascii(xml_data) |
975 | - elif xml_data[:4] == b'\x00\x3c\x00\x3f': |
976 | - # UTF-16BE |
977 | - sniffed_xml_encoding = 'utf-16be' |
978 | - xml_data = unicode(xml_data, 'utf-16be').encode('utf-8') |
979 | - elif (len(xml_data) >= 4) and (xml_data[:2] == b'\xfe\xff') \ |
980 | - and (xml_data[2:4] != b'\x00\x00'): |
981 | - # UTF-16BE with BOM |
982 | - sniffed_xml_encoding = 'utf-16be' |
983 | - xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8') |
984 | - elif xml_data[:4] == b'\x3c\x00\x3f\x00': |
985 | - # UTF-16LE |
986 | - sniffed_xml_encoding = 'utf-16le' |
987 | - xml_data = unicode(xml_data, 'utf-16le').encode('utf-8') |
988 | - elif (len(xml_data) >= 4) and (xml_data[:2] == b'\xff\xfe') and \ |
989 | - (xml_data[2:4] != b'\x00\x00'): |
990 | - # UTF-16LE with BOM |
991 | - sniffed_xml_encoding = 'utf-16le' |
992 | - xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8') |
993 | - elif xml_data[:4] == b'\x00\x00\x00\x3c': |
994 | - # UTF-32BE |
995 | - sniffed_xml_encoding = 'utf-32be' |
996 | - xml_data = unicode(xml_data, 'utf-32be').encode('utf-8') |
997 | - elif xml_data[:4] == b'\x3c\x00\x00\x00': |
998 | - # UTF-32LE |
999 | - sniffed_xml_encoding = 'utf-32le' |
1000 | - xml_data = unicode(xml_data, 'utf-32le').encode('utf-8') |
1001 | - elif xml_data[:4] == b'\x00\x00\xfe\xff': |
1002 | - # UTF-32BE with BOM |
1003 | - sniffed_xml_encoding = 'utf-32be' |
1004 | - xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8') |
1005 | - elif xml_data[:4] == b'\xff\xfe\x00\x00': |
1006 | - # UTF-32LE with BOM |
1007 | - sniffed_xml_encoding = 'utf-32le' |
1008 | - xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8') |
1009 | - elif xml_data[:3] == b'\xef\xbb\xbf': |
1010 | - # UTF-8 with BOM |
1011 | - sniffed_xml_encoding = 'utf-8' |
1012 | - xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8') |
1013 | - else: |
1014 | - sniffed_xml_encoding = 'ascii' |
1015 | - pass |
1016 | - except: |
1017 | - xml_encoding_match = None |
1018 | - xml_encoding_match = xml_encoding_re.match(xml_data) |
1019 | - if not xml_encoding_match and is_html: |
1020 | - xml_encoding_match = html_meta_re.search(xml_data) |
1021 | - if xml_encoding_match is not None: |
1022 | - xml_encoding = xml_encoding_match.groups()[0].decode( |
1023 | - 'ascii').lower() |
1024 | - if is_html: |
1025 | - self.declared_html_encoding = xml_encoding |
1026 | - if sniffed_xml_encoding and \ |
1027 | - (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode', |
1028 | - 'iso-10646-ucs-4', 'ucs-4', 'csucs4', |
1029 | - 'utf-16', 'utf-32', 'utf_16', 'utf_32', |
1030 | - 'utf16', 'u16')): |
1031 | - xml_encoding = sniffed_xml_encoding |
1032 | - return xml_data, xml_encoding, sniffed_xml_encoding |
1033 | + return unicode(data, encoding, errors) |
1034 | + |
1035 | + @property |
1036 | + def declared_html_encoding(self): |
1037 | + if not self.is_html: |
1038 | + return None |
1039 | + return self.detector.declared_encoding |
1040 | |
1041 | def find_codec(self, charset): |
1042 | - return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \ |
1043 | - or (charset and self._codec(charset.replace("-", ""))) \ |
1044 | - or (charset and self._codec(charset.replace("-", "_"))) \ |
1045 | + value = (self._codec(self.CHARSET_ALIASES.get(charset, charset)) |
1046 | + or (charset and self._codec(charset.replace("-", ""))) |
1047 | + or (charset and self._codec(charset.replace("-", "_"))) |
1048 | + or (charset and charset.lower()) |
1049 | or charset |
1050 | + ) |
1051 | + if value: |
1052 | + return value.lower() |
1053 | + return None |
1054 | |
1055 | def _codec(self, charset): |
1056 | if not charset: |
1057 | @@ -427,32 +455,6 @@ |
1058 | pass |
1059 | return codec |
1060 | |
1061 | - EBCDIC_TO_ASCII_MAP = None |
1062 | - |
1063 | - def _ebcdic_to_ascii(self, s): |
1064 | - c = self.__class__ |
1065 | - if not c.EBCDIC_TO_ASCII_MAP: |
1066 | - emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15, |
1067 | - 16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31, |
1068 | - 128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7, |
1069 | - 144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26, |
1070 | - 32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33, |
1071 | - 38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94, |
1072 | - 45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63, |
1073 | - 186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34, |
1074 | - 195,97,98,99,100,101,102,103,104,105,196,197,198,199,200, |
1075 | - 201,202,106,107,108,109,110,111,112,113,114,203,204,205, |
1076 | - 206,207,208,209,126,115,116,117,118,119,120,121,122,210, |
1077 | - 211,212,213,214,215,216,217,218,219,220,221,222,223,224, |
1078 | - 225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72, |
1079 | - 73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81, |
1080 | - 82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89, |
1081 | - 90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57, |
1082 | - 250,251,252,253,254,255) |
1083 | - import string |
1084 | - c.EBCDIC_TO_ASCII_MAP = string.maketrans( |
1085 | - ''.join(map(chr, list(range(256)))), ''.join(map(chr, emap))) |
1086 | - return s.translate(c.EBCDIC_TO_ASCII_MAP) |
1087 | |
1088 | # A partial mapping of ISO-Latin-1 to HTML entities/XML numeric entities. |
1089 | MS_CHARS = {b'\x80': ('euro', '20AC'), |
1090 | |
1091 | === modified file 'bs4/diagnose.py' |
1092 | --- bs4/diagnose.py 2013-08-09 18:39:43 +0000 |
1093 | +++ bs4/diagnose.py 2014-05-29 09:58:03 +0000 |
1094 | @@ -1,10 +1,15 @@ |
1095 | """Diagnostic functions, mainly for use when doing tech support.""" |
1096 | +import cProfile |
1097 | from StringIO import StringIO |
1098 | from HTMLParser import HTMLParser |
1099 | +import bs4 |
1100 | from bs4 import BeautifulSoup, __version__ |
1101 | from bs4.builder import builder_registry |
1102 | + |
1103 | import os |
1104 | +import pstats |
1105 | import random |
1106 | +import tempfile |
1107 | import time |
1108 | import traceback |
1109 | import sys |
1110 | @@ -61,14 +66,14 @@ |
1111 | |
1112 | print "-" * 80 |
1113 | |
1114 | -def lxml_trace(data, html=True): |
1115 | +def lxml_trace(data, html=True, **kwargs): |
1116 | """Print out the lxml events that occur during parsing. |
1117 | |
1118 | This lets you see how lxml parses a document when no Beautiful |
1119 | Soup code is running. |
1120 | """ |
1121 | from lxml import etree |
1122 | - for event, element in etree.iterparse(StringIO(data), html=html): |
1123 | + for event, element in etree.iterparse(StringIO(data), html=html, **kwargs): |
1124 | print("%s, %4s, %s" % (event, element.tag, element.text)) |
1125 | |
1126 | class AnnouncingParser(HTMLParser): |
1127 | @@ -174,5 +179,26 @@ |
1128 | b = time.time() |
1129 | print "Raw lxml parsed the markup in %.2fs." % (b-a) |
1130 | |
1131 | + import html5lib |
1132 | + parser = html5lib.HTMLParser() |
1133 | + a = time.time() |
1134 | + parser.parse(data) |
1135 | + b = time.time() |
1136 | + print "Raw html5lib parsed the markup in %.2fs." % (b-a) |
1137 | + |
1138 | +def profile(num_elements=100000, parser="lxml"): |
1139 | + |
1140 | + filehandle = tempfile.NamedTemporaryFile() |
1141 | + filename = filehandle.name |
1142 | + |
1143 | + data = rdoc(num_elements) |
1144 | + vars = dict(bs4=bs4, data=data, parser=parser) |
1145 | + cProfile.runctx('bs4.BeautifulSoup(data, parser)' , vars, vars, filename) |
1146 | + |
1147 | + stats = pstats.Stats(filename) |
1148 | + # stats.strip_dirs() |
1149 | + stats.sort_stats("cumulative") |
1150 | + stats.print_stats('_html5lib|bs4', 50) |
1151 | + |
1152 | if __name__ == '__main__': |
1153 | diagnose(sys.stdin.read()) |
1154 | |
1155 | === modified file 'bs4/element.py' |
1156 | --- bs4/element.py 2013-05-25 21:27:22 +0000 |
1157 | +++ bs4/element.py 2014-05-29 09:58:03 +0000 |
1158 | @@ -255,11 +255,16 @@ |
1159 | self.previous_sibling = self.next_sibling = None |
1160 | return self |
1161 | |
1162 | - def _last_descendant(self): |
1163 | + def _last_descendant(self, is_initialized=True, accept_self=True): |
1164 | "Finds the last element beneath this object to be parsed." |
1165 | - last_child = self |
1166 | - while hasattr(last_child, 'contents') and last_child.contents: |
1167 | - last_child = last_child.contents[-1] |
1168 | + if is_initialized and self.next_sibling: |
1169 | + last_child = self.next_sibling.previous_element |
1170 | + else: |
1171 | + last_child = self |
1172 | + while isinstance(last_child, Tag) and last_child.contents: |
1173 | + last_child = last_child.contents[-1] |
1174 | + if not accept_self and last_child == self: |
1175 | + last_child = None |
1176 | return last_child |
1177 | # BS3: Not part of the API! |
1178 | _lastRecursiveChild = _last_descendant |
1179 | @@ -294,11 +299,11 @@ |
1180 | previous_child = self.contents[position - 1] |
1181 | new_child.previous_sibling = previous_child |
1182 | new_child.previous_sibling.next_sibling = new_child |
1183 | - new_child.previous_element = previous_child._last_descendant() |
1184 | + new_child.previous_element = previous_child._last_descendant(False) |
1185 | if new_child.previous_element is not None: |
1186 | new_child.previous_element.next_element = new_child |
1187 | |
1188 | - new_childs_last_element = new_child._last_descendant() |
1189 | + new_childs_last_element = new_child._last_descendant(False) |
1190 | |
1191 | if position >= len(self.contents): |
1192 | new_child.next_sibling = None |
1193 | @@ -475,20 +480,21 @@ |
1194 | |
1195 | if isinstance(name, SoupStrainer): |
1196 | strainer = name |
1197 | - elif text is None and not limit and not attrs and not kwargs: |
1198 | - # Optimization to find all tags. |
1199 | + else: |
1200 | + strainer = SoupStrainer(name, attrs, text, **kwargs) |
1201 | + |
1202 | + if text is None and not limit and not attrs and not kwargs: |
1203 | if name is True or name is None: |
1204 | - return [element for element in generator |
1205 | - if isinstance(element, Tag)] |
1206 | - # Optimization to find all tags with a given name. |
1207 | + # Optimization to find all tags. |
1208 | + result = (element for element in generator |
1209 | + if isinstance(element, Tag)) |
1210 | + return ResultSet(strainer, result) |
1211 | elif isinstance(name, basestring): |
1212 | - return [element for element in generator |
1213 | - if isinstance(element, Tag) and element.name == name] |
1214 | - else: |
1215 | - strainer = SoupStrainer(name, attrs, text, **kwargs) |
1216 | - else: |
1217 | - # Build a SoupStrainer |
1218 | - strainer = SoupStrainer(name, attrs, text, **kwargs) |
1219 | + # Optimization to find all tags with a given name. |
1220 | + result = (element for element in generator |
1221 | + if isinstance(element, Tag) |
1222 | + and element.name == name) |
1223 | + return ResultSet(strainer, result) |
1224 | results = ResultSet(strainer) |
1225 | while True: |
1226 | try: |
1227 | @@ -672,6 +678,13 @@ |
1228 | output = self.format_string(self, formatter) |
1229 | return self.PREFIX + output + self.SUFFIX |
1230 | |
1231 | + @property |
1232 | + def name(self): |
1233 | + return None |
1234 | + |
1235 | + @name.setter |
1236 | + def name(self, name): |
1237 | + raise AttributeError("A NavigableString cannot be given a name.") |
1238 | |
1239 | class PreformattedString(NavigableString): |
1240 | """A NavigableString not subject to the normal formatting rules. |
1241 | @@ -746,7 +759,7 @@ |
1242 | self.prefix = prefix |
1243 | if attrs is None: |
1244 | attrs = {} |
1245 | - elif builder.cdata_list_attributes: |
1246 | + elif attrs and builder.cdata_list_attributes: |
1247 | attrs = builder._replace_cdata_list_attribute_values( |
1248 | self.name, attrs) |
1249 | else: |
1250 | @@ -1593,6 +1606,6 @@ |
1251 | class ResultSet(list): |
1252 | """A ResultSet is just a list that keeps track of the SoupStrainer |
1253 | that created it.""" |
1254 | - def __init__(self, source): |
1255 | - list.__init__([]) |
1256 | + def __init__(self, source, result=()): |
1257 | + super(ResultSet, self).__init__(result) |
1258 | self.source = source |
1259 | |
1260 | === modified file 'bs4/testing.py' |
1261 | --- bs4/testing.py 2013-08-09 18:39:43 +0000 |
1262 | +++ bs4/testing.py 2014-05-29 09:58:03 +0000 |
1263 | @@ -281,6 +281,14 @@ |
1264 | # to detect any differences between them. |
1265 | # |
1266 | |
1267 | + def test_can_parse_unicode_document(self): |
1268 | + # A seemingly innocuous document... but it's in Unicode! And |
1269 | + # it contains characters that can't be represented in the |
1270 | + # encoding found in the declaration! The horror! |
1271 | + markup = u'<html><head><meta encoding="euc-jp"></head><body>Sacr\N{LATIN SMALL LETTER E WITH ACUTE} bleu!</body>' |
1272 | + soup = self.soup(markup) |
1273 | + self.assertEqual(u'Sacr\xe9 bleu!', soup.body.string) |
1274 | + |
1275 | def test_soupstrainer(self): |
1276 | """Parsers should be able to work with SoupStrainers.""" |
1277 | strainer = SoupStrainer("b") |
1278 | @@ -484,6 +492,11 @@ |
1279 | encoded = soup.encode() |
1280 | self.assertTrue(b"< < hey > >" in encoded) |
1281 | |
1282 | + def test_can_parse_unicode_document(self): |
1283 | + markup = u'<?xml version="1.0" encoding="euc-jp"><root>Sacr\N{LATIN SMALL LETTER E WITH ACUTE} bleu!</root>' |
1284 | + soup = self.soup(markup) |
1285 | + self.assertEqual(u'Sacr\xe9 bleu!', soup.root.string) |
1286 | + |
1287 | def test_popping_namespaced_tag(self): |
1288 | markup = '<rss xmlns:dc="foo"><dc:creator>b</dc:creator><dc:date>2012-07-02T20:33:42Z</dc:date><dc:rights>c</dc:rights><image>d</image></rss>' |
1289 | soup = self.soup(markup) |
1290 | |
1291 | === modified file 'bs4/tests/test_html5lib.py' |
1292 | --- bs4/tests/test_html5lib.py 2013-08-09 18:39:43 +0000 |
1293 | +++ bs4/tests/test_html5lib.py 2014-05-29 09:58:03 +0000 |
1294 | @@ -70,3 +70,16 @@ |
1295 | soup = self.soup(markup) |
1296 | # Verify that we can reach the <p> tag; this means the tree is connected. |
1297 | self.assertEqual(b"<p>foo</p>", soup.p.encode()) |
1298 | + |
1299 | + def test_reparented_markup(self): |
1300 | + markup = '<p><em>foo</p>\n<p>bar<a></a></em></p>' |
1301 | + soup = self.soup(markup) |
1302 | + self.assertEqual(u"<body><p><em>foo</em></p><em>\n</em><p><em>bar<a></a></em></p></body>", soup.body.decode()) |
1303 | + self.assertEqual(2, len(soup.find_all('p'))) |
1304 | + |
1305 | + |
1306 | + def test_reparented_markup_ends_with_whitespace(self): |
1307 | + markup = '<p><em>foo</p>\n<p>bar<a></a></em></p>\n' |
1308 | + soup = self.soup(markup) |
1309 | + self.assertEqual(u"<body><p><em>foo</em></p><em>\n</em><p><em>bar<a></a></em></p>\n</body>", soup.body.decode()) |
1310 | + self.assertEqual(2, len(soup.find_all('p'))) |
1311 | |
1312 | === modified file 'bs4/tests/test_lxml.py' |
1313 | --- bs4/tests/test_lxml.py 2013-08-09 18:39:43 +0000 |
1314 | +++ bs4/tests/test_lxml.py 2014-05-29 09:58:03 +0000 |
1315 | @@ -4,14 +4,16 @@ |
1316 | import warnings |
1317 | |
1318 | try: |
1319 | - from bs4.builder import LXMLTreeBuilder, LXMLTreeBuilderForXML |
1320 | + import lxml.etree |
1321 | LXML_PRESENT = True |
1322 | - import lxml.etree |
1323 | LXML_VERSION = lxml.etree.LXML_VERSION |
1324 | except ImportError, e: |
1325 | LXML_PRESENT = False |
1326 | LXML_VERSION = (0,) |
1327 | |
1328 | +if LXML_PRESENT: |
1329 | + from bs4.builder import LXMLTreeBuilder, LXMLTreeBuilderForXML |
1330 | + |
1331 | from bs4 import ( |
1332 | BeautifulSoup, |
1333 | BeautifulStoneSoup, |
1334 | @@ -58,9 +60,10 @@ |
1335 | def test_beautifulstonesoup_is_xml_parser(self): |
1336 | # Make sure that the deprecated BSS class uses an xml builder |
1337 | # if one is installed. |
1338 | - with warnings.catch_warnings(record=False) as w: |
1339 | + with warnings.catch_warnings(record=True) as w: |
1340 | soup = BeautifulStoneSoup("<b />") |
1341 | - self.assertEqual(u"<b/>", unicode(soup.b)) |
1342 | + self.assertEqual(u"<b/>", unicode(soup.b)) |
1343 | + self.assertTrue("BeautifulStoneSoup class is deprecated" in str(w[0].message)) |
1344 | |
1345 | def test_real_xhtml_document(self): |
1346 | """lxml strips the XML definition from an XHTML doc, which is fine.""" |
1347 | |
1348 | === modified file 'bs4/tests/test_soup.py' |
1349 | --- bs4/tests/test_soup.py 2013-08-09 18:39:43 +0000 |
1350 | +++ bs4/tests/test_soup.py 2014-05-29 09:58:03 +0000 |
1351 | @@ -4,6 +4,8 @@ |
1352 | import logging |
1353 | import unittest |
1354 | import sys |
1355 | +import tempfile |
1356 | + |
1357 | from bs4 import ( |
1358 | BeautifulSoup, |
1359 | BeautifulStoneSoup, |
1360 | @@ -15,7 +17,10 @@ |
1361 | NamespacedAttribute, |
1362 | ) |
1363 | import bs4.dammit |
1364 | -from bs4.dammit import EntitySubstitution, UnicodeDammit |
1365 | +from bs4.dammit import ( |
1366 | + EntitySubstitution, |
1367 | + UnicodeDammit, |
1368 | +) |
1369 | from bs4.testing import ( |
1370 | SoupTest, |
1371 | skipIf, |
1372 | @@ -31,6 +36,19 @@ |
1373 | PYTHON_2_PRE_2_7 = (sys.version_info < (2,7)) |
1374 | PYTHON_3_PRE_3_2 = (sys.version_info[0] == 3 and sys.version_info < (3,2)) |
1375 | |
1376 | +class TestConstructor(SoupTest): |
1377 | + |
1378 | + def test_short_unicode_input(self): |
1379 | + data = u"<h1>éé</h1>" |
1380 | + soup = self.soup(data) |
1381 | + self.assertEqual(u"éé", soup.h1.string) |
1382 | + |
1383 | + def test_embedded_null(self): |
1384 | + data = u"<h1>foo\0bar</h1>" |
1385 | + soup = self.soup(data) |
1386 | + self.assertEqual(u"foo\0bar", soup.h1.string) |
1387 | + |
1388 | + |
1389 | class TestDeprecatedConstructorArguments(SoupTest): |
1390 | |
1391 | def test_parseOnlyThese_renamed_to_parse_only(self): |
1392 | @@ -54,14 +72,33 @@ |
1393 | self.assertRaises( |
1394 | TypeError, self.soup, "<a>", no_such_argument=True) |
1395 | |
1396 | - @skipIf( |
1397 | - not LXML_PRESENT, |
1398 | - "lxml not present, not testing BeautifulStoneSoup.") |
1399 | - def test_beautifulstonesoup(self): |
1400 | - with warnings.catch_warnings(record=True) as w: |
1401 | - soup = BeautifulStoneSoup("<markup>") |
1402 | - self.assertTrue(isinstance(soup, BeautifulSoup)) |
1403 | - self.assertTrue("BeautifulStoneSoup class is deprecated") |
1404 | +class TestWarnings(SoupTest): |
1405 | + |
1406 | + def test_disk_file_warning(self): |
1407 | + filehandle = tempfile.NamedTemporaryFile() |
1408 | + filename = filehandle.name |
1409 | + try: |
1410 | + with warnings.catch_warnings(record=True) as w: |
1411 | + soup = self.soup(filename) |
1412 | + msg = str(w[0].message) |
1413 | + self.assertTrue("looks like a filename" in msg) |
1414 | + finally: |
1415 | + filehandle.close() |
1416 | + |
1417 | + # The file no longer exists, so Beautiful Soup will no longer issue the warning. |
1418 | + with warnings.catch_warnings(record=True) as w: |
1419 | + soup = self.soup(filename) |
1420 | + self.assertEqual(0, len(w)) |
1421 | + |
1422 | + def test_url_warning(self): |
1423 | + with warnings.catch_warnings(record=True) as w: |
1424 | + soup = self.soup("http://www.crummy.com/") |
1425 | + msg = str(w[0].message) |
1426 | + self.assertTrue("looks like a URL" in msg) |
1427 | + |
1428 | + with warnings.catch_warnings(record=True) as w: |
1429 | + soup = self.soup("http://www.crummy.com/ is great") |
1430 | + self.assertEqual(0, len(w)) |
1431 | |
1432 | class TestSelectiveParsing(SoupTest): |
1433 | |
1434 | @@ -156,13 +193,23 @@ |
1435 | |
1436 | def test_ascii_in_unicode_out(self): |
1437 | # ASCII input is converted to Unicode. The original_encoding |
1438 | - # attribute is set. |
1439 | - ascii = b"<foo>a</foo>" |
1440 | - soup_from_ascii = self.soup(ascii) |
1441 | - unicode_output = soup_from_ascii.decode() |
1442 | - self.assertTrue(isinstance(unicode_output, unicode)) |
1443 | - self.assertEqual(unicode_output, self.document_for(ascii.decode())) |
1444 | - self.assertEqual(soup_from_ascii.original_encoding.lower(), "ascii") |
1445 | + # attribute is set to 'utf-8', a superset of ASCII. |
1446 | + chardet = bs4.dammit.chardet_dammit |
1447 | + logging.disable(logging.WARNING) |
1448 | + try: |
1449 | + def noop(str): |
1450 | + return None |
1451 | + # Disable chardet, which will realize that the ASCII is ASCII. |
1452 | + bs4.dammit.chardet_dammit = noop |
1453 | + ascii = b"<foo>a</foo>" |
1454 | + soup_from_ascii = self.soup(ascii) |
1455 | + unicode_output = soup_from_ascii.decode() |
1456 | + self.assertTrue(isinstance(unicode_output, unicode)) |
1457 | + self.assertEqual(unicode_output, self.document_for(ascii.decode())) |
1458 | + self.assertEqual(soup_from_ascii.original_encoding.lower(), "utf-8") |
1459 | + finally: |
1460 | + logging.disable(logging.NOTSET) |
1461 | + bs4.dammit.chardet_dammit = chardet |
1462 | |
1463 | def test_unicode_in_unicode_out(self): |
1464 | # Unicode input is left alone. The original_encoding attribute |
1465 | @@ -192,7 +239,12 @@ |
1466 | self.assertEqual(self.soup(markup).div.encode("utf8"), markup.encode("utf8")) |
1467 | |
1468 | class TestUnicodeDammit(unittest.TestCase): |
1469 | - """Standalone tests of Unicode, Dammit.""" |
1470 | + """Standalone tests of UnicodeDammit.""" |
1471 | + |
1472 | + def test_unicode_input(self): |
1473 | + markup = u"I'm already Unicode! \N{SNOWMAN}" |
1474 | + dammit = UnicodeDammit(markup) |
1475 | + self.assertEqual(dammit.unicode_markup, markup) |
1476 | |
1477 | def test_smart_quotes_to_unicode(self): |
1478 | markup = b"<foo>\x91\x92\x93\x94</foo>" |
1479 | @@ -293,9 +345,8 @@ |
1480 | logging.disable(logging.NOTSET) |
1481 | bs4.dammit.chardet_dammit = chardet |
1482 | |
1483 | - def test_sniffed_xml_encoding(self): |
1484 | - # A document written in UTF-16LE will be converted by a different |
1485 | - # code path that sniffs the byte order markers. |
1486 | + def test_byte_order_mark_removed(self): |
1487 | + # A document written in UTF-16LE will have its byte order marker stripped. |
1488 | data = b'\xff\xfe<\x00a\x00>\x00\xe1\x00\xe9\x00<\x00/\x00a\x00>\x00' |
1489 | dammit = UnicodeDammit(data) |
1490 | self.assertEqual(u"<a>áé</a>", dammit.unicode_markup) |
1491 | |
1492 | === modified file 'bs4/tests/test_tree.py' |
1493 | --- bs4/tests/test_tree.py 2013-08-09 18:39:43 +0000 |
1494 | +++ bs4/tests/test_tree.py 2014-05-29 09:58:03 +0000 |
1495 | @@ -70,6 +70,16 @@ |
1496 | soup = self.soup(u'<h1>Räksmörgås</h1>') |
1497 | self.assertEqual(soup.find(text=u'Räksmörgås'), u'Räksmörgås') |
1498 | |
1499 | + def test_find_everything(self): |
1500 | + """Test an optimization that finds all tags.""" |
1501 | + soup = self.soup("<a>foo</a><b>bar</b>") |
1502 | + self.assertEqual(2, len(soup.find_all())) |
1503 | + |
1504 | + def test_find_everything_with_name(self): |
1505 | + """Test an optimization that finds all tags with a given name.""" |
1506 | + soup = self.soup("<a>foo</a><b>bar</b><a>baz</a>") |
1507 | + self.assertEqual(2, len(soup.find_all('a'))) |
1508 | + |
1509 | class TestFindAll(TreeTest): |
1510 | """Basic tests of the find_all() method.""" |
1511 | |
1512 | @@ -115,6 +125,19 @@ |
1513 | # recursion. |
1514 | self.assertEqual([], soup.find_all(l)) |
1515 | |
1516 | + def test_find_all_resultset(self): |
1517 | + """All find_all calls return a ResultSet""" |
1518 | + soup = self.soup("<a></a>") |
1519 | + result = soup.find_all("a") |
1520 | + self.assertTrue(hasattr(result, "source")) |
1521 | + |
1522 | + result = soup.find_all(True) |
1523 | + self.assertTrue(hasattr(result, "source")) |
1524 | + |
1525 | + result = soup.find_all(text="foo") |
1526 | + self.assertTrue(hasattr(result, "source")) |
1527 | + |
1528 | + |
1529 | class TestFindAllBasicNamespaces(TreeTest): |
1530 | |
1531 | def test_find_by_namespaced_name(self): |
1532 | @@ -1219,6 +1242,12 @@ |
1533 | # attribute for any other tag. |
1534 | self.assertEqual('ISO-8859-1 UTF-8', soup.a['accept-charset']) |
1535 | |
1536 | + def test_string_has_immutable_name_property(self): |
1537 | + string = self.soup("s").string |
1538 | + self.assertEqual(None, string.name) |
1539 | + def t(): |
1540 | + string.name = 'foo' |
1541 | + self.assertRaises(AttributeError, t) |
1542 | |
1543 | class TestPersistence(SoupTest): |
1544 | "Testing features like pickle and deepcopy." |
1545 | |
1546 | === modified file 'debian/changelog' |
1547 | --- debian/changelog 2014-02-23 13:46:15 +0000 |
1548 | +++ debian/changelog 2014-05-29 09:58:03 +0000 |
1549 | @@ -1,3 +1,19 @@ |
1550 | +beautifulsoup4 (4.3.2-1ubuntu1) utopic; urgency=medium |
1551 | + |
1552 | + * Merge from debian. Remaining changes: |
1553 | + - debian/control, debian/rules: Disable pypy-bs4 and Build-Depends on |
1554 | + pypy since the latter is in universe, while beautifulsoup4 is being |
1555 | + pulled into main via webtest. |
1556 | + |
1557 | + -- Jackson Doak <noskcaj@ubuntu.com> Thu, 29 May 2014 19:50:43 +1000 |
1558 | + |
1559 | +beautifulsoup4 (4.3.2-1) unstable; urgency=low |
1560 | + |
1561 | + * New upstream release. |
1562 | + * Bump Standards-Version to 3.9.5, no changes needed. |
1563 | + |
1564 | + -- Stefano Rivera <stefanor@debian.org> Sat, 03 May 2014 14:19:04 +0200 |
1565 | + |
1566 | beautifulsoup4 (4.2.1-1ubuntu2) trusty; urgency=medium |
1567 | |
1568 | * Rebuild to drop files installed into /usr/share/pyshared. |
1569 | |
1570 | === modified file 'debian/control' |
1571 | --- debian/control 2013-11-15 09:56:34 +0000 |
1572 | +++ debian/control 2014-05-29 09:58:03 +0000 |
1573 | @@ -15,7 +15,7 @@ |
1574 | python3-lxml, |
1575 | python3-pkg-resources |
1576 | X-Python-Version: >= 2.6 |
1577 | -Standards-Version: 3.9.4 |
1578 | +Standards-Version: 3.9.5 |
1579 | Homepage: http://www.crummy.com/software/BeautifulSoup |
1580 | Vcs-Svn: svn://anonscm.debian.org/python-modules/packages/beautifulsoup4/trunk/ |
1581 | Vcs-Browser: http://anonscm.debian.org/viewvc/python-modules/packages/beautifulsoup4/trunk/ |
1582 | |
1583 | === modified file 'debian/copyright' |
1584 | --- debian/copyright 2013-05-25 21:27:22 +0000 |
1585 | +++ debian/copyright 2014-05-29 09:58:03 +0000 |
1586 | @@ -18,7 +18,7 @@ |
1587 | Files: debian/* |
1588 | Copyright: |
1589 | 2005-2009, Decklin Foster <decklin@red-bean.com> |
1590 | - 2011-2013, Stefano Rivera <stefanor@debian.org> |
1591 | + 2011-2014, Stefano Rivera <stefanor@debian.org> |
1592 | License: Expatish |
1593 | |
1594 | License: Expatish |
1595 | |
1596 | === modified file 'doc/source/index.rst' |
1597 | --- doc/source/index.rst 2013-08-09 18:39:43 +0000 |
1598 | +++ doc/source/index.rst 2014-05-29 09:58:03 +0000 |
1599 | @@ -26,6 +26,10 @@ |
1600 | projects. If you want to learn about the differences between Beautiful |
1601 | Soup 3 and Beautiful Soup 4, see `Porting code to BS4`_. |
1602 | |
1603 | +This documentation has been translated into other languages by its users. |
1604 | + |
1605 | +* 이 문서는 한국어 번역도 가능합니다. (`외부 링크 <http://coreapython.hosting.paran.com/etc/beautifulsoup4.html>`_) |
1606 | + |
1607 | Getting help |
1608 | ------------ |
1609 | |
1610 | @@ -1209,8 +1213,8 @@ |
1611 | You can filter an attribute based on `a string`_, `a regular |
1612 | expression`_, `a list`_, `a function`_, or `the value True`_. |
1613 | |
1614 | -This code finds all tags that have an ``id`` attribute, regardless of |
1615 | -what the value is:: |
1616 | +This code finds all tags whose ``id`` attribute has a value, |
1617 | +regardless of what the value is:: |
1618 | |
1619 | soup.find_all(id=True) |
1620 | # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, |
1621 | @@ -2478,9 +2482,11 @@ |
1622 | dammit.original_encoding |
1623 | # 'utf-8' |
1624 | |
1625 | -The more data you give Unicode, Dammit, the more accurately it will |
1626 | -guess. If you have your own suspicions as to what the encoding might |
1627 | -be, you can pass them in as a list:: |
1628 | +Unicode, Dammit's guesses will get a lot more accurate if you install |
1629 | +the ``chardet`` or ``cchardet`` Python libraries. The more data you |
1630 | +give Unicode, Dammit, the more accurately it will guess. If you have |
1631 | +your own suspicions as to what the encoding might be, you can pass |
1632 | +them in as a list:: |
1633 | |
1634 | dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"]) |
1635 | print(dammit.unicode_markup) |
1636 | @@ -2823,16 +2829,6 @@ |
1637 | You can speed up encoding detection significantly by installing the |
1638 | `cchardet <http://pypi.python.org/pypi/cchardet/>`_ library. |
1639 | |
1640 | -Sometimes `Unicode, Dammit`_ can only detect the encoding of a file by |
1641 | -doing a byte-by-byte examination of the file. This slows Beautiful |
1642 | -Soup to a crawl. My tests indicate that this only happened on 2.x |
1643 | -versions of Python, and that it happened most often with documents |
1644 | -using Russian or Chinese encodings. If this is happening to you, you |
1645 | -can fix it by installing cchardet, or by using Python 3 for your |
1646 | -script. If you happen to know a document's encoding, you can pass |
1647 | -it into the ``BeautifulSoup`` constructor as ``from_encoding``, and |
1648 | -bypass encoding detection altogether. |
1649 | - |
1650 | `Parsing only part of a document`_ won't save you much time parsing |
1651 | the document, but it can save a lot of memory, and it'll make |
1652 | `searching` the document much faster. |
1653 | |
1654 | === modified file 'setup.py' |
1655 | --- setup.py 2013-08-09 18:39:43 +0000 |
1656 | +++ setup.py 2014-05-29 09:58:03 +0000 |
1657 | @@ -7,7 +7,7 @@ |
1658 | from distutils.command.build_py import build_py |
1659 | |
1660 | setup(name="beautifulsoup4", |
1661 | - version = "4.2.1", |
1662 | + version = "4.3.2", |
1663 | author="Leonard Richardson", |
1664 | author_email='leonardr@segfault.org', |
1665 | url="http://www.crummy.com/software/BeautifulSoup/bs4/", |
Thanks. Uploaded.