Merge beautifulsoup:more-modular-soupstrainers into beautifulsoup:4.13
- Git
- lp:beautifulsoup
- more-modular-soupstrainers
- Merge into 4.13
Proposed by
Leonard Richardson
Status: | Merged |
---|---|
Merged at revision: | c23dd48ebea467fcf028e14287f07d2c51e62975 |
Proposed branch: | beautifulsoup:more-modular-soupstrainers |
Merge into: | beautifulsoup:4.13 |
Diff against target: |
2064 lines (+710/-262) 18 files modified
CHANGELOG (+18/-1) bs4/__init__.py (+131/-84) bs4/_typing.py (+19/-1) bs4/builder/__init__.py (+8/-8) bs4/builder/_html5lib.py (+123/-67) bs4/builder/_htmlparser.py (+12/-2) bs4/builder/_lxml.py (+1/-1) bs4/diagnose.py (+27/-15) bs4/element.py (+24/-20) bs4/filter.py (+167/-36) bs4/tests/__init__.py (+1/-1) bs4/tests/test_filter.py (+125/-8) bs4/tests/test_html5lib.py (+2/-2) bs4/tests/test_lxml.py (+1/-1) bs4/tests/test_pageelement.py (+1/-1) bs4/tests/test_soup.py (+2/-2) bs4/tests/test_tree.py (+1/-1) doc/index.rst (+47/-11) |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Leonard Richardson | Pending | ||
Review via email: mp+459082@code.launchpad.net |
Commit message
Description of the change
To post a comment you must log in.
Preview Diff
[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1 | diff --git a/CHANGELOG b/CHANGELOG |
2 | index 69f238d..162e3dc 100644 |
3 | --- a/CHANGELOG |
4 | +++ b/CHANGELOG |
5 | @@ -1,5 +1,7 @@ |
6 | = 4.13.0 (Unreleased) |
7 | |
8 | +TODO: we could stand to put limit inside ResultSet |
9 | + |
10 | * This version drops support for Python 3.6. The minimum supported |
11 | major Python version for Beautiful Soup is now Python 3.7. |
12 | |
13 | @@ -31,6 +33,13 @@ |
14 | you, since you probably use HTMLParserTreeBuilder, not |
15 | BeautifulSoupHTMLParser directly. |
16 | |
17 | +* The TreeBuilderForHtml5lib methods fragmentClass and getFragment |
18 | + now raise NotImplementedError. These methods are called only by |
19 | + html5lib's HTMLParser.parseFragment() method, which Beautiful Soup |
20 | + doesn't use, so they were untested and should have never been called. |
21 | + The getFragment() implementation was also slightly incorrect in a way |
22 | + that should have caused obvious problems for anyone using it. |
23 | + |
24 | * If Tag.get_attribute_list() is used to access an attribute that's not set, |
25 | the return value is now an empty list rather than [None]. |
26 | |
27 | @@ -47,6 +56,10 @@ |
28 | empty list was treated the same as None and False, and you would have |
29 | found the tags which did not have that attribute set at all. [bug=2045469] |
30 | |
31 | +* For similar reasons, if you pass in limit=0 to a find() method for some |
32 | + reason, you will now get zero results. Previously, you would get all |
33 | + matching results. |
34 | + |
35 | * When using one of the find() methods or creating a SoupStrainer, |
36 | if you specify the same attribute value in ``attrs`` and the |
37 | keyword arguments, you'll end up with two different ways to match that |
38 | @@ -88,7 +101,7 @@ |
39 | changed to match the arguments to the superclass, |
40 | TreeBuilder.prepare_markup. Specifically, document_declared_encoding |
41 | now appears before exclude_encodings, not after. If you were calling |
42 | - this method yourself, I recomment switching to using keyword |
43 | + this method yourself, I recommend switching to using keyword |
44 | arguments instead. |
45 | |
46 | * Fixed an error in the lookup table used when converting |
47 | @@ -101,8 +114,12 @@ New deprecations in 4.13.0: |
48 | |
49 | * The SAXTreeBuilder class, which was never officially supported or tested. |
50 | |
51 | +* The private class method BeautifulSoup._decode_markup(), which has not |
52 | + been used inside Beautiful Soup for many years. |
53 | + |
54 | * The first argument to BeautifulSoup.decode has been changed from a bool |
55 | `pretty_print` to an int `indent_level`, to match the signature of Tag.decode. |
56 | + Using a bool will still work but will give you a DeprecationWarning. |
57 | |
58 | * SoupStrainer.text and SoupStrainer.string are both deprecated |
59 | since a single item can't capture all the possibilities of a SoupStrainer |
60 | diff --git a/bs4/__init__.py b/bs4/__init__.py |
61 | index 347cb38..95bd48d 100644 |
62 | --- a/bs4/__init__.py |
63 | +++ b/bs4/__init__.py |
64 | @@ -15,7 +15,7 @@ documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ |
65 | """ |
66 | |
67 | __author__ = "Leonard Richardson (leonardr@segfault.org)" |
68 | -__version__ = "4.12.3" |
69 | +__version__ = "4.13.0" |
70 | __copyright__ = "Copyright (c) 2004-2024 Leonard Richardson" |
71 | # Use of this source code is governed by the MIT license. |
72 | __license__ = "MIT" |
73 | @@ -42,10 +42,13 @@ from .builder import ( |
74 | ) |
75 | from .builder._htmlparser import HTMLParserTreeBuilder |
76 | from .dammit import UnicodeDammit |
77 | +from .css import ( |
78 | + CSS |
79 | +) |
80 | +from ._deprecation import _deprecated |
81 | from .element import ( |
82 | CData, |
83 | Comment, |
84 | - CSS, |
85 | DEFAULT_OUTPUT_ENCODING, |
86 | Declaration, |
87 | Doctype, |
88 | @@ -60,7 +63,10 @@ from .element import ( |
89 | TemplateString, |
90 | ) |
91 | from .formatter import Formatter |
92 | -from .strainer import SoupStrainer |
93 | +from .filter import ( |
94 | + ElementFilter, |
95 | + SoupStrainer, |
96 | +) |
97 | from typing import ( |
98 | Any, |
99 | cast, |
100 | @@ -70,6 +76,7 @@ from typing import ( |
101 | List, |
102 | Sequence, |
103 | Optional, |
104 | + Tuple, |
105 | Type, |
106 | TYPE_CHECKING, |
107 | Union, |
108 | @@ -81,6 +88,7 @@ from bs4._typing import ( |
109 | _Encoding, |
110 | _Encodings, |
111 | _IncomingMarkup, |
112 | + _RawMarkup, |
113 | ) |
114 | |
115 | # Define some custom warnings. |
116 | @@ -144,20 +152,21 @@ class BeautifulSoup(Tag): |
117 | NO_PARSER_SPECIFIED_WARNING: str = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features=\"%(parser)s\"' to the BeautifulSoup constructor.\n" |
118 | |
119 | # FUTURE PYTHON: |
120 | - element_classes:Dict[Type[PageElement], Type[Any]] #: :meta private: |
121 | + element_classes:Dict[Type[PageElement], Type[PageElement]] #: :meta private: |
122 | builder:TreeBuilder #: :meta private: |
123 | is_xml: bool |
124 | known_xml: Optional[bool] |
125 | parse_only: Optional[SoupStrainer] #: :meta private: |
126 | |
127 | # These members are only used while parsing markup. |
128 | - markup:Optional[Union[str,bytes]] #: :meta private: |
129 | + markup:Optional[_RawMarkup] #: :meta private: |
130 | current_data:List[str] #: :meta private: |
131 | currentTag:Optional[Tag] #: :meta private: |
132 | tagStack:List[Tag] #: :meta private: |
133 | open_tag_counter:CounterType[str] #: :meta private: |
134 | preserve_whitespace_tag_stack:List[Tag] #: :meta private: |
135 | string_container_stack:List[Tag] #: :meta private: |
136 | + _most_recent_element:Optional[PageElement] #: :meta private: |
137 | |
138 | #: Beautiful Soup's best guess as to the character encoding of the |
139 | #: original document. |
140 | @@ -182,7 +191,7 @@ class BeautifulSoup(Tag): |
141 | parse_only:Optional[SoupStrainer]=None, |
142 | from_encoding:Optional[_Encoding]=None, |
143 | exclude_encodings:Optional[_Encodings]=None, |
144 | - element_classes:Optional[Dict[Type[PageElement], Type[Any]]]=None, |
145 | + element_classes:Optional[Dict[Type[PageElement], Type[PageElement]]]=None, |
146 | **kwargs:Any |
147 | ): |
148 | """Constructor. |
149 | @@ -271,7 +280,7 @@ class BeautifulSoup(Tag): |
150 | "features='lxml' for HTML and features='lxml-xml' for " |
151 | "XML.") |
152 | |
153 | - def deprecated_argument(old_name, new_name): |
154 | + def deprecated_argument(old_name:str, new_name:str) -> Optional[Any]: |
155 | if old_name in kwargs: |
156 | warnings.warn( |
157 | 'The "%s" argument to the BeautifulSoup constructor ' |
158 | @@ -284,13 +293,14 @@ class BeautifulSoup(Tag): |
159 | |
160 | parse_only = parse_only or deprecated_argument( |
161 | "parseOnlyThese", "parse_only") |
162 | - if (parse_only is not None |
163 | - and parse_only.string_rules and |
164 | - (parse_only.name_rules or parse_only.attribute_rules)): |
165 | - warnings.warn( |
166 | - f"Value for parse_only will exclude everything, since it puts restrictions on both tags and strings: {parse_only}", |
167 | - UserWarning, stacklevel=3 |
168 | - ) |
169 | + if parse_only is not None: |
170 | + # Issue a warning if we can tell in advance that |
171 | + # parse_only will exclude the entire tree. |
172 | + if parse_only.excludes_everything: |
173 | + warnings.warn( |
174 | + f"The given value for parse_only will exclude everything: {parse_only}", |
175 | + UserWarning, stacklevel=3 |
176 | + ) |
177 | |
178 | from_encoding = from_encoding or deprecated_argument( |
179 | "fromEncoding", "from_encoding") |
180 | @@ -323,7 +333,7 @@ class BeautifulSoup(Tag): |
181 | "Couldn't find a tree builder with the features you " |
182 | "requested: %s. Do you need to install a parser library?" |
183 | % ",".join(features)) |
184 | - builder_class = cast(Type[TreeBuilder], possible_builder_class) |
185 | + builder_class = possible_builder_class |
186 | |
187 | # At this point either we have a TreeBuilder instance in |
188 | # builder, or we have a builder_class that we can instantiate |
189 | @@ -399,7 +409,7 @@ class BeautifulSoup(Tag): |
190 | |
191 | # At this point we know markup is a string or bytestring. If |
192 | # it was a file-type object, we've read from it. |
193 | - markup = cast(Union[str,bytes], markup) |
194 | + markup = cast(_RawMarkup, markup) |
195 | |
196 | rejections = [] |
197 | success = False |
198 | @@ -428,7 +438,7 @@ class BeautifulSoup(Tag): |
199 | self.markup = None |
200 | self.builder.soup = None |
201 | |
202 | - def _clone(self): |
203 | + def _clone(self) -> "BeautifulSoup": |
204 | """Create a new BeautifulSoup object with the same TreeBuilder, |
205 | but not associated with any markup. |
206 | |
207 | @@ -441,7 +451,7 @@ class BeautifulSoup(Tag): |
208 | clone.original_encoding = self.original_encoding |
209 | return clone |
210 | |
211 | - def __getstate__(self): |
212 | + def __getstate__(self) -> dict[str, Any]: |
213 | # Frequently a tree builder can't be pickled. |
214 | d = dict(self.__dict__) |
215 | if 'builder' in d and d['builder'] is not None and not self.builder.picklable: |
216 | @@ -457,7 +467,7 @@ class BeautifulSoup(Tag): |
217 | del d['_most_recent_element'] |
218 | return d |
219 | |
220 | - def __setstate__(self, state): |
221 | + def __setstate__(self, state: dict[str, Any]) -> None: |
222 | # If necessary, restore the TreeBuilder by looking it up. |
223 | self.__dict__ = state |
224 | if isinstance(self.builder, type): |
225 | @@ -469,15 +479,16 @@ class BeautifulSoup(Tag): |
226 | self.builder.soup = self |
227 | self.reset() |
228 | self._feed() |
229 | - return state |
230 | |
231 | |
232 | @classmethod |
233 | - def _decode_markup(cls, markup): |
234 | - """Ensure `markup` is bytes so it's safe to send into warnings.warn. |
235 | + @_deprecated(replaced_by="nothing (private method, will be removed)", version="4.13.0") |
236 | + def _decode_markup(cls, markup:_RawMarkup) -> str: |
237 | + """Ensure `markup` is Unicode so it's safe to send into warnings.warn. |
238 | |
239 | - TODO: warnings.warn had this problem back in 2010 but it might not |
240 | - anymore. |
241 | + warnings.warn had this problem back in 2010 but fortunately |
242 | + not anymore. This has not been used for a long time; I just |
243 | + noticed that fact while working on 4.13.0. |
244 | """ |
245 | if isinstance(markup, bytes): |
246 | decoded = markup.decode('utf-8', 'replace') |
247 | @@ -486,56 +497,76 @@ class BeautifulSoup(Tag): |
248 | return decoded |
249 | |
250 | @classmethod |
251 | - def _markup_is_url(cls, markup): |
252 | + def _markup_is_url(cls, markup:_RawMarkup) -> bool: |
253 | """Error-handling method to raise a warning if incoming markup looks |
254 | like a URL. |
255 | |
256 | - :param markup: A string. |
257 | - :return: Whether or not the markup resembles a URL |
258 | - closely enough to justify a warning. |
259 | + :param markup: A string of markup. |
260 | + :return: Whether or not the markup resembled a URL |
261 | + closely enough to justify issuing a warning. |
262 | """ |
263 | + problem: bool = False |
264 | if isinstance(markup, bytes): |
265 | - space = b' ' |
266 | - cant_start_with = (b"http:", b"https:") |
267 | + cant_start_with_b: Tuple[bytes, bytes] = (b"http:", b"https:") |
268 | + problem = ( |
269 | + any( |
270 | + markup.startswith(prefix) for prefix in |
271 | + (b"http:", b"https:") |
272 | + ) |
273 | + and not b' ' in markup |
274 | + ) |
275 | elif isinstance(markup, str): |
276 | - space = ' ' |
277 | - cant_start_with = ("http:", "https:") |
278 | + problem = ( |
279 | + any( |
280 | + markup.startswith(prefix) for prefix in |
281 | + ("http:", "https:") |
282 | + ) |
283 | + and not ' ' in markup |
284 | + ) |
285 | else: |
286 | return False |
287 | |
288 | - if any(markup.startswith(prefix) for prefix in cant_start_with): |
289 | - if not space in markup: |
290 | - warnings.warn( |
291 | - 'The input looks more like a URL than markup. You may want to use' |
292 | - ' an HTTP client like requests to get the document behind' |
293 | - ' the URL, and feed that document to Beautiful Soup.', |
294 | - MarkupResemblesLocatorWarning, |
295 | - stacklevel=3 |
296 | - ) |
297 | - return True |
298 | - return False |
299 | + if not problem: |
300 | + return False |
301 | + warnings.warn( |
302 | + 'The input looks more like a URL than markup. You may want to use' |
303 | + ' an HTTP client like requests to get the document behind' |
304 | + ' the URL, and feed that document to Beautiful Soup.', |
305 | + MarkupResemblesLocatorWarning, |
306 | + stacklevel=3 |
307 | + ) |
308 | + return True |
309 | |
310 | @classmethod |
311 | - def _markup_resembles_filename(cls, markup): |
312 | - """Error-handling method to raise a warning if incoming markup |
313 | + def _markup_resembles_filename(cls, markup:_RawMarkup) -> bool: |
314 | + """Error-handling method to issue a warning if incoming markup |
315 | resembles a filename. |
316 | |
317 | - :param markup: A bytestring or string. |
318 | - :return: Whether or not the markup resembles a filename |
319 | - closely enough to justify a warning. |
320 | + :param markup: A string of markup. |
321 | + :return: Whether or not the markup resembled a filename |
322 | + closely enough to justify issuing a warning. |
323 | """ |
324 | - path_characters = '/\\' |
325 | - extensions = ['.html', '.htm', '.xml', '.xhtml', '.txt'] |
326 | - if isinstance(markup, bytes): |
327 | - path_characters = path_characters.encode("utf8") |
328 | - extensions = [x.encode('utf8') for x in extensions] |
329 | + path_characters_b = b'/\\' |
330 | + path_characters_s = '/\\' |
331 | + extensions_b = [b'.html', b'.htm', b'.xml', b'.xhtml', b'.txt'] |
332 | + extensions_s = ['.html', '.htm', '.xml', '.xhtml', '.txt'] |
333 | + |
334 | filelike = False |
335 | - if any(x in markup for x in path_characters): |
336 | - filelike = True |
337 | + if isinstance(markup, bytes): |
338 | + if any(x in markup for x in path_characters_b): |
339 | + filelike = True |
340 | + else: |
341 | + lower_b = markup.lower() |
342 | + if any(lower_b.endswith(ext) for ext in extensions_b): |
343 | + filelike = True |
344 | else: |
345 | - lower = markup.lower() |
346 | - if any(lower.endswith(ext) for ext in extensions): |
347 | + if any(x in markup for x in path_characters_s): |
348 | filelike = True |
349 | + else: |
350 | + lower_s = markup.lower() |
351 | + if any(lower_s.endswith(ext) for ext in extensions_s): |
352 | + filelike = True |
353 | + |
354 | if filelike: |
355 | warnings.warn( |
356 | 'The input looks more like a filename than markup. You may' |
357 | @@ -546,20 +577,22 @@ class BeautifulSoup(Tag): |
358 | return True |
359 | return False |
360 | |
361 | - def _feed(self): |
362 | + def _feed(self) -> None: |
363 | """Internal method that parses previously set markup, creating a large |
364 | number of Tag and NavigableString objects. |
365 | """ |
366 | # Convert the document to Unicode. |
367 | self.builder.reset() |
368 | |
369 | - self.builder.feed(self.markup) |
370 | + if self.markup is not None: |
371 | + self.builder.feed(self.markup) |
372 | # Close out any unfinished strings and close all the open tags. |
373 | self.endData() |
374 | - while self.currentTag.name != self.ROOT_TAG_NAME: |
375 | + while (self.currentTag is not None and |
376 | + self.currentTag.name != self.ROOT_TAG_NAME): |
377 | self.popTag() |
378 | |
379 | - def reset(self): |
380 | + def reset(self) -> None: |
381 | """Reset this object to a state as though it had never parsed any |
382 | markup. |
383 | """ |
384 | @@ -585,7 +618,7 @@ class BeautifulSoup(Tag): |
385 | sourcepos:Optional[int]=None, |
386 | string:Optional[str]=None, |
387 | **kwattrs:_AttributeValue, |
388 | - ): |
389 | + ) -> Tag: |
390 | """Create a new Tag associated with this BeautifulSoup object. |
391 | |
392 | :param name: The name of the new Tag. |
393 | @@ -603,10 +636,16 @@ class BeautifulSoup(Tag): |
394 | |
395 | """ |
396 | kwattrs.update(attrs) |
397 | - tag = self.element_classes.get(Tag, Tag)( |
398 | + tag_class = self.element_classes.get(Tag, Tag) |
399 | + |
400 | + # Assume that this is either Tag or a subclass of Tag. If not, |
401 | + # the user brought type-unsafety upon themselves. |
402 | + tag_class = cast(Type[Tag], tag_class) |
403 | + tag = tag_class( |
404 | None, self.builder, name, namespace, nsprefix, kwattrs, |
405 | sourceline=sourceline, sourcepos=sourcepos |
406 | ) |
407 | + |
408 | if string is not None: |
409 | tag.string = string |
410 | return tag |
411 | @@ -622,9 +661,11 @@ class BeautifulSoup(Tag): |
412 | """ |
413 | container = base_class or NavigableString |
414 | |
415 | - # There may be a general override of NavigableString. |
416 | - container = self.element_classes.get( |
417 | - container, container |
418 | + # The user may want us to use some other class (hopefully a |
419 | + # custom subclass) instead of the one we'd use normally. |
420 | + container = cast( |
421 | + type[NavigableString], |
422 | + self.element_classes.get(container, container) |
423 | ) |
424 | |
425 | # On top of that, we may be inside a tag that needs a special |
426 | @@ -728,9 +769,8 @@ class BeautifulSoup(Tag): |
427 | self.current_data = [] |
428 | |
429 | # Should we add this string to the tree at all? |
430 | - if self.parse_only and len(self.tagStack) <= 1 and \ |
431 | - (not self.parse_only.string_rules or \ |
432 | - not self.parse_only.allow_string_creation(current_data)): |
433 | + if (self.parse_only and len(self.tagStack) <= 1 and |
434 | + (not self.parse_only.allow_string_creation(current_data))): |
435 | return |
436 | |
437 | containerClass = self.string_container(containerClass) |
438 | @@ -739,17 +779,16 @@ class BeautifulSoup(Tag): |
439 | |
440 | def object_was_parsed( |
441 | self, o:PageElement, parent:Optional[Tag]=None, |
442 | - most_recent_element:Optional[PageElement]=None): |
443 | + most_recent_element:Optional[PageElement]=None) -> None: |
444 | """Method called by the TreeBuilder to integrate an object into the |
445 | parse tree. |
446 | |
447 | - |
448 | - |
449 | :meta private: |
450 | """ |
451 | if parent is None: |
452 | parent = self.currentTag |
453 | assert parent is not None |
454 | + previous_element: Optional[PageElement] |
455 | if most_recent_element is not None: |
456 | previous_element = most_recent_element |
457 | else: |
458 | @@ -774,12 +813,12 @@ class BeautifulSoup(Tag): |
459 | if fix: |
460 | self._linkage_fixer(parent) |
461 | |
462 | - def _linkage_fixer(self, el): |
463 | + def _linkage_fixer(self, el:Tag) -> None: |
464 | """Make sure linkage of this fragment is sound.""" |
465 | |
466 | first = el.contents[0] |
467 | child = el.contents[-1] |
468 | - descendant = child |
469 | + descendant:PageElement = child |
470 | |
471 | if child is first and el.parent is not None: |
472 | # Parent should be linked to first child |
473 | @@ -797,14 +836,18 @@ class BeautifulSoup(Tag): |
474 | |
475 | # This index is a tag, dig deeper for a "last descendant" |
476 | if isinstance(child, Tag) and child.contents: |
477 | - descendant = child._last_descendant(False) |
478 | + # _last_decendant is typed as returning Optional[PageElement], |
479 | + # but the value can't be None here, because el is a Tag |
480 | + # which we know has contents. |
481 | + descendant = cast(PageElement, child._last_descendant(False)) |
482 | |
483 | # As the final step, link last descendant. It should be linked |
484 | # to the parent's next sibling (if found), else walk up the chain |
485 | # and find a parent with a sibling. It should have no next sibling. |
486 | descendant.next_element = None |
487 | descendant.next_sibling = None |
488 | - target = el |
489 | + |
490 | + target:Optional[Tag] = el |
491 | while True: |
492 | if target is None: |
493 | break |
494 | @@ -814,7 +857,7 @@ class BeautifulSoup(Tag): |
495 | break |
496 | target = target.parent |
497 | |
498 | - def _popToTag(self, name, nsprefix=None, inclusivePop=True) -> Optional[Tag]: |
499 | + def _popToTag(self, name:str, nsprefix:Optional[str]=None, inclusivePop:bool=True) -> Optional[Tag]: |
500 | """Pops the tag stack up to and including the most recent |
501 | instance of the given tag. |
502 | |
503 | @@ -851,7 +894,7 @@ class BeautifulSoup(Tag): |
504 | |
505 | def handle_starttag( |
506 | self, name:str, namespace:Optional[str], |
507 | - nsprefix:Optional[str], attrs:Optional[Dict[str,str]], |
508 | + nsprefix:Optional[str], attrs:_AttributeValues, |
509 | sourceline:Optional[int]=None, sourcepos:Optional[int]=None, |
510 | namespaces:Optional[Dict[str, str]]=None) -> Optional[Tag]: |
511 | """Called by the tree builder when a new tag is encountered. |
512 | @@ -867,7 +910,7 @@ class BeautifulSoup(Tag): |
513 | currently in scope in the document. |
514 | |
515 | If this method returns None, the tag was rejected by an active |
516 | - SoupStrainer. You should proceed as if the tag had not occurred |
517 | + `ElementFilter`. You should proceed as if the tag had not occurred |
518 | in the document. For instance, if this was a self-closing tag, |
519 | don't call handle_endtag. |
520 | |
521 | @@ -877,11 +920,14 @@ class BeautifulSoup(Tag): |
522 | self.endData() |
523 | |
524 | if (self.parse_only and len(self.tagStack) <= 1 |
525 | - and (self.parse_only.string_rules |
526 | - or not self.parse_only.allow_tag_creation(nsprefix, name, attrs))): |
527 | + and not self.parse_only.allow_tag_creation(nsprefix, name, attrs)): |
528 | return None |
529 | |
530 | - tag = self.element_classes.get(Tag, Tag)( |
531 | + tag_class = self.element_classes.get(Tag, Tag) |
532 | + # Assume that this is either Tag or a subclass of Tag. If not, |
533 | + # the user brought type-unsafety upon themselves. |
534 | + tag_class = cast(Type[Tag], tag_class) |
535 | + tag = tag_class( |
536 | self, self.builder, name, namespace, nsprefix, attrs, |
537 | self.currentTag, self._most_recent_element, |
538 | sourceline=sourceline, sourcepos=sourcepos, |
539 | @@ -918,7 +964,8 @@ class BeautifulSoup(Tag): |
540 | def decode(self, indent_level:Optional[int]=None, |
541 | eventual_encoding:_Encoding=DEFAULT_OUTPUT_ENCODING, |
542 | formatter:Union[Formatter,str]="minimal", |
543 | - iterator:Optional[Iterable]=None, **kwargs) -> str: |
544 | + iterator:Optional[Iterable[PageElement]]=None, |
545 | + **kwargs:Any) -> str: |
546 | """Returns a string representation of the parse tree |
547 | as a full HTML or XML document. |
548 | |
549 | @@ -989,7 +1036,7 @@ _soup = BeautifulSoup |
550 | class BeautifulStoneSoup(BeautifulSoup): |
551 | """Deprecated interface to an XML parser.""" |
552 | |
553 | - def __init__(self, *args, **kwargs): |
554 | + def __init__(self, *args:Any, **kwargs:Any): |
555 | kwargs['features'] = 'xml' |
556 | warnings.warn( |
557 | 'The BeautifulStoneSoup class was deprecated in version 4.0.0. Instead of using ' |
558 | diff --git a/bs4/_typing.py b/bs4/_typing.py |
559 | index fed804a..ab8f7a0 100644 |
560 | --- a/bs4/_typing.py |
561 | +++ b/bs4/_typing.py |
562 | @@ -7,6 +7,8 @@ |
563 | # * In 3.10, x|y is an accepted shorthand for Union[x,y]. |
564 | # * In 3.10, TypeAlias gains capabilities that can be used to |
565 | # improve the tree matching types (I don't remember what, exactly). |
566 | +# * 3.8 defines the Protocol type, which can be used to do duck typing |
567 | +# in a statically checkable way. |
568 | |
569 | import re |
570 | from typing_extensions import TypeAlias |
571 | @@ -15,13 +17,14 @@ from typing import ( |
572 | Dict, |
573 | IO, |
574 | Iterable, |
575 | + Optional, |
576 | Pattern, |
577 | TYPE_CHECKING, |
578 | Union, |
579 | ) |
580 | |
581 | if TYPE_CHECKING: |
582 | - from bs4.element import Tag |
583 | + from bs4.element import PageElement, Tag |
584 | |
585 | # Aliases for markup in various stages of processing. |
586 | # |
587 | @@ -52,6 +55,10 @@ _InvertedNamespaceMapping:TypeAlias = Dict[_NamespaceURL, _NamespacePrefix] |
588 | _AttributeValue: TypeAlias = Union[str, Iterable[str]] |
589 | _AttributeValues: TypeAlias = Dict[str, _AttributeValue] |
590 | |
591 | +# The most common form in which attribute values are passed in from a |
592 | +# parser. |
593 | +_RawAttributeValues: TypeAlias = dict[str, str] |
594 | + |
595 | # Aliases to represent the many possibilities for matching bits of a |
596 | # parse tree. |
597 | # |
598 | @@ -60,6 +67,17 @@ _AttributeValues: TypeAlias = Dict[str, _AttributeValue] |
599 | # of the arguments to the SoupStrainer constructor and (more |
600 | # familiarly to Beautiful Soup users) the find* methods. |
601 | |
602 | +# A function that takes a PageElement and returns a yes-or-no answer. |
603 | +_PageElementMatchFunction:TypeAlias = Callable[['PageElement'], bool] |
604 | + |
605 | +# A function that takes the raw parsed ingredients of a markup tag |
606 | +# and returns a yes-or-no answer. |
607 | +_AllowTagCreationFunction:TypeAlias = Callable[[Optional[str], str, Optional[_RawAttributeValues]], bool] |
608 | + |
609 | +# A function that takes the raw parsed ingredients of a markup string node |
610 | +# and returns a yes-or-no answer. |
611 | +_AllowStringCreationFunction:TypeAlias = Callable[[Optional[str]], bool] |
612 | + |
613 | # A function that takes a Tag and returns a yes-or-no answer. |
614 | # A TagNameMatchRule expects this kind of function, if you're |
615 | # going to pass it a function. |
616 | diff --git a/bs4/builder/__init__.py b/bs4/builder/__init__.py |
617 | index fa2b939..b59513e 100644 |
618 | --- a/bs4/builder/__init__.py |
619 | +++ b/bs4/builder/__init__.py |
620 | @@ -277,7 +277,7 @@ class TreeBuilder(object): |
621 | return True |
622 | return tag_name in self.empty_element_tags |
623 | |
624 | - def feed(self, markup:str) -> None: |
625 | + def feed(self, markup:_RawMarkup) -> None: |
626 | """Run some incoming markup through some parsing process, |
627 | populating the `BeautifulSoup` object in `TreeBuilder.soup` |
628 | """ |
629 | @@ -598,8 +598,8 @@ class DetectsXMLParsedAsHTML(object): |
630 | |
631 | # This is typed as str, not `ProcessingInstruction`, because this |
632 | # check may be run before any Beautiful Soup objects are created. |
633 | - _first_processing_instruction: Optional[str] |
634 | - _root_tag: Optional[Tag] |
635 | + _first_processing_instruction: Optional[str] #: :meta private: |
636 | + _root_tag_name: Optional[str] #: :meta private: |
637 | |
638 | @classmethod |
639 | def warn_if_markup_looks_like_xml(cls, markup:Optional[_RawMarkup], stacklevel:int=3) -> bool: |
640 | @@ -648,14 +648,14 @@ class DetectsXMLParsedAsHTML(object): |
641 | def _initialize_xml_detector(self) -> None: |
642 | """Call this method before parsing a document.""" |
643 | self._first_processing_instruction = None |
644 | - self._root_tag = None |
645 | + self._root_tag_name = None |
646 | |
647 | def _document_might_be_xml(self, processing_instruction:str): |
648 | """Call this method when encountering an XML declaration, or a |
649 | "processing instruction" that might be an XML declaration. |
650 | """ |
651 | if (self._first_processing_instruction is not None |
652 | - or self._root_tag is not None): |
653 | + or self._root_tag_name is not None): |
654 | # The document has already started. Don't bother checking |
655 | # anymore. |
656 | return |
657 | @@ -665,18 +665,18 @@ class DetectsXMLParsedAsHTML(object): |
658 | # We won't know until we encounter the first tag whether or |
659 | # not this is actually a problem. |
660 | |
661 | - def _root_tag_encountered(self, name): |
662 | + def _root_tag_encountered(self, name:str) -> None: |
663 | """Call this when you encounter the document's root tag. |
664 | |
665 | This is where we actually check whether an XML document is |
666 | being incorrectly parsed as HTML, and issue the warning. |
667 | """ |
668 | - if self._root_tag is not None: |
669 | + if self._root_tag_name is not None: |
670 | # This method was incorrectly called multiple times. Do |
671 | # nothing. |
672 | return |
673 | |
674 | - self._root_tag = name |
675 | + self._root_tag_name = name |
676 | if (name != 'html' and self._first_processing_instruction is not None |
677 | and self._first_processing_instruction.lower().startswith('xml ')): |
678 | # We encountered an XML declaration and then a tag other |
679 | diff --git a/bs4/builder/_html5lib.py b/bs4/builder/_html5lib.py |
680 | index b7d2924..2ea556c 100644 |
681 | --- a/bs4/builder/_html5lib.py |
682 | +++ b/bs4/builder/_html5lib.py |
683 | @@ -6,6 +6,9 @@ __all__ = [ |
684 | ] |
685 | |
686 | from typing import ( |
687 | + Any, |
688 | + cast, |
689 | + Dict, |
690 | Iterable, |
691 | List, |
692 | Optional, |
693 | @@ -14,8 +17,11 @@ from typing import ( |
694 | Union, |
695 | ) |
696 | from bs4._typing import ( |
697 | + _AttributeValue, |
698 | + _AttributeValues, |
699 | _Encoding, |
700 | _Encodings, |
701 | + _NamespaceURL, |
702 | _RawMarkup, |
703 | ) |
704 | |
705 | @@ -30,6 +36,7 @@ from bs4.builder import ( |
706 | ) |
707 | from bs4.element import ( |
708 | NamespacedAttribute, |
709 | + PageElement, |
710 | nonwhitespace_re, |
711 | ) |
712 | import html5lib |
713 | @@ -42,7 +49,9 @@ from bs4.element import ( |
714 | Doctype, |
715 | NavigableString, |
716 | Tag, |
717 | - ) |
718 | +) |
719 | +if TYPE_CHECKING: |
720 | + from bs4 import BeautifulSoup |
721 | |
722 | from html5lib.treebuilders import base as treebuilder_base |
723 | |
724 | @@ -71,7 +80,9 @@ class HTML5TreeBuilder(HTMLTreeBuilder): |
725 | #: html5lib can tell us which line number and position in the |
726 | #: original file is the source of an element. |
727 | TRACKS_LINE_NUMBERS:bool = True |
728 | - |
729 | + |
730 | + underlying_builder:'TreeBuilderForHtml5lib' #: :meta private: |
731 | + |
732 | def prepare_markup(self, markup:_RawMarkup, |
733 | user_specified_encoding:Optional[_Encoding]=None, |
734 | document_declared_encoding:Optional[_Encoding]=None, |
735 | @@ -102,20 +113,31 @@ class HTML5TreeBuilder(HTMLTreeBuilder): |
736 | yield (markup, None, None, False) |
737 | |
738 | # These methods are defined by Beautiful Soup. |
739 | - def feed(self, markup): |
740 | + def feed(self, markup:_RawMarkup) -> None: |
741 | """Run some incoming markup through some parsing process, |
742 | populating the `BeautifulSoup` object in `HTML5TreeBuilder.soup`. |
743 | """ |
744 | - if self.soup.parse_only is not None: |
745 | + if self.soup is not None and self.soup.parse_only is not None: |
746 | warnings.warn( |
747 | "You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.", |
748 | stacklevel=4 |
749 | ) |
750 | + |
751 | + # self.underlying_parser is probably None now, but it'll be set |
752 | + # when self.create_treebuilder is called by html5lib. |
753 | + # |
754 | + # TODO-TYPING: typeshed stubs are incorrect about the return |
755 | + # value of HTMLParser.__init__; it is HTMLParser, not None. |
756 | parser = html5lib.HTMLParser(tree=self.create_treebuilder) |
757 | + assert self.underlying_builder is not None |
758 | self.underlying_builder.parser = parser |
759 | extra_kwargs = dict() |
760 | if not isinstance(markup, str): |
761 | + # kwargs, specifically override_encoding, will eventually |
762 | + # be passed in to html5lib's |
763 | + # HTMLBinaryInputStream.__init__. |
764 | extra_kwargs['override_encoding'] = self.user_specified_encoding |
765 | + |
766 | doc = parser.parse(markup, **extra_kwargs) |
767 | |
768 | # Set the character encoding detected by the tokenizer. |
769 | @@ -131,10 +153,12 @@ class HTML5TreeBuilder(HTMLTreeBuilder): |
770 | doc.original_encoding = original_encoding |
771 | self.underlying_builder.parser = None |
772 | |
773 | - def create_treebuilder(self, namespaceHTMLElements): |
774 | + def create_treebuilder(self, namespaceHTMLElements:bool) -> 'TreeBuilderForHtml5lib': |
775 | """Called by html5lib to instantiate the kind of class it |
776 | calls a 'TreeBuilder'. |
777 | - |
778 | + |
779 | + :param namespaceHTMLElements: Whether or not to namespace HTML elements. |
780 | + |
781 | :meta private: |
782 | """ |
783 | self.underlying_builder = TreeBuilderForHtml5lib( |
784 | @@ -143,15 +167,18 @@ class HTML5TreeBuilder(HTMLTreeBuilder): |
785 | ) |
786 | return self.underlying_builder |
787 | |
788 | - def test_fragment_to_document(self, fragment): |
789 | + def test_fragment_to_document(self, fragment:str) -> str: |
790 | """See `TreeBuilder`.""" |
791 | return '<html><head></head><body>%s</body></html>' % fragment |
792 | |
793 | |
794 | class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder): |
795 | - |
796 | - def __init__(self, namespaceHTMLElements, soup=None, |
797 | - store_line_numbers=True, **kwargs): |
798 | + |
799 | + soup:'BeautifulSoup' #: :meta private: |
800 | + |
801 | + def __init__(self, namespaceHTMLElements:bool, |
802 | + soup:Optional['BeautifulSoup']=None, |
803 | + store_line_numbers:bool=True, **kwargs:Any): |
804 | if soup: |
805 | self.soup = soup |
806 | else: |
807 | @@ -172,65 +199,68 @@ class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder): |
808 | self.parser = None |
809 | self.store_line_numbers = store_line_numbers |
810 | |
811 | - def documentClass(self): |
812 | + def documentClass(self) -> 'Element': |
813 | self.soup.reset() |
814 | return Element(self.soup, self.soup, None) |
815 | |
816 | - def insertDoctype(self, token): |
817 | - name = token["name"] |
818 | - publicId = token["publicId"] |
819 | - systemId = token["systemId"] |
820 | + def insertDoctype(self, token:Dict[str, Any]) -> None: |
821 | + name:str = cast(str, token["name"]) |
822 | + publicId:Optional[str] = cast(Optional[str], token["publicId"]) |
823 | + systemId:Optional[str] = cast(Optional[str], token["systemId"]) |
824 | |
825 | doctype = Doctype.for_name_and_ids(name, publicId, systemId) |
826 | self.soup.object_was_parsed(doctype) |
827 | |
828 | - def elementClass(self, name, namespace): |
829 | - kwargs = {} |
830 | + def elementClass(self, name:str, namespace:str) -> 'Element': |
831 | + sourceline:Optional[int] = None |
832 | + sourcepos:Optional[int] = None |
833 | if self.parser and self.store_line_numbers: |
834 | # This represents the point immediately after the end of the |
835 | # tag. We don't know when the tag started, but we do know |
836 | # where it ended -- the character just before this one. |
837 | sourceline, sourcepos = self.parser.tokenizer.stream.position() |
838 | - kwargs['sourceline'] = sourceline |
839 | - kwargs['sourcepos'] = sourcepos-1 |
840 | - tag = self.soup.new_tag(name, namespace, **kwargs) |
841 | + sourcepos = sourcepos-1 |
842 | + tag = self.soup.new_tag( |
843 | + name, namespace, sourceline=sourceline, sourcepos=sourcepos |
844 | + ) |
845 | |
846 | return Element(tag, self.soup, namespace) |
847 | |
848 | - def commentClass(self, data): |
849 | + def commentClass(self, data:str) -> 'TextNode': |
850 | return TextNode(Comment(data), self.soup) |
851 | |
852 | - def fragmentClass(self): |
853 | - from bs4 import BeautifulSoup |
854 | - # TODO: Why is the parser 'html.parser' here? To avoid an |
855 | - # infinite loop? |
856 | - self.soup = BeautifulSoup("", "html.parser") |
857 | - self.soup.name = "[document_fragment]" |
858 | - return Element(self.soup, self.soup, None) |
859 | + def fragmentClass(self) -> 'Element': |
860 | + """This is only used by html5lib HTMLParser.parseFragment(), |
861 | + which is never used by Beautiful Soup.""" |
862 | + raise NotImplementedError() |
863 | + |
864 | + def getFragment(self) -> 'Element': |
865 | + """This is only used by html5lib HTMLParser.parseFragment, |
866 | + which is never used by Beautiful Soup.""" |
867 | + raise NotImplementedError() |
868 | |
869 | - def appendChild(self, node): |
870 | - # XXX This code is not covered by the BS4 tests. |
871 | + def appendChild(self, node:'Element') -> None: |
872 | + # TODO: This code is not covered by the BS4 tests. |
873 | self.soup.append(node.element) |
874 | |
875 | - def getDocument(self): |
876 | + def getDocument(self) -> 'BeautifulSoup': |
877 | return self.soup |
878 | |
879 | - def getFragment(self): |
880 | - return treebuilder_base.TreeBuilder.getFragment(self).element |
881 | - |
882 | - def testSerializer(self, element): |
883 | + # TODO-TYPING: typeshed stubs are incorrect about this; |
884 | + # cloneNode returns a str, not None. |
885 | + def testSerializer(self, element:'Element') -> str: |
886 | from bs4 import BeautifulSoup |
887 | rv = [] |
888 | doctype_re = re.compile(r'^(.*?)(?: PUBLIC "(.*?)"(?: "(.*?)")?| SYSTEM "(.*?)")?$') |
889 | |
890 | - def serializeElement(element, indent=0): |
891 | + def serializeElement(element:Union['Element', PageElement], indent=0) -> None: |
892 | if isinstance(element, BeautifulSoup): |
893 | pass |
894 | if isinstance(element, Doctype): |
895 | m = doctype_re.match(element) |
896 | - if m: |
897 | + if m is not None: |
898 | name = m.group(1) |
899 | - if m.lastindex > 1: |
900 | + if m.lastindex is not None and m.lastindex > 1: |
901 | publicId = m.group(2) or "" |
902 | systemId = m.group(3) or m.group(4) or "" |
903 | rv.append("""|%s<!DOCTYPE %s "%s" "%s">""" % |
904 | @@ -243,7 +273,7 @@ class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder): |
905 | rv.append("|%s<!-- %s -->" % (' ' * indent, element)) |
906 | elif isinstance(element, NavigableString): |
907 | rv.append("|%s\"%s\"" % (' ' * indent, element)) |
908 | - else: |
909 | + elif isinstance(element, Element): |
910 | if element.namespace: |
911 | name = "%s %s" % (prefixes[element.namespace], |
912 | element.name) |
913 | @@ -269,12 +299,19 @@ class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder): |
914 | return "\n".join(rv) |
915 | |
916 | class AttrList(object): |
917 | - def __init__(self, element): |
918 | + """Represents a Tag's attributes in a way compatible with html5lib.""" |
919 | + |
920 | + element:Tag |
921 | + attrs:_AttributeValues |
922 | + |
923 | + def __init__(self, element:Tag): |
924 | self.element = element |
925 | self.attrs = dict(self.element.attrs) |
926 | - def __iter__(self): |
927 | + |
928 | + def __iter__(self) -> Iterable[Tuple[str, _AttributeValue]]: |
929 | return list(self.attrs.items()).__iter__() |
930 | - def __setitem__(self, name, value): |
931 | + |
932 | + def __setitem__(self, name:str, value:_AttributeValue) -> None: |
933 | # If this attribute is a multi-valued attribute for this element, |
934 | # turn its value into a list. |
935 | list_attr = self.element.cdata_list_attributes or {} |
936 | @@ -282,40 +319,52 @@ class AttrList(object): |
937 | or (self.element.name in list_attr |
938 | and name in list_attr.get(self.element.name, []))): |
939 | # A node that is being cloned may have already undergone |
940 | - # this procedure. |
941 | + # this procedure. Check for this and skip it. |
942 | if not isinstance(value, list): |
943 | + assert isinstance(value, str) |
944 | value = nonwhitespace_re.findall(value) |
945 | self.element[name] = value |
946 | - def items(self): |
947 | + |
948 | + def items(self) -> Iterable[Tuple[str, _AttributeValue]]: |
949 | return list(self.attrs.items()) |
950 | - def keys(self): |
951 | + |
952 | + def keys(self) -> Iterable[str]: |
953 | return list(self.attrs.keys()) |
954 | - def __len__(self): |
955 | + |
956 | + def __len__(self) -> int: |
957 | return len(self.attrs) |
958 | - def __getitem__(self, name): |
959 | + |
960 | + def __getitem__(self, name:str) -> _AttributeValue: |
961 | return self.attrs[name] |
962 | - def __contains__(self, name): |
963 | + |
964 | + def __contains__(self, name:str) -> bool: |
965 | return name in list(self.attrs.keys()) |
966 | |
967 | |
968 | class Element(treebuilder_base.Node): |
969 | - def __init__(self, element, soup, namespace): |
970 | + |
971 | + element:Tag |
972 | + soup:'BeautifulSoup' |
973 | + namespace:Optional[_NamespaceURL] |
974 | + |
975 | + def __init__(self, element:Tag, soup:'BeautifulSoup', |
976 | + namespace:Optional[_NamespaceURL]): |
977 | treebuilder_base.Node.__init__(self, element.name) |
978 | self.element = element |
979 | self.soup = soup |
980 | self.namespace = namespace |
981 | |
982 | - def appendChild(self, node): |
983 | + def appendChild(self, node:'Element') -> None: |
984 | string_child = child = None |
985 | if isinstance(node, str): |
986 | # Some other piece of code decided to pass in a string |
987 | # instead of creating a TextElement object to contain the |
988 | - # string. |
989 | + # string. This should not ever happen. |
990 | string_child = child = node |
991 | elif isinstance(node, Tag): |
992 | # Some other piece of code decided to pass in a Tag |
993 | # instead of creating an Element object to contain the |
994 | - # Tag. |
995 | + # Tag. This should not ever happen. |
996 | child = node |
997 | elif node.element.__class__ == NavigableString: |
998 | string_child = child = node.element |
999 | @@ -324,7 +373,7 @@ class Element(treebuilder_base.Node): |
1000 | child = node.element |
1001 | node.parent = self |
1002 | |
1003 | - if not isinstance(child, str) and child.parent is not None: |
1004 | + if not isinstance(child, str) and child is not None and child.parent is not None: |
1005 | node.element.extract() |
1006 | |
1007 | if (string_child is not None and self.element.contents |
1008 | @@ -359,14 +408,13 @@ class Element(treebuilder_base.Node): |
1009 | child, parent=self.element, |
1010 | most_recent_element=most_recent_element) |
1011 | |
1012 | - def getAttributes(self): |
1013 | + def getAttributes(self) -> AttrList: |
1014 | if isinstance(self.element, Comment): |
1015 | return {} |
1016 | return AttrList(self.element) |
1017 | |
1018 | - def setAttributes(self, attributes): |
1019 | + def setAttributes(self, attributes:Optional[Dict]) -> None: |
1020 | if attributes is not None and len(attributes) > 0: |
1021 | - converted_attributes = [] |
1022 | for name, value in list(attributes.items()): |
1023 | if isinstance(name, tuple): |
1024 | new_name = NamespacedAttribute(*name) |
1025 | @@ -386,14 +434,14 @@ class Element(treebuilder_base.Node): |
1026 | self.soup.builder.set_up_substitutions(self.element) |
1027 | attributes = property(getAttributes, setAttributes) |
1028 | |
1029 | - def insertText(self, data, insertBefore=None): |
1030 | + def insertText(self, data:str, insertBefore:Optional['Element']=None) -> None: |
1031 | text = TextNode(self.soup.new_string(data), self.soup) |
1032 | if insertBefore: |
1033 | self.insertBefore(text, insertBefore) |
1034 | else: |
1035 | self.appendChild(text) |
1036 | |
1037 | - def insertBefore(self, node, refNode): |
1038 | + def insertBefore(self, node:'Element', refNode:'Element') -> None: |
1039 | index = self.element.index(refNode.element) |
1040 | if (node.element.__class__ == NavigableString and self.element.contents |
1041 | and self.element.contents[index-1].__class__ == NavigableString): |
1042 | @@ -405,10 +453,10 @@ class Element(treebuilder_base.Node): |
1043 | self.element.insert(index, node.element) |
1044 | node.parent = self |
1045 | |
1046 | - def removeChild(self, node): |
1047 | + def removeChild(self, node:'Element') -> None: |
1048 | node.element.extract() |
1049 | |
1050 | - def reparentChildren(self, new_parent): |
1051 | + def reparentChildren(self, new_parent:'Element') -> None: |
1052 | """Move all of this tag's children into another tag.""" |
1053 | # print("MOVE", self.element.contents) |
1054 | # print("FROM", self.element) |
1055 | @@ -424,6 +472,10 @@ class Element(treebuilder_base.Node): |
1056 | if len(new_parent_element.contents) > 0: |
1057 | # The new parent already contains children. We will be |
1058 | # appending this tag's children to the end. |
1059 | + |
1060 | + # We can make this assertion since we know new_parent has |
1061 | + # children. |
1062 | + assert new_parents_last_descendant is not None |
1063 | new_parents_last_child = new_parent_element.contents[-1] |
1064 | new_parents_last_descendant_next_element = new_parents_last_descendant.next_element |
1065 | else: |
1066 | @@ -474,17 +526,21 @@ class Element(treebuilder_base.Node): |
1067 | # print("FROM", self.element) |
1068 | # print("TO", new_parent_element) |
1069 | |
1070 | - def cloneNode(self): |
1071 | + # TODO: typeshed stubs are incorrect about this; |
1072 | + # cloneNode returns a new Node, not None. |
1073 | + def cloneNode(self) -> treebuilder_base.Node: |
1074 | tag = self.soup.new_tag(self.element.name, self.namespace) |
1075 | node = Element(tag, self.soup, self.namespace) |
1076 | for key,value in self.attributes: |
1077 | node.attributes[key] = value |
1078 | return node |
1079 | |
1080 | - def hasContent(self): |
1081 | - return self.element.contents |
1082 | + # TODO-TYPING: typeshed stubs are incorrect about this; |
1083 | + # cloneNode returns a boolean, not None. |
1084 | + def hasContent(self) -> bool: |
1085 | + return len(self.element.contents) > 0 |
1086 | |
1087 | - def getNameTuple(self): |
1088 | + def getNameTuple(self) -> Tuple[str, str]: |
1089 | if self.namespace == None: |
1090 | return namespaces["html"], self.name |
1091 | else: |
1092 | @@ -493,10 +549,10 @@ class Element(treebuilder_base.Node): |
1093 | nameTuple = property(getNameTuple) |
1094 | |
1095 | class TextNode(Element): |
1096 | - def __init__(self, element, soup): |
1097 | + def __init__(self, element:PageElement, soup:'BeautifulSoup'): |
1098 | treebuilder_base.Node.__init__(self, None) |
1099 | self.element = element |
1100 | self.soup = soup |
1101 | |
1102 | - def cloneNode(self): |
1103 | - raise NotImplementedError |
1104 | + def cloneNode(self) -> treebuilder_base.Node: |
1105 | + raise NotImplementedError() |
1106 | diff --git a/bs4/builder/_htmlparser.py b/bs4/builder/_htmlparser.py |
1107 | index 291f6c6..91cecf7 100644 |
1108 | --- a/bs4/builder/_htmlparser.py |
1109 | +++ b/bs4/builder/_htmlparser.py |
1110 | @@ -188,7 +188,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): |
1111 | # later on. If so, we want to ignore it. |
1112 | self.already_closed_empty_element.append(name) |
1113 | |
1114 | - if self._root_tag is None: |
1115 | + if self._root_tag_name is None: |
1116 | self._root_tag_encountered(name) |
1117 | |
1118 | def handle_endtag(self, name:str, check_already_closed:bool=True) -> None: |
1119 | @@ -422,13 +422,23 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder): |
1120 | dammit.declared_html_encoding, |
1121 | dammit.contains_replacement_characters) |
1122 | |
1123 | - def feed(self, markup:str): |
1124 | + def feed(self, markup:_RawMarkup) -> None: |
1125 | args, kwargs = self.parser_args |
1126 | + |
1127 | + # HTMLParser.feed will only handle str, but |
1128 | + # BeautifulSoup.markup is allowed to be _RawMarkup, because |
1129 | + # it's set by the yield value of |
1130 | + # TreeBuilder.prepare_markup. Fortunately, |
1131 | + # HTMLParserTreeBuilder.prepare_markup always yields a str |
1132 | + # (UnicodeDammit.unicode_markup). |
1133 | + assert isinstance(markup, str) |
1134 | + |
1135 | # We know BeautifulSoup calls TreeBuilder.initialize_soup |
1136 | # before calling feed(), so we can assume self.soup |
1137 | # is set. |
1138 | assert self.soup is not None |
1139 | parser = BeautifulSoupHTMLParser(self.soup, *args, **kwargs) |
1140 | + |
1141 | try: |
1142 | parser.feed(markup) |
1143 | parser.close() |
1144 | diff --git a/bs4/builder/_lxml.py b/bs4/builder/_lxml.py |
1145 | index ba87e87..3dfe88a 100644 |
1146 | --- a/bs4/builder/_lxml.py |
1147 | +++ b/bs4/builder/_lxml.py |
1148 | @@ -269,7 +269,7 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
1149 | for encoding in detector.encodings: |
1150 | yield (detector.markup, encoding, document_declared_encoding, False) |
1151 | |
1152 | - def feed(self, markup:Union[bytes,str]) -> None: |
1153 | + def feed(self, markup:_RawMarkup) -> None: |
1154 | io: IO |
1155 | if isinstance(markup, bytes): |
1156 | io = BytesIO(markup) |
1157 | diff --git a/bs4/diagnose.py b/bs4/diagnose.py |
1158 | index 201b879..c2202ad 100644 |
1159 | --- a/bs4/diagnose.py |
1160 | +++ b/bs4/diagnose.py |
1161 | @@ -9,7 +9,15 @@ from html.parser import HTMLParser |
1162 | import bs4 |
1163 | from bs4 import BeautifulSoup, __version__ |
1164 | from bs4.builder import builder_registry |
1165 | -from typing import TYPE_CHECKING |
1166 | +from typing import ( |
1167 | + Any, |
1168 | + IO, |
1169 | + List, |
1170 | + Optional, |
1171 | + Tuple, |
1172 | + TYPE_CHECKING, |
1173 | +) |
1174 | + |
1175 | if TYPE_CHECKING: |
1176 | from bs4._typing import _IncomingMarkup |
1177 | |
1178 | @@ -78,7 +86,7 @@ def diagnose(data:_IncomingMarkup) -> None: |
1179 | |
1180 | print(("-" * 80)) |
1181 | |
1182 | -def lxml_trace(data, html:bool=True, **kwargs) -> None: |
1183 | +def lxml_trace(data:_IncomingMarkup, html:bool=True, **kwargs:Any) -> None: |
1184 | """Print out the lxml events that occur during parsing. |
1185 | |
1186 | This lets you see how lxml parses a document when no Beautiful |
1187 | @@ -94,7 +102,8 @@ def lxml_trace(data, html:bool=True, **kwargs) -> None: |
1188 | recover = kwargs.pop('recover', True) |
1189 | if isinstance(data, str): |
1190 | data = data.encode("utf8") |
1191 | - reader = BytesIO(data) |
1192 | + if not isinstance(data, IO): |
1193 | + reader = BytesIO(data) |
1194 | for event, element in etree.iterparse( |
1195 | reader, html=html, recover=recover, **kwargs |
1196 | ): |
1197 | @@ -108,37 +117,40 @@ class AnnouncingParser(HTMLParser): |
1198 | document. The easiest way to do this is to call `htmlparser_trace`. |
1199 | """ |
1200 | |
1201 | - def _p(self, s): |
1202 | + def _p(self, s:str) -> None: |
1203 | print(s) |
1204 | |
1205 | - def handle_starttag(self, name, attrs): |
1206 | + def handle_starttag( |
1207 | + self, name:str, attrs:List[Tuple[str, Optional[str]]], |
1208 | + handle_empty_element:bool=True |
1209 | + ) -> None: |
1210 | self._p(f"{name} {attrs} START") |
1211 | |
1212 | - def handle_endtag(self, name): |
1213 | + def handle_endtag(self, name:str, check_already_closed:bool=True) -> None: |
1214 | self._p("%s END" % name) |
1215 | |
1216 | - def handle_data(self, data): |
1217 | + def handle_data(self, data:str) -> None: |
1218 | self._p("%s DATA" % data) |
1219 | |
1220 | - def handle_charref(self, name): |
1221 | + def handle_charref(self, name:str) -> None: |
1222 | self._p("%s CHARREF" % name) |
1223 | |
1224 | - def handle_entityref(self, name): |
1225 | + def handle_entityref(self, name:str) -> None: |
1226 | self._p("%s ENTITYREF" % name) |
1227 | |
1228 | - def handle_comment(self, data): |
1229 | + def handle_comment(self, data:str) -> None: |
1230 | self._p("%s COMMENT" % data) |
1231 | |
1232 | - def handle_decl(self, data): |
1233 | + def handle_decl(self, data:str) -> None: |
1234 | self._p("%s DECL" % data) |
1235 | |
1236 | - def unknown_decl(self, data): |
1237 | + def unknown_decl(self, data:str) -> None: |
1238 | self._p("%s UNKNOWN-DECL" % data) |
1239 | |
1240 | - def handle_pi(self, data): |
1241 | + def handle_pi(self, data:str) -> None: |
1242 | self._p("%s PI" % data) |
1243 | |
1244 | -def htmlparser_trace(data): |
1245 | +def htmlparser_trace(data:str) -> None: |
1246 | """Print out the HTMLParser events that occur during parsing. |
1247 | |
1248 | This lets you see how HTMLParser parses a document when no |
1249 | @@ -226,7 +238,7 @@ def benchmark_parsers(num_elements:int=100000) -> None: |
1250 | b = time.time() |
1251 | print(("Raw html5lib parsed the markup in %.2fs." % (b-a))) |
1252 | |
1253 | -def profile(num_elements:int=100000, parser:str="lxml"): |
1254 | +def profile(num_elements:int=100000, parser:str="lxml") -> None: |
1255 | """Use Python's profiler on a randomly generated document.""" |
1256 | filehandle = tempfile.NamedTemporaryFile() |
1257 | filename = filehandle.name |
1258 | diff --git a/bs4/element.py b/bs4/element.py |
1259 | index 83f4882..f4ab89c 100644 |
1260 | --- a/bs4/element.py |
1261 | +++ b/bs4/element.py |
1262 | @@ -44,6 +44,7 @@ if TYPE_CHECKING: |
1263 | from bs4 import BeautifulSoup |
1264 | from bs4.builder import TreeBuilder |
1265 | from bs4.dammit import _Encoding |
1266 | + from bs4.filter import ElementFilter |
1267 | from bs4.formatter import ( |
1268 | _EntitySubstitutionFunction, |
1269 | _FormatterOrName, |
1270 | @@ -901,7 +902,7 @@ class PageElement(object): |
1271 | limit:Optional[int], |
1272 | generator:Iterator[PageElement], |
1273 | _stacklevel:int=3, |
1274 | - **kwargs:_StrainableAttribute) -> ResultSet[PageElement]: |
1275 | + **kwargs:_StrainableAttribute) -> ResultSet[PageElement]: |
1276 | """Iterates over a generator looking for things that match.""" |
1277 | results: ResultSet[PageElement] |
1278 | |
1279 | @@ -912,11 +913,11 @@ class PageElement(object): |
1280 | DeprecationWarning, stacklevel=_stacklevel |
1281 | ) |
1282 | |
1283 | - from bs4.strainer import SoupStrainer |
1284 | - if isinstance(name, SoupStrainer): |
1285 | - strainer = name |
1286 | + from bs4.filter import ElementFilter |
1287 | + if isinstance(name, ElementFilter): |
1288 | + matcher = name |
1289 | else: |
1290 | - strainer = SoupStrainer(name, attrs, string, **kwargs) |
1291 | + matcher = SoupStrainer(name, attrs, string, **kwargs) |
1292 | |
1293 | result: Iterable[PageElement] |
1294 | if string is None and not limit and not attrs and not kwargs: |
1295 | @@ -924,7 +925,7 @@ class PageElement(object): |
1296 | # Optimization to find all tags. |
1297 | result = (element for element in generator |
1298 | if isinstance(element, Tag)) |
1299 | - return ResultSet(strainer, result) |
1300 | + return ResultSet(matcher, result) |
1301 | elif isinstance(name, str): |
1302 | # Optimization to find all tags with a given name. |
1303 | if name.count(':') == 1: |
1304 | @@ -945,22 +946,25 @@ class PageElement(object): |
1305 | ) |
1306 | ): |
1307 | result.append(element) |
1308 | - return ResultSet(strainer, result) |
1309 | + return ResultSet(matcher, result) |
1310 | + return self.match(generator, matcher, limit) |
1311 | + |
1312 | + def match(self, generator:Iterator[PageElement], matcher:ElementFilter, limit:Optional[int]=None) -> ResultSet[PageElement]: |
1313 | + """The most generic search method offered by Beautiful Soup. |
1314 | |
1315 | - results = ResultSet(strainer) |
1316 | + You can pass in your own technique for iterating over the tree, and your own |
1317 | + technique for matching items. |
1318 | + """ |
1319 | + results:ResultSet = ResultSet(matcher) |
1320 | while True: |
1321 | try: |
1322 | i = next(generator) |
1323 | except StopIteration: |
1324 | break |
1325 | if i: |
1326 | - # TODO: SoupStrainer.search is a confusing method |
1327 | - # that needs to be redone, and this is where |
1328 | - # it's being used. |
1329 | - found = strainer.search(i) |
1330 | - if found: |
1331 | - results.append(found) |
1332 | - if limit and len(results) >= limit: |
1333 | + if matcher.match(i): |
1334 | + results.append(i) |
1335 | + if limit is not None and len(results) >= limit: |
1336 | break |
1337 | return results |
1338 | |
1339 | @@ -1254,7 +1258,7 @@ class Declaration(PreformattedString): |
1340 | class Doctype(PreformattedString): |
1341 | """A `document type declaration <https://www.w3.org/TR/REC-xml/#dt-doctype>`_.""" |
1342 | @classmethod |
1343 | - def for_name_and_ids(cls, name:str, pub_id:str, system_id:str) -> Doctype: |
1344 | + def for_name_and_ids(cls, name:str, pub_id:Optional[str], system_id:Optional[str]) -> Doctype: |
1345 | """Generate an appropriate document type declaration for a given |
1346 | public ID and system ID. |
1347 | |
1348 | @@ -2503,12 +2507,12 @@ class Tag(PageElement): |
1349 | _PageElementT = TypeVar("_PageElementT", bound=PageElement) |
1350 | class ResultSet(List[_PageElementT], Generic[_PageElementT]): |
1351 | """A ResultSet is a list of `PageElement` objects, gathered as the result |
1352 | - of matching a `SoupStrainer` against a parse tree. Basically, a list of |
1353 | + of matching an `ElementFilter` against a parse tree. Basically, a list of |
1354 | search results. |
1355 | """ |
1356 | - source: Optional[SoupStrainer] |
1357 | + source: Optional[ElementFilter] |
1358 | |
1359 | - def __init__(self, source:Optional[SoupStrainer], result: Iterable[_PageElementT]=()) -> None: |
1360 | + def __init__(self, source:Optional[ElementFilter], result: Iterable[_PageElementT]=()) -> None: |
1361 | super(ResultSet, self).__init__(result) |
1362 | self.source = source |
1363 | |
1364 | @@ -2522,4 +2526,4 @@ class ResultSet(List[_PageElementT], Generic[_PageElementT]): |
1365 | # import SoupStrainer itself into this module to preserve the |
1366 | # backwards compatibility of anyone who imports |
1367 | # bs4.element.SoupStrainer. |
1368 | -from bs4.strainer import SoupStrainer |
1369 | +from bs4.filter import SoupStrainer |
1370 | diff --git a/bs4/strainer.py b/bs4/filter.py |
1371 | similarity index 60% |
1372 | rename from bs4/strainer.py |
1373 | rename to bs4/filter.py |
1374 | index 15b289c..74e26d9 100644 |
1375 | --- a/bs4/strainer.py |
1376 | +++ b/bs4/filter.py |
1377 | @@ -25,6 +25,10 @@ from bs4._deprecation import _deprecated |
1378 | from bs4.element import NavigableString, PageElement, Tag |
1379 | from bs4._typing import ( |
1380 | _AttributeValue, |
1381 | + _AttributeValues, |
1382 | + _AllowStringCreationFunction, |
1383 | + _AllowTagCreationFunction, |
1384 | + _PageElementMatchFunction, |
1385 | _TagMatchFunction, |
1386 | _StringMatchFunction, |
1387 | _StrainableElement, |
1388 | @@ -33,13 +37,96 @@ from bs4._typing import ( |
1389 | _StrainableString, |
1390 | ) |
1391 | |
1392 | + |
1393 | +class ElementFilter(object): |
1394 | + """ElementFilters encapsulate the logic necessary to decide: |
1395 | + |
1396 | + 1. whether a PageElement (a tag or a string) matches a |
1397 | + user-specified query. |
1398 | + |
1399 | + 2. whether a given sequence of markup found during initial parsing |
1400 | + should be turned into a PageElement, or simply discarded. |
1401 | + |
1402 | + The base class is the simplest ElementFilter. By default, it |
1403 | + matches everything and allows all PageElements to be created. You |
1404 | + can make it more selective by passing in user-defined functions. |
1405 | + |
1406 | + Most users of Beautiful Soup will never need to use |
1407 | + ElementFilter, or its more capable subclass |
1408 | + SoupStrainer. Instead, they will use the find_* methods, which |
1409 | + will convert their arguments into SoupStrainer objects and run them |
1410 | + against the tree. |
1411 | + """ |
1412 | + match_hook: Optional[_PageElementMatchFunction] |
1413 | + allow_tag_creation_function: Optional[_AllowTagCreationFunction] |
1414 | + allow_string_creation_function: Optional[_AllowStringCreationFunction] |
1415 | + |
1416 | + def __init__( |
1417 | + self, match_function:Optional[_PageElementMatchFunction]=None, |
1418 | + allow_tag_creation_function:Optional[_AllowTagCreationFunction]=None, |
1419 | + allow_string_creation_function:Optional[_AllowStringCreationFunction]=None): |
1420 | + self.match_function = match_function |
1421 | + self.allow_tag_creation_function = allow_tag_creation_function |
1422 | + self.allow_string_creation_function = allow_string_creation_function |
1423 | + |
1424 | + @property |
1425 | + def excludes_everything(self) -> bool: |
1426 | + """Does this ElementFilter obviously exclude everything? If |
1427 | + so, Beautiful Soup will issue a warning if you try to use it |
1428 | + when parsing a document. |
1429 | + |
1430 | + The ElementFilter might turn out to exclude everything even |
1431 | + if this returns False, but it won't do so in an obvious way. |
1432 | + |
1433 | + The default ElementFilter excludes *nothing*, and we don't |
1434 | + have any way of answering questions about more complex |
1435 | + ElementFilters without running their hook functions, so the |
1436 | + base implementation always returns False. |
1437 | + """ |
1438 | + return False |
1439 | + |
1440 | + def match(self, element:PageElement) -> bool: |
1441 | + """Does the given PageElement match the rules set down by this |
1442 | + ElementFilter? |
1443 | + |
1444 | + The base implementation delegates to the function passed in to |
1445 | + the constructor. |
1446 | + """ |
1447 | + if not self.match_function: |
1448 | + return True |
1449 | + return self.match_function(element) |
1450 | + |
1451 | + def allow_tag_creation( |
1452 | + self, nsprefix:Optional[str], name:str, |
1453 | + attrs:Optional[_AttributeValues] |
1454 | + ) -> bool: |
1455 | + """Based on the name and attributes of a tag, see whether this |
1456 | + ElementFilter will allow a Tag object to even be created. |
1457 | + |
1458 | + :param name: The name of the prospective tag. |
1459 | + :param attrs: The attributes of the prospective tag. |
1460 | + """ |
1461 | + if not self.allow_tag_creation_function: |
1462 | + return True |
1463 | + return self.allow_tag_creation_function(nsprefix, name, attrs) |
1464 | + |
1465 | + def allow_string_creation(self, string:str) -> bool: |
1466 | + if not self.allow_string_creation_function: |
1467 | + return True |
1468 | + return self.allow_string_creation_function(string) |
1469 | + |
1470 | + |
1471 | class MatchRule(object): |
1472 | + """Each MatchRule encapsulates the logic behind a single argument |
1473 | + passed in to one of the Beautiful Soup find* methods. |
1474 | + """ |
1475 | + |
1476 | string: Optional[str] |
1477 | pattern: Optional[Pattern[str]] |
1478 | present: Optional[bool] |
1479 | - |
1480 | - # All MatchRule objects also have an attribute ``function``, but |
1481 | - # the type of the function depends on the subclass. |
1482 | + # TODO-TYPING: All MatchRule objects also have an attribute |
1483 | + # ``function``, but the type of the function depends on the |
1484 | + # subclass. |
1485 | |
1486 | def __init__( |
1487 | self, |
1488 | @@ -72,7 +159,7 @@ class MatchRule(object): |
1489 | "At most one of string, pattern, function and present must be provided." |
1490 | ) |
1491 | |
1492 | - def _base_match(self, string:str) -> Optional[bool]: |
1493 | + def _base_match(self, string:Optional[str]) -> Optional[bool]: |
1494 | """Run the 'cheap' portion of a match, trying to get an answer without |
1495 | calling a potentially expensive custom function. |
1496 | |
1497 | @@ -101,7 +188,7 @@ class MatchRule(object): |
1498 | |
1499 | return None |
1500 | |
1501 | - def matches_string(self, string:str) -> bool: |
1502 | + def matches_string(self, string:Optional[str]) -> bool: |
1503 | _base_result = self._base_match(string) |
1504 | if _base_result is not None: |
1505 | # No need to invoke the test function. |
1506 | @@ -125,6 +212,7 @@ class MatchRule(object): |
1507 | ) |
1508 | |
1509 | class TagNameMatchRule(MatchRule): |
1510 | + """A MatchRule implementing the rules for matches against tag name.""" |
1511 | function: Optional[_TagMatchFunction] |
1512 | |
1513 | def matches_tag(self, tag:Tag) -> bool: |
1514 | @@ -140,19 +228,25 @@ class TagNameMatchRule(MatchRule): |
1515 | return False |
1516 | |
1517 | class AttributeValueMatchRule(MatchRule): |
1518 | + """A MatchRule implementing the rules for matches against attribute value.""" |
1519 | function: Optional[_StringMatchFunction] |
1520 | |
1521 | class StringMatchRule(MatchRule): |
1522 | + """A MatchRule implementing the rules for matches against a NavigableString.""" |
1523 | function: Optional[_StringMatchFunction] |
1524 | |
1525 | -class SoupStrainer(object): |
1526 | - """Encapsulates a number of ways of matching a markup element (a tag |
1527 | - or a string). |
1528 | +class SoupStrainer(ElementFilter): |
1529 | + """The ElementFilter subclass used internally by Beautiful Soup. |
1530 | |
1531 | - These are primarily created internally and used to underpin the |
1532 | - find_* methods, but you can create one yourself and pass it in as |
1533 | - ``parse_only`` to the `BeautifulSoup` constructor, to parse a |
1534 | - subset of a large document. |
1535 | + A SoupStrainer encapsulates the logic necessary to perform the |
1536 | + kind of matches supported by the find_* methods. SoupStrainers are |
1537 | + primarily created internally, but you can create one yourself and |
1538 | + pass it in as ``parse_only`` to the `BeautifulSoup` constructor, |
1539 | + to parse a subset of a large document. |
1540 | + |
1541 | + Internally, SoupStrainer objects work by converting the |
1542 | + constructor arguments into MatchRule objects. Incoming |
1543 | + tags/markup are matched against those rules. |
1544 | |
1545 | :param name: One or more restrictions on the tags found in a |
1546 | document. |
1547 | @@ -226,6 +320,17 @@ class SoupStrainer(object): |
1548 | self.__string = string |
1549 | |
1550 | @property |
1551 | + def excludes_everything(self) -> bool: |
1552 | + """Check whether the provided rules will obviously exclude |
1553 | + everything. (They might exclude everything even if this returns False, |
1554 | + but not in an obvious way.) |
1555 | + """ |
1556 | + return True if ( |
1557 | + self.string_rules and |
1558 | + (self.name_rules or self.attribute_rules) |
1559 | + ) else False |
1560 | + |
1561 | + @property |
1562 | def string(self) -> Optional[_StrainableString]: |
1563 | ":meta private:" |
1564 | warnings.warn(f"Access to deprecated property string. (Look at .string_rules instead) -- Deprecated since version 4.13.0.", DeprecationWarning, stacklevel=2) |
1565 | @@ -262,6 +367,15 @@ class SoupStrainer(object): |
1566 | yield rule_class(function=obj) |
1567 | elif isinstance(obj, Pattern): |
1568 | yield rule_class(pattern=obj) |
1569 | + elif hasattr(obj, 'search'): |
1570 | + # We do a little duck typing here to detect usage of the |
1571 | + # third-party regex library, whose pattern objects doesn't |
1572 | + # derive from re.Pattern. |
1573 | + # |
1574 | + # TODO-TYPING: Once we drop support for Python 3.7, we |
1575 | + # might be able to address this by defining an appropriate |
1576 | + # Protocol. |
1577 | + yield rule_class(pattern=obj) |
1578 | elif hasattr(obj, '__iter__'): |
1579 | for o in obj: |
1580 | if not isinstance(o, (bytes, str)) and hasattr(o, '__iter__'): |
1581 | @@ -358,7 +472,7 @@ class SoupStrainer(object): |
1582 | else: |
1583 | attr_values = [cast(str, attr_value)] |
1584 | |
1585 | - def _match_attribute_value_helper(attr_values:Sequence[Optional[str]]): |
1586 | + def _match_attribute_value_helper(attr_values:Sequence[Optional[str]]) -> bool: |
1587 | for rule in rules: |
1588 | for attr_value in attr_values: |
1589 | if rule.matches_string(attr_value): |
1590 | @@ -382,8 +496,8 @@ class SoupStrainer(object): |
1591 | [joined_attr_value] |
1592 | ) |
1593 | return this_attr_match |
1594 | - |
1595 | - def allow_tag_creation(self, nsprefix:Optional[str], name:str, attrs:Optional[dict[str, str]]) -> bool: |
1596 | + |
1597 | + def allow_tag_creation(self, nsprefix:Optional[str], name:str, attrs:Optional[_AttributeValues]) -> bool: |
1598 | """Based on the name and attributes of a tag, see whether this |
1599 | SoupStrainer will allow a Tag object to even be created. |
1600 | |
1601 | @@ -423,17 +537,25 @@ class SoupStrainer(object): |
1602 | return True |
1603 | |
1604 | def allow_string_creation(self, string:str) -> bool: |
1605 | + """Based on the content of a markup string, see whether this |
1606 | + SoupStrainer will allow it to be instantiated as a |
1607 | + NavigableString object, or whether it should be ignored. |
1608 | + """ |
1609 | if self.name_rules or self.attribute_rules: |
1610 | # A SoupStrainer that has name or attribute rules won't |
1611 | # match any strings; it's designed to match tags with |
1612 | # certain properties. |
1613 | return False |
1614 | + if not self.string_rules: |
1615 | + # A SoupStrainer with no string rules will match |
1616 | + # all strings. |
1617 | + return True |
1618 | if not self.matches_any_string_rule(string): |
1619 | return False |
1620 | return True |
1621 | |
1622 | def matches_any_string_rule(self, string:str) -> bool: |
1623 | - """See whether the content of a string, matches any of |
1624 | + """See whether the content of a string matches any of |
1625 | this SoupStrainer's string rules. |
1626 | """ |
1627 | if not self.string_rules: |
1628 | @@ -442,28 +564,37 @@ class SoupStrainer(object): |
1629 | if string_rule.matches_string(string): |
1630 | return True |
1631 | return False |
1632 | - |
1633 | - |
1634 | + |
1635 | + def match(self, element:PageElement) -> bool: |
1636 | + """Does the given PageElement match the rules set down by this |
1637 | + SoupStrainer? |
1638 | + |
1639 | + The find_* methods rely heavily on this method to find matches. |
1640 | + |
1641 | + :param element: A PageElement. |
1642 | + :return: True if the element matches this SoupStrainer's rules; False otherwise. |
1643 | + """ |
1644 | + if isinstance(element, Tag): |
1645 | + return self.matches_tag(element) |
1646 | + assert isinstance(element, NavigableString) |
1647 | + if not (self.name_rules or self.attribute_rules): |
1648 | + # A NavigableString can only match a SoupStrainer that |
1649 | + # does not define any name or attribute restrictions. |
1650 | + for rule in self.string_rules: |
1651 | + if rule.matches_string(element): |
1652 | + return True |
1653 | + return False |
1654 | + |
1655 | @_deprecated("allow_tag_creation", "4.13.0") |
1656 | - def search_tag(self, name, attrs): |
1657 | + def search_tag(self, name:str, attrs:Optional[_AttributeValues]) -> bool: |
1658 | + """A less elegant version of allow_tag_creation().""" |
1659 | ":meta private:" |
1660 | return self.allow_tag_creation(None, name, attrs) |
1661 | |
1662 | - def search(self, element:PageElement): |
1663 | - # TODO: This method needs to be removed or redone. It is |
1664 | - # very confusing but it's used everywhere. |
1665 | - match = None |
1666 | - if isinstance(element, Tag): |
1667 | - match = self.matches_tag(element) |
1668 | - else: |
1669 | - assert isinstance(element, NavigableString) |
1670 | - match = False |
1671 | - if not (self.name_rules or self.attribute_rules): |
1672 | - # A NavigableString can only match a SoupStrainer that |
1673 | - # does not define any name or attribute restrictions. |
1674 | - for rule in self.string_rules: |
1675 | - if rule.matches_string(element): |
1676 | - match = True |
1677 | - break |
1678 | - return element if match else False |
1679 | + @_deprecated("match", "4.13.0") |
1680 | + def search(self, element:PageElement) -> Optional[PageElement]: |
1681 | + """A less elegant version of match(). |
1682 | |
1683 | + :meta private: |
1684 | + """ |
1685 | + return element if self.match(element) else None |
1686 | diff --git a/bs4/tests/__init__.py b/bs4/tests/__init__.py |
1687 | index 2ef7fd8..3ef999d 100644 |
1688 | --- a/bs4/tests/__init__.py |
1689 | +++ b/bs4/tests/__init__.py |
1690 | @@ -20,7 +20,7 @@ from bs4.element import ( |
1691 | Stylesheet, |
1692 | Tag |
1693 | ) |
1694 | -from bs4.strainer import SoupStrainer |
1695 | +from bs4.filter import SoupStrainer |
1696 | from bs4.builder import ( |
1697 | DetectsXMLParsedAsHTML, |
1698 | XMLParsedAsHTMLWarning, |
1699 | diff --git a/bs4/tests/test_strainer.py b/bs4/tests/test_filter.py |
1700 | similarity index 56% |
1701 | rename from bs4/tests/test_strainer.py |
1702 | rename to bs4/tests/test_filter.py |
1703 | index 4de03f0..8d5da70 100644 |
1704 | --- a/bs4/tests/test_strainer.py |
1705 | +++ b/bs4/tests/test_filter.py |
1706 | @@ -6,20 +6,108 @@ from . import ( |
1707 | SoupTest, |
1708 | ) |
1709 | from bs4.element import Tag |
1710 | -from bs4.strainer import ( |
1711 | +from bs4.filter import ( |
1712 | AttributeValueMatchRule, |
1713 | + ElementFilter, |
1714 | MatchRule, |
1715 | SoupStrainer, |
1716 | StringMatchRule, |
1717 | TagNameMatchRule, |
1718 | ) |
1719 | |
1720 | -class TestMatchrule(SoupTest): |
1721 | +class TestElementFilter(SoupTest): |
1722 | + |
1723 | + def test_default_behavior(self): |
1724 | + # An unconfigured ElementFilter matches absolutely everything. |
1725 | + selector = ElementFilter() |
1726 | + assert not selector.excludes_everything |
1727 | + soup = self.soup("<a>text</a>") |
1728 | + tag = soup.a |
1729 | + string = tag.string |
1730 | + assert True == selector.match(soup) |
1731 | + assert True == selector.match(tag) |
1732 | + assert True == selector.match(string) |
1733 | + assert soup.find(selector).name == "a" |
1734 | + |
1735 | + # And allows any incoming markup to be turned into PageElements. |
1736 | + assert True == selector.allow_tag_creation(None, "tag", None) |
1737 | + assert True == selector.allow_string_creation("some string") |
1738 | + |
1739 | + def test_match(self): |
1740 | + def m(pe): |
1741 | + return (pe.string == "allow" or ( |
1742 | + isinstance(pe, Tag) and pe.name=="allow")) |
1743 | + |
1744 | + soup = self.soup("<allow>deny</allow>allow<deny>deny</deny>") |
1745 | + allow_tag = soup.allow |
1746 | + allow_string = soup.find(string="allow") |
1747 | + deny_tag = soup.deny |
1748 | + deny_string = soup.find(string="deny") |
1749 | + |
1750 | + selector = ElementFilter(match_function=m) |
1751 | + assert True == selector.match(allow_tag) |
1752 | + assert True == selector.match(allow_string) |
1753 | + assert False == selector.match(deny_tag) |
1754 | + assert False == selector.match(deny_string) |
1755 | + |
1756 | + # Since only the match function was provided, there is |
1757 | + # no effect on tag or string creation. |
1758 | + soup = self.soup("<a>text</a>", parse_only=selector) |
1759 | + assert "text" == soup.a.string |
1760 | + |
1761 | + def test_allow_tag_creation(self): |
1762 | + def m(nsprefix, name, attrs): |
1763 | + return nsprefix=="allow" or name=="allow" or "allow" in attrs |
1764 | + selector = ElementFilter(allow_tag_creation_function=m) |
1765 | + f = selector.allow_tag_creation |
1766 | + assert True == f("allow", "ignore", {}) |
1767 | + assert True == f("ignore", "allow", {}) |
1768 | + assert True == f(None, "ignore", {"allow": "1"}) |
1769 | + assert False == f("no", "no", {"no" : "nope"}) |
1770 | + |
1771 | + # Test the ElementFilter as a value for parse_only. |
1772 | + soup = self.soup( |
1773 | + "<deny>deny</deny> <allow>deny</allow> allow", |
1774 | + parse_only=selector |
1775 | + ) |
1776 | |
1777 | - def _tuple(self, rule): |
1778 | - if isinstance(rule.pattern, str): |
1779 | - import pdb; pdb.set_trace() |
1780 | + # The <deny> tag was filtered out, but there was no effect on |
1781 | + # the strings, since only allow_tag_creation_function was |
1782 | + # defined. |
1783 | + assert 'deny <allow>deny</allow> allow' == soup.decode() |
1784 | + |
1785 | + # Similarly, since match_function was not defined, this |
1786 | + # ElementFilter matches everything. |
1787 | + assert soup.find(selector) == "deny" |
1788 | + |
1789 | + def test_allow_string_creation(self): |
1790 | + def m(s): |
1791 | + return s=="allow" |
1792 | + selector = ElementFilter(allow_string_creation_function=m) |
1793 | + f = selector.allow_string_creation |
1794 | + assert True == f("allow") |
1795 | + assert False == f("deny") |
1796 | + assert False == f("please allow") |
1797 | + |
1798 | + # Test the ElementFilter as a value for parse_only. |
1799 | + soup = self.soup( |
1800 | + "<deny>deny</deny> <allow>deny</allow> allow", |
1801 | + parse_only=selector |
1802 | + ) |
1803 | + |
1804 | + # All incoming strings other than "allow" (even whitespace) |
1805 | + # were filtered out, but there was no effect on the tags, |
1806 | + # since only allow_string_creation_function was defined. |
1807 | + assert '<deny>deny</deny><allow>deny</allow>' == soup.decode() |
1808 | + |
1809 | + # Similarly, since match_function was not defined, this |
1810 | + # ElementFilter matches everything. |
1811 | + assert soup.find(selector).name == "deny" |
1812 | |
1813 | + |
1814 | +class TestMatchRule(SoupTest): |
1815 | + |
1816 | + def _tuple(self, rule): |
1817 | return ( |
1818 | rule.string, |
1819 | rule.pattern.pattern if rule.pattern else None, |
1820 | @@ -155,6 +243,28 @@ class TestSoupStrainer(SoupTest): |
1821 | assert w2.filename == __file__ |
1822 | assert msg == "Access to deprecated property text. (Look at .string_rules instead) -- Deprecated since version 4.13.0." |
1823 | |
1824 | + def test_search_tag_deprecated(self): |
1825 | + strainer = SoupStrainer(name="a") |
1826 | + with warnings.catch_warnings(record=True) as w: |
1827 | + assert False == strainer.search_tag("b", {}) |
1828 | + [w1] = w |
1829 | + msg = str(w1.message) |
1830 | + assert w1.filename == __file__ |
1831 | + assert msg == "Call to deprecated method search_tag. (Replaced by allow_tag_creation) -- Deprecated since version 4.13.0." |
1832 | + |
1833 | + def test_search_deprecated(self): |
1834 | + strainer = SoupStrainer(name="a") |
1835 | + soup = self.soup("<a></a><b></b>") |
1836 | + with warnings.catch_warnings(record=True) as w: |
1837 | + assert soup.a == strainer.search(soup.a) |
1838 | + assert None == strainer.search(soup.b) |
1839 | + [w1, w2] = w |
1840 | + msg = str(w1.message) |
1841 | + assert msg == str(w2.message) |
1842 | + assert w1.filename == __file__ |
1843 | + assert msg == "Call to deprecated method search. (Replaced by match) -- Deprecated since version 4.13.0." |
1844 | + |
1845 | + # Dummy function used within tests. |
1846 | def _match_function(x): |
1847 | pass |
1848 | |
1849 | @@ -213,7 +323,7 @@ class TestSoupStrainer(SoupTest): |
1850 | ) |
1851 | |
1852 | def test_constructor_with_overlapping_attributes(self): |
1853 | - # If you specify the same attribute in arts and **kwargs, you end up |
1854 | + # If you specify the same attribute in args and **kwargs, you end up |
1855 | # with two different AttributeValueMatchRule objects. |
1856 | |
1857 | # This happens whether you use the 'class' shortcut on attrs... |
1858 | @@ -437,17 +547,24 @@ class TestSoupStrainer(SoupTest): |
1859 | # because the string restrictions can't be evaluated during |
1860 | # the parsing process, and the tag restrictions eliminate |
1861 | # any strings from consideration. |
1862 | + # |
1863 | + # We can detect this ahead of time, and warn about it, |
1864 | + # thanks to SoupStrainer.excludes_everything |
1865 | markup = "<a><b>one string<div>another string</div></b></a>" |
1866 | |
1867 | with warnings.catch_warnings(record=True) as w: |
1868 | + assert True, soupstrainer.excludes_everything |
1869 | assert "" == self.soup(markup, parse_only=soupstrainer).decode() |
1870 | [warning] = w |
1871 | msg = str(warning.message) |
1872 | assert warning.filename == __file__ |
1873 | assert str(warning.message).startswith( |
1874 | - "Value for parse_only will exclude everything, since it puts restrictions on both tags and strings:" |
1875 | + "The given value for parse_only will exclude everything:" |
1876 | ) |
1877 | - |
1878 | + |
1879 | + # The average SoupStrainer has excludes_everything=False |
1880 | + assert not SoupStrainer().excludes_everything |
1881 | + |
1882 | def test_documentation_examples(self): |
1883 | """Medium-weight real-world tests based on the Beautiful Soup |
1884 | documentation. |
1885 | diff --git a/bs4/tests/test_html5lib.py b/bs4/tests/test_html5lib.py |
1886 | index b0f4384..9f6dfa1 100644 |
1887 | --- a/bs4/tests/test_html5lib.py |
1888 | +++ b/bs4/tests/test_html5lib.py |
1889 | @@ -4,7 +4,7 @@ import pytest |
1890 | import warnings |
1891 | |
1892 | from bs4 import BeautifulSoup |
1893 | -from bs4.strainer import SoupStrainer |
1894 | +from bs4.filter import SoupStrainer |
1895 | from . import ( |
1896 | HTML5LIB_PRESENT, |
1897 | HTML5TreeBuilderSmokeTest, |
1898 | @@ -24,7 +24,7 @@ class TestHTML5LibBuilder(SoupTest, HTML5TreeBuilderSmokeTest): |
1899 | return HTML5TreeBuilder |
1900 | |
1901 | def test_soupstrainer(self): |
1902 | - # The html5lib tree builder does not support SoupStrainers. |
1903 | + # The html5lib tree builder does not support parse_only. |
1904 | strainer = SoupStrainer("b") |
1905 | markup = "<p>A <b>bold</b> statement.</p>" |
1906 | with warnings.catch_warnings(record=True) as w: |
1907 | diff --git a/bs4/tests/test_lxml.py b/bs4/tests/test_lxml.py |
1908 | index d450740..9fc04e0 100644 |
1909 | --- a/bs4/tests/test_lxml.py |
1910 | +++ b/bs4/tests/test_lxml.py |
1911 | @@ -14,7 +14,7 @@ from bs4 import ( |
1912 | BeautifulStoneSoup, |
1913 | ) |
1914 | from bs4.element import Comment, Doctype |
1915 | -from bs4.strainer import SoupStrainer |
1916 | +from bs4.filter import SoupStrainer |
1917 | from . import ( |
1918 | HTMLTreeBuilderSmokeTest, |
1919 | XMLTreeBuilderSmokeTest, |
1920 | diff --git a/bs4/tests/test_pageelement.py b/bs4/tests/test_pageelement.py |
1921 | index 19b4d63..7dfdc22 100644 |
1922 | --- a/bs4/tests/test_pageelement.py |
1923 | +++ b/bs4/tests/test_pageelement.py |
1924 | @@ -10,7 +10,7 @@ from bs4.element import ( |
1925 | Comment, |
1926 | ResultSet, |
1927 | ) |
1928 | -from bs4.strainer import SoupStrainer |
1929 | +from bs4.filter import SoupStrainer |
1930 | from . import ( |
1931 | SoupTest, |
1932 | ) |
1933 | diff --git a/bs4/tests/test_soup.py b/bs4/tests/test_soup.py |
1934 | index 4f8ee1a..c95f380 100644 |
1935 | --- a/bs4/tests/test_soup.py |
1936 | +++ b/bs4/tests/test_soup.py |
1937 | @@ -27,7 +27,7 @@ from bs4.element import ( |
1938 | Tag, |
1939 | NavigableString, |
1940 | ) |
1941 | -from bs4.strainer import SoupStrainer |
1942 | +from bs4.filter import SoupStrainer |
1943 | |
1944 | from . import ( |
1945 | default_builder, |
1946 | @@ -293,7 +293,7 @@ class TestWarnings(SoupTest): |
1947 | soup = self.soup("<a><b></b></a>", parse_only=strainer) |
1948 | warning = self._assert_warning(w, UserWarning) |
1949 | msg = str(warning.message) |
1950 | - assert msg.startswith("Value for parse_only will exclude everything, since it puts restrictions on both tags and strings:") |
1951 | + assert msg.startswith("The given value for parse_only will exclude everything:") |
1952 | |
1953 | def test_parseOnlyThese_renamed_to_parse_only(self): |
1954 | with warnings.catch_warnings(record=True) as w: |
1955 | diff --git a/bs4/tests/test_tree.py b/bs4/tests/test_tree.py |
1956 | index 606525f..43afb29 100644 |
1957 | --- a/bs4/tests/test_tree.py |
1958 | +++ b/bs4/tests/test_tree.py |
1959 | @@ -26,7 +26,7 @@ from bs4.element import ( |
1960 | Tag, |
1961 | TemplateString, |
1962 | ) |
1963 | -from bs4.strainer import SoupStrainer |
1964 | +from bs4.filter import SoupStrainer |
1965 | from . import ( |
1966 | SoupTest, |
1967 | ) |
1968 | diff --git a/doc/index.rst b/doc/index.rst |
1969 | index 7beff36..a414830 100755 |
1970 | --- a/doc/index.rst |
1971 | +++ b/doc/index.rst |
1972 | @@ -20,7 +20,7 @@ with examples. I show you what the library is good for, how it works, |
1973 | how to use it, how to make it do what you want, and what to do when it |
1974 | violates your expectations. |
1975 | |
1976 | -This document covers Beautiful Soup version 4.12.2. The examples in |
1977 | +This document covers Beautiful Soup version 4.13.0. The examples in |
1978 | this documentation were written for Python 3.8. |
1979 | |
1980 | You might be looking for the documentation for `Beautiful Soup 3 |
1981 | @@ -2577,6 +2577,11 @@ the human-visible content of the page.* |
1982 | either return the object itself, or nothing, so the only reason to do |
1983 | this is when you're iterating over a mixed list.* |
1984 | |
1985 | +*As of Beautiful Soup version 4.13.0, you can call .string on a |
1986 | +NavigableString object. It will return the object itself, so again, |
1987 | +the only reason to do this is when you're iterating over a mixed |
1988 | +list.* |
1989 | + |
1990 | Specifying the parser to use |
1991 | ============================ |
1992 | |
1993 | @@ -2604,8 +2609,9 @@ specifying one of the following: |
1994 | |
1995 | The section `Installing a parser`_ contrasts the supported parsers. |
1996 | |
1997 | -If you don't have an appropriate parser installed, Beautiful Soup will |
1998 | -ignore your request and pick a different parser. Right now, the only |
1999 | +If you ask for a parser that isn't installed, Beautiful Soup will |
2000 | +raise an exception so that you don't inadvertently parse a document |
2001 | +under an unknown set of rules. For example, right now, the only |
2002 | supported XML parser is lxml. If you don't have lxml installed, asking |
2003 | for an XML parser won't give you one, and asking for "lxml" won't work |
2004 | either. |
2005 | @@ -3018,6 +3024,44 @@ been called on it:: |
2006 | This is because two different :py:class:`Tag` objects can't occupy the same |
2007 | space at the same time. |
2008 | |
2009 | +Advanced search techniques |
2010 | +========================== |
2011 | + |
2012 | +Almost everyone who uses Beautiful Soup to extract information from a |
2013 | +document can get what they need using the methods described in |
2014 | +`Searching the tree`_. However, there's a lower-level interface--the |
2015 | +:py:class:`ElementSelector` class-- which lets you define any matching |
2016 | +behavior whatsoever. |
2017 | + |
2018 | +To use :py:class:`ElementSelector`, define a function that takes a |
2019 | +:py:class:`PageElement` object (that is, it might be either a |
2020 | +:py:class:`Tag` or a :py:class`NavigableString`) and returns ``True`` |
2021 | +(if the element matches your custom criteria) or ``False`` (if it |
2022 | +doesn't):: |
2023 | + |
2024 | + [example goes here] |
2025 | + |
2026 | +Then, pass the function into an :py:class:`ElementSelector`:: |
2027 | + |
2028 | + from bs4.select import ElementSelector |
2029 | + selector = ElementSelector(f) |
2030 | + |
2031 | +You can then pass the :py:class:`ElementSelector` object as the first |
2032 | +argument to any of the `Searching the tree`_ methods:: |
2033 | + |
2034 | + [examples go here] |
2035 | + |
2036 | +Every potential match will be run through your function, and the only |
2037 | +:py:class:`PageElement` objects returned will be the one where your |
2038 | +function returned ``True``. |
2039 | + |
2040 | +Note that this is different from simply passing `a function`_ as the |
2041 | +first argument to one of the search methods. That's an easy way to |
2042 | +find a tag, but _only_ tags will be considered. With an |
2043 | +:py:class:`ElementSelector` you can write a single function that makes |
2044 | +decisions about both tags and strings. |
2045 | + |
2046 | + |
2047 | Advanced parser customization |
2048 | ============================= |
2049 | |
2050 | @@ -3111,14 +3155,6 @@ The :py:class:`SoupStrainer` behavior is as follows: |
2051 | * When a tag does not match, the tag itself is not kept, but parsing continues |
2052 | into its contents to look for other tags that do match. |
2053 | |
2054 | -You can also pass a :py:class:`SoupStrainer` into any of the methods covered |
2055 | -in `Searching the tree`_. This probably isn't terribly useful, but I |
2056 | -thought I'd mention it:: |
2057 | - |
2058 | - soup = BeautifulSoup(html_doc, 'html.parser') |
2059 | - soup.find_all(only_short_strings) |
2060 | - # ['\n\n', '\n\n', 'Elsie', ',\n', 'Lacie', ' and\n', 'Tillie', |
2061 | - # '\n\n', '...', '\n'] |
2062 | |
2063 | Customizing multi-valued attributes |
2064 | ----------------------------------- |