Merge beautifulsoup:more-modular-soupstrainers into beautifulsoup:4.13

Proposed by Leonard Richardson
Status: Merged
Merged at revision: c23dd48ebea467fcf028e14287f07d2c51e62975
Proposed branch: beautifulsoup:more-modular-soupstrainers
Merge into: beautifulsoup:4.13
Diff against target: 2064 lines (+710/-262)
18 files modified
CHANGELOG (+18/-1)
bs4/__init__.py (+131/-84)
bs4/_typing.py (+19/-1)
bs4/builder/__init__.py (+8/-8)
bs4/builder/_html5lib.py (+123/-67)
bs4/builder/_htmlparser.py (+12/-2)
bs4/builder/_lxml.py (+1/-1)
bs4/diagnose.py (+27/-15)
bs4/element.py (+24/-20)
bs4/filter.py (+167/-36)
bs4/tests/__init__.py (+1/-1)
bs4/tests/test_filter.py (+125/-8)
bs4/tests/test_html5lib.py (+2/-2)
bs4/tests/test_lxml.py (+1/-1)
bs4/tests/test_pageelement.py (+1/-1)
bs4/tests/test_soup.py (+2/-2)
bs4/tests/test_tree.py (+1/-1)
doc/index.rst (+47/-11)
Reviewer Review Type Date Requested Status
Leonard Richardson Pending
Review via email: mp+459082@code.launchpad.net
To post a comment you must log in.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/CHANGELOG b/CHANGELOG
2index 69f238d..162e3dc 100644
3--- a/CHANGELOG
4+++ b/CHANGELOG
5@@ -1,5 +1,7 @@
6 = 4.13.0 (Unreleased)
7
8+TODO: we could stand to put limit inside ResultSet
9+
10 * This version drops support for Python 3.6. The minimum supported
11 major Python version for Beautiful Soup is now Python 3.7.
12
13@@ -31,6 +33,13 @@
14 you, since you probably use HTMLParserTreeBuilder, not
15 BeautifulSoupHTMLParser directly.
16
17+* The TreeBuilderForHtml5lib methods fragmentClass and getFragment
18+ now raise NotImplementedError. These methods are called only by
19+ html5lib's HTMLParser.parseFragment() method, which Beautiful Soup
20+ doesn't use, so they were untested and should have never been called.
21+ The getFragment() implementation was also slightly incorrect in a way
22+ that should have caused obvious problems for anyone using it.
23+
24 * If Tag.get_attribute_list() is used to access an attribute that's not set,
25 the return value is now an empty list rather than [None].
26
27@@ -47,6 +56,10 @@
28 empty list was treated the same as None and False, and you would have
29 found the tags which did not have that attribute set at all. [bug=2045469]
30
31+* For similar reasons, if you pass in limit=0 to a find() method for some
32+ reason, you will now get zero results. Previously, you would get all
33+ matching results.
34+
35 * When using one of the find() methods or creating a SoupStrainer,
36 if you specify the same attribute value in ``attrs`` and the
37 keyword arguments, you'll end up with two different ways to match that
38@@ -88,7 +101,7 @@
39 changed to match the arguments to the superclass,
40 TreeBuilder.prepare_markup. Specifically, document_declared_encoding
41 now appears before exclude_encodings, not after. If you were calling
42- this method yourself, I recomment switching to using keyword
43+ this method yourself, I recommend switching to using keyword
44 arguments instead.
45
46 * Fixed an error in the lookup table used when converting
47@@ -101,8 +114,12 @@ New deprecations in 4.13.0:
48
49 * The SAXTreeBuilder class, which was never officially supported or tested.
50
51+* The private class method BeautifulSoup._decode_markup(), which has not
52+ been used inside Beautiful Soup for many years.
53+
54 * The first argument to BeautifulSoup.decode has been changed from a bool
55 `pretty_print` to an int `indent_level`, to match the signature of Tag.decode.
56+ Using a bool will still work but will give you a DeprecationWarning.
57
58 * SoupStrainer.text and SoupStrainer.string are both deprecated
59 since a single item can't capture all the possibilities of a SoupStrainer
60diff --git a/bs4/__init__.py b/bs4/__init__.py
61index 347cb38..95bd48d 100644
62--- a/bs4/__init__.py
63+++ b/bs4/__init__.py
64@@ -15,7 +15,7 @@ documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
65 """
66
67 __author__ = "Leonard Richardson (leonardr@segfault.org)"
68-__version__ = "4.12.3"
69+__version__ = "4.13.0"
70 __copyright__ = "Copyright (c) 2004-2024 Leonard Richardson"
71 # Use of this source code is governed by the MIT license.
72 __license__ = "MIT"
73@@ -42,10 +42,13 @@ from .builder import (
74 )
75 from .builder._htmlparser import HTMLParserTreeBuilder
76 from .dammit import UnicodeDammit
77+from .css import (
78+ CSS
79+)
80+from ._deprecation import _deprecated
81 from .element import (
82 CData,
83 Comment,
84- CSS,
85 DEFAULT_OUTPUT_ENCODING,
86 Declaration,
87 Doctype,
88@@ -60,7 +63,10 @@ from .element import (
89 TemplateString,
90 )
91 from .formatter import Formatter
92-from .strainer import SoupStrainer
93+from .filter import (
94+ ElementFilter,
95+ SoupStrainer,
96+)
97 from typing import (
98 Any,
99 cast,
100@@ -70,6 +76,7 @@ from typing import (
101 List,
102 Sequence,
103 Optional,
104+ Tuple,
105 Type,
106 TYPE_CHECKING,
107 Union,
108@@ -81,6 +88,7 @@ from bs4._typing import (
109 _Encoding,
110 _Encodings,
111 _IncomingMarkup,
112+ _RawMarkup,
113 )
114
115 # Define some custom warnings.
116@@ -144,20 +152,21 @@ class BeautifulSoup(Tag):
117 NO_PARSER_SPECIFIED_WARNING: str = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features=\"%(parser)s\"' to the BeautifulSoup constructor.\n"
118
119 # FUTURE PYTHON:
120- element_classes:Dict[Type[PageElement], Type[Any]] #: :meta private:
121+ element_classes:Dict[Type[PageElement], Type[PageElement]] #: :meta private:
122 builder:TreeBuilder #: :meta private:
123 is_xml: bool
124 known_xml: Optional[bool]
125 parse_only: Optional[SoupStrainer] #: :meta private:
126
127 # These members are only used while parsing markup.
128- markup:Optional[Union[str,bytes]] #: :meta private:
129+ markup:Optional[_RawMarkup] #: :meta private:
130 current_data:List[str] #: :meta private:
131 currentTag:Optional[Tag] #: :meta private:
132 tagStack:List[Tag] #: :meta private:
133 open_tag_counter:CounterType[str] #: :meta private:
134 preserve_whitespace_tag_stack:List[Tag] #: :meta private:
135 string_container_stack:List[Tag] #: :meta private:
136+ _most_recent_element:Optional[PageElement] #: :meta private:
137
138 #: Beautiful Soup's best guess as to the character encoding of the
139 #: original document.
140@@ -182,7 +191,7 @@ class BeautifulSoup(Tag):
141 parse_only:Optional[SoupStrainer]=None,
142 from_encoding:Optional[_Encoding]=None,
143 exclude_encodings:Optional[_Encodings]=None,
144- element_classes:Optional[Dict[Type[PageElement], Type[Any]]]=None,
145+ element_classes:Optional[Dict[Type[PageElement], Type[PageElement]]]=None,
146 **kwargs:Any
147 ):
148 """Constructor.
149@@ -271,7 +280,7 @@ class BeautifulSoup(Tag):
150 "features='lxml' for HTML and features='lxml-xml' for "
151 "XML.")
152
153- def deprecated_argument(old_name, new_name):
154+ def deprecated_argument(old_name:str, new_name:str) -> Optional[Any]:
155 if old_name in kwargs:
156 warnings.warn(
157 'The "%s" argument to the BeautifulSoup constructor '
158@@ -284,13 +293,14 @@ class BeautifulSoup(Tag):
159
160 parse_only = parse_only or deprecated_argument(
161 "parseOnlyThese", "parse_only")
162- if (parse_only is not None
163- and parse_only.string_rules and
164- (parse_only.name_rules or parse_only.attribute_rules)):
165- warnings.warn(
166- f"Value for parse_only will exclude everything, since it puts restrictions on both tags and strings: {parse_only}",
167- UserWarning, stacklevel=3
168- )
169+ if parse_only is not None:
170+ # Issue a warning if we can tell in advance that
171+ # parse_only will exclude the entire tree.
172+ if parse_only.excludes_everything:
173+ warnings.warn(
174+ f"The given value for parse_only will exclude everything: {parse_only}",
175+ UserWarning, stacklevel=3
176+ )
177
178 from_encoding = from_encoding or deprecated_argument(
179 "fromEncoding", "from_encoding")
180@@ -323,7 +333,7 @@ class BeautifulSoup(Tag):
181 "Couldn't find a tree builder with the features you "
182 "requested: %s. Do you need to install a parser library?"
183 % ",".join(features))
184- builder_class = cast(Type[TreeBuilder], possible_builder_class)
185+ builder_class = possible_builder_class
186
187 # At this point either we have a TreeBuilder instance in
188 # builder, or we have a builder_class that we can instantiate
189@@ -399,7 +409,7 @@ class BeautifulSoup(Tag):
190
191 # At this point we know markup is a string or bytestring. If
192 # it was a file-type object, we've read from it.
193- markup = cast(Union[str,bytes], markup)
194+ markup = cast(_RawMarkup, markup)
195
196 rejections = []
197 success = False
198@@ -428,7 +438,7 @@ class BeautifulSoup(Tag):
199 self.markup = None
200 self.builder.soup = None
201
202- def _clone(self):
203+ def _clone(self) -> "BeautifulSoup":
204 """Create a new BeautifulSoup object with the same TreeBuilder,
205 but not associated with any markup.
206
207@@ -441,7 +451,7 @@ class BeautifulSoup(Tag):
208 clone.original_encoding = self.original_encoding
209 return clone
210
211- def __getstate__(self):
212+ def __getstate__(self) -> dict[str, Any]:
213 # Frequently a tree builder can't be pickled.
214 d = dict(self.__dict__)
215 if 'builder' in d and d['builder'] is not None and not self.builder.picklable:
216@@ -457,7 +467,7 @@ class BeautifulSoup(Tag):
217 del d['_most_recent_element']
218 return d
219
220- def __setstate__(self, state):
221+ def __setstate__(self, state: dict[str, Any]) -> None:
222 # If necessary, restore the TreeBuilder by looking it up.
223 self.__dict__ = state
224 if isinstance(self.builder, type):
225@@ -469,15 +479,16 @@ class BeautifulSoup(Tag):
226 self.builder.soup = self
227 self.reset()
228 self._feed()
229- return state
230
231
232 @classmethod
233- def _decode_markup(cls, markup):
234- """Ensure `markup` is bytes so it's safe to send into warnings.warn.
235+ @_deprecated(replaced_by="nothing (private method, will be removed)", version="4.13.0")
236+ def _decode_markup(cls, markup:_RawMarkup) -> str:
237+ """Ensure `markup` is Unicode so it's safe to send into warnings.warn.
238
239- TODO: warnings.warn had this problem back in 2010 but it might not
240- anymore.
241+ warnings.warn had this problem back in 2010 but fortunately
242+ not anymore. This has not been used for a long time; I just
243+ noticed that fact while working on 4.13.0.
244 """
245 if isinstance(markup, bytes):
246 decoded = markup.decode('utf-8', 'replace')
247@@ -486,56 +497,76 @@ class BeautifulSoup(Tag):
248 return decoded
249
250 @classmethod
251- def _markup_is_url(cls, markup):
252+ def _markup_is_url(cls, markup:_RawMarkup) -> bool:
253 """Error-handling method to raise a warning if incoming markup looks
254 like a URL.
255
256- :param markup: A string.
257- :return: Whether or not the markup resembles a URL
258- closely enough to justify a warning.
259+ :param markup: A string of markup.
260+ :return: Whether or not the markup resembled a URL
261+ closely enough to justify issuing a warning.
262 """
263+ problem: bool = False
264 if isinstance(markup, bytes):
265- space = b' '
266- cant_start_with = (b"http:", b"https:")
267+ cant_start_with_b: Tuple[bytes, bytes] = (b"http:", b"https:")
268+ problem = (
269+ any(
270+ markup.startswith(prefix) for prefix in
271+ (b"http:", b"https:")
272+ )
273+ and not b' ' in markup
274+ )
275 elif isinstance(markup, str):
276- space = ' '
277- cant_start_with = ("http:", "https:")
278+ problem = (
279+ any(
280+ markup.startswith(prefix) for prefix in
281+ ("http:", "https:")
282+ )
283+ and not ' ' in markup
284+ )
285 else:
286 return False
287
288- if any(markup.startswith(prefix) for prefix in cant_start_with):
289- if not space in markup:
290- warnings.warn(
291- 'The input looks more like a URL than markup. You may want to use'
292- ' an HTTP client like requests to get the document behind'
293- ' the URL, and feed that document to Beautiful Soup.',
294- MarkupResemblesLocatorWarning,
295- stacklevel=3
296- )
297- return True
298- return False
299+ if not problem:
300+ return False
301+ warnings.warn(
302+ 'The input looks more like a URL than markup. You may want to use'
303+ ' an HTTP client like requests to get the document behind'
304+ ' the URL, and feed that document to Beautiful Soup.',
305+ MarkupResemblesLocatorWarning,
306+ stacklevel=3
307+ )
308+ return True
309
310 @classmethod
311- def _markup_resembles_filename(cls, markup):
312- """Error-handling method to raise a warning if incoming markup
313+ def _markup_resembles_filename(cls, markup:_RawMarkup) -> bool:
314+ """Error-handling method to issue a warning if incoming markup
315 resembles a filename.
316
317- :param markup: A bytestring or string.
318- :return: Whether or not the markup resembles a filename
319- closely enough to justify a warning.
320+ :param markup: A string of markup.
321+ :return: Whether or not the markup resembled a filename
322+ closely enough to justify issuing a warning.
323 """
324- path_characters = '/\\'
325- extensions = ['.html', '.htm', '.xml', '.xhtml', '.txt']
326- if isinstance(markup, bytes):
327- path_characters = path_characters.encode("utf8")
328- extensions = [x.encode('utf8') for x in extensions]
329+ path_characters_b = b'/\\'
330+ path_characters_s = '/\\'
331+ extensions_b = [b'.html', b'.htm', b'.xml', b'.xhtml', b'.txt']
332+ extensions_s = ['.html', '.htm', '.xml', '.xhtml', '.txt']
333+
334 filelike = False
335- if any(x in markup for x in path_characters):
336- filelike = True
337+ if isinstance(markup, bytes):
338+ if any(x in markup for x in path_characters_b):
339+ filelike = True
340+ else:
341+ lower_b = markup.lower()
342+ if any(lower_b.endswith(ext) for ext in extensions_b):
343+ filelike = True
344 else:
345- lower = markup.lower()
346- if any(lower.endswith(ext) for ext in extensions):
347+ if any(x in markup for x in path_characters_s):
348 filelike = True
349+ else:
350+ lower_s = markup.lower()
351+ if any(lower_s.endswith(ext) for ext in extensions_s):
352+ filelike = True
353+
354 if filelike:
355 warnings.warn(
356 'The input looks more like a filename than markup. You may'
357@@ -546,20 +577,22 @@ class BeautifulSoup(Tag):
358 return True
359 return False
360
361- def _feed(self):
362+ def _feed(self) -> None:
363 """Internal method that parses previously set markup, creating a large
364 number of Tag and NavigableString objects.
365 """
366 # Convert the document to Unicode.
367 self.builder.reset()
368
369- self.builder.feed(self.markup)
370+ if self.markup is not None:
371+ self.builder.feed(self.markup)
372 # Close out any unfinished strings and close all the open tags.
373 self.endData()
374- while self.currentTag.name != self.ROOT_TAG_NAME:
375+ while (self.currentTag is not None and
376+ self.currentTag.name != self.ROOT_TAG_NAME):
377 self.popTag()
378
379- def reset(self):
380+ def reset(self) -> None:
381 """Reset this object to a state as though it had never parsed any
382 markup.
383 """
384@@ -585,7 +618,7 @@ class BeautifulSoup(Tag):
385 sourcepos:Optional[int]=None,
386 string:Optional[str]=None,
387 **kwattrs:_AttributeValue,
388- ):
389+ ) -> Tag:
390 """Create a new Tag associated with this BeautifulSoup object.
391
392 :param name: The name of the new Tag.
393@@ -603,10 +636,16 @@ class BeautifulSoup(Tag):
394
395 """
396 kwattrs.update(attrs)
397- tag = self.element_classes.get(Tag, Tag)(
398+ tag_class = self.element_classes.get(Tag, Tag)
399+
400+ # Assume that this is either Tag or a subclass of Tag. If not,
401+ # the user brought type-unsafety upon themselves.
402+ tag_class = cast(Type[Tag], tag_class)
403+ tag = tag_class(
404 None, self.builder, name, namespace, nsprefix, kwattrs,
405 sourceline=sourceline, sourcepos=sourcepos
406 )
407+
408 if string is not None:
409 tag.string = string
410 return tag
411@@ -622,9 +661,11 @@ class BeautifulSoup(Tag):
412 """
413 container = base_class or NavigableString
414
415- # There may be a general override of NavigableString.
416- container = self.element_classes.get(
417- container, container
418+ # The user may want us to use some other class (hopefully a
419+ # custom subclass) instead of the one we'd use normally.
420+ container = cast(
421+ type[NavigableString],
422+ self.element_classes.get(container, container)
423 )
424
425 # On top of that, we may be inside a tag that needs a special
426@@ -728,9 +769,8 @@ class BeautifulSoup(Tag):
427 self.current_data = []
428
429 # Should we add this string to the tree at all?
430- if self.parse_only and len(self.tagStack) <= 1 and \
431- (not self.parse_only.string_rules or \
432- not self.parse_only.allow_string_creation(current_data)):
433+ if (self.parse_only and len(self.tagStack) <= 1 and
434+ (not self.parse_only.allow_string_creation(current_data))):
435 return
436
437 containerClass = self.string_container(containerClass)
438@@ -739,17 +779,16 @@ class BeautifulSoup(Tag):
439
440 def object_was_parsed(
441 self, o:PageElement, parent:Optional[Tag]=None,
442- most_recent_element:Optional[PageElement]=None):
443+ most_recent_element:Optional[PageElement]=None) -> None:
444 """Method called by the TreeBuilder to integrate an object into the
445 parse tree.
446
447-
448-
449 :meta private:
450 """
451 if parent is None:
452 parent = self.currentTag
453 assert parent is not None
454+ previous_element: Optional[PageElement]
455 if most_recent_element is not None:
456 previous_element = most_recent_element
457 else:
458@@ -774,12 +813,12 @@ class BeautifulSoup(Tag):
459 if fix:
460 self._linkage_fixer(parent)
461
462- def _linkage_fixer(self, el):
463+ def _linkage_fixer(self, el:Tag) -> None:
464 """Make sure linkage of this fragment is sound."""
465
466 first = el.contents[0]
467 child = el.contents[-1]
468- descendant = child
469+ descendant:PageElement = child
470
471 if child is first and el.parent is not None:
472 # Parent should be linked to first child
473@@ -797,14 +836,18 @@ class BeautifulSoup(Tag):
474
475 # This index is a tag, dig deeper for a "last descendant"
476 if isinstance(child, Tag) and child.contents:
477- descendant = child._last_descendant(False)
478+ # _last_decendant is typed as returning Optional[PageElement],
479+ # but the value can't be None here, because el is a Tag
480+ # which we know has contents.
481+ descendant = cast(PageElement, child._last_descendant(False))
482
483 # As the final step, link last descendant. It should be linked
484 # to the parent's next sibling (if found), else walk up the chain
485 # and find a parent with a sibling. It should have no next sibling.
486 descendant.next_element = None
487 descendant.next_sibling = None
488- target = el
489+
490+ target:Optional[Tag] = el
491 while True:
492 if target is None:
493 break
494@@ -814,7 +857,7 @@ class BeautifulSoup(Tag):
495 break
496 target = target.parent
497
498- def _popToTag(self, name, nsprefix=None, inclusivePop=True) -> Optional[Tag]:
499+ def _popToTag(self, name:str, nsprefix:Optional[str]=None, inclusivePop:bool=True) -> Optional[Tag]:
500 """Pops the tag stack up to and including the most recent
501 instance of the given tag.
502
503@@ -851,7 +894,7 @@ class BeautifulSoup(Tag):
504
505 def handle_starttag(
506 self, name:str, namespace:Optional[str],
507- nsprefix:Optional[str], attrs:Optional[Dict[str,str]],
508+ nsprefix:Optional[str], attrs:_AttributeValues,
509 sourceline:Optional[int]=None, sourcepos:Optional[int]=None,
510 namespaces:Optional[Dict[str, str]]=None) -> Optional[Tag]:
511 """Called by the tree builder when a new tag is encountered.
512@@ -867,7 +910,7 @@ class BeautifulSoup(Tag):
513 currently in scope in the document.
514
515 If this method returns None, the tag was rejected by an active
516- SoupStrainer. You should proceed as if the tag had not occurred
517+ `ElementFilter`. You should proceed as if the tag had not occurred
518 in the document. For instance, if this was a self-closing tag,
519 don't call handle_endtag.
520
521@@ -877,11 +920,14 @@ class BeautifulSoup(Tag):
522 self.endData()
523
524 if (self.parse_only and len(self.tagStack) <= 1
525- and (self.parse_only.string_rules
526- or not self.parse_only.allow_tag_creation(nsprefix, name, attrs))):
527+ and not self.parse_only.allow_tag_creation(nsprefix, name, attrs)):
528 return None
529
530- tag = self.element_classes.get(Tag, Tag)(
531+ tag_class = self.element_classes.get(Tag, Tag)
532+ # Assume that this is either Tag or a subclass of Tag. If not,
533+ # the user brought type-unsafety upon themselves.
534+ tag_class = cast(Type[Tag], tag_class)
535+ tag = tag_class(
536 self, self.builder, name, namespace, nsprefix, attrs,
537 self.currentTag, self._most_recent_element,
538 sourceline=sourceline, sourcepos=sourcepos,
539@@ -918,7 +964,8 @@ class BeautifulSoup(Tag):
540 def decode(self, indent_level:Optional[int]=None,
541 eventual_encoding:_Encoding=DEFAULT_OUTPUT_ENCODING,
542 formatter:Union[Formatter,str]="minimal",
543- iterator:Optional[Iterable]=None, **kwargs) -> str:
544+ iterator:Optional[Iterable[PageElement]]=None,
545+ **kwargs:Any) -> str:
546 """Returns a string representation of the parse tree
547 as a full HTML or XML document.
548
549@@ -989,7 +1036,7 @@ _soup = BeautifulSoup
550 class BeautifulStoneSoup(BeautifulSoup):
551 """Deprecated interface to an XML parser."""
552
553- def __init__(self, *args, **kwargs):
554+ def __init__(self, *args:Any, **kwargs:Any):
555 kwargs['features'] = 'xml'
556 warnings.warn(
557 'The BeautifulStoneSoup class was deprecated in version 4.0.0. Instead of using '
558diff --git a/bs4/_typing.py b/bs4/_typing.py
559index fed804a..ab8f7a0 100644
560--- a/bs4/_typing.py
561+++ b/bs4/_typing.py
562@@ -7,6 +7,8 @@
563 # * In 3.10, x|y is an accepted shorthand for Union[x,y].
564 # * In 3.10, TypeAlias gains capabilities that can be used to
565 # improve the tree matching types (I don't remember what, exactly).
566+# * 3.8 defines the Protocol type, which can be used to do duck typing
567+# in a statically checkable way.
568
569 import re
570 from typing_extensions import TypeAlias
571@@ -15,13 +17,14 @@ from typing import (
572 Dict,
573 IO,
574 Iterable,
575+ Optional,
576 Pattern,
577 TYPE_CHECKING,
578 Union,
579 )
580
581 if TYPE_CHECKING:
582- from bs4.element import Tag
583+ from bs4.element import PageElement, Tag
584
585 # Aliases for markup in various stages of processing.
586 #
587@@ -52,6 +55,10 @@ _InvertedNamespaceMapping:TypeAlias = Dict[_NamespaceURL, _NamespacePrefix]
588 _AttributeValue: TypeAlias = Union[str, Iterable[str]]
589 _AttributeValues: TypeAlias = Dict[str, _AttributeValue]
590
591+# The most common form in which attribute values are passed in from a
592+# parser.
593+_RawAttributeValues: TypeAlias = dict[str, str]
594+
595 # Aliases to represent the many possibilities for matching bits of a
596 # parse tree.
597 #
598@@ -60,6 +67,17 @@ _AttributeValues: TypeAlias = Dict[str, _AttributeValue]
599 # of the arguments to the SoupStrainer constructor and (more
600 # familiarly to Beautiful Soup users) the find* methods.
601
602+# A function that takes a PageElement and returns a yes-or-no answer.
603+_PageElementMatchFunction:TypeAlias = Callable[['PageElement'], bool]
604+
605+# A function that takes the raw parsed ingredients of a markup tag
606+# and returns a yes-or-no answer.
607+_AllowTagCreationFunction:TypeAlias = Callable[[Optional[str], str, Optional[_RawAttributeValues]], bool]
608+
609+# A function that takes the raw parsed ingredients of a markup string node
610+# and returns a yes-or-no answer.
611+_AllowStringCreationFunction:TypeAlias = Callable[[Optional[str]], bool]
612+
613 # A function that takes a Tag and returns a yes-or-no answer.
614 # A TagNameMatchRule expects this kind of function, if you're
615 # going to pass it a function.
616diff --git a/bs4/builder/__init__.py b/bs4/builder/__init__.py
617index fa2b939..b59513e 100644
618--- a/bs4/builder/__init__.py
619+++ b/bs4/builder/__init__.py
620@@ -277,7 +277,7 @@ class TreeBuilder(object):
621 return True
622 return tag_name in self.empty_element_tags
623
624- def feed(self, markup:str) -> None:
625+ def feed(self, markup:_RawMarkup) -> None:
626 """Run some incoming markup through some parsing process,
627 populating the `BeautifulSoup` object in `TreeBuilder.soup`
628 """
629@@ -598,8 +598,8 @@ class DetectsXMLParsedAsHTML(object):
630
631 # This is typed as str, not `ProcessingInstruction`, because this
632 # check may be run before any Beautiful Soup objects are created.
633- _first_processing_instruction: Optional[str]
634- _root_tag: Optional[Tag]
635+ _first_processing_instruction: Optional[str] #: :meta private:
636+ _root_tag_name: Optional[str] #: :meta private:
637
638 @classmethod
639 def warn_if_markup_looks_like_xml(cls, markup:Optional[_RawMarkup], stacklevel:int=3) -> bool:
640@@ -648,14 +648,14 @@ class DetectsXMLParsedAsHTML(object):
641 def _initialize_xml_detector(self) -> None:
642 """Call this method before parsing a document."""
643 self._first_processing_instruction = None
644- self._root_tag = None
645+ self._root_tag_name = None
646
647 def _document_might_be_xml(self, processing_instruction:str):
648 """Call this method when encountering an XML declaration, or a
649 "processing instruction" that might be an XML declaration.
650 """
651 if (self._first_processing_instruction is not None
652- or self._root_tag is not None):
653+ or self._root_tag_name is not None):
654 # The document has already started. Don't bother checking
655 # anymore.
656 return
657@@ -665,18 +665,18 @@ class DetectsXMLParsedAsHTML(object):
658 # We won't know until we encounter the first tag whether or
659 # not this is actually a problem.
660
661- def _root_tag_encountered(self, name):
662+ def _root_tag_encountered(self, name:str) -> None:
663 """Call this when you encounter the document's root tag.
664
665 This is where we actually check whether an XML document is
666 being incorrectly parsed as HTML, and issue the warning.
667 """
668- if self._root_tag is not None:
669+ if self._root_tag_name is not None:
670 # This method was incorrectly called multiple times. Do
671 # nothing.
672 return
673
674- self._root_tag = name
675+ self._root_tag_name = name
676 if (name != 'html' and self._first_processing_instruction is not None
677 and self._first_processing_instruction.lower().startswith('xml ')):
678 # We encountered an XML declaration and then a tag other
679diff --git a/bs4/builder/_html5lib.py b/bs4/builder/_html5lib.py
680index b7d2924..2ea556c 100644
681--- a/bs4/builder/_html5lib.py
682+++ b/bs4/builder/_html5lib.py
683@@ -6,6 +6,9 @@ __all__ = [
684 ]
685
686 from typing import (
687+ Any,
688+ cast,
689+ Dict,
690 Iterable,
691 List,
692 Optional,
693@@ -14,8 +17,11 @@ from typing import (
694 Union,
695 )
696 from bs4._typing import (
697+ _AttributeValue,
698+ _AttributeValues,
699 _Encoding,
700 _Encodings,
701+ _NamespaceURL,
702 _RawMarkup,
703 )
704
705@@ -30,6 +36,7 @@ from bs4.builder import (
706 )
707 from bs4.element import (
708 NamespacedAttribute,
709+ PageElement,
710 nonwhitespace_re,
711 )
712 import html5lib
713@@ -42,7 +49,9 @@ from bs4.element import (
714 Doctype,
715 NavigableString,
716 Tag,
717- )
718+)
719+if TYPE_CHECKING:
720+ from bs4 import BeautifulSoup
721
722 from html5lib.treebuilders import base as treebuilder_base
723
724@@ -71,7 +80,9 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
725 #: html5lib can tell us which line number and position in the
726 #: original file is the source of an element.
727 TRACKS_LINE_NUMBERS:bool = True
728-
729+
730+ underlying_builder:'TreeBuilderForHtml5lib' #: :meta private:
731+
732 def prepare_markup(self, markup:_RawMarkup,
733 user_specified_encoding:Optional[_Encoding]=None,
734 document_declared_encoding:Optional[_Encoding]=None,
735@@ -102,20 +113,31 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
736 yield (markup, None, None, False)
737
738 # These methods are defined by Beautiful Soup.
739- def feed(self, markup):
740+ def feed(self, markup:_RawMarkup) -> None:
741 """Run some incoming markup through some parsing process,
742 populating the `BeautifulSoup` object in `HTML5TreeBuilder.soup`.
743 """
744- if self.soup.parse_only is not None:
745+ if self.soup is not None and self.soup.parse_only is not None:
746 warnings.warn(
747 "You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.",
748 stacklevel=4
749 )
750+
751+ # self.underlying_parser is probably None now, but it'll be set
752+ # when self.create_treebuilder is called by html5lib.
753+ #
754+ # TODO-TYPING: typeshed stubs are incorrect about the return
755+ # value of HTMLParser.__init__; it is HTMLParser, not None.
756 parser = html5lib.HTMLParser(tree=self.create_treebuilder)
757+ assert self.underlying_builder is not None
758 self.underlying_builder.parser = parser
759 extra_kwargs = dict()
760 if not isinstance(markup, str):
761+ # kwargs, specifically override_encoding, will eventually
762+ # be passed in to html5lib's
763+ # HTMLBinaryInputStream.__init__.
764 extra_kwargs['override_encoding'] = self.user_specified_encoding
765+
766 doc = parser.parse(markup, **extra_kwargs)
767
768 # Set the character encoding detected by the tokenizer.
769@@ -131,10 +153,12 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
770 doc.original_encoding = original_encoding
771 self.underlying_builder.parser = None
772
773- def create_treebuilder(self, namespaceHTMLElements):
774+ def create_treebuilder(self, namespaceHTMLElements:bool) -> 'TreeBuilderForHtml5lib':
775 """Called by html5lib to instantiate the kind of class it
776 calls a 'TreeBuilder'.
777-
778+
779+ :param namespaceHTMLElements: Whether or not to namespace HTML elements.
780+
781 :meta private:
782 """
783 self.underlying_builder = TreeBuilderForHtml5lib(
784@@ -143,15 +167,18 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
785 )
786 return self.underlying_builder
787
788- def test_fragment_to_document(self, fragment):
789+ def test_fragment_to_document(self, fragment:str) -> str:
790 """See `TreeBuilder`."""
791 return '<html><head></head><body>%s</body></html>' % fragment
792
793
794 class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder):
795-
796- def __init__(self, namespaceHTMLElements, soup=None,
797- store_line_numbers=True, **kwargs):
798+
799+ soup:'BeautifulSoup' #: :meta private:
800+
801+ def __init__(self, namespaceHTMLElements:bool,
802+ soup:Optional['BeautifulSoup']=None,
803+ store_line_numbers:bool=True, **kwargs:Any):
804 if soup:
805 self.soup = soup
806 else:
807@@ -172,65 +199,68 @@ class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder):
808 self.parser = None
809 self.store_line_numbers = store_line_numbers
810
811- def documentClass(self):
812+ def documentClass(self) -> 'Element':
813 self.soup.reset()
814 return Element(self.soup, self.soup, None)
815
816- def insertDoctype(self, token):
817- name = token["name"]
818- publicId = token["publicId"]
819- systemId = token["systemId"]
820+ def insertDoctype(self, token:Dict[str, Any]) -> None:
821+ name:str = cast(str, token["name"])
822+ publicId:Optional[str] = cast(Optional[str], token["publicId"])
823+ systemId:Optional[str] = cast(Optional[str], token["systemId"])
824
825 doctype = Doctype.for_name_and_ids(name, publicId, systemId)
826 self.soup.object_was_parsed(doctype)
827
828- def elementClass(self, name, namespace):
829- kwargs = {}
830+ def elementClass(self, name:str, namespace:str) -> 'Element':
831+ sourceline:Optional[int] = None
832+ sourcepos:Optional[int] = None
833 if self.parser and self.store_line_numbers:
834 # This represents the point immediately after the end of the
835 # tag. We don't know when the tag started, but we do know
836 # where it ended -- the character just before this one.
837 sourceline, sourcepos = self.parser.tokenizer.stream.position()
838- kwargs['sourceline'] = sourceline
839- kwargs['sourcepos'] = sourcepos-1
840- tag = self.soup.new_tag(name, namespace, **kwargs)
841+ sourcepos = sourcepos-1
842+ tag = self.soup.new_tag(
843+ name, namespace, sourceline=sourceline, sourcepos=sourcepos
844+ )
845
846 return Element(tag, self.soup, namespace)
847
848- def commentClass(self, data):
849+ def commentClass(self, data:str) -> 'TextNode':
850 return TextNode(Comment(data), self.soup)
851
852- def fragmentClass(self):
853- from bs4 import BeautifulSoup
854- # TODO: Why is the parser 'html.parser' here? To avoid an
855- # infinite loop?
856- self.soup = BeautifulSoup("", "html.parser")
857- self.soup.name = "[document_fragment]"
858- return Element(self.soup, self.soup, None)
859+ def fragmentClass(self) -> 'Element':
860+ """This is only used by html5lib HTMLParser.parseFragment(),
861+ which is never used by Beautiful Soup."""
862+ raise NotImplementedError()
863+
864+ def getFragment(self) -> 'Element':
865+ """This is only used by html5lib HTMLParser.parseFragment,
866+ which is never used by Beautiful Soup."""
867+ raise NotImplementedError()
868
869- def appendChild(self, node):
870- # XXX This code is not covered by the BS4 tests.
871+ def appendChild(self, node:'Element') -> None:
872+ # TODO: This code is not covered by the BS4 tests.
873 self.soup.append(node.element)
874
875- def getDocument(self):
876+ def getDocument(self) -> 'BeautifulSoup':
877 return self.soup
878
879- def getFragment(self):
880- return treebuilder_base.TreeBuilder.getFragment(self).element
881-
882- def testSerializer(self, element):
883+ # TODO-TYPING: typeshed stubs are incorrect about this;
884+ # cloneNode returns a str, not None.
885+ def testSerializer(self, element:'Element') -> str:
886 from bs4 import BeautifulSoup
887 rv = []
888 doctype_re = re.compile(r'^(.*?)(?: PUBLIC "(.*?)"(?: "(.*?)")?| SYSTEM "(.*?)")?$')
889
890- def serializeElement(element, indent=0):
891+ def serializeElement(element:Union['Element', PageElement], indent=0) -> None:
892 if isinstance(element, BeautifulSoup):
893 pass
894 if isinstance(element, Doctype):
895 m = doctype_re.match(element)
896- if m:
897+ if m is not None:
898 name = m.group(1)
899- if m.lastindex > 1:
900+ if m.lastindex is not None and m.lastindex > 1:
901 publicId = m.group(2) or ""
902 systemId = m.group(3) or m.group(4) or ""
903 rv.append("""|%s<!DOCTYPE %s "%s" "%s">""" %
904@@ -243,7 +273,7 @@ class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder):
905 rv.append("|%s<!-- %s -->" % (' ' * indent, element))
906 elif isinstance(element, NavigableString):
907 rv.append("|%s\"%s\"" % (' ' * indent, element))
908- else:
909+ elif isinstance(element, Element):
910 if element.namespace:
911 name = "%s %s" % (prefixes[element.namespace],
912 element.name)
913@@ -269,12 +299,19 @@ class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder):
914 return "\n".join(rv)
915
916 class AttrList(object):
917- def __init__(self, element):
918+ """Represents a Tag's attributes in a way compatible with html5lib."""
919+
920+ element:Tag
921+ attrs:_AttributeValues
922+
923+ def __init__(self, element:Tag):
924 self.element = element
925 self.attrs = dict(self.element.attrs)
926- def __iter__(self):
927+
928+ def __iter__(self) -> Iterable[Tuple[str, _AttributeValue]]:
929 return list(self.attrs.items()).__iter__()
930- def __setitem__(self, name, value):
931+
932+ def __setitem__(self, name:str, value:_AttributeValue) -> None:
933 # If this attribute is a multi-valued attribute for this element,
934 # turn its value into a list.
935 list_attr = self.element.cdata_list_attributes or {}
936@@ -282,40 +319,52 @@ class AttrList(object):
937 or (self.element.name in list_attr
938 and name in list_attr.get(self.element.name, []))):
939 # A node that is being cloned may have already undergone
940- # this procedure.
941+ # this procedure. Check for this and skip it.
942 if not isinstance(value, list):
943+ assert isinstance(value, str)
944 value = nonwhitespace_re.findall(value)
945 self.element[name] = value
946- def items(self):
947+
948+ def items(self) -> Iterable[Tuple[str, _AttributeValue]]:
949 return list(self.attrs.items())
950- def keys(self):
951+
952+ def keys(self) -> Iterable[str]:
953 return list(self.attrs.keys())
954- def __len__(self):
955+
956+ def __len__(self) -> int:
957 return len(self.attrs)
958- def __getitem__(self, name):
959+
960+ def __getitem__(self, name:str) -> _AttributeValue:
961 return self.attrs[name]
962- def __contains__(self, name):
963+
964+ def __contains__(self, name:str) -> bool:
965 return name in list(self.attrs.keys())
966
967
968 class Element(treebuilder_base.Node):
969- def __init__(self, element, soup, namespace):
970+
971+ element:Tag
972+ soup:'BeautifulSoup'
973+ namespace:Optional[_NamespaceURL]
974+
975+ def __init__(self, element:Tag, soup:'BeautifulSoup',
976+ namespace:Optional[_NamespaceURL]):
977 treebuilder_base.Node.__init__(self, element.name)
978 self.element = element
979 self.soup = soup
980 self.namespace = namespace
981
982- def appendChild(self, node):
983+ def appendChild(self, node:'Element') -> None:
984 string_child = child = None
985 if isinstance(node, str):
986 # Some other piece of code decided to pass in a string
987 # instead of creating a TextElement object to contain the
988- # string.
989+ # string. This should not ever happen.
990 string_child = child = node
991 elif isinstance(node, Tag):
992 # Some other piece of code decided to pass in a Tag
993 # instead of creating an Element object to contain the
994- # Tag.
995+ # Tag. This should not ever happen.
996 child = node
997 elif node.element.__class__ == NavigableString:
998 string_child = child = node.element
999@@ -324,7 +373,7 @@ class Element(treebuilder_base.Node):
1000 child = node.element
1001 node.parent = self
1002
1003- if not isinstance(child, str) and child.parent is not None:
1004+ if not isinstance(child, str) and child is not None and child.parent is not None:
1005 node.element.extract()
1006
1007 if (string_child is not None and self.element.contents
1008@@ -359,14 +408,13 @@ class Element(treebuilder_base.Node):
1009 child, parent=self.element,
1010 most_recent_element=most_recent_element)
1011
1012- def getAttributes(self):
1013+ def getAttributes(self) -> AttrList:
1014 if isinstance(self.element, Comment):
1015 return {}
1016 return AttrList(self.element)
1017
1018- def setAttributes(self, attributes):
1019+ def setAttributes(self, attributes:Optional[Dict]) -> None:
1020 if attributes is not None and len(attributes) > 0:
1021- converted_attributes = []
1022 for name, value in list(attributes.items()):
1023 if isinstance(name, tuple):
1024 new_name = NamespacedAttribute(*name)
1025@@ -386,14 +434,14 @@ class Element(treebuilder_base.Node):
1026 self.soup.builder.set_up_substitutions(self.element)
1027 attributes = property(getAttributes, setAttributes)
1028
1029- def insertText(self, data, insertBefore=None):
1030+ def insertText(self, data:str, insertBefore:Optional['Element']=None) -> None:
1031 text = TextNode(self.soup.new_string(data), self.soup)
1032 if insertBefore:
1033 self.insertBefore(text, insertBefore)
1034 else:
1035 self.appendChild(text)
1036
1037- def insertBefore(self, node, refNode):
1038+ def insertBefore(self, node:'Element', refNode:'Element') -> None:
1039 index = self.element.index(refNode.element)
1040 if (node.element.__class__ == NavigableString and self.element.contents
1041 and self.element.contents[index-1].__class__ == NavigableString):
1042@@ -405,10 +453,10 @@ class Element(treebuilder_base.Node):
1043 self.element.insert(index, node.element)
1044 node.parent = self
1045
1046- def removeChild(self, node):
1047+ def removeChild(self, node:'Element') -> None:
1048 node.element.extract()
1049
1050- def reparentChildren(self, new_parent):
1051+ def reparentChildren(self, new_parent:'Element') -> None:
1052 """Move all of this tag's children into another tag."""
1053 # print("MOVE", self.element.contents)
1054 # print("FROM", self.element)
1055@@ -424,6 +472,10 @@ class Element(treebuilder_base.Node):
1056 if len(new_parent_element.contents) > 0:
1057 # The new parent already contains children. We will be
1058 # appending this tag's children to the end.
1059+
1060+ # We can make this assertion since we know new_parent has
1061+ # children.
1062+ assert new_parents_last_descendant is not None
1063 new_parents_last_child = new_parent_element.contents[-1]
1064 new_parents_last_descendant_next_element = new_parents_last_descendant.next_element
1065 else:
1066@@ -474,17 +526,21 @@ class Element(treebuilder_base.Node):
1067 # print("FROM", self.element)
1068 # print("TO", new_parent_element)
1069
1070- def cloneNode(self):
1071+ # TODO: typeshed stubs are incorrect about this;
1072+ # cloneNode returns a new Node, not None.
1073+ def cloneNode(self) -> treebuilder_base.Node:
1074 tag = self.soup.new_tag(self.element.name, self.namespace)
1075 node = Element(tag, self.soup, self.namespace)
1076 for key,value in self.attributes:
1077 node.attributes[key] = value
1078 return node
1079
1080- def hasContent(self):
1081- return self.element.contents
1082+ # TODO-TYPING: typeshed stubs are incorrect about this;
1083+ # cloneNode returns a boolean, not None.
1084+ def hasContent(self) -> bool:
1085+ return len(self.element.contents) > 0
1086
1087- def getNameTuple(self):
1088+ def getNameTuple(self) -> Tuple[str, str]:
1089 if self.namespace == None:
1090 return namespaces["html"], self.name
1091 else:
1092@@ -493,10 +549,10 @@ class Element(treebuilder_base.Node):
1093 nameTuple = property(getNameTuple)
1094
1095 class TextNode(Element):
1096- def __init__(self, element, soup):
1097+ def __init__(self, element:PageElement, soup:'BeautifulSoup'):
1098 treebuilder_base.Node.__init__(self, None)
1099 self.element = element
1100 self.soup = soup
1101
1102- def cloneNode(self):
1103- raise NotImplementedError
1104+ def cloneNode(self) -> treebuilder_base.Node:
1105+ raise NotImplementedError()
1106diff --git a/bs4/builder/_htmlparser.py b/bs4/builder/_htmlparser.py
1107index 291f6c6..91cecf7 100644
1108--- a/bs4/builder/_htmlparser.py
1109+++ b/bs4/builder/_htmlparser.py
1110@@ -188,7 +188,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1111 # later on. If so, we want to ignore it.
1112 self.already_closed_empty_element.append(name)
1113
1114- if self._root_tag is None:
1115+ if self._root_tag_name is None:
1116 self._root_tag_encountered(name)
1117
1118 def handle_endtag(self, name:str, check_already_closed:bool=True) -> None:
1119@@ -422,13 +422,23 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
1120 dammit.declared_html_encoding,
1121 dammit.contains_replacement_characters)
1122
1123- def feed(self, markup:str):
1124+ def feed(self, markup:_RawMarkup) -> None:
1125 args, kwargs = self.parser_args
1126+
1127+ # HTMLParser.feed will only handle str, but
1128+ # BeautifulSoup.markup is allowed to be _RawMarkup, because
1129+ # it's set by the yield value of
1130+ # TreeBuilder.prepare_markup. Fortunately,
1131+ # HTMLParserTreeBuilder.prepare_markup always yields a str
1132+ # (UnicodeDammit.unicode_markup).
1133+ assert isinstance(markup, str)
1134+
1135 # We know BeautifulSoup calls TreeBuilder.initialize_soup
1136 # before calling feed(), so we can assume self.soup
1137 # is set.
1138 assert self.soup is not None
1139 parser = BeautifulSoupHTMLParser(self.soup, *args, **kwargs)
1140+
1141 try:
1142 parser.feed(markup)
1143 parser.close()
1144diff --git a/bs4/builder/_lxml.py b/bs4/builder/_lxml.py
1145index ba87e87..3dfe88a 100644
1146--- a/bs4/builder/_lxml.py
1147+++ b/bs4/builder/_lxml.py
1148@@ -269,7 +269,7 @@ class LXMLTreeBuilderForXML(TreeBuilder):
1149 for encoding in detector.encodings:
1150 yield (detector.markup, encoding, document_declared_encoding, False)
1151
1152- def feed(self, markup:Union[bytes,str]) -> None:
1153+ def feed(self, markup:_RawMarkup) -> None:
1154 io: IO
1155 if isinstance(markup, bytes):
1156 io = BytesIO(markup)
1157diff --git a/bs4/diagnose.py b/bs4/diagnose.py
1158index 201b879..c2202ad 100644
1159--- a/bs4/diagnose.py
1160+++ b/bs4/diagnose.py
1161@@ -9,7 +9,15 @@ from html.parser import HTMLParser
1162 import bs4
1163 from bs4 import BeautifulSoup, __version__
1164 from bs4.builder import builder_registry
1165-from typing import TYPE_CHECKING
1166+from typing import (
1167+ Any,
1168+ IO,
1169+ List,
1170+ Optional,
1171+ Tuple,
1172+ TYPE_CHECKING,
1173+)
1174+
1175 if TYPE_CHECKING:
1176 from bs4._typing import _IncomingMarkup
1177
1178@@ -78,7 +86,7 @@ def diagnose(data:_IncomingMarkup) -> None:
1179
1180 print(("-" * 80))
1181
1182-def lxml_trace(data, html:bool=True, **kwargs) -> None:
1183+def lxml_trace(data:_IncomingMarkup, html:bool=True, **kwargs:Any) -> None:
1184 """Print out the lxml events that occur during parsing.
1185
1186 This lets you see how lxml parses a document when no Beautiful
1187@@ -94,7 +102,8 @@ def lxml_trace(data, html:bool=True, **kwargs) -> None:
1188 recover = kwargs.pop('recover', True)
1189 if isinstance(data, str):
1190 data = data.encode("utf8")
1191- reader = BytesIO(data)
1192+ if not isinstance(data, IO):
1193+ reader = BytesIO(data)
1194 for event, element in etree.iterparse(
1195 reader, html=html, recover=recover, **kwargs
1196 ):
1197@@ -108,37 +117,40 @@ class AnnouncingParser(HTMLParser):
1198 document. The easiest way to do this is to call `htmlparser_trace`.
1199 """
1200
1201- def _p(self, s):
1202+ def _p(self, s:str) -> None:
1203 print(s)
1204
1205- def handle_starttag(self, name, attrs):
1206+ def handle_starttag(
1207+ self, name:str, attrs:List[Tuple[str, Optional[str]]],
1208+ handle_empty_element:bool=True
1209+ ) -> None:
1210 self._p(f"{name} {attrs} START")
1211
1212- def handle_endtag(self, name):
1213+ def handle_endtag(self, name:str, check_already_closed:bool=True) -> None:
1214 self._p("%s END" % name)
1215
1216- def handle_data(self, data):
1217+ def handle_data(self, data:str) -> None:
1218 self._p("%s DATA" % data)
1219
1220- def handle_charref(self, name):
1221+ def handle_charref(self, name:str) -> None:
1222 self._p("%s CHARREF" % name)
1223
1224- def handle_entityref(self, name):
1225+ def handle_entityref(self, name:str) -> None:
1226 self._p("%s ENTITYREF" % name)
1227
1228- def handle_comment(self, data):
1229+ def handle_comment(self, data:str) -> None:
1230 self._p("%s COMMENT" % data)
1231
1232- def handle_decl(self, data):
1233+ def handle_decl(self, data:str) -> None:
1234 self._p("%s DECL" % data)
1235
1236- def unknown_decl(self, data):
1237+ def unknown_decl(self, data:str) -> None:
1238 self._p("%s UNKNOWN-DECL" % data)
1239
1240- def handle_pi(self, data):
1241+ def handle_pi(self, data:str) -> None:
1242 self._p("%s PI" % data)
1243
1244-def htmlparser_trace(data):
1245+def htmlparser_trace(data:str) -> None:
1246 """Print out the HTMLParser events that occur during parsing.
1247
1248 This lets you see how HTMLParser parses a document when no
1249@@ -226,7 +238,7 @@ def benchmark_parsers(num_elements:int=100000) -> None:
1250 b = time.time()
1251 print(("Raw html5lib parsed the markup in %.2fs." % (b-a)))
1252
1253-def profile(num_elements:int=100000, parser:str="lxml"):
1254+def profile(num_elements:int=100000, parser:str="lxml") -> None:
1255 """Use Python's profiler on a randomly generated document."""
1256 filehandle = tempfile.NamedTemporaryFile()
1257 filename = filehandle.name
1258diff --git a/bs4/element.py b/bs4/element.py
1259index 83f4882..f4ab89c 100644
1260--- a/bs4/element.py
1261+++ b/bs4/element.py
1262@@ -44,6 +44,7 @@ if TYPE_CHECKING:
1263 from bs4 import BeautifulSoup
1264 from bs4.builder import TreeBuilder
1265 from bs4.dammit import _Encoding
1266+ from bs4.filter import ElementFilter
1267 from bs4.formatter import (
1268 _EntitySubstitutionFunction,
1269 _FormatterOrName,
1270@@ -901,7 +902,7 @@ class PageElement(object):
1271 limit:Optional[int],
1272 generator:Iterator[PageElement],
1273 _stacklevel:int=3,
1274- **kwargs:_StrainableAttribute) -> ResultSet[PageElement]:
1275+ **kwargs:_StrainableAttribute) -> ResultSet[PageElement]:
1276 """Iterates over a generator looking for things that match."""
1277 results: ResultSet[PageElement]
1278
1279@@ -912,11 +913,11 @@ class PageElement(object):
1280 DeprecationWarning, stacklevel=_stacklevel
1281 )
1282
1283- from bs4.strainer import SoupStrainer
1284- if isinstance(name, SoupStrainer):
1285- strainer = name
1286+ from bs4.filter import ElementFilter
1287+ if isinstance(name, ElementFilter):
1288+ matcher = name
1289 else:
1290- strainer = SoupStrainer(name, attrs, string, **kwargs)
1291+ matcher = SoupStrainer(name, attrs, string, **kwargs)
1292
1293 result: Iterable[PageElement]
1294 if string is None and not limit and not attrs and not kwargs:
1295@@ -924,7 +925,7 @@ class PageElement(object):
1296 # Optimization to find all tags.
1297 result = (element for element in generator
1298 if isinstance(element, Tag))
1299- return ResultSet(strainer, result)
1300+ return ResultSet(matcher, result)
1301 elif isinstance(name, str):
1302 # Optimization to find all tags with a given name.
1303 if name.count(':') == 1:
1304@@ -945,22 +946,25 @@ class PageElement(object):
1305 )
1306 ):
1307 result.append(element)
1308- return ResultSet(strainer, result)
1309+ return ResultSet(matcher, result)
1310+ return self.match(generator, matcher, limit)
1311+
1312+ def match(self, generator:Iterator[PageElement], matcher:ElementFilter, limit:Optional[int]=None) -> ResultSet[PageElement]:
1313+ """The most generic search method offered by Beautiful Soup.
1314
1315- results = ResultSet(strainer)
1316+ You can pass in your own technique for iterating over the tree, and your own
1317+ technique for matching items.
1318+ """
1319+ results:ResultSet = ResultSet(matcher)
1320 while True:
1321 try:
1322 i = next(generator)
1323 except StopIteration:
1324 break
1325 if i:
1326- # TODO: SoupStrainer.search is a confusing method
1327- # that needs to be redone, and this is where
1328- # it's being used.
1329- found = strainer.search(i)
1330- if found:
1331- results.append(found)
1332- if limit and len(results) >= limit:
1333+ if matcher.match(i):
1334+ results.append(i)
1335+ if limit is not None and len(results) >= limit:
1336 break
1337 return results
1338
1339@@ -1254,7 +1258,7 @@ class Declaration(PreformattedString):
1340 class Doctype(PreformattedString):
1341 """A `document type declaration <https://www.w3.org/TR/REC-xml/#dt-doctype>`_."""
1342 @classmethod
1343- def for_name_and_ids(cls, name:str, pub_id:str, system_id:str) -> Doctype:
1344+ def for_name_and_ids(cls, name:str, pub_id:Optional[str], system_id:Optional[str]) -> Doctype:
1345 """Generate an appropriate document type declaration for a given
1346 public ID and system ID.
1347
1348@@ -2503,12 +2507,12 @@ class Tag(PageElement):
1349 _PageElementT = TypeVar("_PageElementT", bound=PageElement)
1350 class ResultSet(List[_PageElementT], Generic[_PageElementT]):
1351 """A ResultSet is a list of `PageElement` objects, gathered as the result
1352- of matching a `SoupStrainer` against a parse tree. Basically, a list of
1353+ of matching an `ElementFilter` against a parse tree. Basically, a list of
1354 search results.
1355 """
1356- source: Optional[SoupStrainer]
1357+ source: Optional[ElementFilter]
1358
1359- def __init__(self, source:Optional[SoupStrainer], result: Iterable[_PageElementT]=()) -> None:
1360+ def __init__(self, source:Optional[ElementFilter], result: Iterable[_PageElementT]=()) -> None:
1361 super(ResultSet, self).__init__(result)
1362 self.source = source
1363
1364@@ -2522,4 +2526,4 @@ class ResultSet(List[_PageElementT], Generic[_PageElementT]):
1365 # import SoupStrainer itself into this module to preserve the
1366 # backwards compatibility of anyone who imports
1367 # bs4.element.SoupStrainer.
1368-from bs4.strainer import SoupStrainer
1369+from bs4.filter import SoupStrainer
1370diff --git a/bs4/strainer.py b/bs4/filter.py
1371similarity index 60%
1372rename from bs4/strainer.py
1373rename to bs4/filter.py
1374index 15b289c..74e26d9 100644
1375--- a/bs4/strainer.py
1376+++ b/bs4/filter.py
1377@@ -25,6 +25,10 @@ from bs4._deprecation import _deprecated
1378 from bs4.element import NavigableString, PageElement, Tag
1379 from bs4._typing import (
1380 _AttributeValue,
1381+ _AttributeValues,
1382+ _AllowStringCreationFunction,
1383+ _AllowTagCreationFunction,
1384+ _PageElementMatchFunction,
1385 _TagMatchFunction,
1386 _StringMatchFunction,
1387 _StrainableElement,
1388@@ -33,13 +37,96 @@ from bs4._typing import (
1389 _StrainableString,
1390 )
1391
1392+
1393+class ElementFilter(object):
1394+ """ElementFilters encapsulate the logic necessary to decide:
1395+
1396+ 1. whether a PageElement (a tag or a string) matches a
1397+ user-specified query.
1398+
1399+ 2. whether a given sequence of markup found during initial parsing
1400+ should be turned into a PageElement, or simply discarded.
1401+
1402+ The base class is the simplest ElementFilter. By default, it
1403+ matches everything and allows all PageElements to be created. You
1404+ can make it more selective by passing in user-defined functions.
1405+
1406+ Most users of Beautiful Soup will never need to use
1407+ ElementFilter, or its more capable subclass
1408+ SoupStrainer. Instead, they will use the find_* methods, which
1409+ will convert their arguments into SoupStrainer objects and run them
1410+ against the tree.
1411+ """
1412+ match_hook: Optional[_PageElementMatchFunction]
1413+ allow_tag_creation_function: Optional[_AllowTagCreationFunction]
1414+ allow_string_creation_function: Optional[_AllowStringCreationFunction]
1415+
1416+ def __init__(
1417+ self, match_function:Optional[_PageElementMatchFunction]=None,
1418+ allow_tag_creation_function:Optional[_AllowTagCreationFunction]=None,
1419+ allow_string_creation_function:Optional[_AllowStringCreationFunction]=None):
1420+ self.match_function = match_function
1421+ self.allow_tag_creation_function = allow_tag_creation_function
1422+ self.allow_string_creation_function = allow_string_creation_function
1423+
1424+ @property
1425+ def excludes_everything(self) -> bool:
1426+ """Does this ElementFilter obviously exclude everything? If
1427+ so, Beautiful Soup will issue a warning if you try to use it
1428+ when parsing a document.
1429+
1430+ The ElementFilter might turn out to exclude everything even
1431+ if this returns False, but it won't do so in an obvious way.
1432+
1433+ The default ElementFilter excludes *nothing*, and we don't
1434+ have any way of answering questions about more complex
1435+ ElementFilters without running their hook functions, so the
1436+ base implementation always returns False.
1437+ """
1438+ return False
1439+
1440+ def match(self, element:PageElement) -> bool:
1441+ """Does the given PageElement match the rules set down by this
1442+ ElementFilter?
1443+
1444+ The base implementation delegates to the function passed in to
1445+ the constructor.
1446+ """
1447+ if not self.match_function:
1448+ return True
1449+ return self.match_function(element)
1450+
1451+ def allow_tag_creation(
1452+ self, nsprefix:Optional[str], name:str,
1453+ attrs:Optional[_AttributeValues]
1454+ ) -> bool:
1455+ """Based on the name and attributes of a tag, see whether this
1456+ ElementFilter will allow a Tag object to even be created.
1457+
1458+ :param name: The name of the prospective tag.
1459+ :param attrs: The attributes of the prospective tag.
1460+ """
1461+ if not self.allow_tag_creation_function:
1462+ return True
1463+ return self.allow_tag_creation_function(nsprefix, name, attrs)
1464+
1465+ def allow_string_creation(self, string:str) -> bool:
1466+ if not self.allow_string_creation_function:
1467+ return True
1468+ return self.allow_string_creation_function(string)
1469+
1470+
1471 class MatchRule(object):
1472+ """Each MatchRule encapsulates the logic behind a single argument
1473+ passed in to one of the Beautiful Soup find* methods.
1474+ """
1475+
1476 string: Optional[str]
1477 pattern: Optional[Pattern[str]]
1478 present: Optional[bool]
1479-
1480- # All MatchRule objects also have an attribute ``function``, but
1481- # the type of the function depends on the subclass.
1482+ # TODO-TYPING: All MatchRule objects also have an attribute
1483+ # ``function``, but the type of the function depends on the
1484+ # subclass.
1485
1486 def __init__(
1487 self,
1488@@ -72,7 +159,7 @@ class MatchRule(object):
1489 "At most one of string, pattern, function and present must be provided."
1490 )
1491
1492- def _base_match(self, string:str) -> Optional[bool]:
1493+ def _base_match(self, string:Optional[str]) -> Optional[bool]:
1494 """Run the 'cheap' portion of a match, trying to get an answer without
1495 calling a potentially expensive custom function.
1496
1497@@ -101,7 +188,7 @@ class MatchRule(object):
1498
1499 return None
1500
1501- def matches_string(self, string:str) -> bool:
1502+ def matches_string(self, string:Optional[str]) -> bool:
1503 _base_result = self._base_match(string)
1504 if _base_result is not None:
1505 # No need to invoke the test function.
1506@@ -125,6 +212,7 @@ class MatchRule(object):
1507 )
1508
1509 class TagNameMatchRule(MatchRule):
1510+ """A MatchRule implementing the rules for matches against tag name."""
1511 function: Optional[_TagMatchFunction]
1512
1513 def matches_tag(self, tag:Tag) -> bool:
1514@@ -140,19 +228,25 @@ class TagNameMatchRule(MatchRule):
1515 return False
1516
1517 class AttributeValueMatchRule(MatchRule):
1518+ """A MatchRule implementing the rules for matches against attribute value."""
1519 function: Optional[_StringMatchFunction]
1520
1521 class StringMatchRule(MatchRule):
1522+ """A MatchRule implementing the rules for matches against a NavigableString."""
1523 function: Optional[_StringMatchFunction]
1524
1525-class SoupStrainer(object):
1526- """Encapsulates a number of ways of matching a markup element (a tag
1527- or a string).
1528+class SoupStrainer(ElementFilter):
1529+ """The ElementFilter subclass used internally by Beautiful Soup.
1530
1531- These are primarily created internally and used to underpin the
1532- find_* methods, but you can create one yourself and pass it in as
1533- ``parse_only`` to the `BeautifulSoup` constructor, to parse a
1534- subset of a large document.
1535+ A SoupStrainer encapsulates the logic necessary to perform the
1536+ kind of matches supported by the find_* methods. SoupStrainers are
1537+ primarily created internally, but you can create one yourself and
1538+ pass it in as ``parse_only`` to the `BeautifulSoup` constructor,
1539+ to parse a subset of a large document.
1540+
1541+ Internally, SoupStrainer objects work by converting the
1542+ constructor arguments into MatchRule objects. Incoming
1543+ tags/markup are matched against those rules.
1544
1545 :param name: One or more restrictions on the tags found in a
1546 document.
1547@@ -226,6 +320,17 @@ class SoupStrainer(object):
1548 self.__string = string
1549
1550 @property
1551+ def excludes_everything(self) -> bool:
1552+ """Check whether the provided rules will obviously exclude
1553+ everything. (They might exclude everything even if this returns False,
1554+ but not in an obvious way.)
1555+ """
1556+ return True if (
1557+ self.string_rules and
1558+ (self.name_rules or self.attribute_rules)
1559+ ) else False
1560+
1561+ @property
1562 def string(self) -> Optional[_StrainableString]:
1563 ":meta private:"
1564 warnings.warn(f"Access to deprecated property string. (Look at .string_rules instead) -- Deprecated since version 4.13.0.", DeprecationWarning, stacklevel=2)
1565@@ -262,6 +367,15 @@ class SoupStrainer(object):
1566 yield rule_class(function=obj)
1567 elif isinstance(obj, Pattern):
1568 yield rule_class(pattern=obj)
1569+ elif hasattr(obj, 'search'):
1570+ # We do a little duck typing here to detect usage of the
1571+ # third-party regex library, whose pattern objects doesn't
1572+ # derive from re.Pattern.
1573+ #
1574+ # TODO-TYPING: Once we drop support for Python 3.7, we
1575+ # might be able to address this by defining an appropriate
1576+ # Protocol.
1577+ yield rule_class(pattern=obj)
1578 elif hasattr(obj, '__iter__'):
1579 for o in obj:
1580 if not isinstance(o, (bytes, str)) and hasattr(o, '__iter__'):
1581@@ -358,7 +472,7 @@ class SoupStrainer(object):
1582 else:
1583 attr_values = [cast(str, attr_value)]
1584
1585- def _match_attribute_value_helper(attr_values:Sequence[Optional[str]]):
1586+ def _match_attribute_value_helper(attr_values:Sequence[Optional[str]]) -> bool:
1587 for rule in rules:
1588 for attr_value in attr_values:
1589 if rule.matches_string(attr_value):
1590@@ -382,8 +496,8 @@ class SoupStrainer(object):
1591 [joined_attr_value]
1592 )
1593 return this_attr_match
1594-
1595- def allow_tag_creation(self, nsprefix:Optional[str], name:str, attrs:Optional[dict[str, str]]) -> bool:
1596+
1597+ def allow_tag_creation(self, nsprefix:Optional[str], name:str, attrs:Optional[_AttributeValues]) -> bool:
1598 """Based on the name and attributes of a tag, see whether this
1599 SoupStrainer will allow a Tag object to even be created.
1600
1601@@ -423,17 +537,25 @@ class SoupStrainer(object):
1602 return True
1603
1604 def allow_string_creation(self, string:str) -> bool:
1605+ """Based on the content of a markup string, see whether this
1606+ SoupStrainer will allow it to be instantiated as a
1607+ NavigableString object, or whether it should be ignored.
1608+ """
1609 if self.name_rules or self.attribute_rules:
1610 # A SoupStrainer that has name or attribute rules won't
1611 # match any strings; it's designed to match tags with
1612 # certain properties.
1613 return False
1614+ if not self.string_rules:
1615+ # A SoupStrainer with no string rules will match
1616+ # all strings.
1617+ return True
1618 if not self.matches_any_string_rule(string):
1619 return False
1620 return True
1621
1622 def matches_any_string_rule(self, string:str) -> bool:
1623- """See whether the content of a string, matches any of
1624+ """See whether the content of a string matches any of
1625 this SoupStrainer's string rules.
1626 """
1627 if not self.string_rules:
1628@@ -442,28 +564,37 @@ class SoupStrainer(object):
1629 if string_rule.matches_string(string):
1630 return True
1631 return False
1632-
1633-
1634+
1635+ def match(self, element:PageElement) -> bool:
1636+ """Does the given PageElement match the rules set down by this
1637+ SoupStrainer?
1638+
1639+ The find_* methods rely heavily on this method to find matches.
1640+
1641+ :param element: A PageElement.
1642+ :return: True if the element matches this SoupStrainer's rules; False otherwise.
1643+ """
1644+ if isinstance(element, Tag):
1645+ return self.matches_tag(element)
1646+ assert isinstance(element, NavigableString)
1647+ if not (self.name_rules or self.attribute_rules):
1648+ # A NavigableString can only match a SoupStrainer that
1649+ # does not define any name or attribute restrictions.
1650+ for rule in self.string_rules:
1651+ if rule.matches_string(element):
1652+ return True
1653+ return False
1654+
1655 @_deprecated("allow_tag_creation", "4.13.0")
1656- def search_tag(self, name, attrs):
1657+ def search_tag(self, name:str, attrs:Optional[_AttributeValues]) -> bool:
1658+ """A less elegant version of allow_tag_creation()."""
1659 ":meta private:"
1660 return self.allow_tag_creation(None, name, attrs)
1661
1662- def search(self, element:PageElement):
1663- # TODO: This method needs to be removed or redone. It is
1664- # very confusing but it's used everywhere.
1665- match = None
1666- if isinstance(element, Tag):
1667- match = self.matches_tag(element)
1668- else:
1669- assert isinstance(element, NavigableString)
1670- match = False
1671- if not (self.name_rules or self.attribute_rules):
1672- # A NavigableString can only match a SoupStrainer that
1673- # does not define any name or attribute restrictions.
1674- for rule in self.string_rules:
1675- if rule.matches_string(element):
1676- match = True
1677- break
1678- return element if match else False
1679+ @_deprecated("match", "4.13.0")
1680+ def search(self, element:PageElement) -> Optional[PageElement]:
1681+ """A less elegant version of match().
1682
1683+ :meta private:
1684+ """
1685+ return element if self.match(element) else None
1686diff --git a/bs4/tests/__init__.py b/bs4/tests/__init__.py
1687index 2ef7fd8..3ef999d 100644
1688--- a/bs4/tests/__init__.py
1689+++ b/bs4/tests/__init__.py
1690@@ -20,7 +20,7 @@ from bs4.element import (
1691 Stylesheet,
1692 Tag
1693 )
1694-from bs4.strainer import SoupStrainer
1695+from bs4.filter import SoupStrainer
1696 from bs4.builder import (
1697 DetectsXMLParsedAsHTML,
1698 XMLParsedAsHTMLWarning,
1699diff --git a/bs4/tests/test_strainer.py b/bs4/tests/test_filter.py
1700similarity index 56%
1701rename from bs4/tests/test_strainer.py
1702rename to bs4/tests/test_filter.py
1703index 4de03f0..8d5da70 100644
1704--- a/bs4/tests/test_strainer.py
1705+++ b/bs4/tests/test_filter.py
1706@@ -6,20 +6,108 @@ from . import (
1707 SoupTest,
1708 )
1709 from bs4.element import Tag
1710-from bs4.strainer import (
1711+from bs4.filter import (
1712 AttributeValueMatchRule,
1713+ ElementFilter,
1714 MatchRule,
1715 SoupStrainer,
1716 StringMatchRule,
1717 TagNameMatchRule,
1718 )
1719
1720-class TestMatchrule(SoupTest):
1721+class TestElementFilter(SoupTest):
1722+
1723+ def test_default_behavior(self):
1724+ # An unconfigured ElementFilter matches absolutely everything.
1725+ selector = ElementFilter()
1726+ assert not selector.excludes_everything
1727+ soup = self.soup("<a>text</a>")
1728+ tag = soup.a
1729+ string = tag.string
1730+ assert True == selector.match(soup)
1731+ assert True == selector.match(tag)
1732+ assert True == selector.match(string)
1733+ assert soup.find(selector).name == "a"
1734+
1735+ # And allows any incoming markup to be turned into PageElements.
1736+ assert True == selector.allow_tag_creation(None, "tag", None)
1737+ assert True == selector.allow_string_creation("some string")
1738+
1739+ def test_match(self):
1740+ def m(pe):
1741+ return (pe.string == "allow" or (
1742+ isinstance(pe, Tag) and pe.name=="allow"))
1743+
1744+ soup = self.soup("<allow>deny</allow>allow<deny>deny</deny>")
1745+ allow_tag = soup.allow
1746+ allow_string = soup.find(string="allow")
1747+ deny_tag = soup.deny
1748+ deny_string = soup.find(string="deny")
1749+
1750+ selector = ElementFilter(match_function=m)
1751+ assert True == selector.match(allow_tag)
1752+ assert True == selector.match(allow_string)
1753+ assert False == selector.match(deny_tag)
1754+ assert False == selector.match(deny_string)
1755+
1756+ # Since only the match function was provided, there is
1757+ # no effect on tag or string creation.
1758+ soup = self.soup("<a>text</a>", parse_only=selector)
1759+ assert "text" == soup.a.string
1760+
1761+ def test_allow_tag_creation(self):
1762+ def m(nsprefix, name, attrs):
1763+ return nsprefix=="allow" or name=="allow" or "allow" in attrs
1764+ selector = ElementFilter(allow_tag_creation_function=m)
1765+ f = selector.allow_tag_creation
1766+ assert True == f("allow", "ignore", {})
1767+ assert True == f("ignore", "allow", {})
1768+ assert True == f(None, "ignore", {"allow": "1"})
1769+ assert False == f("no", "no", {"no" : "nope"})
1770+
1771+ # Test the ElementFilter as a value for parse_only.
1772+ soup = self.soup(
1773+ "<deny>deny</deny> <allow>deny</allow> allow",
1774+ parse_only=selector
1775+ )
1776
1777- def _tuple(self, rule):
1778- if isinstance(rule.pattern, str):
1779- import pdb; pdb.set_trace()
1780+ # The <deny> tag was filtered out, but there was no effect on
1781+ # the strings, since only allow_tag_creation_function was
1782+ # defined.
1783+ assert 'deny <allow>deny</allow> allow' == soup.decode()
1784+
1785+ # Similarly, since match_function was not defined, this
1786+ # ElementFilter matches everything.
1787+ assert soup.find(selector) == "deny"
1788+
1789+ def test_allow_string_creation(self):
1790+ def m(s):
1791+ return s=="allow"
1792+ selector = ElementFilter(allow_string_creation_function=m)
1793+ f = selector.allow_string_creation
1794+ assert True == f("allow")
1795+ assert False == f("deny")
1796+ assert False == f("please allow")
1797+
1798+ # Test the ElementFilter as a value for parse_only.
1799+ soup = self.soup(
1800+ "<deny>deny</deny> <allow>deny</allow> allow",
1801+ parse_only=selector
1802+ )
1803+
1804+ # All incoming strings other than "allow" (even whitespace)
1805+ # were filtered out, but there was no effect on the tags,
1806+ # since only allow_string_creation_function was defined.
1807+ assert '<deny>deny</deny><allow>deny</allow>' == soup.decode()
1808+
1809+ # Similarly, since match_function was not defined, this
1810+ # ElementFilter matches everything.
1811+ assert soup.find(selector).name == "deny"
1812
1813+
1814+class TestMatchRule(SoupTest):
1815+
1816+ def _tuple(self, rule):
1817 return (
1818 rule.string,
1819 rule.pattern.pattern if rule.pattern else None,
1820@@ -155,6 +243,28 @@ class TestSoupStrainer(SoupTest):
1821 assert w2.filename == __file__
1822 assert msg == "Access to deprecated property text. (Look at .string_rules instead) -- Deprecated since version 4.13.0."
1823
1824+ def test_search_tag_deprecated(self):
1825+ strainer = SoupStrainer(name="a")
1826+ with warnings.catch_warnings(record=True) as w:
1827+ assert False == strainer.search_tag("b", {})
1828+ [w1] = w
1829+ msg = str(w1.message)
1830+ assert w1.filename == __file__
1831+ assert msg == "Call to deprecated method search_tag. (Replaced by allow_tag_creation) -- Deprecated since version 4.13.0."
1832+
1833+ def test_search_deprecated(self):
1834+ strainer = SoupStrainer(name="a")
1835+ soup = self.soup("<a></a><b></b>")
1836+ with warnings.catch_warnings(record=True) as w:
1837+ assert soup.a == strainer.search(soup.a)
1838+ assert None == strainer.search(soup.b)
1839+ [w1, w2] = w
1840+ msg = str(w1.message)
1841+ assert msg == str(w2.message)
1842+ assert w1.filename == __file__
1843+ assert msg == "Call to deprecated method search. (Replaced by match) -- Deprecated since version 4.13.0."
1844+
1845+ # Dummy function used within tests.
1846 def _match_function(x):
1847 pass
1848
1849@@ -213,7 +323,7 @@ class TestSoupStrainer(SoupTest):
1850 )
1851
1852 def test_constructor_with_overlapping_attributes(self):
1853- # If you specify the same attribute in arts and **kwargs, you end up
1854+ # If you specify the same attribute in args and **kwargs, you end up
1855 # with two different AttributeValueMatchRule objects.
1856
1857 # This happens whether you use the 'class' shortcut on attrs...
1858@@ -437,17 +547,24 @@ class TestSoupStrainer(SoupTest):
1859 # because the string restrictions can't be evaluated during
1860 # the parsing process, and the tag restrictions eliminate
1861 # any strings from consideration.
1862+ #
1863+ # We can detect this ahead of time, and warn about it,
1864+ # thanks to SoupStrainer.excludes_everything
1865 markup = "<a><b>one string<div>another string</div></b></a>"
1866
1867 with warnings.catch_warnings(record=True) as w:
1868+ assert True, soupstrainer.excludes_everything
1869 assert "" == self.soup(markup, parse_only=soupstrainer).decode()
1870 [warning] = w
1871 msg = str(warning.message)
1872 assert warning.filename == __file__
1873 assert str(warning.message).startswith(
1874- "Value for parse_only will exclude everything, since it puts restrictions on both tags and strings:"
1875+ "The given value for parse_only will exclude everything:"
1876 )
1877-
1878+
1879+ # The average SoupStrainer has excludes_everything=False
1880+ assert not SoupStrainer().excludes_everything
1881+
1882 def test_documentation_examples(self):
1883 """Medium-weight real-world tests based on the Beautiful Soup
1884 documentation.
1885diff --git a/bs4/tests/test_html5lib.py b/bs4/tests/test_html5lib.py
1886index b0f4384..9f6dfa1 100644
1887--- a/bs4/tests/test_html5lib.py
1888+++ b/bs4/tests/test_html5lib.py
1889@@ -4,7 +4,7 @@ import pytest
1890 import warnings
1891
1892 from bs4 import BeautifulSoup
1893-from bs4.strainer import SoupStrainer
1894+from bs4.filter import SoupStrainer
1895 from . import (
1896 HTML5LIB_PRESENT,
1897 HTML5TreeBuilderSmokeTest,
1898@@ -24,7 +24,7 @@ class TestHTML5LibBuilder(SoupTest, HTML5TreeBuilderSmokeTest):
1899 return HTML5TreeBuilder
1900
1901 def test_soupstrainer(self):
1902- # The html5lib tree builder does not support SoupStrainers.
1903+ # The html5lib tree builder does not support parse_only.
1904 strainer = SoupStrainer("b")
1905 markup = "<p>A <b>bold</b> statement.</p>"
1906 with warnings.catch_warnings(record=True) as w:
1907diff --git a/bs4/tests/test_lxml.py b/bs4/tests/test_lxml.py
1908index d450740..9fc04e0 100644
1909--- a/bs4/tests/test_lxml.py
1910+++ b/bs4/tests/test_lxml.py
1911@@ -14,7 +14,7 @@ from bs4 import (
1912 BeautifulStoneSoup,
1913 )
1914 from bs4.element import Comment, Doctype
1915-from bs4.strainer import SoupStrainer
1916+from bs4.filter import SoupStrainer
1917 from . import (
1918 HTMLTreeBuilderSmokeTest,
1919 XMLTreeBuilderSmokeTest,
1920diff --git a/bs4/tests/test_pageelement.py b/bs4/tests/test_pageelement.py
1921index 19b4d63..7dfdc22 100644
1922--- a/bs4/tests/test_pageelement.py
1923+++ b/bs4/tests/test_pageelement.py
1924@@ -10,7 +10,7 @@ from bs4.element import (
1925 Comment,
1926 ResultSet,
1927 )
1928-from bs4.strainer import SoupStrainer
1929+from bs4.filter import SoupStrainer
1930 from . import (
1931 SoupTest,
1932 )
1933diff --git a/bs4/tests/test_soup.py b/bs4/tests/test_soup.py
1934index 4f8ee1a..c95f380 100644
1935--- a/bs4/tests/test_soup.py
1936+++ b/bs4/tests/test_soup.py
1937@@ -27,7 +27,7 @@ from bs4.element import (
1938 Tag,
1939 NavigableString,
1940 )
1941-from bs4.strainer import SoupStrainer
1942+from bs4.filter import SoupStrainer
1943
1944 from . import (
1945 default_builder,
1946@@ -293,7 +293,7 @@ class TestWarnings(SoupTest):
1947 soup = self.soup("<a><b></b></a>", parse_only=strainer)
1948 warning = self._assert_warning(w, UserWarning)
1949 msg = str(warning.message)
1950- assert msg.startswith("Value for parse_only will exclude everything, since it puts restrictions on both tags and strings:")
1951+ assert msg.startswith("The given value for parse_only will exclude everything:")
1952
1953 def test_parseOnlyThese_renamed_to_parse_only(self):
1954 with warnings.catch_warnings(record=True) as w:
1955diff --git a/bs4/tests/test_tree.py b/bs4/tests/test_tree.py
1956index 606525f..43afb29 100644
1957--- a/bs4/tests/test_tree.py
1958+++ b/bs4/tests/test_tree.py
1959@@ -26,7 +26,7 @@ from bs4.element import (
1960 Tag,
1961 TemplateString,
1962 )
1963-from bs4.strainer import SoupStrainer
1964+from bs4.filter import SoupStrainer
1965 from . import (
1966 SoupTest,
1967 )
1968diff --git a/doc/index.rst b/doc/index.rst
1969index 7beff36..a414830 100755
1970--- a/doc/index.rst
1971+++ b/doc/index.rst
1972@@ -20,7 +20,7 @@ with examples. I show you what the library is good for, how it works,
1973 how to use it, how to make it do what you want, and what to do when it
1974 violates your expectations.
1975
1976-This document covers Beautiful Soup version 4.12.2. The examples in
1977+This document covers Beautiful Soup version 4.13.0. The examples in
1978 this documentation were written for Python 3.8.
1979
1980 You might be looking for the documentation for `Beautiful Soup 3
1981@@ -2577,6 +2577,11 @@ the human-visible content of the page.*
1982 either return the object itself, or nothing, so the only reason to do
1983 this is when you're iterating over a mixed list.*
1984
1985+*As of Beautiful Soup version 4.13.0, you can call .string on a
1986+NavigableString object. It will return the object itself, so again,
1987+the only reason to do this is when you're iterating over a mixed
1988+list.*
1989+
1990 Specifying the parser to use
1991 ============================
1992
1993@@ -2604,8 +2609,9 @@ specifying one of the following:
1994
1995 The section `Installing a parser`_ contrasts the supported parsers.
1996
1997-If you don't have an appropriate parser installed, Beautiful Soup will
1998-ignore your request and pick a different parser. Right now, the only
1999+If you ask for a parser that isn't installed, Beautiful Soup will
2000+raise an exception so that you don't inadvertently parse a document
2001+under an unknown set of rules. For example, right now, the only
2002 supported XML parser is lxml. If you don't have lxml installed, asking
2003 for an XML parser won't give you one, and asking for "lxml" won't work
2004 either.
2005@@ -3018,6 +3024,44 @@ been called on it::
2006 This is because two different :py:class:`Tag` objects can't occupy the same
2007 space at the same time.
2008
2009+Advanced search techniques
2010+==========================
2011+
2012+Almost everyone who uses Beautiful Soup to extract information from a
2013+document can get what they need using the methods described in
2014+`Searching the tree`_. However, there's a lower-level interface--the
2015+:py:class:`ElementSelector` class-- which lets you define any matching
2016+behavior whatsoever.
2017+
2018+To use :py:class:`ElementSelector`, define a function that takes a
2019+:py:class:`PageElement` object (that is, it might be either a
2020+:py:class:`Tag` or a :py:class`NavigableString`) and returns ``True``
2021+(if the element matches your custom criteria) or ``False`` (if it
2022+doesn't)::
2023+
2024+ [example goes here]
2025+
2026+Then, pass the function into an :py:class:`ElementSelector`::
2027+
2028+ from bs4.select import ElementSelector
2029+ selector = ElementSelector(f)
2030+
2031+You can then pass the :py:class:`ElementSelector` object as the first
2032+argument to any of the `Searching the tree`_ methods::
2033+
2034+ [examples go here]
2035+
2036+Every potential match will be run through your function, and the only
2037+:py:class:`PageElement` objects returned will be the one where your
2038+function returned ``True``.
2039+
2040+Note that this is different from simply passing `a function`_ as the
2041+first argument to one of the search methods. That's an easy way to
2042+find a tag, but _only_ tags will be considered. With an
2043+:py:class:`ElementSelector` you can write a single function that makes
2044+decisions about both tags and strings.
2045+
2046+
2047 Advanced parser customization
2048 =============================
2049
2050@@ -3111,14 +3155,6 @@ The :py:class:`SoupStrainer` behavior is as follows:
2051 * When a tag does not match, the tag itself is not kept, but parsing continues
2052 into its contents to look for other tags that do match.
2053
2054-You can also pass a :py:class:`SoupStrainer` into any of the methods covered
2055-in `Searching the tree`_. This probably isn't terribly useful, but I
2056-thought I'd mention it::
2057-
2058- soup = BeautifulSoup(html_doc, 'html.parser')
2059- soup.find_all(only_short_strings)
2060- # ['\n\n', '\n\n', 'Elsie', ',\n', 'Lacie', ' and\n', 'Tillie',
2061- # '\n\n', '...', '\n']
2062
2063 Customizing multi-valued attributes
2064 -----------------------------------

Subscribers

People subscribed via source and target branches

to all changes: