Merge ~chrispitude/beautifulsoup:more-modular-soupstrainers-doc into beautifulsoup:more-modular-soupstrainers

Proposed by Chris Papademetrious
Status: Needs review
Proposed branch: ~chrispitude/beautifulsoup:more-modular-soupstrainers-doc
Merge into: beautifulsoup:more-modular-soupstrainers
Diff against target: 2065 lines (+492/-327)
22 files modified
CHANGELOG (+14/-8)
bs4/__init__.py (+13/-10)
bs4/_deprecation.py (+6/-6)
bs4/_typing.py (+19/-8)
bs4/builder/__init__.py (+76/-49)
bs4/builder/_html5lib.py (+104/-97)
bs4/builder/_htmlparser.py (+15/-14)
bs4/builder/_lxml.py (+44/-27)
bs4/dammit.py (+3/-3)
bs4/element.py (+65/-39)
bs4/filter.py (+6/-7)
bs4/tests/__init__.py (+33/-14)
bs4/tests/test_builder_registry.py (+3/-1)
bs4/tests/test_css.py (+8/-2)
bs4/tests/test_filter.py (+16/-4)
bs4/tests/test_fuzz.py (+2/-2)
bs4/tests/test_html5lib.py (+1/-2)
bs4/tests/test_htmlparser.py (+4/-2)
bs4/tests/test_lxml.py (+2/-2)
bs4/tests/test_soup.py (+4/-2)
bs4/tests/test_tree.py (+8/-8)
doc/index.rst (+46/-20)
Reviewer Review Type Date Requested Status
Leonard Richardson Pending
Review via email: mp+459970@code.launchpad.net

Commit message

add an example for the new ElementFilter feature

To post a comment you must log in.

Unmerged commits

b17956d... by Chris Papademetrious

add documentation example for ElementFilter

Signed-off-by: Chris Papademetrious <email address hidden>

4e37196... by Leonard Richardson

Added some basic typing to test methods that are called by the actual tests.

fe88554... by Leonard Richardson

Clarified the match protocol thing.

2618f24... by Leonard Richardson

Decided putting limit in ResultSet wasn't a good idea.

cc290e9... by Leonard Richardson

Added note that Protocol is in typing_extensions so it could be used now.

4c0618d... by Leonard Richardson

Made some typing system changes necessary to get the code to run under Python 3.8.

d5db19b... by Leonard Richardson

I feel like it's more consistent to use modified_attrs throughout.

792192c... by Leonard Richardson

Sorted out the _RawAttributeValue/_AttributeValue typing issue.

814f1a6... by Leonard Richardson

All the easy stuff is resolved now, and most of the medium-sized stuff except an issue with _RawAttributeValues versus _AttributeValues.

3898972... by Leonard Richardson

Got the type definitions in _deprecation.py a _little_ better.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/CHANGELOG b/CHANGELOG
2index 162e3dc..41b1467 100644
3--- a/CHANGELOG
4+++ b/CHANGELOG
5@@ -1,7 +1,5 @@
6 = 4.13.0 (Unreleased)
7
8-TODO: we could stand to put limit inside ResultSet
9-
10 * This version drops support for Python 3.6. The minimum supported
11 major Python version for Beautiful Soup is now Python 3.7.
12
13@@ -33,12 +31,15 @@ TODO: we could stand to put limit inside ResultSet
14 you, since you probably use HTMLParserTreeBuilder, not
15 BeautifulSoupHTMLParser directly.
16
17-* The TreeBuilderForHtml5lib methods fragmentClass and getFragment
18- now raise NotImplementedError. These methods are called only by
19- html5lib's HTMLParser.parseFragment() method, which Beautiful Soup
20- doesn't use, so they were untested and should have never been called.
21- The getFragment() implementation was also slightly incorrect in a way
22- that should have caused obvious problems for anyone using it.
23+* The TreeBuilderForHtml5lib methods fragmentClass(), getFragment(),
24+ and testSerializer() now raise NotImplementedError. These methods
25+ are called only by html5lib's test suite, and Beautiful Soup isn't
26+ integrated into that test suite, so this code was long since unused and
27+ untested.
28+
29+ These methods are _not_ deprecated, since they are methods defined by
30+ html5lib. They may one day have real implementations, as part of a future
31+ effort to integrate Beautiful Soup into html5lib's test suite.
32
33 * If Tag.get_attribute_list() is used to access an attribute that's not set,
34 the return value is now an empty list rather than [None].
35@@ -73,6 +74,11 @@ TODO: we could stand to put limit inside ResultSet
36 * A SoupStrainer can now filter tag creation based on a tag's
37 namespaced name. Previously only the unqualified name could be used.
38
39+* Some of the arguments in the methods of LXMLTreeBuilderForXML
40+ have been renamed for consistency with the names lxml uses for those
41+ arguments in the superclass. This won't affect you unless you were
42+ calling methods like LXMLTreeBuilderForXML.start() directly.
43+
44 * All TreeBuilder constructors now take the empty_element_tags
45 argument. The sets of tags found in HTMLTreeBuilder.empty_element_tags and
46 HTMLTreeBuilder.block_elements are now in
47diff --git a/bs4/__init__.py b/bs4/__init__.py
48index 95bd48d..6f01a75 100644
49--- a/bs4/__init__.py
50+++ b/bs4/__init__.py
51@@ -72,7 +72,7 @@ from typing import (
52 cast,
53 Counter as CounterType,
54 Dict,
55- Iterable,
56+ Iterator,
57 List,
58 Sequence,
59 Optional,
60@@ -84,10 +84,11 @@ from typing import (
61
62 from bs4._typing import (
63 _AttributeValue,
64- _AttributeValues,
65 _Encoding,
66 _Encodings,
67 _IncomingMarkup,
68+ _RawAttributeValue,
69+ _RawAttributeValues,
70 _RawMarkup,
71 )
72
73@@ -451,7 +452,7 @@ class BeautifulSoup(Tag):
74 clone.original_encoding = self.original_encoding
75 return clone
76
77- def __getstate__(self) -> dict[str, Any]:
78+ def __getstate__(self) -> Dict[str, Any]:
79 # Frequently a tree builder can't be pickled.
80 d = dict(self.__dict__)
81 if 'builder' in d and d['builder'] is not None and not self.builder.picklable:
82@@ -467,7 +468,7 @@ class BeautifulSoup(Tag):
83 del d['_most_recent_element']
84 return d
85
86- def __setstate__(self, state: dict[str, Any]) -> None:
87+ def __setstate__(self, state: Dict[str, Any]) -> None:
88 # If necessary, restore the TreeBuilder by looking it up.
89 self.__dict__ = state
90 if isinstance(self.builder, type):
91@@ -613,11 +614,11 @@ class BeautifulSoup(Tag):
92 name:str,
93 namespace:Optional[str]=None,
94 nsprefix:Optional[str]=None,
95- attrs:_AttributeValues={},
96+ attrs:_RawAttributeValues={},
97 sourceline:Optional[int]=None,
98 sourcepos:Optional[int]=None,
99 string:Optional[str]=None,
100- **kwattrs:_AttributeValue,
101+ **kwattrs:_RawAttributeValue,
102 ) -> Tag:
103 """Create a new Tag associated with this BeautifulSoup object.
104
105@@ -664,7 +665,7 @@ class BeautifulSoup(Tag):
106 # The user may want us to use some other class (hopefully a
107 # custom subclass) instead of the one we'd use normally.
108 container = cast(
109- type[NavigableString],
110+ Type[NavigableString],
111 self.element_classes.get(container, container)
112 )
113
114@@ -894,14 +895,16 @@ class BeautifulSoup(Tag):
115
116 def handle_starttag(
117 self, name:str, namespace:Optional[str],
118- nsprefix:Optional[str], attrs:_AttributeValues,
119+ nsprefix:Optional[str], attrs:_RawAttributeValues,
120 sourceline:Optional[int]=None, sourcepos:Optional[int]=None,
121 namespaces:Optional[Dict[str, str]]=None) -> Optional[Tag]:
122 """Called by the tree builder when a new tag is encountered.
123
124 :param name: Name of the tag.
125 :param nsprefix: Namespace prefix for the tag.
126- :param attrs: A dictionary of attribute values.
127+ :param attrs: A dictionary of attribute values. Note that
128+ attribute values are expected to be simple strings; processing
129+ of multi-valued attributes such as "class" comes later.
130 :param sourceline: The line number where this tag was found in its
131 source document.
132 :param sourcepos: The character position within `sourceline` where this
133@@ -964,7 +967,7 @@ class BeautifulSoup(Tag):
134 def decode(self, indent_level:Optional[int]=None,
135 eventual_encoding:_Encoding=DEFAULT_OUTPUT_ENCODING,
136 formatter:Union[Formatter,str]="minimal",
137- iterator:Optional[Iterable[PageElement]]=None,
138+ iterator:Optional[Iterator[PageElement]]=None,
139 **kwargs:Any) -> str:
140 """Returns a string representation of the parse tree
141 as a full HTML or XML document.
142diff --git a/bs4/_deprecation.py b/bs4/_deprecation.py
143index febc1b3..86b22a2 100644
144--- a/bs4/_deprecation.py
145+++ b/bs4/_deprecation.py
146@@ -17,7 +17,7 @@ from typing import (
147 Callable,
148 )
149
150-def _deprecated_alias(old_name, new_name, version):
151+def _deprecated_alias(old_name:str, new_name:str, version:str):
152 """Alias one attribute name to another for backward compatibility
153
154 :meta private:
155@@ -29,23 +29,23 @@ def _deprecated_alias(old_name, new_name, version):
156 return getattr(self, new_name)
157
158 @alias.setter
159- def alias(self, value:str)->Any:
160+ def alias(self, value:str) -> None:
161 ":meta private:"
162 warnings.warn(f"Write to deprecated property {old_name}. (Replaced by {new_name}) -- Deprecated since version {version}.", DeprecationWarning, stacklevel=2)
163 return setattr(self, new_name, value)
164 return alias
165
166-def _deprecated_function_alias(old_name:str, new_name:str, version:str) -> Callable:
167- def alias(self, *args, **kwargs):
168+def _deprecated_function_alias(old_name:str, new_name:str, version:str) -> Callable[[Any], Any]:
169+ def alias(self, *args:Any, **kwargs:Any) -> Any:
170 ":meta private:"
171 warnings.warn(f"Call to deprecated method {old_name}. (Replaced by {new_name}) -- Deprecated since version {version}.", DeprecationWarning, stacklevel=2)
172 return getattr(self, new_name)(*args, **kwargs)
173 return alias
174
175 def _deprecated(replaced_by:str, version:str) -> Callable:
176- def deprecate(func):
177+ def deprecate(func:Callable) -> Callable:
178 @functools.wraps(func)
179- def with_warning(*args, **kwargs):
180+ def with_warning(*args:Any, **kwargs:Any) -> Any:
181 ":meta private:"
182 warnings.warn(
183 f"Call to deprecated method {func.__name__}. (Replaced by {replaced_by}) -- Deprecated since version {version}.",
184diff --git a/bs4/_typing.py b/bs4/_typing.py
185index ab8f7a0..7fff292 100644
186--- a/bs4/_typing.py
187+++ b/bs4/_typing.py
188@@ -8,7 +8,12 @@
189 # * In 3.10, TypeAlias gains capabilities that can be used to
190 # improve the tree matching types (I don't remember what, exactly).
191 # * 3.8 defines the Protocol type, which can be used to do duck typing
192-# in a statically checkable way.
193+# in a statically checkable way. (Protocols are also in typing_extensions,
194+# so I could add this now in the couple places it's needed.)
195+# * In 3.9 it's possible to specialize the re.Match type,
196+# e.g. re.Match[str]. In 3.8 there's a typing.re namespace for this,
197+# but it's removed in 3.12, so to support the widest possible set of
198+# versions I'm not using it.
199
200 import re
201 from typing_extensions import TypeAlias
202@@ -48,16 +53,22 @@ _InvertedNamespaceMapping:TypeAlias = Dict[_NamespaceURL, _NamespacePrefix]
203
204 # Aliases for the attribute values associated with HTML/XML tags.
205 #
206-# Note that these are attribute values in their final form, as stored
207-# in the `Tag` class. Different parsers present attributes to the
208-# `TreeBuilder` subclasses in different formats, which are not defined
209-# here.
210+# These are the relatively unprocessed values Beautiful Soup expects
211+# to come from a `TreeBuilder`.
212+_RawAttributeValue: TypeAlias = str
213+_RawAttributeValues: TypeAlias = Dict[str, _RawAttributeValue]
214+
215+# These are attribute values in their final form, as stored in the
216+# `Tag` class, after they have been processed and (in some cases)
217+# split into lists of strings.
218 _AttributeValue: TypeAlias = Union[str, Iterable[str]]
219 _AttributeValues: TypeAlias = Dict[str, _AttributeValue]
220
221-# The most common form in which attribute values are passed in from a
222-# parser.
223-_RawAttributeValues: TypeAlias = dict[str, str]
224+# The methods that deal with turning _RawAttributeValues into
225+# _AttributeValues may be called several times, even after the values
226+# are already processed (e.g. when cloning a tag), so they need to
227+# be able to acommodate both possibilities.
228+_RawOrProcessedAttributeValues:TypeAlias = Union[_RawAttributeValues, _AttributeValues]
229
230 # Aliases to represent the many possibilities for matching bits of a
231 # parse tree.
232diff --git a/bs4/builder/__init__.py b/bs4/builder/__init__.py
233index b59513e..29726c3 100644
234--- a/bs4/builder/__init__.py
235+++ b/bs4/builder/__init__.py
236@@ -33,15 +33,20 @@ from bs4.element import (
237 nonwhitespace_re
238 )
239
240+from bs4._typing import (
241+ _AttributeValues,
242+ _RawAttributeValue,
243+)
244 if TYPE_CHECKING:
245 from bs4 import BeautifulSoup
246 from bs4.element import (
247 NavigableString, Tag,
248- _AttributeValues, _AttributeValue,
249 )
250 from bs4._typing import (
251+ _AttributeValue,
252 _Encoding,
253 _Encodings,
254+ _RawOrProcessedAttributeValues,
255 _RawMarkup,
256 )
257
258@@ -75,7 +80,7 @@ class TreeBuilderRegistry(object):
259 builders_for_feature: Dict[str, List[Type[TreeBuilder]]]
260 builders: List[Type[TreeBuilder]]
261
262- def __init__(self):
263+ def __init__(self) -> None:
264 self.builders_for_feature = defaultdict(list)
265 self.builders = []
266
267@@ -233,7 +238,7 @@ class TreeBuilder(object):
268 #: no contents--that is, using XML rules. HTMLTreeBuilder
269 #: defines a different set of DEFAULT_EMPTY_ELEMENT_TAGS based on the
270 #: HTML 4 and HTML5 standards.
271- DEFAULT_EMPTY_ELEMENT_TAGS: Optional[Set] = None
272+ DEFAULT_EMPTY_ELEMENT_TAGS: Optional[Set[str]] = None
273
274 #: Most parsers don't keep track of line numbers.
275 TRACKS_LINE_NUMBERS: bool = False
276@@ -347,7 +352,7 @@ class TreeBuilder(object):
277 """
278 return False
279
280- def _replace_cdata_list_attribute_values(self, tag_name:str, attrs:_AttributeValues):
281+ def _replace_cdata_list_attribute_values(self, tag_name:str, attrs:_RawOrProcessedAttributeValues) -> _AttributeValues:
282 """When an attribute value is associated with a tag that can
283 have multiple values for that attribute, convert the string
284 value to a list of strings.
285@@ -359,88 +364,106 @@ class TreeBuilder(object):
286 :param tag_name: The name of a tag.
287 :param attrs: A dictionary containing the tag's attributes.
288 Any appropriate attribute values will be modified in place.
289+ :return: The modified dictionary that was originally passed in.
290 """
291- if not attrs:
292- return attrs
293- if self.cdata_list_attributes:
294- universal: Set[str] = self.cdata_list_attributes.get('*', set())
295- tag_specific = self.cdata_list_attributes.get(
296- tag_name.lower(), None)
297- for attr in list(attrs.keys()):
298- values: _AttributeValue
299- if attr in universal or (tag_specific and attr in tag_specific):
300- # We have a "class"-type attribute whose string
301- # value is a whitespace-separated list of
302- # values. Split it into a list.
303- value = attrs[attr]
304- if isinstance(value, str):
305- values = nonwhitespace_re.findall(value)
306- else:
307- # html5lib sometimes calls setAttributes twice
308- # for the same tag when rearranging the parse
309- # tree. On the second call the attribute value
310- # here is already a list. If this happens,
311- # leave the value alone rather than trying to
312- # split it again.
313- values = value
314- attrs[attr] = values
315- return attrs
316+
317+ # First, cast the attrs dict to _AttributeValues. This might
318+ # not be accurate yet, but it will be by the time this method
319+ # returns.
320+ modified_attrs = cast(_AttributeValues, attrs)
321+ if not modified_attrs or not self.cdata_list_attributes:
322+ # Nothing to do.
323+ return modified_attrs
324+
325+ # There is at least a possibility that we need to modify one of
326+ # the attribute values.
327+ universal: Set[str] = self.cdata_list_attributes.get('*', set())
328+ tag_specific = self.cdata_list_attributes.get(
329+ tag_name.lower(), None)
330+ for attr in list(modified_attrs.keys()):
331+ modified_value:_AttributeValue
332+ if attr in universal or (tag_specific and attr in tag_specific):
333+ # We have a "class"-type attribute whose string
334+ # value is a whitespace-separated list of
335+ # values. Split it into a list.
336+ original_value:_AttributeValue = modified_attrs[attr]
337+ if isinstance(original_value, _RawAttributeValue):
338+ # This is a _RawAttributeValue (a string) that
339+ # needs to be split into a list so it can be an
340+ # _AttributeValue.
341+ modified_value = nonwhitespace_re.findall(original_value)
342+ else:
343+ # html5lib calls setAttributes twice for the
344+ # same tag when rearranging the parse tree. On
345+ # the second call the attribute value here is
346+ # already a list. This can also happen when a
347+ # Tag object is cloned. If this happens, leave
348+ # the value alone rather than trying to split
349+ # it again.
350+ modified_value = original_value
351+ modified_attrs[attr] = modified_value
352+ return modified_attrs
353
354 class SAXTreeBuilder(TreeBuilder):
355 """A Beautiful Soup treebuilder that listens for SAX events.
356
357- This is not currently used for anything, but it demonstrates
358- how a simple TreeBuilder would work.
359+ This is not currently used for anything, and it will be removed
360+ soon. It was a good idea, but it wasn't properly integrated into the
361+ rest of Beautiful Soup, so there have been long stretches where it
362+ hasn't worked properly.
363 """
364-
365- def __init__(self, *args, **kwargs):
366+ def __init__(self, *args:Any, **kwargs:Any) -> None:
367 warnings.warn(
368- f"The SAXTreeBuilder class was deprecated in 4.13.0. It is completely untested and probably doesn't work; use at your own risk.",
369+ f"The SAXTreeBuilder class was deprecated in 4.13.0 and will be removed soon thereafter. It is completely untested and probably doesn't work; do not use it.",
370 DeprecationWarning,
371 stacklevel=2
372 )
373 super(SAXTreeBuilder, self).__init__(*args, **kwargs)
374
375- def feed(self, markup:_RawMarkup):
376+ def feed(self, markup:_RawMarkup) -> None:
377 raise NotImplementedError()
378
379- def close(self):
380+ def close(self) -> None:
381 pass
382
383- def startElement(self, name, attrs):
384+ def startElement(self, name:str, attrs:Dict[str,str]) -> None:
385 attrs = dict((key[1], value) for key, value in list(attrs.items()))
386 #print("Start %s, %r" % (name, attrs))
387- self.soup.handle_starttag(name, attrs)
388+ assert self.soup is not None
389+ self.soup.handle_starttag(name, None, None, attrs)
390
391- def endElement(self, name):
392+ def endElement(self, name:str) -> None:
393 #print("End %s" % name)
394+ assert self.soup is not None
395 self.soup.handle_endtag(name)
396
397- def startElementNS(self, nsTuple, nodeName, attrs):
398+ def startElementNS(self, nsTuple:Tuple[str,str],
399+ nodeName:str, attrs:Dict[str,str]) -> None:
400 # Throw away (ns, nodeName) for now.
401 self.startElement(nodeName, attrs)
402
403- def endElementNS(self, nsTuple, nodeName):
404+ def endElementNS(self, nsTuple:Tuple[str,str], nodeName:str) -> None:
405 # Throw away (ns, nodeName) for now.
406 self.endElement(nodeName)
407 #handler.endElementNS((ns, node.nodeName), node.nodeName)
408
409- def startPrefixMapping(self, prefix, nodeValue):
410+ def startPrefixMapping(self, prefix:str, nodeValue:str) -> None:
411 # Ignore the prefix for now.
412 pass
413
414- def endPrefixMapping(self, prefix):
415+ def endPrefixMapping(self, prefix:str) -> None:
416 # Ignore the prefix for now.
417 # handler.endPrefixMapping(prefix)
418 pass
419
420- def characters(self, content):
421+ def characters(self, content:str) -> None:
422+ assert self.soup is not None
423 self.soup.handle_data(content)
424
425- def startDocument(self):
426+ def startDocument(self) -> None:
427 pass
428
429- def endDocument(self):
430+ def endDocument(self) -> None:
431 pass
432
433
434@@ -620,13 +643,13 @@ class DetectsXMLParsedAsHTML(object):
435 return False
436 markup = markup[:500]
437 if isinstance(markup, bytes):
438- markup_b = cast(bytes, markup)
439+ markup_b:bytes = markup
440 looks_like_xml = (
441 markup_b.startswith(cls.XML_PREFIX_B)
442 and not cls.LOOKS_LIKE_HTML_B.search(markup)
443 )
444 else:
445- markup_s = cast(str, markup)
446+ markup_s:str = markup
447 looks_like_xml = (
448 markup_s.startswith(cls.XML_PREFIX)
449 and not cls.LOOKS_LIKE_HTML.search(markup)
450@@ -650,9 +673,13 @@ class DetectsXMLParsedAsHTML(object):
451 self._first_processing_instruction = None
452 self._root_tag_name = None
453
454- def _document_might_be_xml(self, processing_instruction:str):
455+ def _document_might_be_xml(self, processing_instruction:str) -> None:
456 """Call this method when encountering an XML declaration, or a
457 "processing instruction" that might be an XML declaration.
458+
459+ This helps Beautiful Soup detect potential issues later, if
460+ the XML document turns out to be a non-XHTML document that's
461+ being parsed as XML.
462 """
463 if (self._first_processing_instruction is not None
464 or self._root_tag_name is not None):
465diff --git a/bs4/builder/_html5lib.py b/bs4/builder/_html5lib.py
466index 2ea556c..51e3c97 100644
467--- a/bs4/builder/_html5lib.py
468+++ b/bs4/builder/_html5lib.py
469@@ -12,6 +12,7 @@ from typing import (
470 Iterable,
471 List,
472 Optional,
473+ TypeAlias,
474 TYPE_CHECKING,
475 Tuple,
476 Union,
477@@ -54,6 +55,7 @@ if TYPE_CHECKING:
478 from bs4 import BeautifulSoup
479
480 from html5lib.treebuilders import base as treebuilder_base
481+from html5lib.treewalkers import base as treewalker_base
482
483
484 class HTML5TreeBuilder(HTMLTreeBuilder):
485@@ -138,6 +140,14 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
486 # HTMLBinaryInputStream.__init__.
487 extra_kwargs['override_encoding'] = self.user_specified_encoding
488
489+ # TODO-TYPING: typeshed stub says the second argument to
490+ # HTMLParser.parse is scripting:bool, but the implementation
491+ # treats scripting as one of the kwargs. scripting:bool isn't
492+ # called out separately until we get down into _parse(), and
493+ # there it's the fourth argument, not the second. I'm not
494+ # sure what the stub ought to look like, but I'm confident
495+ # enough that it's better to leave this alone, rather than
496+ # change this call to get rid of the warning.
497 doc = parser.parse(markup, **extra_kwargs)
498
499 # Set the character encoding detected by the tokenizer.
500@@ -146,6 +156,10 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
501 # charEncoding to UTF-8 if it gets Unicode input.
502 doc.original_encoding = None
503 else:
504+ # TODO-TYPING HTMLParser.tokenizer is set by
505+ # HTMLParser._parse(), so it's definitely set by this
506+ # point, but it's not defined as an instance variable, so
507+ # this line gives a warning.
508 original_encoding = parser.tokenizer.stream.charEncoding[0]
509 # The encoding is an html5lib Encoding object. We want to
510 # use a string for compatibility with other tree builders.
511@@ -231,72 +245,35 @@ class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder):
512
513 def fragmentClass(self) -> 'Element':
514 """This is only used by html5lib HTMLParser.parseFragment(),
515- which is never used by Beautiful Soup."""
516+ which is never used by Beautiful Soup, only by the html5lib
517+ unit tests. Since we don't currently hook into those tests,
518+ the implementation is left blank.
519+ """
520 raise NotImplementedError()
521
522 def getFragment(self) -> 'Element':
523- """This is only used by html5lib HTMLParser.parseFragment,
524- which is never used by Beautiful Soup."""
525+ """This is only used by the html5lib unit tests. Since we
526+ don't currently hook into those tests, the implementation is
527+ left blank.
528+ """
529 raise NotImplementedError()
530
531 def appendChild(self, node:'Element') -> None:
532- # TODO: This code is not covered by the BS4 tests.
533+ # TODO: This code is not covered by the BS4 tests, and
534+ # apparently not triggered by the html5lib test suite either.
535 self.soup.append(node.element)
536
537 def getDocument(self) -> 'BeautifulSoup':
538 return self.soup
539
540 # TODO-TYPING: typeshed stubs are incorrect about this;
541- # cloneNode returns a str, not None.
542+ # testSerializer returns a str, not None.
543 def testSerializer(self, element:'Element') -> str:
544- from bs4 import BeautifulSoup
545- rv = []
546- doctype_re = re.compile(r'^(.*?)(?: PUBLIC "(.*?)"(?: "(.*?)")?| SYSTEM "(.*?)")?$')
547-
548- def serializeElement(element:Union['Element', PageElement], indent=0) -> None:
549- if isinstance(element, BeautifulSoup):
550- pass
551- if isinstance(element, Doctype):
552- m = doctype_re.match(element)
553- if m is not None:
554- name = m.group(1)
555- if m.lastindex is not None and m.lastindex > 1:
556- publicId = m.group(2) or ""
557- systemId = m.group(3) or m.group(4) or ""
558- rv.append("""|%s<!DOCTYPE %s "%s" "%s">""" %
559- (' ' * indent, name, publicId, systemId))
560- else:
561- rv.append("|%s<!DOCTYPE %s>" % (' ' * indent, name))
562- else:
563- rv.append("|%s<!DOCTYPE >" % (' ' * indent,))
564- elif isinstance(element, Comment):
565- rv.append("|%s<!-- %s -->" % (' ' * indent, element))
566- elif isinstance(element, NavigableString):
567- rv.append("|%s\"%s\"" % (' ' * indent, element))
568- elif isinstance(element, Element):
569- if element.namespace:
570- name = "%s %s" % (prefixes[element.namespace],
571- element.name)
572- else:
573- name = element.name
574- rv.append("|%s<%s>" % (' ' * indent, name))
575- if element.attrs:
576- attributes = []
577- for name, value in list(element.attrs.items()):
578- if isinstance(name, NamespacedAttribute):
579- name = "%s %s" % (prefixes[name.namespace], name.name)
580- if isinstance(value, list):
581- value = " ".join(value)
582- attributes.append((name, value))
583-
584- for name, value in sorted(attributes):
585- rv.append('|%s%s="%s"' % (' ' * (indent + 2), name, value))
586- indent += 2
587- for child in element.children:
588- serializeElement(child, indent)
589- serializeElement(element, 0)
590-
591- return "\n".join(rv)
592+ """This is only used by the html5lib unit tests. Since we
593+ don't currently hook into those tests, the implementation is
594+ left blank.
595+ """
596+ raise NotImplementedError()
597
598 class AttrList(object):
599 """Represents a Tag's attributes in a way compatible with html5lib."""
600@@ -340,11 +317,28 @@ class AttrList(object):
601 def __contains__(self, name:str) -> bool:
602 return name in list(self.attrs.keys())
603
604+class BeautifulSoupNode(treebuilder_base.Node):
605+ element:PageElement
606+ soup:'BeautifulSoup'
607+ namespace:Optional[_NamespaceURL]
608+
609+ @property
610+ def nodeType(self) -> int:
611+ """Return the html5lib constant corresponding to the type of
612+ the underlying DOM object.
613
614-class Element(treebuilder_base.Node):
615+ NOTE: This property is only accessed by the html5lib test
616+ suite, not by Beautiful Soup proper.
617+ """
618+ raise NotImplementedError()
619
620+ # TODO-TYPING: typeshed stubs are incorrect about this;
621+ # cloneNode returns a new Node, not None.
622+ def cloneNode(self) -> treebuilder_base.Node:
623+ raise NotImplementedError()
624+
625+class Element(BeautifulSoupNode):
626 element:Tag
627- soup:'BeautifulSoup'
628 namespace:Optional[_NamespaceURL]
629
630 def __init__(self, element:Tag, soup:'BeautifulSoup',
631@@ -354,30 +348,20 @@ class Element(treebuilder_base.Node):
632 self.soup = soup
633 self.namespace = namespace
634
635- def appendChild(self, node:'Element') -> None:
636- string_child = child = None
637- if isinstance(node, str):
638- # Some other piece of code decided to pass in a string
639- # instead of creating a TextElement object to contain the
640- # string. This should not ever happen.
641- string_child = child = node
642- elif isinstance(node, Tag):
643- # Some other piece of code decided to pass in a Tag
644- # instead of creating an Element object to contain the
645- # Tag. This should not ever happen.
646- child = node
647- elif node.element.__class__ == NavigableString:
648+ def appendChild(self, node:'BeautifulSoupNode') -> None:
649+ string_child:Optional[NavigableString] = None
650+ child:PageElement
651+ if type(node.element) == NavigableString:
652 string_child = child = node.element
653- node.parent = self
654 else:
655 child = node.element
656- node.parent = self
657+ node.parent = self
658
659- if not isinstance(child, str) and child is not None and child.parent is not None:
660+ if child is not None and child.parent is not None and not isinstance(child, str):
661 node.element.extract()
662
663 if (string_child is not None and self.element.contents
664- and self.element.contents[-1].__class__ == NavigableString):
665+ and type(self.element.contents[-1]) == NavigableString):
666 # We are appending a string onto another string.
667 # TODO This has O(n^2) performance, for input like
668 # "a</a>a</a>a</a>..."
669@@ -413,18 +397,36 @@ class Element(treebuilder_base.Node):
670 return {}
671 return AttrList(self.element)
672
673- def setAttributes(self, attributes:Optional[Dict]) -> None:
674+ # An HTML5lib attribute name may either be a single string,
675+ # or a tuple (namespace, name).
676+ _Html5libAttributeName: TypeAlias = Union[str, Tuple[str, str]]
677+ # Now we can define the type this method accepts as a dictionary
678+ # mapping those attribute names to single string values.
679+ _Html5libAttributes: TypeAlias = Dict[_Html5libAttributeName, str]
680+ def setAttributes(self, attributes:Optional[_Html5libAttributes]) -> None:
681 if attributes is not None and len(attributes) > 0:
682+
683+ # Replace any namespaced attributes with
684+ # NamespacedAttribute objects.
685 for name, value in list(attributes.items()):
686 if isinstance(name, tuple):
687 new_name = NamespacedAttribute(*name)
688 del attributes[name]
689 attributes[new_name] = value
690
691+ # We can now cast attributes to the type of Dict
692+ # used by Beautiful Soup.
693+ normalized_attributes = cast(_AttributeValues, attributes)
694+
695+ # Values for tags like 'class' came in as single strings;
696+ # replace them with lists of strings as appropriate.
697 self.soup.builder._replace_cdata_list_attribute_values(
698- self.name, attributes)
699- for name, value in list(attributes.items()):
700- self.element[name] = value
701+ self.name, normalized_attributes)
702+
703+ # Then set the attributes on the Tag associated with this
704+ # BeautifulSoupNode.
705+ for name, value_or_values in list(normalized_attributes.items()):
706+ self.element[name] = value_or_values
707
708 # The attributes may contain variables that need substitution.
709 # Call set_up_substitutions manually.
710@@ -434,19 +436,20 @@ class Element(treebuilder_base.Node):
711 self.soup.builder.set_up_substitutions(self.element)
712 attributes = property(getAttributes, setAttributes)
713
714- def insertText(self, data:str, insertBefore:Optional['Element']=None) -> None:
715+ def insertText(self, data:str, insertBefore:Optional['BeautifulSoupNode']=None) -> None:
716 text = TextNode(self.soup.new_string(data), self.soup)
717 if insertBefore:
718 self.insertBefore(text, insertBefore)
719 else:
720 self.appendChild(text)
721
722- def insertBefore(self, node:'Element', refNode:'Element') -> None:
723+ def insertBefore(self, node:'BeautifulSoupNode', refNode:'BeautifulSoupNode') -> None:
724 index = self.element.index(refNode.element)
725- if (node.element.__class__ == NavigableString and self.element.contents
726- and self.element.contents[index-1].__class__ == NavigableString):
727+ if (type(node.element) == NavigableString and self.element.contents
728+ and type(self.element.contents[index-1]) == NavigableString):
729 # (See comments in appendChild)
730 old_node = self.element.contents[index-1]
731+ assert type(old_node) == NavigableString
732 new_str = self.soup.new_string(old_node + node.element)
733 old_node.replace_with(new_str)
734 else:
735@@ -504,13 +507,19 @@ class Element(treebuilder_base.Node):
736 # parent's last descendant. It has no .next_sibling and
737 # its .next_element is whatever the previous last
738 # descendant had.
739- last_childs_last_descendant = to_append[-1]._last_descendant(False, True)
740+ last_childs_last_descendant = to_append[-1]._last_descendant(
741+ is_initialized=False, accept_self=True
742+ )
743
744+ # Since we passed accept_self=True into _last_descendant,
745+ # there's no possibility that the result is None.
746+ assert last_childs_last_descendant is not None
747 last_childs_last_descendant.next_element = new_parents_last_descendant_next_element
748 if new_parents_last_descendant_next_element is not None:
749- # TODO: This code has no test coverage and I'm not sure
750- # how to get html5lib to go through this path, but it's
751- # just the other side of the previous line.
752+ # TODO-COVERAGE: This code has no test coverage and
753+ # I'm not sure how to get html5lib to go through this
754+ # path, but it's just the other side of the previous
755+ # line.
756 new_parents_last_descendant_next_element.previous_element = last_childs_last_descendant
757 last_childs_last_descendant.next_sibling = None
758
759@@ -526,7 +535,12 @@ class Element(treebuilder_base.Node):
760 # print("FROM", self.element)
761 # print("TO", new_parent_element)
762
763- # TODO: typeshed stubs are incorrect about this;
764+ # TODO-TYPING: typeshed stubs are incorrect about this;
765+ # hasContent returns a boolean, not None.
766+ def hasContent(self) -> bool:
767+ return len(self.element.contents) > 0
768+
769+ # TODO-TYPING: typeshed stubs are incorrect about this;
770 # cloneNode returns a new Node, not None.
771 def cloneNode(self) -> treebuilder_base.Node:
772 tag = self.soup.new_tag(self.element.name, self.namespace)
773@@ -535,24 +549,17 @@ class Element(treebuilder_base.Node):
774 node.attributes[key] = value
775 return node
776
777- # TODO-TYPING: typeshed stubs are incorrect about this;
778- # cloneNode returns a boolean, not None.
779- def hasContent(self) -> bool:
780- return len(self.element.contents) > 0
781-
782- def getNameTuple(self) -> Tuple[str, str]:
783+ def getNameTuple(self) -> Tuple[Optional[_NamespaceURL], str]:
784 if self.namespace == None:
785 return namespaces["html"], self.name
786 else:
787 return self.namespace, self.name
788-
789 nameTuple = property(getNameTuple)
790
791-class TextNode(Element):
792- def __init__(self, element:PageElement, soup:'BeautifulSoup'):
793+class TextNode(BeautifulSoupNode):
794+ element:NavigableString
795+
796+ def __init__(self, element:NavigableString, soup:'BeautifulSoup'):
797 treebuilder_base.Node.__init__(self, None)
798 self.element = element
799 self.soup = soup
800-
801- def cloneNode(self) -> treebuilder_base.Node:
802- raise NotImplementedError()
803diff --git a/bs4/builder/_htmlparser.py b/bs4/builder/_htmlparser.py
804index 91cecf7..d8e21d1 100644
805--- a/bs4/builder/_htmlparser.py
806+++ b/bs4/builder/_htmlparser.py
807@@ -49,7 +49,6 @@ if TYPE_CHECKING:
808 from bs4 import BeautifulSoup
809 from bs4.element import NavigableString
810 from bs4._typing import (
811- _AttributeValues,
812 _Encoding,
813 _Encodings,
814 _RawMarkup,
815@@ -60,6 +59,14 @@ HTMLPARSER = 'html.parser'
816 _DuplicateAttributeHandler = Callable[[Dict[str, str], str, str], None]
817
818 class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
819+ #: Constant to handle duplicate attributes by ignoring later values
820+ #: and keeping the earlier ones.
821+ REPLACE:str = 'replace'
822+
823+ #: Constant to handle duplicate attributes by replacing earlier values
824+ #: with later ones.
825+ IGNORE:str = 'ignore'
826+
827 """A subclass of the Python standard library's HTMLParser class, which
828 listens for HTMLParser events and translates them into calls
829 to Beautiful Soup's tree construction API.
830@@ -73,11 +80,13 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
831 the name of the duplicate attribute, and the most recent value
832 encountered.
833 """
834- def __init__(self, soup:BeautifulSoup, *args, **kwargs):
835+ def __init__(
836+ self, soup:BeautifulSoup, *args:Any,
837+ on_duplicate_attribute:Union[str, _DuplicateAttributeHandler]=REPLACE,
838+ **kwargs:Any
839+ ):
840 self.soup = soup
841- self.on_duplicate_attribute = kwargs.pop(
842- 'on_duplicate_attribute', self.REPLACE
843- )
844+ self.on_duplicate_attribute = on_duplicate_attribute
845 HTMLParser.__init__(self, *args, **kwargs)
846
847 # Keep a list of empty-element tags that were encountered
848@@ -90,14 +99,6 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
849 self.already_closed_empty_element = []
850
851 self._initialize_xml_detector()
852-
853- #: Constant to handle duplicate attributes by replacing earlier values
854- #: with later ones.
855- IGNORE:str = 'ignore'
856-
857- #: Constant to handle duplicate attributes by ignoring later values
858- #: and keeping the earlier ones.
859- REPLACE:str = 'replace'
860
861 on_duplicate_attribute:Union[str, _DuplicateAttributeHandler]
862 already_closed_empty_element: List[str]
863@@ -145,7 +146,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
864 closing tag).
865 """
866 # TODO: handle namespaces here?
867- attr_dict: Dict[str, str] = {}
868+ attr_dict:Dict[str, str] = {}
869 for key, value in attrs:
870 # Change None attribute values to the empty string
871 # for consistency with the other tree builders.
872diff --git a/bs4/builder/_lxml.py b/bs4/builder/_lxml.py
873index 3dfe88a..380164e 100644
874--- a/bs4/builder/_lxml.py
875+++ b/bs4/builder/_lxml.py
876@@ -13,6 +13,7 @@ from collections.abc import Callable
877
878 from typing import (
879 Any,
880+ cast,
881 Dict,
882 IO,
883 Iterable,
884@@ -21,6 +22,7 @@ from typing import (
885 Set,
886 Tuple,
887 Type,
888+ TypeAlias,
889 TYPE_CHECKING,
890 Union,
891 )
892@@ -28,7 +30,6 @@ from typing import (
893 from io import BytesIO
894 from io import StringIO
895 from lxml import etree
896-from bs4.dammit import (_Encoding)
897 from bs4.element import (
898 Comment,
899 Doctype,
900@@ -60,13 +61,16 @@ if TYPE_CHECKING:
901
902 LXML:str = 'lxml'
903
904-def _invert(d):
905+def _invert(d:dict[Any, Any]) -> dict[Any, Any]:
906 "Invert a dictionary."
907 return dict((v,k) for k, v in list(d.items()))
908
909+_LXMLParser:TypeAlias = Union[etree.XMLParser, etree.HTMLParser]
910+_ParserOrParserClass:TypeAlias = Union[_LXMLParser, Type[etree.XMLParser], Type[etree.HTMLParser]]
911+
912 class LXMLTreeBuilderForXML(TreeBuilder):
913
914- DEFAULT_PARSER_CLASS:Type[Any] = etree.XMLParser
915+ DEFAULT_PARSER_CLASS:Type[etree.XMLParser] = etree.XMLParser
916
917 is_xml:bool = True
918
919@@ -93,6 +97,7 @@ class LXMLTreeBuilderForXML(TreeBuilder):
920 nsmaps: List[Optional[_InvertedNamespaceMapping]]
921 empty_element_tags: Set[str]
922 parser: Any
923+ _default_parser: Optional[etree.XMLParser]
924
925 # NOTE: If we parsed Element objects and looked at .sourceline,
926 # we'd be able to see the line numbers from the original document.
927@@ -137,7 +142,7 @@ class LXMLTreeBuilderForXML(TreeBuilder):
928 # prefix, the first one in the document takes precedence.
929 self.soup._namespaces[key] = value
930
931- def default_parser(self, encoding:Optional[_Encoding]) -> Type:
932+ def default_parser(self, encoding:Optional[_Encoding]) -> _ParserOrParserClass:
933 """Find the default parser for the given encoding.
934
935 :return: Either a parser object or a class, which
936@@ -148,7 +153,7 @@ class LXMLTreeBuilderForXML(TreeBuilder):
937 return self.DEFAULT_PARSER_CLASS(
938 target=self, strip_cdata=False, recover=True, encoding=encoding)
939
940- def parser_for(self, encoding: Optional[_Encoding]) -> Any:
941+ def parser_for(self, encoding: Optional[_Encoding]) -> _LXMLParser:
942 """Instantiate an appropriate parser for the given encoding.
943
944 :param encoding: A string.
945@@ -164,8 +169,8 @@ class LXMLTreeBuilderForXML(TreeBuilder):
946 )
947 return parser
948
949- def __init__(self, parser:Optional[Any]=None,
950- empty_element_tags:Optional[Set[str]]=None, **kwargs):
951+ def __init__(self, parser:Optional[etree.XMLParser]=None,
952+ empty_element_tags:Optional[Set[str]]=None, **kwargs:Any):
953 # TODO: Issue a warning if parser is present but not a
954 # callable, since that means there's no way to create new
955 # parsers for different encodings.
956@@ -270,7 +275,7 @@ class LXMLTreeBuilderForXML(TreeBuilder):
957 yield (detector.markup, encoding, document_declared_encoding, False)
958
959 def feed(self, markup:_RawMarkup) -> None:
960- io: IO
961+ io: Union[BytesIO, StringIO]
962 if isinstance(markup, bytes):
963 io = BytesIO(markup)
964 elif isinstance(markup, str):
965@@ -298,14 +303,25 @@ class LXMLTreeBuilderForXML(TreeBuilder):
966 def close(self) -> None:
967 self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED]
968
969- def start(self, name:str, attrs:Dict[str, str], nsmap:_NamespaceMapping={}):
970+ def start(self, tag:str|bytes, attrs:Dict[str|bytes, str|bytes], nsmap:_NamespaceMapping={}) -> None:
971 # This is called by lxml code as a result of calling
972 # BeautifulSoup.feed(), and we know self.soup is set by the time feed()
973 # is called.
974 assert self.soup is not None
975-
976- # Make sure attrs is a mutable dict--lxml may send an immutable dictproxy.
977- attrs = dict(attrs)
978+ assert isinstance(tag, str)
979+
980+ # We need to recreate the attribute dict for three
981+ # reasons. First, for type checking, so we can assert there
982+ # are no bytestrings in the keys or values. Second, because we
983+ # need a mutable dict--lxml might send us an immutable
984+ # dictproxy. Third, so we can handle namespaced attribute
985+ # names by converting the keys to NamespacedAttributes.
986+ new_attrs:Dict[Union[str,NamespacedAttribute], str] = {}
987+ for k, v in attrs.items():
988+ assert isinstance(k, str)
989+ assert isinstance(v, str)
990+ new_attrs[k] = v
991+
992 nsprefix: Optional[_NamespacePrefix] = None
993 namespace: Optional[_NamespaceURL] = None
994 # Invert each namespace map as it comes in.
995@@ -340,30 +356,28 @@ class LXMLTreeBuilderForXML(TreeBuilder):
996
997 # Also treat the namespace mapping as a set of attributes on the
998 # tag, so we can recreate it later.
999- attrs = attrs.copy()
1000 for prefix, namespace in list(nsmap.items()):
1001 attribute = NamespacedAttribute(
1002 "xmlns", prefix, "http://www.w3.org/2000/xmlns/")
1003- attrs[attribute] = namespace
1004+ new_attrs[attribute] = namespace
1005
1006 # Namespaces are in play. Find any attributes that came in
1007 # from lxml with namespaces attached to their names, and
1008 # turn then into NamespacedAttribute objects.
1009- new_attrs:Dict[Union[str,NamespacedAttribute], str] = {}
1010- for attr, value in list(attrs.items()):
1011+ final_attrs:Dict[Union[str,NamespacedAttribute], str] = {}
1012+ for attr, value in list(new_attrs.items()):
1013 namespace, attr = self._getNsTag(attr)
1014 if namespace is None:
1015- new_attrs[attr] = value
1016+ final_attrs[attr] = value
1017 else:
1018 nsprefix = self._prefix_for_namespace(namespace)
1019 attr = NamespacedAttribute(nsprefix, attr, namespace)
1020- new_attrs[attr] = value
1021- attrs = new_attrs
1022+ final_attrs[attr] = value
1023
1024- namespace, name = self._getNsTag(name)
1025+ namespace, tag = self._getNsTag(tag)
1026 nsprefix = self._prefix_for_namespace(namespace)
1027 self.soup.handle_starttag(
1028- name, namespace, nsprefix, attrs,
1029+ tag, namespace, nsprefix, final_attrs,
1030 namespaces=self.active_namespace_prefixes[-1]
1031 )
1032
1033@@ -376,8 +390,9 @@ class LXMLTreeBuilderForXML(TreeBuilder):
1034 return inverted_nsmap[namespace]
1035 return None
1036
1037- def end(self, name:str) -> None:
1038+ def end(self, name:str|bytes) -> None:
1039 assert self.soup is not None
1040+ assert isinstance(name, str)
1041 self.soup.endData()
1042 completed_tag = self.soup.tagStack[-1]
1043 namespace, name = self._getNsTag(name)
1044@@ -406,9 +421,10 @@ class LXMLTreeBuilderForXML(TreeBuilder):
1045 self.soup.handle_data(data)
1046 self.soup.endData(self.processing_instruction_class)
1047
1048- def data(self, content:str) -> None:
1049+ def data(self, data:str|bytes) -> None:
1050 assert self.soup is not None
1051- self.soup.handle_data(content)
1052+ assert isinstance(data, str)
1053+ self.soup.handle_data(data)
1054
1055 def doctype(self, name:str, pubid:str, system:str) -> None:
1056 assert self.soup is not None
1057@@ -416,11 +432,12 @@ class LXMLTreeBuilderForXML(TreeBuilder):
1058 doctype = Doctype.for_name_and_ids(name, pubid, system)
1059 self.soup.object_was_parsed(doctype)
1060
1061- def comment(self, content:str) -> None:
1062+ def comment(self, text:str|bytes) -> None:
1063 "Handle comments as Comment objects."
1064 assert self.soup is not None
1065+ assert isinstance(text, str)
1066 self.soup.endData()
1067- self.soup.handle_data(content)
1068+ self.soup.handle_data(text)
1069 self.soup.endData(Comment)
1070
1071 def test_fragment_to_document(self, fragment:str) -> str:
1072@@ -436,7 +453,7 @@ class LXMLTreeBuilder(HTMLTreeBuilder, LXMLTreeBuilderForXML):
1073 features: Iterable[str] = list(ALTERNATE_NAMES) + [NAME, HTML, FAST, PERMISSIVE]
1074 is_xml: bool = False
1075
1076- def default_parser(self, encoding:Optional[_Encoding]) -> Type[Any]:
1077+ def default_parser(self, encoding:Optional[_Encoding]) -> _ParserOrParserClass:
1078 return etree.HTMLParser
1079
1080 def feed(self, markup:_RawMarkup) -> None:
1081diff --git a/bs4/dammit.py b/bs4/dammit.py
1082index 8c1b631..4e950a2 100644
1083--- a/bs4/dammit.py
1084+++ b/bs4/dammit.py
1085@@ -260,14 +260,14 @@ class EntitySubstitution(object):
1086 AMPERSAND_OR_BRACKET: Pattern[str] = re.compile("([<>&])")
1087
1088 @classmethod
1089- def _substitute_html_entity(cls, matchobj:re.Match[str]) -> str:
1090+ def _substitute_html_entity(cls, matchobj:re.Match) -> str:
1091 """Used with a regular expression to substitute the
1092 appropriate HTML entity for a special character string."""
1093 entity = cls.CHARACTER_TO_HTML_ENTITY.get(matchobj.group(0))
1094 return "&%s;" % entity
1095
1096 @classmethod
1097- def _substitute_xml_entity(cls, matchobj:re.Match[str]) -> str:
1098+ def _substitute_xml_entity(cls, matchobj:re.Match) -> str:
1099 """Used with a regular expression to substitute the
1100 appropriate XML entity for a special character string."""
1101 entity = cls.CHARACTER_TO_XML_ENTITY[matchobj.group(0)]
1102@@ -752,7 +752,7 @@ class UnicodeDammit:
1103
1104 log: Logger #: :meta private:
1105
1106- def _sub_ms_char(self, match:re.Match[bytes]) -> bytes:
1107+ def _sub_ms_char(self, match:re.Match) -> bytes:
1108 """Changes a MS smart quote character to an XML or HTML
1109 entity, or an ASCII character.
1110
1111diff --git a/bs4/element.py b/bs4/element.py
1112index f4ab89c..22115a2 100644
1113--- a/bs4/element.py
1114+++ b/bs4/element.py
1115@@ -39,11 +39,13 @@ from typing import (
1116 Union,
1117 cast,
1118 )
1119-from typing_extensions import Self
1120+from typing_extensions import (
1121+ Self,
1122+ TypeAlias,
1123+)
1124 if TYPE_CHECKING:
1125 from bs4 import BeautifulSoup
1126 from bs4.builder import TreeBuilder
1127- from bs4.dammit import _Encoding
1128 from bs4.filter import ElementFilter
1129 from bs4.formatter import (
1130 _EntitySubstitutionFunction,
1131@@ -52,12 +54,16 @@ if TYPE_CHECKING:
1132 from bs4._typing import (
1133 _AttributeValue,
1134 _AttributeValues,
1135+ _Encoding,
1136+ _RawOrProcessedAttributeValues,
1137 _StrainableElement,
1138 _StrainableAttribute,
1139 _StrainableAttributes,
1140 _StrainableString,
1141 )
1142
1143+_OneOrMoreStringTypes:TypeAlias = Union[Type['NavigableString'], Iterable[Type['NavigableString']]]
1144+
1145 # Deprecated module-level attributes.
1146 # See https://peps.python.org/pep-0562/
1147 _deprecated_names = dict(
1148@@ -66,7 +72,7 @@ _deprecated_names = dict(
1149 #: :meta private:
1150 _deprecated_whitespace_re: Pattern[str] = re.compile(r"\s+")
1151
1152-def __getattr__(name):
1153+def __getattr__(name:str) -> Any:
1154 if name in _deprecated_names:
1155 message = _deprecated_names[name]
1156 warnings.warn(
1157@@ -124,7 +130,8 @@ class NamespacedAttribute(str):
1158 namespace: Optional[str]
1159
1160 def __new__(cls, prefix:Optional[str],
1161- name:Optional[str]=None, namespace:Optional[str]=None):
1162+ name:Optional[str]=None,
1163+ namespace:Optional[str]=None) -> Self:
1164 if not name:
1165 # This is the default namespace. Its name "has no value"
1166 # per https://www.w3.org/TR/xml-names/#defaulting
1167@@ -223,7 +230,7 @@ class ContentMetaAttributeValue(AttributeValueWithCharsetSubstitution):
1168 """
1169 if eventual_encoding in PYTHON_SPECIFIC_ENCODINGS:
1170 return self.CHARSET_RE.sub('', self.original_value)
1171- def rewrite(match):
1172+ def rewrite(match:re.Match[str]) -> str:
1173 return match.group(1) + eventual_encoding
1174 return self.CHARSET_RE.sub(rewrite, self.original_value)
1175
1176@@ -370,7 +377,7 @@ class PageElement(object):
1177 "previousSibling", "previous_sibling", "4.0.0"
1178 )
1179
1180- def __deepcopy__(self, memo:Dict, recursive:bool=False) -> Self:
1181+ def __deepcopy__(self, memo:Dict[Any,Any], recursive:bool=False) -> Self:
1182 raise NotImplementedError()
1183
1184 def __copy__(self) -> Self:
1185@@ -528,6 +535,13 @@ class PageElement(object):
1186 ) -> Optional[PageElement]:
1187 """Finds the last element beneath this object to be parsed.
1188
1189+ Special note to help you figure things out if your type
1190+ checking is tripped up by the fact that this method returns
1191+ Optional[PageElement] instead of PageElement: the only time
1192+ this method returns None is if `accept_self` is False and the
1193+ `PageElement` has no children--either it's a NavigableString
1194+ or an empty Tag.
1195+
1196 :param is_initialized: Has `PageElement.setup` been called on
1197 this `PageElement` yet?
1198
1199@@ -878,9 +892,10 @@ class PageElement(object):
1200
1201 def _find_one(
1202 self,
1203- # TODO: "There is no syntax to indicate optional or keyword
1204- # arguments; such function types are rarely used as
1205- # callback types." - So, not sure how to get more specific here.
1206+ # TODO-TYPING: "There is no syntax to indicate optional or
1207+ # keyword arguments; such function types are rarely used
1208+ # as callback types." - So, not sure how to get more
1209+ # specific here.
1210 method:Callable,
1211 name:Optional[_StrainableElement],
1212 attrs:_StrainableAttributes,
1213@@ -955,7 +970,7 @@ class PageElement(object):
1214 You can pass in your own technique for iterating over the tree, and your own
1215 technique for matching items.
1216 """
1217- results:ResultSet = ResultSet(matcher)
1218+ results:ResultSet[PageElement] = ResultSet(matcher)
1219 while True:
1220 try:
1221 i = next(generator)
1222@@ -1029,27 +1044,27 @@ class PageElement(object):
1223 return getattr(self, '_decomposed', False) or False
1224
1225 @_deprecated("next_elements", "4.0.0")
1226- def nextGenerator(self):
1227+ def nextGenerator(self) -> Iterator[PageElement]:
1228 ":meta private:"
1229 return self.next_elements
1230
1231 @_deprecated("next_siblings", "4.0.0")
1232- def nextSiblingGenerator(self):
1233+ def nextSiblingGenerator(self) -> Iterator[PageElement]:
1234 ":meta private:"
1235 return self.next_siblings
1236
1237 @_deprecated("previous_elements", "4.0.0")
1238- def previousGenerator(self):
1239+ def previousGenerator(self) -> Iterator[PageElement]:
1240 ":meta private:"
1241 return self.previous_elements
1242
1243 @_deprecated("previous_siblings", "4.0.0")
1244- def previousSiblingGenerator(self):
1245+ def previousSiblingGenerator(self) -> Iterator[PageElement]:
1246 ":meta private:"
1247 return self.previous_siblings
1248
1249 @_deprecated("parents", "4.0.0")
1250- def parentGenerator(self):
1251+ def parentGenerator(self) -> Iterator[PageElement]:
1252 ":meta private:"
1253 return self.parents
1254
1255@@ -1087,7 +1102,7 @@ class NavigableString(str, PageElement):
1256 u.setup()
1257 return u
1258
1259- def __deepcopy__(self, memo:Dict, recursive:bool=False) -> Self:
1260+ def __deepcopy__(self, memo:Dict[Any, Any], recursive:bool=False) -> Self:
1261 """A copy of a NavigableString has the same contents and class
1262 as the original, but it is not connected to the parse tree.
1263
1264@@ -1097,7 +1112,7 @@ class NavigableString(str, PageElement):
1265 """
1266 return type(self)(self)
1267
1268- def __getnewargs__(self):
1269+ def __getnewargs__(self) -> Tuple[str]:
1270 return (str(self),)
1271
1272 @property
1273@@ -1134,14 +1149,14 @@ class NavigableString(str, PageElement):
1274 return None
1275
1276 @name.setter
1277- def name(self, name:str):
1278+ def name(self, name:str) -> None:
1279 """Prevent NavigableString.name from ever being set.
1280
1281 :meta private:
1282 """
1283 raise AttributeError("A NavigableString cannot be given a name.")
1284
1285- def _all_strings(self, strip=False, types:Iterable[Type[NavigableString]]=PageElement.default) -> Iterator[str]:
1286+ def _all_strings(self, strip:bool=False, types:_OneOrMoreStringTypes=PageElement.default) -> Iterator[str]:
1287 """Yield all strings of certain classes, possibly stripping them.
1288
1289 This makes it easy for NavigableString to implement methods
1290@@ -1382,7 +1397,7 @@ class Tag(PageElement):
1291 name:Optional[str]=None,
1292 namespace:Optional[str]=None,
1293 prefix:Optional[str]=None,
1294- attrs:Optional[_AttributeValues]=None,
1295+ attrs:Optional[_RawOrProcessedAttributeValues]=None,
1296 parent:Optional[Union[BeautifulSoup, Tag]]=None,
1297 previous:Optional[PageElement]=None,
1298 is_xml:Optional[bool]=None,
1299@@ -1485,7 +1500,7 @@ class Tag(PageElement):
1300 #: :meta private:
1301 parserClass = _deprecated_alias("parserClass", "parser_class", "4.0.0")
1302
1303- def __deepcopy__(self, memo:dict, recursive:bool=True) -> Self:
1304+ def __deepcopy__(self, memo:Dict[Any, Any], recursive:bool=True) -> Self:
1305 """A deepcopy of a Tag is a new Tag, unconnected to the parse tree.
1306 Its contents are a copy of the old Tag's contents.
1307 """
1308@@ -1552,9 +1567,9 @@ class Tag(PageElement):
1309 return len(self.contents) == 0 and self.can_be_empty_element is True
1310
1311 @_deprecated("is_empty_element", "4.0.0")
1312- def isSelfClosing(self):
1313+ def isSelfClosing(self) -> bool:
1314 ": :meta private:"
1315- return is_empty_element()
1316+ return self.is_empty_element
1317
1318 @property
1319 def string(self) -> Optional[str]:
1320@@ -1592,7 +1607,7 @@ class Tag(PageElement):
1321
1322 #: :meta private:
1323 MAIN_CONTENT_STRING_TYPES = {NavigableString, CData}
1324- def _all_strings(self, strip:bool=False, types:Iterable[Type[NavigableString]]=PageElement.default) -> Iterator[str]:
1325+ def _all_strings(self, strip:bool=False, types:_OneOrMoreStringTypes=PageElement.default) -> Iterator[str]:
1326 """Yield all strings of certain classes, possibly stripping them.
1327
1328 :param strip: If True, all strings will be stripped before being
1329@@ -1739,7 +1754,7 @@ class Tag(PageElement):
1330 replace_with_children = unwrap
1331
1332 @_deprecated("unwrap", "4.0.0")
1333- def replaceWithChildren(self):
1334+ def replaceWithChildren(self) -> PageElement:
1335 ": :meta private:"
1336 return self.unwrap()
1337
1338@@ -1914,11 +1929,21 @@ class Tag(PageElement):
1339 "Deleting tag[key] deletes all 'key' attributes for the tag."
1340 self.attrs.pop(key, None)
1341
1342- def __call__(self, *args, **kwargs) -> ResultSet[PageElement]:
1343+ def __call__(self,
1344+ name:Optional[_StrainableElement]=None,
1345+ attrs:_StrainableAttributes={},
1346+ recursive:bool=True,
1347+ string:Optional[_StrainableString]=None,
1348+ limit:Optional[int]=None,
1349+ _stacklevel:int=2,
1350+ **kwargs:_StrainableAttribute
1351+ )-> ResultSet[PageElement]:
1352 """Calling a Tag like a function is the same as calling its
1353 find_all() method. Eg. tag('a') returns a list of all the A tags
1354 found within this tag."""
1355- return self.find_all(*args, **kwargs)
1356+ return self.find_all(
1357+ name, attrs, recursive, string, limit, _stacklevel, **kwargs
1358+ )
1359
1360 def __getattr__(self, subtag:str) -> Optional[Tag]:
1361 """Calling tag.subtag is the same as calling tag.find(name="subtag")"""
1362@@ -2002,7 +2027,7 @@ class Tag(PageElement):
1363 def decode(self, indent_level:Optional[int]=None,
1364 eventual_encoding:_Encoding=DEFAULT_OUTPUT_ENCODING,
1365 formatter:_FormatterOrName="minimal",
1366- iterator:Optional[Iterable]=None) -> str:
1367+ iterator:Optional[Iterator[PageElement]]=None) -> str:
1368 """Render this `Tag` and its contents as a Unicode string.
1369
1370 :param indent_level: Each line of the rendering will be
1371@@ -2122,9 +2147,10 @@ class Tag(PageElement):
1372 EMPTY_ELEMENT_EVENT = _TreeTraversalEvent() #: :meta private:
1373 STRING_ELEMENT_EVENT = _TreeTraversalEvent() #: :meta private:
1374
1375- def _event_stream(self, iterator=None) -> Iterator[
1376- Tuple[_TreeTraversalEvent, PageElement]
1377- ]:
1378+ def _event_stream(
1379+ self,
1380+ iterator:Optional[Iterator[PageElement]]=None
1381+ ) -> Iterator[Tuple[_TreeTraversalEvent, PageElement]]:
1382 """Yield a sequence of events that can be used to reconstruct the DOM
1383 for this element.
1384
1385@@ -2316,8 +2342,8 @@ class Tag(PageElement):
1386 return contents.encode(encoding)
1387
1388 @_deprecated("encode_contents", "4.0.0")
1389- def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
1390- prettyPrint=False, indentLevel=0):
1391+ def renderContents(self, encoding:str=DEFAULT_OUTPUT_ENCODING,
1392+ prettyPrint:bool=False, indentLevel:Optional[int]=0) -> bytes:
1393 """Deprecated method for BS3 compatibility.
1394
1395 :meta private:
1396@@ -2436,7 +2462,7 @@ class Tag(PageElement):
1397 def select_one(self,
1398 selector:str,
1399 namespaces:Optional[Dict[str, str]]=None,
1400- **kwargs) -> Optional[Tag]:
1401+ **kwargs:Any) -> Optional[Tag]:
1402 """Perform a CSS selection operation on the current element.
1403
1404 :param selector: A CSS selector.
1405@@ -2452,7 +2478,7 @@ class Tag(PageElement):
1406 return self.css.select_one(selector, namespaces, **kwargs)
1407
1408 def select(self, selector:str, namespaces:Optional[Dict[str, str]]=None,
1409- limit:int=0, **kwargs) -> ResultSet[Tag]:
1410+ limit:int=0, **kwargs:Any) -> ResultSet[Tag]:
1411 """Perform a CSS selection operation on the current element.
1412
1413 This uses the SoupSieve library.
1414@@ -2478,7 +2504,7 @@ class Tag(PageElement):
1415
1416 # Old names for backwards compatibility
1417 @_deprecated("children", "4.0.0")
1418- def childGenerator(self):
1419+ def childGenerator(self) -> Iterator[PageElement]:
1420 """Deprecated generator.
1421
1422 :meta private:
1423@@ -2486,7 +2512,7 @@ class Tag(PageElement):
1424 return self.children
1425
1426 @_deprecated("descendants", "4.0.0")
1427- def recursiveChildGenerator(self):
1428+ def recursiveChildGenerator(self) -> Iterator[PageElement]:
1429 """Deprecated generator.
1430
1431 :meta private:
1432@@ -2494,7 +2520,7 @@ class Tag(PageElement):
1433 return self.descendants
1434
1435 @_deprecated("has_attr", "4.0.0")
1436- def has_key(self, key):
1437+ def has_key(self, key:str) -> bool:
1438 """Deprecated method. This was kind of misleading because has_key()
1439 (attributes) was different from __in__ (contents).
1440
1441@@ -2516,7 +2542,7 @@ class ResultSet(List[_PageElementT], Generic[_PageElementT]):
1442 super(ResultSet, self).__init__(result)
1443 self.source = source
1444
1445- def __getattr__(self, key:str):
1446+ def __getattr__(self, key:str) -> None:
1447 """Raise a helpful exception to explain a common code fix."""
1448 raise AttributeError(
1449 f"""ResultSet object has no attribute "{key}". You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?"""
1450diff --git a/bs4/filter.py b/bs4/filter.py
1451index 74e26d9..7632639 100644
1452--- a/bs4/filter.py
1453+++ b/bs4/filter.py
1454@@ -25,10 +25,10 @@ from bs4._deprecation import _deprecated
1455 from bs4.element import NavigableString, PageElement, Tag
1456 from bs4._typing import (
1457 _AttributeValue,
1458- _AttributeValues,
1459 _AllowStringCreationFunction,
1460 _AllowTagCreationFunction,
1461 _PageElementMatchFunction,
1462+ _RawAttributeValues,
1463 _TagMatchFunction,
1464 _StringMatchFunction,
1465 _StrainableElement,
1466@@ -98,7 +98,7 @@ class ElementFilter(object):
1467
1468 def allow_tag_creation(
1469 self, nsprefix:Optional[str], name:str,
1470- attrs:Optional[_AttributeValues]
1471+ attrs:Optional[_RawAttributeValues]
1472 ) -> bool:
1473 """Based on the name and attributes of a tag, see whether this
1474 ElementFilter will allow a Tag object to even be created.
1475@@ -372,9 +372,8 @@ class SoupStrainer(ElementFilter):
1476 # third-party regex library, whose pattern objects doesn't
1477 # derive from re.Pattern.
1478 #
1479- # TODO-TYPING: Once we drop support for Python 3.7, we
1480- # might be able to address this by defining an appropriate
1481- # Protocol.
1482+ # TODO-TYPING: We should be able to bring in a Protocol
1483+ # from typing_extensions to handle this.
1484 yield rule_class(pattern=obj)
1485 elif hasattr(obj, '__iter__'):
1486 for o in obj:
1487@@ -497,7 +496,7 @@ class SoupStrainer(ElementFilter):
1488 )
1489 return this_attr_match
1490
1491- def allow_tag_creation(self, nsprefix:Optional[str], name:str, attrs:Optional[_AttributeValues]) -> bool:
1492+ def allow_tag_creation(self, nsprefix:Optional[str], name:str, attrs:Optional[_RawAttributeValues]) -> bool:
1493 """Based on the name and attributes of a tag, see whether this
1494 SoupStrainer will allow a Tag object to even be created.
1495
1496@@ -586,7 +585,7 @@ class SoupStrainer(ElementFilter):
1497 return False
1498
1499 @_deprecated("allow_tag_creation", "4.13.0")
1500- def search_tag(self, name:str, attrs:Optional[_AttributeValues]) -> bool:
1501+ def search_tag(self, name:str, attrs:Optional[_RawAttributeValues]) -> bool:
1502 """A less elegant version of allow_tag_creation()."""
1503 ":meta private:"
1504 return self.allow_tag_creation(None, name, attrs)
1505diff --git a/bs4/tests/__init__.py b/bs4/tests/__init__.py
1506index 3ef999d..8415645 100644
1507--- a/bs4/tests/__init__.py
1508+++ b/bs4/tests/__init__.py
1509@@ -15,6 +15,7 @@ from bs4.element import (
1510 Comment,
1511 ContentMetaAttributeValue,
1512 Doctype,
1513+ PageElement,
1514 PYTHON_SPECIFIC_ENCODINGS,
1515 Script,
1516 Stylesheet,
1517@@ -25,8 +26,21 @@ from bs4.builder import (
1518 DetectsXMLParsedAsHTML,
1519 XMLParsedAsHTMLWarning,
1520 )
1521+from bs4._typing import (
1522+ _IncomingMarkup
1523+)
1524+
1525+from bs4.builder import TreeBuilder
1526 from bs4.builder._htmlparser import HTMLParserTreeBuilder
1527-default_builder = HTMLParserTreeBuilder
1528+
1529+from typing import (
1530+ Any,
1531+ Iterable,
1532+ List,
1533+ Optional,
1534+ Tuple,
1535+ Type,
1536+)
1537
1538 # Some tests depend on specific third-party libraries. We use
1539 # @pytest.mark.skipIf on the following conditionals to skip them
1540@@ -51,7 +65,9 @@ except ImportError:
1541 LXML_PRESENT = False
1542 LXML_VERSION = (0,)
1543
1544-BAD_DOCUMENT = """A bare string
1545+default_builder:Type[TreeBuilder] = HTMLParserTreeBuilder
1546+
1547+BAD_DOCUMENT:str = """A bare string
1548 <!DOCTYPE xsl:stylesheet SYSTEM "htmlent.dtd">
1549 <!DOCTYPE xsl:stylesheet PUBLIC "htmlent.dtd">
1550 <div><![CDATA[A CDATA section where it doesn't belong]]></div>
1551@@ -91,28 +107,30 @@ BAD_DOCUMENT = """A bare string
1552 class SoupTest(object):
1553
1554 @property
1555- def default_builder(self):
1556+ def default_builder(self) -> Type[TreeBuilder]:
1557 return default_builder
1558
1559- def soup(self, markup, **kwargs):
1560+ def soup(self, markup:_IncomingMarkup, **kwargs:Any) -> BeautifulSoup:
1561 """Build a Beautiful Soup object from markup."""
1562 builder = kwargs.pop('builder', self.default_builder)
1563 return BeautifulSoup(markup, builder=builder, **kwargs)
1564
1565- def document_for(self, markup, **kwargs):
1566+ def document_for(self, markup:str, **kwargs:Any) -> str:
1567 """Turn an HTML fragment into a document.
1568
1569 The details depend on the builder.
1570 """
1571 return self.default_builder(**kwargs).test_fragment_to_document(markup)
1572
1573- def assert_soup(self, to_parse, compare_parsed_to=None):
1574+ def assert_soup(self, to_parse:_IncomingMarkup,
1575+ compare_parsed_to:Optional[str]=None) -> None:
1576 """Parse some markup using Beautiful Soup and verify that
1577 the output markup is as expected.
1578 """
1579 builder = self.default_builder
1580 obj = BeautifulSoup(to_parse, builder=builder)
1581 if compare_parsed_to is None:
1582+ assert isinstance(to_parse, str)
1583 compare_parsed_to = to_parse
1584
1585 # Verify that the documents come out the same.
1586@@ -131,7 +149,7 @@ class SoupTest(object):
1587
1588 assertSoupEquals = assert_soup
1589
1590- def assertConnectedness(self, element):
1591+ def assertConnectedness(self, element:Tag) -> None:
1592 """Ensure that next_element and previous_element are properly
1593 set for all descendants of the given element.
1594 """
1595@@ -142,7 +160,7 @@ class SoupTest(object):
1596 assert earlier == e.previous_element
1597 earlier = e
1598
1599- def linkage_validator(self, el, _recursive_call=False):
1600+ def linkage_validator(self, el:Tag, _recursive_call:bool=False) -> Optional[PageElement]:
1601 """Ensure proper linkage throughout the document."""
1602 descendant = None
1603 # Document element should have no previous element or previous sibling.
1604@@ -209,6 +227,7 @@ class SoupTest(object):
1605
1606 if isinstance(child, Tag) and child.contents:
1607 descendant = self.linkage_validator(child, True)
1608+ assert descendant is not None
1609 # A bubbled up descendant should have no next siblings
1610 assert descendant.next_sibling is None,\
1611 "Bad next_sibling\nNODE: {}\nNEXT {}\nEXPECTED {}".format(
1612@@ -234,7 +253,7 @@ class SoupTest(object):
1613 child = el
1614
1615 if not _recursive_call and child is not None:
1616- target = el
1617+ target:Optional[Tag] = el
1618 while True:
1619 if target is None:
1620 assert child.next_element is None, \
1621@@ -256,7 +275,7 @@ class SoupTest(object):
1622 # Return the child to the recursive caller
1623 return child
1624
1625- def assert_selects(self, tags, should_match):
1626+ def assert_selects(self, tags:Iterable[Tag], should_match:Iterable[str]) -> None:
1627 """Make sure that the given tags have the correct text.
1628
1629 This is used in tests that define a bunch of tags, each
1630@@ -265,7 +284,7 @@ class SoupTest(object):
1631 """
1632 assert [tag.string for tag in tags] == should_match
1633
1634- def assert_selects_ids(self, tags, should_match):
1635+ def assert_selects_ids(self, tags:Iterable[Tag], should_match:Iterable[str]) -> None:
1636 """Make sure that the given tags have the correct IDs.
1637
1638 This is used in tests that define a bunch of tags, each
1639@@ -275,7 +294,7 @@ class SoupTest(object):
1640 assert [tag['id'] for tag in tags] == should_match
1641
1642
1643-class TreeBuilderSmokeTest(object):
1644+class TreeBuilderSmokeTest(SoupTest):
1645 # Tests that are common to HTML and XML tree builders.
1646
1647 @pytest.mark.parametrize(
1648@@ -352,7 +371,7 @@ class HTMLTreeBuilderSmokeTest(TreeBuilderSmokeTest):
1649 assert loaded.__class__ == BeautifulSoup
1650 assert loaded.decode() == tree.decode()
1651
1652- def assertDoctypeHandled(self, doctype_fragment):
1653+ def assertDoctypeHandled(self, doctype_fragment:str) -> None:
1654 """Assert that a given doctype string is handled correctly."""
1655 doctype_str, soup = self._document_with_doctype(doctype_fragment)
1656
1657@@ -366,7 +385,7 @@ class HTMLTreeBuilderSmokeTest(TreeBuilderSmokeTest):
1658 # parse tree and that the rest of the document parsed.
1659 assert soup.p.contents[0] == 'foo'
1660
1661- def _document_with_doctype(self, doctype_fragment, doctype_string="DOCTYPE"):
1662+ def _document_with_doctype(self, doctype_fragment:str, doctype_string:str="DOCTYPE") -> Tuple[bytes, BeautifulSoup]:
1663 """Generate and parse a document with the given doctype."""
1664 doctype = '<!%s %s>' % (doctype_string, doctype_fragment)
1665 markup = doctype + '\n<p>foo</p>'
1666diff --git a/bs4/tests/test_builder_registry.py b/bs4/tests/test_builder_registry.py
1667index 9a9ce1f..da10e5f 100644
1668--- a/bs4/tests/test_builder_registry.py
1669+++ b/bs4/tests/test_builder_registry.py
1670@@ -2,10 +2,12 @@
1671
1672 import pytest
1673 import warnings
1674+from typing import Type
1675
1676 from bs4 import BeautifulSoup
1677 from bs4.builder import (
1678 builder_registry as registry,
1679+ TreeBuilder,
1680 TreeBuilderRegistry,
1681 )
1682 from bs4.builder._htmlparser import HTMLParserTreeBuilder
1683@@ -81,7 +83,7 @@ class TestRegistry(object):
1684 def setup_method(self):
1685 self.registry = TreeBuilderRegistry()
1686
1687- def builder_for_features(self, *feature_list):
1688+ def builder_for_features(self, *feature_list:str) -> Type[TreeBuilder]:
1689 cls = type('Builder_' + '_'.join(feature_list),
1690 (object,), {'features' : feature_list})
1691
1692diff --git a/bs4/tests/test_css.py b/bs4/tests/test_css.py
1693index 359dbcd..2e2baba 100644
1694--- a/bs4/tests/test_css.py
1695+++ b/bs4/tests/test_css.py
1696@@ -8,6 +8,12 @@ from bs4 import (
1697 ResultSet,
1698 )
1699
1700+from typing import (
1701+ Any,
1702+ Iterable,
1703+ Tuple,
1704+)
1705+
1706 from . import (
1707 SoupTest,
1708 SOUP_SIEVE_PRESENT,
1709@@ -78,7 +84,7 @@ class TestCSSSelectors(SoupTest):
1710 def setup_method(self):
1711 self.soup = BeautifulSoup(self.HTML, 'html.parser')
1712
1713- def assert_selects(self, selector, expected_ids, **kwargs):
1714+ def assert_selects(self, selector:str, expected_ids:Iterable[str], **kwargs:Any) -> None:
1715 results = self.soup.select(selector, **kwargs)
1716 assert isinstance(results, ResultSet)
1717 el_ids = [el['id'] for el in results]
1718@@ -90,7 +96,7 @@ class TestCSSSelectors(SoupTest):
1719
1720 assertSelect = assert_selects
1721
1722- def assert_select_multiple(self, *tests):
1723+ def assert_select_multiple(self, *tests:Tuple[str, Iterable[str]]):
1724 for selector, expected_ids in tests:
1725 self.assert_selects(selector, expected_ids)
1726
1727diff --git a/bs4/tests/test_filter.py b/bs4/tests/test_filter.py
1728index 8d5da70..dfd6f18 100644
1729--- a/bs4/tests/test_filter.py
1730+++ b/bs4/tests/test_filter.py
1731@@ -5,6 +5,12 @@ import warnings
1732 from . import (
1733 SoupTest,
1734 )
1735+from typing import (
1736+ Callable,
1737+ Optional,
1738+ Pattern,
1739+ Tuple,
1740+)
1741 from bs4.element import Tag
1742 from bs4.filter import (
1743 AttributeValueMatchRule,
1744@@ -14,6 +20,8 @@ from bs4.filter import (
1745 StringMatchRule,
1746 TagNameMatchRule,
1747 )
1748+from bs4._typing import _RawOrProcessedAttributeValues
1749+
1750
1751 class TestElementFilter(SoupTest):
1752
1753@@ -107,7 +115,10 @@ class TestElementFilter(SoupTest):
1754
1755 class TestMatchRule(SoupTest):
1756
1757- def _tuple(self, rule):
1758+ def _tuple(self, rule:MatchRule) -> Tuple[Optional[str],
1759+ Optional[Pattern[str]],
1760+ Optional[Callable],
1761+ Optional[bool]]:
1762 return (
1763 rule.string,
1764 rule.pattern.pattern if rule.pattern else None,
1765@@ -395,9 +406,10 @@ class TestSoupStrainer(SoupTest):
1766 assert msg == "Ignoring nested list [[...]] to avoid the possibility of infinite recursion."
1767
1768 def tag_matches(
1769- self, strainer, name, attrs=None, string=None, prefix=None,
1770- match_valence=True
1771- ):
1772+ self, strainer:SoupStrainer, name:str,
1773+ attrs:Optional[_RawOrProcessedAttributeValues]=None,
1774+ string:Optional[str]=None, prefix:Optional[str]=None,
1775+ ) -> bool:
1776 # Create a Tag with the given prefix, name and attributes,
1777 # then make sure that strainer.matches_tag and allow_tag_creation
1778 # both approve it.
1779diff --git a/bs4/tests/test_fuzz.py b/bs4/tests/test_fuzz.py
1780index f29802d..579686d 100644
1781--- a/bs4/tests/test_fuzz.py
1782+++ b/bs4/tests/test_fuzz.py
1783@@ -38,7 +38,7 @@ class TestFuzz(object):
1784 # multiple copies of the code must be kept around to run against
1785 # older tests. I'm not sure what to do about this, but I may
1786 # retire old tests after a time.
1787- def fuzz_test_with_css(self, filename):
1788+ def fuzz_test_with_css(self, filename:str) -> None:
1789 data = self.__markup(filename)
1790 parsers = ['lxml-xml', 'html5lib', 'html.parser', 'lxml']
1791 try:
1792@@ -168,7 +168,7 @@ class TestFuzz(object):
1793 def test_html5lib_parse_errors(self, filename):
1794 self.fuzz_test_with_css(filename)
1795
1796- def __markup(self, filename):
1797+ def __markup(self, filename:str) -> bytes:
1798 if not filename.endswith(self.TESTCASE_SUFFIX):
1799 filename += self.TESTCASE_SUFFIX
1800 this_dir = os.path.split(__file__)[0]
1801diff --git a/bs4/tests/test_html5lib.py b/bs4/tests/test_html5lib.py
1802index 9f6dfa1..3c34403 100644
1803--- a/bs4/tests/test_html5lib.py
1804+++ b/bs4/tests/test_html5lib.py
1805@@ -8,14 +8,13 @@ from bs4.filter import SoupStrainer
1806 from . import (
1807 HTML5LIB_PRESENT,
1808 HTML5TreeBuilderSmokeTest,
1809- SoupTest,
1810 )
1811
1812 @pytest.mark.skipif(
1813 not HTML5LIB_PRESENT,
1814 reason="html5lib seems not to be present, not testing its tree builder."
1815 )
1816-class TestHTML5LibBuilder(SoupTest, HTML5TreeBuilderSmokeTest):
1817+class TestHTML5LibBuilder(HTML5TreeBuilderSmokeTest):
1818 """See ``HTML5TreeBuilderSmokeTest``."""
1819
1820 @property
1821diff --git a/bs4/tests/test_htmlparser.py b/bs4/tests/test_htmlparser.py
1822index ff0f305..2a13c99 100644
1823--- a/bs4/tests/test_htmlparser.py
1824+++ b/bs4/tests/test_htmlparser.py
1825@@ -10,12 +10,14 @@ from bs4.builder import (
1826 XMLParsedAsHTMLWarning,
1827 )
1828 from bs4.builder._htmlparser import (
1829+ _DuplicateAttributeHandler,
1830 BeautifulSoupHTMLParser,
1831 HTMLParserTreeBuilder,
1832 )
1833+from typing import Any
1834 from . import SoupTest, HTMLTreeBuilderSmokeTest
1835
1836-class TestHTMLParserTreeBuilder(SoupTest, HTMLTreeBuilderSmokeTest):
1837+class TestHTMLParserTreeBuilder(HTMLTreeBuilderSmokeTest):
1838
1839 default_builder = HTMLParserTreeBuilder
1840
1841@@ -95,7 +97,7 @@ class TestHTMLParserTreeBuilder(SoupTest, HTMLTreeBuilderSmokeTest):
1842 assert "id" == soup.a['id']
1843
1844 # You can also get this behavior explicitly.
1845- def assert_attribute(on_duplicate_attribute, expected):
1846+ def assert_attribute(on_duplicate_attribute:_DuplicateAttributeHandler, expected:Any) -> None:
1847 soup = self.soup(
1848 markup, on_duplicate_attribute=on_duplicate_attribute
1849 )
1850diff --git a/bs4/tests/test_lxml.py b/bs4/tests/test_lxml.py
1851index 9fc04e0..dc20501 100644
1852--- a/bs4/tests/test_lxml.py
1853+++ b/bs4/tests/test_lxml.py
1854@@ -26,7 +26,7 @@ from . import (
1855 not LXML_PRESENT,
1856 reason="lxml seems not to be present, not testing its tree builder."
1857 )
1858-class TestLXMLTreeBuilder(SoupTest, HTMLTreeBuilderSmokeTest):
1859+class TestLXMLTreeBuilder(HTMLTreeBuilderSmokeTest):
1860 """See ``HTMLTreeBuilderSmokeTest``."""
1861
1862 @property
1863@@ -88,7 +88,7 @@ class TestLXMLTreeBuilder(SoupTest, HTMLTreeBuilderSmokeTest):
1864 not LXML_PRESENT,
1865 reason="lxml seems not to be present, not testing its XML tree builder."
1866 )
1867-class TestLXMLXMLTreeBuilder(SoupTest, XMLTreeBuilderSmokeTest):
1868+class TestLXMLXMLTreeBuilder(XMLTreeBuilderSmokeTest):
1869 """See ``HTMLTreeBuilderSmokeTest``."""
1870
1871 @property
1872diff --git a/bs4/tests/test_soup.py b/bs4/tests/test_soup.py
1873index c95f380..61d0235 100644
1874--- a/bs4/tests/test_soup.py
1875+++ b/bs4/tests/test_soup.py
1876@@ -8,6 +8,7 @@ import pickle
1877 import pytest
1878 import sys
1879 import tempfile
1880+from typing import Iterable
1881
1882 from bs4 import (
1883 BeautifulSoup,
1884@@ -260,14 +261,15 @@ class TestWarnings(SoupTest):
1885 # that the code that triggered the warning is in the same
1886 # file as the test.
1887
1888- def _assert_warning(self, warnings, cls):
1889+ def _assert_warning(
1890+ self, warnings:Iterable[Warning], cls:type[Warning]) -> Warning:
1891 for w in warnings:
1892 if isinstance(w.message, cls):
1893 assert w.filename == __file__
1894 return w
1895 raise Exception("%s warning not found in %r" % (cls, warnings))
1896
1897- def _assert_no_parser_specified(self, w):
1898+ def _assert_no_parser_specified(self, w:Warning) -> None:
1899 warning = self._assert_warning(w, GuessedAtParserWarning)
1900 message = str(warning.message)
1901 assert message.startswith(BeautifulSoup.NO_PARSER_SPECIFIED_WARNING[:60])
1902diff --git a/bs4/tests/test_tree.py b/bs4/tests/test_tree.py
1903index 43afb29..fe56e6b 100644
1904--- a/bs4/tests/test_tree.py
1905+++ b/bs4/tests/test_tree.py
1906@@ -135,7 +135,7 @@ class TestFindAllBasicNamespaces(SoupTest):
1907 class TestFindAllByName(SoupTest):
1908 """Test ways of finding tags by tag name."""
1909
1910- def setup_method(self):
1911+ def setup_method(self) -> None:
1912 self.tree = self.soup("""<a>First tag.</a>
1913 <b>Second tag.</b>
1914 <c>Third <a>Nested tag.</a> tag.</c>""")
1915@@ -459,7 +459,7 @@ class TestIndex(SoupTest):
1916 class TestParentOperations(SoupTest):
1917 """Test navigation and searching through an element's parents."""
1918
1919- def setup_method(self):
1920+ def setup_method(self) -> None:
1921 self.tree = self.soup('''<ul id="empty"></ul>
1922 <ul id="top">
1923 <ul id="middle">
1924@@ -508,14 +508,14 @@ class TestParentOperations(SoupTest):
1925
1926 class ProximityTest(SoupTest):
1927
1928- def setup_method(self):
1929+ def setup_method(self) -> None:
1930 self.tree = self.soup(
1931 '<html id="start"><head></head><body><b id="1">One</b><b id="2">Two</b><b id="3">Three</b></body></html>')
1932
1933
1934 class TestNextOperations(ProximityTest):
1935
1936- def setup_method(self):
1937+ def setup_method(self) -> None:
1938 super(TestNextOperations, self).setup_method()
1939 self.start = self.tree.b
1940
1941@@ -555,7 +555,7 @@ class TestNextOperations(ProximityTest):
1942
1943 class TestPreviousOperations(ProximityTest):
1944
1945- def setup_method(self):
1946+ def setup_method(self) -> None:
1947 super(TestPreviousOperations, self).setup_method()
1948 self.end = self.tree.find(string="Three")
1949
1950@@ -604,7 +604,7 @@ class TestPreviousOperations(ProximityTest):
1951
1952 class SiblingTest(SoupTest):
1953
1954- def setup_method(self):
1955+ def setup_method(self) -> None:
1956 markup = '''<html>
1957 <span id="1">
1958 <span id="1.1"></span>
1959@@ -625,7 +625,7 @@ class SiblingTest(SoupTest):
1960
1961 class TestNextSibling(SiblingTest):
1962
1963- def setup_method(self):
1964+ def setup_method(self) -> None:
1965 super(TestNextSibling, self).setup_method()
1966 self.start = self.tree.find(id="1")
1967
1968@@ -670,7 +670,7 @@ class TestNextSibling(SiblingTest):
1969
1970 class TestPreviousSibling(SiblingTest):
1971
1972- def setup_method(self):
1973+ def setup_method(self) -> None:
1974 super(TestPreviousSibling, self).setup_method()
1975 self.end = self.tree.find(id="4")
1976
1977diff --git a/doc/index.rst b/doc/index.rst
1978index a414830..0ff1fb2 100755
1979--- a/doc/index.rst
1980+++ b/doc/index.rst
1981@@ -3029,38 +3029,64 @@ Advanced search techniques
1982
1983 Almost everyone who uses Beautiful Soup to extract information from a
1984 document can get what they need using the methods described in
1985-`Searching the tree`_. However, there's a lower-level interface--the
1986-:py:class:`ElementSelector` class-- which lets you define any matching
1987+`Searching the tree`_. However, there is a lower-level interface--the
1988+:py:class:`ElementFilter` class--that lets you define any matching
1989 behavior whatsoever.
1990
1991-To use :py:class:`ElementSelector`, define a function that takes a
1992-:py:class:`PageElement` object (that is, it might be either a
1993-:py:class:`Tag` or a :py:class`NavigableString`) and returns ``True``
1994+To use :py:class:`ElementFilter`, define a function that takes a
1995+:py:class:`PageElement` object (which can be either a :py:class:`Tag`
1996+*or* :py:class:`NavigableString` object) and returns ``True``
1997 (if the element matches your custom criteria) or ``False`` (if it
1998-doesn't)::
1999+does not)::
2000
2001- [example goes here]
2002+ def _match_non_whitespace(pe):
2003+ """
2004+ return True for:
2005+ * all Tag objects
2006+ * NavigableString objects that contain non-whitespace text
2007+ """
2008+ return (
2009+ isinstance(pe, Tag) or
2010+ (isinstance(pe, NavigableString) and
2011+ pe.text and not pe.text.isspace()))
2012
2013-Then, pass the function into an :py:class:`ElementSelector`::
2014+Then, construct an :py:class:`ElementFilter` that uses your function::
2015
2016- from bs4.select import ElementSelector
2017- selector = ElementSelector(f)
2018+ from bs4.filter import ElementFilter
2019+ skip_whitespace = ElementFilter(match_function=_match_non_whitespace)
2020
2021-You can then pass the :py:class:`ElementSelector` object as the first
2022+You can now pass this :py:class:`ElementFilter` object as the first
2023 argument to any of the `Searching the tree`_ methods::
2024
2025- [examples go here]
2026+ from bs4 import BeautifulSoup
2027+ html_doc = """
2028+ <p>
2029+ <b>bold</b>
2030+ <i>italic</i>
2031+ and
2032+ <u>underline</u>
2033+ </p>
2034+ """
2035+ soup = BeautifulSoup(html_doc, 'lxml')
2036
2037-Every potential match will be run through your function, and the only
2038-:py:class:`PageElement` objects returned will be the one where your
2039+ soup.find('p').find_all(skip_whitespace, recursive=False)
2040+ # [<b>bold</b>, <i>italic</i>, '\n and\n ', <u>underline</u>]
2041+
2042+Every :py:class:`PageElement` encountered will be evaluated by your
2043+function, and the objects returned will be only the ones where your
2044 function returned ``True``.
2045
2046-Note that this is different from simply passing `a function`_ as the
2047-first argument to one of the search methods. That's an easy way to
2048-find a tag, but _only_ tags will be considered. With an
2049-:py:class:`ElementSelector` you can write a single function that makes
2050-decisions about both tags and strings.
2051-
2052+To summarize the function-based matching behaviors,
2053+
2054+* A function passed as the first argument to a search method
2055+ (or equivalently, using the ``name`` argument) considers only
2056+ :py:class:`Tag` objects.
2057+* A function passed to a search method using the ``string`` argument
2058+ considers only :py:class:`NavigableString` objects.
2059+* A function passed to a search method using an :py:class:`ElementFilter`
2060+ object considers both :py:class:`Tag` and :py:class:`NavigableString`
2061+ objects.
2062+
2063
2064 Advanced parser customization
2065 =============================

Subscribers

People subscribed via source and target branches