Merge ~chrispitude/beautifulsoup:node-filters into beautifulsoup:master

Proposed by Chris Papademetrious
Status: Superseded
Proposed branch: ~chrispitude/beautifulsoup:node-filters
Merge into: beautifulsoup:master
Diff against target: 9961 lines (+4153/-2375)
31 files modified
CHANGELOG (+99/-0)
bs4/__init__.py (+245/-73)
bs4/_deprecation.py (+57/-0)
bs4/_typing.py (+99/-0)
bs4/builder/__init__.py (+278/-184)
bs4/builder/_html5lib.py (+57/-36)
bs4/builder/_htmlparser.py (+120/-66)
bs4/builder/_lxml.py (+137/-68)
bs4/css.py (+124/-96)
bs4/dammit.py (+407/-230)
bs4/diagnose.py (+31/-19)
bs4/element.py (+1154/-1052)
bs4/formatter.py (+96/-41)
bs4/strainer.py (+498/-0)
bs4/tests/__init__.py (+17/-12)
bs4/tests/test_builder_registry.py (+3/-3)
bs4/tests/test_dammit.py (+17/-10)
bs4/tests/test_element.py (+25/-5)
bs4/tests/test_html5lib.py (+17/-1)
bs4/tests/test_htmlparser.py (+5/-3)
bs4/tests/test_lxml.py (+4/-3)
bs4/tests/test_pageelement.py (+8/-4)
bs4/tests/test_soup.py (+13/-5)
bs4/tests/test_strainer.py (+485/-0)
bs4/tests/test_tag.py (+1/-0)
bs4/tests/test_tree.py (+44/-25)
dev/null (+0/-256)
doc/Makefile (+14/-124)
doc/conf.py (+33/-0)
doc/index.rst (+61/-57)
tox.ini (+4/-2)
Reviewer Review Type Date Requested Status
Leonard Richardson Pending
Review via email: mp+457782@code.launchpad.net

This proposal has been superseded by a proposal from 2024-01-01.

Commit message

implement "node" filtering that considers all PageElement objects

Description of the change

This is a draft merge request that implements a proof-of-concept for the following wishlist request:

#2047713: enhance find*() methods to filter through all object types
https://bugs.launchpad.net/beautifulsoup/+bug/2047713

Most of the changes are to thread a new "node" filter down into the machinery. The actual functionality is just three additional lines in the search() method to use it.

With these changes, when the following script is run:

====
#!/bin/env python
from bs4 import BeautifulSoup, NavigableString
html_doc = """
  <p>
    <b>bold</b>
    <i>italic</i>
    <u>underline</u>
    text
    <br />
  </p>
"""
soup = BeautifulSoup(html_doc, 'lxml')

# this is the filter I want to use
def is_non_whitespace(thing) -> bool:
    return not (isinstance(thing, NavigableString) and thing.text.isspace())

# get the first non-whitespace thing in <p>
this_thing = soup.find('p').find(node=is_non_whitespace, recursive=False)

# print all following non-whitespace sibling elements in <p>
while this_thing:
    next_thing = this_thing.find_next_sibling(node=is_non_whitespace)
    print(f"{repr(this_thing)} is followed by {repr(next_thing)}")
    this_thing = next_thing
====

the results are as follows:

====
<b>bold</b> is followed by <i>italic</i>
<i>italic</i> is followed by <u>underline</u>
<u>underline</u> is followed by '\n text\n '
'\n text\n ' is followed by <br/>
<br/> is followed by None
====

Note the mix of tag and text objects!

Some questions and open items:

* Is "PageElement" the correct term for any object in a BeautifulSoup document (Tag, NavigableString, Comment, ProcessingInstruction, etc.)?
* What should this new filter argument be called? (node? page_element? something else?)
* Is there a more elegant approach that doesn't require threading a new argument down into everything?
* Rules for mixing this new filter with existing name/attribute filters must be defined/coded/documented.
  * I think this new filter should be mutually exclusive with tag/attribute filters.
  * I think this new filter should accept only Callable objects, and perhaps also True/False.
* Tests and documentation are needed.
  * I can do this when the implementation is complete.

Fingers crossed that this makes it in. It would be enormously powerful.

To post a comment you must log in.
Revision history for this message
Chris Papademetrious (chrispitude) wrote :

I will redo this merge proposal on top of the 4.13 branch.

Before I start, are there any aspects that you'd like to see implemented differently?

a771557... by Chris Papademetrious

implement a filter that considers all PageElement objects

Unmerged commits

a771557... by Chris Papademetrious

implement a filter that considers all PageElement objects

8a6d1dd... by Leonard Richardson

Merged in change to main branch.

4cde600... by Leonard Richardson

Those casts are more trouble than they're worth.

1113a86... by Leonard Richardson

Got css.py to pass mypy strict although it's a little hacky.

26e1772... by Leonard Richardson

Went through formatter.py with mypy strict.

5bf3787... by Leonard Richardson

Went through dammit.py with mypy strict.

7200655... by Leonard Richardson

Merged in main branch.

f3a3619... by Leonard Richardson

Got rid of deprecation warnings in tests.

6f89323... by Leonard Richardson

Get (slightly) more specific about alias.

f8e55c0... by Leonard Richardson

_alias itself is not used anywhere.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/CHANGELOG b/CHANGELOG
2index 66fcb74..bec1e11 100644
3--- a/CHANGELOG
4+++ b/CHANGELOG
5@@ -1,3 +1,102 @@
6+= 4.13.0 (Unreleased)
7+
8+* This version drops support for Python 3.6. The minimum supported
9+ major version for Beautiful Soup is now Python 3.7.
10+
11+* Deprecation warnings have been added for all deprecated methods and
12+ attributes. Most of these were deprecated over ten years ago, and
13+ some were deprecated over fifteen years ago.
14+
15+ Going forward, deprecated names will be subject to removal two
16+ feature releases or one major release after the deprecation warning
17+ is added.
18+
19+* append(), extend(), insert(), and unwrap() were moved from PageElement to
20+ Tag. Those methods manipulate the 'contents' collection, so they would
21+ only have ever worked on Tag objects.
22+
23+* decompose() was moved from Tag to PageElement, since there's no reason
24+ it won't also work on NavigableString objects.
25+
26+* The BeautifulSoupHTMLParser constructor now requires a BeautifulSoup
27+ object as its first argument. This almost certainly does not affect
28+ you, since you probably use HTMLParserTreeBuilder, not
29+ BeautifulSoupHTMLParser directly.
30+
31+* If Tag.get_attribute_list() is used to access an attribute that's not set,
32+ the return value is now an empty list rather than [None].
33+
34+* AttributeValueWithCharsetSubstitution.encode() is renamed to
35+ substitute_encoding, to avoid confusion with the much different str.encode()
36+
37+* Using PageElement.replace_with() to replace an element with itself
38+ returns the element instead of None.
39+
40+* When using one of the find() methods or creating a SoupStrainer,
41+ if you specify the same attribute value in ``attrs`` and the
42+ keyword arguments, you'll end up with two different ways to match that
43+ attribute. Previously the value in keyword arguments would override the
44+ value in ``attrs``.
45+
46+* When using one of the find() methods or creating a SoupStrainer, you can
47+ pass a list of any accepted object (strings, regular expressions, etc.) for
48+ any of the objects. Previously you could only pass in a list of strings.
49+
50+* A SoupStrainer can now filter tag creation based on a tag's
51+ namespaced name. Previously only the unqualified name could be used.
52+
53+* All TreeBuilder constructors now take the empty_element_tags
54+ argument. The sets of tags found in HTMLTreeBuilder.empty_element_tags and
55+ HTMLTreeBuilder.block_elements are now in
56+ HTMLTreeBuilder.DEFAULT_EMPTY_ELEMENT_TAGS and
57+ HTMLTreeBuilder.DEFAULT_BLOCK_ELEMENTS, to avoid confusing them with
58+ instance variables.
59+
60+* Issue a warning if a document is parsed using a SoupStrainer that's just
61+ going to filter everything. In these cases, filtering everything is the
62+ most consistent thing to do, but there was no indication that was
63+ happening.
64+
65+* UnicodeDammit.markup is now always a bytestring representing the
66+ *original* markup (sans BOM), and UnicodeDammit.unicode_markup is
67+ always the same markup, converted to Unicode. Previously,
68+ UnicodeDammit.markup was treated inconsistently and would often end
69+ up containing Unicode. UnicodeDammit.markup was not a documented
70+ attribute, but if you were using it, you probably want to switch to using
71+ .unicode_markup instead.
72+
73+* Corrected the markup that's output in the unlikely event that you
74+ encode a document to a Python internal encoding (like "palmos")
75+ that's not recognized by the HTML or XML standard.
76+
77+* The arguments to LXMLTreeBuilderForXML.prepare_markup have been
78+ changed to match the arguments to the superclass,
79+ TreeBuilder.prepare_markup. Specifically, document_declared_encoding
80+ now appears before exclude_encodings, not after. If you were calling
81+ this method yourself, I recomment switching to using keyword
82+ arguments instead.
83+
84+* Fixed an error in the lookup table used when converting
85+ ISO-Latin-1 to ASCII, which no one should do anyway.
86+
87+* The unused constant LXMLTreeBuilderForXML.DEFAULT_PARSER_CLASS
88+ has been removed.
89+
90+New deprecations in 4.13.0:
91+
92+* The SAXTreeBuilder class, which was never officially supported or tested.
93+
94+* The first argument to BeautifulSoup.decode has been changed from a bool
95+ `pretty_print` to an int `indent_level`, to match the signature of Tag.decode.
96+
97+* SoupStrainer.text and SoupStrainer.string are both deprecated
98+ since a single item can't capture all the possibilities of a SoupStrainer
99+ designed to match strings.
100+
101+* SoupStrainer.search_tag() is deprecated. It was never a
102+ documented method, but if you use it, you should start using
103+ SoupStrainer.allow_tag_creation() instead.
104+
105 = 4.12.3 (?)
106
107 * Fixed a regression such that if you set .hidden on a tag, the tag
108diff --git a/bs4/__init__.py b/bs4/__init__.py
109index 3d2ab09..a1289c7 100644
110--- a/bs4/__init__.py
111+++ b/bs4/__init__.py
112@@ -7,8 +7,8 @@ Beautiful Soup uses a pluggable XML or HTML parser to parse a
113 provides methods and Pythonic idioms that make it easy to navigate,
114 search, and modify the parse tree.
115
116-Beautiful Soup works with Python 3.6 and up. It works better if lxml
117-and/or html5lib is installed.
118+Beautiful Soup works with Python 3.7 and up. It works better if lxml
119+and/or html5lib is installed, but they are not required.
120
121 For more than you ever wanted to know about Beautiful Soup, see the
122 documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
123@@ -37,9 +37,10 @@ if sys.version_info.major < 3:
124 from .builder import (
125 builder_registry,
126 ParserRejectedMarkup,
127+ TreeBuilder,
128 XMLParsedAsHTMLWarning,
129- HTMLParserTreeBuilder
130 )
131+from .builder._htmlparser import HTMLParserTreeBuilder
132 from .dammit import UnicodeDammit
133 from .element import (
134 CData,
135@@ -55,10 +56,32 @@ from .element import (
136 ResultSet,
137 Script,
138 Stylesheet,
139- SoupStrainer,
140 Tag,
141 TemplateString,
142 )
143+from .formatter import Formatter
144+from .strainer import SoupStrainer
145+from typing import (
146+ Any,
147+ cast,
148+ Counter as CounterType,
149+ Dict,
150+ Iterable,
151+ List,
152+ Sequence,
153+ Optional,
154+ Type,
155+ TYPE_CHECKING,
156+ Union,
157+)
158+
159+from bs4._typing import (
160+ _AttributeValue,
161+ _AttributeValues,
162+ _Encoding,
163+ _Encodings,
164+ _IncomingMarkup,
165+)
166
167 # Define some custom warnings.
168 class GuessedAtParserWarning(UserWarning):
169@@ -104,24 +127,64 @@ class BeautifulSoup(Tag):
170 handle_endtag.
171 """
172
173- # Since BeautifulSoup subclasses Tag, it's possible to treat it as
174- # a Tag with a .name. This name makes it clear the BeautifulSoup
175- # object isn't a real markup tag.
176- ROOT_TAG_NAME = '[document]'
177-
178- # If the end-user gives no indication which tree builder they
179- # want, look for one with these features.
180- DEFAULT_BUILDER_FEATURES = ['html', 'fast']
181-
182- # A string containing all ASCII whitespace characters, used in
183- # endData() to detect data chunks that seem 'empty'.
184- ASCII_SPACES = '\x20\x0a\x09\x0c\x0d'
185-
186- NO_PARSER_SPECIFIED_WARNING = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features=\"%(parser)s\"' to the BeautifulSoup constructor.\n"
187-
188- def __init__(self, markup="", features=None, builder=None,
189- parse_only=None, from_encoding=None, exclude_encodings=None,
190- element_classes=None, **kwargs):
191+ #: Since `BeautifulSoup` subclasses `Tag`, it's possible to treat it as
192+ #: a `Tag` with a `Tag.name`. Hoever, this name makes it clear the
193+ #: `BeautifulSoup` object isn't a real markup tag.
194+ ROOT_TAG_NAME:str = '[document]'
195+
196+ #: If the end-user gives no indication which tree builder they
197+ #: want, look for one with these features.
198+ DEFAULT_BUILDER_FEATURES: Sequence[str] = ['html', 'fast']
199+
200+ #: A string containing all ASCII whitespace characters, used in
201+ #: `BeautifulSoup.endData` to detect data chunks that seem 'empty'.
202+ ASCII_SPACES: str = '\x20\x0a\x09\x0c\x0d'
203+
204+ #: :meta private:
205+ NO_PARSER_SPECIFIED_WARNING: str = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features=\"%(parser)s\"' to the BeautifulSoup constructor.\n"
206+
207+ # FUTURE PYTHON:
208+ element_classes:Dict[Type[PageElement], Type[Any]] #: :meta private:
209+ builder:TreeBuilder #: :meta private:
210+ is_xml: bool
211+ known_xml: Optional[bool]
212+ parse_only: Optional[SoupStrainer] #: :meta private:
213+
214+ # These members are only used while parsing markup.
215+ markup:Optional[Union[str,bytes]] #: :meta private:
216+ current_data:List[str] #: :meta private:
217+ currentTag:Optional[Tag] #: :meta private:
218+ tagStack:List[Tag] #: :meta private:
219+ open_tag_counter:CounterType[str] #: :meta private:
220+ preserve_whitespace_tag_stack:List[Tag] #: :meta private:
221+ string_container_stack:List[Tag] #: :meta private:
222+
223+ #: Beautiful Soup's best guess as to the character encoding of the
224+ #: original document.
225+ original_encoding: Optional[_Encoding]
226+
227+ #: The character encoding, if any, that was explicitly defined
228+ #: in the original document. This may or may not match
229+ #: `BeautifulSoup.original_encoding`.
230+ declared_html_encoding: Optional[_Encoding]
231+
232+ #: This is True if the markup that was parsed contains
233+ #: U+FFFD REPLACEMENT_CHARACTER characters which were not present
234+ #: in the original markup. These mark character sequences that
235+ #: could not be represented in Unicode.
236+ contains_replacement_characters: bool
237+
238+ def __init__(
239+ self,
240+ markup:_IncomingMarkup="",
241+ features:Optional[Union[str,Sequence[str]]]=None,
242+ builder:Optional[Union[TreeBuilder,Type[TreeBuilder]]]=None,
243+ parse_only:Optional[SoupStrainer]=None,
244+ from_encoding:Optional[_Encoding]=None,
245+ exclude_encodings:Optional[_Encodings]=None,
246+ element_classes:Optional[Dict[Type[PageElement], Type[Any]]]=None,
247+ **kwargs:Any
248+ ):
249 """Constructor.
250
251 :param markup: A string or a file-like object representing
252@@ -196,14 +259,14 @@ class BeautifulSoup(Tag):
253 if 'selfClosingTags' in kwargs:
254 del kwargs['selfClosingTags']
255 warnings.warn(
256- "BS4 does not respect the selfClosingTags argument to the "
257+ "Beautiful Soup 4 does not respect the selfClosingTags argument to the "
258 "BeautifulSoup constructor. The tree builder is responsible "
259 "for understanding self-closing tags.")
260
261 if 'isHTML' in kwargs:
262 del kwargs['isHTML']
263 warnings.warn(
264- "BS4 does not respect the isHTML argument to the "
265+ "Beautiful Soup 4 does not respect the isHTML argument to the "
266 "BeautifulSoup constructor. Suggest you use "
267 "features='lxml' for HTML and features='lxml-xml' for "
268 "XML.")
269@@ -212,7 +275,8 @@ class BeautifulSoup(Tag):
270 if old_name in kwargs:
271 warnings.warn(
272 'The "%s" argument to the BeautifulSoup constructor '
273- 'has been renamed to "%s."' % (old_name, new_name),
274+ 'was renamed to "%s" in Beautiful Soup 4.0.0' % (
275+ old_name, new_name),
276 DeprecationWarning, stacklevel=3
277 )
278 return kwargs.pop(old_name)
279@@ -220,7 +284,14 @@ class BeautifulSoup(Tag):
280
281 parse_only = parse_only or deprecated_argument(
282 "parseOnlyThese", "parse_only")
283-
284+ if (parse_only is not None
285+ and parse_only.string_rules and
286+ (parse_only.name_rules or parse_only.attribute_rules)):
287+ warnings.warn(
288+ f"Value for parse_only will exclude everything, since it puts restrictions on both tags and strings: {parse_only}",
289+ UserWarning, stacklevel=3
290+ )
291+
292 from_encoding = from_encoding or deprecated_argument(
293 "fromEncoding", "from_encoding")
294
295@@ -235,7 +306,8 @@ class BeautifulSoup(Tag):
296 # specify a parser' warning.
297 original_builder = builder
298 original_features = features
299-
300+
301+ builder_class: Type[TreeBuilder]
302 if isinstance(builder, type):
303 # A builder class was passed in; it needs to be instantiated.
304 builder_class = builder
305@@ -245,12 +317,13 @@ class BeautifulSoup(Tag):
306 features = [features]
307 if features is None or len(features) == 0:
308 features = self.DEFAULT_BUILDER_FEATURES
309- builder_class = builder_registry.lookup(*features)
310- if builder_class is None:
311+ possible_builder_class = builder_registry.lookup(*features)
312+ if possible_builder_class is None:
313 raise FeatureNotFound(
314 "Couldn't find a tree builder with the features you "
315 "requested: %s. Do you need to install a parser library?"
316 % ",".join(features))
317+ builder_class = cast(Type[TreeBuilder], possible_builder_class)
318
319 # At this point either we have a TreeBuilder instance in
320 # builder, or we have a builder_class that we can instantiate
321@@ -259,7 +332,8 @@ class BeautifulSoup(Tag):
322 builder = builder_class(**kwargs)
323 if not original_builder and not (
324 original_features == builder.NAME or
325- original_features in builder.ALTERNATE_NAMES
326+ (isinstance(original_features, str)
327+ and original_features in builder.ALTERNATE_NAMES)
328 ) and markup:
329 # The user did not tell us which TreeBuilder to use,
330 # and we had to guess. Issue a warning.
331@@ -323,6 +397,10 @@ class BeautifulSoup(Tag):
332 if not self._markup_is_url(markup):
333 self._markup_resembles_filename(markup)
334
335+ # At this point we know markup is a string or bytestring. If
336+ # it was a file-type object, we've read from it.
337+ markup = cast(Union[str,bytes], markup)
338+
339 rejections = []
340 success = False
341 for (self.markup, self.original_encoding, self.declared_html_encoding,
342@@ -486,7 +564,7 @@ class BeautifulSoup(Tag):
343 markup.
344 """
345 Tag.__init__(self, self, self.builder, self.ROOT_TAG_NAME)
346- self.hidden = 1
347+ self.hidden = True
348 self.builder.reset()
349 self.current_data = []
350 self.currentTag = None
351@@ -497,8 +575,16 @@ class BeautifulSoup(Tag):
352 self._most_recent_element = None
353 self.pushTag(self)
354
355- def new_tag(self, name, namespace=None, nsprefix=None, attrs={},
356- sourceline=None, sourcepos=None, **kwattrs):
357+ def new_tag(
358+ self,
359+ name:str,
360+ namespace:Optional[str]=None,
361+ nsprefix:Optional[str]=None,
362+ attrs:_AttributeValues={},
363+ sourceline:Optional[int]=None,
364+ sourcepos:Optional[int]=None,
365+ **kwattrs:_AttributeValue,
366+ ):
367 """Create a new Tag associated with this BeautifulSoup object.
368
369 :param name: The name of the new Tag.
370@@ -509,7 +595,7 @@ class BeautifulSoup(Tag):
371 that are reserved words in Python.
372 :param sourceline: The line number where this tag was
373 (purportedly) found in its source document.
374- :param sourcepos: The character position within `sourceline` where this
375+ :param sourcepos: The character position within ``sourceline`` where this
376 tag was (purportedly) found.
377 :param kwattrs: Keyword arguments for the new Tag's attribute values.
378
379@@ -520,9 +606,17 @@ class BeautifulSoup(Tag):
380 sourceline=sourceline, sourcepos=sourcepos
381 )
382
383- def string_container(self, base_class=None):
384+ def string_container(self,
385+ base_class:Optional[Type[NavigableString]]=None
386+ ) -> Type[NavigableString]:
387+ """Find the class that should be instantiated to hold a given kind of
388+ string.
389+
390+ This may be a built-in Beautiful Soup class or a custom class passed
391+ in to the BeautifulSoup constructor.
392+ """
393 container = base_class or NavigableString
394-
395+
396 # There may be a general override of NavigableString.
397 container = self.element_classes.get(
398 container, container
399@@ -536,27 +630,40 @@ class BeautifulSoup(Tag):
400 )
401 return container
402
403- def new_string(self, s, subclass=None):
404- """Create a new NavigableString associated with this BeautifulSoup
405+ def new_string(self, s:str, subclass:Optional[Type[NavigableString]]=None) -> NavigableString:
406+ """Create a new `NavigableString` associated with this `BeautifulSoup`
407 object.
408+
409+ :param s: The string content of the `NavigableString`
410+
411+ :param subclass: The subclass of `NavigableString`, if any, to
412+ use. If a document is being processed, an appropriate subclass
413+ for the current location in the document will be determined
414+ automatically.
415 """
416 container = self.string_container(subclass)
417 return container(s)
418
419- def insert_before(self, *args):
420+ def insert_before(self, *args:PageElement) -> None:
421 """This method is part of the PageElement API, but `BeautifulSoup` doesn't implement
422 it because there is nothing before or after it in the parse tree.
423 """
424 raise NotImplementedError("BeautifulSoup objects don't support insert_before().")
425
426- def insert_after(self, *args):
427+ def insert_after(self, *args:PageElement) -> None:
428 """This method is part of the PageElement API, but `BeautifulSoup` doesn't implement
429 it because there is nothing before or after it in the parse tree.
430 """
431 raise NotImplementedError("BeautifulSoup objects don't support insert_after().")
432
433- def popTag(self):
434- """Internal method called by _popToTag when a tag is closed."""
435+ def popTag(self) -> Optional[Tag]:
436+ """Internal method called by _popToTag when a tag is closed.
437+
438+ :meta private:
439+ """
440+ if not self.tagStack:
441+ # Nothing to pop. This shouldn't happen.
442+ return None
443 tag = self.tagStack.pop()
444 if tag.name in self.open_tag_counter:
445 self.open_tag_counter[tag.name] -= 1
446@@ -569,8 +676,11 @@ class BeautifulSoup(Tag):
447 self.currentTag = self.tagStack[-1]
448 return self.currentTag
449
450- def pushTag(self, tag):
451- """Internal method called by handle_starttag when a tag is opened."""
452+ def pushTag(self, tag:Tag) -> None:
453+ """Internal method called by handle_starttag when a tag is opened.
454+
455+ :meta private:
456+ """
457 #print("Push", tag.name)
458 if self.currentTag is not None:
459 self.currentTag.contents.append(tag)
460@@ -583,9 +693,14 @@ class BeautifulSoup(Tag):
461 if tag.name in self.builder.string_containers:
462 self.string_container_stack.append(tag)
463
464- def endData(self, containerClass=None):
465+ def endData(self, containerClass:Optional[Type[NavigableString]]=None) -> None:
466 """Method called by the TreeBuilder when the end of a data segment
467 occurs.
468+
469+ :param containerClass: The class to use when incorporating the
470+ data segment into the parse tree.
471+
472+ :meta private:
473 """
474 if self.current_data:
475 current_data = ''.join(self.current_data)
476@@ -609,18 +724,27 @@ class BeautifulSoup(Tag):
477
478 # Should we add this string to the tree at all?
479 if self.parse_only and len(self.tagStack) <= 1 and \
480- (not self.parse_only.text or \
481- not self.parse_only.search(current_data)):
482+ (not self.parse_only.string_rules or \
483+ not self.parse_only.allow_string_creation(current_data)):
484 return
485
486 containerClass = self.string_container(containerClass)
487 o = containerClass(current_data)
488 self.object_was_parsed(o)
489
490- def object_was_parsed(self, o, parent=None, most_recent_element=None):
491- """Method called by the TreeBuilder to integrate an object into the parse tree."""
492+ def object_was_parsed(
493+ self, o:PageElement, parent:Optional[Tag]=None,
494+ most_recent_element:Optional[PageElement]=None):
495+ """Method called by the TreeBuilder to integrate an object into the
496+ parse tree.
497+
498+
499+
500+ :meta private:
501+ """
502 if parent is None:
503 parent = self.currentTag
504+ assert parent is not None
505 if most_recent_element is not None:
506 previous_element = most_recent_element
507 else:
508@@ -685,7 +809,7 @@ class BeautifulSoup(Tag):
509 break
510 target = target.parent
511
512- def _popToTag(self, name, nsprefix=None, inclusivePop=True):
513+ def _popToTag(self, name, nsprefix=None, inclusivePop=True) -> Optional[Tag]:
514 """Pops the tag stack up to and including the most recent
515 instance of the given tag.
516
517@@ -698,11 +822,12 @@ class BeautifulSoup(Tag):
518 to but *not* including the most recent instqance of the
519 given tag.
520
521+ :meta private:
522 """
523 #print("Popping to %s" % name)
524 if name == self.ROOT_TAG_NAME:
525 # The BeautifulSoup object itself can never be popped.
526- return
527+ return None
528
529 most_recently_popped = None
530
531@@ -719,8 +844,11 @@ class BeautifulSoup(Tag):
532
533 return most_recently_popped
534
535- def handle_starttag(self, name, namespace, nsprefix, attrs, sourceline=None,
536- sourcepos=None, namespaces=None):
537+ def handle_starttag(
538+ self, name:str, namespace:Optional[str],
539+ nsprefix:Optional[str], attrs:Optional[Dict[str,str]],
540+ sourceline:Optional[int]=None, sourcepos:Optional[int]=None,
541+ namespaces:Optional[Dict[str, str]]=None) -> Optional[Tag]:
542 """Called by the tree builder when a new tag is encountered.
543
544 :param name: Name of the tag.
545@@ -737,13 +865,15 @@ class BeautifulSoup(Tag):
546 SoupStrainer. You should proceed as if the tag had not occurred
547 in the document. For instance, if this was a self-closing tag,
548 don't call handle_endtag.
549+
550+ :meta private:
551 """
552 # print("Start tag %s: %s" % (name, attrs))
553 self.endData()
554
555 if (self.parse_only and len(self.tagStack) <= 1
556- and (self.parse_only.text
557- or not self.parse_only.search_tag(name, attrs))):
558+ and (self.parse_only.string_rules
559+ or not self.parse_only.allow_tag_creation(nsprefix, name, attrs))):
560 return None
561
562 tag = self.element_classes.get(Tag, Tag)(
563@@ -760,48 +890,90 @@ class BeautifulSoup(Tag):
564 self.pushTag(tag)
565 return tag
566
567- def handle_endtag(self, name, nsprefix=None):
568+ def handle_endtag(self, name:str, nsprefix:Optional[str]=None) -> None:
569 """Called by the tree builder when an ending tag is encountered.
570
571 :param name: Name of the tag.
572 :param nsprefix: Namespace prefix for the tag.
573+
574+ :meta private:
575 """
576 #print("End tag: " + name)
577 self.endData()
578 self._popToTag(name, nsprefix)
579
580- def handle_data(self, data):
581- """Called by the tree builder when a chunk of textual data is encountered."""
582+ def handle_data(self, data:str) -> None:
583+ """Called by the tree builder when a chunk of textual data is
584+ encountered.
585+
586+ :meta private:
587+ """
588 self.current_data.append(data)
589
590- def decode(self, pretty_print=False,
591- eventual_encoding=DEFAULT_OUTPUT_ENCODING,
592- formatter="minimal", iterator=None):
593- """Returns a string or Unicode representation of the parse tree
594- as an HTML or XML document.
595-
596- :param pretty_print: If this is True, indentation will be used to
597- make the document more readable.
598+ def decode(self, indent_level:Optional[int]=None,
599+ eventual_encoding:_Encoding=DEFAULT_OUTPUT_ENCODING,
600+ formatter:Union[Formatter,str]="minimal",
601+ iterator:Optional[Iterable]=None, **kwargs) -> str:
602+ """Returns a string representation of the parse tree
603+ as a full HTML or XML document.
604+
605+ :param indent_level: Each line of the rendering will be
606+ indented this many levels. (The ``formatter`` decides what a
607+ 'level' means, in terms of spaces or other characters
608+ output.) This is used internally in recursive calls while
609+ pretty-printing.
610 :param eventual_encoding: The encoding of the final document.
611 If this is None, the document will be a Unicode string.
612+ :param formatter: Either a `Formatter` object, or a string naming one of
613+ the standard formatters.
614+ :param iterator: The iterator to use when navigating over the
615+ parse tree. This is only used by `Tag.decode_contents` and
616+ you probably won't need to use it.
617 """
618 if self.is_xml:
619 # Print the XML declaration
620 encoding_part = ''
621+ declared_encoding: Optional[str] = eventual_encoding
622 if eventual_encoding in PYTHON_SPECIFIC_ENCODINGS:
623 # This is a special Python encoding; it can't actually
624 # go into an XML document because it means nothing
625 # outside of Python.
626- eventual_encoding = None
627- if eventual_encoding != None:
628- encoding_part = ' encoding="%s"' % eventual_encoding
629+ declared_encoding = None
630+ if declared_encoding != None:
631+ encoding_part = ' encoding="%s"' % declared_encoding
632 prefix = '<?xml version="1.0"%s?>\n' % encoding_part
633 else:
634 prefix = ''
635- if not pretty_print:
636- indent_level = None
637+
638+ # Prior to 4.13.0, the first argument to this method was a
639+ # bool called pretty_print, which gave the method a different
640+ # signature from its superclass implementation, Tag.decode.
641+ #
642+ # The signatures of the two methods now match, but just in
643+ # case someone is still passing a boolean in as the first
644+ # argument to this method (or a keyword argument with the old
645+ # name), we can handle it and put out a DeprecationWarning.
646+ warning:Optional[str] = None
647+ if isinstance(indent_level, bool):
648+ if indent_level is True:
649+ indent_level = 0
650+ elif indent_level is False:
651+ indent_level = None
652+ warning = f"As of 4.13.0, the first argument to BeautifulSoup.decode has been changed from bool to int, to match Tag.decode. Pass in a value of {indent_level} instead."
653 else:
654- indent_level = 0
655+ pretty_print = kwargs.pop("pretty_print", None)
656+ assert not kwargs
657+ if pretty_print is not None:
658+ if pretty_print is True:
659+ indent_level = 0
660+ elif pretty_print is False:
661+ indent_level = None
662+ warning = f"As of 4.13.0, the pretty_print argument to BeautifulSoup.decode has been removed, to match Tag.decode. Pass in a value of indent_level={indent_level} instead."
663+
664+ if warning:
665+ warnings.warn(warning, DeprecationWarning, stacklevel=2)
666+ elif indent_level is False or pretty_print is False:
667+ indent_level = None
668 return prefix + super(BeautifulSoup, self).decode(
669 indent_level, eventual_encoding, formatter, iterator)
670
671@@ -815,7 +987,7 @@ class BeautifulStoneSoup(BeautifulSoup):
672 def __init__(self, *args, **kwargs):
673 kwargs['features'] = 'xml'
674 warnings.warn(
675- 'The BeautifulStoneSoup class is deprecated. Instead of using '
676+ 'The BeautifulStoneSoup class was deprecated in version 4.0.0. Instead of using '
677 'it, pass features="xml" into the BeautifulSoup constructor.',
678 DeprecationWarning, stacklevel=2
679 )
680diff --git a/bs4/_deprecation.py b/bs4/_deprecation.py
681new file mode 100644
682index 0000000..febc1b3
683--- /dev/null
684+++ b/bs4/_deprecation.py
685@@ -0,0 +1,57 @@
686+"""Helper functions for deprecation.
687+
688+This interface is itself unstable and may change without warning. Do
689+not use these functions yourself, even as a joke. The underscores are
690+there for a reson.
691+
692+In particular, most of this will go away once Beautiful Soup drops
693+support for Python 3.11, since Python 3.12 defines a
694+`@typing.deprecated() decorator. <https://peps.python.org/pep-0702/>`_
695+"""
696+
697+import functools
698+import warnings
699+
700+from typing import (
701+ Any,
702+ Callable,
703+)
704+
705+def _deprecated_alias(old_name, new_name, version):
706+ """Alias one attribute name to another for backward compatibility
707+
708+ :meta private:
709+ """
710+ @property
711+ def alias(self) -> Any:
712+ ":meta private:"
713+ warnings.warn(f"Access to deprecated property {old_name}. (Replaced by {new_name}) -- Deprecated since version {version}.", DeprecationWarning, stacklevel=2)
714+ return getattr(self, new_name)
715+
716+ @alias.setter
717+ def alias(self, value:str)->Any:
718+ ":meta private:"
719+ warnings.warn(f"Write to deprecated property {old_name}. (Replaced by {new_name}) -- Deprecated since version {version}.", DeprecationWarning, stacklevel=2)
720+ return setattr(self, new_name, value)
721+ return alias
722+
723+def _deprecated_function_alias(old_name:str, new_name:str, version:str) -> Callable:
724+ def alias(self, *args, **kwargs):
725+ ":meta private:"
726+ warnings.warn(f"Call to deprecated method {old_name}. (Replaced by {new_name}) -- Deprecated since version {version}.", DeprecationWarning, stacklevel=2)
727+ return getattr(self, new_name)(*args, **kwargs)
728+ return alias
729+
730+def _deprecated(replaced_by:str, version:str) -> Callable:
731+ def deprecate(func):
732+ @functools.wraps(func)
733+ def with_warning(*args, **kwargs):
734+ ":meta private:"
735+ warnings.warn(
736+ f"Call to deprecated method {func.__name__}. (Replaced by {replaced_by}) -- Deprecated since version {version}.",
737+ DeprecationWarning,
738+ stacklevel=2
739+ )
740+ return func(*args, **kwargs)
741+ return with_warning
742+ return deprecate
743diff --git a/bs4/_typing.py b/bs4/_typing.py
744new file mode 100644
745index 0000000..9fe58c3
746--- /dev/null
747+++ b/bs4/_typing.py
748@@ -0,0 +1,99 @@
749+# Custom type aliases used throughout Beautiful Soup to improve readability.
750+
751+# Notes on improvements to the type system in newer versions of Python
752+# that can be used once Beautiful Soup drops support for older
753+# versions:
754+#
755+# * In 3.10, x|y is an accepted shorthand for Union[x,y].
756+# * In 3.10, TypeAlias gains capabilities that can be used to
757+# improve the tree matching types (I don't remember what, exactly).
758+
759+import re
760+from typing_extensions import TypeAlias
761+from typing import (
762+ Callable,
763+ Dict,
764+ IO,
765+ Iterable,
766+ Pattern,
767+ TYPE_CHECKING,
768+ Union,
769+)
770+
771+if TYPE_CHECKING:
772+ from bs4.element import Tag
773+
774+# Aliases for markup in various stages of processing.
775+#
776+# The rawest form of markup: either a string or an open filehandle.
777+_IncomingMarkup: TypeAlias = Union[str,bytes,IO]
778+
779+# Markup that is in memory but has (potentially) yet to be converted
780+# to Unicode.
781+_RawMarkup: TypeAlias = Union[str,bytes]
782+
783+# Aliases for character encodings
784+#
785+_Encoding:TypeAlias = str
786+_Encodings:TypeAlias = Iterable[_Encoding]
787+
788+# Aliases for XML namespaces
789+_NamespacePrefix:TypeAlias = str
790+_NamespaceURL:TypeAlias = str
791+_NamespaceMapping:TypeAlias = Dict[_NamespacePrefix, _NamespaceURL]
792+_InvertedNamespaceMapping:TypeAlias = Dict[_NamespaceURL, _NamespacePrefix]
793+
794+# Aliases for the attribute values associated with HTML/XML tags.
795+#
796+# Note that these are attribute values in their final form, as stored
797+# in the `Tag` class. Different parsers present attributes to the
798+# `TreeBuilder` subclasses in different formats, which are not defined
799+# here.
800+_AttributeValue: TypeAlias = Union[str, Iterable[str]]
801+_AttributeValues: TypeAlias = Dict[str, _AttributeValue]
802+
803+# Aliases to represent the many possibilities for matching bits of a
804+# parse tree.
805+#
806+# This is very complicated because we're applying a formal type system
807+# to some very DWIM code. The types we end up with will be the types
808+# of the arguments to the SoupStrainer constructor and (more
809+# familiarly to Beautiful Soup users) the find* methods.
810+
811+# A function that takes a Tag and returns a yes-or-no answer.
812+# A TagNameMatchRule expects this kind of function, if you're
813+# going to pass it a function.
814+_TagMatchFunction:TypeAlias = Callable[['Tag'], bool]
815+
816+# A function that takes a single string and returns a yes-or-no
817+# answer. An AttributeValueMatchRule expects this kind of function, if
818+# you're going to pass it a function. So does a StringMatchRule
819+_StringMatchFunction:TypeAlias = Callable[[str], bool]
820+
821+# A function that takes a Tag or string and returns a yes-or-no
822+# answer.
823+_TagOrStringMatchFunction:TypeAlias = Union[_TagMatchFunction, _StringMatchFunction, bool]
824+
825+# Either a tag name, an attribute value or a string can be matched
826+# against a string, bytestring, regular expression, or a boolean.
827+_BaseStrainable:TypeAlias = Union[str, bytes, Pattern[str], bool]
828+
829+# A tag can also be matched using a function that takes the Tag
830+# as its sole argument.
831+_BaseStrainableElement:TypeAlias = Union[_BaseStrainable, _TagMatchFunction]
832+
833+# A tag's attribute value can be matched using a function that takes
834+# the value as its sole argument.
835+_BaseStrainableAttribute:TypeAlias = Union[_BaseStrainable, _StringMatchFunction]
836+
837+# Finally, a tag name, attribute or string can be matched using either
838+# a single criterion or a list of criteria.
839+_StrainableElement:TypeAlias = Union[
840+ _BaseStrainableElement, Iterable[_BaseStrainableElement]
841+]
842+_StrainableAttribute:TypeAlias = Union[
843+ _BaseStrainableAttribute, Iterable[_BaseStrainableAttribute]
844+]
845+
846+_StrainableAttributes:TypeAlias = Dict[str, _StrainableAttribute]
847+_StrainableString:TypeAlias = _StrainableAttribute
848diff --git a/bs4/builder/__init__.py b/bs4/builder/__init__.py
849index 2e39745..671315d 100644
850--- a/bs4/builder/__init__.py
851+++ b/bs4/builder/__init__.py
852@@ -1,9 +1,25 @@
853+from __future__ import annotations
854 # Use of this source code is governed by the MIT license.
855 __license__ = "MIT"
856
857 from collections import defaultdict
858 import itertools
859 import re
860+from types import ModuleType
861+from typing import (
862+ Any,
863+ cast,
864+ Dict,
865+ Iterable,
866+ List,
867+ Optional,
868+ Pattern,
869+ Set,
870+ Tuple,
871+ Type,
872+ TYPE_CHECKING,
873+ Union,
874+)
875 import warnings
876 import sys
877 from bs4.element import (
878@@ -17,6 +33,18 @@ from bs4.element import (
879 nonwhitespace_re
880 )
881
882+if TYPE_CHECKING:
883+ from bs4 import BeautifulSoup
884+ from bs4.element import (
885+ NavigableString, Tag,
886+ _AttributeValues, _AttributeValue,
887+ )
888+ from bs4._typing import (
889+ _Encoding,
890+ _Encodings,
891+ _RawMarkup,
892+ )
893+
894 __all__ = [
895 'HTMLTreeBuilder',
896 'SAXTreeBuilder',
897@@ -36,29 +64,32 @@ class XMLParsedAsHTMLWarning(UserWarning):
898 """The warning issued when an HTML parser is used to parse
899 XML that is not XHTML.
900 """
901- MESSAGE = """It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor."""
902+ MESSAGE:str = """It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.""" #: :meta private:
903
904
905 class TreeBuilderRegistry(object):
906 """A way of looking up TreeBuilder subclasses by their name or by desired
907 features.
908 """
909+
910+ builders_for_feature: Dict[str, List[Type[TreeBuilder]]]
911+ builders: List[Type[TreeBuilder]]
912
913 def __init__(self):
914 self.builders_for_feature = defaultdict(list)
915 self.builders = []
916
917- def register(self, treebuilder_class):
918+ def register(self, treebuilder_class:type[TreeBuilder]) -> None:
919 """Register a treebuilder based on its advertised features.
920
921- :param treebuilder_class: A subclass of Treebuilder. its .features
922- attribute should list its features.
923+ :param treebuilder_class: A subclass of `Treebuilder`. its
924+ `TreeBuilder.features` attribute should list its features.
925 """
926 for feature in treebuilder_class.features:
927 self.builders_for_feature[feature].insert(0, treebuilder_class)
928 self.builders.insert(0, treebuilder_class)
929
930- def lookup(self, *features):
931+ def lookup(self, *features:str) -> Optional[Type[TreeBuilder]]:
932 """Look up a TreeBuilder subclass with the desired features.
933
934 :param features: A list of features to look for. If none are
935@@ -78,12 +109,12 @@ class TreeBuilderRegistry(object):
936
937 # Go down the list of features in order, and eliminate any builders
938 # that don't match every feature.
939- features = list(features)
940- features.reverse()
941+ feature_list = list(features)
942+ feature_list.reverse()
943 candidates = None
944 candidate_set = None
945- while len(features) > 0:
946- feature = features.pop()
947+ while len(feature_list) > 0:
948+ feature = feature_list.pop()
949 we_have_the_feature = self.builders_for_feature.get(feature, [])
950 if len(we_have_the_feature) > 0:
951 if candidates is None:
952@@ -97,81 +128,61 @@ class TreeBuilderRegistry(object):
953 # The only valid candidates are the ones in candidate_set.
954 # Go through the original list of candidates and pick the first one
955 # that's in candidate_set.
956- if candidate_set is None:
957+ if candidate_set is None or candidates is None:
958 return None
959 for candidate in candidates:
960 if candidate in candidate_set:
961 return candidate
962 return None
963
964-# The BeautifulSoup class will take feature lists from developers and use them
965-# to look up builders in this registry.
966-builder_registry = TreeBuilderRegistry()
967+#: The `BeautifulSoup` constructor will take a list of features
968+#: and use it to look up `TreeBuilder` classes in this registry.
969+builder_registry:TreeBuilderRegistry = TreeBuilderRegistry()
970
971 class TreeBuilder(object):
972- """Turn a textual document into a Beautiful Soup object tree."""
973-
974- NAME = "[Unknown tree builder]"
975- ALTERNATE_NAMES = []
976- features = []
977-
978- is_xml = False
979- picklable = False
980- empty_element_tags = None # A tag will be considered an empty-element
981- # tag when and only when it has no contents.
982-
983- # A value for these tag/attribute combinations is a space- or
984- # comma-separated list of CDATA, rather than a single CDATA.
985- DEFAULT_CDATA_LIST_ATTRIBUTES = defaultdict(list)
986-
987- # Whitespace should be preserved inside these tags.
988- DEFAULT_PRESERVE_WHITESPACE_TAGS = set()
989-
990- # The textual contents of tags with these names should be
991- # instantiated with some class other than NavigableString.
992- DEFAULT_STRING_CONTAINERS = {}
993-
994- USE_DEFAULT = object()
995+ """Turn a textual document into a Beautiful Soup object tree.
996+
997+ This is an abstract superclass which smooths out the behavior of
998+ different parser libraries into a single, unified interface.
999+
1000+ :param multi_valued_attributes: If this is set to None, the
1001+ TreeBuilder will not turn any values for attributes like
1002+ 'class' into lists. Setting this to a dictionary will
1003+ customize this behavior; look at DEFAULT_CDATA_LIST_ATTRIBUTES
1004+ for an example.
1005+
1006+ Internally, these are called "CDATA list attributes", but that
1007+ probably doesn't make sense to an end-user, so the argument name
1008+ is `multi_valued_attributes`.
1009+
1010+ :param preserve_whitespace_tags: A set of tags to treat
1011+ the way <pre> tags are treated in HTML. Tags in this set
1012+ are immune from pretty-printing; their contents will always be
1013+ output as-is.
1014+
1015+ :param string_containers: A dictionary mapping tag names to
1016+ the classes that should be instantiated to contain the textual
1017+ contents of those tags. The default is to use NavigableString
1018+ for every tag, no matter what the name. You can override the
1019+ default by changing DEFAULT_STRING_CONTAINERS.
1020+
1021+ :param store_line_numbers: If the parser keeps track of the
1022+ line numbers and positions of the original markup, that
1023+ information will, by default, be stored in each corresponding
1024+ `Tag` object. You can turn this off by passing
1025+ store_line_numbers=False. If the parser you're using doesn't
1026+ keep track of this information, then setting store_line_numbers=True
1027+ will do nothing.
1028+ """
1029
1030- # Most parsers don't keep track of line numbers.
1031- TRACKS_LINE_NUMBERS = False
1032+ USE_DEFAULT: Any = object() #: :meta private:
1033
1034- def __init__(self, multi_valued_attributes=USE_DEFAULT,
1035- preserve_whitespace_tags=USE_DEFAULT,
1036- store_line_numbers=USE_DEFAULT,
1037- string_containers=USE_DEFAULT,
1038+ def __init__(self, multi_valued_attributes:Dict[str, Set[str]]=USE_DEFAULT,
1039+ preserve_whitespace_tags:Set[str]=USE_DEFAULT,
1040+ store_line_numbers:bool=USE_DEFAULT,
1041+ string_containers:Dict[str, Type[NavigableString]]=USE_DEFAULT,
1042+ empty_element_tags:Set[str]=USE_DEFAULT
1043 ):
1044- """Constructor.
1045-
1046- :param multi_valued_attributes: If this is set to None, the
1047- TreeBuilder will not turn any values for attributes like
1048- 'class' into lists. Setting this to a dictionary will
1049- customize this behavior; look at DEFAULT_CDATA_LIST_ATTRIBUTES
1050- for an example.
1051-
1052- Internally, these are called "CDATA list attributes", but that
1053- probably doesn't make sense to an end-user, so the argument name
1054- is `multi_valued_attributes`.
1055-
1056- :param preserve_whitespace_tags: A list of tags to treat
1057- the way <pre> tags are treated in HTML. Tags in this list
1058- are immune from pretty-printing; their contents will always be
1059- output as-is.
1060-
1061- :param string_containers: A dictionary mapping tag names to
1062- the classes that should be instantiated to contain the textual
1063- contents of those tags. The default is to use NavigableString
1064- for every tag, no matter what the name. You can override the
1065- default by changing DEFAULT_STRING_CONTAINERS.
1066-
1067- :param store_line_numbers: If the parser keeps track of the
1068- line numbers and positions of the original markup, that
1069- information will, by default, be stored in each corresponding
1070- `Tag` object. You can turn this off by passing
1071- store_line_numbers=False. If the parser you're using doesn't
1072- keep track of this information, then setting store_line_numbers=True
1073- will do nothing.
1074- """
1075 self.soup = None
1076 if multi_valued_attributes is self.USE_DEFAULT:
1077 multi_valued_attributes = self.DEFAULT_CDATA_LIST_ATTRIBUTES
1078@@ -179,14 +190,55 @@ class TreeBuilder(object):
1079 if preserve_whitespace_tags is self.USE_DEFAULT:
1080 preserve_whitespace_tags = self.DEFAULT_PRESERVE_WHITESPACE_TAGS
1081 self.preserve_whitespace_tags = preserve_whitespace_tags
1082+ if empty_element_tags is self.USE_DEFAULT:
1083+ self.empty_element_tags = self.DEFAULT_EMPTY_ELEMENT_TAGS
1084+ else:
1085+ self.empty_element_tags = empty_element_tags
1086 if store_line_numbers == self.USE_DEFAULT:
1087 store_line_numbers = self.TRACKS_LINE_NUMBERS
1088 self.store_line_numbers = store_line_numbers
1089 if string_containers == self.USE_DEFAULT:
1090 string_containers = self.DEFAULT_STRING_CONTAINERS
1091 self.string_containers = string_containers
1092+
1093+ NAME:str = "[Unknown tree builder]"
1094+ ALTERNATE_NAMES: Iterable[str] = []
1095+ features: Iterable[str] = []
1096+
1097+ is_xml: bool = False
1098+ picklable: bool = False
1099+
1100+ soup: Optional[BeautifulSoup] #: :meta private:
1101+
1102+ #: A tag will be considered an empty-element
1103+ #: tag when and only when it has no contents.
1104+ empty_element_tags: Optional[Set[str]] = None #: :meta private:
1105+ cdata_list_attributes: Dict[str, Set[str]] #: :meta private:
1106+ preserve_whitespace_tags: Set[str] #: :meta private:
1107+ string_containers: Dict[str, Type[NavigableString]] #: :meta private:
1108+ tracks_line_numbers: bool #: :meta private:
1109+
1110+ #: A value for these tag/attribute combinations is a space- or
1111+ #: comma-separated list of CDATA, rather than a single CDATA.
1112+ DEFAULT_CDATA_LIST_ATTRIBUTES : Dict[str, Set[str]] = defaultdict(set)
1113+
1114+ #: Whitespace should be preserved inside these tags.
1115+ DEFAULT_PRESERVE_WHITESPACE_TAGS : Set[str] = set()
1116+
1117+ #: The textual contents of tags with these names should be
1118+ #: instantiated with some class other than NavigableString.
1119+ DEFAULT_STRING_CONTAINERS: Dict[str, Type[NavigableString]] = {}
1120+
1121+ #: By default, tags are treated as empty-element tags if they have
1122+ #: no contents--that is, using XML rules. HTMLTreeBuilder
1123+ #: defines a different set of DEFAULT_EMPTY_ELEMENT_TAGS based on the
1124+ #: HTML 4 and HTML5 standards.
1125+ DEFAULT_EMPTY_ELEMENT_TAGS: Optional[Set] = None
1126+
1127+ #: Most parsers don't keep track of line numbers.
1128+ TRACKS_LINE_NUMBERS: bool = False
1129
1130- def initialize_soup(self, soup):
1131+ def initialize_soup(self, soup:BeautifulSoup) -> None:
1132 """The BeautifulSoup object has been initialized and is now
1133 being associated with the TreeBuilder.
1134
1135@@ -194,7 +246,7 @@ class TreeBuilder(object):
1136 """
1137 self.soup = soup
1138
1139- def reset(self):
1140+ def reset(self) -> None:
1141 """Do any work necessary to reset the underlying parser
1142 for a new document.
1143
1144@@ -202,7 +254,7 @@ class TreeBuilder(object):
1145 """
1146 pass
1147
1148- def can_be_empty_element(self, tag_name):
1149+ def can_be_empty_element(self, tag_name:str) -> bool:
1150 """Might a tag with this name be an empty-element tag?
1151
1152 The final markup may or may not actually present this tag as
1153@@ -225,46 +277,48 @@ class TreeBuilder(object):
1154 return True
1155 return tag_name in self.empty_element_tags
1156
1157- def feed(self, markup):
1158+ def feed(self, markup:str) -> None:
1159 """Run some incoming markup through some parsing process,
1160- populating the `BeautifulSoup` object in self.soup.
1161-
1162- This method is not implemented in TreeBuilder; it must be
1163- implemented in subclasses.
1164-
1165- :return: None.
1166+ populating the `BeautifulSoup` object in `TreeBuilder.soup`
1167 """
1168 raise NotImplementedError()
1169
1170- def prepare_markup(self, markup, user_specified_encoding=None,
1171- document_declared_encoding=None, exclude_encodings=None):
1172+ def prepare_markup(
1173+ self, markup:_RawMarkup,
1174+ user_specified_encoding:Optional[_Encoding]=None,
1175+ document_declared_encoding:Optional[_Encoding]=None,
1176+ exclude_encodings:Optional[_Encodings]=None
1177+ ) -> Iterable[Tuple[_RawMarkup, Optional[_Encoding], Optional[_Encoding], bool]]:
1178 """Run any preliminary steps necessary to make incoming markup
1179 acceptable to the parser.
1180
1181- :param markup: Some markup -- probably a bytestring.
1182- :param user_specified_encoding: The user asked to try this encoding.
1183+ :param markup: The markup that's about to be parsed.
1184+ :param user_specified_encoding: The user asked to try this encoding
1185+ to convert the markup into a Unicode string.
1186 :param document_declared_encoding: The markup itself claims to be
1187 in this encoding. NOTE: This argument is not used by the
1188 calling code and can probably be removed.
1189- :param exclude_encodings: The user asked _not_ to try any of
1190+ :param exclude_encodings: The user asked *not* to try any of
1191 these encodings.
1192
1193- :yield: A series of 4-tuples:
1194- (markup, encoding, declared encoding,
1195- has undergone character replacement)
1196+ :yield: A series of 4-tuples: (markup, encoding, declared encoding,
1197+ has undergone character replacement)
1198
1199- Each 4-tuple represents a strategy for converting the
1200- document to Unicode and parsing it. Each strategy will be tried
1201- in turn.
1202+ Each 4-tuple represents a strategy that the parser can try
1203+ to convert the document to Unicode and parse it. Each
1204+ strategy will be tried in turn.
1205
1206 By default, the only strategy is to parse the markup
1207 as-is. See `LXMLTreeBuilderForXML` and
1208 `HTMLParserTreeBuilder` for implementations that take into
1209 account the quirks of particular parsers.
1210+
1211+ :meta private:
1212+
1213 """
1214 yield markup, None, None, False
1215
1216- def test_fragment_to_document(self, fragment):
1217+ def test_fragment_to_document(self, fragment:str) -> str:
1218 """Wrap an HTML fragment to make it look like a document.
1219
1220 Different parsers do this differently. For instance, lxml
1221@@ -273,26 +327,27 @@ class TreeBuilder(object):
1222 which run HTML fragments through the parser and compare the
1223 results against other HTML fragments.
1224
1225- This method should not be used outside of tests.
1226+ This method should not be used outside of unit tests.
1227
1228- :param fragment: A string -- fragment of HTML.
1229- :return: A string -- a full HTML document.
1230+ :param fragment: A fragment of HTML.
1231+ :return: A full HTML document.
1232+ :meta private:
1233 """
1234 return fragment
1235
1236- def set_up_substitutions(self, tag):
1237+ def set_up_substitutions(self, tag:Tag) -> bool:
1238 """Set up any substitutions that will need to be performed on
1239 a `Tag` when it's output as a string.
1240
1241 By default, this does nothing. See `HTMLTreeBuilder` for a
1242 case where this is used.
1243
1244- :param tag: A `Tag`
1245 :return: Whether or not a substitution was performed.
1246+ :meta private:
1247 """
1248 return False
1249
1250- def _replace_cdata_list_attribute_values(self, tag_name, attrs):
1251+ def _replace_cdata_list_attribute_values(self, tag_name:str, attrs:_AttributeValues):
1252 """When an attribute value is associated with a tag that can
1253 have multiple values for that attribute, convert the string
1254 value to a list of strings.
1255@@ -308,10 +363,11 @@ class TreeBuilder(object):
1256 if not attrs:
1257 return attrs
1258 if self.cdata_list_attributes:
1259- universal = self.cdata_list_attributes.get('*', [])
1260+ universal: Set[str] = self.cdata_list_attributes.get('*', set())
1261 tag_specific = self.cdata_list_attributes.get(
1262 tag_name.lower(), None)
1263 for attr in list(attrs.keys()):
1264+ values: _AttributeValue
1265 if attr in universal or (tag_specific and attr in tag_specific):
1266 # We have a "class"-type attribute whose string
1267 # value is a whitespace-separated list of
1268@@ -337,7 +393,15 @@ class SAXTreeBuilder(TreeBuilder):
1269 how a simple TreeBuilder would work.
1270 """
1271
1272- def feed(self, markup):
1273+ def __init__(self, *args, **kwargs):
1274+ warnings.warn(
1275+ f"The SAXTreeBuilder class was deprecated in 4.13.0. It is completely untested and probably doesn't work; use at your own risk.",
1276+ DeprecationWarning,
1277+ stacklevel=2
1278+ )
1279+ super(SAXTreeBuilder, self).__init__(*args, **kwargs)
1280+
1281+ def feed(self, markup:_RawMarkup):
1282 raise NotImplementedError()
1283
1284 def close(self):
1285@@ -381,12 +445,13 @@ class SAXTreeBuilder(TreeBuilder):
1286
1287
1288 class HTMLTreeBuilder(TreeBuilder):
1289- """This TreeBuilder knows facts about HTML.
1290-
1291- Such as which tags are empty-element tags.
1292+ """This TreeBuilder knows facts about HTML, such as which tags are treated
1293+ specially by the HTML standard.
1294 """
1295
1296- empty_element_tags = set([
1297+ #: Some HTML tags are defined as having no contents. Beautiful Soup
1298+ #: treats these specially.
1299+ DEFAULT_EMPTY_ELEMENT_TAGS: Set[str] = set([
1300 # These are from HTML5.
1301 'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen', 'link', 'menuitem', 'meta', 'param', 'source', 'track', 'wbr',
1302
1303@@ -394,29 +459,29 @@ class HTMLTreeBuilder(TreeBuilder):
1304 'basefont', 'bgsound', 'command', 'frame', 'image', 'isindex', 'nextid', 'spacer'
1305 ])
1306
1307- # The HTML standard defines these as block-level elements. Beautiful
1308- # Soup does not treat these elements differently from other elements,
1309- # but it may do so eventually, and this information is available if
1310- # you need to use it.
1311- block_elements = set(["address", "article", "aside", "blockquote", "canvas", "dd", "div", "dl", "dt", "fieldset", "figcaption", "figure", "footer", "form", "h1", "h2", "h3", "h4", "h5", "h6", "header", "hr", "li", "main", "nav", "noscript", "ol", "output", "p", "pre", "section", "table", "tfoot", "ul", "video"])
1312-
1313- # These HTML tags need special treatment so they can be
1314- # represented by a string class other than NavigableString.
1315- #
1316- # For some of these tags, it's because the HTML standard defines
1317- # an unusual content model for them. I made this list by going
1318- # through the HTML spec
1319- # (https://html.spec.whatwg.org/#metadata-content) and looking for
1320- # "metadata content" elements that can contain strings.
1321- #
1322- # The Ruby tags (<rt> and <rp>) are here despite being normal
1323- # "phrasing content" tags, because the content they contain is
1324- # qualitatively different from other text in the document, and it
1325- # can be useful to be able to distinguish it.
1326- #
1327- # TODO: Arguably <noscript> could go here but it seems
1328- # qualitatively different from the other tags.
1329- DEFAULT_STRING_CONTAINERS = {
1330+ #: The HTML standard defines these tags as block-level elements. Beautiful
1331+ #: Soup does not treat these elements differently from other elements,
1332+ #: but it may do so eventually, and this information is available if
1333+ #: you need to use it.
1334+ DEFAULT_BLOCK_ELEMENTS: Set[str] = set(["address", "article", "aside", "blockquote", "canvas", "dd", "div", "dl", "dt", "fieldset", "figcaption", "figure", "footer", "form", "h1", "h2", "h3", "h4", "h5", "h6", "header", "hr", "li", "main", "nav", "noscript", "ol", "output", "p", "pre", "section", "table", "tfoot", "ul", "video"])
1335+
1336+ #: These HTML tags need special treatment so they can be
1337+ #: represented by a string class other than NavigableString.
1338+ #:
1339+ #: For some of these tags, it's because the HTML standard defines
1340+ #: an unusual content model for them. I made this list by going
1341+ #: through the HTML spec
1342+ #: (https://html.spec.whatwg.org/#metadata-content) and looking for
1343+ #: "metadata content" elements that can contain strings.
1344+ #:
1345+ #: The Ruby tags (<rt> and <rp>) are here despite being normal
1346+ #: "phrasing content" tags, because the content they contain is
1347+ #: qualitatively different from other text in the document, and it
1348+ #: can be useful to be able to distinguish it.
1349+ #:
1350+ #: TODO: Arguably <noscript> could go here but it seems
1351+ #: qualitatively different from the other tags.
1352+ DEFAULT_STRING_CONTAINERS: Dict[str, Type[NavigableString]] = {
1353 'rt' : RubyTextString,
1354 'rp' : RubyParenthesisString,
1355 'style': Stylesheet,
1356@@ -424,33 +489,35 @@ class HTMLTreeBuilder(TreeBuilder):
1357 'template': TemplateString,
1358 }
1359
1360- # The HTML standard defines these attributes as containing a
1361- # space-separated list of values, not a single value. That is,
1362- # class="foo bar" means that the 'class' attribute has two values,
1363- # 'foo' and 'bar', not the single value 'foo bar'. When we
1364- # encounter one of these attributes, we will parse its value into
1365- # a list of values if possible. Upon output, the list will be
1366- # converted back into a string.
1367- DEFAULT_CDATA_LIST_ATTRIBUTES = {
1368- "*" : ['class', 'accesskey', 'dropzone'],
1369- "a" : ['rel', 'rev'],
1370- "link" : ['rel', 'rev'],
1371- "td" : ["headers"],
1372- "th" : ["headers"],
1373- "td" : ["headers"],
1374- "form" : ["accept-charset"],
1375- "object" : ["archive"],
1376+ #: The HTML standard defines these attributes as containing a
1377+ #: space-separated list of values, not a single value. That is,
1378+ #: class="foo bar" means that the 'class' attribute has two values,
1379+ #: 'foo' and 'bar', not the single value 'foo bar'. When we
1380+ #: encounter one of these attributes, we will parse its value into
1381+ #: a list of values if possible. Upon output, the list will be
1382+ #: converted back into a string.
1383+ DEFAULT_CDATA_LIST_ATTRIBUTES: Dict[str, Set[str]] = {
1384+ "*" : {'class', 'accesskey', 'dropzone'},
1385+ "a" : {'rel', 'rev'},
1386+ "link" : {'rel', 'rev'},
1387+ "td" : {"headers"},
1388+ "th" : {"headers"},
1389+ "td" : {"headers"},
1390+ "form" : {"accept-charset"},
1391+ "object" : {"archive"},
1392
1393 # These are HTML5 specific, as are *.accesskey and *.dropzone above.
1394- "area" : ["rel"],
1395- "icon" : ["sizes"],
1396- "iframe" : ["sandbox"],
1397- "output" : ["for"],
1398+ "area" : {"rel"},
1399+ "icon" : {"sizes"},
1400+ "iframe" : {"sandbox"},
1401+ "output" : {"for"},
1402 }
1403
1404- DEFAULT_PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
1405+ #: By default, whitespace inside these HTML tags will be
1406+ #: preserved rather than being collapsed.
1407+ DEFAULT_PRESERVE_WHITESPACE_TAGS: set[str] = set(['pre', 'textarea'])
1408
1409- def set_up_substitutions(self, tag):
1410+ def set_up_substitutions(self, tag:Tag) -> bool:
1411 """Replace the declared encoding in a <meta> tag with a placeholder,
1412 to be substituted when the tag is output to a string.
1413
1414@@ -458,17 +525,26 @@ class HTMLTreeBuilder(TreeBuilder):
1415 encoding, but exit in a different encoding, and the <meta> tag
1416 needs to be changed to reflect this.
1417
1418- :param tag: A `Tag`
1419 :return: Whether or not a substitution was performed.
1420+
1421+ :meta private:
1422 """
1423 # We are only interested in <meta> tags
1424 if tag.name != 'meta':
1425 return False
1426
1427- http_equiv = tag.get('http-equiv')
1428- content = tag.get('content')
1429- charset = tag.get('charset')
1430-
1431+ # TODO: This cast will fail in the (very unlikely) scenario
1432+ # that the programmer who instantiates the TreeBuilder
1433+ # specifies meta['content'] or meta['charset'] as
1434+ # cdata_list_attributes.
1435+ content:Optional[str] = cast(Optional[str], tag.get('content'))
1436+ charset:Optional[str] = cast(Optional[str], tag.get('charset'))
1437+
1438+ # But we can accommodate meta['http-equiv'] being made a
1439+ # cdata_list_attribute (again, very unlikely) without much
1440+ # trouble.
1441+ http_equiv:List[str] = tag.get_attribute_list('http-equiv')
1442+
1443 # We are interested in <meta> tags that say what encoding the
1444 # document was originally in. This means HTML 5-style <meta>
1445 # tags that provide the "charset" attribute. It also means
1446@@ -478,20 +554,22 @@ class HTMLTreeBuilder(TreeBuilder):
1447 # In both cases we will replace the value of the appropriate
1448 # attribute with a standin object that can take on any
1449 # encoding.
1450- meta_encoding = None
1451+ substituted = False
1452 if charset is not None:
1453 # HTML 5 style:
1454 # <meta charset="utf8">
1455 meta_encoding = charset
1456 tag['charset'] = CharsetMetaAttributeValue(charset)
1457+ substituted = True
1458
1459- elif (content is not None and http_equiv is not None
1460- and http_equiv.lower() == 'content-type'):
1461+ elif (content is not None and
1462+ any(x.lower() == 'content-type' for x in http_equiv)):
1463 # HTML 4 style:
1464 # <meta http-equiv="content-type" content="text/html; charset=utf8">
1465 tag['content'] = ContentMetaAttributeValue(content)
1466+ substituted = True
1467
1468- return (meta_encoding is not None)
1469+ return substituted
1470
1471 class DetectsXMLParsedAsHTML(object):
1472 """A mixin class for any class (a TreeBuilder, or some class used by a
1473@@ -502,19 +580,29 @@ class DetectsXMLParsedAsHTML(object):
1474 This requires being able to observe an incoming processing
1475 instruction that might be an XML declaration, and also able to
1476 observe tags as they're opened. If you can't do that for a given
1477- TreeBuilder, there's a less reliable implementation based on
1478+ `TreeBuilder`, there's a less reliable implementation based on
1479 examining the raw markup.
1480 """
1481
1482- # Regular expression for seeing if markup has an <html> tag.
1483- LOOKS_LIKE_HTML = re.compile("<[^ +]html", re.I)
1484- LOOKS_LIKE_HTML_B = re.compile(b"<[^ +]html", re.I)
1485+ #: Regular expression for seeing if string markup has an <html> tag.
1486+ LOOKS_LIKE_HTML:Pattern[str] = re.compile("<[^ +]html", re.I)
1487
1488- XML_PREFIX = '<?xml'
1489- XML_PREFIX_B = b'<?xml'
1490+ #: Regular expression for seeing if byte markup has an <html> tag.
1491+ LOOKS_LIKE_HTML_B:Pattern[bytes] = re.compile(b"<[^ +]html", re.I)
1492+
1493+ #: The start of an XML document string.
1494+ XML_PREFIX:str = '<?xml'
1495+
1496+ #: The start of an XML document bytestring.
1497+ XML_PREFIX_B:bytes = b'<?xml'
1498+
1499+ # This is typed as str, not `ProcessingInstruction`, because this
1500+ # check may be run before any Beautiful Soup objects are created.
1501+ _first_processing_instruction: Optional[str]
1502+ _root_tag: Optional[Tag]
1503
1504 @classmethod
1505- def warn_if_markup_looks_like_xml(cls, markup):
1506+ def warn_if_markup_looks_like_xml(cls, markup:Optional[_RawMarkup]) -> bool:
1507 """Perform a check on some markup to see if it looks like XML
1508 that's not XHTML. If so, issue a warning.
1509
1510@@ -524,34 +612,40 @@ class DetectsXMLParsedAsHTML(object):
1511 :return: True if the markup looks like non-XHTML XML, False
1512 otherwise.
1513 """
1514+ if markup is None:
1515+ return False
1516+ markup = markup[:500]
1517 if isinstance(markup, bytes):
1518- prefix = cls.XML_PREFIX_B
1519- looks_like_html = cls.LOOKS_LIKE_HTML_B
1520+ markup_b = cast(bytes, markup)
1521+ looks_like_xml = (
1522+ markup_b.startswith(cls.XML_PREFIX_B)
1523+ and not cls.LOOKS_LIKE_HTML_B.search(markup)
1524+ )
1525 else:
1526- prefix = cls.XML_PREFIX
1527- looks_like_html = cls.LOOKS_LIKE_HTML
1528-
1529- if (markup is not None
1530- and markup.startswith(prefix)
1531- and not looks_like_html.search(markup[:500])
1532- ):
1533+ markup_s = cast(str, markup)
1534+ looks_like_xml = (
1535+ markup_s.startswith(cls.XML_PREFIX)
1536+ and not cls.LOOKS_LIKE_HTML.search(markup)
1537+ )
1538+
1539+ if looks_like_xml:
1540 cls._warn()
1541 return True
1542- return False
1543-
1544+ return False
1545+
1546 @classmethod
1547- def _warn(cls):
1548+ def _warn(cls) -> None:
1549 """Issue a warning about XML being parsed as HTML."""
1550 warnings.warn(
1551 XMLParsedAsHTMLWarning.MESSAGE, XMLParsedAsHTMLWarning
1552 )
1553
1554- def _initialize_xml_detector(self):
1555+ def _initialize_xml_detector(self) -> None:
1556 """Call this method before parsing a document."""
1557 self._first_processing_instruction = None
1558 self._root_tag = None
1559
1560- def _document_might_be_xml(self, processing_instruction):
1561+ def _document_might_be_xml(self, processing_instruction:str):
1562 """Call this method when encountering an XML declaration, or a
1563 "processing instruction" that might be an XML declaration.
1564 """
1565@@ -586,7 +680,7 @@ class DetectsXMLParsedAsHTML(object):
1566 self._warn()
1567
1568
1569-def register_treebuilders_from(module):
1570+def register_treebuilders_from(module:ModuleType) -> None:
1571 """Copy TreeBuilders from the given module into this module."""
1572 this_module = sys.modules[__name__]
1573 for name in module.__all__:
1574@@ -602,7 +696,7 @@ class ParserRejectedMarkup(Exception):
1575 """An Exception to be raised when the underlying parser simply
1576 refuses to parse the given markup.
1577 """
1578- def __init__(self, message_or_exception):
1579+ def __init__(self, message_or_exception:Union[str,Exception]):
1580 """Explain why the parser rejected the given markup, either
1581 with a textual explanation or another exception.
1582 """
1583diff --git a/bs4/builder/_html5lib.py b/bs4/builder/_html5lib.py
1584index dac2173..560a036 100644
1585--- a/bs4/builder/_html5lib.py
1586+++ b/bs4/builder/_html5lib.py
1587@@ -5,6 +5,20 @@ __all__ = [
1588 'HTML5TreeBuilder',
1589 ]
1590
1591+from typing import (
1592+ Iterable,
1593+ List,
1594+ Optional,
1595+ TYPE_CHECKING,
1596+ Tuple,
1597+ Union,
1598+)
1599+from bs4._typing import (
1600+ _Encoding,
1601+ _Encodings,
1602+ _RawMarkup,
1603+)
1604+
1605 import warnings
1606 import re
1607 from bs4.builder import (
1608@@ -30,50 +44,54 @@ from bs4.element import (
1609 Tag,
1610 )
1611
1612-try:
1613- # Pre-0.99999999
1614- from html5lib.treebuilders import _base as treebuilder_base
1615- new_html5lib = False
1616-except ImportError as e:
1617- # 0.99999999 and up
1618- from html5lib.treebuilders import base as treebuilder_base
1619- new_html5lib = True
1620+from html5lib.treebuilders import base as treebuilder_base
1621+
1622
1623 class HTML5TreeBuilder(HTMLTreeBuilder):
1624- """Use html5lib to build a tree.
1625+ """Use `html5lib <https://github.com/html5lib/html5lib-python>`_ to
1626+ build a tree.
1627
1628- Note that this TreeBuilder does not support some features common
1629- to HTML TreeBuilders. Some of these features could theoretically
1630+ Note that `HTML5TreeBuilder` does not support some common HTML
1631+ `TreeBuilder` features. Some of these features could theoretically
1632 be implemented, but at the very least it's quite difficult,
1633 because html5lib moves the parse tree around as it's being built.
1634
1635- * This TreeBuilder doesn't use different subclasses of NavigableString
1636- based on the name of the tag in which the string was found.
1637+ Specifically:
1638
1639- * You can't use a SoupStrainer to parse only part of a document.
1640+ * This `TreeBuilder` doesn't use different subclasses of
1641+ `NavigableString` (e.g. `Script`) based on the name of the tag
1642+ in which the string was found.
1643+ * You can't use a `SoupStrainer` to parse only part of a document.
1644 """
1645
1646- NAME = "html5lib"
1647+ NAME:str = "html5lib"
1648
1649- features = [NAME, PERMISSIVE, HTML_5, HTML]
1650+ features:Iterable[str] = [NAME, PERMISSIVE, HTML_5, HTML]
1651
1652- # html5lib can tell us which line number and position in the
1653- # original file is the source of an element.
1654- TRACKS_LINE_NUMBERS = True
1655+ #: html5lib can tell us which line number and position in the
1656+ #: original file is the source of an element.
1657+ TRACKS_LINE_NUMBERS:bool = True
1658
1659- def prepare_markup(self, markup, user_specified_encoding,
1660- document_declared_encoding=None, exclude_encodings=None):
1661+ def prepare_markup(self, markup:_RawMarkup,
1662+ user_specified_encoding:Optional[_Encoding]=None,
1663+ document_declared_encoding:Optional[_Encoding]=None,
1664+ exclude_encodings:Optional[_Encodings]=None
1665+ ) -> Iterable[Tuple[_RawMarkup, Optional[_Encoding], Optional[_Encoding], bool]]:
1666 # Store the user-specified encoding for use later on.
1667 self.user_specified_encoding = user_specified_encoding
1668
1669 # document_declared_encoding and exclude_encodings aren't used
1670 # ATM because the html5lib TreeBuilder doesn't use
1671 # UnicodeDammit.
1672- if exclude_encodings:
1673- warnings.warn(
1674- "You provided a value for exclude_encoding, but the html5lib tree builder doesn't support exclude_encoding.",
1675- stacklevel=3
1676- )
1677+ for variable, name in (
1678+ (document_declared_encoding, 'document_declared_encoding'),
1679+ (exclude_encodings, 'exclude_encodings'),
1680+ ):
1681+ if variable:
1682+ warnings.warn(
1683+ f"You provided a value for {name}, but the html5lib tree builder doesn't support {name}.",
1684+ stacklevel=3
1685+ )
1686
1687 # html5lib only parses HTML, so if it's given XML that's worth
1688 # noting.
1689@@ -83,6 +101,9 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
1690
1691 # These methods are defined by Beautiful Soup.
1692 def feed(self, markup):
1693+ """Run some incoming markup through some parsing process,
1694+ populating the `BeautifulSoup` object in `HTML5TreeBuilder.soup`.
1695+ """
1696 if self.soup.parse_only is not None:
1697 warnings.warn(
1698 "You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.",
1699@@ -92,10 +113,7 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
1700 self.underlying_builder.parser = parser
1701 extra_kwargs = dict()
1702 if not isinstance(markup, str):
1703- if new_html5lib:
1704- extra_kwargs['override_encoding'] = self.user_specified_encoding
1705- else:
1706- extra_kwargs['encoding'] = self.user_specified_encoding
1707+ extra_kwargs['override_encoding'] = self.user_specified_encoding
1708 doc = parser.parse(markup, **extra_kwargs)
1709
1710 # Set the character encoding detected by the tokenizer.
1711@@ -105,15 +123,18 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
1712 doc.original_encoding = None
1713 else:
1714 original_encoding = parser.tokenizer.stream.charEncoding[0]
1715- if not isinstance(original_encoding, str):
1716- # In 0.99999999 and up, the encoding is an html5lib
1717- # Encoding object. We want to use a string for compatibility
1718- # with other tree builders.
1719- original_encoding = original_encoding.name
1720+ # The encoding is an html5lib Encoding object. We want to
1721+ # use a string for compatibility with other tree builders.
1722+ original_encoding = original_encoding.name
1723 doc.original_encoding = original_encoding
1724 self.underlying_builder.parser = None
1725-
1726+
1727 def create_treebuilder(self, namespaceHTMLElements):
1728+ """Called by html5lib to instantiate the kind of class it
1729+ calls a 'TreeBuilder'.
1730+
1731+ :meta private:
1732+ """
1733 self.underlying_builder = TreeBuilderForHtml5lib(
1734 namespaceHTMLElements, self.soup,
1735 store_line_numbers=self.store_line_numbers
1736diff --git a/bs4/builder/_htmlparser.py b/bs4/builder/_htmlparser.py
1737index 3cc187f..291f6c6 100644
1738--- a/bs4/builder/_htmlparser.py
1739+++ b/bs4/builder/_htmlparser.py
1740@@ -1,4 +1,5 @@
1741 # encoding: utf-8
1742+from __future__ import annotations
1743 """Use the HTMLParser library to parse HTML files that aren't too bad."""
1744
1745 # Use of this source code is governed by the MIT license.
1746@@ -11,6 +12,19 @@ __all__ = [
1747 from html.parser import HTMLParser
1748
1749 import sys
1750+from typing import (
1751+ Any,
1752+ Callable,
1753+ cast,
1754+ Dict,
1755+ Iterable,
1756+ List,
1757+ Optional,
1758+ TYPE_CHECKING,
1759+ Tuple,
1760+ Type,
1761+ Union,
1762+)
1763 import warnings
1764
1765 from bs4.element import (
1766@@ -30,21 +44,25 @@ from bs4.builder import (
1767 STRICT,
1768 )
1769
1770-
1771+from bs4.element import Tag
1772+if TYPE_CHECKING:
1773+ from bs4 import BeautifulSoup
1774+ from bs4.element import NavigableString
1775+ from bs4._typing import (
1776+ _AttributeValues,
1777+ _Encoding,
1778+ _Encodings,
1779+ _RawMarkup,
1780+ )
1781+
1782 HTMLPARSER = 'html.parser'
1783
1784+_DuplicateAttributeHandler = Callable[[Dict[str, str], str, str], None]
1785+
1786 class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1787 """A subclass of the Python standard library's HTMLParser class, which
1788 listens for HTMLParser events and translates them into calls
1789 to Beautiful Soup's tree construction API.
1790- """
1791-
1792- # Strategies for handling duplicate attributes
1793- IGNORE = 'ignore'
1794- REPLACE = 'replace'
1795-
1796- def __init__(self, *args, **kwargs):
1797- """Constructor.
1798
1799 :param on_duplicate_attribute: A strategy for what to do if a
1800 tag includes the same attribute more than once. Accepted
1801@@ -53,8 +71,10 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1802 encountered), or a callable. A callable must take three
1803 arguments: the dictionary of attributes already processed,
1804 the name of the duplicate attribute, and the most recent value
1805- encountered.
1806- """
1807+ encountered.
1808+ """
1809+ def __init__(self, soup:BeautifulSoup, *args, **kwargs):
1810+ self.soup = soup
1811 self.on_duplicate_attribute = kwargs.pop(
1812 'on_duplicate_attribute', self.REPLACE
1813 )
1814@@ -70,8 +90,20 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1815 self.already_closed_empty_element = []
1816
1817 self._initialize_xml_detector()
1818+
1819+ #: Constant to handle duplicate attributes by replacing earlier values
1820+ #: with later ones.
1821+ IGNORE:str = 'ignore'
1822+
1823+ #: Constant to handle duplicate attributes by ignoring later values
1824+ #: and keeping the earlier ones.
1825+ REPLACE:str = 'replace'
1826
1827- def error(self, message):
1828+ on_duplicate_attribute:Union[str, _DuplicateAttributeHandler]
1829+ already_closed_empty_element: List[str]
1830+ soup: BeautifulSoup
1831+
1832+ def error(self, message:str) -> None:
1833 # NOTE: This method is required so long as Python 3.9 is
1834 # supported. The corresponding code is removed from HTMLParser
1835 # in 3.5, but not removed from ParserBase until 3.10.
1836@@ -87,32 +119,33 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1837 # catch this error and wrap it in a ParserRejectedMarkup.)
1838 raise ParserRejectedMarkup(message)
1839
1840- def handle_startendtag(self, name, attrs):
1841+ def handle_startendtag(
1842+ self, name:str, attrs:List[Tuple[str, Optional[str]]]
1843+ ) -> None:
1844 """Handle an incoming empty-element tag.
1845
1846- This is only called when the markup looks like <tag/>.
1847-
1848- :param name: Name of the tag.
1849- :param attrs: Dictionary of the tag's attributes.
1850+ html.parser only calls this method when the markup looks like
1851+ <tag/>.
1852 """
1853- # is_startend() tells handle_starttag not to close the tag
1854+ # `handle_empty_element` tells handle_starttag not to close the tag
1855 # just because its name matches a known empty-element tag. We
1856- # know that this is an empty-element tag and we want to call
1857+ # know that this is an empty-element tag, and we want to call
1858 # handle_endtag ourselves.
1859- tag = self.handle_starttag(name, attrs, handle_empty_element=False)
1860+ self.handle_starttag(name, attrs, handle_empty_element=False)
1861 self.handle_endtag(name)
1862
1863- def handle_starttag(self, name, attrs, handle_empty_element=True):
1864+ def handle_starttag(
1865+ self, name:str, attrs:List[Tuple[str, Optional[str]]],
1866+ handle_empty_element:bool=True
1867+ ) -> None:
1868 """Handle an opening tag, e.g. '<tag>'
1869
1870- :param name: Name of the tag.
1871- :param attrs: Dictionary of the tag's attributes.
1872 :param handle_empty_element: True if this tag is known to be
1873 an empty-element tag (i.e. there is not expected to be any
1874 closing tag).
1875 """
1876- # XXX namespace
1877- attr_dict = {}
1878+ # TODO: handle namespaces here?
1879+ attr_dict: Dict[str, str] = {}
1880 for key, value in attrs:
1881 # Change None attribute values to the empty string
1882 # for consistency with the other tree builders.
1883@@ -128,6 +161,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1884 elif on_dupe in (None, self.REPLACE):
1885 attr_dict[key] = value
1886 else:
1887+ on_dupe = cast(_DuplicateAttributeHandler, on_dupe)
1888 on_dupe(attr_dict, key, value)
1889 else:
1890 attr_dict[key] = value
1891@@ -157,7 +191,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1892 if self._root_tag is None:
1893 self._root_tag_encountered(name)
1894
1895- def handle_endtag(self, name, check_already_closed=True):
1896+ def handle_endtag(self, name:str, check_already_closed:bool=True) -> None:
1897 """Handle a closing tag, e.g. '</tag>'
1898
1899 :param name: A tag name.
1900@@ -175,11 +209,11 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1901 else:
1902 self.soup.handle_endtag(name)
1903
1904- def handle_data(self, data):
1905+ def handle_data(self, data:str) -> None:
1906 """Handle some textual data that shows up between tags."""
1907 self.soup.handle_data(data)
1908
1909- def handle_charref(self, name):
1910+ def handle_charref(self, name:str) -> None:
1911 """Handle a numeric character reference by converting it to the
1912 corresponding Unicode character and treating it as textual
1913 data.
1914@@ -219,7 +253,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1915 data = data or "\N{REPLACEMENT CHARACTER}"
1916 self.handle_data(data)
1917
1918- def handle_entityref(self, name):
1919+ def handle_entityref(self, name:str) -> None:
1920 """Handle a named entity reference by converting it to the
1921 corresponding Unicode character(s) and treating it as textual
1922 data.
1923@@ -238,7 +272,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1924 data = "&%s" % name
1925 self.handle_data(data)
1926
1927- def handle_comment(self, data):
1928+ def handle_comment(self, data:str) -> None:
1929 """Handle an HTML comment.
1930
1931 :param data: The text of the comment.
1932@@ -247,7 +281,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1933 self.soup.handle_data(data)
1934 self.soup.endData(Comment)
1935
1936- def handle_decl(self, data):
1937+ def handle_decl(self, data:str) -> None:
1938 """Handle a DOCTYPE declaration.
1939
1940 :param data: The text of the declaration.
1941@@ -257,11 +291,12 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1942 self.soup.handle_data(data)
1943 self.soup.endData(Doctype)
1944
1945- def unknown_decl(self, data):
1946+ def unknown_decl(self, data:str) -> None:
1947 """Handle a declaration of unknown type -- probably a CDATA block.
1948
1949 :param data: The text of the declaration.
1950 """
1951+ cls: Type[NavigableString]
1952 if data.upper().startswith('CDATA['):
1953 cls = CData
1954 data = data[len('CDATA['):]
1955@@ -271,7 +306,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1956 self.soup.handle_data(data)
1957 self.soup.endData(cls)
1958
1959- def handle_pi(self, data):
1960+ def handle_pi(self, data:str) -> None:
1961 """Handle a processing instruction.
1962
1963 :param data: The text of the instruction.
1964@@ -286,16 +321,17 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
1965 """A Beautiful soup `TreeBuilder` that uses the `HTMLParser` parser,
1966 found in the Python standard library.
1967 """
1968- is_xml = False
1969- picklable = True
1970- NAME = HTMLPARSER
1971- features = [NAME, HTML, STRICT]
1972+ is_xml:bool = False
1973+ picklable:bool = True
1974+ NAME:str = HTMLPARSER
1975+ features: Iterable[str] = [NAME, HTML, STRICT]
1976
1977- # The html.parser knows which line number and position in the
1978- # original file is the source of an element.
1979- TRACKS_LINE_NUMBERS = True
1980+ #: The html.parser knows which line number and position in the
1981+ #: original file is the source of an element.
1982+ TRACKS_LINE_NUMBERS:bool = True
1983
1984- def __init__(self, parser_args=None, parser_kwargs=None, **kwargs):
1985+ def __init__(self, parser_args:Optional[Iterable[Any]]=None,
1986+ parser_kwargs:Optional[Dict[str, Any]]=None, **kwargs:Any):
1987 """Constructor.
1988
1989 :param parser_args: Positional arguments to pass into
1990@@ -320,9 +356,12 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
1991 parser_kwargs['convert_charrefs'] = False
1992 self.parser_args = (parser_args, parser_kwargs)
1993
1994- def prepare_markup(self, markup, user_specified_encoding=None,
1995- document_declared_encoding=None, exclude_encodings=None):
1996-
1997+ def prepare_markup(
1998+ self, markup:_RawMarkup,
1999+ user_specified_encoding:Optional[_Encoding]=None,
2000+ document_declared_encoding:Optional[_Encoding]=None,
2001+ exclude_encodings:Optional[_Encodings]=None
2002+ ) -> Iterable[Tuple[str, Optional[_Encoding], Optional[_Encoding], bool]]:
2003 """Run any preliminary steps necessary to make incoming markup
2004 acceptable to the parser.
2005
2006@@ -333,13 +372,12 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
2007 :param exclude_encodings: The user asked _not_ to try any of
2008 these encodings.
2009
2010- :yield: A series of 4-tuples:
2011- (markup, encoding, declared encoding,
2012- has undergone character replacement)
2013+ :yield: A series of 4-tuples: (markup, encoding, declared encoding,
2014+ has undergone character replacement)
2015
2016- Each 4-tuple represents a strategy for converting the
2017- document to Unicode and parsing it. Each strategy will be tried
2018- in turn.
2019+ Each 4-tuple represents a strategy for parsing the document.
2020+ This TreeBuilder uses Unicode, Dammit to convert the markup
2021+ into Unicode, so the `markup` element will always be a string.
2022 """
2023 if isinstance(markup, str):
2024 # Parse Unicode as-is.
2025@@ -348,14 +386,19 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
2026
2027 # Ask UnicodeDammit to sniff the most likely encoding.
2028
2029- # This was provided by the end-user; treat it as a known
2030- # definite encoding per the algorithm laid out in the HTML5
2031- # spec. (See the EncodingDetector class for details.)
2032- known_definite_encodings = [user_specified_encoding]
2033+ known_definite_encodings: List[_Encoding] = []
2034+ if user_specified_encoding:
2035+ # This was provided by the end-user; treat it as a known
2036+ # definite encoding per the algorithm laid out in the
2037+ # HTML5 spec. (See the EncodingDetector class for
2038+ # details.)
2039+ known_definite_encodings.append(user_specified_encoding)
2040
2041- # This was found in the document; treat it as a slightly lower-priority
2042- # user encoding.
2043- user_encodings = [document_declared_encoding]
2044+ user_encodings: List[_Encoding] = []
2045+ if document_declared_encoding:
2046+ # This was found in the document; treat it as a slightly
2047+ # lower-priority user encoding.
2048+ user_encodings.append(document_declared_encoding)
2049
2050 try_encodings = [user_specified_encoding, document_declared_encoding]
2051 dammit = UnicodeDammit(
2052@@ -365,17 +408,27 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
2053 is_html=True,
2054 exclude_encodings=exclude_encodings
2055 )
2056- yield (dammit.markup, dammit.original_encoding,
2057- dammit.declared_html_encoding,
2058- dammit.contains_replacement_characters)
2059
2060- def feed(self, markup):
2061- """Run some incoming markup through some parsing process,
2062- populating the `BeautifulSoup` object in self.soup.
2063- """
2064+ if dammit.unicode_markup is None:
2065+ # In every case I've seen, Unicode, Dammit is able to
2066+ # convert the markup into Unicode, even if it needs to use
2067+ # REPLACEMENT CHARACTER. But there is a code path that
2068+ # could result in unicode_markup being None, and
2069+ # HTMLParser can only parse Unicode, so here we handle
2070+ # that code path.
2071+ raise ParserRejectedMarkup("Could not convert input to Unicode, and html.parser will not accept bytestrings.")
2072+ else:
2073+ yield (dammit.unicode_markup, dammit.original_encoding,
2074+ dammit.declared_html_encoding,
2075+ dammit.contains_replacement_characters)
2076+
2077+ def feed(self, markup:str):
2078 args, kwargs = self.parser_args
2079- parser = BeautifulSoupHTMLParser(*args, **kwargs)
2080- parser.soup = self.soup
2081+ # We know BeautifulSoup calls TreeBuilder.initialize_soup
2082+ # before calling feed(), so we can assume self.soup
2083+ # is set.
2084+ assert self.soup is not None
2085+ parser = BeautifulSoupHTMLParser(self.soup, *args, **kwargs)
2086 try:
2087 parser.feed(markup)
2088 parser.close()
2089@@ -385,3 +438,4 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
2090 # when there's an error in the doctype declaration.
2091 raise ParserRejectedMarkup(e)
2092 parser.already_closed_empty_element = []
2093+
2094diff --git a/bs4/builder/_lxml.py b/bs4/builder/_lxml.py
2095index 971c81e..44a477f 100644
2096--- a/bs4/builder/_lxml.py
2097+++ b/bs4/builder/_lxml.py
2098@@ -1,3 +1,6 @@
2099+# encoding: utf-8
2100+from __future__ import annotations
2101+
2102 # Use of this source code is governed by the MIT license.
2103 __license__ = "MIT"
2104
2105@@ -6,14 +9,26 @@ __all__ = [
2106 'LXMLTreeBuilder',
2107 ]
2108
2109-try:
2110- from collections.abc import Callable # Python 3.6
2111-except ImportError as e:
2112- from collections import Callable
2113+from collections.abc import Callable
2114+
2115+from typing import (
2116+ Any,
2117+ Dict,
2118+ IO,
2119+ Iterable,
2120+ List,
2121+ Optional,
2122+ Set,
2123+ Tuple,
2124+ Type,
2125+ TYPE_CHECKING,
2126+ Union,
2127+)
2128
2129 from io import BytesIO
2130 from io import StringIO
2131 from lxml import etree
2132+from bs4.dammit import (_Encoding)
2133 from bs4.element import (
2134 Comment,
2135 Doctype,
2136@@ -31,33 +46,54 @@ from bs4.builder import (
2137 TreeBuilder,
2138 XML)
2139 from bs4.dammit import EncodingDetector
2140-
2141-LXML = 'lxml'
2142+if TYPE_CHECKING:
2143+ from bs4._typing import (
2144+ _Encoding,
2145+ _Encodings,
2146+ _NamespacePrefix,
2147+ _NamespaceURL,
2148+ _NamespaceMapping,
2149+ _InvertedNamespaceMapping,
2150+ _RawMarkup,
2151+ )
2152+ from bs4 import BeautifulSoup
2153+
2154+LXML:str = 'lxml'
2155
2156 def _invert(d):
2157 "Invert a dictionary."
2158 return dict((v,k) for k, v in list(d.items()))
2159
2160 class LXMLTreeBuilderForXML(TreeBuilder):
2161- DEFAULT_PARSER_CLASS = etree.XMLParser
2162-
2163- is_xml = True
2164- processing_instruction_class = XMLProcessingInstruction
2165
2166- NAME = "lxml-xml"
2167- ALTERNATE_NAMES = ["xml"]
2168+ DEFAULT_PARSER_CLASS:Type[Any] = etree.XMLParser
2169+
2170+ is_xml:bool = True
2171+
2172+ processing_instruction_class:Type[ProcessingInstruction]
2173+
2174+ NAME:str = "lxml-xml"
2175+ ALTERNATE_NAMES: Iterable[str] = ["xml"]
2176
2177 # Well, it's permissive by XML parser standards.
2178- features = [NAME, LXML, XML, FAST, PERMISSIVE]
2179+ features: Iterable[str] = [NAME, LXML, XML, FAST, PERMISSIVE]
2180
2181- CHUNK_SIZE = 512
2182+ CHUNK_SIZE:int = 512
2183
2184 # This namespace mapping is specified in the XML Namespace
2185 # standard.
2186- DEFAULT_NSMAPS = dict(xml='http://www.w3.org/XML/1998/namespace')
2187+ DEFAULT_NSMAPS: _NamespaceMapping = dict(
2188+ xml='http://www.w3.org/XML/1998/namespace'
2189+ )
2190
2191- DEFAULT_NSMAPS_INVERTED = _invert(DEFAULT_NSMAPS)
2192+ DEFAULT_NSMAPS_INVERTED:_InvertedNamespaceMapping = _invert(
2193+ DEFAULT_NSMAPS
2194+ )
2195
2196+ nsmaps: List[Optional[_InvertedNamespaceMapping]]
2197+ empty_element_tags: Set[str]
2198+ parser: Any
2199+
2200 # NOTE: If we parsed Element objects and looked at .sourceline,
2201 # we'd be able to see the line numbers from the original document.
2202 # But instead we build an XMLParser or HTMLParser object to serve
2203@@ -65,16 +101,18 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2204 # line numbers.
2205 # See: https://bugs.launchpad.net/lxml/+bug/1846906
2206
2207- def initialize_soup(self, soup):
2208+ def initialize_soup(self, soup:BeautifulSoup) -> None:
2209 """Let the BeautifulSoup object know about the standard namespace
2210 mapping.
2211
2212 :param soup: A `BeautifulSoup`.
2213 """
2214+ # Beyond this point, self.soup is set, so we can assume (and
2215+ # assert) it's not None whenever necessary.
2216 super(LXMLTreeBuilderForXML, self).initialize_soup(soup)
2217 self._register_namespaces(self.DEFAULT_NSMAPS)
2218
2219- def _register_namespaces(self, mapping):
2220+ def _register_namespaces(self, mapping:Dict[str, str]) -> None:
2221 """Let the BeautifulSoup object know about namespaces encountered
2222 while parsing the document.
2223
2224@@ -87,6 +125,7 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2225
2226 :param mapping: A dictionary mapping namespace prefixes to URIs.
2227 """
2228+ assert self.soup is not None
2229 for key, value in list(mapping.items()):
2230 # This is 'if key' and not 'if key is not None' because we
2231 # don't track un-prefixed namespaces. Soupselect will
2232@@ -98,19 +137,18 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2233 # prefix, the first one in the document takes precedence.
2234 self.soup._namespaces[key] = value
2235
2236- def default_parser(self, encoding):
2237+ def default_parser(self, encoding:Optional[_Encoding]) -> Type:
2238 """Find the default parser for the given encoding.
2239
2240- :param encoding: A string.
2241 :return: Either a parser object or a class, which
2242 will be instantiated with default arguments.
2243 """
2244 if self._default_parser is not None:
2245 return self._default_parser
2246- return etree.XMLParser(
2247+ return self.DEFAULT_PARSER_CLASS(
2248 target=self, strip_cdata=False, recover=True, encoding=encoding)
2249
2250- def parser_for(self, encoding):
2251+ def parser_for(self, encoding: Optional[_Encoding]) -> Any:
2252 """Instantiate an appropriate parser for the given encoding.
2253
2254 :param encoding: A string.
2255@@ -119,36 +157,39 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2256 # Use the default parser.
2257 parser = self.default_parser(encoding)
2258
2259- if isinstance(parser, Callable):
2260+ if callable(parser):
2261 # Instantiate the parser with default arguments
2262 parser = parser(
2263 target=self, strip_cdata=False, recover=True, encoding=encoding
2264 )
2265 return parser
2266
2267- def __init__(self, parser=None, empty_element_tags=None, **kwargs):
2268+ def __init__(self, parser:Optional[Any]=None,
2269+ empty_element_tags:Optional[Set[str]]=None, **kwargs):
2270 # TODO: Issue a warning if parser is present but not a
2271 # callable, since that means there's no way to create new
2272 # parsers for different encodings.
2273 self._default_parser = parser
2274- if empty_element_tags is not None:
2275- self.empty_element_tags = set(empty_element_tags)
2276 self.soup = None
2277 self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED]
2278 self.active_namespace_prefixes = [dict(self.DEFAULT_NSMAPS)]
2279 super(LXMLTreeBuilderForXML, self).__init__(**kwargs)
2280
2281- def _getNsTag(self, tag):
2282+ def _getNsTag(self, tag:str) -> Tuple[Optional[str], str]:
2283 # Split the namespace URL out of a fully-qualified lxml tag
2284 # name. Copied from lxml's src/lxml/sax.py.
2285 if tag[0] == '{':
2286- return tuple(tag[1:].split('}', 1))
2287+ namespace, name = tag[1:].split('}', 1)
2288+ return (namespace, name)
2289 else:
2290 return (None, tag)
2291
2292- def prepare_markup(self, markup, user_specified_encoding=None,
2293- exclude_encodings=None,
2294- document_declared_encoding=None):
2295+ def prepare_markup(
2296+ self, markup:_RawMarkup,
2297+ user_specified_encoding:Optional[_Encoding]=None,
2298+ document_declared_encoding:Optional[_Encoding]=None,
2299+ exclude_encodings:Optional[_Encodings]=None,
2300+ ) -> Iterable[Tuple[Union[str,bytes], Optional[_Encoding], Optional[_Encoding], bool]]:
2301 """Run any preliminary steps necessary to make incoming markup
2302 acceptable to the parser.
2303
2304@@ -166,13 +207,12 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2305 :param exclude_encodings: The user asked _not_ to try any of
2306 these encodings.
2307
2308- :yield: A series of 4-tuples:
2309- (markup, encoding, declared encoding,
2310- has undergone character replacement)
2311+ :yield: A series of 4-tuples: (markup, encoding, declared encoding,
2312+ has undergone character replacement)
2313
2314- Each 4-tuple represents a strategy for converting the
2315- document to Unicode and parsing it. Each strategy will be tried
2316- in turn.
2317+ Each 4-tuple represents a strategy for converting the
2318+ document to Unicode and parsing it. Each strategy will be tried
2319+ in turn.
2320 """
2321 is_html = not self.is_xml
2322 if is_html:
2323@@ -200,14 +240,25 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2324 yield (markup.encode("utf8"), "utf8",
2325 document_declared_encoding, False)
2326
2327- # This was provided by the end-user; treat it as a known
2328- # definite encoding per the algorithm laid out in the HTML5
2329- # spec. (See the EncodingDetector class for details.)
2330- known_definite_encodings = [user_specified_encoding]
2331+ # Since the document was Unicode in the first place, there
2332+ # is no need to try any more strategies; we know this will
2333+ # work.
2334+ return
2335+
2336+ known_definite_encodings: List[_Encoding] = []
2337+ if user_specified_encoding:
2338+ # This was provided by the end-user; treat it as a known
2339+ # definite encoding per the algorithm laid out in the
2340+ # HTML5 spec. (See the EncodingDetector class for
2341+ # details.)
2342+ known_definite_encodings.append(user_specified_encoding)
2343+
2344+ user_encodings: List[_Encoding] = []
2345+ if document_declared_encoding:
2346+ # This was found in the document; treat it as a slightly
2347+ # lower-priority user encoding.
2348+ user_encodings.append(document_declared_encoding)
2349
2350- # This was found in the document; treat it as a slightly lower-priority
2351- # user encoding.
2352- user_encodings = [document_declared_encoding]
2353 detector = EncodingDetector(
2354 markup, known_definite_encodings=known_definite_encodings,
2355 user_encodings=user_encodings, is_html=is_html,
2356@@ -216,34 +267,45 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2357 for encoding in detector.encodings:
2358 yield (detector.markup, encoding, document_declared_encoding, False)
2359
2360- def feed(self, markup):
2361+ def feed(self, markup:Union[bytes,str]) -> None:
2362+ io: IO
2363 if isinstance(markup, bytes):
2364- markup = BytesIO(markup)
2365+ io = BytesIO(markup)
2366 elif isinstance(markup, str):
2367- markup = StringIO(markup)
2368+ io = StringIO(markup)
2369
2370+ # initialize_soup is called before feed, so we know this
2371+ # is not None.
2372+ assert self.soup is not None
2373+
2374 # Call feed() at least once, even if the markup is empty,
2375 # or the parser won't be initialized.
2376- data = markup.read(self.CHUNK_SIZE)
2377+ data = io.read(self.CHUNK_SIZE)
2378 try:
2379 self.parser = self.parser_for(self.soup.original_encoding)
2380 self.parser.feed(data)
2381 while len(data) != 0:
2382 # Now call feed() on the rest of the data, chunk by chunk.
2383- data = markup.read(self.CHUNK_SIZE)
2384+ data = io.read(self.CHUNK_SIZE)
2385 if len(data) != 0:
2386 self.parser.feed(data)
2387 self.parser.close()
2388 except (UnicodeDecodeError, LookupError, etree.ParserError) as e:
2389 raise ParserRejectedMarkup(e)
2390
2391- def close(self):
2392+ def close(self) -> None:
2393 self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED]
2394
2395- def start(self, name, attrs, nsmap={}):
2396+ def start(self, name:str, attrs:Dict[str, str], nsmap:_NamespaceMapping={}):
2397+ # This is called by lxml code as a result of calling
2398+ # BeautifulSoup.feed(), and we know self.soup is set by the time feed()
2399+ # is called.
2400+ assert self.soup is not None
2401+
2402 # Make sure attrs is a mutable dict--lxml may send an immutable dictproxy.
2403 attrs = dict(attrs)
2404- nsprefix = None
2405+ nsprefix: Optional[_NamespacePrefix] = None
2406+ namespace: Optional[_NamespaceURL] = None
2407 # Invert each namespace map as it comes in.
2408 if len(nsmap) == 0 and len(self.nsmaps) > 1:
2409 # There are no new namespaces for this tag, but
2410@@ -285,7 +347,7 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2411 # Namespaces are in play. Find any attributes that came in
2412 # from lxml with namespaces attached to their names, and
2413 # turn then into NamespacedAttribute objects.
2414- new_attrs = {}
2415+ new_attrs:Dict[Union[str,NamespacedAttribute], str] = {}
2416 for attr, value in list(attrs.items()):
2417 namespace, attr = self._getNsTag(attr)
2418 if namespace is None:
2419@@ -303,7 +365,7 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2420 namespaces=self.active_namespace_prefixes[-1]
2421 )
2422
2423- def _prefix_for_namespace(self, namespace):
2424+ def _prefix_for_namespace(self, namespace:Optional[_NamespaceURL]) -> Optional[_NamespacePrefix]:
2425 """Find the currently active prefix for the given namespace."""
2426 if namespace is None:
2427 return None
2428@@ -312,7 +374,8 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2429 return inverted_nsmap[namespace]
2430 return None
2431
2432- def end(self, name):
2433+ def end(self, name:str) -> None:
2434+ assert self.soup is not None
2435 self.soup.endData()
2436 completed_tag = self.soup.tagStack[-1]
2437 namespace, name = self._getNsTag(name)
2438@@ -334,44 +397,49 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2439 # namespace prefixes.
2440 self.active_namespace_prefixes.pop()
2441
2442- def pi(self, target, data):
2443+ def pi(self, target:str, data:str) -> None:
2444+ assert self.soup is not None
2445 self.soup.endData()
2446 data = target + ' ' + data
2447 self.soup.handle_data(data)
2448 self.soup.endData(self.processing_instruction_class)
2449
2450- def data(self, content):
2451+ def data(self, content:str) -> None:
2452+ assert self.soup is not None
2453 self.soup.handle_data(content)
2454
2455- def doctype(self, name, pubid, system):
2456+ def doctype(self, name:str, pubid:str, system:str) -> None:
2457+ assert self.soup is not None
2458 self.soup.endData()
2459 doctype = Doctype.for_name_and_ids(name, pubid, system)
2460 self.soup.object_was_parsed(doctype)
2461
2462- def comment(self, content):
2463+ def comment(self, content:str) -> None:
2464 "Handle comments as Comment objects."
2465+ assert self.soup is not None
2466 self.soup.endData()
2467 self.soup.handle_data(content)
2468 self.soup.endData(Comment)
2469
2470- def test_fragment_to_document(self, fragment):
2471+ def test_fragment_to_document(self, fragment:str) -> str:
2472 """See `TreeBuilder`."""
2473 return '<?xml version="1.0" encoding="utf-8"?>\n%s' % fragment
2474
2475
2476 class LXMLTreeBuilder(HTMLTreeBuilder, LXMLTreeBuilderForXML):
2477
2478- NAME = LXML
2479- ALTERNATE_NAMES = ["lxml-html"]
2480+ NAME:str = LXML
2481+ ALTERNATE_NAMES: Iterable[str] = ["lxml-html"]
2482
2483- features = ALTERNATE_NAMES + [NAME, HTML, FAST, PERMISSIVE]
2484- is_xml = False
2485- processing_instruction_class = ProcessingInstruction
2486+ features: Iterable[str] = list(ALTERNATE_NAMES) + [NAME, HTML, FAST, PERMISSIVE]
2487+ is_xml: bool = False
2488
2489- def default_parser(self, encoding):
2490+ def default_parser(self, encoding:Optional[_Encoding]) -> Type[Any]:
2491 return etree.HTMLParser
2492
2493- def feed(self, markup):
2494+ def feed(self, markup:_RawMarkup) -> None:
2495+ # We know self.soup is set by the time feed() is called.
2496+ assert self.soup is not None
2497 encoding = self.soup.original_encoding
2498 try:
2499 self.parser = self.parser_for(encoding)
2500@@ -381,6 +449,7 @@ class LXMLTreeBuilder(HTMLTreeBuilder, LXMLTreeBuilderForXML):
2501 raise ParserRejectedMarkup(e)
2502
2503
2504- def test_fragment_to_document(self, fragment):
2505+ def test_fragment_to_document(self, fragment:str) -> str:
2506 """See `TreeBuilder`."""
2507 return '<html><body>%s</body></html>' % fragment
2508+
2509diff --git a/bs4/css.py b/bs4/css.py
2510index 245ac60..0477de8 100644
2511--- a/bs4/css.py
2512+++ b/bs4/css.py
2513@@ -1,6 +1,36 @@
2514-"""Integration code for CSS selectors using Soup Sieve (pypi: soupsieve)."""
2515-
2516+"""Integration code for CSS selectors using `Soup Sieve <https://facelessuser.github.io/soupsieve/>`_ (pypi: ``soupsieve``).
2517+
2518+Acquire a `CSS` object through the `bs4.element.Tag.css` attribute of
2519+the starting point of your CSS selector, or (if you want to run a
2520+selector against the entire document) of the `BeautifulSoup` object
2521+itself.
2522+
2523+The main advantage of doing this instead of using ``soupsieve``
2524+functions is that you don't need to keep passing the `bs4.element.Tag` to be
2525+selected against, since the `CSS` object is permanently scoped to that
2526+`bs4.element.Tag`.
2527+
2528+"""
2529+
2530+from __future__ import annotations
2531+
2532+from types import ModuleType
2533+from typing import (
2534+ Any,
2535+ cast,
2536+ Iterable,
2537+ Iterator,
2538+ Optional,
2539+ TYPE_CHECKING,
2540+)
2541 import warnings
2542+from bs4._typing import _NamespaceMapping
2543+if TYPE_CHECKING:
2544+ from soupsieve import SoupSieve
2545+ from bs4 import element
2546+ from bs4.element import ResultSet, Tag
2547+
2548+soupsieve: Optional[ModuleType]
2549 try:
2550 import soupsieve
2551 except ImportError as e:
2552@@ -9,34 +39,22 @@ except ImportError as e:
2553 'The soupsieve package is not installed. CSS selectors cannot be used.'
2554 )
2555
2556-
2557 class CSS(object):
2558- """A proxy object against the soupsieve library, to simplify its
2559+ """A proxy object against the ``soupsieve`` library, to simplify its
2560 CSS selector API.
2561
2562- Acquire this object through the .css attribute on the
2563- BeautifulSoup object, or on the Tag you want to use as the
2564- starting point for a CSS selector.
2565-
2566- The main advantage of doing this is that the tag to be selected
2567- against doesn't need to be explicitly specified in the function
2568- calls, since it's already scoped to a tag.
2569- """
2570-
2571- def __init__(self, tag, api=soupsieve):
2572- """Constructor.
2573-
2574- You don't need to instantiate this class yourself; instead,
2575- access the .css attribute on the BeautifulSoup object, or on
2576- the Tag you want to use as the starting point for your CSS
2577- selector.
2578+ You don't need to instantiate this class yourself; instead, use
2579+ `element.Tag.css`.
2580
2581- :param tag: All CSS selectors will use this as their starting
2582- point.
2583+ :param tag: All CSS selectors run by this object will use this as
2584+ their starting point.
2585
2586- :param api: A plug-in replacement for the soupsieve module,
2587- designed mainly for use in tests.
2588- """
2589+ :param api: An optional drop-in replacement for the ``soupsieve`` module,
2590+ intended for use in unit tests.
2591+ """
2592+ def __init__(self, tag: element.Tag, api:Optional[ModuleType]=None):
2593+ if api is None:
2594+ api = soupsieve
2595 if api is None:
2596 raise NotImplementedError(
2597 "Cannot execute CSS selectors because the soupsieve package is not installed."
2598@@ -44,19 +62,19 @@ class CSS(object):
2599 self.api = api
2600 self.tag = tag
2601
2602- def escape(self, ident):
2603+ def escape(self, ident:str) -> str:
2604 """Escape a CSS identifier.
2605
2606- This is a simple wrapper around soupselect.escape(). See the
2607+ This is a simple wrapper around `soupsieve.escape() <https://facelessuser.github.io/soupsieve/api/#soupsieveescape>`_. See the
2608 documentation for that function for more information.
2609 """
2610 if soupsieve is None:
2611 raise NotImplementedError(
2612 "Cannot escape CSS identifiers because the soupsieve package is not installed."
2613 )
2614- return self.api.escape(ident)
2615+ return cast(str, self.api.escape(ident))
2616
2617- def _ns(self, ns, select):
2618+ def _ns(self, ns:Optional[_NamespaceMapping], select:str) -> Optional[_NamespaceMapping]:
2619 """Normalize a dictionary of namespaces."""
2620 if not isinstance(select, self.api.SoupSieve) and ns is None:
2621 # If the selector is a precompiled pattern, it already has
2622@@ -65,7 +83,7 @@ class CSS(object):
2623 ns = self.tag._namespaces
2624 return ns
2625
2626- def _rs(self, results):
2627+ def _rs(self, results:Iterable[Tag]) -> ResultSet[Tag]:
2628 """Normalize a list of results to a Resultset.
2629
2630 A ResultSet is more consistent with the rest of Beautiful
2631@@ -77,7 +95,12 @@ class CSS(object):
2632 from bs4.element import ResultSet
2633 return ResultSet(None, results)
2634
2635- def compile(self, select, namespaces=None, flags=0, **kwargs):
2636+ def compile(self,
2637+ select:str,
2638+ namespaces:Optional[_NamespaceMapping]=None,
2639+ flags:int=0,
2640+ **kwargs:Any
2641+ ) -> SoupSieve:
2642 """Pre-compile a selector and return the compiled object.
2643
2644 :param selector: A CSS selector.
2645@@ -88,10 +111,10 @@ class CSS(object):
2646 parsing the document.
2647
2648 :param flags: Flags to be passed into Soup Sieve's
2649- soupsieve.compile() method.
2650+ `soupsieve.compile() <https://facelessuser.github.io/soupsieve/api/#soupsievecompile>`_ method.
2651
2652- :param kwargs: Keyword arguments to be passed into SoupSieve's
2653- soupsieve.compile() method.
2654+ :param kwargs: Keyword arguments to be passed into Soup Sieve's
2655+ `soupsieve.compile() <https://facelessuser.github.io/soupsieve/api/#soupsievecompile>`_ method.
2656
2657 :return: A precompiled selector object.
2658 :rtype: soupsieve.SoupSieve
2659@@ -100,13 +123,16 @@ class CSS(object):
2660 select, self._ns(namespaces, select), flags, **kwargs
2661 )
2662
2663- def select_one(self, select, namespaces=None, flags=0, **kwargs):
2664+ def select_one(
2665+ self, select:str,
2666+ namespaces:Optional[_NamespaceMapping]=None,
2667+ flags:int=0, **kwargs:Any
2668+ )-> element.Tag | None:
2669 """Perform a CSS selection operation on the current Tag and return the
2670- first result.
2671+ first result, if any.
2672
2673 This uses the Soup Sieve library. For more information, see
2674- that library's documentation for the soupsieve.select_one()
2675- method.
2676+ that library's documentation for the `soupsieve.select_one() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect_one>`_ method.
2677
2678 :param selector: A CSS selector.
2679
2680@@ -116,27 +142,24 @@ class CSS(object):
2681 parsing the document.
2682
2683 :param flags: Flags to be passed into Soup Sieve's
2684- soupsieve.select_one() method.
2685-
2686- :param kwargs: Keyword arguments to be passed into SoupSieve's
2687- soupsieve.select_one() method.
2688-
2689- :return: A Tag, or None if the selector has no match.
2690- :rtype: bs4.element.Tag
2691+ `soupsieve.select_one() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect_one>`_ method.
2692
2693+ :param kwargs: Keyword arguments to be passed into Soup Sieve's
2694+ `soupsieve.select_one() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect_one>`_ method.
2695 """
2696 return self.api.select_one(
2697 select, self.tag, self._ns(namespaces, select), flags, **kwargs
2698 )
2699
2700- def select(self, select, namespaces=None, limit=0, flags=0, **kwargs):
2701- """Perform a CSS selection operation on the current Tag.
2702+ def select(self, select:str,
2703+ namespaces:Optional[_NamespaceMapping]=None,
2704+ limit:int=0, flags:int=0, **kwargs:Any) -> ResultSet[Tag]:
2705+ """Perform a CSS selection operation on the current `element.Tag`.
2706
2707 This uses the Soup Sieve library. For more information, see
2708- that library's documentation for the soupsieve.select()
2709- method.
2710+ that library's documentation for the `soupsieve.select() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect>`_ method.
2711
2712- :param selector: A string containing a CSS selector.
2713+ :param selector: A CSS selector.
2714
2715 :param namespaces: A dictionary mapping namespace prefixes
2716 used in the CSS selector to namespace URIs. By default,
2717@@ -146,14 +169,10 @@ class CSS(object):
2718 :param limit: After finding this number of results, stop looking.
2719
2720 :param flags: Flags to be passed into Soup Sieve's
2721- soupsieve.select() method.
2722-
2723- :param kwargs: Keyword arguments to be passed into SoupSieve's
2724- soupsieve.select() method.
2725-
2726- :return: A ResultSet of Tag objects.
2727- :rtype: bs4.element.ResultSet
2728+ `soupsieve.select() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect>`_ method.
2729
2730+ :param kwargs: Keyword arguments to be passed into Soup Sieve's
2731+ `soupsieve.select() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect>`_ method.
2732 """
2733 if limit is None:
2734 limit = 0
2735@@ -165,11 +184,14 @@ class CSS(object):
2736 )
2737 )
2738
2739- def iselect(self, select, namespaces=None, limit=0, flags=0, **kwargs):
2740- """Perform a CSS selection operation on the current Tag.
2741+ def iselect(self, select:str,
2742+ namespaces:Optional[_NamespaceMapping]=None,
2743+ limit:int=0, flags:int=0, **kwargs:Any) -> Iterator[element.Tag]:
2744+ """Perform a CSS selection operation on the current `element.Tag`.
2745
2746 This uses the Soup Sieve library. For more information, see
2747- that library's documentation for the soupsieve.iselect()
2748+ that library's documentation for the `soupsieve.iselect()
2749+ <https://facelessuser.github.io/soupsieve/api/#soupsieveiselect>`_
2750 method. It is the same as select(), but it returns a generator
2751 instead of a list.
2752
2753@@ -183,23 +205,23 @@ class CSS(object):
2754 :param limit: After finding this number of results, stop looking.
2755
2756 :param flags: Flags to be passed into Soup Sieve's
2757- soupsieve.iselect() method.
2758-
2759- :param kwargs: Keyword arguments to be passed into SoupSieve's
2760- soupsieve.iselect() method.
2761+ `soupsieve.iselect() <https://facelessuser.github.io/soupsieve/api/#soupsieveiselect>`_ method.
2762
2763- :return: A generator
2764- :rtype: types.GeneratorType
2765+ :param kwargs: Keyword arguments to be passed into Soup Sieve's
2766+ `soupsieve.iselect() <https://facelessuser.github.io/soupsieve/api/#soupsieveiselect>`_ method.
2767 """
2768 return self.api.iselect(
2769 select, self.tag, self._ns(namespaces, select), limit, flags, **kwargs
2770 )
2771
2772- def closest(self, select, namespaces=None, flags=0, **kwargs):
2773- """Find the Tag closest to this one that matches the given selector.
2774+ def closest(self, select:str,
2775+ namespaces:Optional[_NamespaceMapping]=None,
2776+ flags:int=0, **kwargs:Any) -> Optional[element.Tag]:
2777+ """Find the `element.Tag` closest to this one that matches the given selector.
2778
2779 This uses the Soup Sieve library. For more information, see
2780- that library's documentation for the soupsieve.closest()
2781+ that library's documentation for the `soupsieve.closest()
2782+ <https://facelessuser.github.io/soupsieve/api/#soupsieveclosest>`_
2783 method.
2784
2785 :param selector: A string containing a CSS selector.
2786@@ -210,24 +232,24 @@ class CSS(object):
2787 parsing the document.
2788
2789 :param flags: Flags to be passed into Soup Sieve's
2790- soupsieve.closest() method.
2791+ `soupsieve.closest() <https://facelessuser.github.io/soupsieve/api/#soupsieveclosest>`_ method.
2792
2793- :param kwargs: Keyword arguments to be passed into SoupSieve's
2794- soupsieve.closest() method.
2795-
2796- :return: A Tag, or None if there is no match.
2797- :rtype: bs4.Tag
2798+ :param kwargs: Keyword arguments to be passed into Soup Sieve's
2799+ `soupsieve.closest() <https://facelessuser.github.io/soupsieve/api/#soupsieveclosest>`_ method.
2800
2801 """
2802 return self.api.closest(
2803 select, self.tag, self._ns(namespaces, select), flags, **kwargs
2804 )
2805
2806- def match(self, select, namespaces=None, flags=0, **kwargs):
2807- """Check whether this Tag matches the given CSS selector.
2808+ def match(self, select:str,
2809+ namespaces:Optional[_NamespaceMapping]=None,
2810+ flags:int=0, **kwargs:Any) -> bool:
2811+ """Check whether or not this `element.Tag` matches the given CSS selector.
2812
2813 This uses the Soup Sieve library. For more information, see
2814- that library's documentation for the soupsieve.match()
2815+ that library's documentation for the `soupsieve.match()
2816+ <https://facelessuser.github.io/soupsieve/api/#soupsievematch>`_
2817 method.
2818
2819 :param: a CSS selector.
2820@@ -238,25 +260,30 @@ class CSS(object):
2821 parsing the document.
2822
2823 :param flags: Flags to be passed into Soup Sieve's
2824- soupsieve.match() method.
2825+ `soupsieve.match()
2826+ <https://facelessuser.github.io/soupsieve/api/#soupsievematch>`_
2827+ method.
2828
2829 :param kwargs: Keyword arguments to be passed into SoupSieve's
2830- soupsieve.match() method.
2831-
2832- :return: True if this Tag matches the selector; False otherwise.
2833- :rtype: bool
2834+ `soupsieve.match()
2835+ <https://facelessuser.github.io/soupsieve/api/#soupsievematch>`_
2836+ method.
2837 """
2838- return self.api.match(
2839+ return cast(bool, self.api.match(
2840 select, self.tag, self._ns(namespaces, select), flags, **kwargs
2841- )
2842+ ))
2843
2844- def filter(self, select, namespaces=None, flags=0, **kwargs):
2845- """Filter this Tag's direct children based on the given CSS selector.
2846+ def filter(self, select:str,
2847+ namespaces:Optional[_NamespaceMapping]=None,
2848+ flags:int=0, **kwargs:Any) -> ResultSet[Tag]:
2849+ """Filter this `element.Tag`'s direct children based on the given CSS selector.
2850
2851 This uses the Soup Sieve library. It works the same way as
2852- passing this Tag into that library's soupsieve.filter()
2853- method. More information, for more information see the
2854- documentation for soupsieve.filter().
2855+ passing a `element.Tag` into that library's `soupsieve.filter()
2856+ <https://facelessuser.github.io/soupsieve/api/#soupsievefilter>`_
2857+ method. For more information, see the documentation for
2858+ `soupsieve.filter()
2859+ <https://facelessuser.github.io/soupsieve/api/#soupsievefilter>`_.
2860
2861 :param namespaces: A dictionary mapping namespace prefixes
2862 used in the CSS selector to namespace URIs. By default,
2863@@ -264,17 +291,18 @@ class CSS(object):
2864 parsing the document.
2865
2866 :param flags: Flags to be passed into Soup Sieve's
2867- soupsieve.filter() method.
2868+ `soupsieve.filter()
2869+ <https://facelessuser.github.io/soupsieve/api/#soupsievefilter>`_
2870+ method.
2871
2872 :param kwargs: Keyword arguments to be passed into SoupSieve's
2873- soupsieve.filter() method.
2874-
2875- :return: A ResultSet of Tag objects.
2876- :rtype: bs4.element.ResultSet
2877-
2878+ `soupsieve.filter()
2879+ <https://facelessuser.github.io/soupsieve/api/#soupsievefilter>`_
2880+ method.
2881 """
2882 return self._rs(
2883 self.api.filter(
2884 select, self.tag, self._ns(namespaces, select), flags, **kwargs
2885 )
2886 )
2887+
2888diff --git a/bs4/dammit.py b/bs4/dammit.py
2889index 692433c..8c1b631 100644
2890--- a/bs4/dammit.py
2891+++ b/bs4/dammit.py
2892@@ -2,9 +2,11 @@
2893 """Beautiful Soup bonus library: Unicode, Dammit
2894
2895 This library converts a bytestream to Unicode through any means
2896-necessary. It is heavily based on code from Mark Pilgrim's Universal
2897-Feed Parser. It works best on XML and HTML, but it does not rewrite the
2898-XML or HTML to reflect a new encoding; that's the tree builder's job.
2899+necessary. It is heavily based on code from Mark Pilgrim's `Universal
2900+Feed Parser <https://pypi.org/project/feedparser/>`_. It works best on
2901+XML and HTML, but it does not rewrite the XML or HTML to reflect a new
2902+encoding; that's the job of `TreeBuilder`.
2903+
2904 """
2905 # Use of this source code is governed by the MIT license.
2906 __license__ = "MIT"
2907@@ -12,9 +14,31 @@ __license__ = "MIT"
2908 from html.entities import codepoint2name
2909 from collections import defaultdict
2910 import codecs
2911+from html.entities import html5
2912 import re
2913-import logging
2914+from logging import Logger, getLogger
2915 import string
2916+from types import ModuleType
2917+from typing import (
2918+ Dict,
2919+ Iterable,
2920+ Iterator,
2921+ List,
2922+ Optional,
2923+ Pattern,
2924+ Sequence,
2925+ Set,
2926+ Tuple,
2927+ Type,
2928+ Union,
2929+ cast,
2930+)
2931+from bs4._typing import (
2932+ _Encoding,
2933+ _Encodings,
2934+ _RawMarkup,
2935+)
2936+import warnings
2937
2938 # Import a library to autodetect character encodings. We'll support
2939 # any of a number of libraries that all support the same API:
2940@@ -22,37 +46,41 @@ import string
2941 # * cchardet
2942 # * chardet
2943 # * charset-normalizer
2944-chardet_module = None
2945+chardet_module: Optional[ModuleType] = None
2946 try:
2947 # PyPI package: cchardet
2948- import cchardet as chardet_module
2949+ import cchardet
2950+ chardet_module = cchardet
2951 except ImportError:
2952 try:
2953 # Debian package: python-chardet
2954 # PyPI package: chardet
2955- import chardet as chardet_module
2956+ import chardet
2957+ chardet_module = chardet
2958 except ImportError:
2959 try:
2960 # PyPI package: charset-normalizer
2961- import charset_normalizer as chardet_module
2962+ import charset_normalizer
2963+ chardet_module = charset_normalizer
2964 except ImportError:
2965 # No chardet available.
2966- chardet_module = None
2967+ pass
2968
2969-if chardet_module:
2970- def chardet_dammit(s):
2971- if isinstance(s, str):
2972- return None
2973- return chardet_module.detect(s)['encoding']
2974-else:
2975- def chardet_dammit(s):
2976+
2977+def _chardet_dammit(s:bytes) -> Optional[str]:
2978+ """Try as hard as possible to detect the encoding of a bytestring."""
2979+ if chardet_module is None or isinstance(s, str):
2980 return None
2981+ module = chardet_module
2982+ return module.detect(s)['encoding']
2983
2984 # Build bytestring and Unicode versions of regular expressions for finding
2985 # a declared encoding inside an XML or HTML document.
2986-xml_encoding = '^\\s*<\\?.*encoding=[\'"](.*?)[\'"].*\\?>'
2987-html_meta = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'
2988-encoding_res = dict()
2989+xml_encoding:str = '^\\s*<\\?.*encoding=[\'"](.*?)[\'"].*\\?>' #: :meta private:
2990+html_meta:str = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]' #: :meta private:
2991+
2992+# TODO: The Pattern type here could use more refinement, but it's tricky.
2993+encoding_res: Dict[Type, Dict[str, Pattern]] = dict()
2994 encoding_res[bytes] = {
2995 'html' : re.compile(html_meta.encode("ascii"), re.I),
2996 'xml' : re.compile(xml_encoding.encode("ascii"), re.I),
2997@@ -62,12 +90,29 @@ encoding_res[str] = {
2998 'xml' : re.compile(xml_encoding, re.I)
2999 }
3000
3001-from html.entities import html5
3002-
3003 class EntitySubstitution(object):
3004 """The ability to substitute XML or HTML entities for certain characters."""
3005
3006- def _populate_class_variables():
3007+ #: A map of named HTML entities to the corresponding Unicode string.
3008+ #:
3009+ #: :meta hide-value:
3010+ HTML_ENTITY_TO_CHARACTER: Dict[str, str]
3011+
3012+ #: A map of Unicode strings to the corresponding named HTML entities;
3013+ #: the inverse of HTML_ENTITY_TO_CHARACTER.
3014+ #:
3015+ #: :meta hide-value:
3016+ CHARACTER_TO_HTML_ENTITY: Dict[str, str]
3017+
3018+ #: A regular expression that matches any character (or, in rare
3019+ #: cases, pair of characters) that can be replaced with a named
3020+ #: HTML entity.
3021+ #:
3022+ #: :meta hide-value:
3023+ CHARACTER_TO_HTML_ENTITY_RE: Pattern[str]
3024+
3025+ @classmethod
3026+ def _populate_class_variables(cls) -> None:
3027 """Initialize variables used by this class to manage the plethora of
3028 HTML5 named entities.
3029
3030@@ -184,11 +229,14 @@ class EntitySubstitution(object):
3031 character = chr(codepoint)
3032 unicode_to_name[character] = name
3033
3034- return unicode_to_name, name_to_unicode, re.compile(re_definition)
3035- (CHARACTER_TO_HTML_ENTITY, HTML_ENTITY_TO_CHARACTER,
3036- CHARACTER_TO_HTML_ENTITY_RE) = _populate_class_variables()
3037+ cls.CHARACTER_TO_HTML_ENTITY = unicode_to_name
3038+ cls.HTML_ENTITY_TO_CHARACTER = name_to_unicode
3039+ cls.CHARACTER_TO_HTML_ENTITY_RE = re.compile(re_definition)
3040
3041- CHARACTER_TO_XML_ENTITY = {
3042+ #: A map of Unicode strings to the corresponding named XML entities.
3043+ #:
3044+ #: :meta hide-value:
3045+ CHARACTER_TO_XML_ENTITY: Dict[str, str] = {
3046 "'": "apos",
3047 '"': "quot",
3048 "&": "amp",
3049@@ -196,28 +244,37 @@ class EntitySubstitution(object):
3050 ">": "gt",
3051 }
3052
3053- BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|"
3054- "&(?!#\\d+;|#x[0-9a-fA-F]+;|\\w+;)"
3055- ")")
3056-
3057- AMPERSAND_OR_BRACKET = re.compile("([<>&])")
3058+ #: A regular expression matching an angle bracket or an ampersand that
3059+ #: is not part of an XML or HTML entity.
3060+ #:
3061+ #: :meta hide-value:
3062+ BARE_AMPERSAND_OR_BRACKET: Pattern[str] = re.compile(
3063+ "([<>]|"
3064+ "&(?!#\\d+;|#x[0-9a-fA-F]+;|\\w+;)"
3065+ ")"
3066+ )
3067+
3068+ #: A regular expression matching an angle bracket or an ampersand.
3069+ #:
3070+ #: :meta hide-value:
3071+ AMPERSAND_OR_BRACKET: Pattern[str] = re.compile("([<>&])")
3072
3073 @classmethod
3074- def _substitute_html_entity(cls, matchobj):
3075+ def _substitute_html_entity(cls, matchobj:re.Match[str]) -> str:
3076 """Used with a regular expression to substitute the
3077 appropriate HTML entity for a special character string."""
3078 entity = cls.CHARACTER_TO_HTML_ENTITY.get(matchobj.group(0))
3079 return "&%s;" % entity
3080
3081 @classmethod
3082- def _substitute_xml_entity(cls, matchobj):
3083+ def _substitute_xml_entity(cls, matchobj:re.Match[str]) -> str:
3084 """Used with a regular expression to substitute the
3085 appropriate XML entity for a special character string."""
3086 entity = cls.CHARACTER_TO_XML_ENTITY[matchobj.group(0)]
3087 return "&%s;" % entity
3088
3089 @classmethod
3090- def quoted_attribute_value(self, value):
3091+ def quoted_attribute_value(cls, value: str) -> str:
3092 """Make a value into a quoted XML attribute, possibly escaping it.
3093
3094 Most strings will be quoted using double quotes.
3095@@ -233,7 +290,10 @@ class EntitySubstitution(object):
3096 double quotes will be escaped, and the string will be quoted
3097 using double quotes.
3098
3099- Welcome to "Bob's Bar" -> "Welcome to &quot;Bob's bar&quot;
3100+ Welcome to "Bob's Bar" -> Welcome to &quot;Bob's bar&quot;
3101+
3102+ :param value: The XML attribute value to quote
3103+ :return: The quoted value
3104 """
3105 quote_with = '"'
3106 if '"' in value:
3107@@ -254,17 +314,22 @@ class EntitySubstitution(object):
3108 return quote_with + value + quote_with
3109
3110 @classmethod
3111- def substitute_xml(cls, value, make_quoted_attribute=False):
3112- """Substitute XML entities for special XML characters.
3113+ def substitute_xml(cls, value:str, make_quoted_attribute:bool=False) -> str:
3114+ """Replace special XML characters with named XML entities.
3115+
3116+ The less-than sign will become &lt;, the greater-than sign
3117+ will become &gt;, and any ampersands will become &amp;. If you
3118+ want ampersands that seem to be part of an entity definition
3119+ to be left alone, use `substitute_xml_containing_entities`
3120+ instead.
3121
3122- :param value: A string to be substituted. The less-than sign
3123- will become &lt;, the greater-than sign will become &gt;,
3124- and any ampersands will become &amp;. If you want ampersands
3125- that appear to be part of an entity definition to be left
3126- alone, use substitute_xml_containing_entities() instead.
3127+ :param value: A string to be substituted.
3128
3129 :param make_quoted_attribute: If True, then the string will be
3130 quoted, as befits an attribute value.
3131+
3132+ :return: A version of ``value`` with special characters replaced
3133+ with named entities.
3134 """
3135 # Escape angle brackets and ampersands.
3136 value = cls.AMPERSAND_OR_BRACKET.sub(
3137@@ -276,7 +341,7 @@ class EntitySubstitution(object):
3138
3139 @classmethod
3140 def substitute_xml_containing_entities(
3141- cls, value, make_quoted_attribute=False):
3142+ cls, value: str, make_quoted_attribute:bool=False) -> str:
3143 """Substitute XML entities for special XML characters.
3144
3145 :param value: A string to be substituted. The less-than sign will
3146@@ -297,10 +362,10 @@ class EntitySubstitution(object):
3147 return value
3148
3149 @classmethod
3150- def substitute_html(cls, s):
3151+ def substitute_html(cls, s: str) -> str:
3152 """Replace certain Unicode characters with named HTML entities.
3153
3154- This differs from data.encode(encoding, 'xmlcharrefreplace')
3155+ This differs from ``data.encode(encoding, 'xmlcharrefreplace')``
3156 in that the goal is to make the result more readable (to those
3157 with ASCII displays) rather than to recover from
3158 errors. There's absolutely nothing wrong with a UTF-8 string
3159@@ -308,109 +373,126 @@ class EntitySubstitution(object):
3160 character with "&eacute;" will make it more readable to some
3161 people.
3162
3163- :param s: A Unicode string.
3164+ :param s: The string to be modified.
3165+ :return: The string with some Unicode characters replaced with
3166+ HTML entities.
3167 """
3168 return cls.CHARACTER_TO_HTML_ENTITY_RE.sub(
3169 cls._substitute_html_entity, s)
3170-
3171+EntitySubstitution._populate_class_variables()
3172
3173 class EncodingDetector:
3174- """Suggests a number of possible encodings for a bytestring.
3175+ """This class is capable of guessing a number of possible encodings
3176+ for a bytestring.
3177
3178 Order of precedence:
3179
3180 1. Encodings you specifically tell EncodingDetector to try first
3181- (the known_definite_encodings argument to the constructor).
3182-
3183+ (the ``known_definite_encodings`` argument to the constructor).
3184+
3185 2. An encoding determined by sniffing the document's byte-order mark.
3186-
3187+
3188 3. Encodings you specifically tell EncodingDetector to try if
3189- byte-order mark sniffing fails (the user_encodings argument to the
3190- constructor).
3191+ byte-order mark sniffing fails (the ``user_encodings`` argument to the
3192+ constructor).
3193
3194 4. An encoding declared within the bytestring itself, either in an
3195- XML declaration (if the bytestring is to be interpreted as an XML
3196- document), or in a <meta> tag (if the bytestring is to be
3197- interpreted as an HTML document.)
3198+ XML declaration (if the bytestring is to be interpreted as an XML
3199+ document), or in a <meta> tag (if the bytestring is to be
3200+ interpreted as an HTML document.)
3201
3202 5. An encoding detected through textual analysis by chardet,
3203- cchardet, or a similar external library.
3204+ cchardet, or a similar external library.
3205
3206- 4. UTF-8.
3207+ 6. UTF-8.
3208
3209- 5. Windows-1252.
3210+ 7. Windows-1252.
3211
3212- """
3213- def __init__(self, markup, known_definite_encodings=None,
3214- is_html=False, exclude_encodings=None,
3215- user_encodings=None, override_encodings=None):
3216- """Constructor.
3217-
3218- :param markup: Some markup in an unknown encoding.
3219-
3220- :param known_definite_encodings: When determining the encoding
3221- of `markup`, these encodings will be tried first, in
3222- order. In HTML terms, this corresponds to the "known
3223- definite encoding" step defined here:
3224- https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding
3225-
3226- :param user_encodings: These encodings will be tried after the
3227- `known_definite_encodings` have been tried and failed, and
3228- after an attempt to sniff the encoding by looking at a
3229- byte order mark has failed. In HTML terms, this
3230- corresponds to the step "user has explicitly instructed
3231- the user agent to override the document's character
3232- encoding", defined here:
3233- https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
3234-
3235- :param override_encodings: A deprecated alias for
3236- known_definite_encodings. Any encodings here will be tried
3237- immediately after the encodings in
3238- known_definite_encodings.
3239-
3240- :param is_html: If True, this markup is considered to be
3241- HTML. Otherwise it's assumed to be XML.
3242-
3243- :param exclude_encodings: These encodings will not be tried,
3244- even if they otherwise would be.
3245+ :param markup: Some markup in an unknown encoding.
3246
3247- """
3248+ :param known_definite_encodings: When determining the encoding
3249+ of ``markup``, these encodings will be tried first, in
3250+ order. In HTML terms, this corresponds to the "known
3251+ definite encoding" step defined in `section 13.2.3.1 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding>`_.
3252+
3253+ :param user_encodings: These encodings will be tried after the
3254+ ``known_definite_encodings`` have been tried and failed, and
3255+ after an attempt to sniff the encoding by looking at a
3256+ byte order mark has failed. In HTML terms, this
3257+ corresponds to the step "user has explicitly instructed
3258+ the user agent to override the document's character
3259+ encoding", defined in `section 13.2.3.2 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>`_.
3260+
3261+ :param override_encodings: A **deprecated** alias for
3262+ ``known_definite_encodings``. Any encodings here will be tried
3263+ immediately after the encodings in
3264+ ``known_definite_encodings``.
3265+
3266+ :param is_html: If True, this markup is considered to be
3267+ HTML. Otherwise it's assumed to be XML.
3268+
3269+ :param exclude_encodings: These encodings will not be tried,
3270+ even if they otherwise would be.
3271+
3272+ """
3273+ def __init__(self, markup:bytes,
3274+ known_definite_encodings:Optional[_Encodings]=None,
3275+ is_html:Optional[bool]=False,
3276+ exclude_encodings:Optional[_Encodings]=None,
3277+ user_encodings:Optional[_Encodings]=None,
3278+ override_encodings:Optional[_Encodings]=None):
3279 self.known_definite_encodings = list(known_definite_encodings or [])
3280 if override_encodings:
3281+ warnings.warn(
3282+ "The 'override_encodings' argument was deprecated in 4.10.0. Use 'known_definite_encodings' instead.",
3283+ DeprecationWarning,
3284+ stacklevel=3
3285+ )
3286 self.known_definite_encodings += override_encodings
3287 self.user_encodings = user_encodings or []
3288 exclude_encodings = exclude_encodings or []
3289 self.exclude_encodings = set([x.lower() for x in exclude_encodings])
3290 self.chardet_encoding = None
3291- self.is_html = is_html
3292- self.declared_encoding = None
3293+ self.is_html = False if is_html is None else is_html
3294+ self.declared_encoding: Optional[str] = None
3295
3296 # First order of business: strip a byte-order mark.
3297 self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup)
3298
3299- def _usable(self, encoding, tried):
3300+ known_definite_encodings:_Encodings
3301+ user_encodings:_Encodings
3302+ exclude_encodings:_Encodings
3303+ chardet_encoding:Optional[_Encoding]
3304+ is_html:bool
3305+ declared_encoding:Optional[_Encoding]
3306+ markup:bytes
3307+ sniffed_encoding:Optional[_Encoding]
3308+
3309+ def _usable(self, encoding:Optional[_Encoding], tried:Set[_Encoding]) -> bool:
3310 """Should we even bother to try this encoding?
3311
3312 :param encoding: Name of an encoding.
3313- :param tried: Encodings that have already been tried. This will be modified
3314- as a side effect.
3315+ :param tried: Encodings that have already been tried. This
3316+ will be modified as a side effect.
3317 """
3318- if encoding is not None:
3319- encoding = encoding.lower()
3320- if encoding in self.exclude_encodings:
3321- return False
3322- if encoding not in tried:
3323- tried.add(encoding)
3324- return True
3325+ if encoding is None:
3326+ return False
3327+ encoding = encoding.lower()
3328+ if encoding in self.exclude_encodings:
3329+ return False
3330+ if encoding not in tried:
3331+ tried.add(encoding)
3332+ return True
3333 return False
3334
3335 @property
3336- def encodings(self):
3337+ def encodings(self) -> Iterator[_Encoding]:
3338 """Yield a number of encodings that might work for this markup.
3339
3340- :yield: A sequence of strings.
3341+ :yield: A sequence of strings. Each is the name of an encoding
3342+ that *might* work to convert a bytestring into Unicode.
3343 """
3344- tried = set()
3345+ tried:Set[_Encoding] = set()
3346
3347 # First, try the known definite encodings
3348 for e in self.known_definite_encodings:
3349@@ -419,7 +501,9 @@ class EncodingDetector:
3350
3351 # Did the document originally start with a byte-order mark
3352 # that indicated its encoding?
3353- if self._usable(self.sniffed_encoding, tried):
3354+ if self.sniffed_encoding is not None and self._usable(
3355+ self.sniffed_encoding, tried
3356+ ):
3357 yield self.sniffed_encoding
3358
3359 # Sniffing the byte-order mark did nothing; try the user
3360@@ -433,14 +517,18 @@ class EncodingDetector:
3361 if self.declared_encoding is None:
3362 self.declared_encoding = self.find_declared_encoding(
3363 self.markup, self.is_html)
3364- if self._usable(self.declared_encoding, tried):
3365+ if self.declared_encoding is not None and self._usable(
3366+ self.declared_encoding, tried
3367+ ):
3368 yield self.declared_encoding
3369
3370 # Use third-party character set detection to guess at the
3371 # encoding.
3372 if self.chardet_encoding is None:
3373- self.chardet_encoding = chardet_dammit(self.markup)
3374- if self._usable(self.chardet_encoding, tried):
3375+ self.chardet_encoding = _chardet_dammit(self.markup)
3376+ if self.chardet_encoding is not None and self._usable(
3377+ self.chardet_encoding, tried
3378+ ):
3379 yield self.chardet_encoding
3380
3381 # As a last-ditch effort, try utf-8 and windows-1252.
3382@@ -449,22 +537,24 @@ class EncodingDetector:
3383 yield e
3384
3385 @classmethod
3386- def strip_byte_order_mark(cls, data):
3387+ def strip_byte_order_mark(cls, data:bytes) -> Tuple[bytes, Optional[_Encoding]]:
3388 """If a byte-order mark is present, strip it and return the encoding it implies.
3389
3390- :param data: Some markup.
3391- :return: A 2-tuple (modified data, implied encoding)
3392+ :param data: A bytestring that may or may not begin with a
3393+ byte-order mark.
3394+
3395+ :return: A 2-tuple (data stripped of byte-order mark, encoding implied by byte-order mark)
3396 """
3397 encoding = None
3398 if isinstance(data, str):
3399 # Unicode data cannot have a byte-order mark.
3400 return data, encoding
3401 if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
3402- and (data[2:4] != '\x00\x00'):
3403+ and (data[2:4] != b'\x00\x00'):
3404 encoding = 'utf-16be'
3405 data = data[2:]
3406 elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
3407- and (data[2:4] != '\x00\x00'):
3408+ and (data[2:4] != b'\x00\x00'):
3409 encoding = 'utf-16le'
3410 data = data[2:]
3411 elif data[:3] == b'\xef\xbb\xbf':
3412@@ -479,8 +569,9 @@ class EncodingDetector:
3413 return data, encoding
3414
3415 @classmethod
3416- def find_declared_encoding(cls, markup, is_html=False, search_entire_document=False):
3417- """Given a document, tries to find its declared encoding.
3418+ def find_declared_encoding(cls, markup:Union[bytes,str], is_html:bool=False, search_entire_document:bool=False) -> Optional[_Encoding]:
3419+ """Given a document, tries to find an encoding declared within the
3420+ text of the document itself.
3421
3422 An XML encoding is declared at the beginning of the document.
3423
3424@@ -490,9 +581,12 @@ class EncodingDetector:
3425 :param markup: Some markup.
3426 :param is_html: If True, this markup is considered to be HTML. Otherwise
3427 it's assumed to be XML.
3428- :param search_entire_document: Since an encoding is supposed to declared near the beginning
3429- of the document, most of the time it's only necessary to search a few kilobytes of data.
3430- Set this to True to force this method to search the entire document.
3431+ :param search_entire_document: Since an encoding is supposed
3432+ to declared near the beginning of the document, most of
3433+ the time it's only necessary to search a few kilobytes of
3434+ data. Set this to True to force this method to search the
3435+ entire document.
3436+ :return: The declared encoding, if one is found.
3437 """
3438 if search_entire_document:
3439 xml_endpos = html_endpos = len(markup)
3440@@ -520,74 +614,69 @@ class EncodingDetector:
3441 return None
3442
3443 class UnicodeDammit:
3444- """A class for detecting the encoding of a *ML document and
3445- converting it to a Unicode string. If the source encoding is
3446- windows-1252, can replace MS smart quotes with their HTML or XML
3447- equivalents."""
3448-
3449- # This dictionary maps commonly seen values for "charset" in HTML
3450- # meta tags to the corresponding Python codec names. It only covers
3451- # values that aren't in Python's aliases and can't be determined
3452- # by the heuristics in find_codec.
3453- CHARSET_ALIASES = {"macintosh": "mac-roman",
3454- "x-sjis": "shift-jis"}
3455-
3456- ENCODINGS_WITH_SMART_QUOTES = [
3457- "windows-1252",
3458- "iso-8859-1",
3459- "iso-8859-2",
3460- ]
3461+ """A class for detecting the encoding of a bytestring containing an
3462+ HTML or XML document, and decoding it to Unicode. If the source
3463+ encoding is windows-1252, `UnicodeDammit` can also replace
3464+ Microsoft smart quotes with their HTML or XML equivalents.
3465+
3466+ :param markup: HTML or XML markup in an unknown encoding.
3467+
3468+ :param known_definite_encodings: When determining the encoding
3469+ of ``markup``, these encodings will be tried first, in
3470+ order. In HTML terms, this corresponds to the "known
3471+ definite encoding" step defined in `section 13.2.3.1 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding>`_.
3472+
3473+ :param user_encodings: These encodings will be tried after the
3474+ ``known_definite_encodings`` have been tried and failed, and
3475+ after an attempt to sniff the encoding by looking at a
3476+ byte order mark has failed. In HTML terms, this
3477+ corresponds to the step "user has explicitly instructed
3478+ the user agent to override the document's character
3479+ encoding", defined in `section 13.2.3.2 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>`_.
3480+
3481+ :param override_encodings: A **deprecated** alias for
3482+ ``known_definite_encodings``. Any encodings here will be tried
3483+ immediately after the encodings in
3484+ ``known_definite_encodings``.
3485+
3486+ :param smart_quotes_to: By default, Microsoft smart quotes will,
3487+ like all other characters, be converted to Unicode
3488+ characters. Setting this to ``ascii`` will convert them to ASCII
3489+ quotes instead. Setting it to ``xml`` will convert them to XML
3490+ entity references, and setting it to ``html`` will convert them
3491+ to HTML entity references.
3492+
3493+ :param is_html: If True, ``markup`` is treated as an HTML
3494+ document. Otherwise it's treated as an XML document.
3495+
3496+ :param exclude_encodings: These encodings will not be considered,
3497+ even if the sniffing code thinks they might make sense.
3498
3499- def __init__(self, markup, known_definite_encodings=[],
3500- smart_quotes_to=None, is_html=False, exclude_encodings=[],
3501- user_encodings=None, override_encodings=None
3502+ """
3503+ def __init__(
3504+ self, markup:bytes,
3505+ known_definite_encodings:Optional[_Encodings]=[],
3506+ # TODO PYTHON 3.8 Literal is added to the typing module
3507+ #
3508+ # smart_quotes_to: Literal["ascii", "xml", "html"] | None = None,
3509+ smart_quotes_to: Optional[str] = None,
3510+ is_html: bool = False,
3511+ exclude_encodings:Optional[_Encodings] = [],
3512+ user_encodings:Optional[_Encodings] = None,
3513+ override_encodings:Optional[_Encodings] = None
3514 ):
3515- """Constructor.
3516-
3517- :param markup: A bytestring representing markup in an unknown encoding.
3518-
3519- :param known_definite_encodings: When determining the encoding
3520- of `markup`, these encodings will be tried first, in
3521- order. In HTML terms, this corresponds to the "known
3522- definite encoding" step defined here:
3523- https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding
3524-
3525- :param user_encodings: These encodings will be tried after the
3526- `known_definite_encodings` have been tried and failed, and
3527- after an attempt to sniff the encoding by looking at a
3528- byte order mark has failed. In HTML terms, this
3529- corresponds to the step "user has explicitly instructed
3530- the user agent to override the document's character
3531- encoding", defined here:
3532- https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
3533-
3534- :param override_encodings: A deprecated alias for
3535- known_definite_encodings. Any encodings here will be tried
3536- immediately after the encodings in
3537- known_definite_encodings.
3538-
3539- :param smart_quotes_to: By default, Microsoft smart quotes will, like all other characters, be converted
3540- to Unicode characters. Setting this to 'ascii' will convert them to ASCII quotes instead.
3541- Setting it to 'xml' will convert them to XML entity references, and setting it to 'html'
3542- will convert them to HTML entity references.
3543- :param is_html: If True, this markup is considered to be HTML. Otherwise
3544- it's assumed to be XML.
3545- :param exclude_encodings: These encodings will not be considered, even
3546- if the sniffing code thinks they might make sense.
3547-
3548- """
3549 self.smart_quotes_to = smart_quotes_to
3550 self.tried_encodings = []
3551 self.contains_replacement_characters = False
3552 self.is_html = is_html
3553- self.log = logging.getLogger(__name__)
3554+ self.log = getLogger(__name__)
3555 self.detector = EncodingDetector(
3556 markup, known_definite_encodings, is_html, exclude_encodings,
3557 user_encodings, override_encodings
3558 )
3559
3560 # Short-circuit if the data is in Unicode to begin with.
3561- if isinstance(markup, str) or markup == '':
3562+ if isinstance(markup, str) or markup == b'':
3563 self.markup = markup
3564 self.unicode_markup = str(markup)
3565 self.original_encoding = None
3566@@ -616,41 +705,117 @@ class UnicodeDammit:
3567 "Some characters could not be decoded, and were "
3568 "replaced with REPLACEMENT CHARACTER."
3569 )
3570+
3571 self.contains_replacement_characters = True
3572 break
3573
3574 # If none of that worked, we could at this point force it to
3575 # ASCII, but that would destroy so much data that I think
3576 # giving up is better.
3577- self.unicode_markup = u
3578- if not u:
3579+ #
3580+ # Note that this is extremely unlikely, probably impossible,
3581+ # because the "replace" strategy is so powerful. Even running
3582+ # the Python binary through Unicode, Dammit gives you Unicode,
3583+ # albeit Unicode riddled with REPLACEMENT CHARACTER.
3584+ if u is None:
3585 self.original_encoding = None
3586+ self.unicode_markup = None
3587+ else:
3588+ self.unicode_markup = u
3589+
3590+ #: The original markup, before it was converted to Unicode.
3591+ #: This is not necessarily the same as what was passed in to the
3592+ #: constructor, since any byte-order mark will be stripped.
3593+ markup:bytes
3594
3595- def _sub_ms_char(self, match):
3596+ #: The Unicode version of the markup, following conversion. This
3597+ #: is set to `None` if there was simply no way to convert the
3598+ #: bytestring to Unicode (as with binary data).
3599+ unicode_markup:Optional[str]
3600+
3601+ #: This is True if `UnicodeDammit.unicode_markup` contains
3602+ #: U+FFFD REPLACEMENT_CHARACTER characters which were not present
3603+ #: in `UnicodeDammit.markup`. These mark character sequences that
3604+ #: could not be represented in Unicode.
3605+ contains_replacement_characters: bool
3606+
3607+ #: Unicode, Dammit's best guess as to the original character
3608+ #: encoding of `UnicodeDammit.markup`.
3609+ original_encoding:Optional[_Encoding]
3610+
3611+ #: The strategy used to handle Microsoft smart quotes.
3612+ smart_quotes_to: Optional[str]
3613+
3614+ #: The (encoding, error handling strategy) 2-tuples that were used to
3615+ #: try and convert the markup to Unicode.
3616+ tried_encodings: List[Tuple[_Encoding, str]]
3617+
3618+ log: Logger #: :meta private:
3619+
3620+ def _sub_ms_char(self, match:re.Match[bytes]) -> bytes:
3621 """Changes a MS smart quote character to an XML or HTML
3622- entity, or an ASCII character."""
3623- orig = match.group(1)
3624+ entity, or an ASCII character.
3625+
3626+ TODO: Since this is only used to convert smart quotes, it
3627+ could be simplified, and MS_CHARS_TO_ASCII made much less
3628+ parochial.
3629+ """
3630+ orig: bytes = match.group(1)
3631+ sub: bytes
3632 if self.smart_quotes_to == 'ascii':
3633- sub = self.MS_CHARS_TO_ASCII.get(orig).encode()
3634+ if orig in self.MS_CHARS_TO_ASCII:
3635+ sub = self.MS_CHARS_TO_ASCII[orig].encode()
3636+ else:
3637+ # Shouldn't happen; substitute the character
3638+ # with itself.
3639+ sub = orig
3640 else:
3641- sub = self.MS_CHARS.get(orig)
3642- if type(sub) == tuple:
3643- if self.smart_quotes_to == 'xml':
3644- sub = '&#x'.encode() + sub[1].encode() + ';'.encode()
3645+ if orig in self.MS_CHARS:
3646+ substitutions = self.MS_CHARS[orig]
3647+ if type(substitutions) == tuple:
3648+ if self.smart_quotes_to == 'xml':
3649+ sub = b'&#x' + substitutions[1].encode() + b';'
3650+ else:
3651+ sub = b'&' + substitutions[0].encode() + b';'
3652 else:
3653- sub = '&'.encode() + sub[0].encode() + ';'.encode()
3654+ substitutions = cast(str, substitutions)
3655+ sub = substitutions.encode()
3656 else:
3657- sub = sub.encode()
3658+ # Shouldn't happen; substitute the character
3659+ # for itself.
3660+ sub = orig
3661 return sub
3662+
3663+ #: This dictionary maps commonly seen values for "charset" in HTML
3664+ #: meta tags to the corresponding Python codec names. It only covers
3665+ #: values that aren't in Python's aliases and can't be determined
3666+ #: by the heuristics in `find_codec`.
3667+ #:
3668+ #: :meta hide-value:
3669+ CHARSET_ALIASES: Dict[str, _Encoding] = {"macintosh": "mac-roman",
3670+ "x-sjis": "shift-jis"}
3671+
3672+ #: A list of encodings that tend to contain Microsoft smart quotes.
3673+ #:
3674+ #: :meta hide-value:
3675+ ENCODINGS_WITH_SMART_QUOTES: _Encodings = [
3676+ "windows-1252",
3677+ "iso-8859-1",
3678+ "iso-8859-2",
3679+ ]
3680
3681- def _convert_from(self, proposed, errors="strict"):
3682+ def _convert_from(self, proposed:_Encoding, errors:str="strict") -> Optional[str]:
3683 """Attempt to convert the markup to the proposed encoding.
3684
3685 :param proposed: The name of a character encoding.
3686+ :param errors: An error handling strategy, used when calling `str`.
3687+ :return: The converted markup, or `None` if the proposed
3688+ encoding/error handling strategy didn't work.
3689 """
3690- proposed = self.find_codec(proposed)
3691- if not proposed or (proposed, errors) in self.tried_encodings:
3692+ lookup_result = self.find_codec(proposed)
3693+ if lookup_result is None or (lookup_result, errors) in self.tried_encodings:
3694 return None
3695+ proposed = lookup_result
3696 self.tried_encodings.append((proposed, errors))
3697 markup = self.markup
3698 # Convert smart quotes to HTML if coming from an encoding
3699@@ -665,36 +830,37 @@ class UnicodeDammit:
3700 #print("Trying to convert document to %s (errors=%s)" % (
3701 # proposed, errors))
3702 u = self._to_unicode(markup, proposed, errors)
3703- self.markup = u
3704+ self.unicode_markup = u
3705 self.original_encoding = proposed
3706 except Exception as e:
3707 #print("That didn't work!")
3708 #print(e)
3709 return None
3710 #print("Correct encoding: %s" % proposed)
3711- return self.markup
3712+ return self.unicode_markup
3713
3714- def _to_unicode(self, data, encoding, errors="strict"):
3715- """Given a string and its encoding, decodes the string into Unicode.
3716+ def _to_unicode(self, data:bytes, encoding:_Encoding, errors:str="strict") -> str:
3717+ """Given a bytestring and its encoding, decodes the string into Unicode.
3718
3719 :param encoding: The name of an encoding.
3720+ :param errors: An error handling strategy, used when calling `str`.
3721 """
3722 return str(data, encoding, errors)
3723
3724 @property
3725- def declared_html_encoding(self):
3726- """If the markup is an HTML document, returns the encoding declared _within_
3727- the document.
3728+ def declared_html_encoding(self) -> Optional[str]:
3729+ """If the markup is an HTML document, returns the encoding, if any,
3730+ declared *inside* the document.
3731 """
3732 if not self.is_html:
3733 return None
3734 return self.detector.declared_encoding
3735
3736- def find_codec(self, charset):
3737- """Convert the name of a character set to a codec name.
3738+ def find_codec(self, charset:_Encoding) -> Optional[str]:
3739+ """Look up the Python codec corresponding to a given character set.
3740
3741 :param charset: The name of a character set.
3742- :return: The name of a codec.
3743+ :return: The name of a Python codec.
3744 """
3745 value = (self._codec(self.CHARSET_ALIASES.get(charset, charset))
3746 or (charset and self._codec(charset.replace("-", "")))
3747@@ -706,7 +872,7 @@ class UnicodeDammit:
3748 return value.lower()
3749 return None
3750
3751- def _codec(self, charset):
3752+ def _codec(self, charset:_Encoding) -> Optional[str]:
3753 if not charset:
3754 return charset
3755 codec = None
3756@@ -718,8 +884,11 @@ class UnicodeDammit:
3757 return codec
3758
3759
3760- # A partial mapping of ISO-Latin-1 to HTML entities/XML numeric entities.
3761- MS_CHARS = {b'\x80': ('euro', '20AC'),
3762+ #: A partial mapping of ISO-Latin-1 to HTML entities/XML numeric entities.
3763+ #:
3764+ #: :meta hide-value:
3765+ MS_CHARS: Dict[bytes, Union[str, Tuple[str, str]]] = {
3766+ b'\x80': ('euro', '20AC'),
3767 b'\x81': ' ',
3768 b'\x82': ('sbquo', '201A'),
3769 b'\x83': ('fnof', '192'),
3770@@ -752,10 +921,15 @@ class UnicodeDammit:
3771 b'\x9e': ('#x17E', '17E'),
3772 b'\x9f': ('Yuml', ''),}
3773
3774- # A parochial partial mapping of ISO-Latin-1 to ASCII. Contains
3775- # horrors like stripping diacritical marks to turn á into a, but also
3776- # contains non-horrors like turning “ into ".
3777- MS_CHARS_TO_ASCII = {
3778+ #: A parochial partial mapping of ISO-Latin-1 to ASCII. Contains
3779+ #: horrors like stripping diacritical marks to turn á into a, but also
3780+ #: contains non-horrors like turning “ into ".
3781+ #:
3782+ #: Seriously, don't use this for anything other than removing smart
3783+ #: quotes.
3784+ #:
3785+ #: :meta private:
3786+ MS_CHARS_TO_ASCII: Dict[bytes, str] = {
3787 b'\x80' : 'EUR',
3788 b'\x81' : ' ',
3789 b'\x82' : ',',
3790@@ -809,7 +983,7 @@ class UnicodeDammit:
3791 b'\xb1' : '+-',
3792 b'\xb2' : '2',
3793 b'\xb3' : '3',
3794- b'\xb4' : ("'", 'acute'),
3795+ b'\xb4' : "'",
3796 b'\xb5' : 'u',
3797 b'\xb6' : 'P',
3798 b'\xb7' : '*',
3799@@ -887,12 +1061,14 @@ class UnicodeDammit:
3800 b'\xff' : 'y',
3801 }
3802
3803- # A map used when removing rogue Windows-1252/ISO-8859-1
3804- # characters in otherwise UTF-8 documents.
3805- #
3806- # Note that \x81, \x8d, \x8f, \x90, and \x9d are undefined in
3807- # Windows-1252.
3808- WINDOWS_1252_TO_UTF8 = {
3809+ #: A map used when removing rogue Windows-1252/ISO-8859-1
3810+ #: characters in otherwise UTF-8 documents.
3811+ #:
3812+ #: Note that \\x81, \\x8d, \\x8f, \\x90, and \\x9d are undefined in
3813+ #: Windows-1252.
3814+ #:
3815+ #: :meta hide-value:
3816+ WINDOWS_1252_TO_UTF8: Dict[int, bytes] = {
3817 0x80 : b'\xe2\x82\xac', # €
3818 0x82 : b'\xe2\x80\x9a', # ‚
3819 0x83 : b'\xc6\x92', # Æ’
3820@@ -1017,33 +1193,37 @@ class UnicodeDammit:
3821 0xfe : b'\xc3\xbe', # þ
3822 }
3823
3824- MULTIBYTE_MARKERS_AND_SIZES = [
3825+ #: :meta private:
3826+ MULTIBYTE_MARKERS_AND_SIZES:List[Tuple[int, int, int]] = [
3827 (0xc2, 0xdf, 2), # 2-byte characters start with a byte C2-DF
3828 (0xe0, 0xef, 3), # 3-byte characters start with E0-EF
3829 (0xf0, 0xf4, 4), # 4-byte characters start with F0-F4
3830 ]
3831
3832- FIRST_MULTIBYTE_MARKER = MULTIBYTE_MARKERS_AND_SIZES[0][0]
3833- LAST_MULTIBYTE_MARKER = MULTIBYTE_MARKERS_AND_SIZES[-1][1]
3834+ #: :meta private:
3835+ FIRST_MULTIBYTE_MARKER:int = MULTIBYTE_MARKERS_AND_SIZES[0][0]
3836+
3837+ #: :meta private:
3838+ LAST_MULTIBYTE_MARKER:int = MULTIBYTE_MARKERS_AND_SIZES[-1][1]
3839
3840 @classmethod
3841- def detwingle(cls, in_bytes, main_encoding="utf8",
3842- embedded_encoding="windows-1252"):
3843+ def detwingle(cls, in_bytes:bytes, main_encoding:_Encoding="utf8",
3844+ embedded_encoding:_Encoding="windows-1252") -> bytes:
3845 """Fix characters from one encoding embedded in some other encoding.
3846
3847 Currently the only situation supported is Windows-1252 (or its
3848 subset ISO-8859-1), embedded in UTF-8.
3849
3850 :param in_bytes: A bytestring that you suspect contains
3851- characters from multiple encodings. Note that this _must_
3852+ characters from multiple encodings. Note that this *must*
3853 be a bytestring. If you've already converted the document
3854 to Unicode, you're too late.
3855- :param main_encoding: The primary encoding of `in_bytes`.
3856+ :param main_encoding: The primary encoding of ``in_bytes``.
3857 :param embedded_encoding: The encoding that was used to embed characters
3858 in the main document.
3859- :return: A bytestring in which `embedded_encoding`
3860- characters have been converted to their `main_encoding`
3861- equivalents.
3862+ :return: A bytestring similar to ``in_bytes``, in which
3863+ ``embedded_encoding`` characters have been converted to
3864+ their ``main_encoding`` equivalents.
3865 """
3866 if embedded_encoding.replace('_', '-').lower() not in (
3867 'windows-1252', 'windows_1252'):
3868@@ -1061,9 +1241,6 @@ class UnicodeDammit:
3869 pos = 0
3870 while pos < len(in_bytes):
3871 byte = in_bytes[pos]
3872- if not isinstance(byte, int):
3873- # Python 2.x
3874- byte = ord(byte)
3875 if (byte >= cls.FIRST_MULTIBYTE_MARKER
3876 and byte <= cls.LAST_MULTIBYTE_MARKER):
3877 # This is the start of a UTF-8 multibyte character. Skip
3878diff --git a/bs4/diagnose.py b/bs4/diagnose.py
3879index e079772..201b879 100644
3880--- a/bs4/diagnose.py
3881+++ b/bs4/diagnose.py
3882@@ -7,8 +7,11 @@ import cProfile
3883 from io import BytesIO
3884 from html.parser import HTMLParser
3885 import bs4
3886-from bs4 import BeautifulSoup, __version__
3887+from bs4 import BeautifulSoup, __version__
3888 from bs4.builder import builder_registry
3889+from typing import TYPE_CHECKING
3890+if TYPE_CHECKING:
3891+ from bs4._typing import _IncomingMarkup
3892
3893 import os
3894 import pstats
3895@@ -19,10 +22,10 @@ import traceback
3896 import sys
3897 import cProfile
3898
3899-def diagnose(data):
3900+def diagnose(data:_IncomingMarkup) -> None:
3901 """Diagnostic suite for isolating common problems.
3902
3903- :param data: A string containing markup that needs to be explained.
3904+ :param data: Some markup that needs to be explained.
3905 :return: None; diagnostics are printed to standard output.
3906 """
3907 print(("Diagnostic running on Beautiful Soup %s" % __version__))
3908@@ -75,7 +78,7 @@ def diagnose(data):
3909
3910 print(("-" * 80))
3911
3912-def lxml_trace(data, html=True, **kwargs):
3913+def lxml_trace(data, html:bool=True, **kwargs) -> None:
3914 """Print out the lxml events that occur during parsing.
3915
3916 This lets you see how lxml parses a document when no Beautiful
3917@@ -109,7 +112,7 @@ class AnnouncingParser(HTMLParser):
3918 print(s)
3919
3920 def handle_starttag(self, name, attrs):
3921- self._p("%s START" % name)
3922+ self._p(f"{name} {attrs} START")
3923
3924 def handle_endtag(self, name):
3925 self._p("%s END" % name)
3926@@ -146,11 +149,14 @@ def htmlparser_trace(data):
3927 parser = AnnouncingParser()
3928 parser.feed(data)
3929
3930-_vowels = "aeiou"
3931-_consonants = "bcdfghjklmnpqrstvwxyz"
3932+_vowels:str = "aeiou"
3933+_consonants:str = "bcdfghjklmnpqrstvwxyz"
3934
3935-def rword(length=5):
3936- "Generate a random word-like string."
3937+def rword(length:int=5) -> str:
3938+ """Generate a random word-like string.
3939+
3940+ :meta private:
3941+ """
3942 s = ''
3943 for i in range(length):
3944 if i % 2 == 0:
3945@@ -160,12 +166,18 @@ def rword(length=5):
3946 s += random.choice(t)
3947 return s
3948
3949-def rsentence(length=4):
3950- "Generate a random sentence-like string."
3951+def rsentence(length:int=4) -> str:
3952+ """Generate a random sentence-like string.
3953+
3954+ :meta private:
3955+ """
3956 return " ".join(rword(random.randint(4,9)) for i in range(length))
3957
3958-def rdoc(num_elements=1000):
3959- """Randomly generate an invalid HTML document."""
3960+def rdoc(num_elements:int=1000) -> str:
3961+ """Randomly generate an invalid HTML document.
3962+
3963+ :meta private:
3964+ """
3965 tag_names = ['p', 'div', 'span', 'i', 'b', 'script', 'table']
3966 elements = []
3967 for i in range(num_elements):
3968@@ -182,24 +194,24 @@ def rdoc(num_elements=1000):
3969 elements.append("</%s>" % tag_name)
3970 return "<html>" + "\n".join(elements) + "</html>"
3971
3972-def benchmark_parsers(num_elements=100000):
3973+def benchmark_parsers(num_elements:int=100000) -> None:
3974 """Very basic head-to-head performance benchmark."""
3975 print(("Comparative parser benchmark on Beautiful Soup %s" % __version__))
3976 data = rdoc(num_elements)
3977 print(("Generated a large invalid HTML document (%d bytes)." % len(data)))
3978
3979- for parser in ["lxml", ["lxml", "html"], "html5lib", "html.parser"]:
3980+ for parser_name in ["lxml", ["lxml", "html"], "html5lib", "html.parser"]:
3981 success = False
3982 try:
3983 a = time.time()
3984- soup = BeautifulSoup(data, parser)
3985+ soup = BeautifulSoup(data, parser_name)
3986 b = time.time()
3987 success = True
3988 except Exception as e:
3989- print(("%s could not parse the markup." % parser))
3990+ print(("%s could not parse the markup." % parser_name))
3991 traceback.print_exc()
3992 if success:
3993- print(("BS4+%s parsed the markup in %.2fs." % (parser, b-a)))
3994+ print(("BS4+%s parsed the markup in %.2fs." % (parser_name, b-a)))
3995
3996 from lxml import etree
3997 a = time.time()
3998@@ -214,7 +226,7 @@ def benchmark_parsers(num_elements=100000):
3999 b = time.time()
4000 print(("Raw html5lib parsed the markup in %.2fs." % (b-a)))
4001
4002-def profile(num_elements=100000, parser="lxml"):
4003+def profile(num_elements:int=100000, parser:str="lxml"):
4004 """Use Python's profiler on a randomly generated document."""
4005 filehandle = tempfile.NamedTemporaryFile()
4006 filename = filehandle.name
4007diff --git a/bs4/element.py b/bs4/element.py
4008index 0aefe73..8b3774e 100644
4009--- a/bs4/element.py
4010+++ b/bs4/element.py
4011@@ -1,55 +1,102 @@
4012+from __future__ import annotations
4013 # Use of this source code is governed by the MIT license.
4014 __license__ = "MIT"
4015
4016-try:
4017- from collections.abc import Callable # Python 3.6
4018-except ImportError as e:
4019- from collections import Callable
4020 import re
4021 import sys
4022 import warnings
4023
4024 from bs4.css import CSS
4025+from bs4._deprecation import (
4026+ _deprecated,
4027+ _deprecated_alias,
4028+ _deprecated_function_alias,
4029+)
4030 from bs4.formatter import (
4031 Formatter,
4032 HTMLFormatter,
4033 XMLFormatter,
4034 )
4035
4036-DEFAULT_OUTPUT_ENCODING = "utf-8"
4037-
4038-nonwhitespace_re = re.compile(r"\S+")
4039-
4040-# NOTE: This isn't used as of 4.7.0. I'm leaving it for a little bit on
4041-# the off chance someone imported it for their own use.
4042-whitespace_re = re.compile(r"\s+")
4043+from typing import (
4044+ Any,
4045+ Callable,
4046+ Dict,
4047+ Generator,
4048+ Generic,
4049+ Iterable,
4050+ Iterator,
4051+ List,
4052+ Mapping,
4053+ Optional,
4054+ Pattern,
4055+ Sequence,
4056+ Set,
4057+ TYPE_CHECKING,
4058+ Tuple,
4059+ Type,
4060+ TypeVar,
4061+ Union,
4062+ cast,
4063+)
4064+from typing_extensions import Self
4065+if TYPE_CHECKING:
4066+ from bs4 import BeautifulSoup
4067+ from bs4.builder import TreeBuilder
4068+ from bs4.dammit import _Encoding
4069+ from bs4.formatter import (
4070+ _EntitySubstitutionFunction,
4071+ _FormatterOrName,
4072+ )
4073+ from bs4._typing import (
4074+ _AttributeValue,
4075+ _AttributeValues,
4076+ _StrainableElement,
4077+ _StrainableAttribute,
4078+ _StrainableAttributes,
4079+ _StrainableString,
4080+ )
4081+
4082+# Deprecated module-level attributes.
4083+# See https://peps.python.org/pep-0562/
4084+_deprecated_names = dict(
4085+ whitespace_re = 'The {name} attribute was deprecated in version 4.7.0. If you need it, make your own copy.'
4086+)
4087+#: :meta private:
4088+_deprecated_whitespace_re: Pattern[str] = re.compile(r"\s+")
4089
4090-def _alias(attr):
4091- """Alias one attribute name to another for backward compatibility"""
4092- @property
4093- def alias(self):
4094- return getattr(self, attr)
4095-
4096- @alias.setter
4097- def alias(self):
4098- return setattr(self, attr)
4099- return alias
4100-
4101-
4102-# These encodings are recognized by Python (so PageElement.encode
4103-# could theoretically support them) but XML and HTML don't recognize
4104-# them (so they should not show up in an XML or HTML document as that
4105-# document's encoding).
4106-#
4107-# If an XML document is encoded in one of these encodings, no encoding
4108-# will be mentioned in the XML declaration. If an HTML document is
4109-# encoded in one of these encodings, and the HTML document has a
4110-# <meta> tag that mentions an encoding, the encoding will be given as
4111-# the empty string.
4112-#
4113-# Source:
4114-# https://docs.python.org/3/library/codecs.html#python-specific-encodings
4115-PYTHON_SPECIFIC_ENCODINGS = set([
4116+def __getattr__(name):
4117+ if name in _deprecated_names:
4118+ message = _deprecated_names[name]
4119+ warnings.warn(
4120+ message.format(name=name),
4121+ DeprecationWarning, stacklevel=2
4122+ )
4123+
4124+ return globals()[f"_deprecated_{name}"]
4125+ raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
4126+
4127+#: Documents output by Beautiful Soup will be encoded with
4128+#: this encoding unless you specify otherwise.
4129+DEFAULT_OUTPUT_ENCODING:str = "utf-8"
4130+
4131+#: A regular expression that can be used to split on whitespace.
4132+nonwhitespace_re: Pattern[str] = re.compile(r"\S+")
4133+
4134+#: These encodings are recognized by Python (so `Tag.encode`
4135+#: could theoretically support them) but XML and HTML don't recognize
4136+#: them (so they should not show up in an XML or HTML document as that
4137+#: document's encoding).
4138+#:
4139+#: If an XML document is encoded in one of these encodings, no encoding
4140+#: will be mentioned in the XML declaration. If an HTML document is
4141+#: encoded in one of these encodings, and the HTML document has a
4142+#: <meta> tag that mentions an encoding, the encoding will be given as
4143+#: the empty string.
4144+#:
4145+#: Source:
4146+#: Python documentation, `Python Specific Encodings <https://docs.python.org/3/library/codecs.html#python-specific-encodings>`_
4147+PYTHON_SPECIFIC_ENCODINGS: Set[_Encoding] = set([
4148 "idna",
4149 "mbcs",
4150 "oem",
4151@@ -66,11 +113,17 @@ PYTHON_SPECIFIC_ENCODINGS = set([
4152
4153
4154 class NamespacedAttribute(str):
4155- """A namespaced string (e.g. 'xml:lang') that remembers the namespace
4156- ('xml') and the name ('lang') that were used to create it.
4157+ """A namespaced attribute (e.g. the 'xml:lang' in 'xml:lang="en"')
4158+ which remembers the namespace prefix ('xml') and the name ('lang')
4159+ that were used to create it.
4160 """
4161
4162- def __new__(cls, prefix, name=None, namespace=None):
4163+ prefix: Optional[str]
4164+ name: Optional[str]
4165+ namespace: Optional[str]
4166+
4167+ def __new__(cls, prefix:Optional[str],
4168+ name:Optional[str]=None, namespace:Optional[str]=None):
4169 if not name:
4170 # This is the default namespace. Its name "has no value"
4171 # per https://www.w3.org/TR/xml-names/#defaulting
4172@@ -89,72 +142,126 @@ class NamespacedAttribute(str):
4173 return obj
4174
4175 class AttributeValueWithCharsetSubstitution(str):
4176- """A stand-in object for a character encoding specified in HTML."""
4177+ """An abstract class standing in for a character encoding specified
4178+ inside an HTML ``<meta>`` tag.
4179
4180-class CharsetMetaAttributeValue(AttributeValueWithCharsetSubstitution):
4181- """A generic stand-in for the value of a meta tag's 'charset' attribute.
4182+ Subclasses exist for each place such a character encoding might be
4183+ found: either inside the ``charset`` attribute
4184+ (`CharsetMetaAttributeValue`) or inside the ``content`` attribute
4185+ (`ContentMetaAttributeValue`)
4186
4187- When Beautiful Soup parses the markup '<meta charset="utf8">', the
4188- value of the 'charset' attribute will be one of these objects.
4189+ This allows Beautiful Soup to replace that part of the HTML file
4190+ with a different encoding when ouputting a tree as a string.
4191 """
4192+ # The original, un-encoded value of the ``content`` attribute.
4193+ #: :meta private:
4194+ original_value: str
4195+
4196+ def substitute_encoding(self, eventual_encoding:str) -> str:
4197+ """Do whatever's necessary in this implementation-specific
4198+ portion an HTML document to substitute in a specific encoding.
4199+ """
4200+ raise NotImplementedError()
4201+
4202+class CharsetMetaAttributeValue(AttributeValueWithCharsetSubstitution):
4203+ """A generic stand-in for the value of a ``<meta>`` tag's ``charset``
4204+ attribute.
4205+
4206+ When Beautiful Soup parses the markup ``<meta charset="utf8">``, the
4207+ value of the ``charset`` attribute will become one of these objects.
4208
4209- def __new__(cls, original_value):
4210+ If the document is later encoded to an encoding other than UTF-8, its
4211+ ``<meta>`` tag will mention the new encoding instead of ``utf8``.
4212+ """
4213+ def __new__(cls, original_value:str) -> Self:
4214+ # We don't need to use the original value for anything, but
4215+ # it might be useful for the user to know.
4216 obj = str.__new__(cls, original_value)
4217 obj.original_value = original_value
4218 return obj
4219-
4220- def encode(self, encoding):
4221+
4222+ def substitute_encoding(self, eventual_encoding:_Encoding="utf-8") -> str:
4223 """When an HTML document is being encoded to a given encoding, the
4224- value of a meta tag's 'charset' is the name of the encoding.
4225+ value of a ``<meta>`` tag's ``charset`` becomes the name of
4226+ the encoding.
4227 """
4228- if encoding in PYTHON_SPECIFIC_ENCODINGS:
4229+ if eventual_encoding in PYTHON_SPECIFIC_ENCODINGS:
4230 return ''
4231- return encoding
4232+ return eventual_encoding
4233
4234
4235 class ContentMetaAttributeValue(AttributeValueWithCharsetSubstitution):
4236- """A generic stand-in for the value of a meta tag's 'content' attribute.
4237+ """A generic stand-in for the value of a ``<meta>`` tag's ``content``
4238+ attribute.
4239
4240 When Beautiful Soup parses the markup:
4241- <meta http-equiv="content-type" content="text/html; charset=utf8">
4242+ ``<meta http-equiv="content-type" content="text/html; charset=utf8">``
4243
4244- The value of the 'content' attribute will be one of these objects.
4245- """
4246-
4247- CHARSET_RE = re.compile(r"((^|;)\s*charset=)([^;]*)", re.M)
4248+ The value of the ``content`` attribute will become one of these objects.
4249
4250- def __new__(cls, original_value):
4251+ If the document is later encoded to an encoding other than UTF-8, its
4252+ ``<meta>`` tag will mention the new encoding instead of ``utf8``.
4253+ """
4254+ #: Match the 'charset' argument inside the 'content' attribute
4255+ #: of a <meta> tag.
4256+ #: :meta private:
4257+ CHARSET_RE: Pattern[str] = re.compile(
4258+ r"((^|;)\s*charset=)([^;]*)", re.M
4259+ )
4260+
4261+ def __new__(cls, original_value:str) -> Self:
4262 match = cls.CHARSET_RE.search(original_value)
4263- if match is None:
4264- # No substitution necessary.
4265- return str.__new__(str, original_value)
4266-
4267 obj = str.__new__(cls, original_value)
4268 obj.original_value = original_value
4269 return obj
4270
4271- def encode(self, encoding):
4272- if encoding in PYTHON_SPECIFIC_ENCODINGS:
4273- return ''
4274+ def substitute_encoding(self, eventual_encoding:_Encoding="utf-8") -> str:
4275+ """When an HTML document is being encoded to a given encoding, the
4276+ value of the ``charset=`` in a ``<meta>`` tag's ``content`` becomes
4277+ the name of the encoding.
4278+ """
4279+ if eventual_encoding in PYTHON_SPECIFIC_ENCODINGS:
4280+ return self.CHARSET_RE.sub('', self.original_value)
4281 def rewrite(match):
4282- return match.group(1) + encoding
4283+ return match.group(1) + eventual_encoding
4284 return self.CHARSET_RE.sub(rewrite, self.original_value)
4285
4286
4287 class PageElement(object):
4288- """Contains the navigational information for some part of the page:
4289- that is, its current location in the parse tree.
4290+ """An abstract class representing a single element in the parse tree.
4291
4292- NavigableString, Tag, etc. are all subclasses of PageElement.
4293+ `NavigableString`, `Tag`, etc. are all subclasses of
4294+ `PageElement`. For this reason you'll see a lot of methods that
4295+ return `PageElement`, but you'll never see an actual `PageElement`
4296+ object. For the most part you can think of `PageElement` as
4297+ meaning "a `Tag` or a `NavigableString`."
4298 """
4299
4300- # In general, we can't tell just by looking at an element whether
4301- # it's contained in an XML document or an HTML document. But for
4302- # Tags (q.v.) we can store this information at parse time.
4303- known_xml = None
4304-
4305- def setup(self, parent=None, previous_element=None, next_element=None,
4306- previous_sibling=None, next_sibling=None):
4307+ #: In general, we can't tell just by looking at an element whether
4308+ #: it's contained in an XML document or an HTML document. But for
4309+ #: `Tag` objects (q.v.) we can store this information at parse time.
4310+ #: :meta private:
4311+ known_xml: Optional[bool] = None
4312+
4313+ #: Whether or not this element has been decomposed from the tree
4314+ #: it was created in.
4315+ _decomposed: bool
4316+
4317+ parent: Optional[Tag]
4318+ next_element: Optional[PageElement]
4319+ previous_element: Optional[PageElement]
4320+ next_sibling: Optional[PageElement]
4321+ previous_sibling: Optional[PageElement]
4322+
4323+ #: Whether or not this element is hidden from generated output.
4324+ #: Only the `BeautifulSoup` object itself is hidden.
4325+ hidden: bool=False
4326+
4327+ def setup(self, parent:Optional[Tag]=None,
4328+ previous_element:Optional[PageElement]=None,
4329+ next_element:Optional[PageElement]=None,
4330+ previous_sibling:Optional[PageElement]=None,
4331+ next_sibling:Optional[PageElement]=None) -> None:
4332 """Sets up the initial relations between this element and
4333 other elements.
4334
4335@@ -175,7 +282,7 @@ class PageElement(object):
4336 self.parent = parent
4337
4338 self.previous_element = previous_element
4339- if previous_element is not None:
4340+ if self.previous_element is not None:
4341 self.previous_element.next_element = self
4342
4343 self.next_element = next_element
4344@@ -191,10 +298,10 @@ class PageElement(object):
4345 previous_sibling = self.parent.contents[-1]
4346
4347 self.previous_sibling = previous_sibling
4348- if previous_sibling is not None:
4349+ if self.previous_sibling is not None:
4350 self.previous_sibling.next_sibling = self
4351
4352- def format_string(self, s, formatter):
4353+ def format_string(self, s:str, formatter:Optional[_FormatterOrName]) -> str:
4354 """Format the given string using the given formatter.
4355
4356 :param s: A string.
4357@@ -207,28 +314,35 @@ class PageElement(object):
4358 output = formatter.substitute(s)
4359 return output
4360
4361- def formatter_for_name(self, formatter):
4362+ def formatter_for_name(
4363+ self,
4364+ formatter_name:Union[_FormatterOrName, _EntitySubstitutionFunction]
4365+ ) -> Formatter:
4366 """Look up or create a Formatter for the given identifier,
4367 if necessary.
4368
4369- :param formatter: Can be a Formatter object (used as-is), a
4370+ :param formatter: Can be a `Formatter` object (used as-is), a
4371 function (used as the entity substitution hook for an
4372- XMLFormatter or HTMLFormatter), or a string (used to look
4373- up an XMLFormatter or HTMLFormatter in the appropriate
4374+ `XMLFormatter` or `HTMLFormatter`), or a string (used to look
4375+ up an `XMLFormatter` or `HTMLFormatter` in the appropriate
4376 registry.
4377 """
4378- if isinstance(formatter, Formatter):
4379- return formatter
4380+ if isinstance(formatter_name, Formatter):
4381+ return formatter_name
4382+ c: type[Formatter]
4383+ registry: Mapping[Optional[str], Formatter]
4384 if self._is_xml:
4385 c = XMLFormatter
4386+ registry = XMLFormatter.REGISTRY
4387 else:
4388 c = HTMLFormatter
4389- if isinstance(formatter, Callable):
4390- return c(entity_substitution=formatter)
4391- return c.REGISTRY[formatter]
4392+ registry = HTMLFormatter.REGISTRY
4393+ if callable(formatter_name):
4394+ return c(entity_substitution=formatter_name)
4395+ return registry[formatter_name]
4396
4397 @property
4398- def _is_xml(self):
4399+ def _is_xml(self) -> bool:
4400 """Is this element part of an XML tree or an HTML tree?
4401
4402 This is used in formatter_for_name, when deciding whether an
4403@@ -250,28 +364,41 @@ class PageElement(object):
4404 return getattr(self, 'is_xml', False)
4405 return self.parent._is_xml
4406
4407- nextSibling = _alias("next_sibling") # BS3
4408- previousSibling = _alias("previous_sibling") # BS3
4409+ nextSibling = _deprecated_alias("nextSibling", "next_sibling", "4.0.0")
4410+ previousSibling = _deprecated_alias(
4411+ "previousSibling", "previous_sibling", "4.0.0"
4412+ )
4413
4414- default = object()
4415- def _all_strings(self, strip=False, types=default):
4416+ def __deepcopy__(self, memo:Dict, recursive:bool=False) -> Self:
4417+ raise NotImplementedError()
4418+
4419+ def __copy__(self) -> Self:
4420+ """A copy of a PageElement can only be a deep copy, because
4421+ only one PageElement can occupy a given place in a parse tree.
4422+ """
4423+ return self.__deepcopy__({})
4424+
4425+ default: Iterable[type[NavigableString]] = tuple() #: :meta private:
4426+ def _all_strings(self, strip:bool=False, types:Iterable[type[NavigableString]]=default) -> Iterator[str]:
4427 """Yield all strings of certain classes, possibly stripping them.
4428
4429- This is implemented differently in Tag and NavigableString.
4430+ This is implemented differently in `Tag` and `NavigableString`.
4431 """
4432 raise NotImplementedError()
4433
4434 @property
4435- def stripped_strings(self):
4436- """Yield all strings in this PageElement, stripping them first.
4437+ def stripped_strings(self) -> Iterator[str]:
4438+ """Yield all interesting strings in this PageElement, stripping them
4439+ first.
4440
4441- :yield: A sequence of stripped strings.
4442+ See `Tag` for information on which strings are considered
4443+ interesting in a given context.
4444 """
4445 for string in self._all_strings(True):
4446 yield string
4447
4448- def get_text(self, separator="", strip=False,
4449- types=default):
4450+ def get_text(self, separator:str="", strip:bool=False,
4451+ types:Iterable[Type[NavigableString]]=default) -> str:
4452 """Get all child strings of this PageElement, concatenated using the
4453 given separator.
4454
4455@@ -294,19 +421,19 @@ class PageElement(object):
4456 getText = get_text
4457 text = property(get_text)
4458
4459- def replace_with(self, *args):
4460- """Replace this PageElement with one or more PageElements, keeping the
4461- rest of the tree the same.
4462+ def replace_with(self, *args:PageElement) -> PageElement:
4463+ """Replace this `PageElement` with one or more other `PageElements`,
4464+ keeping the rest of the tree the same.
4465
4466- :param args: One or more PageElements.
4467- :return: `self`, no longer part of the tree.
4468+ :return: This `PageElement`, no longer part of the tree.
4469 """
4470 if self.parent is None:
4471 raise ValueError(
4472 "Cannot replace one element with another when the "
4473 "element to be replaced is not part of a tree.")
4474 if len(args) == 1 and args[0] is self:
4475- return
4476+ # Replacing an element with itself is a no-op.
4477+ return self
4478 if any(x is self.parent for x in args):
4479 raise ValueError("Cannot replace a Tag with its parent.")
4480 old_parent = self.parent
4481@@ -315,45 +442,28 @@ class PageElement(object):
4482 for idx, replace_with in enumerate(args, start=my_index):
4483 old_parent.insert(idx, replace_with)
4484 return self
4485- replaceWith = replace_with # BS3
4486+ replaceWith = _deprecated_function_alias(
4487+ "replaceWith", "replace_with", "4.0.0"
4488+ )
4489
4490- def unwrap(self):
4491- """Replace this PageElement with its contents.
4492+ def wrap(self, wrap_inside:Tag) -> Tag:
4493+ """Wrap this `PageElement` inside a `Tag`.
4494
4495- :return: `self`, no longer part of the tree.
4496- """
4497- my_parent = self.parent
4498- if self.parent is None:
4499- raise ValueError(
4500- "Cannot replace an element with its contents when that"
4501- "element is not part of a tree.")
4502- my_index = self.parent.index(self)
4503- self.extract(_self_index=my_index)
4504- for child in reversed(self.contents[:]):
4505- my_parent.insert(my_index, child)
4506- return self
4507- replace_with_children = unwrap
4508- replaceWithChildren = unwrap # BS3
4509-
4510- def wrap(self, wrap_inside):
4511- """Wrap this PageElement inside another one.
4512-
4513- :param wrap_inside: A PageElement.
4514- :return: `wrap_inside`, occupying the position in the tree that used
4515- to be occupied by `self`, and with `self` inside it.
4516+ :return: ``wrap_inside``, occupying the position in the tree that used
4517+ to be occupied by this object, and with this object now inside it.
4518 """
4519 me = self.replace_with(wrap_inside)
4520 wrap_inside.append(me)
4521 return wrap_inside
4522
4523- def extract(self, _self_index=None):
4524+ def extract(self, _self_index:Optional[int]=None) -> PageElement:
4525 """Destructively rips this element out of the tree.
4526
4527 :param _self_index: The location of this element in its parent's
4528 .contents, if known. Passing this in allows for a performance
4529 optimization.
4530
4531- :return: `self`, no longer part of the tree.
4532+ :return: this `PageElement`, no longer part of the tree.
4533 """
4534 if self.parent is not None:
4535 if _self_index is None:
4536@@ -364,11 +474,17 @@ class PageElement(object):
4537 #this element (and any children) hadn't been parsed. Connect
4538 #the two.
4539 last_child = self._last_descendant()
4540+
4541+ # last_child can't be None because we passed accept_self=True
4542+ # into _last_descendant. Worst case, last_child will be
4543+ # self. Making this cast removes several mypy complaints later
4544+ # on as we manipulate last_child.
4545+ last_child = cast(PageElement, last_child)
4546 next_element = last_child.next_element
4547
4548- if (self.previous_element is not None and
4549- self.previous_element is not next_element):
4550- self.previous_element.next_element = next_element
4551+ if self.previous_element is not None:
4552+ if self.previous_element is not next_element:
4553+ self.previous_element.next_element = next_element
4554 if next_element is not None and next_element is not self.previous_element:
4555 next_element.previous_element = self.previous_element
4556 self.previous_element = None
4557@@ -384,12 +500,38 @@ class PageElement(object):
4558 self.previous_sibling = self.next_sibling = None
4559 return self
4560
4561- def _last_descendant(self, is_initialized=True, accept_self=True):
4562+ def decompose(self) -> None:
4563+ """Recursively destroys this `PageElement` and its children.
4564+
4565+ The element will be removed from the tree and wiped out; so
4566+ will everything beneath it.
4567+
4568+ The behavior of a decomposed `PageElement` is undefined and you
4569+ should never use one for anything, but if you need to *check*
4570+ whether an element has been decomposed, you can use the
4571+ `PageElement.decomposed` property.
4572+ """
4573+ self.extract()
4574+ e: Optional[PageElement] = self
4575+ next_up: Optional[PageElement] = None
4576+ while e is not None:
4577+ next_up = e.next_element
4578+ e.__dict__.clear()
4579+ if isinstance(e, Tag):
4580+ e.contents = []
4581+ e._decomposed = True
4582+ e = next_up
4583+
4584+ def _last_descendant(
4585+ self, is_initialized:bool=True, accept_self:bool=True
4586+ ) -> Optional[PageElement]:
4587 """Finds the last element beneath this object to be parsed.
4588
4589- :param is_initialized: Has `setup` been called on this PageElement
4590- yet?
4591- :param accept_self: Is `self` an acceptable answer to the question?
4592+ :param is_initialized: Has `PageElement.setup` been called on
4593+ this `PageElement` yet?
4594+
4595+ :param accept_self: Is ``self`` an acceptable answer to the
4596+ question?
4597 """
4598 if is_initialized and self.next_sibling is not None:
4599 last_child = self.next_sibling.previous_element
4600@@ -400,121 +542,15 @@ class PageElement(object):
4601 if not accept_self and last_child is self:
4602 last_child = None
4603 return last_child
4604- # BS3: Not part of the API!
4605- _lastRecursiveChild = _last_descendant
4606
4607- def insert(self, position, new_child):
4608- """Insert a new PageElement in the list of this PageElement's children.
4609-
4610- This works the same way as `list.insert`.
4611-
4612- :param position: The numeric position that should be occupied
4613- in `self.children` by the new PageElement.
4614- :param new_child: A PageElement.
4615- """
4616- if new_child is None:
4617- raise ValueError("Cannot insert None into a tag.")
4618- if new_child is self:
4619- raise ValueError("Cannot insert a tag into itself.")
4620- if (isinstance(new_child, str)
4621- and not isinstance(new_child, NavigableString)):
4622- new_child = NavigableString(new_child)
4623+ _lastRecursiveChild = _deprecated_alias("_lastRecursiveChild", "_last_descendant", "4.0.0")
4624
4625- from bs4 import BeautifulSoup
4626- if isinstance(new_child, BeautifulSoup):
4627- # We don't want to end up with a situation where one BeautifulSoup
4628- # object contains another. Insert the children one at a time.
4629- for subchild in list(new_child.contents):
4630- self.insert(position, subchild)
4631- position += 1
4632- return
4633- position = min(position, len(self.contents))
4634- if hasattr(new_child, 'parent') and new_child.parent is not None:
4635- # We're 'inserting' an element that's already one
4636- # of this object's children.
4637- if new_child.parent is self:
4638- current_index = self.index(new_child)
4639- if current_index < position:
4640- # We're moving this element further down the list
4641- # of this object's children. That means that when
4642- # we extract this element, our target index will
4643- # jump down one.
4644- position -= 1
4645- new_child.extract()
4646-
4647- new_child.parent = self
4648- previous_child = None
4649- if position == 0:
4650- new_child.previous_sibling = None
4651- new_child.previous_element = self
4652- else:
4653- previous_child = self.contents[position - 1]
4654- new_child.previous_sibling = previous_child
4655- new_child.previous_sibling.next_sibling = new_child
4656- new_child.previous_element = previous_child._last_descendant(False)
4657- if new_child.previous_element is not None:
4658- new_child.previous_element.next_element = new_child
4659-
4660- new_childs_last_element = new_child._last_descendant(False)
4661-
4662- if position >= len(self.contents):
4663- new_child.next_sibling = None
4664-
4665- parent = self
4666- parents_next_sibling = None
4667- while parents_next_sibling is None and parent is not None:
4668- parents_next_sibling = parent.next_sibling
4669- parent = parent.parent
4670- if parents_next_sibling is not None:
4671- # We found the element that comes next in the document.
4672- break
4673- if parents_next_sibling is not None:
4674- new_childs_last_element.next_element = parents_next_sibling
4675- else:
4676- # The last element of this tag is the last element in
4677- # the document.
4678- new_childs_last_element.next_element = None
4679- else:
4680- next_child = self.contents[position]
4681- new_child.next_sibling = next_child
4682- if new_child.next_sibling is not None:
4683- new_child.next_sibling.previous_sibling = new_child
4684- new_childs_last_element.next_element = next_child
4685-
4686- if new_childs_last_element.next_element is not None:
4687- new_childs_last_element.next_element.previous_element = new_childs_last_element
4688- self.contents.insert(position, new_child)
4689-
4690- def append(self, tag):
4691- """Appends the given PageElement to the contents of this one.
4692-
4693- :param tag: A PageElement.
4694- """
4695- self.insert(len(self.contents), tag)
4696-
4697- def extend(self, tags):
4698- """Appends the given PageElements to this one's contents.
4699-
4700- :param tags: A list of PageElements. If a single Tag is
4701- provided instead, this PageElement's contents will be extended
4702- with that Tag's contents.
4703- """
4704- if isinstance(tags, Tag):
4705- tags = tags.contents
4706- if isinstance(tags, list):
4707- # Moving items around the tree may change their position in
4708- # the original list. Make a list that won't change.
4709- tags = list(tags)
4710- for tag in tags:
4711- self.append(tag)
4712-
4713- def insert_before(self, *args):
4714+ def insert_before(self, *args:PageElement) -> None:
4715 """Makes the given element(s) the immediate predecessor of this one.
4716
4717- All the elements will have the same parent, and the given elements
4718- will be immediately before this one.
4719-
4720- :param args: One or more PageElements.
4721+ All the elements will have the same `PageElement.parent` as
4722+ this one, and the given elements will occur immediately before
4723+ this one.
4724 """
4725 parent = self.parent
4726 if parent is None:
4727@@ -530,13 +566,12 @@ class PageElement(object):
4728 index = parent.index(self)
4729 parent.insert(index, predecessor)
4730
4731- def insert_after(self, *args):
4732+ def insert_after(self, *args:PageElement) -> None:
4733 """Makes the given element(s) the immediate successor of this one.
4734
4735- The elements will have the same parent, and the given elements
4736- will be immediately after this one.
4737-
4738- :param args: One or more PageElements.
4739+ The elements will have the same `PageElement.parent` as this
4740+ one, and the given elements will occur immediately after this
4741+ one.
4742 """
4743 # Do all error checking before modifying the tree.
4744 parent = self.parent
4745@@ -556,7 +591,14 @@ class PageElement(object):
4746 parent.insert(index+1+offset, successor)
4747 offset += 1
4748
4749- def find_next(self, name=None, attrs={}, string=None, **kwargs):
4750+ def find_next(
4751+ self,
4752+ name:Optional[_StrainableElement]=None,
4753+ attrs:_StrainableAttributes={},
4754+ string:Optional[_StrainableString]=None,
4755+ node:Optional[_TagOrStringMatchFunction]=None,
4756+ **kwargs:_StrainableAttribute
4757+ ) -> Optional[PageElement]:
4758 """Find the first PageElement that matches the given criteria and
4759 appears later in the document than this PageElement.
4760
4761@@ -564,36 +606,47 @@ class PageElement(object):
4762 documentation for detailed explanations.
4763
4764 :param name: A filter on tag name.
4765- :param attrs: A dictionary of filters on attribute values.
4766+ :param attrs: Additional filters on attribute values.
4767 :param string: A filter for a NavigableString with specific text.
4768- :kwargs: A dictionary of filters on attribute values.
4769- :return: A PageElement.
4770- :rtype: bs4.element.Tag | bs4.element.NavigableString
4771- """
4772- return self._find_one(self.find_all_next, name, attrs, string, **kwargs)
4773- findNext = find_next # BS3
4774-
4775- def find_all_next(self, name=None, attrs={}, string=None, limit=None,
4776- **kwargs):
4777- """Find all PageElements that match the given criteria and appear
4778- later in the document than this PageElement.
4779+ :kwargs: Additional filters on attribute values.
4780+ """
4781+ return self._find_one(self.find_all_next, name, attrs, string, node, **kwargs)
4782+ findNext = _deprecated_function_alias("findNext", "find_next", "4.0.0")
4783+
4784+ def find_all_next(
4785+ self,
4786+ name:Optional[_StrainableElement]=None,
4787+ attrs:_StrainableAttributes={},
4788+ string:Optional[_StrainableString]=None,
4789+ limit:Optional[int]=None,
4790+ node:Optional[_TagOrStringMatchFunction]=None,
4791+ _stacklevel:int=2,
4792+ **kwargs:_StrainableAttribute
4793+ ) -> ResultSet[PageElement]:
4794+ """Find all `PageElement` objects that match the given criteria and
4795+ appear later in the document than this `PageElement`.
4796
4797 All find_* methods take a common set of arguments. See the online
4798 documentation for detailed explanations.
4799
4800 :param name: A filter on tag name.
4801- :param attrs: A dictionary of filters on attribute values.
4802+ :param attrs: Additional filters on attribute values.
4803 :param string: A filter for a NavigableString with specific text.
4804 :param limit: Stop looking after finding this many results.
4805- :kwargs: A dictionary of filters on attribute values.
4806- :return: A ResultSet containing PageElements.
4807+ :param _stacklevel: Used internally to improve warning messages.
4808+ :kwargs: Additional filters on attribute values.
4809 """
4810- _stacklevel = kwargs.pop('_stacklevel', 2)
4811 return self._find_all(name, attrs, string, limit, self.next_elements,
4812- _stacklevel=_stacklevel+1, **kwargs)
4813- findAllNext = find_all_next # BS3
4814-
4815- def find_next_sibling(self, name=None, attrs={}, string=None, **kwargs):
4816+ node, _stacklevel=_stacklevel+1, **kwargs)
4817+ findAllNext = _deprecated_function_alias("findAllNext", "find_all_next", "4.0.0")
4818+
4819+ def find_next_sibling(
4820+ self,
4821+ name:Optional[_StrainableElement]=None,
4822+ attrs:_StrainableAttributes={},
4823+ string:Optional[_StrainableString]=None,
4824+ node:Optional[_TagOrStringMatchFunction]=None,
4825+ **kwargs:_StrainableAttribute) -> Optional[PageElement]:
4826 """Find the closest sibling to this PageElement that matches the
4827 given criteria and appears later in the document.
4828
4829@@ -601,102 +654,143 @@ class PageElement(object):
4830 online documentation for detailed explanations.
4831
4832 :param name: A filter on tag name.
4833- :param attrs: A dictionary of filters on attribute values.
4834- :param string: A filter for a NavigableString with specific text.
4835- :kwargs: A dictionary of filters on attribute values.
4836- :return: A PageElement.
4837- :rtype: bs4.element.Tag | bs4.element.NavigableString
4838+ :param attrs: Additional filters on attribute values.
4839+ :param string: A filter for a `NavigableString` with specific text.
4840+ :kwargs: Additional filters on attribute values.
4841 """
4842 return self._find_one(self.find_next_siblings, name, attrs, string,
4843- **kwargs)
4844- findNextSibling = find_next_sibling # BS3
4845-
4846- def find_next_siblings(self, name=None, attrs={}, string=None, limit=None,
4847- **kwargs):
4848- """Find all siblings of this PageElement that match the given criteria
4849+ node, **kwargs)
4850+ findNextSibling = _deprecated_function_alias(
4851+ "findNextSibling", "find_next_sibling", "4.0.0"
4852+ )
4853+
4854+ def find_next_siblings(
4855+ self,
4856+ name:Optional[_StrainableElement]=None,
4857+ attrs:_StrainableAttributes={},
4858+ string:Optional[_StrainableString]=None,
4859+ limit:Optional[int]=None,
4860+ node:Optional[_TagOrStringMatchFunction]=None,
4861+ _stacklevel:int=2,
4862+ **kwargs:_StrainableAttribute
4863+ ) -> ResultSet[PageElement]:
4864+ """Find all siblings of this `PageElement` that match the given criteria
4865 and appear later in the document.
4866
4867 All find_* methods take a common set of arguments. See the online
4868 documentation for detailed explanations.
4869
4870 :param name: A filter on tag name.
4871- :param attrs: A dictionary of filters on attribute values.
4872- :param string: A filter for a NavigableString with specific text.
4873+ :param attrs: Additional filters on attribute values.
4874+ :param string: A filter for a `NavigableString` with specific text.
4875 :param limit: Stop looking after finding this many results.
4876- :kwargs: A dictionary of filters on attribute values.
4877- :return: A ResultSet of PageElements.
4878- :rtype: bs4.element.ResultSet
4879+ :param _stacklevel: Used internally to improve warning messages.
4880+ :kwargs: Additional filters on attribute values.
4881 """
4882- _stacklevel = kwargs.pop('_stacklevel', 2)
4883 return self._find_all(
4884 name, attrs, string, limit,
4885- self.next_siblings, _stacklevel=_stacklevel+1, **kwargs
4886+ self.next_siblings, node, _stacklevel=_stacklevel+1, **kwargs
4887 )
4888- findNextSiblings = find_next_siblings # BS3
4889- fetchNextSiblings = find_next_siblings # BS2
4890-
4891- def find_previous(self, name=None, attrs={}, string=None, **kwargs):
4892- """Look backwards in the document from this PageElement and find the
4893- first PageElement that matches the given criteria.
4894+ findNextSiblings = _deprecated_function_alias(
4895+ "findNextSiblings", "find_next_siblings", "4.0.0"
4896+ )
4897+ fetchNextSiblings = _deprecated_function_alias(
4898+ "fetchNextSiblings", "find_next_siblings", "3.0.0"
4899+ )
4900+
4901+ def find_previous(
4902+ self,
4903+ name:Optional[_StrainableElement]=None,
4904+ attrs:_StrainableAttributes={},
4905+ string:Optional[_StrainableString]=None,
4906+ node:Optional[_TagOrStringMatchFunction]=None,
4907+ **kwargs:_StrainableAttribute) -> Optional[PageElement]:
4908+ """Look backwards in the document from this `PageElement` and find the
4909+ first `PageElement` that matches the given criteria.
4910
4911 All find_* methods take a common set of arguments. See the online
4912 documentation for detailed explanations.
4913
4914 :param name: A filter on tag name.
4915- :param attrs: A dictionary of filters on attribute values.
4916- :param string: A filter for a NavigableString with specific text.
4917- :kwargs: A dictionary of filters on attribute values.
4918- :return: A PageElement.
4919- :rtype: bs4.element.Tag | bs4.element.NavigableString
4920+ :param attrs: Additional filters on attribute values.
4921+ :param string: A filter for a `NavigableString` with specific text.
4922+ :kwargs: Additional filters on attribute values.
4923 """
4924 return self._find_one(
4925- self.find_all_previous, name, attrs, string, **kwargs)
4926- findPrevious = find_previous # BS3
4927-
4928- def find_all_previous(self, name=None, attrs={}, string=None, limit=None,
4929- **kwargs):
4930- """Look backwards in the document from this PageElement and find all
4931- PageElements that match the given criteria.
4932+ self.find_all_previous, name, attrs, string, node, **kwargs)
4933+
4934+ findPrevious = _deprecated_function_alias(
4935+ "findPrevious", "find_previous", "3.0.0"
4936+ )
4937+
4938+ def find_all_previous(
4939+ self,
4940+ name:Optional[_StrainableElement]=None,
4941+ attrs:_StrainableAttributes={},
4942+ string:Optional[_StrainableString]=None,
4943+ limit:Optional[int]=None,
4944+ node:Optional[_TagOrStringMatchFunction]=None,
4945+ _stacklevel:int=2,
4946+ **kwargs:_StrainableAttribute
4947+ ) -> ResultSet[PageElement]:
4948+ """Look backwards in the document from this `PageElement` and find all
4949+ `PageElement` that match the given criteria.
4950
4951 All find_* methods take a common set of arguments. See the online
4952 documentation for detailed explanations.
4953
4954 :param name: A filter on tag name.
4955- :param attrs: A dictionary of filters on attribute values.
4956- :param string: A filter for a NavigableString with specific text.
4957+ :param attrs: Additional filters on attribute values.
4958+ :param string: A filter for a `NavigableString` with specific text.
4959 :param limit: Stop looking after finding this many results.
4960- :kwargs: A dictionary of filters on attribute values.
4961- :return: A ResultSet of PageElements.
4962- :rtype: bs4.element.ResultSet
4963+ :param _stacklevel: Used internally to improve warning messages.
4964+ :kwargs: Additional filters on attribute values.
4965 """
4966- _stacklevel = kwargs.pop('_stacklevel', 2)
4967 return self._find_all(
4968 name, attrs, string, limit, self.previous_elements,
4969- _stacklevel=_stacklevel+1, **kwargs
4970+ node, _stacklevel=_stacklevel+1, **kwargs
4971 )
4972- findAllPrevious = find_all_previous # BS3
4973- fetchPrevious = find_all_previous # BS2
4974-
4975- def find_previous_sibling(self, name=None, attrs={}, string=None, **kwargs):
4976- """Returns the closest sibling to this PageElement that matches the
4977+ findAllPrevious = _deprecated_function_alias(
4978+ "findAllPrevious", "find_all_previous", "4.0.0"
4979+ )
4980+ fetchAllPrevious = _deprecated_function_alias(
4981+ "fetchAllPrevious", "find_all_previous", "3.0.0"
4982+ )
4983+
4984+ def find_previous_sibling(
4985+ self,
4986+ name:Optional[_StrainableElement]=None,
4987+ attrs:_StrainableAttributes={},
4988+ string:Optional[_StrainableString]=None,
4989+ node:Optional[_TagOrStringMatchFunction]=None,
4990+ **kwargs:_StrainableAttribute) -> Optional[PageElement]:
4991+ """Returns the closest sibling to this `PageElement` that matches the
4992 given criteria and appears earlier in the document.
4993
4994 All find_* methods take a common set of arguments. See the online
4995 documentation for detailed explanations.
4996
4997 :param name: A filter on tag name.
4998- :param attrs: A dictionary of filters on attribute values.
4999- :param string: A filter for a NavigableString with specific text.
5000- :kwargs: A dictionary of filters on attribute values.
The diff has been truncated for viewing.

Subscribers

People subscribed via source and target branches