Merge ~chrispitude/beautifulsoup:node-filters into beautifulsoup:master

Proposed by Chris Papademetrious
Status: Superseded
Proposed branch: ~chrispitude/beautifulsoup:node-filters
Merge into: beautifulsoup:master
Diff against target: 9961 lines (+4153/-2375)
31 files modified
CHANGELOG (+99/-0)
bs4/__init__.py (+245/-73)
bs4/_deprecation.py (+57/-0)
bs4/_typing.py (+99/-0)
bs4/builder/__init__.py (+278/-184)
bs4/builder/_html5lib.py (+57/-36)
bs4/builder/_htmlparser.py (+120/-66)
bs4/builder/_lxml.py (+137/-68)
bs4/css.py (+124/-96)
bs4/dammit.py (+407/-230)
bs4/diagnose.py (+31/-19)
bs4/element.py (+1154/-1052)
bs4/formatter.py (+96/-41)
bs4/strainer.py (+498/-0)
bs4/tests/__init__.py (+17/-12)
bs4/tests/test_builder_registry.py (+3/-3)
bs4/tests/test_dammit.py (+17/-10)
bs4/tests/test_element.py (+25/-5)
bs4/tests/test_html5lib.py (+17/-1)
bs4/tests/test_htmlparser.py (+5/-3)
bs4/tests/test_lxml.py (+4/-3)
bs4/tests/test_pageelement.py (+8/-4)
bs4/tests/test_soup.py (+13/-5)
bs4/tests/test_strainer.py (+485/-0)
bs4/tests/test_tag.py (+1/-0)
bs4/tests/test_tree.py (+44/-25)
dev/null (+0/-256)
doc/Makefile (+14/-124)
doc/conf.py (+33/-0)
doc/index.rst (+61/-57)
tox.ini (+4/-2)
Reviewer Review Type Date Requested Status
Leonard Richardson Pending
Review via email: mp+457782@code.launchpad.net

This proposal has been superseded by a proposal from 2024-01-01.

Commit message

implement "node" filtering that considers all PageElement objects

Description of the change

This is a draft merge request that implements a proof-of-concept for the following wishlist request:

#2047713: enhance find*() methods to filter through all object types
https://bugs.launchpad.net/beautifulsoup/+bug/2047713

Most of the changes are to thread a new "node" filter down into the machinery. The actual functionality is just three additional lines in the search() method to use it.

With these changes, when the following script is run:

====
#!/bin/env python
from bs4 import BeautifulSoup, NavigableString
html_doc = """
  <p>
    <b>bold</b>
    <i>italic</i>
    <u>underline</u>
    text
    <br />
  </p>
"""
soup = BeautifulSoup(html_doc, 'lxml')

# this is the filter I want to use
def is_non_whitespace(thing) -> bool:
    return not (isinstance(thing, NavigableString) and thing.text.isspace())

# get the first non-whitespace thing in <p>
this_thing = soup.find('p').find(node=is_non_whitespace, recursive=False)

# print all following non-whitespace sibling elements in <p>
while this_thing:
    next_thing = this_thing.find_next_sibling(node=is_non_whitespace)
    print(f"{repr(this_thing)} is followed by {repr(next_thing)}")
    this_thing = next_thing
====

the results are as follows:

====
<b>bold</b> is followed by <i>italic</i>
<i>italic</i> is followed by <u>underline</u>
<u>underline</u> is followed by '\n text\n '
'\n text\n ' is followed by <br/>
<br/> is followed by None
====

Note the mix of tag and text objects!

Some questions and open items:

* Is "PageElement" the correct term for any object in a BeautifulSoup document (Tag, NavigableString, Comment, ProcessingInstruction, etc.)?
* What should this new filter argument be called? (node? page_element? something else?)
* Is there a more elegant approach that doesn't require threading a new argument down into everything?
* Rules for mixing this new filter with existing name/attribute filters must be defined/coded/documented.
  * I think this new filter should be mutually exclusive with tag/attribute filters.
  * I think this new filter should accept only Callable objects, and perhaps also True/False.
* Tests and documentation are needed.
  * I can do this when the implementation is complete.

Fingers crossed that this makes it in. It would be enormously powerful.

To post a comment you must log in.
Revision history for this message
Chris Papademetrious (chrispitude) wrote :

I will redo this merge proposal on top of the 4.13 branch.

Before I start, are there any aspects that you'd like to see implemented differently?

a771557... by Chris Papademetrious

implement a filter that considers all PageElement objects

Unmerged commits

a771557... by Chris Papademetrious

implement a filter that considers all PageElement objects

8a6d1dd... by Leonard Richardson

Merged in change to main branch.

4cde600... by Leonard Richardson

Those casts are more trouble than they're worth.

1113a86... by Leonard Richardson

Got css.py to pass mypy strict although it's a little hacky.

26e1772... by Leonard Richardson

Went through formatter.py with mypy strict.

5bf3787... by Leonard Richardson

Went through dammit.py with mypy strict.

7200655... by Leonard Richardson

Merged in main branch.

f3a3619... by Leonard Richardson

Got rid of deprecation warnings in tests.

6f89323... by Leonard Richardson

Get (slightly) more specific about alias.

f8e55c0... by Leonard Richardson

_alias itself is not used anywhere.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
diff --git a/CHANGELOG b/CHANGELOG
index 66fcb74..bec1e11 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -1,3 +1,102 @@
1= 4.13.0 (Unreleased)
2
3* This version drops support for Python 3.6. The minimum supported
4 major version for Beautiful Soup is now Python 3.7.
5
6* Deprecation warnings have been added for all deprecated methods and
7 attributes. Most of these were deprecated over ten years ago, and
8 some were deprecated over fifteen years ago.
9
10 Going forward, deprecated names will be subject to removal two
11 feature releases or one major release after the deprecation warning
12 is added.
13
14* append(), extend(), insert(), and unwrap() were moved from PageElement to
15 Tag. Those methods manipulate the 'contents' collection, so they would
16 only have ever worked on Tag objects.
17
18* decompose() was moved from Tag to PageElement, since there's no reason
19 it won't also work on NavigableString objects.
20
21* The BeautifulSoupHTMLParser constructor now requires a BeautifulSoup
22 object as its first argument. This almost certainly does not affect
23 you, since you probably use HTMLParserTreeBuilder, not
24 BeautifulSoupHTMLParser directly.
25
26* If Tag.get_attribute_list() is used to access an attribute that's not set,
27 the return value is now an empty list rather than [None].
28
29* AttributeValueWithCharsetSubstitution.encode() is renamed to
30 substitute_encoding, to avoid confusion with the much different str.encode()
31
32* Using PageElement.replace_with() to replace an element with itself
33 returns the element instead of None.
34
35* When using one of the find() methods or creating a SoupStrainer,
36 if you specify the same attribute value in ``attrs`` and the
37 keyword arguments, you'll end up with two different ways to match that
38 attribute. Previously the value in keyword arguments would override the
39 value in ``attrs``.
40
41* When using one of the find() methods or creating a SoupStrainer, you can
42 pass a list of any accepted object (strings, regular expressions, etc.) for
43 any of the objects. Previously you could only pass in a list of strings.
44
45* A SoupStrainer can now filter tag creation based on a tag's
46 namespaced name. Previously only the unqualified name could be used.
47
48* All TreeBuilder constructors now take the empty_element_tags
49 argument. The sets of tags found in HTMLTreeBuilder.empty_element_tags and
50 HTMLTreeBuilder.block_elements are now in
51 HTMLTreeBuilder.DEFAULT_EMPTY_ELEMENT_TAGS and
52 HTMLTreeBuilder.DEFAULT_BLOCK_ELEMENTS, to avoid confusing them with
53 instance variables.
54
55* Issue a warning if a document is parsed using a SoupStrainer that's just
56 going to filter everything. In these cases, filtering everything is the
57 most consistent thing to do, but there was no indication that was
58 happening.
59
60* UnicodeDammit.markup is now always a bytestring representing the
61 *original* markup (sans BOM), and UnicodeDammit.unicode_markup is
62 always the same markup, converted to Unicode. Previously,
63 UnicodeDammit.markup was treated inconsistently and would often end
64 up containing Unicode. UnicodeDammit.markup was not a documented
65 attribute, but if you were using it, you probably want to switch to using
66 .unicode_markup instead.
67
68* Corrected the markup that's output in the unlikely event that you
69 encode a document to a Python internal encoding (like "palmos")
70 that's not recognized by the HTML or XML standard.
71
72* The arguments to LXMLTreeBuilderForXML.prepare_markup have been
73 changed to match the arguments to the superclass,
74 TreeBuilder.prepare_markup. Specifically, document_declared_encoding
75 now appears before exclude_encodings, not after. If you were calling
76 this method yourself, I recomment switching to using keyword
77 arguments instead.
78
79* Fixed an error in the lookup table used when converting
80 ISO-Latin-1 to ASCII, which no one should do anyway.
81
82* The unused constant LXMLTreeBuilderForXML.DEFAULT_PARSER_CLASS
83 has been removed.
84
85New deprecations in 4.13.0:
86
87* The SAXTreeBuilder class, which was never officially supported or tested.
88
89* The first argument to BeautifulSoup.decode has been changed from a bool
90 `pretty_print` to an int `indent_level`, to match the signature of Tag.decode.
91
92* SoupStrainer.text and SoupStrainer.string are both deprecated
93 since a single item can't capture all the possibilities of a SoupStrainer
94 designed to match strings.
95
96* SoupStrainer.search_tag() is deprecated. It was never a
97 documented method, but if you use it, you should start using
98 SoupStrainer.allow_tag_creation() instead.
99
1= 4.12.3 (?)100= 4.12.3 (?)
2101
3* Fixed a regression such that if you set .hidden on a tag, the tag102* Fixed a regression such that if you set .hidden on a tag, the tag
diff --git a/bs4/__init__.py b/bs4/__init__.py
index 3d2ab09..a1289c7 100644
--- a/bs4/__init__.py
+++ b/bs4/__init__.py
@@ -7,8 +7,8 @@ Beautiful Soup uses a pluggable XML or HTML parser to parse a
7provides methods and Pythonic idioms that make it easy to navigate,7provides methods and Pythonic idioms that make it easy to navigate,
8search, and modify the parse tree.8search, and modify the parse tree.
99
10Beautiful Soup works with Python 3.6 and up. It works better if lxml10Beautiful Soup works with Python 3.7 and up. It works better if lxml
11and/or html5lib is installed.11and/or html5lib is installed, but they are not required.
1212
13For more than you ever wanted to know about Beautiful Soup, see the13For more than you ever wanted to know about Beautiful Soup, see the
14documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/14documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
@@ -37,9 +37,10 @@ if sys.version_info.major < 3:
37from .builder import (37from .builder import (
38 builder_registry,38 builder_registry,
39 ParserRejectedMarkup,39 ParserRejectedMarkup,
40 TreeBuilder,
40 XMLParsedAsHTMLWarning,41 XMLParsedAsHTMLWarning,
41 HTMLParserTreeBuilder
42)42)
43from .builder._htmlparser import HTMLParserTreeBuilder
43from .dammit import UnicodeDammit44from .dammit import UnicodeDammit
44from .element import (45from .element import (
45 CData,46 CData,
@@ -55,10 +56,32 @@ from .element import (
55 ResultSet,56 ResultSet,
56 Script,57 Script,
57 Stylesheet,58 Stylesheet,
58 SoupStrainer,
59 Tag,59 Tag,
60 TemplateString,60 TemplateString,
61 )61 )
62from .formatter import Formatter
63from .strainer import SoupStrainer
64from typing import (
65 Any,
66 cast,
67 Counter as CounterType,
68 Dict,
69 Iterable,
70 List,
71 Sequence,
72 Optional,
73 Type,
74 TYPE_CHECKING,
75 Union,
76)
77
78from bs4._typing import (
79 _AttributeValue,
80 _AttributeValues,
81 _Encoding,
82 _Encodings,
83 _IncomingMarkup,
84)
6285
63# Define some custom warnings.86# Define some custom warnings.
64class GuessedAtParserWarning(UserWarning):87class GuessedAtParserWarning(UserWarning):
@@ -104,24 +127,64 @@ class BeautifulSoup(Tag):
104 handle_endtag.127 handle_endtag.
105 """128 """
106129
107 # Since BeautifulSoup subclasses Tag, it's possible to treat it as130 #: Since `BeautifulSoup` subclasses `Tag`, it's possible to treat it as
108 # a Tag with a .name. This name makes it clear the BeautifulSoup131 #: a `Tag` with a `Tag.name`. Hoever, this name makes it clear the
109 # object isn't a real markup tag.132 #: `BeautifulSoup` object isn't a real markup tag.
110 ROOT_TAG_NAME = '[document]'133 ROOT_TAG_NAME:str = '[document]'
111134
112 # If the end-user gives no indication which tree builder they135 #: If the end-user gives no indication which tree builder they
113 # want, look for one with these features.136 #: want, look for one with these features.
114 DEFAULT_BUILDER_FEATURES = ['html', 'fast']137 DEFAULT_BUILDER_FEATURES: Sequence[str] = ['html', 'fast']
115138
116 # A string containing all ASCII whitespace characters, used in139 #: A string containing all ASCII whitespace characters, used in
117 # endData() to detect data chunks that seem 'empty'.140 #: `BeautifulSoup.endData` to detect data chunks that seem 'empty'.
118 ASCII_SPACES = '\x20\x0a\x09\x0c\x0d'141 ASCII_SPACES: str = '\x20\x0a\x09\x0c\x0d'
119142
120 NO_PARSER_SPECIFIED_WARNING = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features=\"%(parser)s\"' to the BeautifulSoup constructor.\n"143 #: :meta private:
121 144 NO_PARSER_SPECIFIED_WARNING: str = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features=\"%(parser)s\"' to the BeautifulSoup constructor.\n"
122 def __init__(self, markup="", features=None, builder=None,145
123 parse_only=None, from_encoding=None, exclude_encodings=None,146 # FUTURE PYTHON:
124 element_classes=None, **kwargs):147 element_classes:Dict[Type[PageElement], Type[Any]] #: :meta private:
148 builder:TreeBuilder #: :meta private:
149 is_xml: bool
150 known_xml: Optional[bool]
151 parse_only: Optional[SoupStrainer] #: :meta private:
152
153 # These members are only used while parsing markup.
154 markup:Optional[Union[str,bytes]] #: :meta private:
155 current_data:List[str] #: :meta private:
156 currentTag:Optional[Tag] #: :meta private:
157 tagStack:List[Tag] #: :meta private:
158 open_tag_counter:CounterType[str] #: :meta private:
159 preserve_whitespace_tag_stack:List[Tag] #: :meta private:
160 string_container_stack:List[Tag] #: :meta private:
161
162 #: Beautiful Soup's best guess as to the character encoding of the
163 #: original document.
164 original_encoding: Optional[_Encoding]
165
166 #: The character encoding, if any, that was explicitly defined
167 #: in the original document. This may or may not match
168 #: `BeautifulSoup.original_encoding`.
169 declared_html_encoding: Optional[_Encoding]
170
171 #: This is True if the markup that was parsed contains
172 #: U+FFFD REPLACEMENT_CHARACTER characters which were not present
173 #: in the original markup. These mark character sequences that
174 #: could not be represented in Unicode.
175 contains_replacement_characters: bool
176
177 def __init__(
178 self,
179 markup:_IncomingMarkup="",
180 features:Optional[Union[str,Sequence[str]]]=None,
181 builder:Optional[Union[TreeBuilder,Type[TreeBuilder]]]=None,
182 parse_only:Optional[SoupStrainer]=None,
183 from_encoding:Optional[_Encoding]=None,
184 exclude_encodings:Optional[_Encodings]=None,
185 element_classes:Optional[Dict[Type[PageElement], Type[Any]]]=None,
186 **kwargs:Any
187 ):
125 """Constructor.188 """Constructor.
126189
127 :param markup: A string or a file-like object representing190 :param markup: A string or a file-like object representing
@@ -196,14 +259,14 @@ class BeautifulSoup(Tag):
196 if 'selfClosingTags' in kwargs:259 if 'selfClosingTags' in kwargs:
197 del kwargs['selfClosingTags']260 del kwargs['selfClosingTags']
198 warnings.warn(261 warnings.warn(
199 "BS4 does not respect the selfClosingTags argument to the "262 "Beautiful Soup 4 does not respect the selfClosingTags argument to the "
200 "BeautifulSoup constructor. The tree builder is responsible "263 "BeautifulSoup constructor. The tree builder is responsible "
201 "for understanding self-closing tags.")264 "for understanding self-closing tags.")
202265
203 if 'isHTML' in kwargs:266 if 'isHTML' in kwargs:
204 del kwargs['isHTML']267 del kwargs['isHTML']
205 warnings.warn(268 warnings.warn(
206 "BS4 does not respect the isHTML argument to the "269 "Beautiful Soup 4 does not respect the isHTML argument to the "
207 "BeautifulSoup constructor. Suggest you use "270 "BeautifulSoup constructor. Suggest you use "
208 "features='lxml' for HTML and features='lxml-xml' for "271 "features='lxml' for HTML and features='lxml-xml' for "
209 "XML.")272 "XML.")
@@ -212,7 +275,8 @@ class BeautifulSoup(Tag):
212 if old_name in kwargs:275 if old_name in kwargs:
213 warnings.warn(276 warnings.warn(
214 'The "%s" argument to the BeautifulSoup constructor '277 'The "%s" argument to the BeautifulSoup constructor '
215 'has been renamed to "%s."' % (old_name, new_name),278 'was renamed to "%s" in Beautiful Soup 4.0.0' % (
279 old_name, new_name),
216 DeprecationWarning, stacklevel=3280 DeprecationWarning, stacklevel=3
217 )281 )
218 return kwargs.pop(old_name)282 return kwargs.pop(old_name)
@@ -220,7 +284,14 @@ class BeautifulSoup(Tag):
220284
221 parse_only = parse_only or deprecated_argument(285 parse_only = parse_only or deprecated_argument(
222 "parseOnlyThese", "parse_only")286 "parseOnlyThese", "parse_only")
223287 if (parse_only is not None
288 and parse_only.string_rules and
289 (parse_only.name_rules or parse_only.attribute_rules)):
290 warnings.warn(
291 f"Value for parse_only will exclude everything, since it puts restrictions on both tags and strings: {parse_only}",
292 UserWarning, stacklevel=3
293 )
294
224 from_encoding = from_encoding or deprecated_argument(295 from_encoding = from_encoding or deprecated_argument(
225 "fromEncoding", "from_encoding")296 "fromEncoding", "from_encoding")
226297
@@ -235,7 +306,8 @@ class BeautifulSoup(Tag):
235 # specify a parser' warning.306 # specify a parser' warning.
236 original_builder = builder307 original_builder = builder
237 original_features = features308 original_features = features
238 309
310 builder_class: Type[TreeBuilder]
239 if isinstance(builder, type):311 if isinstance(builder, type):
240 # A builder class was passed in; it needs to be instantiated.312 # A builder class was passed in; it needs to be instantiated.
241 builder_class = builder313 builder_class = builder
@@ -245,12 +317,13 @@ class BeautifulSoup(Tag):
245 features = [features]317 features = [features]
246 if features is None or len(features) == 0:318 if features is None or len(features) == 0:
247 features = self.DEFAULT_BUILDER_FEATURES319 features = self.DEFAULT_BUILDER_FEATURES
248 builder_class = builder_registry.lookup(*features)320 possible_builder_class = builder_registry.lookup(*features)
249 if builder_class is None:321 if possible_builder_class is None:
250 raise FeatureNotFound(322 raise FeatureNotFound(
251 "Couldn't find a tree builder with the features you "323 "Couldn't find a tree builder with the features you "
252 "requested: %s. Do you need to install a parser library?"324 "requested: %s. Do you need to install a parser library?"
253 % ",".join(features))325 % ",".join(features))
326 builder_class = cast(Type[TreeBuilder], possible_builder_class)
254327
255 # At this point either we have a TreeBuilder instance in328 # At this point either we have a TreeBuilder instance in
256 # builder, or we have a builder_class that we can instantiate329 # builder, or we have a builder_class that we can instantiate
@@ -259,7 +332,8 @@ class BeautifulSoup(Tag):
259 builder = builder_class(**kwargs)332 builder = builder_class(**kwargs)
260 if not original_builder and not (333 if not original_builder and not (
261 original_features == builder.NAME or334 original_features == builder.NAME or
262 original_features in builder.ALTERNATE_NAMES335 (isinstance(original_features, str)
336 and original_features in builder.ALTERNATE_NAMES)
263 ) and markup:337 ) and markup:
264 # The user did not tell us which TreeBuilder to use,338 # The user did not tell us which TreeBuilder to use,
265 # and we had to guess. Issue a warning.339 # and we had to guess. Issue a warning.
@@ -323,6 +397,10 @@ class BeautifulSoup(Tag):
323 if not self._markup_is_url(markup):397 if not self._markup_is_url(markup):
324 self._markup_resembles_filename(markup) 398 self._markup_resembles_filename(markup)
325399
400 # At this point we know markup is a string or bytestring. If
401 # it was a file-type object, we've read from it.
402 markup = cast(Union[str,bytes], markup)
403
326 rejections = []404 rejections = []
327 success = False405 success = False
328 for (self.markup, self.original_encoding, self.declared_html_encoding,406 for (self.markup, self.original_encoding, self.declared_html_encoding,
@@ -486,7 +564,7 @@ class BeautifulSoup(Tag):
486 markup.564 markup.
487 """565 """
488 Tag.__init__(self, self, self.builder, self.ROOT_TAG_NAME)566 Tag.__init__(self, self, self.builder, self.ROOT_TAG_NAME)
489 self.hidden = 1567 self.hidden = True
490 self.builder.reset()568 self.builder.reset()
491 self.current_data = []569 self.current_data = []
492 self.currentTag = None570 self.currentTag = None
@@ -497,8 +575,16 @@ class BeautifulSoup(Tag):
497 self._most_recent_element = None575 self._most_recent_element = None
498 self.pushTag(self)576 self.pushTag(self)
499577
500 def new_tag(self, name, namespace=None, nsprefix=None, attrs={},578 def new_tag(
501 sourceline=None, sourcepos=None, **kwattrs):579 self,
580 name:str,
581 namespace:Optional[str]=None,
582 nsprefix:Optional[str]=None,
583 attrs:_AttributeValues={},
584 sourceline:Optional[int]=None,
585 sourcepos:Optional[int]=None,
586 **kwattrs:_AttributeValue,
587 ):
502 """Create a new Tag associated with this BeautifulSoup object.588 """Create a new Tag associated with this BeautifulSoup object.
503589
504 :param name: The name of the new Tag.590 :param name: The name of the new Tag.
@@ -509,7 +595,7 @@ class BeautifulSoup(Tag):
509 that are reserved words in Python.595 that are reserved words in Python.
510 :param sourceline: The line number where this tag was596 :param sourceline: The line number where this tag was
511 (purportedly) found in its source document.597 (purportedly) found in its source document.
512 :param sourcepos: The character position within `sourceline` where this598 :param sourcepos: The character position within ``sourceline`` where this
513 tag was (purportedly) found.599 tag was (purportedly) found.
514 :param kwattrs: Keyword arguments for the new Tag's attribute values.600 :param kwattrs: Keyword arguments for the new Tag's attribute values.
515601
@@ -520,9 +606,17 @@ class BeautifulSoup(Tag):
520 sourceline=sourceline, sourcepos=sourcepos606 sourceline=sourceline, sourcepos=sourcepos
521 )607 )
522608
523 def string_container(self, base_class=None):609 def string_container(self,
610 base_class:Optional[Type[NavigableString]]=None
611 ) -> Type[NavigableString]:
612 """Find the class that should be instantiated to hold a given kind of
613 string.
614
615 This may be a built-in Beautiful Soup class or a custom class passed
616 in to the BeautifulSoup constructor.
617 """
524 container = base_class or NavigableString618 container = base_class or NavigableString
525 619
526 # There may be a general override of NavigableString.620 # There may be a general override of NavigableString.
527 container = self.element_classes.get(621 container = self.element_classes.get(
528 container, container622 container, container
@@ -536,27 +630,40 @@ class BeautifulSoup(Tag):
536 )630 )
537 return container631 return container
538 632
539 def new_string(self, s, subclass=None):633 def new_string(self, s:str, subclass:Optional[Type[NavigableString]]=None) -> NavigableString:
540 """Create a new NavigableString associated with this BeautifulSoup634 """Create a new `NavigableString` associated with this `BeautifulSoup`
541 object.635 object.
636
637 :param s: The string content of the `NavigableString`
638
639 :param subclass: The subclass of `NavigableString`, if any, to
640 use. If a document is being processed, an appropriate subclass
641 for the current location in the document will be determined
642 automatically.
542 """643 """
543 container = self.string_container(subclass)644 container = self.string_container(subclass)
544 return container(s)645 return container(s)
545646
546 def insert_before(self, *args):647 def insert_before(self, *args:PageElement) -> None:
547 """This method is part of the PageElement API, but `BeautifulSoup` doesn't implement648 """This method is part of the PageElement API, but `BeautifulSoup` doesn't implement
548 it because there is nothing before or after it in the parse tree.649 it because there is nothing before or after it in the parse tree.
549 """650 """
550 raise NotImplementedError("BeautifulSoup objects don't support insert_before().")651 raise NotImplementedError("BeautifulSoup objects don't support insert_before().")
551652
552 def insert_after(self, *args):653 def insert_after(self, *args:PageElement) -> None:
553 """This method is part of the PageElement API, but `BeautifulSoup` doesn't implement654 """This method is part of the PageElement API, but `BeautifulSoup` doesn't implement
554 it because there is nothing before or after it in the parse tree.655 it because there is nothing before or after it in the parse tree.
555 """656 """
556 raise NotImplementedError("BeautifulSoup objects don't support insert_after().")657 raise NotImplementedError("BeautifulSoup objects don't support insert_after().")
557658
558 def popTag(self):659 def popTag(self) -> Optional[Tag]:
559 """Internal method called by _popToTag when a tag is closed."""660 """Internal method called by _popToTag when a tag is closed.
661
662 :meta private:
663 """
664 if not self.tagStack:
665 # Nothing to pop. This shouldn't happen.
666 return None
560 tag = self.tagStack.pop()667 tag = self.tagStack.pop()
561 if tag.name in self.open_tag_counter:668 if tag.name in self.open_tag_counter:
562 self.open_tag_counter[tag.name] -= 1669 self.open_tag_counter[tag.name] -= 1
@@ -569,8 +676,11 @@ class BeautifulSoup(Tag):
569 self.currentTag = self.tagStack[-1]676 self.currentTag = self.tagStack[-1]
570 return self.currentTag677 return self.currentTag
571678
572 def pushTag(self, tag):679 def pushTag(self, tag:Tag) -> None:
573 """Internal method called by handle_starttag when a tag is opened."""680 """Internal method called by handle_starttag when a tag is opened.
681
682 :meta private:
683 """
574 #print("Push", tag.name)684 #print("Push", tag.name)
575 if self.currentTag is not None:685 if self.currentTag is not None:
576 self.currentTag.contents.append(tag)686 self.currentTag.contents.append(tag)
@@ -583,9 +693,14 @@ class BeautifulSoup(Tag):
583 if tag.name in self.builder.string_containers:693 if tag.name in self.builder.string_containers:
584 self.string_container_stack.append(tag)694 self.string_container_stack.append(tag)
585695
586 def endData(self, containerClass=None):696 def endData(self, containerClass:Optional[Type[NavigableString]]=None) -> None:
587 """Method called by the TreeBuilder when the end of a data segment697 """Method called by the TreeBuilder when the end of a data segment
588 occurs.698 occurs.
699
700 :param containerClass: The class to use when incorporating the
701 data segment into the parse tree.
702
703 :meta private:
589 """ 704 """
590 if self.current_data:705 if self.current_data:
591 current_data = ''.join(self.current_data)706 current_data = ''.join(self.current_data)
@@ -609,18 +724,27 @@ class BeautifulSoup(Tag):
609724
610 # Should we add this string to the tree at all?725 # Should we add this string to the tree at all?
611 if self.parse_only and len(self.tagStack) <= 1 and \726 if self.parse_only and len(self.tagStack) <= 1 and \
612 (not self.parse_only.text or \727 (not self.parse_only.string_rules or \
613 not self.parse_only.search(current_data)):728 not self.parse_only.allow_string_creation(current_data)):
614 return729 return
615730
616 containerClass = self.string_container(containerClass)731 containerClass = self.string_container(containerClass)
617 o = containerClass(current_data)732 o = containerClass(current_data)
618 self.object_was_parsed(o)733 self.object_was_parsed(o)
619734
620 def object_was_parsed(self, o, parent=None, most_recent_element=None):735 def object_was_parsed(
621 """Method called by the TreeBuilder to integrate an object into the parse tree."""736 self, o:PageElement, parent:Optional[Tag]=None,
737 most_recent_element:Optional[PageElement]=None):
738 """Method called by the TreeBuilder to integrate an object into the
739 parse tree.
740
741
742
743 :meta private:
744 """
622 if parent is None:745 if parent is None:
623 parent = self.currentTag746 parent = self.currentTag
747 assert parent is not None
624 if most_recent_element is not None:748 if most_recent_element is not None:
625 previous_element = most_recent_element749 previous_element = most_recent_element
626 else:750 else:
@@ -685,7 +809,7 @@ class BeautifulSoup(Tag):
685 break809 break
686 target = target.parent810 target = target.parent
687811
688 def _popToTag(self, name, nsprefix=None, inclusivePop=True):812 def _popToTag(self, name, nsprefix=None, inclusivePop=True) -> Optional[Tag]:
689 """Pops the tag stack up to and including the most recent813 """Pops the tag stack up to and including the most recent
690 instance of the given tag.814 instance of the given tag.
691815
@@ -698,11 +822,12 @@ class BeautifulSoup(Tag):
698 to but *not* including the most recent instqance of the822 to but *not* including the most recent instqance of the
699 given tag.823 given tag.
700824
825 :meta private:
701 """826 """
702 #print("Popping to %s" % name)827 #print("Popping to %s" % name)
703 if name == self.ROOT_TAG_NAME:828 if name == self.ROOT_TAG_NAME:
704 # The BeautifulSoup object itself can never be popped.829 # The BeautifulSoup object itself can never be popped.
705 return830 return None
706831
707 most_recently_popped = None832 most_recently_popped = None
708833
@@ -719,8 +844,11 @@ class BeautifulSoup(Tag):
719844
720 return most_recently_popped845 return most_recently_popped
721846
722 def handle_starttag(self, name, namespace, nsprefix, attrs, sourceline=None,847 def handle_starttag(
723 sourcepos=None, namespaces=None):848 self, name:str, namespace:Optional[str],
849 nsprefix:Optional[str], attrs:Optional[Dict[str,str]],
850 sourceline:Optional[int]=None, sourcepos:Optional[int]=None,
851 namespaces:Optional[Dict[str, str]]=None) -> Optional[Tag]:
724 """Called by the tree builder when a new tag is encountered.852 """Called by the tree builder when a new tag is encountered.
725853
726 :param name: Name of the tag.854 :param name: Name of the tag.
@@ -737,13 +865,15 @@ class BeautifulSoup(Tag):
737 SoupStrainer. You should proceed as if the tag had not occurred865 SoupStrainer. You should proceed as if the tag had not occurred
738 in the document. For instance, if this was a self-closing tag,866 in the document. For instance, if this was a self-closing tag,
739 don't call handle_endtag.867 don't call handle_endtag.
868
869 :meta private:
740 """870 """
741 # print("Start tag %s: %s" % (name, attrs))871 # print("Start tag %s: %s" % (name, attrs))
742 self.endData()872 self.endData()
743873
744 if (self.parse_only and len(self.tagStack) <= 1874 if (self.parse_only and len(self.tagStack) <= 1
745 and (self.parse_only.text875 and (self.parse_only.string_rules
746 or not self.parse_only.search_tag(name, attrs))):876 or not self.parse_only.allow_tag_creation(nsprefix, name, attrs))):
747 return None877 return None
748878
749 tag = self.element_classes.get(Tag, Tag)(879 tag = self.element_classes.get(Tag, Tag)(
@@ -760,48 +890,90 @@ class BeautifulSoup(Tag):
760 self.pushTag(tag)890 self.pushTag(tag)
761 return tag891 return tag
762892
763 def handle_endtag(self, name, nsprefix=None):893 def handle_endtag(self, name:str, nsprefix:Optional[str]=None) -> None:
764 """Called by the tree builder when an ending tag is encountered.894 """Called by the tree builder when an ending tag is encountered.
765895
766 :param name: Name of the tag.896 :param name: Name of the tag.
767 :param nsprefix: Namespace prefix for the tag.897 :param nsprefix: Namespace prefix for the tag.
898
899 :meta private:
768 """900 """
769 #print("End tag: " + name)901 #print("End tag: " + name)
770 self.endData()902 self.endData()
771 self._popToTag(name, nsprefix)903 self._popToTag(name, nsprefix)
772 904
773 def handle_data(self, data):905 def handle_data(self, data:str) -> None:
774 """Called by the tree builder when a chunk of textual data is encountered."""906 """Called by the tree builder when a chunk of textual data is
907 encountered.
908
909 :meta private:
910 """
775 self.current_data.append(data)911 self.current_data.append(data)
776 912
777 def decode(self, pretty_print=False,913 def decode(self, indent_level:Optional[int]=None,
778 eventual_encoding=DEFAULT_OUTPUT_ENCODING,914 eventual_encoding:_Encoding=DEFAULT_OUTPUT_ENCODING,
779 formatter="minimal", iterator=None):915 formatter:Union[Formatter,str]="minimal",
780 """Returns a string or Unicode representation of the parse tree916 iterator:Optional[Iterable]=None, **kwargs) -> str:
781 as an HTML or XML document.917 """Returns a string representation of the parse tree
782918 as a full HTML or XML document.
783 :param pretty_print: If this is True, indentation will be used to919
784 make the document more readable.920 :param indent_level: Each line of the rendering will be
921 indented this many levels. (The ``formatter`` decides what a
922 'level' means, in terms of spaces or other characters
923 output.) This is used internally in recursive calls while
924 pretty-printing.
785 :param eventual_encoding: The encoding of the final document.925 :param eventual_encoding: The encoding of the final document.
786 If this is None, the document will be a Unicode string.926 If this is None, the document will be a Unicode string.
927 :param formatter: Either a `Formatter` object, or a string naming one of
928 the standard formatters.
929 :param iterator: The iterator to use when navigating over the
930 parse tree. This is only used by `Tag.decode_contents` and
931 you probably won't need to use it.
787 """932 """
788 if self.is_xml:933 if self.is_xml:
789 # Print the XML declaration934 # Print the XML declaration
790 encoding_part = ''935 encoding_part = ''
936 declared_encoding: Optional[str] = eventual_encoding
791 if eventual_encoding in PYTHON_SPECIFIC_ENCODINGS:937 if eventual_encoding in PYTHON_SPECIFIC_ENCODINGS:
792 # This is a special Python encoding; it can't actually938 # This is a special Python encoding; it can't actually
793 # go into an XML document because it means nothing939 # go into an XML document because it means nothing
794 # outside of Python.940 # outside of Python.
795 eventual_encoding = None941 declared_encoding = None
796 if eventual_encoding != None:942 if declared_encoding != None:
797 encoding_part = ' encoding="%s"' % eventual_encoding943 encoding_part = ' encoding="%s"' % declared_encoding
798 prefix = '<?xml version="1.0"%s?>\n' % encoding_part944 prefix = '<?xml version="1.0"%s?>\n' % encoding_part
799 else:945 else:
800 prefix = ''946 prefix = ''
801 if not pretty_print:947
802 indent_level = None948 # Prior to 4.13.0, the first argument to this method was a
949 # bool called pretty_print, which gave the method a different
950 # signature from its superclass implementation, Tag.decode.
951 #
952 # The signatures of the two methods now match, but just in
953 # case someone is still passing a boolean in as the first
954 # argument to this method (or a keyword argument with the old
955 # name), we can handle it and put out a DeprecationWarning.
956 warning:Optional[str] = None
957 if isinstance(indent_level, bool):
958 if indent_level is True:
959 indent_level = 0
960 elif indent_level is False:
961 indent_level = None
962 warning = f"As of 4.13.0, the first argument to BeautifulSoup.decode has been changed from bool to int, to match Tag.decode. Pass in a value of {indent_level} instead."
803 else:963 else:
804 indent_level = 0964 pretty_print = kwargs.pop("pretty_print", None)
965 assert not kwargs
966 if pretty_print is not None:
967 if pretty_print is True:
968 indent_level = 0
969 elif pretty_print is False:
970 indent_level = None
971 warning = f"As of 4.13.0, the pretty_print argument to BeautifulSoup.decode has been removed, to match Tag.decode. Pass in a value of indent_level={indent_level} instead."
972
973 if warning:
974 warnings.warn(warning, DeprecationWarning, stacklevel=2)
975 elif indent_level is False or pretty_print is False:
976 indent_level = None
805 return prefix + super(BeautifulSoup, self).decode(977 return prefix + super(BeautifulSoup, self).decode(
806 indent_level, eventual_encoding, formatter, iterator)978 indent_level, eventual_encoding, formatter, iterator)
807979
@@ -815,7 +987,7 @@ class BeautifulStoneSoup(BeautifulSoup):
815 def __init__(self, *args, **kwargs):987 def __init__(self, *args, **kwargs):
816 kwargs['features'] = 'xml'988 kwargs['features'] = 'xml'
817 warnings.warn(989 warnings.warn(
818 'The BeautifulStoneSoup class is deprecated. Instead of using '990 'The BeautifulStoneSoup class was deprecated in version 4.0.0. Instead of using '
819 'it, pass features="xml" into the BeautifulSoup constructor.',991 'it, pass features="xml" into the BeautifulSoup constructor.',
820 DeprecationWarning, stacklevel=2992 DeprecationWarning, stacklevel=2
821 )993 )
diff --git a/bs4/_deprecation.py b/bs4/_deprecation.py
822new file mode 100644994new file mode 100644
index 0000000..febc1b3
--- /dev/null
+++ b/bs4/_deprecation.py
@@ -0,0 +1,57 @@
1"""Helper functions for deprecation.
2
3This interface is itself unstable and may change without warning. Do
4not use these functions yourself, even as a joke. The underscores are
5there for a reson.
6
7In particular, most of this will go away once Beautiful Soup drops
8support for Python 3.11, since Python 3.12 defines a
9`@typing.deprecated() decorator. <https://peps.python.org/pep-0702/>`_
10"""
11
12import functools
13import warnings
14
15from typing import (
16 Any,
17 Callable,
18)
19
20def _deprecated_alias(old_name, new_name, version):
21 """Alias one attribute name to another for backward compatibility
22
23 :meta private:
24 """
25 @property
26 def alias(self) -> Any:
27 ":meta private:"
28 warnings.warn(f"Access to deprecated property {old_name}. (Replaced by {new_name}) -- Deprecated since version {version}.", DeprecationWarning, stacklevel=2)
29 return getattr(self, new_name)
30
31 @alias.setter
32 def alias(self, value:str)->Any:
33 ":meta private:"
34 warnings.warn(f"Write to deprecated property {old_name}. (Replaced by {new_name}) -- Deprecated since version {version}.", DeprecationWarning, stacklevel=2)
35 return setattr(self, new_name, value)
36 return alias
37
38def _deprecated_function_alias(old_name:str, new_name:str, version:str) -> Callable:
39 def alias(self, *args, **kwargs):
40 ":meta private:"
41 warnings.warn(f"Call to deprecated method {old_name}. (Replaced by {new_name}) -- Deprecated since version {version}.", DeprecationWarning, stacklevel=2)
42 return getattr(self, new_name)(*args, **kwargs)
43 return alias
44
45def _deprecated(replaced_by:str, version:str) -> Callable:
46 def deprecate(func):
47 @functools.wraps(func)
48 def with_warning(*args, **kwargs):
49 ":meta private:"
50 warnings.warn(
51 f"Call to deprecated method {func.__name__}. (Replaced by {replaced_by}) -- Deprecated since version {version}.",
52 DeprecationWarning,
53 stacklevel=2
54 )
55 return func(*args, **kwargs)
56 return with_warning
57 return deprecate
diff --git a/bs4/_typing.py b/bs4/_typing.py
0new file mode 10064458new file mode 100644
index 0000000..9fe58c3
--- /dev/null
+++ b/bs4/_typing.py
@@ -0,0 +1,99 @@
1# Custom type aliases used throughout Beautiful Soup to improve readability.
2
3# Notes on improvements to the type system in newer versions of Python
4# that can be used once Beautiful Soup drops support for older
5# versions:
6#
7# * In 3.10, x|y is an accepted shorthand for Union[x,y].
8# * In 3.10, TypeAlias gains capabilities that can be used to
9# improve the tree matching types (I don't remember what, exactly).
10
11import re
12from typing_extensions import TypeAlias
13from typing import (
14 Callable,
15 Dict,
16 IO,
17 Iterable,
18 Pattern,
19 TYPE_CHECKING,
20 Union,
21)
22
23if TYPE_CHECKING:
24 from bs4.element import Tag
25
26# Aliases for markup in various stages of processing.
27#
28# The rawest form of markup: either a string or an open filehandle.
29_IncomingMarkup: TypeAlias = Union[str,bytes,IO]
30
31# Markup that is in memory but has (potentially) yet to be converted
32# to Unicode.
33_RawMarkup: TypeAlias = Union[str,bytes]
34
35# Aliases for character encodings
36#
37_Encoding:TypeAlias = str
38_Encodings:TypeAlias = Iterable[_Encoding]
39
40# Aliases for XML namespaces
41_NamespacePrefix:TypeAlias = str
42_NamespaceURL:TypeAlias = str
43_NamespaceMapping:TypeAlias = Dict[_NamespacePrefix, _NamespaceURL]
44_InvertedNamespaceMapping:TypeAlias = Dict[_NamespaceURL, _NamespacePrefix]
45
46# Aliases for the attribute values associated with HTML/XML tags.
47#
48# Note that these are attribute values in their final form, as stored
49# in the `Tag` class. Different parsers present attributes to the
50# `TreeBuilder` subclasses in different formats, which are not defined
51# here.
52_AttributeValue: TypeAlias = Union[str, Iterable[str]]
53_AttributeValues: TypeAlias = Dict[str, _AttributeValue]
54
55# Aliases to represent the many possibilities for matching bits of a
56# parse tree.
57#
58# This is very complicated because we're applying a formal type system
59# to some very DWIM code. The types we end up with will be the types
60# of the arguments to the SoupStrainer constructor and (more
61# familiarly to Beautiful Soup users) the find* methods.
62
63# A function that takes a Tag and returns a yes-or-no answer.
64# A TagNameMatchRule expects this kind of function, if you're
65# going to pass it a function.
66_TagMatchFunction:TypeAlias = Callable[['Tag'], bool]
67
68# A function that takes a single string and returns a yes-or-no
69# answer. An AttributeValueMatchRule expects this kind of function, if
70# you're going to pass it a function. So does a StringMatchRule
71_StringMatchFunction:TypeAlias = Callable[[str], bool]
72
73# A function that takes a Tag or string and returns a yes-or-no
74# answer.
75_TagOrStringMatchFunction:TypeAlias = Union[_TagMatchFunction, _StringMatchFunction, bool]
76
77# Either a tag name, an attribute value or a string can be matched
78# against a string, bytestring, regular expression, or a boolean.
79_BaseStrainable:TypeAlias = Union[str, bytes, Pattern[str], bool]
80
81# A tag can also be matched using a function that takes the Tag
82# as its sole argument.
83_BaseStrainableElement:TypeAlias = Union[_BaseStrainable, _TagMatchFunction]
84
85# A tag's attribute value can be matched using a function that takes
86# the value as its sole argument.
87_BaseStrainableAttribute:TypeAlias = Union[_BaseStrainable, _StringMatchFunction]
88
89# Finally, a tag name, attribute or string can be matched using either
90# a single criterion or a list of criteria.
91_StrainableElement:TypeAlias = Union[
92 _BaseStrainableElement, Iterable[_BaseStrainableElement]
93]
94_StrainableAttribute:TypeAlias = Union[
95 _BaseStrainableAttribute, Iterable[_BaseStrainableAttribute]
96]
97
98_StrainableAttributes:TypeAlias = Dict[str, _StrainableAttribute]
99_StrainableString:TypeAlias = _StrainableAttribute
diff --git a/bs4/builder/__init__.py b/bs4/builder/__init__.py
index 2e39745..671315d 100644
--- a/bs4/builder/__init__.py
+++ b/bs4/builder/__init__.py
@@ -1,9 +1,25 @@
1from __future__ import annotations
1# Use of this source code is governed by the MIT license.2# Use of this source code is governed by the MIT license.
2__license__ = "MIT"3__license__ = "MIT"
34
4from collections import defaultdict5from collections import defaultdict
5import itertools6import itertools
6import re7import re
8from types import ModuleType
9from typing import (
10 Any,
11 cast,
12 Dict,
13 Iterable,
14 List,
15 Optional,
16 Pattern,
17 Set,
18 Tuple,
19 Type,
20 TYPE_CHECKING,
21 Union,
22)
7import warnings23import warnings
8import sys24import sys
9from bs4.element import (25from bs4.element import (
@@ -17,6 +33,18 @@ from bs4.element import (
17 nonwhitespace_re33 nonwhitespace_re
18)34)
1935
36if TYPE_CHECKING:
37 from bs4 import BeautifulSoup
38 from bs4.element import (
39 NavigableString, Tag,
40 _AttributeValues, _AttributeValue,
41 )
42 from bs4._typing import (
43 _Encoding,
44 _Encodings,
45 _RawMarkup,
46 )
47
20__all__ = [48__all__ = [
21 'HTMLTreeBuilder',49 'HTMLTreeBuilder',
22 'SAXTreeBuilder',50 'SAXTreeBuilder',
@@ -36,29 +64,32 @@ class XMLParsedAsHTMLWarning(UserWarning):
36 """The warning issued when an HTML parser is used to parse64 """The warning issued when an HTML parser is used to parse
37 XML that is not XHTML.65 XML that is not XHTML.
38 """66 """
39 MESSAGE = """It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor."""67 MESSAGE:str = """It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.""" #: :meta private:
4068
4169
42class TreeBuilderRegistry(object):70class TreeBuilderRegistry(object):
43 """A way of looking up TreeBuilder subclasses by their name or by desired71 """A way of looking up TreeBuilder subclasses by their name or by desired
44 features.72 features.
45 """73 """
74
75 builders_for_feature: Dict[str, List[Type[TreeBuilder]]]
76 builders: List[Type[TreeBuilder]]
46 77
47 def __init__(self):78 def __init__(self):
48 self.builders_for_feature = defaultdict(list)79 self.builders_for_feature = defaultdict(list)
49 self.builders = []80 self.builders = []
5081
51 def register(self, treebuilder_class):82 def register(self, treebuilder_class:type[TreeBuilder]) -> None:
52 """Register a treebuilder based on its advertised features.83 """Register a treebuilder based on its advertised features.
5384
54 :param treebuilder_class: A subclass of Treebuilder. its .features85 :param treebuilder_class: A subclass of `Treebuilder`. its
55 attribute should list its features.86 `TreeBuilder.features` attribute should list its features.
56 """87 """
57 for feature in treebuilder_class.features:88 for feature in treebuilder_class.features:
58 self.builders_for_feature[feature].insert(0, treebuilder_class)89 self.builders_for_feature[feature].insert(0, treebuilder_class)
59 self.builders.insert(0, treebuilder_class)90 self.builders.insert(0, treebuilder_class)
6091
61 def lookup(self, *features):92 def lookup(self, *features:str) -> Optional[Type[TreeBuilder]]:
62 """Look up a TreeBuilder subclass with the desired features.93 """Look up a TreeBuilder subclass with the desired features.
6394
64 :param features: A list of features to look for. If none are95 :param features: A list of features to look for. If none are
@@ -78,12 +109,12 @@ class TreeBuilderRegistry(object):
78109
79 # Go down the list of features in order, and eliminate any builders110 # Go down the list of features in order, and eliminate any builders
80 # that don't match every feature.111 # that don't match every feature.
81 features = list(features)112 feature_list = list(features)
82 features.reverse()113 feature_list.reverse()
83 candidates = None114 candidates = None
84 candidate_set = None115 candidate_set = None
85 while len(features) > 0:116 while len(feature_list) > 0:
86 feature = features.pop()117 feature = feature_list.pop()
87 we_have_the_feature = self.builders_for_feature.get(feature, [])118 we_have_the_feature = self.builders_for_feature.get(feature, [])
88 if len(we_have_the_feature) > 0:119 if len(we_have_the_feature) > 0:
89 if candidates is None:120 if candidates is None:
@@ -97,81 +128,61 @@ class TreeBuilderRegistry(object):
97 # The only valid candidates are the ones in candidate_set.128 # The only valid candidates are the ones in candidate_set.
98 # Go through the original list of candidates and pick the first one129 # Go through the original list of candidates and pick the first one
99 # that's in candidate_set.130 # that's in candidate_set.
100 if candidate_set is None:131 if candidate_set is None or candidates is None:
101 return None132 return None
102 for candidate in candidates:133 for candidate in candidates:
103 if candidate in candidate_set:134 if candidate in candidate_set:
104 return candidate135 return candidate
105 return None136 return None
106137
107# The BeautifulSoup class will take feature lists from developers and use them138#: The `BeautifulSoup` constructor will take a list of features
108# to look up builders in this registry.139#: and use it to look up `TreeBuilder` classes in this registry.
109builder_registry = TreeBuilderRegistry()140builder_registry:TreeBuilderRegistry = TreeBuilderRegistry()
110141
111class TreeBuilder(object):142class TreeBuilder(object):
112 """Turn a textual document into a Beautiful Soup object tree."""143 """Turn a textual document into a Beautiful Soup object tree.
113144
114 NAME = "[Unknown tree builder]"145 This is an abstract superclass which smooths out the behavior of
115 ALTERNATE_NAMES = []146 different parser libraries into a single, unified interface.
116 features = []147
117148 :param multi_valued_attributes: If this is set to None, the
118 is_xml = False149 TreeBuilder will not turn any values for attributes like
119 picklable = False150 'class' into lists. Setting this to a dictionary will
120 empty_element_tags = None # A tag will be considered an empty-element151 customize this behavior; look at DEFAULT_CDATA_LIST_ATTRIBUTES
121 # tag when and only when it has no contents.152 for an example.
122 153
123 # A value for these tag/attribute combinations is a space- or154 Internally, these are called "CDATA list attributes", but that
124 # comma-separated list of CDATA, rather than a single CDATA.155 probably doesn't make sense to an end-user, so the argument name
125 DEFAULT_CDATA_LIST_ATTRIBUTES = defaultdict(list)156 is `multi_valued_attributes`.
126157
127 # Whitespace should be preserved inside these tags.158 :param preserve_whitespace_tags: A set of tags to treat
128 DEFAULT_PRESERVE_WHITESPACE_TAGS = set()159 the way <pre> tags are treated in HTML. Tags in this set
129160 are immune from pretty-printing; their contents will always be
130 # The textual contents of tags with these names should be161 output as-is.
131 # instantiated with some class other than NavigableString.162
132 DEFAULT_STRING_CONTAINERS = {}163 :param string_containers: A dictionary mapping tag names to
133 164 the classes that should be instantiated to contain the textual
134 USE_DEFAULT = object()165 contents of those tags. The default is to use NavigableString
166 for every tag, no matter what the name. You can override the
167 default by changing DEFAULT_STRING_CONTAINERS.
168
169 :param store_line_numbers: If the parser keeps track of the
170 line numbers and positions of the original markup, that
171 information will, by default, be stored in each corresponding
172 `Tag` object. You can turn this off by passing
173 store_line_numbers=False. If the parser you're using doesn't
174 keep track of this information, then setting store_line_numbers=True
175 will do nothing.
176 """
135177
136 # Most parsers don't keep track of line numbers.178 USE_DEFAULT: Any = object() #: :meta private:
137 TRACKS_LINE_NUMBERS = False
138 179
139 def __init__(self, multi_valued_attributes=USE_DEFAULT,180 def __init__(self, multi_valued_attributes:Dict[str, Set[str]]=USE_DEFAULT,
140 preserve_whitespace_tags=USE_DEFAULT,181 preserve_whitespace_tags:Set[str]=USE_DEFAULT,
141 store_line_numbers=USE_DEFAULT,182 store_line_numbers:bool=USE_DEFAULT,
142 string_containers=USE_DEFAULT,183 string_containers:Dict[str, Type[NavigableString]]=USE_DEFAULT,
184 empty_element_tags:Set[str]=USE_DEFAULT
143 ):185 ):
144 """Constructor.
145
146 :param multi_valued_attributes: If this is set to None, the
147 TreeBuilder will not turn any values for attributes like
148 'class' into lists. Setting this to a dictionary will
149 customize this behavior; look at DEFAULT_CDATA_LIST_ATTRIBUTES
150 for an example.
151
152 Internally, these are called "CDATA list attributes", but that
153 probably doesn't make sense to an end-user, so the argument name
154 is `multi_valued_attributes`.
155
156 :param preserve_whitespace_tags: A list of tags to treat
157 the way <pre> tags are treated in HTML. Tags in this list
158 are immune from pretty-printing; their contents will always be
159 output as-is.
160
161 :param string_containers: A dictionary mapping tag names to
162 the classes that should be instantiated to contain the textual
163 contents of those tags. The default is to use NavigableString
164 for every tag, no matter what the name. You can override the
165 default by changing DEFAULT_STRING_CONTAINERS.
166
167 :param store_line_numbers: If the parser keeps track of the
168 line numbers and positions of the original markup, that
169 information will, by default, be stored in each corresponding
170 `Tag` object. You can turn this off by passing
171 store_line_numbers=False. If the parser you're using doesn't
172 keep track of this information, then setting store_line_numbers=True
173 will do nothing.
174 """
175 self.soup = None186 self.soup = None
176 if multi_valued_attributes is self.USE_DEFAULT:187 if multi_valued_attributes is self.USE_DEFAULT:
177 multi_valued_attributes = self.DEFAULT_CDATA_LIST_ATTRIBUTES188 multi_valued_attributes = self.DEFAULT_CDATA_LIST_ATTRIBUTES
@@ -179,14 +190,55 @@ class TreeBuilder(object):
179 if preserve_whitespace_tags is self.USE_DEFAULT:190 if preserve_whitespace_tags is self.USE_DEFAULT:
180 preserve_whitespace_tags = self.DEFAULT_PRESERVE_WHITESPACE_TAGS191 preserve_whitespace_tags = self.DEFAULT_PRESERVE_WHITESPACE_TAGS
181 self.preserve_whitespace_tags = preserve_whitespace_tags192 self.preserve_whitespace_tags = preserve_whitespace_tags
193 if empty_element_tags is self.USE_DEFAULT:
194 self.empty_element_tags = self.DEFAULT_EMPTY_ELEMENT_TAGS
195 else:
196 self.empty_element_tags = empty_element_tags
182 if store_line_numbers == self.USE_DEFAULT:197 if store_line_numbers == self.USE_DEFAULT:
183 store_line_numbers = self.TRACKS_LINE_NUMBERS198 store_line_numbers = self.TRACKS_LINE_NUMBERS
184 self.store_line_numbers = store_line_numbers 199 self.store_line_numbers = store_line_numbers
185 if string_containers == self.USE_DEFAULT:200 if string_containers == self.USE_DEFAULT:
186 string_containers = self.DEFAULT_STRING_CONTAINERS201 string_containers = self.DEFAULT_STRING_CONTAINERS
187 self.string_containers = string_containers202 self.string_containers = string_containers
203
204 NAME:str = "[Unknown tree builder]"
205 ALTERNATE_NAMES: Iterable[str] = []
206 features: Iterable[str] = []
207
208 is_xml: bool = False
209 picklable: bool = False
210
211 soup: Optional[BeautifulSoup] #: :meta private:
212
213 #: A tag will be considered an empty-element
214 #: tag when and only when it has no contents.
215 empty_element_tags: Optional[Set[str]] = None #: :meta private:
216 cdata_list_attributes: Dict[str, Set[str]] #: :meta private:
217 preserve_whitespace_tags: Set[str] #: :meta private:
218 string_containers: Dict[str, Type[NavigableString]] #: :meta private:
219 tracks_line_numbers: bool #: :meta private:
220
221 #: A value for these tag/attribute combinations is a space- or
222 #: comma-separated list of CDATA, rather than a single CDATA.
223 DEFAULT_CDATA_LIST_ATTRIBUTES : Dict[str, Set[str]] = defaultdict(set)
224
225 #: Whitespace should be preserved inside these tags.
226 DEFAULT_PRESERVE_WHITESPACE_TAGS : Set[str] = set()
227
228 #: The textual contents of tags with these names should be
229 #: instantiated with some class other than NavigableString.
230 DEFAULT_STRING_CONTAINERS: Dict[str, Type[NavigableString]] = {}
231
232 #: By default, tags are treated as empty-element tags if they have
233 #: no contents--that is, using XML rules. HTMLTreeBuilder
234 #: defines a different set of DEFAULT_EMPTY_ELEMENT_TAGS based on the
235 #: HTML 4 and HTML5 standards.
236 DEFAULT_EMPTY_ELEMENT_TAGS: Optional[Set] = None
237
238 #: Most parsers don't keep track of line numbers.
239 TRACKS_LINE_NUMBERS: bool = False
188 240
189 def initialize_soup(self, soup):241 def initialize_soup(self, soup:BeautifulSoup) -> None:
190 """The BeautifulSoup object has been initialized and is now242 """The BeautifulSoup object has been initialized and is now
191 being associated with the TreeBuilder.243 being associated with the TreeBuilder.
192244
@@ -194,7 +246,7 @@ class TreeBuilder(object):
194 """246 """
195 self.soup = soup247 self.soup = soup
196 248
197 def reset(self):249 def reset(self) -> None:
198 """Do any work necessary to reset the underlying parser250 """Do any work necessary to reset the underlying parser
199 for a new document.251 for a new document.
200252
@@ -202,7 +254,7 @@ class TreeBuilder(object):
202 """254 """
203 pass255 pass
204256
205 def can_be_empty_element(self, tag_name):257 def can_be_empty_element(self, tag_name:str) -> bool:
206 """Might a tag with this name be an empty-element tag?258 """Might a tag with this name be an empty-element tag?
207259
208 The final markup may or may not actually present this tag as260 The final markup may or may not actually present this tag as
@@ -225,46 +277,48 @@ class TreeBuilder(object):
225 return True277 return True
226 return tag_name in self.empty_element_tags278 return tag_name in self.empty_element_tags
227 279
228 def feed(self, markup):280 def feed(self, markup:str) -> None:
229 """Run some incoming markup through some parsing process,281 """Run some incoming markup through some parsing process,
230 populating the `BeautifulSoup` object in self.soup.282 populating the `BeautifulSoup` object in `TreeBuilder.soup`
231
232 This method is not implemented in TreeBuilder; it must be
233 implemented in subclasses.
234
235 :return: None.
236 """283 """
237 raise NotImplementedError()284 raise NotImplementedError()
238285
239 def prepare_markup(self, markup, user_specified_encoding=None,286 def prepare_markup(
240 document_declared_encoding=None, exclude_encodings=None):287 self, markup:_RawMarkup,
288 user_specified_encoding:Optional[_Encoding]=None,
289 document_declared_encoding:Optional[_Encoding]=None,
290 exclude_encodings:Optional[_Encodings]=None
291 ) -> Iterable[Tuple[_RawMarkup, Optional[_Encoding], Optional[_Encoding], bool]]:
241 """Run any preliminary steps necessary to make incoming markup292 """Run any preliminary steps necessary to make incoming markup
242 acceptable to the parser.293 acceptable to the parser.
243294
244 :param markup: Some markup -- probably a bytestring.295 :param markup: The markup that's about to be parsed.
245 :param user_specified_encoding: The user asked to try this encoding.296 :param user_specified_encoding: The user asked to try this encoding
297 to convert the markup into a Unicode string.
246 :param document_declared_encoding: The markup itself claims to be298 :param document_declared_encoding: The markup itself claims to be
247 in this encoding. NOTE: This argument is not used by the299 in this encoding. NOTE: This argument is not used by the
248 calling code and can probably be removed.300 calling code and can probably be removed.
249 :param exclude_encodings: The user asked _not_ to try any of301 :param exclude_encodings: The user asked *not* to try any of
250 these encodings.302 these encodings.
251303
252 :yield: A series of 4-tuples:304 :yield: A series of 4-tuples: (markup, encoding, declared encoding,
253 (markup, encoding, declared encoding,305 has undergone character replacement)
254 has undergone character replacement)
255306
256 Each 4-tuple represents a strategy for converting the307 Each 4-tuple represents a strategy that the parser can try
257 document to Unicode and parsing it. Each strategy will be tried 308 to convert the document to Unicode and parse it. Each
258 in turn.309 strategy will be tried in turn.
259310
260 By default, the only strategy is to parse the markup311 By default, the only strategy is to parse the markup
261 as-is. See `LXMLTreeBuilderForXML` and312 as-is. See `LXMLTreeBuilderForXML` and
262 `HTMLParserTreeBuilder` for implementations that take into313 `HTMLParserTreeBuilder` for implementations that take into
263 account the quirks of particular parsers.314 account the quirks of particular parsers.
315
316 :meta private:
317
264 """318 """
265 yield markup, None, None, False319 yield markup, None, None, False
266320
267 def test_fragment_to_document(self, fragment):321 def test_fragment_to_document(self, fragment:str) -> str:
268 """Wrap an HTML fragment to make it look like a document.322 """Wrap an HTML fragment to make it look like a document.
269323
270 Different parsers do this differently. For instance, lxml324 Different parsers do this differently. For instance, lxml
@@ -273,26 +327,27 @@ class TreeBuilder(object):
273 which run HTML fragments through the parser and compare the327 which run HTML fragments through the parser and compare the
274 results against other HTML fragments.328 results against other HTML fragments.
275329
276 This method should not be used outside of tests.330 This method should not be used outside of unit tests.
277331
278 :param fragment: A string -- fragment of HTML.332 :param fragment: A fragment of HTML.
279 :return: A string -- a full HTML document.333 :return: A full HTML document.
334 :meta private:
280 """335 """
281 return fragment336 return fragment
282337
283 def set_up_substitutions(self, tag):338 def set_up_substitutions(self, tag:Tag) -> bool:
284 """Set up any substitutions that will need to be performed on 339 """Set up any substitutions that will need to be performed on
285 a `Tag` when it's output as a string.340 a `Tag` when it's output as a string.
286341
287 By default, this does nothing. See `HTMLTreeBuilder` for a342 By default, this does nothing. See `HTMLTreeBuilder` for a
288 case where this is used.343 case where this is used.
289344
290 :param tag: A `Tag`
291 :return: Whether or not a substitution was performed.345 :return: Whether or not a substitution was performed.
346 :meta private:
292 """347 """
293 return False348 return False
294349
295 def _replace_cdata_list_attribute_values(self, tag_name, attrs):350 def _replace_cdata_list_attribute_values(self, tag_name:str, attrs:_AttributeValues):
296 """When an attribute value is associated with a tag that can351 """When an attribute value is associated with a tag that can
297 have multiple values for that attribute, convert the string352 have multiple values for that attribute, convert the string
298 value to a list of strings.353 value to a list of strings.
@@ -308,10 +363,11 @@ class TreeBuilder(object):
308 if not attrs:363 if not attrs:
309 return attrs364 return attrs
310 if self.cdata_list_attributes:365 if self.cdata_list_attributes:
311 universal = self.cdata_list_attributes.get('*', [])366 universal: Set[str] = self.cdata_list_attributes.get('*', set())
312 tag_specific = self.cdata_list_attributes.get(367 tag_specific = self.cdata_list_attributes.get(
313 tag_name.lower(), None)368 tag_name.lower(), None)
314 for attr in list(attrs.keys()):369 for attr in list(attrs.keys()):
370 values: _AttributeValue
315 if attr in universal or (tag_specific and attr in tag_specific):371 if attr in universal or (tag_specific and attr in tag_specific):
316 # We have a "class"-type attribute whose string372 # We have a "class"-type attribute whose string
317 # value is a whitespace-separated list of373 # value is a whitespace-separated list of
@@ -337,7 +393,15 @@ class SAXTreeBuilder(TreeBuilder):
337 how a simple TreeBuilder would work.393 how a simple TreeBuilder would work.
338 """394 """
339395
340 def feed(self, markup):396 def __init__(self, *args, **kwargs):
397 warnings.warn(
398 f"The SAXTreeBuilder class was deprecated in 4.13.0. It is completely untested and probably doesn't work; use at your own risk.",
399 DeprecationWarning,
400 stacklevel=2
401 )
402 super(SAXTreeBuilder, self).__init__(*args, **kwargs)
403
404 def feed(self, markup:_RawMarkup):
341 raise NotImplementedError()405 raise NotImplementedError()
342406
343 def close(self):407 def close(self):
@@ -381,12 +445,13 @@ class SAXTreeBuilder(TreeBuilder):
381445
382446
383class HTMLTreeBuilder(TreeBuilder):447class HTMLTreeBuilder(TreeBuilder):
384 """This TreeBuilder knows facts about HTML.448 """This TreeBuilder knows facts about HTML, such as which tags are treated
385449 specially by the HTML standard.
386 Such as which tags are empty-element tags.
387 """450 """
388451
389 empty_element_tags = set([452 #: Some HTML tags are defined as having no contents. Beautiful Soup
453 #: treats these specially.
454 DEFAULT_EMPTY_ELEMENT_TAGS: Set[str] = set([
390 # These are from HTML5.455 # These are from HTML5.
391 'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen', 'link', 'menuitem', 'meta', 'param', 'source', 'track', 'wbr',456 'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen', 'link', 'menuitem', 'meta', 'param', 'source', 'track', 'wbr',
392 457
@@ -394,29 +459,29 @@ class HTMLTreeBuilder(TreeBuilder):
394 'basefont', 'bgsound', 'command', 'frame', 'image', 'isindex', 'nextid', 'spacer'459 'basefont', 'bgsound', 'command', 'frame', 'image', 'isindex', 'nextid', 'spacer'
395 ])460 ])
396461
397 # The HTML standard defines these as block-level elements. Beautiful462 #: The HTML standard defines these tags as block-level elements. Beautiful
398 # Soup does not treat these elements differently from other elements,463 #: Soup does not treat these elements differently from other elements,
399 # but it may do so eventually, and this information is available if464 #: but it may do so eventually, and this information is available if
400 # you need to use it.465 #: you need to use it.
401 block_elements = set(["address", "article", "aside", "blockquote", "canvas", "dd", "div", "dl", "dt", "fieldset", "figcaption", "figure", "footer", "form", "h1", "h2", "h3", "h4", "h5", "h6", "header", "hr", "li", "main", "nav", "noscript", "ol", "output", "p", "pre", "section", "table", "tfoot", "ul", "video"])466 DEFAULT_BLOCK_ELEMENTS: Set[str] = set(["address", "article", "aside", "blockquote", "canvas", "dd", "div", "dl", "dt", "fieldset", "figcaption", "figure", "footer", "form", "h1", "h2", "h3", "h4", "h5", "h6", "header", "hr", "li", "main", "nav", "noscript", "ol", "output", "p", "pre", "section", "table", "tfoot", "ul", "video"])
402467
403 # These HTML tags need special treatment so they can be468 #: These HTML tags need special treatment so they can be
404 # represented by a string class other than NavigableString.469 #: represented by a string class other than NavigableString.
405 #470 #:
406 # For some of these tags, it's because the HTML standard defines471 #: For some of these tags, it's because the HTML standard defines
407 # an unusual content model for them. I made this list by going472 #: an unusual content model for them. I made this list by going
408 # through the HTML spec473 #: through the HTML spec
409 # (https://html.spec.whatwg.org/#metadata-content) and looking for474 #: (https://html.spec.whatwg.org/#metadata-content) and looking for
410 # "metadata content" elements that can contain strings.475 #: "metadata content" elements that can contain strings.
411 #476 #:
412 # The Ruby tags (<rt> and <rp>) are here despite being normal477 #: The Ruby tags (<rt> and <rp>) are here despite being normal
413 # "phrasing content" tags, because the content they contain is478 #: "phrasing content" tags, because the content they contain is
414 # qualitatively different from other text in the document, and it479 #: qualitatively different from other text in the document, and it
415 # can be useful to be able to distinguish it.480 #: can be useful to be able to distinguish it.
416 #481 #:
417 # TODO: Arguably <noscript> could go here but it seems482 #: TODO: Arguably <noscript> could go here but it seems
418 # qualitatively different from the other tags.483 #: qualitatively different from the other tags.
419 DEFAULT_STRING_CONTAINERS = {484 DEFAULT_STRING_CONTAINERS: Dict[str, Type[NavigableString]] = {
420 'rt' : RubyTextString,485 'rt' : RubyTextString,
421 'rp' : RubyParenthesisString,486 'rp' : RubyParenthesisString,
422 'style': Stylesheet,487 'style': Stylesheet,
@@ -424,33 +489,35 @@ class HTMLTreeBuilder(TreeBuilder):
424 'template': TemplateString,489 'template': TemplateString,
425 } 490 }
426 491
427 # The HTML standard defines these attributes as containing a492 #: The HTML standard defines these attributes as containing a
428 # space-separated list of values, not a single value. That is,493 #: space-separated list of values, not a single value. That is,
429 # class="foo bar" means that the 'class' attribute has two values,494 #: class="foo bar" means that the 'class' attribute has two values,
430 # 'foo' and 'bar', not the single value 'foo bar'. When we495 #: 'foo' and 'bar', not the single value 'foo bar'. When we
431 # encounter one of these attributes, we will parse its value into496 #: encounter one of these attributes, we will parse its value into
432 # a list of values if possible. Upon output, the list will be497 #: a list of values if possible. Upon output, the list will be
433 # converted back into a string.498 #: converted back into a string.
434 DEFAULT_CDATA_LIST_ATTRIBUTES = {499 DEFAULT_CDATA_LIST_ATTRIBUTES: Dict[str, Set[str]] = {
435 "*" : ['class', 'accesskey', 'dropzone'],500 "*" : {'class', 'accesskey', 'dropzone'},
436 "a" : ['rel', 'rev'],501 "a" : {'rel', 'rev'},
437 "link" : ['rel', 'rev'],502 "link" : {'rel', 'rev'},
438 "td" : ["headers"],503 "td" : {"headers"},
439 "th" : ["headers"],504 "th" : {"headers"},
440 "td" : ["headers"],505 "td" : {"headers"},
441 "form" : ["accept-charset"],506 "form" : {"accept-charset"},
442 "object" : ["archive"],507 "object" : {"archive"},
443508
444 # These are HTML5 specific, as are *.accesskey and *.dropzone above.509 # These are HTML5 specific, as are *.accesskey and *.dropzone above.
445 "area" : ["rel"],510 "area" : {"rel"},
446 "icon" : ["sizes"],511 "icon" : {"sizes"},
447 "iframe" : ["sandbox"],512 "iframe" : {"sandbox"},
448 "output" : ["for"],513 "output" : {"for"},
449 }514 }
450515
451 DEFAULT_PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])516 #: By default, whitespace inside these HTML tags will be
517 #: preserved rather than being collapsed.
518 DEFAULT_PRESERVE_WHITESPACE_TAGS: set[str] = set(['pre', 'textarea'])
452519
453 def set_up_substitutions(self, tag):520 def set_up_substitutions(self, tag:Tag) -> bool:
454 """Replace the declared encoding in a <meta> tag with a placeholder,521 """Replace the declared encoding in a <meta> tag with a placeholder,
455 to be substituted when the tag is output to a string.522 to be substituted when the tag is output to a string.
456523
@@ -458,17 +525,26 @@ class HTMLTreeBuilder(TreeBuilder):
458 encoding, but exit in a different encoding, and the <meta> tag525 encoding, but exit in a different encoding, and the <meta> tag
459 needs to be changed to reflect this.526 needs to be changed to reflect this.
460527
461 :param tag: A `Tag`
462 :return: Whether or not a substitution was performed.528 :return: Whether or not a substitution was performed.
529
530 :meta private:
463 """531 """
464 # We are only interested in <meta> tags532 # We are only interested in <meta> tags
465 if tag.name != 'meta':533 if tag.name != 'meta':
466 return False534 return False
467535
468 http_equiv = tag.get('http-equiv')536 # TODO: This cast will fail in the (very unlikely) scenario
469 content = tag.get('content')537 # that the programmer who instantiates the TreeBuilder
470 charset = tag.get('charset')538 # specifies meta['content'] or meta['charset'] as
471539 # cdata_list_attributes.
540 content:Optional[str] = cast(Optional[str], tag.get('content'))
541 charset:Optional[str] = cast(Optional[str], tag.get('charset'))
542
543 # But we can accommodate meta['http-equiv'] being made a
544 # cdata_list_attribute (again, very unlikely) without much
545 # trouble.
546 http_equiv:List[str] = tag.get_attribute_list('http-equiv')
547
472 # We are interested in <meta> tags that say what encoding the548 # We are interested in <meta> tags that say what encoding the
473 # document was originally in. This means HTML 5-style <meta>549 # document was originally in. This means HTML 5-style <meta>
474 # tags that provide the "charset" attribute. It also means550 # tags that provide the "charset" attribute. It also means
@@ -478,20 +554,22 @@ class HTMLTreeBuilder(TreeBuilder):
478 # In both cases we will replace the value of the appropriate554 # In both cases we will replace the value of the appropriate
479 # attribute with a standin object that can take on any555 # attribute with a standin object that can take on any
480 # encoding.556 # encoding.
481 meta_encoding = None557 substituted = False
482 if charset is not None:558 if charset is not None:
483 # HTML 5 style:559 # HTML 5 style:
484 # <meta charset="utf8">560 # <meta charset="utf8">
485 meta_encoding = charset561 meta_encoding = charset
486 tag['charset'] = CharsetMetaAttributeValue(charset)562 tag['charset'] = CharsetMetaAttributeValue(charset)
563 substituted = True
487564
488 elif (content is not None and http_equiv is not None565 elif (content is not None and
489 and http_equiv.lower() == 'content-type'):566 any(x.lower() == 'content-type' for x in http_equiv)):
490 # HTML 4 style:567 # HTML 4 style:
491 # <meta http-equiv="content-type" content="text/html; charset=utf8">568 # <meta http-equiv="content-type" content="text/html; charset=utf8">
492 tag['content'] = ContentMetaAttributeValue(content)569 tag['content'] = ContentMetaAttributeValue(content)
570 substituted = True
493571
494 return (meta_encoding is not None)572 return substituted
495573
496class DetectsXMLParsedAsHTML(object):574class DetectsXMLParsedAsHTML(object):
497 """A mixin class for any class (a TreeBuilder, or some class used by a575 """A mixin class for any class (a TreeBuilder, or some class used by a
@@ -502,19 +580,29 @@ class DetectsXMLParsedAsHTML(object):
502 This requires being able to observe an incoming processing580 This requires being able to observe an incoming processing
503 instruction that might be an XML declaration, and also able to581 instruction that might be an XML declaration, and also able to
504 observe tags as they're opened. If you can't do that for a given582 observe tags as they're opened. If you can't do that for a given
505 TreeBuilder, there's a less reliable implementation based on583 `TreeBuilder`, there's a less reliable implementation based on
506 examining the raw markup.584 examining the raw markup.
507 """585 """
508586
509 # Regular expression for seeing if markup has an <html> tag.587 #: Regular expression for seeing if string markup has an <html> tag.
510 LOOKS_LIKE_HTML = re.compile("<[^ +]html", re.I)588 LOOKS_LIKE_HTML:Pattern[str] = re.compile("<[^ +]html", re.I)
511 LOOKS_LIKE_HTML_B = re.compile(b"<[^ +]html", re.I)
512589
513 XML_PREFIX = '<?xml'590 #: Regular expression for seeing if byte markup has an <html> tag.
514 XML_PREFIX_B = b'<?xml'591 LOOKS_LIKE_HTML_B:Pattern[bytes] = re.compile(b"<[^ +]html", re.I)
592
593 #: The start of an XML document string.
594 XML_PREFIX:str = '<?xml'
595
596 #: The start of an XML document bytestring.
597 XML_PREFIX_B:bytes = b'<?xml'
598
599 # This is typed as str, not `ProcessingInstruction`, because this
600 # check may be run before any Beautiful Soup objects are created.
601 _first_processing_instruction: Optional[str]
602 _root_tag: Optional[Tag]
515 603
516 @classmethod604 @classmethod
517 def warn_if_markup_looks_like_xml(cls, markup):605 def warn_if_markup_looks_like_xml(cls, markup:Optional[_RawMarkup]) -> bool:
518 """Perform a check on some markup to see if it looks like XML606 """Perform a check on some markup to see if it looks like XML
519 that's not XHTML. If so, issue a warning.607 that's not XHTML. If so, issue a warning.
520608
@@ -524,34 +612,40 @@ class DetectsXMLParsedAsHTML(object):
524 :return: True if the markup looks like non-XHTML XML, False612 :return: True if the markup looks like non-XHTML XML, False
525 otherwise.613 otherwise.
526 """614 """
615 if markup is None:
616 return False
617 markup = markup[:500]
527 if isinstance(markup, bytes):618 if isinstance(markup, bytes):
528 prefix = cls.XML_PREFIX_B619 markup_b = cast(bytes, markup)
529 looks_like_html = cls.LOOKS_LIKE_HTML_B620 looks_like_xml = (
621 markup_b.startswith(cls.XML_PREFIX_B)
622 and not cls.LOOKS_LIKE_HTML_B.search(markup)
623 )
530 else:624 else:
531 prefix = cls.XML_PREFIX625 markup_s = cast(str, markup)
532 looks_like_html = cls.LOOKS_LIKE_HTML626 looks_like_xml = (
533 627 markup_s.startswith(cls.XML_PREFIX)
534 if (markup is not None628 and not cls.LOOKS_LIKE_HTML.search(markup)
535 and markup.startswith(prefix)629 )
536 and not looks_like_html.search(markup[:500])630
537 ):631 if looks_like_xml:
538 cls._warn()632 cls._warn()
539 return True633 return True
540 return False634 return False
541635
542 @classmethod636 @classmethod
543 def _warn(cls):637 def _warn(cls) -> None:
544 """Issue a warning about XML being parsed as HTML."""638 """Issue a warning about XML being parsed as HTML."""
545 warnings.warn(639 warnings.warn(
546 XMLParsedAsHTMLWarning.MESSAGE, XMLParsedAsHTMLWarning640 XMLParsedAsHTMLWarning.MESSAGE, XMLParsedAsHTMLWarning
547 )641 )
548 642
549 def _initialize_xml_detector(self):643 def _initialize_xml_detector(self) -> None:
550 """Call this method before parsing a document."""644 """Call this method before parsing a document."""
551 self._first_processing_instruction = None645 self._first_processing_instruction = None
552 self._root_tag = None646 self._root_tag = None
553 647
554 def _document_might_be_xml(self, processing_instruction):648 def _document_might_be_xml(self, processing_instruction:str):
555 """Call this method when encountering an XML declaration, or a649 """Call this method when encountering an XML declaration, or a
556 "processing instruction" that might be an XML declaration.650 "processing instruction" that might be an XML declaration.
557 """651 """
@@ -586,7 +680,7 @@ class DetectsXMLParsedAsHTML(object):
586 self._warn()680 self._warn()
587681
588 682
589def register_treebuilders_from(module):683def register_treebuilders_from(module:ModuleType) -> None:
590 """Copy TreeBuilders from the given module into this module."""684 """Copy TreeBuilders from the given module into this module."""
591 this_module = sys.modules[__name__]685 this_module = sys.modules[__name__]
592 for name in module.__all__:686 for name in module.__all__:
@@ -602,7 +696,7 @@ class ParserRejectedMarkup(Exception):
602 """An Exception to be raised when the underlying parser simply696 """An Exception to be raised when the underlying parser simply
603 refuses to parse the given markup.697 refuses to parse the given markup.
604 """698 """
605 def __init__(self, message_or_exception):699 def __init__(self, message_or_exception:Union[str,Exception]):
606 """Explain why the parser rejected the given markup, either700 """Explain why the parser rejected the given markup, either
607 with a textual explanation or another exception.701 with a textual explanation or another exception.
608 """702 """
diff --git a/bs4/builder/_html5lib.py b/bs4/builder/_html5lib.py
index dac2173..560a036 100644
--- a/bs4/builder/_html5lib.py
+++ b/bs4/builder/_html5lib.py
@@ -5,6 +5,20 @@ __all__ = [
5 'HTML5TreeBuilder',5 'HTML5TreeBuilder',
6 ]6 ]
77
8from typing import (
9 Iterable,
10 List,
11 Optional,
12 TYPE_CHECKING,
13 Tuple,
14 Union,
15)
16from bs4._typing import (
17 _Encoding,
18 _Encodings,
19 _RawMarkup,
20)
21
8import warnings22import warnings
9import re23import re
10from bs4.builder import (24from bs4.builder import (
@@ -30,50 +44,54 @@ from bs4.element import (
30 Tag,44 Tag,
31 )45 )
3246
33try:47from html5lib.treebuilders import base as treebuilder_base
34 # Pre-0.9999999948
35 from html5lib.treebuilders import _base as treebuilder_base
36 new_html5lib = False
37except ImportError as e:
38 # 0.99999999 and up
39 from html5lib.treebuilders import base as treebuilder_base
40 new_html5lib = True
4149
42class HTML5TreeBuilder(HTMLTreeBuilder):50class HTML5TreeBuilder(HTMLTreeBuilder):
43 """Use html5lib to build a tree.51 """Use `html5lib <https://github.com/html5lib/html5lib-python>`_ to
52 build a tree.
4453
45 Note that this TreeBuilder does not support some features common54 Note that `HTML5TreeBuilder` does not support some common HTML
46 to HTML TreeBuilders. Some of these features could theoretically55 `TreeBuilder` features. Some of these features could theoretically
47 be implemented, but at the very least it's quite difficult,56 be implemented, but at the very least it's quite difficult,
48 because html5lib moves the parse tree around as it's being built.57 because html5lib moves the parse tree around as it's being built.
4958
50 * This TreeBuilder doesn't use different subclasses of NavigableString59 Specifically:
51 based on the name of the tag in which the string was found.
5260
53 * You can't use a SoupStrainer to parse only part of a document.61 * This `TreeBuilder` doesn't use different subclasses of
62 `NavigableString` (e.g. `Script`) based on the name of the tag
63 in which the string was found.
64 * You can't use a `SoupStrainer` to parse only part of a document.
54 """65 """
5566
56 NAME = "html5lib"67 NAME:str = "html5lib"
5768
58 features = [NAME, PERMISSIVE, HTML_5, HTML]69 features:Iterable[str] = [NAME, PERMISSIVE, HTML_5, HTML]
5970
60 # html5lib can tell us which line number and position in the71 #: html5lib can tell us which line number and position in the
61 # original file is the source of an element.72 #: original file is the source of an element.
62 TRACKS_LINE_NUMBERS = True73 TRACKS_LINE_NUMBERS:bool = True
63 74
64 def prepare_markup(self, markup, user_specified_encoding,75 def prepare_markup(self, markup:_RawMarkup,
65 document_declared_encoding=None, exclude_encodings=None):76 user_specified_encoding:Optional[_Encoding]=None,
77 document_declared_encoding:Optional[_Encoding]=None,
78 exclude_encodings:Optional[_Encodings]=None
79 ) -> Iterable[Tuple[_RawMarkup, Optional[_Encoding], Optional[_Encoding], bool]]:
66 # Store the user-specified encoding for use later on.80 # Store the user-specified encoding for use later on.
67 self.user_specified_encoding = user_specified_encoding81 self.user_specified_encoding = user_specified_encoding
6882
69 # document_declared_encoding and exclude_encodings aren't used83 # document_declared_encoding and exclude_encodings aren't used
70 # ATM because the html5lib TreeBuilder doesn't use84 # ATM because the html5lib TreeBuilder doesn't use
71 # UnicodeDammit.85 # UnicodeDammit.
72 if exclude_encodings:86 for variable, name in (
73 warnings.warn(87 (document_declared_encoding, 'document_declared_encoding'),
74 "You provided a value for exclude_encoding, but the html5lib tree builder doesn't support exclude_encoding.",88 (exclude_encodings, 'exclude_encodings'),
75 stacklevel=389 ):
76 )90 if variable:
91 warnings.warn(
92 f"You provided a value for {name}, but the html5lib tree builder doesn't support {name}.",
93 stacklevel=3
94 )
7795
78 # html5lib only parses HTML, so if it's given XML that's worth96 # html5lib only parses HTML, so if it's given XML that's worth
79 # noting.97 # noting.
@@ -83,6 +101,9 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
83101
84 # These methods are defined by Beautiful Soup.102 # These methods are defined by Beautiful Soup.
85 def feed(self, markup):103 def feed(self, markup):
104 """Run some incoming markup through some parsing process,
105 populating the `BeautifulSoup` object in `HTML5TreeBuilder.soup`.
106 """
86 if self.soup.parse_only is not None:107 if self.soup.parse_only is not None:
87 warnings.warn(108 warnings.warn(
88 "You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.",109 "You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.",
@@ -92,10 +113,7 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
92 self.underlying_builder.parser = parser113 self.underlying_builder.parser = parser
93 extra_kwargs = dict()114 extra_kwargs = dict()
94 if not isinstance(markup, str):115 if not isinstance(markup, str):
95 if new_html5lib:116 extra_kwargs['override_encoding'] = self.user_specified_encoding
96 extra_kwargs['override_encoding'] = self.user_specified_encoding
97 else:
98 extra_kwargs['encoding'] = self.user_specified_encoding
99 doc = parser.parse(markup, **extra_kwargs)117 doc = parser.parse(markup, **extra_kwargs)
100 118
101 # Set the character encoding detected by the tokenizer.119 # Set the character encoding detected by the tokenizer.
@@ -105,15 +123,18 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
105 doc.original_encoding = None123 doc.original_encoding = None
106 else:124 else:
107 original_encoding = parser.tokenizer.stream.charEncoding[0]125 original_encoding = parser.tokenizer.stream.charEncoding[0]
108 if not isinstance(original_encoding, str):126 # The encoding is an html5lib Encoding object. We want to
109 # In 0.99999999 and up, the encoding is an html5lib127 # use a string for compatibility with other tree builders.
110 # Encoding object. We want to use a string for compatibility128 original_encoding = original_encoding.name
111 # with other tree builders.
112 original_encoding = original_encoding.name
113 doc.original_encoding = original_encoding129 doc.original_encoding = original_encoding
114 self.underlying_builder.parser = None130 self.underlying_builder.parser = None
115 131
116 def create_treebuilder(self, namespaceHTMLElements):132 def create_treebuilder(self, namespaceHTMLElements):
133 """Called by html5lib to instantiate the kind of class it
134 calls a 'TreeBuilder'.
135
136 :meta private:
137 """
117 self.underlying_builder = TreeBuilderForHtml5lib(138 self.underlying_builder = TreeBuilderForHtml5lib(
118 namespaceHTMLElements, self.soup,139 namespaceHTMLElements, self.soup,
119 store_line_numbers=self.store_line_numbers140 store_line_numbers=self.store_line_numbers
diff --git a/bs4/builder/_htmlparser.py b/bs4/builder/_htmlparser.py
index 3cc187f..291f6c6 100644
--- a/bs4/builder/_htmlparser.py
+++ b/bs4/builder/_htmlparser.py
@@ -1,4 +1,5 @@
1# encoding: utf-81# encoding: utf-8
2from __future__ import annotations
2"""Use the HTMLParser library to parse HTML files that aren't too bad."""3"""Use the HTMLParser library to parse HTML files that aren't too bad."""
34
4# Use of this source code is governed by the MIT license.5# Use of this source code is governed by the MIT license.
@@ -11,6 +12,19 @@ __all__ = [
11from html.parser import HTMLParser12from html.parser import HTMLParser
1213
13import sys14import sys
15from typing import (
16 Any,
17 Callable,
18 cast,
19 Dict,
20 Iterable,
21 List,
22 Optional,
23 TYPE_CHECKING,
24 Tuple,
25 Type,
26 Union,
27)
14import warnings28import warnings
1529
16from bs4.element import (30from bs4.element import (
@@ -30,21 +44,25 @@ from bs4.builder import (
30 STRICT,44 STRICT,
31 )45 )
3246
3347from bs4.element import Tag
48if TYPE_CHECKING:
49 from bs4 import BeautifulSoup
50 from bs4.element import NavigableString
51 from bs4._typing import (
52 _AttributeValues,
53 _Encoding,
54 _Encodings,
55 _RawMarkup,
56 )
57
34HTMLPARSER = 'html.parser'58HTMLPARSER = 'html.parser'
3559
60_DuplicateAttributeHandler = Callable[[Dict[str, str], str, str], None]
61
36class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):62class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
37 """A subclass of the Python standard library's HTMLParser class, which63 """A subclass of the Python standard library's HTMLParser class, which
38 listens for HTMLParser events and translates them into calls64 listens for HTMLParser events and translates them into calls
39 to Beautiful Soup's tree construction API.65 to Beautiful Soup's tree construction API.
40 """
41
42 # Strategies for handling duplicate attributes
43 IGNORE = 'ignore'
44 REPLACE = 'replace'
45
46 def __init__(self, *args, **kwargs):
47 """Constructor.
4866
49 :param on_duplicate_attribute: A strategy for what to do if a67 :param on_duplicate_attribute: A strategy for what to do if a
50 tag includes the same attribute more than once. Accepted68 tag includes the same attribute more than once. Accepted
@@ -53,8 +71,10 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
53 encountered), or a callable. A callable must take three71 encountered), or a callable. A callable must take three
54 arguments: the dictionary of attributes already processed,72 arguments: the dictionary of attributes already processed,
55 the name of the duplicate attribute, and the most recent value73 the name of the duplicate attribute, and the most recent value
56 encountered. 74 encountered.
57 """75 """
76 def __init__(self, soup:BeautifulSoup, *args, **kwargs):
77 self.soup = soup
58 self.on_duplicate_attribute = kwargs.pop(78 self.on_duplicate_attribute = kwargs.pop(
59 'on_duplicate_attribute', self.REPLACE79 'on_duplicate_attribute', self.REPLACE
60 )80 )
@@ -70,8 +90,20 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
70 self.already_closed_empty_element = []90 self.already_closed_empty_element = []
7191
72 self._initialize_xml_detector()92 self._initialize_xml_detector()
93
94 #: Constant to handle duplicate attributes by replacing earlier values
95 #: with later ones.
96 IGNORE:str = 'ignore'
97
98 #: Constant to handle duplicate attributes by ignoring later values
99 #: and keeping the earlier ones.
100 REPLACE:str = 'replace'
73101
74 def error(self, message):102 on_duplicate_attribute:Union[str, _DuplicateAttributeHandler]
103 already_closed_empty_element: List[str]
104 soup: BeautifulSoup
105
106 def error(self, message:str) -> None:
75 # NOTE: This method is required so long as Python 3.9 is107 # NOTE: This method is required so long as Python 3.9 is
76 # supported. The corresponding code is removed from HTMLParser108 # supported. The corresponding code is removed from HTMLParser
77 # in 3.5, but not removed from ParserBase until 3.10.109 # in 3.5, but not removed from ParserBase until 3.10.
@@ -87,32 +119,33 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
87 # catch this error and wrap it in a ParserRejectedMarkup.)119 # catch this error and wrap it in a ParserRejectedMarkup.)
88 raise ParserRejectedMarkup(message)120 raise ParserRejectedMarkup(message)
89121
90 def handle_startendtag(self, name, attrs):122 def handle_startendtag(
123 self, name:str, attrs:List[Tuple[str, Optional[str]]]
124 ) -> None:
91 """Handle an incoming empty-element tag.125 """Handle an incoming empty-element tag.
92126
93 This is only called when the markup looks like <tag/>.127 html.parser only calls this method when the markup looks like
94128 <tag/>.
95 :param name: Name of the tag.
96 :param attrs: Dictionary of the tag's attributes.
97 """129 """
98 # is_startend() tells handle_starttag not to close the tag130 # `handle_empty_element` tells handle_starttag not to close the tag
99 # just because its name matches a known empty-element tag. We131 # just because its name matches a known empty-element tag. We
100 # know that this is an empty-element tag and we want to call132 # know that this is an empty-element tag, and we want to call
101 # handle_endtag ourselves.133 # handle_endtag ourselves.
102 tag = self.handle_starttag(name, attrs, handle_empty_element=False)134 self.handle_starttag(name, attrs, handle_empty_element=False)
103 self.handle_endtag(name)135 self.handle_endtag(name)
104 136
105 def handle_starttag(self, name, attrs, handle_empty_element=True):137 def handle_starttag(
138 self, name:str, attrs:List[Tuple[str, Optional[str]]],
139 handle_empty_element:bool=True
140 ) -> None:
106 """Handle an opening tag, e.g. '<tag>'141 """Handle an opening tag, e.g. '<tag>'
107142
108 :param name: Name of the tag.
109 :param attrs: Dictionary of the tag's attributes.
110 :param handle_empty_element: True if this tag is known to be143 :param handle_empty_element: True if this tag is known to be
111 an empty-element tag (i.e. there is not expected to be any144 an empty-element tag (i.e. there is not expected to be any
112 closing tag).145 closing tag).
113 """146 """
114 # XXX namespace147 # TODO: handle namespaces here?
115 attr_dict = {}148 attr_dict: Dict[str, str] = {}
116 for key, value in attrs:149 for key, value in attrs:
117 # Change None attribute values to the empty string150 # Change None attribute values to the empty string
118 # for consistency with the other tree builders.151 # for consistency with the other tree builders.
@@ -128,6 +161,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
128 elif on_dupe in (None, self.REPLACE):161 elif on_dupe in (None, self.REPLACE):
129 attr_dict[key] = value162 attr_dict[key] = value
130 else:163 else:
164 on_dupe = cast(_DuplicateAttributeHandler, on_dupe)
131 on_dupe(attr_dict, key, value)165 on_dupe(attr_dict, key, value)
132 else:166 else:
133 attr_dict[key] = value167 attr_dict[key] = value
@@ -157,7 +191,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
157 if self._root_tag is None:191 if self._root_tag is None:
158 self._root_tag_encountered(name)192 self._root_tag_encountered(name)
159 193
160 def handle_endtag(self, name, check_already_closed=True):194 def handle_endtag(self, name:str, check_already_closed:bool=True) -> None:
161 """Handle a closing tag, e.g. '</tag>'195 """Handle a closing tag, e.g. '</tag>'
162 196
163 :param name: A tag name.197 :param name: A tag name.
@@ -175,11 +209,11 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
175 else:209 else:
176 self.soup.handle_endtag(name)210 self.soup.handle_endtag(name)
177 211
178 def handle_data(self, data):212 def handle_data(self, data:str) -> None:
179 """Handle some textual data that shows up between tags."""213 """Handle some textual data that shows up between tags."""
180 self.soup.handle_data(data)214 self.soup.handle_data(data)
181215
182 def handle_charref(self, name):216 def handle_charref(self, name:str) -> None:
183 """Handle a numeric character reference by converting it to the217 """Handle a numeric character reference by converting it to the
184 corresponding Unicode character and treating it as textual218 corresponding Unicode character and treating it as textual
185 data.219 data.
@@ -219,7 +253,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
219 data = data or "\N{REPLACEMENT CHARACTER}"253 data = data or "\N{REPLACEMENT CHARACTER}"
220 self.handle_data(data)254 self.handle_data(data)
221255
222 def handle_entityref(self, name):256 def handle_entityref(self, name:str) -> None:
223 """Handle a named entity reference by converting it to the257 """Handle a named entity reference by converting it to the
224 corresponding Unicode character(s) and treating it as textual258 corresponding Unicode character(s) and treating it as textual
225 data.259 data.
@@ -238,7 +272,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
238 data = "&%s" % name272 data = "&%s" % name
239 self.handle_data(data)273 self.handle_data(data)
240274
241 def handle_comment(self, data):275 def handle_comment(self, data:str) -> None:
242 """Handle an HTML comment.276 """Handle an HTML comment.
243277
244 :param data: The text of the comment.278 :param data: The text of the comment.
@@ -247,7 +281,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
247 self.soup.handle_data(data)281 self.soup.handle_data(data)
248 self.soup.endData(Comment)282 self.soup.endData(Comment)
249283
250 def handle_decl(self, data):284 def handle_decl(self, data:str) -> None:
251 """Handle a DOCTYPE declaration.285 """Handle a DOCTYPE declaration.
252286
253 :param data: The text of the declaration.287 :param data: The text of the declaration.
@@ -257,11 +291,12 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
257 self.soup.handle_data(data)291 self.soup.handle_data(data)
258 self.soup.endData(Doctype)292 self.soup.endData(Doctype)
259293
260 def unknown_decl(self, data):294 def unknown_decl(self, data:str) -> None:
261 """Handle a declaration of unknown type -- probably a CDATA block.295 """Handle a declaration of unknown type -- probably a CDATA block.
262296
263 :param data: The text of the declaration.297 :param data: The text of the declaration.
264 """298 """
299 cls: Type[NavigableString]
265 if data.upper().startswith('CDATA['):300 if data.upper().startswith('CDATA['):
266 cls = CData301 cls = CData
267 data = data[len('CDATA['):]302 data = data[len('CDATA['):]
@@ -271,7 +306,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
271 self.soup.handle_data(data)306 self.soup.handle_data(data)
272 self.soup.endData(cls)307 self.soup.endData(cls)
273308
274 def handle_pi(self, data):309 def handle_pi(self, data:str) -> None:
275 """Handle a processing instruction.310 """Handle a processing instruction.
276311
277 :param data: The text of the instruction.312 :param data: The text of the instruction.
@@ -286,16 +321,17 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
286 """A Beautiful soup `TreeBuilder` that uses the `HTMLParser` parser,321 """A Beautiful soup `TreeBuilder` that uses the `HTMLParser` parser,
287 found in the Python standard library.322 found in the Python standard library.
288 """323 """
289 is_xml = False324 is_xml:bool = False
290 picklable = True325 picklable:bool = True
291 NAME = HTMLPARSER326 NAME:str = HTMLPARSER
292 features = [NAME, HTML, STRICT]327 features: Iterable[str] = [NAME, HTML, STRICT]
293328
294 # The html.parser knows which line number and position in the329 #: The html.parser knows which line number and position in the
295 # original file is the source of an element.330 #: original file is the source of an element.
296 TRACKS_LINE_NUMBERS = True331 TRACKS_LINE_NUMBERS:bool = True
297332
298 def __init__(self, parser_args=None, parser_kwargs=None, **kwargs):333 def __init__(self, parser_args:Optional[Iterable[Any]]=None,
334 parser_kwargs:Optional[Dict[str, Any]]=None, **kwargs:Any):
299 """Constructor.335 """Constructor.
300336
301 :param parser_args: Positional arguments to pass into 337 :param parser_args: Positional arguments to pass into
@@ -320,9 +356,12 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
320 parser_kwargs['convert_charrefs'] = False356 parser_kwargs['convert_charrefs'] = False
321 self.parser_args = (parser_args, parser_kwargs)357 self.parser_args = (parser_args, parser_kwargs)
322 358
323 def prepare_markup(self, markup, user_specified_encoding=None,359 def prepare_markup(
324 document_declared_encoding=None, exclude_encodings=None):360 self, markup:_RawMarkup,
325361 user_specified_encoding:Optional[_Encoding]=None,
362 document_declared_encoding:Optional[_Encoding]=None,
363 exclude_encodings:Optional[_Encodings]=None
364 ) -> Iterable[Tuple[str, Optional[_Encoding], Optional[_Encoding], bool]]:
326 """Run any preliminary steps necessary to make incoming markup365 """Run any preliminary steps necessary to make incoming markup
327 acceptable to the parser.366 acceptable to the parser.
328367
@@ -333,13 +372,12 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
333 :param exclude_encodings: The user asked _not_ to try any of372 :param exclude_encodings: The user asked _not_ to try any of
334 these encodings.373 these encodings.
335374
336 :yield: A series of 4-tuples:375 :yield: A series of 4-tuples: (markup, encoding, declared encoding,
337 (markup, encoding, declared encoding,376 has undergone character replacement)
338 has undergone character replacement)
339377
340 Each 4-tuple represents a strategy for converting the378 Each 4-tuple represents a strategy for parsing the document.
341 document to Unicode and parsing it. Each strategy will be tried 379 This TreeBuilder uses Unicode, Dammit to convert the markup
342 in turn.380 into Unicode, so the `markup` element will always be a string.
343 """381 """
344 if isinstance(markup, str):382 if isinstance(markup, str):
345 # Parse Unicode as-is.383 # Parse Unicode as-is.
@@ -348,14 +386,19 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
348386
349 # Ask UnicodeDammit to sniff the most likely encoding.387 # Ask UnicodeDammit to sniff the most likely encoding.
350388
351 # This was provided by the end-user; treat it as a known389 known_definite_encodings: List[_Encoding] = []
352 # definite encoding per the algorithm laid out in the HTML5390 if user_specified_encoding:
353 # spec. (See the EncodingDetector class for details.)391 # This was provided by the end-user; treat it as a known
354 known_definite_encodings = [user_specified_encoding]392 # definite encoding per the algorithm laid out in the
393 # HTML5 spec. (See the EncodingDetector class for
394 # details.)
395 known_definite_encodings.append(user_specified_encoding)
355396
356 # This was found in the document; treat it as a slightly lower-priority397 user_encodings: List[_Encoding] = []
357 # user encoding.398 if document_declared_encoding:
358 user_encodings = [document_declared_encoding]399 # This was found in the document; treat it as a slightly
400 # lower-priority user encoding.
401 user_encodings.append(document_declared_encoding)
359402
360 try_encodings = [user_specified_encoding, document_declared_encoding]403 try_encodings = [user_specified_encoding, document_declared_encoding]
361 dammit = UnicodeDammit(404 dammit = UnicodeDammit(
@@ -365,17 +408,27 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
365 is_html=True,408 is_html=True,
366 exclude_encodings=exclude_encodings409 exclude_encodings=exclude_encodings
367 )410 )
368 yield (dammit.markup, dammit.original_encoding,
369 dammit.declared_html_encoding,
370 dammit.contains_replacement_characters)
371411
372 def feed(self, markup):412 if dammit.unicode_markup is None:
373 """Run some incoming markup through some parsing process,413 # In every case I've seen, Unicode, Dammit is able to
374 populating the `BeautifulSoup` object in self.soup.414 # convert the markup into Unicode, even if it needs to use
375 """415 # REPLACEMENT CHARACTER. But there is a code path that
416 # could result in unicode_markup being None, and
417 # HTMLParser can only parse Unicode, so here we handle
418 # that code path.
419 raise ParserRejectedMarkup("Could not convert input to Unicode, and html.parser will not accept bytestrings.")
420 else:
421 yield (dammit.unicode_markup, dammit.original_encoding,
422 dammit.declared_html_encoding,
423 dammit.contains_replacement_characters)
424
425 def feed(self, markup:str):
376 args, kwargs = self.parser_args426 args, kwargs = self.parser_args
377 parser = BeautifulSoupHTMLParser(*args, **kwargs)427 # We know BeautifulSoup calls TreeBuilder.initialize_soup
378 parser.soup = self.soup428 # before calling feed(), so we can assume self.soup
429 # is set.
430 assert self.soup is not None
431 parser = BeautifulSoupHTMLParser(self.soup, *args, **kwargs)
379 try:432 try:
380 parser.feed(markup)433 parser.feed(markup)
381 parser.close()434 parser.close()
@@ -385,3 +438,4 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
385 # when there's an error in the doctype declaration.438 # when there's an error in the doctype declaration.
386 raise ParserRejectedMarkup(e)439 raise ParserRejectedMarkup(e)
387 parser.already_closed_empty_element = []440 parser.already_closed_empty_element = []
441
diff --git a/bs4/builder/_lxml.py b/bs4/builder/_lxml.py
index 971c81e..44a477f 100644
--- a/bs4/builder/_lxml.py
+++ b/bs4/builder/_lxml.py
@@ -1,3 +1,6 @@
1# encoding: utf-8
2from __future__ import annotations
3
1# Use of this source code is governed by the MIT license.4# Use of this source code is governed by the MIT license.
2__license__ = "MIT"5__license__ = "MIT"
36
@@ -6,14 +9,26 @@ __all__ = [
6 'LXMLTreeBuilder',9 'LXMLTreeBuilder',
7 ]10 ]
811
9try:12from collections.abc import Callable
10 from collections.abc import Callable # Python 3.613
11except ImportError as e:14from typing import (
12 from collections import Callable15 Any,
16 Dict,
17 IO,
18 Iterable,
19 List,
20 Optional,
21 Set,
22 Tuple,
23 Type,
24 TYPE_CHECKING,
25 Union,
26)
1327
14from io import BytesIO28from io import BytesIO
15from io import StringIO29from io import StringIO
16from lxml import etree30from lxml import etree
31from bs4.dammit import (_Encoding)
17from bs4.element import (32from bs4.element import (
18 Comment,33 Comment,
19 Doctype,34 Doctype,
@@ -31,33 +46,54 @@ from bs4.builder import (
31 TreeBuilder,46 TreeBuilder,
32 XML)47 XML)
33from bs4.dammit import EncodingDetector48from bs4.dammit import EncodingDetector
3449if TYPE_CHECKING:
35LXML = 'lxml'50 from bs4._typing import (
51 _Encoding,
52 _Encodings,
53 _NamespacePrefix,
54 _NamespaceURL,
55 _NamespaceMapping,
56 _InvertedNamespaceMapping,
57 _RawMarkup,
58 )
59 from bs4 import BeautifulSoup
60
61LXML:str = 'lxml'
3662
37def _invert(d):63def _invert(d):
38 "Invert a dictionary."64 "Invert a dictionary."
39 return dict((v,k) for k, v in list(d.items()))65 return dict((v,k) for k, v in list(d.items()))
4066
41class LXMLTreeBuilderForXML(TreeBuilder):67class LXMLTreeBuilderForXML(TreeBuilder):
42 DEFAULT_PARSER_CLASS = etree.XMLParser
43
44 is_xml = True
45 processing_instruction_class = XMLProcessingInstruction
4668
47 NAME = "lxml-xml"69 DEFAULT_PARSER_CLASS:Type[Any] = etree.XMLParser
48 ALTERNATE_NAMES = ["xml"]70
71 is_xml:bool = True
72
73 processing_instruction_class:Type[ProcessingInstruction]
74
75 NAME:str = "lxml-xml"
76 ALTERNATE_NAMES: Iterable[str] = ["xml"]
4977
50 # Well, it's permissive by XML parser standards.78 # Well, it's permissive by XML parser standards.
51 features = [NAME, LXML, XML, FAST, PERMISSIVE]79 features: Iterable[str] = [NAME, LXML, XML, FAST, PERMISSIVE]
5280
53 CHUNK_SIZE = 51281 CHUNK_SIZE:int = 512
5482
55 # This namespace mapping is specified in the XML Namespace83 # This namespace mapping is specified in the XML Namespace
56 # standard.84 # standard.
57 DEFAULT_NSMAPS = dict(xml='http://www.w3.org/XML/1998/namespace')85 DEFAULT_NSMAPS: _NamespaceMapping = dict(
86 xml='http://www.w3.org/XML/1998/namespace'
87 )
5888
59 DEFAULT_NSMAPS_INVERTED = _invert(DEFAULT_NSMAPS)89 DEFAULT_NSMAPS_INVERTED:_InvertedNamespaceMapping = _invert(
90 DEFAULT_NSMAPS
91 )
6092
93 nsmaps: List[Optional[_InvertedNamespaceMapping]]
94 empty_element_tags: Set[str]
95 parser: Any
96
61 # NOTE: If we parsed Element objects and looked at .sourceline,97 # NOTE: If we parsed Element objects and looked at .sourceline,
62 # we'd be able to see the line numbers from the original document.98 # we'd be able to see the line numbers from the original document.
63 # But instead we build an XMLParser or HTMLParser object to serve99 # But instead we build an XMLParser or HTMLParser object to serve
@@ -65,16 +101,18 @@ class LXMLTreeBuilderForXML(TreeBuilder):
65 # line numbers.101 # line numbers.
66 # See: https://bugs.launchpad.net/lxml/+bug/1846906102 # See: https://bugs.launchpad.net/lxml/+bug/1846906
67 103
68 def initialize_soup(self, soup):104 def initialize_soup(self, soup:BeautifulSoup) -> None:
69 """Let the BeautifulSoup object know about the standard namespace105 """Let the BeautifulSoup object know about the standard namespace
70 mapping.106 mapping.
71107
72 :param soup: A `BeautifulSoup`.108 :param soup: A `BeautifulSoup`.
73 """109 """
110 # Beyond this point, self.soup is set, so we can assume (and
111 # assert) it's not None whenever necessary.
74 super(LXMLTreeBuilderForXML, self).initialize_soup(soup)112 super(LXMLTreeBuilderForXML, self).initialize_soup(soup)
75 self._register_namespaces(self.DEFAULT_NSMAPS)113 self._register_namespaces(self.DEFAULT_NSMAPS)
76114
77 def _register_namespaces(self, mapping):115 def _register_namespaces(self, mapping:Dict[str, str]) -> None:
78 """Let the BeautifulSoup object know about namespaces encountered116 """Let the BeautifulSoup object know about namespaces encountered
79 while parsing the document.117 while parsing the document.
80118
@@ -87,6 +125,7 @@ class LXMLTreeBuilderForXML(TreeBuilder):
87125
88 :param mapping: A dictionary mapping namespace prefixes to URIs.126 :param mapping: A dictionary mapping namespace prefixes to URIs.
89 """127 """
128 assert self.soup is not None
90 for key, value in list(mapping.items()):129 for key, value in list(mapping.items()):
91 # This is 'if key' and not 'if key is not None' because we130 # This is 'if key' and not 'if key is not None' because we
92 # don't track un-prefixed namespaces. Soupselect will131 # don't track un-prefixed namespaces. Soupselect will
@@ -98,19 +137,18 @@ class LXMLTreeBuilderForXML(TreeBuilder):
98 # prefix, the first one in the document takes precedence.137 # prefix, the first one in the document takes precedence.
99 self.soup._namespaces[key] = value138 self.soup._namespaces[key] = value
100 139
101 def default_parser(self, encoding):140 def default_parser(self, encoding:Optional[_Encoding]) -> Type:
102 """Find the default parser for the given encoding.141 """Find the default parser for the given encoding.
103142
104 :param encoding: A string.
105 :return: Either a parser object or a class, which143 :return: Either a parser object or a class, which
106 will be instantiated with default arguments.144 will be instantiated with default arguments.
107 """145 """
108 if self._default_parser is not None:146 if self._default_parser is not None:
109 return self._default_parser147 return self._default_parser
110 return etree.XMLParser(148 return self.DEFAULT_PARSER_CLASS(
111 target=self, strip_cdata=False, recover=True, encoding=encoding)149 target=self, strip_cdata=False, recover=True, encoding=encoding)
112150
113 def parser_for(self, encoding):151 def parser_for(self, encoding: Optional[_Encoding]) -> Any:
114 """Instantiate an appropriate parser for the given encoding.152 """Instantiate an appropriate parser for the given encoding.
115153
116 :param encoding: A string.154 :param encoding: A string.
@@ -119,36 +157,39 @@ class LXMLTreeBuilderForXML(TreeBuilder):
119 # Use the default parser.157 # Use the default parser.
120 parser = self.default_parser(encoding)158 parser = self.default_parser(encoding)
121159
122 if isinstance(parser, Callable):160 if callable(parser):
123 # Instantiate the parser with default arguments161 # Instantiate the parser with default arguments
124 parser = parser(162 parser = parser(
125 target=self, strip_cdata=False, recover=True, encoding=encoding163 target=self, strip_cdata=False, recover=True, encoding=encoding
126 )164 )
127 return parser165 return parser
128166
129 def __init__(self, parser=None, empty_element_tags=None, **kwargs):167 def __init__(self, parser:Optional[Any]=None,
168 empty_element_tags:Optional[Set[str]]=None, **kwargs):
130 # TODO: Issue a warning if parser is present but not a169 # TODO: Issue a warning if parser is present but not a
131 # callable, since that means there's no way to create new170 # callable, since that means there's no way to create new
132 # parsers for different encodings.171 # parsers for different encodings.
133 self._default_parser = parser172 self._default_parser = parser
134 if empty_element_tags is not None:
135 self.empty_element_tags = set(empty_element_tags)
136 self.soup = None173 self.soup = None
137 self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED]174 self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED]
138 self.active_namespace_prefixes = [dict(self.DEFAULT_NSMAPS)]175 self.active_namespace_prefixes = [dict(self.DEFAULT_NSMAPS)]
139 super(LXMLTreeBuilderForXML, self).__init__(**kwargs)176 super(LXMLTreeBuilderForXML, self).__init__(**kwargs)
140 177
141 def _getNsTag(self, tag):178 def _getNsTag(self, tag:str) -> Tuple[Optional[str], str]:
142 # Split the namespace URL out of a fully-qualified lxml tag179 # Split the namespace URL out of a fully-qualified lxml tag
143 # name. Copied from lxml's src/lxml/sax.py.180 # name. Copied from lxml's src/lxml/sax.py.
144 if tag[0] == '{':181 if tag[0] == '{':
145 return tuple(tag[1:].split('}', 1))182 namespace, name = tag[1:].split('}', 1)
183 return (namespace, name)
146 else:184 else:
147 return (None, tag)185 return (None, tag)
148186
149 def prepare_markup(self, markup, user_specified_encoding=None,187 def prepare_markup(
150 exclude_encodings=None,188 self, markup:_RawMarkup,
151 document_declared_encoding=None):189 user_specified_encoding:Optional[_Encoding]=None,
190 document_declared_encoding:Optional[_Encoding]=None,
191 exclude_encodings:Optional[_Encodings]=None,
192 ) -> Iterable[Tuple[Union[str,bytes], Optional[_Encoding], Optional[_Encoding], bool]]:
152 """Run any preliminary steps necessary to make incoming markup193 """Run any preliminary steps necessary to make incoming markup
153 acceptable to the parser.194 acceptable to the parser.
154195
@@ -166,13 +207,12 @@ class LXMLTreeBuilderForXML(TreeBuilder):
166 :param exclude_encodings: The user asked _not_ to try any of207 :param exclude_encodings: The user asked _not_ to try any of
167 these encodings.208 these encodings.
168209
169 :yield: A series of 4-tuples:210 :yield: A series of 4-tuples: (markup, encoding, declared encoding,
170 (markup, encoding, declared encoding,211 has undergone character replacement)
171 has undergone character replacement)
172212
173 Each 4-tuple represents a strategy for converting the213 Each 4-tuple represents a strategy for converting the
174 document to Unicode and parsing it. Each strategy will be tried 214 document to Unicode and parsing it. Each strategy will be tried
175 in turn.215 in turn.
176 """216 """
177 is_html = not self.is_xml217 is_html = not self.is_xml
178 if is_html:218 if is_html:
@@ -200,14 +240,25 @@ class LXMLTreeBuilderForXML(TreeBuilder):
200 yield (markup.encode("utf8"), "utf8",240 yield (markup.encode("utf8"), "utf8",
201 document_declared_encoding, False)241 document_declared_encoding, False)
202242
203 # This was provided by the end-user; treat it as a known243 # Since the document was Unicode in the first place, there
204 # definite encoding per the algorithm laid out in the HTML5244 # is no need to try any more strategies; we know this will
205 # spec. (See the EncodingDetector class for details.)245 # work.
206 known_definite_encodings = [user_specified_encoding]246 return
247
248 known_definite_encodings: List[_Encoding] = []
249 if user_specified_encoding:
250 # This was provided by the end-user; treat it as a known
251 # definite encoding per the algorithm laid out in the
252 # HTML5 spec. (See the EncodingDetector class for
253 # details.)
254 known_definite_encodings.append(user_specified_encoding)
255
256 user_encodings: List[_Encoding] = []
257 if document_declared_encoding:
258 # This was found in the document; treat it as a slightly
259 # lower-priority user encoding.
260 user_encodings.append(document_declared_encoding)
207261
208 # This was found in the document; treat it as a slightly lower-priority
209 # user encoding.
210 user_encodings = [document_declared_encoding]
211 detector = EncodingDetector(262 detector = EncodingDetector(
212 markup, known_definite_encodings=known_definite_encodings,263 markup, known_definite_encodings=known_definite_encodings,
213 user_encodings=user_encodings, is_html=is_html,264 user_encodings=user_encodings, is_html=is_html,
@@ -216,34 +267,45 @@ class LXMLTreeBuilderForXML(TreeBuilder):
216 for encoding in detector.encodings:267 for encoding in detector.encodings:
217 yield (detector.markup, encoding, document_declared_encoding, False)268 yield (detector.markup, encoding, document_declared_encoding, False)
218269
219 def feed(self, markup):270 def feed(self, markup:Union[bytes,str]) -> None:
271 io: IO
220 if isinstance(markup, bytes):272 if isinstance(markup, bytes):
221 markup = BytesIO(markup)273 io = BytesIO(markup)
222 elif isinstance(markup, str):274 elif isinstance(markup, str):
223 markup = StringIO(markup)275 io = StringIO(markup)
224276
277 # initialize_soup is called before feed, so we know this
278 # is not None.
279 assert self.soup is not None
280
225 # Call feed() at least once, even if the markup is empty,281 # Call feed() at least once, even if the markup is empty,
226 # or the parser won't be initialized.282 # or the parser won't be initialized.
227 data = markup.read(self.CHUNK_SIZE)283 data = io.read(self.CHUNK_SIZE)
228 try:284 try:
229 self.parser = self.parser_for(self.soup.original_encoding)285 self.parser = self.parser_for(self.soup.original_encoding)
230 self.parser.feed(data)286 self.parser.feed(data)
231 while len(data) != 0:287 while len(data) != 0:
232 # Now call feed() on the rest of the data, chunk by chunk.288 # Now call feed() on the rest of the data, chunk by chunk.
233 data = markup.read(self.CHUNK_SIZE)289 data = io.read(self.CHUNK_SIZE)
234 if len(data) != 0:290 if len(data) != 0:
235 self.parser.feed(data)291 self.parser.feed(data)
236 self.parser.close()292 self.parser.close()
237 except (UnicodeDecodeError, LookupError, etree.ParserError) as e:293 except (UnicodeDecodeError, LookupError, etree.ParserError) as e:
238 raise ParserRejectedMarkup(e)294 raise ParserRejectedMarkup(e)
239295
240 def close(self):296 def close(self) -> None:
241 self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED]297 self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED]
242298
243 def start(self, name, attrs, nsmap={}):299 def start(self, name:str, attrs:Dict[str, str], nsmap:_NamespaceMapping={}):
300 # This is called by lxml code as a result of calling
301 # BeautifulSoup.feed(), and we know self.soup is set by the time feed()
302 # is called.
303 assert self.soup is not None
304
244 # Make sure attrs is a mutable dict--lxml may send an immutable dictproxy.305 # Make sure attrs is a mutable dict--lxml may send an immutable dictproxy.
245 attrs = dict(attrs)306 attrs = dict(attrs)
246 nsprefix = None307 nsprefix: Optional[_NamespacePrefix] = None
308 namespace: Optional[_NamespaceURL] = None
247 # Invert each namespace map as it comes in.309 # Invert each namespace map as it comes in.
248 if len(nsmap) == 0 and len(self.nsmaps) > 1:310 if len(nsmap) == 0 and len(self.nsmaps) > 1:
249 # There are no new namespaces for this tag, but311 # There are no new namespaces for this tag, but
@@ -285,7 +347,7 @@ class LXMLTreeBuilderForXML(TreeBuilder):
285 # Namespaces are in play. Find any attributes that came in347 # Namespaces are in play. Find any attributes that came in
286 # from lxml with namespaces attached to their names, and348 # from lxml with namespaces attached to their names, and
287 # turn then into NamespacedAttribute objects.349 # turn then into NamespacedAttribute objects.
288 new_attrs = {}350 new_attrs:Dict[Union[str,NamespacedAttribute], str] = {}
289 for attr, value in list(attrs.items()):351 for attr, value in list(attrs.items()):
290 namespace, attr = self._getNsTag(attr)352 namespace, attr = self._getNsTag(attr)
291 if namespace is None:353 if namespace is None:
@@ -303,7 +365,7 @@ class LXMLTreeBuilderForXML(TreeBuilder):
303 namespaces=self.active_namespace_prefixes[-1]365 namespaces=self.active_namespace_prefixes[-1]
304 )366 )
305 367
306 def _prefix_for_namespace(self, namespace):368 def _prefix_for_namespace(self, namespace:Optional[_NamespaceURL]) -> Optional[_NamespacePrefix]:
307 """Find the currently active prefix for the given namespace."""369 """Find the currently active prefix for the given namespace."""
308 if namespace is None:370 if namespace is None:
309 return None371 return None
@@ -312,7 +374,8 @@ class LXMLTreeBuilderForXML(TreeBuilder):
312 return inverted_nsmap[namespace]374 return inverted_nsmap[namespace]
313 return None375 return None
314376
315 def end(self, name):377 def end(self, name:str) -> None:
378 assert self.soup is not None
316 self.soup.endData()379 self.soup.endData()
317 completed_tag = self.soup.tagStack[-1]380 completed_tag = self.soup.tagStack[-1]
318 namespace, name = self._getNsTag(name)381 namespace, name = self._getNsTag(name)
@@ -334,44 +397,49 @@ class LXMLTreeBuilderForXML(TreeBuilder):
334 # namespace prefixes.397 # namespace prefixes.
335 self.active_namespace_prefixes.pop()398 self.active_namespace_prefixes.pop()
336 399
337 def pi(self, target, data):400 def pi(self, target:str, data:str) -> None:
401 assert self.soup is not None
338 self.soup.endData()402 self.soup.endData()
339 data = target + ' ' + data403 data = target + ' ' + data
340 self.soup.handle_data(data)404 self.soup.handle_data(data)
341 self.soup.endData(self.processing_instruction_class)405 self.soup.endData(self.processing_instruction_class)
342 406
343 def data(self, content):407 def data(self, content:str) -> None:
408 assert self.soup is not None
344 self.soup.handle_data(content)409 self.soup.handle_data(content)
345410
346 def doctype(self, name, pubid, system):411 def doctype(self, name:str, pubid:str, system:str) -> None:
412 assert self.soup is not None
347 self.soup.endData()413 self.soup.endData()
348 doctype = Doctype.for_name_and_ids(name, pubid, system)414 doctype = Doctype.for_name_and_ids(name, pubid, system)
349 self.soup.object_was_parsed(doctype)415 self.soup.object_was_parsed(doctype)
350416
351 def comment(self, content):417 def comment(self, content:str) -> None:
352 "Handle comments as Comment objects."418 "Handle comments as Comment objects."
419 assert self.soup is not None
353 self.soup.endData()420 self.soup.endData()
354 self.soup.handle_data(content)421 self.soup.handle_data(content)
355 self.soup.endData(Comment)422 self.soup.endData(Comment)
356423
357 def test_fragment_to_document(self, fragment):424 def test_fragment_to_document(self, fragment:str) -> str:
358 """See `TreeBuilder`."""425 """See `TreeBuilder`."""
359 return '<?xml version="1.0" encoding="utf-8"?>\n%s' % fragment426 return '<?xml version="1.0" encoding="utf-8"?>\n%s' % fragment
360427
361428
362class LXMLTreeBuilder(HTMLTreeBuilder, LXMLTreeBuilderForXML):429class LXMLTreeBuilder(HTMLTreeBuilder, LXMLTreeBuilderForXML):
363430
364 NAME = LXML431 NAME:str = LXML
365 ALTERNATE_NAMES = ["lxml-html"]432 ALTERNATE_NAMES: Iterable[str] = ["lxml-html"]
366433
367 features = ALTERNATE_NAMES + [NAME, HTML, FAST, PERMISSIVE]434 features: Iterable[str] = list(ALTERNATE_NAMES) + [NAME, HTML, FAST, PERMISSIVE]
368 is_xml = False435 is_xml: bool = False
369 processing_instruction_class = ProcessingInstruction
370436
371 def default_parser(self, encoding):437 def default_parser(self, encoding:Optional[_Encoding]) -> Type[Any]:
372 return etree.HTMLParser438 return etree.HTMLParser
373439
374 def feed(self, markup):440 def feed(self, markup:_RawMarkup) -> None:
441 # We know self.soup is set by the time feed() is called.
442 assert self.soup is not None
375 encoding = self.soup.original_encoding443 encoding = self.soup.original_encoding
376 try:444 try:
377 self.parser = self.parser_for(encoding)445 self.parser = self.parser_for(encoding)
@@ -381,6 +449,7 @@ class LXMLTreeBuilder(HTMLTreeBuilder, LXMLTreeBuilderForXML):
381 raise ParserRejectedMarkup(e)449 raise ParserRejectedMarkup(e)
382450
383451
384 def test_fragment_to_document(self, fragment):452 def test_fragment_to_document(self, fragment:str) -> str:
385 """See `TreeBuilder`."""453 """See `TreeBuilder`."""
386 return '<html><body>%s</body></html>' % fragment454 return '<html><body>%s</body></html>' % fragment
455
diff --git a/bs4/css.py b/bs4/css.py
index 245ac60..0477de8 100644
--- a/bs4/css.py
+++ b/bs4/css.py
@@ -1,6 +1,36 @@
1"""Integration code for CSS selectors using Soup Sieve (pypi: soupsieve)."""1"""Integration code for CSS selectors using `Soup Sieve <https://facelessuser.github.io/soupsieve/>`_ (pypi: ``soupsieve``).
22
3Acquire a `CSS` object through the `bs4.element.Tag.css` attribute of
4the starting point of your CSS selector, or (if you want to run a
5selector against the entire document) of the `BeautifulSoup` object
6itself.
7
8The main advantage of doing this instead of using ``soupsieve``
9functions is that you don't need to keep passing the `bs4.element.Tag` to be
10selected against, since the `CSS` object is permanently scoped to that
11`bs4.element.Tag`.
12
13"""
14
15from __future__ import annotations
16
17from types import ModuleType
18from typing import (
19 Any,
20 cast,
21 Iterable,
22 Iterator,
23 Optional,
24 TYPE_CHECKING,
25)
3import warnings26import warnings
27from bs4._typing import _NamespaceMapping
28if TYPE_CHECKING:
29 from soupsieve import SoupSieve
30 from bs4 import element
31 from bs4.element import ResultSet, Tag
32
33soupsieve: Optional[ModuleType]
4try:34try:
5 import soupsieve35 import soupsieve
6except ImportError as e:36except ImportError as e:
@@ -9,34 +39,22 @@ except ImportError as e:
9 'The soupsieve package is not installed. CSS selectors cannot be used.'39 'The soupsieve package is not installed. CSS selectors cannot be used.'
10 )40 )
1141
12
13class CSS(object):42class CSS(object):
14 """A proxy object against the soupsieve library, to simplify its43 """A proxy object against the ``soupsieve`` library, to simplify its
15 CSS selector API.44 CSS selector API.
1645
17 Acquire this object through the .css attribute on the46 You don't need to instantiate this class yourself; instead, use
18 BeautifulSoup object, or on the Tag you want to use as the47 `element.Tag.css`.
19 starting point for a CSS selector.
20
21 The main advantage of doing this is that the tag to be selected
22 against doesn't need to be explicitly specified in the function
23 calls, since it's already scoped to a tag.
24 """
25
26 def __init__(self, tag, api=soupsieve):
27 """Constructor.
28
29 You don't need to instantiate this class yourself; instead,
30 access the .css attribute on the BeautifulSoup object, or on
31 the Tag you want to use as the starting point for your CSS
32 selector.
3348
34 :param tag: All CSS selectors will use this as their starting49 :param tag: All CSS selectors run by this object will use this as
35 point.50 their starting point.
3651
37 :param api: A plug-in replacement for the soupsieve module,52 :param api: An optional drop-in replacement for the ``soupsieve`` module,
38 designed mainly for use in tests.53 intended for use in unit tests.
39 """54 """
55 def __init__(self, tag: element.Tag, api:Optional[ModuleType]=None):
56 if api is None:
57 api = soupsieve
40 if api is None:58 if api is None:
41 raise NotImplementedError(59 raise NotImplementedError(
42 "Cannot execute CSS selectors because the soupsieve package is not installed."60 "Cannot execute CSS selectors because the soupsieve package is not installed."
@@ -44,19 +62,19 @@ class CSS(object):
44 self.api = api62 self.api = api
45 self.tag = tag63 self.tag = tag
4664
47 def escape(self, ident):65 def escape(self, ident:str) -> str:
48 """Escape a CSS identifier.66 """Escape a CSS identifier.
4967
50 This is a simple wrapper around soupselect.escape(). See the68 This is a simple wrapper around `soupsieve.escape() <https://facelessuser.github.io/soupsieve/api/#soupsieveescape>`_. See the
51 documentation for that function for more information.69 documentation for that function for more information.
52 """70 """
53 if soupsieve is None:71 if soupsieve is None:
54 raise NotImplementedError(72 raise NotImplementedError(
55 "Cannot escape CSS identifiers because the soupsieve package is not installed."73 "Cannot escape CSS identifiers because the soupsieve package is not installed."
56 )74 )
57 return self.api.escape(ident)75 return cast(str, self.api.escape(ident))
5876
59 def _ns(self, ns, select):77 def _ns(self, ns:Optional[_NamespaceMapping], select:str) -> Optional[_NamespaceMapping]:
60 """Normalize a dictionary of namespaces."""78 """Normalize a dictionary of namespaces."""
61 if not isinstance(select, self.api.SoupSieve) and ns is None:79 if not isinstance(select, self.api.SoupSieve) and ns is None:
62 # If the selector is a precompiled pattern, it already has80 # If the selector is a precompiled pattern, it already has
@@ -65,7 +83,7 @@ class CSS(object):
65 ns = self.tag._namespaces83 ns = self.tag._namespaces
66 return ns84 return ns
6785
68 def _rs(self, results):86 def _rs(self, results:Iterable[Tag]) -> ResultSet[Tag]:
69 """Normalize a list of results to a Resultset.87 """Normalize a list of results to a Resultset.
7088
71 A ResultSet is more consistent with the rest of Beautiful89 A ResultSet is more consistent with the rest of Beautiful
@@ -77,7 +95,12 @@ class CSS(object):
77 from bs4.element import ResultSet95 from bs4.element import ResultSet
78 return ResultSet(None, results)96 return ResultSet(None, results)
7997
80 def compile(self, select, namespaces=None, flags=0, **kwargs):98 def compile(self,
99 select:str,
100 namespaces:Optional[_NamespaceMapping]=None,
101 flags:int=0,
102 **kwargs:Any
103 ) -> SoupSieve:
81 """Pre-compile a selector and return the compiled object.104 """Pre-compile a selector and return the compiled object.
82105
83 :param selector: A CSS selector.106 :param selector: A CSS selector.
@@ -88,10 +111,10 @@ class CSS(object):
88 parsing the document.111 parsing the document.
89112
90 :param flags: Flags to be passed into Soup Sieve's113 :param flags: Flags to be passed into Soup Sieve's
91 soupsieve.compile() method.114 `soupsieve.compile() <https://facelessuser.github.io/soupsieve/api/#soupsievecompile>`_ method.
92115
93 :param kwargs: Keyword arguments to be passed into SoupSieve's116 :param kwargs: Keyword arguments to be passed into Soup Sieve's
94 soupsieve.compile() method.117 `soupsieve.compile() <https://facelessuser.github.io/soupsieve/api/#soupsievecompile>`_ method.
95118
96 :return: A precompiled selector object.119 :return: A precompiled selector object.
97 :rtype: soupsieve.SoupSieve120 :rtype: soupsieve.SoupSieve
@@ -100,13 +123,16 @@ class CSS(object):
100 select, self._ns(namespaces, select), flags, **kwargs123 select, self._ns(namespaces, select), flags, **kwargs
101 )124 )
102125
103 def select_one(self, select, namespaces=None, flags=0, **kwargs):126 def select_one(
127 self, select:str,
128 namespaces:Optional[_NamespaceMapping]=None,
129 flags:int=0, **kwargs:Any
130 )-> element.Tag | None:
104 """Perform a CSS selection operation on the current Tag and return the131 """Perform a CSS selection operation on the current Tag and return the
105 first result.132 first result, if any.
106133
107 This uses the Soup Sieve library. For more information, see134 This uses the Soup Sieve library. For more information, see
108 that library's documentation for the soupsieve.select_one()135 that library's documentation for the `soupsieve.select_one() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect_one>`_ method.
109 method.
110136
111 :param selector: A CSS selector.137 :param selector: A CSS selector.
112138
@@ -116,27 +142,24 @@ class CSS(object):
116 parsing the document.142 parsing the document.
117143
118 :param flags: Flags to be passed into Soup Sieve's144 :param flags: Flags to be passed into Soup Sieve's
119 soupsieve.select_one() method.145 `soupsieve.select_one() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect_one>`_ method.
120
121 :param kwargs: Keyword arguments to be passed into SoupSieve's
122 soupsieve.select_one() method.
123
124 :return: A Tag, or None if the selector has no match.
125 :rtype: bs4.element.Tag
126146
147 :param kwargs: Keyword arguments to be passed into Soup Sieve's
148 `soupsieve.select_one() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect_one>`_ method.
127 """149 """
128 return self.api.select_one(150 return self.api.select_one(
129 select, self.tag, self._ns(namespaces, select), flags, **kwargs151 select, self.tag, self._ns(namespaces, select), flags, **kwargs
130 )152 )
131153
132 def select(self, select, namespaces=None, limit=0, flags=0, **kwargs):154 def select(self, select:str,
133 """Perform a CSS selection operation on the current Tag.155 namespaces:Optional[_NamespaceMapping]=None,
156 limit:int=0, flags:int=0, **kwargs:Any) -> ResultSet[Tag]:
157 """Perform a CSS selection operation on the current `element.Tag`.
134158
135 This uses the Soup Sieve library. For more information, see159 This uses the Soup Sieve library. For more information, see
136 that library's documentation for the soupsieve.select()160 that library's documentation for the `soupsieve.select() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect>`_ method.
137 method.
138161
139 :param selector: A string containing a CSS selector.162 :param selector: A CSS selector.
140163
141 :param namespaces: A dictionary mapping namespace prefixes164 :param namespaces: A dictionary mapping namespace prefixes
142 used in the CSS selector to namespace URIs. By default,165 used in the CSS selector to namespace URIs. By default,
@@ -146,14 +169,10 @@ class CSS(object):
146 :param limit: After finding this number of results, stop looking.169 :param limit: After finding this number of results, stop looking.
147170
148 :param flags: Flags to be passed into Soup Sieve's171 :param flags: Flags to be passed into Soup Sieve's
149 soupsieve.select() method.172 `soupsieve.select() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect>`_ method.
150
151 :param kwargs: Keyword arguments to be passed into SoupSieve's
152 soupsieve.select() method.
153
154 :return: A ResultSet of Tag objects.
155 :rtype: bs4.element.ResultSet
156173
174 :param kwargs: Keyword arguments to be passed into Soup Sieve's
175 `soupsieve.select() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect>`_ method.
157 """176 """
158 if limit is None:177 if limit is None:
159 limit = 0178 limit = 0
@@ -165,11 +184,14 @@ class CSS(object):
165 )184 )
166 )185 )
167186
168 def iselect(self, select, namespaces=None, limit=0, flags=0, **kwargs):187 def iselect(self, select:str,
169 """Perform a CSS selection operation on the current Tag.188 namespaces:Optional[_NamespaceMapping]=None,
189 limit:int=0, flags:int=0, **kwargs:Any) -> Iterator[element.Tag]:
190 """Perform a CSS selection operation on the current `element.Tag`.
170191
171 This uses the Soup Sieve library. For more information, see192 This uses the Soup Sieve library. For more information, see
172 that library's documentation for the soupsieve.iselect()193 that library's documentation for the `soupsieve.iselect()
194 <https://facelessuser.github.io/soupsieve/api/#soupsieveiselect>`_
173 method. It is the same as select(), but it returns a generator195 method. It is the same as select(), but it returns a generator
174 instead of a list.196 instead of a list.
175197
@@ -183,23 +205,23 @@ class CSS(object):
183 :param limit: After finding this number of results, stop looking.205 :param limit: After finding this number of results, stop looking.
184206
185 :param flags: Flags to be passed into Soup Sieve's207 :param flags: Flags to be passed into Soup Sieve's
186 soupsieve.iselect() method.208 `soupsieve.iselect() <https://facelessuser.github.io/soupsieve/api/#soupsieveiselect>`_ method.
187
188 :param kwargs: Keyword arguments to be passed into SoupSieve's
189 soupsieve.iselect() method.
190209
191 :return: A generator210 :param kwargs: Keyword arguments to be passed into Soup Sieve's
192 :rtype: types.GeneratorType211 `soupsieve.iselect() <https://facelessuser.github.io/soupsieve/api/#soupsieveiselect>`_ method.
193 """212 """
194 return self.api.iselect(213 return self.api.iselect(
195 select, self.tag, self._ns(namespaces, select), limit, flags, **kwargs214 select, self.tag, self._ns(namespaces, select), limit, flags, **kwargs
196 )215 )
197216
198 def closest(self, select, namespaces=None, flags=0, **kwargs):217 def closest(self, select:str,
199 """Find the Tag closest to this one that matches the given selector.218 namespaces:Optional[_NamespaceMapping]=None,
219 flags:int=0, **kwargs:Any) -> Optional[element.Tag]:
220 """Find the `element.Tag` closest to this one that matches the given selector.
200221
201 This uses the Soup Sieve library. For more information, see222 This uses the Soup Sieve library. For more information, see
202 that library's documentation for the soupsieve.closest()223 that library's documentation for the `soupsieve.closest()
224 <https://facelessuser.github.io/soupsieve/api/#soupsieveclosest>`_
203 method.225 method.
204226
205 :param selector: A string containing a CSS selector.227 :param selector: A string containing a CSS selector.
@@ -210,24 +232,24 @@ class CSS(object):
210 parsing the document.232 parsing the document.
211233
212 :param flags: Flags to be passed into Soup Sieve's234 :param flags: Flags to be passed into Soup Sieve's
213 soupsieve.closest() method.235 `soupsieve.closest() <https://facelessuser.github.io/soupsieve/api/#soupsieveclosest>`_ method.
214236
215 :param kwargs: Keyword arguments to be passed into SoupSieve's237 :param kwargs: Keyword arguments to be passed into Soup Sieve's
216 soupsieve.closest() method.238 `soupsieve.closest() <https://facelessuser.github.io/soupsieve/api/#soupsieveclosest>`_ method.
217
218 :return: A Tag, or None if there is no match.
219 :rtype: bs4.Tag
220239
221 """240 """
222 return self.api.closest(241 return self.api.closest(
223 select, self.tag, self._ns(namespaces, select), flags, **kwargs242 select, self.tag, self._ns(namespaces, select), flags, **kwargs
224 )243 )
225244
226 def match(self, select, namespaces=None, flags=0, **kwargs):245 def match(self, select:str,
227 """Check whether this Tag matches the given CSS selector.246 namespaces:Optional[_NamespaceMapping]=None,
247 flags:int=0, **kwargs:Any) -> bool:
248 """Check whether or not this `element.Tag` matches the given CSS selector.
228249
229 This uses the Soup Sieve library. For more information, see250 This uses the Soup Sieve library. For more information, see
230 that library's documentation for the soupsieve.match()251 that library's documentation for the `soupsieve.match()
252 <https://facelessuser.github.io/soupsieve/api/#soupsievematch>`_
231 method.253 method.
232254
233 :param: a CSS selector.255 :param: a CSS selector.
@@ -238,25 +260,30 @@ class CSS(object):
238 parsing the document.260 parsing the document.
239261
240 :param flags: Flags to be passed into Soup Sieve's262 :param flags: Flags to be passed into Soup Sieve's
241 soupsieve.match() method.263 `soupsieve.match()
264 <https://facelessuser.github.io/soupsieve/api/#soupsievematch>`_
265 method.
242266
243 :param kwargs: Keyword arguments to be passed into SoupSieve's267 :param kwargs: Keyword arguments to be passed into SoupSieve's
244 soupsieve.match() method.268 `soupsieve.match()
245269 <https://facelessuser.github.io/soupsieve/api/#soupsievematch>`_
246 :return: True if this Tag matches the selector; False otherwise.270 method.
247 :rtype: bool
248 """271 """
249 return self.api.match(272 return cast(bool, self.api.match(
250 select, self.tag, self._ns(namespaces, select), flags, **kwargs273 select, self.tag, self._ns(namespaces, select), flags, **kwargs
251 )274 ))
252275
253 def filter(self, select, namespaces=None, flags=0, **kwargs):276 def filter(self, select:str,
254 """Filter this Tag's direct children based on the given CSS selector.277 namespaces:Optional[_NamespaceMapping]=None,
278 flags:int=0, **kwargs:Any) -> ResultSet[Tag]:
279 """Filter this `element.Tag`'s direct children based on the given CSS selector.
255280
256 This uses the Soup Sieve library. It works the same way as281 This uses the Soup Sieve library. It works the same way as
257 passing this Tag into that library's soupsieve.filter()282 passing a `element.Tag` into that library's `soupsieve.filter()
258 method. More information, for more information see the283 <https://facelessuser.github.io/soupsieve/api/#soupsievefilter>`_
259 documentation for soupsieve.filter().284 method. For more information, see the documentation for
285 `soupsieve.filter()
286 <https://facelessuser.github.io/soupsieve/api/#soupsievefilter>`_.
260287
261 :param namespaces: A dictionary mapping namespace prefixes288 :param namespaces: A dictionary mapping namespace prefixes
262 used in the CSS selector to namespace URIs. By default,289 used in the CSS selector to namespace URIs. By default,
@@ -264,17 +291,18 @@ class CSS(object):
264 parsing the document.291 parsing the document.
265292
266 :param flags: Flags to be passed into Soup Sieve's293 :param flags: Flags to be passed into Soup Sieve's
267 soupsieve.filter() method.294 `soupsieve.filter()
295 <https://facelessuser.github.io/soupsieve/api/#soupsievefilter>`_
296 method.
268297
269 :param kwargs: Keyword arguments to be passed into SoupSieve's298 :param kwargs: Keyword arguments to be passed into SoupSieve's
270 soupsieve.filter() method.299 `soupsieve.filter()
271300 <https://facelessuser.github.io/soupsieve/api/#soupsievefilter>`_
272 :return: A ResultSet of Tag objects.301 method.
273 :rtype: bs4.element.ResultSet
274
275 """302 """
276 return self._rs(303 return self._rs(
277 self.api.filter(304 self.api.filter(
278 select, self.tag, self._ns(namespaces, select), flags, **kwargs305 select, self.tag, self._ns(namespaces, select), flags, **kwargs
279 )306 )
280 )307 )
308
diff --git a/bs4/dammit.py b/bs4/dammit.py
index 692433c..8c1b631 100644
--- a/bs4/dammit.py
+++ b/bs4/dammit.py
@@ -2,9 +2,11 @@
2"""Beautiful Soup bonus library: Unicode, Dammit2"""Beautiful Soup bonus library: Unicode, Dammit
33
4This library converts a bytestream to Unicode through any means4This library converts a bytestream to Unicode through any means
5necessary. It is heavily based on code from Mark Pilgrim's Universal5necessary. It is heavily based on code from Mark Pilgrim's `Universal
6Feed Parser. It works best on XML and HTML, but it does not rewrite the6Feed Parser <https://pypi.org/project/feedparser/>`_. It works best on
7XML or HTML to reflect a new encoding; that's the tree builder's job.7XML and HTML, but it does not rewrite the XML or HTML to reflect a new
8encoding; that's the job of `TreeBuilder`.
9
8"""10"""
9# Use of this source code is governed by the MIT license.11# Use of this source code is governed by the MIT license.
10__license__ = "MIT"12__license__ = "MIT"
@@ -12,9 +14,31 @@ __license__ = "MIT"
12from html.entities import codepoint2name14from html.entities import codepoint2name
13from collections import defaultdict15from collections import defaultdict
14import codecs16import codecs
17from html.entities import html5
15import re18import re
16import logging19from logging import Logger, getLogger
17import string20import string
21from types import ModuleType
22from typing import (
23 Dict,
24 Iterable,
25 Iterator,
26 List,
27 Optional,
28 Pattern,
29 Sequence,
30 Set,
31 Tuple,
32 Type,
33 Union,
34 cast,
35)
36from bs4._typing import (
37 _Encoding,
38 _Encodings,
39 _RawMarkup,
40)
41import warnings
1842
19# Import a library to autodetect character encodings. We'll support43# Import a library to autodetect character encodings. We'll support
20# any of a number of libraries that all support the same API:44# any of a number of libraries that all support the same API:
@@ -22,37 +46,41 @@ import string
22# * cchardet46# * cchardet
23# * chardet47# * chardet
24# * charset-normalizer48# * charset-normalizer
25chardet_module = None49chardet_module: Optional[ModuleType] = None
26try:50try:
27 # PyPI package: cchardet51 # PyPI package: cchardet
28 import cchardet as chardet_module52 import cchardet
53 chardet_module = cchardet
29except ImportError:54except ImportError:
30 try:55 try:
31 # Debian package: python-chardet56 # Debian package: python-chardet
32 # PyPI package: chardet57 # PyPI package: chardet
33 import chardet as chardet_module58 import chardet
59 chardet_module = chardet
34 except ImportError:60 except ImportError:
35 try:61 try:
36 # PyPI package: charset-normalizer62 # PyPI package: charset-normalizer
37 import charset_normalizer as chardet_module63 import charset_normalizer
64 chardet_module = charset_normalizer
38 except ImportError:65 except ImportError:
39 # No chardet available.66 # No chardet available.
40 chardet_module = None67 pass
4168
42if chardet_module:69
43 def chardet_dammit(s):70def _chardet_dammit(s:bytes) -> Optional[str]:
44 if isinstance(s, str):71 """Try as hard as possible to detect the encoding of a bytestring."""
45 return None72 if chardet_module is None or isinstance(s, str):
46 return chardet_module.detect(s)['encoding']
47else:
48 def chardet_dammit(s):
49 return None73 return None
74 module = chardet_module
75 return module.detect(s)['encoding']
5076
51# Build bytestring and Unicode versions of regular expressions for finding77# Build bytestring and Unicode versions of regular expressions for finding
52# a declared encoding inside an XML or HTML document.78# a declared encoding inside an XML or HTML document.
53xml_encoding = '^\\s*<\\?.*encoding=[\'"](.*?)[\'"].*\\?>'79xml_encoding:str = '^\\s*<\\?.*encoding=[\'"](.*?)[\'"].*\\?>' #: :meta private:
54html_meta = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'80html_meta:str = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]' #: :meta private:
55encoding_res = dict()81
82# TODO: The Pattern type here could use more refinement, but it's tricky.
83encoding_res: Dict[Type, Dict[str, Pattern]] = dict()
56encoding_res[bytes] = {84encoding_res[bytes] = {
57 'html' : re.compile(html_meta.encode("ascii"), re.I),85 'html' : re.compile(html_meta.encode("ascii"), re.I),
58 'xml' : re.compile(xml_encoding.encode("ascii"), re.I),86 'xml' : re.compile(xml_encoding.encode("ascii"), re.I),
@@ -62,12 +90,29 @@ encoding_res[str] = {
62 'xml' : re.compile(xml_encoding, re.I)90 'xml' : re.compile(xml_encoding, re.I)
63}91}
6492
65from html.entities import html5
66
67class EntitySubstitution(object):93class EntitySubstitution(object):
68 """The ability to substitute XML or HTML entities for certain characters."""94 """The ability to substitute XML or HTML entities for certain characters."""
6995
70 def _populate_class_variables():96 #: A map of named HTML entities to the corresponding Unicode string.
97 #:
98 #: :meta hide-value:
99 HTML_ENTITY_TO_CHARACTER: Dict[str, str]
100
101 #: A map of Unicode strings to the corresponding named HTML entities;
102 #: the inverse of HTML_ENTITY_TO_CHARACTER.
103 #:
104 #: :meta hide-value:
105 CHARACTER_TO_HTML_ENTITY: Dict[str, str]
106
107 #: A regular expression that matches any character (or, in rare
108 #: cases, pair of characters) that can be replaced with a named
109 #: HTML entity.
110 #:
111 #: :meta hide-value:
112 CHARACTER_TO_HTML_ENTITY_RE: Pattern[str]
113
114 @classmethod
115 def _populate_class_variables(cls) -> None:
71 """Initialize variables used by this class to manage the plethora of116 """Initialize variables used by this class to manage the plethora of
72 HTML5 named entities.117 HTML5 named entities.
73118
@@ -184,11 +229,14 @@ class EntitySubstitution(object):
184 character = chr(codepoint)229 character = chr(codepoint)
185 unicode_to_name[character] = name230 unicode_to_name[character] = name
186231
187 return unicode_to_name, name_to_unicode, re.compile(re_definition)232 cls.CHARACTER_TO_HTML_ENTITY = unicode_to_name
188 (CHARACTER_TO_HTML_ENTITY, HTML_ENTITY_TO_CHARACTER,233 cls.HTML_ENTITY_TO_CHARACTER = name_to_unicode
189 CHARACTER_TO_HTML_ENTITY_RE) = _populate_class_variables()234 cls.CHARACTER_TO_HTML_ENTITY_RE = re.compile(re_definition)
190235
191 CHARACTER_TO_XML_ENTITY = {236 #: A map of Unicode strings to the corresponding named XML entities.
237 #:
238 #: :meta hide-value:
239 CHARACTER_TO_XML_ENTITY: Dict[str, str] = {
192 "'": "apos",240 "'": "apos",
193 '"': "quot",241 '"': "quot",
194 "&": "amp",242 "&": "amp",
@@ -196,28 +244,37 @@ class EntitySubstitution(object):
196 ">": "gt",244 ">": "gt",
197 }245 }
198246
199 BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|"247 #: A regular expression matching an angle bracket or an ampersand that
200 "&(?!#\\d+;|#x[0-9a-fA-F]+;|\\w+;)"248 #: is not part of an XML or HTML entity.
201 ")")249 #:
202250 #: :meta hide-value:
203 AMPERSAND_OR_BRACKET = re.compile("([<>&])")251 BARE_AMPERSAND_OR_BRACKET: Pattern[str] = re.compile(
252 "([<>]|"
253 "&(?!#\\d+;|#x[0-9a-fA-F]+;|\\w+;)"
254 ")"
255 )
256
257 #: A regular expression matching an angle bracket or an ampersand.
258 #:
259 #: :meta hide-value:
260 AMPERSAND_OR_BRACKET: Pattern[str] = re.compile("([<>&])")
204261
205 @classmethod262 @classmethod
206 def _substitute_html_entity(cls, matchobj):263 def _substitute_html_entity(cls, matchobj:re.Match[str]) -> str:
207 """Used with a regular expression to substitute the264 """Used with a regular expression to substitute the
208 appropriate HTML entity for a special character string."""265 appropriate HTML entity for a special character string."""
209 entity = cls.CHARACTER_TO_HTML_ENTITY.get(matchobj.group(0))266 entity = cls.CHARACTER_TO_HTML_ENTITY.get(matchobj.group(0))
210 return "&%s;" % entity267 return "&%s;" % entity
211268
212 @classmethod269 @classmethod
213 def _substitute_xml_entity(cls, matchobj):270 def _substitute_xml_entity(cls, matchobj:re.Match[str]) -> str:
214 """Used with a regular expression to substitute the271 """Used with a regular expression to substitute the
215 appropriate XML entity for a special character string."""272 appropriate XML entity for a special character string."""
216 entity = cls.CHARACTER_TO_XML_ENTITY[matchobj.group(0)]273 entity = cls.CHARACTER_TO_XML_ENTITY[matchobj.group(0)]
217 return "&%s;" % entity274 return "&%s;" % entity
218275
219 @classmethod276 @classmethod
220 def quoted_attribute_value(self, value):277 def quoted_attribute_value(cls, value: str) -> str:
221 """Make a value into a quoted XML attribute, possibly escaping it.278 """Make a value into a quoted XML attribute, possibly escaping it.
222279
223 Most strings will be quoted using double quotes.280 Most strings will be quoted using double quotes.
@@ -233,7 +290,10 @@ class EntitySubstitution(object):
233 double quotes will be escaped, and the string will be quoted290 double quotes will be escaped, and the string will be quoted
234 using double quotes.291 using double quotes.
235292
236 Welcome to "Bob's Bar" -> "Welcome to &quot;Bob's bar&quot;293 Welcome to "Bob's Bar" -> Welcome to &quot;Bob's bar&quot;
294
295 :param value: The XML attribute value to quote
296 :return: The quoted value
237 """297 """
238 quote_with = '"'298 quote_with = '"'
239 if '"' in value:299 if '"' in value:
@@ -254,17 +314,22 @@ class EntitySubstitution(object):
254 return quote_with + value + quote_with314 return quote_with + value + quote_with
255315
256 @classmethod316 @classmethod
257 def substitute_xml(cls, value, make_quoted_attribute=False):317 def substitute_xml(cls, value:str, make_quoted_attribute:bool=False) -> str:
258 """Substitute XML entities for special XML characters.318 """Replace special XML characters with named XML entities.
319
320 The less-than sign will become &lt;, the greater-than sign
321 will become &gt;, and any ampersands will become &amp;. If you
322 want ampersands that seem to be part of an entity definition
323 to be left alone, use `substitute_xml_containing_entities`
324 instead.
259325
260 :param value: A string to be substituted. The less-than sign326 :param value: A string to be substituted.
261 will become &lt;, the greater-than sign will become &gt;,
262 and any ampersands will become &amp;. If you want ampersands
263 that appear to be part of an entity definition to be left
264 alone, use substitute_xml_containing_entities() instead.
265327
266 :param make_quoted_attribute: If True, then the string will be328 :param make_quoted_attribute: If True, then the string will be
267 quoted, as befits an attribute value.329 quoted, as befits an attribute value.
330
331 :return: A version of ``value`` with special characters replaced
332 with named entities.
268 """333 """
269 # Escape angle brackets and ampersands.334 # Escape angle brackets and ampersands.
270 value = cls.AMPERSAND_OR_BRACKET.sub(335 value = cls.AMPERSAND_OR_BRACKET.sub(
@@ -276,7 +341,7 @@ class EntitySubstitution(object):
276341
277 @classmethod342 @classmethod
278 def substitute_xml_containing_entities(343 def substitute_xml_containing_entities(
279 cls, value, make_quoted_attribute=False):344 cls, value: str, make_quoted_attribute:bool=False) -> str:
280 """Substitute XML entities for special XML characters.345 """Substitute XML entities for special XML characters.
281346
282 :param value: A string to be substituted. The less-than sign will347 :param value: A string to be substituted. The less-than sign will
@@ -297,10 +362,10 @@ class EntitySubstitution(object):
297 return value362 return value
298363
299 @classmethod364 @classmethod
300 def substitute_html(cls, s):365 def substitute_html(cls, s: str) -> str:
301 """Replace certain Unicode characters with named HTML entities.366 """Replace certain Unicode characters with named HTML entities.
302367
303 This differs from data.encode(encoding, 'xmlcharrefreplace')368 This differs from ``data.encode(encoding, 'xmlcharrefreplace')``
304 in that the goal is to make the result more readable (to those369 in that the goal is to make the result more readable (to those
305 with ASCII displays) rather than to recover from370 with ASCII displays) rather than to recover from
306 errors. There's absolutely nothing wrong with a UTF-8 string371 errors. There's absolutely nothing wrong with a UTF-8 string
@@ -308,109 +373,126 @@ class EntitySubstitution(object):
308 character with "&eacute;" will make it more readable to some373 character with "&eacute;" will make it more readable to some
309 people.374 people.
310375
311 :param s: A Unicode string.376 :param s: The string to be modified.
377 :return: The string with some Unicode characters replaced with
378 HTML entities.
312 """379 """
313 return cls.CHARACTER_TO_HTML_ENTITY_RE.sub(380 return cls.CHARACTER_TO_HTML_ENTITY_RE.sub(
314 cls._substitute_html_entity, s)381 cls._substitute_html_entity, s)
315382EntitySubstitution._populate_class_variables()
316383
317class EncodingDetector:384class EncodingDetector:
318 """Suggests a number of possible encodings for a bytestring.385 """This class is capable of guessing a number of possible encodings
386 for a bytestring.
319387
320 Order of precedence:388 Order of precedence:
321389
322 1. Encodings you specifically tell EncodingDetector to try first390 1. Encodings you specifically tell EncodingDetector to try first
323 (the known_definite_encodings argument to the constructor).391 (the ``known_definite_encodings`` argument to the constructor).
324392
325 2. An encoding determined by sniffing the document's byte-order mark.393 2. An encoding determined by sniffing the document's byte-order mark.
326394
327 3. Encodings you specifically tell EncodingDetector to try if395 3. Encodings you specifically tell EncodingDetector to try if
328 byte-order mark sniffing fails (the user_encodings argument to the396 byte-order mark sniffing fails (the ``user_encodings`` argument to the
329 constructor).397 constructor).
330398
331 4. An encoding declared within the bytestring itself, either in an399 4. An encoding declared within the bytestring itself, either in an
332 XML declaration (if the bytestring is to be interpreted as an XML400 XML declaration (if the bytestring is to be interpreted as an XML
333 document), or in a <meta> tag (if the bytestring is to be401 document), or in a <meta> tag (if the bytestring is to be
334 interpreted as an HTML document.)402 interpreted as an HTML document.)
335403
336 5. An encoding detected through textual analysis by chardet,404 5. An encoding detected through textual analysis by chardet,
337 cchardet, or a similar external library.405 cchardet, or a similar external library.
338406
339 4. UTF-8.407 6. UTF-8.
340408
341 5. Windows-1252.409 7. Windows-1252.
342410
343 """411 :param markup: Some markup in an unknown encoding.
344 def __init__(self, markup, known_definite_encodings=None,
345 is_html=False, exclude_encodings=None,
346 user_encodings=None, override_encodings=None):
347 """Constructor.
348
349 :param markup: Some markup in an unknown encoding.
350
351 :param known_definite_encodings: When determining the encoding
352 of `markup`, these encodings will be tried first, in
353 order. In HTML terms, this corresponds to the "known
354 definite encoding" step defined here:
355 https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding
356
357 :param user_encodings: These encodings will be tried after the
358 `known_definite_encodings` have been tried and failed, and
359 after an attempt to sniff the encoding by looking at a
360 byte order mark has failed. In HTML terms, this
361 corresponds to the step "user has explicitly instructed
362 the user agent to override the document's character
363 encoding", defined here:
364 https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
365
366 :param override_encodings: A deprecated alias for
367 known_definite_encodings. Any encodings here will be tried
368 immediately after the encodings in
369 known_definite_encodings.
370
371 :param is_html: If True, this markup is considered to be
372 HTML. Otherwise it's assumed to be XML.
373
374 :param exclude_encodings: These encodings will not be tried,
375 even if they otherwise would be.
376412
377 """413 :param known_definite_encodings: When determining the encoding
414 of ``markup``, these encodings will be tried first, in
415 order. In HTML terms, this corresponds to the "known
416 definite encoding" step defined in `section 13.2.3.1 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding>`_.
417
418 :param user_encodings: These encodings will be tried after the
419 ``known_definite_encodings`` have been tried and failed, and
420 after an attempt to sniff the encoding by looking at a
421 byte order mark has failed. In HTML terms, this
422 corresponds to the step "user has explicitly instructed
423 the user agent to override the document's character
424 encoding", defined in `section 13.2.3.2 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>`_.
425
426 :param override_encodings: A **deprecated** alias for
427 ``known_definite_encodings``. Any encodings here will be tried
428 immediately after the encodings in
429 ``known_definite_encodings``.
430
431 :param is_html: If True, this markup is considered to be
432 HTML. Otherwise it's assumed to be XML.
433
434 :param exclude_encodings: These encodings will not be tried,
435 even if they otherwise would be.
436
437 """
438 def __init__(self, markup:bytes,
439 known_definite_encodings:Optional[_Encodings]=None,
440 is_html:Optional[bool]=False,
441 exclude_encodings:Optional[_Encodings]=None,
442 user_encodings:Optional[_Encodings]=None,
443 override_encodings:Optional[_Encodings]=None):
378 self.known_definite_encodings = list(known_definite_encodings or [])444 self.known_definite_encodings = list(known_definite_encodings or [])
379 if override_encodings:445 if override_encodings:
446 warnings.warn(
447 "The 'override_encodings' argument was deprecated in 4.10.0. Use 'known_definite_encodings' instead.",
448 DeprecationWarning,
449 stacklevel=3
450 )
380 self.known_definite_encodings += override_encodings451 self.known_definite_encodings += override_encodings
381 self.user_encodings = user_encodings or []452 self.user_encodings = user_encodings or []
382 exclude_encodings = exclude_encodings or []453 exclude_encodings = exclude_encodings or []
383 self.exclude_encodings = set([x.lower() for x in exclude_encodings])454 self.exclude_encodings = set([x.lower() for x in exclude_encodings])
384 self.chardet_encoding = None455 self.chardet_encoding = None
385 self.is_html = is_html456 self.is_html = False if is_html is None else is_html
386 self.declared_encoding = None457 self.declared_encoding: Optional[str] = None
387458
388 # First order of business: strip a byte-order mark.459 # First order of business: strip a byte-order mark.
389 self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup)460 self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup)
390461
391 def _usable(self, encoding, tried):462 known_definite_encodings:_Encodings
463 user_encodings:_Encodings
464 exclude_encodings:_Encodings
465 chardet_encoding:Optional[_Encoding]
466 is_html:bool
467 declared_encoding:Optional[_Encoding]
468 markup:bytes
469 sniffed_encoding:Optional[_Encoding]
470
471 def _usable(self, encoding:Optional[_Encoding], tried:Set[_Encoding]) -> bool:
392 """Should we even bother to try this encoding?472 """Should we even bother to try this encoding?
393473
394 :param encoding: Name of an encoding.474 :param encoding: Name of an encoding.
395 :param tried: Encodings that have already been tried. This will be modified475 :param tried: Encodings that have already been tried. This
396 as a side effect.476 will be modified as a side effect.
397 """477 """
398 if encoding is not None:478 if encoding is None:
399 encoding = encoding.lower()479 return False
400 if encoding in self.exclude_encodings:480 encoding = encoding.lower()
401 return False481 if encoding in self.exclude_encodings:
402 if encoding not in tried:482 return False
403 tried.add(encoding)483 if encoding not in tried:
404 return True484 tried.add(encoding)
485 return True
405 return False486 return False
406487
407 @property488 @property
408 def encodings(self):489 def encodings(self) -> Iterator[_Encoding]:
409 """Yield a number of encodings that might work for this markup.490 """Yield a number of encodings that might work for this markup.
410491
411 :yield: A sequence of strings.492 :yield: A sequence of strings. Each is the name of an encoding
493 that *might* work to convert a bytestring into Unicode.
412 """494 """
413 tried = set()495 tried:Set[_Encoding] = set()
414496
415 # First, try the known definite encodings497 # First, try the known definite encodings
416 for e in self.known_definite_encodings:498 for e in self.known_definite_encodings:
@@ -419,7 +501,9 @@ class EncodingDetector:
419501
420 # Did the document originally start with a byte-order mark502 # Did the document originally start with a byte-order mark
421 # that indicated its encoding?503 # that indicated its encoding?
422 if self._usable(self.sniffed_encoding, tried):504 if self.sniffed_encoding is not None and self._usable(
505 self.sniffed_encoding, tried
506 ):
423 yield self.sniffed_encoding507 yield self.sniffed_encoding
424508
425 # Sniffing the byte-order mark did nothing; try the user509 # Sniffing the byte-order mark did nothing; try the user
@@ -433,14 +517,18 @@ class EncodingDetector:
433 if self.declared_encoding is None:517 if self.declared_encoding is None:
434 self.declared_encoding = self.find_declared_encoding(518 self.declared_encoding = self.find_declared_encoding(
435 self.markup, self.is_html)519 self.markup, self.is_html)
436 if self._usable(self.declared_encoding, tried):520 if self.declared_encoding is not None and self._usable(
521 self.declared_encoding, tried
522 ):
437 yield self.declared_encoding523 yield self.declared_encoding
438524
439 # Use third-party character set detection to guess at the525 # Use third-party character set detection to guess at the
440 # encoding.526 # encoding.
441 if self.chardet_encoding is None:527 if self.chardet_encoding is None:
442 self.chardet_encoding = chardet_dammit(self.markup)528 self.chardet_encoding = _chardet_dammit(self.markup)
443 if self._usable(self.chardet_encoding, tried):529 if self.chardet_encoding is not None and self._usable(
530 self.chardet_encoding, tried
531 ):
444 yield self.chardet_encoding532 yield self.chardet_encoding
445533
446 # As a last-ditch effort, try utf-8 and windows-1252.534 # As a last-ditch effort, try utf-8 and windows-1252.
@@ -449,22 +537,24 @@ class EncodingDetector:
449 yield e537 yield e
450538
451 @classmethod539 @classmethod
452 def strip_byte_order_mark(cls, data):540 def strip_byte_order_mark(cls, data:bytes) -> Tuple[bytes, Optional[_Encoding]]:
453 """If a byte-order mark is present, strip it and return the encoding it implies.541 """If a byte-order mark is present, strip it and return the encoding it implies.
454542
455 :param data: Some markup.543 :param data: A bytestring that may or may not begin with a
456 :return: A 2-tuple (modified data, implied encoding)544 byte-order mark.
545
546 :return: A 2-tuple (data stripped of byte-order mark, encoding implied by byte-order mark)
457 """547 """
458 encoding = None548 encoding = None
459 if isinstance(data, str):549 if isinstance(data, str):
460 # Unicode data cannot have a byte-order mark.550 # Unicode data cannot have a byte-order mark.
461 return data, encoding551 return data, encoding
462 if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \552 if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
463 and (data[2:4] != '\x00\x00'):553 and (data[2:4] != b'\x00\x00'):
464 encoding = 'utf-16be'554 encoding = 'utf-16be'
465 data = data[2:]555 data = data[2:]
466 elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \556 elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
467 and (data[2:4] != '\x00\x00'):557 and (data[2:4] != b'\x00\x00'):
468 encoding = 'utf-16le'558 encoding = 'utf-16le'
469 data = data[2:]559 data = data[2:]
470 elif data[:3] == b'\xef\xbb\xbf':560 elif data[:3] == b'\xef\xbb\xbf':
@@ -479,8 +569,9 @@ class EncodingDetector:
479 return data, encoding569 return data, encoding
480570
481 @classmethod571 @classmethod
482 def find_declared_encoding(cls, markup, is_html=False, search_entire_document=False):572 def find_declared_encoding(cls, markup:Union[bytes,str], is_html:bool=False, search_entire_document:bool=False) -> Optional[_Encoding]:
483 """Given a document, tries to find its declared encoding.573 """Given a document, tries to find an encoding declared within the
574 text of the document itself.
484575
485 An XML encoding is declared at the beginning of the document.576 An XML encoding is declared at the beginning of the document.
486577
@@ -490,9 +581,12 @@ class EncodingDetector:
490 :param markup: Some markup.581 :param markup: Some markup.
491 :param is_html: If True, this markup is considered to be HTML. Otherwise582 :param is_html: If True, this markup is considered to be HTML. Otherwise
492 it's assumed to be XML.583 it's assumed to be XML.
493 :param search_entire_document: Since an encoding is supposed to declared near the beginning584 :param search_entire_document: Since an encoding is supposed
494 of the document, most of the time it's only necessary to search a few kilobytes of data.585 to declared near the beginning of the document, most of
495 Set this to True to force this method to search the entire document.586 the time it's only necessary to search a few kilobytes of
587 data. Set this to True to force this method to search the
588 entire document.
589 :return: The declared encoding, if one is found.
496 """590 """
497 if search_entire_document:591 if search_entire_document:
498 xml_endpos = html_endpos = len(markup)592 xml_endpos = html_endpos = len(markup)
@@ -520,74 +614,69 @@ class EncodingDetector:
520 return None614 return None
521615
522class UnicodeDammit:616class UnicodeDammit:
523 """A class for detecting the encoding of a *ML document and617 """A class for detecting the encoding of a bytestring containing an
524 converting it to a Unicode string. If the source encoding is618 HTML or XML document, and decoding it to Unicode. If the source
525 windows-1252, can replace MS smart quotes with their HTML or XML619 encoding is windows-1252, `UnicodeDammit` can also replace
526 equivalents."""620 Microsoft smart quotes with their HTML or XML equivalents.
527621
528 # This dictionary maps commonly seen values for "charset" in HTML622 :param markup: HTML or XML markup in an unknown encoding.
529 # meta tags to the corresponding Python codec names. It only covers623
530 # values that aren't in Python's aliases and can't be determined624 :param known_definite_encodings: When determining the encoding
531 # by the heuristics in find_codec.625 of ``markup``, these encodings will be tried first, in
532 CHARSET_ALIASES = {"macintosh": "mac-roman",626 order. In HTML terms, this corresponds to the "known
533 "x-sjis": "shift-jis"}627 definite encoding" step defined in `section 13.2.3.1 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding>`_.
534628
535 ENCODINGS_WITH_SMART_QUOTES = [629 :param user_encodings: These encodings will be tried after the
536 "windows-1252",630 ``known_definite_encodings`` have been tried and failed, and
537 "iso-8859-1",631 after an attempt to sniff the encoding by looking at a
538 "iso-8859-2",632 byte order mark has failed. In HTML terms, this
539 ]633 corresponds to the step "user has explicitly instructed
634 the user agent to override the document's character
635 encoding", defined in `section 13.2.3.2 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>`_.
636
637 :param override_encodings: A **deprecated** alias for
638 ``known_definite_encodings``. Any encodings here will be tried
639 immediately after the encodings in
640 ``known_definite_encodings``.
641
642 :param smart_quotes_to: By default, Microsoft smart quotes will,
643 like all other characters, be converted to Unicode
644 characters. Setting this to ``ascii`` will convert them to ASCII
645 quotes instead. Setting it to ``xml`` will convert them to XML
646 entity references, and setting it to ``html`` will convert them
647 to HTML entity references.
648
649 :param is_html: If True, ``markup`` is treated as an HTML
650 document. Otherwise it's treated as an XML document.
651
652 :param exclude_encodings: These encodings will not be considered,
653 even if the sniffing code thinks they might make sense.
540654
541 def __init__(self, markup, known_definite_encodings=[],655 """
542 smart_quotes_to=None, is_html=False, exclude_encodings=[],656 def __init__(
543 user_encodings=None, override_encodings=None657 self, markup:bytes,
658 known_definite_encodings:Optional[_Encodings]=[],
659 # TODO PYTHON 3.8 Literal is added to the typing module
660 #
661 # smart_quotes_to: Literal["ascii", "xml", "html"] | None = None,
662 smart_quotes_to: Optional[str] = None,
663 is_html: bool = False,
664 exclude_encodings:Optional[_Encodings] = [],
665 user_encodings:Optional[_Encodings] = None,
666 override_encodings:Optional[_Encodings] = None
544 ):667 ):
545 """Constructor.
546
547 :param markup: A bytestring representing markup in an unknown encoding.
548
549 :param known_definite_encodings: When determining the encoding
550 of `markup`, these encodings will be tried first, in
551 order. In HTML terms, this corresponds to the "known
552 definite encoding" step defined here:
553 https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding
554
555 :param user_encodings: These encodings will be tried after the
556 `known_definite_encodings` have been tried and failed, and
557 after an attempt to sniff the encoding by looking at a
558 byte order mark has failed. In HTML terms, this
559 corresponds to the step "user has explicitly instructed
560 the user agent to override the document's character
561 encoding", defined here:
562 https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
563
564 :param override_encodings: A deprecated alias for
565 known_definite_encodings. Any encodings here will be tried
566 immediately after the encodings in
567 known_definite_encodings.
568
569 :param smart_quotes_to: By default, Microsoft smart quotes will, like all other characters, be converted
570 to Unicode characters. Setting this to 'ascii' will convert them to ASCII quotes instead.
571 Setting it to 'xml' will convert them to XML entity references, and setting it to 'html'
572 will convert them to HTML entity references.
573 :param is_html: If True, this markup is considered to be HTML. Otherwise
574 it's assumed to be XML.
575 :param exclude_encodings: These encodings will not be considered, even
576 if the sniffing code thinks they might make sense.
577
578 """
579 self.smart_quotes_to = smart_quotes_to668 self.smart_quotes_to = smart_quotes_to
580 self.tried_encodings = []669 self.tried_encodings = []
581 self.contains_replacement_characters = False670 self.contains_replacement_characters = False
582 self.is_html = is_html671 self.is_html = is_html
583 self.log = logging.getLogger(__name__)672 self.log = getLogger(__name__)
584 self.detector = EncodingDetector(673 self.detector = EncodingDetector(
585 markup, known_definite_encodings, is_html, exclude_encodings,674 markup, known_definite_encodings, is_html, exclude_encodings,
586 user_encodings, override_encodings675 user_encodings, override_encodings
587 )676 )
588677
589 # Short-circuit if the data is in Unicode to begin with.678 # Short-circuit if the data is in Unicode to begin with.
590 if isinstance(markup, str) or markup == '':679 if isinstance(markup, str) or markup == b'':
591 self.markup = markup680 self.markup = markup
592 self.unicode_markup = str(markup)681 self.unicode_markup = str(markup)
593 self.original_encoding = None682 self.original_encoding = None
@@ -616,41 +705,117 @@ class UnicodeDammit:
616 "Some characters could not be decoded, and were "705 "Some characters could not be decoded, and were "
617 "replaced with REPLACEMENT CHARACTER."706 "replaced with REPLACEMENT CHARACTER."
618 )707 )
708
619 self.contains_replacement_characters = True709 self.contains_replacement_characters = True
620 break710 break
621711
622 # If none of that worked, we could at this point force it to712 # If none of that worked, we could at this point force it to
623 # ASCII, but that would destroy so much data that I think713 # ASCII, but that would destroy so much data that I think
624 # giving up is better.714 # giving up is better.
625 self.unicode_markup = u715 #
626 if not u:716 # Note that this is extremely unlikely, probably impossible,
717 # because the "replace" strategy is so powerful. Even running
718 # the Python binary through Unicode, Dammit gives you Unicode,
719 # albeit Unicode riddled with REPLACEMENT CHARACTER.
720 if u is None:
627 self.original_encoding = None721 self.original_encoding = None
722 self.unicode_markup = None
723 else:
724 self.unicode_markup = u
725
726 #: The original markup, before it was converted to Unicode.
727 #: This is not necessarily the same as what was passed in to the
728 #: constructor, since any byte-order mark will be stripped.
729 markup:bytes
628730
629 def _sub_ms_char(self, match):731 #: The Unicode version of the markup, following conversion. This
732 #: is set to `None` if there was simply no way to convert the
733 #: bytestring to Unicode (as with binary data).
734 unicode_markup:Optional[str]
735
736 #: This is True if `UnicodeDammit.unicode_markup` contains
737 #: U+FFFD REPLACEMENT_CHARACTER characters which were not present
738 #: in `UnicodeDammit.markup`. These mark character sequences that
739 #: could not be represented in Unicode.
740 contains_replacement_characters: bool
741
742 #: Unicode, Dammit's best guess as to the original character
743 #: encoding of `UnicodeDammit.markup`.
744 original_encoding:Optional[_Encoding]
745
746 #: The strategy used to handle Microsoft smart quotes.
747 smart_quotes_to: Optional[str]
748
749 #: The (encoding, error handling strategy) 2-tuples that were used to
750 #: try and convert the markup to Unicode.
751 tried_encodings: List[Tuple[_Encoding, str]]
752
753 log: Logger #: :meta private:
754
755 def _sub_ms_char(self, match:re.Match[bytes]) -> bytes:
630 """Changes a MS smart quote character to an XML or HTML756 """Changes a MS smart quote character to an XML or HTML
631 entity, or an ASCII character."""757 entity, or an ASCII character.
632 orig = match.group(1)758
759 TODO: Since this is only used to convert smart quotes, it
760 could be simplified, and MS_CHARS_TO_ASCII made much less
761 parochial.
762 """
763 orig: bytes = match.group(1)
764 sub: bytes
633 if self.smart_quotes_to == 'ascii':765 if self.smart_quotes_to == 'ascii':
634 sub = self.MS_CHARS_TO_ASCII.get(orig).encode()766 if orig in self.MS_CHARS_TO_ASCII:
767 sub = self.MS_CHARS_TO_ASCII[orig].encode()
768 else:
769 # Shouldn't happen; substitute the character
770 # with itself.
771 sub = orig
635 else:772 else:
636 sub = self.MS_CHARS.get(orig)773 if orig in self.MS_CHARS:
637 if type(sub) == tuple:774 substitutions = self.MS_CHARS[orig]
638 if self.smart_quotes_to == 'xml':775 if type(substitutions) == tuple:
639 sub = '&#x'.encode() + sub[1].encode() + ';'.encode()776 if self.smart_quotes_to == 'xml':
777 sub = b'&#x' + substitutions[1].encode() + b';'
778 else:
779 sub = b'&' + substitutions[0].encode() + b';'
640 else:780 else:
641 sub = '&'.encode() + sub[0].encode() + ';'.encode()781 substitutions = cast(str, substitutions)
782 sub = substitutions.encode()
642 else:783 else:
643 sub = sub.encode()784 # Shouldn't happen; substitute the character
785 # for itself.
786 sub = orig
644 return sub787 return sub
788
789 #: This dictionary maps commonly seen values for "charset" in HTML
790 #: meta tags to the corresponding Python codec names. It only covers
791 #: values that aren't in Python's aliases and can't be determined
792 #: by the heuristics in `find_codec`.
793 #:
794 #: :meta hide-value:
795 CHARSET_ALIASES: Dict[str, _Encoding] = {"macintosh": "mac-roman",
796 "x-sjis": "shift-jis"}
797
798 #: A list of encodings that tend to contain Microsoft smart quotes.
799 #:
800 #: :meta hide-value:
801 ENCODINGS_WITH_SMART_QUOTES: _Encodings = [
802 "windows-1252",
803 "iso-8859-1",
804 "iso-8859-2",
805 ]
645806
646 def _convert_from(self, proposed, errors="strict"):807 def _convert_from(self, proposed:_Encoding, errors:str="strict") -> Optional[str]:
647 """Attempt to convert the markup to the proposed encoding.808 """Attempt to convert the markup to the proposed encoding.
648809
649 :param proposed: The name of a character encoding.810 :param proposed: The name of a character encoding.
811 :param errors: An error handling strategy, used when calling `str`.
812 :return: The converted markup, or `None` if the proposed
813 encoding/error handling strategy didn't work.
650 """814 """
651 proposed = self.find_codec(proposed)815 lookup_result = self.find_codec(proposed)
652 if not proposed or (proposed, errors) in self.tried_encodings:816 if lookup_result is None or (lookup_result, errors) in self.tried_encodings:
653 return None817 return None
818 proposed = lookup_result
654 self.tried_encodings.append((proposed, errors))819 self.tried_encodings.append((proposed, errors))
655 markup = self.markup820 markup = self.markup
656 # Convert smart quotes to HTML if coming from an encoding821 # Convert smart quotes to HTML if coming from an encoding
@@ -665,36 +830,37 @@ class UnicodeDammit:
665 #print("Trying to convert document to %s (errors=%s)" % (830 #print("Trying to convert document to %s (errors=%s)" % (
666 # proposed, errors))831 # proposed, errors))
667 u = self._to_unicode(markup, proposed, errors)832 u = self._to_unicode(markup, proposed, errors)
668 self.markup = u833 self.unicode_markup = u
669 self.original_encoding = proposed834 self.original_encoding = proposed
670 except Exception as e:835 except Exception as e:
671 #print("That didn't work!")836 #print("That didn't work!")
672 #print(e)837 #print(e)
673 return None838 return None
674 #print("Correct encoding: %s" % proposed)839 #print("Correct encoding: %s" % proposed)
675 return self.markup840 return self.unicode_markup
676841
677 def _to_unicode(self, data, encoding, errors="strict"):842 def _to_unicode(self, data:bytes, encoding:_Encoding, errors:str="strict") -> str:
678 """Given a string and its encoding, decodes the string into Unicode.843 """Given a bytestring and its encoding, decodes the string into Unicode.
679844
680 :param encoding: The name of an encoding.845 :param encoding: The name of an encoding.
846 :param errors: An error handling strategy, used when calling `str`.
681 """847 """
682 return str(data, encoding, errors)848 return str(data, encoding, errors)
683849
684 @property850 @property
685 def declared_html_encoding(self):851 def declared_html_encoding(self) -> Optional[str]:
686 """If the markup is an HTML document, returns the encoding declared _within_852 """If the markup is an HTML document, returns the encoding, if any,
687 the document.853 declared *inside* the document.
688 """854 """
689 if not self.is_html:855 if not self.is_html:
690 return None856 return None
691 return self.detector.declared_encoding857 return self.detector.declared_encoding
692858
693 def find_codec(self, charset):859 def find_codec(self, charset:_Encoding) -> Optional[str]:
694 """Convert the name of a character set to a codec name.860 """Look up the Python codec corresponding to a given character set.
695861
696 :param charset: The name of a character set.862 :param charset: The name of a character set.
697 :return: The name of a codec.863 :return: The name of a Python codec.
698 """864 """
699 value = (self._codec(self.CHARSET_ALIASES.get(charset, charset))865 value = (self._codec(self.CHARSET_ALIASES.get(charset, charset))
700 or (charset and self._codec(charset.replace("-", "")))866 or (charset and self._codec(charset.replace("-", "")))
@@ -706,7 +872,7 @@ class UnicodeDammit:
706 return value.lower()872 return value.lower()
707 return None873 return None
708874
709 def _codec(self, charset):875 def _codec(self, charset:_Encoding) -> Optional[str]:
710 if not charset:876 if not charset:
711 return charset877 return charset
712 codec = None878 codec = None
@@ -718,8 +884,11 @@ class UnicodeDammit:
718 return codec884 return codec
719885
720886
721 # A partial mapping of ISO-Latin-1 to HTML entities/XML numeric entities.887 #: A partial mapping of ISO-Latin-1 to HTML entities/XML numeric entities.
722 MS_CHARS = {b'\x80': ('euro', '20AC'),888 #:
889 #: :meta hide-value:
890 MS_CHARS: Dict[bytes, Union[str, Tuple[str, str]]] = {
891 b'\x80': ('euro', '20AC'),
723 b'\x81': ' ',892 b'\x81': ' ',
724 b'\x82': ('sbquo', '201A'),893 b'\x82': ('sbquo', '201A'),
725 b'\x83': ('fnof', '192'),894 b'\x83': ('fnof', '192'),
@@ -752,10 +921,15 @@ class UnicodeDammit:
752 b'\x9e': ('#x17E', '17E'),921 b'\x9e': ('#x17E', '17E'),
753 b'\x9f': ('Yuml', ''),}922 b'\x9f': ('Yuml', ''),}
754923
755 # A parochial partial mapping of ISO-Latin-1 to ASCII. Contains924 #: A parochial partial mapping of ISO-Latin-1 to ASCII. Contains
756 # horrors like stripping diacritical marks to turn á into a, but also925 #: horrors like stripping diacritical marks to turn á into a, but also
757 # contains non-horrors like turning “ into ".926 #: contains non-horrors like turning “ into ".
758 MS_CHARS_TO_ASCII = {927 #:
928 #: Seriously, don't use this for anything other than removing smart
929 #: quotes.
930 #:
931 #: :meta private:
932 MS_CHARS_TO_ASCII: Dict[bytes, str] = {
759 b'\x80' : 'EUR',933 b'\x80' : 'EUR',
760 b'\x81' : ' ',934 b'\x81' : ' ',
761 b'\x82' : ',',935 b'\x82' : ',',
@@ -809,7 +983,7 @@ class UnicodeDammit:
809 b'\xb1' : '+-',983 b'\xb1' : '+-',
810 b'\xb2' : '2',984 b'\xb2' : '2',
811 b'\xb3' : '3',985 b'\xb3' : '3',
812 b'\xb4' : ("'", 'acute'),986 b'\xb4' : "'",
813 b'\xb5' : 'u',987 b'\xb5' : 'u',
814 b'\xb6' : 'P',988 b'\xb6' : 'P',
815 b'\xb7' : '*',989 b'\xb7' : '*',
@@ -887,12 +1061,14 @@ class UnicodeDammit:
887 b'\xff' : 'y',1061 b'\xff' : 'y',
888 }1062 }
8891063
890 # A map used when removing rogue Windows-1252/ISO-8859-11064 #: A map used when removing rogue Windows-1252/ISO-8859-1
891 # characters in otherwise UTF-8 documents.1065 #: characters in otherwise UTF-8 documents.
892 #1066 #:
893 # Note that \x81, \x8d, \x8f, \x90, and \x9d are undefined in1067 #: Note that \\x81, \\x8d, \\x8f, \\x90, and \\x9d are undefined in
894 # Windows-1252.1068 #: Windows-1252.
895 WINDOWS_1252_TO_UTF8 = {1069 #:
1070 #: :meta hide-value:
1071 WINDOWS_1252_TO_UTF8: Dict[int, bytes] = {
896 0x80 : b'\xe2\x82\xac', # €1072 0x80 : b'\xe2\x82\xac', # €
897 0x82 : b'\xe2\x80\x9a', # ‚1073 0x82 : b'\xe2\x80\x9a', # ‚
898 0x83 : b'\xc6\x92', # Æ’1074 0x83 : b'\xc6\x92', # Æ’
@@ -1017,33 +1193,37 @@ class UnicodeDammit:
1017 0xfe : b'\xc3\xbe', # þ1193 0xfe : b'\xc3\xbe', # þ
1018 }1194 }
10191195
1020 MULTIBYTE_MARKERS_AND_SIZES = [1196 #: :meta private:
1197 MULTIBYTE_MARKERS_AND_SIZES:List[Tuple[int, int, int]] = [
1021 (0xc2, 0xdf, 2), # 2-byte characters start with a byte C2-DF1198 (0xc2, 0xdf, 2), # 2-byte characters start with a byte C2-DF
1022 (0xe0, 0xef, 3), # 3-byte characters start with E0-EF1199 (0xe0, 0xef, 3), # 3-byte characters start with E0-EF
1023 (0xf0, 0xf4, 4), # 4-byte characters start with F0-F41200 (0xf0, 0xf4, 4), # 4-byte characters start with F0-F4
1024 ]1201 ]
10251202
1026 FIRST_MULTIBYTE_MARKER = MULTIBYTE_MARKERS_AND_SIZES[0][0]1203 #: :meta private:
1027 LAST_MULTIBYTE_MARKER = MULTIBYTE_MARKERS_AND_SIZES[-1][1]1204 FIRST_MULTIBYTE_MARKER:int = MULTIBYTE_MARKERS_AND_SIZES[0][0]
1205
1206 #: :meta private:
1207 LAST_MULTIBYTE_MARKER:int = MULTIBYTE_MARKERS_AND_SIZES[-1][1]
10281208
1029 @classmethod1209 @classmethod
1030 def detwingle(cls, in_bytes, main_encoding="utf8",1210 def detwingle(cls, in_bytes:bytes, main_encoding:_Encoding="utf8",
1031 embedded_encoding="windows-1252"):1211 embedded_encoding:_Encoding="windows-1252") -> bytes:
1032 """Fix characters from one encoding embedded in some other encoding.1212 """Fix characters from one encoding embedded in some other encoding.
10331213
1034 Currently the only situation supported is Windows-1252 (or its1214 Currently the only situation supported is Windows-1252 (or its
1035 subset ISO-8859-1), embedded in UTF-8.1215 subset ISO-8859-1), embedded in UTF-8.
10361216
1037 :param in_bytes: A bytestring that you suspect contains1217 :param in_bytes: A bytestring that you suspect contains
1038 characters from multiple encodings. Note that this _must_1218 characters from multiple encodings. Note that this *must*
1039 be a bytestring. If you've already converted the document1219 be a bytestring. If you've already converted the document
1040 to Unicode, you're too late.1220 to Unicode, you're too late.
1041 :param main_encoding: The primary encoding of `in_bytes`.1221 :param main_encoding: The primary encoding of ``in_bytes``.
1042 :param embedded_encoding: The encoding that was used to embed characters1222 :param embedded_encoding: The encoding that was used to embed characters
1043 in the main document.1223 in the main document.
1044 :return: A bytestring in which `embedded_encoding`1224 :return: A bytestring similar to ``in_bytes``, in which
1045 characters have been converted to their `main_encoding`1225 ``embedded_encoding`` characters have been converted to
1046 equivalents.1226 their ``main_encoding`` equivalents.
1047 """1227 """
1048 if embedded_encoding.replace('_', '-').lower() not in (1228 if embedded_encoding.replace('_', '-').lower() not in (
1049 'windows-1252', 'windows_1252'):1229 'windows-1252', 'windows_1252'):
@@ -1061,9 +1241,6 @@ class UnicodeDammit:
1061 pos = 01241 pos = 0
1062 while pos < len(in_bytes):1242 while pos < len(in_bytes):
1063 byte = in_bytes[pos]1243 byte = in_bytes[pos]
1064 if not isinstance(byte, int):
1065 # Python 2.x
1066 byte = ord(byte)
1067 if (byte >= cls.FIRST_MULTIBYTE_MARKER1244 if (byte >= cls.FIRST_MULTIBYTE_MARKER
1068 and byte <= cls.LAST_MULTIBYTE_MARKER):1245 and byte <= cls.LAST_MULTIBYTE_MARKER):
1069 # This is the start of a UTF-8 multibyte character. Skip1246 # This is the start of a UTF-8 multibyte character. Skip
diff --git a/bs4/diagnose.py b/bs4/diagnose.py
index e079772..201b879 100644
--- a/bs4/diagnose.py
+++ b/bs4/diagnose.py
@@ -7,8 +7,11 @@ import cProfile
7from io import BytesIO7from io import BytesIO
8from html.parser import HTMLParser8from html.parser import HTMLParser
9import bs49import bs4
10from bs4 import BeautifulSoup, __version__10from bs4 import BeautifulSoup, __version__
11from bs4.builder import builder_registry11from bs4.builder import builder_registry
12from typing import TYPE_CHECKING
13if TYPE_CHECKING:
14 from bs4._typing import _IncomingMarkup
1215
13import os16import os
14import pstats17import pstats
@@ -19,10 +22,10 @@ import traceback
19import sys22import sys
20import cProfile23import cProfile
2124
22def diagnose(data):25def diagnose(data:_IncomingMarkup) -> None:
23 """Diagnostic suite for isolating common problems.26 """Diagnostic suite for isolating common problems.
2427
25 :param data: A string containing markup that needs to be explained.28 :param data: Some markup that needs to be explained.
26 :return: None; diagnostics are printed to standard output.29 :return: None; diagnostics are printed to standard output.
27 """30 """
28 print(("Diagnostic running on Beautiful Soup %s" % __version__))31 print(("Diagnostic running on Beautiful Soup %s" % __version__))
@@ -75,7 +78,7 @@ def diagnose(data):
7578
76 print(("-" * 80))79 print(("-" * 80))
7780
78def lxml_trace(data, html=True, **kwargs):81def lxml_trace(data, html:bool=True, **kwargs) -> None:
79 """Print out the lxml events that occur during parsing.82 """Print out the lxml events that occur during parsing.
8083
81 This lets you see how lxml parses a document when no Beautiful84 This lets you see how lxml parses a document when no Beautiful
@@ -109,7 +112,7 @@ class AnnouncingParser(HTMLParser):
109 print(s)112 print(s)
110113
111 def handle_starttag(self, name, attrs):114 def handle_starttag(self, name, attrs):
112 self._p("%s START" % name)115 self._p(f"{name} {attrs} START")
113116
114 def handle_endtag(self, name):117 def handle_endtag(self, name):
115 self._p("%s END" % name)118 self._p("%s END" % name)
@@ -146,11 +149,14 @@ def htmlparser_trace(data):
146 parser = AnnouncingParser()149 parser = AnnouncingParser()
147 parser.feed(data)150 parser.feed(data)
148151
149_vowels = "aeiou"152_vowels:str = "aeiou"
150_consonants = "bcdfghjklmnpqrstvwxyz"153_consonants:str = "bcdfghjklmnpqrstvwxyz"
151154
152def rword(length=5):155def rword(length:int=5) -> str:
153 "Generate a random word-like string."156 """Generate a random word-like string.
157
158 :meta private:
159 """
154 s = ''160 s = ''
155 for i in range(length):161 for i in range(length):
156 if i % 2 == 0:162 if i % 2 == 0:
@@ -160,12 +166,18 @@ def rword(length=5):
160 s += random.choice(t)166 s += random.choice(t)
161 return s167 return s
162168
163def rsentence(length=4):169def rsentence(length:int=4) -> str:
164 "Generate a random sentence-like string."170 """Generate a random sentence-like string.
171
172 :meta private:
173 """
165 return " ".join(rword(random.randint(4,9)) for i in range(length))174 return " ".join(rword(random.randint(4,9)) for i in range(length))
166 175
167def rdoc(num_elements=1000):176def rdoc(num_elements:int=1000) -> str:
168 """Randomly generate an invalid HTML document."""177 """Randomly generate an invalid HTML document.
178
179 :meta private:
180 """
169 tag_names = ['p', 'div', 'span', 'i', 'b', 'script', 'table']181 tag_names = ['p', 'div', 'span', 'i', 'b', 'script', 'table']
170 elements = []182 elements = []
171 for i in range(num_elements):183 for i in range(num_elements):
@@ -182,24 +194,24 @@ def rdoc(num_elements=1000):
182 elements.append("</%s>" % tag_name)194 elements.append("</%s>" % tag_name)
183 return "<html>" + "\n".join(elements) + "</html>"195 return "<html>" + "\n".join(elements) + "</html>"
184196
185def benchmark_parsers(num_elements=100000):197def benchmark_parsers(num_elements:int=100000) -> None:
186 """Very basic head-to-head performance benchmark."""198 """Very basic head-to-head performance benchmark."""
187 print(("Comparative parser benchmark on Beautiful Soup %s" % __version__))199 print(("Comparative parser benchmark on Beautiful Soup %s" % __version__))
188 data = rdoc(num_elements)200 data = rdoc(num_elements)
189 print(("Generated a large invalid HTML document (%d bytes)." % len(data)))201 print(("Generated a large invalid HTML document (%d bytes)." % len(data)))
190 202
191 for parser in ["lxml", ["lxml", "html"], "html5lib", "html.parser"]:203 for parser_name in ["lxml", ["lxml", "html"], "html5lib", "html.parser"]:
192 success = False204 success = False
193 try:205 try:
194 a = time.time()206 a = time.time()
195 soup = BeautifulSoup(data, parser)207 soup = BeautifulSoup(data, parser_name)
196 b = time.time()208 b = time.time()
197 success = True209 success = True
198 except Exception as e:210 except Exception as e:
199 print(("%s could not parse the markup." % parser))211 print(("%s could not parse the markup." % parser_name))
200 traceback.print_exc()212 traceback.print_exc()
201 if success:213 if success:
202 print(("BS4+%s parsed the markup in %.2fs." % (parser, b-a)))214 print(("BS4+%s parsed the markup in %.2fs." % (parser_name, b-a)))
203215
204 from lxml import etree216 from lxml import etree
205 a = time.time()217 a = time.time()
@@ -214,7 +226,7 @@ def benchmark_parsers(num_elements=100000):
214 b = time.time()226 b = time.time()
215 print(("Raw html5lib parsed the markup in %.2fs." % (b-a)))227 print(("Raw html5lib parsed the markup in %.2fs." % (b-a)))
216228
217def profile(num_elements=100000, parser="lxml"):229def profile(num_elements:int=100000, parser:str="lxml"):
218 """Use Python's profiler on a randomly generated document."""230 """Use Python's profiler on a randomly generated document."""
219 filehandle = tempfile.NamedTemporaryFile()231 filehandle = tempfile.NamedTemporaryFile()
220 filename = filehandle.name232 filename = filehandle.name
diff --git a/bs4/element.py b/bs4/element.py
index 0aefe73..8b3774e 100644
--- a/bs4/element.py
+++ b/bs4/element.py
@@ -1,55 +1,102 @@
1from __future__ import annotations
1# Use of this source code is governed by the MIT license.2# Use of this source code is governed by the MIT license.
2__license__ = "MIT"3__license__ = "MIT"
34
4try:
5 from collections.abc import Callable # Python 3.6
6except ImportError as e:
7 from collections import Callable
8import re5import re
9import sys6import sys
10import warnings7import warnings
118
12from bs4.css import CSS9from bs4.css import CSS
10from bs4._deprecation import (
11 _deprecated,
12 _deprecated_alias,
13 _deprecated_function_alias,
14)
13from bs4.formatter import (15from bs4.formatter import (
14 Formatter,16 Formatter,
15 HTMLFormatter,17 HTMLFormatter,
16 XMLFormatter,18 XMLFormatter,
17)19)
1820
19DEFAULT_OUTPUT_ENCODING = "utf-8"21from typing import (
2022 Any,
21nonwhitespace_re = re.compile(r"\S+")23 Callable,
2224 Dict,
23# NOTE: This isn't used as of 4.7.0. I'm leaving it for a little bit on25 Generator,
24# the off chance someone imported it for their own use.26 Generic,
25whitespace_re = re.compile(r"\s+")27 Iterable,
28 Iterator,
29 List,
30 Mapping,
31 Optional,
32 Pattern,
33 Sequence,
34 Set,
35 TYPE_CHECKING,
36 Tuple,
37 Type,
38 TypeVar,
39 Union,
40 cast,
41)
42from typing_extensions import Self
43if TYPE_CHECKING:
44 from bs4 import BeautifulSoup
45 from bs4.builder import TreeBuilder
46 from bs4.dammit import _Encoding
47 from bs4.formatter import (
48 _EntitySubstitutionFunction,
49 _FormatterOrName,
50 )
51 from bs4._typing import (
52 _AttributeValue,
53 _AttributeValues,
54 _StrainableElement,
55 _StrainableAttribute,
56 _StrainableAttributes,
57 _StrainableString,
58 )
59
60# Deprecated module-level attributes.
61# See https://peps.python.org/pep-0562/
62_deprecated_names = dict(
63 whitespace_re = 'The {name} attribute was deprecated in version 4.7.0. If you need it, make your own copy.'
64)
65#: :meta private:
66_deprecated_whitespace_re: Pattern[str] = re.compile(r"\s+")
2667
27def _alias(attr):68def __getattr__(name):
28 """Alias one attribute name to another for backward compatibility"""69 if name in _deprecated_names:
29 @property70 message = _deprecated_names[name]
30 def alias(self):71 warnings.warn(
31 return getattr(self, attr)72 message.format(name=name),
3273 DeprecationWarning, stacklevel=2
33 @alias.setter74 )
34 def alias(self):75
35 return setattr(self, attr)76 return globals()[f"_deprecated_{name}"]
36 return alias77 raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
3778
3879#: Documents output by Beautiful Soup will be encoded with
39# These encodings are recognized by Python (so PageElement.encode80#: this encoding unless you specify otherwise.
40# could theoretically support them) but XML and HTML don't recognize81DEFAULT_OUTPUT_ENCODING:str = "utf-8"
41# them (so they should not show up in an XML or HTML document as that82
42# document's encoding).83#: A regular expression that can be used to split on whitespace.
43#84nonwhitespace_re: Pattern[str] = re.compile(r"\S+")
44# If an XML document is encoded in one of these encodings, no encoding85
45# will be mentioned in the XML declaration. If an HTML document is86#: These encodings are recognized by Python (so `Tag.encode`
46# encoded in one of these encodings, and the HTML document has a87#: could theoretically support them) but XML and HTML don't recognize
47# <meta> tag that mentions an encoding, the encoding will be given as88#: them (so they should not show up in an XML or HTML document as that
48# the empty string.89#: document's encoding).
49#90#:
50# Source:91#: If an XML document is encoded in one of these encodings, no encoding
51# https://docs.python.org/3/library/codecs.html#python-specific-encodings92#: will be mentioned in the XML declaration. If an HTML document is
52PYTHON_SPECIFIC_ENCODINGS = set([93#: encoded in one of these encodings, and the HTML document has a
94#: <meta> tag that mentions an encoding, the encoding will be given as
95#: the empty string.
96#:
97#: Source:
98#: Python documentation, `Python Specific Encodings <https://docs.python.org/3/library/codecs.html#python-specific-encodings>`_
99PYTHON_SPECIFIC_ENCODINGS: Set[_Encoding] = set([
53 "idna",100 "idna",
54 "mbcs",101 "mbcs",
55 "oem",102 "oem",
@@ -66,11 +113,17 @@ PYTHON_SPECIFIC_ENCODINGS = set([
66113
67114
68class NamespacedAttribute(str):115class NamespacedAttribute(str):
69 """A namespaced string (e.g. 'xml:lang') that remembers the namespace116 """A namespaced attribute (e.g. the 'xml:lang' in 'xml:lang="en"')
70 ('xml') and the name ('lang') that were used to create it.117 which remembers the namespace prefix ('xml') and the name ('lang')
118 that were used to create it.
71 """119 """
72120
73 def __new__(cls, prefix, name=None, namespace=None):121 prefix: Optional[str]
122 name: Optional[str]
123 namespace: Optional[str]
124
125 def __new__(cls, prefix:Optional[str],
126 name:Optional[str]=None, namespace:Optional[str]=None):
74 if not name:127 if not name:
75 # This is the default namespace. Its name "has no value"128 # This is the default namespace. Its name "has no value"
76 # per https://www.w3.org/TR/xml-names/#defaulting129 # per https://www.w3.org/TR/xml-names/#defaulting
@@ -89,72 +142,126 @@ class NamespacedAttribute(str):
89 return obj142 return obj
90143
91class AttributeValueWithCharsetSubstitution(str):144class AttributeValueWithCharsetSubstitution(str):
92 """A stand-in object for a character encoding specified in HTML."""145 """An abstract class standing in for a character encoding specified
146 inside an HTML ``<meta>`` tag.
93147
94class CharsetMetaAttributeValue(AttributeValueWithCharsetSubstitution):148 Subclasses exist for each place such a character encoding might be
95 """A generic stand-in for the value of a meta tag's 'charset' attribute.149 found: either inside the ``charset`` attribute
150 (`CharsetMetaAttributeValue`) or inside the ``content`` attribute
151 (`ContentMetaAttributeValue`)
96152
97 When Beautiful Soup parses the markup '<meta charset="utf8">', the153 This allows Beautiful Soup to replace that part of the HTML file
98 value of the 'charset' attribute will be one of these objects.154 with a different encoding when ouputting a tree as a string.
99 """155 """
156 # The original, un-encoded value of the ``content`` attribute.
157 #: :meta private:
158 original_value: str
159
160 def substitute_encoding(self, eventual_encoding:str) -> str:
161 """Do whatever's necessary in this implementation-specific
162 portion an HTML document to substitute in a specific encoding.
163 """
164 raise NotImplementedError()
165
166class CharsetMetaAttributeValue(AttributeValueWithCharsetSubstitution):
167 """A generic stand-in for the value of a ``<meta>`` tag's ``charset``
168 attribute.
169
170 When Beautiful Soup parses the markup ``<meta charset="utf8">``, the
171 value of the ``charset`` attribute will become one of these objects.
100172
101 def __new__(cls, original_value):173 If the document is later encoded to an encoding other than UTF-8, its
174 ``<meta>`` tag will mention the new encoding instead of ``utf8``.
175 """
176 def __new__(cls, original_value:str) -> Self:
177 # We don't need to use the original value for anything, but
178 # it might be useful for the user to know.
102 obj = str.__new__(cls, original_value)179 obj = str.__new__(cls, original_value)
103 obj.original_value = original_value180 obj.original_value = original_value
104 return obj181 return obj
105182
106 def encode(self, encoding):183 def substitute_encoding(self, eventual_encoding:_Encoding="utf-8") -> str:
107 """When an HTML document is being encoded to a given encoding, the184 """When an HTML document is being encoded to a given encoding, the
108 value of a meta tag's 'charset' is the name of the encoding.185 value of a ``<meta>`` tag's ``charset`` becomes the name of
186 the encoding.
109 """187 """
110 if encoding in PYTHON_SPECIFIC_ENCODINGS:188 if eventual_encoding in PYTHON_SPECIFIC_ENCODINGS:
111 return ''189 return ''
112 return encoding190 return eventual_encoding
113191
114192
115class ContentMetaAttributeValue(AttributeValueWithCharsetSubstitution):193class ContentMetaAttributeValue(AttributeValueWithCharsetSubstitution):
116 """A generic stand-in for the value of a meta tag's 'content' attribute.194 """A generic stand-in for the value of a ``<meta>`` tag's ``content``
195 attribute.
117196
118 When Beautiful Soup parses the markup:197 When Beautiful Soup parses the markup:
119 <meta http-equiv="content-type" content="text/html; charset=utf8">198 ``<meta http-equiv="content-type" content="text/html; charset=utf8">``
120199
121 The value of the 'content' attribute will be one of these objects.200 The value of the ``content`` attribute will become one of these objects.
122 """
123
124 CHARSET_RE = re.compile(r"((^|;)\s*charset=)([^;]*)", re.M)
125201
126 def __new__(cls, original_value):202 If the document is later encoded to an encoding other than UTF-8, its
203 ``<meta>`` tag will mention the new encoding instead of ``utf8``.
204 """
205 #: Match the 'charset' argument inside the 'content' attribute
206 #: of a <meta> tag.
207 #: :meta private:
208 CHARSET_RE: Pattern[str] = re.compile(
209 r"((^|;)\s*charset=)([^;]*)", re.M
210 )
211
212 def __new__(cls, original_value:str) -> Self:
127 match = cls.CHARSET_RE.search(original_value)213 match = cls.CHARSET_RE.search(original_value)
128 if match is None:
129 # No substitution necessary.
130 return str.__new__(str, original_value)
131
132 obj = str.__new__(cls, original_value)214 obj = str.__new__(cls, original_value)
133 obj.original_value = original_value215 obj.original_value = original_value
134 return obj216 return obj
135217
136 def encode(self, encoding):218 def substitute_encoding(self, eventual_encoding:_Encoding="utf-8") -> str:
137 if encoding in PYTHON_SPECIFIC_ENCODINGS:219 """When an HTML document is being encoded to a given encoding, the
138 return ''220 value of the ``charset=`` in a ``<meta>`` tag's ``content`` becomes
221 the name of the encoding.
222 """
223 if eventual_encoding in PYTHON_SPECIFIC_ENCODINGS:
224 return self.CHARSET_RE.sub('', self.original_value)
139 def rewrite(match):225 def rewrite(match):
140 return match.group(1) + encoding226 return match.group(1) + eventual_encoding
141 return self.CHARSET_RE.sub(rewrite, self.original_value)227 return self.CHARSET_RE.sub(rewrite, self.original_value)
142228
143229
144class PageElement(object):230class PageElement(object):
145 """Contains the navigational information for some part of the page:231 """An abstract class representing a single element in the parse tree.
146 that is, its current location in the parse tree.
147232
148 NavigableString, Tag, etc. are all subclasses of PageElement.233 `NavigableString`, `Tag`, etc. are all subclasses of
234 `PageElement`. For this reason you'll see a lot of methods that
235 return `PageElement`, but you'll never see an actual `PageElement`
236 object. For the most part you can think of `PageElement` as
237 meaning "a `Tag` or a `NavigableString`."
149 """238 """
150239
151 # In general, we can't tell just by looking at an element whether240 #: In general, we can't tell just by looking at an element whether
152 # it's contained in an XML document or an HTML document. But for241 #: it's contained in an XML document or an HTML document. But for
153 # Tags (q.v.) we can store this information at parse time.242 #: `Tag` objects (q.v.) we can store this information at parse time.
154 known_xml = None243 #: :meta private:
155244 known_xml: Optional[bool] = None
156 def setup(self, parent=None, previous_element=None, next_element=None,245
157 previous_sibling=None, next_sibling=None):246 #: Whether or not this element has been decomposed from the tree
247 #: it was created in.
248 _decomposed: bool
249
250 parent: Optional[Tag]
251 next_element: Optional[PageElement]
252 previous_element: Optional[PageElement]
253 next_sibling: Optional[PageElement]
254 previous_sibling: Optional[PageElement]
255
256 #: Whether or not this element is hidden from generated output.
257 #: Only the `BeautifulSoup` object itself is hidden.
258 hidden: bool=False
259
260 def setup(self, parent:Optional[Tag]=None,
261 previous_element:Optional[PageElement]=None,
262 next_element:Optional[PageElement]=None,
263 previous_sibling:Optional[PageElement]=None,
264 next_sibling:Optional[PageElement]=None) -> None:
158 """Sets up the initial relations between this element and265 """Sets up the initial relations between this element and
159 other elements.266 other elements.
160267
@@ -175,7 +282,7 @@ class PageElement(object):
175 self.parent = parent282 self.parent = parent
176283
177 self.previous_element = previous_element284 self.previous_element = previous_element
178 if previous_element is not None:285 if self.previous_element is not None:
179 self.previous_element.next_element = self286 self.previous_element.next_element = self
180287
181 self.next_element = next_element288 self.next_element = next_element
@@ -191,10 +298,10 @@ class PageElement(object):
191 previous_sibling = self.parent.contents[-1]298 previous_sibling = self.parent.contents[-1]
192299
193 self.previous_sibling = previous_sibling300 self.previous_sibling = previous_sibling
194 if previous_sibling is not None:301 if self.previous_sibling is not None:
195 self.previous_sibling.next_sibling = self302 self.previous_sibling.next_sibling = self
196303
197 def format_string(self, s, formatter):304 def format_string(self, s:str, formatter:Optional[_FormatterOrName]) -> str:
198 """Format the given string using the given formatter.305 """Format the given string using the given formatter.
199306
200 :param s: A string.307 :param s: A string.
@@ -207,28 +314,35 @@ class PageElement(object):
207 output = formatter.substitute(s)314 output = formatter.substitute(s)
208 return output315 return output
209316
210 def formatter_for_name(self, formatter):317 def formatter_for_name(
318 self,
319 formatter_name:Union[_FormatterOrName, _EntitySubstitutionFunction]
320 ) -> Formatter:
211 """Look up or create a Formatter for the given identifier,321 """Look up or create a Formatter for the given identifier,
212 if necessary.322 if necessary.
213323
214 :param formatter: Can be a Formatter object (used as-is), a324 :param formatter: Can be a `Formatter` object (used as-is), a
215 function (used as the entity substitution hook for an325 function (used as the entity substitution hook for an
216 XMLFormatter or HTMLFormatter), or a string (used to look326 `XMLFormatter` or `HTMLFormatter`), or a string (used to look
217 up an XMLFormatter or HTMLFormatter in the appropriate327 up an `XMLFormatter` or `HTMLFormatter` in the appropriate
218 registry.328 registry.
219 """329 """
220 if isinstance(formatter, Formatter):330 if isinstance(formatter_name, Formatter):
221 return formatter331 return formatter_name
332 c: type[Formatter]
333 registry: Mapping[Optional[str], Formatter]
222 if self._is_xml:334 if self._is_xml:
223 c = XMLFormatter335 c = XMLFormatter
336 registry = XMLFormatter.REGISTRY
224 else:337 else:
225 c = HTMLFormatter338 c = HTMLFormatter
226 if isinstance(formatter, Callable):339 registry = HTMLFormatter.REGISTRY
227 return c(entity_substitution=formatter)340 if callable(formatter_name):
228 return c.REGISTRY[formatter]341 return c(entity_substitution=formatter_name)
342 return registry[formatter_name]
229343
230 @property344 @property
231 def _is_xml(self):345 def _is_xml(self) -> bool:
232 """Is this element part of an XML tree or an HTML tree?346 """Is this element part of an XML tree or an HTML tree?
233347
234 This is used in formatter_for_name, when deciding whether an348 This is used in formatter_for_name, when deciding whether an
@@ -250,28 +364,41 @@ class PageElement(object):
250 return getattr(self, 'is_xml', False)364 return getattr(self, 'is_xml', False)
251 return self.parent._is_xml365 return self.parent._is_xml
252366
253 nextSibling = _alias("next_sibling") # BS3367 nextSibling = _deprecated_alias("nextSibling", "next_sibling", "4.0.0")
254 previousSibling = _alias("previous_sibling") # BS3368 previousSibling = _deprecated_alias(
369 "previousSibling", "previous_sibling", "4.0.0"
370 )
255371
256 default = object()372 def __deepcopy__(self, memo:Dict, recursive:bool=False) -> Self:
257 def _all_strings(self, strip=False, types=default):373 raise NotImplementedError()
374
375 def __copy__(self) -> Self:
376 """A copy of a PageElement can only be a deep copy, because
377 only one PageElement can occupy a given place in a parse tree.
378 """
379 return self.__deepcopy__({})
380
381 default: Iterable[type[NavigableString]] = tuple() #: :meta private:
382 def _all_strings(self, strip:bool=False, types:Iterable[type[NavigableString]]=default) -> Iterator[str]:
258 """Yield all strings of certain classes, possibly stripping them.383 """Yield all strings of certain classes, possibly stripping them.
259384
260 This is implemented differently in Tag and NavigableString.385 This is implemented differently in `Tag` and `NavigableString`.
261 """386 """
262 raise NotImplementedError()387 raise NotImplementedError()
263388
264 @property389 @property
265 def stripped_strings(self):390 def stripped_strings(self) -> Iterator[str]:
266 """Yield all strings in this PageElement, stripping them first.391 """Yield all interesting strings in this PageElement, stripping them
392 first.
267393
268 :yield: A sequence of stripped strings.394 See `Tag` for information on which strings are considered
395 interesting in a given context.
269 """396 """
270 for string in self._all_strings(True):397 for string in self._all_strings(True):
271 yield string398 yield string
272399
273 def get_text(self, separator="", strip=False,400 def get_text(self, separator:str="", strip:bool=False,
274 types=default):401 types:Iterable[Type[NavigableString]]=default) -> str:
275 """Get all child strings of this PageElement, concatenated using the402 """Get all child strings of this PageElement, concatenated using the
276 given separator.403 given separator.
277404
@@ -294,19 +421,19 @@ class PageElement(object):
294 getText = get_text421 getText = get_text
295 text = property(get_text)422 text = property(get_text)
296423
297 def replace_with(self, *args):424 def replace_with(self, *args:PageElement) -> PageElement:
298 """Replace this PageElement with one or more PageElements, keeping the425 """Replace this `PageElement` with one or more other `PageElements`,
299 rest of the tree the same.426 keeping the rest of the tree the same.
300427
301 :param args: One or more PageElements.428 :return: This `PageElement`, no longer part of the tree.
302 :return: `self`, no longer part of the tree.
303 """429 """
304 if self.parent is None:430 if self.parent is None:
305 raise ValueError(431 raise ValueError(
306 "Cannot replace one element with another when the "432 "Cannot replace one element with another when the "
307 "element to be replaced is not part of a tree.")433 "element to be replaced is not part of a tree.")
308 if len(args) == 1 and args[0] is self:434 if len(args) == 1 and args[0] is self:
309 return435 # Replacing an element with itself is a no-op.
436 return self
310 if any(x is self.parent for x in args):437 if any(x is self.parent for x in args):
311 raise ValueError("Cannot replace a Tag with its parent.")438 raise ValueError("Cannot replace a Tag with its parent.")
312 old_parent = self.parent439 old_parent = self.parent
@@ -315,45 +442,28 @@ class PageElement(object):
315 for idx, replace_with in enumerate(args, start=my_index):442 for idx, replace_with in enumerate(args, start=my_index):
316 old_parent.insert(idx, replace_with)443 old_parent.insert(idx, replace_with)
317 return self444 return self
318 replaceWith = replace_with # BS3445 replaceWith = _deprecated_function_alias(
446 "replaceWith", "replace_with", "4.0.0"
447 )
319448
320 def unwrap(self):449 def wrap(self, wrap_inside:Tag) -> Tag:
321 """Replace this PageElement with its contents.450 """Wrap this `PageElement` inside a `Tag`.
322451
323 :return: `self`, no longer part of the tree.452 :return: ``wrap_inside``, occupying the position in the tree that used
324 """453 to be occupied by this object, and with this object now inside it.
325 my_parent = self.parent
326 if self.parent is None:
327 raise ValueError(
328 "Cannot replace an element with its contents when that"
329 "element is not part of a tree.")
330 my_index = self.parent.index(self)
331 self.extract(_self_index=my_index)
332 for child in reversed(self.contents[:]):
333 my_parent.insert(my_index, child)
334 return self
335 replace_with_children = unwrap
336 replaceWithChildren = unwrap # BS3
337
338 def wrap(self, wrap_inside):
339 """Wrap this PageElement inside another one.
340
341 :param wrap_inside: A PageElement.
342 :return: `wrap_inside`, occupying the position in the tree that used
343 to be occupied by `self`, and with `self` inside it.
344 """454 """
345 me = self.replace_with(wrap_inside)455 me = self.replace_with(wrap_inside)
346 wrap_inside.append(me)456 wrap_inside.append(me)
347 return wrap_inside457 return wrap_inside
348458
349 def extract(self, _self_index=None):459 def extract(self, _self_index:Optional[int]=None) -> PageElement:
350 """Destructively rips this element out of the tree.460 """Destructively rips this element out of the tree.
351461
352 :param _self_index: The location of this element in its parent's462 :param _self_index: The location of this element in its parent's
353 .contents, if known. Passing this in allows for a performance463 .contents, if known. Passing this in allows for a performance
354 optimization.464 optimization.
355465
356 :return: `self`, no longer part of the tree.466 :return: this `PageElement`, no longer part of the tree.
357 """467 """
358 if self.parent is not None:468 if self.parent is not None:
359 if _self_index is None:469 if _self_index is None:
@@ -364,11 +474,17 @@ class PageElement(object):
364 #this element (and any children) hadn't been parsed. Connect474 #this element (and any children) hadn't been parsed. Connect
365 #the two.475 #the two.
366 last_child = self._last_descendant()476 last_child = self._last_descendant()
477
478 # last_child can't be None because we passed accept_self=True
479 # into _last_descendant. Worst case, last_child will be
480 # self. Making this cast removes several mypy complaints later
481 # on as we manipulate last_child.
482 last_child = cast(PageElement, last_child)
367 next_element = last_child.next_element483 next_element = last_child.next_element
368484
369 if (self.previous_element is not None and485 if self.previous_element is not None:
370 self.previous_element is not next_element):486 if self.previous_element is not next_element:
371 self.previous_element.next_element = next_element487 self.previous_element.next_element = next_element
372 if next_element is not None and next_element is not self.previous_element:488 if next_element is not None and next_element is not self.previous_element:
373 next_element.previous_element = self.previous_element489 next_element.previous_element = self.previous_element
374 self.previous_element = None490 self.previous_element = None
@@ -384,12 +500,38 @@ class PageElement(object):
384 self.previous_sibling = self.next_sibling = None500 self.previous_sibling = self.next_sibling = None
385 return self501 return self
386502
387 def _last_descendant(self, is_initialized=True, accept_self=True):503 def decompose(self) -> None:
504 """Recursively destroys this `PageElement` and its children.
505
506 The element will be removed from the tree and wiped out; so
507 will everything beneath it.
508
509 The behavior of a decomposed `PageElement` is undefined and you
510 should never use one for anything, but if you need to *check*
511 whether an element has been decomposed, you can use the
512 `PageElement.decomposed` property.
513 """
514 self.extract()
515 e: Optional[PageElement] = self
516 next_up: Optional[PageElement] = None
517 while e is not None:
518 next_up = e.next_element
519 e.__dict__.clear()
520 if isinstance(e, Tag):
521 e.contents = []
522 e._decomposed = True
523 e = next_up
524
525 def _last_descendant(
526 self, is_initialized:bool=True, accept_self:bool=True
527 ) -> Optional[PageElement]:
388 """Finds the last element beneath this object to be parsed.528 """Finds the last element beneath this object to be parsed.
389529
390 :param is_initialized: Has `setup` been called on this PageElement530 :param is_initialized: Has `PageElement.setup` been called on
391 yet?531 this `PageElement` yet?
392 :param accept_self: Is `self` an acceptable answer to the question?532
533 :param accept_self: Is ``self`` an acceptable answer to the
534 question?
393 """535 """
394 if is_initialized and self.next_sibling is not None:536 if is_initialized and self.next_sibling is not None:
395 last_child = self.next_sibling.previous_element537 last_child = self.next_sibling.previous_element
@@ -400,121 +542,15 @@ class PageElement(object):
400 if not accept_self and last_child is self:542 if not accept_self and last_child is self:
401 last_child = None543 last_child = None
402 return last_child544 return last_child
403 # BS3: Not part of the API!
404 _lastRecursiveChild = _last_descendant
405545
406 def insert(self, position, new_child):546 _lastRecursiveChild = _deprecated_alias("_lastRecursiveChild", "_last_descendant", "4.0.0")
407 """Insert a new PageElement in the list of this PageElement's children.
408
409 This works the same way as `list.insert`.
410
411 :param position: The numeric position that should be occupied
412 in `self.children` by the new PageElement.
413 :param new_child: A PageElement.
414 """
415 if new_child is None:
416 raise ValueError("Cannot insert None into a tag.")
417 if new_child is self:
418 raise ValueError("Cannot insert a tag into itself.")
419 if (isinstance(new_child, str)
420 and not isinstance(new_child, NavigableString)):
421 new_child = NavigableString(new_child)
422547
423 from bs4 import BeautifulSoup548 def insert_before(self, *args:PageElement) -> None:
424 if isinstance(new_child, BeautifulSoup):
425 # We don't want to end up with a situation where one BeautifulSoup
426 # object contains another. Insert the children one at a time.
427 for subchild in list(new_child.contents):
428 self.insert(position, subchild)
429 position += 1
430 return
431 position = min(position, len(self.contents))
432 if hasattr(new_child, 'parent') and new_child.parent is not None:
433 # We're 'inserting' an element that's already one
434 # of this object's children.
435 if new_child.parent is self:
436 current_index = self.index(new_child)
437 if current_index < position:
438 # We're moving this element further down the list
439 # of this object's children. That means that when
440 # we extract this element, our target index will
441 # jump down one.
442 position -= 1
443 new_child.extract()
444
445 new_child.parent = self
446 previous_child = None
447 if position == 0:
448 new_child.previous_sibling = None
449 new_child.previous_element = self
450 else:
451 previous_child = self.contents[position - 1]
452 new_child.previous_sibling = previous_child
453 new_child.previous_sibling.next_sibling = new_child
454 new_child.previous_element = previous_child._last_descendant(False)
455 if new_child.previous_element is not None:
456 new_child.previous_element.next_element = new_child
457
458 new_childs_last_element = new_child._last_descendant(False)
459
460 if position >= len(self.contents):
461 new_child.next_sibling = None
462
463 parent = self
464 parents_next_sibling = None
465 while parents_next_sibling is None and parent is not None:
466 parents_next_sibling = parent.next_sibling
467 parent = parent.parent
468 if parents_next_sibling is not None:
469 # We found the element that comes next in the document.
470 break
471 if parents_next_sibling is not None:
472 new_childs_last_element.next_element = parents_next_sibling
473 else:
474 # The last element of this tag is the last element in
475 # the document.
476 new_childs_last_element.next_element = None
477 else:
478 next_child = self.contents[position]
479 new_child.next_sibling = next_child
480 if new_child.next_sibling is not None:
481 new_child.next_sibling.previous_sibling = new_child
482 new_childs_last_element.next_element = next_child
483
484 if new_childs_last_element.next_element is not None:
485 new_childs_last_element.next_element.previous_element = new_childs_last_element
486 self.contents.insert(position, new_child)
487
488 def append(self, tag):
489 """Appends the given PageElement to the contents of this one.
490
491 :param tag: A PageElement.
492 """
493 self.insert(len(self.contents), tag)
494
495 def extend(self, tags):
496 """Appends the given PageElements to this one's contents.
497
498 :param tags: A list of PageElements. If a single Tag is
499 provided instead, this PageElement's contents will be extended
500 with that Tag's contents.
501 """
502 if isinstance(tags, Tag):
503 tags = tags.contents
504 if isinstance(tags, list):
505 # Moving items around the tree may change their position in
506 # the original list. Make a list that won't change.
507 tags = list(tags)
508 for tag in tags:
509 self.append(tag)
510
511 def insert_before(self, *args):
512 """Makes the given element(s) the immediate predecessor of this one.549 """Makes the given element(s) the immediate predecessor of this one.
513550
514 All the elements will have the same parent, and the given elements551 All the elements will have the same `PageElement.parent` as
515 will be immediately before this one.552 this one, and the given elements will occur immediately before
516553 this one.
517 :param args: One or more PageElements.
518 """554 """
519 parent = self.parent555 parent = self.parent
520 if parent is None:556 if parent is None:
@@ -530,13 +566,12 @@ class PageElement(object):
530 index = parent.index(self)566 index = parent.index(self)
531 parent.insert(index, predecessor)567 parent.insert(index, predecessor)
532568
533 def insert_after(self, *args):569 def insert_after(self, *args:PageElement) -> None:
534 """Makes the given element(s) the immediate successor of this one.570 """Makes the given element(s) the immediate successor of this one.
535571
536 The elements will have the same parent, and the given elements572 The elements will have the same `PageElement.parent` as this
537 will be immediately after this one.573 one, and the given elements will occur immediately after this
538574 one.
539 :param args: One or more PageElements.
540 """575 """
541 # Do all error checking before modifying the tree.576 # Do all error checking before modifying the tree.
542 parent = self.parent577 parent = self.parent
@@ -556,7 +591,14 @@ class PageElement(object):
556 parent.insert(index+1+offset, successor)591 parent.insert(index+1+offset, successor)
557 offset += 1592 offset += 1
558593
559 def find_next(self, name=None, attrs={}, string=None, **kwargs):594 def find_next(
595 self,
596 name:Optional[_StrainableElement]=None,
597 attrs:_StrainableAttributes={},
598 string:Optional[_StrainableString]=None,
599 node:Optional[_TagOrStringMatchFunction]=None,
600 **kwargs:_StrainableAttribute
601 ) -> Optional[PageElement]:
560 """Find the first PageElement that matches the given criteria and602 """Find the first PageElement that matches the given criteria and
561 appears later in the document than this PageElement.603 appears later in the document than this PageElement.
562604
@@ -564,36 +606,47 @@ class PageElement(object):
564 documentation for detailed explanations.606 documentation for detailed explanations.
565607
566 :param name: A filter on tag name.608 :param name: A filter on tag name.
567 :param attrs: A dictionary of filters on attribute values.609 :param attrs: Additional filters on attribute values.
568 :param string: A filter for a NavigableString with specific text.610 :param string: A filter for a NavigableString with specific text.
569 :kwargs: A dictionary of filters on attribute values.611 :kwargs: Additional filters on attribute values.
570 :return: A PageElement.612 """
571 :rtype: bs4.element.Tag | bs4.element.NavigableString613 return self._find_one(self.find_all_next, name, attrs, string, node, **kwargs)
572 """614 findNext = _deprecated_function_alias("findNext", "find_next", "4.0.0")
573 return self._find_one(self.find_all_next, name, attrs, string, **kwargs)615
574 findNext = find_next # BS3616 def find_all_next(
575617 self,
576 def find_all_next(self, name=None, attrs={}, string=None, limit=None,618 name:Optional[_StrainableElement]=None,
577 **kwargs):619 attrs:_StrainableAttributes={},
578 """Find all PageElements that match the given criteria and appear620 string:Optional[_StrainableString]=None,
579 later in the document than this PageElement.621 limit:Optional[int]=None,
622 node:Optional[_TagOrStringMatchFunction]=None,
623 _stacklevel:int=2,
624 **kwargs:_StrainableAttribute
625 ) -> ResultSet[PageElement]:
626 """Find all `PageElement` objects that match the given criteria and
627 appear later in the document than this `PageElement`.
580628
581 All find_* methods take a common set of arguments. See the online629 All find_* methods take a common set of arguments. See the online
582 documentation for detailed explanations.630 documentation for detailed explanations.
583631
584 :param name: A filter on tag name.632 :param name: A filter on tag name.
585 :param attrs: A dictionary of filters on attribute values.633 :param attrs: Additional filters on attribute values.
586 :param string: A filter for a NavigableString with specific text.634 :param string: A filter for a NavigableString with specific text.
587 :param limit: Stop looking after finding this many results.635 :param limit: Stop looking after finding this many results.
588 :kwargs: A dictionary of filters on attribute values.636 :param _stacklevel: Used internally to improve warning messages.
589 :return: A ResultSet containing PageElements.637 :kwargs: Additional filters on attribute values.
590 """638 """
591 _stacklevel = kwargs.pop('_stacklevel', 2)
592 return self._find_all(name, attrs, string, limit, self.next_elements,639 return self._find_all(name, attrs, string, limit, self.next_elements,
593 _stacklevel=_stacklevel+1, **kwargs)640 node, _stacklevel=_stacklevel+1, **kwargs)
594 findAllNext = find_all_next # BS3641 findAllNext = _deprecated_function_alias("findAllNext", "find_all_next", "4.0.0")
595642
596 def find_next_sibling(self, name=None, attrs={}, string=None, **kwargs):643 def find_next_sibling(
644 self,
645 name:Optional[_StrainableElement]=None,
646 attrs:_StrainableAttributes={},
647 string:Optional[_StrainableString]=None,
648 node:Optional[_TagOrStringMatchFunction]=None,
649 **kwargs:_StrainableAttribute) -> Optional[PageElement]:
597 """Find the closest sibling to this PageElement that matches the650 """Find the closest sibling to this PageElement that matches the
598 given criteria and appears later in the document.651 given criteria and appears later in the document.
599652
@@ -601,102 +654,143 @@ class PageElement(object):
601 online documentation for detailed explanations.654 online documentation for detailed explanations.
602655
603 :param name: A filter on tag name.656 :param name: A filter on tag name.
604 :param attrs: A dictionary of filters on attribute values.657 :param attrs: Additional filters on attribute values.
605 :param string: A filter for a NavigableString with specific text.658 :param string: A filter for a `NavigableString` with specific text.
606 :kwargs: A dictionary of filters on attribute values.659 :kwargs: Additional filters on attribute values.
607 :return: A PageElement.
608 :rtype: bs4.element.Tag | bs4.element.NavigableString
609 """660 """
610 return self._find_one(self.find_next_siblings, name, attrs, string,661 return self._find_one(self.find_next_siblings, name, attrs, string,
611 **kwargs)662 node, **kwargs)
612 findNextSibling = find_next_sibling # BS3663 findNextSibling = _deprecated_function_alias(
613664 "findNextSibling", "find_next_sibling", "4.0.0"
614 def find_next_siblings(self, name=None, attrs={}, string=None, limit=None,665 )
615 **kwargs):666
616 """Find all siblings of this PageElement that match the given criteria667 def find_next_siblings(
668 self,
669 name:Optional[_StrainableElement]=None,
670 attrs:_StrainableAttributes={},
671 string:Optional[_StrainableString]=None,
672 limit:Optional[int]=None,
673 node:Optional[_TagOrStringMatchFunction]=None,
674 _stacklevel:int=2,
675 **kwargs:_StrainableAttribute
676 ) -> ResultSet[PageElement]:
677 """Find all siblings of this `PageElement` that match the given criteria
617 and appear later in the document.678 and appear later in the document.
618679
619 All find_* methods take a common set of arguments. See the online680 All find_* methods take a common set of arguments. See the online
620 documentation for detailed explanations.681 documentation for detailed explanations.
621682
622 :param name: A filter on tag name.683 :param name: A filter on tag name.
623 :param attrs: A dictionary of filters on attribute values.684 :param attrs: Additional filters on attribute values.
624 :param string: A filter for a NavigableString with specific text.685 :param string: A filter for a `NavigableString` with specific text.
625 :param limit: Stop looking after finding this many results.686 :param limit: Stop looking after finding this many results.
626 :kwargs: A dictionary of filters on attribute values.687 :param _stacklevel: Used internally to improve warning messages.
627 :return: A ResultSet of PageElements.688 :kwargs: Additional filters on attribute values.
628 :rtype: bs4.element.ResultSet
629 """689 """
630 _stacklevel = kwargs.pop('_stacklevel', 2)
631 return self._find_all(690 return self._find_all(
632 name, attrs, string, limit,691 name, attrs, string, limit,
633 self.next_siblings, _stacklevel=_stacklevel+1, **kwargs692 self.next_siblings, node, _stacklevel=_stacklevel+1, **kwargs
634 )693 )
635 findNextSiblings = find_next_siblings # BS3694 findNextSiblings = _deprecated_function_alias(
636 fetchNextSiblings = find_next_siblings # BS2695 "findNextSiblings", "find_next_siblings", "4.0.0"
637696 )
638 def find_previous(self, name=None, attrs={}, string=None, **kwargs):697 fetchNextSiblings = _deprecated_function_alias(
639 """Look backwards in the document from this PageElement and find the698 "fetchNextSiblings", "find_next_siblings", "3.0.0"
640 first PageElement that matches the given criteria.699 )
700
701 def find_previous(
702 self,
703 name:Optional[_StrainableElement]=None,
704 attrs:_StrainableAttributes={},
705 string:Optional[_StrainableString]=None,
706 node:Optional[_TagOrStringMatchFunction]=None,
707 **kwargs:_StrainableAttribute) -> Optional[PageElement]:
708 """Look backwards in the document from this `PageElement` and find the
709 first `PageElement` that matches the given criteria.
641710
642 All find_* methods take a common set of arguments. See the online711 All find_* methods take a common set of arguments. See the online
643 documentation for detailed explanations.712 documentation for detailed explanations.
644713
645 :param name: A filter on tag name.714 :param name: A filter on tag name.
646 :param attrs: A dictionary of filters on attribute values.715 :param attrs: Additional filters on attribute values.
647 :param string: A filter for a NavigableString with specific text.716 :param string: A filter for a `NavigableString` with specific text.
648 :kwargs: A dictionary of filters on attribute values.717 :kwargs: Additional filters on attribute values.
649 :return: A PageElement.
650 :rtype: bs4.element.Tag | bs4.element.NavigableString
651 """718 """
652 return self._find_one(719 return self._find_one(
653 self.find_all_previous, name, attrs, string, **kwargs)720 self.find_all_previous, name, attrs, string, node, **kwargs)
654 findPrevious = find_previous # BS3721
655722 findPrevious = _deprecated_function_alias(
656 def find_all_previous(self, name=None, attrs={}, string=None, limit=None,723 "findPrevious", "find_previous", "3.0.0"
657 **kwargs):724 )
658 """Look backwards in the document from this PageElement and find all725
659 PageElements that match the given criteria.726 def find_all_previous(
727 self,
728 name:Optional[_StrainableElement]=None,
729 attrs:_StrainableAttributes={},
730 string:Optional[_StrainableString]=None,
731 limit:Optional[int]=None,
732 node:Optional[_TagOrStringMatchFunction]=None,
733 _stacklevel:int=2,
734 **kwargs:_StrainableAttribute
735 ) -> ResultSet[PageElement]:
736 """Look backwards in the document from this `PageElement` and find all
737 `PageElement` that match the given criteria.
660738
661 All find_* methods take a common set of arguments. See the online739 All find_* methods take a common set of arguments. See the online
662 documentation for detailed explanations.740 documentation for detailed explanations.
663741
664 :param name: A filter on tag name.742 :param name: A filter on tag name.
665 :param attrs: A dictionary of filters on attribute values.743 :param attrs: Additional filters on attribute values.
666 :param string: A filter for a NavigableString with specific text.744 :param string: A filter for a `NavigableString` with specific text.
667 :param limit: Stop looking after finding this many results.745 :param limit: Stop looking after finding this many results.
668 :kwargs: A dictionary of filters on attribute values.746 :param _stacklevel: Used internally to improve warning messages.
669 :return: A ResultSet of PageElements.747 :kwargs: Additional filters on attribute values.
670 :rtype: bs4.element.ResultSet
671 """748 """
672 _stacklevel = kwargs.pop('_stacklevel', 2)
673 return self._find_all(749 return self._find_all(
674 name, attrs, string, limit, self.previous_elements,750 name, attrs, string, limit, self.previous_elements,
675 _stacklevel=_stacklevel+1, **kwargs751 node, _stacklevel=_stacklevel+1, **kwargs
676 )752 )
677 findAllPrevious = find_all_previous # BS3753 findAllPrevious = _deprecated_function_alias(
678 fetchPrevious = find_all_previous # BS2754 "findAllPrevious", "find_all_previous", "4.0.0"
679755 )
680 def find_previous_sibling(self, name=None, attrs={}, string=None, **kwargs):756 fetchAllPrevious = _deprecated_function_alias(
681 """Returns the closest sibling to this PageElement that matches the757 "fetchAllPrevious", "find_all_previous", "3.0.0"
758 )
759
760 def find_previous_sibling(
761 self,
762 name:Optional[_StrainableElement]=None,
763 attrs:_StrainableAttributes={},
764 string:Optional[_StrainableString]=None,
765 node:Optional[_TagOrStringMatchFunction]=None,
766 **kwargs:_StrainableAttribute) -> Optional[PageElement]:
767 """Returns the closest sibling to this `PageElement` that matches the
682 given criteria and appears earlier in the document.768 given criteria and appears earlier in the document.
683769
684 All find_* methods take a common set of arguments. See the online770 All find_* methods take a common set of arguments. See the online
685 documentation for detailed explanations.771 documentation for detailed explanations.
686772
687 :param name: A filter on tag name.773 :param name: A filter on tag name.
688 :param attrs: A dictionary of filters on attribute values.
689 :param string: A filter for a NavigableString with specific text.
690 :kwargs: A dictionary of filters on attribute values.
The diff has been truncated for viewing.

Subscribers

People subscribed via source and target branches