Merge ~chrispitude/beautifulsoup:node-filters into beautifulsoup:master
- Git
- lp:~chrispitude/beautifulsoup
- node-filters
- Merge into master
Status: | Superseded |
---|---|
Proposed branch: | ~chrispitude/beautifulsoup:node-filters |
Merge into: | beautifulsoup:master |
Diff against target: |
9961 lines (+4153/-2375) 31 files modified
CHANGELOG (+99/-0) bs4/__init__.py (+245/-73) bs4/_deprecation.py (+57/-0) bs4/_typing.py (+99/-0) bs4/builder/__init__.py (+278/-184) bs4/builder/_html5lib.py (+57/-36) bs4/builder/_htmlparser.py (+120/-66) bs4/builder/_lxml.py (+137/-68) bs4/css.py (+124/-96) bs4/dammit.py (+407/-230) bs4/diagnose.py (+31/-19) bs4/element.py (+1154/-1052) bs4/formatter.py (+96/-41) bs4/strainer.py (+498/-0) bs4/tests/__init__.py (+17/-12) bs4/tests/test_builder_registry.py (+3/-3) bs4/tests/test_dammit.py (+17/-10) bs4/tests/test_element.py (+25/-5) bs4/tests/test_html5lib.py (+17/-1) bs4/tests/test_htmlparser.py (+5/-3) bs4/tests/test_lxml.py (+4/-3) bs4/tests/test_pageelement.py (+8/-4) bs4/tests/test_soup.py (+13/-5) bs4/tests/test_strainer.py (+485/-0) bs4/tests/test_tag.py (+1/-0) bs4/tests/test_tree.py (+44/-25) dev/null (+0/-256) doc/Makefile (+14/-124) doc/conf.py (+33/-0) doc/index.rst (+61/-57) tox.ini (+4/-2) |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Leonard Richardson | Pending | ||
Review via email: mp+457782@code.launchpad.net |
This proposal has been superseded by a proposal from 2024-01-01.
Commit message
implement "node" filtering that considers all PageElement objects
Description of the change
This is a draft merge request that implements a proof-of-concept for the following wishlist request:
#2047713: enhance find*() methods to filter through all object types
https:/
Most of the changes are to thread a new "node" filter down into the machinery. The actual functionality is just three additional lines in the search() method to use it.
With these changes, when the following script is run:
====
#!/bin/env python
from bs4 import BeautifulSoup, NavigableString
html_doc = """
<p>
<b>bold</b>
<i>italic</i>
<u>
text
<br />
</p>
"""
soup = BeautifulSoup(
# this is the filter I want to use
def is_non_
return not (isinstance(thing, NavigableString) and thing.text.
# get the first non-whitespace thing in <p>
this_thing = soup.find(
# print all following non-whitespace sibling elements in <p>
while this_thing:
next_thing = this_thing.
print(
this_thing = next_thing
====
the results are as follows:
====
<b>bold</b> is followed by <i>italic</i>
<i>italic</i> is followed by <u>underline</u>
<u>underline</u> is followed by '\n text\n '
'\n text\n ' is followed by <br/>
<br/> is followed by None
====
Note the mix of tag and text objects!
Some questions and open items:
* Is "PageElement" the correct term for any object in a BeautifulSoup document (Tag, NavigableString, Comment, ProcessingInstr
* What should this new filter argument be called? (node? page_element? something else?)
* Is there a more elegant approach that doesn't require threading a new argument down into everything?
* Rules for mixing this new filter with existing name/attribute filters must be defined/
* I think this new filter should be mutually exclusive with tag/attribute filters.
* I think this new filter should accept only Callable objects, and perhaps also True/False.
* Tests and documentation are needed.
* I can do this when the implementation is complete.
Fingers crossed that this makes it in. It would be enormously powerful.
Chris Papademetrious (chrispitude) wrote : | # |
- a771557... by Chris Papademetrious
-
implement a filter that considers all PageElement objects
Unmerged commits
- a771557... by Chris Papademetrious
-
implement a filter that considers all PageElement objects
- 8a6d1dd... by Leonard Richardson
-
Merged in change to main branch.
- 4cde600... by Leonard Richardson
-
Those casts are more trouble than they're worth.
- 1113a86... by Leonard Richardson
-
Got css.py to pass mypy strict although it's a little hacky.
- 26e1772... by Leonard Richardson
-
Went through formatter.py with mypy strict.
- 5bf3787... by Leonard Richardson
-
Went through dammit.py with mypy strict.
- 7200655... by Leonard Richardson
-
Merged in main branch.
- f3a3619... by Leonard Richardson
-
Got rid of deprecation warnings in tests.
- 6f89323... by Leonard Richardson
-
Get (slightly) more specific about alias.
- f8e55c0... by Leonard Richardson
-
_alias itself is not used anywhere.
Preview Diff
1 | diff --git a/CHANGELOG b/CHANGELOG |
2 | index 66fcb74..bec1e11 100644 |
3 | --- a/CHANGELOG |
4 | +++ b/CHANGELOG |
5 | @@ -1,3 +1,102 @@ |
6 | += 4.13.0 (Unreleased) |
7 | + |
8 | +* This version drops support for Python 3.6. The minimum supported |
9 | + major version for Beautiful Soup is now Python 3.7. |
10 | + |
11 | +* Deprecation warnings have been added for all deprecated methods and |
12 | + attributes. Most of these were deprecated over ten years ago, and |
13 | + some were deprecated over fifteen years ago. |
14 | + |
15 | + Going forward, deprecated names will be subject to removal two |
16 | + feature releases or one major release after the deprecation warning |
17 | + is added. |
18 | + |
19 | +* append(), extend(), insert(), and unwrap() were moved from PageElement to |
20 | + Tag. Those methods manipulate the 'contents' collection, so they would |
21 | + only have ever worked on Tag objects. |
22 | + |
23 | +* decompose() was moved from Tag to PageElement, since there's no reason |
24 | + it won't also work on NavigableString objects. |
25 | + |
26 | +* The BeautifulSoupHTMLParser constructor now requires a BeautifulSoup |
27 | + object as its first argument. This almost certainly does not affect |
28 | + you, since you probably use HTMLParserTreeBuilder, not |
29 | + BeautifulSoupHTMLParser directly. |
30 | + |
31 | +* If Tag.get_attribute_list() is used to access an attribute that's not set, |
32 | + the return value is now an empty list rather than [None]. |
33 | + |
34 | +* AttributeValueWithCharsetSubstitution.encode() is renamed to |
35 | + substitute_encoding, to avoid confusion with the much different str.encode() |
36 | + |
37 | +* Using PageElement.replace_with() to replace an element with itself |
38 | + returns the element instead of None. |
39 | + |
40 | +* When using one of the find() methods or creating a SoupStrainer, |
41 | + if you specify the same attribute value in ``attrs`` and the |
42 | + keyword arguments, you'll end up with two different ways to match that |
43 | + attribute. Previously the value in keyword arguments would override the |
44 | + value in ``attrs``. |
45 | + |
46 | +* When using one of the find() methods or creating a SoupStrainer, you can |
47 | + pass a list of any accepted object (strings, regular expressions, etc.) for |
48 | + any of the objects. Previously you could only pass in a list of strings. |
49 | + |
50 | +* A SoupStrainer can now filter tag creation based on a tag's |
51 | + namespaced name. Previously only the unqualified name could be used. |
52 | + |
53 | +* All TreeBuilder constructors now take the empty_element_tags |
54 | + argument. The sets of tags found in HTMLTreeBuilder.empty_element_tags and |
55 | + HTMLTreeBuilder.block_elements are now in |
56 | + HTMLTreeBuilder.DEFAULT_EMPTY_ELEMENT_TAGS and |
57 | + HTMLTreeBuilder.DEFAULT_BLOCK_ELEMENTS, to avoid confusing them with |
58 | + instance variables. |
59 | + |
60 | +* Issue a warning if a document is parsed using a SoupStrainer that's just |
61 | + going to filter everything. In these cases, filtering everything is the |
62 | + most consistent thing to do, but there was no indication that was |
63 | + happening. |
64 | + |
65 | +* UnicodeDammit.markup is now always a bytestring representing the |
66 | + *original* markup (sans BOM), and UnicodeDammit.unicode_markup is |
67 | + always the same markup, converted to Unicode. Previously, |
68 | + UnicodeDammit.markup was treated inconsistently and would often end |
69 | + up containing Unicode. UnicodeDammit.markup was not a documented |
70 | + attribute, but if you were using it, you probably want to switch to using |
71 | + .unicode_markup instead. |
72 | + |
73 | +* Corrected the markup that's output in the unlikely event that you |
74 | + encode a document to a Python internal encoding (like "palmos") |
75 | + that's not recognized by the HTML or XML standard. |
76 | + |
77 | +* The arguments to LXMLTreeBuilderForXML.prepare_markup have been |
78 | + changed to match the arguments to the superclass, |
79 | + TreeBuilder.prepare_markup. Specifically, document_declared_encoding |
80 | + now appears before exclude_encodings, not after. If you were calling |
81 | + this method yourself, I recomment switching to using keyword |
82 | + arguments instead. |
83 | + |
84 | +* Fixed an error in the lookup table used when converting |
85 | + ISO-Latin-1 to ASCII, which no one should do anyway. |
86 | + |
87 | +* The unused constant LXMLTreeBuilderForXML.DEFAULT_PARSER_CLASS |
88 | + has been removed. |
89 | + |
90 | +New deprecations in 4.13.0: |
91 | + |
92 | +* The SAXTreeBuilder class, which was never officially supported or tested. |
93 | + |
94 | +* The first argument to BeautifulSoup.decode has been changed from a bool |
95 | + `pretty_print` to an int `indent_level`, to match the signature of Tag.decode. |
96 | + |
97 | +* SoupStrainer.text and SoupStrainer.string are both deprecated |
98 | + since a single item can't capture all the possibilities of a SoupStrainer |
99 | + designed to match strings. |
100 | + |
101 | +* SoupStrainer.search_tag() is deprecated. It was never a |
102 | + documented method, but if you use it, you should start using |
103 | + SoupStrainer.allow_tag_creation() instead. |
104 | + |
105 | = 4.12.3 (?) |
106 | |
107 | * Fixed a regression such that if you set .hidden on a tag, the tag |
108 | diff --git a/bs4/__init__.py b/bs4/__init__.py |
109 | index 3d2ab09..a1289c7 100644 |
110 | --- a/bs4/__init__.py |
111 | +++ b/bs4/__init__.py |
112 | @@ -7,8 +7,8 @@ Beautiful Soup uses a pluggable XML or HTML parser to parse a |
113 | provides methods and Pythonic idioms that make it easy to navigate, |
114 | search, and modify the parse tree. |
115 | |
116 | -Beautiful Soup works with Python 3.6 and up. It works better if lxml |
117 | -and/or html5lib is installed. |
118 | +Beautiful Soup works with Python 3.7 and up. It works better if lxml |
119 | +and/or html5lib is installed, but they are not required. |
120 | |
121 | For more than you ever wanted to know about Beautiful Soup, see the |
122 | documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ |
123 | @@ -37,9 +37,10 @@ if sys.version_info.major < 3: |
124 | from .builder import ( |
125 | builder_registry, |
126 | ParserRejectedMarkup, |
127 | + TreeBuilder, |
128 | XMLParsedAsHTMLWarning, |
129 | - HTMLParserTreeBuilder |
130 | ) |
131 | +from .builder._htmlparser import HTMLParserTreeBuilder |
132 | from .dammit import UnicodeDammit |
133 | from .element import ( |
134 | CData, |
135 | @@ -55,10 +56,32 @@ from .element import ( |
136 | ResultSet, |
137 | Script, |
138 | Stylesheet, |
139 | - SoupStrainer, |
140 | Tag, |
141 | TemplateString, |
142 | ) |
143 | +from .formatter import Formatter |
144 | +from .strainer import SoupStrainer |
145 | +from typing import ( |
146 | + Any, |
147 | + cast, |
148 | + Counter as CounterType, |
149 | + Dict, |
150 | + Iterable, |
151 | + List, |
152 | + Sequence, |
153 | + Optional, |
154 | + Type, |
155 | + TYPE_CHECKING, |
156 | + Union, |
157 | +) |
158 | + |
159 | +from bs4._typing import ( |
160 | + _AttributeValue, |
161 | + _AttributeValues, |
162 | + _Encoding, |
163 | + _Encodings, |
164 | + _IncomingMarkup, |
165 | +) |
166 | |
167 | # Define some custom warnings. |
168 | class GuessedAtParserWarning(UserWarning): |
169 | @@ -104,24 +127,64 @@ class BeautifulSoup(Tag): |
170 | handle_endtag. |
171 | """ |
172 | |
173 | - # Since BeautifulSoup subclasses Tag, it's possible to treat it as |
174 | - # a Tag with a .name. This name makes it clear the BeautifulSoup |
175 | - # object isn't a real markup tag. |
176 | - ROOT_TAG_NAME = '[document]' |
177 | - |
178 | - # If the end-user gives no indication which tree builder they |
179 | - # want, look for one with these features. |
180 | - DEFAULT_BUILDER_FEATURES = ['html', 'fast'] |
181 | - |
182 | - # A string containing all ASCII whitespace characters, used in |
183 | - # endData() to detect data chunks that seem 'empty'. |
184 | - ASCII_SPACES = '\x20\x0a\x09\x0c\x0d' |
185 | - |
186 | - NO_PARSER_SPECIFIED_WARNING = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features=\"%(parser)s\"' to the BeautifulSoup constructor.\n" |
187 | - |
188 | - def __init__(self, markup="", features=None, builder=None, |
189 | - parse_only=None, from_encoding=None, exclude_encodings=None, |
190 | - element_classes=None, **kwargs): |
191 | + #: Since `BeautifulSoup` subclasses `Tag`, it's possible to treat it as |
192 | + #: a `Tag` with a `Tag.name`. Hoever, this name makes it clear the |
193 | + #: `BeautifulSoup` object isn't a real markup tag. |
194 | + ROOT_TAG_NAME:str = '[document]' |
195 | + |
196 | + #: If the end-user gives no indication which tree builder they |
197 | + #: want, look for one with these features. |
198 | + DEFAULT_BUILDER_FEATURES: Sequence[str] = ['html', 'fast'] |
199 | + |
200 | + #: A string containing all ASCII whitespace characters, used in |
201 | + #: `BeautifulSoup.endData` to detect data chunks that seem 'empty'. |
202 | + ASCII_SPACES: str = '\x20\x0a\x09\x0c\x0d' |
203 | + |
204 | + #: :meta private: |
205 | + NO_PARSER_SPECIFIED_WARNING: str = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features=\"%(parser)s\"' to the BeautifulSoup constructor.\n" |
206 | + |
207 | + # FUTURE PYTHON: |
208 | + element_classes:Dict[Type[PageElement], Type[Any]] #: :meta private: |
209 | + builder:TreeBuilder #: :meta private: |
210 | + is_xml: bool |
211 | + known_xml: Optional[bool] |
212 | + parse_only: Optional[SoupStrainer] #: :meta private: |
213 | + |
214 | + # These members are only used while parsing markup. |
215 | + markup:Optional[Union[str,bytes]] #: :meta private: |
216 | + current_data:List[str] #: :meta private: |
217 | + currentTag:Optional[Tag] #: :meta private: |
218 | + tagStack:List[Tag] #: :meta private: |
219 | + open_tag_counter:CounterType[str] #: :meta private: |
220 | + preserve_whitespace_tag_stack:List[Tag] #: :meta private: |
221 | + string_container_stack:List[Tag] #: :meta private: |
222 | + |
223 | + #: Beautiful Soup's best guess as to the character encoding of the |
224 | + #: original document. |
225 | + original_encoding: Optional[_Encoding] |
226 | + |
227 | + #: The character encoding, if any, that was explicitly defined |
228 | + #: in the original document. This may or may not match |
229 | + #: `BeautifulSoup.original_encoding`. |
230 | + declared_html_encoding: Optional[_Encoding] |
231 | + |
232 | + #: This is True if the markup that was parsed contains |
233 | + #: U+FFFD REPLACEMENT_CHARACTER characters which were not present |
234 | + #: in the original markup. These mark character sequences that |
235 | + #: could not be represented in Unicode. |
236 | + contains_replacement_characters: bool |
237 | + |
238 | + def __init__( |
239 | + self, |
240 | + markup:_IncomingMarkup="", |
241 | + features:Optional[Union[str,Sequence[str]]]=None, |
242 | + builder:Optional[Union[TreeBuilder,Type[TreeBuilder]]]=None, |
243 | + parse_only:Optional[SoupStrainer]=None, |
244 | + from_encoding:Optional[_Encoding]=None, |
245 | + exclude_encodings:Optional[_Encodings]=None, |
246 | + element_classes:Optional[Dict[Type[PageElement], Type[Any]]]=None, |
247 | + **kwargs:Any |
248 | + ): |
249 | """Constructor. |
250 | |
251 | :param markup: A string or a file-like object representing |
252 | @@ -196,14 +259,14 @@ class BeautifulSoup(Tag): |
253 | if 'selfClosingTags' in kwargs: |
254 | del kwargs['selfClosingTags'] |
255 | warnings.warn( |
256 | - "BS4 does not respect the selfClosingTags argument to the " |
257 | + "Beautiful Soup 4 does not respect the selfClosingTags argument to the " |
258 | "BeautifulSoup constructor. The tree builder is responsible " |
259 | "for understanding self-closing tags.") |
260 | |
261 | if 'isHTML' in kwargs: |
262 | del kwargs['isHTML'] |
263 | warnings.warn( |
264 | - "BS4 does not respect the isHTML argument to the " |
265 | + "Beautiful Soup 4 does not respect the isHTML argument to the " |
266 | "BeautifulSoup constructor. Suggest you use " |
267 | "features='lxml' for HTML and features='lxml-xml' for " |
268 | "XML.") |
269 | @@ -212,7 +275,8 @@ class BeautifulSoup(Tag): |
270 | if old_name in kwargs: |
271 | warnings.warn( |
272 | 'The "%s" argument to the BeautifulSoup constructor ' |
273 | - 'has been renamed to "%s."' % (old_name, new_name), |
274 | + 'was renamed to "%s" in Beautiful Soup 4.0.0' % ( |
275 | + old_name, new_name), |
276 | DeprecationWarning, stacklevel=3 |
277 | ) |
278 | return kwargs.pop(old_name) |
279 | @@ -220,7 +284,14 @@ class BeautifulSoup(Tag): |
280 | |
281 | parse_only = parse_only or deprecated_argument( |
282 | "parseOnlyThese", "parse_only") |
283 | - |
284 | + if (parse_only is not None |
285 | + and parse_only.string_rules and |
286 | + (parse_only.name_rules or parse_only.attribute_rules)): |
287 | + warnings.warn( |
288 | + f"Value for parse_only will exclude everything, since it puts restrictions on both tags and strings: {parse_only}", |
289 | + UserWarning, stacklevel=3 |
290 | + ) |
291 | + |
292 | from_encoding = from_encoding or deprecated_argument( |
293 | "fromEncoding", "from_encoding") |
294 | |
295 | @@ -235,7 +306,8 @@ class BeautifulSoup(Tag): |
296 | # specify a parser' warning. |
297 | original_builder = builder |
298 | original_features = features |
299 | - |
300 | + |
301 | + builder_class: Type[TreeBuilder] |
302 | if isinstance(builder, type): |
303 | # A builder class was passed in; it needs to be instantiated. |
304 | builder_class = builder |
305 | @@ -245,12 +317,13 @@ class BeautifulSoup(Tag): |
306 | features = [features] |
307 | if features is None or len(features) == 0: |
308 | features = self.DEFAULT_BUILDER_FEATURES |
309 | - builder_class = builder_registry.lookup(*features) |
310 | - if builder_class is None: |
311 | + possible_builder_class = builder_registry.lookup(*features) |
312 | + if possible_builder_class is None: |
313 | raise FeatureNotFound( |
314 | "Couldn't find a tree builder with the features you " |
315 | "requested: %s. Do you need to install a parser library?" |
316 | % ",".join(features)) |
317 | + builder_class = cast(Type[TreeBuilder], possible_builder_class) |
318 | |
319 | # At this point either we have a TreeBuilder instance in |
320 | # builder, or we have a builder_class that we can instantiate |
321 | @@ -259,7 +332,8 @@ class BeautifulSoup(Tag): |
322 | builder = builder_class(**kwargs) |
323 | if not original_builder and not ( |
324 | original_features == builder.NAME or |
325 | - original_features in builder.ALTERNATE_NAMES |
326 | + (isinstance(original_features, str) |
327 | + and original_features in builder.ALTERNATE_NAMES) |
328 | ) and markup: |
329 | # The user did not tell us which TreeBuilder to use, |
330 | # and we had to guess. Issue a warning. |
331 | @@ -323,6 +397,10 @@ class BeautifulSoup(Tag): |
332 | if not self._markup_is_url(markup): |
333 | self._markup_resembles_filename(markup) |
334 | |
335 | + # At this point we know markup is a string or bytestring. If |
336 | + # it was a file-type object, we've read from it. |
337 | + markup = cast(Union[str,bytes], markup) |
338 | + |
339 | rejections = [] |
340 | success = False |
341 | for (self.markup, self.original_encoding, self.declared_html_encoding, |
342 | @@ -486,7 +564,7 @@ class BeautifulSoup(Tag): |
343 | markup. |
344 | """ |
345 | Tag.__init__(self, self, self.builder, self.ROOT_TAG_NAME) |
346 | - self.hidden = 1 |
347 | + self.hidden = True |
348 | self.builder.reset() |
349 | self.current_data = [] |
350 | self.currentTag = None |
351 | @@ -497,8 +575,16 @@ class BeautifulSoup(Tag): |
352 | self._most_recent_element = None |
353 | self.pushTag(self) |
354 | |
355 | - def new_tag(self, name, namespace=None, nsprefix=None, attrs={}, |
356 | - sourceline=None, sourcepos=None, **kwattrs): |
357 | + def new_tag( |
358 | + self, |
359 | + name:str, |
360 | + namespace:Optional[str]=None, |
361 | + nsprefix:Optional[str]=None, |
362 | + attrs:_AttributeValues={}, |
363 | + sourceline:Optional[int]=None, |
364 | + sourcepos:Optional[int]=None, |
365 | + **kwattrs:_AttributeValue, |
366 | + ): |
367 | """Create a new Tag associated with this BeautifulSoup object. |
368 | |
369 | :param name: The name of the new Tag. |
370 | @@ -509,7 +595,7 @@ class BeautifulSoup(Tag): |
371 | that are reserved words in Python. |
372 | :param sourceline: The line number where this tag was |
373 | (purportedly) found in its source document. |
374 | - :param sourcepos: The character position within `sourceline` where this |
375 | + :param sourcepos: The character position within ``sourceline`` where this |
376 | tag was (purportedly) found. |
377 | :param kwattrs: Keyword arguments for the new Tag's attribute values. |
378 | |
379 | @@ -520,9 +606,17 @@ class BeautifulSoup(Tag): |
380 | sourceline=sourceline, sourcepos=sourcepos |
381 | ) |
382 | |
383 | - def string_container(self, base_class=None): |
384 | + def string_container(self, |
385 | + base_class:Optional[Type[NavigableString]]=None |
386 | + ) -> Type[NavigableString]: |
387 | + """Find the class that should be instantiated to hold a given kind of |
388 | + string. |
389 | + |
390 | + This may be a built-in Beautiful Soup class or a custom class passed |
391 | + in to the BeautifulSoup constructor. |
392 | + """ |
393 | container = base_class or NavigableString |
394 | - |
395 | + |
396 | # There may be a general override of NavigableString. |
397 | container = self.element_classes.get( |
398 | container, container |
399 | @@ -536,27 +630,40 @@ class BeautifulSoup(Tag): |
400 | ) |
401 | return container |
402 | |
403 | - def new_string(self, s, subclass=None): |
404 | - """Create a new NavigableString associated with this BeautifulSoup |
405 | + def new_string(self, s:str, subclass:Optional[Type[NavigableString]]=None) -> NavigableString: |
406 | + """Create a new `NavigableString` associated with this `BeautifulSoup` |
407 | object. |
408 | + |
409 | + :param s: The string content of the `NavigableString` |
410 | + |
411 | + :param subclass: The subclass of `NavigableString`, if any, to |
412 | + use. If a document is being processed, an appropriate subclass |
413 | + for the current location in the document will be determined |
414 | + automatically. |
415 | """ |
416 | container = self.string_container(subclass) |
417 | return container(s) |
418 | |
419 | - def insert_before(self, *args): |
420 | + def insert_before(self, *args:PageElement) -> None: |
421 | """This method is part of the PageElement API, but `BeautifulSoup` doesn't implement |
422 | it because there is nothing before or after it in the parse tree. |
423 | """ |
424 | raise NotImplementedError("BeautifulSoup objects don't support insert_before().") |
425 | |
426 | - def insert_after(self, *args): |
427 | + def insert_after(self, *args:PageElement) -> None: |
428 | """This method is part of the PageElement API, but `BeautifulSoup` doesn't implement |
429 | it because there is nothing before or after it in the parse tree. |
430 | """ |
431 | raise NotImplementedError("BeautifulSoup objects don't support insert_after().") |
432 | |
433 | - def popTag(self): |
434 | - """Internal method called by _popToTag when a tag is closed.""" |
435 | + def popTag(self) -> Optional[Tag]: |
436 | + """Internal method called by _popToTag when a tag is closed. |
437 | + |
438 | + :meta private: |
439 | + """ |
440 | + if not self.tagStack: |
441 | + # Nothing to pop. This shouldn't happen. |
442 | + return None |
443 | tag = self.tagStack.pop() |
444 | if tag.name in self.open_tag_counter: |
445 | self.open_tag_counter[tag.name] -= 1 |
446 | @@ -569,8 +676,11 @@ class BeautifulSoup(Tag): |
447 | self.currentTag = self.tagStack[-1] |
448 | return self.currentTag |
449 | |
450 | - def pushTag(self, tag): |
451 | - """Internal method called by handle_starttag when a tag is opened.""" |
452 | + def pushTag(self, tag:Tag) -> None: |
453 | + """Internal method called by handle_starttag when a tag is opened. |
454 | + |
455 | + :meta private: |
456 | + """ |
457 | #print("Push", tag.name) |
458 | if self.currentTag is not None: |
459 | self.currentTag.contents.append(tag) |
460 | @@ -583,9 +693,14 @@ class BeautifulSoup(Tag): |
461 | if tag.name in self.builder.string_containers: |
462 | self.string_container_stack.append(tag) |
463 | |
464 | - def endData(self, containerClass=None): |
465 | + def endData(self, containerClass:Optional[Type[NavigableString]]=None) -> None: |
466 | """Method called by the TreeBuilder when the end of a data segment |
467 | occurs. |
468 | + |
469 | + :param containerClass: The class to use when incorporating the |
470 | + data segment into the parse tree. |
471 | + |
472 | + :meta private: |
473 | """ |
474 | if self.current_data: |
475 | current_data = ''.join(self.current_data) |
476 | @@ -609,18 +724,27 @@ class BeautifulSoup(Tag): |
477 | |
478 | # Should we add this string to the tree at all? |
479 | if self.parse_only and len(self.tagStack) <= 1 and \ |
480 | - (not self.parse_only.text or \ |
481 | - not self.parse_only.search(current_data)): |
482 | + (not self.parse_only.string_rules or \ |
483 | + not self.parse_only.allow_string_creation(current_data)): |
484 | return |
485 | |
486 | containerClass = self.string_container(containerClass) |
487 | o = containerClass(current_data) |
488 | self.object_was_parsed(o) |
489 | |
490 | - def object_was_parsed(self, o, parent=None, most_recent_element=None): |
491 | - """Method called by the TreeBuilder to integrate an object into the parse tree.""" |
492 | + def object_was_parsed( |
493 | + self, o:PageElement, parent:Optional[Tag]=None, |
494 | + most_recent_element:Optional[PageElement]=None): |
495 | + """Method called by the TreeBuilder to integrate an object into the |
496 | + parse tree. |
497 | + |
498 | + |
499 | + |
500 | + :meta private: |
501 | + """ |
502 | if parent is None: |
503 | parent = self.currentTag |
504 | + assert parent is not None |
505 | if most_recent_element is not None: |
506 | previous_element = most_recent_element |
507 | else: |
508 | @@ -685,7 +809,7 @@ class BeautifulSoup(Tag): |
509 | break |
510 | target = target.parent |
511 | |
512 | - def _popToTag(self, name, nsprefix=None, inclusivePop=True): |
513 | + def _popToTag(self, name, nsprefix=None, inclusivePop=True) -> Optional[Tag]: |
514 | """Pops the tag stack up to and including the most recent |
515 | instance of the given tag. |
516 | |
517 | @@ -698,11 +822,12 @@ class BeautifulSoup(Tag): |
518 | to but *not* including the most recent instqance of the |
519 | given tag. |
520 | |
521 | + :meta private: |
522 | """ |
523 | #print("Popping to %s" % name) |
524 | if name == self.ROOT_TAG_NAME: |
525 | # The BeautifulSoup object itself can never be popped. |
526 | - return |
527 | + return None |
528 | |
529 | most_recently_popped = None |
530 | |
531 | @@ -719,8 +844,11 @@ class BeautifulSoup(Tag): |
532 | |
533 | return most_recently_popped |
534 | |
535 | - def handle_starttag(self, name, namespace, nsprefix, attrs, sourceline=None, |
536 | - sourcepos=None, namespaces=None): |
537 | + def handle_starttag( |
538 | + self, name:str, namespace:Optional[str], |
539 | + nsprefix:Optional[str], attrs:Optional[Dict[str,str]], |
540 | + sourceline:Optional[int]=None, sourcepos:Optional[int]=None, |
541 | + namespaces:Optional[Dict[str, str]]=None) -> Optional[Tag]: |
542 | """Called by the tree builder when a new tag is encountered. |
543 | |
544 | :param name: Name of the tag. |
545 | @@ -737,13 +865,15 @@ class BeautifulSoup(Tag): |
546 | SoupStrainer. You should proceed as if the tag had not occurred |
547 | in the document. For instance, if this was a self-closing tag, |
548 | don't call handle_endtag. |
549 | + |
550 | + :meta private: |
551 | """ |
552 | # print("Start tag %s: %s" % (name, attrs)) |
553 | self.endData() |
554 | |
555 | if (self.parse_only and len(self.tagStack) <= 1 |
556 | - and (self.parse_only.text |
557 | - or not self.parse_only.search_tag(name, attrs))): |
558 | + and (self.parse_only.string_rules |
559 | + or not self.parse_only.allow_tag_creation(nsprefix, name, attrs))): |
560 | return None |
561 | |
562 | tag = self.element_classes.get(Tag, Tag)( |
563 | @@ -760,48 +890,90 @@ class BeautifulSoup(Tag): |
564 | self.pushTag(tag) |
565 | return tag |
566 | |
567 | - def handle_endtag(self, name, nsprefix=None): |
568 | + def handle_endtag(self, name:str, nsprefix:Optional[str]=None) -> None: |
569 | """Called by the tree builder when an ending tag is encountered. |
570 | |
571 | :param name: Name of the tag. |
572 | :param nsprefix: Namespace prefix for the tag. |
573 | + |
574 | + :meta private: |
575 | """ |
576 | #print("End tag: " + name) |
577 | self.endData() |
578 | self._popToTag(name, nsprefix) |
579 | |
580 | - def handle_data(self, data): |
581 | - """Called by the tree builder when a chunk of textual data is encountered.""" |
582 | + def handle_data(self, data:str) -> None: |
583 | + """Called by the tree builder when a chunk of textual data is |
584 | + encountered. |
585 | + |
586 | + :meta private: |
587 | + """ |
588 | self.current_data.append(data) |
589 | |
590 | - def decode(self, pretty_print=False, |
591 | - eventual_encoding=DEFAULT_OUTPUT_ENCODING, |
592 | - formatter="minimal", iterator=None): |
593 | - """Returns a string or Unicode representation of the parse tree |
594 | - as an HTML or XML document. |
595 | - |
596 | - :param pretty_print: If this is True, indentation will be used to |
597 | - make the document more readable. |
598 | + def decode(self, indent_level:Optional[int]=None, |
599 | + eventual_encoding:_Encoding=DEFAULT_OUTPUT_ENCODING, |
600 | + formatter:Union[Formatter,str]="minimal", |
601 | + iterator:Optional[Iterable]=None, **kwargs) -> str: |
602 | + """Returns a string representation of the parse tree |
603 | + as a full HTML or XML document. |
604 | + |
605 | + :param indent_level: Each line of the rendering will be |
606 | + indented this many levels. (The ``formatter`` decides what a |
607 | + 'level' means, in terms of spaces or other characters |
608 | + output.) This is used internally in recursive calls while |
609 | + pretty-printing. |
610 | :param eventual_encoding: The encoding of the final document. |
611 | If this is None, the document will be a Unicode string. |
612 | + :param formatter: Either a `Formatter` object, or a string naming one of |
613 | + the standard formatters. |
614 | + :param iterator: The iterator to use when navigating over the |
615 | + parse tree. This is only used by `Tag.decode_contents` and |
616 | + you probably won't need to use it. |
617 | """ |
618 | if self.is_xml: |
619 | # Print the XML declaration |
620 | encoding_part = '' |
621 | + declared_encoding: Optional[str] = eventual_encoding |
622 | if eventual_encoding in PYTHON_SPECIFIC_ENCODINGS: |
623 | # This is a special Python encoding; it can't actually |
624 | # go into an XML document because it means nothing |
625 | # outside of Python. |
626 | - eventual_encoding = None |
627 | - if eventual_encoding != None: |
628 | - encoding_part = ' encoding="%s"' % eventual_encoding |
629 | + declared_encoding = None |
630 | + if declared_encoding != None: |
631 | + encoding_part = ' encoding="%s"' % declared_encoding |
632 | prefix = '<?xml version="1.0"%s?>\n' % encoding_part |
633 | else: |
634 | prefix = '' |
635 | - if not pretty_print: |
636 | - indent_level = None |
637 | + |
638 | + # Prior to 4.13.0, the first argument to this method was a |
639 | + # bool called pretty_print, which gave the method a different |
640 | + # signature from its superclass implementation, Tag.decode. |
641 | + # |
642 | + # The signatures of the two methods now match, but just in |
643 | + # case someone is still passing a boolean in as the first |
644 | + # argument to this method (or a keyword argument with the old |
645 | + # name), we can handle it and put out a DeprecationWarning. |
646 | + warning:Optional[str] = None |
647 | + if isinstance(indent_level, bool): |
648 | + if indent_level is True: |
649 | + indent_level = 0 |
650 | + elif indent_level is False: |
651 | + indent_level = None |
652 | + warning = f"As of 4.13.0, the first argument to BeautifulSoup.decode has been changed from bool to int, to match Tag.decode. Pass in a value of {indent_level} instead." |
653 | else: |
654 | - indent_level = 0 |
655 | + pretty_print = kwargs.pop("pretty_print", None) |
656 | + assert not kwargs |
657 | + if pretty_print is not None: |
658 | + if pretty_print is True: |
659 | + indent_level = 0 |
660 | + elif pretty_print is False: |
661 | + indent_level = None |
662 | + warning = f"As of 4.13.0, the pretty_print argument to BeautifulSoup.decode has been removed, to match Tag.decode. Pass in a value of indent_level={indent_level} instead." |
663 | + |
664 | + if warning: |
665 | + warnings.warn(warning, DeprecationWarning, stacklevel=2) |
666 | + elif indent_level is False or pretty_print is False: |
667 | + indent_level = None |
668 | return prefix + super(BeautifulSoup, self).decode( |
669 | indent_level, eventual_encoding, formatter, iterator) |
670 | |
671 | @@ -815,7 +987,7 @@ class BeautifulStoneSoup(BeautifulSoup): |
672 | def __init__(self, *args, **kwargs): |
673 | kwargs['features'] = 'xml' |
674 | warnings.warn( |
675 | - 'The BeautifulStoneSoup class is deprecated. Instead of using ' |
676 | + 'The BeautifulStoneSoup class was deprecated in version 4.0.0. Instead of using ' |
677 | 'it, pass features="xml" into the BeautifulSoup constructor.', |
678 | DeprecationWarning, stacklevel=2 |
679 | ) |
680 | diff --git a/bs4/_deprecation.py b/bs4/_deprecation.py |
681 | new file mode 100644 |
682 | index 0000000..febc1b3 |
683 | --- /dev/null |
684 | +++ b/bs4/_deprecation.py |
685 | @@ -0,0 +1,57 @@ |
686 | +"""Helper functions for deprecation. |
687 | + |
688 | +This interface is itself unstable and may change without warning. Do |
689 | +not use these functions yourself, even as a joke. The underscores are |
690 | +there for a reson. |
691 | + |
692 | +In particular, most of this will go away once Beautiful Soup drops |
693 | +support for Python 3.11, since Python 3.12 defines a |
694 | +`@typing.deprecated() decorator. <https://peps.python.org/pep-0702/>`_ |
695 | +""" |
696 | + |
697 | +import functools |
698 | +import warnings |
699 | + |
700 | +from typing import ( |
701 | + Any, |
702 | + Callable, |
703 | +) |
704 | + |
705 | +def _deprecated_alias(old_name, new_name, version): |
706 | + """Alias one attribute name to another for backward compatibility |
707 | + |
708 | + :meta private: |
709 | + """ |
710 | + @property |
711 | + def alias(self) -> Any: |
712 | + ":meta private:" |
713 | + warnings.warn(f"Access to deprecated property {old_name}. (Replaced by {new_name}) -- Deprecated since version {version}.", DeprecationWarning, stacklevel=2) |
714 | + return getattr(self, new_name) |
715 | + |
716 | + @alias.setter |
717 | + def alias(self, value:str)->Any: |
718 | + ":meta private:" |
719 | + warnings.warn(f"Write to deprecated property {old_name}. (Replaced by {new_name}) -- Deprecated since version {version}.", DeprecationWarning, stacklevel=2) |
720 | + return setattr(self, new_name, value) |
721 | + return alias |
722 | + |
723 | +def _deprecated_function_alias(old_name:str, new_name:str, version:str) -> Callable: |
724 | + def alias(self, *args, **kwargs): |
725 | + ":meta private:" |
726 | + warnings.warn(f"Call to deprecated method {old_name}. (Replaced by {new_name}) -- Deprecated since version {version}.", DeprecationWarning, stacklevel=2) |
727 | + return getattr(self, new_name)(*args, **kwargs) |
728 | + return alias |
729 | + |
730 | +def _deprecated(replaced_by:str, version:str) -> Callable: |
731 | + def deprecate(func): |
732 | + @functools.wraps(func) |
733 | + def with_warning(*args, **kwargs): |
734 | + ":meta private:" |
735 | + warnings.warn( |
736 | + f"Call to deprecated method {func.__name__}. (Replaced by {replaced_by}) -- Deprecated since version {version}.", |
737 | + DeprecationWarning, |
738 | + stacklevel=2 |
739 | + ) |
740 | + return func(*args, **kwargs) |
741 | + return with_warning |
742 | + return deprecate |
743 | diff --git a/bs4/_typing.py b/bs4/_typing.py |
744 | new file mode 100644 |
745 | index 0000000..9fe58c3 |
746 | --- /dev/null |
747 | +++ b/bs4/_typing.py |
748 | @@ -0,0 +1,99 @@ |
749 | +# Custom type aliases used throughout Beautiful Soup to improve readability. |
750 | + |
751 | +# Notes on improvements to the type system in newer versions of Python |
752 | +# that can be used once Beautiful Soup drops support for older |
753 | +# versions: |
754 | +# |
755 | +# * In 3.10, x|y is an accepted shorthand for Union[x,y]. |
756 | +# * In 3.10, TypeAlias gains capabilities that can be used to |
757 | +# improve the tree matching types (I don't remember what, exactly). |
758 | + |
759 | +import re |
760 | +from typing_extensions import TypeAlias |
761 | +from typing import ( |
762 | + Callable, |
763 | + Dict, |
764 | + IO, |
765 | + Iterable, |
766 | + Pattern, |
767 | + TYPE_CHECKING, |
768 | + Union, |
769 | +) |
770 | + |
771 | +if TYPE_CHECKING: |
772 | + from bs4.element import Tag |
773 | + |
774 | +# Aliases for markup in various stages of processing. |
775 | +# |
776 | +# The rawest form of markup: either a string or an open filehandle. |
777 | +_IncomingMarkup: TypeAlias = Union[str,bytes,IO] |
778 | + |
779 | +# Markup that is in memory but has (potentially) yet to be converted |
780 | +# to Unicode. |
781 | +_RawMarkup: TypeAlias = Union[str,bytes] |
782 | + |
783 | +# Aliases for character encodings |
784 | +# |
785 | +_Encoding:TypeAlias = str |
786 | +_Encodings:TypeAlias = Iterable[_Encoding] |
787 | + |
788 | +# Aliases for XML namespaces |
789 | +_NamespacePrefix:TypeAlias = str |
790 | +_NamespaceURL:TypeAlias = str |
791 | +_NamespaceMapping:TypeAlias = Dict[_NamespacePrefix, _NamespaceURL] |
792 | +_InvertedNamespaceMapping:TypeAlias = Dict[_NamespaceURL, _NamespacePrefix] |
793 | + |
794 | +# Aliases for the attribute values associated with HTML/XML tags. |
795 | +# |
796 | +# Note that these are attribute values in their final form, as stored |
797 | +# in the `Tag` class. Different parsers present attributes to the |
798 | +# `TreeBuilder` subclasses in different formats, which are not defined |
799 | +# here. |
800 | +_AttributeValue: TypeAlias = Union[str, Iterable[str]] |
801 | +_AttributeValues: TypeAlias = Dict[str, _AttributeValue] |
802 | + |
803 | +# Aliases to represent the many possibilities for matching bits of a |
804 | +# parse tree. |
805 | +# |
806 | +# This is very complicated because we're applying a formal type system |
807 | +# to some very DWIM code. The types we end up with will be the types |
808 | +# of the arguments to the SoupStrainer constructor and (more |
809 | +# familiarly to Beautiful Soup users) the find* methods. |
810 | + |
811 | +# A function that takes a Tag and returns a yes-or-no answer. |
812 | +# A TagNameMatchRule expects this kind of function, if you're |
813 | +# going to pass it a function. |
814 | +_TagMatchFunction:TypeAlias = Callable[['Tag'], bool] |
815 | + |
816 | +# A function that takes a single string and returns a yes-or-no |
817 | +# answer. An AttributeValueMatchRule expects this kind of function, if |
818 | +# you're going to pass it a function. So does a StringMatchRule |
819 | +_StringMatchFunction:TypeAlias = Callable[[str], bool] |
820 | + |
821 | +# A function that takes a Tag or string and returns a yes-or-no |
822 | +# answer. |
823 | +_TagOrStringMatchFunction:TypeAlias = Union[_TagMatchFunction, _StringMatchFunction, bool] |
824 | + |
825 | +# Either a tag name, an attribute value or a string can be matched |
826 | +# against a string, bytestring, regular expression, or a boolean. |
827 | +_BaseStrainable:TypeAlias = Union[str, bytes, Pattern[str], bool] |
828 | + |
829 | +# A tag can also be matched using a function that takes the Tag |
830 | +# as its sole argument. |
831 | +_BaseStrainableElement:TypeAlias = Union[_BaseStrainable, _TagMatchFunction] |
832 | + |
833 | +# A tag's attribute value can be matched using a function that takes |
834 | +# the value as its sole argument. |
835 | +_BaseStrainableAttribute:TypeAlias = Union[_BaseStrainable, _StringMatchFunction] |
836 | + |
837 | +# Finally, a tag name, attribute or string can be matched using either |
838 | +# a single criterion or a list of criteria. |
839 | +_StrainableElement:TypeAlias = Union[ |
840 | + _BaseStrainableElement, Iterable[_BaseStrainableElement] |
841 | +] |
842 | +_StrainableAttribute:TypeAlias = Union[ |
843 | + _BaseStrainableAttribute, Iterable[_BaseStrainableAttribute] |
844 | +] |
845 | + |
846 | +_StrainableAttributes:TypeAlias = Dict[str, _StrainableAttribute] |
847 | +_StrainableString:TypeAlias = _StrainableAttribute |
848 | diff --git a/bs4/builder/__init__.py b/bs4/builder/__init__.py |
849 | index 2e39745..671315d 100644 |
850 | --- a/bs4/builder/__init__.py |
851 | +++ b/bs4/builder/__init__.py |
852 | @@ -1,9 +1,25 @@ |
853 | +from __future__ import annotations |
854 | # Use of this source code is governed by the MIT license. |
855 | __license__ = "MIT" |
856 | |
857 | from collections import defaultdict |
858 | import itertools |
859 | import re |
860 | +from types import ModuleType |
861 | +from typing import ( |
862 | + Any, |
863 | + cast, |
864 | + Dict, |
865 | + Iterable, |
866 | + List, |
867 | + Optional, |
868 | + Pattern, |
869 | + Set, |
870 | + Tuple, |
871 | + Type, |
872 | + TYPE_CHECKING, |
873 | + Union, |
874 | +) |
875 | import warnings |
876 | import sys |
877 | from bs4.element import ( |
878 | @@ -17,6 +33,18 @@ from bs4.element import ( |
879 | nonwhitespace_re |
880 | ) |
881 | |
882 | +if TYPE_CHECKING: |
883 | + from bs4 import BeautifulSoup |
884 | + from bs4.element import ( |
885 | + NavigableString, Tag, |
886 | + _AttributeValues, _AttributeValue, |
887 | + ) |
888 | + from bs4._typing import ( |
889 | + _Encoding, |
890 | + _Encodings, |
891 | + _RawMarkup, |
892 | + ) |
893 | + |
894 | __all__ = [ |
895 | 'HTMLTreeBuilder', |
896 | 'SAXTreeBuilder', |
897 | @@ -36,29 +64,32 @@ class XMLParsedAsHTMLWarning(UserWarning): |
898 | """The warning issued when an HTML parser is used to parse |
899 | XML that is not XHTML. |
900 | """ |
901 | - MESSAGE = """It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.""" |
902 | + MESSAGE:str = """It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.""" #: :meta private: |
903 | |
904 | |
905 | class TreeBuilderRegistry(object): |
906 | """A way of looking up TreeBuilder subclasses by their name or by desired |
907 | features. |
908 | """ |
909 | + |
910 | + builders_for_feature: Dict[str, List[Type[TreeBuilder]]] |
911 | + builders: List[Type[TreeBuilder]] |
912 | |
913 | def __init__(self): |
914 | self.builders_for_feature = defaultdict(list) |
915 | self.builders = [] |
916 | |
917 | - def register(self, treebuilder_class): |
918 | + def register(self, treebuilder_class:type[TreeBuilder]) -> None: |
919 | """Register a treebuilder based on its advertised features. |
920 | |
921 | - :param treebuilder_class: A subclass of Treebuilder. its .features |
922 | - attribute should list its features. |
923 | + :param treebuilder_class: A subclass of `Treebuilder`. its |
924 | + `TreeBuilder.features` attribute should list its features. |
925 | """ |
926 | for feature in treebuilder_class.features: |
927 | self.builders_for_feature[feature].insert(0, treebuilder_class) |
928 | self.builders.insert(0, treebuilder_class) |
929 | |
930 | - def lookup(self, *features): |
931 | + def lookup(self, *features:str) -> Optional[Type[TreeBuilder]]: |
932 | """Look up a TreeBuilder subclass with the desired features. |
933 | |
934 | :param features: A list of features to look for. If none are |
935 | @@ -78,12 +109,12 @@ class TreeBuilderRegistry(object): |
936 | |
937 | # Go down the list of features in order, and eliminate any builders |
938 | # that don't match every feature. |
939 | - features = list(features) |
940 | - features.reverse() |
941 | + feature_list = list(features) |
942 | + feature_list.reverse() |
943 | candidates = None |
944 | candidate_set = None |
945 | - while len(features) > 0: |
946 | - feature = features.pop() |
947 | + while len(feature_list) > 0: |
948 | + feature = feature_list.pop() |
949 | we_have_the_feature = self.builders_for_feature.get(feature, []) |
950 | if len(we_have_the_feature) > 0: |
951 | if candidates is None: |
952 | @@ -97,81 +128,61 @@ class TreeBuilderRegistry(object): |
953 | # The only valid candidates are the ones in candidate_set. |
954 | # Go through the original list of candidates and pick the first one |
955 | # that's in candidate_set. |
956 | - if candidate_set is None: |
957 | + if candidate_set is None or candidates is None: |
958 | return None |
959 | for candidate in candidates: |
960 | if candidate in candidate_set: |
961 | return candidate |
962 | return None |
963 | |
964 | -# The BeautifulSoup class will take feature lists from developers and use them |
965 | -# to look up builders in this registry. |
966 | -builder_registry = TreeBuilderRegistry() |
967 | +#: The `BeautifulSoup` constructor will take a list of features |
968 | +#: and use it to look up `TreeBuilder` classes in this registry. |
969 | +builder_registry:TreeBuilderRegistry = TreeBuilderRegistry() |
970 | |
971 | class TreeBuilder(object): |
972 | - """Turn a textual document into a Beautiful Soup object tree.""" |
973 | - |
974 | - NAME = "[Unknown tree builder]" |
975 | - ALTERNATE_NAMES = [] |
976 | - features = [] |
977 | - |
978 | - is_xml = False |
979 | - picklable = False |
980 | - empty_element_tags = None # A tag will be considered an empty-element |
981 | - # tag when and only when it has no contents. |
982 | - |
983 | - # A value for these tag/attribute combinations is a space- or |
984 | - # comma-separated list of CDATA, rather than a single CDATA. |
985 | - DEFAULT_CDATA_LIST_ATTRIBUTES = defaultdict(list) |
986 | - |
987 | - # Whitespace should be preserved inside these tags. |
988 | - DEFAULT_PRESERVE_WHITESPACE_TAGS = set() |
989 | - |
990 | - # The textual contents of tags with these names should be |
991 | - # instantiated with some class other than NavigableString. |
992 | - DEFAULT_STRING_CONTAINERS = {} |
993 | - |
994 | - USE_DEFAULT = object() |
995 | + """Turn a textual document into a Beautiful Soup object tree. |
996 | + |
997 | + This is an abstract superclass which smooths out the behavior of |
998 | + different parser libraries into a single, unified interface. |
999 | + |
1000 | + :param multi_valued_attributes: If this is set to None, the |
1001 | + TreeBuilder will not turn any values for attributes like |
1002 | + 'class' into lists. Setting this to a dictionary will |
1003 | + customize this behavior; look at DEFAULT_CDATA_LIST_ATTRIBUTES |
1004 | + for an example. |
1005 | + |
1006 | + Internally, these are called "CDATA list attributes", but that |
1007 | + probably doesn't make sense to an end-user, so the argument name |
1008 | + is `multi_valued_attributes`. |
1009 | + |
1010 | + :param preserve_whitespace_tags: A set of tags to treat |
1011 | + the way <pre> tags are treated in HTML. Tags in this set |
1012 | + are immune from pretty-printing; their contents will always be |
1013 | + output as-is. |
1014 | + |
1015 | + :param string_containers: A dictionary mapping tag names to |
1016 | + the classes that should be instantiated to contain the textual |
1017 | + contents of those tags. The default is to use NavigableString |
1018 | + for every tag, no matter what the name. You can override the |
1019 | + default by changing DEFAULT_STRING_CONTAINERS. |
1020 | + |
1021 | + :param store_line_numbers: If the parser keeps track of the |
1022 | + line numbers and positions of the original markup, that |
1023 | + information will, by default, be stored in each corresponding |
1024 | + `Tag` object. You can turn this off by passing |
1025 | + store_line_numbers=False. If the parser you're using doesn't |
1026 | + keep track of this information, then setting store_line_numbers=True |
1027 | + will do nothing. |
1028 | + """ |
1029 | |
1030 | - # Most parsers don't keep track of line numbers. |
1031 | - TRACKS_LINE_NUMBERS = False |
1032 | + USE_DEFAULT: Any = object() #: :meta private: |
1033 | |
1034 | - def __init__(self, multi_valued_attributes=USE_DEFAULT, |
1035 | - preserve_whitespace_tags=USE_DEFAULT, |
1036 | - store_line_numbers=USE_DEFAULT, |
1037 | - string_containers=USE_DEFAULT, |
1038 | + def __init__(self, multi_valued_attributes:Dict[str, Set[str]]=USE_DEFAULT, |
1039 | + preserve_whitespace_tags:Set[str]=USE_DEFAULT, |
1040 | + store_line_numbers:bool=USE_DEFAULT, |
1041 | + string_containers:Dict[str, Type[NavigableString]]=USE_DEFAULT, |
1042 | + empty_element_tags:Set[str]=USE_DEFAULT |
1043 | ): |
1044 | - """Constructor. |
1045 | - |
1046 | - :param multi_valued_attributes: If this is set to None, the |
1047 | - TreeBuilder will not turn any values for attributes like |
1048 | - 'class' into lists. Setting this to a dictionary will |
1049 | - customize this behavior; look at DEFAULT_CDATA_LIST_ATTRIBUTES |
1050 | - for an example. |
1051 | - |
1052 | - Internally, these are called "CDATA list attributes", but that |
1053 | - probably doesn't make sense to an end-user, so the argument name |
1054 | - is `multi_valued_attributes`. |
1055 | - |
1056 | - :param preserve_whitespace_tags: A list of tags to treat |
1057 | - the way <pre> tags are treated in HTML. Tags in this list |
1058 | - are immune from pretty-printing; their contents will always be |
1059 | - output as-is. |
1060 | - |
1061 | - :param string_containers: A dictionary mapping tag names to |
1062 | - the classes that should be instantiated to contain the textual |
1063 | - contents of those tags. The default is to use NavigableString |
1064 | - for every tag, no matter what the name. You can override the |
1065 | - default by changing DEFAULT_STRING_CONTAINERS. |
1066 | - |
1067 | - :param store_line_numbers: If the parser keeps track of the |
1068 | - line numbers and positions of the original markup, that |
1069 | - information will, by default, be stored in each corresponding |
1070 | - `Tag` object. You can turn this off by passing |
1071 | - store_line_numbers=False. If the parser you're using doesn't |
1072 | - keep track of this information, then setting store_line_numbers=True |
1073 | - will do nothing. |
1074 | - """ |
1075 | self.soup = None |
1076 | if multi_valued_attributes is self.USE_DEFAULT: |
1077 | multi_valued_attributes = self.DEFAULT_CDATA_LIST_ATTRIBUTES |
1078 | @@ -179,14 +190,55 @@ class TreeBuilder(object): |
1079 | if preserve_whitespace_tags is self.USE_DEFAULT: |
1080 | preserve_whitespace_tags = self.DEFAULT_PRESERVE_WHITESPACE_TAGS |
1081 | self.preserve_whitespace_tags = preserve_whitespace_tags |
1082 | + if empty_element_tags is self.USE_DEFAULT: |
1083 | + self.empty_element_tags = self.DEFAULT_EMPTY_ELEMENT_TAGS |
1084 | + else: |
1085 | + self.empty_element_tags = empty_element_tags |
1086 | if store_line_numbers == self.USE_DEFAULT: |
1087 | store_line_numbers = self.TRACKS_LINE_NUMBERS |
1088 | self.store_line_numbers = store_line_numbers |
1089 | if string_containers == self.USE_DEFAULT: |
1090 | string_containers = self.DEFAULT_STRING_CONTAINERS |
1091 | self.string_containers = string_containers |
1092 | + |
1093 | + NAME:str = "[Unknown tree builder]" |
1094 | + ALTERNATE_NAMES: Iterable[str] = [] |
1095 | + features: Iterable[str] = [] |
1096 | + |
1097 | + is_xml: bool = False |
1098 | + picklable: bool = False |
1099 | + |
1100 | + soup: Optional[BeautifulSoup] #: :meta private: |
1101 | + |
1102 | + #: A tag will be considered an empty-element |
1103 | + #: tag when and only when it has no contents. |
1104 | + empty_element_tags: Optional[Set[str]] = None #: :meta private: |
1105 | + cdata_list_attributes: Dict[str, Set[str]] #: :meta private: |
1106 | + preserve_whitespace_tags: Set[str] #: :meta private: |
1107 | + string_containers: Dict[str, Type[NavigableString]] #: :meta private: |
1108 | + tracks_line_numbers: bool #: :meta private: |
1109 | + |
1110 | + #: A value for these tag/attribute combinations is a space- or |
1111 | + #: comma-separated list of CDATA, rather than a single CDATA. |
1112 | + DEFAULT_CDATA_LIST_ATTRIBUTES : Dict[str, Set[str]] = defaultdict(set) |
1113 | + |
1114 | + #: Whitespace should be preserved inside these tags. |
1115 | + DEFAULT_PRESERVE_WHITESPACE_TAGS : Set[str] = set() |
1116 | + |
1117 | + #: The textual contents of tags with these names should be |
1118 | + #: instantiated with some class other than NavigableString. |
1119 | + DEFAULT_STRING_CONTAINERS: Dict[str, Type[NavigableString]] = {} |
1120 | + |
1121 | + #: By default, tags are treated as empty-element tags if they have |
1122 | + #: no contents--that is, using XML rules. HTMLTreeBuilder |
1123 | + #: defines a different set of DEFAULT_EMPTY_ELEMENT_TAGS based on the |
1124 | + #: HTML 4 and HTML5 standards. |
1125 | + DEFAULT_EMPTY_ELEMENT_TAGS: Optional[Set] = None |
1126 | + |
1127 | + #: Most parsers don't keep track of line numbers. |
1128 | + TRACKS_LINE_NUMBERS: bool = False |
1129 | |
1130 | - def initialize_soup(self, soup): |
1131 | + def initialize_soup(self, soup:BeautifulSoup) -> None: |
1132 | """The BeautifulSoup object has been initialized and is now |
1133 | being associated with the TreeBuilder. |
1134 | |
1135 | @@ -194,7 +246,7 @@ class TreeBuilder(object): |
1136 | """ |
1137 | self.soup = soup |
1138 | |
1139 | - def reset(self): |
1140 | + def reset(self) -> None: |
1141 | """Do any work necessary to reset the underlying parser |
1142 | for a new document. |
1143 | |
1144 | @@ -202,7 +254,7 @@ class TreeBuilder(object): |
1145 | """ |
1146 | pass |
1147 | |
1148 | - def can_be_empty_element(self, tag_name): |
1149 | + def can_be_empty_element(self, tag_name:str) -> bool: |
1150 | """Might a tag with this name be an empty-element tag? |
1151 | |
1152 | The final markup may or may not actually present this tag as |
1153 | @@ -225,46 +277,48 @@ class TreeBuilder(object): |
1154 | return True |
1155 | return tag_name in self.empty_element_tags |
1156 | |
1157 | - def feed(self, markup): |
1158 | + def feed(self, markup:str) -> None: |
1159 | """Run some incoming markup through some parsing process, |
1160 | - populating the `BeautifulSoup` object in self.soup. |
1161 | - |
1162 | - This method is not implemented in TreeBuilder; it must be |
1163 | - implemented in subclasses. |
1164 | - |
1165 | - :return: None. |
1166 | + populating the `BeautifulSoup` object in `TreeBuilder.soup` |
1167 | """ |
1168 | raise NotImplementedError() |
1169 | |
1170 | - def prepare_markup(self, markup, user_specified_encoding=None, |
1171 | - document_declared_encoding=None, exclude_encodings=None): |
1172 | + def prepare_markup( |
1173 | + self, markup:_RawMarkup, |
1174 | + user_specified_encoding:Optional[_Encoding]=None, |
1175 | + document_declared_encoding:Optional[_Encoding]=None, |
1176 | + exclude_encodings:Optional[_Encodings]=None |
1177 | + ) -> Iterable[Tuple[_RawMarkup, Optional[_Encoding], Optional[_Encoding], bool]]: |
1178 | """Run any preliminary steps necessary to make incoming markup |
1179 | acceptable to the parser. |
1180 | |
1181 | - :param markup: Some markup -- probably a bytestring. |
1182 | - :param user_specified_encoding: The user asked to try this encoding. |
1183 | + :param markup: The markup that's about to be parsed. |
1184 | + :param user_specified_encoding: The user asked to try this encoding |
1185 | + to convert the markup into a Unicode string. |
1186 | :param document_declared_encoding: The markup itself claims to be |
1187 | in this encoding. NOTE: This argument is not used by the |
1188 | calling code and can probably be removed. |
1189 | - :param exclude_encodings: The user asked _not_ to try any of |
1190 | + :param exclude_encodings: The user asked *not* to try any of |
1191 | these encodings. |
1192 | |
1193 | - :yield: A series of 4-tuples: |
1194 | - (markup, encoding, declared encoding, |
1195 | - has undergone character replacement) |
1196 | + :yield: A series of 4-tuples: (markup, encoding, declared encoding, |
1197 | + has undergone character replacement) |
1198 | |
1199 | - Each 4-tuple represents a strategy for converting the |
1200 | - document to Unicode and parsing it. Each strategy will be tried |
1201 | - in turn. |
1202 | + Each 4-tuple represents a strategy that the parser can try |
1203 | + to convert the document to Unicode and parse it. Each |
1204 | + strategy will be tried in turn. |
1205 | |
1206 | By default, the only strategy is to parse the markup |
1207 | as-is. See `LXMLTreeBuilderForXML` and |
1208 | `HTMLParserTreeBuilder` for implementations that take into |
1209 | account the quirks of particular parsers. |
1210 | + |
1211 | + :meta private: |
1212 | + |
1213 | """ |
1214 | yield markup, None, None, False |
1215 | |
1216 | - def test_fragment_to_document(self, fragment): |
1217 | + def test_fragment_to_document(self, fragment:str) -> str: |
1218 | """Wrap an HTML fragment to make it look like a document. |
1219 | |
1220 | Different parsers do this differently. For instance, lxml |
1221 | @@ -273,26 +327,27 @@ class TreeBuilder(object): |
1222 | which run HTML fragments through the parser and compare the |
1223 | results against other HTML fragments. |
1224 | |
1225 | - This method should not be used outside of tests. |
1226 | + This method should not be used outside of unit tests. |
1227 | |
1228 | - :param fragment: A string -- fragment of HTML. |
1229 | - :return: A string -- a full HTML document. |
1230 | + :param fragment: A fragment of HTML. |
1231 | + :return: A full HTML document. |
1232 | + :meta private: |
1233 | """ |
1234 | return fragment |
1235 | |
1236 | - def set_up_substitutions(self, tag): |
1237 | + def set_up_substitutions(self, tag:Tag) -> bool: |
1238 | """Set up any substitutions that will need to be performed on |
1239 | a `Tag` when it's output as a string. |
1240 | |
1241 | By default, this does nothing. See `HTMLTreeBuilder` for a |
1242 | case where this is used. |
1243 | |
1244 | - :param tag: A `Tag` |
1245 | :return: Whether or not a substitution was performed. |
1246 | + :meta private: |
1247 | """ |
1248 | return False |
1249 | |
1250 | - def _replace_cdata_list_attribute_values(self, tag_name, attrs): |
1251 | + def _replace_cdata_list_attribute_values(self, tag_name:str, attrs:_AttributeValues): |
1252 | """When an attribute value is associated with a tag that can |
1253 | have multiple values for that attribute, convert the string |
1254 | value to a list of strings. |
1255 | @@ -308,10 +363,11 @@ class TreeBuilder(object): |
1256 | if not attrs: |
1257 | return attrs |
1258 | if self.cdata_list_attributes: |
1259 | - universal = self.cdata_list_attributes.get('*', []) |
1260 | + universal: Set[str] = self.cdata_list_attributes.get('*', set()) |
1261 | tag_specific = self.cdata_list_attributes.get( |
1262 | tag_name.lower(), None) |
1263 | for attr in list(attrs.keys()): |
1264 | + values: _AttributeValue |
1265 | if attr in universal or (tag_specific and attr in tag_specific): |
1266 | # We have a "class"-type attribute whose string |
1267 | # value is a whitespace-separated list of |
1268 | @@ -337,7 +393,15 @@ class SAXTreeBuilder(TreeBuilder): |
1269 | how a simple TreeBuilder would work. |
1270 | """ |
1271 | |
1272 | - def feed(self, markup): |
1273 | + def __init__(self, *args, **kwargs): |
1274 | + warnings.warn( |
1275 | + f"The SAXTreeBuilder class was deprecated in 4.13.0. It is completely untested and probably doesn't work; use at your own risk.", |
1276 | + DeprecationWarning, |
1277 | + stacklevel=2 |
1278 | + ) |
1279 | + super(SAXTreeBuilder, self).__init__(*args, **kwargs) |
1280 | + |
1281 | + def feed(self, markup:_RawMarkup): |
1282 | raise NotImplementedError() |
1283 | |
1284 | def close(self): |
1285 | @@ -381,12 +445,13 @@ class SAXTreeBuilder(TreeBuilder): |
1286 | |
1287 | |
1288 | class HTMLTreeBuilder(TreeBuilder): |
1289 | - """This TreeBuilder knows facts about HTML. |
1290 | - |
1291 | - Such as which tags are empty-element tags. |
1292 | + """This TreeBuilder knows facts about HTML, such as which tags are treated |
1293 | + specially by the HTML standard. |
1294 | """ |
1295 | |
1296 | - empty_element_tags = set([ |
1297 | + #: Some HTML tags are defined as having no contents. Beautiful Soup |
1298 | + #: treats these specially. |
1299 | + DEFAULT_EMPTY_ELEMENT_TAGS: Set[str] = set([ |
1300 | # These are from HTML5. |
1301 | 'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen', 'link', 'menuitem', 'meta', 'param', 'source', 'track', 'wbr', |
1302 | |
1303 | @@ -394,29 +459,29 @@ class HTMLTreeBuilder(TreeBuilder): |
1304 | 'basefont', 'bgsound', 'command', 'frame', 'image', 'isindex', 'nextid', 'spacer' |
1305 | ]) |
1306 | |
1307 | - # The HTML standard defines these as block-level elements. Beautiful |
1308 | - # Soup does not treat these elements differently from other elements, |
1309 | - # but it may do so eventually, and this information is available if |
1310 | - # you need to use it. |
1311 | - block_elements = set(["address", "article", "aside", "blockquote", "canvas", "dd", "div", "dl", "dt", "fieldset", "figcaption", "figure", "footer", "form", "h1", "h2", "h3", "h4", "h5", "h6", "header", "hr", "li", "main", "nav", "noscript", "ol", "output", "p", "pre", "section", "table", "tfoot", "ul", "video"]) |
1312 | - |
1313 | - # These HTML tags need special treatment so they can be |
1314 | - # represented by a string class other than NavigableString. |
1315 | - # |
1316 | - # For some of these tags, it's because the HTML standard defines |
1317 | - # an unusual content model for them. I made this list by going |
1318 | - # through the HTML spec |
1319 | - # (https://html.spec.whatwg.org/#metadata-content) and looking for |
1320 | - # "metadata content" elements that can contain strings. |
1321 | - # |
1322 | - # The Ruby tags (<rt> and <rp>) are here despite being normal |
1323 | - # "phrasing content" tags, because the content they contain is |
1324 | - # qualitatively different from other text in the document, and it |
1325 | - # can be useful to be able to distinguish it. |
1326 | - # |
1327 | - # TODO: Arguably <noscript> could go here but it seems |
1328 | - # qualitatively different from the other tags. |
1329 | - DEFAULT_STRING_CONTAINERS = { |
1330 | + #: The HTML standard defines these tags as block-level elements. Beautiful |
1331 | + #: Soup does not treat these elements differently from other elements, |
1332 | + #: but it may do so eventually, and this information is available if |
1333 | + #: you need to use it. |
1334 | + DEFAULT_BLOCK_ELEMENTS: Set[str] = set(["address", "article", "aside", "blockquote", "canvas", "dd", "div", "dl", "dt", "fieldset", "figcaption", "figure", "footer", "form", "h1", "h2", "h3", "h4", "h5", "h6", "header", "hr", "li", "main", "nav", "noscript", "ol", "output", "p", "pre", "section", "table", "tfoot", "ul", "video"]) |
1335 | + |
1336 | + #: These HTML tags need special treatment so they can be |
1337 | + #: represented by a string class other than NavigableString. |
1338 | + #: |
1339 | + #: For some of these tags, it's because the HTML standard defines |
1340 | + #: an unusual content model for them. I made this list by going |
1341 | + #: through the HTML spec |
1342 | + #: (https://html.spec.whatwg.org/#metadata-content) and looking for |
1343 | + #: "metadata content" elements that can contain strings. |
1344 | + #: |
1345 | + #: The Ruby tags (<rt> and <rp>) are here despite being normal |
1346 | + #: "phrasing content" tags, because the content they contain is |
1347 | + #: qualitatively different from other text in the document, and it |
1348 | + #: can be useful to be able to distinguish it. |
1349 | + #: |
1350 | + #: TODO: Arguably <noscript> could go here but it seems |
1351 | + #: qualitatively different from the other tags. |
1352 | + DEFAULT_STRING_CONTAINERS: Dict[str, Type[NavigableString]] = { |
1353 | 'rt' : RubyTextString, |
1354 | 'rp' : RubyParenthesisString, |
1355 | 'style': Stylesheet, |
1356 | @@ -424,33 +489,35 @@ class HTMLTreeBuilder(TreeBuilder): |
1357 | 'template': TemplateString, |
1358 | } |
1359 | |
1360 | - # The HTML standard defines these attributes as containing a |
1361 | - # space-separated list of values, not a single value. That is, |
1362 | - # class="foo bar" means that the 'class' attribute has two values, |
1363 | - # 'foo' and 'bar', not the single value 'foo bar'. When we |
1364 | - # encounter one of these attributes, we will parse its value into |
1365 | - # a list of values if possible. Upon output, the list will be |
1366 | - # converted back into a string. |
1367 | - DEFAULT_CDATA_LIST_ATTRIBUTES = { |
1368 | - "*" : ['class', 'accesskey', 'dropzone'], |
1369 | - "a" : ['rel', 'rev'], |
1370 | - "link" : ['rel', 'rev'], |
1371 | - "td" : ["headers"], |
1372 | - "th" : ["headers"], |
1373 | - "td" : ["headers"], |
1374 | - "form" : ["accept-charset"], |
1375 | - "object" : ["archive"], |
1376 | + #: The HTML standard defines these attributes as containing a |
1377 | + #: space-separated list of values, not a single value. That is, |
1378 | + #: class="foo bar" means that the 'class' attribute has two values, |
1379 | + #: 'foo' and 'bar', not the single value 'foo bar'. When we |
1380 | + #: encounter one of these attributes, we will parse its value into |
1381 | + #: a list of values if possible. Upon output, the list will be |
1382 | + #: converted back into a string. |
1383 | + DEFAULT_CDATA_LIST_ATTRIBUTES: Dict[str, Set[str]] = { |
1384 | + "*" : {'class', 'accesskey', 'dropzone'}, |
1385 | + "a" : {'rel', 'rev'}, |
1386 | + "link" : {'rel', 'rev'}, |
1387 | + "td" : {"headers"}, |
1388 | + "th" : {"headers"}, |
1389 | + "td" : {"headers"}, |
1390 | + "form" : {"accept-charset"}, |
1391 | + "object" : {"archive"}, |
1392 | |
1393 | # These are HTML5 specific, as are *.accesskey and *.dropzone above. |
1394 | - "area" : ["rel"], |
1395 | - "icon" : ["sizes"], |
1396 | - "iframe" : ["sandbox"], |
1397 | - "output" : ["for"], |
1398 | + "area" : {"rel"}, |
1399 | + "icon" : {"sizes"}, |
1400 | + "iframe" : {"sandbox"}, |
1401 | + "output" : {"for"}, |
1402 | } |
1403 | |
1404 | - DEFAULT_PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea']) |
1405 | + #: By default, whitespace inside these HTML tags will be |
1406 | + #: preserved rather than being collapsed. |
1407 | + DEFAULT_PRESERVE_WHITESPACE_TAGS: set[str] = set(['pre', 'textarea']) |
1408 | |
1409 | - def set_up_substitutions(self, tag): |
1410 | + def set_up_substitutions(self, tag:Tag) -> bool: |
1411 | """Replace the declared encoding in a <meta> tag with a placeholder, |
1412 | to be substituted when the tag is output to a string. |
1413 | |
1414 | @@ -458,17 +525,26 @@ class HTMLTreeBuilder(TreeBuilder): |
1415 | encoding, but exit in a different encoding, and the <meta> tag |
1416 | needs to be changed to reflect this. |
1417 | |
1418 | - :param tag: A `Tag` |
1419 | :return: Whether or not a substitution was performed. |
1420 | + |
1421 | + :meta private: |
1422 | """ |
1423 | # We are only interested in <meta> tags |
1424 | if tag.name != 'meta': |
1425 | return False |
1426 | |
1427 | - http_equiv = tag.get('http-equiv') |
1428 | - content = tag.get('content') |
1429 | - charset = tag.get('charset') |
1430 | - |
1431 | + # TODO: This cast will fail in the (very unlikely) scenario |
1432 | + # that the programmer who instantiates the TreeBuilder |
1433 | + # specifies meta['content'] or meta['charset'] as |
1434 | + # cdata_list_attributes. |
1435 | + content:Optional[str] = cast(Optional[str], tag.get('content')) |
1436 | + charset:Optional[str] = cast(Optional[str], tag.get('charset')) |
1437 | + |
1438 | + # But we can accommodate meta['http-equiv'] being made a |
1439 | + # cdata_list_attribute (again, very unlikely) without much |
1440 | + # trouble. |
1441 | + http_equiv:List[str] = tag.get_attribute_list('http-equiv') |
1442 | + |
1443 | # We are interested in <meta> tags that say what encoding the |
1444 | # document was originally in. This means HTML 5-style <meta> |
1445 | # tags that provide the "charset" attribute. It also means |
1446 | @@ -478,20 +554,22 @@ class HTMLTreeBuilder(TreeBuilder): |
1447 | # In both cases we will replace the value of the appropriate |
1448 | # attribute with a standin object that can take on any |
1449 | # encoding. |
1450 | - meta_encoding = None |
1451 | + substituted = False |
1452 | if charset is not None: |
1453 | # HTML 5 style: |
1454 | # <meta charset="utf8"> |
1455 | meta_encoding = charset |
1456 | tag['charset'] = CharsetMetaAttributeValue(charset) |
1457 | + substituted = True |
1458 | |
1459 | - elif (content is not None and http_equiv is not None |
1460 | - and http_equiv.lower() == 'content-type'): |
1461 | + elif (content is not None and |
1462 | + any(x.lower() == 'content-type' for x in http_equiv)): |
1463 | # HTML 4 style: |
1464 | # <meta http-equiv="content-type" content="text/html; charset=utf8"> |
1465 | tag['content'] = ContentMetaAttributeValue(content) |
1466 | + substituted = True |
1467 | |
1468 | - return (meta_encoding is not None) |
1469 | + return substituted |
1470 | |
1471 | class DetectsXMLParsedAsHTML(object): |
1472 | """A mixin class for any class (a TreeBuilder, or some class used by a |
1473 | @@ -502,19 +580,29 @@ class DetectsXMLParsedAsHTML(object): |
1474 | This requires being able to observe an incoming processing |
1475 | instruction that might be an XML declaration, and also able to |
1476 | observe tags as they're opened. If you can't do that for a given |
1477 | - TreeBuilder, there's a less reliable implementation based on |
1478 | + `TreeBuilder`, there's a less reliable implementation based on |
1479 | examining the raw markup. |
1480 | """ |
1481 | |
1482 | - # Regular expression for seeing if markup has an <html> tag. |
1483 | - LOOKS_LIKE_HTML = re.compile("<[^ +]html", re.I) |
1484 | - LOOKS_LIKE_HTML_B = re.compile(b"<[^ +]html", re.I) |
1485 | + #: Regular expression for seeing if string markup has an <html> tag. |
1486 | + LOOKS_LIKE_HTML:Pattern[str] = re.compile("<[^ +]html", re.I) |
1487 | |
1488 | - XML_PREFIX = '<?xml' |
1489 | - XML_PREFIX_B = b'<?xml' |
1490 | + #: Regular expression for seeing if byte markup has an <html> tag. |
1491 | + LOOKS_LIKE_HTML_B:Pattern[bytes] = re.compile(b"<[^ +]html", re.I) |
1492 | + |
1493 | + #: The start of an XML document string. |
1494 | + XML_PREFIX:str = '<?xml' |
1495 | + |
1496 | + #: The start of an XML document bytestring. |
1497 | + XML_PREFIX_B:bytes = b'<?xml' |
1498 | + |
1499 | + # This is typed as str, not `ProcessingInstruction`, because this |
1500 | + # check may be run before any Beautiful Soup objects are created. |
1501 | + _first_processing_instruction: Optional[str] |
1502 | + _root_tag: Optional[Tag] |
1503 | |
1504 | @classmethod |
1505 | - def warn_if_markup_looks_like_xml(cls, markup): |
1506 | + def warn_if_markup_looks_like_xml(cls, markup:Optional[_RawMarkup]) -> bool: |
1507 | """Perform a check on some markup to see if it looks like XML |
1508 | that's not XHTML. If so, issue a warning. |
1509 | |
1510 | @@ -524,34 +612,40 @@ class DetectsXMLParsedAsHTML(object): |
1511 | :return: True if the markup looks like non-XHTML XML, False |
1512 | otherwise. |
1513 | """ |
1514 | + if markup is None: |
1515 | + return False |
1516 | + markup = markup[:500] |
1517 | if isinstance(markup, bytes): |
1518 | - prefix = cls.XML_PREFIX_B |
1519 | - looks_like_html = cls.LOOKS_LIKE_HTML_B |
1520 | + markup_b = cast(bytes, markup) |
1521 | + looks_like_xml = ( |
1522 | + markup_b.startswith(cls.XML_PREFIX_B) |
1523 | + and not cls.LOOKS_LIKE_HTML_B.search(markup) |
1524 | + ) |
1525 | else: |
1526 | - prefix = cls.XML_PREFIX |
1527 | - looks_like_html = cls.LOOKS_LIKE_HTML |
1528 | - |
1529 | - if (markup is not None |
1530 | - and markup.startswith(prefix) |
1531 | - and not looks_like_html.search(markup[:500]) |
1532 | - ): |
1533 | + markup_s = cast(str, markup) |
1534 | + looks_like_xml = ( |
1535 | + markup_s.startswith(cls.XML_PREFIX) |
1536 | + and not cls.LOOKS_LIKE_HTML.search(markup) |
1537 | + ) |
1538 | + |
1539 | + if looks_like_xml: |
1540 | cls._warn() |
1541 | return True |
1542 | - return False |
1543 | - |
1544 | + return False |
1545 | + |
1546 | @classmethod |
1547 | - def _warn(cls): |
1548 | + def _warn(cls) -> None: |
1549 | """Issue a warning about XML being parsed as HTML.""" |
1550 | warnings.warn( |
1551 | XMLParsedAsHTMLWarning.MESSAGE, XMLParsedAsHTMLWarning |
1552 | ) |
1553 | |
1554 | - def _initialize_xml_detector(self): |
1555 | + def _initialize_xml_detector(self) -> None: |
1556 | """Call this method before parsing a document.""" |
1557 | self._first_processing_instruction = None |
1558 | self._root_tag = None |
1559 | |
1560 | - def _document_might_be_xml(self, processing_instruction): |
1561 | + def _document_might_be_xml(self, processing_instruction:str): |
1562 | """Call this method when encountering an XML declaration, or a |
1563 | "processing instruction" that might be an XML declaration. |
1564 | """ |
1565 | @@ -586,7 +680,7 @@ class DetectsXMLParsedAsHTML(object): |
1566 | self._warn() |
1567 | |
1568 | |
1569 | -def register_treebuilders_from(module): |
1570 | +def register_treebuilders_from(module:ModuleType) -> None: |
1571 | """Copy TreeBuilders from the given module into this module.""" |
1572 | this_module = sys.modules[__name__] |
1573 | for name in module.__all__: |
1574 | @@ -602,7 +696,7 @@ class ParserRejectedMarkup(Exception): |
1575 | """An Exception to be raised when the underlying parser simply |
1576 | refuses to parse the given markup. |
1577 | """ |
1578 | - def __init__(self, message_or_exception): |
1579 | + def __init__(self, message_or_exception:Union[str,Exception]): |
1580 | """Explain why the parser rejected the given markup, either |
1581 | with a textual explanation or another exception. |
1582 | """ |
1583 | diff --git a/bs4/builder/_html5lib.py b/bs4/builder/_html5lib.py |
1584 | index dac2173..560a036 100644 |
1585 | --- a/bs4/builder/_html5lib.py |
1586 | +++ b/bs4/builder/_html5lib.py |
1587 | @@ -5,6 +5,20 @@ __all__ = [ |
1588 | 'HTML5TreeBuilder', |
1589 | ] |
1590 | |
1591 | +from typing import ( |
1592 | + Iterable, |
1593 | + List, |
1594 | + Optional, |
1595 | + TYPE_CHECKING, |
1596 | + Tuple, |
1597 | + Union, |
1598 | +) |
1599 | +from bs4._typing import ( |
1600 | + _Encoding, |
1601 | + _Encodings, |
1602 | + _RawMarkup, |
1603 | +) |
1604 | + |
1605 | import warnings |
1606 | import re |
1607 | from bs4.builder import ( |
1608 | @@ -30,50 +44,54 @@ from bs4.element import ( |
1609 | Tag, |
1610 | ) |
1611 | |
1612 | -try: |
1613 | - # Pre-0.99999999 |
1614 | - from html5lib.treebuilders import _base as treebuilder_base |
1615 | - new_html5lib = False |
1616 | -except ImportError as e: |
1617 | - # 0.99999999 and up |
1618 | - from html5lib.treebuilders import base as treebuilder_base |
1619 | - new_html5lib = True |
1620 | +from html5lib.treebuilders import base as treebuilder_base |
1621 | + |
1622 | |
1623 | class HTML5TreeBuilder(HTMLTreeBuilder): |
1624 | - """Use html5lib to build a tree. |
1625 | + """Use `html5lib <https://github.com/html5lib/html5lib-python>`_ to |
1626 | + build a tree. |
1627 | |
1628 | - Note that this TreeBuilder does not support some features common |
1629 | - to HTML TreeBuilders. Some of these features could theoretically |
1630 | + Note that `HTML5TreeBuilder` does not support some common HTML |
1631 | + `TreeBuilder` features. Some of these features could theoretically |
1632 | be implemented, but at the very least it's quite difficult, |
1633 | because html5lib moves the parse tree around as it's being built. |
1634 | |
1635 | - * This TreeBuilder doesn't use different subclasses of NavigableString |
1636 | - based on the name of the tag in which the string was found. |
1637 | + Specifically: |
1638 | |
1639 | - * You can't use a SoupStrainer to parse only part of a document. |
1640 | + * This `TreeBuilder` doesn't use different subclasses of |
1641 | + `NavigableString` (e.g. `Script`) based on the name of the tag |
1642 | + in which the string was found. |
1643 | + * You can't use a `SoupStrainer` to parse only part of a document. |
1644 | """ |
1645 | |
1646 | - NAME = "html5lib" |
1647 | + NAME:str = "html5lib" |
1648 | |
1649 | - features = [NAME, PERMISSIVE, HTML_5, HTML] |
1650 | + features:Iterable[str] = [NAME, PERMISSIVE, HTML_5, HTML] |
1651 | |
1652 | - # html5lib can tell us which line number and position in the |
1653 | - # original file is the source of an element. |
1654 | - TRACKS_LINE_NUMBERS = True |
1655 | + #: html5lib can tell us which line number and position in the |
1656 | + #: original file is the source of an element. |
1657 | + TRACKS_LINE_NUMBERS:bool = True |
1658 | |
1659 | - def prepare_markup(self, markup, user_specified_encoding, |
1660 | - document_declared_encoding=None, exclude_encodings=None): |
1661 | + def prepare_markup(self, markup:_RawMarkup, |
1662 | + user_specified_encoding:Optional[_Encoding]=None, |
1663 | + document_declared_encoding:Optional[_Encoding]=None, |
1664 | + exclude_encodings:Optional[_Encodings]=None |
1665 | + ) -> Iterable[Tuple[_RawMarkup, Optional[_Encoding], Optional[_Encoding], bool]]: |
1666 | # Store the user-specified encoding for use later on. |
1667 | self.user_specified_encoding = user_specified_encoding |
1668 | |
1669 | # document_declared_encoding and exclude_encodings aren't used |
1670 | # ATM because the html5lib TreeBuilder doesn't use |
1671 | # UnicodeDammit. |
1672 | - if exclude_encodings: |
1673 | - warnings.warn( |
1674 | - "You provided a value for exclude_encoding, but the html5lib tree builder doesn't support exclude_encoding.", |
1675 | - stacklevel=3 |
1676 | - ) |
1677 | + for variable, name in ( |
1678 | + (document_declared_encoding, 'document_declared_encoding'), |
1679 | + (exclude_encodings, 'exclude_encodings'), |
1680 | + ): |
1681 | + if variable: |
1682 | + warnings.warn( |
1683 | + f"You provided a value for {name}, but the html5lib tree builder doesn't support {name}.", |
1684 | + stacklevel=3 |
1685 | + ) |
1686 | |
1687 | # html5lib only parses HTML, so if it's given XML that's worth |
1688 | # noting. |
1689 | @@ -83,6 +101,9 @@ class HTML5TreeBuilder(HTMLTreeBuilder): |
1690 | |
1691 | # These methods are defined by Beautiful Soup. |
1692 | def feed(self, markup): |
1693 | + """Run some incoming markup through some parsing process, |
1694 | + populating the `BeautifulSoup` object in `HTML5TreeBuilder.soup`. |
1695 | + """ |
1696 | if self.soup.parse_only is not None: |
1697 | warnings.warn( |
1698 | "You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.", |
1699 | @@ -92,10 +113,7 @@ class HTML5TreeBuilder(HTMLTreeBuilder): |
1700 | self.underlying_builder.parser = parser |
1701 | extra_kwargs = dict() |
1702 | if not isinstance(markup, str): |
1703 | - if new_html5lib: |
1704 | - extra_kwargs['override_encoding'] = self.user_specified_encoding |
1705 | - else: |
1706 | - extra_kwargs['encoding'] = self.user_specified_encoding |
1707 | + extra_kwargs['override_encoding'] = self.user_specified_encoding |
1708 | doc = parser.parse(markup, **extra_kwargs) |
1709 | |
1710 | # Set the character encoding detected by the tokenizer. |
1711 | @@ -105,15 +123,18 @@ class HTML5TreeBuilder(HTMLTreeBuilder): |
1712 | doc.original_encoding = None |
1713 | else: |
1714 | original_encoding = parser.tokenizer.stream.charEncoding[0] |
1715 | - if not isinstance(original_encoding, str): |
1716 | - # In 0.99999999 and up, the encoding is an html5lib |
1717 | - # Encoding object. We want to use a string for compatibility |
1718 | - # with other tree builders. |
1719 | - original_encoding = original_encoding.name |
1720 | + # The encoding is an html5lib Encoding object. We want to |
1721 | + # use a string for compatibility with other tree builders. |
1722 | + original_encoding = original_encoding.name |
1723 | doc.original_encoding = original_encoding |
1724 | self.underlying_builder.parser = None |
1725 | - |
1726 | + |
1727 | def create_treebuilder(self, namespaceHTMLElements): |
1728 | + """Called by html5lib to instantiate the kind of class it |
1729 | + calls a 'TreeBuilder'. |
1730 | + |
1731 | + :meta private: |
1732 | + """ |
1733 | self.underlying_builder = TreeBuilderForHtml5lib( |
1734 | namespaceHTMLElements, self.soup, |
1735 | store_line_numbers=self.store_line_numbers |
1736 | diff --git a/bs4/builder/_htmlparser.py b/bs4/builder/_htmlparser.py |
1737 | index 3cc187f..291f6c6 100644 |
1738 | --- a/bs4/builder/_htmlparser.py |
1739 | +++ b/bs4/builder/_htmlparser.py |
1740 | @@ -1,4 +1,5 @@ |
1741 | # encoding: utf-8 |
1742 | +from __future__ import annotations |
1743 | """Use the HTMLParser library to parse HTML files that aren't too bad.""" |
1744 | |
1745 | # Use of this source code is governed by the MIT license. |
1746 | @@ -11,6 +12,19 @@ __all__ = [ |
1747 | from html.parser import HTMLParser |
1748 | |
1749 | import sys |
1750 | +from typing import ( |
1751 | + Any, |
1752 | + Callable, |
1753 | + cast, |
1754 | + Dict, |
1755 | + Iterable, |
1756 | + List, |
1757 | + Optional, |
1758 | + TYPE_CHECKING, |
1759 | + Tuple, |
1760 | + Type, |
1761 | + Union, |
1762 | +) |
1763 | import warnings |
1764 | |
1765 | from bs4.element import ( |
1766 | @@ -30,21 +44,25 @@ from bs4.builder import ( |
1767 | STRICT, |
1768 | ) |
1769 | |
1770 | - |
1771 | +from bs4.element import Tag |
1772 | +if TYPE_CHECKING: |
1773 | + from bs4 import BeautifulSoup |
1774 | + from bs4.element import NavigableString |
1775 | + from bs4._typing import ( |
1776 | + _AttributeValues, |
1777 | + _Encoding, |
1778 | + _Encodings, |
1779 | + _RawMarkup, |
1780 | + ) |
1781 | + |
1782 | HTMLPARSER = 'html.parser' |
1783 | |
1784 | +_DuplicateAttributeHandler = Callable[[Dict[str, str], str, str], None] |
1785 | + |
1786 | class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): |
1787 | """A subclass of the Python standard library's HTMLParser class, which |
1788 | listens for HTMLParser events and translates them into calls |
1789 | to Beautiful Soup's tree construction API. |
1790 | - """ |
1791 | - |
1792 | - # Strategies for handling duplicate attributes |
1793 | - IGNORE = 'ignore' |
1794 | - REPLACE = 'replace' |
1795 | - |
1796 | - def __init__(self, *args, **kwargs): |
1797 | - """Constructor. |
1798 | |
1799 | :param on_duplicate_attribute: A strategy for what to do if a |
1800 | tag includes the same attribute more than once. Accepted |
1801 | @@ -53,8 +71,10 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): |
1802 | encountered), or a callable. A callable must take three |
1803 | arguments: the dictionary of attributes already processed, |
1804 | the name of the duplicate attribute, and the most recent value |
1805 | - encountered. |
1806 | - """ |
1807 | + encountered. |
1808 | + """ |
1809 | + def __init__(self, soup:BeautifulSoup, *args, **kwargs): |
1810 | + self.soup = soup |
1811 | self.on_duplicate_attribute = kwargs.pop( |
1812 | 'on_duplicate_attribute', self.REPLACE |
1813 | ) |
1814 | @@ -70,8 +90,20 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): |
1815 | self.already_closed_empty_element = [] |
1816 | |
1817 | self._initialize_xml_detector() |
1818 | + |
1819 | + #: Constant to handle duplicate attributes by replacing earlier values |
1820 | + #: with later ones. |
1821 | + IGNORE:str = 'ignore' |
1822 | + |
1823 | + #: Constant to handle duplicate attributes by ignoring later values |
1824 | + #: and keeping the earlier ones. |
1825 | + REPLACE:str = 'replace' |
1826 | |
1827 | - def error(self, message): |
1828 | + on_duplicate_attribute:Union[str, _DuplicateAttributeHandler] |
1829 | + already_closed_empty_element: List[str] |
1830 | + soup: BeautifulSoup |
1831 | + |
1832 | + def error(self, message:str) -> None: |
1833 | # NOTE: This method is required so long as Python 3.9 is |
1834 | # supported. The corresponding code is removed from HTMLParser |
1835 | # in 3.5, but not removed from ParserBase until 3.10. |
1836 | @@ -87,32 +119,33 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): |
1837 | # catch this error and wrap it in a ParserRejectedMarkup.) |
1838 | raise ParserRejectedMarkup(message) |
1839 | |
1840 | - def handle_startendtag(self, name, attrs): |
1841 | + def handle_startendtag( |
1842 | + self, name:str, attrs:List[Tuple[str, Optional[str]]] |
1843 | + ) -> None: |
1844 | """Handle an incoming empty-element tag. |
1845 | |
1846 | - This is only called when the markup looks like <tag/>. |
1847 | - |
1848 | - :param name: Name of the tag. |
1849 | - :param attrs: Dictionary of the tag's attributes. |
1850 | + html.parser only calls this method when the markup looks like |
1851 | + <tag/>. |
1852 | """ |
1853 | - # is_startend() tells handle_starttag not to close the tag |
1854 | + # `handle_empty_element` tells handle_starttag not to close the tag |
1855 | # just because its name matches a known empty-element tag. We |
1856 | - # know that this is an empty-element tag and we want to call |
1857 | + # know that this is an empty-element tag, and we want to call |
1858 | # handle_endtag ourselves. |
1859 | - tag = self.handle_starttag(name, attrs, handle_empty_element=False) |
1860 | + self.handle_starttag(name, attrs, handle_empty_element=False) |
1861 | self.handle_endtag(name) |
1862 | |
1863 | - def handle_starttag(self, name, attrs, handle_empty_element=True): |
1864 | + def handle_starttag( |
1865 | + self, name:str, attrs:List[Tuple[str, Optional[str]]], |
1866 | + handle_empty_element:bool=True |
1867 | + ) -> None: |
1868 | """Handle an opening tag, e.g. '<tag>' |
1869 | |
1870 | - :param name: Name of the tag. |
1871 | - :param attrs: Dictionary of the tag's attributes. |
1872 | :param handle_empty_element: True if this tag is known to be |
1873 | an empty-element tag (i.e. there is not expected to be any |
1874 | closing tag). |
1875 | """ |
1876 | - # XXX namespace |
1877 | - attr_dict = {} |
1878 | + # TODO: handle namespaces here? |
1879 | + attr_dict: Dict[str, str] = {} |
1880 | for key, value in attrs: |
1881 | # Change None attribute values to the empty string |
1882 | # for consistency with the other tree builders. |
1883 | @@ -128,6 +161,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): |
1884 | elif on_dupe in (None, self.REPLACE): |
1885 | attr_dict[key] = value |
1886 | else: |
1887 | + on_dupe = cast(_DuplicateAttributeHandler, on_dupe) |
1888 | on_dupe(attr_dict, key, value) |
1889 | else: |
1890 | attr_dict[key] = value |
1891 | @@ -157,7 +191,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): |
1892 | if self._root_tag is None: |
1893 | self._root_tag_encountered(name) |
1894 | |
1895 | - def handle_endtag(self, name, check_already_closed=True): |
1896 | + def handle_endtag(self, name:str, check_already_closed:bool=True) -> None: |
1897 | """Handle a closing tag, e.g. '</tag>' |
1898 | |
1899 | :param name: A tag name. |
1900 | @@ -175,11 +209,11 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): |
1901 | else: |
1902 | self.soup.handle_endtag(name) |
1903 | |
1904 | - def handle_data(self, data): |
1905 | + def handle_data(self, data:str) -> None: |
1906 | """Handle some textual data that shows up between tags.""" |
1907 | self.soup.handle_data(data) |
1908 | |
1909 | - def handle_charref(self, name): |
1910 | + def handle_charref(self, name:str) -> None: |
1911 | """Handle a numeric character reference by converting it to the |
1912 | corresponding Unicode character and treating it as textual |
1913 | data. |
1914 | @@ -219,7 +253,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): |
1915 | data = data or "\N{REPLACEMENT CHARACTER}" |
1916 | self.handle_data(data) |
1917 | |
1918 | - def handle_entityref(self, name): |
1919 | + def handle_entityref(self, name:str) -> None: |
1920 | """Handle a named entity reference by converting it to the |
1921 | corresponding Unicode character(s) and treating it as textual |
1922 | data. |
1923 | @@ -238,7 +272,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): |
1924 | data = "&%s" % name |
1925 | self.handle_data(data) |
1926 | |
1927 | - def handle_comment(self, data): |
1928 | + def handle_comment(self, data:str) -> None: |
1929 | """Handle an HTML comment. |
1930 | |
1931 | :param data: The text of the comment. |
1932 | @@ -247,7 +281,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): |
1933 | self.soup.handle_data(data) |
1934 | self.soup.endData(Comment) |
1935 | |
1936 | - def handle_decl(self, data): |
1937 | + def handle_decl(self, data:str) -> None: |
1938 | """Handle a DOCTYPE declaration. |
1939 | |
1940 | :param data: The text of the declaration. |
1941 | @@ -257,11 +291,12 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): |
1942 | self.soup.handle_data(data) |
1943 | self.soup.endData(Doctype) |
1944 | |
1945 | - def unknown_decl(self, data): |
1946 | + def unknown_decl(self, data:str) -> None: |
1947 | """Handle a declaration of unknown type -- probably a CDATA block. |
1948 | |
1949 | :param data: The text of the declaration. |
1950 | """ |
1951 | + cls: Type[NavigableString] |
1952 | if data.upper().startswith('CDATA['): |
1953 | cls = CData |
1954 | data = data[len('CDATA['):] |
1955 | @@ -271,7 +306,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): |
1956 | self.soup.handle_data(data) |
1957 | self.soup.endData(cls) |
1958 | |
1959 | - def handle_pi(self, data): |
1960 | + def handle_pi(self, data:str) -> None: |
1961 | """Handle a processing instruction. |
1962 | |
1963 | :param data: The text of the instruction. |
1964 | @@ -286,16 +321,17 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder): |
1965 | """A Beautiful soup `TreeBuilder` that uses the `HTMLParser` parser, |
1966 | found in the Python standard library. |
1967 | """ |
1968 | - is_xml = False |
1969 | - picklable = True |
1970 | - NAME = HTMLPARSER |
1971 | - features = [NAME, HTML, STRICT] |
1972 | + is_xml:bool = False |
1973 | + picklable:bool = True |
1974 | + NAME:str = HTMLPARSER |
1975 | + features: Iterable[str] = [NAME, HTML, STRICT] |
1976 | |
1977 | - # The html.parser knows which line number and position in the |
1978 | - # original file is the source of an element. |
1979 | - TRACKS_LINE_NUMBERS = True |
1980 | + #: The html.parser knows which line number and position in the |
1981 | + #: original file is the source of an element. |
1982 | + TRACKS_LINE_NUMBERS:bool = True |
1983 | |
1984 | - def __init__(self, parser_args=None, parser_kwargs=None, **kwargs): |
1985 | + def __init__(self, parser_args:Optional[Iterable[Any]]=None, |
1986 | + parser_kwargs:Optional[Dict[str, Any]]=None, **kwargs:Any): |
1987 | """Constructor. |
1988 | |
1989 | :param parser_args: Positional arguments to pass into |
1990 | @@ -320,9 +356,12 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder): |
1991 | parser_kwargs['convert_charrefs'] = False |
1992 | self.parser_args = (parser_args, parser_kwargs) |
1993 | |
1994 | - def prepare_markup(self, markup, user_specified_encoding=None, |
1995 | - document_declared_encoding=None, exclude_encodings=None): |
1996 | - |
1997 | + def prepare_markup( |
1998 | + self, markup:_RawMarkup, |
1999 | + user_specified_encoding:Optional[_Encoding]=None, |
2000 | + document_declared_encoding:Optional[_Encoding]=None, |
2001 | + exclude_encodings:Optional[_Encodings]=None |
2002 | + ) -> Iterable[Tuple[str, Optional[_Encoding], Optional[_Encoding], bool]]: |
2003 | """Run any preliminary steps necessary to make incoming markup |
2004 | acceptable to the parser. |
2005 | |
2006 | @@ -333,13 +372,12 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder): |
2007 | :param exclude_encodings: The user asked _not_ to try any of |
2008 | these encodings. |
2009 | |
2010 | - :yield: A series of 4-tuples: |
2011 | - (markup, encoding, declared encoding, |
2012 | - has undergone character replacement) |
2013 | + :yield: A series of 4-tuples: (markup, encoding, declared encoding, |
2014 | + has undergone character replacement) |
2015 | |
2016 | - Each 4-tuple represents a strategy for converting the |
2017 | - document to Unicode and parsing it. Each strategy will be tried |
2018 | - in turn. |
2019 | + Each 4-tuple represents a strategy for parsing the document. |
2020 | + This TreeBuilder uses Unicode, Dammit to convert the markup |
2021 | + into Unicode, so the `markup` element will always be a string. |
2022 | """ |
2023 | if isinstance(markup, str): |
2024 | # Parse Unicode as-is. |
2025 | @@ -348,14 +386,19 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder): |
2026 | |
2027 | # Ask UnicodeDammit to sniff the most likely encoding. |
2028 | |
2029 | - # This was provided by the end-user; treat it as a known |
2030 | - # definite encoding per the algorithm laid out in the HTML5 |
2031 | - # spec. (See the EncodingDetector class for details.) |
2032 | - known_definite_encodings = [user_specified_encoding] |
2033 | + known_definite_encodings: List[_Encoding] = [] |
2034 | + if user_specified_encoding: |
2035 | + # This was provided by the end-user; treat it as a known |
2036 | + # definite encoding per the algorithm laid out in the |
2037 | + # HTML5 spec. (See the EncodingDetector class for |
2038 | + # details.) |
2039 | + known_definite_encodings.append(user_specified_encoding) |
2040 | |
2041 | - # This was found in the document; treat it as a slightly lower-priority |
2042 | - # user encoding. |
2043 | - user_encodings = [document_declared_encoding] |
2044 | + user_encodings: List[_Encoding] = [] |
2045 | + if document_declared_encoding: |
2046 | + # This was found in the document; treat it as a slightly |
2047 | + # lower-priority user encoding. |
2048 | + user_encodings.append(document_declared_encoding) |
2049 | |
2050 | try_encodings = [user_specified_encoding, document_declared_encoding] |
2051 | dammit = UnicodeDammit( |
2052 | @@ -365,17 +408,27 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder): |
2053 | is_html=True, |
2054 | exclude_encodings=exclude_encodings |
2055 | ) |
2056 | - yield (dammit.markup, dammit.original_encoding, |
2057 | - dammit.declared_html_encoding, |
2058 | - dammit.contains_replacement_characters) |
2059 | |
2060 | - def feed(self, markup): |
2061 | - """Run some incoming markup through some parsing process, |
2062 | - populating the `BeautifulSoup` object in self.soup. |
2063 | - """ |
2064 | + if dammit.unicode_markup is None: |
2065 | + # In every case I've seen, Unicode, Dammit is able to |
2066 | + # convert the markup into Unicode, even if it needs to use |
2067 | + # REPLACEMENT CHARACTER. But there is a code path that |
2068 | + # could result in unicode_markup being None, and |
2069 | + # HTMLParser can only parse Unicode, so here we handle |
2070 | + # that code path. |
2071 | + raise ParserRejectedMarkup("Could not convert input to Unicode, and html.parser will not accept bytestrings.") |
2072 | + else: |
2073 | + yield (dammit.unicode_markup, dammit.original_encoding, |
2074 | + dammit.declared_html_encoding, |
2075 | + dammit.contains_replacement_characters) |
2076 | + |
2077 | + def feed(self, markup:str): |
2078 | args, kwargs = self.parser_args |
2079 | - parser = BeautifulSoupHTMLParser(*args, **kwargs) |
2080 | - parser.soup = self.soup |
2081 | + # We know BeautifulSoup calls TreeBuilder.initialize_soup |
2082 | + # before calling feed(), so we can assume self.soup |
2083 | + # is set. |
2084 | + assert self.soup is not None |
2085 | + parser = BeautifulSoupHTMLParser(self.soup, *args, **kwargs) |
2086 | try: |
2087 | parser.feed(markup) |
2088 | parser.close() |
2089 | @@ -385,3 +438,4 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder): |
2090 | # when there's an error in the doctype declaration. |
2091 | raise ParserRejectedMarkup(e) |
2092 | parser.already_closed_empty_element = [] |
2093 | + |
2094 | diff --git a/bs4/builder/_lxml.py b/bs4/builder/_lxml.py |
2095 | index 971c81e..44a477f 100644 |
2096 | --- a/bs4/builder/_lxml.py |
2097 | +++ b/bs4/builder/_lxml.py |
2098 | @@ -1,3 +1,6 @@ |
2099 | +# encoding: utf-8 |
2100 | +from __future__ import annotations |
2101 | + |
2102 | # Use of this source code is governed by the MIT license. |
2103 | __license__ = "MIT" |
2104 | |
2105 | @@ -6,14 +9,26 @@ __all__ = [ |
2106 | 'LXMLTreeBuilder', |
2107 | ] |
2108 | |
2109 | -try: |
2110 | - from collections.abc import Callable # Python 3.6 |
2111 | -except ImportError as e: |
2112 | - from collections import Callable |
2113 | +from collections.abc import Callable |
2114 | + |
2115 | +from typing import ( |
2116 | + Any, |
2117 | + Dict, |
2118 | + IO, |
2119 | + Iterable, |
2120 | + List, |
2121 | + Optional, |
2122 | + Set, |
2123 | + Tuple, |
2124 | + Type, |
2125 | + TYPE_CHECKING, |
2126 | + Union, |
2127 | +) |
2128 | |
2129 | from io import BytesIO |
2130 | from io import StringIO |
2131 | from lxml import etree |
2132 | +from bs4.dammit import (_Encoding) |
2133 | from bs4.element import ( |
2134 | Comment, |
2135 | Doctype, |
2136 | @@ -31,33 +46,54 @@ from bs4.builder import ( |
2137 | TreeBuilder, |
2138 | XML) |
2139 | from bs4.dammit import EncodingDetector |
2140 | - |
2141 | -LXML = 'lxml' |
2142 | +if TYPE_CHECKING: |
2143 | + from bs4._typing import ( |
2144 | + _Encoding, |
2145 | + _Encodings, |
2146 | + _NamespacePrefix, |
2147 | + _NamespaceURL, |
2148 | + _NamespaceMapping, |
2149 | + _InvertedNamespaceMapping, |
2150 | + _RawMarkup, |
2151 | + ) |
2152 | + from bs4 import BeautifulSoup |
2153 | + |
2154 | +LXML:str = 'lxml' |
2155 | |
2156 | def _invert(d): |
2157 | "Invert a dictionary." |
2158 | return dict((v,k) for k, v in list(d.items())) |
2159 | |
2160 | class LXMLTreeBuilderForXML(TreeBuilder): |
2161 | - DEFAULT_PARSER_CLASS = etree.XMLParser |
2162 | - |
2163 | - is_xml = True |
2164 | - processing_instruction_class = XMLProcessingInstruction |
2165 | |
2166 | - NAME = "lxml-xml" |
2167 | - ALTERNATE_NAMES = ["xml"] |
2168 | + DEFAULT_PARSER_CLASS:Type[Any] = etree.XMLParser |
2169 | + |
2170 | + is_xml:bool = True |
2171 | + |
2172 | + processing_instruction_class:Type[ProcessingInstruction] |
2173 | + |
2174 | + NAME:str = "lxml-xml" |
2175 | + ALTERNATE_NAMES: Iterable[str] = ["xml"] |
2176 | |
2177 | # Well, it's permissive by XML parser standards. |
2178 | - features = [NAME, LXML, XML, FAST, PERMISSIVE] |
2179 | + features: Iterable[str] = [NAME, LXML, XML, FAST, PERMISSIVE] |
2180 | |
2181 | - CHUNK_SIZE = 512 |
2182 | + CHUNK_SIZE:int = 512 |
2183 | |
2184 | # This namespace mapping is specified in the XML Namespace |
2185 | # standard. |
2186 | - DEFAULT_NSMAPS = dict(xml='http://www.w3.org/XML/1998/namespace') |
2187 | + DEFAULT_NSMAPS: _NamespaceMapping = dict( |
2188 | + xml='http://www.w3.org/XML/1998/namespace' |
2189 | + ) |
2190 | |
2191 | - DEFAULT_NSMAPS_INVERTED = _invert(DEFAULT_NSMAPS) |
2192 | + DEFAULT_NSMAPS_INVERTED:_InvertedNamespaceMapping = _invert( |
2193 | + DEFAULT_NSMAPS |
2194 | + ) |
2195 | |
2196 | + nsmaps: List[Optional[_InvertedNamespaceMapping]] |
2197 | + empty_element_tags: Set[str] |
2198 | + parser: Any |
2199 | + |
2200 | # NOTE: If we parsed Element objects and looked at .sourceline, |
2201 | # we'd be able to see the line numbers from the original document. |
2202 | # But instead we build an XMLParser or HTMLParser object to serve |
2203 | @@ -65,16 +101,18 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
2204 | # line numbers. |
2205 | # See: https://bugs.launchpad.net/lxml/+bug/1846906 |
2206 | |
2207 | - def initialize_soup(self, soup): |
2208 | + def initialize_soup(self, soup:BeautifulSoup) -> None: |
2209 | """Let the BeautifulSoup object know about the standard namespace |
2210 | mapping. |
2211 | |
2212 | :param soup: A `BeautifulSoup`. |
2213 | """ |
2214 | + # Beyond this point, self.soup is set, so we can assume (and |
2215 | + # assert) it's not None whenever necessary. |
2216 | super(LXMLTreeBuilderForXML, self).initialize_soup(soup) |
2217 | self._register_namespaces(self.DEFAULT_NSMAPS) |
2218 | |
2219 | - def _register_namespaces(self, mapping): |
2220 | + def _register_namespaces(self, mapping:Dict[str, str]) -> None: |
2221 | """Let the BeautifulSoup object know about namespaces encountered |
2222 | while parsing the document. |
2223 | |
2224 | @@ -87,6 +125,7 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
2225 | |
2226 | :param mapping: A dictionary mapping namespace prefixes to URIs. |
2227 | """ |
2228 | + assert self.soup is not None |
2229 | for key, value in list(mapping.items()): |
2230 | # This is 'if key' and not 'if key is not None' because we |
2231 | # don't track un-prefixed namespaces. Soupselect will |
2232 | @@ -98,19 +137,18 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
2233 | # prefix, the first one in the document takes precedence. |
2234 | self.soup._namespaces[key] = value |
2235 | |
2236 | - def default_parser(self, encoding): |
2237 | + def default_parser(self, encoding:Optional[_Encoding]) -> Type: |
2238 | """Find the default parser for the given encoding. |
2239 | |
2240 | - :param encoding: A string. |
2241 | :return: Either a parser object or a class, which |
2242 | will be instantiated with default arguments. |
2243 | """ |
2244 | if self._default_parser is not None: |
2245 | return self._default_parser |
2246 | - return etree.XMLParser( |
2247 | + return self.DEFAULT_PARSER_CLASS( |
2248 | target=self, strip_cdata=False, recover=True, encoding=encoding) |
2249 | |
2250 | - def parser_for(self, encoding): |
2251 | + def parser_for(self, encoding: Optional[_Encoding]) -> Any: |
2252 | """Instantiate an appropriate parser for the given encoding. |
2253 | |
2254 | :param encoding: A string. |
2255 | @@ -119,36 +157,39 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
2256 | # Use the default parser. |
2257 | parser = self.default_parser(encoding) |
2258 | |
2259 | - if isinstance(parser, Callable): |
2260 | + if callable(parser): |
2261 | # Instantiate the parser with default arguments |
2262 | parser = parser( |
2263 | target=self, strip_cdata=False, recover=True, encoding=encoding |
2264 | ) |
2265 | return parser |
2266 | |
2267 | - def __init__(self, parser=None, empty_element_tags=None, **kwargs): |
2268 | + def __init__(self, parser:Optional[Any]=None, |
2269 | + empty_element_tags:Optional[Set[str]]=None, **kwargs): |
2270 | # TODO: Issue a warning if parser is present but not a |
2271 | # callable, since that means there's no way to create new |
2272 | # parsers for different encodings. |
2273 | self._default_parser = parser |
2274 | - if empty_element_tags is not None: |
2275 | - self.empty_element_tags = set(empty_element_tags) |
2276 | self.soup = None |
2277 | self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED] |
2278 | self.active_namespace_prefixes = [dict(self.DEFAULT_NSMAPS)] |
2279 | super(LXMLTreeBuilderForXML, self).__init__(**kwargs) |
2280 | |
2281 | - def _getNsTag(self, tag): |
2282 | + def _getNsTag(self, tag:str) -> Tuple[Optional[str], str]: |
2283 | # Split the namespace URL out of a fully-qualified lxml tag |
2284 | # name. Copied from lxml's src/lxml/sax.py. |
2285 | if tag[0] == '{': |
2286 | - return tuple(tag[1:].split('}', 1)) |
2287 | + namespace, name = tag[1:].split('}', 1) |
2288 | + return (namespace, name) |
2289 | else: |
2290 | return (None, tag) |
2291 | |
2292 | - def prepare_markup(self, markup, user_specified_encoding=None, |
2293 | - exclude_encodings=None, |
2294 | - document_declared_encoding=None): |
2295 | + def prepare_markup( |
2296 | + self, markup:_RawMarkup, |
2297 | + user_specified_encoding:Optional[_Encoding]=None, |
2298 | + document_declared_encoding:Optional[_Encoding]=None, |
2299 | + exclude_encodings:Optional[_Encodings]=None, |
2300 | + ) -> Iterable[Tuple[Union[str,bytes], Optional[_Encoding], Optional[_Encoding], bool]]: |
2301 | """Run any preliminary steps necessary to make incoming markup |
2302 | acceptable to the parser. |
2303 | |
2304 | @@ -166,13 +207,12 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
2305 | :param exclude_encodings: The user asked _not_ to try any of |
2306 | these encodings. |
2307 | |
2308 | - :yield: A series of 4-tuples: |
2309 | - (markup, encoding, declared encoding, |
2310 | - has undergone character replacement) |
2311 | + :yield: A series of 4-tuples: (markup, encoding, declared encoding, |
2312 | + has undergone character replacement) |
2313 | |
2314 | - Each 4-tuple represents a strategy for converting the |
2315 | - document to Unicode and parsing it. Each strategy will be tried |
2316 | - in turn. |
2317 | + Each 4-tuple represents a strategy for converting the |
2318 | + document to Unicode and parsing it. Each strategy will be tried |
2319 | + in turn. |
2320 | """ |
2321 | is_html = not self.is_xml |
2322 | if is_html: |
2323 | @@ -200,14 +240,25 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
2324 | yield (markup.encode("utf8"), "utf8", |
2325 | document_declared_encoding, False) |
2326 | |
2327 | - # This was provided by the end-user; treat it as a known |
2328 | - # definite encoding per the algorithm laid out in the HTML5 |
2329 | - # spec. (See the EncodingDetector class for details.) |
2330 | - known_definite_encodings = [user_specified_encoding] |
2331 | + # Since the document was Unicode in the first place, there |
2332 | + # is no need to try any more strategies; we know this will |
2333 | + # work. |
2334 | + return |
2335 | + |
2336 | + known_definite_encodings: List[_Encoding] = [] |
2337 | + if user_specified_encoding: |
2338 | + # This was provided by the end-user; treat it as a known |
2339 | + # definite encoding per the algorithm laid out in the |
2340 | + # HTML5 spec. (See the EncodingDetector class for |
2341 | + # details.) |
2342 | + known_definite_encodings.append(user_specified_encoding) |
2343 | + |
2344 | + user_encodings: List[_Encoding] = [] |
2345 | + if document_declared_encoding: |
2346 | + # This was found in the document; treat it as a slightly |
2347 | + # lower-priority user encoding. |
2348 | + user_encodings.append(document_declared_encoding) |
2349 | |
2350 | - # This was found in the document; treat it as a slightly lower-priority |
2351 | - # user encoding. |
2352 | - user_encodings = [document_declared_encoding] |
2353 | detector = EncodingDetector( |
2354 | markup, known_definite_encodings=known_definite_encodings, |
2355 | user_encodings=user_encodings, is_html=is_html, |
2356 | @@ -216,34 +267,45 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
2357 | for encoding in detector.encodings: |
2358 | yield (detector.markup, encoding, document_declared_encoding, False) |
2359 | |
2360 | - def feed(self, markup): |
2361 | + def feed(self, markup:Union[bytes,str]) -> None: |
2362 | + io: IO |
2363 | if isinstance(markup, bytes): |
2364 | - markup = BytesIO(markup) |
2365 | + io = BytesIO(markup) |
2366 | elif isinstance(markup, str): |
2367 | - markup = StringIO(markup) |
2368 | + io = StringIO(markup) |
2369 | |
2370 | + # initialize_soup is called before feed, so we know this |
2371 | + # is not None. |
2372 | + assert self.soup is not None |
2373 | + |
2374 | # Call feed() at least once, even if the markup is empty, |
2375 | # or the parser won't be initialized. |
2376 | - data = markup.read(self.CHUNK_SIZE) |
2377 | + data = io.read(self.CHUNK_SIZE) |
2378 | try: |
2379 | self.parser = self.parser_for(self.soup.original_encoding) |
2380 | self.parser.feed(data) |
2381 | while len(data) != 0: |
2382 | # Now call feed() on the rest of the data, chunk by chunk. |
2383 | - data = markup.read(self.CHUNK_SIZE) |
2384 | + data = io.read(self.CHUNK_SIZE) |
2385 | if len(data) != 0: |
2386 | self.parser.feed(data) |
2387 | self.parser.close() |
2388 | except (UnicodeDecodeError, LookupError, etree.ParserError) as e: |
2389 | raise ParserRejectedMarkup(e) |
2390 | |
2391 | - def close(self): |
2392 | + def close(self) -> None: |
2393 | self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED] |
2394 | |
2395 | - def start(self, name, attrs, nsmap={}): |
2396 | + def start(self, name:str, attrs:Dict[str, str], nsmap:_NamespaceMapping={}): |
2397 | + # This is called by lxml code as a result of calling |
2398 | + # BeautifulSoup.feed(), and we know self.soup is set by the time feed() |
2399 | + # is called. |
2400 | + assert self.soup is not None |
2401 | + |
2402 | # Make sure attrs is a mutable dict--lxml may send an immutable dictproxy. |
2403 | attrs = dict(attrs) |
2404 | - nsprefix = None |
2405 | + nsprefix: Optional[_NamespacePrefix] = None |
2406 | + namespace: Optional[_NamespaceURL] = None |
2407 | # Invert each namespace map as it comes in. |
2408 | if len(nsmap) == 0 and len(self.nsmaps) > 1: |
2409 | # There are no new namespaces for this tag, but |
2410 | @@ -285,7 +347,7 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
2411 | # Namespaces are in play. Find any attributes that came in |
2412 | # from lxml with namespaces attached to their names, and |
2413 | # turn then into NamespacedAttribute objects. |
2414 | - new_attrs = {} |
2415 | + new_attrs:Dict[Union[str,NamespacedAttribute], str] = {} |
2416 | for attr, value in list(attrs.items()): |
2417 | namespace, attr = self._getNsTag(attr) |
2418 | if namespace is None: |
2419 | @@ -303,7 +365,7 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
2420 | namespaces=self.active_namespace_prefixes[-1] |
2421 | ) |
2422 | |
2423 | - def _prefix_for_namespace(self, namespace): |
2424 | + def _prefix_for_namespace(self, namespace:Optional[_NamespaceURL]) -> Optional[_NamespacePrefix]: |
2425 | """Find the currently active prefix for the given namespace.""" |
2426 | if namespace is None: |
2427 | return None |
2428 | @@ -312,7 +374,8 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
2429 | return inverted_nsmap[namespace] |
2430 | return None |
2431 | |
2432 | - def end(self, name): |
2433 | + def end(self, name:str) -> None: |
2434 | + assert self.soup is not None |
2435 | self.soup.endData() |
2436 | completed_tag = self.soup.tagStack[-1] |
2437 | namespace, name = self._getNsTag(name) |
2438 | @@ -334,44 +397,49 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
2439 | # namespace prefixes. |
2440 | self.active_namespace_prefixes.pop() |
2441 | |
2442 | - def pi(self, target, data): |
2443 | + def pi(self, target:str, data:str) -> None: |
2444 | + assert self.soup is not None |
2445 | self.soup.endData() |
2446 | data = target + ' ' + data |
2447 | self.soup.handle_data(data) |
2448 | self.soup.endData(self.processing_instruction_class) |
2449 | |
2450 | - def data(self, content): |
2451 | + def data(self, content:str) -> None: |
2452 | + assert self.soup is not None |
2453 | self.soup.handle_data(content) |
2454 | |
2455 | - def doctype(self, name, pubid, system): |
2456 | + def doctype(self, name:str, pubid:str, system:str) -> None: |
2457 | + assert self.soup is not None |
2458 | self.soup.endData() |
2459 | doctype = Doctype.for_name_and_ids(name, pubid, system) |
2460 | self.soup.object_was_parsed(doctype) |
2461 | |
2462 | - def comment(self, content): |
2463 | + def comment(self, content:str) -> None: |
2464 | "Handle comments as Comment objects." |
2465 | + assert self.soup is not None |
2466 | self.soup.endData() |
2467 | self.soup.handle_data(content) |
2468 | self.soup.endData(Comment) |
2469 | |
2470 | - def test_fragment_to_document(self, fragment): |
2471 | + def test_fragment_to_document(self, fragment:str) -> str: |
2472 | """See `TreeBuilder`.""" |
2473 | return '<?xml version="1.0" encoding="utf-8"?>\n%s' % fragment |
2474 | |
2475 | |
2476 | class LXMLTreeBuilder(HTMLTreeBuilder, LXMLTreeBuilderForXML): |
2477 | |
2478 | - NAME = LXML |
2479 | - ALTERNATE_NAMES = ["lxml-html"] |
2480 | + NAME:str = LXML |
2481 | + ALTERNATE_NAMES: Iterable[str] = ["lxml-html"] |
2482 | |
2483 | - features = ALTERNATE_NAMES + [NAME, HTML, FAST, PERMISSIVE] |
2484 | - is_xml = False |
2485 | - processing_instruction_class = ProcessingInstruction |
2486 | + features: Iterable[str] = list(ALTERNATE_NAMES) + [NAME, HTML, FAST, PERMISSIVE] |
2487 | + is_xml: bool = False |
2488 | |
2489 | - def default_parser(self, encoding): |
2490 | + def default_parser(self, encoding:Optional[_Encoding]) -> Type[Any]: |
2491 | return etree.HTMLParser |
2492 | |
2493 | - def feed(self, markup): |
2494 | + def feed(self, markup:_RawMarkup) -> None: |
2495 | + # We know self.soup is set by the time feed() is called. |
2496 | + assert self.soup is not None |
2497 | encoding = self.soup.original_encoding |
2498 | try: |
2499 | self.parser = self.parser_for(encoding) |
2500 | @@ -381,6 +449,7 @@ class LXMLTreeBuilder(HTMLTreeBuilder, LXMLTreeBuilderForXML): |
2501 | raise ParserRejectedMarkup(e) |
2502 | |
2503 | |
2504 | - def test_fragment_to_document(self, fragment): |
2505 | + def test_fragment_to_document(self, fragment:str) -> str: |
2506 | """See `TreeBuilder`.""" |
2507 | return '<html><body>%s</body></html>' % fragment |
2508 | + |
2509 | diff --git a/bs4/css.py b/bs4/css.py |
2510 | index 245ac60..0477de8 100644 |
2511 | --- a/bs4/css.py |
2512 | +++ b/bs4/css.py |
2513 | @@ -1,6 +1,36 @@ |
2514 | -"""Integration code for CSS selectors using Soup Sieve (pypi: soupsieve).""" |
2515 | - |
2516 | +"""Integration code for CSS selectors using `Soup Sieve <https://facelessuser.github.io/soupsieve/>`_ (pypi: ``soupsieve``). |
2517 | + |
2518 | +Acquire a `CSS` object through the `bs4.element.Tag.css` attribute of |
2519 | +the starting point of your CSS selector, or (if you want to run a |
2520 | +selector against the entire document) of the `BeautifulSoup` object |
2521 | +itself. |
2522 | + |
2523 | +The main advantage of doing this instead of using ``soupsieve`` |
2524 | +functions is that you don't need to keep passing the `bs4.element.Tag` to be |
2525 | +selected against, since the `CSS` object is permanently scoped to that |
2526 | +`bs4.element.Tag`. |
2527 | + |
2528 | +""" |
2529 | + |
2530 | +from __future__ import annotations |
2531 | + |
2532 | +from types import ModuleType |
2533 | +from typing import ( |
2534 | + Any, |
2535 | + cast, |
2536 | + Iterable, |
2537 | + Iterator, |
2538 | + Optional, |
2539 | + TYPE_CHECKING, |
2540 | +) |
2541 | import warnings |
2542 | +from bs4._typing import _NamespaceMapping |
2543 | +if TYPE_CHECKING: |
2544 | + from soupsieve import SoupSieve |
2545 | + from bs4 import element |
2546 | + from bs4.element import ResultSet, Tag |
2547 | + |
2548 | +soupsieve: Optional[ModuleType] |
2549 | try: |
2550 | import soupsieve |
2551 | except ImportError as e: |
2552 | @@ -9,34 +39,22 @@ except ImportError as e: |
2553 | 'The soupsieve package is not installed. CSS selectors cannot be used.' |
2554 | ) |
2555 | |
2556 | - |
2557 | class CSS(object): |
2558 | - """A proxy object against the soupsieve library, to simplify its |
2559 | + """A proxy object against the ``soupsieve`` library, to simplify its |
2560 | CSS selector API. |
2561 | |
2562 | - Acquire this object through the .css attribute on the |
2563 | - BeautifulSoup object, or on the Tag you want to use as the |
2564 | - starting point for a CSS selector. |
2565 | - |
2566 | - The main advantage of doing this is that the tag to be selected |
2567 | - against doesn't need to be explicitly specified in the function |
2568 | - calls, since it's already scoped to a tag. |
2569 | - """ |
2570 | - |
2571 | - def __init__(self, tag, api=soupsieve): |
2572 | - """Constructor. |
2573 | - |
2574 | - You don't need to instantiate this class yourself; instead, |
2575 | - access the .css attribute on the BeautifulSoup object, or on |
2576 | - the Tag you want to use as the starting point for your CSS |
2577 | - selector. |
2578 | + You don't need to instantiate this class yourself; instead, use |
2579 | + `element.Tag.css`. |
2580 | |
2581 | - :param tag: All CSS selectors will use this as their starting |
2582 | - point. |
2583 | + :param tag: All CSS selectors run by this object will use this as |
2584 | + their starting point. |
2585 | |
2586 | - :param api: A plug-in replacement for the soupsieve module, |
2587 | - designed mainly for use in tests. |
2588 | - """ |
2589 | + :param api: An optional drop-in replacement for the ``soupsieve`` module, |
2590 | + intended for use in unit tests. |
2591 | + """ |
2592 | + def __init__(self, tag: element.Tag, api:Optional[ModuleType]=None): |
2593 | + if api is None: |
2594 | + api = soupsieve |
2595 | if api is None: |
2596 | raise NotImplementedError( |
2597 | "Cannot execute CSS selectors because the soupsieve package is not installed." |
2598 | @@ -44,19 +62,19 @@ class CSS(object): |
2599 | self.api = api |
2600 | self.tag = tag |
2601 | |
2602 | - def escape(self, ident): |
2603 | + def escape(self, ident:str) -> str: |
2604 | """Escape a CSS identifier. |
2605 | |
2606 | - This is a simple wrapper around soupselect.escape(). See the |
2607 | + This is a simple wrapper around `soupsieve.escape() <https://facelessuser.github.io/soupsieve/api/#soupsieveescape>`_. See the |
2608 | documentation for that function for more information. |
2609 | """ |
2610 | if soupsieve is None: |
2611 | raise NotImplementedError( |
2612 | "Cannot escape CSS identifiers because the soupsieve package is not installed." |
2613 | ) |
2614 | - return self.api.escape(ident) |
2615 | + return cast(str, self.api.escape(ident)) |
2616 | |
2617 | - def _ns(self, ns, select): |
2618 | + def _ns(self, ns:Optional[_NamespaceMapping], select:str) -> Optional[_NamespaceMapping]: |
2619 | """Normalize a dictionary of namespaces.""" |
2620 | if not isinstance(select, self.api.SoupSieve) and ns is None: |
2621 | # If the selector is a precompiled pattern, it already has |
2622 | @@ -65,7 +83,7 @@ class CSS(object): |
2623 | ns = self.tag._namespaces |
2624 | return ns |
2625 | |
2626 | - def _rs(self, results): |
2627 | + def _rs(self, results:Iterable[Tag]) -> ResultSet[Tag]: |
2628 | """Normalize a list of results to a Resultset. |
2629 | |
2630 | A ResultSet is more consistent with the rest of Beautiful |
2631 | @@ -77,7 +95,12 @@ class CSS(object): |
2632 | from bs4.element import ResultSet |
2633 | return ResultSet(None, results) |
2634 | |
2635 | - def compile(self, select, namespaces=None, flags=0, **kwargs): |
2636 | + def compile(self, |
2637 | + select:str, |
2638 | + namespaces:Optional[_NamespaceMapping]=None, |
2639 | + flags:int=0, |
2640 | + **kwargs:Any |
2641 | + ) -> SoupSieve: |
2642 | """Pre-compile a selector and return the compiled object. |
2643 | |
2644 | :param selector: A CSS selector. |
2645 | @@ -88,10 +111,10 @@ class CSS(object): |
2646 | parsing the document. |
2647 | |
2648 | :param flags: Flags to be passed into Soup Sieve's |
2649 | - soupsieve.compile() method. |
2650 | + `soupsieve.compile() <https://facelessuser.github.io/soupsieve/api/#soupsievecompile>`_ method. |
2651 | |
2652 | - :param kwargs: Keyword arguments to be passed into SoupSieve's |
2653 | - soupsieve.compile() method. |
2654 | + :param kwargs: Keyword arguments to be passed into Soup Sieve's |
2655 | + `soupsieve.compile() <https://facelessuser.github.io/soupsieve/api/#soupsievecompile>`_ method. |
2656 | |
2657 | :return: A precompiled selector object. |
2658 | :rtype: soupsieve.SoupSieve |
2659 | @@ -100,13 +123,16 @@ class CSS(object): |
2660 | select, self._ns(namespaces, select), flags, **kwargs |
2661 | ) |
2662 | |
2663 | - def select_one(self, select, namespaces=None, flags=0, **kwargs): |
2664 | + def select_one( |
2665 | + self, select:str, |
2666 | + namespaces:Optional[_NamespaceMapping]=None, |
2667 | + flags:int=0, **kwargs:Any |
2668 | + )-> element.Tag | None: |
2669 | """Perform a CSS selection operation on the current Tag and return the |
2670 | - first result. |
2671 | + first result, if any. |
2672 | |
2673 | This uses the Soup Sieve library. For more information, see |
2674 | - that library's documentation for the soupsieve.select_one() |
2675 | - method. |
2676 | + that library's documentation for the `soupsieve.select_one() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect_one>`_ method. |
2677 | |
2678 | :param selector: A CSS selector. |
2679 | |
2680 | @@ -116,27 +142,24 @@ class CSS(object): |
2681 | parsing the document. |
2682 | |
2683 | :param flags: Flags to be passed into Soup Sieve's |
2684 | - soupsieve.select_one() method. |
2685 | - |
2686 | - :param kwargs: Keyword arguments to be passed into SoupSieve's |
2687 | - soupsieve.select_one() method. |
2688 | - |
2689 | - :return: A Tag, or None if the selector has no match. |
2690 | - :rtype: bs4.element.Tag |
2691 | + `soupsieve.select_one() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect_one>`_ method. |
2692 | |
2693 | + :param kwargs: Keyword arguments to be passed into Soup Sieve's |
2694 | + `soupsieve.select_one() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect_one>`_ method. |
2695 | """ |
2696 | return self.api.select_one( |
2697 | select, self.tag, self._ns(namespaces, select), flags, **kwargs |
2698 | ) |
2699 | |
2700 | - def select(self, select, namespaces=None, limit=0, flags=0, **kwargs): |
2701 | - """Perform a CSS selection operation on the current Tag. |
2702 | + def select(self, select:str, |
2703 | + namespaces:Optional[_NamespaceMapping]=None, |
2704 | + limit:int=0, flags:int=0, **kwargs:Any) -> ResultSet[Tag]: |
2705 | + """Perform a CSS selection operation on the current `element.Tag`. |
2706 | |
2707 | This uses the Soup Sieve library. For more information, see |
2708 | - that library's documentation for the soupsieve.select() |
2709 | - method. |
2710 | + that library's documentation for the `soupsieve.select() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect>`_ method. |
2711 | |
2712 | - :param selector: A string containing a CSS selector. |
2713 | + :param selector: A CSS selector. |
2714 | |
2715 | :param namespaces: A dictionary mapping namespace prefixes |
2716 | used in the CSS selector to namespace URIs. By default, |
2717 | @@ -146,14 +169,10 @@ class CSS(object): |
2718 | :param limit: After finding this number of results, stop looking. |
2719 | |
2720 | :param flags: Flags to be passed into Soup Sieve's |
2721 | - soupsieve.select() method. |
2722 | - |
2723 | - :param kwargs: Keyword arguments to be passed into SoupSieve's |
2724 | - soupsieve.select() method. |
2725 | - |
2726 | - :return: A ResultSet of Tag objects. |
2727 | - :rtype: bs4.element.ResultSet |
2728 | + `soupsieve.select() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect>`_ method. |
2729 | |
2730 | + :param kwargs: Keyword arguments to be passed into Soup Sieve's |
2731 | + `soupsieve.select() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect>`_ method. |
2732 | """ |
2733 | if limit is None: |
2734 | limit = 0 |
2735 | @@ -165,11 +184,14 @@ class CSS(object): |
2736 | ) |
2737 | ) |
2738 | |
2739 | - def iselect(self, select, namespaces=None, limit=0, flags=0, **kwargs): |
2740 | - """Perform a CSS selection operation on the current Tag. |
2741 | + def iselect(self, select:str, |
2742 | + namespaces:Optional[_NamespaceMapping]=None, |
2743 | + limit:int=0, flags:int=0, **kwargs:Any) -> Iterator[element.Tag]: |
2744 | + """Perform a CSS selection operation on the current `element.Tag`. |
2745 | |
2746 | This uses the Soup Sieve library. For more information, see |
2747 | - that library's documentation for the soupsieve.iselect() |
2748 | + that library's documentation for the `soupsieve.iselect() |
2749 | + <https://facelessuser.github.io/soupsieve/api/#soupsieveiselect>`_ |
2750 | method. It is the same as select(), but it returns a generator |
2751 | instead of a list. |
2752 | |
2753 | @@ -183,23 +205,23 @@ class CSS(object): |
2754 | :param limit: After finding this number of results, stop looking. |
2755 | |
2756 | :param flags: Flags to be passed into Soup Sieve's |
2757 | - soupsieve.iselect() method. |
2758 | - |
2759 | - :param kwargs: Keyword arguments to be passed into SoupSieve's |
2760 | - soupsieve.iselect() method. |
2761 | + `soupsieve.iselect() <https://facelessuser.github.io/soupsieve/api/#soupsieveiselect>`_ method. |
2762 | |
2763 | - :return: A generator |
2764 | - :rtype: types.GeneratorType |
2765 | + :param kwargs: Keyword arguments to be passed into Soup Sieve's |
2766 | + `soupsieve.iselect() <https://facelessuser.github.io/soupsieve/api/#soupsieveiselect>`_ method. |
2767 | """ |
2768 | return self.api.iselect( |
2769 | select, self.tag, self._ns(namespaces, select), limit, flags, **kwargs |
2770 | ) |
2771 | |
2772 | - def closest(self, select, namespaces=None, flags=0, **kwargs): |
2773 | - """Find the Tag closest to this one that matches the given selector. |
2774 | + def closest(self, select:str, |
2775 | + namespaces:Optional[_NamespaceMapping]=None, |
2776 | + flags:int=0, **kwargs:Any) -> Optional[element.Tag]: |
2777 | + """Find the `element.Tag` closest to this one that matches the given selector. |
2778 | |
2779 | This uses the Soup Sieve library. For more information, see |
2780 | - that library's documentation for the soupsieve.closest() |
2781 | + that library's documentation for the `soupsieve.closest() |
2782 | + <https://facelessuser.github.io/soupsieve/api/#soupsieveclosest>`_ |
2783 | method. |
2784 | |
2785 | :param selector: A string containing a CSS selector. |
2786 | @@ -210,24 +232,24 @@ class CSS(object): |
2787 | parsing the document. |
2788 | |
2789 | :param flags: Flags to be passed into Soup Sieve's |
2790 | - soupsieve.closest() method. |
2791 | + `soupsieve.closest() <https://facelessuser.github.io/soupsieve/api/#soupsieveclosest>`_ method. |
2792 | |
2793 | - :param kwargs: Keyword arguments to be passed into SoupSieve's |
2794 | - soupsieve.closest() method. |
2795 | - |
2796 | - :return: A Tag, or None if there is no match. |
2797 | - :rtype: bs4.Tag |
2798 | + :param kwargs: Keyword arguments to be passed into Soup Sieve's |
2799 | + `soupsieve.closest() <https://facelessuser.github.io/soupsieve/api/#soupsieveclosest>`_ method. |
2800 | |
2801 | """ |
2802 | return self.api.closest( |
2803 | select, self.tag, self._ns(namespaces, select), flags, **kwargs |
2804 | ) |
2805 | |
2806 | - def match(self, select, namespaces=None, flags=0, **kwargs): |
2807 | - """Check whether this Tag matches the given CSS selector. |
2808 | + def match(self, select:str, |
2809 | + namespaces:Optional[_NamespaceMapping]=None, |
2810 | + flags:int=0, **kwargs:Any) -> bool: |
2811 | + """Check whether or not this `element.Tag` matches the given CSS selector. |
2812 | |
2813 | This uses the Soup Sieve library. For more information, see |
2814 | - that library's documentation for the soupsieve.match() |
2815 | + that library's documentation for the `soupsieve.match() |
2816 | + <https://facelessuser.github.io/soupsieve/api/#soupsievematch>`_ |
2817 | method. |
2818 | |
2819 | :param: a CSS selector. |
2820 | @@ -238,25 +260,30 @@ class CSS(object): |
2821 | parsing the document. |
2822 | |
2823 | :param flags: Flags to be passed into Soup Sieve's |
2824 | - soupsieve.match() method. |
2825 | + `soupsieve.match() |
2826 | + <https://facelessuser.github.io/soupsieve/api/#soupsievematch>`_ |
2827 | + method. |
2828 | |
2829 | :param kwargs: Keyword arguments to be passed into SoupSieve's |
2830 | - soupsieve.match() method. |
2831 | - |
2832 | - :return: True if this Tag matches the selector; False otherwise. |
2833 | - :rtype: bool |
2834 | + `soupsieve.match() |
2835 | + <https://facelessuser.github.io/soupsieve/api/#soupsievematch>`_ |
2836 | + method. |
2837 | """ |
2838 | - return self.api.match( |
2839 | + return cast(bool, self.api.match( |
2840 | select, self.tag, self._ns(namespaces, select), flags, **kwargs |
2841 | - ) |
2842 | + )) |
2843 | |
2844 | - def filter(self, select, namespaces=None, flags=0, **kwargs): |
2845 | - """Filter this Tag's direct children based on the given CSS selector. |
2846 | + def filter(self, select:str, |
2847 | + namespaces:Optional[_NamespaceMapping]=None, |
2848 | + flags:int=0, **kwargs:Any) -> ResultSet[Tag]: |
2849 | + """Filter this `element.Tag`'s direct children based on the given CSS selector. |
2850 | |
2851 | This uses the Soup Sieve library. It works the same way as |
2852 | - passing this Tag into that library's soupsieve.filter() |
2853 | - method. More information, for more information see the |
2854 | - documentation for soupsieve.filter(). |
2855 | + passing a `element.Tag` into that library's `soupsieve.filter() |
2856 | + <https://facelessuser.github.io/soupsieve/api/#soupsievefilter>`_ |
2857 | + method. For more information, see the documentation for |
2858 | + `soupsieve.filter() |
2859 | + <https://facelessuser.github.io/soupsieve/api/#soupsievefilter>`_. |
2860 | |
2861 | :param namespaces: A dictionary mapping namespace prefixes |
2862 | used in the CSS selector to namespace URIs. By default, |
2863 | @@ -264,17 +291,18 @@ class CSS(object): |
2864 | parsing the document. |
2865 | |
2866 | :param flags: Flags to be passed into Soup Sieve's |
2867 | - soupsieve.filter() method. |
2868 | + `soupsieve.filter() |
2869 | + <https://facelessuser.github.io/soupsieve/api/#soupsievefilter>`_ |
2870 | + method. |
2871 | |
2872 | :param kwargs: Keyword arguments to be passed into SoupSieve's |
2873 | - soupsieve.filter() method. |
2874 | - |
2875 | - :return: A ResultSet of Tag objects. |
2876 | - :rtype: bs4.element.ResultSet |
2877 | - |
2878 | + `soupsieve.filter() |
2879 | + <https://facelessuser.github.io/soupsieve/api/#soupsievefilter>`_ |
2880 | + method. |
2881 | """ |
2882 | return self._rs( |
2883 | self.api.filter( |
2884 | select, self.tag, self._ns(namespaces, select), flags, **kwargs |
2885 | ) |
2886 | ) |
2887 | + |
2888 | diff --git a/bs4/dammit.py b/bs4/dammit.py |
2889 | index 692433c..8c1b631 100644 |
2890 | --- a/bs4/dammit.py |
2891 | +++ b/bs4/dammit.py |
2892 | @@ -2,9 +2,11 @@ |
2893 | """Beautiful Soup bonus library: Unicode, Dammit |
2894 | |
2895 | This library converts a bytestream to Unicode through any means |
2896 | -necessary. It is heavily based on code from Mark Pilgrim's Universal |
2897 | -Feed Parser. It works best on XML and HTML, but it does not rewrite the |
2898 | -XML or HTML to reflect a new encoding; that's the tree builder's job. |
2899 | +necessary. It is heavily based on code from Mark Pilgrim's `Universal |
2900 | +Feed Parser <https://pypi.org/project/feedparser/>`_. It works best on |
2901 | +XML and HTML, but it does not rewrite the XML or HTML to reflect a new |
2902 | +encoding; that's the job of `TreeBuilder`. |
2903 | + |
2904 | """ |
2905 | # Use of this source code is governed by the MIT license. |
2906 | __license__ = "MIT" |
2907 | @@ -12,9 +14,31 @@ __license__ = "MIT" |
2908 | from html.entities import codepoint2name |
2909 | from collections import defaultdict |
2910 | import codecs |
2911 | +from html.entities import html5 |
2912 | import re |
2913 | -import logging |
2914 | +from logging import Logger, getLogger |
2915 | import string |
2916 | +from types import ModuleType |
2917 | +from typing import ( |
2918 | + Dict, |
2919 | + Iterable, |
2920 | + Iterator, |
2921 | + List, |
2922 | + Optional, |
2923 | + Pattern, |
2924 | + Sequence, |
2925 | + Set, |
2926 | + Tuple, |
2927 | + Type, |
2928 | + Union, |
2929 | + cast, |
2930 | +) |
2931 | +from bs4._typing import ( |
2932 | + _Encoding, |
2933 | + _Encodings, |
2934 | + _RawMarkup, |
2935 | +) |
2936 | +import warnings |
2937 | |
2938 | # Import a library to autodetect character encodings. We'll support |
2939 | # any of a number of libraries that all support the same API: |
2940 | @@ -22,37 +46,41 @@ import string |
2941 | # * cchardet |
2942 | # * chardet |
2943 | # * charset-normalizer |
2944 | -chardet_module = None |
2945 | +chardet_module: Optional[ModuleType] = None |
2946 | try: |
2947 | # PyPI package: cchardet |
2948 | - import cchardet as chardet_module |
2949 | + import cchardet |
2950 | + chardet_module = cchardet |
2951 | except ImportError: |
2952 | try: |
2953 | # Debian package: python-chardet |
2954 | # PyPI package: chardet |
2955 | - import chardet as chardet_module |
2956 | + import chardet |
2957 | + chardet_module = chardet |
2958 | except ImportError: |
2959 | try: |
2960 | # PyPI package: charset-normalizer |
2961 | - import charset_normalizer as chardet_module |
2962 | + import charset_normalizer |
2963 | + chardet_module = charset_normalizer |
2964 | except ImportError: |
2965 | # No chardet available. |
2966 | - chardet_module = None |
2967 | + pass |
2968 | |
2969 | -if chardet_module: |
2970 | - def chardet_dammit(s): |
2971 | - if isinstance(s, str): |
2972 | - return None |
2973 | - return chardet_module.detect(s)['encoding'] |
2974 | -else: |
2975 | - def chardet_dammit(s): |
2976 | + |
2977 | +def _chardet_dammit(s:bytes) -> Optional[str]: |
2978 | + """Try as hard as possible to detect the encoding of a bytestring.""" |
2979 | + if chardet_module is None or isinstance(s, str): |
2980 | return None |
2981 | + module = chardet_module |
2982 | + return module.detect(s)['encoding'] |
2983 | |
2984 | # Build bytestring and Unicode versions of regular expressions for finding |
2985 | # a declared encoding inside an XML or HTML document. |
2986 | -xml_encoding = '^\\s*<\\?.*encoding=[\'"](.*?)[\'"].*\\?>' |
2987 | -html_meta = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]' |
2988 | -encoding_res = dict() |
2989 | +xml_encoding:str = '^\\s*<\\?.*encoding=[\'"](.*?)[\'"].*\\?>' #: :meta private: |
2990 | +html_meta:str = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]' #: :meta private: |
2991 | + |
2992 | +# TODO: The Pattern type here could use more refinement, but it's tricky. |
2993 | +encoding_res: Dict[Type, Dict[str, Pattern]] = dict() |
2994 | encoding_res[bytes] = { |
2995 | 'html' : re.compile(html_meta.encode("ascii"), re.I), |
2996 | 'xml' : re.compile(xml_encoding.encode("ascii"), re.I), |
2997 | @@ -62,12 +90,29 @@ encoding_res[str] = { |
2998 | 'xml' : re.compile(xml_encoding, re.I) |
2999 | } |
3000 | |
3001 | -from html.entities import html5 |
3002 | - |
3003 | class EntitySubstitution(object): |
3004 | """The ability to substitute XML or HTML entities for certain characters.""" |
3005 | |
3006 | - def _populate_class_variables(): |
3007 | + #: A map of named HTML entities to the corresponding Unicode string. |
3008 | + #: |
3009 | + #: :meta hide-value: |
3010 | + HTML_ENTITY_TO_CHARACTER: Dict[str, str] |
3011 | + |
3012 | + #: A map of Unicode strings to the corresponding named HTML entities; |
3013 | + #: the inverse of HTML_ENTITY_TO_CHARACTER. |
3014 | + #: |
3015 | + #: :meta hide-value: |
3016 | + CHARACTER_TO_HTML_ENTITY: Dict[str, str] |
3017 | + |
3018 | + #: A regular expression that matches any character (or, in rare |
3019 | + #: cases, pair of characters) that can be replaced with a named |
3020 | + #: HTML entity. |
3021 | + #: |
3022 | + #: :meta hide-value: |
3023 | + CHARACTER_TO_HTML_ENTITY_RE: Pattern[str] |
3024 | + |
3025 | + @classmethod |
3026 | + def _populate_class_variables(cls) -> None: |
3027 | """Initialize variables used by this class to manage the plethora of |
3028 | HTML5 named entities. |
3029 | |
3030 | @@ -184,11 +229,14 @@ class EntitySubstitution(object): |
3031 | character = chr(codepoint) |
3032 | unicode_to_name[character] = name |
3033 | |
3034 | - return unicode_to_name, name_to_unicode, re.compile(re_definition) |
3035 | - (CHARACTER_TO_HTML_ENTITY, HTML_ENTITY_TO_CHARACTER, |
3036 | - CHARACTER_TO_HTML_ENTITY_RE) = _populate_class_variables() |
3037 | + cls.CHARACTER_TO_HTML_ENTITY = unicode_to_name |
3038 | + cls.HTML_ENTITY_TO_CHARACTER = name_to_unicode |
3039 | + cls.CHARACTER_TO_HTML_ENTITY_RE = re.compile(re_definition) |
3040 | |
3041 | - CHARACTER_TO_XML_ENTITY = { |
3042 | + #: A map of Unicode strings to the corresponding named XML entities. |
3043 | + #: |
3044 | + #: :meta hide-value: |
3045 | + CHARACTER_TO_XML_ENTITY: Dict[str, str] = { |
3046 | "'": "apos", |
3047 | '"': "quot", |
3048 | "&": "amp", |
3049 | @@ -196,28 +244,37 @@ class EntitySubstitution(object): |
3050 | ">": "gt", |
3051 | } |
3052 | |
3053 | - BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|" |
3054 | - "&(?!#\\d+;|#x[0-9a-fA-F]+;|\\w+;)" |
3055 | - ")") |
3056 | - |
3057 | - AMPERSAND_OR_BRACKET = re.compile("([<>&])") |
3058 | + #: A regular expression matching an angle bracket or an ampersand that |
3059 | + #: is not part of an XML or HTML entity. |
3060 | + #: |
3061 | + #: :meta hide-value: |
3062 | + BARE_AMPERSAND_OR_BRACKET: Pattern[str] = re.compile( |
3063 | + "([<>]|" |
3064 | + "&(?!#\\d+;|#x[0-9a-fA-F]+;|\\w+;)" |
3065 | + ")" |
3066 | + ) |
3067 | + |
3068 | + #: A regular expression matching an angle bracket or an ampersand. |
3069 | + #: |
3070 | + #: :meta hide-value: |
3071 | + AMPERSAND_OR_BRACKET: Pattern[str] = re.compile("([<>&])") |
3072 | |
3073 | @classmethod |
3074 | - def _substitute_html_entity(cls, matchobj): |
3075 | + def _substitute_html_entity(cls, matchobj:re.Match[str]) -> str: |
3076 | """Used with a regular expression to substitute the |
3077 | appropriate HTML entity for a special character string.""" |
3078 | entity = cls.CHARACTER_TO_HTML_ENTITY.get(matchobj.group(0)) |
3079 | return "&%s;" % entity |
3080 | |
3081 | @classmethod |
3082 | - def _substitute_xml_entity(cls, matchobj): |
3083 | + def _substitute_xml_entity(cls, matchobj:re.Match[str]) -> str: |
3084 | """Used with a regular expression to substitute the |
3085 | appropriate XML entity for a special character string.""" |
3086 | entity = cls.CHARACTER_TO_XML_ENTITY[matchobj.group(0)] |
3087 | return "&%s;" % entity |
3088 | |
3089 | @classmethod |
3090 | - def quoted_attribute_value(self, value): |
3091 | + def quoted_attribute_value(cls, value: str) -> str: |
3092 | """Make a value into a quoted XML attribute, possibly escaping it. |
3093 | |
3094 | Most strings will be quoted using double quotes. |
3095 | @@ -233,7 +290,10 @@ class EntitySubstitution(object): |
3096 | double quotes will be escaped, and the string will be quoted |
3097 | using double quotes. |
3098 | |
3099 | - Welcome to "Bob's Bar" -> "Welcome to "Bob's bar" |
3100 | + Welcome to "Bob's Bar" -> Welcome to "Bob's bar" |
3101 | + |
3102 | + :param value: The XML attribute value to quote |
3103 | + :return: The quoted value |
3104 | """ |
3105 | quote_with = '"' |
3106 | if '"' in value: |
3107 | @@ -254,17 +314,22 @@ class EntitySubstitution(object): |
3108 | return quote_with + value + quote_with |
3109 | |
3110 | @classmethod |
3111 | - def substitute_xml(cls, value, make_quoted_attribute=False): |
3112 | - """Substitute XML entities for special XML characters. |
3113 | + def substitute_xml(cls, value:str, make_quoted_attribute:bool=False) -> str: |
3114 | + """Replace special XML characters with named XML entities. |
3115 | + |
3116 | + The less-than sign will become <, the greater-than sign |
3117 | + will become >, and any ampersands will become &. If you |
3118 | + want ampersands that seem to be part of an entity definition |
3119 | + to be left alone, use `substitute_xml_containing_entities` |
3120 | + instead. |
3121 | |
3122 | - :param value: A string to be substituted. The less-than sign |
3123 | - will become <, the greater-than sign will become >, |
3124 | - and any ampersands will become &. If you want ampersands |
3125 | - that appear to be part of an entity definition to be left |
3126 | - alone, use substitute_xml_containing_entities() instead. |
3127 | + :param value: A string to be substituted. |
3128 | |
3129 | :param make_quoted_attribute: If True, then the string will be |
3130 | quoted, as befits an attribute value. |
3131 | + |
3132 | + :return: A version of ``value`` with special characters replaced |
3133 | + with named entities. |
3134 | """ |
3135 | # Escape angle brackets and ampersands. |
3136 | value = cls.AMPERSAND_OR_BRACKET.sub( |
3137 | @@ -276,7 +341,7 @@ class EntitySubstitution(object): |
3138 | |
3139 | @classmethod |
3140 | def substitute_xml_containing_entities( |
3141 | - cls, value, make_quoted_attribute=False): |
3142 | + cls, value: str, make_quoted_attribute:bool=False) -> str: |
3143 | """Substitute XML entities for special XML characters. |
3144 | |
3145 | :param value: A string to be substituted. The less-than sign will |
3146 | @@ -297,10 +362,10 @@ class EntitySubstitution(object): |
3147 | return value |
3148 | |
3149 | @classmethod |
3150 | - def substitute_html(cls, s): |
3151 | + def substitute_html(cls, s: str) -> str: |
3152 | """Replace certain Unicode characters with named HTML entities. |
3153 | |
3154 | - This differs from data.encode(encoding, 'xmlcharrefreplace') |
3155 | + This differs from ``data.encode(encoding, 'xmlcharrefreplace')`` |
3156 | in that the goal is to make the result more readable (to those |
3157 | with ASCII displays) rather than to recover from |
3158 | errors. There's absolutely nothing wrong with a UTF-8 string |
3159 | @@ -308,109 +373,126 @@ class EntitySubstitution(object): |
3160 | character with "é" will make it more readable to some |
3161 | people. |
3162 | |
3163 | - :param s: A Unicode string. |
3164 | + :param s: The string to be modified. |
3165 | + :return: The string with some Unicode characters replaced with |
3166 | + HTML entities. |
3167 | """ |
3168 | return cls.CHARACTER_TO_HTML_ENTITY_RE.sub( |
3169 | cls._substitute_html_entity, s) |
3170 | - |
3171 | +EntitySubstitution._populate_class_variables() |
3172 | |
3173 | class EncodingDetector: |
3174 | - """Suggests a number of possible encodings for a bytestring. |
3175 | + """This class is capable of guessing a number of possible encodings |
3176 | + for a bytestring. |
3177 | |
3178 | Order of precedence: |
3179 | |
3180 | 1. Encodings you specifically tell EncodingDetector to try first |
3181 | - (the known_definite_encodings argument to the constructor). |
3182 | - |
3183 | + (the ``known_definite_encodings`` argument to the constructor). |
3184 | + |
3185 | 2. An encoding determined by sniffing the document's byte-order mark. |
3186 | - |
3187 | + |
3188 | 3. Encodings you specifically tell EncodingDetector to try if |
3189 | - byte-order mark sniffing fails (the user_encodings argument to the |
3190 | - constructor). |
3191 | + byte-order mark sniffing fails (the ``user_encodings`` argument to the |
3192 | + constructor). |
3193 | |
3194 | 4. An encoding declared within the bytestring itself, either in an |
3195 | - XML declaration (if the bytestring is to be interpreted as an XML |
3196 | - document), or in a <meta> tag (if the bytestring is to be |
3197 | - interpreted as an HTML document.) |
3198 | + XML declaration (if the bytestring is to be interpreted as an XML |
3199 | + document), or in a <meta> tag (if the bytestring is to be |
3200 | + interpreted as an HTML document.) |
3201 | |
3202 | 5. An encoding detected through textual analysis by chardet, |
3203 | - cchardet, or a similar external library. |
3204 | + cchardet, or a similar external library. |
3205 | |
3206 | - 4. UTF-8. |
3207 | + 6. UTF-8. |
3208 | |
3209 | - 5. Windows-1252. |
3210 | + 7. Windows-1252. |
3211 | |
3212 | - """ |
3213 | - def __init__(self, markup, known_definite_encodings=None, |
3214 | - is_html=False, exclude_encodings=None, |
3215 | - user_encodings=None, override_encodings=None): |
3216 | - """Constructor. |
3217 | - |
3218 | - :param markup: Some markup in an unknown encoding. |
3219 | - |
3220 | - :param known_definite_encodings: When determining the encoding |
3221 | - of `markup`, these encodings will be tried first, in |
3222 | - order. In HTML terms, this corresponds to the "known |
3223 | - definite encoding" step defined here: |
3224 | - https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding |
3225 | - |
3226 | - :param user_encodings: These encodings will be tried after the |
3227 | - `known_definite_encodings` have been tried and failed, and |
3228 | - after an attempt to sniff the encoding by looking at a |
3229 | - byte order mark has failed. In HTML terms, this |
3230 | - corresponds to the step "user has explicitly instructed |
3231 | - the user agent to override the document's character |
3232 | - encoding", defined here: |
3233 | - https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding |
3234 | - |
3235 | - :param override_encodings: A deprecated alias for |
3236 | - known_definite_encodings. Any encodings here will be tried |
3237 | - immediately after the encodings in |
3238 | - known_definite_encodings. |
3239 | - |
3240 | - :param is_html: If True, this markup is considered to be |
3241 | - HTML. Otherwise it's assumed to be XML. |
3242 | - |
3243 | - :param exclude_encodings: These encodings will not be tried, |
3244 | - even if they otherwise would be. |
3245 | + :param markup: Some markup in an unknown encoding. |
3246 | |
3247 | - """ |
3248 | + :param known_definite_encodings: When determining the encoding |
3249 | + of ``markup``, these encodings will be tried first, in |
3250 | + order. In HTML terms, this corresponds to the "known |
3251 | + definite encoding" step defined in `section 13.2.3.1 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding>`_. |
3252 | + |
3253 | + :param user_encodings: These encodings will be tried after the |
3254 | + ``known_definite_encodings`` have been tried and failed, and |
3255 | + after an attempt to sniff the encoding by looking at a |
3256 | + byte order mark has failed. In HTML terms, this |
3257 | + corresponds to the step "user has explicitly instructed |
3258 | + the user agent to override the document's character |
3259 | + encoding", defined in `section 13.2.3.2 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>`_. |
3260 | + |
3261 | + :param override_encodings: A **deprecated** alias for |
3262 | + ``known_definite_encodings``. Any encodings here will be tried |
3263 | + immediately after the encodings in |
3264 | + ``known_definite_encodings``. |
3265 | + |
3266 | + :param is_html: If True, this markup is considered to be |
3267 | + HTML. Otherwise it's assumed to be XML. |
3268 | + |
3269 | + :param exclude_encodings: These encodings will not be tried, |
3270 | + even if they otherwise would be. |
3271 | + |
3272 | + """ |
3273 | + def __init__(self, markup:bytes, |
3274 | + known_definite_encodings:Optional[_Encodings]=None, |
3275 | + is_html:Optional[bool]=False, |
3276 | + exclude_encodings:Optional[_Encodings]=None, |
3277 | + user_encodings:Optional[_Encodings]=None, |
3278 | + override_encodings:Optional[_Encodings]=None): |
3279 | self.known_definite_encodings = list(known_definite_encodings or []) |
3280 | if override_encodings: |
3281 | + warnings.warn( |
3282 | + "The 'override_encodings' argument was deprecated in 4.10.0. Use 'known_definite_encodings' instead.", |
3283 | + DeprecationWarning, |
3284 | + stacklevel=3 |
3285 | + ) |
3286 | self.known_definite_encodings += override_encodings |
3287 | self.user_encodings = user_encodings or [] |
3288 | exclude_encodings = exclude_encodings or [] |
3289 | self.exclude_encodings = set([x.lower() for x in exclude_encodings]) |
3290 | self.chardet_encoding = None |
3291 | - self.is_html = is_html |
3292 | - self.declared_encoding = None |
3293 | + self.is_html = False if is_html is None else is_html |
3294 | + self.declared_encoding: Optional[str] = None |
3295 | |
3296 | # First order of business: strip a byte-order mark. |
3297 | self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup) |
3298 | |
3299 | - def _usable(self, encoding, tried): |
3300 | + known_definite_encodings:_Encodings |
3301 | + user_encodings:_Encodings |
3302 | + exclude_encodings:_Encodings |
3303 | + chardet_encoding:Optional[_Encoding] |
3304 | + is_html:bool |
3305 | + declared_encoding:Optional[_Encoding] |
3306 | + markup:bytes |
3307 | + sniffed_encoding:Optional[_Encoding] |
3308 | + |
3309 | + def _usable(self, encoding:Optional[_Encoding], tried:Set[_Encoding]) -> bool: |
3310 | """Should we even bother to try this encoding? |
3311 | |
3312 | :param encoding: Name of an encoding. |
3313 | - :param tried: Encodings that have already been tried. This will be modified |
3314 | - as a side effect. |
3315 | + :param tried: Encodings that have already been tried. This |
3316 | + will be modified as a side effect. |
3317 | """ |
3318 | - if encoding is not None: |
3319 | - encoding = encoding.lower() |
3320 | - if encoding in self.exclude_encodings: |
3321 | - return False |
3322 | - if encoding not in tried: |
3323 | - tried.add(encoding) |
3324 | - return True |
3325 | + if encoding is None: |
3326 | + return False |
3327 | + encoding = encoding.lower() |
3328 | + if encoding in self.exclude_encodings: |
3329 | + return False |
3330 | + if encoding not in tried: |
3331 | + tried.add(encoding) |
3332 | + return True |
3333 | return False |
3334 | |
3335 | @property |
3336 | - def encodings(self): |
3337 | + def encodings(self) -> Iterator[_Encoding]: |
3338 | """Yield a number of encodings that might work for this markup. |
3339 | |
3340 | - :yield: A sequence of strings. |
3341 | + :yield: A sequence of strings. Each is the name of an encoding |
3342 | + that *might* work to convert a bytestring into Unicode. |
3343 | """ |
3344 | - tried = set() |
3345 | + tried:Set[_Encoding] = set() |
3346 | |
3347 | # First, try the known definite encodings |
3348 | for e in self.known_definite_encodings: |
3349 | @@ -419,7 +501,9 @@ class EncodingDetector: |
3350 | |
3351 | # Did the document originally start with a byte-order mark |
3352 | # that indicated its encoding? |
3353 | - if self._usable(self.sniffed_encoding, tried): |
3354 | + if self.sniffed_encoding is not None and self._usable( |
3355 | + self.sniffed_encoding, tried |
3356 | + ): |
3357 | yield self.sniffed_encoding |
3358 | |
3359 | # Sniffing the byte-order mark did nothing; try the user |
3360 | @@ -433,14 +517,18 @@ class EncodingDetector: |
3361 | if self.declared_encoding is None: |
3362 | self.declared_encoding = self.find_declared_encoding( |
3363 | self.markup, self.is_html) |
3364 | - if self._usable(self.declared_encoding, tried): |
3365 | + if self.declared_encoding is not None and self._usable( |
3366 | + self.declared_encoding, tried |
3367 | + ): |
3368 | yield self.declared_encoding |
3369 | |
3370 | # Use third-party character set detection to guess at the |
3371 | # encoding. |
3372 | if self.chardet_encoding is None: |
3373 | - self.chardet_encoding = chardet_dammit(self.markup) |
3374 | - if self._usable(self.chardet_encoding, tried): |
3375 | + self.chardet_encoding = _chardet_dammit(self.markup) |
3376 | + if self.chardet_encoding is not None and self._usable( |
3377 | + self.chardet_encoding, tried |
3378 | + ): |
3379 | yield self.chardet_encoding |
3380 | |
3381 | # As a last-ditch effort, try utf-8 and windows-1252. |
3382 | @@ -449,22 +537,24 @@ class EncodingDetector: |
3383 | yield e |
3384 | |
3385 | @classmethod |
3386 | - def strip_byte_order_mark(cls, data): |
3387 | + def strip_byte_order_mark(cls, data:bytes) -> Tuple[bytes, Optional[_Encoding]]: |
3388 | """If a byte-order mark is present, strip it and return the encoding it implies. |
3389 | |
3390 | - :param data: Some markup. |
3391 | - :return: A 2-tuple (modified data, implied encoding) |
3392 | + :param data: A bytestring that may or may not begin with a |
3393 | + byte-order mark. |
3394 | + |
3395 | + :return: A 2-tuple (data stripped of byte-order mark, encoding implied by byte-order mark) |
3396 | """ |
3397 | encoding = None |
3398 | if isinstance(data, str): |
3399 | # Unicode data cannot have a byte-order mark. |
3400 | return data, encoding |
3401 | if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \ |
3402 | - and (data[2:4] != '\x00\x00'): |
3403 | + and (data[2:4] != b'\x00\x00'): |
3404 | encoding = 'utf-16be' |
3405 | data = data[2:] |
3406 | elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \ |
3407 | - and (data[2:4] != '\x00\x00'): |
3408 | + and (data[2:4] != b'\x00\x00'): |
3409 | encoding = 'utf-16le' |
3410 | data = data[2:] |
3411 | elif data[:3] == b'\xef\xbb\xbf': |
3412 | @@ -479,8 +569,9 @@ class EncodingDetector: |
3413 | return data, encoding |
3414 | |
3415 | @classmethod |
3416 | - def find_declared_encoding(cls, markup, is_html=False, search_entire_document=False): |
3417 | - """Given a document, tries to find its declared encoding. |
3418 | + def find_declared_encoding(cls, markup:Union[bytes,str], is_html:bool=False, search_entire_document:bool=False) -> Optional[_Encoding]: |
3419 | + """Given a document, tries to find an encoding declared within the |
3420 | + text of the document itself. |
3421 | |
3422 | An XML encoding is declared at the beginning of the document. |
3423 | |
3424 | @@ -490,9 +581,12 @@ class EncodingDetector: |
3425 | :param markup: Some markup. |
3426 | :param is_html: If True, this markup is considered to be HTML. Otherwise |
3427 | it's assumed to be XML. |
3428 | - :param search_entire_document: Since an encoding is supposed to declared near the beginning |
3429 | - of the document, most of the time it's only necessary to search a few kilobytes of data. |
3430 | - Set this to True to force this method to search the entire document. |
3431 | + :param search_entire_document: Since an encoding is supposed |
3432 | + to declared near the beginning of the document, most of |
3433 | + the time it's only necessary to search a few kilobytes of |
3434 | + data. Set this to True to force this method to search the |
3435 | + entire document. |
3436 | + :return: The declared encoding, if one is found. |
3437 | """ |
3438 | if search_entire_document: |
3439 | xml_endpos = html_endpos = len(markup) |
3440 | @@ -520,74 +614,69 @@ class EncodingDetector: |
3441 | return None |
3442 | |
3443 | class UnicodeDammit: |
3444 | - """A class for detecting the encoding of a *ML document and |
3445 | - converting it to a Unicode string. If the source encoding is |
3446 | - windows-1252, can replace MS smart quotes with their HTML or XML |
3447 | - equivalents.""" |
3448 | - |
3449 | - # This dictionary maps commonly seen values for "charset" in HTML |
3450 | - # meta tags to the corresponding Python codec names. It only covers |
3451 | - # values that aren't in Python's aliases and can't be determined |
3452 | - # by the heuristics in find_codec. |
3453 | - CHARSET_ALIASES = {"macintosh": "mac-roman", |
3454 | - "x-sjis": "shift-jis"} |
3455 | - |
3456 | - ENCODINGS_WITH_SMART_QUOTES = [ |
3457 | - "windows-1252", |
3458 | - "iso-8859-1", |
3459 | - "iso-8859-2", |
3460 | - ] |
3461 | + """A class for detecting the encoding of a bytestring containing an |
3462 | + HTML or XML document, and decoding it to Unicode. If the source |
3463 | + encoding is windows-1252, `UnicodeDammit` can also replace |
3464 | + Microsoft smart quotes with their HTML or XML equivalents. |
3465 | + |
3466 | + :param markup: HTML or XML markup in an unknown encoding. |
3467 | + |
3468 | + :param known_definite_encodings: When determining the encoding |
3469 | + of ``markup``, these encodings will be tried first, in |
3470 | + order. In HTML terms, this corresponds to the "known |
3471 | + definite encoding" step defined in `section 13.2.3.1 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding>`_. |
3472 | + |
3473 | + :param user_encodings: These encodings will be tried after the |
3474 | + ``known_definite_encodings`` have been tried and failed, and |
3475 | + after an attempt to sniff the encoding by looking at a |
3476 | + byte order mark has failed. In HTML terms, this |
3477 | + corresponds to the step "user has explicitly instructed |
3478 | + the user agent to override the document's character |
3479 | + encoding", defined in `section 13.2.3.2 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>`_. |
3480 | + |
3481 | + :param override_encodings: A **deprecated** alias for |
3482 | + ``known_definite_encodings``. Any encodings here will be tried |
3483 | + immediately after the encodings in |
3484 | + ``known_definite_encodings``. |
3485 | + |
3486 | + :param smart_quotes_to: By default, Microsoft smart quotes will, |
3487 | + like all other characters, be converted to Unicode |
3488 | + characters. Setting this to ``ascii`` will convert them to ASCII |
3489 | + quotes instead. Setting it to ``xml`` will convert them to XML |
3490 | + entity references, and setting it to ``html`` will convert them |
3491 | + to HTML entity references. |
3492 | + |
3493 | + :param is_html: If True, ``markup`` is treated as an HTML |
3494 | + document. Otherwise it's treated as an XML document. |
3495 | + |
3496 | + :param exclude_encodings: These encodings will not be considered, |
3497 | + even if the sniffing code thinks they might make sense. |
3498 | |
3499 | - def __init__(self, markup, known_definite_encodings=[], |
3500 | - smart_quotes_to=None, is_html=False, exclude_encodings=[], |
3501 | - user_encodings=None, override_encodings=None |
3502 | + """ |
3503 | + def __init__( |
3504 | + self, markup:bytes, |
3505 | + known_definite_encodings:Optional[_Encodings]=[], |
3506 | + # TODO PYTHON 3.8 Literal is added to the typing module |
3507 | + # |
3508 | + # smart_quotes_to: Literal["ascii", "xml", "html"] | None = None, |
3509 | + smart_quotes_to: Optional[str] = None, |
3510 | + is_html: bool = False, |
3511 | + exclude_encodings:Optional[_Encodings] = [], |
3512 | + user_encodings:Optional[_Encodings] = None, |
3513 | + override_encodings:Optional[_Encodings] = None |
3514 | ): |
3515 | - """Constructor. |
3516 | - |
3517 | - :param markup: A bytestring representing markup in an unknown encoding. |
3518 | - |
3519 | - :param known_definite_encodings: When determining the encoding |
3520 | - of `markup`, these encodings will be tried first, in |
3521 | - order. In HTML terms, this corresponds to the "known |
3522 | - definite encoding" step defined here: |
3523 | - https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding |
3524 | - |
3525 | - :param user_encodings: These encodings will be tried after the |
3526 | - `known_definite_encodings` have been tried and failed, and |
3527 | - after an attempt to sniff the encoding by looking at a |
3528 | - byte order mark has failed. In HTML terms, this |
3529 | - corresponds to the step "user has explicitly instructed |
3530 | - the user agent to override the document's character |
3531 | - encoding", defined here: |
3532 | - https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding |
3533 | - |
3534 | - :param override_encodings: A deprecated alias for |
3535 | - known_definite_encodings. Any encodings here will be tried |
3536 | - immediately after the encodings in |
3537 | - known_definite_encodings. |
3538 | - |
3539 | - :param smart_quotes_to: By default, Microsoft smart quotes will, like all other characters, be converted |
3540 | - to Unicode characters. Setting this to 'ascii' will convert them to ASCII quotes instead. |
3541 | - Setting it to 'xml' will convert them to XML entity references, and setting it to 'html' |
3542 | - will convert them to HTML entity references. |
3543 | - :param is_html: If True, this markup is considered to be HTML. Otherwise |
3544 | - it's assumed to be XML. |
3545 | - :param exclude_encodings: These encodings will not be considered, even |
3546 | - if the sniffing code thinks they might make sense. |
3547 | - |
3548 | - """ |
3549 | self.smart_quotes_to = smart_quotes_to |
3550 | self.tried_encodings = [] |
3551 | self.contains_replacement_characters = False |
3552 | self.is_html = is_html |
3553 | - self.log = logging.getLogger(__name__) |
3554 | + self.log = getLogger(__name__) |
3555 | self.detector = EncodingDetector( |
3556 | markup, known_definite_encodings, is_html, exclude_encodings, |
3557 | user_encodings, override_encodings |
3558 | ) |
3559 | |
3560 | # Short-circuit if the data is in Unicode to begin with. |
3561 | - if isinstance(markup, str) or markup == '': |
3562 | + if isinstance(markup, str) or markup == b'': |
3563 | self.markup = markup |
3564 | self.unicode_markup = str(markup) |
3565 | self.original_encoding = None |
3566 | @@ -616,41 +705,117 @@ class UnicodeDammit: |
3567 | "Some characters could not be decoded, and were " |
3568 | "replaced with REPLACEMENT CHARACTER." |
3569 | ) |
3570 | + |
3571 | self.contains_replacement_characters = True |
3572 | break |
3573 | |
3574 | # If none of that worked, we could at this point force it to |
3575 | # ASCII, but that would destroy so much data that I think |
3576 | # giving up is better. |
3577 | - self.unicode_markup = u |
3578 | - if not u: |
3579 | + # |
3580 | + # Note that this is extremely unlikely, probably impossible, |
3581 | + # because the "replace" strategy is so powerful. Even running |
3582 | + # the Python binary through Unicode, Dammit gives you Unicode, |
3583 | + # albeit Unicode riddled with REPLACEMENT CHARACTER. |
3584 | + if u is None: |
3585 | self.original_encoding = None |
3586 | + self.unicode_markup = None |
3587 | + else: |
3588 | + self.unicode_markup = u |
3589 | + |
3590 | + #: The original markup, before it was converted to Unicode. |
3591 | + #: This is not necessarily the same as what was passed in to the |
3592 | + #: constructor, since any byte-order mark will be stripped. |
3593 | + markup:bytes |
3594 | |
3595 | - def _sub_ms_char(self, match): |
3596 | + #: The Unicode version of the markup, following conversion. This |
3597 | + #: is set to `None` if there was simply no way to convert the |
3598 | + #: bytestring to Unicode (as with binary data). |
3599 | + unicode_markup:Optional[str] |
3600 | + |
3601 | + #: This is True if `UnicodeDammit.unicode_markup` contains |
3602 | + #: U+FFFD REPLACEMENT_CHARACTER characters which were not present |
3603 | + #: in `UnicodeDammit.markup`. These mark character sequences that |
3604 | + #: could not be represented in Unicode. |
3605 | + contains_replacement_characters: bool |
3606 | + |
3607 | + #: Unicode, Dammit's best guess as to the original character |
3608 | + #: encoding of `UnicodeDammit.markup`. |
3609 | + original_encoding:Optional[_Encoding] |
3610 | + |
3611 | + #: The strategy used to handle Microsoft smart quotes. |
3612 | + smart_quotes_to: Optional[str] |
3613 | + |
3614 | + #: The (encoding, error handling strategy) 2-tuples that were used to |
3615 | + #: try and convert the markup to Unicode. |
3616 | + tried_encodings: List[Tuple[_Encoding, str]] |
3617 | + |
3618 | + log: Logger #: :meta private: |
3619 | + |
3620 | + def _sub_ms_char(self, match:re.Match[bytes]) -> bytes: |
3621 | """Changes a MS smart quote character to an XML or HTML |
3622 | - entity, or an ASCII character.""" |
3623 | - orig = match.group(1) |
3624 | + entity, or an ASCII character. |
3625 | + |
3626 | + TODO: Since this is only used to convert smart quotes, it |
3627 | + could be simplified, and MS_CHARS_TO_ASCII made much less |
3628 | + parochial. |
3629 | + """ |
3630 | + orig: bytes = match.group(1) |
3631 | + sub: bytes |
3632 | if self.smart_quotes_to == 'ascii': |
3633 | - sub = self.MS_CHARS_TO_ASCII.get(orig).encode() |
3634 | + if orig in self.MS_CHARS_TO_ASCII: |
3635 | + sub = self.MS_CHARS_TO_ASCII[orig].encode() |
3636 | + else: |
3637 | + # Shouldn't happen; substitute the character |
3638 | + # with itself. |
3639 | + sub = orig |
3640 | else: |
3641 | - sub = self.MS_CHARS.get(orig) |
3642 | - if type(sub) == tuple: |
3643 | - if self.smart_quotes_to == 'xml': |
3644 | - sub = '&#x'.encode() + sub[1].encode() + ';'.encode() |
3645 | + if orig in self.MS_CHARS: |
3646 | + substitutions = self.MS_CHARS[orig] |
3647 | + if type(substitutions) == tuple: |
3648 | + if self.smart_quotes_to == 'xml': |
3649 | + sub = b'&#x' + substitutions[1].encode() + b';' |
3650 | + else: |
3651 | + sub = b'&' + substitutions[0].encode() + b';' |
3652 | else: |
3653 | - sub = '&'.encode() + sub[0].encode() + ';'.encode() |
3654 | + substitutions = cast(str, substitutions) |
3655 | + sub = substitutions.encode() |
3656 | else: |
3657 | - sub = sub.encode() |
3658 | + # Shouldn't happen; substitute the character |
3659 | + # for itself. |
3660 | + sub = orig |
3661 | return sub |
3662 | + |
3663 | + #: This dictionary maps commonly seen values for "charset" in HTML |
3664 | + #: meta tags to the corresponding Python codec names. It only covers |
3665 | + #: values that aren't in Python's aliases and can't be determined |
3666 | + #: by the heuristics in `find_codec`. |
3667 | + #: |
3668 | + #: :meta hide-value: |
3669 | + CHARSET_ALIASES: Dict[str, _Encoding] = {"macintosh": "mac-roman", |
3670 | + "x-sjis": "shift-jis"} |
3671 | + |
3672 | + #: A list of encodings that tend to contain Microsoft smart quotes. |
3673 | + #: |
3674 | + #: :meta hide-value: |
3675 | + ENCODINGS_WITH_SMART_QUOTES: _Encodings = [ |
3676 | + "windows-1252", |
3677 | + "iso-8859-1", |
3678 | + "iso-8859-2", |
3679 | + ] |
3680 | |
3681 | - def _convert_from(self, proposed, errors="strict"): |
3682 | + def _convert_from(self, proposed:_Encoding, errors:str="strict") -> Optional[str]: |
3683 | """Attempt to convert the markup to the proposed encoding. |
3684 | |
3685 | :param proposed: The name of a character encoding. |
3686 | + :param errors: An error handling strategy, used when calling `str`. |
3687 | + :return: The converted markup, or `None` if the proposed |
3688 | + encoding/error handling strategy didn't work. |
3689 | """ |
3690 | - proposed = self.find_codec(proposed) |
3691 | - if not proposed or (proposed, errors) in self.tried_encodings: |
3692 | + lookup_result = self.find_codec(proposed) |
3693 | + if lookup_result is None or (lookup_result, errors) in self.tried_encodings: |
3694 | return None |
3695 | + proposed = lookup_result |
3696 | self.tried_encodings.append((proposed, errors)) |
3697 | markup = self.markup |
3698 | # Convert smart quotes to HTML if coming from an encoding |
3699 | @@ -665,36 +830,37 @@ class UnicodeDammit: |
3700 | #print("Trying to convert document to %s (errors=%s)" % ( |
3701 | # proposed, errors)) |
3702 | u = self._to_unicode(markup, proposed, errors) |
3703 | - self.markup = u |
3704 | + self.unicode_markup = u |
3705 | self.original_encoding = proposed |
3706 | except Exception as e: |
3707 | #print("That didn't work!") |
3708 | #print(e) |
3709 | return None |
3710 | #print("Correct encoding: %s" % proposed) |
3711 | - return self.markup |
3712 | + return self.unicode_markup |
3713 | |
3714 | - def _to_unicode(self, data, encoding, errors="strict"): |
3715 | - """Given a string and its encoding, decodes the string into Unicode. |
3716 | + def _to_unicode(self, data:bytes, encoding:_Encoding, errors:str="strict") -> str: |
3717 | + """Given a bytestring and its encoding, decodes the string into Unicode. |
3718 | |
3719 | :param encoding: The name of an encoding. |
3720 | + :param errors: An error handling strategy, used when calling `str`. |
3721 | """ |
3722 | return str(data, encoding, errors) |
3723 | |
3724 | @property |
3725 | - def declared_html_encoding(self): |
3726 | - """If the markup is an HTML document, returns the encoding declared _within_ |
3727 | - the document. |
3728 | + def declared_html_encoding(self) -> Optional[str]: |
3729 | + """If the markup is an HTML document, returns the encoding, if any, |
3730 | + declared *inside* the document. |
3731 | """ |
3732 | if not self.is_html: |
3733 | return None |
3734 | return self.detector.declared_encoding |
3735 | |
3736 | - def find_codec(self, charset): |
3737 | - """Convert the name of a character set to a codec name. |
3738 | + def find_codec(self, charset:_Encoding) -> Optional[str]: |
3739 | + """Look up the Python codec corresponding to a given character set. |
3740 | |
3741 | :param charset: The name of a character set. |
3742 | - :return: The name of a codec. |
3743 | + :return: The name of a Python codec. |
3744 | """ |
3745 | value = (self._codec(self.CHARSET_ALIASES.get(charset, charset)) |
3746 | or (charset and self._codec(charset.replace("-", ""))) |
3747 | @@ -706,7 +872,7 @@ class UnicodeDammit: |
3748 | return value.lower() |
3749 | return None |
3750 | |
3751 | - def _codec(self, charset): |
3752 | + def _codec(self, charset:_Encoding) -> Optional[str]: |
3753 | if not charset: |
3754 | return charset |
3755 | codec = None |
3756 | @@ -718,8 +884,11 @@ class UnicodeDammit: |
3757 | return codec |
3758 | |
3759 | |
3760 | - # A partial mapping of ISO-Latin-1 to HTML entities/XML numeric entities. |
3761 | - MS_CHARS = {b'\x80': ('euro', '20AC'), |
3762 | + #: A partial mapping of ISO-Latin-1 to HTML entities/XML numeric entities. |
3763 | + #: |
3764 | + #: :meta hide-value: |
3765 | + MS_CHARS: Dict[bytes, Union[str, Tuple[str, str]]] = { |
3766 | + b'\x80': ('euro', '20AC'), |
3767 | b'\x81': ' ', |
3768 | b'\x82': ('sbquo', '201A'), |
3769 | b'\x83': ('fnof', '192'), |
3770 | @@ -752,10 +921,15 @@ class UnicodeDammit: |
3771 | b'\x9e': ('#x17E', '17E'), |
3772 | b'\x9f': ('Yuml', ''),} |
3773 | |
3774 | - # A parochial partial mapping of ISO-Latin-1 to ASCII. Contains |
3775 | - # horrors like stripping diacritical marks to turn á into a, but also |
3776 | - # contains non-horrors like turning “ into ". |
3777 | - MS_CHARS_TO_ASCII = { |
3778 | + #: A parochial partial mapping of ISO-Latin-1 to ASCII. Contains |
3779 | + #: horrors like stripping diacritical marks to turn á into a, but also |
3780 | + #: contains non-horrors like turning “ into ". |
3781 | + #: |
3782 | + #: Seriously, don't use this for anything other than removing smart |
3783 | + #: quotes. |
3784 | + #: |
3785 | + #: :meta private: |
3786 | + MS_CHARS_TO_ASCII: Dict[bytes, str] = { |
3787 | b'\x80' : 'EUR', |
3788 | b'\x81' : ' ', |
3789 | b'\x82' : ',', |
3790 | @@ -809,7 +983,7 @@ class UnicodeDammit: |
3791 | b'\xb1' : '+-', |
3792 | b'\xb2' : '2', |
3793 | b'\xb3' : '3', |
3794 | - b'\xb4' : ("'", 'acute'), |
3795 | + b'\xb4' : "'", |
3796 | b'\xb5' : 'u', |
3797 | b'\xb6' : 'P', |
3798 | b'\xb7' : '*', |
3799 | @@ -887,12 +1061,14 @@ class UnicodeDammit: |
3800 | b'\xff' : 'y', |
3801 | } |
3802 | |
3803 | - # A map used when removing rogue Windows-1252/ISO-8859-1 |
3804 | - # characters in otherwise UTF-8 documents. |
3805 | - # |
3806 | - # Note that \x81, \x8d, \x8f, \x90, and \x9d are undefined in |
3807 | - # Windows-1252. |
3808 | - WINDOWS_1252_TO_UTF8 = { |
3809 | + #: A map used when removing rogue Windows-1252/ISO-8859-1 |
3810 | + #: characters in otherwise UTF-8 documents. |
3811 | + #: |
3812 | + #: Note that \\x81, \\x8d, \\x8f, \\x90, and \\x9d are undefined in |
3813 | + #: Windows-1252. |
3814 | + #: |
3815 | + #: :meta hide-value: |
3816 | + WINDOWS_1252_TO_UTF8: Dict[int, bytes] = { |
3817 | 0x80 : b'\xe2\x82\xac', # € |
3818 | 0x82 : b'\xe2\x80\x9a', # ‚ |
3819 | 0x83 : b'\xc6\x92', # Æ’ |
3820 | @@ -1017,33 +1193,37 @@ class UnicodeDammit: |
3821 | 0xfe : b'\xc3\xbe', # þ |
3822 | } |
3823 | |
3824 | - MULTIBYTE_MARKERS_AND_SIZES = [ |
3825 | + #: :meta private: |
3826 | + MULTIBYTE_MARKERS_AND_SIZES:List[Tuple[int, int, int]] = [ |
3827 | (0xc2, 0xdf, 2), # 2-byte characters start with a byte C2-DF |
3828 | (0xe0, 0xef, 3), # 3-byte characters start with E0-EF |
3829 | (0xf0, 0xf4, 4), # 4-byte characters start with F0-F4 |
3830 | ] |
3831 | |
3832 | - FIRST_MULTIBYTE_MARKER = MULTIBYTE_MARKERS_AND_SIZES[0][0] |
3833 | - LAST_MULTIBYTE_MARKER = MULTIBYTE_MARKERS_AND_SIZES[-1][1] |
3834 | + #: :meta private: |
3835 | + FIRST_MULTIBYTE_MARKER:int = MULTIBYTE_MARKERS_AND_SIZES[0][0] |
3836 | + |
3837 | + #: :meta private: |
3838 | + LAST_MULTIBYTE_MARKER:int = MULTIBYTE_MARKERS_AND_SIZES[-1][1] |
3839 | |
3840 | @classmethod |
3841 | - def detwingle(cls, in_bytes, main_encoding="utf8", |
3842 | - embedded_encoding="windows-1252"): |
3843 | + def detwingle(cls, in_bytes:bytes, main_encoding:_Encoding="utf8", |
3844 | + embedded_encoding:_Encoding="windows-1252") -> bytes: |
3845 | """Fix characters from one encoding embedded in some other encoding. |
3846 | |
3847 | Currently the only situation supported is Windows-1252 (or its |
3848 | subset ISO-8859-1), embedded in UTF-8. |
3849 | |
3850 | :param in_bytes: A bytestring that you suspect contains |
3851 | - characters from multiple encodings. Note that this _must_ |
3852 | + characters from multiple encodings. Note that this *must* |
3853 | be a bytestring. If you've already converted the document |
3854 | to Unicode, you're too late. |
3855 | - :param main_encoding: The primary encoding of `in_bytes`. |
3856 | + :param main_encoding: The primary encoding of ``in_bytes``. |
3857 | :param embedded_encoding: The encoding that was used to embed characters |
3858 | in the main document. |
3859 | - :return: A bytestring in which `embedded_encoding` |
3860 | - characters have been converted to their `main_encoding` |
3861 | - equivalents. |
3862 | + :return: A bytestring similar to ``in_bytes``, in which |
3863 | + ``embedded_encoding`` characters have been converted to |
3864 | + their ``main_encoding`` equivalents. |
3865 | """ |
3866 | if embedded_encoding.replace('_', '-').lower() not in ( |
3867 | 'windows-1252', 'windows_1252'): |
3868 | @@ -1061,9 +1241,6 @@ class UnicodeDammit: |
3869 | pos = 0 |
3870 | while pos < len(in_bytes): |
3871 | byte = in_bytes[pos] |
3872 | - if not isinstance(byte, int): |
3873 | - # Python 2.x |
3874 | - byte = ord(byte) |
3875 | if (byte >= cls.FIRST_MULTIBYTE_MARKER |
3876 | and byte <= cls.LAST_MULTIBYTE_MARKER): |
3877 | # This is the start of a UTF-8 multibyte character. Skip |
3878 | diff --git a/bs4/diagnose.py b/bs4/diagnose.py |
3879 | index e079772..201b879 100644 |
3880 | --- a/bs4/diagnose.py |
3881 | +++ b/bs4/diagnose.py |
3882 | @@ -7,8 +7,11 @@ import cProfile |
3883 | from io import BytesIO |
3884 | from html.parser import HTMLParser |
3885 | import bs4 |
3886 | -from bs4 import BeautifulSoup, __version__ |
3887 | +from bs4 import BeautifulSoup, __version__ |
3888 | from bs4.builder import builder_registry |
3889 | +from typing import TYPE_CHECKING |
3890 | +if TYPE_CHECKING: |
3891 | + from bs4._typing import _IncomingMarkup |
3892 | |
3893 | import os |
3894 | import pstats |
3895 | @@ -19,10 +22,10 @@ import traceback |
3896 | import sys |
3897 | import cProfile |
3898 | |
3899 | -def diagnose(data): |
3900 | +def diagnose(data:_IncomingMarkup) -> None: |
3901 | """Diagnostic suite for isolating common problems. |
3902 | |
3903 | - :param data: A string containing markup that needs to be explained. |
3904 | + :param data: Some markup that needs to be explained. |
3905 | :return: None; diagnostics are printed to standard output. |
3906 | """ |
3907 | print(("Diagnostic running on Beautiful Soup %s" % __version__)) |
3908 | @@ -75,7 +78,7 @@ def diagnose(data): |
3909 | |
3910 | print(("-" * 80)) |
3911 | |
3912 | -def lxml_trace(data, html=True, **kwargs): |
3913 | +def lxml_trace(data, html:bool=True, **kwargs) -> None: |
3914 | """Print out the lxml events that occur during parsing. |
3915 | |
3916 | This lets you see how lxml parses a document when no Beautiful |
3917 | @@ -109,7 +112,7 @@ class AnnouncingParser(HTMLParser): |
3918 | print(s) |
3919 | |
3920 | def handle_starttag(self, name, attrs): |
3921 | - self._p("%s START" % name) |
3922 | + self._p(f"{name} {attrs} START") |
3923 | |
3924 | def handle_endtag(self, name): |
3925 | self._p("%s END" % name) |
3926 | @@ -146,11 +149,14 @@ def htmlparser_trace(data): |
3927 | parser = AnnouncingParser() |
3928 | parser.feed(data) |
3929 | |
3930 | -_vowels = "aeiou" |
3931 | -_consonants = "bcdfghjklmnpqrstvwxyz" |
3932 | +_vowels:str = "aeiou" |
3933 | +_consonants:str = "bcdfghjklmnpqrstvwxyz" |
3934 | |
3935 | -def rword(length=5): |
3936 | - "Generate a random word-like string." |
3937 | +def rword(length:int=5) -> str: |
3938 | + """Generate a random word-like string. |
3939 | + |
3940 | + :meta private: |
3941 | + """ |
3942 | s = '' |
3943 | for i in range(length): |
3944 | if i % 2 == 0: |
3945 | @@ -160,12 +166,18 @@ def rword(length=5): |
3946 | s += random.choice(t) |
3947 | return s |
3948 | |
3949 | -def rsentence(length=4): |
3950 | - "Generate a random sentence-like string." |
3951 | +def rsentence(length:int=4) -> str: |
3952 | + """Generate a random sentence-like string. |
3953 | + |
3954 | + :meta private: |
3955 | + """ |
3956 | return " ".join(rword(random.randint(4,9)) for i in range(length)) |
3957 | |
3958 | -def rdoc(num_elements=1000): |
3959 | - """Randomly generate an invalid HTML document.""" |
3960 | +def rdoc(num_elements:int=1000) -> str: |
3961 | + """Randomly generate an invalid HTML document. |
3962 | + |
3963 | + :meta private: |
3964 | + """ |
3965 | tag_names = ['p', 'div', 'span', 'i', 'b', 'script', 'table'] |
3966 | elements = [] |
3967 | for i in range(num_elements): |
3968 | @@ -182,24 +194,24 @@ def rdoc(num_elements=1000): |
3969 | elements.append("</%s>" % tag_name) |
3970 | return "<html>" + "\n".join(elements) + "</html>" |
3971 | |
3972 | -def benchmark_parsers(num_elements=100000): |
3973 | +def benchmark_parsers(num_elements:int=100000) -> None: |
3974 | """Very basic head-to-head performance benchmark.""" |
3975 | print(("Comparative parser benchmark on Beautiful Soup %s" % __version__)) |
3976 | data = rdoc(num_elements) |
3977 | print(("Generated a large invalid HTML document (%d bytes)." % len(data))) |
3978 | |
3979 | - for parser in ["lxml", ["lxml", "html"], "html5lib", "html.parser"]: |
3980 | + for parser_name in ["lxml", ["lxml", "html"], "html5lib", "html.parser"]: |
3981 | success = False |
3982 | try: |
3983 | a = time.time() |
3984 | - soup = BeautifulSoup(data, parser) |
3985 | + soup = BeautifulSoup(data, parser_name) |
3986 | b = time.time() |
3987 | success = True |
3988 | except Exception as e: |
3989 | - print(("%s could not parse the markup." % parser)) |
3990 | + print(("%s could not parse the markup." % parser_name)) |
3991 | traceback.print_exc() |
3992 | if success: |
3993 | - print(("BS4+%s parsed the markup in %.2fs." % (parser, b-a))) |
3994 | + print(("BS4+%s parsed the markup in %.2fs." % (parser_name, b-a))) |
3995 | |
3996 | from lxml import etree |
3997 | a = time.time() |
3998 | @@ -214,7 +226,7 @@ def benchmark_parsers(num_elements=100000): |
3999 | b = time.time() |
4000 | print(("Raw html5lib parsed the markup in %.2fs." % (b-a))) |
4001 | |
4002 | -def profile(num_elements=100000, parser="lxml"): |
4003 | +def profile(num_elements:int=100000, parser:str="lxml"): |
4004 | """Use Python's profiler on a randomly generated document.""" |
4005 | filehandle = tempfile.NamedTemporaryFile() |
4006 | filename = filehandle.name |
4007 | diff --git a/bs4/element.py b/bs4/element.py |
4008 | index 0aefe73..8b3774e 100644 |
4009 | --- a/bs4/element.py |
4010 | +++ b/bs4/element.py |
4011 | @@ -1,55 +1,102 @@ |
4012 | +from __future__ import annotations |
4013 | # Use of this source code is governed by the MIT license. |
4014 | __license__ = "MIT" |
4015 | |
4016 | -try: |
4017 | - from collections.abc import Callable # Python 3.6 |
4018 | -except ImportError as e: |
4019 | - from collections import Callable |
4020 | import re |
4021 | import sys |
4022 | import warnings |
4023 | |
4024 | from bs4.css import CSS |
4025 | +from bs4._deprecation import ( |
4026 | + _deprecated, |
4027 | + _deprecated_alias, |
4028 | + _deprecated_function_alias, |
4029 | +) |
4030 | from bs4.formatter import ( |
4031 | Formatter, |
4032 | HTMLFormatter, |
4033 | XMLFormatter, |
4034 | ) |
4035 | |
4036 | -DEFAULT_OUTPUT_ENCODING = "utf-8" |
4037 | - |
4038 | -nonwhitespace_re = re.compile(r"\S+") |
4039 | - |
4040 | -# NOTE: This isn't used as of 4.7.0. I'm leaving it for a little bit on |
4041 | -# the off chance someone imported it for their own use. |
4042 | -whitespace_re = re.compile(r"\s+") |
4043 | +from typing import ( |
4044 | + Any, |
4045 | + Callable, |
4046 | + Dict, |
4047 | + Generator, |
4048 | + Generic, |
4049 | + Iterable, |
4050 | + Iterator, |
4051 | + List, |
4052 | + Mapping, |
4053 | + Optional, |
4054 | + Pattern, |
4055 | + Sequence, |
4056 | + Set, |
4057 | + TYPE_CHECKING, |
4058 | + Tuple, |
4059 | + Type, |
4060 | + TypeVar, |
4061 | + Union, |
4062 | + cast, |
4063 | +) |
4064 | +from typing_extensions import Self |
4065 | +if TYPE_CHECKING: |
4066 | + from bs4 import BeautifulSoup |
4067 | + from bs4.builder import TreeBuilder |
4068 | + from bs4.dammit import _Encoding |
4069 | + from bs4.formatter import ( |
4070 | + _EntitySubstitutionFunction, |
4071 | + _FormatterOrName, |
4072 | + ) |
4073 | + from bs4._typing import ( |
4074 | + _AttributeValue, |
4075 | + _AttributeValues, |
4076 | + _StrainableElement, |
4077 | + _StrainableAttribute, |
4078 | + _StrainableAttributes, |
4079 | + _StrainableString, |
4080 | + ) |
4081 | + |
4082 | +# Deprecated module-level attributes. |
4083 | +# See https://peps.python.org/pep-0562/ |
4084 | +_deprecated_names = dict( |
4085 | + whitespace_re = 'The {name} attribute was deprecated in version 4.7.0. If you need it, make your own copy.' |
4086 | +) |
4087 | +#: :meta private: |
4088 | +_deprecated_whitespace_re: Pattern[str] = re.compile(r"\s+") |
4089 | |
4090 | -def _alias(attr): |
4091 | - """Alias one attribute name to another for backward compatibility""" |
4092 | - @property |
4093 | - def alias(self): |
4094 | - return getattr(self, attr) |
4095 | - |
4096 | - @alias.setter |
4097 | - def alias(self): |
4098 | - return setattr(self, attr) |
4099 | - return alias |
4100 | - |
4101 | - |
4102 | -# These encodings are recognized by Python (so PageElement.encode |
4103 | -# could theoretically support them) but XML and HTML don't recognize |
4104 | -# them (so they should not show up in an XML or HTML document as that |
4105 | -# document's encoding). |
4106 | -# |
4107 | -# If an XML document is encoded in one of these encodings, no encoding |
4108 | -# will be mentioned in the XML declaration. If an HTML document is |
4109 | -# encoded in one of these encodings, and the HTML document has a |
4110 | -# <meta> tag that mentions an encoding, the encoding will be given as |
4111 | -# the empty string. |
4112 | -# |
4113 | -# Source: |
4114 | -# https://docs.python.org/3/library/codecs.html#python-specific-encodings |
4115 | -PYTHON_SPECIFIC_ENCODINGS = set([ |
4116 | +def __getattr__(name): |
4117 | + if name in _deprecated_names: |
4118 | + message = _deprecated_names[name] |
4119 | + warnings.warn( |
4120 | + message.format(name=name), |
4121 | + DeprecationWarning, stacklevel=2 |
4122 | + ) |
4123 | + |
4124 | + return globals()[f"_deprecated_{name}"] |
4125 | + raise AttributeError(f"module {__name__!r} has no attribute {name!r}") |
4126 | + |
4127 | +#: Documents output by Beautiful Soup will be encoded with |
4128 | +#: this encoding unless you specify otherwise. |
4129 | +DEFAULT_OUTPUT_ENCODING:str = "utf-8" |
4130 | + |
4131 | +#: A regular expression that can be used to split on whitespace. |
4132 | +nonwhitespace_re: Pattern[str] = re.compile(r"\S+") |
4133 | + |
4134 | +#: These encodings are recognized by Python (so `Tag.encode` |
4135 | +#: could theoretically support them) but XML and HTML don't recognize |
4136 | +#: them (so they should not show up in an XML or HTML document as that |
4137 | +#: document's encoding). |
4138 | +#: |
4139 | +#: If an XML document is encoded in one of these encodings, no encoding |
4140 | +#: will be mentioned in the XML declaration. If an HTML document is |
4141 | +#: encoded in one of these encodings, and the HTML document has a |
4142 | +#: <meta> tag that mentions an encoding, the encoding will be given as |
4143 | +#: the empty string. |
4144 | +#: |
4145 | +#: Source: |
4146 | +#: Python documentation, `Python Specific Encodings <https://docs.python.org/3/library/codecs.html#python-specific-encodings>`_ |
4147 | +PYTHON_SPECIFIC_ENCODINGS: Set[_Encoding] = set([ |
4148 | "idna", |
4149 | "mbcs", |
4150 | "oem", |
4151 | @@ -66,11 +113,17 @@ PYTHON_SPECIFIC_ENCODINGS = set([ |
4152 | |
4153 | |
4154 | class NamespacedAttribute(str): |
4155 | - """A namespaced string (e.g. 'xml:lang') that remembers the namespace |
4156 | - ('xml') and the name ('lang') that were used to create it. |
4157 | + """A namespaced attribute (e.g. the 'xml:lang' in 'xml:lang="en"') |
4158 | + which remembers the namespace prefix ('xml') and the name ('lang') |
4159 | + that were used to create it. |
4160 | """ |
4161 | |
4162 | - def __new__(cls, prefix, name=None, namespace=None): |
4163 | + prefix: Optional[str] |
4164 | + name: Optional[str] |
4165 | + namespace: Optional[str] |
4166 | + |
4167 | + def __new__(cls, prefix:Optional[str], |
4168 | + name:Optional[str]=None, namespace:Optional[str]=None): |
4169 | if not name: |
4170 | # This is the default namespace. Its name "has no value" |
4171 | # per https://www.w3.org/TR/xml-names/#defaulting |
4172 | @@ -89,72 +142,126 @@ class NamespacedAttribute(str): |
4173 | return obj |
4174 | |
4175 | class AttributeValueWithCharsetSubstitution(str): |
4176 | - """A stand-in object for a character encoding specified in HTML.""" |
4177 | + """An abstract class standing in for a character encoding specified |
4178 | + inside an HTML ``<meta>`` tag. |
4179 | |
4180 | -class CharsetMetaAttributeValue(AttributeValueWithCharsetSubstitution): |
4181 | - """A generic stand-in for the value of a meta tag's 'charset' attribute. |
4182 | + Subclasses exist for each place such a character encoding might be |
4183 | + found: either inside the ``charset`` attribute |
4184 | + (`CharsetMetaAttributeValue`) or inside the ``content`` attribute |
4185 | + (`ContentMetaAttributeValue`) |
4186 | |
4187 | - When Beautiful Soup parses the markup '<meta charset="utf8">', the |
4188 | - value of the 'charset' attribute will be one of these objects. |
4189 | + This allows Beautiful Soup to replace that part of the HTML file |
4190 | + with a different encoding when ouputting a tree as a string. |
4191 | """ |
4192 | + # The original, un-encoded value of the ``content`` attribute. |
4193 | + #: :meta private: |
4194 | + original_value: str |
4195 | + |
4196 | + def substitute_encoding(self, eventual_encoding:str) -> str: |
4197 | + """Do whatever's necessary in this implementation-specific |
4198 | + portion an HTML document to substitute in a specific encoding. |
4199 | + """ |
4200 | + raise NotImplementedError() |
4201 | + |
4202 | +class CharsetMetaAttributeValue(AttributeValueWithCharsetSubstitution): |
4203 | + """A generic stand-in for the value of a ``<meta>`` tag's ``charset`` |
4204 | + attribute. |
4205 | + |
4206 | + When Beautiful Soup parses the markup ``<meta charset="utf8">``, the |
4207 | + value of the ``charset`` attribute will become one of these objects. |
4208 | |
4209 | - def __new__(cls, original_value): |
4210 | + If the document is later encoded to an encoding other than UTF-8, its |
4211 | + ``<meta>`` tag will mention the new encoding instead of ``utf8``. |
4212 | + """ |
4213 | + def __new__(cls, original_value:str) -> Self: |
4214 | + # We don't need to use the original value for anything, but |
4215 | + # it might be useful for the user to know. |
4216 | obj = str.__new__(cls, original_value) |
4217 | obj.original_value = original_value |
4218 | return obj |
4219 | - |
4220 | - def encode(self, encoding): |
4221 | + |
4222 | + def substitute_encoding(self, eventual_encoding:_Encoding="utf-8") -> str: |
4223 | """When an HTML document is being encoded to a given encoding, the |
4224 | - value of a meta tag's 'charset' is the name of the encoding. |
4225 | + value of a ``<meta>`` tag's ``charset`` becomes the name of |
4226 | + the encoding. |
4227 | """ |
4228 | - if encoding in PYTHON_SPECIFIC_ENCODINGS: |
4229 | + if eventual_encoding in PYTHON_SPECIFIC_ENCODINGS: |
4230 | return '' |
4231 | - return encoding |
4232 | + return eventual_encoding |
4233 | |
4234 | |
4235 | class ContentMetaAttributeValue(AttributeValueWithCharsetSubstitution): |
4236 | - """A generic stand-in for the value of a meta tag's 'content' attribute. |
4237 | + """A generic stand-in for the value of a ``<meta>`` tag's ``content`` |
4238 | + attribute. |
4239 | |
4240 | When Beautiful Soup parses the markup: |
4241 | - <meta http-equiv="content-type" content="text/html; charset=utf8"> |
4242 | + ``<meta http-equiv="content-type" content="text/html; charset=utf8">`` |
4243 | |
4244 | - The value of the 'content' attribute will be one of these objects. |
4245 | - """ |
4246 | - |
4247 | - CHARSET_RE = re.compile(r"((^|;)\s*charset=)([^;]*)", re.M) |
4248 | + The value of the ``content`` attribute will become one of these objects. |
4249 | |
4250 | - def __new__(cls, original_value): |
4251 | + If the document is later encoded to an encoding other than UTF-8, its |
4252 | + ``<meta>`` tag will mention the new encoding instead of ``utf8``. |
4253 | + """ |
4254 | + #: Match the 'charset' argument inside the 'content' attribute |
4255 | + #: of a <meta> tag. |
4256 | + #: :meta private: |
4257 | + CHARSET_RE: Pattern[str] = re.compile( |
4258 | + r"((^|;)\s*charset=)([^;]*)", re.M |
4259 | + ) |
4260 | + |
4261 | + def __new__(cls, original_value:str) -> Self: |
4262 | match = cls.CHARSET_RE.search(original_value) |
4263 | - if match is None: |
4264 | - # No substitution necessary. |
4265 | - return str.__new__(str, original_value) |
4266 | - |
4267 | obj = str.__new__(cls, original_value) |
4268 | obj.original_value = original_value |
4269 | return obj |
4270 | |
4271 | - def encode(self, encoding): |
4272 | - if encoding in PYTHON_SPECIFIC_ENCODINGS: |
4273 | - return '' |
4274 | + def substitute_encoding(self, eventual_encoding:_Encoding="utf-8") -> str: |
4275 | + """When an HTML document is being encoded to a given encoding, the |
4276 | + value of the ``charset=`` in a ``<meta>`` tag's ``content`` becomes |
4277 | + the name of the encoding. |
4278 | + """ |
4279 | + if eventual_encoding in PYTHON_SPECIFIC_ENCODINGS: |
4280 | + return self.CHARSET_RE.sub('', self.original_value) |
4281 | def rewrite(match): |
4282 | - return match.group(1) + encoding |
4283 | + return match.group(1) + eventual_encoding |
4284 | return self.CHARSET_RE.sub(rewrite, self.original_value) |
4285 | |
4286 | |
4287 | class PageElement(object): |
4288 | - """Contains the navigational information for some part of the page: |
4289 | - that is, its current location in the parse tree. |
4290 | + """An abstract class representing a single element in the parse tree. |
4291 | |
4292 | - NavigableString, Tag, etc. are all subclasses of PageElement. |
4293 | + `NavigableString`, `Tag`, etc. are all subclasses of |
4294 | + `PageElement`. For this reason you'll see a lot of methods that |
4295 | + return `PageElement`, but you'll never see an actual `PageElement` |
4296 | + object. For the most part you can think of `PageElement` as |
4297 | + meaning "a `Tag` or a `NavigableString`." |
4298 | """ |
4299 | |
4300 | - # In general, we can't tell just by looking at an element whether |
4301 | - # it's contained in an XML document or an HTML document. But for |
4302 | - # Tags (q.v.) we can store this information at parse time. |
4303 | - known_xml = None |
4304 | - |
4305 | - def setup(self, parent=None, previous_element=None, next_element=None, |
4306 | - previous_sibling=None, next_sibling=None): |
4307 | + #: In general, we can't tell just by looking at an element whether |
4308 | + #: it's contained in an XML document or an HTML document. But for |
4309 | + #: `Tag` objects (q.v.) we can store this information at parse time. |
4310 | + #: :meta private: |
4311 | + known_xml: Optional[bool] = None |
4312 | + |
4313 | + #: Whether or not this element has been decomposed from the tree |
4314 | + #: it was created in. |
4315 | + _decomposed: bool |
4316 | + |
4317 | + parent: Optional[Tag] |
4318 | + next_element: Optional[PageElement] |
4319 | + previous_element: Optional[PageElement] |
4320 | + next_sibling: Optional[PageElement] |
4321 | + previous_sibling: Optional[PageElement] |
4322 | + |
4323 | + #: Whether or not this element is hidden from generated output. |
4324 | + #: Only the `BeautifulSoup` object itself is hidden. |
4325 | + hidden: bool=False |
4326 | + |
4327 | + def setup(self, parent:Optional[Tag]=None, |
4328 | + previous_element:Optional[PageElement]=None, |
4329 | + next_element:Optional[PageElement]=None, |
4330 | + previous_sibling:Optional[PageElement]=None, |
4331 | + next_sibling:Optional[PageElement]=None) -> None: |
4332 | """Sets up the initial relations between this element and |
4333 | other elements. |
4334 | |
4335 | @@ -175,7 +282,7 @@ class PageElement(object): |
4336 | self.parent = parent |
4337 | |
4338 | self.previous_element = previous_element |
4339 | - if previous_element is not None: |
4340 | + if self.previous_element is not None: |
4341 | self.previous_element.next_element = self |
4342 | |
4343 | self.next_element = next_element |
4344 | @@ -191,10 +298,10 @@ class PageElement(object): |
4345 | previous_sibling = self.parent.contents[-1] |
4346 | |
4347 | self.previous_sibling = previous_sibling |
4348 | - if previous_sibling is not None: |
4349 | + if self.previous_sibling is not None: |
4350 | self.previous_sibling.next_sibling = self |
4351 | |
4352 | - def format_string(self, s, formatter): |
4353 | + def format_string(self, s:str, formatter:Optional[_FormatterOrName]) -> str: |
4354 | """Format the given string using the given formatter. |
4355 | |
4356 | :param s: A string. |
4357 | @@ -207,28 +314,35 @@ class PageElement(object): |
4358 | output = formatter.substitute(s) |
4359 | return output |
4360 | |
4361 | - def formatter_for_name(self, formatter): |
4362 | + def formatter_for_name( |
4363 | + self, |
4364 | + formatter_name:Union[_FormatterOrName, _EntitySubstitutionFunction] |
4365 | + ) -> Formatter: |
4366 | """Look up or create a Formatter for the given identifier, |
4367 | if necessary. |
4368 | |
4369 | - :param formatter: Can be a Formatter object (used as-is), a |
4370 | + :param formatter: Can be a `Formatter` object (used as-is), a |
4371 | function (used as the entity substitution hook for an |
4372 | - XMLFormatter or HTMLFormatter), or a string (used to look |
4373 | - up an XMLFormatter or HTMLFormatter in the appropriate |
4374 | + `XMLFormatter` or `HTMLFormatter`), or a string (used to look |
4375 | + up an `XMLFormatter` or `HTMLFormatter` in the appropriate |
4376 | registry. |
4377 | """ |
4378 | - if isinstance(formatter, Formatter): |
4379 | - return formatter |
4380 | + if isinstance(formatter_name, Formatter): |
4381 | + return formatter_name |
4382 | + c: type[Formatter] |
4383 | + registry: Mapping[Optional[str], Formatter] |
4384 | if self._is_xml: |
4385 | c = XMLFormatter |
4386 | + registry = XMLFormatter.REGISTRY |
4387 | else: |
4388 | c = HTMLFormatter |
4389 | - if isinstance(formatter, Callable): |
4390 | - return c(entity_substitution=formatter) |
4391 | - return c.REGISTRY[formatter] |
4392 | + registry = HTMLFormatter.REGISTRY |
4393 | + if callable(formatter_name): |
4394 | + return c(entity_substitution=formatter_name) |
4395 | + return registry[formatter_name] |
4396 | |
4397 | @property |
4398 | - def _is_xml(self): |
4399 | + def _is_xml(self) -> bool: |
4400 | """Is this element part of an XML tree or an HTML tree? |
4401 | |
4402 | This is used in formatter_for_name, when deciding whether an |
4403 | @@ -250,28 +364,41 @@ class PageElement(object): |
4404 | return getattr(self, 'is_xml', False) |
4405 | return self.parent._is_xml |
4406 | |
4407 | - nextSibling = _alias("next_sibling") # BS3 |
4408 | - previousSibling = _alias("previous_sibling") # BS3 |
4409 | + nextSibling = _deprecated_alias("nextSibling", "next_sibling", "4.0.0") |
4410 | + previousSibling = _deprecated_alias( |
4411 | + "previousSibling", "previous_sibling", "4.0.0" |
4412 | + ) |
4413 | |
4414 | - default = object() |
4415 | - def _all_strings(self, strip=False, types=default): |
4416 | + def __deepcopy__(self, memo:Dict, recursive:bool=False) -> Self: |
4417 | + raise NotImplementedError() |
4418 | + |
4419 | + def __copy__(self) -> Self: |
4420 | + """A copy of a PageElement can only be a deep copy, because |
4421 | + only one PageElement can occupy a given place in a parse tree. |
4422 | + """ |
4423 | + return self.__deepcopy__({}) |
4424 | + |
4425 | + default: Iterable[type[NavigableString]] = tuple() #: :meta private: |
4426 | + def _all_strings(self, strip:bool=False, types:Iterable[type[NavigableString]]=default) -> Iterator[str]: |
4427 | """Yield all strings of certain classes, possibly stripping them. |
4428 | |
4429 | - This is implemented differently in Tag and NavigableString. |
4430 | + This is implemented differently in `Tag` and `NavigableString`. |
4431 | """ |
4432 | raise NotImplementedError() |
4433 | |
4434 | @property |
4435 | - def stripped_strings(self): |
4436 | - """Yield all strings in this PageElement, stripping them first. |
4437 | + def stripped_strings(self) -> Iterator[str]: |
4438 | + """Yield all interesting strings in this PageElement, stripping them |
4439 | + first. |
4440 | |
4441 | - :yield: A sequence of stripped strings. |
4442 | + See `Tag` for information on which strings are considered |
4443 | + interesting in a given context. |
4444 | """ |
4445 | for string in self._all_strings(True): |
4446 | yield string |
4447 | |
4448 | - def get_text(self, separator="", strip=False, |
4449 | - types=default): |
4450 | + def get_text(self, separator:str="", strip:bool=False, |
4451 | + types:Iterable[Type[NavigableString]]=default) -> str: |
4452 | """Get all child strings of this PageElement, concatenated using the |
4453 | given separator. |
4454 | |
4455 | @@ -294,19 +421,19 @@ class PageElement(object): |
4456 | getText = get_text |
4457 | text = property(get_text) |
4458 | |
4459 | - def replace_with(self, *args): |
4460 | - """Replace this PageElement with one or more PageElements, keeping the |
4461 | - rest of the tree the same. |
4462 | + def replace_with(self, *args:PageElement) -> PageElement: |
4463 | + """Replace this `PageElement` with one or more other `PageElements`, |
4464 | + keeping the rest of the tree the same. |
4465 | |
4466 | - :param args: One or more PageElements. |
4467 | - :return: `self`, no longer part of the tree. |
4468 | + :return: This `PageElement`, no longer part of the tree. |
4469 | """ |
4470 | if self.parent is None: |
4471 | raise ValueError( |
4472 | "Cannot replace one element with another when the " |
4473 | "element to be replaced is not part of a tree.") |
4474 | if len(args) == 1 and args[0] is self: |
4475 | - return |
4476 | + # Replacing an element with itself is a no-op. |
4477 | + return self |
4478 | if any(x is self.parent for x in args): |
4479 | raise ValueError("Cannot replace a Tag with its parent.") |
4480 | old_parent = self.parent |
4481 | @@ -315,45 +442,28 @@ class PageElement(object): |
4482 | for idx, replace_with in enumerate(args, start=my_index): |
4483 | old_parent.insert(idx, replace_with) |
4484 | return self |
4485 | - replaceWith = replace_with # BS3 |
4486 | + replaceWith = _deprecated_function_alias( |
4487 | + "replaceWith", "replace_with", "4.0.0" |
4488 | + ) |
4489 | |
4490 | - def unwrap(self): |
4491 | - """Replace this PageElement with its contents. |
4492 | + def wrap(self, wrap_inside:Tag) -> Tag: |
4493 | + """Wrap this `PageElement` inside a `Tag`. |
4494 | |
4495 | - :return: `self`, no longer part of the tree. |
4496 | - """ |
4497 | - my_parent = self.parent |
4498 | - if self.parent is None: |
4499 | - raise ValueError( |
4500 | - "Cannot replace an element with its contents when that" |
4501 | - "element is not part of a tree.") |
4502 | - my_index = self.parent.index(self) |
4503 | - self.extract(_self_index=my_index) |
4504 | - for child in reversed(self.contents[:]): |
4505 | - my_parent.insert(my_index, child) |
4506 | - return self |
4507 | - replace_with_children = unwrap |
4508 | - replaceWithChildren = unwrap # BS3 |
4509 | - |
4510 | - def wrap(self, wrap_inside): |
4511 | - """Wrap this PageElement inside another one. |
4512 | - |
4513 | - :param wrap_inside: A PageElement. |
4514 | - :return: `wrap_inside`, occupying the position in the tree that used |
4515 | - to be occupied by `self`, and with `self` inside it. |
4516 | + :return: ``wrap_inside``, occupying the position in the tree that used |
4517 | + to be occupied by this object, and with this object now inside it. |
4518 | """ |
4519 | me = self.replace_with(wrap_inside) |
4520 | wrap_inside.append(me) |
4521 | return wrap_inside |
4522 | |
4523 | - def extract(self, _self_index=None): |
4524 | + def extract(self, _self_index:Optional[int]=None) -> PageElement: |
4525 | """Destructively rips this element out of the tree. |
4526 | |
4527 | :param _self_index: The location of this element in its parent's |
4528 | .contents, if known. Passing this in allows for a performance |
4529 | optimization. |
4530 | |
4531 | - :return: `self`, no longer part of the tree. |
4532 | + :return: this `PageElement`, no longer part of the tree. |
4533 | """ |
4534 | if self.parent is not None: |
4535 | if _self_index is None: |
4536 | @@ -364,11 +474,17 @@ class PageElement(object): |
4537 | #this element (and any children) hadn't been parsed. Connect |
4538 | #the two. |
4539 | last_child = self._last_descendant() |
4540 | + |
4541 | + # last_child can't be None because we passed accept_self=True |
4542 | + # into _last_descendant. Worst case, last_child will be |
4543 | + # self. Making this cast removes several mypy complaints later |
4544 | + # on as we manipulate last_child. |
4545 | + last_child = cast(PageElement, last_child) |
4546 | next_element = last_child.next_element |
4547 | |
4548 | - if (self.previous_element is not None and |
4549 | - self.previous_element is not next_element): |
4550 | - self.previous_element.next_element = next_element |
4551 | + if self.previous_element is not None: |
4552 | + if self.previous_element is not next_element: |
4553 | + self.previous_element.next_element = next_element |
4554 | if next_element is not None and next_element is not self.previous_element: |
4555 | next_element.previous_element = self.previous_element |
4556 | self.previous_element = None |
4557 | @@ -384,12 +500,38 @@ class PageElement(object): |
4558 | self.previous_sibling = self.next_sibling = None |
4559 | return self |
4560 | |
4561 | - def _last_descendant(self, is_initialized=True, accept_self=True): |
4562 | + def decompose(self) -> None: |
4563 | + """Recursively destroys this `PageElement` and its children. |
4564 | + |
4565 | + The element will be removed from the tree and wiped out; so |
4566 | + will everything beneath it. |
4567 | + |
4568 | + The behavior of a decomposed `PageElement` is undefined and you |
4569 | + should never use one for anything, but if you need to *check* |
4570 | + whether an element has been decomposed, you can use the |
4571 | + `PageElement.decomposed` property. |
4572 | + """ |
4573 | + self.extract() |
4574 | + e: Optional[PageElement] = self |
4575 | + next_up: Optional[PageElement] = None |
4576 | + while e is not None: |
4577 | + next_up = e.next_element |
4578 | + e.__dict__.clear() |
4579 | + if isinstance(e, Tag): |
4580 | + e.contents = [] |
4581 | + e._decomposed = True |
4582 | + e = next_up |
4583 | + |
4584 | + def _last_descendant( |
4585 | + self, is_initialized:bool=True, accept_self:bool=True |
4586 | + ) -> Optional[PageElement]: |
4587 | """Finds the last element beneath this object to be parsed. |
4588 | |
4589 | - :param is_initialized: Has `setup` been called on this PageElement |
4590 | - yet? |
4591 | - :param accept_self: Is `self` an acceptable answer to the question? |
4592 | + :param is_initialized: Has `PageElement.setup` been called on |
4593 | + this `PageElement` yet? |
4594 | + |
4595 | + :param accept_self: Is ``self`` an acceptable answer to the |
4596 | + question? |
4597 | """ |
4598 | if is_initialized and self.next_sibling is not None: |
4599 | last_child = self.next_sibling.previous_element |
4600 | @@ -400,121 +542,15 @@ class PageElement(object): |
4601 | if not accept_self and last_child is self: |
4602 | last_child = None |
4603 | return last_child |
4604 | - # BS3: Not part of the API! |
4605 | - _lastRecursiveChild = _last_descendant |
4606 | |
4607 | - def insert(self, position, new_child): |
4608 | - """Insert a new PageElement in the list of this PageElement's children. |
4609 | - |
4610 | - This works the same way as `list.insert`. |
4611 | - |
4612 | - :param position: The numeric position that should be occupied |
4613 | - in `self.children` by the new PageElement. |
4614 | - :param new_child: A PageElement. |
4615 | - """ |
4616 | - if new_child is None: |
4617 | - raise ValueError("Cannot insert None into a tag.") |
4618 | - if new_child is self: |
4619 | - raise ValueError("Cannot insert a tag into itself.") |
4620 | - if (isinstance(new_child, str) |
4621 | - and not isinstance(new_child, NavigableString)): |
4622 | - new_child = NavigableString(new_child) |
4623 | + _lastRecursiveChild = _deprecated_alias("_lastRecursiveChild", "_last_descendant", "4.0.0") |
4624 | |
4625 | - from bs4 import BeautifulSoup |
4626 | - if isinstance(new_child, BeautifulSoup): |
4627 | - # We don't want to end up with a situation where one BeautifulSoup |
4628 | - # object contains another. Insert the children one at a time. |
4629 | - for subchild in list(new_child.contents): |
4630 | - self.insert(position, subchild) |
4631 | - position += 1 |
4632 | - return |
4633 | - position = min(position, len(self.contents)) |
4634 | - if hasattr(new_child, 'parent') and new_child.parent is not None: |
4635 | - # We're 'inserting' an element that's already one |
4636 | - # of this object's children. |
4637 | - if new_child.parent is self: |
4638 | - current_index = self.index(new_child) |
4639 | - if current_index < position: |
4640 | - # We're moving this element further down the list |
4641 | - # of this object's children. That means that when |
4642 | - # we extract this element, our target index will |
4643 | - # jump down one. |
4644 | - position -= 1 |
4645 | - new_child.extract() |
4646 | - |
4647 | - new_child.parent = self |
4648 | - previous_child = None |
4649 | - if position == 0: |
4650 | - new_child.previous_sibling = None |
4651 | - new_child.previous_element = self |
4652 | - else: |
4653 | - previous_child = self.contents[position - 1] |
4654 | - new_child.previous_sibling = previous_child |
4655 | - new_child.previous_sibling.next_sibling = new_child |
4656 | - new_child.previous_element = previous_child._last_descendant(False) |
4657 | - if new_child.previous_element is not None: |
4658 | - new_child.previous_element.next_element = new_child |
4659 | - |
4660 | - new_childs_last_element = new_child._last_descendant(False) |
4661 | - |
4662 | - if position >= len(self.contents): |
4663 | - new_child.next_sibling = None |
4664 | - |
4665 | - parent = self |
4666 | - parents_next_sibling = None |
4667 | - while parents_next_sibling is None and parent is not None: |
4668 | - parents_next_sibling = parent.next_sibling |
4669 | - parent = parent.parent |
4670 | - if parents_next_sibling is not None: |
4671 | - # We found the element that comes next in the document. |
4672 | - break |
4673 | - if parents_next_sibling is not None: |
4674 | - new_childs_last_element.next_element = parents_next_sibling |
4675 | - else: |
4676 | - # The last element of this tag is the last element in |
4677 | - # the document. |
4678 | - new_childs_last_element.next_element = None |
4679 | - else: |
4680 | - next_child = self.contents[position] |
4681 | - new_child.next_sibling = next_child |
4682 | - if new_child.next_sibling is not None: |
4683 | - new_child.next_sibling.previous_sibling = new_child |
4684 | - new_childs_last_element.next_element = next_child |
4685 | - |
4686 | - if new_childs_last_element.next_element is not None: |
4687 | - new_childs_last_element.next_element.previous_element = new_childs_last_element |
4688 | - self.contents.insert(position, new_child) |
4689 | - |
4690 | - def append(self, tag): |
4691 | - """Appends the given PageElement to the contents of this one. |
4692 | - |
4693 | - :param tag: A PageElement. |
4694 | - """ |
4695 | - self.insert(len(self.contents), tag) |
4696 | - |
4697 | - def extend(self, tags): |
4698 | - """Appends the given PageElements to this one's contents. |
4699 | - |
4700 | - :param tags: A list of PageElements. If a single Tag is |
4701 | - provided instead, this PageElement's contents will be extended |
4702 | - with that Tag's contents. |
4703 | - """ |
4704 | - if isinstance(tags, Tag): |
4705 | - tags = tags.contents |
4706 | - if isinstance(tags, list): |
4707 | - # Moving items around the tree may change their position in |
4708 | - # the original list. Make a list that won't change. |
4709 | - tags = list(tags) |
4710 | - for tag in tags: |
4711 | - self.append(tag) |
4712 | - |
4713 | - def insert_before(self, *args): |
4714 | + def insert_before(self, *args:PageElement) -> None: |
4715 | """Makes the given element(s) the immediate predecessor of this one. |
4716 | |
4717 | - All the elements will have the same parent, and the given elements |
4718 | - will be immediately before this one. |
4719 | - |
4720 | - :param args: One or more PageElements. |
4721 | + All the elements will have the same `PageElement.parent` as |
4722 | + this one, and the given elements will occur immediately before |
4723 | + this one. |
4724 | """ |
4725 | parent = self.parent |
4726 | if parent is None: |
4727 | @@ -530,13 +566,12 @@ class PageElement(object): |
4728 | index = parent.index(self) |
4729 | parent.insert(index, predecessor) |
4730 | |
4731 | - def insert_after(self, *args): |
4732 | + def insert_after(self, *args:PageElement) -> None: |
4733 | """Makes the given element(s) the immediate successor of this one. |
4734 | |
4735 | - The elements will have the same parent, and the given elements |
4736 | - will be immediately after this one. |
4737 | - |
4738 | - :param args: One or more PageElements. |
4739 | + The elements will have the same `PageElement.parent` as this |
4740 | + one, and the given elements will occur immediately after this |
4741 | + one. |
4742 | """ |
4743 | # Do all error checking before modifying the tree. |
4744 | parent = self.parent |
4745 | @@ -556,7 +591,14 @@ class PageElement(object): |
4746 | parent.insert(index+1+offset, successor) |
4747 | offset += 1 |
4748 | |
4749 | - def find_next(self, name=None, attrs={}, string=None, **kwargs): |
4750 | + def find_next( |
4751 | + self, |
4752 | + name:Optional[_StrainableElement]=None, |
4753 | + attrs:_StrainableAttributes={}, |
4754 | + string:Optional[_StrainableString]=None, |
4755 | + node:Optional[_TagOrStringMatchFunction]=None, |
4756 | + **kwargs:_StrainableAttribute |
4757 | + ) -> Optional[PageElement]: |
4758 | """Find the first PageElement that matches the given criteria and |
4759 | appears later in the document than this PageElement. |
4760 | |
4761 | @@ -564,36 +606,47 @@ class PageElement(object): |
4762 | documentation for detailed explanations. |
4763 | |
4764 | :param name: A filter on tag name. |
4765 | - :param attrs: A dictionary of filters on attribute values. |
4766 | + :param attrs: Additional filters on attribute values. |
4767 | :param string: A filter for a NavigableString with specific text. |
4768 | - :kwargs: A dictionary of filters on attribute values. |
4769 | - :return: A PageElement. |
4770 | - :rtype: bs4.element.Tag | bs4.element.NavigableString |
4771 | - """ |
4772 | - return self._find_one(self.find_all_next, name, attrs, string, **kwargs) |
4773 | - findNext = find_next # BS3 |
4774 | - |
4775 | - def find_all_next(self, name=None, attrs={}, string=None, limit=None, |
4776 | - **kwargs): |
4777 | - """Find all PageElements that match the given criteria and appear |
4778 | - later in the document than this PageElement. |
4779 | + :kwargs: Additional filters on attribute values. |
4780 | + """ |
4781 | + return self._find_one(self.find_all_next, name, attrs, string, node, **kwargs) |
4782 | + findNext = _deprecated_function_alias("findNext", "find_next", "4.0.0") |
4783 | + |
4784 | + def find_all_next( |
4785 | + self, |
4786 | + name:Optional[_StrainableElement]=None, |
4787 | + attrs:_StrainableAttributes={}, |
4788 | + string:Optional[_StrainableString]=None, |
4789 | + limit:Optional[int]=None, |
4790 | + node:Optional[_TagOrStringMatchFunction]=None, |
4791 | + _stacklevel:int=2, |
4792 | + **kwargs:_StrainableAttribute |
4793 | + ) -> ResultSet[PageElement]: |
4794 | + """Find all `PageElement` objects that match the given criteria and |
4795 | + appear later in the document than this `PageElement`. |
4796 | |
4797 | All find_* methods take a common set of arguments. See the online |
4798 | documentation for detailed explanations. |
4799 | |
4800 | :param name: A filter on tag name. |
4801 | - :param attrs: A dictionary of filters on attribute values. |
4802 | + :param attrs: Additional filters on attribute values. |
4803 | :param string: A filter for a NavigableString with specific text. |
4804 | :param limit: Stop looking after finding this many results. |
4805 | - :kwargs: A dictionary of filters on attribute values. |
4806 | - :return: A ResultSet containing PageElements. |
4807 | + :param _stacklevel: Used internally to improve warning messages. |
4808 | + :kwargs: Additional filters on attribute values. |
4809 | """ |
4810 | - _stacklevel = kwargs.pop('_stacklevel', 2) |
4811 | return self._find_all(name, attrs, string, limit, self.next_elements, |
4812 | - _stacklevel=_stacklevel+1, **kwargs) |
4813 | - findAllNext = find_all_next # BS3 |
4814 | - |
4815 | - def find_next_sibling(self, name=None, attrs={}, string=None, **kwargs): |
4816 | + node, _stacklevel=_stacklevel+1, **kwargs) |
4817 | + findAllNext = _deprecated_function_alias("findAllNext", "find_all_next", "4.0.0") |
4818 | + |
4819 | + def find_next_sibling( |
4820 | + self, |
4821 | + name:Optional[_StrainableElement]=None, |
4822 | + attrs:_StrainableAttributes={}, |
4823 | + string:Optional[_StrainableString]=None, |
4824 | + node:Optional[_TagOrStringMatchFunction]=None, |
4825 | + **kwargs:_StrainableAttribute) -> Optional[PageElement]: |
4826 | """Find the closest sibling to this PageElement that matches the |
4827 | given criteria and appears later in the document. |
4828 | |
4829 | @@ -601,102 +654,143 @@ class PageElement(object): |
4830 | online documentation for detailed explanations. |
4831 | |
4832 | :param name: A filter on tag name. |
4833 | - :param attrs: A dictionary of filters on attribute values. |
4834 | - :param string: A filter for a NavigableString with specific text. |
4835 | - :kwargs: A dictionary of filters on attribute values. |
4836 | - :return: A PageElement. |
4837 | - :rtype: bs4.element.Tag | bs4.element.NavigableString |
4838 | + :param attrs: Additional filters on attribute values. |
4839 | + :param string: A filter for a `NavigableString` with specific text. |
4840 | + :kwargs: Additional filters on attribute values. |
4841 | """ |
4842 | return self._find_one(self.find_next_siblings, name, attrs, string, |
4843 | - **kwargs) |
4844 | - findNextSibling = find_next_sibling # BS3 |
4845 | - |
4846 | - def find_next_siblings(self, name=None, attrs={}, string=None, limit=None, |
4847 | - **kwargs): |
4848 | - """Find all siblings of this PageElement that match the given criteria |
4849 | + node, **kwargs) |
4850 | + findNextSibling = _deprecated_function_alias( |
4851 | + "findNextSibling", "find_next_sibling", "4.0.0" |
4852 | + ) |
4853 | + |
4854 | + def find_next_siblings( |
4855 | + self, |
4856 | + name:Optional[_StrainableElement]=None, |
4857 | + attrs:_StrainableAttributes={}, |
4858 | + string:Optional[_StrainableString]=None, |
4859 | + limit:Optional[int]=None, |
4860 | + node:Optional[_TagOrStringMatchFunction]=None, |
4861 | + _stacklevel:int=2, |
4862 | + **kwargs:_StrainableAttribute |
4863 | + ) -> ResultSet[PageElement]: |
4864 | + """Find all siblings of this `PageElement` that match the given criteria |
4865 | and appear later in the document. |
4866 | |
4867 | All find_* methods take a common set of arguments. See the online |
4868 | documentation for detailed explanations. |
4869 | |
4870 | :param name: A filter on tag name. |
4871 | - :param attrs: A dictionary of filters on attribute values. |
4872 | - :param string: A filter for a NavigableString with specific text. |
4873 | + :param attrs: Additional filters on attribute values. |
4874 | + :param string: A filter for a `NavigableString` with specific text. |
4875 | :param limit: Stop looking after finding this many results. |
4876 | - :kwargs: A dictionary of filters on attribute values. |
4877 | - :return: A ResultSet of PageElements. |
4878 | - :rtype: bs4.element.ResultSet |
4879 | + :param _stacklevel: Used internally to improve warning messages. |
4880 | + :kwargs: Additional filters on attribute values. |
4881 | """ |
4882 | - _stacklevel = kwargs.pop('_stacklevel', 2) |
4883 | return self._find_all( |
4884 | name, attrs, string, limit, |
4885 | - self.next_siblings, _stacklevel=_stacklevel+1, **kwargs |
4886 | + self.next_siblings, node, _stacklevel=_stacklevel+1, **kwargs |
4887 | ) |
4888 | - findNextSiblings = find_next_siblings # BS3 |
4889 | - fetchNextSiblings = find_next_siblings # BS2 |
4890 | - |
4891 | - def find_previous(self, name=None, attrs={}, string=None, **kwargs): |
4892 | - """Look backwards in the document from this PageElement and find the |
4893 | - first PageElement that matches the given criteria. |
4894 | + findNextSiblings = _deprecated_function_alias( |
4895 | + "findNextSiblings", "find_next_siblings", "4.0.0" |
4896 | + ) |
4897 | + fetchNextSiblings = _deprecated_function_alias( |
4898 | + "fetchNextSiblings", "find_next_siblings", "3.0.0" |
4899 | + ) |
4900 | + |
4901 | + def find_previous( |
4902 | + self, |
4903 | + name:Optional[_StrainableElement]=None, |
4904 | + attrs:_StrainableAttributes={}, |
4905 | + string:Optional[_StrainableString]=None, |
4906 | + node:Optional[_TagOrStringMatchFunction]=None, |
4907 | + **kwargs:_StrainableAttribute) -> Optional[PageElement]: |
4908 | + """Look backwards in the document from this `PageElement` and find the |
4909 | + first `PageElement` that matches the given criteria. |
4910 | |
4911 | All find_* methods take a common set of arguments. See the online |
4912 | documentation for detailed explanations. |
4913 | |
4914 | :param name: A filter on tag name. |
4915 | - :param attrs: A dictionary of filters on attribute values. |
4916 | - :param string: A filter for a NavigableString with specific text. |
4917 | - :kwargs: A dictionary of filters on attribute values. |
4918 | - :return: A PageElement. |
4919 | - :rtype: bs4.element.Tag | bs4.element.NavigableString |
4920 | + :param attrs: Additional filters on attribute values. |
4921 | + :param string: A filter for a `NavigableString` with specific text. |
4922 | + :kwargs: Additional filters on attribute values. |
4923 | """ |
4924 | return self._find_one( |
4925 | - self.find_all_previous, name, attrs, string, **kwargs) |
4926 | - findPrevious = find_previous # BS3 |
4927 | - |
4928 | - def find_all_previous(self, name=None, attrs={}, string=None, limit=None, |
4929 | - **kwargs): |
4930 | - """Look backwards in the document from this PageElement and find all |
4931 | - PageElements that match the given criteria. |
4932 | + self.find_all_previous, name, attrs, string, node, **kwargs) |
4933 | + |
4934 | + findPrevious = _deprecated_function_alias( |
4935 | + "findPrevious", "find_previous", "3.0.0" |
4936 | + ) |
4937 | + |
4938 | + def find_all_previous( |
4939 | + self, |
4940 | + name:Optional[_StrainableElement]=None, |
4941 | + attrs:_StrainableAttributes={}, |
4942 | + string:Optional[_StrainableString]=None, |
4943 | + limit:Optional[int]=None, |
4944 | + node:Optional[_TagOrStringMatchFunction]=None, |
4945 | + _stacklevel:int=2, |
4946 | + **kwargs:_StrainableAttribute |
4947 | + ) -> ResultSet[PageElement]: |
4948 | + """Look backwards in the document from this `PageElement` and find all |
4949 | + `PageElement` that match the given criteria. |
4950 | |
4951 | All find_* methods take a common set of arguments. See the online |
4952 | documentation for detailed explanations. |
4953 | |
4954 | :param name: A filter on tag name. |
4955 | - :param attrs: A dictionary of filters on attribute values. |
4956 | - :param string: A filter for a NavigableString with specific text. |
4957 | + :param attrs: Additional filters on attribute values. |
4958 | + :param string: A filter for a `NavigableString` with specific text. |
4959 | :param limit: Stop looking after finding this many results. |
4960 | - :kwargs: A dictionary of filters on attribute values. |
4961 | - :return: A ResultSet of PageElements. |
4962 | - :rtype: bs4.element.ResultSet |
4963 | + :param _stacklevel: Used internally to improve warning messages. |
4964 | + :kwargs: Additional filters on attribute values. |
4965 | """ |
4966 | - _stacklevel = kwargs.pop('_stacklevel', 2) |
4967 | return self._find_all( |
4968 | name, attrs, string, limit, self.previous_elements, |
4969 | - _stacklevel=_stacklevel+1, **kwargs |
4970 | + node, _stacklevel=_stacklevel+1, **kwargs |
4971 | ) |
4972 | - findAllPrevious = find_all_previous # BS3 |
4973 | - fetchPrevious = find_all_previous # BS2 |
4974 | - |
4975 | - def find_previous_sibling(self, name=None, attrs={}, string=None, **kwargs): |
4976 | - """Returns the closest sibling to this PageElement that matches the |
4977 | + findAllPrevious = _deprecated_function_alias( |
4978 | + "findAllPrevious", "find_all_previous", "4.0.0" |
4979 | + ) |
4980 | + fetchAllPrevious = _deprecated_function_alias( |
4981 | + "fetchAllPrevious", "find_all_previous", "3.0.0" |
4982 | + ) |
4983 | + |
4984 | + def find_previous_sibling( |
4985 | + self, |
4986 | + name:Optional[_StrainableElement]=None, |
4987 | + attrs:_StrainableAttributes={}, |
4988 | + string:Optional[_StrainableString]=None, |
4989 | + node:Optional[_TagOrStringMatchFunction]=None, |
4990 | + **kwargs:_StrainableAttribute) -> Optional[PageElement]: |
4991 | + """Returns the closest sibling to this `PageElement` that matches the |
4992 | given criteria and appears earlier in the document. |
4993 | |
4994 | All find_* methods take a common set of arguments. See the online |
4995 | documentation for detailed explanations. |
4996 | |
4997 | :param name: A filter on tag name. |
4998 | - :param attrs: A dictionary of filters on attribute values. |
4999 | - :param string: A filter for a NavigableString with specific text. |
5000 | - :kwargs: A dictionary of filters on attribute values. |
I will redo this merge proposal on top of the 4.13 branch.
Before I start, are there any aspects that you'd like to see implemented differently?