Merge beautifulsoup:more-modular-soupstrainers into beautifulsoup:4.13
- Git
- lp:beautifulsoup
- more-modular-soupstrainers
- Merge into 4.13
Proposed by
Leonard Richardson
Status: | Merged |
---|---|
Merged at revision: | c23dd48ebea467fcf028e14287f07d2c51e62975 |
Proposed branch: | beautifulsoup:more-modular-soupstrainers |
Merge into: | beautifulsoup:4.13 |
Diff against target: |
2064 lines (+710/-262) 18 files modified
CHANGELOG (+18/-1) bs4/__init__.py (+131/-84) bs4/_typing.py (+19/-1) bs4/builder/__init__.py (+8/-8) bs4/builder/_html5lib.py (+123/-67) bs4/builder/_htmlparser.py (+12/-2) bs4/builder/_lxml.py (+1/-1) bs4/diagnose.py (+27/-15) bs4/element.py (+24/-20) bs4/filter.py (+167/-36) bs4/tests/__init__.py (+1/-1) bs4/tests/test_filter.py (+125/-8) bs4/tests/test_html5lib.py (+2/-2) bs4/tests/test_lxml.py (+1/-1) bs4/tests/test_pageelement.py (+1/-1) bs4/tests/test_soup.py (+2/-2) bs4/tests/test_tree.py (+1/-1) doc/index.rst (+47/-11) |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Leonard Richardson | Pending | ||
Review via email: mp+459082@code.launchpad.net |
Commit message
Description of the change
To post a comment you must log in.
Preview Diff
[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1 | diff --git a/CHANGELOG b/CHANGELOG | |||
2 | index 69f238d..162e3dc 100644 | |||
3 | --- a/CHANGELOG | |||
4 | +++ b/CHANGELOG | |||
5 | @@ -1,5 +1,7 @@ | |||
6 | 1 | = 4.13.0 (Unreleased) | 1 | = 4.13.0 (Unreleased) |
7 | 2 | 2 | ||
8 | 3 | TODO: we could stand to put limit inside ResultSet | ||
9 | 4 | |||
10 | 3 | * This version drops support for Python 3.6. The minimum supported | 5 | * This version drops support for Python 3.6. The minimum supported |
11 | 4 | major Python version for Beautiful Soup is now Python 3.7. | 6 | major Python version for Beautiful Soup is now Python 3.7. |
12 | 5 | 7 | ||
13 | @@ -31,6 +33,13 @@ | |||
14 | 31 | you, since you probably use HTMLParserTreeBuilder, not | 33 | you, since you probably use HTMLParserTreeBuilder, not |
15 | 32 | BeautifulSoupHTMLParser directly. | 34 | BeautifulSoupHTMLParser directly. |
16 | 33 | 35 | ||
17 | 36 | * The TreeBuilderForHtml5lib methods fragmentClass and getFragment | ||
18 | 37 | now raise NotImplementedError. These methods are called only by | ||
19 | 38 | html5lib's HTMLParser.parseFragment() method, which Beautiful Soup | ||
20 | 39 | doesn't use, so they were untested and should have never been called. | ||
21 | 40 | The getFragment() implementation was also slightly incorrect in a way | ||
22 | 41 | that should have caused obvious problems for anyone using it. | ||
23 | 42 | |||
24 | 34 | * If Tag.get_attribute_list() is used to access an attribute that's not set, | 43 | * If Tag.get_attribute_list() is used to access an attribute that's not set, |
25 | 35 | the return value is now an empty list rather than [None]. | 44 | the return value is now an empty list rather than [None]. |
26 | 36 | 45 | ||
27 | @@ -47,6 +56,10 @@ | |||
28 | 47 | empty list was treated the same as None and False, and you would have | 56 | empty list was treated the same as None and False, and you would have |
29 | 48 | found the tags which did not have that attribute set at all. [bug=2045469] | 57 | found the tags which did not have that attribute set at all. [bug=2045469] |
30 | 49 | 58 | ||
31 | 59 | * For similar reasons, if you pass in limit=0 to a find() method for some | ||
32 | 60 | reason, you will now get zero results. Previously, you would get all | ||
33 | 61 | matching results. | ||
34 | 62 | |||
35 | 50 | * When using one of the find() methods or creating a SoupStrainer, | 63 | * When using one of the find() methods or creating a SoupStrainer, |
36 | 51 | if you specify the same attribute value in ``attrs`` and the | 64 | if you specify the same attribute value in ``attrs`` and the |
37 | 52 | keyword arguments, you'll end up with two different ways to match that | 65 | keyword arguments, you'll end up with two different ways to match that |
38 | @@ -88,7 +101,7 @@ | |||
39 | 88 | changed to match the arguments to the superclass, | 101 | changed to match the arguments to the superclass, |
40 | 89 | TreeBuilder.prepare_markup. Specifically, document_declared_encoding | 102 | TreeBuilder.prepare_markup. Specifically, document_declared_encoding |
41 | 90 | now appears before exclude_encodings, not after. If you were calling | 103 | now appears before exclude_encodings, not after. If you were calling |
43 | 91 | this method yourself, I recomment switching to using keyword | 104 | this method yourself, I recommend switching to using keyword |
44 | 92 | arguments instead. | 105 | arguments instead. |
45 | 93 | 106 | ||
46 | 94 | * Fixed an error in the lookup table used when converting | 107 | * Fixed an error in the lookup table used when converting |
47 | @@ -101,8 +114,12 @@ New deprecations in 4.13.0: | |||
48 | 101 | 114 | ||
49 | 102 | * The SAXTreeBuilder class, which was never officially supported or tested. | 115 | * The SAXTreeBuilder class, which was never officially supported or tested. |
50 | 103 | 116 | ||
51 | 117 | * The private class method BeautifulSoup._decode_markup(), which has not | ||
52 | 118 | been used inside Beautiful Soup for many years. | ||
53 | 119 | |||
54 | 104 | * The first argument to BeautifulSoup.decode has been changed from a bool | 120 | * The first argument to BeautifulSoup.decode has been changed from a bool |
55 | 105 | `pretty_print` to an int `indent_level`, to match the signature of Tag.decode. | 121 | `pretty_print` to an int `indent_level`, to match the signature of Tag.decode. |
56 | 122 | Using a bool will still work but will give you a DeprecationWarning. | ||
57 | 106 | 123 | ||
58 | 107 | * SoupStrainer.text and SoupStrainer.string are both deprecated | 124 | * SoupStrainer.text and SoupStrainer.string are both deprecated |
59 | 108 | since a single item can't capture all the possibilities of a SoupStrainer | 125 | since a single item can't capture all the possibilities of a SoupStrainer |
60 | diff --git a/bs4/__init__.py b/bs4/__init__.py | |||
61 | index 347cb38..95bd48d 100644 | |||
62 | --- a/bs4/__init__.py | |||
63 | +++ b/bs4/__init__.py | |||
64 | @@ -15,7 +15,7 @@ documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ | |||
65 | 15 | """ | 15 | """ |
66 | 16 | 16 | ||
67 | 17 | __author__ = "Leonard Richardson (leonardr@segfault.org)" | 17 | __author__ = "Leonard Richardson (leonardr@segfault.org)" |
69 | 18 | __version__ = "4.12.3" | 18 | __version__ = "4.13.0" |
70 | 19 | __copyright__ = "Copyright (c) 2004-2024 Leonard Richardson" | 19 | __copyright__ = "Copyright (c) 2004-2024 Leonard Richardson" |
71 | 20 | # Use of this source code is governed by the MIT license. | 20 | # Use of this source code is governed by the MIT license. |
72 | 21 | __license__ = "MIT" | 21 | __license__ = "MIT" |
73 | @@ -42,10 +42,13 @@ from .builder import ( | |||
74 | 42 | ) | 42 | ) |
75 | 43 | from .builder._htmlparser import HTMLParserTreeBuilder | 43 | from .builder._htmlparser import HTMLParserTreeBuilder |
76 | 44 | from .dammit import UnicodeDammit | 44 | from .dammit import UnicodeDammit |
77 | 45 | from .css import ( | ||
78 | 46 | CSS | ||
79 | 47 | ) | ||
80 | 48 | from ._deprecation import _deprecated | ||
81 | 45 | from .element import ( | 49 | from .element import ( |
82 | 46 | CData, | 50 | CData, |
83 | 47 | Comment, | 51 | Comment, |
84 | 48 | CSS, | ||
85 | 49 | DEFAULT_OUTPUT_ENCODING, | 52 | DEFAULT_OUTPUT_ENCODING, |
86 | 50 | Declaration, | 53 | Declaration, |
87 | 51 | Doctype, | 54 | Doctype, |
88 | @@ -60,7 +63,10 @@ from .element import ( | |||
89 | 60 | TemplateString, | 63 | TemplateString, |
90 | 61 | ) | 64 | ) |
91 | 62 | from .formatter import Formatter | 65 | from .formatter import Formatter |
93 | 63 | from .strainer import SoupStrainer | 66 | from .filter import ( |
94 | 67 | ElementFilter, | ||
95 | 68 | SoupStrainer, | ||
96 | 69 | ) | ||
97 | 64 | from typing import ( | 70 | from typing import ( |
98 | 65 | Any, | 71 | Any, |
99 | 66 | cast, | 72 | cast, |
100 | @@ -70,6 +76,7 @@ from typing import ( | |||
101 | 70 | List, | 76 | List, |
102 | 71 | Sequence, | 77 | Sequence, |
103 | 72 | Optional, | 78 | Optional, |
104 | 79 | Tuple, | ||
105 | 73 | Type, | 80 | Type, |
106 | 74 | TYPE_CHECKING, | 81 | TYPE_CHECKING, |
107 | 75 | Union, | 82 | Union, |
108 | @@ -81,6 +88,7 @@ from bs4._typing import ( | |||
109 | 81 | _Encoding, | 88 | _Encoding, |
110 | 82 | _Encodings, | 89 | _Encodings, |
111 | 83 | _IncomingMarkup, | 90 | _IncomingMarkup, |
112 | 91 | _RawMarkup, | ||
113 | 84 | ) | 92 | ) |
114 | 85 | 93 | ||
115 | 86 | # Define some custom warnings. | 94 | # Define some custom warnings. |
116 | @@ -144,20 +152,21 @@ class BeautifulSoup(Tag): | |||
117 | 144 | NO_PARSER_SPECIFIED_WARNING: str = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features=\"%(parser)s\"' to the BeautifulSoup constructor.\n" | 152 | NO_PARSER_SPECIFIED_WARNING: str = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features=\"%(parser)s\"' to the BeautifulSoup constructor.\n" |
118 | 145 | 153 | ||
119 | 146 | # FUTURE PYTHON: | 154 | # FUTURE PYTHON: |
121 | 147 | element_classes:Dict[Type[PageElement], Type[Any]] #: :meta private: | 155 | element_classes:Dict[Type[PageElement], Type[PageElement]] #: :meta private: |
122 | 148 | builder:TreeBuilder #: :meta private: | 156 | builder:TreeBuilder #: :meta private: |
123 | 149 | is_xml: bool | 157 | is_xml: bool |
124 | 150 | known_xml: Optional[bool] | 158 | known_xml: Optional[bool] |
125 | 151 | parse_only: Optional[SoupStrainer] #: :meta private: | 159 | parse_only: Optional[SoupStrainer] #: :meta private: |
126 | 152 | 160 | ||
127 | 153 | # These members are only used while parsing markup. | 161 | # These members are only used while parsing markup. |
129 | 154 | markup:Optional[Union[str,bytes]] #: :meta private: | 162 | markup:Optional[_RawMarkup] #: :meta private: |
130 | 155 | current_data:List[str] #: :meta private: | 163 | current_data:List[str] #: :meta private: |
131 | 156 | currentTag:Optional[Tag] #: :meta private: | 164 | currentTag:Optional[Tag] #: :meta private: |
132 | 157 | tagStack:List[Tag] #: :meta private: | 165 | tagStack:List[Tag] #: :meta private: |
133 | 158 | open_tag_counter:CounterType[str] #: :meta private: | 166 | open_tag_counter:CounterType[str] #: :meta private: |
134 | 159 | preserve_whitespace_tag_stack:List[Tag] #: :meta private: | 167 | preserve_whitespace_tag_stack:List[Tag] #: :meta private: |
135 | 160 | string_container_stack:List[Tag] #: :meta private: | 168 | string_container_stack:List[Tag] #: :meta private: |
136 | 169 | _most_recent_element:Optional[PageElement] #: :meta private: | ||
137 | 161 | 170 | ||
138 | 162 | #: Beautiful Soup's best guess as to the character encoding of the | 171 | #: Beautiful Soup's best guess as to the character encoding of the |
139 | 163 | #: original document. | 172 | #: original document. |
140 | @@ -182,7 +191,7 @@ class BeautifulSoup(Tag): | |||
141 | 182 | parse_only:Optional[SoupStrainer]=None, | 191 | parse_only:Optional[SoupStrainer]=None, |
142 | 183 | from_encoding:Optional[_Encoding]=None, | 192 | from_encoding:Optional[_Encoding]=None, |
143 | 184 | exclude_encodings:Optional[_Encodings]=None, | 193 | exclude_encodings:Optional[_Encodings]=None, |
145 | 185 | element_classes:Optional[Dict[Type[PageElement], Type[Any]]]=None, | 194 | element_classes:Optional[Dict[Type[PageElement], Type[PageElement]]]=None, |
146 | 186 | **kwargs:Any | 195 | **kwargs:Any |
147 | 187 | ): | 196 | ): |
148 | 188 | """Constructor. | 197 | """Constructor. |
149 | @@ -271,7 +280,7 @@ class BeautifulSoup(Tag): | |||
150 | 271 | "features='lxml' for HTML and features='lxml-xml' for " | 280 | "features='lxml' for HTML and features='lxml-xml' for " |
151 | 272 | "XML.") | 281 | "XML.") |
152 | 273 | 282 | ||
154 | 274 | def deprecated_argument(old_name, new_name): | 283 | def deprecated_argument(old_name:str, new_name:str) -> Optional[Any]: |
155 | 275 | if old_name in kwargs: | 284 | if old_name in kwargs: |
156 | 276 | warnings.warn( | 285 | warnings.warn( |
157 | 277 | 'The "%s" argument to the BeautifulSoup constructor ' | 286 | 'The "%s" argument to the BeautifulSoup constructor ' |
158 | @@ -284,13 +293,14 @@ class BeautifulSoup(Tag): | |||
159 | 284 | 293 | ||
160 | 285 | parse_only = parse_only or deprecated_argument( | 294 | parse_only = parse_only or deprecated_argument( |
161 | 286 | "parseOnlyThese", "parse_only") | 295 | "parseOnlyThese", "parse_only") |
169 | 287 | if (parse_only is not None | 296 | if parse_only is not None: |
170 | 288 | and parse_only.string_rules and | 297 | # Issue a warning if we can tell in advance that |
171 | 289 | (parse_only.name_rules or parse_only.attribute_rules)): | 298 | # parse_only will exclude the entire tree. |
172 | 290 | warnings.warn( | 299 | if parse_only.excludes_everything: |
173 | 291 | f"Value for parse_only will exclude everything, since it puts restrictions on both tags and strings: {parse_only}", | 300 | warnings.warn( |
174 | 292 | UserWarning, stacklevel=3 | 301 | f"The given value for parse_only will exclude everything: {parse_only}", |
175 | 293 | ) | 302 | UserWarning, stacklevel=3 |
176 | 303 | ) | ||
177 | 294 | 304 | ||
178 | 295 | from_encoding = from_encoding or deprecated_argument( | 305 | from_encoding = from_encoding or deprecated_argument( |
179 | 296 | "fromEncoding", "from_encoding") | 306 | "fromEncoding", "from_encoding") |
180 | @@ -323,7 +333,7 @@ class BeautifulSoup(Tag): | |||
181 | 323 | "Couldn't find a tree builder with the features you " | 333 | "Couldn't find a tree builder with the features you " |
182 | 324 | "requested: %s. Do you need to install a parser library?" | 334 | "requested: %s. Do you need to install a parser library?" |
183 | 325 | % ",".join(features)) | 335 | % ",".join(features)) |
185 | 326 | builder_class = cast(Type[TreeBuilder], possible_builder_class) | 336 | builder_class = possible_builder_class |
186 | 327 | 337 | ||
187 | 328 | # At this point either we have a TreeBuilder instance in | 338 | # At this point either we have a TreeBuilder instance in |
188 | 329 | # builder, or we have a builder_class that we can instantiate | 339 | # builder, or we have a builder_class that we can instantiate |
189 | @@ -399,7 +409,7 @@ class BeautifulSoup(Tag): | |||
190 | 399 | 409 | ||
191 | 400 | # At this point we know markup is a string or bytestring. If | 410 | # At this point we know markup is a string or bytestring. If |
192 | 401 | # it was a file-type object, we've read from it. | 411 | # it was a file-type object, we've read from it. |
194 | 402 | markup = cast(Union[str,bytes], markup) | 412 | markup = cast(_RawMarkup, markup) |
195 | 403 | 413 | ||
196 | 404 | rejections = [] | 414 | rejections = [] |
197 | 405 | success = False | 415 | success = False |
198 | @@ -428,7 +438,7 @@ class BeautifulSoup(Tag): | |||
199 | 428 | self.markup = None | 438 | self.markup = None |
200 | 429 | self.builder.soup = None | 439 | self.builder.soup = None |
201 | 430 | 440 | ||
203 | 431 | def _clone(self): | 441 | def _clone(self) -> "BeautifulSoup": |
204 | 432 | """Create a new BeautifulSoup object with the same TreeBuilder, | 442 | """Create a new BeautifulSoup object with the same TreeBuilder, |
205 | 433 | but not associated with any markup. | 443 | but not associated with any markup. |
206 | 434 | 444 | ||
207 | @@ -441,7 +451,7 @@ class BeautifulSoup(Tag): | |||
208 | 441 | clone.original_encoding = self.original_encoding | 451 | clone.original_encoding = self.original_encoding |
209 | 442 | return clone | 452 | return clone |
210 | 443 | 453 | ||
212 | 444 | def __getstate__(self): | 454 | def __getstate__(self) -> dict[str, Any]: |
213 | 445 | # Frequently a tree builder can't be pickled. | 455 | # Frequently a tree builder can't be pickled. |
214 | 446 | d = dict(self.__dict__) | 456 | d = dict(self.__dict__) |
215 | 447 | if 'builder' in d and d['builder'] is not None and not self.builder.picklable: | 457 | if 'builder' in d and d['builder'] is not None and not self.builder.picklable: |
216 | @@ -457,7 +467,7 @@ class BeautifulSoup(Tag): | |||
217 | 457 | del d['_most_recent_element'] | 467 | del d['_most_recent_element'] |
218 | 458 | return d | 468 | return d |
219 | 459 | 469 | ||
221 | 460 | def __setstate__(self, state): | 470 | def __setstate__(self, state: dict[str, Any]) -> None: |
222 | 461 | # If necessary, restore the TreeBuilder by looking it up. | 471 | # If necessary, restore the TreeBuilder by looking it up. |
223 | 462 | self.__dict__ = state | 472 | self.__dict__ = state |
224 | 463 | if isinstance(self.builder, type): | 473 | if isinstance(self.builder, type): |
225 | @@ -469,15 +479,16 @@ class BeautifulSoup(Tag): | |||
226 | 469 | self.builder.soup = self | 479 | self.builder.soup = self |
227 | 470 | self.reset() | 480 | self.reset() |
228 | 471 | self._feed() | 481 | self._feed() |
229 | 472 | return state | ||
230 | 473 | 482 | ||
231 | 474 | 483 | ||
232 | 475 | @classmethod | 484 | @classmethod |
235 | 476 | def _decode_markup(cls, markup): | 485 | @_deprecated(replaced_by="nothing (private method, will be removed)", version="4.13.0") |
236 | 477 | """Ensure `markup` is bytes so it's safe to send into warnings.warn. | 486 | def _decode_markup(cls, markup:_RawMarkup) -> str: |
237 | 487 | """Ensure `markup` is Unicode so it's safe to send into warnings.warn. | ||
238 | 478 | 488 | ||
241 | 479 | TODO: warnings.warn had this problem back in 2010 but it might not | 489 | warnings.warn had this problem back in 2010 but fortunately |
242 | 480 | anymore. | 490 | not anymore. This has not been used for a long time; I just |
243 | 491 | noticed that fact while working on 4.13.0. | ||
244 | 481 | """ | 492 | """ |
245 | 482 | if isinstance(markup, bytes): | 493 | if isinstance(markup, bytes): |
246 | 483 | decoded = markup.decode('utf-8', 'replace') | 494 | decoded = markup.decode('utf-8', 'replace') |
247 | @@ -486,56 +497,76 @@ class BeautifulSoup(Tag): | |||
248 | 486 | return decoded | 497 | return decoded |
249 | 487 | 498 | ||
250 | 488 | @classmethod | 499 | @classmethod |
252 | 489 | def _markup_is_url(cls, markup): | 500 | def _markup_is_url(cls, markup:_RawMarkup) -> bool: |
253 | 490 | """Error-handling method to raise a warning if incoming markup looks | 501 | """Error-handling method to raise a warning if incoming markup looks |
254 | 491 | like a URL. | 502 | like a URL. |
255 | 492 | 503 | ||
259 | 493 | :param markup: A string. | 504 | :param markup: A string of markup. |
260 | 494 | :return: Whether or not the markup resembles a URL | 505 | :return: Whether or not the markup resembled a URL |
261 | 495 | closely enough to justify a warning. | 506 | closely enough to justify issuing a warning. |
262 | 496 | """ | 507 | """ |
263 | 508 | problem: bool = False | ||
264 | 497 | if isinstance(markup, bytes): | 509 | if isinstance(markup, bytes): |
267 | 498 | space = b' ' | 510 | cant_start_with_b: Tuple[bytes, bytes] = (b"http:", b"https:") |
268 | 499 | cant_start_with = (b"http:", b"https:") | 511 | problem = ( |
269 | 512 | any( | ||
270 | 513 | markup.startswith(prefix) for prefix in | ||
271 | 514 | (b"http:", b"https:") | ||
272 | 515 | ) | ||
273 | 516 | and not b' ' in markup | ||
274 | 517 | ) | ||
275 | 500 | elif isinstance(markup, str): | 518 | elif isinstance(markup, str): |
278 | 501 | space = ' ' | 519 | problem = ( |
279 | 502 | cant_start_with = ("http:", "https:") | 520 | any( |
280 | 521 | markup.startswith(prefix) for prefix in | ||
281 | 522 | ("http:", "https:") | ||
282 | 523 | ) | ||
283 | 524 | and not ' ' in markup | ||
284 | 525 | ) | ||
285 | 503 | else: | 526 | else: |
286 | 504 | return False | 527 | return False |
287 | 505 | 528 | ||
299 | 506 | if any(markup.startswith(prefix) for prefix in cant_start_with): | 529 | if not problem: |
300 | 507 | if not space in markup: | 530 | return False |
301 | 508 | warnings.warn( | 531 | warnings.warn( |
302 | 509 | 'The input looks more like a URL than markup. You may want to use' | 532 | 'The input looks more like a URL than markup. You may want to use' |
303 | 510 | ' an HTTP client like requests to get the document behind' | 533 | ' an HTTP client like requests to get the document behind' |
304 | 511 | ' the URL, and feed that document to Beautiful Soup.', | 534 | ' the URL, and feed that document to Beautiful Soup.', |
305 | 512 | MarkupResemblesLocatorWarning, | 535 | MarkupResemblesLocatorWarning, |
306 | 513 | stacklevel=3 | 536 | stacklevel=3 |
307 | 514 | ) | 537 | ) |
308 | 515 | return True | 538 | return True |
298 | 516 | return False | ||
309 | 517 | 539 | ||
310 | 518 | @classmethod | 540 | @classmethod |
313 | 519 | def _markup_resembles_filename(cls, markup): | 541 | def _markup_resembles_filename(cls, markup:_RawMarkup) -> bool: |
314 | 520 | """Error-handling method to raise a warning if incoming markup | 542 | """Error-handling method to issue a warning if incoming markup |
315 | 521 | resembles a filename. | 543 | resembles a filename. |
316 | 522 | 544 | ||
320 | 523 | :param markup: A bytestring or string. | 545 | :param markup: A string of markup. |
321 | 524 | :return: Whether or not the markup resembles a filename | 546 | :return: Whether or not the markup resembled a filename |
322 | 525 | closely enough to justify a warning. | 547 | closely enough to justify issuing a warning. |
323 | 526 | """ | 548 | """ |
329 | 527 | path_characters = '/\\' | 549 | path_characters_b = b'/\\' |
330 | 528 | extensions = ['.html', '.htm', '.xml', '.xhtml', '.txt'] | 550 | path_characters_s = '/\\' |
331 | 529 | if isinstance(markup, bytes): | 551 | extensions_b = [b'.html', b'.htm', b'.xml', b'.xhtml', b'.txt'] |
332 | 530 | path_characters = path_characters.encode("utf8") | 552 | extensions_s = ['.html', '.htm', '.xml', '.xhtml', '.txt'] |
333 | 531 | extensions = [x.encode('utf8') for x in extensions] | 553 | |
334 | 532 | filelike = False | 554 | filelike = False |
337 | 533 | if any(x in markup for x in path_characters): | 555 | if isinstance(markup, bytes): |
338 | 534 | filelike = True | 556 | if any(x in markup for x in path_characters_b): |
339 | 557 | filelike = True | ||
340 | 558 | else: | ||
341 | 559 | lower_b = markup.lower() | ||
342 | 560 | if any(lower_b.endswith(ext) for ext in extensions_b): | ||
343 | 561 | filelike = True | ||
344 | 535 | else: | 562 | else: |
347 | 536 | lower = markup.lower() | 563 | if any(x in markup for x in path_characters_s): |
346 | 537 | if any(lower.endswith(ext) for ext in extensions): | ||
348 | 538 | filelike = True | 564 | filelike = True |
349 | 565 | else: | ||
350 | 566 | lower_s = markup.lower() | ||
351 | 567 | if any(lower_s.endswith(ext) for ext in extensions_s): | ||
352 | 568 | filelike = True | ||
353 | 569 | |||
354 | 539 | if filelike: | 570 | if filelike: |
355 | 540 | warnings.warn( | 571 | warnings.warn( |
356 | 541 | 'The input looks more like a filename than markup. You may' | 572 | 'The input looks more like a filename than markup. You may' |
357 | @@ -546,20 +577,22 @@ class BeautifulSoup(Tag): | |||
358 | 546 | return True | 577 | return True |
359 | 547 | return False | 578 | return False |
360 | 548 | 579 | ||
362 | 549 | def _feed(self): | 580 | def _feed(self) -> None: |
363 | 550 | """Internal method that parses previously set markup, creating a large | 581 | """Internal method that parses previously set markup, creating a large |
364 | 551 | number of Tag and NavigableString objects. | 582 | number of Tag and NavigableString objects. |
365 | 552 | """ | 583 | """ |
366 | 553 | # Convert the document to Unicode. | 584 | # Convert the document to Unicode. |
367 | 554 | self.builder.reset() | 585 | self.builder.reset() |
368 | 555 | 586 | ||
370 | 556 | self.builder.feed(self.markup) | 587 | if self.markup is not None: |
371 | 588 | self.builder.feed(self.markup) | ||
372 | 557 | # Close out any unfinished strings and close all the open tags. | 589 | # Close out any unfinished strings and close all the open tags. |
373 | 558 | self.endData() | 590 | self.endData() |
375 | 559 | while self.currentTag.name != self.ROOT_TAG_NAME: | 591 | while (self.currentTag is not None and |
376 | 592 | self.currentTag.name != self.ROOT_TAG_NAME): | ||
377 | 560 | self.popTag() | 593 | self.popTag() |
378 | 561 | 594 | ||
380 | 562 | def reset(self): | 595 | def reset(self) -> None: |
381 | 563 | """Reset this object to a state as though it had never parsed any | 596 | """Reset this object to a state as though it had never parsed any |
382 | 564 | markup. | 597 | markup. |
383 | 565 | """ | 598 | """ |
384 | @@ -585,7 +618,7 @@ class BeautifulSoup(Tag): | |||
385 | 585 | sourcepos:Optional[int]=None, | 618 | sourcepos:Optional[int]=None, |
386 | 586 | string:Optional[str]=None, | 619 | string:Optional[str]=None, |
387 | 587 | **kwattrs:_AttributeValue, | 620 | **kwattrs:_AttributeValue, |
389 | 588 | ): | 621 | ) -> Tag: |
390 | 589 | """Create a new Tag associated with this BeautifulSoup object. | 622 | """Create a new Tag associated with this BeautifulSoup object. |
391 | 590 | 623 | ||
392 | 591 | :param name: The name of the new Tag. | 624 | :param name: The name of the new Tag. |
393 | @@ -603,10 +636,16 @@ class BeautifulSoup(Tag): | |||
394 | 603 | 636 | ||
395 | 604 | """ | 637 | """ |
396 | 605 | kwattrs.update(attrs) | 638 | kwattrs.update(attrs) |
398 | 606 | tag = self.element_classes.get(Tag, Tag)( | 639 | tag_class = self.element_classes.get(Tag, Tag) |
399 | 640 | |||
400 | 641 | # Assume that this is either Tag or a subclass of Tag. If not, | ||
401 | 642 | # the user brought type-unsafety upon themselves. | ||
402 | 643 | tag_class = cast(Type[Tag], tag_class) | ||
403 | 644 | tag = tag_class( | ||
404 | 607 | None, self.builder, name, namespace, nsprefix, kwattrs, | 645 | None, self.builder, name, namespace, nsprefix, kwattrs, |
405 | 608 | sourceline=sourceline, sourcepos=sourcepos | 646 | sourceline=sourceline, sourcepos=sourcepos |
406 | 609 | ) | 647 | ) |
407 | 648 | |||
408 | 610 | if string is not None: | 649 | if string is not None: |
409 | 611 | tag.string = string | 650 | tag.string = string |
410 | 612 | return tag | 651 | return tag |
411 | @@ -622,9 +661,11 @@ class BeautifulSoup(Tag): | |||
412 | 622 | """ | 661 | """ |
413 | 623 | container = base_class or NavigableString | 662 | container = base_class or NavigableString |
414 | 624 | 663 | ||
418 | 625 | # There may be a general override of NavigableString. | 664 | # The user may want us to use some other class (hopefully a |
419 | 626 | container = self.element_classes.get( | 665 | # custom subclass) instead of the one we'd use normally. |
420 | 627 | container, container | 666 | container = cast( |
421 | 667 | type[NavigableString], | ||
422 | 668 | self.element_classes.get(container, container) | ||
423 | 628 | ) | 669 | ) |
424 | 629 | 670 | ||
425 | 630 | # On top of that, we may be inside a tag that needs a special | 671 | # On top of that, we may be inside a tag that needs a special |
426 | @@ -728,9 +769,8 @@ class BeautifulSoup(Tag): | |||
427 | 728 | self.current_data = [] | 769 | self.current_data = [] |
428 | 729 | 770 | ||
429 | 730 | # Should we add this string to the tree at all? | 771 | # Should we add this string to the tree at all? |
433 | 731 | if self.parse_only and len(self.tagStack) <= 1 and \ | 772 | if (self.parse_only and len(self.tagStack) <= 1 and |
434 | 732 | (not self.parse_only.string_rules or \ | 773 | (not self.parse_only.allow_string_creation(current_data))): |
432 | 733 | not self.parse_only.allow_string_creation(current_data)): | ||
435 | 734 | return | 774 | return |
436 | 735 | 775 | ||
437 | 736 | containerClass = self.string_container(containerClass) | 776 | containerClass = self.string_container(containerClass) |
438 | @@ -739,17 +779,16 @@ class BeautifulSoup(Tag): | |||
439 | 739 | 779 | ||
440 | 740 | def object_was_parsed( | 780 | def object_was_parsed( |
441 | 741 | self, o:PageElement, parent:Optional[Tag]=None, | 781 | self, o:PageElement, parent:Optional[Tag]=None, |
443 | 742 | most_recent_element:Optional[PageElement]=None): | 782 | most_recent_element:Optional[PageElement]=None) -> None: |
444 | 743 | """Method called by the TreeBuilder to integrate an object into the | 783 | """Method called by the TreeBuilder to integrate an object into the |
445 | 744 | parse tree. | 784 | parse tree. |
446 | 745 | 785 | ||
447 | 746 | |||
448 | 747 | |||
449 | 748 | :meta private: | 786 | :meta private: |
450 | 749 | """ | 787 | """ |
451 | 750 | if parent is None: | 788 | if parent is None: |
452 | 751 | parent = self.currentTag | 789 | parent = self.currentTag |
453 | 752 | assert parent is not None | 790 | assert parent is not None |
454 | 791 | previous_element: Optional[PageElement] | ||
455 | 753 | if most_recent_element is not None: | 792 | if most_recent_element is not None: |
456 | 754 | previous_element = most_recent_element | 793 | previous_element = most_recent_element |
457 | 755 | else: | 794 | else: |
458 | @@ -774,12 +813,12 @@ class BeautifulSoup(Tag): | |||
459 | 774 | if fix: | 813 | if fix: |
460 | 775 | self._linkage_fixer(parent) | 814 | self._linkage_fixer(parent) |
461 | 776 | 815 | ||
463 | 777 | def _linkage_fixer(self, el): | 816 | def _linkage_fixer(self, el:Tag) -> None: |
464 | 778 | """Make sure linkage of this fragment is sound.""" | 817 | """Make sure linkage of this fragment is sound.""" |
465 | 779 | 818 | ||
466 | 780 | first = el.contents[0] | 819 | first = el.contents[0] |
467 | 781 | child = el.contents[-1] | 820 | child = el.contents[-1] |
469 | 782 | descendant = child | 821 | descendant:PageElement = child |
470 | 783 | 822 | ||
471 | 784 | if child is first and el.parent is not None: | 823 | if child is first and el.parent is not None: |
472 | 785 | # Parent should be linked to first child | 824 | # Parent should be linked to first child |
473 | @@ -797,14 +836,18 @@ class BeautifulSoup(Tag): | |||
474 | 797 | 836 | ||
475 | 798 | # This index is a tag, dig deeper for a "last descendant" | 837 | # This index is a tag, dig deeper for a "last descendant" |
476 | 799 | if isinstance(child, Tag) and child.contents: | 838 | if isinstance(child, Tag) and child.contents: |
478 | 800 | descendant = child._last_descendant(False) | 839 | # _last_decendant is typed as returning Optional[PageElement], |
479 | 840 | # but the value can't be None here, because el is a Tag | ||
480 | 841 | # which we know has contents. | ||
481 | 842 | descendant = cast(PageElement, child._last_descendant(False)) | ||
482 | 801 | 843 | ||
483 | 802 | # As the final step, link last descendant. It should be linked | 844 | # As the final step, link last descendant. It should be linked |
484 | 803 | # to the parent's next sibling (if found), else walk up the chain | 845 | # to the parent's next sibling (if found), else walk up the chain |
485 | 804 | # and find a parent with a sibling. It should have no next sibling. | 846 | # and find a parent with a sibling. It should have no next sibling. |
486 | 805 | descendant.next_element = None | 847 | descendant.next_element = None |
487 | 806 | descendant.next_sibling = None | 848 | descendant.next_sibling = None |
489 | 807 | target = el | 849 | |
490 | 850 | target:Optional[Tag] = el | ||
491 | 808 | while True: | 851 | while True: |
492 | 809 | if target is None: | 852 | if target is None: |
493 | 810 | break | 853 | break |
494 | @@ -814,7 +857,7 @@ class BeautifulSoup(Tag): | |||
495 | 814 | break | 857 | break |
496 | 815 | target = target.parent | 858 | target = target.parent |
497 | 816 | 859 | ||
499 | 817 | def _popToTag(self, name, nsprefix=None, inclusivePop=True) -> Optional[Tag]: | 860 | def _popToTag(self, name:str, nsprefix:Optional[str]=None, inclusivePop:bool=True) -> Optional[Tag]: |
500 | 818 | """Pops the tag stack up to and including the most recent | 861 | """Pops the tag stack up to and including the most recent |
501 | 819 | instance of the given tag. | 862 | instance of the given tag. |
502 | 820 | 863 | ||
503 | @@ -851,7 +894,7 @@ class BeautifulSoup(Tag): | |||
504 | 851 | 894 | ||
505 | 852 | def handle_starttag( | 895 | def handle_starttag( |
506 | 853 | self, name:str, namespace:Optional[str], | 896 | self, name:str, namespace:Optional[str], |
508 | 854 | nsprefix:Optional[str], attrs:Optional[Dict[str,str]], | 897 | nsprefix:Optional[str], attrs:_AttributeValues, |
509 | 855 | sourceline:Optional[int]=None, sourcepos:Optional[int]=None, | 898 | sourceline:Optional[int]=None, sourcepos:Optional[int]=None, |
510 | 856 | namespaces:Optional[Dict[str, str]]=None) -> Optional[Tag]: | 899 | namespaces:Optional[Dict[str, str]]=None) -> Optional[Tag]: |
511 | 857 | """Called by the tree builder when a new tag is encountered. | 900 | """Called by the tree builder when a new tag is encountered. |
512 | @@ -867,7 +910,7 @@ class BeautifulSoup(Tag): | |||
513 | 867 | currently in scope in the document. | 910 | currently in scope in the document. |
514 | 868 | 911 | ||
515 | 869 | If this method returns None, the tag was rejected by an active | 912 | If this method returns None, the tag was rejected by an active |
517 | 870 | SoupStrainer. You should proceed as if the tag had not occurred | 913 | `ElementFilter`. You should proceed as if the tag had not occurred |
518 | 871 | in the document. For instance, if this was a self-closing tag, | 914 | in the document. For instance, if this was a self-closing tag, |
519 | 872 | don't call handle_endtag. | 915 | don't call handle_endtag. |
520 | 873 | 916 | ||
521 | @@ -877,11 +920,14 @@ class BeautifulSoup(Tag): | |||
522 | 877 | self.endData() | 920 | self.endData() |
523 | 878 | 921 | ||
524 | 879 | if (self.parse_only and len(self.tagStack) <= 1 | 922 | if (self.parse_only and len(self.tagStack) <= 1 |
527 | 880 | and (self.parse_only.string_rules | 923 | and not self.parse_only.allow_tag_creation(nsprefix, name, attrs)): |
526 | 881 | or not self.parse_only.allow_tag_creation(nsprefix, name, attrs))): | ||
528 | 882 | return None | 924 | return None |
529 | 883 | 925 | ||
531 | 884 | tag = self.element_classes.get(Tag, Tag)( | 926 | tag_class = self.element_classes.get(Tag, Tag) |
532 | 927 | # Assume that this is either Tag or a subclass of Tag. If not, | ||
533 | 928 | # the user brought type-unsafety upon themselves. | ||
534 | 929 | tag_class = cast(Type[Tag], tag_class) | ||
535 | 930 | tag = tag_class( | ||
536 | 885 | self, self.builder, name, namespace, nsprefix, attrs, | 931 | self, self.builder, name, namespace, nsprefix, attrs, |
537 | 886 | self.currentTag, self._most_recent_element, | 932 | self.currentTag, self._most_recent_element, |
538 | 887 | sourceline=sourceline, sourcepos=sourcepos, | 933 | sourceline=sourceline, sourcepos=sourcepos, |
539 | @@ -918,7 +964,8 @@ class BeautifulSoup(Tag): | |||
540 | 918 | def decode(self, indent_level:Optional[int]=None, | 964 | def decode(self, indent_level:Optional[int]=None, |
541 | 919 | eventual_encoding:_Encoding=DEFAULT_OUTPUT_ENCODING, | 965 | eventual_encoding:_Encoding=DEFAULT_OUTPUT_ENCODING, |
542 | 920 | formatter:Union[Formatter,str]="minimal", | 966 | formatter:Union[Formatter,str]="minimal", |
544 | 921 | iterator:Optional[Iterable]=None, **kwargs) -> str: | 967 | iterator:Optional[Iterable[PageElement]]=None, |
545 | 968 | **kwargs:Any) -> str: | ||
546 | 922 | """Returns a string representation of the parse tree | 969 | """Returns a string representation of the parse tree |
547 | 923 | as a full HTML or XML document. | 970 | as a full HTML or XML document. |
548 | 924 | 971 | ||
549 | @@ -989,7 +1036,7 @@ _soup = BeautifulSoup | |||
550 | 989 | class BeautifulStoneSoup(BeautifulSoup): | 1036 | class BeautifulStoneSoup(BeautifulSoup): |
551 | 990 | """Deprecated interface to an XML parser.""" | 1037 | """Deprecated interface to an XML parser.""" |
552 | 991 | 1038 | ||
554 | 992 | def __init__(self, *args, **kwargs): | 1039 | def __init__(self, *args:Any, **kwargs:Any): |
555 | 993 | kwargs['features'] = 'xml' | 1040 | kwargs['features'] = 'xml' |
556 | 994 | warnings.warn( | 1041 | warnings.warn( |
557 | 995 | 'The BeautifulStoneSoup class was deprecated in version 4.0.0. Instead of using ' | 1042 | 'The BeautifulStoneSoup class was deprecated in version 4.0.0. Instead of using ' |
558 | diff --git a/bs4/_typing.py b/bs4/_typing.py | |||
559 | index fed804a..ab8f7a0 100644 | |||
560 | --- a/bs4/_typing.py | |||
561 | +++ b/bs4/_typing.py | |||
562 | @@ -7,6 +7,8 @@ | |||
563 | 7 | # * In 3.10, x|y is an accepted shorthand for Union[x,y]. | 7 | # * In 3.10, x|y is an accepted shorthand for Union[x,y]. |
564 | 8 | # * In 3.10, TypeAlias gains capabilities that can be used to | 8 | # * In 3.10, TypeAlias gains capabilities that can be used to |
565 | 9 | # improve the tree matching types (I don't remember what, exactly). | 9 | # improve the tree matching types (I don't remember what, exactly). |
566 | 10 | # * 3.8 defines the Protocol type, which can be used to do duck typing | ||
567 | 11 | # in a statically checkable way. | ||
568 | 10 | 12 | ||
569 | 11 | import re | 13 | import re |
570 | 12 | from typing_extensions import TypeAlias | 14 | from typing_extensions import TypeAlias |
571 | @@ -15,13 +17,14 @@ from typing import ( | |||
572 | 15 | Dict, | 17 | Dict, |
573 | 16 | IO, | 18 | IO, |
574 | 17 | Iterable, | 19 | Iterable, |
575 | 20 | Optional, | ||
576 | 18 | Pattern, | 21 | Pattern, |
577 | 19 | TYPE_CHECKING, | 22 | TYPE_CHECKING, |
578 | 20 | Union, | 23 | Union, |
579 | 21 | ) | 24 | ) |
580 | 22 | 25 | ||
581 | 23 | if TYPE_CHECKING: | 26 | if TYPE_CHECKING: |
583 | 24 | from bs4.element import Tag | 27 | from bs4.element import PageElement, Tag |
584 | 25 | 28 | ||
585 | 26 | # Aliases for markup in various stages of processing. | 29 | # Aliases for markup in various stages of processing. |
586 | 27 | # | 30 | # |
587 | @@ -52,6 +55,10 @@ _InvertedNamespaceMapping:TypeAlias = Dict[_NamespaceURL, _NamespacePrefix] | |||
588 | 52 | _AttributeValue: TypeAlias = Union[str, Iterable[str]] | 55 | _AttributeValue: TypeAlias = Union[str, Iterable[str]] |
589 | 53 | _AttributeValues: TypeAlias = Dict[str, _AttributeValue] | 56 | _AttributeValues: TypeAlias = Dict[str, _AttributeValue] |
590 | 54 | 57 | ||
591 | 58 | # The most common form in which attribute values are passed in from a | ||
592 | 59 | # parser. | ||
593 | 60 | _RawAttributeValues: TypeAlias = dict[str, str] | ||
594 | 61 | |||
595 | 55 | # Aliases to represent the many possibilities for matching bits of a | 62 | # Aliases to represent the many possibilities for matching bits of a |
596 | 56 | # parse tree. | 63 | # parse tree. |
597 | 57 | # | 64 | # |
598 | @@ -60,6 +67,17 @@ _AttributeValues: TypeAlias = Dict[str, _AttributeValue] | |||
599 | 60 | # of the arguments to the SoupStrainer constructor and (more | 67 | # of the arguments to the SoupStrainer constructor and (more |
600 | 61 | # familiarly to Beautiful Soup users) the find* methods. | 68 | # familiarly to Beautiful Soup users) the find* methods. |
601 | 62 | 69 | ||
602 | 70 | # A function that takes a PageElement and returns a yes-or-no answer. | ||
603 | 71 | _PageElementMatchFunction:TypeAlias = Callable[['PageElement'], bool] | ||
604 | 72 | |||
605 | 73 | # A function that takes the raw parsed ingredients of a markup tag | ||
606 | 74 | # and returns a yes-or-no answer. | ||
607 | 75 | _AllowTagCreationFunction:TypeAlias = Callable[[Optional[str], str, Optional[_RawAttributeValues]], bool] | ||
608 | 76 | |||
609 | 77 | # A function that takes the raw parsed ingredients of a markup string node | ||
610 | 78 | # and returns a yes-or-no answer. | ||
611 | 79 | _AllowStringCreationFunction:TypeAlias = Callable[[Optional[str]], bool] | ||
612 | 80 | |||
613 | 63 | # A function that takes a Tag and returns a yes-or-no answer. | 81 | # A function that takes a Tag and returns a yes-or-no answer. |
614 | 64 | # A TagNameMatchRule expects this kind of function, if you're | 82 | # A TagNameMatchRule expects this kind of function, if you're |
615 | 65 | # going to pass it a function. | 83 | # going to pass it a function. |
616 | diff --git a/bs4/builder/__init__.py b/bs4/builder/__init__.py | |||
617 | index fa2b939..b59513e 100644 | |||
618 | --- a/bs4/builder/__init__.py | |||
619 | +++ b/bs4/builder/__init__.py | |||
620 | @@ -277,7 +277,7 @@ class TreeBuilder(object): | |||
621 | 277 | return True | 277 | return True |
622 | 278 | return tag_name in self.empty_element_tags | 278 | return tag_name in self.empty_element_tags |
623 | 279 | 279 | ||
625 | 280 | def feed(self, markup:str) -> None: | 280 | def feed(self, markup:_RawMarkup) -> None: |
626 | 281 | """Run some incoming markup through some parsing process, | 281 | """Run some incoming markup through some parsing process, |
627 | 282 | populating the `BeautifulSoup` object in `TreeBuilder.soup` | 282 | populating the `BeautifulSoup` object in `TreeBuilder.soup` |
628 | 283 | """ | 283 | """ |
629 | @@ -598,8 +598,8 @@ class DetectsXMLParsedAsHTML(object): | |||
630 | 598 | 598 | ||
631 | 599 | # This is typed as str, not `ProcessingInstruction`, because this | 599 | # This is typed as str, not `ProcessingInstruction`, because this |
632 | 600 | # check may be run before any Beautiful Soup objects are created. | 600 | # check may be run before any Beautiful Soup objects are created. |
635 | 601 | _first_processing_instruction: Optional[str] | 601 | _first_processing_instruction: Optional[str] #: :meta private: |
636 | 602 | _root_tag: Optional[Tag] | 602 | _root_tag_name: Optional[str] #: :meta private: |
637 | 603 | 603 | ||
638 | 604 | @classmethod | 604 | @classmethod |
639 | 605 | def warn_if_markup_looks_like_xml(cls, markup:Optional[_RawMarkup], stacklevel:int=3) -> bool: | 605 | def warn_if_markup_looks_like_xml(cls, markup:Optional[_RawMarkup], stacklevel:int=3) -> bool: |
640 | @@ -648,14 +648,14 @@ class DetectsXMLParsedAsHTML(object): | |||
641 | 648 | def _initialize_xml_detector(self) -> None: | 648 | def _initialize_xml_detector(self) -> None: |
642 | 649 | """Call this method before parsing a document.""" | 649 | """Call this method before parsing a document.""" |
643 | 650 | self._first_processing_instruction = None | 650 | self._first_processing_instruction = None |
645 | 651 | self._root_tag = None | 651 | self._root_tag_name = None |
646 | 652 | 652 | ||
647 | 653 | def _document_might_be_xml(self, processing_instruction:str): | 653 | def _document_might_be_xml(self, processing_instruction:str): |
648 | 654 | """Call this method when encountering an XML declaration, or a | 654 | """Call this method when encountering an XML declaration, or a |
649 | 655 | "processing instruction" that might be an XML declaration. | 655 | "processing instruction" that might be an XML declaration. |
650 | 656 | """ | 656 | """ |
651 | 657 | if (self._first_processing_instruction is not None | 657 | if (self._first_processing_instruction is not None |
653 | 658 | or self._root_tag is not None): | 658 | or self._root_tag_name is not None): |
654 | 659 | # The document has already started. Don't bother checking | 659 | # The document has already started. Don't bother checking |
655 | 660 | # anymore. | 660 | # anymore. |
656 | 661 | return | 661 | return |
657 | @@ -665,18 +665,18 @@ class DetectsXMLParsedAsHTML(object): | |||
658 | 665 | # We won't know until we encounter the first tag whether or | 665 | # We won't know until we encounter the first tag whether or |
659 | 666 | # not this is actually a problem. | 666 | # not this is actually a problem. |
660 | 667 | 667 | ||
662 | 668 | def _root_tag_encountered(self, name): | 668 | def _root_tag_encountered(self, name:str) -> None: |
663 | 669 | """Call this when you encounter the document's root tag. | 669 | """Call this when you encounter the document's root tag. |
664 | 670 | 670 | ||
665 | 671 | This is where we actually check whether an XML document is | 671 | This is where we actually check whether an XML document is |
666 | 672 | being incorrectly parsed as HTML, and issue the warning. | 672 | being incorrectly parsed as HTML, and issue the warning. |
667 | 673 | """ | 673 | """ |
669 | 674 | if self._root_tag is not None: | 674 | if self._root_tag_name is not None: |
670 | 675 | # This method was incorrectly called multiple times. Do | 675 | # This method was incorrectly called multiple times. Do |
671 | 676 | # nothing. | 676 | # nothing. |
672 | 677 | return | 677 | return |
673 | 678 | 678 | ||
675 | 679 | self._root_tag = name | 679 | self._root_tag_name = name |
676 | 680 | if (name != 'html' and self._first_processing_instruction is not None | 680 | if (name != 'html' and self._first_processing_instruction is not None |
677 | 681 | and self._first_processing_instruction.lower().startswith('xml ')): | 681 | and self._first_processing_instruction.lower().startswith('xml ')): |
678 | 682 | # We encountered an XML declaration and then a tag other | 682 | # We encountered an XML declaration and then a tag other |
679 | diff --git a/bs4/builder/_html5lib.py b/bs4/builder/_html5lib.py | |||
680 | index b7d2924..2ea556c 100644 | |||
681 | --- a/bs4/builder/_html5lib.py | |||
682 | +++ b/bs4/builder/_html5lib.py | |||
683 | @@ -6,6 +6,9 @@ __all__ = [ | |||
684 | 6 | ] | 6 | ] |
685 | 7 | 7 | ||
686 | 8 | from typing import ( | 8 | from typing import ( |
687 | 9 | Any, | ||
688 | 10 | cast, | ||
689 | 11 | Dict, | ||
690 | 9 | Iterable, | 12 | Iterable, |
691 | 10 | List, | 13 | List, |
692 | 11 | Optional, | 14 | Optional, |
693 | @@ -14,8 +17,11 @@ from typing import ( | |||
694 | 14 | Union, | 17 | Union, |
695 | 15 | ) | 18 | ) |
696 | 16 | from bs4._typing import ( | 19 | from bs4._typing import ( |
697 | 20 | _AttributeValue, | ||
698 | 21 | _AttributeValues, | ||
699 | 17 | _Encoding, | 22 | _Encoding, |
700 | 18 | _Encodings, | 23 | _Encodings, |
701 | 24 | _NamespaceURL, | ||
702 | 19 | _RawMarkup, | 25 | _RawMarkup, |
703 | 20 | ) | 26 | ) |
704 | 21 | 27 | ||
705 | @@ -30,6 +36,7 @@ from bs4.builder import ( | |||
706 | 30 | ) | 36 | ) |
707 | 31 | from bs4.element import ( | 37 | from bs4.element import ( |
708 | 32 | NamespacedAttribute, | 38 | NamespacedAttribute, |
709 | 39 | PageElement, | ||
710 | 33 | nonwhitespace_re, | 40 | nonwhitespace_re, |
711 | 34 | ) | 41 | ) |
712 | 35 | import html5lib | 42 | import html5lib |
713 | @@ -42,7 +49,9 @@ from bs4.element import ( | |||
714 | 42 | Doctype, | 49 | Doctype, |
715 | 43 | NavigableString, | 50 | NavigableString, |
716 | 44 | Tag, | 51 | Tag, |
718 | 45 | ) | 52 | ) |
719 | 53 | if TYPE_CHECKING: | ||
720 | 54 | from bs4 import BeautifulSoup | ||
721 | 46 | 55 | ||
722 | 47 | from html5lib.treebuilders import base as treebuilder_base | 56 | from html5lib.treebuilders import base as treebuilder_base |
723 | 48 | 57 | ||
724 | @@ -71,7 +80,9 @@ class HTML5TreeBuilder(HTMLTreeBuilder): | |||
725 | 71 | #: html5lib can tell us which line number and position in the | 80 | #: html5lib can tell us which line number and position in the |
726 | 72 | #: original file is the source of an element. | 81 | #: original file is the source of an element. |
727 | 73 | TRACKS_LINE_NUMBERS:bool = True | 82 | TRACKS_LINE_NUMBERS:bool = True |
729 | 74 | 83 | ||
730 | 84 | underlying_builder:'TreeBuilderForHtml5lib' #: :meta private: | ||
731 | 85 | |||
732 | 75 | def prepare_markup(self, markup:_RawMarkup, | 86 | def prepare_markup(self, markup:_RawMarkup, |
733 | 76 | user_specified_encoding:Optional[_Encoding]=None, | 87 | user_specified_encoding:Optional[_Encoding]=None, |
734 | 77 | document_declared_encoding:Optional[_Encoding]=None, | 88 | document_declared_encoding:Optional[_Encoding]=None, |
735 | @@ -102,20 +113,31 @@ class HTML5TreeBuilder(HTMLTreeBuilder): | |||
736 | 102 | yield (markup, None, None, False) | 113 | yield (markup, None, None, False) |
737 | 103 | 114 | ||
738 | 104 | # These methods are defined by Beautiful Soup. | 115 | # These methods are defined by Beautiful Soup. |
740 | 105 | def feed(self, markup): | 116 | def feed(self, markup:_RawMarkup) -> None: |
741 | 106 | """Run some incoming markup through some parsing process, | 117 | """Run some incoming markup through some parsing process, |
742 | 107 | populating the `BeautifulSoup` object in `HTML5TreeBuilder.soup`. | 118 | populating the `BeautifulSoup` object in `HTML5TreeBuilder.soup`. |
743 | 108 | """ | 119 | """ |
745 | 109 | if self.soup.parse_only is not None: | 120 | if self.soup is not None and self.soup.parse_only is not None: |
746 | 110 | warnings.warn( | 121 | warnings.warn( |
747 | 111 | "You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.", | 122 | "You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.", |
748 | 112 | stacklevel=4 | 123 | stacklevel=4 |
749 | 113 | ) | 124 | ) |
750 | 125 | |||
751 | 126 | # self.underlying_parser is probably None now, but it'll be set | ||
752 | 127 | # when self.create_treebuilder is called by html5lib. | ||
753 | 128 | # | ||
754 | 129 | # TODO-TYPING: typeshed stubs are incorrect about the return | ||
755 | 130 | # value of HTMLParser.__init__; it is HTMLParser, not None. | ||
756 | 114 | parser = html5lib.HTMLParser(tree=self.create_treebuilder) | 131 | parser = html5lib.HTMLParser(tree=self.create_treebuilder) |
757 | 132 | assert self.underlying_builder is not None | ||
758 | 115 | self.underlying_builder.parser = parser | 133 | self.underlying_builder.parser = parser |
759 | 116 | extra_kwargs = dict() | 134 | extra_kwargs = dict() |
760 | 117 | if not isinstance(markup, str): | 135 | if not isinstance(markup, str): |
761 | 136 | # kwargs, specifically override_encoding, will eventually | ||
762 | 137 | # be passed in to html5lib's | ||
763 | 138 | # HTMLBinaryInputStream.__init__. | ||
764 | 118 | extra_kwargs['override_encoding'] = self.user_specified_encoding | 139 | extra_kwargs['override_encoding'] = self.user_specified_encoding |
765 | 140 | |||
766 | 119 | doc = parser.parse(markup, **extra_kwargs) | 141 | doc = parser.parse(markup, **extra_kwargs) |
767 | 120 | 142 | ||
768 | 121 | # Set the character encoding detected by the tokenizer. | 143 | # Set the character encoding detected by the tokenizer. |
769 | @@ -131,10 +153,12 @@ class HTML5TreeBuilder(HTMLTreeBuilder): | |||
770 | 131 | doc.original_encoding = original_encoding | 153 | doc.original_encoding = original_encoding |
771 | 132 | self.underlying_builder.parser = None | 154 | self.underlying_builder.parser = None |
772 | 133 | 155 | ||
774 | 134 | def create_treebuilder(self, namespaceHTMLElements): | 156 | def create_treebuilder(self, namespaceHTMLElements:bool) -> 'TreeBuilderForHtml5lib': |
775 | 135 | """Called by html5lib to instantiate the kind of class it | 157 | """Called by html5lib to instantiate the kind of class it |
776 | 136 | calls a 'TreeBuilder'. | 158 | calls a 'TreeBuilder'. |
778 | 137 | 159 | ||
779 | 160 | :param namespaceHTMLElements: Whether or not to namespace HTML elements. | ||
780 | 161 | |||
781 | 138 | :meta private: | 162 | :meta private: |
782 | 139 | """ | 163 | """ |
783 | 140 | self.underlying_builder = TreeBuilderForHtml5lib( | 164 | self.underlying_builder = TreeBuilderForHtml5lib( |
784 | @@ -143,15 +167,18 @@ class HTML5TreeBuilder(HTMLTreeBuilder): | |||
785 | 143 | ) | 167 | ) |
786 | 144 | return self.underlying_builder | 168 | return self.underlying_builder |
787 | 145 | 169 | ||
789 | 146 | def test_fragment_to_document(self, fragment): | 170 | def test_fragment_to_document(self, fragment:str) -> str: |
790 | 147 | """See `TreeBuilder`.""" | 171 | """See `TreeBuilder`.""" |
791 | 148 | return '<html><head></head><body>%s</body></html>' % fragment | 172 | return '<html><head></head><body>%s</body></html>' % fragment |
792 | 149 | 173 | ||
793 | 150 | 174 | ||
794 | 151 | class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder): | 175 | class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder): |
798 | 152 | 176 | ||
799 | 153 | def __init__(self, namespaceHTMLElements, soup=None, | 177 | soup:'BeautifulSoup' #: :meta private: |
800 | 154 | store_line_numbers=True, **kwargs): | 178 | |
801 | 179 | def __init__(self, namespaceHTMLElements:bool, | ||
802 | 180 | soup:Optional['BeautifulSoup']=None, | ||
803 | 181 | store_line_numbers:bool=True, **kwargs:Any): | ||
804 | 155 | if soup: | 182 | if soup: |
805 | 156 | self.soup = soup | 183 | self.soup = soup |
806 | 157 | else: | 184 | else: |
807 | @@ -172,65 +199,68 @@ class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder): | |||
808 | 172 | self.parser = None | 199 | self.parser = None |
809 | 173 | self.store_line_numbers = store_line_numbers | 200 | self.store_line_numbers = store_line_numbers |
810 | 174 | 201 | ||
812 | 175 | def documentClass(self): | 202 | def documentClass(self) -> 'Element': |
813 | 176 | self.soup.reset() | 203 | self.soup.reset() |
814 | 177 | return Element(self.soup, self.soup, None) | 204 | return Element(self.soup, self.soup, None) |
815 | 178 | 205 | ||
820 | 179 | def insertDoctype(self, token): | 206 | def insertDoctype(self, token:Dict[str, Any]) -> None: |
821 | 180 | name = token["name"] | 207 | name:str = cast(str, token["name"]) |
822 | 181 | publicId = token["publicId"] | 208 | publicId:Optional[str] = cast(Optional[str], token["publicId"]) |
823 | 182 | systemId = token["systemId"] | 209 | systemId:Optional[str] = cast(Optional[str], token["systemId"]) |
824 | 183 | 210 | ||
825 | 184 | doctype = Doctype.for_name_and_ids(name, publicId, systemId) | 211 | doctype = Doctype.for_name_and_ids(name, publicId, systemId) |
826 | 185 | self.soup.object_was_parsed(doctype) | 212 | self.soup.object_was_parsed(doctype) |
827 | 186 | 213 | ||
830 | 187 | def elementClass(self, name, namespace): | 214 | def elementClass(self, name:str, namespace:str) -> 'Element': |
831 | 188 | kwargs = {} | 215 | sourceline:Optional[int] = None |
832 | 216 | sourcepos:Optional[int] = None | ||
833 | 189 | if self.parser and self.store_line_numbers: | 217 | if self.parser and self.store_line_numbers: |
834 | 190 | # This represents the point immediately after the end of the | 218 | # This represents the point immediately after the end of the |
835 | 191 | # tag. We don't know when the tag started, but we do know | 219 | # tag. We don't know when the tag started, but we do know |
836 | 192 | # where it ended -- the character just before this one. | 220 | # where it ended -- the character just before this one. |
837 | 193 | sourceline, sourcepos = self.parser.tokenizer.stream.position() | 221 | sourceline, sourcepos = self.parser.tokenizer.stream.position() |
841 | 194 | kwargs['sourceline'] = sourceline | 222 | sourcepos = sourcepos-1 |
842 | 195 | kwargs['sourcepos'] = sourcepos-1 | 223 | tag = self.soup.new_tag( |
843 | 196 | tag = self.soup.new_tag(name, namespace, **kwargs) | 224 | name, namespace, sourceline=sourceline, sourcepos=sourcepos |
844 | 225 | ) | ||
845 | 197 | 226 | ||
846 | 198 | return Element(tag, self.soup, namespace) | 227 | return Element(tag, self.soup, namespace) |
847 | 199 | 228 | ||
849 | 200 | def commentClass(self, data): | 229 | def commentClass(self, data:str) -> 'TextNode': |
850 | 201 | return TextNode(Comment(data), self.soup) | 230 | return TextNode(Comment(data), self.soup) |
851 | 202 | 231 | ||
859 | 203 | def fragmentClass(self): | 232 | def fragmentClass(self) -> 'Element': |
860 | 204 | from bs4 import BeautifulSoup | 233 | """This is only used by html5lib HTMLParser.parseFragment(), |
861 | 205 | # TODO: Why is the parser 'html.parser' here? To avoid an | 234 | which is never used by Beautiful Soup.""" |
862 | 206 | # infinite loop? | 235 | raise NotImplementedError() |
863 | 207 | self.soup = BeautifulSoup("", "html.parser") | 236 | |
864 | 208 | self.soup.name = "[document_fragment]" | 237 | def getFragment(self) -> 'Element': |
865 | 209 | return Element(self.soup, self.soup, None) | 238 | """This is only used by html5lib HTMLParser.parseFragment, |
866 | 239 | which is never used by Beautiful Soup.""" | ||
867 | 240 | raise NotImplementedError() | ||
868 | 210 | 241 | ||
871 | 211 | def appendChild(self, node): | 242 | def appendChild(self, node:'Element') -> None: |
872 | 212 | # XXX This code is not covered by the BS4 tests. | 243 | # TODO: This code is not covered by the BS4 tests. |
873 | 213 | self.soup.append(node.element) | 244 | self.soup.append(node.element) |
874 | 214 | 245 | ||
876 | 215 | def getDocument(self): | 246 | def getDocument(self) -> 'BeautifulSoup': |
877 | 216 | return self.soup | 247 | return self.soup |
878 | 217 | 248 | ||
883 | 218 | def getFragment(self): | 249 | # TODO-TYPING: typeshed stubs are incorrect about this; |
884 | 219 | return treebuilder_base.TreeBuilder.getFragment(self).element | 250 | # cloneNode returns a str, not None. |
885 | 220 | 251 | def testSerializer(self, element:'Element') -> str: | |
882 | 221 | def testSerializer(self, element): | ||
886 | 222 | from bs4 import BeautifulSoup | 252 | from bs4 import BeautifulSoup |
887 | 223 | rv = [] | 253 | rv = [] |
888 | 224 | doctype_re = re.compile(r'^(.*?)(?: PUBLIC "(.*?)"(?: "(.*?)")?| SYSTEM "(.*?)")?$') | 254 | doctype_re = re.compile(r'^(.*?)(?: PUBLIC "(.*?)"(?: "(.*?)")?| SYSTEM "(.*?)")?$') |
889 | 225 | 255 | ||
891 | 226 | def serializeElement(element, indent=0): | 256 | def serializeElement(element:Union['Element', PageElement], indent=0) -> None: |
892 | 227 | if isinstance(element, BeautifulSoup): | 257 | if isinstance(element, BeautifulSoup): |
893 | 228 | pass | 258 | pass |
894 | 229 | if isinstance(element, Doctype): | 259 | if isinstance(element, Doctype): |
895 | 230 | m = doctype_re.match(element) | 260 | m = doctype_re.match(element) |
897 | 231 | if m: | 261 | if m is not None: |
898 | 232 | name = m.group(1) | 262 | name = m.group(1) |
900 | 233 | if m.lastindex > 1: | 263 | if m.lastindex is not None and m.lastindex > 1: |
901 | 234 | publicId = m.group(2) or "" | 264 | publicId = m.group(2) or "" |
902 | 235 | systemId = m.group(3) or m.group(4) or "" | 265 | systemId = m.group(3) or m.group(4) or "" |
903 | 236 | rv.append("""|%s<!DOCTYPE %s "%s" "%s">""" % | 266 | rv.append("""|%s<!DOCTYPE %s "%s" "%s">""" % |
904 | @@ -243,7 +273,7 @@ class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder): | |||
905 | 243 | rv.append("|%s<!-- %s -->" % (' ' * indent, element)) | 273 | rv.append("|%s<!-- %s -->" % (' ' * indent, element)) |
906 | 244 | elif isinstance(element, NavigableString): | 274 | elif isinstance(element, NavigableString): |
907 | 245 | rv.append("|%s\"%s\"" % (' ' * indent, element)) | 275 | rv.append("|%s\"%s\"" % (' ' * indent, element)) |
909 | 246 | else: | 276 | elif isinstance(element, Element): |
910 | 247 | if element.namespace: | 277 | if element.namespace: |
911 | 248 | name = "%s %s" % (prefixes[element.namespace], | 278 | name = "%s %s" % (prefixes[element.namespace], |
912 | 249 | element.name) | 279 | element.name) |
913 | @@ -269,12 +299,19 @@ class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder): | |||
914 | 269 | return "\n".join(rv) | 299 | return "\n".join(rv) |
915 | 270 | 300 | ||
916 | 271 | class AttrList(object): | 301 | class AttrList(object): |
918 | 272 | def __init__(self, element): | 302 | """Represents a Tag's attributes in a way compatible with html5lib.""" |
919 | 303 | |||
920 | 304 | element:Tag | ||
921 | 305 | attrs:_AttributeValues | ||
922 | 306 | |||
923 | 307 | def __init__(self, element:Tag): | ||
924 | 273 | self.element = element | 308 | self.element = element |
925 | 274 | self.attrs = dict(self.element.attrs) | 309 | self.attrs = dict(self.element.attrs) |
927 | 275 | def __iter__(self): | 310 | |
928 | 311 | def __iter__(self) -> Iterable[Tuple[str, _AttributeValue]]: | ||
929 | 276 | return list(self.attrs.items()).__iter__() | 312 | return list(self.attrs.items()).__iter__() |
931 | 277 | def __setitem__(self, name, value): | 313 | |
932 | 314 | def __setitem__(self, name:str, value:_AttributeValue) -> None: | ||
933 | 278 | # If this attribute is a multi-valued attribute for this element, | 315 | # If this attribute is a multi-valued attribute for this element, |
934 | 279 | # turn its value into a list. | 316 | # turn its value into a list. |
935 | 280 | list_attr = self.element.cdata_list_attributes or {} | 317 | list_attr = self.element.cdata_list_attributes or {} |
936 | @@ -282,40 +319,52 @@ class AttrList(object): | |||
937 | 282 | or (self.element.name in list_attr | 319 | or (self.element.name in list_attr |
938 | 283 | and name in list_attr.get(self.element.name, []))): | 320 | and name in list_attr.get(self.element.name, []))): |
939 | 284 | # A node that is being cloned may have already undergone | 321 | # A node that is being cloned may have already undergone |
941 | 285 | # this procedure. | 322 | # this procedure. Check for this and skip it. |
942 | 286 | if not isinstance(value, list): | 323 | if not isinstance(value, list): |
943 | 324 | assert isinstance(value, str) | ||
944 | 287 | value = nonwhitespace_re.findall(value) | 325 | value = nonwhitespace_re.findall(value) |
945 | 288 | self.element[name] = value | 326 | self.element[name] = value |
947 | 289 | def items(self): | 327 | |
948 | 328 | def items(self) -> Iterable[Tuple[str, _AttributeValue]]: | ||
949 | 290 | return list(self.attrs.items()) | 329 | return list(self.attrs.items()) |
951 | 291 | def keys(self): | 330 | |
952 | 331 | def keys(self) -> Iterable[str]: | ||
953 | 292 | return list(self.attrs.keys()) | 332 | return list(self.attrs.keys()) |
955 | 293 | def __len__(self): | 333 | |
956 | 334 | def __len__(self) -> int: | ||
957 | 294 | return len(self.attrs) | 335 | return len(self.attrs) |
959 | 295 | def __getitem__(self, name): | 336 | |
960 | 337 | def __getitem__(self, name:str) -> _AttributeValue: | ||
961 | 296 | return self.attrs[name] | 338 | return self.attrs[name] |
963 | 297 | def __contains__(self, name): | 339 | |
964 | 340 | def __contains__(self, name:str) -> bool: | ||
965 | 298 | return name in list(self.attrs.keys()) | 341 | return name in list(self.attrs.keys()) |
966 | 299 | 342 | ||
967 | 300 | 343 | ||
968 | 301 | class Element(treebuilder_base.Node): | 344 | class Element(treebuilder_base.Node): |
970 | 302 | def __init__(self, element, soup, namespace): | 345 | |
971 | 346 | element:Tag | ||
972 | 347 | soup:'BeautifulSoup' | ||
973 | 348 | namespace:Optional[_NamespaceURL] | ||
974 | 349 | |||
975 | 350 | def __init__(self, element:Tag, soup:'BeautifulSoup', | ||
976 | 351 | namespace:Optional[_NamespaceURL]): | ||
977 | 303 | treebuilder_base.Node.__init__(self, element.name) | 352 | treebuilder_base.Node.__init__(self, element.name) |
978 | 304 | self.element = element | 353 | self.element = element |
979 | 305 | self.soup = soup | 354 | self.soup = soup |
980 | 306 | self.namespace = namespace | 355 | self.namespace = namespace |
981 | 307 | 356 | ||
983 | 308 | def appendChild(self, node): | 357 | def appendChild(self, node:'Element') -> None: |
984 | 309 | string_child = child = None | 358 | string_child = child = None |
985 | 310 | if isinstance(node, str): | 359 | if isinstance(node, str): |
986 | 311 | # Some other piece of code decided to pass in a string | 360 | # Some other piece of code decided to pass in a string |
987 | 312 | # instead of creating a TextElement object to contain the | 361 | # instead of creating a TextElement object to contain the |
989 | 313 | # string. | 362 | # string. This should not ever happen. |
990 | 314 | string_child = child = node | 363 | string_child = child = node |
991 | 315 | elif isinstance(node, Tag): | 364 | elif isinstance(node, Tag): |
992 | 316 | # Some other piece of code decided to pass in a Tag | 365 | # Some other piece of code decided to pass in a Tag |
993 | 317 | # instead of creating an Element object to contain the | 366 | # instead of creating an Element object to contain the |
995 | 318 | # Tag. | 367 | # Tag. This should not ever happen. |
996 | 319 | child = node | 368 | child = node |
997 | 320 | elif node.element.__class__ == NavigableString: | 369 | elif node.element.__class__ == NavigableString: |
998 | 321 | string_child = child = node.element | 370 | string_child = child = node.element |
999 | @@ -324,7 +373,7 @@ class Element(treebuilder_base.Node): | |||
1000 | 324 | child = node.element | 373 | child = node.element |
1001 | 325 | node.parent = self | 374 | node.parent = self |
1002 | 326 | 375 | ||
1004 | 327 | if not isinstance(child, str) and child.parent is not None: | 376 | if not isinstance(child, str) and child is not None and child.parent is not None: |
1005 | 328 | node.element.extract() | 377 | node.element.extract() |
1006 | 329 | 378 | ||
1007 | 330 | if (string_child is not None and self.element.contents | 379 | if (string_child is not None and self.element.contents |
1008 | @@ -359,14 +408,13 @@ class Element(treebuilder_base.Node): | |||
1009 | 359 | child, parent=self.element, | 408 | child, parent=self.element, |
1010 | 360 | most_recent_element=most_recent_element) | 409 | most_recent_element=most_recent_element) |
1011 | 361 | 410 | ||
1013 | 362 | def getAttributes(self): | 411 | def getAttributes(self) -> AttrList: |
1014 | 363 | if isinstance(self.element, Comment): | 412 | if isinstance(self.element, Comment): |
1015 | 364 | return {} | 413 | return {} |
1016 | 365 | return AttrList(self.element) | 414 | return AttrList(self.element) |
1017 | 366 | 415 | ||
1019 | 367 | def setAttributes(self, attributes): | 416 | def setAttributes(self, attributes:Optional[Dict]) -> None: |
1020 | 368 | if attributes is not None and len(attributes) > 0: | 417 | if attributes is not None and len(attributes) > 0: |
1021 | 369 | converted_attributes = [] | ||
1022 | 370 | for name, value in list(attributes.items()): | 418 | for name, value in list(attributes.items()): |
1023 | 371 | if isinstance(name, tuple): | 419 | if isinstance(name, tuple): |
1024 | 372 | new_name = NamespacedAttribute(*name) | 420 | new_name = NamespacedAttribute(*name) |
1025 | @@ -386,14 +434,14 @@ class Element(treebuilder_base.Node): | |||
1026 | 386 | self.soup.builder.set_up_substitutions(self.element) | 434 | self.soup.builder.set_up_substitutions(self.element) |
1027 | 387 | attributes = property(getAttributes, setAttributes) | 435 | attributes = property(getAttributes, setAttributes) |
1028 | 388 | 436 | ||
1030 | 389 | def insertText(self, data, insertBefore=None): | 437 | def insertText(self, data:str, insertBefore:Optional['Element']=None) -> None: |
1031 | 390 | text = TextNode(self.soup.new_string(data), self.soup) | 438 | text = TextNode(self.soup.new_string(data), self.soup) |
1032 | 391 | if insertBefore: | 439 | if insertBefore: |
1033 | 392 | self.insertBefore(text, insertBefore) | 440 | self.insertBefore(text, insertBefore) |
1034 | 393 | else: | 441 | else: |
1035 | 394 | self.appendChild(text) | 442 | self.appendChild(text) |
1036 | 395 | 443 | ||
1038 | 396 | def insertBefore(self, node, refNode): | 444 | def insertBefore(self, node:'Element', refNode:'Element') -> None: |
1039 | 397 | index = self.element.index(refNode.element) | 445 | index = self.element.index(refNode.element) |
1040 | 398 | if (node.element.__class__ == NavigableString and self.element.contents | 446 | if (node.element.__class__ == NavigableString and self.element.contents |
1041 | 399 | and self.element.contents[index-1].__class__ == NavigableString): | 447 | and self.element.contents[index-1].__class__ == NavigableString): |
1042 | @@ -405,10 +453,10 @@ class Element(treebuilder_base.Node): | |||
1043 | 405 | self.element.insert(index, node.element) | 453 | self.element.insert(index, node.element) |
1044 | 406 | node.parent = self | 454 | node.parent = self |
1045 | 407 | 455 | ||
1047 | 408 | def removeChild(self, node): | 456 | def removeChild(self, node:'Element') -> None: |
1048 | 409 | node.element.extract() | 457 | node.element.extract() |
1049 | 410 | 458 | ||
1051 | 411 | def reparentChildren(self, new_parent): | 459 | def reparentChildren(self, new_parent:'Element') -> None: |
1052 | 412 | """Move all of this tag's children into another tag.""" | 460 | """Move all of this tag's children into another tag.""" |
1053 | 413 | # print("MOVE", self.element.contents) | 461 | # print("MOVE", self.element.contents) |
1054 | 414 | # print("FROM", self.element) | 462 | # print("FROM", self.element) |
1055 | @@ -424,6 +472,10 @@ class Element(treebuilder_base.Node): | |||
1056 | 424 | if len(new_parent_element.contents) > 0: | 472 | if len(new_parent_element.contents) > 0: |
1057 | 425 | # The new parent already contains children. We will be | 473 | # The new parent already contains children. We will be |
1058 | 426 | # appending this tag's children to the end. | 474 | # appending this tag's children to the end. |
1059 | 475 | |||
1060 | 476 | # We can make this assertion since we know new_parent has | ||
1061 | 477 | # children. | ||
1062 | 478 | assert new_parents_last_descendant is not None | ||
1063 | 427 | new_parents_last_child = new_parent_element.contents[-1] | 479 | new_parents_last_child = new_parent_element.contents[-1] |
1064 | 428 | new_parents_last_descendant_next_element = new_parents_last_descendant.next_element | 480 | new_parents_last_descendant_next_element = new_parents_last_descendant.next_element |
1065 | 429 | else: | 481 | else: |
1066 | @@ -474,17 +526,21 @@ class Element(treebuilder_base.Node): | |||
1067 | 474 | # print("FROM", self.element) | 526 | # print("FROM", self.element) |
1068 | 475 | # print("TO", new_parent_element) | 527 | # print("TO", new_parent_element) |
1069 | 476 | 528 | ||
1071 | 477 | def cloneNode(self): | 529 | # TODO: typeshed stubs are incorrect about this; |
1072 | 530 | # cloneNode returns a new Node, not None. | ||
1073 | 531 | def cloneNode(self) -> treebuilder_base.Node: | ||
1074 | 478 | tag = self.soup.new_tag(self.element.name, self.namespace) | 532 | tag = self.soup.new_tag(self.element.name, self.namespace) |
1075 | 479 | node = Element(tag, self.soup, self.namespace) | 533 | node = Element(tag, self.soup, self.namespace) |
1076 | 480 | for key,value in self.attributes: | 534 | for key,value in self.attributes: |
1077 | 481 | node.attributes[key] = value | 535 | node.attributes[key] = value |
1078 | 482 | return node | 536 | return node |
1079 | 483 | 537 | ||
1082 | 484 | def hasContent(self): | 538 | # TODO-TYPING: typeshed stubs are incorrect about this; |
1083 | 485 | return self.element.contents | 539 | # cloneNode returns a boolean, not None. |
1084 | 540 | def hasContent(self) -> bool: | ||
1085 | 541 | return len(self.element.contents) > 0 | ||
1086 | 486 | 542 | ||
1088 | 487 | def getNameTuple(self): | 543 | def getNameTuple(self) -> Tuple[str, str]: |
1089 | 488 | if self.namespace == None: | 544 | if self.namespace == None: |
1090 | 489 | return namespaces["html"], self.name | 545 | return namespaces["html"], self.name |
1091 | 490 | else: | 546 | else: |
1092 | @@ -493,10 +549,10 @@ class Element(treebuilder_base.Node): | |||
1093 | 493 | nameTuple = property(getNameTuple) | 549 | nameTuple = property(getNameTuple) |
1094 | 494 | 550 | ||
1095 | 495 | class TextNode(Element): | 551 | class TextNode(Element): |
1097 | 496 | def __init__(self, element, soup): | 552 | def __init__(self, element:PageElement, soup:'BeautifulSoup'): |
1098 | 497 | treebuilder_base.Node.__init__(self, None) | 553 | treebuilder_base.Node.__init__(self, None) |
1099 | 498 | self.element = element | 554 | self.element = element |
1100 | 499 | self.soup = soup | 555 | self.soup = soup |
1101 | 500 | 556 | ||
1104 | 501 | def cloneNode(self): | 557 | def cloneNode(self) -> treebuilder_base.Node: |
1105 | 502 | raise NotImplementedError | 558 | raise NotImplementedError() |
1106 | diff --git a/bs4/builder/_htmlparser.py b/bs4/builder/_htmlparser.py | |||
1107 | index 291f6c6..91cecf7 100644 | |||
1108 | --- a/bs4/builder/_htmlparser.py | |||
1109 | +++ b/bs4/builder/_htmlparser.py | |||
1110 | @@ -188,7 +188,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): | |||
1111 | 188 | # later on. If so, we want to ignore it. | 188 | # later on. If so, we want to ignore it. |
1112 | 189 | self.already_closed_empty_element.append(name) | 189 | self.already_closed_empty_element.append(name) |
1113 | 190 | 190 | ||
1115 | 191 | if self._root_tag is None: | 191 | if self._root_tag_name is None: |
1116 | 192 | self._root_tag_encountered(name) | 192 | self._root_tag_encountered(name) |
1117 | 193 | 193 | ||
1118 | 194 | def handle_endtag(self, name:str, check_already_closed:bool=True) -> None: | 194 | def handle_endtag(self, name:str, check_already_closed:bool=True) -> None: |
1119 | @@ -422,13 +422,23 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder): | |||
1120 | 422 | dammit.declared_html_encoding, | 422 | dammit.declared_html_encoding, |
1121 | 423 | dammit.contains_replacement_characters) | 423 | dammit.contains_replacement_characters) |
1122 | 424 | 424 | ||
1124 | 425 | def feed(self, markup:str): | 425 | def feed(self, markup:_RawMarkup) -> None: |
1125 | 426 | args, kwargs = self.parser_args | 426 | args, kwargs = self.parser_args |
1126 | 427 | |||
1127 | 428 | # HTMLParser.feed will only handle str, but | ||
1128 | 429 | # BeautifulSoup.markup is allowed to be _RawMarkup, because | ||
1129 | 430 | # it's set by the yield value of | ||
1130 | 431 | # TreeBuilder.prepare_markup. Fortunately, | ||
1131 | 432 | # HTMLParserTreeBuilder.prepare_markup always yields a str | ||
1132 | 433 | # (UnicodeDammit.unicode_markup). | ||
1133 | 434 | assert isinstance(markup, str) | ||
1134 | 435 | |||
1135 | 427 | # We know BeautifulSoup calls TreeBuilder.initialize_soup | 436 | # We know BeautifulSoup calls TreeBuilder.initialize_soup |
1136 | 428 | # before calling feed(), so we can assume self.soup | 437 | # before calling feed(), so we can assume self.soup |
1137 | 429 | # is set. | 438 | # is set. |
1138 | 430 | assert self.soup is not None | 439 | assert self.soup is not None |
1139 | 431 | parser = BeautifulSoupHTMLParser(self.soup, *args, **kwargs) | 440 | parser = BeautifulSoupHTMLParser(self.soup, *args, **kwargs) |
1140 | 441 | |||
1141 | 432 | try: | 442 | try: |
1142 | 433 | parser.feed(markup) | 443 | parser.feed(markup) |
1143 | 434 | parser.close() | 444 | parser.close() |
1144 | diff --git a/bs4/builder/_lxml.py b/bs4/builder/_lxml.py | |||
1145 | index ba87e87..3dfe88a 100644 | |||
1146 | --- a/bs4/builder/_lxml.py | |||
1147 | +++ b/bs4/builder/_lxml.py | |||
1148 | @@ -269,7 +269,7 @@ class LXMLTreeBuilderForXML(TreeBuilder): | |||
1149 | 269 | for encoding in detector.encodings: | 269 | for encoding in detector.encodings: |
1150 | 270 | yield (detector.markup, encoding, document_declared_encoding, False) | 270 | yield (detector.markup, encoding, document_declared_encoding, False) |
1151 | 271 | 271 | ||
1153 | 272 | def feed(self, markup:Union[bytes,str]) -> None: | 272 | def feed(self, markup:_RawMarkup) -> None: |
1154 | 273 | io: IO | 273 | io: IO |
1155 | 274 | if isinstance(markup, bytes): | 274 | if isinstance(markup, bytes): |
1156 | 275 | io = BytesIO(markup) | 275 | io = BytesIO(markup) |
1157 | diff --git a/bs4/diagnose.py b/bs4/diagnose.py | |||
1158 | index 201b879..c2202ad 100644 | |||
1159 | --- a/bs4/diagnose.py | |||
1160 | +++ b/bs4/diagnose.py | |||
1161 | @@ -9,7 +9,15 @@ from html.parser import HTMLParser | |||
1162 | 9 | import bs4 | 9 | import bs4 |
1163 | 10 | from bs4 import BeautifulSoup, __version__ | 10 | from bs4 import BeautifulSoup, __version__ |
1164 | 11 | from bs4.builder import builder_registry | 11 | from bs4.builder import builder_registry |
1166 | 12 | from typing import TYPE_CHECKING | 12 | from typing import ( |
1167 | 13 | Any, | ||
1168 | 14 | IO, | ||
1169 | 15 | List, | ||
1170 | 16 | Optional, | ||
1171 | 17 | Tuple, | ||
1172 | 18 | TYPE_CHECKING, | ||
1173 | 19 | ) | ||
1174 | 20 | |||
1175 | 13 | if TYPE_CHECKING: | 21 | if TYPE_CHECKING: |
1176 | 14 | from bs4._typing import _IncomingMarkup | 22 | from bs4._typing import _IncomingMarkup |
1177 | 15 | 23 | ||
1178 | @@ -78,7 +86,7 @@ def diagnose(data:_IncomingMarkup) -> None: | |||
1179 | 78 | 86 | ||
1180 | 79 | print(("-" * 80)) | 87 | print(("-" * 80)) |
1181 | 80 | 88 | ||
1183 | 81 | def lxml_trace(data, html:bool=True, **kwargs) -> None: | 89 | def lxml_trace(data:_IncomingMarkup, html:bool=True, **kwargs:Any) -> None: |
1184 | 82 | """Print out the lxml events that occur during parsing. | 90 | """Print out the lxml events that occur during parsing. |
1185 | 83 | 91 | ||
1186 | 84 | This lets you see how lxml parses a document when no Beautiful | 92 | This lets you see how lxml parses a document when no Beautiful |
1187 | @@ -94,7 +102,8 @@ def lxml_trace(data, html:bool=True, **kwargs) -> None: | |||
1188 | 94 | recover = kwargs.pop('recover', True) | 102 | recover = kwargs.pop('recover', True) |
1189 | 95 | if isinstance(data, str): | 103 | if isinstance(data, str): |
1190 | 96 | data = data.encode("utf8") | 104 | data = data.encode("utf8") |
1192 | 97 | reader = BytesIO(data) | 105 | if not isinstance(data, IO): |
1193 | 106 | reader = BytesIO(data) | ||
1194 | 98 | for event, element in etree.iterparse( | 107 | for event, element in etree.iterparse( |
1195 | 99 | reader, html=html, recover=recover, **kwargs | 108 | reader, html=html, recover=recover, **kwargs |
1196 | 100 | ): | 109 | ): |
1197 | @@ -108,37 +117,40 @@ class AnnouncingParser(HTMLParser): | |||
1198 | 108 | document. The easiest way to do this is to call `htmlparser_trace`. | 117 | document. The easiest way to do this is to call `htmlparser_trace`. |
1199 | 109 | """ | 118 | """ |
1200 | 110 | 119 | ||
1202 | 111 | def _p(self, s): | 120 | def _p(self, s:str) -> None: |
1203 | 112 | print(s) | 121 | print(s) |
1204 | 113 | 122 | ||
1206 | 114 | def handle_starttag(self, name, attrs): | 123 | def handle_starttag( |
1207 | 124 | self, name:str, attrs:List[Tuple[str, Optional[str]]], | ||
1208 | 125 | handle_empty_element:bool=True | ||
1209 | 126 | ) -> None: | ||
1210 | 115 | self._p(f"{name} {attrs} START") | 127 | self._p(f"{name} {attrs} START") |
1211 | 116 | 128 | ||
1213 | 117 | def handle_endtag(self, name): | 129 | def handle_endtag(self, name:str, check_already_closed:bool=True) -> None: |
1214 | 118 | self._p("%s END" % name) | 130 | self._p("%s END" % name) |
1215 | 119 | 131 | ||
1217 | 120 | def handle_data(self, data): | 132 | def handle_data(self, data:str) -> None: |
1218 | 121 | self._p("%s DATA" % data) | 133 | self._p("%s DATA" % data) |
1219 | 122 | 134 | ||
1221 | 123 | def handle_charref(self, name): | 135 | def handle_charref(self, name:str) -> None: |
1222 | 124 | self._p("%s CHARREF" % name) | 136 | self._p("%s CHARREF" % name) |
1223 | 125 | 137 | ||
1225 | 126 | def handle_entityref(self, name): | 138 | def handle_entityref(self, name:str) -> None: |
1226 | 127 | self._p("%s ENTITYREF" % name) | 139 | self._p("%s ENTITYREF" % name) |
1227 | 128 | 140 | ||
1229 | 129 | def handle_comment(self, data): | 141 | def handle_comment(self, data:str) -> None: |
1230 | 130 | self._p("%s COMMENT" % data) | 142 | self._p("%s COMMENT" % data) |
1231 | 131 | 143 | ||
1233 | 132 | def handle_decl(self, data): | 144 | def handle_decl(self, data:str) -> None: |
1234 | 133 | self._p("%s DECL" % data) | 145 | self._p("%s DECL" % data) |
1235 | 134 | 146 | ||
1237 | 135 | def unknown_decl(self, data): | 147 | def unknown_decl(self, data:str) -> None: |
1238 | 136 | self._p("%s UNKNOWN-DECL" % data) | 148 | self._p("%s UNKNOWN-DECL" % data) |
1239 | 137 | 149 | ||
1241 | 138 | def handle_pi(self, data): | 150 | def handle_pi(self, data:str) -> None: |
1242 | 139 | self._p("%s PI" % data) | 151 | self._p("%s PI" % data) |
1243 | 140 | 152 | ||
1245 | 141 | def htmlparser_trace(data): | 153 | def htmlparser_trace(data:str) -> None: |
1246 | 142 | """Print out the HTMLParser events that occur during parsing. | 154 | """Print out the HTMLParser events that occur during parsing. |
1247 | 143 | 155 | ||
1248 | 144 | This lets you see how HTMLParser parses a document when no | 156 | This lets you see how HTMLParser parses a document when no |
1249 | @@ -226,7 +238,7 @@ def benchmark_parsers(num_elements:int=100000) -> None: | |||
1250 | 226 | b = time.time() | 238 | b = time.time() |
1251 | 227 | print(("Raw html5lib parsed the markup in %.2fs." % (b-a))) | 239 | print(("Raw html5lib parsed the markup in %.2fs." % (b-a))) |
1252 | 228 | 240 | ||
1254 | 229 | def profile(num_elements:int=100000, parser:str="lxml"): | 241 | def profile(num_elements:int=100000, parser:str="lxml") -> None: |
1255 | 230 | """Use Python's profiler on a randomly generated document.""" | 242 | """Use Python's profiler on a randomly generated document.""" |
1256 | 231 | filehandle = tempfile.NamedTemporaryFile() | 243 | filehandle = tempfile.NamedTemporaryFile() |
1257 | 232 | filename = filehandle.name | 244 | filename = filehandle.name |
1258 | diff --git a/bs4/element.py b/bs4/element.py | |||
1259 | index 83f4882..f4ab89c 100644 | |||
1260 | --- a/bs4/element.py | |||
1261 | +++ b/bs4/element.py | |||
1262 | @@ -44,6 +44,7 @@ if TYPE_CHECKING: | |||
1263 | 44 | from bs4 import BeautifulSoup | 44 | from bs4 import BeautifulSoup |
1264 | 45 | from bs4.builder import TreeBuilder | 45 | from bs4.builder import TreeBuilder |
1265 | 46 | from bs4.dammit import _Encoding | 46 | from bs4.dammit import _Encoding |
1266 | 47 | from bs4.filter import ElementFilter | ||
1267 | 47 | from bs4.formatter import ( | 48 | from bs4.formatter import ( |
1268 | 48 | _EntitySubstitutionFunction, | 49 | _EntitySubstitutionFunction, |
1269 | 49 | _FormatterOrName, | 50 | _FormatterOrName, |
1270 | @@ -901,7 +902,7 @@ class PageElement(object): | |||
1271 | 901 | limit:Optional[int], | 902 | limit:Optional[int], |
1272 | 902 | generator:Iterator[PageElement], | 903 | generator:Iterator[PageElement], |
1273 | 903 | _stacklevel:int=3, | 904 | _stacklevel:int=3, |
1275 | 904 | **kwargs:_StrainableAttribute) -> ResultSet[PageElement]: | 905 | **kwargs:_StrainableAttribute) -> ResultSet[PageElement]: |
1276 | 905 | """Iterates over a generator looking for things that match.""" | 906 | """Iterates over a generator looking for things that match.""" |
1277 | 906 | results: ResultSet[PageElement] | 907 | results: ResultSet[PageElement] |
1278 | 907 | 908 | ||
1279 | @@ -912,11 +913,11 @@ class PageElement(object): | |||
1280 | 912 | DeprecationWarning, stacklevel=_stacklevel | 913 | DeprecationWarning, stacklevel=_stacklevel |
1281 | 913 | ) | 914 | ) |
1282 | 914 | 915 | ||
1286 | 915 | from bs4.strainer import SoupStrainer | 916 | from bs4.filter import ElementFilter |
1287 | 916 | if isinstance(name, SoupStrainer): | 917 | if isinstance(name, ElementFilter): |
1288 | 917 | strainer = name | 918 | matcher = name |
1289 | 918 | else: | 919 | else: |
1291 | 919 | strainer = SoupStrainer(name, attrs, string, **kwargs) | 920 | matcher = SoupStrainer(name, attrs, string, **kwargs) |
1292 | 920 | 921 | ||
1293 | 921 | result: Iterable[PageElement] | 922 | result: Iterable[PageElement] |
1294 | 922 | if string is None and not limit and not attrs and not kwargs: | 923 | if string is None and not limit and not attrs and not kwargs: |
1295 | @@ -924,7 +925,7 @@ class PageElement(object): | |||
1296 | 924 | # Optimization to find all tags. | 925 | # Optimization to find all tags. |
1297 | 925 | result = (element for element in generator | 926 | result = (element for element in generator |
1298 | 926 | if isinstance(element, Tag)) | 927 | if isinstance(element, Tag)) |
1300 | 927 | return ResultSet(strainer, result) | 928 | return ResultSet(matcher, result) |
1301 | 928 | elif isinstance(name, str): | 929 | elif isinstance(name, str): |
1302 | 929 | # Optimization to find all tags with a given name. | 930 | # Optimization to find all tags with a given name. |
1303 | 930 | if name.count(':') == 1: | 931 | if name.count(':') == 1: |
1304 | @@ -945,22 +946,25 @@ class PageElement(object): | |||
1305 | 945 | ) | 946 | ) |
1306 | 946 | ): | 947 | ): |
1307 | 947 | result.append(element) | 948 | result.append(element) |
1309 | 948 | return ResultSet(strainer, result) | 949 | return ResultSet(matcher, result) |
1310 | 950 | return self.match(generator, matcher, limit) | ||
1311 | 951 | |||
1312 | 952 | def match(self, generator:Iterator[PageElement], matcher:ElementFilter, limit:Optional[int]=None) -> ResultSet[PageElement]: | ||
1313 | 953 | """The most generic search method offered by Beautiful Soup. | ||
1314 | 949 | 954 | ||
1316 | 950 | results = ResultSet(strainer) | 955 | You can pass in your own technique for iterating over the tree, and your own |
1317 | 956 | technique for matching items. | ||
1318 | 957 | """ | ||
1319 | 958 | results:ResultSet = ResultSet(matcher) | ||
1320 | 951 | while True: | 959 | while True: |
1321 | 952 | try: | 960 | try: |
1322 | 953 | i = next(generator) | 961 | i = next(generator) |
1323 | 954 | except StopIteration: | 962 | except StopIteration: |
1324 | 955 | break | 963 | break |
1325 | 956 | if i: | 964 | if i: |
1333 | 957 | # TODO: SoupStrainer.search is a confusing method | 965 | if matcher.match(i): |
1334 | 958 | # that needs to be redone, and this is where | 966 | results.append(i) |
1335 | 959 | # it's being used. | 967 | if limit is not None and len(results) >= limit: |
1329 | 960 | found = strainer.search(i) | ||
1330 | 961 | if found: | ||
1331 | 962 | results.append(found) | ||
1332 | 963 | if limit and len(results) >= limit: | ||
1336 | 964 | break | 968 | break |
1337 | 965 | return results | 969 | return results |
1338 | 966 | 970 | ||
1339 | @@ -1254,7 +1258,7 @@ class Declaration(PreformattedString): | |||
1340 | 1254 | class Doctype(PreformattedString): | 1258 | class Doctype(PreformattedString): |
1341 | 1255 | """A `document type declaration <https://www.w3.org/TR/REC-xml/#dt-doctype>`_.""" | 1259 | """A `document type declaration <https://www.w3.org/TR/REC-xml/#dt-doctype>`_.""" |
1342 | 1256 | @classmethod | 1260 | @classmethod |
1344 | 1257 | def for_name_and_ids(cls, name:str, pub_id:str, system_id:str) -> Doctype: | 1261 | def for_name_and_ids(cls, name:str, pub_id:Optional[str], system_id:Optional[str]) -> Doctype: |
1345 | 1258 | """Generate an appropriate document type declaration for a given | 1262 | """Generate an appropriate document type declaration for a given |
1346 | 1259 | public ID and system ID. | 1263 | public ID and system ID. |
1347 | 1260 | 1264 | ||
1348 | @@ -2503,12 +2507,12 @@ class Tag(PageElement): | |||
1349 | 2503 | _PageElementT = TypeVar("_PageElementT", bound=PageElement) | 2507 | _PageElementT = TypeVar("_PageElementT", bound=PageElement) |
1350 | 2504 | class ResultSet(List[_PageElementT], Generic[_PageElementT]): | 2508 | class ResultSet(List[_PageElementT], Generic[_PageElementT]): |
1351 | 2505 | """A ResultSet is a list of `PageElement` objects, gathered as the result | 2509 | """A ResultSet is a list of `PageElement` objects, gathered as the result |
1353 | 2506 | of matching a `SoupStrainer` against a parse tree. Basically, a list of | 2510 | of matching an `ElementFilter` against a parse tree. Basically, a list of |
1354 | 2507 | search results. | 2511 | search results. |
1355 | 2508 | """ | 2512 | """ |
1357 | 2509 | source: Optional[SoupStrainer] | 2513 | source: Optional[ElementFilter] |
1358 | 2510 | 2514 | ||
1360 | 2511 | def __init__(self, source:Optional[SoupStrainer], result: Iterable[_PageElementT]=()) -> None: | 2515 | def __init__(self, source:Optional[ElementFilter], result: Iterable[_PageElementT]=()) -> None: |
1361 | 2512 | super(ResultSet, self).__init__(result) | 2516 | super(ResultSet, self).__init__(result) |
1362 | 2513 | self.source = source | 2517 | self.source = source |
1363 | 2514 | 2518 | ||
1364 | @@ -2522,4 +2526,4 @@ class ResultSet(List[_PageElementT], Generic[_PageElementT]): | |||
1365 | 2522 | # import SoupStrainer itself into this module to preserve the | 2526 | # import SoupStrainer itself into this module to preserve the |
1366 | 2523 | # backwards compatibility of anyone who imports | 2527 | # backwards compatibility of anyone who imports |
1367 | 2524 | # bs4.element.SoupStrainer. | 2528 | # bs4.element.SoupStrainer. |
1369 | 2525 | from bs4.strainer import SoupStrainer | 2529 | from bs4.filter import SoupStrainer |
1370 | diff --git a/bs4/strainer.py b/bs4/filter.py | |||
1371 | 2526 | similarity index 60% | 2530 | similarity index 60% |
1372 | 2527 | rename from bs4/strainer.py | 2531 | rename from bs4/strainer.py |
1373 | 2528 | rename to bs4/filter.py | 2532 | rename to bs4/filter.py |
1374 | index 15b289c..74e26d9 100644 | |||
1375 | --- a/bs4/strainer.py | |||
1376 | +++ b/bs4/filter.py | |||
1377 | @@ -25,6 +25,10 @@ from bs4._deprecation import _deprecated | |||
1378 | 25 | from bs4.element import NavigableString, PageElement, Tag | 25 | from bs4.element import NavigableString, PageElement, Tag |
1379 | 26 | from bs4._typing import ( | 26 | from bs4._typing import ( |
1380 | 27 | _AttributeValue, | 27 | _AttributeValue, |
1381 | 28 | _AttributeValues, | ||
1382 | 29 | _AllowStringCreationFunction, | ||
1383 | 30 | _AllowTagCreationFunction, | ||
1384 | 31 | _PageElementMatchFunction, | ||
1385 | 28 | _TagMatchFunction, | 32 | _TagMatchFunction, |
1386 | 29 | _StringMatchFunction, | 33 | _StringMatchFunction, |
1387 | 30 | _StrainableElement, | 34 | _StrainableElement, |
1388 | @@ -33,13 +37,96 @@ from bs4._typing import ( | |||
1389 | 33 | _StrainableString, | 37 | _StrainableString, |
1390 | 34 | ) | 38 | ) |
1391 | 35 | 39 | ||
1392 | 40 | |||
1393 | 41 | class ElementFilter(object): | ||
1394 | 42 | """ElementFilters encapsulate the logic necessary to decide: | ||
1395 | 43 | |||
1396 | 44 | 1. whether a PageElement (a tag or a string) matches a | ||
1397 | 45 | user-specified query. | ||
1398 | 46 | |||
1399 | 47 | 2. whether a given sequence of markup found during initial parsing | ||
1400 | 48 | should be turned into a PageElement, or simply discarded. | ||
1401 | 49 | |||
1402 | 50 | The base class is the simplest ElementFilter. By default, it | ||
1403 | 51 | matches everything and allows all PageElements to be created. You | ||
1404 | 52 | can make it more selective by passing in user-defined functions. | ||
1405 | 53 | |||
1406 | 54 | Most users of Beautiful Soup will never need to use | ||
1407 | 55 | ElementFilter, or its more capable subclass | ||
1408 | 56 | SoupStrainer. Instead, they will use the find_* methods, which | ||
1409 | 57 | will convert their arguments into SoupStrainer objects and run them | ||
1410 | 58 | against the tree. | ||
1411 | 59 | """ | ||
1412 | 60 | match_hook: Optional[_PageElementMatchFunction] | ||
1413 | 61 | allow_tag_creation_function: Optional[_AllowTagCreationFunction] | ||
1414 | 62 | allow_string_creation_function: Optional[_AllowStringCreationFunction] | ||
1415 | 63 | |||
1416 | 64 | def __init__( | ||
1417 | 65 | self, match_function:Optional[_PageElementMatchFunction]=None, | ||
1418 | 66 | allow_tag_creation_function:Optional[_AllowTagCreationFunction]=None, | ||
1419 | 67 | allow_string_creation_function:Optional[_AllowStringCreationFunction]=None): | ||
1420 | 68 | self.match_function = match_function | ||
1421 | 69 | self.allow_tag_creation_function = allow_tag_creation_function | ||
1422 | 70 | self.allow_string_creation_function = allow_string_creation_function | ||
1423 | 71 | |||
1424 | 72 | @property | ||
1425 | 73 | def excludes_everything(self) -> bool: | ||
1426 | 74 | """Does this ElementFilter obviously exclude everything? If | ||
1427 | 75 | so, Beautiful Soup will issue a warning if you try to use it | ||
1428 | 76 | when parsing a document. | ||
1429 | 77 | |||
1430 | 78 | The ElementFilter might turn out to exclude everything even | ||
1431 | 79 | if this returns False, but it won't do so in an obvious way. | ||
1432 | 80 | |||
1433 | 81 | The default ElementFilter excludes *nothing*, and we don't | ||
1434 | 82 | have any way of answering questions about more complex | ||
1435 | 83 | ElementFilters without running their hook functions, so the | ||
1436 | 84 | base implementation always returns False. | ||
1437 | 85 | """ | ||
1438 | 86 | return False | ||
1439 | 87 | |||
1440 | 88 | def match(self, element:PageElement) -> bool: | ||
1441 | 89 | """Does the given PageElement match the rules set down by this | ||
1442 | 90 | ElementFilter? | ||
1443 | 91 | |||
1444 | 92 | The base implementation delegates to the function passed in to | ||
1445 | 93 | the constructor. | ||
1446 | 94 | """ | ||
1447 | 95 | if not self.match_function: | ||
1448 | 96 | return True | ||
1449 | 97 | return self.match_function(element) | ||
1450 | 98 | |||
1451 | 99 | def allow_tag_creation( | ||
1452 | 100 | self, nsprefix:Optional[str], name:str, | ||
1453 | 101 | attrs:Optional[_AttributeValues] | ||
1454 | 102 | ) -> bool: | ||
1455 | 103 | """Based on the name and attributes of a tag, see whether this | ||
1456 | 104 | ElementFilter will allow a Tag object to even be created. | ||
1457 | 105 | |||
1458 | 106 | :param name: The name of the prospective tag. | ||
1459 | 107 | :param attrs: The attributes of the prospective tag. | ||
1460 | 108 | """ | ||
1461 | 109 | if not self.allow_tag_creation_function: | ||
1462 | 110 | return True | ||
1463 | 111 | return self.allow_tag_creation_function(nsprefix, name, attrs) | ||
1464 | 112 | |||
1465 | 113 | def allow_string_creation(self, string:str) -> bool: | ||
1466 | 114 | if not self.allow_string_creation_function: | ||
1467 | 115 | return True | ||
1468 | 116 | return self.allow_string_creation_function(string) | ||
1469 | 117 | |||
1470 | 118 | |||
1471 | 36 | class MatchRule(object): | 119 | class MatchRule(object): |
1472 | 120 | """Each MatchRule encapsulates the logic behind a single argument | ||
1473 | 121 | passed in to one of the Beautiful Soup find* methods. | ||
1474 | 122 | """ | ||
1475 | 123 | |||
1476 | 37 | string: Optional[str] | 124 | string: Optional[str] |
1477 | 38 | pattern: Optional[Pattern[str]] | 125 | pattern: Optional[Pattern[str]] |
1478 | 39 | present: Optional[bool] | 126 | present: Optional[bool] |
1482 | 40 | 127 | # TODO-TYPING: All MatchRule objects also have an attribute | |
1483 | 41 | # All MatchRule objects also have an attribute ``function``, but | 128 | # ``function``, but the type of the function depends on the |
1484 | 42 | # the type of the function depends on the subclass. | 129 | # subclass. |
1485 | 43 | 130 | ||
1486 | 44 | def __init__( | 131 | def __init__( |
1487 | 45 | self, | 132 | self, |
1488 | @@ -72,7 +159,7 @@ class MatchRule(object): | |||
1489 | 72 | "At most one of string, pattern, function and present must be provided." | 159 | "At most one of string, pattern, function and present must be provided." |
1490 | 73 | ) | 160 | ) |
1491 | 74 | 161 | ||
1493 | 75 | def _base_match(self, string:str) -> Optional[bool]: | 162 | def _base_match(self, string:Optional[str]) -> Optional[bool]: |
1494 | 76 | """Run the 'cheap' portion of a match, trying to get an answer without | 163 | """Run the 'cheap' portion of a match, trying to get an answer without |
1495 | 77 | calling a potentially expensive custom function. | 164 | calling a potentially expensive custom function. |
1496 | 78 | 165 | ||
1497 | @@ -101,7 +188,7 @@ class MatchRule(object): | |||
1498 | 101 | 188 | ||
1499 | 102 | return None | 189 | return None |
1500 | 103 | 190 | ||
1502 | 104 | def matches_string(self, string:str) -> bool: | 191 | def matches_string(self, string:Optional[str]) -> bool: |
1503 | 105 | _base_result = self._base_match(string) | 192 | _base_result = self._base_match(string) |
1504 | 106 | if _base_result is not None: | 193 | if _base_result is not None: |
1505 | 107 | # No need to invoke the test function. | 194 | # No need to invoke the test function. |
1506 | @@ -125,6 +212,7 @@ class MatchRule(object): | |||
1507 | 125 | ) | 212 | ) |
1508 | 126 | 213 | ||
1509 | 127 | class TagNameMatchRule(MatchRule): | 214 | class TagNameMatchRule(MatchRule): |
1510 | 215 | """A MatchRule implementing the rules for matches against tag name.""" | ||
1511 | 128 | function: Optional[_TagMatchFunction] | 216 | function: Optional[_TagMatchFunction] |
1512 | 129 | 217 | ||
1513 | 130 | def matches_tag(self, tag:Tag) -> bool: | 218 | def matches_tag(self, tag:Tag) -> bool: |
1514 | @@ -140,19 +228,25 @@ class TagNameMatchRule(MatchRule): | |||
1515 | 140 | return False | 228 | return False |
1516 | 141 | 229 | ||
1517 | 142 | class AttributeValueMatchRule(MatchRule): | 230 | class AttributeValueMatchRule(MatchRule): |
1518 | 231 | """A MatchRule implementing the rules for matches against attribute value.""" | ||
1519 | 143 | function: Optional[_StringMatchFunction] | 232 | function: Optional[_StringMatchFunction] |
1520 | 144 | 233 | ||
1521 | 145 | class StringMatchRule(MatchRule): | 234 | class StringMatchRule(MatchRule): |
1522 | 235 | """A MatchRule implementing the rules for matches against a NavigableString.""" | ||
1523 | 146 | function: Optional[_StringMatchFunction] | 236 | function: Optional[_StringMatchFunction] |
1524 | 147 | 237 | ||
1528 | 148 | class SoupStrainer(object): | 238 | class SoupStrainer(ElementFilter): |
1529 | 149 | """Encapsulates a number of ways of matching a markup element (a tag | 239 | """The ElementFilter subclass used internally by Beautiful Soup. |
1527 | 150 | or a string). | ||
1530 | 151 | 240 | ||
1535 | 152 | These are primarily created internally and used to underpin the | 241 | A SoupStrainer encapsulates the logic necessary to perform the |
1536 | 153 | find_* methods, but you can create one yourself and pass it in as | 242 | kind of matches supported by the find_* methods. SoupStrainers are |
1537 | 154 | ``parse_only`` to the `BeautifulSoup` constructor, to parse a | 243 | primarily created internally, but you can create one yourself and |
1538 | 155 | subset of a large document. | 244 | pass it in as ``parse_only`` to the `BeautifulSoup` constructor, |
1539 | 245 | to parse a subset of a large document. | ||
1540 | 246 | |||
1541 | 247 | Internally, SoupStrainer objects work by converting the | ||
1542 | 248 | constructor arguments into MatchRule objects. Incoming | ||
1543 | 249 | tags/markup are matched against those rules. | ||
1544 | 156 | 250 | ||
1545 | 157 | :param name: One or more restrictions on the tags found in a | 251 | :param name: One or more restrictions on the tags found in a |
1546 | 158 | document. | 252 | document. |
1547 | @@ -226,6 +320,17 @@ class SoupStrainer(object): | |||
1548 | 226 | self.__string = string | 320 | self.__string = string |
1549 | 227 | 321 | ||
1550 | 228 | @property | 322 | @property |
1551 | 323 | def excludes_everything(self) -> bool: | ||
1552 | 324 | """Check whether the provided rules will obviously exclude | ||
1553 | 325 | everything. (They might exclude everything even if this returns False, | ||
1554 | 326 | but not in an obvious way.) | ||
1555 | 327 | """ | ||
1556 | 328 | return True if ( | ||
1557 | 329 | self.string_rules and | ||
1558 | 330 | (self.name_rules or self.attribute_rules) | ||
1559 | 331 | ) else False | ||
1560 | 332 | |||
1561 | 333 | @property | ||
1562 | 229 | def string(self) -> Optional[_StrainableString]: | 334 | def string(self) -> Optional[_StrainableString]: |
1563 | 230 | ":meta private:" | 335 | ":meta private:" |
1564 | 231 | warnings.warn(f"Access to deprecated property string. (Look at .string_rules instead) -- Deprecated since version 4.13.0.", DeprecationWarning, stacklevel=2) | 336 | warnings.warn(f"Access to deprecated property string. (Look at .string_rules instead) -- Deprecated since version 4.13.0.", DeprecationWarning, stacklevel=2) |
1565 | @@ -262,6 +367,15 @@ class SoupStrainer(object): | |||
1566 | 262 | yield rule_class(function=obj) | 367 | yield rule_class(function=obj) |
1567 | 263 | elif isinstance(obj, Pattern): | 368 | elif isinstance(obj, Pattern): |
1568 | 264 | yield rule_class(pattern=obj) | 369 | yield rule_class(pattern=obj) |
1569 | 370 | elif hasattr(obj, 'search'): | ||
1570 | 371 | # We do a little duck typing here to detect usage of the | ||
1571 | 372 | # third-party regex library, whose pattern objects doesn't | ||
1572 | 373 | # derive from re.Pattern. | ||
1573 | 374 | # | ||
1574 | 375 | # TODO-TYPING: Once we drop support for Python 3.7, we | ||
1575 | 376 | # might be able to address this by defining an appropriate | ||
1576 | 377 | # Protocol. | ||
1577 | 378 | yield rule_class(pattern=obj) | ||
1578 | 265 | elif hasattr(obj, '__iter__'): | 379 | elif hasattr(obj, '__iter__'): |
1579 | 266 | for o in obj: | 380 | for o in obj: |
1580 | 267 | if not isinstance(o, (bytes, str)) and hasattr(o, '__iter__'): | 381 | if not isinstance(o, (bytes, str)) and hasattr(o, '__iter__'): |
1581 | @@ -358,7 +472,7 @@ class SoupStrainer(object): | |||
1582 | 358 | else: | 472 | else: |
1583 | 359 | attr_values = [cast(str, attr_value)] | 473 | attr_values = [cast(str, attr_value)] |
1584 | 360 | 474 | ||
1586 | 361 | def _match_attribute_value_helper(attr_values:Sequence[Optional[str]]): | 475 | def _match_attribute_value_helper(attr_values:Sequence[Optional[str]]) -> bool: |
1587 | 362 | for rule in rules: | 476 | for rule in rules: |
1588 | 363 | for attr_value in attr_values: | 477 | for attr_value in attr_values: |
1589 | 364 | if rule.matches_string(attr_value): | 478 | if rule.matches_string(attr_value): |
1590 | @@ -382,8 +496,8 @@ class SoupStrainer(object): | |||
1591 | 382 | [joined_attr_value] | 496 | [joined_attr_value] |
1592 | 383 | ) | 497 | ) |
1593 | 384 | return this_attr_match | 498 | return this_attr_match |
1596 | 385 | 499 | ||
1597 | 386 | def allow_tag_creation(self, nsprefix:Optional[str], name:str, attrs:Optional[dict[str, str]]) -> bool: | 500 | def allow_tag_creation(self, nsprefix:Optional[str], name:str, attrs:Optional[_AttributeValues]) -> bool: |
1598 | 387 | """Based on the name and attributes of a tag, see whether this | 501 | """Based on the name and attributes of a tag, see whether this |
1599 | 388 | SoupStrainer will allow a Tag object to even be created. | 502 | SoupStrainer will allow a Tag object to even be created. |
1600 | 389 | 503 | ||
1601 | @@ -423,17 +537,25 @@ class SoupStrainer(object): | |||
1602 | 423 | return True | 537 | return True |
1603 | 424 | 538 | ||
1604 | 425 | def allow_string_creation(self, string:str) -> bool: | 539 | def allow_string_creation(self, string:str) -> bool: |
1605 | 540 | """Based on the content of a markup string, see whether this | ||
1606 | 541 | SoupStrainer will allow it to be instantiated as a | ||
1607 | 542 | NavigableString object, or whether it should be ignored. | ||
1608 | 543 | """ | ||
1609 | 426 | if self.name_rules or self.attribute_rules: | 544 | if self.name_rules or self.attribute_rules: |
1610 | 427 | # A SoupStrainer that has name or attribute rules won't | 545 | # A SoupStrainer that has name or attribute rules won't |
1611 | 428 | # match any strings; it's designed to match tags with | 546 | # match any strings; it's designed to match tags with |
1612 | 429 | # certain properties. | 547 | # certain properties. |
1613 | 430 | return False | 548 | return False |
1614 | 549 | if not self.string_rules: | ||
1615 | 550 | # A SoupStrainer with no string rules will match | ||
1616 | 551 | # all strings. | ||
1617 | 552 | return True | ||
1618 | 431 | if not self.matches_any_string_rule(string): | 553 | if not self.matches_any_string_rule(string): |
1619 | 432 | return False | 554 | return False |
1620 | 433 | return True | 555 | return True |
1621 | 434 | 556 | ||
1622 | 435 | def matches_any_string_rule(self, string:str) -> bool: | 557 | def matches_any_string_rule(self, string:str) -> bool: |
1624 | 436 | """See whether the content of a string, matches any of | 558 | """See whether the content of a string matches any of |
1625 | 437 | this SoupStrainer's string rules. | 559 | this SoupStrainer's string rules. |
1626 | 438 | """ | 560 | """ |
1627 | 439 | if not self.string_rules: | 561 | if not self.string_rules: |
1628 | @@ -442,28 +564,37 @@ class SoupStrainer(object): | |||
1629 | 442 | if string_rule.matches_string(string): | 564 | if string_rule.matches_string(string): |
1630 | 443 | return True | 565 | return True |
1631 | 444 | return False | 566 | return False |
1634 | 445 | 567 | ||
1635 | 446 | 568 | def match(self, element:PageElement) -> bool: | |
1636 | 569 | """Does the given PageElement match the rules set down by this | ||
1637 | 570 | SoupStrainer? | ||
1638 | 571 | |||
1639 | 572 | The find_* methods rely heavily on this method to find matches. | ||
1640 | 573 | |||
1641 | 574 | :param element: A PageElement. | ||
1642 | 575 | :return: True if the element matches this SoupStrainer's rules; False otherwise. | ||
1643 | 576 | """ | ||
1644 | 577 | if isinstance(element, Tag): | ||
1645 | 578 | return self.matches_tag(element) | ||
1646 | 579 | assert isinstance(element, NavigableString) | ||
1647 | 580 | if not (self.name_rules or self.attribute_rules): | ||
1648 | 581 | # A NavigableString can only match a SoupStrainer that | ||
1649 | 582 | # does not define any name or attribute restrictions. | ||
1650 | 583 | for rule in self.string_rules: | ||
1651 | 584 | if rule.matches_string(element): | ||
1652 | 585 | return True | ||
1653 | 586 | return False | ||
1654 | 587 | |||
1655 | 447 | @_deprecated("allow_tag_creation", "4.13.0") | 588 | @_deprecated("allow_tag_creation", "4.13.0") |
1657 | 448 | def search_tag(self, name, attrs): | 589 | def search_tag(self, name:str, attrs:Optional[_AttributeValues]) -> bool: |
1658 | 590 | """A less elegant version of allow_tag_creation().""" | ||
1659 | 449 | ":meta private:" | 591 | ":meta private:" |
1660 | 450 | return self.allow_tag_creation(None, name, attrs) | 592 | return self.allow_tag_creation(None, name, attrs) |
1661 | 451 | 593 | ||
1679 | 452 | def search(self, element:PageElement): | 594 | @_deprecated("match", "4.13.0") |
1680 | 453 | # TODO: This method needs to be removed or redone. It is | 595 | def search(self, element:PageElement) -> Optional[PageElement]: |
1681 | 454 | # very confusing but it's used everywhere. | 596 | """A less elegant version of match(). |
1665 | 455 | match = None | ||
1666 | 456 | if isinstance(element, Tag): | ||
1667 | 457 | match = self.matches_tag(element) | ||
1668 | 458 | else: | ||
1669 | 459 | assert isinstance(element, NavigableString) | ||
1670 | 460 | match = False | ||
1671 | 461 | if not (self.name_rules or self.attribute_rules): | ||
1672 | 462 | # A NavigableString can only match a SoupStrainer that | ||
1673 | 463 | # does not define any name or attribute restrictions. | ||
1674 | 464 | for rule in self.string_rules: | ||
1675 | 465 | if rule.matches_string(element): | ||
1676 | 466 | match = True | ||
1677 | 467 | break | ||
1678 | 468 | return element if match else False | ||
1682 | 469 | 597 | ||
1683 | 598 | :meta private: | ||
1684 | 599 | """ | ||
1685 | 600 | return element if self.match(element) else None | ||
1686 | diff --git a/bs4/tests/__init__.py b/bs4/tests/__init__.py | |||
1687 | index 2ef7fd8..3ef999d 100644 | |||
1688 | --- a/bs4/tests/__init__.py | |||
1689 | +++ b/bs4/tests/__init__.py | |||
1690 | @@ -20,7 +20,7 @@ from bs4.element import ( | |||
1691 | 20 | Stylesheet, | 20 | Stylesheet, |
1692 | 21 | Tag | 21 | Tag |
1693 | 22 | ) | 22 | ) |
1695 | 23 | from bs4.strainer import SoupStrainer | 23 | from bs4.filter import SoupStrainer |
1696 | 24 | from bs4.builder import ( | 24 | from bs4.builder import ( |
1697 | 25 | DetectsXMLParsedAsHTML, | 25 | DetectsXMLParsedAsHTML, |
1698 | 26 | XMLParsedAsHTMLWarning, | 26 | XMLParsedAsHTMLWarning, |
1699 | diff --git a/bs4/tests/test_strainer.py b/bs4/tests/test_filter.py | |||
1700 | 27 | similarity index 56% | 27 | similarity index 56% |
1701 | 28 | rename from bs4/tests/test_strainer.py | 28 | rename from bs4/tests/test_strainer.py |
1702 | 29 | rename to bs4/tests/test_filter.py | 29 | rename to bs4/tests/test_filter.py |
1703 | index 4de03f0..8d5da70 100644 | |||
1704 | --- a/bs4/tests/test_strainer.py | |||
1705 | +++ b/bs4/tests/test_filter.py | |||
1706 | @@ -6,20 +6,108 @@ from . import ( | |||
1707 | 6 | SoupTest, | 6 | SoupTest, |
1708 | 7 | ) | 7 | ) |
1709 | 8 | from bs4.element import Tag | 8 | from bs4.element import Tag |
1711 | 9 | from bs4.strainer import ( | 9 | from bs4.filter import ( |
1712 | 10 | AttributeValueMatchRule, | 10 | AttributeValueMatchRule, |
1713 | 11 | ElementFilter, | ||
1714 | 11 | MatchRule, | 12 | MatchRule, |
1715 | 12 | SoupStrainer, | 13 | SoupStrainer, |
1716 | 13 | StringMatchRule, | 14 | StringMatchRule, |
1717 | 14 | TagNameMatchRule, | 15 | TagNameMatchRule, |
1718 | 15 | ) | 16 | ) |
1719 | 16 | 17 | ||
1721 | 17 | class TestMatchrule(SoupTest): | 18 | class TestElementFilter(SoupTest): |
1722 | 19 | |||
1723 | 20 | def test_default_behavior(self): | ||
1724 | 21 | # An unconfigured ElementFilter matches absolutely everything. | ||
1725 | 22 | selector = ElementFilter() | ||
1726 | 23 | assert not selector.excludes_everything | ||
1727 | 24 | soup = self.soup("<a>text</a>") | ||
1728 | 25 | tag = soup.a | ||
1729 | 26 | string = tag.string | ||
1730 | 27 | assert True == selector.match(soup) | ||
1731 | 28 | assert True == selector.match(tag) | ||
1732 | 29 | assert True == selector.match(string) | ||
1733 | 30 | assert soup.find(selector).name == "a" | ||
1734 | 31 | |||
1735 | 32 | # And allows any incoming markup to be turned into PageElements. | ||
1736 | 33 | assert True == selector.allow_tag_creation(None, "tag", None) | ||
1737 | 34 | assert True == selector.allow_string_creation("some string") | ||
1738 | 35 | |||
1739 | 36 | def test_match(self): | ||
1740 | 37 | def m(pe): | ||
1741 | 38 | return (pe.string == "allow" or ( | ||
1742 | 39 | isinstance(pe, Tag) and pe.name=="allow")) | ||
1743 | 40 | |||
1744 | 41 | soup = self.soup("<allow>deny</allow>allow<deny>deny</deny>") | ||
1745 | 42 | allow_tag = soup.allow | ||
1746 | 43 | allow_string = soup.find(string="allow") | ||
1747 | 44 | deny_tag = soup.deny | ||
1748 | 45 | deny_string = soup.find(string="deny") | ||
1749 | 46 | |||
1750 | 47 | selector = ElementFilter(match_function=m) | ||
1751 | 48 | assert True == selector.match(allow_tag) | ||
1752 | 49 | assert True == selector.match(allow_string) | ||
1753 | 50 | assert False == selector.match(deny_tag) | ||
1754 | 51 | assert False == selector.match(deny_string) | ||
1755 | 52 | |||
1756 | 53 | # Since only the match function was provided, there is | ||
1757 | 54 | # no effect on tag or string creation. | ||
1758 | 55 | soup = self.soup("<a>text</a>", parse_only=selector) | ||
1759 | 56 | assert "text" == soup.a.string | ||
1760 | 57 | |||
1761 | 58 | def test_allow_tag_creation(self): | ||
1762 | 59 | def m(nsprefix, name, attrs): | ||
1763 | 60 | return nsprefix=="allow" or name=="allow" or "allow" in attrs | ||
1764 | 61 | selector = ElementFilter(allow_tag_creation_function=m) | ||
1765 | 62 | f = selector.allow_tag_creation | ||
1766 | 63 | assert True == f("allow", "ignore", {}) | ||
1767 | 64 | assert True == f("ignore", "allow", {}) | ||
1768 | 65 | assert True == f(None, "ignore", {"allow": "1"}) | ||
1769 | 66 | assert False == f("no", "no", {"no" : "nope"}) | ||
1770 | 67 | |||
1771 | 68 | # Test the ElementFilter as a value for parse_only. | ||
1772 | 69 | soup = self.soup( | ||
1773 | 70 | "<deny>deny</deny> <allow>deny</allow> allow", | ||
1774 | 71 | parse_only=selector | ||
1775 | 72 | ) | ||
1776 | 18 | 73 | ||
1780 | 19 | def _tuple(self, rule): | 74 | # The <deny> tag was filtered out, but there was no effect on |
1781 | 20 | if isinstance(rule.pattern, str): | 75 | # the strings, since only allow_tag_creation_function was |
1782 | 21 | import pdb; pdb.set_trace() | 76 | # defined. |
1783 | 77 | assert 'deny <allow>deny</allow> allow' == soup.decode() | ||
1784 | 78 | |||
1785 | 79 | # Similarly, since match_function was not defined, this | ||
1786 | 80 | # ElementFilter matches everything. | ||
1787 | 81 | assert soup.find(selector) == "deny" | ||
1788 | 82 | |||
1789 | 83 | def test_allow_string_creation(self): | ||
1790 | 84 | def m(s): | ||
1791 | 85 | return s=="allow" | ||
1792 | 86 | selector = ElementFilter(allow_string_creation_function=m) | ||
1793 | 87 | f = selector.allow_string_creation | ||
1794 | 88 | assert True == f("allow") | ||
1795 | 89 | assert False == f("deny") | ||
1796 | 90 | assert False == f("please allow") | ||
1797 | 91 | |||
1798 | 92 | # Test the ElementFilter as a value for parse_only. | ||
1799 | 93 | soup = self.soup( | ||
1800 | 94 | "<deny>deny</deny> <allow>deny</allow> allow", | ||
1801 | 95 | parse_only=selector | ||
1802 | 96 | ) | ||
1803 | 97 | |||
1804 | 98 | # All incoming strings other than "allow" (even whitespace) | ||
1805 | 99 | # were filtered out, but there was no effect on the tags, | ||
1806 | 100 | # since only allow_string_creation_function was defined. | ||
1807 | 101 | assert '<deny>deny</deny><allow>deny</allow>' == soup.decode() | ||
1808 | 102 | |||
1809 | 103 | # Similarly, since match_function was not defined, this | ||
1810 | 104 | # ElementFilter matches everything. | ||
1811 | 105 | assert soup.find(selector).name == "deny" | ||
1812 | 22 | 106 | ||
1813 | 107 | |||
1814 | 108 | class TestMatchRule(SoupTest): | ||
1815 | 109 | |||
1816 | 110 | def _tuple(self, rule): | ||
1817 | 23 | return ( | 111 | return ( |
1818 | 24 | rule.string, | 112 | rule.string, |
1819 | 25 | rule.pattern.pattern if rule.pattern else None, | 113 | rule.pattern.pattern if rule.pattern else None, |
1820 | @@ -155,6 +243,28 @@ class TestSoupStrainer(SoupTest): | |||
1821 | 155 | assert w2.filename == __file__ | 243 | assert w2.filename == __file__ |
1822 | 156 | assert msg == "Access to deprecated property text. (Look at .string_rules instead) -- Deprecated since version 4.13.0." | 244 | assert msg == "Access to deprecated property text. (Look at .string_rules instead) -- Deprecated since version 4.13.0." |
1823 | 157 | 245 | ||
1824 | 246 | def test_search_tag_deprecated(self): | ||
1825 | 247 | strainer = SoupStrainer(name="a") | ||
1826 | 248 | with warnings.catch_warnings(record=True) as w: | ||
1827 | 249 | assert False == strainer.search_tag("b", {}) | ||
1828 | 250 | [w1] = w | ||
1829 | 251 | msg = str(w1.message) | ||
1830 | 252 | assert w1.filename == __file__ | ||
1831 | 253 | assert msg == "Call to deprecated method search_tag. (Replaced by allow_tag_creation) -- Deprecated since version 4.13.0." | ||
1832 | 254 | |||
1833 | 255 | def test_search_deprecated(self): | ||
1834 | 256 | strainer = SoupStrainer(name="a") | ||
1835 | 257 | soup = self.soup("<a></a><b></b>") | ||
1836 | 258 | with warnings.catch_warnings(record=True) as w: | ||
1837 | 259 | assert soup.a == strainer.search(soup.a) | ||
1838 | 260 | assert None == strainer.search(soup.b) | ||
1839 | 261 | [w1, w2] = w | ||
1840 | 262 | msg = str(w1.message) | ||
1841 | 263 | assert msg == str(w2.message) | ||
1842 | 264 | assert w1.filename == __file__ | ||
1843 | 265 | assert msg == "Call to deprecated method search. (Replaced by match) -- Deprecated since version 4.13.0." | ||
1844 | 266 | |||
1845 | 267 | # Dummy function used within tests. | ||
1846 | 158 | def _match_function(x): | 268 | def _match_function(x): |
1847 | 159 | pass | 269 | pass |
1848 | 160 | 270 | ||
1849 | @@ -213,7 +323,7 @@ class TestSoupStrainer(SoupTest): | |||
1850 | 213 | ) | 323 | ) |
1851 | 214 | 324 | ||
1852 | 215 | def test_constructor_with_overlapping_attributes(self): | 325 | def test_constructor_with_overlapping_attributes(self): |
1854 | 216 | # If you specify the same attribute in arts and **kwargs, you end up | 326 | # If you specify the same attribute in args and **kwargs, you end up |
1855 | 217 | # with two different AttributeValueMatchRule objects. | 327 | # with two different AttributeValueMatchRule objects. |
1856 | 218 | 328 | ||
1857 | 219 | # This happens whether you use the 'class' shortcut on attrs... | 329 | # This happens whether you use the 'class' shortcut on attrs... |
1858 | @@ -437,17 +547,24 @@ class TestSoupStrainer(SoupTest): | |||
1859 | 437 | # because the string restrictions can't be evaluated during | 547 | # because the string restrictions can't be evaluated during |
1860 | 438 | # the parsing process, and the tag restrictions eliminate | 548 | # the parsing process, and the tag restrictions eliminate |
1861 | 439 | # any strings from consideration. | 549 | # any strings from consideration. |
1862 | 550 | # | ||
1863 | 551 | # We can detect this ahead of time, and warn about it, | ||
1864 | 552 | # thanks to SoupStrainer.excludes_everything | ||
1865 | 440 | markup = "<a><b>one string<div>another string</div></b></a>" | 553 | markup = "<a><b>one string<div>another string</div></b></a>" |
1866 | 441 | 554 | ||
1867 | 442 | with warnings.catch_warnings(record=True) as w: | 555 | with warnings.catch_warnings(record=True) as w: |
1868 | 556 | assert True, soupstrainer.excludes_everything | ||
1869 | 443 | assert "" == self.soup(markup, parse_only=soupstrainer).decode() | 557 | assert "" == self.soup(markup, parse_only=soupstrainer).decode() |
1870 | 444 | [warning] = w | 558 | [warning] = w |
1871 | 445 | msg = str(warning.message) | 559 | msg = str(warning.message) |
1872 | 446 | assert warning.filename == __file__ | 560 | assert warning.filename == __file__ |
1873 | 447 | assert str(warning.message).startswith( | 561 | assert str(warning.message).startswith( |
1875 | 448 | "Value for parse_only will exclude everything, since it puts restrictions on both tags and strings:" | 562 | "The given value for parse_only will exclude everything:" |
1876 | 449 | ) | 563 | ) |
1878 | 450 | 564 | ||
1879 | 565 | # The average SoupStrainer has excludes_everything=False | ||
1880 | 566 | assert not SoupStrainer().excludes_everything | ||
1881 | 567 | |||
1882 | 451 | def test_documentation_examples(self): | 568 | def test_documentation_examples(self): |
1883 | 452 | """Medium-weight real-world tests based on the Beautiful Soup | 569 | """Medium-weight real-world tests based on the Beautiful Soup |
1884 | 453 | documentation. | 570 | documentation. |
1885 | diff --git a/bs4/tests/test_html5lib.py b/bs4/tests/test_html5lib.py | |||
1886 | index b0f4384..9f6dfa1 100644 | |||
1887 | --- a/bs4/tests/test_html5lib.py | |||
1888 | +++ b/bs4/tests/test_html5lib.py | |||
1889 | @@ -4,7 +4,7 @@ import pytest | |||
1890 | 4 | import warnings | 4 | import warnings |
1891 | 5 | 5 | ||
1892 | 6 | from bs4 import BeautifulSoup | 6 | from bs4 import BeautifulSoup |
1894 | 7 | from bs4.strainer import SoupStrainer | 7 | from bs4.filter import SoupStrainer |
1895 | 8 | from . import ( | 8 | from . import ( |
1896 | 9 | HTML5LIB_PRESENT, | 9 | HTML5LIB_PRESENT, |
1897 | 10 | HTML5TreeBuilderSmokeTest, | 10 | HTML5TreeBuilderSmokeTest, |
1898 | @@ -24,7 +24,7 @@ class TestHTML5LibBuilder(SoupTest, HTML5TreeBuilderSmokeTest): | |||
1899 | 24 | return HTML5TreeBuilder | 24 | return HTML5TreeBuilder |
1900 | 25 | 25 | ||
1901 | 26 | def test_soupstrainer(self): | 26 | def test_soupstrainer(self): |
1903 | 27 | # The html5lib tree builder does not support SoupStrainers. | 27 | # The html5lib tree builder does not support parse_only. |
1904 | 28 | strainer = SoupStrainer("b") | 28 | strainer = SoupStrainer("b") |
1905 | 29 | markup = "<p>A <b>bold</b> statement.</p>" | 29 | markup = "<p>A <b>bold</b> statement.</p>" |
1906 | 30 | with warnings.catch_warnings(record=True) as w: | 30 | with warnings.catch_warnings(record=True) as w: |
1907 | diff --git a/bs4/tests/test_lxml.py b/bs4/tests/test_lxml.py | |||
1908 | index d450740..9fc04e0 100644 | |||
1909 | --- a/bs4/tests/test_lxml.py | |||
1910 | +++ b/bs4/tests/test_lxml.py | |||
1911 | @@ -14,7 +14,7 @@ from bs4 import ( | |||
1912 | 14 | BeautifulStoneSoup, | 14 | BeautifulStoneSoup, |
1913 | 15 | ) | 15 | ) |
1914 | 16 | from bs4.element import Comment, Doctype | 16 | from bs4.element import Comment, Doctype |
1916 | 17 | from bs4.strainer import SoupStrainer | 17 | from bs4.filter import SoupStrainer |
1917 | 18 | from . import ( | 18 | from . import ( |
1918 | 19 | HTMLTreeBuilderSmokeTest, | 19 | HTMLTreeBuilderSmokeTest, |
1919 | 20 | XMLTreeBuilderSmokeTest, | 20 | XMLTreeBuilderSmokeTest, |
1920 | diff --git a/bs4/tests/test_pageelement.py b/bs4/tests/test_pageelement.py | |||
1921 | index 19b4d63..7dfdc22 100644 | |||
1922 | --- a/bs4/tests/test_pageelement.py | |||
1923 | +++ b/bs4/tests/test_pageelement.py | |||
1924 | @@ -10,7 +10,7 @@ from bs4.element import ( | |||
1925 | 10 | Comment, | 10 | Comment, |
1926 | 11 | ResultSet, | 11 | ResultSet, |
1927 | 12 | ) | 12 | ) |
1929 | 13 | from bs4.strainer import SoupStrainer | 13 | from bs4.filter import SoupStrainer |
1930 | 14 | from . import ( | 14 | from . import ( |
1931 | 15 | SoupTest, | 15 | SoupTest, |
1932 | 16 | ) | 16 | ) |
1933 | diff --git a/bs4/tests/test_soup.py b/bs4/tests/test_soup.py | |||
1934 | index 4f8ee1a..c95f380 100644 | |||
1935 | --- a/bs4/tests/test_soup.py | |||
1936 | +++ b/bs4/tests/test_soup.py | |||
1937 | @@ -27,7 +27,7 @@ from bs4.element import ( | |||
1938 | 27 | Tag, | 27 | Tag, |
1939 | 28 | NavigableString, | 28 | NavigableString, |
1940 | 29 | ) | 29 | ) |
1942 | 30 | from bs4.strainer import SoupStrainer | 30 | from bs4.filter import SoupStrainer |
1943 | 31 | 31 | ||
1944 | 32 | from . import ( | 32 | from . import ( |
1945 | 33 | default_builder, | 33 | default_builder, |
1946 | @@ -293,7 +293,7 @@ class TestWarnings(SoupTest): | |||
1947 | 293 | soup = self.soup("<a><b></b></a>", parse_only=strainer) | 293 | soup = self.soup("<a><b></b></a>", parse_only=strainer) |
1948 | 294 | warning = self._assert_warning(w, UserWarning) | 294 | warning = self._assert_warning(w, UserWarning) |
1949 | 295 | msg = str(warning.message) | 295 | msg = str(warning.message) |
1951 | 296 | assert msg.startswith("Value for parse_only will exclude everything, since it puts restrictions on both tags and strings:") | 296 | assert msg.startswith("The given value for parse_only will exclude everything:") |
1952 | 297 | 297 | ||
1953 | 298 | def test_parseOnlyThese_renamed_to_parse_only(self): | 298 | def test_parseOnlyThese_renamed_to_parse_only(self): |
1954 | 299 | with warnings.catch_warnings(record=True) as w: | 299 | with warnings.catch_warnings(record=True) as w: |
1955 | diff --git a/bs4/tests/test_tree.py b/bs4/tests/test_tree.py | |||
1956 | index 606525f..43afb29 100644 | |||
1957 | --- a/bs4/tests/test_tree.py | |||
1958 | +++ b/bs4/tests/test_tree.py | |||
1959 | @@ -26,7 +26,7 @@ from bs4.element import ( | |||
1960 | 26 | Tag, | 26 | Tag, |
1961 | 27 | TemplateString, | 27 | TemplateString, |
1962 | 28 | ) | 28 | ) |
1964 | 29 | from bs4.strainer import SoupStrainer | 29 | from bs4.filter import SoupStrainer |
1965 | 30 | from . import ( | 30 | from . import ( |
1966 | 31 | SoupTest, | 31 | SoupTest, |
1967 | 32 | ) | 32 | ) |
1968 | diff --git a/doc/index.rst b/doc/index.rst | |||
1969 | index 7beff36..a414830 100755 | |||
1970 | --- a/doc/index.rst | |||
1971 | +++ b/doc/index.rst | |||
1972 | @@ -20,7 +20,7 @@ with examples. I show you what the library is good for, how it works, | |||
1973 | 20 | how to use it, how to make it do what you want, and what to do when it | 20 | how to use it, how to make it do what you want, and what to do when it |
1974 | 21 | violates your expectations. | 21 | violates your expectations. |
1975 | 22 | 22 | ||
1977 | 23 | This document covers Beautiful Soup version 4.12.2. The examples in | 23 | This document covers Beautiful Soup version 4.13.0. The examples in |
1978 | 24 | this documentation were written for Python 3.8. | 24 | this documentation were written for Python 3.8. |
1979 | 25 | 25 | ||
1980 | 26 | You might be looking for the documentation for `Beautiful Soup 3 | 26 | You might be looking for the documentation for `Beautiful Soup 3 |
1981 | @@ -2577,6 +2577,11 @@ the human-visible content of the page.* | |||
1982 | 2577 | either return the object itself, or nothing, so the only reason to do | 2577 | either return the object itself, or nothing, so the only reason to do |
1983 | 2578 | this is when you're iterating over a mixed list.* | 2578 | this is when you're iterating over a mixed list.* |
1984 | 2579 | 2579 | ||
1985 | 2580 | *As of Beautiful Soup version 4.13.0, you can call .string on a | ||
1986 | 2581 | NavigableString object. It will return the object itself, so again, | ||
1987 | 2582 | the only reason to do this is when you're iterating over a mixed | ||
1988 | 2583 | list.* | ||
1989 | 2584 | |||
1990 | 2580 | Specifying the parser to use | 2585 | Specifying the parser to use |
1991 | 2581 | ============================ | 2586 | ============================ |
1992 | 2582 | 2587 | ||
1993 | @@ -2604,8 +2609,9 @@ specifying one of the following: | |||
1994 | 2604 | 2609 | ||
1995 | 2605 | The section `Installing a parser`_ contrasts the supported parsers. | 2610 | The section `Installing a parser`_ contrasts the supported parsers. |
1996 | 2606 | 2611 | ||
1999 | 2607 | If you don't have an appropriate parser installed, Beautiful Soup will | 2612 | If you ask for a parser that isn't installed, Beautiful Soup will |
2000 | 2608 | ignore your request and pick a different parser. Right now, the only | 2613 | raise an exception so that you don't inadvertently parse a document |
2001 | 2614 | under an unknown set of rules. For example, right now, the only | ||
2002 | 2609 | supported XML parser is lxml. If you don't have lxml installed, asking | 2615 | supported XML parser is lxml. If you don't have lxml installed, asking |
2003 | 2610 | for an XML parser won't give you one, and asking for "lxml" won't work | 2616 | for an XML parser won't give you one, and asking for "lxml" won't work |
2004 | 2611 | either. | 2617 | either. |
2005 | @@ -3018,6 +3024,44 @@ been called on it:: | |||
2006 | 3018 | This is because two different :py:class:`Tag` objects can't occupy the same | 3024 | This is because two different :py:class:`Tag` objects can't occupy the same |
2007 | 3019 | space at the same time. | 3025 | space at the same time. |
2008 | 3020 | 3026 | ||
2009 | 3027 | Advanced search techniques | ||
2010 | 3028 | ========================== | ||
2011 | 3029 | |||
2012 | 3030 | Almost everyone who uses Beautiful Soup to extract information from a | ||
2013 | 3031 | document can get what they need using the methods described in | ||
2014 | 3032 | `Searching the tree`_. However, there's a lower-level interface--the | ||
2015 | 3033 | :py:class:`ElementSelector` class-- which lets you define any matching | ||
2016 | 3034 | behavior whatsoever. | ||
2017 | 3035 | |||
2018 | 3036 | To use :py:class:`ElementSelector`, define a function that takes a | ||
2019 | 3037 | :py:class:`PageElement` object (that is, it might be either a | ||
2020 | 3038 | :py:class:`Tag` or a :py:class`NavigableString`) and returns ``True`` | ||
2021 | 3039 | (if the element matches your custom criteria) or ``False`` (if it | ||
2022 | 3040 | doesn't):: | ||
2023 | 3041 | |||
2024 | 3042 | [example goes here] | ||
2025 | 3043 | |||
2026 | 3044 | Then, pass the function into an :py:class:`ElementSelector`:: | ||
2027 | 3045 | |||
2028 | 3046 | from bs4.select import ElementSelector | ||
2029 | 3047 | selector = ElementSelector(f) | ||
2030 | 3048 | |||
2031 | 3049 | You can then pass the :py:class:`ElementSelector` object as the first | ||
2032 | 3050 | argument to any of the `Searching the tree`_ methods:: | ||
2033 | 3051 | |||
2034 | 3052 | [examples go here] | ||
2035 | 3053 | |||
2036 | 3054 | Every potential match will be run through your function, and the only | ||
2037 | 3055 | :py:class:`PageElement` objects returned will be the one where your | ||
2038 | 3056 | function returned ``True``. | ||
2039 | 3057 | |||
2040 | 3058 | Note that this is different from simply passing `a function`_ as the | ||
2041 | 3059 | first argument to one of the search methods. That's an easy way to | ||
2042 | 3060 | find a tag, but _only_ tags will be considered. With an | ||
2043 | 3061 | :py:class:`ElementSelector` you can write a single function that makes | ||
2044 | 3062 | decisions about both tags and strings. | ||
2045 | 3063 | |||
2046 | 3064 | |||
2047 | 3021 | Advanced parser customization | 3065 | Advanced parser customization |
2048 | 3022 | ============================= | 3066 | ============================= |
2049 | 3023 | 3067 | ||
2050 | @@ -3111,14 +3155,6 @@ The :py:class:`SoupStrainer` behavior is as follows: | |||
2051 | 3111 | * When a tag does not match, the tag itself is not kept, but parsing continues | 3155 | * When a tag does not match, the tag itself is not kept, but parsing continues |
2052 | 3112 | into its contents to look for other tags that do match. | 3156 | into its contents to look for other tags that do match. |
2053 | 3113 | 3157 | ||
2054 | 3114 | You can also pass a :py:class:`SoupStrainer` into any of the methods covered | ||
2055 | 3115 | in `Searching the tree`_. This probably isn't terribly useful, but I | ||
2056 | 3116 | thought I'd mention it:: | ||
2057 | 3117 | |||
2058 | 3118 | soup = BeautifulSoup(html_doc, 'html.parser') | ||
2059 | 3119 | soup.find_all(only_short_strings) | ||
2060 | 3120 | # ['\n\n', '\n\n', 'Elsie', ',\n', 'Lacie', ' and\n', 'Tillie', | ||
2061 | 3121 | # '\n\n', '...', '\n'] | ||
2062 | 3122 | 3158 | ||
2063 | 3123 | Customizing multi-valued attributes | 3159 | Customizing multi-valued attributes |
2064 | 3124 | ----------------------------------- | 3160 | ----------------------------------- |