Merge ~chrispitude/beautifulsoup:more-modular-soupstrainers-doc into beautifulsoup:more-modular-soupstrainers
- Git
- lp:~chrispitude/beautifulsoup
- more-modular-soupstrainers-doc
- Merge into more-modular-soupstrainers
Status: | Needs review |
---|---|
Proposed branch: | ~chrispitude/beautifulsoup:more-modular-soupstrainers-doc |
Merge into: | beautifulsoup:more-modular-soupstrainers |
Diff against target: |
2065 lines (+492/-327) 22 files modified
CHANGELOG (+14/-8) bs4/__init__.py (+13/-10) bs4/_deprecation.py (+6/-6) bs4/_typing.py (+19/-8) bs4/builder/__init__.py (+76/-49) bs4/builder/_html5lib.py (+104/-97) bs4/builder/_htmlparser.py (+15/-14) bs4/builder/_lxml.py (+44/-27) bs4/dammit.py (+3/-3) bs4/element.py (+65/-39) bs4/filter.py (+6/-7) bs4/tests/__init__.py (+33/-14) bs4/tests/test_builder_registry.py (+3/-1) bs4/tests/test_css.py (+8/-2) bs4/tests/test_filter.py (+16/-4) bs4/tests/test_fuzz.py (+2/-2) bs4/tests/test_html5lib.py (+1/-2) bs4/tests/test_htmlparser.py (+4/-2) bs4/tests/test_lxml.py (+2/-2) bs4/tests/test_soup.py (+4/-2) bs4/tests/test_tree.py (+8/-8) doc/index.rst (+46/-20) |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Leonard Richardson | Pending | ||
Review via email: mp+459970@code.launchpad.net |
Commit message
add an example for the new ElementFilter feature
Description of the change
Unmerged commits
- b17956d... by Chris Papademetrious
-
add documentation example for ElementFilter
Signed-off-by: Chris Papademetrious <email address hidden>
- 4e37196... by Leonard Richardson
-
Added some basic typing to test methods that are called by the actual tests.
- fe88554... by Leonard Richardson
-
Clarified the match protocol thing.
- 2618f24... by Leonard Richardson
-
Decided putting limit in ResultSet wasn't a good idea.
- cc290e9... by Leonard Richardson
-
Added note that Protocol is in typing_extensions so it could be used now.
- 4c0618d... by Leonard Richardson
-
Made some typing system changes necessary to get the code to run under Python 3.8.
- d5db19b... by Leonard Richardson
-
I feel like it's more consistent to use modified_attrs throughout.
- 792192c... by Leonard Richardson
-
Sorted out the _RawAttributeVa
lue/_AttributeV alue typing issue. - 814f1a6... by Leonard Richardson
-
All the easy stuff is resolved now, and most of the medium-sized stuff except an issue with _RawAttributeValues versus _AttributeValues.
- 3898972... by Leonard Richardson
-
Got the type definitions in _deprecation.py a _little_ better.
Preview Diff
1 | diff --git a/CHANGELOG b/CHANGELOG |
2 | index 162e3dc..41b1467 100644 |
3 | --- a/CHANGELOG |
4 | +++ b/CHANGELOG |
5 | @@ -1,7 +1,5 @@ |
6 | = 4.13.0 (Unreleased) |
7 | |
8 | -TODO: we could stand to put limit inside ResultSet |
9 | - |
10 | * This version drops support for Python 3.6. The minimum supported |
11 | major Python version for Beautiful Soup is now Python 3.7. |
12 | |
13 | @@ -33,12 +31,15 @@ TODO: we could stand to put limit inside ResultSet |
14 | you, since you probably use HTMLParserTreeBuilder, not |
15 | BeautifulSoupHTMLParser directly. |
16 | |
17 | -* The TreeBuilderForHtml5lib methods fragmentClass and getFragment |
18 | - now raise NotImplementedError. These methods are called only by |
19 | - html5lib's HTMLParser.parseFragment() method, which Beautiful Soup |
20 | - doesn't use, so they were untested and should have never been called. |
21 | - The getFragment() implementation was also slightly incorrect in a way |
22 | - that should have caused obvious problems for anyone using it. |
23 | +* The TreeBuilderForHtml5lib methods fragmentClass(), getFragment(), |
24 | + and testSerializer() now raise NotImplementedError. These methods |
25 | + are called only by html5lib's test suite, and Beautiful Soup isn't |
26 | + integrated into that test suite, so this code was long since unused and |
27 | + untested. |
28 | + |
29 | + These methods are _not_ deprecated, since they are methods defined by |
30 | + html5lib. They may one day have real implementations, as part of a future |
31 | + effort to integrate Beautiful Soup into html5lib's test suite. |
32 | |
33 | * If Tag.get_attribute_list() is used to access an attribute that's not set, |
34 | the return value is now an empty list rather than [None]. |
35 | @@ -73,6 +74,11 @@ TODO: we could stand to put limit inside ResultSet |
36 | * A SoupStrainer can now filter tag creation based on a tag's |
37 | namespaced name. Previously only the unqualified name could be used. |
38 | |
39 | +* Some of the arguments in the methods of LXMLTreeBuilderForXML |
40 | + have been renamed for consistency with the names lxml uses for those |
41 | + arguments in the superclass. This won't affect you unless you were |
42 | + calling methods like LXMLTreeBuilderForXML.start() directly. |
43 | + |
44 | * All TreeBuilder constructors now take the empty_element_tags |
45 | argument. The sets of tags found in HTMLTreeBuilder.empty_element_tags and |
46 | HTMLTreeBuilder.block_elements are now in |
47 | diff --git a/bs4/__init__.py b/bs4/__init__.py |
48 | index 95bd48d..6f01a75 100644 |
49 | --- a/bs4/__init__.py |
50 | +++ b/bs4/__init__.py |
51 | @@ -72,7 +72,7 @@ from typing import ( |
52 | cast, |
53 | Counter as CounterType, |
54 | Dict, |
55 | - Iterable, |
56 | + Iterator, |
57 | List, |
58 | Sequence, |
59 | Optional, |
60 | @@ -84,10 +84,11 @@ from typing import ( |
61 | |
62 | from bs4._typing import ( |
63 | _AttributeValue, |
64 | - _AttributeValues, |
65 | _Encoding, |
66 | _Encodings, |
67 | _IncomingMarkup, |
68 | + _RawAttributeValue, |
69 | + _RawAttributeValues, |
70 | _RawMarkup, |
71 | ) |
72 | |
73 | @@ -451,7 +452,7 @@ class BeautifulSoup(Tag): |
74 | clone.original_encoding = self.original_encoding |
75 | return clone |
76 | |
77 | - def __getstate__(self) -> dict[str, Any]: |
78 | + def __getstate__(self) -> Dict[str, Any]: |
79 | # Frequently a tree builder can't be pickled. |
80 | d = dict(self.__dict__) |
81 | if 'builder' in d and d['builder'] is not None and not self.builder.picklable: |
82 | @@ -467,7 +468,7 @@ class BeautifulSoup(Tag): |
83 | del d['_most_recent_element'] |
84 | return d |
85 | |
86 | - def __setstate__(self, state: dict[str, Any]) -> None: |
87 | + def __setstate__(self, state: Dict[str, Any]) -> None: |
88 | # If necessary, restore the TreeBuilder by looking it up. |
89 | self.__dict__ = state |
90 | if isinstance(self.builder, type): |
91 | @@ -613,11 +614,11 @@ class BeautifulSoup(Tag): |
92 | name:str, |
93 | namespace:Optional[str]=None, |
94 | nsprefix:Optional[str]=None, |
95 | - attrs:_AttributeValues={}, |
96 | + attrs:_RawAttributeValues={}, |
97 | sourceline:Optional[int]=None, |
98 | sourcepos:Optional[int]=None, |
99 | string:Optional[str]=None, |
100 | - **kwattrs:_AttributeValue, |
101 | + **kwattrs:_RawAttributeValue, |
102 | ) -> Tag: |
103 | """Create a new Tag associated with this BeautifulSoup object. |
104 | |
105 | @@ -664,7 +665,7 @@ class BeautifulSoup(Tag): |
106 | # The user may want us to use some other class (hopefully a |
107 | # custom subclass) instead of the one we'd use normally. |
108 | container = cast( |
109 | - type[NavigableString], |
110 | + Type[NavigableString], |
111 | self.element_classes.get(container, container) |
112 | ) |
113 | |
114 | @@ -894,14 +895,16 @@ class BeautifulSoup(Tag): |
115 | |
116 | def handle_starttag( |
117 | self, name:str, namespace:Optional[str], |
118 | - nsprefix:Optional[str], attrs:_AttributeValues, |
119 | + nsprefix:Optional[str], attrs:_RawAttributeValues, |
120 | sourceline:Optional[int]=None, sourcepos:Optional[int]=None, |
121 | namespaces:Optional[Dict[str, str]]=None) -> Optional[Tag]: |
122 | """Called by the tree builder when a new tag is encountered. |
123 | |
124 | :param name: Name of the tag. |
125 | :param nsprefix: Namespace prefix for the tag. |
126 | - :param attrs: A dictionary of attribute values. |
127 | + :param attrs: A dictionary of attribute values. Note that |
128 | + attribute values are expected to be simple strings; processing |
129 | + of multi-valued attributes such as "class" comes later. |
130 | :param sourceline: The line number where this tag was found in its |
131 | source document. |
132 | :param sourcepos: The character position within `sourceline` where this |
133 | @@ -964,7 +967,7 @@ class BeautifulSoup(Tag): |
134 | def decode(self, indent_level:Optional[int]=None, |
135 | eventual_encoding:_Encoding=DEFAULT_OUTPUT_ENCODING, |
136 | formatter:Union[Formatter,str]="minimal", |
137 | - iterator:Optional[Iterable[PageElement]]=None, |
138 | + iterator:Optional[Iterator[PageElement]]=None, |
139 | **kwargs:Any) -> str: |
140 | """Returns a string representation of the parse tree |
141 | as a full HTML or XML document. |
142 | diff --git a/bs4/_deprecation.py b/bs4/_deprecation.py |
143 | index febc1b3..86b22a2 100644 |
144 | --- a/bs4/_deprecation.py |
145 | +++ b/bs4/_deprecation.py |
146 | @@ -17,7 +17,7 @@ from typing import ( |
147 | Callable, |
148 | ) |
149 | |
150 | -def _deprecated_alias(old_name, new_name, version): |
151 | +def _deprecated_alias(old_name:str, new_name:str, version:str): |
152 | """Alias one attribute name to another for backward compatibility |
153 | |
154 | :meta private: |
155 | @@ -29,23 +29,23 @@ def _deprecated_alias(old_name, new_name, version): |
156 | return getattr(self, new_name) |
157 | |
158 | @alias.setter |
159 | - def alias(self, value:str)->Any: |
160 | + def alias(self, value:str) -> None: |
161 | ":meta private:" |
162 | warnings.warn(f"Write to deprecated property {old_name}. (Replaced by {new_name}) -- Deprecated since version {version}.", DeprecationWarning, stacklevel=2) |
163 | return setattr(self, new_name, value) |
164 | return alias |
165 | |
166 | -def _deprecated_function_alias(old_name:str, new_name:str, version:str) -> Callable: |
167 | - def alias(self, *args, **kwargs): |
168 | +def _deprecated_function_alias(old_name:str, new_name:str, version:str) -> Callable[[Any], Any]: |
169 | + def alias(self, *args:Any, **kwargs:Any) -> Any: |
170 | ":meta private:" |
171 | warnings.warn(f"Call to deprecated method {old_name}. (Replaced by {new_name}) -- Deprecated since version {version}.", DeprecationWarning, stacklevel=2) |
172 | return getattr(self, new_name)(*args, **kwargs) |
173 | return alias |
174 | |
175 | def _deprecated(replaced_by:str, version:str) -> Callable: |
176 | - def deprecate(func): |
177 | + def deprecate(func:Callable) -> Callable: |
178 | @functools.wraps(func) |
179 | - def with_warning(*args, **kwargs): |
180 | + def with_warning(*args:Any, **kwargs:Any) -> Any: |
181 | ":meta private:" |
182 | warnings.warn( |
183 | f"Call to deprecated method {func.__name__}. (Replaced by {replaced_by}) -- Deprecated since version {version}.", |
184 | diff --git a/bs4/_typing.py b/bs4/_typing.py |
185 | index ab8f7a0..7fff292 100644 |
186 | --- a/bs4/_typing.py |
187 | +++ b/bs4/_typing.py |
188 | @@ -8,7 +8,12 @@ |
189 | # * In 3.10, TypeAlias gains capabilities that can be used to |
190 | # improve the tree matching types (I don't remember what, exactly). |
191 | # * 3.8 defines the Protocol type, which can be used to do duck typing |
192 | -# in a statically checkable way. |
193 | +# in a statically checkable way. (Protocols are also in typing_extensions, |
194 | +# so I could add this now in the couple places it's needed.) |
195 | +# * In 3.9 it's possible to specialize the re.Match type, |
196 | +# e.g. re.Match[str]. In 3.8 there's a typing.re namespace for this, |
197 | +# but it's removed in 3.12, so to support the widest possible set of |
198 | +# versions I'm not using it. |
199 | |
200 | import re |
201 | from typing_extensions import TypeAlias |
202 | @@ -48,16 +53,22 @@ _InvertedNamespaceMapping:TypeAlias = Dict[_NamespaceURL, _NamespacePrefix] |
203 | |
204 | # Aliases for the attribute values associated with HTML/XML tags. |
205 | # |
206 | -# Note that these are attribute values in their final form, as stored |
207 | -# in the `Tag` class. Different parsers present attributes to the |
208 | -# `TreeBuilder` subclasses in different formats, which are not defined |
209 | -# here. |
210 | +# These are the relatively unprocessed values Beautiful Soup expects |
211 | +# to come from a `TreeBuilder`. |
212 | +_RawAttributeValue: TypeAlias = str |
213 | +_RawAttributeValues: TypeAlias = Dict[str, _RawAttributeValue] |
214 | + |
215 | +# These are attribute values in their final form, as stored in the |
216 | +# `Tag` class, after they have been processed and (in some cases) |
217 | +# split into lists of strings. |
218 | _AttributeValue: TypeAlias = Union[str, Iterable[str]] |
219 | _AttributeValues: TypeAlias = Dict[str, _AttributeValue] |
220 | |
221 | -# The most common form in which attribute values are passed in from a |
222 | -# parser. |
223 | -_RawAttributeValues: TypeAlias = dict[str, str] |
224 | +# The methods that deal with turning _RawAttributeValues into |
225 | +# _AttributeValues may be called several times, even after the values |
226 | +# are already processed (e.g. when cloning a tag), so they need to |
227 | +# be able to acommodate both possibilities. |
228 | +_RawOrProcessedAttributeValues:TypeAlias = Union[_RawAttributeValues, _AttributeValues] |
229 | |
230 | # Aliases to represent the many possibilities for matching bits of a |
231 | # parse tree. |
232 | diff --git a/bs4/builder/__init__.py b/bs4/builder/__init__.py |
233 | index b59513e..29726c3 100644 |
234 | --- a/bs4/builder/__init__.py |
235 | +++ b/bs4/builder/__init__.py |
236 | @@ -33,15 +33,20 @@ from bs4.element import ( |
237 | nonwhitespace_re |
238 | ) |
239 | |
240 | +from bs4._typing import ( |
241 | + _AttributeValues, |
242 | + _RawAttributeValue, |
243 | +) |
244 | if TYPE_CHECKING: |
245 | from bs4 import BeautifulSoup |
246 | from bs4.element import ( |
247 | NavigableString, Tag, |
248 | - _AttributeValues, _AttributeValue, |
249 | ) |
250 | from bs4._typing import ( |
251 | + _AttributeValue, |
252 | _Encoding, |
253 | _Encodings, |
254 | + _RawOrProcessedAttributeValues, |
255 | _RawMarkup, |
256 | ) |
257 | |
258 | @@ -75,7 +80,7 @@ class TreeBuilderRegistry(object): |
259 | builders_for_feature: Dict[str, List[Type[TreeBuilder]]] |
260 | builders: List[Type[TreeBuilder]] |
261 | |
262 | - def __init__(self): |
263 | + def __init__(self) -> None: |
264 | self.builders_for_feature = defaultdict(list) |
265 | self.builders = [] |
266 | |
267 | @@ -233,7 +238,7 @@ class TreeBuilder(object): |
268 | #: no contents--that is, using XML rules. HTMLTreeBuilder |
269 | #: defines a different set of DEFAULT_EMPTY_ELEMENT_TAGS based on the |
270 | #: HTML 4 and HTML5 standards. |
271 | - DEFAULT_EMPTY_ELEMENT_TAGS: Optional[Set] = None |
272 | + DEFAULT_EMPTY_ELEMENT_TAGS: Optional[Set[str]] = None |
273 | |
274 | #: Most parsers don't keep track of line numbers. |
275 | TRACKS_LINE_NUMBERS: bool = False |
276 | @@ -347,7 +352,7 @@ class TreeBuilder(object): |
277 | """ |
278 | return False |
279 | |
280 | - def _replace_cdata_list_attribute_values(self, tag_name:str, attrs:_AttributeValues): |
281 | + def _replace_cdata_list_attribute_values(self, tag_name:str, attrs:_RawOrProcessedAttributeValues) -> _AttributeValues: |
282 | """When an attribute value is associated with a tag that can |
283 | have multiple values for that attribute, convert the string |
284 | value to a list of strings. |
285 | @@ -359,88 +364,106 @@ class TreeBuilder(object): |
286 | :param tag_name: The name of a tag. |
287 | :param attrs: A dictionary containing the tag's attributes. |
288 | Any appropriate attribute values will be modified in place. |
289 | + :return: The modified dictionary that was originally passed in. |
290 | """ |
291 | - if not attrs: |
292 | - return attrs |
293 | - if self.cdata_list_attributes: |
294 | - universal: Set[str] = self.cdata_list_attributes.get('*', set()) |
295 | - tag_specific = self.cdata_list_attributes.get( |
296 | - tag_name.lower(), None) |
297 | - for attr in list(attrs.keys()): |
298 | - values: _AttributeValue |
299 | - if attr in universal or (tag_specific and attr in tag_specific): |
300 | - # We have a "class"-type attribute whose string |
301 | - # value is a whitespace-separated list of |
302 | - # values. Split it into a list. |
303 | - value = attrs[attr] |
304 | - if isinstance(value, str): |
305 | - values = nonwhitespace_re.findall(value) |
306 | - else: |
307 | - # html5lib sometimes calls setAttributes twice |
308 | - # for the same tag when rearranging the parse |
309 | - # tree. On the second call the attribute value |
310 | - # here is already a list. If this happens, |
311 | - # leave the value alone rather than trying to |
312 | - # split it again. |
313 | - values = value |
314 | - attrs[attr] = values |
315 | - return attrs |
316 | + |
317 | + # First, cast the attrs dict to _AttributeValues. This might |
318 | + # not be accurate yet, but it will be by the time this method |
319 | + # returns. |
320 | + modified_attrs = cast(_AttributeValues, attrs) |
321 | + if not modified_attrs or not self.cdata_list_attributes: |
322 | + # Nothing to do. |
323 | + return modified_attrs |
324 | + |
325 | + # There is at least a possibility that we need to modify one of |
326 | + # the attribute values. |
327 | + universal: Set[str] = self.cdata_list_attributes.get('*', set()) |
328 | + tag_specific = self.cdata_list_attributes.get( |
329 | + tag_name.lower(), None) |
330 | + for attr in list(modified_attrs.keys()): |
331 | + modified_value:_AttributeValue |
332 | + if attr in universal or (tag_specific and attr in tag_specific): |
333 | + # We have a "class"-type attribute whose string |
334 | + # value is a whitespace-separated list of |
335 | + # values. Split it into a list. |
336 | + original_value:_AttributeValue = modified_attrs[attr] |
337 | + if isinstance(original_value, _RawAttributeValue): |
338 | + # This is a _RawAttributeValue (a string) that |
339 | + # needs to be split into a list so it can be an |
340 | + # _AttributeValue. |
341 | + modified_value = nonwhitespace_re.findall(original_value) |
342 | + else: |
343 | + # html5lib calls setAttributes twice for the |
344 | + # same tag when rearranging the parse tree. On |
345 | + # the second call the attribute value here is |
346 | + # already a list. This can also happen when a |
347 | + # Tag object is cloned. If this happens, leave |
348 | + # the value alone rather than trying to split |
349 | + # it again. |
350 | + modified_value = original_value |
351 | + modified_attrs[attr] = modified_value |
352 | + return modified_attrs |
353 | |
354 | class SAXTreeBuilder(TreeBuilder): |
355 | """A Beautiful Soup treebuilder that listens for SAX events. |
356 | |
357 | - This is not currently used for anything, but it demonstrates |
358 | - how a simple TreeBuilder would work. |
359 | + This is not currently used for anything, and it will be removed |
360 | + soon. It was a good idea, but it wasn't properly integrated into the |
361 | + rest of Beautiful Soup, so there have been long stretches where it |
362 | + hasn't worked properly. |
363 | """ |
364 | - |
365 | - def __init__(self, *args, **kwargs): |
366 | + def __init__(self, *args:Any, **kwargs:Any) -> None: |
367 | warnings.warn( |
368 | - f"The SAXTreeBuilder class was deprecated in 4.13.0. It is completely untested and probably doesn't work; use at your own risk.", |
369 | + f"The SAXTreeBuilder class was deprecated in 4.13.0 and will be removed soon thereafter. It is completely untested and probably doesn't work; do not use it.", |
370 | DeprecationWarning, |
371 | stacklevel=2 |
372 | ) |
373 | super(SAXTreeBuilder, self).__init__(*args, **kwargs) |
374 | |
375 | - def feed(self, markup:_RawMarkup): |
376 | + def feed(self, markup:_RawMarkup) -> None: |
377 | raise NotImplementedError() |
378 | |
379 | - def close(self): |
380 | + def close(self) -> None: |
381 | pass |
382 | |
383 | - def startElement(self, name, attrs): |
384 | + def startElement(self, name:str, attrs:Dict[str,str]) -> None: |
385 | attrs = dict((key[1], value) for key, value in list(attrs.items())) |
386 | #print("Start %s, %r" % (name, attrs)) |
387 | - self.soup.handle_starttag(name, attrs) |
388 | + assert self.soup is not None |
389 | + self.soup.handle_starttag(name, None, None, attrs) |
390 | |
391 | - def endElement(self, name): |
392 | + def endElement(self, name:str) -> None: |
393 | #print("End %s" % name) |
394 | + assert self.soup is not None |
395 | self.soup.handle_endtag(name) |
396 | |
397 | - def startElementNS(self, nsTuple, nodeName, attrs): |
398 | + def startElementNS(self, nsTuple:Tuple[str,str], |
399 | + nodeName:str, attrs:Dict[str,str]) -> None: |
400 | # Throw away (ns, nodeName) for now. |
401 | self.startElement(nodeName, attrs) |
402 | |
403 | - def endElementNS(self, nsTuple, nodeName): |
404 | + def endElementNS(self, nsTuple:Tuple[str,str], nodeName:str) -> None: |
405 | # Throw away (ns, nodeName) for now. |
406 | self.endElement(nodeName) |
407 | #handler.endElementNS((ns, node.nodeName), node.nodeName) |
408 | |
409 | - def startPrefixMapping(self, prefix, nodeValue): |
410 | + def startPrefixMapping(self, prefix:str, nodeValue:str) -> None: |
411 | # Ignore the prefix for now. |
412 | pass |
413 | |
414 | - def endPrefixMapping(self, prefix): |
415 | + def endPrefixMapping(self, prefix:str) -> None: |
416 | # Ignore the prefix for now. |
417 | # handler.endPrefixMapping(prefix) |
418 | pass |
419 | |
420 | - def characters(self, content): |
421 | + def characters(self, content:str) -> None: |
422 | + assert self.soup is not None |
423 | self.soup.handle_data(content) |
424 | |
425 | - def startDocument(self): |
426 | + def startDocument(self) -> None: |
427 | pass |
428 | |
429 | - def endDocument(self): |
430 | + def endDocument(self) -> None: |
431 | pass |
432 | |
433 | |
434 | @@ -620,13 +643,13 @@ class DetectsXMLParsedAsHTML(object): |
435 | return False |
436 | markup = markup[:500] |
437 | if isinstance(markup, bytes): |
438 | - markup_b = cast(bytes, markup) |
439 | + markup_b:bytes = markup |
440 | looks_like_xml = ( |
441 | markup_b.startswith(cls.XML_PREFIX_B) |
442 | and not cls.LOOKS_LIKE_HTML_B.search(markup) |
443 | ) |
444 | else: |
445 | - markup_s = cast(str, markup) |
446 | + markup_s:str = markup |
447 | looks_like_xml = ( |
448 | markup_s.startswith(cls.XML_PREFIX) |
449 | and not cls.LOOKS_LIKE_HTML.search(markup) |
450 | @@ -650,9 +673,13 @@ class DetectsXMLParsedAsHTML(object): |
451 | self._first_processing_instruction = None |
452 | self._root_tag_name = None |
453 | |
454 | - def _document_might_be_xml(self, processing_instruction:str): |
455 | + def _document_might_be_xml(self, processing_instruction:str) -> None: |
456 | """Call this method when encountering an XML declaration, or a |
457 | "processing instruction" that might be an XML declaration. |
458 | + |
459 | + This helps Beautiful Soup detect potential issues later, if |
460 | + the XML document turns out to be a non-XHTML document that's |
461 | + being parsed as XML. |
462 | """ |
463 | if (self._first_processing_instruction is not None |
464 | or self._root_tag_name is not None): |
465 | diff --git a/bs4/builder/_html5lib.py b/bs4/builder/_html5lib.py |
466 | index 2ea556c..51e3c97 100644 |
467 | --- a/bs4/builder/_html5lib.py |
468 | +++ b/bs4/builder/_html5lib.py |
469 | @@ -12,6 +12,7 @@ from typing import ( |
470 | Iterable, |
471 | List, |
472 | Optional, |
473 | + TypeAlias, |
474 | TYPE_CHECKING, |
475 | Tuple, |
476 | Union, |
477 | @@ -54,6 +55,7 @@ if TYPE_CHECKING: |
478 | from bs4 import BeautifulSoup |
479 | |
480 | from html5lib.treebuilders import base as treebuilder_base |
481 | +from html5lib.treewalkers import base as treewalker_base |
482 | |
483 | |
484 | class HTML5TreeBuilder(HTMLTreeBuilder): |
485 | @@ -138,6 +140,14 @@ class HTML5TreeBuilder(HTMLTreeBuilder): |
486 | # HTMLBinaryInputStream.__init__. |
487 | extra_kwargs['override_encoding'] = self.user_specified_encoding |
488 | |
489 | + # TODO-TYPING: typeshed stub says the second argument to |
490 | + # HTMLParser.parse is scripting:bool, but the implementation |
491 | + # treats scripting as one of the kwargs. scripting:bool isn't |
492 | + # called out separately until we get down into _parse(), and |
493 | + # there it's the fourth argument, not the second. I'm not |
494 | + # sure what the stub ought to look like, but I'm confident |
495 | + # enough that it's better to leave this alone, rather than |
496 | + # change this call to get rid of the warning. |
497 | doc = parser.parse(markup, **extra_kwargs) |
498 | |
499 | # Set the character encoding detected by the tokenizer. |
500 | @@ -146,6 +156,10 @@ class HTML5TreeBuilder(HTMLTreeBuilder): |
501 | # charEncoding to UTF-8 if it gets Unicode input. |
502 | doc.original_encoding = None |
503 | else: |
504 | + # TODO-TYPING HTMLParser.tokenizer is set by |
505 | + # HTMLParser._parse(), so it's definitely set by this |
506 | + # point, but it's not defined as an instance variable, so |
507 | + # this line gives a warning. |
508 | original_encoding = parser.tokenizer.stream.charEncoding[0] |
509 | # The encoding is an html5lib Encoding object. We want to |
510 | # use a string for compatibility with other tree builders. |
511 | @@ -231,72 +245,35 @@ class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder): |
512 | |
513 | def fragmentClass(self) -> 'Element': |
514 | """This is only used by html5lib HTMLParser.parseFragment(), |
515 | - which is never used by Beautiful Soup.""" |
516 | + which is never used by Beautiful Soup, only by the html5lib |
517 | + unit tests. Since we don't currently hook into those tests, |
518 | + the implementation is left blank. |
519 | + """ |
520 | raise NotImplementedError() |
521 | |
522 | def getFragment(self) -> 'Element': |
523 | - """This is only used by html5lib HTMLParser.parseFragment, |
524 | - which is never used by Beautiful Soup.""" |
525 | + """This is only used by the html5lib unit tests. Since we |
526 | + don't currently hook into those tests, the implementation is |
527 | + left blank. |
528 | + """ |
529 | raise NotImplementedError() |
530 | |
531 | def appendChild(self, node:'Element') -> None: |
532 | - # TODO: This code is not covered by the BS4 tests. |
533 | + # TODO: This code is not covered by the BS4 tests, and |
534 | + # apparently not triggered by the html5lib test suite either. |
535 | self.soup.append(node.element) |
536 | |
537 | def getDocument(self) -> 'BeautifulSoup': |
538 | return self.soup |
539 | |
540 | # TODO-TYPING: typeshed stubs are incorrect about this; |
541 | - # cloneNode returns a str, not None. |
542 | + # testSerializer returns a str, not None. |
543 | def testSerializer(self, element:'Element') -> str: |
544 | - from bs4 import BeautifulSoup |
545 | - rv = [] |
546 | - doctype_re = re.compile(r'^(.*?)(?: PUBLIC "(.*?)"(?: "(.*?)")?| SYSTEM "(.*?)")?$') |
547 | - |
548 | - def serializeElement(element:Union['Element', PageElement], indent=0) -> None: |
549 | - if isinstance(element, BeautifulSoup): |
550 | - pass |
551 | - if isinstance(element, Doctype): |
552 | - m = doctype_re.match(element) |
553 | - if m is not None: |
554 | - name = m.group(1) |
555 | - if m.lastindex is not None and m.lastindex > 1: |
556 | - publicId = m.group(2) or "" |
557 | - systemId = m.group(3) or m.group(4) or "" |
558 | - rv.append("""|%s<!DOCTYPE %s "%s" "%s">""" % |
559 | - (' ' * indent, name, publicId, systemId)) |
560 | - else: |
561 | - rv.append("|%s<!DOCTYPE %s>" % (' ' * indent, name)) |
562 | - else: |
563 | - rv.append("|%s<!DOCTYPE >" % (' ' * indent,)) |
564 | - elif isinstance(element, Comment): |
565 | - rv.append("|%s<!-- %s -->" % (' ' * indent, element)) |
566 | - elif isinstance(element, NavigableString): |
567 | - rv.append("|%s\"%s\"" % (' ' * indent, element)) |
568 | - elif isinstance(element, Element): |
569 | - if element.namespace: |
570 | - name = "%s %s" % (prefixes[element.namespace], |
571 | - element.name) |
572 | - else: |
573 | - name = element.name |
574 | - rv.append("|%s<%s>" % (' ' * indent, name)) |
575 | - if element.attrs: |
576 | - attributes = [] |
577 | - for name, value in list(element.attrs.items()): |
578 | - if isinstance(name, NamespacedAttribute): |
579 | - name = "%s %s" % (prefixes[name.namespace], name.name) |
580 | - if isinstance(value, list): |
581 | - value = " ".join(value) |
582 | - attributes.append((name, value)) |
583 | - |
584 | - for name, value in sorted(attributes): |
585 | - rv.append('|%s%s="%s"' % (' ' * (indent + 2), name, value)) |
586 | - indent += 2 |
587 | - for child in element.children: |
588 | - serializeElement(child, indent) |
589 | - serializeElement(element, 0) |
590 | - |
591 | - return "\n".join(rv) |
592 | + """This is only used by the html5lib unit tests. Since we |
593 | + don't currently hook into those tests, the implementation is |
594 | + left blank. |
595 | + """ |
596 | + raise NotImplementedError() |
597 | |
598 | class AttrList(object): |
599 | """Represents a Tag's attributes in a way compatible with html5lib.""" |
600 | @@ -340,11 +317,28 @@ class AttrList(object): |
601 | def __contains__(self, name:str) -> bool: |
602 | return name in list(self.attrs.keys()) |
603 | |
604 | +class BeautifulSoupNode(treebuilder_base.Node): |
605 | + element:PageElement |
606 | + soup:'BeautifulSoup' |
607 | + namespace:Optional[_NamespaceURL] |
608 | + |
609 | + @property |
610 | + def nodeType(self) -> int: |
611 | + """Return the html5lib constant corresponding to the type of |
612 | + the underlying DOM object. |
613 | |
614 | -class Element(treebuilder_base.Node): |
615 | + NOTE: This property is only accessed by the html5lib test |
616 | + suite, not by Beautiful Soup proper. |
617 | + """ |
618 | + raise NotImplementedError() |
619 | |
620 | + # TODO-TYPING: typeshed stubs are incorrect about this; |
621 | + # cloneNode returns a new Node, not None. |
622 | + def cloneNode(self) -> treebuilder_base.Node: |
623 | + raise NotImplementedError() |
624 | + |
625 | +class Element(BeautifulSoupNode): |
626 | element:Tag |
627 | - soup:'BeautifulSoup' |
628 | namespace:Optional[_NamespaceURL] |
629 | |
630 | def __init__(self, element:Tag, soup:'BeautifulSoup', |
631 | @@ -354,30 +348,20 @@ class Element(treebuilder_base.Node): |
632 | self.soup = soup |
633 | self.namespace = namespace |
634 | |
635 | - def appendChild(self, node:'Element') -> None: |
636 | - string_child = child = None |
637 | - if isinstance(node, str): |
638 | - # Some other piece of code decided to pass in a string |
639 | - # instead of creating a TextElement object to contain the |
640 | - # string. This should not ever happen. |
641 | - string_child = child = node |
642 | - elif isinstance(node, Tag): |
643 | - # Some other piece of code decided to pass in a Tag |
644 | - # instead of creating an Element object to contain the |
645 | - # Tag. This should not ever happen. |
646 | - child = node |
647 | - elif node.element.__class__ == NavigableString: |
648 | + def appendChild(self, node:'BeautifulSoupNode') -> None: |
649 | + string_child:Optional[NavigableString] = None |
650 | + child:PageElement |
651 | + if type(node.element) == NavigableString: |
652 | string_child = child = node.element |
653 | - node.parent = self |
654 | else: |
655 | child = node.element |
656 | - node.parent = self |
657 | + node.parent = self |
658 | |
659 | - if not isinstance(child, str) and child is not None and child.parent is not None: |
660 | + if child is not None and child.parent is not None and not isinstance(child, str): |
661 | node.element.extract() |
662 | |
663 | if (string_child is not None and self.element.contents |
664 | - and self.element.contents[-1].__class__ == NavigableString): |
665 | + and type(self.element.contents[-1]) == NavigableString): |
666 | # We are appending a string onto another string. |
667 | # TODO This has O(n^2) performance, for input like |
668 | # "a</a>a</a>a</a>..." |
669 | @@ -413,18 +397,36 @@ class Element(treebuilder_base.Node): |
670 | return {} |
671 | return AttrList(self.element) |
672 | |
673 | - def setAttributes(self, attributes:Optional[Dict]) -> None: |
674 | + # An HTML5lib attribute name may either be a single string, |
675 | + # or a tuple (namespace, name). |
676 | + _Html5libAttributeName: TypeAlias = Union[str, Tuple[str, str]] |
677 | + # Now we can define the type this method accepts as a dictionary |
678 | + # mapping those attribute names to single string values. |
679 | + _Html5libAttributes: TypeAlias = Dict[_Html5libAttributeName, str] |
680 | + def setAttributes(self, attributes:Optional[_Html5libAttributes]) -> None: |
681 | if attributes is not None and len(attributes) > 0: |
682 | + |
683 | + # Replace any namespaced attributes with |
684 | + # NamespacedAttribute objects. |
685 | for name, value in list(attributes.items()): |
686 | if isinstance(name, tuple): |
687 | new_name = NamespacedAttribute(*name) |
688 | del attributes[name] |
689 | attributes[new_name] = value |
690 | |
691 | + # We can now cast attributes to the type of Dict |
692 | + # used by Beautiful Soup. |
693 | + normalized_attributes = cast(_AttributeValues, attributes) |
694 | + |
695 | + # Values for tags like 'class' came in as single strings; |
696 | + # replace them with lists of strings as appropriate. |
697 | self.soup.builder._replace_cdata_list_attribute_values( |
698 | - self.name, attributes) |
699 | - for name, value in list(attributes.items()): |
700 | - self.element[name] = value |
701 | + self.name, normalized_attributes) |
702 | + |
703 | + # Then set the attributes on the Tag associated with this |
704 | + # BeautifulSoupNode. |
705 | + for name, value_or_values in list(normalized_attributes.items()): |
706 | + self.element[name] = value_or_values |
707 | |
708 | # The attributes may contain variables that need substitution. |
709 | # Call set_up_substitutions manually. |
710 | @@ -434,19 +436,20 @@ class Element(treebuilder_base.Node): |
711 | self.soup.builder.set_up_substitutions(self.element) |
712 | attributes = property(getAttributes, setAttributes) |
713 | |
714 | - def insertText(self, data:str, insertBefore:Optional['Element']=None) -> None: |
715 | + def insertText(self, data:str, insertBefore:Optional['BeautifulSoupNode']=None) -> None: |
716 | text = TextNode(self.soup.new_string(data), self.soup) |
717 | if insertBefore: |
718 | self.insertBefore(text, insertBefore) |
719 | else: |
720 | self.appendChild(text) |
721 | |
722 | - def insertBefore(self, node:'Element', refNode:'Element') -> None: |
723 | + def insertBefore(self, node:'BeautifulSoupNode', refNode:'BeautifulSoupNode') -> None: |
724 | index = self.element.index(refNode.element) |
725 | - if (node.element.__class__ == NavigableString and self.element.contents |
726 | - and self.element.contents[index-1].__class__ == NavigableString): |
727 | + if (type(node.element) == NavigableString and self.element.contents |
728 | + and type(self.element.contents[index-1]) == NavigableString): |
729 | # (See comments in appendChild) |
730 | old_node = self.element.contents[index-1] |
731 | + assert type(old_node) == NavigableString |
732 | new_str = self.soup.new_string(old_node + node.element) |
733 | old_node.replace_with(new_str) |
734 | else: |
735 | @@ -504,13 +507,19 @@ class Element(treebuilder_base.Node): |
736 | # parent's last descendant. It has no .next_sibling and |
737 | # its .next_element is whatever the previous last |
738 | # descendant had. |
739 | - last_childs_last_descendant = to_append[-1]._last_descendant(False, True) |
740 | + last_childs_last_descendant = to_append[-1]._last_descendant( |
741 | + is_initialized=False, accept_self=True |
742 | + ) |
743 | |
744 | + # Since we passed accept_self=True into _last_descendant, |
745 | + # there's no possibility that the result is None. |
746 | + assert last_childs_last_descendant is not None |
747 | last_childs_last_descendant.next_element = new_parents_last_descendant_next_element |
748 | if new_parents_last_descendant_next_element is not None: |
749 | - # TODO: This code has no test coverage and I'm not sure |
750 | - # how to get html5lib to go through this path, but it's |
751 | - # just the other side of the previous line. |
752 | + # TODO-COVERAGE: This code has no test coverage and |
753 | + # I'm not sure how to get html5lib to go through this |
754 | + # path, but it's just the other side of the previous |
755 | + # line. |
756 | new_parents_last_descendant_next_element.previous_element = last_childs_last_descendant |
757 | last_childs_last_descendant.next_sibling = None |
758 | |
759 | @@ -526,7 +535,12 @@ class Element(treebuilder_base.Node): |
760 | # print("FROM", self.element) |
761 | # print("TO", new_parent_element) |
762 | |
763 | - # TODO: typeshed stubs are incorrect about this; |
764 | + # TODO-TYPING: typeshed stubs are incorrect about this; |
765 | + # hasContent returns a boolean, not None. |
766 | + def hasContent(self) -> bool: |
767 | + return len(self.element.contents) > 0 |
768 | + |
769 | + # TODO-TYPING: typeshed stubs are incorrect about this; |
770 | # cloneNode returns a new Node, not None. |
771 | def cloneNode(self) -> treebuilder_base.Node: |
772 | tag = self.soup.new_tag(self.element.name, self.namespace) |
773 | @@ -535,24 +549,17 @@ class Element(treebuilder_base.Node): |
774 | node.attributes[key] = value |
775 | return node |
776 | |
777 | - # TODO-TYPING: typeshed stubs are incorrect about this; |
778 | - # cloneNode returns a boolean, not None. |
779 | - def hasContent(self) -> bool: |
780 | - return len(self.element.contents) > 0 |
781 | - |
782 | - def getNameTuple(self) -> Tuple[str, str]: |
783 | + def getNameTuple(self) -> Tuple[Optional[_NamespaceURL], str]: |
784 | if self.namespace == None: |
785 | return namespaces["html"], self.name |
786 | else: |
787 | return self.namespace, self.name |
788 | - |
789 | nameTuple = property(getNameTuple) |
790 | |
791 | -class TextNode(Element): |
792 | - def __init__(self, element:PageElement, soup:'BeautifulSoup'): |
793 | +class TextNode(BeautifulSoupNode): |
794 | + element:NavigableString |
795 | + |
796 | + def __init__(self, element:NavigableString, soup:'BeautifulSoup'): |
797 | treebuilder_base.Node.__init__(self, None) |
798 | self.element = element |
799 | self.soup = soup |
800 | - |
801 | - def cloneNode(self) -> treebuilder_base.Node: |
802 | - raise NotImplementedError() |
803 | diff --git a/bs4/builder/_htmlparser.py b/bs4/builder/_htmlparser.py |
804 | index 91cecf7..d8e21d1 100644 |
805 | --- a/bs4/builder/_htmlparser.py |
806 | +++ b/bs4/builder/_htmlparser.py |
807 | @@ -49,7 +49,6 @@ if TYPE_CHECKING: |
808 | from bs4 import BeautifulSoup |
809 | from bs4.element import NavigableString |
810 | from bs4._typing import ( |
811 | - _AttributeValues, |
812 | _Encoding, |
813 | _Encodings, |
814 | _RawMarkup, |
815 | @@ -60,6 +59,14 @@ HTMLPARSER = 'html.parser' |
816 | _DuplicateAttributeHandler = Callable[[Dict[str, str], str, str], None] |
817 | |
818 | class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): |
819 | + #: Constant to handle duplicate attributes by ignoring later values |
820 | + #: and keeping the earlier ones. |
821 | + REPLACE:str = 'replace' |
822 | + |
823 | + #: Constant to handle duplicate attributes by replacing earlier values |
824 | + #: with later ones. |
825 | + IGNORE:str = 'ignore' |
826 | + |
827 | """A subclass of the Python standard library's HTMLParser class, which |
828 | listens for HTMLParser events and translates them into calls |
829 | to Beautiful Soup's tree construction API. |
830 | @@ -73,11 +80,13 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): |
831 | the name of the duplicate attribute, and the most recent value |
832 | encountered. |
833 | """ |
834 | - def __init__(self, soup:BeautifulSoup, *args, **kwargs): |
835 | + def __init__( |
836 | + self, soup:BeautifulSoup, *args:Any, |
837 | + on_duplicate_attribute:Union[str, _DuplicateAttributeHandler]=REPLACE, |
838 | + **kwargs:Any |
839 | + ): |
840 | self.soup = soup |
841 | - self.on_duplicate_attribute = kwargs.pop( |
842 | - 'on_duplicate_attribute', self.REPLACE |
843 | - ) |
844 | + self.on_duplicate_attribute = on_duplicate_attribute |
845 | HTMLParser.__init__(self, *args, **kwargs) |
846 | |
847 | # Keep a list of empty-element tags that were encountered |
848 | @@ -90,14 +99,6 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): |
849 | self.already_closed_empty_element = [] |
850 | |
851 | self._initialize_xml_detector() |
852 | - |
853 | - #: Constant to handle duplicate attributes by replacing earlier values |
854 | - #: with later ones. |
855 | - IGNORE:str = 'ignore' |
856 | - |
857 | - #: Constant to handle duplicate attributes by ignoring later values |
858 | - #: and keeping the earlier ones. |
859 | - REPLACE:str = 'replace' |
860 | |
861 | on_duplicate_attribute:Union[str, _DuplicateAttributeHandler] |
862 | already_closed_empty_element: List[str] |
863 | @@ -145,7 +146,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML): |
864 | closing tag). |
865 | """ |
866 | # TODO: handle namespaces here? |
867 | - attr_dict: Dict[str, str] = {} |
868 | + attr_dict:Dict[str, str] = {} |
869 | for key, value in attrs: |
870 | # Change None attribute values to the empty string |
871 | # for consistency with the other tree builders. |
872 | diff --git a/bs4/builder/_lxml.py b/bs4/builder/_lxml.py |
873 | index 3dfe88a..380164e 100644 |
874 | --- a/bs4/builder/_lxml.py |
875 | +++ b/bs4/builder/_lxml.py |
876 | @@ -13,6 +13,7 @@ from collections.abc import Callable |
877 | |
878 | from typing import ( |
879 | Any, |
880 | + cast, |
881 | Dict, |
882 | IO, |
883 | Iterable, |
884 | @@ -21,6 +22,7 @@ from typing import ( |
885 | Set, |
886 | Tuple, |
887 | Type, |
888 | + TypeAlias, |
889 | TYPE_CHECKING, |
890 | Union, |
891 | ) |
892 | @@ -28,7 +30,6 @@ from typing import ( |
893 | from io import BytesIO |
894 | from io import StringIO |
895 | from lxml import etree |
896 | -from bs4.dammit import (_Encoding) |
897 | from bs4.element import ( |
898 | Comment, |
899 | Doctype, |
900 | @@ -60,13 +61,16 @@ if TYPE_CHECKING: |
901 | |
902 | LXML:str = 'lxml' |
903 | |
904 | -def _invert(d): |
905 | +def _invert(d:dict[Any, Any]) -> dict[Any, Any]: |
906 | "Invert a dictionary." |
907 | return dict((v,k) for k, v in list(d.items())) |
908 | |
909 | +_LXMLParser:TypeAlias = Union[etree.XMLParser, etree.HTMLParser] |
910 | +_ParserOrParserClass:TypeAlias = Union[_LXMLParser, Type[etree.XMLParser], Type[etree.HTMLParser]] |
911 | + |
912 | class LXMLTreeBuilderForXML(TreeBuilder): |
913 | |
914 | - DEFAULT_PARSER_CLASS:Type[Any] = etree.XMLParser |
915 | + DEFAULT_PARSER_CLASS:Type[etree.XMLParser] = etree.XMLParser |
916 | |
917 | is_xml:bool = True |
918 | |
919 | @@ -93,6 +97,7 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
920 | nsmaps: List[Optional[_InvertedNamespaceMapping]] |
921 | empty_element_tags: Set[str] |
922 | parser: Any |
923 | + _default_parser: Optional[etree.XMLParser] |
924 | |
925 | # NOTE: If we parsed Element objects and looked at .sourceline, |
926 | # we'd be able to see the line numbers from the original document. |
927 | @@ -137,7 +142,7 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
928 | # prefix, the first one in the document takes precedence. |
929 | self.soup._namespaces[key] = value |
930 | |
931 | - def default_parser(self, encoding:Optional[_Encoding]) -> Type: |
932 | + def default_parser(self, encoding:Optional[_Encoding]) -> _ParserOrParserClass: |
933 | """Find the default parser for the given encoding. |
934 | |
935 | :return: Either a parser object or a class, which |
936 | @@ -148,7 +153,7 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
937 | return self.DEFAULT_PARSER_CLASS( |
938 | target=self, strip_cdata=False, recover=True, encoding=encoding) |
939 | |
940 | - def parser_for(self, encoding: Optional[_Encoding]) -> Any: |
941 | + def parser_for(self, encoding: Optional[_Encoding]) -> _LXMLParser: |
942 | """Instantiate an appropriate parser for the given encoding. |
943 | |
944 | :param encoding: A string. |
945 | @@ -164,8 +169,8 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
946 | ) |
947 | return parser |
948 | |
949 | - def __init__(self, parser:Optional[Any]=None, |
950 | - empty_element_tags:Optional[Set[str]]=None, **kwargs): |
951 | + def __init__(self, parser:Optional[etree.XMLParser]=None, |
952 | + empty_element_tags:Optional[Set[str]]=None, **kwargs:Any): |
953 | # TODO: Issue a warning if parser is present but not a |
954 | # callable, since that means there's no way to create new |
955 | # parsers for different encodings. |
956 | @@ -270,7 +275,7 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
957 | yield (detector.markup, encoding, document_declared_encoding, False) |
958 | |
959 | def feed(self, markup:_RawMarkup) -> None: |
960 | - io: IO |
961 | + io: Union[BytesIO, StringIO] |
962 | if isinstance(markup, bytes): |
963 | io = BytesIO(markup) |
964 | elif isinstance(markup, str): |
965 | @@ -298,14 +303,25 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
966 | def close(self) -> None: |
967 | self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED] |
968 | |
969 | - def start(self, name:str, attrs:Dict[str, str], nsmap:_NamespaceMapping={}): |
970 | + def start(self, tag:str|bytes, attrs:Dict[str|bytes, str|bytes], nsmap:_NamespaceMapping={}) -> None: |
971 | # This is called by lxml code as a result of calling |
972 | # BeautifulSoup.feed(), and we know self.soup is set by the time feed() |
973 | # is called. |
974 | assert self.soup is not None |
975 | - |
976 | - # Make sure attrs is a mutable dict--lxml may send an immutable dictproxy. |
977 | - attrs = dict(attrs) |
978 | + assert isinstance(tag, str) |
979 | + |
980 | + # We need to recreate the attribute dict for three |
981 | + # reasons. First, for type checking, so we can assert there |
982 | + # are no bytestrings in the keys or values. Second, because we |
983 | + # need a mutable dict--lxml might send us an immutable |
984 | + # dictproxy. Third, so we can handle namespaced attribute |
985 | + # names by converting the keys to NamespacedAttributes. |
986 | + new_attrs:Dict[Union[str,NamespacedAttribute], str] = {} |
987 | + for k, v in attrs.items(): |
988 | + assert isinstance(k, str) |
989 | + assert isinstance(v, str) |
990 | + new_attrs[k] = v |
991 | + |
992 | nsprefix: Optional[_NamespacePrefix] = None |
993 | namespace: Optional[_NamespaceURL] = None |
994 | # Invert each namespace map as it comes in. |
995 | @@ -340,30 +356,28 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
996 | |
997 | # Also treat the namespace mapping as a set of attributes on the |
998 | # tag, so we can recreate it later. |
999 | - attrs = attrs.copy() |
1000 | for prefix, namespace in list(nsmap.items()): |
1001 | attribute = NamespacedAttribute( |
1002 | "xmlns", prefix, "http://www.w3.org/2000/xmlns/") |
1003 | - attrs[attribute] = namespace |
1004 | + new_attrs[attribute] = namespace |
1005 | |
1006 | # Namespaces are in play. Find any attributes that came in |
1007 | # from lxml with namespaces attached to their names, and |
1008 | # turn then into NamespacedAttribute objects. |
1009 | - new_attrs:Dict[Union[str,NamespacedAttribute], str] = {} |
1010 | - for attr, value in list(attrs.items()): |
1011 | + final_attrs:Dict[Union[str,NamespacedAttribute], str] = {} |
1012 | + for attr, value in list(new_attrs.items()): |
1013 | namespace, attr = self._getNsTag(attr) |
1014 | if namespace is None: |
1015 | - new_attrs[attr] = value |
1016 | + final_attrs[attr] = value |
1017 | else: |
1018 | nsprefix = self._prefix_for_namespace(namespace) |
1019 | attr = NamespacedAttribute(nsprefix, attr, namespace) |
1020 | - new_attrs[attr] = value |
1021 | - attrs = new_attrs |
1022 | + final_attrs[attr] = value |
1023 | |
1024 | - namespace, name = self._getNsTag(name) |
1025 | + namespace, tag = self._getNsTag(tag) |
1026 | nsprefix = self._prefix_for_namespace(namespace) |
1027 | self.soup.handle_starttag( |
1028 | - name, namespace, nsprefix, attrs, |
1029 | + tag, namespace, nsprefix, final_attrs, |
1030 | namespaces=self.active_namespace_prefixes[-1] |
1031 | ) |
1032 | |
1033 | @@ -376,8 +390,9 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
1034 | return inverted_nsmap[namespace] |
1035 | return None |
1036 | |
1037 | - def end(self, name:str) -> None: |
1038 | + def end(self, name:str|bytes) -> None: |
1039 | assert self.soup is not None |
1040 | + assert isinstance(name, str) |
1041 | self.soup.endData() |
1042 | completed_tag = self.soup.tagStack[-1] |
1043 | namespace, name = self._getNsTag(name) |
1044 | @@ -406,9 +421,10 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
1045 | self.soup.handle_data(data) |
1046 | self.soup.endData(self.processing_instruction_class) |
1047 | |
1048 | - def data(self, content:str) -> None: |
1049 | + def data(self, data:str|bytes) -> None: |
1050 | assert self.soup is not None |
1051 | - self.soup.handle_data(content) |
1052 | + assert isinstance(data, str) |
1053 | + self.soup.handle_data(data) |
1054 | |
1055 | def doctype(self, name:str, pubid:str, system:str) -> None: |
1056 | assert self.soup is not None |
1057 | @@ -416,11 +432,12 @@ class LXMLTreeBuilderForXML(TreeBuilder): |
1058 | doctype = Doctype.for_name_and_ids(name, pubid, system) |
1059 | self.soup.object_was_parsed(doctype) |
1060 | |
1061 | - def comment(self, content:str) -> None: |
1062 | + def comment(self, text:str|bytes) -> None: |
1063 | "Handle comments as Comment objects." |
1064 | assert self.soup is not None |
1065 | + assert isinstance(text, str) |
1066 | self.soup.endData() |
1067 | - self.soup.handle_data(content) |
1068 | + self.soup.handle_data(text) |
1069 | self.soup.endData(Comment) |
1070 | |
1071 | def test_fragment_to_document(self, fragment:str) -> str: |
1072 | @@ -436,7 +453,7 @@ class LXMLTreeBuilder(HTMLTreeBuilder, LXMLTreeBuilderForXML): |
1073 | features: Iterable[str] = list(ALTERNATE_NAMES) + [NAME, HTML, FAST, PERMISSIVE] |
1074 | is_xml: bool = False |
1075 | |
1076 | - def default_parser(self, encoding:Optional[_Encoding]) -> Type[Any]: |
1077 | + def default_parser(self, encoding:Optional[_Encoding]) -> _ParserOrParserClass: |
1078 | return etree.HTMLParser |
1079 | |
1080 | def feed(self, markup:_RawMarkup) -> None: |
1081 | diff --git a/bs4/dammit.py b/bs4/dammit.py |
1082 | index 8c1b631..4e950a2 100644 |
1083 | --- a/bs4/dammit.py |
1084 | +++ b/bs4/dammit.py |
1085 | @@ -260,14 +260,14 @@ class EntitySubstitution(object): |
1086 | AMPERSAND_OR_BRACKET: Pattern[str] = re.compile("([<>&])") |
1087 | |
1088 | @classmethod |
1089 | - def _substitute_html_entity(cls, matchobj:re.Match[str]) -> str: |
1090 | + def _substitute_html_entity(cls, matchobj:re.Match) -> str: |
1091 | """Used with a regular expression to substitute the |
1092 | appropriate HTML entity for a special character string.""" |
1093 | entity = cls.CHARACTER_TO_HTML_ENTITY.get(matchobj.group(0)) |
1094 | return "&%s;" % entity |
1095 | |
1096 | @classmethod |
1097 | - def _substitute_xml_entity(cls, matchobj:re.Match[str]) -> str: |
1098 | + def _substitute_xml_entity(cls, matchobj:re.Match) -> str: |
1099 | """Used with a regular expression to substitute the |
1100 | appropriate XML entity for a special character string.""" |
1101 | entity = cls.CHARACTER_TO_XML_ENTITY[matchobj.group(0)] |
1102 | @@ -752,7 +752,7 @@ class UnicodeDammit: |
1103 | |
1104 | log: Logger #: :meta private: |
1105 | |
1106 | - def _sub_ms_char(self, match:re.Match[bytes]) -> bytes: |
1107 | + def _sub_ms_char(self, match:re.Match) -> bytes: |
1108 | """Changes a MS smart quote character to an XML or HTML |
1109 | entity, or an ASCII character. |
1110 | |
1111 | diff --git a/bs4/element.py b/bs4/element.py |
1112 | index f4ab89c..22115a2 100644 |
1113 | --- a/bs4/element.py |
1114 | +++ b/bs4/element.py |
1115 | @@ -39,11 +39,13 @@ from typing import ( |
1116 | Union, |
1117 | cast, |
1118 | ) |
1119 | -from typing_extensions import Self |
1120 | +from typing_extensions import ( |
1121 | + Self, |
1122 | + TypeAlias, |
1123 | +) |
1124 | if TYPE_CHECKING: |
1125 | from bs4 import BeautifulSoup |
1126 | from bs4.builder import TreeBuilder |
1127 | - from bs4.dammit import _Encoding |
1128 | from bs4.filter import ElementFilter |
1129 | from bs4.formatter import ( |
1130 | _EntitySubstitutionFunction, |
1131 | @@ -52,12 +54,16 @@ if TYPE_CHECKING: |
1132 | from bs4._typing import ( |
1133 | _AttributeValue, |
1134 | _AttributeValues, |
1135 | + _Encoding, |
1136 | + _RawOrProcessedAttributeValues, |
1137 | _StrainableElement, |
1138 | _StrainableAttribute, |
1139 | _StrainableAttributes, |
1140 | _StrainableString, |
1141 | ) |
1142 | |
1143 | +_OneOrMoreStringTypes:TypeAlias = Union[Type['NavigableString'], Iterable[Type['NavigableString']]] |
1144 | + |
1145 | # Deprecated module-level attributes. |
1146 | # See https://peps.python.org/pep-0562/ |
1147 | _deprecated_names = dict( |
1148 | @@ -66,7 +72,7 @@ _deprecated_names = dict( |
1149 | #: :meta private: |
1150 | _deprecated_whitespace_re: Pattern[str] = re.compile(r"\s+") |
1151 | |
1152 | -def __getattr__(name): |
1153 | +def __getattr__(name:str) -> Any: |
1154 | if name in _deprecated_names: |
1155 | message = _deprecated_names[name] |
1156 | warnings.warn( |
1157 | @@ -124,7 +130,8 @@ class NamespacedAttribute(str): |
1158 | namespace: Optional[str] |
1159 | |
1160 | def __new__(cls, prefix:Optional[str], |
1161 | - name:Optional[str]=None, namespace:Optional[str]=None): |
1162 | + name:Optional[str]=None, |
1163 | + namespace:Optional[str]=None) -> Self: |
1164 | if not name: |
1165 | # This is the default namespace. Its name "has no value" |
1166 | # per https://www.w3.org/TR/xml-names/#defaulting |
1167 | @@ -223,7 +230,7 @@ class ContentMetaAttributeValue(AttributeValueWithCharsetSubstitution): |
1168 | """ |
1169 | if eventual_encoding in PYTHON_SPECIFIC_ENCODINGS: |
1170 | return self.CHARSET_RE.sub('', self.original_value) |
1171 | - def rewrite(match): |
1172 | + def rewrite(match:re.Match[str]) -> str: |
1173 | return match.group(1) + eventual_encoding |
1174 | return self.CHARSET_RE.sub(rewrite, self.original_value) |
1175 | |
1176 | @@ -370,7 +377,7 @@ class PageElement(object): |
1177 | "previousSibling", "previous_sibling", "4.0.0" |
1178 | ) |
1179 | |
1180 | - def __deepcopy__(self, memo:Dict, recursive:bool=False) -> Self: |
1181 | + def __deepcopy__(self, memo:Dict[Any,Any], recursive:bool=False) -> Self: |
1182 | raise NotImplementedError() |
1183 | |
1184 | def __copy__(self) -> Self: |
1185 | @@ -528,6 +535,13 @@ class PageElement(object): |
1186 | ) -> Optional[PageElement]: |
1187 | """Finds the last element beneath this object to be parsed. |
1188 | |
1189 | + Special note to help you figure things out if your type |
1190 | + checking is tripped up by the fact that this method returns |
1191 | + Optional[PageElement] instead of PageElement: the only time |
1192 | + this method returns None is if `accept_self` is False and the |
1193 | + `PageElement` has no children--either it's a NavigableString |
1194 | + or an empty Tag. |
1195 | + |
1196 | :param is_initialized: Has `PageElement.setup` been called on |
1197 | this `PageElement` yet? |
1198 | |
1199 | @@ -878,9 +892,10 @@ class PageElement(object): |
1200 | |
1201 | def _find_one( |
1202 | self, |
1203 | - # TODO: "There is no syntax to indicate optional or keyword |
1204 | - # arguments; such function types are rarely used as |
1205 | - # callback types." - So, not sure how to get more specific here. |
1206 | + # TODO-TYPING: "There is no syntax to indicate optional or |
1207 | + # keyword arguments; such function types are rarely used |
1208 | + # as callback types." - So, not sure how to get more |
1209 | + # specific here. |
1210 | method:Callable, |
1211 | name:Optional[_StrainableElement], |
1212 | attrs:_StrainableAttributes, |
1213 | @@ -955,7 +970,7 @@ class PageElement(object): |
1214 | You can pass in your own technique for iterating over the tree, and your own |
1215 | technique for matching items. |
1216 | """ |
1217 | - results:ResultSet = ResultSet(matcher) |
1218 | + results:ResultSet[PageElement] = ResultSet(matcher) |
1219 | while True: |
1220 | try: |
1221 | i = next(generator) |
1222 | @@ -1029,27 +1044,27 @@ class PageElement(object): |
1223 | return getattr(self, '_decomposed', False) or False |
1224 | |
1225 | @_deprecated("next_elements", "4.0.0") |
1226 | - def nextGenerator(self): |
1227 | + def nextGenerator(self) -> Iterator[PageElement]: |
1228 | ":meta private:" |
1229 | return self.next_elements |
1230 | |
1231 | @_deprecated("next_siblings", "4.0.0") |
1232 | - def nextSiblingGenerator(self): |
1233 | + def nextSiblingGenerator(self) -> Iterator[PageElement]: |
1234 | ":meta private:" |
1235 | return self.next_siblings |
1236 | |
1237 | @_deprecated("previous_elements", "4.0.0") |
1238 | - def previousGenerator(self): |
1239 | + def previousGenerator(self) -> Iterator[PageElement]: |
1240 | ":meta private:" |
1241 | return self.previous_elements |
1242 | |
1243 | @_deprecated("previous_siblings", "4.0.0") |
1244 | - def previousSiblingGenerator(self): |
1245 | + def previousSiblingGenerator(self) -> Iterator[PageElement]: |
1246 | ":meta private:" |
1247 | return self.previous_siblings |
1248 | |
1249 | @_deprecated("parents", "4.0.0") |
1250 | - def parentGenerator(self): |
1251 | + def parentGenerator(self) -> Iterator[PageElement]: |
1252 | ":meta private:" |
1253 | return self.parents |
1254 | |
1255 | @@ -1087,7 +1102,7 @@ class NavigableString(str, PageElement): |
1256 | u.setup() |
1257 | return u |
1258 | |
1259 | - def __deepcopy__(self, memo:Dict, recursive:bool=False) -> Self: |
1260 | + def __deepcopy__(self, memo:Dict[Any, Any], recursive:bool=False) -> Self: |
1261 | """A copy of a NavigableString has the same contents and class |
1262 | as the original, but it is not connected to the parse tree. |
1263 | |
1264 | @@ -1097,7 +1112,7 @@ class NavigableString(str, PageElement): |
1265 | """ |
1266 | return type(self)(self) |
1267 | |
1268 | - def __getnewargs__(self): |
1269 | + def __getnewargs__(self) -> Tuple[str]: |
1270 | return (str(self),) |
1271 | |
1272 | @property |
1273 | @@ -1134,14 +1149,14 @@ class NavigableString(str, PageElement): |
1274 | return None |
1275 | |
1276 | @name.setter |
1277 | - def name(self, name:str): |
1278 | + def name(self, name:str) -> None: |
1279 | """Prevent NavigableString.name from ever being set. |
1280 | |
1281 | :meta private: |
1282 | """ |
1283 | raise AttributeError("A NavigableString cannot be given a name.") |
1284 | |
1285 | - def _all_strings(self, strip=False, types:Iterable[Type[NavigableString]]=PageElement.default) -> Iterator[str]: |
1286 | + def _all_strings(self, strip:bool=False, types:_OneOrMoreStringTypes=PageElement.default) -> Iterator[str]: |
1287 | """Yield all strings of certain classes, possibly stripping them. |
1288 | |
1289 | This makes it easy for NavigableString to implement methods |
1290 | @@ -1382,7 +1397,7 @@ class Tag(PageElement): |
1291 | name:Optional[str]=None, |
1292 | namespace:Optional[str]=None, |
1293 | prefix:Optional[str]=None, |
1294 | - attrs:Optional[_AttributeValues]=None, |
1295 | + attrs:Optional[_RawOrProcessedAttributeValues]=None, |
1296 | parent:Optional[Union[BeautifulSoup, Tag]]=None, |
1297 | previous:Optional[PageElement]=None, |
1298 | is_xml:Optional[bool]=None, |
1299 | @@ -1485,7 +1500,7 @@ class Tag(PageElement): |
1300 | #: :meta private: |
1301 | parserClass = _deprecated_alias("parserClass", "parser_class", "4.0.0") |
1302 | |
1303 | - def __deepcopy__(self, memo:dict, recursive:bool=True) -> Self: |
1304 | + def __deepcopy__(self, memo:Dict[Any, Any], recursive:bool=True) -> Self: |
1305 | """A deepcopy of a Tag is a new Tag, unconnected to the parse tree. |
1306 | Its contents are a copy of the old Tag's contents. |
1307 | """ |
1308 | @@ -1552,9 +1567,9 @@ class Tag(PageElement): |
1309 | return len(self.contents) == 0 and self.can_be_empty_element is True |
1310 | |
1311 | @_deprecated("is_empty_element", "4.0.0") |
1312 | - def isSelfClosing(self): |
1313 | + def isSelfClosing(self) -> bool: |
1314 | ": :meta private:" |
1315 | - return is_empty_element() |
1316 | + return self.is_empty_element |
1317 | |
1318 | @property |
1319 | def string(self) -> Optional[str]: |
1320 | @@ -1592,7 +1607,7 @@ class Tag(PageElement): |
1321 | |
1322 | #: :meta private: |
1323 | MAIN_CONTENT_STRING_TYPES = {NavigableString, CData} |
1324 | - def _all_strings(self, strip:bool=False, types:Iterable[Type[NavigableString]]=PageElement.default) -> Iterator[str]: |
1325 | + def _all_strings(self, strip:bool=False, types:_OneOrMoreStringTypes=PageElement.default) -> Iterator[str]: |
1326 | """Yield all strings of certain classes, possibly stripping them. |
1327 | |
1328 | :param strip: If True, all strings will be stripped before being |
1329 | @@ -1739,7 +1754,7 @@ class Tag(PageElement): |
1330 | replace_with_children = unwrap |
1331 | |
1332 | @_deprecated("unwrap", "4.0.0") |
1333 | - def replaceWithChildren(self): |
1334 | + def replaceWithChildren(self) -> PageElement: |
1335 | ": :meta private:" |
1336 | return self.unwrap() |
1337 | |
1338 | @@ -1914,11 +1929,21 @@ class Tag(PageElement): |
1339 | "Deleting tag[key] deletes all 'key' attributes for the tag." |
1340 | self.attrs.pop(key, None) |
1341 | |
1342 | - def __call__(self, *args, **kwargs) -> ResultSet[PageElement]: |
1343 | + def __call__(self, |
1344 | + name:Optional[_StrainableElement]=None, |
1345 | + attrs:_StrainableAttributes={}, |
1346 | + recursive:bool=True, |
1347 | + string:Optional[_StrainableString]=None, |
1348 | + limit:Optional[int]=None, |
1349 | + _stacklevel:int=2, |
1350 | + **kwargs:_StrainableAttribute |
1351 | + )-> ResultSet[PageElement]: |
1352 | """Calling a Tag like a function is the same as calling its |
1353 | find_all() method. Eg. tag('a') returns a list of all the A tags |
1354 | found within this tag.""" |
1355 | - return self.find_all(*args, **kwargs) |
1356 | + return self.find_all( |
1357 | + name, attrs, recursive, string, limit, _stacklevel, **kwargs |
1358 | + ) |
1359 | |
1360 | def __getattr__(self, subtag:str) -> Optional[Tag]: |
1361 | """Calling tag.subtag is the same as calling tag.find(name="subtag")""" |
1362 | @@ -2002,7 +2027,7 @@ class Tag(PageElement): |
1363 | def decode(self, indent_level:Optional[int]=None, |
1364 | eventual_encoding:_Encoding=DEFAULT_OUTPUT_ENCODING, |
1365 | formatter:_FormatterOrName="minimal", |
1366 | - iterator:Optional[Iterable]=None) -> str: |
1367 | + iterator:Optional[Iterator[PageElement]]=None) -> str: |
1368 | """Render this `Tag` and its contents as a Unicode string. |
1369 | |
1370 | :param indent_level: Each line of the rendering will be |
1371 | @@ -2122,9 +2147,10 @@ class Tag(PageElement): |
1372 | EMPTY_ELEMENT_EVENT = _TreeTraversalEvent() #: :meta private: |
1373 | STRING_ELEMENT_EVENT = _TreeTraversalEvent() #: :meta private: |
1374 | |
1375 | - def _event_stream(self, iterator=None) -> Iterator[ |
1376 | - Tuple[_TreeTraversalEvent, PageElement] |
1377 | - ]: |
1378 | + def _event_stream( |
1379 | + self, |
1380 | + iterator:Optional[Iterator[PageElement]]=None |
1381 | + ) -> Iterator[Tuple[_TreeTraversalEvent, PageElement]]: |
1382 | """Yield a sequence of events that can be used to reconstruct the DOM |
1383 | for this element. |
1384 | |
1385 | @@ -2316,8 +2342,8 @@ class Tag(PageElement): |
1386 | return contents.encode(encoding) |
1387 | |
1388 | @_deprecated("encode_contents", "4.0.0") |
1389 | - def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING, |
1390 | - prettyPrint=False, indentLevel=0): |
1391 | + def renderContents(self, encoding:str=DEFAULT_OUTPUT_ENCODING, |
1392 | + prettyPrint:bool=False, indentLevel:Optional[int]=0) -> bytes: |
1393 | """Deprecated method for BS3 compatibility. |
1394 | |
1395 | :meta private: |
1396 | @@ -2436,7 +2462,7 @@ class Tag(PageElement): |
1397 | def select_one(self, |
1398 | selector:str, |
1399 | namespaces:Optional[Dict[str, str]]=None, |
1400 | - **kwargs) -> Optional[Tag]: |
1401 | + **kwargs:Any) -> Optional[Tag]: |
1402 | """Perform a CSS selection operation on the current element. |
1403 | |
1404 | :param selector: A CSS selector. |
1405 | @@ -2452,7 +2478,7 @@ class Tag(PageElement): |
1406 | return self.css.select_one(selector, namespaces, **kwargs) |
1407 | |
1408 | def select(self, selector:str, namespaces:Optional[Dict[str, str]]=None, |
1409 | - limit:int=0, **kwargs) -> ResultSet[Tag]: |
1410 | + limit:int=0, **kwargs:Any) -> ResultSet[Tag]: |
1411 | """Perform a CSS selection operation on the current element. |
1412 | |
1413 | This uses the SoupSieve library. |
1414 | @@ -2478,7 +2504,7 @@ class Tag(PageElement): |
1415 | |
1416 | # Old names for backwards compatibility |
1417 | @_deprecated("children", "4.0.0") |
1418 | - def childGenerator(self): |
1419 | + def childGenerator(self) -> Iterator[PageElement]: |
1420 | """Deprecated generator. |
1421 | |
1422 | :meta private: |
1423 | @@ -2486,7 +2512,7 @@ class Tag(PageElement): |
1424 | return self.children |
1425 | |
1426 | @_deprecated("descendants", "4.0.0") |
1427 | - def recursiveChildGenerator(self): |
1428 | + def recursiveChildGenerator(self) -> Iterator[PageElement]: |
1429 | """Deprecated generator. |
1430 | |
1431 | :meta private: |
1432 | @@ -2494,7 +2520,7 @@ class Tag(PageElement): |
1433 | return self.descendants |
1434 | |
1435 | @_deprecated("has_attr", "4.0.0") |
1436 | - def has_key(self, key): |
1437 | + def has_key(self, key:str) -> bool: |
1438 | """Deprecated method. This was kind of misleading because has_key() |
1439 | (attributes) was different from __in__ (contents). |
1440 | |
1441 | @@ -2516,7 +2542,7 @@ class ResultSet(List[_PageElementT], Generic[_PageElementT]): |
1442 | super(ResultSet, self).__init__(result) |
1443 | self.source = source |
1444 | |
1445 | - def __getattr__(self, key:str): |
1446 | + def __getattr__(self, key:str) -> None: |
1447 | """Raise a helpful exception to explain a common code fix.""" |
1448 | raise AttributeError( |
1449 | f"""ResultSet object has no attribute "{key}". You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?""" |
1450 | diff --git a/bs4/filter.py b/bs4/filter.py |
1451 | index 74e26d9..7632639 100644 |
1452 | --- a/bs4/filter.py |
1453 | +++ b/bs4/filter.py |
1454 | @@ -25,10 +25,10 @@ from bs4._deprecation import _deprecated |
1455 | from bs4.element import NavigableString, PageElement, Tag |
1456 | from bs4._typing import ( |
1457 | _AttributeValue, |
1458 | - _AttributeValues, |
1459 | _AllowStringCreationFunction, |
1460 | _AllowTagCreationFunction, |
1461 | _PageElementMatchFunction, |
1462 | + _RawAttributeValues, |
1463 | _TagMatchFunction, |
1464 | _StringMatchFunction, |
1465 | _StrainableElement, |
1466 | @@ -98,7 +98,7 @@ class ElementFilter(object): |
1467 | |
1468 | def allow_tag_creation( |
1469 | self, nsprefix:Optional[str], name:str, |
1470 | - attrs:Optional[_AttributeValues] |
1471 | + attrs:Optional[_RawAttributeValues] |
1472 | ) -> bool: |
1473 | """Based on the name and attributes of a tag, see whether this |
1474 | ElementFilter will allow a Tag object to even be created. |
1475 | @@ -372,9 +372,8 @@ class SoupStrainer(ElementFilter): |
1476 | # third-party regex library, whose pattern objects doesn't |
1477 | # derive from re.Pattern. |
1478 | # |
1479 | - # TODO-TYPING: Once we drop support for Python 3.7, we |
1480 | - # might be able to address this by defining an appropriate |
1481 | - # Protocol. |
1482 | + # TODO-TYPING: We should be able to bring in a Protocol |
1483 | + # from typing_extensions to handle this. |
1484 | yield rule_class(pattern=obj) |
1485 | elif hasattr(obj, '__iter__'): |
1486 | for o in obj: |
1487 | @@ -497,7 +496,7 @@ class SoupStrainer(ElementFilter): |
1488 | ) |
1489 | return this_attr_match |
1490 | |
1491 | - def allow_tag_creation(self, nsprefix:Optional[str], name:str, attrs:Optional[_AttributeValues]) -> bool: |
1492 | + def allow_tag_creation(self, nsprefix:Optional[str], name:str, attrs:Optional[_RawAttributeValues]) -> bool: |
1493 | """Based on the name and attributes of a tag, see whether this |
1494 | SoupStrainer will allow a Tag object to even be created. |
1495 | |
1496 | @@ -586,7 +585,7 @@ class SoupStrainer(ElementFilter): |
1497 | return False |
1498 | |
1499 | @_deprecated("allow_tag_creation", "4.13.0") |
1500 | - def search_tag(self, name:str, attrs:Optional[_AttributeValues]) -> bool: |
1501 | + def search_tag(self, name:str, attrs:Optional[_RawAttributeValues]) -> bool: |
1502 | """A less elegant version of allow_tag_creation().""" |
1503 | ":meta private:" |
1504 | return self.allow_tag_creation(None, name, attrs) |
1505 | diff --git a/bs4/tests/__init__.py b/bs4/tests/__init__.py |
1506 | index 3ef999d..8415645 100644 |
1507 | --- a/bs4/tests/__init__.py |
1508 | +++ b/bs4/tests/__init__.py |
1509 | @@ -15,6 +15,7 @@ from bs4.element import ( |
1510 | Comment, |
1511 | ContentMetaAttributeValue, |
1512 | Doctype, |
1513 | + PageElement, |
1514 | PYTHON_SPECIFIC_ENCODINGS, |
1515 | Script, |
1516 | Stylesheet, |
1517 | @@ -25,8 +26,21 @@ from bs4.builder import ( |
1518 | DetectsXMLParsedAsHTML, |
1519 | XMLParsedAsHTMLWarning, |
1520 | ) |
1521 | +from bs4._typing import ( |
1522 | + _IncomingMarkup |
1523 | +) |
1524 | + |
1525 | +from bs4.builder import TreeBuilder |
1526 | from bs4.builder._htmlparser import HTMLParserTreeBuilder |
1527 | -default_builder = HTMLParserTreeBuilder |
1528 | + |
1529 | +from typing import ( |
1530 | + Any, |
1531 | + Iterable, |
1532 | + List, |
1533 | + Optional, |
1534 | + Tuple, |
1535 | + Type, |
1536 | +) |
1537 | |
1538 | # Some tests depend on specific third-party libraries. We use |
1539 | # @pytest.mark.skipIf on the following conditionals to skip them |
1540 | @@ -51,7 +65,9 @@ except ImportError: |
1541 | LXML_PRESENT = False |
1542 | LXML_VERSION = (0,) |
1543 | |
1544 | -BAD_DOCUMENT = """A bare string |
1545 | +default_builder:Type[TreeBuilder] = HTMLParserTreeBuilder |
1546 | + |
1547 | +BAD_DOCUMENT:str = """A bare string |
1548 | <!DOCTYPE xsl:stylesheet SYSTEM "htmlent.dtd"> |
1549 | <!DOCTYPE xsl:stylesheet PUBLIC "htmlent.dtd"> |
1550 | <div><![CDATA[A CDATA section where it doesn't belong]]></div> |
1551 | @@ -91,28 +107,30 @@ BAD_DOCUMENT = """A bare string |
1552 | class SoupTest(object): |
1553 | |
1554 | @property |
1555 | - def default_builder(self): |
1556 | + def default_builder(self) -> Type[TreeBuilder]: |
1557 | return default_builder |
1558 | |
1559 | - def soup(self, markup, **kwargs): |
1560 | + def soup(self, markup:_IncomingMarkup, **kwargs:Any) -> BeautifulSoup: |
1561 | """Build a Beautiful Soup object from markup.""" |
1562 | builder = kwargs.pop('builder', self.default_builder) |
1563 | return BeautifulSoup(markup, builder=builder, **kwargs) |
1564 | |
1565 | - def document_for(self, markup, **kwargs): |
1566 | + def document_for(self, markup:str, **kwargs:Any) -> str: |
1567 | """Turn an HTML fragment into a document. |
1568 | |
1569 | The details depend on the builder. |
1570 | """ |
1571 | return self.default_builder(**kwargs).test_fragment_to_document(markup) |
1572 | |
1573 | - def assert_soup(self, to_parse, compare_parsed_to=None): |
1574 | + def assert_soup(self, to_parse:_IncomingMarkup, |
1575 | + compare_parsed_to:Optional[str]=None) -> None: |
1576 | """Parse some markup using Beautiful Soup and verify that |
1577 | the output markup is as expected. |
1578 | """ |
1579 | builder = self.default_builder |
1580 | obj = BeautifulSoup(to_parse, builder=builder) |
1581 | if compare_parsed_to is None: |
1582 | + assert isinstance(to_parse, str) |
1583 | compare_parsed_to = to_parse |
1584 | |
1585 | # Verify that the documents come out the same. |
1586 | @@ -131,7 +149,7 @@ class SoupTest(object): |
1587 | |
1588 | assertSoupEquals = assert_soup |
1589 | |
1590 | - def assertConnectedness(self, element): |
1591 | + def assertConnectedness(self, element:Tag) -> None: |
1592 | """Ensure that next_element and previous_element are properly |
1593 | set for all descendants of the given element. |
1594 | """ |
1595 | @@ -142,7 +160,7 @@ class SoupTest(object): |
1596 | assert earlier == e.previous_element |
1597 | earlier = e |
1598 | |
1599 | - def linkage_validator(self, el, _recursive_call=False): |
1600 | + def linkage_validator(self, el:Tag, _recursive_call:bool=False) -> Optional[PageElement]: |
1601 | """Ensure proper linkage throughout the document.""" |
1602 | descendant = None |
1603 | # Document element should have no previous element or previous sibling. |
1604 | @@ -209,6 +227,7 @@ class SoupTest(object): |
1605 | |
1606 | if isinstance(child, Tag) and child.contents: |
1607 | descendant = self.linkage_validator(child, True) |
1608 | + assert descendant is not None |
1609 | # A bubbled up descendant should have no next siblings |
1610 | assert descendant.next_sibling is None,\ |
1611 | "Bad next_sibling\nNODE: {}\nNEXT {}\nEXPECTED {}".format( |
1612 | @@ -234,7 +253,7 @@ class SoupTest(object): |
1613 | child = el |
1614 | |
1615 | if not _recursive_call and child is not None: |
1616 | - target = el |
1617 | + target:Optional[Tag] = el |
1618 | while True: |
1619 | if target is None: |
1620 | assert child.next_element is None, \ |
1621 | @@ -256,7 +275,7 @@ class SoupTest(object): |
1622 | # Return the child to the recursive caller |
1623 | return child |
1624 | |
1625 | - def assert_selects(self, tags, should_match): |
1626 | + def assert_selects(self, tags:Iterable[Tag], should_match:Iterable[str]) -> None: |
1627 | """Make sure that the given tags have the correct text. |
1628 | |
1629 | This is used in tests that define a bunch of tags, each |
1630 | @@ -265,7 +284,7 @@ class SoupTest(object): |
1631 | """ |
1632 | assert [tag.string for tag in tags] == should_match |
1633 | |
1634 | - def assert_selects_ids(self, tags, should_match): |
1635 | + def assert_selects_ids(self, tags:Iterable[Tag], should_match:Iterable[str]) -> None: |
1636 | """Make sure that the given tags have the correct IDs. |
1637 | |
1638 | This is used in tests that define a bunch of tags, each |
1639 | @@ -275,7 +294,7 @@ class SoupTest(object): |
1640 | assert [tag['id'] for tag in tags] == should_match |
1641 | |
1642 | |
1643 | -class TreeBuilderSmokeTest(object): |
1644 | +class TreeBuilderSmokeTest(SoupTest): |
1645 | # Tests that are common to HTML and XML tree builders. |
1646 | |
1647 | @pytest.mark.parametrize( |
1648 | @@ -352,7 +371,7 @@ class HTMLTreeBuilderSmokeTest(TreeBuilderSmokeTest): |
1649 | assert loaded.__class__ == BeautifulSoup |
1650 | assert loaded.decode() == tree.decode() |
1651 | |
1652 | - def assertDoctypeHandled(self, doctype_fragment): |
1653 | + def assertDoctypeHandled(self, doctype_fragment:str) -> None: |
1654 | """Assert that a given doctype string is handled correctly.""" |
1655 | doctype_str, soup = self._document_with_doctype(doctype_fragment) |
1656 | |
1657 | @@ -366,7 +385,7 @@ class HTMLTreeBuilderSmokeTest(TreeBuilderSmokeTest): |
1658 | # parse tree and that the rest of the document parsed. |
1659 | assert soup.p.contents[0] == 'foo' |
1660 | |
1661 | - def _document_with_doctype(self, doctype_fragment, doctype_string="DOCTYPE"): |
1662 | + def _document_with_doctype(self, doctype_fragment:str, doctype_string:str="DOCTYPE") -> Tuple[bytes, BeautifulSoup]: |
1663 | """Generate and parse a document with the given doctype.""" |
1664 | doctype = '<!%s %s>' % (doctype_string, doctype_fragment) |
1665 | markup = doctype + '\n<p>foo</p>' |
1666 | diff --git a/bs4/tests/test_builder_registry.py b/bs4/tests/test_builder_registry.py |
1667 | index 9a9ce1f..da10e5f 100644 |
1668 | --- a/bs4/tests/test_builder_registry.py |
1669 | +++ b/bs4/tests/test_builder_registry.py |
1670 | @@ -2,10 +2,12 @@ |
1671 | |
1672 | import pytest |
1673 | import warnings |
1674 | +from typing import Type |
1675 | |
1676 | from bs4 import BeautifulSoup |
1677 | from bs4.builder import ( |
1678 | builder_registry as registry, |
1679 | + TreeBuilder, |
1680 | TreeBuilderRegistry, |
1681 | ) |
1682 | from bs4.builder._htmlparser import HTMLParserTreeBuilder |
1683 | @@ -81,7 +83,7 @@ class TestRegistry(object): |
1684 | def setup_method(self): |
1685 | self.registry = TreeBuilderRegistry() |
1686 | |
1687 | - def builder_for_features(self, *feature_list): |
1688 | + def builder_for_features(self, *feature_list:str) -> Type[TreeBuilder]: |
1689 | cls = type('Builder_' + '_'.join(feature_list), |
1690 | (object,), {'features' : feature_list}) |
1691 | |
1692 | diff --git a/bs4/tests/test_css.py b/bs4/tests/test_css.py |
1693 | index 359dbcd..2e2baba 100644 |
1694 | --- a/bs4/tests/test_css.py |
1695 | +++ b/bs4/tests/test_css.py |
1696 | @@ -8,6 +8,12 @@ from bs4 import ( |
1697 | ResultSet, |
1698 | ) |
1699 | |
1700 | +from typing import ( |
1701 | + Any, |
1702 | + Iterable, |
1703 | + Tuple, |
1704 | +) |
1705 | + |
1706 | from . import ( |
1707 | SoupTest, |
1708 | SOUP_SIEVE_PRESENT, |
1709 | @@ -78,7 +84,7 @@ class TestCSSSelectors(SoupTest): |
1710 | def setup_method(self): |
1711 | self.soup = BeautifulSoup(self.HTML, 'html.parser') |
1712 | |
1713 | - def assert_selects(self, selector, expected_ids, **kwargs): |
1714 | + def assert_selects(self, selector:str, expected_ids:Iterable[str], **kwargs:Any) -> None: |
1715 | results = self.soup.select(selector, **kwargs) |
1716 | assert isinstance(results, ResultSet) |
1717 | el_ids = [el['id'] for el in results] |
1718 | @@ -90,7 +96,7 @@ class TestCSSSelectors(SoupTest): |
1719 | |
1720 | assertSelect = assert_selects |
1721 | |
1722 | - def assert_select_multiple(self, *tests): |
1723 | + def assert_select_multiple(self, *tests:Tuple[str, Iterable[str]]): |
1724 | for selector, expected_ids in tests: |
1725 | self.assert_selects(selector, expected_ids) |
1726 | |
1727 | diff --git a/bs4/tests/test_filter.py b/bs4/tests/test_filter.py |
1728 | index 8d5da70..dfd6f18 100644 |
1729 | --- a/bs4/tests/test_filter.py |
1730 | +++ b/bs4/tests/test_filter.py |
1731 | @@ -5,6 +5,12 @@ import warnings |
1732 | from . import ( |
1733 | SoupTest, |
1734 | ) |
1735 | +from typing import ( |
1736 | + Callable, |
1737 | + Optional, |
1738 | + Pattern, |
1739 | + Tuple, |
1740 | +) |
1741 | from bs4.element import Tag |
1742 | from bs4.filter import ( |
1743 | AttributeValueMatchRule, |
1744 | @@ -14,6 +20,8 @@ from bs4.filter import ( |
1745 | StringMatchRule, |
1746 | TagNameMatchRule, |
1747 | ) |
1748 | +from bs4._typing import _RawOrProcessedAttributeValues |
1749 | + |
1750 | |
1751 | class TestElementFilter(SoupTest): |
1752 | |
1753 | @@ -107,7 +115,10 @@ class TestElementFilter(SoupTest): |
1754 | |
1755 | class TestMatchRule(SoupTest): |
1756 | |
1757 | - def _tuple(self, rule): |
1758 | + def _tuple(self, rule:MatchRule) -> Tuple[Optional[str], |
1759 | + Optional[Pattern[str]], |
1760 | + Optional[Callable], |
1761 | + Optional[bool]]: |
1762 | return ( |
1763 | rule.string, |
1764 | rule.pattern.pattern if rule.pattern else None, |
1765 | @@ -395,9 +406,10 @@ class TestSoupStrainer(SoupTest): |
1766 | assert msg == "Ignoring nested list [[...]] to avoid the possibility of infinite recursion." |
1767 | |
1768 | def tag_matches( |
1769 | - self, strainer, name, attrs=None, string=None, prefix=None, |
1770 | - match_valence=True |
1771 | - ): |
1772 | + self, strainer:SoupStrainer, name:str, |
1773 | + attrs:Optional[_RawOrProcessedAttributeValues]=None, |
1774 | + string:Optional[str]=None, prefix:Optional[str]=None, |
1775 | + ) -> bool: |
1776 | # Create a Tag with the given prefix, name and attributes, |
1777 | # then make sure that strainer.matches_tag and allow_tag_creation |
1778 | # both approve it. |
1779 | diff --git a/bs4/tests/test_fuzz.py b/bs4/tests/test_fuzz.py |
1780 | index f29802d..579686d 100644 |
1781 | --- a/bs4/tests/test_fuzz.py |
1782 | +++ b/bs4/tests/test_fuzz.py |
1783 | @@ -38,7 +38,7 @@ class TestFuzz(object): |
1784 | # multiple copies of the code must be kept around to run against |
1785 | # older tests. I'm not sure what to do about this, but I may |
1786 | # retire old tests after a time. |
1787 | - def fuzz_test_with_css(self, filename): |
1788 | + def fuzz_test_with_css(self, filename:str) -> None: |
1789 | data = self.__markup(filename) |
1790 | parsers = ['lxml-xml', 'html5lib', 'html.parser', 'lxml'] |
1791 | try: |
1792 | @@ -168,7 +168,7 @@ class TestFuzz(object): |
1793 | def test_html5lib_parse_errors(self, filename): |
1794 | self.fuzz_test_with_css(filename) |
1795 | |
1796 | - def __markup(self, filename): |
1797 | + def __markup(self, filename:str) -> bytes: |
1798 | if not filename.endswith(self.TESTCASE_SUFFIX): |
1799 | filename += self.TESTCASE_SUFFIX |
1800 | this_dir = os.path.split(__file__)[0] |
1801 | diff --git a/bs4/tests/test_html5lib.py b/bs4/tests/test_html5lib.py |
1802 | index 9f6dfa1..3c34403 100644 |
1803 | --- a/bs4/tests/test_html5lib.py |
1804 | +++ b/bs4/tests/test_html5lib.py |
1805 | @@ -8,14 +8,13 @@ from bs4.filter import SoupStrainer |
1806 | from . import ( |
1807 | HTML5LIB_PRESENT, |
1808 | HTML5TreeBuilderSmokeTest, |
1809 | - SoupTest, |
1810 | ) |
1811 | |
1812 | @pytest.mark.skipif( |
1813 | not HTML5LIB_PRESENT, |
1814 | reason="html5lib seems not to be present, not testing its tree builder." |
1815 | ) |
1816 | -class TestHTML5LibBuilder(SoupTest, HTML5TreeBuilderSmokeTest): |
1817 | +class TestHTML5LibBuilder(HTML5TreeBuilderSmokeTest): |
1818 | """See ``HTML5TreeBuilderSmokeTest``.""" |
1819 | |
1820 | @property |
1821 | diff --git a/bs4/tests/test_htmlparser.py b/bs4/tests/test_htmlparser.py |
1822 | index ff0f305..2a13c99 100644 |
1823 | --- a/bs4/tests/test_htmlparser.py |
1824 | +++ b/bs4/tests/test_htmlparser.py |
1825 | @@ -10,12 +10,14 @@ from bs4.builder import ( |
1826 | XMLParsedAsHTMLWarning, |
1827 | ) |
1828 | from bs4.builder._htmlparser import ( |
1829 | + _DuplicateAttributeHandler, |
1830 | BeautifulSoupHTMLParser, |
1831 | HTMLParserTreeBuilder, |
1832 | ) |
1833 | +from typing import Any |
1834 | from . import SoupTest, HTMLTreeBuilderSmokeTest |
1835 | |
1836 | -class TestHTMLParserTreeBuilder(SoupTest, HTMLTreeBuilderSmokeTest): |
1837 | +class TestHTMLParserTreeBuilder(HTMLTreeBuilderSmokeTest): |
1838 | |
1839 | default_builder = HTMLParserTreeBuilder |
1840 | |
1841 | @@ -95,7 +97,7 @@ class TestHTMLParserTreeBuilder(SoupTest, HTMLTreeBuilderSmokeTest): |
1842 | assert "id" == soup.a['id'] |
1843 | |
1844 | # You can also get this behavior explicitly. |
1845 | - def assert_attribute(on_duplicate_attribute, expected): |
1846 | + def assert_attribute(on_duplicate_attribute:_DuplicateAttributeHandler, expected:Any) -> None: |
1847 | soup = self.soup( |
1848 | markup, on_duplicate_attribute=on_duplicate_attribute |
1849 | ) |
1850 | diff --git a/bs4/tests/test_lxml.py b/bs4/tests/test_lxml.py |
1851 | index 9fc04e0..dc20501 100644 |
1852 | --- a/bs4/tests/test_lxml.py |
1853 | +++ b/bs4/tests/test_lxml.py |
1854 | @@ -26,7 +26,7 @@ from . import ( |
1855 | not LXML_PRESENT, |
1856 | reason="lxml seems not to be present, not testing its tree builder." |
1857 | ) |
1858 | -class TestLXMLTreeBuilder(SoupTest, HTMLTreeBuilderSmokeTest): |
1859 | +class TestLXMLTreeBuilder(HTMLTreeBuilderSmokeTest): |
1860 | """See ``HTMLTreeBuilderSmokeTest``.""" |
1861 | |
1862 | @property |
1863 | @@ -88,7 +88,7 @@ class TestLXMLTreeBuilder(SoupTest, HTMLTreeBuilderSmokeTest): |
1864 | not LXML_PRESENT, |
1865 | reason="lxml seems not to be present, not testing its XML tree builder." |
1866 | ) |
1867 | -class TestLXMLXMLTreeBuilder(SoupTest, XMLTreeBuilderSmokeTest): |
1868 | +class TestLXMLXMLTreeBuilder(XMLTreeBuilderSmokeTest): |
1869 | """See ``HTMLTreeBuilderSmokeTest``.""" |
1870 | |
1871 | @property |
1872 | diff --git a/bs4/tests/test_soup.py b/bs4/tests/test_soup.py |
1873 | index c95f380..61d0235 100644 |
1874 | --- a/bs4/tests/test_soup.py |
1875 | +++ b/bs4/tests/test_soup.py |
1876 | @@ -8,6 +8,7 @@ import pickle |
1877 | import pytest |
1878 | import sys |
1879 | import tempfile |
1880 | +from typing import Iterable |
1881 | |
1882 | from bs4 import ( |
1883 | BeautifulSoup, |
1884 | @@ -260,14 +261,15 @@ class TestWarnings(SoupTest): |
1885 | # that the code that triggered the warning is in the same |
1886 | # file as the test. |
1887 | |
1888 | - def _assert_warning(self, warnings, cls): |
1889 | + def _assert_warning( |
1890 | + self, warnings:Iterable[Warning], cls:type[Warning]) -> Warning: |
1891 | for w in warnings: |
1892 | if isinstance(w.message, cls): |
1893 | assert w.filename == __file__ |
1894 | return w |
1895 | raise Exception("%s warning not found in %r" % (cls, warnings)) |
1896 | |
1897 | - def _assert_no_parser_specified(self, w): |
1898 | + def _assert_no_parser_specified(self, w:Warning) -> None: |
1899 | warning = self._assert_warning(w, GuessedAtParserWarning) |
1900 | message = str(warning.message) |
1901 | assert message.startswith(BeautifulSoup.NO_PARSER_SPECIFIED_WARNING[:60]) |
1902 | diff --git a/bs4/tests/test_tree.py b/bs4/tests/test_tree.py |
1903 | index 43afb29..fe56e6b 100644 |
1904 | --- a/bs4/tests/test_tree.py |
1905 | +++ b/bs4/tests/test_tree.py |
1906 | @@ -135,7 +135,7 @@ class TestFindAllBasicNamespaces(SoupTest): |
1907 | class TestFindAllByName(SoupTest): |
1908 | """Test ways of finding tags by tag name.""" |
1909 | |
1910 | - def setup_method(self): |
1911 | + def setup_method(self) -> None: |
1912 | self.tree = self.soup("""<a>First tag.</a> |
1913 | <b>Second tag.</b> |
1914 | <c>Third <a>Nested tag.</a> tag.</c>""") |
1915 | @@ -459,7 +459,7 @@ class TestIndex(SoupTest): |
1916 | class TestParentOperations(SoupTest): |
1917 | """Test navigation and searching through an element's parents.""" |
1918 | |
1919 | - def setup_method(self): |
1920 | + def setup_method(self) -> None: |
1921 | self.tree = self.soup('''<ul id="empty"></ul> |
1922 | <ul id="top"> |
1923 | <ul id="middle"> |
1924 | @@ -508,14 +508,14 @@ class TestParentOperations(SoupTest): |
1925 | |
1926 | class ProximityTest(SoupTest): |
1927 | |
1928 | - def setup_method(self): |
1929 | + def setup_method(self) -> None: |
1930 | self.tree = self.soup( |
1931 | '<html id="start"><head></head><body><b id="1">One</b><b id="2">Two</b><b id="3">Three</b></body></html>') |
1932 | |
1933 | |
1934 | class TestNextOperations(ProximityTest): |
1935 | |
1936 | - def setup_method(self): |
1937 | + def setup_method(self) -> None: |
1938 | super(TestNextOperations, self).setup_method() |
1939 | self.start = self.tree.b |
1940 | |
1941 | @@ -555,7 +555,7 @@ class TestNextOperations(ProximityTest): |
1942 | |
1943 | class TestPreviousOperations(ProximityTest): |
1944 | |
1945 | - def setup_method(self): |
1946 | + def setup_method(self) -> None: |
1947 | super(TestPreviousOperations, self).setup_method() |
1948 | self.end = self.tree.find(string="Three") |
1949 | |
1950 | @@ -604,7 +604,7 @@ class TestPreviousOperations(ProximityTest): |
1951 | |
1952 | class SiblingTest(SoupTest): |
1953 | |
1954 | - def setup_method(self): |
1955 | + def setup_method(self) -> None: |
1956 | markup = '''<html> |
1957 | <span id="1"> |
1958 | <span id="1.1"></span> |
1959 | @@ -625,7 +625,7 @@ class SiblingTest(SoupTest): |
1960 | |
1961 | class TestNextSibling(SiblingTest): |
1962 | |
1963 | - def setup_method(self): |
1964 | + def setup_method(self) -> None: |
1965 | super(TestNextSibling, self).setup_method() |
1966 | self.start = self.tree.find(id="1") |
1967 | |
1968 | @@ -670,7 +670,7 @@ class TestNextSibling(SiblingTest): |
1969 | |
1970 | class TestPreviousSibling(SiblingTest): |
1971 | |
1972 | - def setup_method(self): |
1973 | + def setup_method(self) -> None: |
1974 | super(TestPreviousSibling, self).setup_method() |
1975 | self.end = self.tree.find(id="4") |
1976 | |
1977 | diff --git a/doc/index.rst b/doc/index.rst |
1978 | index a414830..0ff1fb2 100755 |
1979 | --- a/doc/index.rst |
1980 | +++ b/doc/index.rst |
1981 | @@ -3029,38 +3029,64 @@ Advanced search techniques |
1982 | |
1983 | Almost everyone who uses Beautiful Soup to extract information from a |
1984 | document can get what they need using the methods described in |
1985 | -`Searching the tree`_. However, there's a lower-level interface--the |
1986 | -:py:class:`ElementSelector` class-- which lets you define any matching |
1987 | +`Searching the tree`_. However, there is a lower-level interface--the |
1988 | +:py:class:`ElementFilter` class--that lets you define any matching |
1989 | behavior whatsoever. |
1990 | |
1991 | -To use :py:class:`ElementSelector`, define a function that takes a |
1992 | -:py:class:`PageElement` object (that is, it might be either a |
1993 | -:py:class:`Tag` or a :py:class`NavigableString`) and returns ``True`` |
1994 | +To use :py:class:`ElementFilter`, define a function that takes a |
1995 | +:py:class:`PageElement` object (which can be either a :py:class:`Tag` |
1996 | +*or* :py:class:`NavigableString` object) and returns ``True`` |
1997 | (if the element matches your custom criteria) or ``False`` (if it |
1998 | -doesn't):: |
1999 | +does not):: |
2000 | |
2001 | - [example goes here] |
2002 | + def _match_non_whitespace(pe): |
2003 | + """ |
2004 | + return True for: |
2005 | + * all Tag objects |
2006 | + * NavigableString objects that contain non-whitespace text |
2007 | + """ |
2008 | + return ( |
2009 | + isinstance(pe, Tag) or |
2010 | + (isinstance(pe, NavigableString) and |
2011 | + pe.text and not pe.text.isspace())) |
2012 | |
2013 | -Then, pass the function into an :py:class:`ElementSelector`:: |
2014 | +Then, construct an :py:class:`ElementFilter` that uses your function:: |
2015 | |
2016 | - from bs4.select import ElementSelector |
2017 | - selector = ElementSelector(f) |
2018 | + from bs4.filter import ElementFilter |
2019 | + skip_whitespace = ElementFilter(match_function=_match_non_whitespace) |
2020 | |
2021 | -You can then pass the :py:class:`ElementSelector` object as the first |
2022 | +You can now pass this :py:class:`ElementFilter` object as the first |
2023 | argument to any of the `Searching the tree`_ methods:: |
2024 | |
2025 | - [examples go here] |
2026 | + from bs4 import BeautifulSoup |
2027 | + html_doc = """ |
2028 | + <p> |
2029 | + <b>bold</b> |
2030 | + <i>italic</i> |
2031 | + and |
2032 | + <u>underline</u> |
2033 | + </p> |
2034 | + """ |
2035 | + soup = BeautifulSoup(html_doc, 'lxml') |
2036 | |
2037 | -Every potential match will be run through your function, and the only |
2038 | -:py:class:`PageElement` objects returned will be the one where your |
2039 | + soup.find('p').find_all(skip_whitespace, recursive=False) |
2040 | + # [<b>bold</b>, <i>italic</i>, '\n and\n ', <u>underline</u>] |
2041 | + |
2042 | +Every :py:class:`PageElement` encountered will be evaluated by your |
2043 | +function, and the objects returned will be only the ones where your |
2044 | function returned ``True``. |
2045 | |
2046 | -Note that this is different from simply passing `a function`_ as the |
2047 | -first argument to one of the search methods. That's an easy way to |
2048 | -find a tag, but _only_ tags will be considered. With an |
2049 | -:py:class:`ElementSelector` you can write a single function that makes |
2050 | -decisions about both tags and strings. |
2051 | - |
2052 | +To summarize the function-based matching behaviors, |
2053 | + |
2054 | +* A function passed as the first argument to a search method |
2055 | + (or equivalently, using the ``name`` argument) considers only |
2056 | + :py:class:`Tag` objects. |
2057 | +* A function passed to a search method using the ``string`` argument |
2058 | + considers only :py:class:`NavigableString` objects. |
2059 | +* A function passed to a search method using an :py:class:`ElementFilter` |
2060 | + object considers both :py:class:`Tag` and :py:class:`NavigableString` |
2061 | + objects. |
2062 | + |
2063 | |
2064 | Advanced parser customization |
2065 | ============================= |