Merge beautifulsoup:more-modular-soupstrainers into beautifulsoup:4.13

Proposed by Leonard Richardson
Status: Merged
Merged at revision: c23dd48ebea467fcf028e14287f07d2c51e62975
Proposed branch: beautifulsoup:more-modular-soupstrainers
Merge into: beautifulsoup:4.13
Diff against target: 2064 lines (+710/-262)
18 files modified
CHANGELOG (+18/-1)
bs4/__init__.py (+131/-84)
bs4/_typing.py (+19/-1)
bs4/builder/__init__.py (+8/-8)
bs4/builder/_html5lib.py (+123/-67)
bs4/builder/_htmlparser.py (+12/-2)
bs4/builder/_lxml.py (+1/-1)
bs4/diagnose.py (+27/-15)
bs4/element.py (+24/-20)
bs4/filter.py (+167/-36)
bs4/tests/__init__.py (+1/-1)
bs4/tests/test_filter.py (+125/-8)
bs4/tests/test_html5lib.py (+2/-2)
bs4/tests/test_lxml.py (+1/-1)
bs4/tests/test_pageelement.py (+1/-1)
bs4/tests/test_soup.py (+2/-2)
bs4/tests/test_tree.py (+1/-1)
doc/index.rst (+47/-11)
Reviewer Review Type Date Requested Status
Leonard Richardson Pending
Review via email: mp+459082@code.launchpad.net
To post a comment you must log in.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
diff --git a/CHANGELOG b/CHANGELOG
index 69f238d..162e3dc 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -1,5 +1,7 @@
1= 4.13.0 (Unreleased)1= 4.13.0 (Unreleased)
22
3TODO: we could stand to put limit inside ResultSet
4
3* This version drops support for Python 3.6. The minimum supported5* This version drops support for Python 3.6. The minimum supported
4 major Python version for Beautiful Soup is now Python 3.7.6 major Python version for Beautiful Soup is now Python 3.7.
57
@@ -31,6 +33,13 @@
31 you, since you probably use HTMLParserTreeBuilder, not33 you, since you probably use HTMLParserTreeBuilder, not
32 BeautifulSoupHTMLParser directly.34 BeautifulSoupHTMLParser directly.
3335
36* The TreeBuilderForHtml5lib methods fragmentClass and getFragment
37 now raise NotImplementedError. These methods are called only by
38 html5lib's HTMLParser.parseFragment() method, which Beautiful Soup
39 doesn't use, so they were untested and should have never been called.
40 The getFragment() implementation was also slightly incorrect in a way
41 that should have caused obvious problems for anyone using it.
42
34* If Tag.get_attribute_list() is used to access an attribute that's not set,43* If Tag.get_attribute_list() is used to access an attribute that's not set,
35 the return value is now an empty list rather than [None].44 the return value is now an empty list rather than [None].
3645
@@ -47,6 +56,10 @@
47 empty list was treated the same as None and False, and you would have56 empty list was treated the same as None and False, and you would have
48 found the tags which did not have that attribute set at all. [bug=2045469]57 found the tags which did not have that attribute set at all. [bug=2045469]
4958
59* For similar reasons, if you pass in limit=0 to a find() method for some
60 reason, you will now get zero results. Previously, you would get all
61 matching results.
62
50* When using one of the find() methods or creating a SoupStrainer,63* When using one of the find() methods or creating a SoupStrainer,
51 if you specify the same attribute value in ``attrs`` and the64 if you specify the same attribute value in ``attrs`` and the
52 keyword arguments, you'll end up with two different ways to match that65 keyword arguments, you'll end up with two different ways to match that
@@ -88,7 +101,7 @@
88 changed to match the arguments to the superclass,101 changed to match the arguments to the superclass,
89 TreeBuilder.prepare_markup. Specifically, document_declared_encoding102 TreeBuilder.prepare_markup. Specifically, document_declared_encoding
90 now appears before exclude_encodings, not after. If you were calling103 now appears before exclude_encodings, not after. If you were calling
91 this method yourself, I recomment switching to using keyword104 this method yourself, I recommend switching to using keyword
92 arguments instead.105 arguments instead.
93106
94* Fixed an error in the lookup table used when converting107* Fixed an error in the lookup table used when converting
@@ -101,8 +114,12 @@ New deprecations in 4.13.0:
101114
102* The SAXTreeBuilder class, which was never officially supported or tested.115* The SAXTreeBuilder class, which was never officially supported or tested.
103116
117* The private class method BeautifulSoup._decode_markup(), which has not
118 been used inside Beautiful Soup for many years.
119
104* The first argument to BeautifulSoup.decode has been changed from a bool120* The first argument to BeautifulSoup.decode has been changed from a bool
105 `pretty_print` to an int `indent_level`, to match the signature of Tag.decode.121 `pretty_print` to an int `indent_level`, to match the signature of Tag.decode.
122 Using a bool will still work but will give you a DeprecationWarning.
106123
107* SoupStrainer.text and SoupStrainer.string are both deprecated124* SoupStrainer.text and SoupStrainer.string are both deprecated
108 since a single item can't capture all the possibilities of a SoupStrainer125 since a single item can't capture all the possibilities of a SoupStrainer
diff --git a/bs4/__init__.py b/bs4/__init__.py
index 347cb38..95bd48d 100644
--- a/bs4/__init__.py
+++ b/bs4/__init__.py
@@ -15,7 +15,7 @@ documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
15"""15"""
1616
17__author__ = "Leonard Richardson (leonardr@segfault.org)"17__author__ = "Leonard Richardson (leonardr@segfault.org)"
18__version__ = "4.12.3"18__version__ = "4.13.0"
19__copyright__ = "Copyright (c) 2004-2024 Leonard Richardson"19__copyright__ = "Copyright (c) 2004-2024 Leonard Richardson"
20# Use of this source code is governed by the MIT license.20# Use of this source code is governed by the MIT license.
21__license__ = "MIT"21__license__ = "MIT"
@@ -42,10 +42,13 @@ from .builder import (
42)42)
43from .builder._htmlparser import HTMLParserTreeBuilder43from .builder._htmlparser import HTMLParserTreeBuilder
44from .dammit import UnicodeDammit44from .dammit import UnicodeDammit
45from .css import (
46 CSS
47)
48from ._deprecation import _deprecated
45from .element import (49from .element import (
46 CData,50 CData,
47 Comment,51 Comment,
48 CSS,
49 DEFAULT_OUTPUT_ENCODING,52 DEFAULT_OUTPUT_ENCODING,
50 Declaration,53 Declaration,
51 Doctype,54 Doctype,
@@ -60,7 +63,10 @@ from .element import (
60 TemplateString,63 TemplateString,
61 )64 )
62from .formatter import Formatter65from .formatter import Formatter
63from .strainer import SoupStrainer66from .filter import (
67 ElementFilter,
68 SoupStrainer,
69)
64from typing import (70from typing import (
65 Any,71 Any,
66 cast,72 cast,
@@ -70,6 +76,7 @@ from typing import (
70 List,76 List,
71 Sequence,77 Sequence,
72 Optional,78 Optional,
79 Tuple,
73 Type,80 Type,
74 TYPE_CHECKING,81 TYPE_CHECKING,
75 Union,82 Union,
@@ -81,6 +88,7 @@ from bs4._typing import (
81 _Encoding,88 _Encoding,
82 _Encodings,89 _Encodings,
83 _IncomingMarkup,90 _IncomingMarkup,
91 _RawMarkup,
84)92)
8593
86# Define some custom warnings.94# Define some custom warnings.
@@ -144,20 +152,21 @@ class BeautifulSoup(Tag):
144 NO_PARSER_SPECIFIED_WARNING: str = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features=\"%(parser)s\"' to the BeautifulSoup constructor.\n"152 NO_PARSER_SPECIFIED_WARNING: str = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features=\"%(parser)s\"' to the BeautifulSoup constructor.\n"
145153
146 # FUTURE PYTHON:154 # FUTURE PYTHON:
147 element_classes:Dict[Type[PageElement], Type[Any]] #: :meta private:155 element_classes:Dict[Type[PageElement], Type[PageElement]] #: :meta private:
148 builder:TreeBuilder #: :meta private:156 builder:TreeBuilder #: :meta private:
149 is_xml: bool157 is_xml: bool
150 known_xml: Optional[bool]158 known_xml: Optional[bool]
151 parse_only: Optional[SoupStrainer] #: :meta private:159 parse_only: Optional[SoupStrainer] #: :meta private:
152160
153 # These members are only used while parsing markup.161 # These members are only used while parsing markup.
154 markup:Optional[Union[str,bytes]] #: :meta private:162 markup:Optional[_RawMarkup] #: :meta private:
155 current_data:List[str] #: :meta private:163 current_data:List[str] #: :meta private:
156 currentTag:Optional[Tag] #: :meta private:164 currentTag:Optional[Tag] #: :meta private:
157 tagStack:List[Tag] #: :meta private:165 tagStack:List[Tag] #: :meta private:
158 open_tag_counter:CounterType[str] #: :meta private:166 open_tag_counter:CounterType[str] #: :meta private:
159 preserve_whitespace_tag_stack:List[Tag] #: :meta private:167 preserve_whitespace_tag_stack:List[Tag] #: :meta private:
160 string_container_stack:List[Tag] #: :meta private:168 string_container_stack:List[Tag] #: :meta private:
169 _most_recent_element:Optional[PageElement] #: :meta private:
161170
162 #: Beautiful Soup's best guess as to the character encoding of the171 #: Beautiful Soup's best guess as to the character encoding of the
163 #: original document.172 #: original document.
@@ -182,7 +191,7 @@ class BeautifulSoup(Tag):
182 parse_only:Optional[SoupStrainer]=None,191 parse_only:Optional[SoupStrainer]=None,
183 from_encoding:Optional[_Encoding]=None,192 from_encoding:Optional[_Encoding]=None,
184 exclude_encodings:Optional[_Encodings]=None,193 exclude_encodings:Optional[_Encodings]=None,
185 element_classes:Optional[Dict[Type[PageElement], Type[Any]]]=None,194 element_classes:Optional[Dict[Type[PageElement], Type[PageElement]]]=None,
186 **kwargs:Any195 **kwargs:Any
187 ):196 ):
188 """Constructor.197 """Constructor.
@@ -271,7 +280,7 @@ class BeautifulSoup(Tag):
271 "features='lxml' for HTML and features='lxml-xml' for "280 "features='lxml' for HTML and features='lxml-xml' for "
272 "XML.")281 "XML.")
273282
274 def deprecated_argument(old_name, new_name):283 def deprecated_argument(old_name:str, new_name:str) -> Optional[Any]:
275 if old_name in kwargs:284 if old_name in kwargs:
276 warnings.warn(285 warnings.warn(
277 'The "%s" argument to the BeautifulSoup constructor '286 'The "%s" argument to the BeautifulSoup constructor '
@@ -284,13 +293,14 @@ class BeautifulSoup(Tag):
284293
285 parse_only = parse_only or deprecated_argument(294 parse_only = parse_only or deprecated_argument(
286 "parseOnlyThese", "parse_only")295 "parseOnlyThese", "parse_only")
287 if (parse_only is not None296 if parse_only is not None:
288 and parse_only.string_rules and297 # Issue a warning if we can tell in advance that
289 (parse_only.name_rules or parse_only.attribute_rules)):298 # parse_only will exclude the entire tree.
290 warnings.warn(299 if parse_only.excludes_everything:
291 f"Value for parse_only will exclude everything, since it puts restrictions on both tags and strings: {parse_only}",300 warnings.warn(
292 UserWarning, stacklevel=3301 f"The given value for parse_only will exclude everything: {parse_only}",
293 )302 UserWarning, stacklevel=3
303 )
294 304
295 from_encoding = from_encoding or deprecated_argument(305 from_encoding = from_encoding or deprecated_argument(
296 "fromEncoding", "from_encoding")306 "fromEncoding", "from_encoding")
@@ -323,7 +333,7 @@ class BeautifulSoup(Tag):
323 "Couldn't find a tree builder with the features you "333 "Couldn't find a tree builder with the features you "
324 "requested: %s. Do you need to install a parser library?"334 "requested: %s. Do you need to install a parser library?"
325 % ",".join(features))335 % ",".join(features))
326 builder_class = cast(Type[TreeBuilder], possible_builder_class)336 builder_class = possible_builder_class
327337
328 # At this point either we have a TreeBuilder instance in338 # At this point either we have a TreeBuilder instance in
329 # builder, or we have a builder_class that we can instantiate339 # builder, or we have a builder_class that we can instantiate
@@ -399,7 +409,7 @@ class BeautifulSoup(Tag):
399409
400 # At this point we know markup is a string or bytestring. If410 # At this point we know markup is a string or bytestring. If
401 # it was a file-type object, we've read from it.411 # it was a file-type object, we've read from it.
402 markup = cast(Union[str,bytes], markup)412 markup = cast(_RawMarkup, markup)
403 413
404 rejections = []414 rejections = []
405 success = False415 success = False
@@ -428,7 +438,7 @@ class BeautifulSoup(Tag):
428 self.markup = None438 self.markup = None
429 self.builder.soup = None439 self.builder.soup = None
430440
431 def _clone(self):441 def _clone(self) -> "BeautifulSoup":
432 """Create a new BeautifulSoup object with the same TreeBuilder,442 """Create a new BeautifulSoup object with the same TreeBuilder,
433 but not associated with any markup.443 but not associated with any markup.
434444
@@ -441,7 +451,7 @@ class BeautifulSoup(Tag):
441 clone.original_encoding = self.original_encoding451 clone.original_encoding = self.original_encoding
442 return clone452 return clone
443 453
444 def __getstate__(self):454 def __getstate__(self) -> dict[str, Any]:
445 # Frequently a tree builder can't be pickled.455 # Frequently a tree builder can't be pickled.
446 d = dict(self.__dict__)456 d = dict(self.__dict__)
447 if 'builder' in d and d['builder'] is not None and not self.builder.picklable:457 if 'builder' in d and d['builder'] is not None and not self.builder.picklable:
@@ -457,7 +467,7 @@ class BeautifulSoup(Tag):
457 del d['_most_recent_element']467 del d['_most_recent_element']
458 return d468 return d
459469
460 def __setstate__(self, state):470 def __setstate__(self, state: dict[str, Any]) -> None:
461 # If necessary, restore the TreeBuilder by looking it up.471 # If necessary, restore the TreeBuilder by looking it up.
462 self.__dict__ = state472 self.__dict__ = state
463 if isinstance(self.builder, type):473 if isinstance(self.builder, type):
@@ -469,15 +479,16 @@ class BeautifulSoup(Tag):
469 self.builder.soup = self479 self.builder.soup = self
470 self.reset()480 self.reset()
471 self._feed()481 self._feed()
472 return state
473482
474 483
475 @classmethod484 @classmethod
476 def _decode_markup(cls, markup):485 @_deprecated(replaced_by="nothing (private method, will be removed)", version="4.13.0")
477 """Ensure `markup` is bytes so it's safe to send into warnings.warn.486 def _decode_markup(cls, markup:_RawMarkup) -> str:
487 """Ensure `markup` is Unicode so it's safe to send into warnings.warn.
478488
479 TODO: warnings.warn had this problem back in 2010 but it might not489 warnings.warn had this problem back in 2010 but fortunately
480 anymore.490 not anymore. This has not been used for a long time; I just
491 noticed that fact while working on 4.13.0.
481 """492 """
482 if isinstance(markup, bytes):493 if isinstance(markup, bytes):
483 decoded = markup.decode('utf-8', 'replace')494 decoded = markup.decode('utf-8', 'replace')
@@ -486,56 +497,76 @@ class BeautifulSoup(Tag):
486 return decoded497 return decoded
487498
488 @classmethod499 @classmethod
489 def _markup_is_url(cls, markup):500 def _markup_is_url(cls, markup:_RawMarkup) -> bool:
490 """Error-handling method to raise a warning if incoming markup looks501 """Error-handling method to raise a warning if incoming markup looks
491 like a URL.502 like a URL.
492503
493 :param markup: A string.504 :param markup: A string of markup.
494 :return: Whether or not the markup resembles a URL505 :return: Whether or not the markup resembled a URL
495 closely enough to justify a warning.506 closely enough to justify issuing a warning.
496 """507 """
508 problem: bool = False
497 if isinstance(markup, bytes):509 if isinstance(markup, bytes):
498 space = b' '510 cant_start_with_b: Tuple[bytes, bytes] = (b"http:", b"https:")
499 cant_start_with = (b"http:", b"https:")511 problem = (
512 any(
513 markup.startswith(prefix) for prefix in
514 (b"http:", b"https:")
515 )
516 and not b' ' in markup
517 )
500 elif isinstance(markup, str):518 elif isinstance(markup, str):
501 space = ' '519 problem = (
502 cant_start_with = ("http:", "https:")520 any(
521 markup.startswith(prefix) for prefix in
522 ("http:", "https:")
523 )
524 and not ' ' in markup
525 )
503 else:526 else:
504 return False527 return False
505528
506 if any(markup.startswith(prefix) for prefix in cant_start_with):529 if not problem:
507 if not space in markup:530 return False
508 warnings.warn(531 warnings.warn(
509 'The input looks more like a URL than markup. You may want to use'532 'The input looks more like a URL than markup. You may want to use'
510 ' an HTTP client like requests to get the document behind'533 ' an HTTP client like requests to get the document behind'
511 ' the URL, and feed that document to Beautiful Soup.',534 ' the URL, and feed that document to Beautiful Soup.',
512 MarkupResemblesLocatorWarning,535 MarkupResemblesLocatorWarning,
513 stacklevel=3536 stacklevel=3
514 )537 )
515 return True538 return True
516 return False
517539
518 @classmethod540 @classmethod
519 def _markup_resembles_filename(cls, markup):541 def _markup_resembles_filename(cls, markup:_RawMarkup) -> bool:
520 """Error-handling method to raise a warning if incoming markup542 """Error-handling method to issue a warning if incoming markup
521 resembles a filename.543 resembles a filename.
522544
523 :param markup: A bytestring or string.545 :param markup: A string of markup.
524 :return: Whether or not the markup resembles a filename546 :return: Whether or not the markup resembled a filename
525 closely enough to justify a warning.547 closely enough to justify issuing a warning.
526 """548 """
527 path_characters = '/\\'549 path_characters_b = b'/\\'
528 extensions = ['.html', '.htm', '.xml', '.xhtml', '.txt']550 path_characters_s = '/\\'
529 if isinstance(markup, bytes):551 extensions_b = [b'.html', b'.htm', b'.xml', b'.xhtml', b'.txt']
530 path_characters = path_characters.encode("utf8")552 extensions_s = ['.html', '.htm', '.xml', '.xhtml', '.txt']
531 extensions = [x.encode('utf8') for x in extensions]553
532 filelike = False554 filelike = False
533 if any(x in markup for x in path_characters):555 if isinstance(markup, bytes):
534 filelike = True556 if any(x in markup for x in path_characters_b):
557 filelike = True
558 else:
559 lower_b = markup.lower()
560 if any(lower_b.endswith(ext) for ext in extensions_b):
561 filelike = True
535 else:562 else:
536 lower = markup.lower()563 if any(x in markup for x in path_characters_s):
537 if any(lower.endswith(ext) for ext in extensions):
538 filelike = True564 filelike = True
565 else:
566 lower_s = markup.lower()
567 if any(lower_s.endswith(ext) for ext in extensions_s):
568 filelike = True
569
539 if filelike:570 if filelike:
540 warnings.warn(571 warnings.warn(
541 'The input looks more like a filename than markup. You may'572 'The input looks more like a filename than markup. You may'
@@ -546,20 +577,22 @@ class BeautifulSoup(Tag):
546 return True577 return True
547 return False578 return False
548 579
549 def _feed(self):580 def _feed(self) -> None:
550 """Internal method that parses previously set markup, creating a large581 """Internal method that parses previously set markup, creating a large
551 number of Tag and NavigableString objects.582 number of Tag and NavigableString objects.
552 """583 """
553 # Convert the document to Unicode.584 # Convert the document to Unicode.
554 self.builder.reset()585 self.builder.reset()
555586
556 self.builder.feed(self.markup)587 if self.markup is not None:
588 self.builder.feed(self.markup)
557 # Close out any unfinished strings and close all the open tags.589 # Close out any unfinished strings and close all the open tags.
558 self.endData()590 self.endData()
559 while self.currentTag.name != self.ROOT_TAG_NAME:591 while (self.currentTag is not None and
592 self.currentTag.name != self.ROOT_TAG_NAME):
560 self.popTag()593 self.popTag()
561594
562 def reset(self):595 def reset(self) -> None:
563 """Reset this object to a state as though it had never parsed any596 """Reset this object to a state as though it had never parsed any
564 markup.597 markup.
565 """598 """
@@ -585,7 +618,7 @@ class BeautifulSoup(Tag):
585 sourcepos:Optional[int]=None,618 sourcepos:Optional[int]=None,
586 string:Optional[str]=None,619 string:Optional[str]=None,
587 **kwattrs:_AttributeValue,620 **kwattrs:_AttributeValue,
588 ):621 ) -> Tag:
589 """Create a new Tag associated with this BeautifulSoup object.622 """Create a new Tag associated with this BeautifulSoup object.
590623
591 :param name: The name of the new Tag.624 :param name: The name of the new Tag.
@@ -603,10 +636,16 @@ class BeautifulSoup(Tag):
603636
604 """637 """
605 kwattrs.update(attrs)638 kwattrs.update(attrs)
606 tag = self.element_classes.get(Tag, Tag)(639 tag_class = self.element_classes.get(Tag, Tag)
640
641 # Assume that this is either Tag or a subclass of Tag. If not,
642 # the user brought type-unsafety upon themselves.
643 tag_class = cast(Type[Tag], tag_class)
644 tag = tag_class(
607 None, self.builder, name, namespace, nsprefix, kwattrs,645 None, self.builder, name, namespace, nsprefix, kwattrs,
608 sourceline=sourceline, sourcepos=sourcepos646 sourceline=sourceline, sourcepos=sourcepos
609 )647 )
648
610 if string is not None:649 if string is not None:
611 tag.string = string650 tag.string = string
612 return tag651 return tag
@@ -622,9 +661,11 @@ class BeautifulSoup(Tag):
622 """661 """
623 container = base_class or NavigableString662 container = base_class or NavigableString
624663
625 # There may be a general override of NavigableString.664 # The user may want us to use some other class (hopefully a
626 container = self.element_classes.get(665 # custom subclass) instead of the one we'd use normally.
627 container, container666 container = cast(
667 type[NavigableString],
668 self.element_classes.get(container, container)
628 )669 )
629670
630 # On top of that, we may be inside a tag that needs a special671 # On top of that, we may be inside a tag that needs a special
@@ -728,9 +769,8 @@ class BeautifulSoup(Tag):
728 self.current_data = []769 self.current_data = []
729770
730 # Should we add this string to the tree at all?771 # Should we add this string to the tree at all?
731 if self.parse_only and len(self.tagStack) <= 1 and \772 if (self.parse_only and len(self.tagStack) <= 1 and
732 (not self.parse_only.string_rules or \773 (not self.parse_only.allow_string_creation(current_data))):
733 not self.parse_only.allow_string_creation(current_data)):
734 return774 return
735775
736 containerClass = self.string_container(containerClass)776 containerClass = self.string_container(containerClass)
@@ -739,17 +779,16 @@ class BeautifulSoup(Tag):
739779
740 def object_was_parsed(780 def object_was_parsed(
741 self, o:PageElement, parent:Optional[Tag]=None,781 self, o:PageElement, parent:Optional[Tag]=None,
742 most_recent_element:Optional[PageElement]=None):782 most_recent_element:Optional[PageElement]=None) -> None:
743 """Method called by the TreeBuilder to integrate an object into the783 """Method called by the TreeBuilder to integrate an object into the
744 parse tree.784 parse tree.
745785
746
747
748 :meta private:786 :meta private:
749 """787 """
750 if parent is None:788 if parent is None:
751 parent = self.currentTag789 parent = self.currentTag
752 assert parent is not None790 assert parent is not None
791 previous_element: Optional[PageElement]
753 if most_recent_element is not None:792 if most_recent_element is not None:
754 previous_element = most_recent_element793 previous_element = most_recent_element
755 else:794 else:
@@ -774,12 +813,12 @@ class BeautifulSoup(Tag):
774 if fix:813 if fix:
775 self._linkage_fixer(parent)814 self._linkage_fixer(parent)
776815
777 def _linkage_fixer(self, el):816 def _linkage_fixer(self, el:Tag) -> None:
778 """Make sure linkage of this fragment is sound."""817 """Make sure linkage of this fragment is sound."""
779818
780 first = el.contents[0]819 first = el.contents[0]
781 child = el.contents[-1]820 child = el.contents[-1]
782 descendant = child821 descendant:PageElement = child
783822
784 if child is first and el.parent is not None:823 if child is first and el.parent is not None:
785 # Parent should be linked to first child824 # Parent should be linked to first child
@@ -797,14 +836,18 @@ class BeautifulSoup(Tag):
797836
798 # This index is a tag, dig deeper for a "last descendant"837 # This index is a tag, dig deeper for a "last descendant"
799 if isinstance(child, Tag) and child.contents:838 if isinstance(child, Tag) and child.contents:
800 descendant = child._last_descendant(False)839 # _last_decendant is typed as returning Optional[PageElement],
840 # but the value can't be None here, because el is a Tag
841 # which we know has contents.
842 descendant = cast(PageElement, child._last_descendant(False))
801843
802 # As the final step, link last descendant. It should be linked844 # As the final step, link last descendant. It should be linked
803 # to the parent's next sibling (if found), else walk up the chain845 # to the parent's next sibling (if found), else walk up the chain
804 # and find a parent with a sibling. It should have no next sibling.846 # and find a parent with a sibling. It should have no next sibling.
805 descendant.next_element = None847 descendant.next_element = None
806 descendant.next_sibling = None848 descendant.next_sibling = None
807 target = el849
850 target:Optional[Tag] = el
808 while True:851 while True:
809 if target is None:852 if target is None:
810 break853 break
@@ -814,7 +857,7 @@ class BeautifulSoup(Tag):
814 break857 break
815 target = target.parent858 target = target.parent
816859
817 def _popToTag(self, name, nsprefix=None, inclusivePop=True) -> Optional[Tag]:860 def _popToTag(self, name:str, nsprefix:Optional[str]=None, inclusivePop:bool=True) -> Optional[Tag]:
818 """Pops the tag stack up to and including the most recent861 """Pops the tag stack up to and including the most recent
819 instance of the given tag.862 instance of the given tag.
820863
@@ -851,7 +894,7 @@ class BeautifulSoup(Tag):
851894
852 def handle_starttag(895 def handle_starttag(
853 self, name:str, namespace:Optional[str],896 self, name:str, namespace:Optional[str],
854 nsprefix:Optional[str], attrs:Optional[Dict[str,str]],897 nsprefix:Optional[str], attrs:_AttributeValues,
855 sourceline:Optional[int]=None, sourcepos:Optional[int]=None,898 sourceline:Optional[int]=None, sourcepos:Optional[int]=None,
856 namespaces:Optional[Dict[str, str]]=None) -> Optional[Tag]:899 namespaces:Optional[Dict[str, str]]=None) -> Optional[Tag]:
857 """Called by the tree builder when a new tag is encountered.900 """Called by the tree builder when a new tag is encountered.
@@ -867,7 +910,7 @@ class BeautifulSoup(Tag):
867 currently in scope in the document.910 currently in scope in the document.
868911
869 If this method returns None, the tag was rejected by an active912 If this method returns None, the tag was rejected by an active
870 SoupStrainer. You should proceed as if the tag had not occurred913 `ElementFilter`. You should proceed as if the tag had not occurred
871 in the document. For instance, if this was a self-closing tag,914 in the document. For instance, if this was a self-closing tag,
872 don't call handle_endtag.915 don't call handle_endtag.
873916
@@ -877,11 +920,14 @@ class BeautifulSoup(Tag):
877 self.endData()920 self.endData()
878921
879 if (self.parse_only and len(self.tagStack) <= 1922 if (self.parse_only and len(self.tagStack) <= 1
880 and (self.parse_only.string_rules923 and not self.parse_only.allow_tag_creation(nsprefix, name, attrs)):
881 or not self.parse_only.allow_tag_creation(nsprefix, name, attrs))):
882 return None924 return None
883925
884 tag = self.element_classes.get(Tag, Tag)(926 tag_class = self.element_classes.get(Tag, Tag)
927 # Assume that this is either Tag or a subclass of Tag. If not,
928 # the user brought type-unsafety upon themselves.
929 tag_class = cast(Type[Tag], tag_class)
930 tag = tag_class(
885 self, self.builder, name, namespace, nsprefix, attrs,931 self, self.builder, name, namespace, nsprefix, attrs,
886 self.currentTag, self._most_recent_element,932 self.currentTag, self._most_recent_element,
887 sourceline=sourceline, sourcepos=sourcepos,933 sourceline=sourceline, sourcepos=sourcepos,
@@ -918,7 +964,8 @@ class BeautifulSoup(Tag):
918 def decode(self, indent_level:Optional[int]=None,964 def decode(self, indent_level:Optional[int]=None,
919 eventual_encoding:_Encoding=DEFAULT_OUTPUT_ENCODING,965 eventual_encoding:_Encoding=DEFAULT_OUTPUT_ENCODING,
920 formatter:Union[Formatter,str]="minimal",966 formatter:Union[Formatter,str]="minimal",
921 iterator:Optional[Iterable]=None, **kwargs) -> str:967 iterator:Optional[Iterable[PageElement]]=None,
968 **kwargs:Any) -> str:
922 """Returns a string representation of the parse tree969 """Returns a string representation of the parse tree
923 as a full HTML or XML document.970 as a full HTML or XML document.
924971
@@ -989,7 +1036,7 @@ _soup = BeautifulSoup
989class BeautifulStoneSoup(BeautifulSoup):1036class BeautifulStoneSoup(BeautifulSoup):
990 """Deprecated interface to an XML parser."""1037 """Deprecated interface to an XML parser."""
9911038
992 def __init__(self, *args, **kwargs):1039 def __init__(self, *args:Any, **kwargs:Any):
993 kwargs['features'] = 'xml'1040 kwargs['features'] = 'xml'
994 warnings.warn(1041 warnings.warn(
995 'The BeautifulStoneSoup class was deprecated in version 4.0.0. Instead of using '1042 'The BeautifulStoneSoup class was deprecated in version 4.0.0. Instead of using '
diff --git a/bs4/_typing.py b/bs4/_typing.py
index fed804a..ab8f7a0 100644
--- a/bs4/_typing.py
+++ b/bs4/_typing.py
@@ -7,6 +7,8 @@
7# * In 3.10, x|y is an accepted shorthand for Union[x,y].7# * In 3.10, x|y is an accepted shorthand for Union[x,y].
8# * In 3.10, TypeAlias gains capabilities that can be used to8# * In 3.10, TypeAlias gains capabilities that can be used to
9# improve the tree matching types (I don't remember what, exactly).9# improve the tree matching types (I don't remember what, exactly).
10# * 3.8 defines the Protocol type, which can be used to do duck typing
11# in a statically checkable way.
1012
11import re13import re
12from typing_extensions import TypeAlias14from typing_extensions import TypeAlias
@@ -15,13 +17,14 @@ from typing import (
15 Dict,17 Dict,
16 IO,18 IO,
17 Iterable,19 Iterable,
20 Optional,
18 Pattern,21 Pattern,
19 TYPE_CHECKING,22 TYPE_CHECKING,
20 Union,23 Union,
21)24)
2225
23if TYPE_CHECKING:26if TYPE_CHECKING:
24 from bs4.element import Tag27 from bs4.element import PageElement, Tag
2528
26# Aliases for markup in various stages of processing.29# Aliases for markup in various stages of processing.
27#30#
@@ -52,6 +55,10 @@ _InvertedNamespaceMapping:TypeAlias = Dict[_NamespaceURL, _NamespacePrefix]
52_AttributeValue: TypeAlias = Union[str, Iterable[str]]55_AttributeValue: TypeAlias = Union[str, Iterable[str]]
53_AttributeValues: TypeAlias = Dict[str, _AttributeValue]56_AttributeValues: TypeAlias = Dict[str, _AttributeValue]
5457
58# The most common form in which attribute values are passed in from a
59# parser.
60_RawAttributeValues: TypeAlias = dict[str, str]
61
55# Aliases to represent the many possibilities for matching bits of a62# Aliases to represent the many possibilities for matching bits of a
56# parse tree.63# parse tree.
57#64#
@@ -60,6 +67,17 @@ _AttributeValues: TypeAlias = Dict[str, _AttributeValue]
60# of the arguments to the SoupStrainer constructor and (more67# of the arguments to the SoupStrainer constructor and (more
61# familiarly to Beautiful Soup users) the find* methods.68# familiarly to Beautiful Soup users) the find* methods.
6269
70# A function that takes a PageElement and returns a yes-or-no answer.
71_PageElementMatchFunction:TypeAlias = Callable[['PageElement'], bool]
72
73# A function that takes the raw parsed ingredients of a markup tag
74# and returns a yes-or-no answer.
75_AllowTagCreationFunction:TypeAlias = Callable[[Optional[str], str, Optional[_RawAttributeValues]], bool]
76
77# A function that takes the raw parsed ingredients of a markup string node
78# and returns a yes-or-no answer.
79_AllowStringCreationFunction:TypeAlias = Callable[[Optional[str]], bool]
80
63# A function that takes a Tag and returns a yes-or-no answer.81# A function that takes a Tag and returns a yes-or-no answer.
64# A TagNameMatchRule expects this kind of function, if you're82# A TagNameMatchRule expects this kind of function, if you're
65# going to pass it a function.83# going to pass it a function.
diff --git a/bs4/builder/__init__.py b/bs4/builder/__init__.py
index fa2b939..b59513e 100644
--- a/bs4/builder/__init__.py
+++ b/bs4/builder/__init__.py
@@ -277,7 +277,7 @@ class TreeBuilder(object):
277 return True277 return True
278 return tag_name in self.empty_element_tags278 return tag_name in self.empty_element_tags
279 279
280 def feed(self, markup:str) -> None:280 def feed(self, markup:_RawMarkup) -> None:
281 """Run some incoming markup through some parsing process,281 """Run some incoming markup through some parsing process,
282 populating the `BeautifulSoup` object in `TreeBuilder.soup`282 populating the `BeautifulSoup` object in `TreeBuilder.soup`
283 """283 """
@@ -598,8 +598,8 @@ class DetectsXMLParsedAsHTML(object):
598598
599 # This is typed as str, not `ProcessingInstruction`, because this599 # This is typed as str, not `ProcessingInstruction`, because this
600 # check may be run before any Beautiful Soup objects are created.600 # check may be run before any Beautiful Soup objects are created.
601 _first_processing_instruction: Optional[str]601 _first_processing_instruction: Optional[str] #: :meta private:
602 _root_tag: Optional[Tag]602 _root_tag_name: Optional[str] #: :meta private:
603 603
604 @classmethod604 @classmethod
605 def warn_if_markup_looks_like_xml(cls, markup:Optional[_RawMarkup], stacklevel:int=3) -> bool:605 def warn_if_markup_looks_like_xml(cls, markup:Optional[_RawMarkup], stacklevel:int=3) -> bool:
@@ -648,14 +648,14 @@ class DetectsXMLParsedAsHTML(object):
648 def _initialize_xml_detector(self) -> None:648 def _initialize_xml_detector(self) -> None:
649 """Call this method before parsing a document."""649 """Call this method before parsing a document."""
650 self._first_processing_instruction = None650 self._first_processing_instruction = None
651 self._root_tag = None651 self._root_tag_name = None
652 652
653 def _document_might_be_xml(self, processing_instruction:str):653 def _document_might_be_xml(self, processing_instruction:str):
654 """Call this method when encountering an XML declaration, or a654 """Call this method when encountering an XML declaration, or a
655 "processing instruction" that might be an XML declaration.655 "processing instruction" that might be an XML declaration.
656 """656 """
657 if (self._first_processing_instruction is not None657 if (self._first_processing_instruction is not None
658 or self._root_tag is not None):658 or self._root_tag_name is not None):
659 # The document has already started. Don't bother checking659 # The document has already started. Don't bother checking
660 # anymore.660 # anymore.
661 return661 return
@@ -665,18 +665,18 @@ class DetectsXMLParsedAsHTML(object):
665 # We won't know until we encounter the first tag whether or665 # We won't know until we encounter the first tag whether or
666 # not this is actually a problem.666 # not this is actually a problem.
667 667
668 def _root_tag_encountered(self, name):668 def _root_tag_encountered(self, name:str) -> None:
669 """Call this when you encounter the document's root tag.669 """Call this when you encounter the document's root tag.
670670
671 This is where we actually check whether an XML document is671 This is where we actually check whether an XML document is
672 being incorrectly parsed as HTML, and issue the warning.672 being incorrectly parsed as HTML, and issue the warning.
673 """673 """
674 if self._root_tag is not None:674 if self._root_tag_name is not None:
675 # This method was incorrectly called multiple times. Do675 # This method was incorrectly called multiple times. Do
676 # nothing.676 # nothing.
677 return677 return
678678
679 self._root_tag = name679 self._root_tag_name = name
680 if (name != 'html' and self._first_processing_instruction is not None680 if (name != 'html' and self._first_processing_instruction is not None
681 and self._first_processing_instruction.lower().startswith('xml ')):681 and self._first_processing_instruction.lower().startswith('xml ')):
682 # We encountered an XML declaration and then a tag other682 # We encountered an XML declaration and then a tag other
diff --git a/bs4/builder/_html5lib.py b/bs4/builder/_html5lib.py
index b7d2924..2ea556c 100644
--- a/bs4/builder/_html5lib.py
+++ b/bs4/builder/_html5lib.py
@@ -6,6 +6,9 @@ __all__ = [
6 ]6 ]
77
8from typing import (8from typing import (
9 Any,
10 cast,
11 Dict,
9 Iterable,12 Iterable,
10 List,13 List,
11 Optional,14 Optional,
@@ -14,8 +17,11 @@ from typing import (
14 Union,17 Union,
15)18)
16from bs4._typing import (19from bs4._typing import (
20 _AttributeValue,
21 _AttributeValues,
17 _Encoding,22 _Encoding,
18 _Encodings,23 _Encodings,
24 _NamespaceURL,
19 _RawMarkup,25 _RawMarkup,
20)26)
2127
@@ -30,6 +36,7 @@ from bs4.builder import (
30 )36 )
31from bs4.element import (37from bs4.element import (
32 NamespacedAttribute,38 NamespacedAttribute,
39 PageElement,
33 nonwhitespace_re,40 nonwhitespace_re,
34)41)
35import html5lib42import html5lib
@@ -42,7 +49,9 @@ from bs4.element import (
42 Doctype,49 Doctype,
43 NavigableString,50 NavigableString,
44 Tag,51 Tag,
45 )52)
53if TYPE_CHECKING:
54 from bs4 import BeautifulSoup
4655
47from html5lib.treebuilders import base as treebuilder_base56from html5lib.treebuilders import base as treebuilder_base
4857
@@ -71,7 +80,9 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
71 #: html5lib can tell us which line number and position in the80 #: html5lib can tell us which line number and position in the
72 #: original file is the source of an element.81 #: original file is the source of an element.
73 TRACKS_LINE_NUMBERS:bool = True82 TRACKS_LINE_NUMBERS:bool = True
74 83
84 underlying_builder:'TreeBuilderForHtml5lib' #: :meta private:
85
75 def prepare_markup(self, markup:_RawMarkup,86 def prepare_markup(self, markup:_RawMarkup,
76 user_specified_encoding:Optional[_Encoding]=None,87 user_specified_encoding:Optional[_Encoding]=None,
77 document_declared_encoding:Optional[_Encoding]=None,88 document_declared_encoding:Optional[_Encoding]=None,
@@ -102,20 +113,31 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
102 yield (markup, None, None, False)113 yield (markup, None, None, False)
103114
104 # These methods are defined by Beautiful Soup.115 # These methods are defined by Beautiful Soup.
105 def feed(self, markup):116 def feed(self, markup:_RawMarkup) -> None:
106 """Run some incoming markup through some parsing process,117 """Run some incoming markup through some parsing process,
107 populating the `BeautifulSoup` object in `HTML5TreeBuilder.soup`.118 populating the `BeautifulSoup` object in `HTML5TreeBuilder.soup`.
108 """119 """
109 if self.soup.parse_only is not None:120 if self.soup is not None and self.soup.parse_only is not None:
110 warnings.warn(121 warnings.warn(
111 "You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.",122 "You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.",
112 stacklevel=4123 stacklevel=4
113 )124 )
125
126 # self.underlying_parser is probably None now, but it'll be set
127 # when self.create_treebuilder is called by html5lib.
128 #
129 # TODO-TYPING: typeshed stubs are incorrect about the return
130 # value of HTMLParser.__init__; it is HTMLParser, not None.
114 parser = html5lib.HTMLParser(tree=self.create_treebuilder)131 parser = html5lib.HTMLParser(tree=self.create_treebuilder)
132 assert self.underlying_builder is not None
115 self.underlying_builder.parser = parser133 self.underlying_builder.parser = parser
116 extra_kwargs = dict()134 extra_kwargs = dict()
117 if not isinstance(markup, str):135 if not isinstance(markup, str):
136 # kwargs, specifically override_encoding, will eventually
137 # be passed in to html5lib's
138 # HTMLBinaryInputStream.__init__.
118 extra_kwargs['override_encoding'] = self.user_specified_encoding139 extra_kwargs['override_encoding'] = self.user_specified_encoding
140
119 doc = parser.parse(markup, **extra_kwargs)141 doc = parser.parse(markup, **extra_kwargs)
120 142
121 # Set the character encoding detected by the tokenizer.143 # Set the character encoding detected by the tokenizer.
@@ -131,10 +153,12 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
131 doc.original_encoding = original_encoding153 doc.original_encoding = original_encoding
132 self.underlying_builder.parser = None154 self.underlying_builder.parser = None
133155
134 def create_treebuilder(self, namespaceHTMLElements):156 def create_treebuilder(self, namespaceHTMLElements:bool) -> 'TreeBuilderForHtml5lib':
135 """Called by html5lib to instantiate the kind of class it157 """Called by html5lib to instantiate the kind of class it
136 calls a 'TreeBuilder'.158 calls a 'TreeBuilder'.
137 159
160 :param namespaceHTMLElements: Whether or not to namespace HTML elements.
161
138 :meta private:162 :meta private:
139 """163 """
140 self.underlying_builder = TreeBuilderForHtml5lib(164 self.underlying_builder = TreeBuilderForHtml5lib(
@@ -143,15 +167,18 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
143 )167 )
144 return self.underlying_builder168 return self.underlying_builder
145169
146 def test_fragment_to_document(self, fragment):170 def test_fragment_to_document(self, fragment:str) -> str:
147 """See `TreeBuilder`."""171 """See `TreeBuilder`."""
148 return '<html><head></head><body>%s</body></html>' % fragment172 return '<html><head></head><body>%s</body></html>' % fragment
149173
150174
151class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder):175class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder):
152 176
153 def __init__(self, namespaceHTMLElements, soup=None,177 soup:'BeautifulSoup' #: :meta private:
154 store_line_numbers=True, **kwargs):178
179 def __init__(self, namespaceHTMLElements:bool,
180 soup:Optional['BeautifulSoup']=None,
181 store_line_numbers:bool=True, **kwargs:Any):
155 if soup:182 if soup:
156 self.soup = soup183 self.soup = soup
157 else:184 else:
@@ -172,65 +199,68 @@ class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder):
172 self.parser = None199 self.parser = None
173 self.store_line_numbers = store_line_numbers200 self.store_line_numbers = store_line_numbers
174 201
175 def documentClass(self):202 def documentClass(self) -> 'Element':
176 self.soup.reset()203 self.soup.reset()
177 return Element(self.soup, self.soup, None)204 return Element(self.soup, self.soup, None)
178205
179 def insertDoctype(self, token):206 def insertDoctype(self, token:Dict[str, Any]) -> None:
180 name = token["name"]207 name:str = cast(str, token["name"])
181 publicId = token["publicId"]208 publicId:Optional[str] = cast(Optional[str], token["publicId"])
182 systemId = token["systemId"]209 systemId:Optional[str] = cast(Optional[str], token["systemId"])
183210
184 doctype = Doctype.for_name_and_ids(name, publicId, systemId)211 doctype = Doctype.for_name_and_ids(name, publicId, systemId)
185 self.soup.object_was_parsed(doctype)212 self.soup.object_was_parsed(doctype)
186213
187 def elementClass(self, name, namespace):214 def elementClass(self, name:str, namespace:str) -> 'Element':
188 kwargs = {}215 sourceline:Optional[int] = None
216 sourcepos:Optional[int] = None
189 if self.parser and self.store_line_numbers:217 if self.parser and self.store_line_numbers:
190 # This represents the point immediately after the end of the218 # This represents the point immediately after the end of the
191 # tag. We don't know when the tag started, but we do know219 # tag. We don't know when the tag started, but we do know
192 # where it ended -- the character just before this one.220 # where it ended -- the character just before this one.
193 sourceline, sourcepos = self.parser.tokenizer.stream.position()221 sourceline, sourcepos = self.parser.tokenizer.stream.position()
194 kwargs['sourceline'] = sourceline222 sourcepos = sourcepos-1
195 kwargs['sourcepos'] = sourcepos-1223 tag = self.soup.new_tag(
196 tag = self.soup.new_tag(name, namespace, **kwargs)224 name, namespace, sourceline=sourceline, sourcepos=sourcepos
225 )
197226
198 return Element(tag, self.soup, namespace)227 return Element(tag, self.soup, namespace)
199228
200 def commentClass(self, data):229 def commentClass(self, data:str) -> 'TextNode':
201 return TextNode(Comment(data), self.soup)230 return TextNode(Comment(data), self.soup)
202231
203 def fragmentClass(self):232 def fragmentClass(self) -> 'Element':
204 from bs4 import BeautifulSoup233 """This is only used by html5lib HTMLParser.parseFragment(),
205 # TODO: Why is the parser 'html.parser' here? To avoid an234 which is never used by Beautiful Soup."""
206 # infinite loop?235 raise NotImplementedError()
207 self.soup = BeautifulSoup("", "html.parser")236
208 self.soup.name = "[document_fragment]"237 def getFragment(self) -> 'Element':
209 return Element(self.soup, self.soup, None)238 """This is only used by html5lib HTMLParser.parseFragment,
239 which is never used by Beautiful Soup."""
240 raise NotImplementedError()
210241
211 def appendChild(self, node):242 def appendChild(self, node:'Element') -> None:
212 # XXX This code is not covered by the BS4 tests.243 # TODO: This code is not covered by the BS4 tests.
213 self.soup.append(node.element)244 self.soup.append(node.element)
214245
215 def getDocument(self):246 def getDocument(self) -> 'BeautifulSoup':
216 return self.soup247 return self.soup
217248
218 def getFragment(self):249 # TODO-TYPING: typeshed stubs are incorrect about this;
219 return treebuilder_base.TreeBuilder.getFragment(self).element250 # cloneNode returns a str, not None.
220251 def testSerializer(self, element:'Element') -> str:
221 def testSerializer(self, element):
222 from bs4 import BeautifulSoup252 from bs4 import BeautifulSoup
223 rv = []253 rv = []
224 doctype_re = re.compile(r'^(.*?)(?: PUBLIC "(.*?)"(?: "(.*?)")?| SYSTEM "(.*?)")?$')254 doctype_re = re.compile(r'^(.*?)(?: PUBLIC "(.*?)"(?: "(.*?)")?| SYSTEM "(.*?)")?$')
225255
226 def serializeElement(element, indent=0):256 def serializeElement(element:Union['Element', PageElement], indent=0) -> None:
227 if isinstance(element, BeautifulSoup):257 if isinstance(element, BeautifulSoup):
228 pass258 pass
229 if isinstance(element, Doctype):259 if isinstance(element, Doctype):
230 m = doctype_re.match(element)260 m = doctype_re.match(element)
231 if m:261 if m is not None:
232 name = m.group(1)262 name = m.group(1)
233 if m.lastindex > 1:263 if m.lastindex is not None and m.lastindex > 1:
234 publicId = m.group(2) or ""264 publicId = m.group(2) or ""
235 systemId = m.group(3) or m.group(4) or ""265 systemId = m.group(3) or m.group(4) or ""
236 rv.append("""|%s<!DOCTYPE %s "%s" "%s">""" %266 rv.append("""|%s<!DOCTYPE %s "%s" "%s">""" %
@@ -243,7 +273,7 @@ class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder):
243 rv.append("|%s<!-- %s -->" % (' ' * indent, element))273 rv.append("|%s<!-- %s -->" % (' ' * indent, element))
244 elif isinstance(element, NavigableString):274 elif isinstance(element, NavigableString):
245 rv.append("|%s\"%s\"" % (' ' * indent, element))275 rv.append("|%s\"%s\"" % (' ' * indent, element))
246 else:276 elif isinstance(element, Element):
247 if element.namespace:277 if element.namespace:
248 name = "%s %s" % (prefixes[element.namespace],278 name = "%s %s" % (prefixes[element.namespace],
249 element.name)279 element.name)
@@ -269,12 +299,19 @@ class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder):
269 return "\n".join(rv)299 return "\n".join(rv)
270300
271class AttrList(object):301class AttrList(object):
272 def __init__(self, element):302 """Represents a Tag's attributes in a way compatible with html5lib."""
303
304 element:Tag
305 attrs:_AttributeValues
306
307 def __init__(self, element:Tag):
273 self.element = element308 self.element = element
274 self.attrs = dict(self.element.attrs)309 self.attrs = dict(self.element.attrs)
275 def __iter__(self):310
311 def __iter__(self) -> Iterable[Tuple[str, _AttributeValue]]:
276 return list(self.attrs.items()).__iter__()312 return list(self.attrs.items()).__iter__()
277 def __setitem__(self, name, value):313
314 def __setitem__(self, name:str, value:_AttributeValue) -> None:
278 # If this attribute is a multi-valued attribute for this element,315 # If this attribute is a multi-valued attribute for this element,
279 # turn its value into a list.316 # turn its value into a list.
280 list_attr = self.element.cdata_list_attributes or {}317 list_attr = self.element.cdata_list_attributes or {}
@@ -282,40 +319,52 @@ class AttrList(object):
282 or (self.element.name in list_attr319 or (self.element.name in list_attr
283 and name in list_attr.get(self.element.name, []))):320 and name in list_attr.get(self.element.name, []))):
284 # A node that is being cloned may have already undergone321 # A node that is being cloned may have already undergone
285 # this procedure.322 # this procedure. Check for this and skip it.
286 if not isinstance(value, list):323 if not isinstance(value, list):
324 assert isinstance(value, str)
287 value = nonwhitespace_re.findall(value)325 value = nonwhitespace_re.findall(value)
288 self.element[name] = value326 self.element[name] = value
289 def items(self):327
328 def items(self) -> Iterable[Tuple[str, _AttributeValue]]:
290 return list(self.attrs.items())329 return list(self.attrs.items())
291 def keys(self):330
331 def keys(self) -> Iterable[str]:
292 return list(self.attrs.keys())332 return list(self.attrs.keys())
293 def __len__(self):333
334 def __len__(self) -> int:
294 return len(self.attrs)335 return len(self.attrs)
295 def __getitem__(self, name):336
337 def __getitem__(self, name:str) -> _AttributeValue:
296 return self.attrs[name]338 return self.attrs[name]
297 def __contains__(self, name):339
340 def __contains__(self, name:str) -> bool:
298 return name in list(self.attrs.keys())341 return name in list(self.attrs.keys())
299342
300343
301class Element(treebuilder_base.Node):344class Element(treebuilder_base.Node):
302 def __init__(self, element, soup, namespace):345
346 element:Tag
347 soup:'BeautifulSoup'
348 namespace:Optional[_NamespaceURL]
349
350 def __init__(self, element:Tag, soup:'BeautifulSoup',
351 namespace:Optional[_NamespaceURL]):
303 treebuilder_base.Node.__init__(self, element.name)352 treebuilder_base.Node.__init__(self, element.name)
304 self.element = element353 self.element = element
305 self.soup = soup354 self.soup = soup
306 self.namespace = namespace355 self.namespace = namespace
307356
308 def appendChild(self, node):357 def appendChild(self, node:'Element') -> None:
309 string_child = child = None358 string_child = child = None
310 if isinstance(node, str):359 if isinstance(node, str):
311 # Some other piece of code decided to pass in a string360 # Some other piece of code decided to pass in a string
312 # instead of creating a TextElement object to contain the361 # instead of creating a TextElement object to contain the
313 # string.362 # string. This should not ever happen.
314 string_child = child = node363 string_child = child = node
315 elif isinstance(node, Tag):364 elif isinstance(node, Tag):
316 # Some other piece of code decided to pass in a Tag365 # Some other piece of code decided to pass in a Tag
317 # instead of creating an Element object to contain the366 # instead of creating an Element object to contain the
318 # Tag.367 # Tag. This should not ever happen.
319 child = node368 child = node
320 elif node.element.__class__ == NavigableString:369 elif node.element.__class__ == NavigableString:
321 string_child = child = node.element370 string_child = child = node.element
@@ -324,7 +373,7 @@ class Element(treebuilder_base.Node):
324 child = node.element373 child = node.element
325 node.parent = self374 node.parent = self
326375
327 if not isinstance(child, str) and child.parent is not None:376 if not isinstance(child, str) and child is not None and child.parent is not None:
328 node.element.extract()377 node.element.extract()
329378
330 if (string_child is not None and self.element.contents379 if (string_child is not None and self.element.contents
@@ -359,14 +408,13 @@ class Element(treebuilder_base.Node):
359 child, parent=self.element,408 child, parent=self.element,
360 most_recent_element=most_recent_element)409 most_recent_element=most_recent_element)
361410
362 def getAttributes(self):411 def getAttributes(self) -> AttrList:
363 if isinstance(self.element, Comment):412 if isinstance(self.element, Comment):
364 return {}413 return {}
365 return AttrList(self.element)414 return AttrList(self.element)
366415
367 def setAttributes(self, attributes):416 def setAttributes(self, attributes:Optional[Dict]) -> None:
368 if attributes is not None and len(attributes) > 0:417 if attributes is not None and len(attributes) > 0:
369 converted_attributes = []
370 for name, value in list(attributes.items()):418 for name, value in list(attributes.items()):
371 if isinstance(name, tuple):419 if isinstance(name, tuple):
372 new_name = NamespacedAttribute(*name)420 new_name = NamespacedAttribute(*name)
@@ -386,14 +434,14 @@ class Element(treebuilder_base.Node):
386 self.soup.builder.set_up_substitutions(self.element)434 self.soup.builder.set_up_substitutions(self.element)
387 attributes = property(getAttributes, setAttributes)435 attributes = property(getAttributes, setAttributes)
388436
389 def insertText(self, data, insertBefore=None):437 def insertText(self, data:str, insertBefore:Optional['Element']=None) -> None:
390 text = TextNode(self.soup.new_string(data), self.soup)438 text = TextNode(self.soup.new_string(data), self.soup)
391 if insertBefore:439 if insertBefore:
392 self.insertBefore(text, insertBefore)440 self.insertBefore(text, insertBefore)
393 else:441 else:
394 self.appendChild(text)442 self.appendChild(text)
395443
396 def insertBefore(self, node, refNode):444 def insertBefore(self, node:'Element', refNode:'Element') -> None:
397 index = self.element.index(refNode.element)445 index = self.element.index(refNode.element)
398 if (node.element.__class__ == NavigableString and self.element.contents446 if (node.element.__class__ == NavigableString and self.element.contents
399 and self.element.contents[index-1].__class__ == NavigableString):447 and self.element.contents[index-1].__class__ == NavigableString):
@@ -405,10 +453,10 @@ class Element(treebuilder_base.Node):
405 self.element.insert(index, node.element)453 self.element.insert(index, node.element)
406 node.parent = self454 node.parent = self
407455
408 def removeChild(self, node):456 def removeChild(self, node:'Element') -> None:
409 node.element.extract()457 node.element.extract()
410458
411 def reparentChildren(self, new_parent):459 def reparentChildren(self, new_parent:'Element') -> None:
412 """Move all of this tag's children into another tag."""460 """Move all of this tag's children into another tag."""
413 # print("MOVE", self.element.contents)461 # print("MOVE", self.element.contents)
414 # print("FROM", self.element)462 # print("FROM", self.element)
@@ -424,6 +472,10 @@ class Element(treebuilder_base.Node):
424 if len(new_parent_element.contents) > 0:472 if len(new_parent_element.contents) > 0:
425 # The new parent already contains children. We will be473 # The new parent already contains children. We will be
426 # appending this tag's children to the end.474 # appending this tag's children to the end.
475
476 # We can make this assertion since we know new_parent has
477 # children.
478 assert new_parents_last_descendant is not None
427 new_parents_last_child = new_parent_element.contents[-1]479 new_parents_last_child = new_parent_element.contents[-1]
428 new_parents_last_descendant_next_element = new_parents_last_descendant.next_element480 new_parents_last_descendant_next_element = new_parents_last_descendant.next_element
429 else:481 else:
@@ -474,17 +526,21 @@ class Element(treebuilder_base.Node):
474 # print("FROM", self.element)526 # print("FROM", self.element)
475 # print("TO", new_parent_element)527 # print("TO", new_parent_element)
476528
477 def cloneNode(self):529 # TODO: typeshed stubs are incorrect about this;
530 # cloneNode returns a new Node, not None.
531 def cloneNode(self) -> treebuilder_base.Node:
478 tag = self.soup.new_tag(self.element.name, self.namespace)532 tag = self.soup.new_tag(self.element.name, self.namespace)
479 node = Element(tag, self.soup, self.namespace)533 node = Element(tag, self.soup, self.namespace)
480 for key,value in self.attributes:534 for key,value in self.attributes:
481 node.attributes[key] = value535 node.attributes[key] = value
482 return node536 return node
483537
484 def hasContent(self):538 # TODO-TYPING: typeshed stubs are incorrect about this;
485 return self.element.contents539 # cloneNode returns a boolean, not None.
540 def hasContent(self) -> bool:
541 return len(self.element.contents) > 0
486542
487 def getNameTuple(self):543 def getNameTuple(self) -> Tuple[str, str]:
488 if self.namespace == None:544 if self.namespace == None:
489 return namespaces["html"], self.name545 return namespaces["html"], self.name
490 else:546 else:
@@ -493,10 +549,10 @@ class Element(treebuilder_base.Node):
493 nameTuple = property(getNameTuple)549 nameTuple = property(getNameTuple)
494550
495class TextNode(Element):551class TextNode(Element):
496 def __init__(self, element, soup):552 def __init__(self, element:PageElement, soup:'BeautifulSoup'):
497 treebuilder_base.Node.__init__(self, None)553 treebuilder_base.Node.__init__(self, None)
498 self.element = element554 self.element = element
499 self.soup = soup555 self.soup = soup
500556
501 def cloneNode(self):557 def cloneNode(self) -> treebuilder_base.Node:
502 raise NotImplementedError558 raise NotImplementedError()
diff --git a/bs4/builder/_htmlparser.py b/bs4/builder/_htmlparser.py
index 291f6c6..91cecf7 100644
--- a/bs4/builder/_htmlparser.py
+++ b/bs4/builder/_htmlparser.py
@@ -188,7 +188,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
188 # later on. If so, we want to ignore it.188 # later on. If so, we want to ignore it.
189 self.already_closed_empty_element.append(name)189 self.already_closed_empty_element.append(name)
190190
191 if self._root_tag is None:191 if self._root_tag_name is None:
192 self._root_tag_encountered(name)192 self._root_tag_encountered(name)
193 193
194 def handle_endtag(self, name:str, check_already_closed:bool=True) -> None:194 def handle_endtag(self, name:str, check_already_closed:bool=True) -> None:
@@ -422,13 +422,23 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
422 dammit.declared_html_encoding,422 dammit.declared_html_encoding,
423 dammit.contains_replacement_characters)423 dammit.contains_replacement_characters)
424424
425 def feed(self, markup:str):425 def feed(self, markup:_RawMarkup) -> None:
426 args, kwargs = self.parser_args426 args, kwargs = self.parser_args
427
428 # HTMLParser.feed will only handle str, but
429 # BeautifulSoup.markup is allowed to be _RawMarkup, because
430 # it's set by the yield value of
431 # TreeBuilder.prepare_markup. Fortunately,
432 # HTMLParserTreeBuilder.prepare_markup always yields a str
433 # (UnicodeDammit.unicode_markup).
434 assert isinstance(markup, str)
435
427 # We know BeautifulSoup calls TreeBuilder.initialize_soup436 # We know BeautifulSoup calls TreeBuilder.initialize_soup
428 # before calling feed(), so we can assume self.soup437 # before calling feed(), so we can assume self.soup
429 # is set.438 # is set.
430 assert self.soup is not None439 assert self.soup is not None
431 parser = BeautifulSoupHTMLParser(self.soup, *args, **kwargs)440 parser = BeautifulSoupHTMLParser(self.soup, *args, **kwargs)
441
432 try:442 try:
433 parser.feed(markup)443 parser.feed(markup)
434 parser.close()444 parser.close()
diff --git a/bs4/builder/_lxml.py b/bs4/builder/_lxml.py
index ba87e87..3dfe88a 100644
--- a/bs4/builder/_lxml.py
+++ b/bs4/builder/_lxml.py
@@ -269,7 +269,7 @@ class LXMLTreeBuilderForXML(TreeBuilder):
269 for encoding in detector.encodings:269 for encoding in detector.encodings:
270 yield (detector.markup, encoding, document_declared_encoding, False)270 yield (detector.markup, encoding, document_declared_encoding, False)
271271
272 def feed(self, markup:Union[bytes,str]) -> None:272 def feed(self, markup:_RawMarkup) -> None:
273 io: IO273 io: IO
274 if isinstance(markup, bytes):274 if isinstance(markup, bytes):
275 io = BytesIO(markup)275 io = BytesIO(markup)
diff --git a/bs4/diagnose.py b/bs4/diagnose.py
index 201b879..c2202ad 100644
--- a/bs4/diagnose.py
+++ b/bs4/diagnose.py
@@ -9,7 +9,15 @@ from html.parser import HTMLParser
9import bs49import bs4
10from bs4 import BeautifulSoup, __version__ 10from bs4 import BeautifulSoup, __version__
11from bs4.builder import builder_registry11from bs4.builder import builder_registry
12from typing import TYPE_CHECKING12from typing import (
13 Any,
14 IO,
15 List,
16 Optional,
17 Tuple,
18 TYPE_CHECKING,
19)
20
13if TYPE_CHECKING:21if TYPE_CHECKING:
14 from bs4._typing import _IncomingMarkup22 from bs4._typing import _IncomingMarkup
1523
@@ -78,7 +86,7 @@ def diagnose(data:_IncomingMarkup) -> None:
7886
79 print(("-" * 80))87 print(("-" * 80))
8088
81def lxml_trace(data, html:bool=True, **kwargs) -> None:89def lxml_trace(data:_IncomingMarkup, html:bool=True, **kwargs:Any) -> None:
82 """Print out the lxml events that occur during parsing.90 """Print out the lxml events that occur during parsing.
8391
84 This lets you see how lxml parses a document when no Beautiful92 This lets you see how lxml parses a document when no Beautiful
@@ -94,7 +102,8 @@ def lxml_trace(data, html:bool=True, **kwargs) -> None:
94 recover = kwargs.pop('recover', True)102 recover = kwargs.pop('recover', True)
95 if isinstance(data, str):103 if isinstance(data, str):
96 data = data.encode("utf8")104 data = data.encode("utf8")
97 reader = BytesIO(data)105 if not isinstance(data, IO):
106 reader = BytesIO(data)
98 for event, element in etree.iterparse(107 for event, element in etree.iterparse(
99 reader, html=html, recover=recover, **kwargs108 reader, html=html, recover=recover, **kwargs
100 ):109 ):
@@ -108,37 +117,40 @@ class AnnouncingParser(HTMLParser):
108 document. The easiest way to do this is to call `htmlparser_trace`.117 document. The easiest way to do this is to call `htmlparser_trace`.
109 """118 """
110119
111 def _p(self, s):120 def _p(self, s:str) -> None:
112 print(s)121 print(s)
113122
114 def handle_starttag(self, name, attrs):123 def handle_starttag(
124 self, name:str, attrs:List[Tuple[str, Optional[str]]],
125 handle_empty_element:bool=True
126 ) -> None:
115 self._p(f"{name} {attrs} START")127 self._p(f"{name} {attrs} START")
116128
117 def handle_endtag(self, name):129 def handle_endtag(self, name:str, check_already_closed:bool=True) -> None:
118 self._p("%s END" % name)130 self._p("%s END" % name)
119131
120 def handle_data(self, data):132 def handle_data(self, data:str) -> None:
121 self._p("%s DATA" % data)133 self._p("%s DATA" % data)
122134
123 def handle_charref(self, name):135 def handle_charref(self, name:str) -> None:
124 self._p("%s CHARREF" % name)136 self._p("%s CHARREF" % name)
125137
126 def handle_entityref(self, name):138 def handle_entityref(self, name:str) -> None:
127 self._p("%s ENTITYREF" % name)139 self._p("%s ENTITYREF" % name)
128140
129 def handle_comment(self, data):141 def handle_comment(self, data:str) -> None:
130 self._p("%s COMMENT" % data)142 self._p("%s COMMENT" % data)
131143
132 def handle_decl(self, data):144 def handle_decl(self, data:str) -> None:
133 self._p("%s DECL" % data)145 self._p("%s DECL" % data)
134146
135 def unknown_decl(self, data):147 def unknown_decl(self, data:str) -> None:
136 self._p("%s UNKNOWN-DECL" % data)148 self._p("%s UNKNOWN-DECL" % data)
137149
138 def handle_pi(self, data):150 def handle_pi(self, data:str) -> None:
139 self._p("%s PI" % data)151 self._p("%s PI" % data)
140152
141def htmlparser_trace(data):153def htmlparser_trace(data:str) -> None:
142 """Print out the HTMLParser events that occur during parsing.154 """Print out the HTMLParser events that occur during parsing.
143155
144 This lets you see how HTMLParser parses a document when no156 This lets you see how HTMLParser parses a document when no
@@ -226,7 +238,7 @@ def benchmark_parsers(num_elements:int=100000) -> None:
226 b = time.time()238 b = time.time()
227 print(("Raw html5lib parsed the markup in %.2fs." % (b-a)))239 print(("Raw html5lib parsed the markup in %.2fs." % (b-a)))
228240
229def profile(num_elements:int=100000, parser:str="lxml"):241def profile(num_elements:int=100000, parser:str="lxml") -> None:
230 """Use Python's profiler on a randomly generated document."""242 """Use Python's profiler on a randomly generated document."""
231 filehandle = tempfile.NamedTemporaryFile()243 filehandle = tempfile.NamedTemporaryFile()
232 filename = filehandle.name244 filename = filehandle.name
diff --git a/bs4/element.py b/bs4/element.py
index 83f4882..f4ab89c 100644
--- a/bs4/element.py
+++ b/bs4/element.py
@@ -44,6 +44,7 @@ if TYPE_CHECKING:
44 from bs4 import BeautifulSoup44 from bs4 import BeautifulSoup
45 from bs4.builder import TreeBuilder45 from bs4.builder import TreeBuilder
46 from bs4.dammit import _Encoding46 from bs4.dammit import _Encoding
47 from bs4.filter import ElementFilter
47 from bs4.formatter import (48 from bs4.formatter import (
48 _EntitySubstitutionFunction,49 _EntitySubstitutionFunction,
49 _FormatterOrName,50 _FormatterOrName,
@@ -901,7 +902,7 @@ class PageElement(object):
901 limit:Optional[int],902 limit:Optional[int],
902 generator:Iterator[PageElement],903 generator:Iterator[PageElement],
903 _stacklevel:int=3,904 _stacklevel:int=3,
904 **kwargs:_StrainableAttribute) -> ResultSet[PageElement]: 905 **kwargs:_StrainableAttribute) -> ResultSet[PageElement]:
905 """Iterates over a generator looking for things that match."""906 """Iterates over a generator looking for things that match."""
906 results: ResultSet[PageElement]907 results: ResultSet[PageElement]
907 908
@@ -912,11 +913,11 @@ class PageElement(object):
912 DeprecationWarning, stacklevel=_stacklevel913 DeprecationWarning, stacklevel=_stacklevel
913 )914 )
914915
915 from bs4.strainer import SoupStrainer916 from bs4.filter import ElementFilter
916 if isinstance(name, SoupStrainer):917 if isinstance(name, ElementFilter):
917 strainer = name918 matcher = name
918 else:919 else:
919 strainer = SoupStrainer(name, attrs, string, **kwargs)920 matcher = SoupStrainer(name, attrs, string, **kwargs)
920921
921 result: Iterable[PageElement]922 result: Iterable[PageElement]
922 if string is None and not limit and not attrs and not kwargs:923 if string is None and not limit and not attrs and not kwargs:
@@ -924,7 +925,7 @@ class PageElement(object):
924 # Optimization to find all tags.925 # Optimization to find all tags.
925 result = (element for element in generator926 result = (element for element in generator
926 if isinstance(element, Tag))927 if isinstance(element, Tag))
927 return ResultSet(strainer, result)928 return ResultSet(matcher, result)
928 elif isinstance(name, str):929 elif isinstance(name, str):
929 # Optimization to find all tags with a given name.930 # Optimization to find all tags with a given name.
930 if name.count(':') == 1:931 if name.count(':') == 1:
@@ -945,22 +946,25 @@ class PageElement(object):
945 )946 )
946 ):947 ):
947 result.append(element)948 result.append(element)
948 return ResultSet(strainer, result)949 return ResultSet(matcher, result)
950 return self.match(generator, matcher, limit)
951
952 def match(self, generator:Iterator[PageElement], matcher:ElementFilter, limit:Optional[int]=None) -> ResultSet[PageElement]:
953 """The most generic search method offered by Beautiful Soup.
949954
950 results = ResultSet(strainer)955 You can pass in your own technique for iterating over the tree, and your own
956 technique for matching items.
957 """
958 results:ResultSet = ResultSet(matcher)
951 while True:959 while True:
952 try:960 try:
953 i = next(generator)961 i = next(generator)
954 except StopIteration:962 except StopIteration:
955 break963 break
956 if i:964 if i:
957 # TODO: SoupStrainer.search is a confusing method965 if matcher.match(i):
958 # that needs to be redone, and this is where966 results.append(i)
959 # it's being used.967 if limit is not None and len(results) >= limit:
960 found = strainer.search(i)
961 if found:
962 results.append(found)
963 if limit and len(results) >= limit:
964 break968 break
965 return results969 return results
966970
@@ -1254,7 +1258,7 @@ class Declaration(PreformattedString):
1254class Doctype(PreformattedString):1258class Doctype(PreformattedString):
1255 """A `document type declaration <https://www.w3.org/TR/REC-xml/#dt-doctype>`_."""1259 """A `document type declaration <https://www.w3.org/TR/REC-xml/#dt-doctype>`_."""
1256 @classmethod1260 @classmethod
1257 def for_name_and_ids(cls, name:str, pub_id:str, system_id:str) -> Doctype:1261 def for_name_and_ids(cls, name:str, pub_id:Optional[str], system_id:Optional[str]) -> Doctype:
1258 """Generate an appropriate document type declaration for a given1262 """Generate an appropriate document type declaration for a given
1259 public ID and system ID.1263 public ID and system ID.
12601264
@@ -2503,12 +2507,12 @@ class Tag(PageElement):
2503_PageElementT = TypeVar("_PageElementT", bound=PageElement)2507_PageElementT = TypeVar("_PageElementT", bound=PageElement)
2504class ResultSet(List[_PageElementT], Generic[_PageElementT]):2508class ResultSet(List[_PageElementT], Generic[_PageElementT]):
2505 """A ResultSet is a list of `PageElement` objects, gathered as the result2509 """A ResultSet is a list of `PageElement` objects, gathered as the result
2506 of matching a `SoupStrainer` against a parse tree. Basically, a list of2510 of matching an `ElementFilter` against a parse tree. Basically, a list of
2507 search results.2511 search results.
2508 """2512 """
2509 source: Optional[SoupStrainer]2513 source: Optional[ElementFilter]
25102514
2511 def __init__(self, source:Optional[SoupStrainer], result: Iterable[_PageElementT]=()) -> None:2515 def __init__(self, source:Optional[ElementFilter], result: Iterable[_PageElementT]=()) -> None:
2512 super(ResultSet, self).__init__(result)2516 super(ResultSet, self).__init__(result)
2513 self.source = source2517 self.source = source
25142518
@@ -2522,4 +2526,4 @@ class ResultSet(List[_PageElementT], Generic[_PageElementT]):
2522# import SoupStrainer itself into this module to preserve the2526# import SoupStrainer itself into this module to preserve the
2523# backwards compatibility of anyone who imports2527# backwards compatibility of anyone who imports
2524# bs4.element.SoupStrainer.2528# bs4.element.SoupStrainer.
2525from bs4.strainer import SoupStrainer2529from bs4.filter import SoupStrainer
diff --git a/bs4/strainer.py b/bs4/filter.py
2526similarity index 60%2530similarity index 60%
2527rename from bs4/strainer.py2531rename from bs4/strainer.py
2528rename to bs4/filter.py2532rename to bs4/filter.py
index 15b289c..74e26d9 100644
--- a/bs4/strainer.py
+++ b/bs4/filter.py
@@ -25,6 +25,10 @@ from bs4._deprecation import _deprecated
25from bs4.element import NavigableString, PageElement, Tag25from bs4.element import NavigableString, PageElement, Tag
26from bs4._typing import (26from bs4._typing import (
27 _AttributeValue,27 _AttributeValue,
28 _AttributeValues,
29 _AllowStringCreationFunction,
30 _AllowTagCreationFunction,
31 _PageElementMatchFunction,
28 _TagMatchFunction,32 _TagMatchFunction,
29 _StringMatchFunction,33 _StringMatchFunction,
30 _StrainableElement,34 _StrainableElement,
@@ -33,13 +37,96 @@ from bs4._typing import (
33 _StrainableString,37 _StrainableString,
34)38)
3539
40
41class ElementFilter(object):
42 """ElementFilters encapsulate the logic necessary to decide:
43
44 1. whether a PageElement (a tag or a string) matches a
45 user-specified query.
46
47 2. whether a given sequence of markup found during initial parsing
48 should be turned into a PageElement, or simply discarded.
49
50 The base class is the simplest ElementFilter. By default, it
51 matches everything and allows all PageElements to be created. You
52 can make it more selective by passing in user-defined functions.
53
54 Most users of Beautiful Soup will never need to use
55 ElementFilter, or its more capable subclass
56 SoupStrainer. Instead, they will use the find_* methods, which
57 will convert their arguments into SoupStrainer objects and run them
58 against the tree.
59 """
60 match_hook: Optional[_PageElementMatchFunction]
61 allow_tag_creation_function: Optional[_AllowTagCreationFunction]
62 allow_string_creation_function: Optional[_AllowStringCreationFunction]
63
64 def __init__(
65 self, match_function:Optional[_PageElementMatchFunction]=None,
66 allow_tag_creation_function:Optional[_AllowTagCreationFunction]=None,
67 allow_string_creation_function:Optional[_AllowStringCreationFunction]=None):
68 self.match_function = match_function
69 self.allow_tag_creation_function = allow_tag_creation_function
70 self.allow_string_creation_function = allow_string_creation_function
71
72 @property
73 def excludes_everything(self) -> bool:
74 """Does this ElementFilter obviously exclude everything? If
75 so, Beautiful Soup will issue a warning if you try to use it
76 when parsing a document.
77
78 The ElementFilter might turn out to exclude everything even
79 if this returns False, but it won't do so in an obvious way.
80
81 The default ElementFilter excludes *nothing*, and we don't
82 have any way of answering questions about more complex
83 ElementFilters without running their hook functions, so the
84 base implementation always returns False.
85 """
86 return False
87
88 def match(self, element:PageElement) -> bool:
89 """Does the given PageElement match the rules set down by this
90 ElementFilter?
91
92 The base implementation delegates to the function passed in to
93 the constructor.
94 """
95 if not self.match_function:
96 return True
97 return self.match_function(element)
98
99 def allow_tag_creation(
100 self, nsprefix:Optional[str], name:str,
101 attrs:Optional[_AttributeValues]
102 ) -> bool:
103 """Based on the name and attributes of a tag, see whether this
104 ElementFilter will allow a Tag object to even be created.
105
106 :param name: The name of the prospective tag.
107 :param attrs: The attributes of the prospective tag.
108 """
109 if not self.allow_tag_creation_function:
110 return True
111 return self.allow_tag_creation_function(nsprefix, name, attrs)
112
113 def allow_string_creation(self, string:str) -> bool:
114 if not self.allow_string_creation_function:
115 return True
116 return self.allow_string_creation_function(string)
117
118
36class MatchRule(object):119class MatchRule(object):
120 """Each MatchRule encapsulates the logic behind a single argument
121 passed in to one of the Beautiful Soup find* methods.
122 """
123
37 string: Optional[str]124 string: Optional[str]
38 pattern: Optional[Pattern[str]]125 pattern: Optional[Pattern[str]]
39 present: Optional[bool]126 present: Optional[bool]
40127 # TODO-TYPING: All MatchRule objects also have an attribute
41 # All MatchRule objects also have an attribute ``function``, but128 # ``function``, but the type of the function depends on the
42 # the type of the function depends on the subclass.129 # subclass.
43 130
44 def __init__(131 def __init__(
45 self,132 self,
@@ -72,7 +159,7 @@ class MatchRule(object):
72 "At most one of string, pattern, function and present must be provided."159 "At most one of string, pattern, function and present must be provided."
73 )160 )
74 161
75 def _base_match(self, string:str) -> Optional[bool]:162 def _base_match(self, string:Optional[str]) -> Optional[bool]:
76 """Run the 'cheap' portion of a match, trying to get an answer without163 """Run the 'cheap' portion of a match, trying to get an answer without
77 calling a potentially expensive custom function.164 calling a potentially expensive custom function.
78165
@@ -101,7 +188,7 @@ class MatchRule(object):
101188
102 return None189 return None
103 190
104 def matches_string(self, string:str) -> bool:191 def matches_string(self, string:Optional[str]) -> bool:
105 _base_result = self._base_match(string)192 _base_result = self._base_match(string)
106 if _base_result is not None:193 if _base_result is not None:
107 # No need to invoke the test function.194 # No need to invoke the test function.
@@ -125,6 +212,7 @@ class MatchRule(object):
125 )212 )
126 213
127class TagNameMatchRule(MatchRule):214class TagNameMatchRule(MatchRule):
215 """A MatchRule implementing the rules for matches against tag name."""
128 function: Optional[_TagMatchFunction]216 function: Optional[_TagMatchFunction]
129217
130 def matches_tag(self, tag:Tag) -> bool:218 def matches_tag(self, tag:Tag) -> bool:
@@ -140,19 +228,25 @@ class TagNameMatchRule(MatchRule):
140 return False228 return False
141 229
142class AttributeValueMatchRule(MatchRule):230class AttributeValueMatchRule(MatchRule):
231 """A MatchRule implementing the rules for matches against attribute value."""
143 function: Optional[_StringMatchFunction]232 function: Optional[_StringMatchFunction]
144233
145class StringMatchRule(MatchRule):234class StringMatchRule(MatchRule):
235 """A MatchRule implementing the rules for matches against a NavigableString."""
146 function: Optional[_StringMatchFunction]236 function: Optional[_StringMatchFunction]
147 237
148class SoupStrainer(object):238class SoupStrainer(ElementFilter):
149 """Encapsulates a number of ways of matching a markup element (a tag239 """The ElementFilter subclass used internally by Beautiful Soup.
150 or a string).
151240
152 These are primarily created internally and used to underpin the241 A SoupStrainer encapsulates the logic necessary to perform the
153 find_* methods, but you can create one yourself and pass it in as242 kind of matches supported by the find_* methods. SoupStrainers are
154 ``parse_only`` to the `BeautifulSoup` constructor, to parse a243 primarily created internally, but you can create one yourself and
155 subset of a large document.244 pass it in as ``parse_only`` to the `BeautifulSoup` constructor,
245 to parse a subset of a large document.
246
247 Internally, SoupStrainer objects work by converting the
248 constructor arguments into MatchRule objects. Incoming
249 tags/markup are matched against those rules.
156250
157 :param name: One or more restrictions on the tags found in a251 :param name: One or more restrictions on the tags found in a
158 document.252 document.
@@ -226,6 +320,17 @@ class SoupStrainer(object):
226 self.__string = string320 self.__string = string
227321
228 @property322 @property
323 def excludes_everything(self) -> bool:
324 """Check whether the provided rules will obviously exclude
325 everything. (They might exclude everything even if this returns False,
326 but not in an obvious way.)
327 """
328 return True if (
329 self.string_rules and
330 (self.name_rules or self.attribute_rules)
331 ) else False
332
333 @property
229 def string(self) -> Optional[_StrainableString]:334 def string(self) -> Optional[_StrainableString]:
230 ":meta private:"335 ":meta private:"
231 warnings.warn(f"Access to deprecated property string. (Look at .string_rules instead) -- Deprecated since version 4.13.0.", DeprecationWarning, stacklevel=2)336 warnings.warn(f"Access to deprecated property string. (Look at .string_rules instead) -- Deprecated since version 4.13.0.", DeprecationWarning, stacklevel=2)
@@ -262,6 +367,15 @@ class SoupStrainer(object):
262 yield rule_class(function=obj)367 yield rule_class(function=obj)
263 elif isinstance(obj, Pattern):368 elif isinstance(obj, Pattern):
264 yield rule_class(pattern=obj)369 yield rule_class(pattern=obj)
370 elif hasattr(obj, 'search'):
371 # We do a little duck typing here to detect usage of the
372 # third-party regex library, whose pattern objects doesn't
373 # derive from re.Pattern.
374 #
375 # TODO-TYPING: Once we drop support for Python 3.7, we
376 # might be able to address this by defining an appropriate
377 # Protocol.
378 yield rule_class(pattern=obj)
265 elif hasattr(obj, '__iter__'):379 elif hasattr(obj, '__iter__'):
266 for o in obj:380 for o in obj:
267 if not isinstance(o, (bytes, str)) and hasattr(o, '__iter__'):381 if not isinstance(o, (bytes, str)) and hasattr(o, '__iter__'):
@@ -358,7 +472,7 @@ class SoupStrainer(object):
358 else:472 else:
359 attr_values = [cast(str, attr_value)]473 attr_values = [cast(str, attr_value)]
360474
361 def _match_attribute_value_helper(attr_values:Sequence[Optional[str]]):475 def _match_attribute_value_helper(attr_values:Sequence[Optional[str]]) -> bool:
362 for rule in rules:476 for rule in rules:
363 for attr_value in attr_values:477 for attr_value in attr_values:
364 if rule.matches_string(attr_value):478 if rule.matches_string(attr_value):
@@ -382,8 +496,8 @@ class SoupStrainer(object):
382 [joined_attr_value]496 [joined_attr_value]
383 )497 )
384 return this_attr_match498 return this_attr_match
385 499
386 def allow_tag_creation(self, nsprefix:Optional[str], name:str, attrs:Optional[dict[str, str]]) -> bool:500 def allow_tag_creation(self, nsprefix:Optional[str], name:str, attrs:Optional[_AttributeValues]) -> bool:
387 """Based on the name and attributes of a tag, see whether this501 """Based on the name and attributes of a tag, see whether this
388 SoupStrainer will allow a Tag object to even be created.502 SoupStrainer will allow a Tag object to even be created.
389503
@@ -423,17 +537,25 @@ class SoupStrainer(object):
423 return True537 return True
424538
425 def allow_string_creation(self, string:str) -> bool:539 def allow_string_creation(self, string:str) -> bool:
540 """Based on the content of a markup string, see whether this
541 SoupStrainer will allow it to be instantiated as a
542 NavigableString object, or whether it should be ignored.
543 """
426 if self.name_rules or self.attribute_rules:544 if self.name_rules or self.attribute_rules:
427 # A SoupStrainer that has name or attribute rules won't545 # A SoupStrainer that has name or attribute rules won't
428 # match any strings; it's designed to match tags with546 # match any strings; it's designed to match tags with
429 # certain properties.547 # certain properties.
430 return False548 return False
549 if not self.string_rules:
550 # A SoupStrainer with no string rules will match
551 # all strings.
552 return True
431 if not self.matches_any_string_rule(string):553 if not self.matches_any_string_rule(string):
432 return False554 return False
433 return True555 return True
434 556
435 def matches_any_string_rule(self, string:str) -> bool:557 def matches_any_string_rule(self, string:str) -> bool:
436 """See whether the content of a string, matches any of 558 """See whether the content of a string matches any of
437 this SoupStrainer's string rules.559 this SoupStrainer's string rules.
438 """560 """
439 if not self.string_rules:561 if not self.string_rules:
@@ -442,28 +564,37 @@ class SoupStrainer(object):
442 if string_rule.matches_string(string):564 if string_rule.matches_string(string):
443 return True565 return True
444 return False566 return False
445 567
446 568 def match(self, element:PageElement) -> bool:
569 """Does the given PageElement match the rules set down by this
570 SoupStrainer?
571
572 The find_* methods rely heavily on this method to find matches.
573
574 :param element: A PageElement.
575 :return: True if the element matches this SoupStrainer's rules; False otherwise.
576 """
577 if isinstance(element, Tag):
578 return self.matches_tag(element)
579 assert isinstance(element, NavigableString)
580 if not (self.name_rules or self.attribute_rules):
581 # A NavigableString can only match a SoupStrainer that
582 # does not define any name or attribute restrictions.
583 for rule in self.string_rules:
584 if rule.matches_string(element):
585 return True
586 return False
587
447 @_deprecated("allow_tag_creation", "4.13.0")588 @_deprecated("allow_tag_creation", "4.13.0")
448 def search_tag(self, name, attrs):589 def search_tag(self, name:str, attrs:Optional[_AttributeValues]) -> bool:
590 """A less elegant version of allow_tag_creation()."""
449 ":meta private:"591 ":meta private:"
450 return self.allow_tag_creation(None, name, attrs)592 return self.allow_tag_creation(None, name, attrs)
451 593
452 def search(self, element:PageElement):594 @_deprecated("match", "4.13.0")
453 # TODO: This method needs to be removed or redone. It is595 def search(self, element:PageElement) -> Optional[PageElement]:
454 # very confusing but it's used everywhere.596 """A less elegant version of match().
455 match = None
456 if isinstance(element, Tag):
457 match = self.matches_tag(element)
458 else:
459 assert isinstance(element, NavigableString)
460 match = False
461 if not (self.name_rules or self.attribute_rules):
462 # A NavigableString can only match a SoupStrainer that
463 # does not define any name or attribute restrictions.
464 for rule in self.string_rules:
465 if rule.matches_string(element):
466 match = True
467 break
468 return element if match else False
469597
598 :meta private:
599 """
600 return element if self.match(element) else None
diff --git a/bs4/tests/__init__.py b/bs4/tests/__init__.py
index 2ef7fd8..3ef999d 100644
--- a/bs4/tests/__init__.py
+++ b/bs4/tests/__init__.py
@@ -20,7 +20,7 @@ from bs4.element import (
20 Stylesheet,20 Stylesheet,
21 Tag21 Tag
22)22)
23from bs4.strainer import SoupStrainer23from bs4.filter import SoupStrainer
24from bs4.builder import (24from bs4.builder import (
25 DetectsXMLParsedAsHTML,25 DetectsXMLParsedAsHTML,
26 XMLParsedAsHTMLWarning,26 XMLParsedAsHTMLWarning,
diff --git a/bs4/tests/test_strainer.py b/bs4/tests/test_filter.py
27similarity index 56%27similarity index 56%
28rename from bs4/tests/test_strainer.py28rename from bs4/tests/test_strainer.py
29rename to bs4/tests/test_filter.py29rename to bs4/tests/test_filter.py
index 4de03f0..8d5da70 100644
--- a/bs4/tests/test_strainer.py
+++ b/bs4/tests/test_filter.py
@@ -6,20 +6,108 @@ from . import (
6 SoupTest,6 SoupTest,
7)7)
8from bs4.element import Tag8from bs4.element import Tag
9from bs4.strainer import (9from bs4.filter import (
10 AttributeValueMatchRule,10 AttributeValueMatchRule,
11 ElementFilter,
11 MatchRule,12 MatchRule,
12 SoupStrainer,13 SoupStrainer,
13 StringMatchRule,14 StringMatchRule,
14 TagNameMatchRule,15 TagNameMatchRule,
15)16)
1617
17class TestMatchrule(SoupTest):18class TestElementFilter(SoupTest):
19
20 def test_default_behavior(self):
21 # An unconfigured ElementFilter matches absolutely everything.
22 selector = ElementFilter()
23 assert not selector.excludes_everything
24 soup = self.soup("<a>text</a>")
25 tag = soup.a
26 string = tag.string
27 assert True == selector.match(soup)
28 assert True == selector.match(tag)
29 assert True == selector.match(string)
30 assert soup.find(selector).name == "a"
31
32 # And allows any incoming markup to be turned into PageElements.
33 assert True == selector.allow_tag_creation(None, "tag", None)
34 assert True == selector.allow_string_creation("some string")
35
36 def test_match(self):
37 def m(pe):
38 return (pe.string == "allow" or (
39 isinstance(pe, Tag) and pe.name=="allow"))
40
41 soup = self.soup("<allow>deny</allow>allow<deny>deny</deny>")
42 allow_tag = soup.allow
43 allow_string = soup.find(string="allow")
44 deny_tag = soup.deny
45 deny_string = soup.find(string="deny")
46
47 selector = ElementFilter(match_function=m)
48 assert True == selector.match(allow_tag)
49 assert True == selector.match(allow_string)
50 assert False == selector.match(deny_tag)
51 assert False == selector.match(deny_string)
52
53 # Since only the match function was provided, there is
54 # no effect on tag or string creation.
55 soup = self.soup("<a>text</a>", parse_only=selector)
56 assert "text" == soup.a.string
57
58 def test_allow_tag_creation(self):
59 def m(nsprefix, name, attrs):
60 return nsprefix=="allow" or name=="allow" or "allow" in attrs
61 selector = ElementFilter(allow_tag_creation_function=m)
62 f = selector.allow_tag_creation
63 assert True == f("allow", "ignore", {})
64 assert True == f("ignore", "allow", {})
65 assert True == f(None, "ignore", {"allow": "1"})
66 assert False == f("no", "no", {"no" : "nope"})
67
68 # Test the ElementFilter as a value for parse_only.
69 soup = self.soup(
70 "<deny>deny</deny> <allow>deny</allow> allow",
71 parse_only=selector
72 )
1873
19 def _tuple(self, rule):74 # The <deny> tag was filtered out, but there was no effect on
20 if isinstance(rule.pattern, str):75 # the strings, since only allow_tag_creation_function was
21 import pdb; pdb.set_trace()76 # defined.
77 assert 'deny <allow>deny</allow> allow' == soup.decode()
78
79 # Similarly, since match_function was not defined, this
80 # ElementFilter matches everything.
81 assert soup.find(selector) == "deny"
82
83 def test_allow_string_creation(self):
84 def m(s):
85 return s=="allow"
86 selector = ElementFilter(allow_string_creation_function=m)
87 f = selector.allow_string_creation
88 assert True == f("allow")
89 assert False == f("deny")
90 assert False == f("please allow")
91
92 # Test the ElementFilter as a value for parse_only.
93 soup = self.soup(
94 "<deny>deny</deny> <allow>deny</allow> allow",
95 parse_only=selector
96 )
97
98 # All incoming strings other than "allow" (even whitespace)
99 # were filtered out, but there was no effect on the tags,
100 # since only allow_string_creation_function was defined.
101 assert '<deny>deny</deny><allow>deny</allow>' == soup.decode()
102
103 # Similarly, since match_function was not defined, this
104 # ElementFilter matches everything.
105 assert soup.find(selector).name == "deny"
22106
107
108class TestMatchRule(SoupTest):
109
110 def _tuple(self, rule):
23 return (111 return (
24 rule.string,112 rule.string,
25 rule.pattern.pattern if rule.pattern else None,113 rule.pattern.pattern if rule.pattern else None,
@@ -155,6 +243,28 @@ class TestSoupStrainer(SoupTest):
155 assert w2.filename == __file__243 assert w2.filename == __file__
156 assert msg == "Access to deprecated property text. (Look at .string_rules instead) -- Deprecated since version 4.13.0."244 assert msg == "Access to deprecated property text. (Look at .string_rules instead) -- Deprecated since version 4.13.0."
157245
246 def test_search_tag_deprecated(self):
247 strainer = SoupStrainer(name="a")
248 with warnings.catch_warnings(record=True) as w:
249 assert False == strainer.search_tag("b", {})
250 [w1] = w
251 msg = str(w1.message)
252 assert w1.filename == __file__
253 assert msg == "Call to deprecated method search_tag. (Replaced by allow_tag_creation) -- Deprecated since version 4.13.0."
254
255 def test_search_deprecated(self):
256 strainer = SoupStrainer(name="a")
257 soup = self.soup("<a></a><b></b>")
258 with warnings.catch_warnings(record=True) as w:
259 assert soup.a == strainer.search(soup.a)
260 assert None == strainer.search(soup.b)
261 [w1, w2] = w
262 msg = str(w1.message)
263 assert msg == str(w2.message)
264 assert w1.filename == __file__
265 assert msg == "Call to deprecated method search. (Replaced by match) -- Deprecated since version 4.13.0."
266
267 # Dummy function used within tests.
158 def _match_function(x):268 def _match_function(x):
159 pass269 pass
160 270
@@ -213,7 +323,7 @@ class TestSoupStrainer(SoupTest):
213 )323 )
214324
215 def test_constructor_with_overlapping_attributes(self):325 def test_constructor_with_overlapping_attributes(self):
216 # If you specify the same attribute in arts and **kwargs, you end up326 # If you specify the same attribute in args and **kwargs, you end up
217 # with two different AttributeValueMatchRule objects.327 # with two different AttributeValueMatchRule objects.
218328
219 # This happens whether you use the 'class' shortcut on attrs...329 # This happens whether you use the 'class' shortcut on attrs...
@@ -437,17 +547,24 @@ class TestSoupStrainer(SoupTest):
437 # because the string restrictions can't be evaluated during547 # because the string restrictions can't be evaluated during
438 # the parsing process, and the tag restrictions eliminate548 # the parsing process, and the tag restrictions eliminate
439 # any strings from consideration.549 # any strings from consideration.
550 #
551 # We can detect this ahead of time, and warn about it,
552 # thanks to SoupStrainer.excludes_everything
440 markup = "<a><b>one string<div>another string</div></b></a>"553 markup = "<a><b>one string<div>another string</div></b></a>"
441554
442 with warnings.catch_warnings(record=True) as w:555 with warnings.catch_warnings(record=True) as w:
556 assert True, soupstrainer.excludes_everything
443 assert "" == self.soup(markup, parse_only=soupstrainer).decode()557 assert "" == self.soup(markup, parse_only=soupstrainer).decode()
444 [warning] = w558 [warning] = w
445 msg = str(warning.message)559 msg = str(warning.message)
446 assert warning.filename == __file__560 assert warning.filename == __file__
447 assert str(warning.message).startswith(561 assert str(warning.message).startswith(
448 "Value for parse_only will exclude everything, since it puts restrictions on both tags and strings:"562 "The given value for parse_only will exclude everything:"
449 )563 )
450 564
565 # The average SoupStrainer has excludes_everything=False
566 assert not SoupStrainer().excludes_everything
567
451 def test_documentation_examples(self):568 def test_documentation_examples(self):
452 """Medium-weight real-world tests based on the Beautiful Soup569 """Medium-weight real-world tests based on the Beautiful Soup
453 documentation.570 documentation.
diff --git a/bs4/tests/test_html5lib.py b/bs4/tests/test_html5lib.py
index b0f4384..9f6dfa1 100644
--- a/bs4/tests/test_html5lib.py
+++ b/bs4/tests/test_html5lib.py
@@ -4,7 +4,7 @@ import pytest
4import warnings4import warnings
55
6from bs4 import BeautifulSoup6from bs4 import BeautifulSoup
7from bs4.strainer import SoupStrainer7from bs4.filter import SoupStrainer
8from . import (8from . import (
9 HTML5LIB_PRESENT,9 HTML5LIB_PRESENT,
10 HTML5TreeBuilderSmokeTest,10 HTML5TreeBuilderSmokeTest,
@@ -24,7 +24,7 @@ class TestHTML5LibBuilder(SoupTest, HTML5TreeBuilderSmokeTest):
24 return HTML5TreeBuilder24 return HTML5TreeBuilder
2525
26 def test_soupstrainer(self):26 def test_soupstrainer(self):
27 # The html5lib tree builder does not support SoupStrainers.27 # The html5lib tree builder does not support parse_only.
28 strainer = SoupStrainer("b")28 strainer = SoupStrainer("b")
29 markup = "<p>A <b>bold</b> statement.</p>"29 markup = "<p>A <b>bold</b> statement.</p>"
30 with warnings.catch_warnings(record=True) as w:30 with warnings.catch_warnings(record=True) as w:
diff --git a/bs4/tests/test_lxml.py b/bs4/tests/test_lxml.py
index d450740..9fc04e0 100644
--- a/bs4/tests/test_lxml.py
+++ b/bs4/tests/test_lxml.py
@@ -14,7 +14,7 @@ from bs4 import (
14 BeautifulStoneSoup,14 BeautifulStoneSoup,
15 )15 )
16from bs4.element import Comment, Doctype16from bs4.element import Comment, Doctype
17from bs4.strainer import SoupStrainer17from bs4.filter import SoupStrainer
18from . import (18from . import (
19 HTMLTreeBuilderSmokeTest,19 HTMLTreeBuilderSmokeTest,
20 XMLTreeBuilderSmokeTest,20 XMLTreeBuilderSmokeTest,
diff --git a/bs4/tests/test_pageelement.py b/bs4/tests/test_pageelement.py
index 19b4d63..7dfdc22 100644
--- a/bs4/tests/test_pageelement.py
+++ b/bs4/tests/test_pageelement.py
@@ -10,7 +10,7 @@ from bs4.element import (
10 Comment,10 Comment,
11 ResultSet,11 ResultSet,
12)12)
13from bs4.strainer import SoupStrainer13from bs4.filter import SoupStrainer
14from . import (14from . import (
15 SoupTest,15 SoupTest,
16)16)
diff --git a/bs4/tests/test_soup.py b/bs4/tests/test_soup.py
index 4f8ee1a..c95f380 100644
--- a/bs4/tests/test_soup.py
+++ b/bs4/tests/test_soup.py
@@ -27,7 +27,7 @@ from bs4.element import (
27 Tag,27 Tag,
28 NavigableString,28 NavigableString,
29)29)
30from bs4.strainer import SoupStrainer30from bs4.filter import SoupStrainer
3131
32from . import (32from . import (
33 default_builder,33 default_builder,
@@ -293,7 +293,7 @@ class TestWarnings(SoupTest):
293 soup = self.soup("<a><b></b></a>", parse_only=strainer)293 soup = self.soup("<a><b></b></a>", parse_only=strainer)
294 warning = self._assert_warning(w, UserWarning)294 warning = self._assert_warning(w, UserWarning)
295 msg = str(warning.message)295 msg = str(warning.message)
296 assert msg.startswith("Value for parse_only will exclude everything, since it puts restrictions on both tags and strings:")296 assert msg.startswith("The given value for parse_only will exclude everything:")
297 297
298 def test_parseOnlyThese_renamed_to_parse_only(self):298 def test_parseOnlyThese_renamed_to_parse_only(self):
299 with warnings.catch_warnings(record=True) as w:299 with warnings.catch_warnings(record=True) as w:
diff --git a/bs4/tests/test_tree.py b/bs4/tests/test_tree.py
index 606525f..43afb29 100644
--- a/bs4/tests/test_tree.py
+++ b/bs4/tests/test_tree.py
@@ -26,7 +26,7 @@ from bs4.element import (
26 Tag,26 Tag,
27 TemplateString,27 TemplateString,
28)28)
29from bs4.strainer import SoupStrainer29from bs4.filter import SoupStrainer
30from . import (30from . import (
31 SoupTest,31 SoupTest,
32)32)
diff --git a/doc/index.rst b/doc/index.rst
index 7beff36..a414830 100755
--- a/doc/index.rst
+++ b/doc/index.rst
@@ -20,7 +20,7 @@ with examples. I show you what the library is good for, how it works,
20how to use it, how to make it do what you want, and what to do when it20how to use it, how to make it do what you want, and what to do when it
21violates your expectations.21violates your expectations.
2222
23This document covers Beautiful Soup version 4.12.2. The examples in23This document covers Beautiful Soup version 4.13.0. The examples in
24this documentation were written for Python 3.8.24this documentation were written for Python 3.8.
2525
26You might be looking for the documentation for `Beautiful Soup 326You might be looking for the documentation for `Beautiful Soup 3
@@ -2577,6 +2577,11 @@ the human-visible content of the page.*
2577either return the object itself, or nothing, so the only reason to do2577either return the object itself, or nothing, so the only reason to do
2578this is when you're iterating over a mixed list.*2578this is when you're iterating over a mixed list.*
25792579
2580*As of Beautiful Soup version 4.13.0, you can call .string on a
2581NavigableString object. It will return the object itself, so again,
2582the only reason to do this is when you're iterating over a mixed
2583list.*
2584
2580Specifying the parser to use2585Specifying the parser to use
2581============================2586============================
25822587
@@ -2604,8 +2609,9 @@ specifying one of the following:
26042609
2605The section `Installing a parser`_ contrasts the supported parsers.2610The section `Installing a parser`_ contrasts the supported parsers.
26062611
2607If you don't have an appropriate parser installed, Beautiful Soup will2612If you ask for a parser that isn't installed, Beautiful Soup will
2608ignore your request and pick a different parser. Right now, the only2613raise an exception so that you don't inadvertently parse a document
2614under an unknown set of rules. For example, right now, the only
2609supported XML parser is lxml. If you don't have lxml installed, asking2615supported XML parser is lxml. If you don't have lxml installed, asking
2610for an XML parser won't give you one, and asking for "lxml" won't work2616for an XML parser won't give you one, and asking for "lxml" won't work
2611either.2617either.
@@ -3018,6 +3024,44 @@ been called on it::
3018This is because two different :py:class:`Tag` objects can't occupy the same3024This is because two different :py:class:`Tag` objects can't occupy the same
3019space at the same time.3025space at the same time.
30203026
3027Advanced search techniques
3028==========================
3029
3030Almost everyone who uses Beautiful Soup to extract information from a
3031document can get what they need using the methods described in
3032`Searching the tree`_. However, there's a lower-level interface--the
3033:py:class:`ElementSelector` class-- which lets you define any matching
3034behavior whatsoever.
3035
3036To use :py:class:`ElementSelector`, define a function that takes a
3037:py:class:`PageElement` object (that is, it might be either a
3038:py:class:`Tag` or a :py:class`NavigableString`) and returns ``True``
3039(if the element matches your custom criteria) or ``False`` (if it
3040doesn't)::
3041
3042 [example goes here]
3043
3044Then, pass the function into an :py:class:`ElementSelector`::
3045
3046 from bs4.select import ElementSelector
3047 selector = ElementSelector(f)
3048
3049You can then pass the :py:class:`ElementSelector` object as the first
3050argument to any of the `Searching the tree`_ methods::
3051
3052 [examples go here]
3053
3054Every potential match will be run through your function, and the only
3055:py:class:`PageElement` objects returned will be the one where your
3056function returned ``True``.
3057
3058Note that this is different from simply passing `a function`_ as the
3059first argument to one of the search methods. That's an easy way to
3060find a tag, but _only_ tags will be considered. With an
3061:py:class:`ElementSelector` you can write a single function that makes
3062decisions about both tags and strings.
3063
3064
3021Advanced parser customization3065Advanced parser customization
3022=============================3066=============================
30233067
@@ -3111,14 +3155,6 @@ The :py:class:`SoupStrainer` behavior is as follows:
3111* When a tag does not match, the tag itself is not kept, but parsing continues3155* When a tag does not match, the tag itself is not kept, but parsing continues
3112 into its contents to look for other tags that do match.3156 into its contents to look for other tags that do match.
31133157
3114You can also pass a :py:class:`SoupStrainer` into any of the methods covered
3115in `Searching the tree`_. This probably isn't terribly useful, but I
3116thought I'd mention it::
3117
3118 soup = BeautifulSoup(html_doc, 'html.parser')
3119 soup.find_all(only_short_strings)
3120 # ['\n\n', '\n\n', 'Elsie', ',\n', 'Lacie', ' and\n', 'Tillie',
3121 # '\n\n', '...', '\n']
31223158
3123Customizing multi-valued attributes3159Customizing multi-valued attributes
3124-----------------------------------3160-----------------------------------

Subscribers

People subscribed via source and target branches

to all changes: