1
diff --git a/CHANGELOG b/CHANGELOG
2
index 69f238d..162e3dc 100644
3
--- a/CHANGELOG
4
+++ b/CHANGELOG
5
@@ -1,5 +1,7 @@
6
1
= 4.13.0 (Unreleased)
1
= 4.13.0 (Unreleased)
7
2
2
8
3
TODO: we could stand to put limit inside ResultSet
9
4
10
3
* This version drops support for Python 3.6. The minimum supported
5
* This version drops support for Python 3.6. The minimum supported
11
4
  major Python version for Beautiful Soup is now Python 3.7.
6
  major Python version for Beautiful Soup is now Python 3.7.
12
5
7
13
@@ -31,6 +33,13 @@
14
31
  you, since you probably use HTMLParserTreeBuilder, not
33
  you, since you probably use HTMLParserTreeBuilder, not
15
32
  BeautifulSoupHTMLParser directly.
34
  BeautifulSoupHTMLParser directly.
16
33
35
17
36
* The TreeBuilderForHtml5lib methods fragmentClass and getFragment
18
37
  now raise NotImplementedError. These methods are called only by
19
38
  html5lib's HTMLParser.parseFragment() method, which Beautiful Soup
20
39
  doesn't use, so they were untested and should have never been called.
21
40
  The getFragment() implementation was also slightly incorrect in a way
22
41
  that should have caused obvious problems for anyone using it.
23
42
24
34
* If Tag.get_attribute_list() is used to access an attribute that's not set,
43
* If Tag.get_attribute_list() is used to access an attribute that's not set,
25
35
  the return value is now an empty list rather than [None].
44
  the return value is now an empty list rather than [None].
26
36
45
27
@@ -47,6 +56,10 @@
28
47
  empty list was treated the same as None and False, and you would have
56
  empty list was treated the same as None and False, and you would have
29
48
  found the tags which did not have that attribute set at all. [bug=2045469]
57
  found the tags which did not have that attribute set at all. [bug=2045469]
30
49
58
31
59
* For similar reasons, if you pass in limit=0 to a find() method for some
32
60
  reason, you will now get zero results. Previously, you would get all
33
61
  matching results.
34
62
35
50
* When using one of the find() methods or creating a SoupStrainer,
63
* When using one of the find() methods or creating a SoupStrainer,
36
51
  if you specify the same attribute value in ``attrs`` and the
64
  if you specify the same attribute value in ``attrs`` and the
37
52
  keyword arguments, you'll end up with two different ways to match that
65
  keyword arguments, you'll end up with two different ways to match that
38
@@ -88,7 +101,7 @@
39
88
  changed to match the arguments to the superclass,
101
  changed to match the arguments to the superclass,
40
89
  TreeBuilder.prepare_markup. Specifically, document_declared_encoding
102
  TreeBuilder.prepare_markup. Specifically, document_declared_encoding
41
90
  now appears before exclude_encodings, not after. If you were calling
103
  now appears before exclude_encodings, not after. If you were calling
43
91
  this method yourself, I recomment switching to using keyword
104
  this method yourself, I recommend switching to using keyword
44
92
  arguments instead.
105
  arguments instead.
45
93
106
46
94
* Fixed an error in the lookup table used when converting
107
* Fixed an error in the lookup table used when converting
47
@@ -101,8 +114,12 @@ New deprecations in 4.13.0:
48
101
114
49
102
* The SAXTreeBuilder class, which was never officially supported or tested.
115
* The SAXTreeBuilder class, which was never officially supported or tested.
50
103
116
51
117
* The private class method BeautifulSoup._decode_markup(), which has not
52
118
  been used inside Beautiful Soup for many years.
53
119
54
104
* The first argument to BeautifulSoup.decode has been changed from a bool
120
* The first argument to BeautifulSoup.decode has been changed from a bool
55
105
  `pretty_print` to an int `indent_level`, to match the signature of Tag.decode.
121
  `pretty_print` to an int `indent_level`, to match the signature of Tag.decode.
56
122
  Using a bool will still work but will give you a DeprecationWarning.
57
106
123
58
107
* SoupStrainer.text and SoupStrainer.string are both deprecated
124
* SoupStrainer.text and SoupStrainer.string are both deprecated
59
108
  since a single item can't capture all the possibilities of a SoupStrainer
125
  since a single item can't capture all the possibilities of a SoupStrainer
60
diff --git a/bs4/__init__.py b/bs4/__init__.py
61
index 347cb38..95bd48d 100644
62
--- a/bs4/__init__.py
63
+++ b/bs4/__init__.py
64
@@ -15,7 +15,7 @@ documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
65
15
"""
15
"""
66
16
16
67
17
__author__ = "Leonard Richardson (leonardr@segfault.org)"
17
__author__ = "Leonard Richardson (leonardr@segfault.org)"
69
18
__version__ = "4.12.3"
18
__version__ = "4.13.0"
70
19
__copyright__ = "Copyright (c) 2004-2024 Leonard Richardson"
19
__copyright__ = "Copyright (c) 2004-2024 Leonard Richardson"
71
20
# Use of this source code is governed by the MIT license.
20
# Use of this source code is governed by the MIT license.
72
21
__license__ = "MIT"
21
__license__ = "MIT"
73
@@ -42,10 +42,13 @@ from .builder import (
74
42
)
42
)
75
43
from .builder._htmlparser import HTMLParserTreeBuilder
43
from .builder._htmlparser import HTMLParserTreeBuilder
76
44
from .dammit import UnicodeDammit
44
from .dammit import UnicodeDammit
77
45
from .css import (
78
46
    CSS
79
47
)
80
48
from ._deprecation import _deprecated
81
45
from .element import (
49
from .element import (
82
46
    CData,
50
    CData,
83
47
    Comment,
51
    Comment,
84
48
    CSS,
85
49
    DEFAULT_OUTPUT_ENCODING,
52
    DEFAULT_OUTPUT_ENCODING,
86
50
    Declaration,
53
    Declaration,
87
51
    Doctype,
54
    Doctype,
88
@@ -60,7 +63,10 @@ from .element import (
89
60
    TemplateString,
63
    TemplateString,
90
61
    )
64
    )
91
62
from .formatter import Formatter
65
from .formatter import Formatter
93
63
from .strainer import SoupStrainer
66
from .filter import (
94
67
    ElementFilter,
95
68
    SoupStrainer,
96
69
)
97
64
from typing import (
70
from typing import (
98
65
    Any,
71
    Any,
99
66
    cast,
72
    cast,
100
@@ -70,6 +76,7 @@ from typing import (
101
70
    List,
76
    List,
102
71
    Sequence,
77
    Sequence,
103
72
    Optional,
78
    Optional,
104
79
    Tuple,
105
73
    Type,
80
    Type,
106
74
    TYPE_CHECKING,
81
    TYPE_CHECKING,
107
75
    Union,
82
    Union,
108
@@ -81,6 +88,7 @@ from bs4._typing import (
109
81
    _Encoding,
88
    _Encoding,
110
82
    _Encodings,
89
    _Encodings,
111
83
    _IncomingMarkup,
90
    _IncomingMarkup,
112
91
    _RawMarkup,
113
84
)
92
)
114
85
93
115
86
# Define some custom warnings.
94
# Define some custom warnings.
116
@@ -144,20 +152,21 @@ class BeautifulSoup(Tag):
117
144
    NO_PARSER_SPECIFIED_WARNING: str = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features=\"%(parser)s\"' to the BeautifulSoup constructor.\n"
152
    NO_PARSER_SPECIFIED_WARNING: str = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features=\"%(parser)s\"' to the BeautifulSoup constructor.\n"
118
145
153
119
146
    # FUTURE PYTHON:
154
    # FUTURE PYTHON:
121
147
    element_classes:Dict[Type[PageElement], Type[Any]] #: :meta private:
155
    element_classes:Dict[Type[PageElement], Type[PageElement]] #: :meta private:
122
148
    builder:TreeBuilder #: :meta private:
156
    builder:TreeBuilder #: :meta private:
123
149
    is_xml: bool
157
    is_xml: bool
124
150
    known_xml: Optional[bool]
158
    known_xml: Optional[bool]
125
151
    parse_only: Optional[SoupStrainer] #: :meta private:
159
    parse_only: Optional[SoupStrainer] #: :meta private:
126
152
160
127
153
    # These members are only used while parsing markup.
161
    # These members are only used while parsing markup.
129
154
    markup:Optional[Union[str,bytes]] #: :meta private:
162
    markup:Optional[_RawMarkup] #: :meta private:
130
155
    current_data:List[str] #: :meta private:
163
    current_data:List[str] #: :meta private:
131
156
    currentTag:Optional[Tag] #: :meta private:
164
    currentTag:Optional[Tag] #: :meta private:
132
157
    tagStack:List[Tag] #: :meta private:
165
    tagStack:List[Tag] #: :meta private:
133
158
    open_tag_counter:CounterType[str] #: :meta private:
166
    open_tag_counter:CounterType[str] #: :meta private:
134
159
    preserve_whitespace_tag_stack:List[Tag] #: :meta private:
167
    preserve_whitespace_tag_stack:List[Tag] #: :meta private:
135
160
    string_container_stack:List[Tag] #: :meta private:
168
    string_container_stack:List[Tag] #: :meta private:
136
169
    _most_recent_element:Optional[PageElement] #: :meta private:
137
161
170
138
162
    #: Beautiful Soup's best guess as to the character encoding of the
171
    #: Beautiful Soup's best guess as to the character encoding of the
139
163
    #: original document.
172
    #: original document.
140
@@ -182,7 +191,7 @@ class BeautifulSoup(Tag):
141
182
            parse_only:Optional[SoupStrainer]=None,
191
            parse_only:Optional[SoupStrainer]=None,
142
183
            from_encoding:Optional[_Encoding]=None,
192
            from_encoding:Optional[_Encoding]=None,
143
184
            exclude_encodings:Optional[_Encodings]=None,
193
            exclude_encodings:Optional[_Encodings]=None,
145
185
            element_classes:Optional[Dict[Type[PageElement], Type[Any]]]=None,
194
            element_classes:Optional[Dict[Type[PageElement], Type[PageElement]]]=None,
146
186
            **kwargs:Any
195
            **kwargs:Any
147
187
    ):
196
    ):
148
188
        """Constructor.
197
        """Constructor.
149
@@ -271,7 +280,7 @@ class BeautifulSoup(Tag):
150
271
                "features='lxml' for HTML and features='lxml-xml' for "
280
                "features='lxml' for HTML and features='lxml-xml' for "
151
272
                "XML.")
281
                "XML.")
152
273
282
154
274
        def deprecated_argument(old_name, new_name):
283
        def deprecated_argument(old_name:str, new_name:str) -> Optional[Any]:
155
275
            if old_name in kwargs:
284
            if old_name in kwargs:
156
276
                warnings.warn(
285
                warnings.warn(
157
277
                    'The "%s" argument to the BeautifulSoup constructor '
286
                    'The "%s" argument to the BeautifulSoup constructor '
158
@@ -284,13 +293,14 @@ class BeautifulSoup(Tag):
159
284
293
160
285
        parse_only = parse_only or deprecated_argument(
294
        parse_only = parse_only or deprecated_argument(
161
286
            "parseOnlyThese", "parse_only")
295
            "parseOnlyThese", "parse_only")
169
287
        if (parse_only is not None
296
        if parse_only is not None:
170
288
            and parse_only.string_rules and
297
            # Issue a warning if we can tell in advance that
171
289
            (parse_only.name_rules or parse_only.attribute_rules)):
298
            # parse_only will exclude the entire tree.
172
290
            warnings.warn(
299
            if parse_only.excludes_everything:
173
291
                f"Value for parse_only will exclude everything, since it puts restrictions on both tags and strings: {parse_only}",
300
                warnings.warn(
174
292
                UserWarning, stacklevel=3
301
                    f"The given value for parse_only will exclude everything: {parse_only}",
175
293
            )
302
                    UserWarning, stacklevel=3
176
303
                )
177
294
        
304
        
178
295
        from_encoding = from_encoding or deprecated_argument(
305
        from_encoding = from_encoding or deprecated_argument(
179
296
            "fromEncoding", "from_encoding")
306
            "fromEncoding", "from_encoding")
180
@@ -323,7 +333,7 @@ class BeautifulSoup(Tag):
181
323
                    "Couldn't find a tree builder with the features you "
333
                    "Couldn't find a tree builder with the features you "
182
324
                    "requested: %s. Do you need to install a parser library?"
334
                    "requested: %s. Do you need to install a parser library?"
183
325
                    % ",".join(features))
335
                    % ",".join(features))
185
326
            builder_class = cast(Type[TreeBuilder], possible_builder_class)
336
            builder_class = possible_builder_class
186
327
337
187
328
        # At this point either we have a TreeBuilder instance in
338
        # At this point either we have a TreeBuilder instance in
188
329
        # builder, or we have a builder_class that we can instantiate
339
        # builder, or we have a builder_class that we can instantiate
189
@@ -399,7 +409,7 @@ class BeautifulSoup(Tag):
190
399
409
191
400
        # At this point we know markup is a string or bytestring.  If
410
        # At this point we know markup is a string or bytestring.  If
192
401
        # it was a file-type object, we've read from it.
411
        # it was a file-type object, we've read from it.
194
402
        markup = cast(Union[str,bytes], markup)
412
        markup = cast(_RawMarkup, markup)
195
403
                
413
                
196
404
        rejections = []
414
        rejections = []
197
405
        success = False
415
        success = False
198
@@ -428,7 +438,7 @@ class BeautifulSoup(Tag):
199
428
        self.markup = None
438
        self.markup = None
200
429
        self.builder.soup = None
439
        self.builder.soup = None
201
430
440
203
431
    def _clone(self):
441
    def _clone(self) -> "BeautifulSoup":
204
432
        """Create a new BeautifulSoup object with the same TreeBuilder,
442
        """Create a new BeautifulSoup object with the same TreeBuilder,
205
433
        but not associated with any markup.
443
        but not associated with any markup.
206
434
444
207
@@ -441,7 +451,7 @@ class BeautifulSoup(Tag):
208
441
        clone.original_encoding = self.original_encoding
451
        clone.original_encoding = self.original_encoding
209
442
        return clone
452
        return clone
210
443
        
453
        
212
444
    def __getstate__(self):
454
    def __getstate__(self) -> dict[str, Any]:
213
445
        # Frequently a tree builder can't be pickled.
455
        # Frequently a tree builder can't be pickled.
214
446
        d = dict(self.__dict__)
456
        d = dict(self.__dict__)
215
447
        if 'builder' in d and d['builder'] is not None and not self.builder.picklable:
457
        if 'builder' in d and d['builder'] is not None and not self.builder.picklable:
216
@@ -457,7 +467,7 @@ class BeautifulSoup(Tag):
217
457
            del d['_most_recent_element']
467
            del d['_most_recent_element']
218
458
        return d
468
        return d
219
459
469
221
460
    def __setstate__(self, state):
470
    def __setstate__(self, state: dict[str, Any]) -> None:
222
461
        # If necessary, restore the TreeBuilder by looking it up.
471
        # If necessary, restore the TreeBuilder by looking it up.
223
462
        self.__dict__ = state
472
        self.__dict__ = state
224
463
        if isinstance(self.builder, type):
473
        if isinstance(self.builder, type):
225
@@ -469,15 +479,16 @@ class BeautifulSoup(Tag):
226
469
        self.builder.soup = self
479
        self.builder.soup = self
227
470
        self.reset()
480
        self.reset()
228
471
        self._feed()
481
        self._feed()
229
472
        return state
230
473
482
231
474
    
483
    
232
475
    @classmethod
484
    @classmethod
235
476
    def _decode_markup(cls, markup):
485
    @_deprecated(replaced_by="nothing (private method, will be removed)", version="4.13.0")
236
477
        """Ensure `markup` is bytes so it's safe to send into warnings.warn.
486
    def _decode_markup(cls, markup:_RawMarkup) -> str:
237
487
        """Ensure `markup` is Unicode so it's safe to send into warnings.warn.
238
478
488
241
479
        TODO: warnings.warn had this problem back in 2010 but it might not
489
        warnings.warn had this problem back in 2010 but fortunately
242
480
        anymore.
490
        not anymore. This has not been used for a long time; I just
243
491
        noticed that fact while working on 4.13.0.
244
481
        """
492
        """
245
482
        if isinstance(markup, bytes):
493
        if isinstance(markup, bytes):
246
483
            decoded = markup.decode('utf-8', 'replace')
494
            decoded = markup.decode('utf-8', 'replace')
247
@@ -486,56 +497,76 @@ class BeautifulSoup(Tag):
248
486
        return decoded
497
        return decoded
249
487
498
250
488
    @classmethod
499
    @classmethod
252
489
    def _markup_is_url(cls, markup):
500
    def _markup_is_url(cls, markup:_RawMarkup) -> bool:
253
490
        """Error-handling method to raise a warning if incoming markup looks
501
        """Error-handling method to raise a warning if incoming markup looks
254
491
        like a URL.
502
        like a URL.
255
492
503
259
493
        :param markup: A string.
504
        :param markup: A string of markup.
260
494
        :return: Whether or not the markup resembles a URL
505
        :return: Whether or not the markup resembled a URL
261
495
            closely enough to justify a warning.
506
            closely enough to justify issuing a warning.
262
496
        """
507
        """
263
508
        problem: bool = False
264
497
        if isinstance(markup, bytes):
509
        if isinstance(markup, bytes):
267
498
            space = b' '
510
            cant_start_with_b: Tuple[bytes, bytes] = (b"http:", b"https:")
268
499
            cant_start_with = (b"http:", b"https:")
511
            problem = (
269
512
                any(
270
513
                    markup.startswith(prefix) for prefix in
271
514
                    (b"http:", b"https:")
272
515
                )
273
516
                and not b' ' in markup
274
517
            )
275
500
        elif isinstance(markup, str):
518
        elif isinstance(markup, str):
278
501
            space = ' '
519
            problem = (
279
502
            cant_start_with = ("http:", "https:")
520
                any(
280
521
                    markup.startswith(prefix) for prefix in
281
522
                    ("http:", "https:")
282
523
                )
283
524
                and not ' ' in markup
284
525
            )
285
503
        else:
526
        else:
286
504
            return False
527
            return False
287
505
528
299
506
        if any(markup.startswith(prefix) for prefix in cant_start_with):
529
        if not problem:
300
507
            if not space in markup:
530
            return False
301
508
                warnings.warn(
531
        warnings.warn(
302
509
                    'The input looks more like a URL than markup. You may want to use'
532
            'The input looks more like a URL than markup. You may want to use'
303
510
                    ' an HTTP client like requests to get the document behind'
533
            ' an HTTP client like requests to get the document behind'
304
511
                    ' the URL, and feed that document to Beautiful Soup.',
534
            ' the URL, and feed that document to Beautiful Soup.',
305
512
                    MarkupResemblesLocatorWarning,
535
            MarkupResemblesLocatorWarning,
306
513
                    stacklevel=3
536
            stacklevel=3
307
514
                )
537
        )
308
515
                return True
538
        return True
298
516
        return False
309
517
539
310
518
    @classmethod
540
    @classmethod
313
519
    def _markup_resembles_filename(cls, markup):
541
    def _markup_resembles_filename(cls, markup:_RawMarkup) -> bool:
314
520
        """Error-handling method to raise a warning if incoming markup
542
        """Error-handling method to issue a warning if incoming markup
315
521
        resembles a filename.
543
        resembles a filename.
316
522
544
320
523
        :param markup: A bytestring or string.
545
        :param markup: A string of markup.
321
524
        :return: Whether or not the markup resembles a filename
546
        :return: Whether or not the markup resembled a filename
322
525
            closely enough to justify a warning.
547
            closely enough to justify issuing a warning.
323
526
        """
548
        """
329
527
        path_characters = '/\\'
549
        path_characters_b = b'/\\'
330
528
        extensions = ['.html', '.htm', '.xml', '.xhtml', '.txt']
550
        path_characters_s = '/\\'
331
529
        if isinstance(markup, bytes):
551
        extensions_b = [b'.html', b'.htm', b'.xml', b'.xhtml', b'.txt']
332
530
            path_characters = path_characters.encode("utf8")
552
        extensions_s = ['.html', '.htm', '.xml', '.xhtml', '.txt']
333
531
            extensions = [x.encode('utf8') for x in extensions]
553
334
532
        filelike = False
554
        filelike = False
337
533
        if any(x in markup for x in path_characters):
555
        if isinstance(markup, bytes):
338
534
            filelike = True
556
            if any(x in markup for x in path_characters_b):
339
557
                filelike = True
340
558
            else:
341
559
                lower_b = markup.lower()
342
560
                if any(lower_b.endswith(ext) for ext in extensions_b):
343
561
                    filelike = True
344
535
        else:
562
        else:
347
536
            lower = markup.lower()
563
            if any(x in markup for x in path_characters_s):
346
537
            if any(lower.endswith(ext) for ext in extensions):
348
538
                filelike = True
564
                filelike = True
349
565
            else:
350
566
                lower_s = markup.lower()
351
567
                if any(lower_s.endswith(ext) for ext in extensions_s):
352
568
                    filelike = True
353
569
354
539
        if filelike:
570
        if filelike:
355
540
            warnings.warn(
571
            warnings.warn(
356
541
                'The input looks more like a filename than markup. You may'
572
                'The input looks more like a filename than markup. You may'
357
@@ -546,20 +577,22 @@ class BeautifulSoup(Tag):
358
546
            return True
577
            return True
359
547
        return False
578
        return False
360
548
    
579
    
362
549
    def _feed(self):
580
    def _feed(self) -> None:
363
550
        """Internal method that parses previously set markup, creating a large
581
        """Internal method that parses previously set markup, creating a large
364
551
        number of Tag and NavigableString objects.
582
        number of Tag and NavigableString objects.
365
552
        """
583
        """
366
553
        # Convert the document to Unicode.
584
        # Convert the document to Unicode.
367
554
        self.builder.reset()
585
        self.builder.reset()
368
555
586
370
556
        self.builder.feed(self.markup)
587
        if self.markup is not None:
371
588
            self.builder.feed(self.markup)
372
557
        # Close out any unfinished strings and close all the open tags.
589
        # Close out any unfinished strings and close all the open tags.
373
558
        self.endData()
590
        self.endData()
375
559
        while self.currentTag.name != self.ROOT_TAG_NAME:
591
        while (self.currentTag is not None and
376
592
               self.currentTag.name != self.ROOT_TAG_NAME):
377
560
            self.popTag()
593
            self.popTag()
378
561
594
380
562
    def reset(self):
595
    def reset(self) -> None:
381
563
        """Reset this object to a state as though it had never parsed any
596
        """Reset this object to a state as though it had never parsed any
382
564
        markup.
597
        markup.
383
565
        """
598
        """
384
@@ -585,7 +618,7 @@ class BeautifulSoup(Tag):
385
585
            sourcepos:Optional[int]=None,
618
            sourcepos:Optional[int]=None,
386
586
            string:Optional[str]=None,
619
            string:Optional[str]=None,
387
587
            **kwattrs:_AttributeValue,
620
            **kwattrs:_AttributeValue,
389
588
    ):
621
    ) -> Tag:
390
589
        """Create a new Tag associated with this BeautifulSoup object.
622
        """Create a new Tag associated with this BeautifulSoup object.
391
590
623
392
591
        :param name: The name of the new Tag.
624
        :param name: The name of the new Tag.
393
@@ -603,10 +636,16 @@ class BeautifulSoup(Tag):
394
603
636
395
604
        """
637
        """
396
605
        kwattrs.update(attrs)
638
        kwattrs.update(attrs)
398
606
        tag =  self.element_classes.get(Tag, Tag)(
639
        tag_class = self.element_classes.get(Tag, Tag)
399
640
400
641
        # Assume that this is either Tag or a subclass of Tag. If not,
401
642
        # the user brought type-unsafety upon themselves.
402
643
        tag_class = cast(Type[Tag], tag_class)
403
644
        tag = tag_class(
404
607
            None, self.builder, name, namespace, nsprefix, kwattrs,
645
            None, self.builder, name, namespace, nsprefix, kwattrs,
405
608
            sourceline=sourceline, sourcepos=sourcepos
646
            sourceline=sourceline, sourcepos=sourcepos
406
609
        )
647
        )
407
648
408
610
        if string is not None:
649
        if string is not None:
409
611
            tag.string = string
650
            tag.string = string
410
612
        return tag
651
        return tag
411
@@ -622,9 +661,11 @@ class BeautifulSoup(Tag):
412
622
        """
661
        """
413
623
        container = base_class or NavigableString
662
        container = base_class or NavigableString
414
624
663
418
625
        # There may be a general override of NavigableString.
664
        # The user may want us to use some other class (hopefully a
419
626
        container = self.element_classes.get(
665
        # custom subclass) instead of the one we'd use normally.
420
627
            container, container
666
        container = cast(
421
667
            type[NavigableString],
422
668
            self.element_classes.get(container, container)
423
628
        )
669
        )
424
629
670
425
630
        # On top of that, we may be inside a tag that needs a special
671
        # On top of that, we may be inside a tag that needs a special
426
@@ -728,9 +769,8 @@ class BeautifulSoup(Tag):
427
728
            self.current_data = []
769
            self.current_data = []
428
729
770
429
730
            # Should we add this string to the tree at all?
771
            # Should we add this string to the tree at all?
433
731
            if self.parse_only and len(self.tagStack) <= 1 and \
772
            if (self.parse_only and len(self.tagStack) <= 1 and 
434
732
                   (not self.parse_only.string_rules or \
773
                (not self.parse_only.allow_string_creation(current_data))):
432
733
                    not self.parse_only.allow_string_creation(current_data)):
435
734
                return
774
                return
436
735
775
437
736
            containerClass = self.string_container(containerClass)
776
            containerClass = self.string_container(containerClass)
438
@@ -739,17 +779,16 @@ class BeautifulSoup(Tag):
439
739
779
440
740
    def object_was_parsed(
780
    def object_was_parsed(
441
741
            self, o:PageElement, parent:Optional[Tag]=None,
781
            self, o:PageElement, parent:Optional[Tag]=None,
443
742
            most_recent_element:Optional[PageElement]=None):
782
            most_recent_element:Optional[PageElement]=None) -> None:
444
743
        """Method called by the TreeBuilder to integrate an object into the
783
        """Method called by the TreeBuilder to integrate an object into the
445
744
        parse tree.
784
        parse tree.
446
745
785
447
746
        
448
747
449
748
        :meta private:
786
        :meta private:
450
749
        """
787
        """
451
750
        if parent is None:
788
        if parent is None:
452
751
            parent = self.currentTag
789
            parent = self.currentTag
453
752
        assert parent is not None
790
        assert parent is not None
454
791
        previous_element: Optional[PageElement]
455
753
        if most_recent_element is not None:
792
        if most_recent_element is not None:
456
754
            previous_element = most_recent_element
793
            previous_element = most_recent_element
457
755
        else:
794
        else:
458
@@ -774,12 +813,12 @@ class BeautifulSoup(Tag):
459
774
        if fix:
813
        if fix:
460
775
            self._linkage_fixer(parent)
814
            self._linkage_fixer(parent)
461
776
815
463
777
    def _linkage_fixer(self, el):
816
    def _linkage_fixer(self, el:Tag) -> None:
464
778
        """Make sure linkage of this fragment is sound."""
817
        """Make sure linkage of this fragment is sound."""
465
779
818
466
780
        first = el.contents[0]
819
        first = el.contents[0]
467
781
        child = el.contents[-1]
820
        child = el.contents[-1]
469
782
        descendant = child
821
        descendant:PageElement = child
470
783
822
471
784
        if child is first and el.parent is not None:
823
        if child is first and el.parent is not None:
472
785
            # Parent should be linked to first child
824
            # Parent should be linked to first child
473
@@ -797,14 +836,18 @@ class BeautifulSoup(Tag):
474
797
836
475
798
        # This index is a tag, dig deeper for a "last descendant"
837
        # This index is a tag, dig deeper for a "last descendant"
476
799
        if isinstance(child, Tag) and child.contents:
838
        if isinstance(child, Tag) and child.contents:
478
800
            descendant = child._last_descendant(False)
839
            # _last_decendant is typed as returning Optional[PageElement],
479
840
            # but the value can't be None here, because el is a Tag
480
841
            # which we know has contents.
481
842
            descendant = cast(PageElement, child._last_descendant(False))
482
801
843
483
802
        # As the final step, link last descendant. It should be linked
844
        # As the final step, link last descendant. It should be linked
484
803
        # to the parent's next sibling (if found), else walk up the chain
845
        # to the parent's next sibling (if found), else walk up the chain
485
804
        # and find a parent with a sibling. It should have no next sibling.
846
        # and find a parent with a sibling. It should have no next sibling.
486
805
        descendant.next_element = None
847
        descendant.next_element = None
487
806
        descendant.next_sibling = None
848
        descendant.next_sibling = None
489
807
        target = el
849
490
850
        target:Optional[Tag] = el
491
808
        while True:
851
        while True:
492
809
            if target is None:
852
            if target is None:
493
810
                break
853
                break
494
@@ -814,7 +857,7 @@ class BeautifulSoup(Tag):
495
814
                break
857
                break
496
815
            target = target.parent
858
            target = target.parent
497
816
859
499
817
    def _popToTag(self, name, nsprefix=None, inclusivePop=True) -> Optional[Tag]:
860
    def _popToTag(self, name:str, nsprefix:Optional[str]=None, inclusivePop:bool=True) -> Optional[Tag]:
500
818
        """Pops the tag stack up to and including the most recent
861
        """Pops the tag stack up to and including the most recent
501
819
        instance of the given tag.
862
        instance of the given tag.
502
820
863
503
@@ -851,7 +894,7 @@ class BeautifulSoup(Tag):
504
851
894
505
852
    def handle_starttag(
895
    def handle_starttag(
506
853
            self, name:str, namespace:Optional[str],
896
            self, name:str, namespace:Optional[str],
508
854
            nsprefix:Optional[str], attrs:Optional[Dict[str,str]],
897
            nsprefix:Optional[str], attrs:_AttributeValues,
509
855
            sourceline:Optional[int]=None, sourcepos:Optional[int]=None,
898
            sourceline:Optional[int]=None, sourcepos:Optional[int]=None,
510
856
            namespaces:Optional[Dict[str, str]]=None) -> Optional[Tag]:
899
            namespaces:Optional[Dict[str, str]]=None) -> Optional[Tag]:
511
857
        """Called by the tree builder when a new tag is encountered.
900
        """Called by the tree builder when a new tag is encountered.
512
@@ -867,7 +910,7 @@ class BeautifulSoup(Tag):
513
867
            currently in scope in the document.
910
            currently in scope in the document.
514
868
911
515
869
        If this method returns None, the tag was rejected by an active
912
        If this method returns None, the tag was rejected by an active
517
870
        SoupStrainer. You should proceed as if the tag had not occurred
913
        `ElementFilter`. You should proceed as if the tag had not occurred
518
871
        in the document. For instance, if this was a self-closing tag,
914
        in the document. For instance, if this was a self-closing tag,
519
872
        don't call handle_endtag.
915
        don't call handle_endtag.
520
873
916
521
@@ -877,11 +920,14 @@ class BeautifulSoup(Tag):
522
877
        self.endData()
920
        self.endData()
523
878
921
524
879
        if (self.parse_only and len(self.tagStack) <= 1
922
        if (self.parse_only and len(self.tagStack) <= 1
527
880
            and (self.parse_only.string_rules
923
            and not self.parse_only.allow_tag_creation(nsprefix, name, attrs)):
526
881
                 or not self.parse_only.allow_tag_creation(nsprefix, name, attrs))):
528
882
            return None
924
            return None
529
883
925
531
884
        tag = self.element_classes.get(Tag, Tag)(
926
        tag_class = self.element_classes.get(Tag, Tag)
532
927
        # Assume that this is either Tag or a subclass of Tag. If not,
533
928
        # the user brought type-unsafety upon themselves.
534
929
        tag_class = cast(Type[Tag], tag_class)
535
930
        tag = tag_class(
536
885
            self, self.builder, name, namespace, nsprefix, attrs,
931
            self, self.builder, name, namespace, nsprefix, attrs,
537
886
            self.currentTag, self._most_recent_element,
932
            self.currentTag, self._most_recent_element,
538
887
            sourceline=sourceline, sourcepos=sourcepos,
933
            sourceline=sourceline, sourcepos=sourcepos,
539
@@ -918,7 +964,8 @@ class BeautifulSoup(Tag):
540
918
    def decode(self, indent_level:Optional[int]=None,
964
    def decode(self, indent_level:Optional[int]=None,
541
919
               eventual_encoding:_Encoding=DEFAULT_OUTPUT_ENCODING,
965
               eventual_encoding:_Encoding=DEFAULT_OUTPUT_ENCODING,
542
920
               formatter:Union[Formatter,str]="minimal",
966
               formatter:Union[Formatter,str]="minimal",
544
921
               iterator:Optional[Iterable]=None, **kwargs) -> str:
967
               iterator:Optional[Iterable[PageElement]]=None,
545
968
               **kwargs:Any) -> str:
546
922
        """Returns a string representation of the parse tree
969
        """Returns a string representation of the parse tree
547
923
            as a full HTML or XML document.
970
            as a full HTML or XML document.
548
924
971
549
@@ -989,7 +1036,7 @@ _soup = BeautifulSoup
550
989
class BeautifulStoneSoup(BeautifulSoup):
1036
class BeautifulStoneSoup(BeautifulSoup):
551
990
    """Deprecated interface to an XML parser."""
1037
    """Deprecated interface to an XML parser."""
552
991
1038
554
992
    def __init__(self, *args, **kwargs):
1039
    def __init__(self, *args:Any, **kwargs:Any):
555
993
        kwargs['features'] = 'xml'
1040
        kwargs['features'] = 'xml'
556
994
        warnings.warn(
1041
        warnings.warn(
557
995
            'The BeautifulStoneSoup class was deprecated in version 4.0.0. Instead of using '
1042
            'The BeautifulStoneSoup class was deprecated in version 4.0.0. Instead of using '
558
diff --git a/bs4/_typing.py b/bs4/_typing.py
559
index fed804a..ab8f7a0 100644
560
--- a/bs4/_typing.py
561
+++ b/bs4/_typing.py
562
@@ -7,6 +7,8 @@
563
7
# * In 3.10, x|y is an accepted shorthand for Union[x,y].
7
# * In 3.10, x|y is an accepted shorthand for Union[x,y].
564
8
# * In 3.10, TypeAlias gains capabilities that can be used to
8
# * In 3.10, TypeAlias gains capabilities that can be used to
565
9
#   improve the tree matching types (I don't remember what, exactly).
9
#   improve the tree matching types (I don't remember what, exactly).
566
10
# * 3.8 defines the Protocol type, which can be used to do duck typing
567
11
#   in a statically checkable way.
568
10
12
569
11
import re
13
import re
570
12
from typing_extensions import TypeAlias
14
from typing_extensions import TypeAlias
571
@@ -15,13 +17,14 @@ from typing import (
572
15
    Dict,
17
    Dict,
573
16
    IO,
18
    IO,
574
17
    Iterable,
19
    Iterable,
575
20
    Optional,
576
18
    Pattern,
21
    Pattern,
577
19
    TYPE_CHECKING,
22
    TYPE_CHECKING,
578
20
    Union,
23
    Union,
579
21
)
24
)
580
22
25
581
23
if TYPE_CHECKING:
26
if TYPE_CHECKING:
583
24
    from bs4.element import Tag
27
    from bs4.element import PageElement, Tag
584
25
28
585
26
# Aliases for markup in various stages of processing.
29
# Aliases for markup in various stages of processing.
586
27
#
30
#
587
@@ -52,6 +55,10 @@ _InvertedNamespaceMapping:TypeAlias = Dict[_NamespaceURL, _NamespacePrefix]
588
52
_AttributeValue: TypeAlias = Union[str, Iterable[str]]
55
_AttributeValue: TypeAlias = Union[str, Iterable[str]]
589
53
_AttributeValues: TypeAlias = Dict[str, _AttributeValue]
56
_AttributeValues: TypeAlias = Dict[str, _AttributeValue]
590
54
57
591
58
# The most common form in which attribute values are passed in from a
592
59
# parser.
593
60
_RawAttributeValues: TypeAlias = dict[str, str]
594
61
595
55
# Aliases to represent the many possibilities for matching bits of a
62
# Aliases to represent the many possibilities for matching bits of a
596
56
# parse tree.
63
# parse tree.
597
57
#
64
#
598
@@ -60,6 +67,17 @@ _AttributeValues: TypeAlias = Dict[str, _AttributeValue]
599
60
# of the arguments to the SoupStrainer constructor and (more
67
# of the arguments to the SoupStrainer constructor and (more
600
61
# familiarly to Beautiful Soup users) the find* methods.
68
# familiarly to Beautiful Soup users) the find* methods.
601
62
69
602
70
# A function that takes a PageElement and returns a yes-or-no answer.
603
71
_PageElementMatchFunction:TypeAlias = Callable[['PageElement'], bool]
604
72
605
73
# A function that takes the raw parsed ingredients of a markup tag
606
74
# and returns a yes-or-no answer.
607
75
_AllowTagCreationFunction:TypeAlias = Callable[[Optional[str], str, Optional[_RawAttributeValues]], bool]
608
76
609
77
# A function that takes the raw parsed ingredients of a markup string node
610
78
# and returns a yes-or-no answer.
611
79
_AllowStringCreationFunction:TypeAlias = Callable[[Optional[str]], bool]
612
80
613
63
# A function that takes a Tag and returns a yes-or-no answer.
81
# A function that takes a Tag and returns a yes-or-no answer.
614
64
# A TagNameMatchRule expects this kind of function, if you're
82
# A TagNameMatchRule expects this kind of function, if you're
615
65
# going to pass it a function.
83
# going to pass it a function.
616
diff --git a/bs4/builder/__init__.py b/bs4/builder/__init__.py
617
index fa2b939..b59513e 100644
618
--- a/bs4/builder/__init__.py
619
+++ b/bs4/builder/__init__.py
620
@@ -277,7 +277,7 @@ class TreeBuilder(object):
621
277
            return True
277
            return True
622
278
        return tag_name in self.empty_element_tags
278
        return tag_name in self.empty_element_tags
623
279
    
279
    
625
280
    def feed(self, markup:str) -> None:
280
    def feed(self, markup:_RawMarkup) -> None:
626
281
        """Run some incoming markup through some parsing process,
281
        """Run some incoming markup through some parsing process,
627
282
        populating the `BeautifulSoup` object in `TreeBuilder.soup`
282
        populating the `BeautifulSoup` object in `TreeBuilder.soup`
628
283
        """
283
        """
629
@@ -598,8 +598,8 @@ class DetectsXMLParsedAsHTML(object):
630
598
598
631
599
    # This is typed as str, not `ProcessingInstruction`, because this
599
    # This is typed as str, not `ProcessingInstruction`, because this
632
600
    # check may be run before any Beautiful Soup objects are created.
600
    # check may be run before any Beautiful Soup objects are created.
635
601
    _first_processing_instruction: Optional[str]
601
    _first_processing_instruction: Optional[str] #: :meta private:
636
602
    _root_tag: Optional[Tag]
602
    _root_tag_name: Optional[str] #: :meta private:
637
603
    
603
    
638
604
    @classmethod
604
    @classmethod
639
605
    def warn_if_markup_looks_like_xml(cls, markup:Optional[_RawMarkup], stacklevel:int=3) -> bool:
605
    def warn_if_markup_looks_like_xml(cls, markup:Optional[_RawMarkup], stacklevel:int=3) -> bool:
640
@@ -648,14 +648,14 @@ class DetectsXMLParsedAsHTML(object):
641
648
    def _initialize_xml_detector(self) -> None:
648
    def _initialize_xml_detector(self) -> None:
642
649
        """Call this method before parsing a document."""
649
        """Call this method before parsing a document."""
643
650
        self._first_processing_instruction = None
650
        self._first_processing_instruction = None
645
651
        self._root_tag = None
651
        self._root_tag_name = None
646
652
       
652
       
647
653
    def _document_might_be_xml(self, processing_instruction:str):
653
    def _document_might_be_xml(self, processing_instruction:str):
648
654
        """Call this method when encountering an XML declaration, or a
654
        """Call this method when encountering an XML declaration, or a
649
655
        "processing instruction" that might be an XML declaration.
655
        "processing instruction" that might be an XML declaration.
650
656
        """
656
        """
651
657
        if (self._first_processing_instruction is not None
657
        if (self._first_processing_instruction is not None
653
658
            or self._root_tag is not None):
658
            or self._root_tag_name is not None):
654
659
            # The document has already started. Don't bother checking
659
            # The document has already started. Don't bother checking
655
660
            # anymore.
660
            # anymore.
656
661
            return
661
            return
657
@@ -665,18 +665,18 @@ class DetectsXMLParsedAsHTML(object):
658
665
        # We won't know until we encounter the first tag whether or
665
        # We won't know until we encounter the first tag whether or
659
666
        # not this is actually a problem.
666
        # not this is actually a problem.
660
667
        
667
        
662
668
    def _root_tag_encountered(self, name):
668
    def _root_tag_encountered(self, name:str) -> None:
663
669
        """Call this when you encounter the document's root tag.
669
        """Call this when you encounter the document's root tag.
664
670
670
665
671
        This is where we actually check whether an XML document is
671
        This is where we actually check whether an XML document is
666
672
        being incorrectly parsed as HTML, and issue the warning.
672
        being incorrectly parsed as HTML, and issue the warning.
667
673
        """
673
        """
669
674
        if self._root_tag is not None:
674
        if self._root_tag_name is not None:
670
675
            # This method was incorrectly called multiple times. Do
675
            # This method was incorrectly called multiple times. Do
671
676
            # nothing.
676
            # nothing.
672
677
            return
677
            return
673
678
678
675
679
        self._root_tag = name
679
        self._root_tag_name = name
676
680
        if (name != 'html' and self._first_processing_instruction is not None
680
        if (name != 'html' and self._first_processing_instruction is not None
677
681
            and self._first_processing_instruction.lower().startswith('xml ')):
681
            and self._first_processing_instruction.lower().startswith('xml ')):
678
682
            # We encountered an XML declaration and then a tag other
682
            # We encountered an XML declaration and then a tag other
679
diff --git a/bs4/builder/_html5lib.py b/bs4/builder/_html5lib.py
680
index b7d2924..2ea556c 100644
681
--- a/bs4/builder/_html5lib.py
682
+++ b/bs4/builder/_html5lib.py
683
@@ -6,6 +6,9 @@ __all__ = [
684
6
    ]
6
    ]
685
7
7
686
8
from typing import (
8
from typing import (
687
9
    Any,
688
10
    cast,
689
11
    Dict,
690
9
    Iterable,
12
    Iterable,
691
10
    List,
13
    List,
692
11
    Optional,
14
    Optional,
693
@@ -14,8 +17,11 @@ from typing import (
694
14
    Union,
17
    Union,
695
15
)
18
)
696
16
from bs4._typing import (
19
from bs4._typing import (
697
20
    _AttributeValue,
698
21
    _AttributeValues,
699
17
    _Encoding,
22
    _Encoding,
700
18
    _Encodings,
23
    _Encodings,
701
24
    _NamespaceURL,
702
19
    _RawMarkup,
25
    _RawMarkup,
703
20
)
26
)
704
21
27
705
@@ -30,6 +36,7 @@ from bs4.builder import (
706
30
    )
36
    )
707
31
from bs4.element import (
37
from bs4.element import (
708
32
    NamespacedAttribute,
38
    NamespacedAttribute,
709
39
    PageElement,
710
33
    nonwhitespace_re,
40
    nonwhitespace_re,
711
34
)
41
)
712
35
import html5lib
42
import html5lib
713
@@ -42,7 +49,9 @@ from bs4.element import (
714
42
    Doctype,
49
    Doctype,
715
43
    NavigableString,
50
    NavigableString,
716
44
    Tag,
51
    Tag,
718
45
    )
52
)
719
53
if TYPE_CHECKING:
720
54
    from bs4 import BeautifulSoup
721
46
55
722
47
from html5lib.treebuilders import base as treebuilder_base
56
from html5lib.treebuilders import base as treebuilder_base
723
48
57
724
@@ -71,7 +80,9 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
725
71
    #: html5lib can tell us which line number and position in the
80
    #: html5lib can tell us which line number and position in the
726
72
    #: original file is the source of an element.
81
    #: original file is the source of an element.
727
73
    TRACKS_LINE_NUMBERS:bool = True
82
    TRACKS_LINE_NUMBERS:bool = True
729
74
    
83
730
84
    underlying_builder:'TreeBuilderForHtml5lib' #: :meta private:
731
85
732
75
    def prepare_markup(self, markup:_RawMarkup,
86
    def prepare_markup(self, markup:_RawMarkup,
733
76
                       user_specified_encoding:Optional[_Encoding]=None,
87
                       user_specified_encoding:Optional[_Encoding]=None,
734
77
                       document_declared_encoding:Optional[_Encoding]=None,
88
                       document_declared_encoding:Optional[_Encoding]=None,
735
@@ -102,20 +113,31 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
736
102
        yield (markup, None, None, False)
113
        yield (markup, None, None, False)
737
103
114
738
104
    # These methods are defined by Beautiful Soup.
115
    # These methods are defined by Beautiful Soup.
740
105
    def feed(self, markup):
116
    def feed(self, markup:_RawMarkup) -> None:
741
106
        """Run some incoming markup through some parsing process,
117
        """Run some incoming markup through some parsing process,
742
107
        populating the `BeautifulSoup` object in `HTML5TreeBuilder.soup`.
118
        populating the `BeautifulSoup` object in `HTML5TreeBuilder.soup`.
743
108
        """
119
        """
745
109
        if self.soup.parse_only is not None:
120
        if self.soup is not None and self.soup.parse_only is not None:
746
110
            warnings.warn(
121
            warnings.warn(
747
111
                "You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.",
122
                "You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.",
748
112
                stacklevel=4
123
                stacklevel=4
749
113
            )
124
            )
750
125
751
126
        # self.underlying_parser is probably None now, but it'll be set
752
127
        # when self.create_treebuilder is called by html5lib.
753
128
        #
754
129
        # TODO-TYPING: typeshed stubs are incorrect about the return
755
130
        # value of HTMLParser.__init__; it is HTMLParser, not None.
756
114
        parser = html5lib.HTMLParser(tree=self.create_treebuilder)
131
        parser = html5lib.HTMLParser(tree=self.create_treebuilder)
757
132
        assert self.underlying_builder is not None
758
115
        self.underlying_builder.parser = parser
133
        self.underlying_builder.parser = parser
759
116
        extra_kwargs = dict()
134
        extra_kwargs = dict()
760
117
        if not isinstance(markup, str):
135
        if not isinstance(markup, str):
761
136
            # kwargs, specifically override_encoding, will eventually
762
137
            # be passed in to html5lib's
763
138
            # HTMLBinaryInputStream.__init__.
764
118
            extra_kwargs['override_encoding'] = self.user_specified_encoding
139
            extra_kwargs['override_encoding'] = self.user_specified_encoding
765
140
766
119
        doc = parser.parse(markup, **extra_kwargs)
141
        doc = parser.parse(markup, **extra_kwargs)
767
120
        
142
        
768
121
        # Set the character encoding detected by the tokenizer.
143
        # Set the character encoding detected by the tokenizer.
769
@@ -131,10 +153,12 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
770
131
            doc.original_encoding = original_encoding
153
            doc.original_encoding = original_encoding
771
132
        self.underlying_builder.parser = None
154
        self.underlying_builder.parser = None
772
133
155
774
134
    def create_treebuilder(self, namespaceHTMLElements):
156
    def create_treebuilder(self, namespaceHTMLElements:bool) -> 'TreeBuilderForHtml5lib':
775
135
        """Called by html5lib to instantiate the kind of class it
157
        """Called by html5lib to instantiate the kind of class it
776
136
        calls a 'TreeBuilder'.
158
        calls a 'TreeBuilder'.
778
137
        
159
779
160
        :param namespaceHTMLElements: Whether or not to namespace HTML elements.
780
161
781
138
        :meta private:
162
        :meta private:
782
139
        """
163
        """
783
140
        self.underlying_builder = TreeBuilderForHtml5lib(
164
        self.underlying_builder = TreeBuilderForHtml5lib(
784
@@ -143,15 +167,18 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
785
143
        )
167
        )
786
144
        return self.underlying_builder
168
        return self.underlying_builder
787
145
169
789
146
    def test_fragment_to_document(self, fragment):
170
    def test_fragment_to_document(self, fragment:str) -> str:
790
147
        """See `TreeBuilder`."""
171
        """See `TreeBuilder`."""
791
148
        return '<html><head></head><body>%s</body></html>' % fragment
172
        return '<html><head></head><body>%s</body></html>' % fragment
792
149
173
793
150
174
794
151
class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder):
175
class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder):
798
152
    
176
799
153
    def __init__(self, namespaceHTMLElements, soup=None,
177
    soup:'BeautifulSoup' #: :meta private:
800
154
                 store_line_numbers=True, **kwargs):
178
801
179
    def __init__(self, namespaceHTMLElements:bool,
802
180
                 soup:Optional['BeautifulSoup']=None,
803
181
                 store_line_numbers:bool=True, **kwargs:Any):
804
155
        if soup:
182
        if soup:
805
156
            self.soup = soup
183
            self.soup = soup
806
157
        else:
184
        else:
807
@@ -172,65 +199,68 @@ class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder):
808
172
        self.parser = None
199
        self.parser = None
809
173
        self.store_line_numbers = store_line_numbers
200
        self.store_line_numbers = store_line_numbers
810
174
        
201
        
812
175
    def documentClass(self):
202
    def documentClass(self) -> 'Element':
813
176
        self.soup.reset()
203
        self.soup.reset()
814
177
        return Element(self.soup, self.soup, None)
204
        return Element(self.soup, self.soup, None)
815
178
205
820
179
    def insertDoctype(self, token):
206
    def insertDoctype(self, token:Dict[str, Any]) -> None:
821
180
        name = token["name"]
207
        name:str = cast(str, token["name"])
822
181
        publicId = token["publicId"]
208
        publicId:Optional[str] = cast(Optional[str], token["publicId"])
823
182
        systemId = token["systemId"]
209
        systemId:Optional[str] = cast(Optional[str], token["systemId"])
824
183
210
825
184
        doctype = Doctype.for_name_and_ids(name, publicId, systemId)
211
        doctype = Doctype.for_name_and_ids(name, publicId, systemId)
826
185
        self.soup.object_was_parsed(doctype)
212
        self.soup.object_was_parsed(doctype)
827
186
213
830
187
    def elementClass(self, name, namespace):
214
    def elementClass(self, name:str, namespace:str) -> 'Element':
831
188
        kwargs = {}
215
        sourceline:Optional[int] = None
832
216
        sourcepos:Optional[int] = None
833
189
        if self.parser and self.store_line_numbers:
217
        if self.parser and self.store_line_numbers:
834
190
            # This represents the point immediately after the end of the
218
            # This represents the point immediately after the end of the
835
191
            # tag. We don't know when the tag started, but we do know
219
            # tag. We don't know when the tag started, but we do know
836
192
            # where it ended -- the character just before this one.
220
            # where it ended -- the character just before this one.
837
193
            sourceline, sourcepos = self.parser.tokenizer.stream.position()
221
            sourceline, sourcepos = self.parser.tokenizer.stream.position()
841
194
            kwargs['sourceline'] = sourceline
222
            sourcepos = sourcepos-1
842
195
            kwargs['sourcepos'] = sourcepos-1
223
        tag = self.soup.new_tag(
843
196
        tag = self.soup.new_tag(name, namespace, **kwargs)
224
            name, namespace, sourceline=sourceline, sourcepos=sourcepos
844
225
        )
845
197
226
846
198
        return Element(tag, self.soup, namespace)
227
        return Element(tag, self.soup, namespace)
847
199
228
849
200
    def commentClass(self, data):
229
    def commentClass(self, data:str) -> 'TextNode':
850
201
        return TextNode(Comment(data), self.soup)
230
        return TextNode(Comment(data), self.soup)
851
202
231
859
203
    def fragmentClass(self):
232
    def fragmentClass(self) -> 'Element':
860
204
        from bs4 import BeautifulSoup
233
        """This is only used by html5lib HTMLParser.parseFragment(),
861
205
        # TODO: Why is the parser 'html.parser' here? To avoid an
234
        which is never used by Beautiful Soup."""
862
206
        # infinite loop?
235
        raise NotImplementedError()
863
207
        self.soup = BeautifulSoup("", "html.parser")
236
864
208
        self.soup.name = "[document_fragment]"
237
    def getFragment(self) -> 'Element':
865
209
        return Element(self.soup, self.soup, None)
238
        """This is only used by html5lib HTMLParser.parseFragment,
866
239
        which is never used by Beautiful Soup."""
867
240
        raise NotImplementedError()
868
210
241
871
211
    def appendChild(self, node):
242
    def appendChild(self, node:'Element') -> None:
872
212
        # XXX This code is not covered by the BS4 tests.
243
        # TODO: This code is not covered by the BS4 tests.
873
213
        self.soup.append(node.element)
244
        self.soup.append(node.element)
874
214
245
876
215
    def getDocument(self):
246
    def getDocument(self) -> 'BeautifulSoup':
877
216
        return self.soup
247
        return self.soup
878
217
248
883
218
    def getFragment(self):
249
    # TODO-TYPING: typeshed stubs are incorrect about this;
884
219
        return treebuilder_base.TreeBuilder.getFragment(self).element
250
    # cloneNode returns a str, not None.
885
220
251
    def testSerializer(self, element:'Element') -> str:
882
221
    def testSerializer(self, element):
886
222
        from bs4 import BeautifulSoup
252
        from bs4 import BeautifulSoup
887
223
        rv = []
253
        rv = []
888
224
        doctype_re = re.compile(r'^(.*?)(?: PUBLIC "(.*?)"(?: "(.*?)")?| SYSTEM "(.*?)")?$')
254
        doctype_re = re.compile(r'^(.*?)(?: PUBLIC "(.*?)"(?: "(.*?)")?| SYSTEM "(.*?)")?$')
889
225
255
891
226
        def serializeElement(element, indent=0):
256
        def serializeElement(element:Union['Element', PageElement], indent=0) -> None:
892
227
            if isinstance(element, BeautifulSoup):
257
            if isinstance(element, BeautifulSoup):
893
228
                pass
258
                pass
894
229
            if isinstance(element, Doctype):
259
            if isinstance(element, Doctype):
895
230
                m = doctype_re.match(element)
260
                m = doctype_re.match(element)
897
231
                if m:
261
                if m is not None:
898
232
                    name = m.group(1)
262
                    name = m.group(1)
900
233
                    if m.lastindex > 1:
263
                    if m.lastindex is not None and m.lastindex > 1:
901
234
                        publicId = m.group(2) or ""
264
                        publicId = m.group(2) or ""
902
235
                        systemId = m.group(3) or m.group(4) or ""
265
                        systemId = m.group(3) or m.group(4) or ""
903
236
                        rv.append("""|%s<!DOCTYPE %s "%s" "%s">""" %
266
                        rv.append("""|%s<!DOCTYPE %s "%s" "%s">""" %
904
@@ -243,7 +273,7 @@ class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder):
905
243
                rv.append("|%s<!-- %s -->" % (' ' * indent, element))
273
                rv.append("|%s<!-- %s -->" % (' ' * indent, element))
906
244
            elif isinstance(element, NavigableString):
274
            elif isinstance(element, NavigableString):
907
245
                rv.append("|%s\"%s\"" % (' ' * indent, element))
275
                rv.append("|%s\"%s\"" % (' ' * indent, element))
909
246
            else:
276
            elif isinstance(element, Element):
910
247
                if element.namespace:
277
                if element.namespace:
911
248
                    name = "%s %s" % (prefixes[element.namespace],
278
                    name = "%s %s" % (prefixes[element.namespace],
912
249
                                      element.name)
279
                                      element.name)
913
@@ -269,12 +299,19 @@ class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder):
914
269
        return "\n".join(rv)
299
        return "\n".join(rv)
915
270
300
916
271
class AttrList(object):
301
class AttrList(object):
918
272
    def __init__(self, element):
302
    """Represents a Tag's attributes in a way compatible with html5lib."""
919
303
920
304
    element:Tag
921
305
    attrs:_AttributeValues
922
306
923
307
    def __init__(self, element:Tag):
924
273
        self.element = element
308
        self.element = element
925
274
        self.attrs = dict(self.element.attrs)
309
        self.attrs = dict(self.element.attrs)
927
275
    def __iter__(self):
310
928
311
    def __iter__(self) -> Iterable[Tuple[str, _AttributeValue]]:
929
276
        return list(self.attrs.items()).__iter__()
312
        return list(self.attrs.items()).__iter__()
931
277
    def __setitem__(self, name, value):
313
932
314
    def __setitem__(self, name:str, value:_AttributeValue) -> None:
933
278
        # If this attribute is a multi-valued attribute for this element,
315
        # If this attribute is a multi-valued attribute for this element,
934
279
        # turn its value into a list.
316
        # turn its value into a list.
935
280
        list_attr = self.element.cdata_list_attributes or {}
317
        list_attr = self.element.cdata_list_attributes or {}
936
@@ -282,40 +319,52 @@ class AttrList(object):
937
282
            or (self.element.name in list_attr
319
            or (self.element.name in list_attr
938
283
                and name in list_attr.get(self.element.name, []))):
320
                and name in list_attr.get(self.element.name, []))):
939
284
            # A node that is being cloned may have already undergone
321
            # A node that is being cloned may have already undergone
941
285
            # this procedure.
322
            # this procedure. Check for this and skip it.
942
286
            if not isinstance(value, list):
323
            if not isinstance(value, list):
943
324
                assert isinstance(value, str)
944
287
                value = nonwhitespace_re.findall(value)
325
                value = nonwhitespace_re.findall(value)
945
288
        self.element[name] = value
326
        self.element[name] = value
947
289
    def items(self):
327
948
328
    def items(self) -> Iterable[Tuple[str, _AttributeValue]]:
949
290
        return list(self.attrs.items())
329
        return list(self.attrs.items())
951
291
    def keys(self):
330
952
331
    def keys(self) -> Iterable[str]:
953
292
        return list(self.attrs.keys())
332
        return list(self.attrs.keys())
955
293
    def __len__(self):
333
956
334
    def __len__(self) -> int:
957
294
        return len(self.attrs)
335
        return len(self.attrs)
959
295
    def __getitem__(self, name):
336
960
337
    def __getitem__(self, name:str) -> _AttributeValue:
961
296
        return self.attrs[name]
338
        return self.attrs[name]
963
297
    def __contains__(self, name):
339
964
340
    def __contains__(self, name:str) -> bool:
965
298
        return name in list(self.attrs.keys())
341
        return name in list(self.attrs.keys())
966
299
342
967
300
343
968
301
class Element(treebuilder_base.Node):
344
class Element(treebuilder_base.Node):
970
302
    def __init__(self, element, soup, namespace):
345
971
346
    element:Tag
972
347
    soup:'BeautifulSoup'
973
348
    namespace:Optional[_NamespaceURL]
974
349
975
350
    def __init__(self, element:Tag, soup:'BeautifulSoup',
976
351
                 namespace:Optional[_NamespaceURL]):
977
303
        treebuilder_base.Node.__init__(self, element.name)
352
        treebuilder_base.Node.__init__(self, element.name)
978
304
        self.element = element
353
        self.element = element
979
305
        self.soup = soup
354
        self.soup = soup
980
306
        self.namespace = namespace
355
        self.namespace = namespace
981
307
356
983
308
    def appendChild(self, node):
357
    def appendChild(self, node:'Element') -> None:
984
309
        string_child = child = None
358
        string_child = child = None
985
310
        if isinstance(node, str):
359
        if isinstance(node, str):
986
311
            # Some other piece of code decided to pass in a string
360
            # Some other piece of code decided to pass in a string
987
312
            # instead of creating a TextElement object to contain the
361
            # instead of creating a TextElement object to contain the
989
313
            # string.
362
            # string. This should not ever happen.
990
314
            string_child = child = node
363
            string_child = child = node
991
315
        elif isinstance(node, Tag):
364
        elif isinstance(node, Tag):
992
316
            # Some other piece of code decided to pass in a Tag
365
            # Some other piece of code decided to pass in a Tag
993
317
            # instead of creating an Element object to contain the
366
            # instead of creating an Element object to contain the
995
318
            # Tag.
367
            # Tag. This should not ever happen.
996
319
            child = node
368
            child = node
997
320
        elif node.element.__class__ == NavigableString:
369
        elif node.element.__class__ == NavigableString:
998
321
            string_child = child = node.element
370
            string_child = child = node.element
999
@@ -324,7 +373,7 @@ class Element(treebuilder_base.Node):
1000
324
            child = node.element
373
            child = node.element
1001
325
            node.parent = self
374
            node.parent = self
1002
326
375
1004
327
        if not isinstance(child, str) and child.parent is not None:
376
        if not isinstance(child, str) and child is not None and child.parent is not None:
1005
328
            node.element.extract()
377
            node.element.extract()
1006
329
378
1007
330
        if (string_child is not None and self.element.contents
379
        if (string_child is not None and self.element.contents
1008
@@ -359,14 +408,13 @@ class Element(treebuilder_base.Node):
1009
359
                child, parent=self.element,
408
                child, parent=self.element,
1010
360
                most_recent_element=most_recent_element)
409
                most_recent_element=most_recent_element)
1011
361
410
1013
362
    def getAttributes(self):
411
    def getAttributes(self) -> AttrList:
1014
363
        if isinstance(self.element, Comment):
412
        if isinstance(self.element, Comment):
1015
364
            return {}
413
            return {}
1016
365
        return AttrList(self.element)
414
        return AttrList(self.element)
1017
366
415
1019
367
    def setAttributes(self, attributes):
416
    def setAttributes(self, attributes:Optional[Dict]) -> None:
1020
368
        if attributes is not None and len(attributes) > 0:
417
        if attributes is not None and len(attributes) > 0:
1021
369
            converted_attributes = []
1022
370
            for name, value in list(attributes.items()):
418
            for name, value in list(attributes.items()):
1023
371
                if isinstance(name, tuple):
419
                if isinstance(name, tuple):
1024
372
                    new_name = NamespacedAttribute(*name)
420
                    new_name = NamespacedAttribute(*name)
1025
@@ -386,14 +434,14 @@ class Element(treebuilder_base.Node):
1026
386
            self.soup.builder.set_up_substitutions(self.element)
434
            self.soup.builder.set_up_substitutions(self.element)
1027
387
    attributes = property(getAttributes, setAttributes)
435
    attributes = property(getAttributes, setAttributes)
1028
388
436
1030
389
    def insertText(self, data, insertBefore=None):
437
    def insertText(self, data:str, insertBefore:Optional['Element']=None) -> None:
1031
390
        text = TextNode(self.soup.new_string(data), self.soup)
438
        text = TextNode(self.soup.new_string(data), self.soup)
1032
391
        if insertBefore:
439
        if insertBefore:
1033
392
            self.insertBefore(text, insertBefore)
440
            self.insertBefore(text, insertBefore)
1034
393
        else:
441
        else:
1035
394
            self.appendChild(text)
442
            self.appendChild(text)
1036
395
443
1038
396
    def insertBefore(self, node, refNode):
444
    def insertBefore(self, node:'Element', refNode:'Element') -> None:
1039
397
        index = self.element.index(refNode.element)
445
        index = self.element.index(refNode.element)
1040
398
        if (node.element.__class__ == NavigableString and self.element.contents
446
        if (node.element.__class__ == NavigableString and self.element.contents
1041
399
            and self.element.contents[index-1].__class__ == NavigableString):
447
            and self.element.contents[index-1].__class__ == NavigableString):
1042
@@ -405,10 +453,10 @@ class Element(treebuilder_base.Node):
1043
405
            self.element.insert(index, node.element)
453
            self.element.insert(index, node.element)
1044
406
            node.parent = self
454
            node.parent = self
1045
407
455
1047
408
    def removeChild(self, node):
456
    def removeChild(self, node:'Element') -> None:
1048
409
        node.element.extract()
457
        node.element.extract()
1049
410
458
1051
411
    def reparentChildren(self, new_parent):
459
    def reparentChildren(self, new_parent:'Element') -> None:
1052
412
        """Move all of this tag's children into another tag."""
460
        """Move all of this tag's children into another tag."""
1053
413
        # print("MOVE", self.element.contents)
461
        # print("MOVE", self.element.contents)
1054
414
        # print("FROM", self.element)
462
        # print("FROM", self.element)
1055
@@ -424,6 +472,10 @@ class Element(treebuilder_base.Node):
1056
424
        if len(new_parent_element.contents) > 0:
472
        if len(new_parent_element.contents) > 0:
1057
425
            # The new parent already contains children. We will be
473
            # The new parent already contains children. We will be
1058
426
            # appending this tag's children to the end.
474
            # appending this tag's children to the end.
1059
475
1060
476
            # We can make this assertion since we know new_parent has
1061
477
            # children.
1062
478
            assert new_parents_last_descendant is not None
1063
427
            new_parents_last_child = new_parent_element.contents[-1]
479
            new_parents_last_child = new_parent_element.contents[-1]
1064
428
            new_parents_last_descendant_next_element = new_parents_last_descendant.next_element
480
            new_parents_last_descendant_next_element = new_parents_last_descendant.next_element
1065
429
        else:
481
        else:
1066
@@ -474,17 +526,21 @@ class Element(treebuilder_base.Node):
1067
474
        # print("FROM", self.element)
526
        # print("FROM", self.element)
1068
475
        # print("TO", new_parent_element)
527
        # print("TO", new_parent_element)
1069
476
528
1071
477
    def cloneNode(self):
529
    # TODO: typeshed stubs are incorrect about this;
1072
530
    # cloneNode returns a new Node, not None.
1073
531
    def cloneNode(self) -> treebuilder_base.Node:
1074
478
        tag = self.soup.new_tag(self.element.name, self.namespace)
532
        tag = self.soup.new_tag(self.element.name, self.namespace)
1075
479
        node = Element(tag, self.soup, self.namespace)
533
        node = Element(tag, self.soup, self.namespace)
1076
480
        for key,value in self.attributes:
534
        for key,value in self.attributes:
1077
481
            node.attributes[key] = value
535
            node.attributes[key] = value
1078
482
        return node
536
        return node
1079
483
537
1082
484
    def hasContent(self):
538
    # TODO-TYPING: typeshed stubs are incorrect about this;
1083
485
        return self.element.contents
539
    # cloneNode returns a boolean, not None.
1084
540
    def hasContent(self) -> bool:
1085
541
        return len(self.element.contents) > 0
1086
486
542
1088
487
    def getNameTuple(self):
543
    def getNameTuple(self) -> Tuple[str, str]:
1089
488
        if self.namespace == None:
544
        if self.namespace == None:
1090
489
            return namespaces["html"], self.name
545
            return namespaces["html"], self.name
1091
490
        else:
546
        else:
1092
@@ -493,10 +549,10 @@ class Element(treebuilder_base.Node):
1093
493
    nameTuple = property(getNameTuple)
549
    nameTuple = property(getNameTuple)
1094
494
550
1095
495
class TextNode(Element):
551
class TextNode(Element):
1097
496
    def __init__(self, element, soup):
552
    def __init__(self, element:PageElement, soup:'BeautifulSoup'):
1098
497
        treebuilder_base.Node.__init__(self, None)
553
        treebuilder_base.Node.__init__(self, None)
1099
498
        self.element = element
554
        self.element = element
1100
499
        self.soup = soup
555
        self.soup = soup
1101
500
556
1104
501
    def cloneNode(self):
557
    def cloneNode(self) -> treebuilder_base.Node:
1105
502
        raise NotImplementedError
558
        raise NotImplementedError()
1106
diff --git a/bs4/builder/_htmlparser.py b/bs4/builder/_htmlparser.py
1107
index 291f6c6..91cecf7 100644
1108
--- a/bs4/builder/_htmlparser.py
1109
+++ b/bs4/builder/_htmlparser.py
1110
@@ -188,7 +188,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1111
188
            # later on. If so, we want to ignore it.
188
            # later on. If so, we want to ignore it.
1112
189
            self.already_closed_empty_element.append(name)
189
            self.already_closed_empty_element.append(name)
1113
190
190
1115
191
        if self._root_tag is None:
191
        if self._root_tag_name is None:
1116
192
            self._root_tag_encountered(name)
192
            self._root_tag_encountered(name)
1117
193
            
193
            
1118
194
    def handle_endtag(self, name:str, check_already_closed:bool=True) -> None:
194
    def handle_endtag(self, name:str, check_already_closed:bool=True) -> None:
1119
@@ -422,13 +422,23 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
1120
422
                   dammit.declared_html_encoding,
422
                   dammit.declared_html_encoding,
1121
423
                   dammit.contains_replacement_characters)
423
                   dammit.contains_replacement_characters)
1122
424
424
1124
425
    def feed(self, markup:str):
425
    def feed(self, markup:_RawMarkup) -> None:
1125
426
        args, kwargs = self.parser_args
426
        args, kwargs = self.parser_args
1126
427
1127
428
        # HTMLParser.feed will only handle str, but
1128
429
        # BeautifulSoup.markup is allowed to be _RawMarkup, because
1129
430
        # it's set by the yield value of
1130
431
        # TreeBuilder.prepare_markup. Fortunately,
1131
432
        # HTMLParserTreeBuilder.prepare_markup always yields a str
1132
433
        # (UnicodeDammit.unicode_markup).
1133
434
        assert isinstance(markup, str)
1134
435
1135
427
        # We know BeautifulSoup calls TreeBuilder.initialize_soup
436
        # We know BeautifulSoup calls TreeBuilder.initialize_soup
1136
428
        # before calling feed(), so we can assume self.soup
437
        # before calling feed(), so we can assume self.soup
1137
429
        # is set.
438
        # is set.
1138
430
        assert self.soup is not None
439
        assert self.soup is not None
1139
431
        parser = BeautifulSoupHTMLParser(self.soup, *args, **kwargs)
440
        parser = BeautifulSoupHTMLParser(self.soup, *args, **kwargs)
1140
441
1141
432
        try:
442
        try:
1142
433
            parser.feed(markup)
443
            parser.feed(markup)
1143
434
            parser.close()
444
            parser.close()
1144
diff --git a/bs4/builder/_lxml.py b/bs4/builder/_lxml.py
1145
index ba87e87..3dfe88a 100644
1146
--- a/bs4/builder/_lxml.py
1147
+++ b/bs4/builder/_lxml.py
1148
@@ -269,7 +269,7 @@ class LXMLTreeBuilderForXML(TreeBuilder):
1149
269
        for encoding in detector.encodings:
269
        for encoding in detector.encodings:
1150
270
            yield (detector.markup, encoding, document_declared_encoding, False)
270
            yield (detector.markup, encoding, document_declared_encoding, False)
1151
271
271
1153
272
    def feed(self, markup:Union[bytes,str]) -> None:
272
    def feed(self, markup:_RawMarkup) -> None:
1154
273
        io: IO
273
        io: IO
1155
274
        if isinstance(markup, bytes):
274
        if isinstance(markup, bytes):
1156
275
            io = BytesIO(markup)
275
            io = BytesIO(markup)
1157
diff --git a/bs4/diagnose.py b/bs4/diagnose.py
1158
index 201b879..c2202ad 100644
1159
--- a/bs4/diagnose.py
1160
+++ b/bs4/diagnose.py
1161
@@ -9,7 +9,15 @@ from html.parser import HTMLParser
1162
9
import bs4
9
import bs4
1163
10
from bs4 import BeautifulSoup, __version__ 
10
from bs4 import BeautifulSoup, __version__ 
1164
11
from bs4.builder import builder_registry
11
from bs4.builder import builder_registry
1166
12
from typing import TYPE_CHECKING
12
from typing import (
1167
13
    Any,
1168
14
    IO,
1169
15
    List,
1170
16
    Optional,
1171
17
    Tuple,
1172
18
    TYPE_CHECKING,
1173
19
)
1174
20
1175
13
if TYPE_CHECKING:
21
if TYPE_CHECKING:
1176
14
    from bs4._typing import _IncomingMarkup
22
    from bs4._typing import _IncomingMarkup
1177
15
23
1178
@@ -78,7 +86,7 @@ def diagnose(data:_IncomingMarkup) -> None:
1179
78
86
1180
79
        print(("-" * 80))
87
        print(("-" * 80))
1181
80
88
1183
81
def lxml_trace(data, html:bool=True, **kwargs) -> None:
89
def lxml_trace(data:_IncomingMarkup, html:bool=True, **kwargs:Any) -> None:
1184
82
    """Print out the lxml events that occur during parsing.
90
    """Print out the lxml events that occur during parsing.
1185
83
91
1186
84
    This lets you see how lxml parses a document when no Beautiful
92
    This lets you see how lxml parses a document when no Beautiful
1187
@@ -94,7 +102,8 @@ def lxml_trace(data, html:bool=True, **kwargs) -> None:
1188
94
    recover = kwargs.pop('recover', True)
102
    recover = kwargs.pop('recover', True)
1189
95
    if isinstance(data, str):
103
    if isinstance(data, str):
1190
96
        data = data.encode("utf8")
104
        data = data.encode("utf8")
1192
97
    reader = BytesIO(data)
105
    if not isinstance(data, IO):
1193
106
        reader = BytesIO(data)
1194
98
    for event, element in etree.iterparse(
107
    for event, element in etree.iterparse(
1195
99
        reader, html=html, recover=recover, **kwargs
108
        reader, html=html, recover=recover, **kwargs
1196
100
    ):
109
    ):
1197
@@ -108,37 +117,40 @@ class AnnouncingParser(HTMLParser):
1198
108
    document. The easiest way to do this is to call `htmlparser_trace`.
117
    document. The easiest way to do this is to call `htmlparser_trace`.
1199
109
    """
118
    """
1200
110
119
1202
111
    def _p(self, s):
120
    def _p(self, s:str) -> None:
1203
112
        print(s)
121
        print(s)
1204
113
122
1206
114
    def handle_starttag(self, name, attrs):
123
    def handle_starttag(
1207
124
            self, name:str, attrs:List[Tuple[str, Optional[str]]],
1208
125
            handle_empty_element:bool=True
1209
126
    ) -> None:
1210
115
        self._p(f"{name} {attrs} START")
127
        self._p(f"{name} {attrs} START")
1211
116
128
1213
117
    def handle_endtag(self, name):
129
    def handle_endtag(self, name:str, check_already_closed:bool=True) -> None:
1214
118
        self._p("%s END" % name)
130
        self._p("%s END" % name)
1215
119
131
1217
120
    def handle_data(self, data):
132
    def handle_data(self, data:str) -> None:
1218
121
        self._p("%s DATA" % data)
133
        self._p("%s DATA" % data)
1219
122
134
1221
123
    def handle_charref(self, name):
135
    def handle_charref(self, name:str) -> None:
1222
124
        self._p("%s CHARREF" % name)
136
        self._p("%s CHARREF" % name)
1223
125
137
1225
126
    def handle_entityref(self, name):
138
    def handle_entityref(self, name:str) -> None:
1226
127
        self._p("%s ENTITYREF" % name)
139
        self._p("%s ENTITYREF" % name)
1227
128
140
1229
129
    def handle_comment(self, data):
141
    def handle_comment(self, data:str) -> None:
1230
130
        self._p("%s COMMENT" % data)
142
        self._p("%s COMMENT" % data)
1231
131
143
1233
132
    def handle_decl(self, data):
144
    def handle_decl(self, data:str) -> None:
1234
133
        self._p("%s DECL" % data)
145
        self._p("%s DECL" % data)
1235
134
146
1237
135
    def unknown_decl(self, data):
147
    def unknown_decl(self, data:str) -> None:
1238
136
        self._p("%s UNKNOWN-DECL" % data)
148
        self._p("%s UNKNOWN-DECL" % data)
1239
137
149
1241
138
    def handle_pi(self, data):
150
    def handle_pi(self, data:str) -> None:
1242
139
        self._p("%s PI" % data)
151
        self._p("%s PI" % data)
1243
140
152
1245
141
def htmlparser_trace(data):
153
def htmlparser_trace(data:str) -> None:
1246
142
    """Print out the HTMLParser events that occur during parsing.
154
    """Print out the HTMLParser events that occur during parsing.
1247
143
155
1248
144
    This lets you see how HTMLParser parses a document when no
156
    This lets you see how HTMLParser parses a document when no
1249
@@ -226,7 +238,7 @@ def benchmark_parsers(num_elements:int=100000) -> None:
1250
226
    b = time.time()
238
    b = time.time()
1251
227
    print(("Raw html5lib parsed the markup in %.2fs." % (b-a)))
239
    print(("Raw html5lib parsed the markup in %.2fs." % (b-a)))
1252
228
240
1254
229
def profile(num_elements:int=100000, parser:str="lxml"):
241
def profile(num_elements:int=100000, parser:str="lxml") -> None:
1255
230
    """Use Python's profiler on a randomly generated document."""
242
    """Use Python's profiler on a randomly generated document."""
1256
231
    filehandle = tempfile.NamedTemporaryFile()
243
    filehandle = tempfile.NamedTemporaryFile()
1257
232
    filename = filehandle.name
244
    filename = filehandle.name
1258
diff --git a/bs4/element.py b/bs4/element.py
1259
index 83f4882..f4ab89c 100644
1260
--- a/bs4/element.py
1261
+++ b/bs4/element.py
1262
@@ -44,6 +44,7 @@ if TYPE_CHECKING:
1263
44
    from bs4 import BeautifulSoup
44
    from bs4 import BeautifulSoup
1264
45
    from bs4.builder import TreeBuilder
45
    from bs4.builder import TreeBuilder
1265
46
    from bs4.dammit import _Encoding
46
    from bs4.dammit import _Encoding
1266
47
    from bs4.filter import ElementFilter
1267
47
    from bs4.formatter import (
48
    from bs4.formatter import (
1268
48
        _EntitySubstitutionFunction,
49
        _EntitySubstitutionFunction,
1269
49
        _FormatterOrName,
50
        _FormatterOrName,
1270
@@ -901,7 +902,7 @@ class PageElement(object):
1271
901
            limit:Optional[int],
902
            limit:Optional[int],
1272
902
            generator:Iterator[PageElement],
903
            generator:Iterator[PageElement],
1273
903
            _stacklevel:int=3,
904
            _stacklevel:int=3,
1275
904
            **kwargs:_StrainableAttribute) -> ResultSet[PageElement]:        
905
            **kwargs:_StrainableAttribute) -> ResultSet[PageElement]:
1276
905
        """Iterates over a generator looking for things that match."""
906
        """Iterates over a generator looking for things that match."""
1277
906
        results: ResultSet[PageElement]
907
        results: ResultSet[PageElement]
1278
907
        
908
        
1279
@@ -912,11 +913,11 @@ class PageElement(object):
1280
912
                DeprecationWarning, stacklevel=_stacklevel
913
                DeprecationWarning, stacklevel=_stacklevel
1281
913
            )
914
            )
1282
914
915
1286
915
        from bs4.strainer import SoupStrainer
916
        from bs4.filter import ElementFilter
1287
916
        if isinstance(name, SoupStrainer):
917
        if isinstance(name, ElementFilter):
1288
917
            strainer = name
918
            matcher = name
1289
918
        else:
919
        else:
1291
919
            strainer = SoupStrainer(name, attrs, string, **kwargs)
920
            matcher = SoupStrainer(name, attrs, string, **kwargs)
1292
920
921
1293
921
        result: Iterable[PageElement]
922
        result: Iterable[PageElement]
1294
922
        if string is None and not limit and not attrs and not kwargs:
923
        if string is None and not limit and not attrs and not kwargs:
1295
@@ -924,7 +925,7 @@ class PageElement(object):
1296
924
                # Optimization to find all tags.
925
                # Optimization to find all tags.
1297
925
                result = (element for element in generator
926
                result = (element for element in generator
1298
926
                          if isinstance(element, Tag))
927
                          if isinstance(element, Tag))
1300
927
                return ResultSet(strainer, result)
928
                return ResultSet(matcher, result)
1301
928
            elif isinstance(name, str):
929
            elif isinstance(name, str):
1302
929
                # Optimization to find all tags with a given name.
930
                # Optimization to find all tags with a given name.
1303
930
                if name.count(':') == 1:
931
                if name.count(':') == 1:
1304
@@ -945,22 +946,25 @@ class PageElement(object):
1305
945
                         )
946
                         )
1306
946
                        ):
947
                        ):
1307
947
                        result.append(element)
948
                        result.append(element)
1309
948
                return ResultSet(strainer, result)
949
                return ResultSet(matcher, result)
1310
950
        return self.match(generator, matcher, limit)
1311
951
1312
952
    def match(self, generator:Iterator[PageElement], matcher:ElementFilter, limit:Optional[int]=None) -> ResultSet[PageElement]:
1313
953
        """The most generic search method offered by Beautiful Soup.
1314
949
954
1316
950
        results = ResultSet(strainer)
955
        You can pass in your own technique for iterating over the tree, and your own
1317
956
        technique for matching items.
1318
957
        """
1319
958
        results:ResultSet = ResultSet(matcher)
1320
951
        while True:
959
        while True:
1321
952
            try:
960
            try:
1322
953
                i = next(generator)
961
                i = next(generator)
1323
954
            except StopIteration:
962
            except StopIteration:
1324
955
                break
963
                break
1325
956
            if i:
964
            if i:
1333
957
                # TODO: SoupStrainer.search is a confusing method
965
                if matcher.match(i):
1334
958
                # that needs to be redone, and this is where
966
                    results.append(i)
1335
959
                # it's being used.
967
                    if limit is not None and len(results) >= limit:
1329
960
                found = strainer.search(i)
1330
961
                if found:
1331
962
                    results.append(found)
1332
963
                    if limit and len(results) >= limit:
1336
964
                        break
968
                        break
1337
965
        return results
969
        return results
1338
966
970
1339
@@ -1254,7 +1258,7 @@ class Declaration(PreformattedString):
1340
1254
class Doctype(PreformattedString):
1258
class Doctype(PreformattedString):
1341
1255
    """A `document type declaration <https://www.w3.org/TR/REC-xml/#dt-doctype>`_."""
1259
    """A `document type declaration <https://www.w3.org/TR/REC-xml/#dt-doctype>`_."""
1342
1256
    @classmethod
1260
    @classmethod
1344
1257
    def for_name_and_ids(cls, name:str, pub_id:str, system_id:str) -> Doctype:
1261
    def for_name_and_ids(cls, name:str, pub_id:Optional[str], system_id:Optional[str]) -> Doctype:
1345
1258
        """Generate an appropriate document type declaration for a given
1262
        """Generate an appropriate document type declaration for a given
1346
1259
        public ID and system ID.
1263
        public ID and system ID.
1347
1260
1264
1348
@@ -2503,12 +2507,12 @@ class Tag(PageElement):
1349
2503
_PageElementT = TypeVar("_PageElementT", bound=PageElement)
2507
_PageElementT = TypeVar("_PageElementT", bound=PageElement)
1350
2504
class ResultSet(List[_PageElementT], Generic[_PageElementT]):
2508
class ResultSet(List[_PageElementT], Generic[_PageElementT]):
1351
2505
    """A ResultSet is a list of `PageElement` objects, gathered as the result
2509
    """A ResultSet is a list of `PageElement` objects, gathered as the result
1353
2506
    of matching a `SoupStrainer` against a parse tree. Basically, a list of
2510
    of matching an `ElementFilter` against a parse tree. Basically, a list of
1354
2507
    search results.
2511
    search results.
1355
2508
    """
2512
    """
1357
2509
    source: Optional[SoupStrainer]
2513
    source: Optional[ElementFilter]
1358
2510
2514
1360
2511
    def __init__(self, source:Optional[SoupStrainer], result: Iterable[_PageElementT]=()) -> None:
2515
    def __init__(self, source:Optional[ElementFilter], result: Iterable[_PageElementT]=()) -> None:
1361
2512
        super(ResultSet, self).__init__(result)
2516
        super(ResultSet, self).__init__(result)
1362
2513
        self.source = source
2517
        self.source = source
1363
2514
2518
1364
@@ -2522,4 +2526,4 @@ class ResultSet(List[_PageElementT], Generic[_PageElementT]):
1365
2522
# import SoupStrainer itself into this module to preserve the
2526
# import SoupStrainer itself into this module to preserve the
1366
2523
# backwards compatibility of anyone who imports
2527
# backwards compatibility of anyone who imports
1367
2524
# bs4.element.SoupStrainer.
2528
# bs4.element.SoupStrainer.
1369
2525
from bs4.strainer import SoupStrainer
2529
from bs4.filter import SoupStrainer
1370
diff --git a/bs4/strainer.py b/bs4/filter.py
1371
2526
similarity index 60%
2530
similarity index 60%
1372
2527
rename from bs4/strainer.py
2531
rename from bs4/strainer.py
1373
2528
rename to bs4/filter.py
2532
rename to bs4/filter.py
1374
index 15b289c..74e26d9 100644
1375
--- a/bs4/strainer.py
1376
+++ b/bs4/filter.py
1377
@@ -25,6 +25,10 @@ from bs4._deprecation import _deprecated
1378
25
from bs4.element import NavigableString, PageElement, Tag
25
from bs4.element import NavigableString, PageElement, Tag
1379
26
from bs4._typing import (
26
from bs4._typing import (
1380
27
    _AttributeValue,
27
    _AttributeValue,
1381
28
    _AttributeValues,
1382
29
    _AllowStringCreationFunction,
1383
30
    _AllowTagCreationFunction,
1384
31
    _PageElementMatchFunction,
1385
28
    _TagMatchFunction,
32
    _TagMatchFunction,
1386
29
    _StringMatchFunction,
33
    _StringMatchFunction,
1387
30
    _StrainableElement,
34
    _StrainableElement,
1388
@@ -33,13 +37,96 @@ from bs4._typing import (
1389
33
    _StrainableString,
37
    _StrainableString,
1390
34
)
38
)
1391
35
39
1392
40
1393
41
class ElementFilter(object):
1394
42
    """ElementFilters encapsulate the logic necessary to decide:
1395
43
1396
44
    1. whether a PageElement (a tag or a string) matches a
1397
45
    user-specified query.
1398
46
1399
47
    2. whether a given sequence of markup found during initial parsing
1400
48
    should be turned into a PageElement, or simply discarded.
1401
49
1402
50
    The base class is the simplest ElementFilter. By default, it
1403
51
    matches everything and allows all PageElements to be created. You
1404
52
    can make it more selective by passing in user-defined functions.
1405
53
1406
54
    Most users of Beautiful Soup will never need to use
1407
55
    ElementFilter, or its more capable subclass
1408
56
    SoupStrainer. Instead, they will use the find_* methods, which
1409
57
    will convert their arguments into SoupStrainer objects and run them
1410
58
    against the tree.
1411
59
    """
1412
60
    match_hook: Optional[_PageElementMatchFunction]
1413
61
    allow_tag_creation_function: Optional[_AllowTagCreationFunction]
1414
62
    allow_string_creation_function: Optional[_AllowStringCreationFunction]
1415
63
1416
64
    def __init__(
1417
65
            self, match_function:Optional[_PageElementMatchFunction]=None,
1418
66
            allow_tag_creation_function:Optional[_AllowTagCreationFunction]=None,
1419
67
            allow_string_creation_function:Optional[_AllowStringCreationFunction]=None):
1420
68
        self.match_function = match_function
1421
69
        self.allow_tag_creation_function = allow_tag_creation_function
1422
70
        self.allow_string_creation_function = allow_string_creation_function
1423
71
1424
72
    @property
1425
73
    def excludes_everything(self) -> bool:
1426
74
        """Does this ElementFilter obviously exclude everything? If
1427
75
        so, Beautiful Soup will issue a warning if you try to use it
1428
76
        when parsing a document.
1429
77
1430
78
        The ElementFilter might turn out to exclude everything even
1431
79
        if this returns False, but it won't do so in an obvious way.
1432
80
1433
81
        The default ElementFilter excludes *nothing*, and we don't
1434
82
        have any way of answering questions about more complex
1435
83
        ElementFilters without running their hook functions, so the
1436
84
        base implementation always returns False.
1437
85
        """
1438
86
        return False
1439
87
        
1440
88
    def match(self, element:PageElement) -> bool:
1441
89
        """Does the given PageElement match the rules set down by this
1442
90
        ElementFilter?
1443
91
1444
92
        The base implementation delegates to the function passed in to
1445
93
        the constructor.
1446
94
        """
1447
95
        if not self.match_function:
1448
96
            return True
1449
97
        return self.match_function(element)
1450
98
1451
99
    def allow_tag_creation(
1452
100
            self, nsprefix:Optional[str], name:str,
1453
101
            attrs:Optional[_AttributeValues]
1454
102
    ) -> bool:
1455
103
        """Based on the name and attributes of a tag, see whether this
1456
104
        ElementFilter will allow a Tag object to even be created.
1457
105
1458
106
        :param name: The name of the prospective tag.
1459
107
        :param attrs: The attributes of the prospective tag.
1460
108
        """
1461
109
        if not self.allow_tag_creation_function:
1462
110
            return True
1463
111
        return self.allow_tag_creation_function(nsprefix, name, attrs)
1464
112
1465
113
    def allow_string_creation(self, string:str) -> bool:
1466
114
        if not self.allow_string_creation_function:
1467
115
            return True
1468
116
        return self.allow_string_creation_function(string)
1469
117
1470
118
1471
36
class MatchRule(object):
119
class MatchRule(object):
1472
120
    """Each MatchRule encapsulates the logic behind a single argument
1473
121
    passed in to one of the Beautiful Soup find* methods.
1474
122
    """
1475
123
1476
37
    string: Optional[str]
124
    string: Optional[str]
1477
38
    pattern: Optional[Pattern[str]]
125
    pattern: Optional[Pattern[str]]
1478
39
    present: Optional[bool]
126
    present: Optional[bool]
1482
40
127
    # TODO-TYPING: All MatchRule objects also have an attribute
1483
41
    # All MatchRule objects also have an attribute ``function``, but
128
    # ``function``, but the type of the function depends on the
1484
42
    # the type of the function depends on the subclass.
129
    # subclass.
1485
43
    
130
    
1486
44
    def __init__(
131
    def __init__(
1487
45
            self,
132
            self,
1488
@@ -72,7 +159,7 @@ class MatchRule(object):
1489
72
                "At most one of string, pattern, function and present must be provided."
159
                "At most one of string, pattern, function and present must be provided."
1490
73
            )
160
            )
1491
74
        
161
        
1493
75
    def _base_match(self, string:str) -> Optional[bool]:
162
    def _base_match(self, string:Optional[str]) -> Optional[bool]:
1494
76
        """Run the 'cheap' portion of a match, trying to get an answer without
163
        """Run the 'cheap' portion of a match, trying to get an answer without
1495
77
        calling a potentially expensive custom function.
164
        calling a potentially expensive custom function.
1496
78
165
1497
@@ -101,7 +188,7 @@ class MatchRule(object):
1498
101
188
1499
102
        return None
189
        return None
1500
103
        
190
        
1502
104
    def matches_string(self, string:str) -> bool:
191
    def matches_string(self, string:Optional[str]) -> bool:
1503
105
        _base_result = self._base_match(string)
192
        _base_result = self._base_match(string)
1504
106
        if _base_result is not None:
193
        if _base_result is not None:
1505
107
            # No need to invoke the test function.
194
            # No need to invoke the test function.
1506
@@ -125,6 +212,7 @@ class MatchRule(object):
1507
125
        )
212
        )
1508
126
    
213
    
1509
127
class TagNameMatchRule(MatchRule):
214
class TagNameMatchRule(MatchRule):
1510
215
    """A MatchRule implementing the rules for matches against tag name."""
1511
128
    function: Optional[_TagMatchFunction]
216
    function: Optional[_TagMatchFunction]
1512
129
217
1513
130
    def matches_tag(self, tag:Tag) -> bool:
218
    def matches_tag(self, tag:Tag) -> bool:
1514
@@ -140,19 +228,25 @@ class TagNameMatchRule(MatchRule):
1515
140
        return False
228
        return False
1516
141
    
229
    
1517
142
class AttributeValueMatchRule(MatchRule):
230
class AttributeValueMatchRule(MatchRule):
1518
231
    """A MatchRule implementing the rules for matches against attribute value."""
1519
143
    function: Optional[_StringMatchFunction]
232
    function: Optional[_StringMatchFunction]
1520
144
233
1521
145
class StringMatchRule(MatchRule):
234
class StringMatchRule(MatchRule):
1522
235
    """A MatchRule implementing the rules for matches against a NavigableString."""
1523
146
    function: Optional[_StringMatchFunction]
236
    function: Optional[_StringMatchFunction]
1524
147
    
237
    
1528
148
class SoupStrainer(object):
238
class SoupStrainer(ElementFilter):
1529
149
    """Encapsulates a number of ways of matching a markup element (a tag
239
    """The ElementFilter subclass used internally by Beautiful Soup.
1527
150
    or a string).
1530
151
240
1535
152
    These are primarily created internally and used to underpin the
241
    A SoupStrainer encapsulates the logic necessary to perform the
1536
153
    find_* methods, but you can create one yourself and pass it in as
242
    kind of matches supported by the find_* methods. SoupStrainers are
1537
154
    ``parse_only`` to the `BeautifulSoup` constructor, to parse a
243
    primarily created internally, but you can create one yourself and
1538
155
    subset of a large document.
244
    pass it in as ``parse_only`` to the `BeautifulSoup` constructor,
1539
245
    to parse a subset of a large document.
1540
246
1541
247
    Internally, SoupStrainer objects work by converting the
1542
248
    constructor arguments into MatchRule objects. Incoming
1543
249
    tags/markup are matched against those rules.
1544
156
250
1545
157
    :param name: One or more restrictions on the tags found in a
251
    :param name: One or more restrictions on the tags found in a
1546
158
    document.
252
    document.
1547
@@ -226,6 +320,17 @@ class SoupStrainer(object):
1548
226
        self.__string = string
320
        self.__string = string
1549
227
321
1550
228
    @property
322
    @property
1551
323
    def excludes_everything(self) -> bool:
1552
324
        """Check whether the provided rules will obviously exclude
1553
325
        everything. (They might exclude everything even if this returns False,
1554
326
        but not in an obvious way.)
1555
327
        """
1556
328
        return True if (
1557
329
            self.string_rules and
1558
330
            (self.name_rules or self.attribute_rules)
1559
331
        ) else False
1560
332
        
1561
333
    @property
1562
229
    def string(self) -> Optional[_StrainableString]:
334
    def string(self) -> Optional[_StrainableString]:
1563
230
        ":meta private:"
335
        ":meta private:"
1564
231
        warnings.warn(f"Access to deprecated property string. (Look at .string_rules instead) -- Deprecated since version 4.13.0.", DeprecationWarning, stacklevel=2)
336
        warnings.warn(f"Access to deprecated property string. (Look at .string_rules instead) -- Deprecated since version 4.13.0.", DeprecationWarning, stacklevel=2)
1565
@@ -262,6 +367,15 @@ class SoupStrainer(object):
1566
262
            yield rule_class(function=obj)
367
            yield rule_class(function=obj)
1567
263
        elif isinstance(obj, Pattern):
368
        elif isinstance(obj, Pattern):
1568
264
            yield rule_class(pattern=obj)
369
            yield rule_class(pattern=obj)
1569
370
        elif hasattr(obj, 'search'):
1570
371
            # We do a little duck typing here to detect usage of the
1571
372
            # third-party regex library, whose pattern objects doesn't
1572
373
            # derive from re.Pattern.
1573
374
            #
1574
375
            # TODO-TYPING: Once we drop support for Python 3.7, we
1575
376
            # might be able to address this by defining an appropriate
1576
377
            # Protocol.
1577
378
            yield rule_class(pattern=obj)
1578
265
        elif hasattr(obj, '__iter__'):
379
        elif hasattr(obj, '__iter__'):
1579
266
            for o in obj:
380
            for o in obj:
1580
267
                if not isinstance(o, (bytes, str)) and hasattr(o, '__iter__'):
381
                if not isinstance(o, (bytes, str)) and hasattr(o, '__iter__'):
1581
@@ -358,7 +472,7 @@ class SoupStrainer(object):
1582
358
        else:
472
        else:
1583
359
            attr_values = [cast(str, attr_value)]
473
            attr_values = [cast(str, attr_value)]
1584
360
474
1586
361
        def _match_attribute_value_helper(attr_values:Sequence[Optional[str]]):
475
        def _match_attribute_value_helper(attr_values:Sequence[Optional[str]]) -> bool:
1587
362
            for rule in rules:
476
            for rule in rules:
1588
363
                for attr_value in attr_values:
477
                for attr_value in attr_values:
1589
364
                    if rule.matches_string(attr_value):
478
                    if rule.matches_string(attr_value):
1590
@@ -382,8 +496,8 @@ class SoupStrainer(object):
1591
382
                [joined_attr_value]
496
                [joined_attr_value]
1592
383
            )
497
            )
1593
384
        return this_attr_match
498
        return this_attr_match
1596
385
    
499
1597
386
    def allow_tag_creation(self, nsprefix:Optional[str], name:str, attrs:Optional[dict[str, str]]) -> bool:
500
    def allow_tag_creation(self, nsprefix:Optional[str], name:str, attrs:Optional[_AttributeValues]) -> bool:
1598
387
        """Based on the name and attributes of a tag, see whether this
501
        """Based on the name and attributes of a tag, see whether this
1599
388
        SoupStrainer will allow a Tag object to even be created.
502
        SoupStrainer will allow a Tag object to even be created.
1600
389
503
1601
@@ -423,17 +537,25 @@ class SoupStrainer(object):
1602
423
        return True
537
        return True
1603
424
538
1604
425
    def allow_string_creation(self, string:str) -> bool:
539
    def allow_string_creation(self, string:str) -> bool:
1605
540
        """Based on the content of a markup string, see whether this
1606
541
        SoupStrainer will allow it to be instantiated as a
1607
542
        NavigableString object, or whether it should be ignored.
1608
543
        """
1609
426
        if self.name_rules or self.attribute_rules:
544
        if self.name_rules or self.attribute_rules:
1610
427
            # A SoupStrainer that has name or attribute rules won't
545
            # A SoupStrainer that has name or attribute rules won't
1611
428
            # match any strings; it's designed to match tags with
546
            # match any strings; it's designed to match tags with
1612
429
            # certain properties.
547
            # certain properties.
1613
430
            return False
548
            return False
1614
549
        if not self.string_rules:
1615
550
            # A SoupStrainer with no string rules will match
1616
551
            # all strings.
1617
552
            return True
1618
431
        if not self.matches_any_string_rule(string):
553
        if not self.matches_any_string_rule(string):
1619
432
            return False
554
            return False
1620
433
        return True
555
        return True
1621
434
    
556
    
1622
435
    def matches_any_string_rule(self, string:str) -> bool:
557
    def matches_any_string_rule(self, string:str) -> bool:
1624
436
        """See whether the content of a string, matches any of 
558
        """See whether the content of a string matches any of
1625
437
        this SoupStrainer's string rules.
559
        this SoupStrainer's string rules.
1626
438
        """
560
        """
1627
439
        if not self.string_rules:
561
        if not self.string_rules:
1628
@@ -442,28 +564,37 @@ class SoupStrainer(object):
1629
442
            if string_rule.matches_string(string):
564
            if string_rule.matches_string(string):
1630
443
                return True
565
                return True
1631
444
        return False
566
        return False
1634
445
        
567
            
1635
446
    
568
    def match(self, element:PageElement) -> bool:
1636
569
        """Does the given PageElement match the rules set down by this
1637
570
        SoupStrainer?
1638
571
1639
572
        The find_* methods rely heavily on this method to find matches.
1640
573
1641
574
        :param element: A PageElement.
1642
575
        :return: True if the element matches this SoupStrainer's rules; False otherwise.
1643
576
        """
1644
577
        if isinstance(element, Tag):
1645
578
            return self.matches_tag(element)
1646
579
        assert isinstance(element, NavigableString)
1647
580
        if not (self.name_rules or self.attribute_rules):
1648
581
            # A NavigableString can only match a SoupStrainer that
1649
582
            # does not define any name or attribute restrictions.
1650
583
            for rule in self.string_rules:
1651
584
                if rule.matches_string(element):
1652
585
                    return True
1653
586
        return False
1654
587
1655
447
    @_deprecated("allow_tag_creation", "4.13.0")
588
    @_deprecated("allow_tag_creation", "4.13.0")
1657
448
    def search_tag(self, name, attrs):
589
    def search_tag(self, name:str, attrs:Optional[_AttributeValues]) -> bool:
1658
590
        """A less elegant version of allow_tag_creation()."""
1659
449
        ":meta private:"
591
        ":meta private:"
1660
450
        return self.allow_tag_creation(None, name, attrs)
592
        return self.allow_tag_creation(None, name, attrs)
1661
451
    
593
    
1679
452
    def search(self, element:PageElement):
594
    @_deprecated("match", "4.13.0")        
1680
453
        # TODO: This method needs to be removed or redone. It is
595
    def search(self, element:PageElement) -> Optional[PageElement]:
1681
454
        # very confusing but it's used everywhere.
596
        """A less elegant version of match().
1665
455
        match = None
1666
456
        if isinstance(element, Tag):
1667
457
            match = self.matches_tag(element)
1668
458
        else:
1669
459
            assert isinstance(element, NavigableString)
1670
460
            match = False
1671
461
            if not (self.name_rules or self.attribute_rules):
1672
462
                # A NavigableString can only match a SoupStrainer that
1673
463
                # does not define any name or attribute restrictions.
1674
464
                for rule in self.string_rules:
1675
465
                    if rule.matches_string(element):
1676
466
                        match = True
1677
467
                        break
1678
468
        return element if match else False
1682
469
597
1683
598
        :meta private:
1684
599
        """
1685
600
        return element if self.match(element) else None
1686
diff --git a/bs4/tests/__init__.py b/bs4/tests/__init__.py
1687
index 2ef7fd8..3ef999d 100644
1688
--- a/bs4/tests/__init__.py
1689
+++ b/bs4/tests/__init__.py
1690
@@ -20,7 +20,7 @@ from bs4.element import (
1691
20
    Stylesheet,
20
    Stylesheet,
1692
21
    Tag
21
    Tag
1693
22
)
22
)
1695
23
from bs4.strainer import SoupStrainer
23
from bs4.filter import SoupStrainer
1696
24
from bs4.builder import (
24
from bs4.builder import (
1697
25
    DetectsXMLParsedAsHTML,
25
    DetectsXMLParsedAsHTML,
1698
26
    XMLParsedAsHTMLWarning,
26
    XMLParsedAsHTMLWarning,
1699
diff --git a/bs4/tests/test_strainer.py b/bs4/tests/test_filter.py
1700
27
similarity index 56%
27
similarity index 56%
1701
28
rename from bs4/tests/test_strainer.py
28
rename from bs4/tests/test_strainer.py
1702
29
rename to bs4/tests/test_filter.py
29
rename to bs4/tests/test_filter.py
1703
index 4de03f0..8d5da70 100644
1704
--- a/bs4/tests/test_strainer.py
1705
+++ b/bs4/tests/test_filter.py
1706
@@ -6,20 +6,108 @@ from . import (
1707
6
    SoupTest,
6
    SoupTest,
1708
7
)
7
)
1709
8
from bs4.element import Tag
8
from bs4.element import Tag
1711
9
from bs4.strainer import (
9
from bs4.filter import (
1712
10
    AttributeValueMatchRule,
10
    AttributeValueMatchRule,
1713
11
    ElementFilter,
1714
11
    MatchRule,
12
    MatchRule,
1715
12
    SoupStrainer,
13
    SoupStrainer,
1716
13
    StringMatchRule,
14
    StringMatchRule,
1717
14
    TagNameMatchRule,
15
    TagNameMatchRule,
1718
15
)
16
)
1719
16
17
1721
17
class TestMatchrule(SoupTest):
18
class TestElementFilter(SoupTest):
1722
19
1723
20
    def test_default_behavior(self):
1724
21
        # An unconfigured ElementFilter matches absolutely everything.
1725
22
        selector = ElementFilter()
1726
23
        assert not selector.excludes_everything
1727
24
        soup = self.soup("<a>text</a>")
1728
25
        tag = soup.a
1729
26
        string = tag.string
1730
27
        assert True == selector.match(soup)
1731
28
        assert True == selector.match(tag)
1732
29
        assert True == selector.match(string)
1733
30
        assert soup.find(selector).name == "a"
1734
31
1735
32
        # And allows any incoming markup to be turned into PageElements.
1736
33
        assert True == selector.allow_tag_creation(None, "tag", None)
1737
34
        assert True == selector.allow_string_creation("some string")
1738
35
1739
36
    def test_match(self):
1740
37
        def m(pe):
1741
38
            return (pe.string == "allow" or (
1742
39
                isinstance(pe, Tag) and pe.name=="allow"))
1743
40
1744
41
        soup = self.soup("<allow>deny</allow>allow<deny>deny</deny>")
1745
42
        allow_tag = soup.allow
1746
43
        allow_string = soup.find(string="allow")
1747
44
        deny_tag = soup.deny
1748
45
        deny_string = soup.find(string="deny")
1749
46
1750
47
        selector = ElementFilter(match_function=m)
1751
48
        assert True == selector.match(allow_tag)
1752
49
        assert True == selector.match(allow_string)
1753
50
        assert False == selector.match(deny_tag)
1754
51
        assert False == selector.match(deny_string)
1755
52
1756
53
        # Since only the match function was provided, there is
1757
54
        # no effect on tag or string creation.
1758
55
        soup = self.soup("<a>text</a>", parse_only=selector)
1759
56
        assert "text" == soup.a.string
1760
57
1761
58
    def test_allow_tag_creation(self):
1762
59
        def m(nsprefix, name, attrs):
1763
60
            return nsprefix=="allow" or name=="allow" or "allow" in attrs
1764
61
        selector = ElementFilter(allow_tag_creation_function=m)
1765
62
        f = selector.allow_tag_creation
1766
63
        assert True == f("allow", "ignore", {})
1767
64
        assert True == f("ignore", "allow", {})
1768
65
        assert True == f(None, "ignore", {"allow": "1"})
1769
66
        assert False == f("no", "no", {"no" : "nope"})
1770
67
1771
68
        # Test the ElementFilter as a value for parse_only.
1772
69
        soup = self.soup(
1773
70
            "<deny>deny</deny> <allow>deny</allow> allow",
1774
71
            parse_only=selector
1775
72
        )
1776
18
73
1780
19
    def _tuple(self, rule):
74
        # The <deny> tag was filtered out, but there was no effect on
1781
20
        if isinstance(rule.pattern, str):
75
        # the strings, since only allow_tag_creation_function was
1782
21
            import pdb; pdb.set_trace()
76
        # defined.
1783
77
        assert 'deny <allow>deny</allow> allow' == soup.decode()
1784
78
1785
79
        # Similarly, since match_function was not defined, this
1786
80
        # ElementFilter matches everything.
1787
81
        assert soup.find(selector) == "deny"
1788
82
1789
83
    def test_allow_string_creation(self):
1790
84
        def m(s):
1791
85
            return s=="allow"
1792
86
        selector = ElementFilter(allow_string_creation_function=m)
1793
87
        f = selector.allow_string_creation
1794
88
        assert True == f("allow")
1795
89
        assert False == f("deny")
1796
90
        assert False == f("please allow")
1797
91
1798
92
        # Test the ElementFilter as a value for parse_only.
1799
93
        soup = self.soup(
1800
94
            "<deny>deny</deny> <allow>deny</allow> allow",
1801
95
            parse_only=selector
1802
96
        )
1803
97
1804
98
        # All incoming strings other than "allow" (even whitespace)
1805
99
        # were filtered out, but there was no effect on the tags,
1806
100
        # since only allow_string_creation_function was defined.
1807
101
        assert '<deny>deny</deny><allow>deny</allow>' == soup.decode()
1808
102
1809
103
        # Similarly, since match_function was not defined, this
1810
104
        # ElementFilter matches everything.
1811
105
        assert soup.find(selector).name == "deny"
1812
22
106
1813
107
1814
108
class TestMatchRule(SoupTest):
1815
109
1816
110
    def _tuple(self, rule):
1817
23
        return (
111
        return (
1818
24
            rule.string,
112
            rule.string,
1819
25
            rule.pattern.pattern if rule.pattern else None,
113
            rule.pattern.pattern if rule.pattern else None,
1820
@@ -155,6 +243,28 @@ class TestSoupStrainer(SoupTest):
1821
155
            assert w2.filename == __file__
243
            assert w2.filename == __file__
1822
156
            assert msg == "Access to deprecated property text. (Look at .string_rules instead) -- Deprecated since version 4.13.0."
244
            assert msg == "Access to deprecated property text. (Look at .string_rules instead) -- Deprecated since version 4.13.0."
1823
157
245
1824
246
    def test_search_tag_deprecated(self):
1825
247
        strainer = SoupStrainer(name="a")
1826
248
        with warnings.catch_warnings(record=True) as w:
1827
249
            assert False == strainer.search_tag("b", {})
1828
250
            [w1] = w
1829
251
            msg = str(w1.message)
1830
252
            assert w1.filename == __file__
1831
253
            assert msg == "Call to deprecated method search_tag. (Replaced by allow_tag_creation) -- Deprecated since version 4.13.0."
1832
254
1833
255
    def test_search_deprecated(self):
1834
256
        strainer = SoupStrainer(name="a")
1835
257
        soup = self.soup("<a></a><b></b>")
1836
258
        with warnings.catch_warnings(record=True) as w:
1837
259
            assert soup.a == strainer.search(soup.a)
1838
260
            assert None == strainer.search(soup.b)
1839
261
            [w1, w2] = w
1840
262
            msg = str(w1.message)
1841
263
            assert msg == str(w2.message)
1842
264
            assert w1.filename == __file__
1843
265
            assert msg == "Call to deprecated method search. (Replaced by match) -- Deprecated since version 4.13.0."
1844
266
1845
267
    # Dummy function used within tests.
1846
158
    def _match_function(x):
268
    def _match_function(x):
1847
159
        pass
269
        pass
1848
160
            
270
            
1849
@@ -213,7 +323,7 @@ class TestSoupStrainer(SoupTest):
1850
213
        )
323
        )
1851
214
324
1852
215
    def test_constructor_with_overlapping_attributes(self):
325
    def test_constructor_with_overlapping_attributes(self):
1854
216
        # If you specify the same attribute in arts and **kwargs, you end up
326
        # If you specify the same attribute in args and **kwargs, you end up
1855
217
        # with two different AttributeValueMatchRule objects.
327
        # with two different AttributeValueMatchRule objects.
1856
218
328
1857
219
        # This happens whether you use the 'class' shortcut on attrs...
329
        # This happens whether you use the 'class' shortcut on attrs...
1858
@@ -437,17 +547,24 @@ class TestSoupStrainer(SoupTest):
1859
437
        # because the string restrictions can't be evaluated during
547
        # because the string restrictions can't be evaluated during
1860
438
        # the parsing process, and the tag restrictions eliminate
548
        # the parsing process, and the tag restrictions eliminate
1861
439
        # any strings from consideration.
549
        # any strings from consideration.
1862
550
        #
1863
551
        # We can detect this ahead of time, and warn about it,
1864
552
        # thanks to SoupStrainer.excludes_everything
1865
440
        markup = "<a><b>one string<div>another string</div></b></a>"
553
        markup = "<a><b>one string<div>another string</div></b></a>"
1866
441
554
1867
442
        with warnings.catch_warnings(record=True) as w:
555
        with warnings.catch_warnings(record=True) as w:
1868
556
            assert True, soupstrainer.excludes_everything
1869
443
            assert "" == self.soup(markup, parse_only=soupstrainer).decode()
557
            assert "" == self.soup(markup, parse_only=soupstrainer).decode()
1870
444
            [warning] = w
558
            [warning] = w
1871
445
            msg = str(warning.message)
559
            msg = str(warning.message)
1872
446
            assert warning.filename == __file__
560
            assert warning.filename == __file__
1873
447
            assert str(warning.message).startswith(
561
            assert str(warning.message).startswith(
1875
448
                "Value for parse_only will exclude everything, since it puts restrictions on both tags and strings:"
562
                "The given value for parse_only will exclude everything:"
1876
449
            )
563
            )
1878
450
        
564
1879
565
        # The average SoupStrainer has excludes_everything=False
1880
566
        assert not SoupStrainer().excludes_everything
1881
567
1882
451
    def test_documentation_examples(self):
568
    def test_documentation_examples(self):
1883
452
        """Medium-weight real-world tests based on the Beautiful Soup
569
        """Medium-weight real-world tests based on the Beautiful Soup
1884
453
        documentation.
570
        documentation.
1885
diff --git a/bs4/tests/test_html5lib.py b/bs4/tests/test_html5lib.py
1886
index b0f4384..9f6dfa1 100644
1887
--- a/bs4/tests/test_html5lib.py
1888
+++ b/bs4/tests/test_html5lib.py
1889
@@ -4,7 +4,7 @@ import pytest
1890
4
import warnings
4
import warnings
1891
5
5
1892
6
from bs4 import BeautifulSoup
6
from bs4 import BeautifulSoup
1894
7
from bs4.strainer import SoupStrainer
7
from bs4.filter import SoupStrainer
1895
8
from . import (
8
from . import (
1896
9
    HTML5LIB_PRESENT,
9
    HTML5LIB_PRESENT,
1897
10
    HTML5TreeBuilderSmokeTest,
10
    HTML5TreeBuilderSmokeTest,
1898
@@ -24,7 +24,7 @@ class TestHTML5LibBuilder(SoupTest, HTML5TreeBuilderSmokeTest):
1899
24
        return HTML5TreeBuilder
24
        return HTML5TreeBuilder
1900
25
25
1901
26
    def test_soupstrainer(self):
26
    def test_soupstrainer(self):
1903
27
        # The html5lib tree builder does not support SoupStrainers.
27
        # The html5lib tree builder does not support parse_only.
1904
28
        strainer = SoupStrainer("b")
28
        strainer = SoupStrainer("b")
1905
29
        markup = "<p>A <b>bold</b> statement.</p>"
29
        markup = "<p>A <b>bold</b> statement.</p>"
1906
30
        with warnings.catch_warnings(record=True) as w:
30
        with warnings.catch_warnings(record=True) as w:
1907
diff --git a/bs4/tests/test_lxml.py b/bs4/tests/test_lxml.py
1908
index d450740..9fc04e0 100644
1909
--- a/bs4/tests/test_lxml.py
1910
+++ b/bs4/tests/test_lxml.py
1911
@@ -14,7 +14,7 @@ from bs4 import (
1912
14
    BeautifulStoneSoup,
14
    BeautifulStoneSoup,
1913
15
    )
15
    )
1914
16
from bs4.element import Comment, Doctype
16
from bs4.element import Comment, Doctype
1916
17
from bs4.strainer import SoupStrainer
17
from bs4.filter import SoupStrainer
1917
18
from . import (
18
from . import (
1918
19
    HTMLTreeBuilderSmokeTest,
19
    HTMLTreeBuilderSmokeTest,
1919
20
    XMLTreeBuilderSmokeTest,
20
    XMLTreeBuilderSmokeTest,
1920
diff --git a/bs4/tests/test_pageelement.py b/bs4/tests/test_pageelement.py
1921
index 19b4d63..7dfdc22 100644
1922
--- a/bs4/tests/test_pageelement.py
1923
+++ b/bs4/tests/test_pageelement.py
1924
@@ -10,7 +10,7 @@ from bs4.element import (
1925
10
    Comment,
10
    Comment,
1926
11
    ResultSet,
11
    ResultSet,
1927
12
)
12
)
1929
13
from bs4.strainer import SoupStrainer
13
from bs4.filter import SoupStrainer
1930
14
from . import (
14
from . import (
1931
15
    SoupTest,
15
    SoupTest,
1932
16
)
16
)
1933
diff --git a/bs4/tests/test_soup.py b/bs4/tests/test_soup.py
1934
index 4f8ee1a..c95f380 100644
1935
--- a/bs4/tests/test_soup.py
1936
+++ b/bs4/tests/test_soup.py
1937
@@ -27,7 +27,7 @@ from bs4.element import (
1938
27
    Tag,
27
    Tag,
1939
28
    NavigableString,
28
    NavigableString,
1940
29
)
29
)
1942
30
from bs4.strainer import SoupStrainer
30
from bs4.filter import SoupStrainer
1943
31
31
1944
32
from . import (
32
from . import (
1945
33
    default_builder,
33
    default_builder,
1946
@@ -293,7 +293,7 @@ class TestWarnings(SoupTest):
1947
293
            soup = self.soup("<a><b></b></a>", parse_only=strainer)
293
            soup = self.soup("<a><b></b></a>", parse_only=strainer)
1948
294
        warning = self._assert_warning(w, UserWarning)
294
        warning = self._assert_warning(w, UserWarning)
1949
295
        msg = str(warning.message)
295
        msg = str(warning.message)
1951
296
        assert msg.startswith("Value for parse_only will exclude everything, since it puts restrictions on both tags and strings:")
296
        assert msg.startswith("The given value for parse_only will exclude everything:")
1952
297
        
297
        
1953
298
    def test_parseOnlyThese_renamed_to_parse_only(self):
298
    def test_parseOnlyThese_renamed_to_parse_only(self):
1954
299
        with warnings.catch_warnings(record=True) as w:
299
        with warnings.catch_warnings(record=True) as w:
1955
diff --git a/bs4/tests/test_tree.py b/bs4/tests/test_tree.py
1956
index 606525f..43afb29 100644
1957
--- a/bs4/tests/test_tree.py
1958
+++ b/bs4/tests/test_tree.py
1959
@@ -26,7 +26,7 @@ from bs4.element import (
1960
26
    Tag,
26
    Tag,
1961
27
    TemplateString,
27
    TemplateString,
1962
28
)
28
)
1964
29
from bs4.strainer import SoupStrainer
29
from bs4.filter import SoupStrainer
1965
30
from . import (
30
from . import (
1966
31
    SoupTest,
31
    SoupTest,
1967
32
)
32
)
1968
diff --git a/doc/index.rst b/doc/index.rst
1969
index 7beff36..a414830 100755
1970
--- a/doc/index.rst
1971
+++ b/doc/index.rst
1972
@@ -20,7 +20,7 @@ with examples. I show you what the library is good for, how it works,
1973
20
how to use it, how to make it do what you want, and what to do when it
20
how to use it, how to make it do what you want, and what to do when it
1974
21
violates your expectations.
21
violates your expectations.
1975
22
22
1977
23
This document covers Beautiful Soup version 4.12.2. The examples in
23
This document covers Beautiful Soup version 4.13.0. The examples in
1978
24
this documentation were written for Python 3.8.
24
this documentation were written for Python 3.8.
1979
25
25
1980
26
You might be looking for the documentation for `Beautiful Soup 3
26
You might be looking for the documentation for `Beautiful Soup 3
1981
@@ -2577,6 +2577,11 @@ the human-visible content of the page.*
1982
2577
either return the object itself, or nothing, so the only reason to do
2577
either return the object itself, or nothing, so the only reason to do
1983
2578
this is when you're iterating over a mixed list.*
2578
this is when you're iterating over a mixed list.*
1984
2579
2579
1985
2580
*As of Beautiful Soup version 4.13.0, you can call .string on a
1986
2581
NavigableString object. It will return the object itself, so again,
1987
2582
the only reason to do this is when you're iterating over a mixed
1988
2583
list.*
1989
2584
1990
2580
Specifying the parser to use
2585
Specifying the parser to use
1991
2581
============================
2586
============================
1992
2582
2587
1993
@@ -2604,8 +2609,9 @@ specifying one of the following:
1994
2604
2609
1995
2605
The section `Installing a parser`_ contrasts the supported parsers.
2610
The section `Installing a parser`_ contrasts the supported parsers.
1996
2606
2611
1999
2607
If you don't have an appropriate parser installed, Beautiful Soup will
2612
If you ask for a parser that isn't installed, Beautiful Soup will
2000
2608
ignore your request and pick a different parser. Right now, the only
2613
raise an exception so that you don't inadvertently parse a document
2001
2614
under an unknown set of rules. For example, right now, the only
2002
2609
supported XML parser is lxml. If you don't have lxml installed, asking
2615
supported XML parser is lxml. If you don't have lxml installed, asking
2003
2610
for an XML parser won't give you one, and asking for "lxml" won't work
2616
for an XML parser won't give you one, and asking for "lxml" won't work
2004
2611
either.
2617
either.
2005
@@ -3018,6 +3024,44 @@ been called on it::
2006
3018
This is because two different :py:class:`Tag` objects can't occupy the same
3024
This is because two different :py:class:`Tag` objects can't occupy the same
2007
3019
space at the same time.
3025
space at the same time.
2008
3020
3026
2009
3027
Advanced search techniques
2010
3028
==========================
2011
3029
2012
3030
Almost everyone who uses Beautiful Soup to extract information from a
2013
3031
document can get what they need using the methods described in
2014
3032
`Searching the tree`_. However, there's a lower-level interface--the
2015
3033
:py:class:`ElementSelector` class-- which lets you define any matching
2016
3034
behavior whatsoever.
2017
3035
2018
3036
To use :py:class:`ElementSelector`, define a function that takes a
2019
3037
:py:class:`PageElement` object (that is, it might be either a
2020
3038
:py:class:`Tag` or a :py:class`NavigableString`) and returns ``True``
2021
3039
(if the element matches your custom criteria) or ``False`` (if it
2022
3040
doesn't)::
2023
3041
2024
3042
  [example goes here]
2025
3043
2026
3044
Then, pass the function into an :py:class:`ElementSelector`::
2027
3045
2028
3046
 from bs4.select import ElementSelector
2029
3047
 selector = ElementSelector(f)
2030
3048
2031
3049
You can then pass the :py:class:`ElementSelector` object as the first
2032
3050
argument to any of the `Searching the tree`_ methods::
2033
3051
2034
3052
 [examples go here]
2035
3053
2036
3054
Every potential match will be run through your function, and the only
2037
3055
:py:class:`PageElement` objects returned will be the one where your
2038
3056
function returned ``True``.
2039
3057
2040
3058
Note that this is different from simply passing `a function`_ as the
2041
3059
first argument to one of the search methods. That's an easy way to
2042
3060
find a tag, but _only_ tags will be considered. With an
2043
3061
:py:class:`ElementSelector` you can write a single function that makes
2044
3062
decisions about both tags and strings.
2045
3063
 
2046
3064
2047
3021
Advanced parser customization
3065
Advanced parser customization
2048
3022
=============================
3066
=============================
2049
3023
3067
2050
@@ -3111,14 +3155,6 @@ The :py:class:`SoupStrainer` behavior is as follows:
2051
3111
* When a tag does not match, the tag itself is not kept, but parsing continues
3155
* When a tag does not match, the tag itself is not kept, but parsing continues
2052
3112
  into its contents to look for other tags that do match.
3156
  into its contents to look for other tags that do match.
2053
3113
3157
2054
3114
You can also pass a :py:class:`SoupStrainer` into any of the methods covered
2055
3115
in `Searching the tree`_. This probably isn't terribly useful, but I
2056
3116
thought I'd mention it::
2057
3117
2058
3118
 soup = BeautifulSoup(html_doc, 'html.parser')
2059
3119
 soup.find_all(only_short_strings)
2060
3120
 # ['\n\n', '\n\n', 'Elsie', ',\n', 'Lacie', ' and\n', 'Tillie',
2061
3121
 #  '\n\n', '...', '\n']
2062
3122
3158
2063
3123
Customizing multi-valued attributes
3159
Customizing multi-valued attributes
2064
3124
-----------------------------------
3160
-----------------------------------
Status:	Merged
Merged at revision:	c23dd48ebea467fcf028e14287f07d2c51e62975
Proposed branch:	beautifulsoup:more-modular-soupstrainers
Merge into:	beautifulsoup:4.13
Diff against target:	2064 lines (+710/-262) 18 files modified CHANGELOG (+18/-1) bs4/__init__.py (+131/-84) bs4/_typing.py (+19/-1) bs4/builder/__init__.py (+8/-8) bs4/builder/_html5lib.py (+123/-67) bs4/builder/_htmlparser.py (+12/-2) bs4/builder/_lxml.py (+1/-1) bs4/diagnose.py (+27/-15) bs4/element.py (+24/-20) bs4/filter.py (+167/-36) bs4/tests/__init__.py (+1/-1) bs4/tests/test_filter.py (+125/-8) bs4/tests/test_html5lib.py (+2/-2) bs4/tests/test_lxml.py (+1/-1) bs4/tests/test_pageelement.py (+1/-1) bs4/tests/test_soup.py (+2/-2) bs4/tests/test_tree.py (+1/-1) doc/index.rst (+47/-11)
Related bugs:	Link a bug report
Reviewer	Review Type	Date Requested	Status
Leonard Richardson			Pending
Review via email: mp+459082@code.launchpad.net