1
diff --git a/CHANGELOG b/CHANGELOG
2
index 66fcb74..bec1e11 100644
3
--- a/CHANGELOG
4
+++ b/CHANGELOG
5
@@ -1,3 +1,102 @@
6
1
= 4.13.0 (Unreleased)
7
2
8
3
* This version drops support for Python 3.6. The minimum supported
9
4
  major version for Beautiful Soup is now Python 3.7.
10
5
11
6
* Deprecation warnings have been added for all deprecated methods and
12
7
  attributes. Most of these were deprecated over ten years ago, and
13
8
  some were deprecated over fifteen years ago.
14
9
15
10
  Going forward, deprecated names will be subject to removal two
16
11
  feature releases or one major release after the deprecation warning
17
12
  is added.
18
13
19
14
* append(), extend(), insert(), and unwrap() were moved from PageElement to
20
15
  Tag. Those methods manipulate the 'contents' collection, so they would
21
16
  only have ever worked on Tag objects.
22
17
23
18
* decompose() was moved from Tag to PageElement, since there's no reason
24
19
  it won't also work on NavigableString objects.
25
20
26
21
* The BeautifulSoupHTMLParser constructor now requires a BeautifulSoup
27
22
  object as its first argument. This almost certainly does not affect
28
23
  you, since you probably use HTMLParserTreeBuilder, not
29
24
  BeautifulSoupHTMLParser directly.
30
25
31
26
* If Tag.get_attribute_list() is used to access an attribute that's not set,
32
27
  the return value is now an empty list rather than [None].
33
28
34
29
* AttributeValueWithCharsetSubstitution.encode() is renamed to
35
30
  substitute_encoding, to avoid confusion with the much different str.encode()
36
31
37
32
* Using PageElement.replace_with() to replace an element with itself
38
33
  returns the element instead of None.
39
34
40
35
* When using one of the find() methods or creating a SoupStrainer,
41
36
  if you specify the same attribute value in ``attrs`` and the
42
37
  keyword arguments, you'll end up with two different ways to match that
43
38
  attribute. Previously the value in keyword arguments would override the
44
39
  value in ``attrs``.
45
40
46
41
* When using one of the find() methods or creating a SoupStrainer, you can
47
42
  pass a list of any accepted object (strings, regular expressions, etc.) for
48
43
  any of the objects. Previously you could only pass in a list of strings.
49
44
50
45
* A SoupStrainer can now filter tag creation based on a tag's
51
46
  namespaced name. Previously only the unqualified name could be used.
52
47
53
48
* All TreeBuilder constructors now take the empty_element_tags
54
49
  argument. The sets of tags found in HTMLTreeBuilder.empty_element_tags and
55
50
  HTMLTreeBuilder.block_elements are now in
56
51
  HTMLTreeBuilder.DEFAULT_EMPTY_ELEMENT_TAGS and
57
52
  HTMLTreeBuilder.DEFAULT_BLOCK_ELEMENTS, to avoid confusing them with
58
53
  instance variables.
59
54
60
55
* Issue a warning if a document is parsed using a SoupStrainer that's just
61
56
  going to filter everything. In these cases, filtering everything is the
62
57
  most consistent thing to do, but there was no indication that was
63
58
  happening.
64
59
65
60
* UnicodeDammit.markup is now always a bytestring representing the
66
61
  *original* markup (sans BOM), and UnicodeDammit.unicode_markup is
67
62
  always the same markup, converted to Unicode. Previously,
68
63
  UnicodeDammit.markup was treated inconsistently and would often end
69
64
  up containing Unicode. UnicodeDammit.markup was not a documented
70
65
  attribute, but if you were using it, you probably want to switch to using
71
66
  .unicode_markup instead.
72
67
73
68
* Corrected the markup that's output in the unlikely event that you
74
69
  encode a document to a Python internal encoding (like "palmos")
75
70
  that's not recognized by the HTML or XML standard.
76
71
77
72
* The arguments to LXMLTreeBuilderForXML.prepare_markup have been
78
73
  changed to match the arguments to the superclass,
79
74
  TreeBuilder.prepare_markup. Specifically, document_declared_encoding
80
75
  now appears before exclude_encodings, not after. If you were calling
81
76
  this method yourself, I recomment switching to using keyword
82
77
  arguments instead.
83
78
84
79
* Fixed an error in the lookup table used when converting
85
80
  ISO-Latin-1 to ASCII, which no one should do anyway.
86
81
87
82
* The unused constant LXMLTreeBuilderForXML.DEFAULT_PARSER_CLASS
88
83
  has been removed.
89
84
90
85
New deprecations in 4.13.0:
91
86
92
87
* The SAXTreeBuilder class, which was never officially supported or tested.
93
88
94
89
* The first argument to BeautifulSoup.decode has been changed from a bool
95
90
  `pretty_print` to an int `indent_level`, to match the signature of Tag.decode.
96
91
97
92
* SoupStrainer.text and SoupStrainer.string are both deprecated
98
93
  since a single item can't capture all the possibilities of a SoupStrainer
99
94
  designed to match strings.
100
95
101
96
* SoupStrainer.search_tag() is deprecated. It was never a
102
97
  documented method, but if you use it, you should start using
103
98
  SoupStrainer.allow_tag_creation() instead.
104
99
105
1
= 4.12.3 (?)
100
= 4.12.3 (?)
106
2
101
107
3
* Fixed a regression such that if you set .hidden on a tag, the tag
102
* Fixed a regression such that if you set .hidden on a tag, the tag
108
diff --git a/bs4/__init__.py b/bs4/__init__.py
109
index 3d2ab09..a1289c7 100644
110
--- a/bs4/__init__.py
111
+++ b/bs4/__init__.py
112
@@ -7,8 +7,8 @@ Beautiful Soup uses a pluggable XML or HTML parser to parse a
113
7
provides methods and Pythonic idioms that make it easy to navigate,
7
provides methods and Pythonic idioms that make it easy to navigate,
114
8
search, and modify the parse tree.
8
search, and modify the parse tree.
115
9
9
118
10
Beautiful Soup works with Python 3.6 and up. It works better if lxml
10
Beautiful Soup works with Python 3.7 and up. It works better if lxml
119
11
and/or html5lib is installed.
11
and/or html5lib is installed, but they are not required.
120
12
12
121
13
For more than you ever wanted to know about Beautiful Soup, see the
13
For more than you ever wanted to know about Beautiful Soup, see the
122
14
documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
14
documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
123
@@ -37,9 +37,10 @@ if sys.version_info.major < 3:
124
37
from .builder import (
37
from .builder import (
125
38
    builder_registry,
38
    builder_registry,
126
39
    ParserRejectedMarkup,
39
    ParserRejectedMarkup,
127
40
    TreeBuilder,
128
40
    XMLParsedAsHTMLWarning,
41
    XMLParsedAsHTMLWarning,
129
41
    HTMLParserTreeBuilder
130
42
)
42
)
131
43
from .builder._htmlparser import HTMLParserTreeBuilder
132
43
from .dammit import UnicodeDammit
44
from .dammit import UnicodeDammit
133
44
from .element import (
45
from .element import (
134
45
    CData,
46
    CData,
135
@@ -55,10 +56,32 @@ from .element import (
136
55
    ResultSet,
56
    ResultSet,
137
56
    Script,
57
    Script,
138
57
    Stylesheet,
58
    Stylesheet,
139
58
    SoupStrainer,
140
59
    Tag,
59
    Tag,
141
60
    TemplateString,
60
    TemplateString,
142
61
    )
61
    )
143
62
from .formatter import Formatter
144
63
from .strainer import SoupStrainer
145
64
from typing import (
146
65
    Any,
147
66
    cast,
148
67
    Counter as CounterType,
149
68
    Dict,
150
69
    Iterable,
151
70
    List,
152
71
    Sequence,
153
72
    Optional,
154
73
    Type,
155
74
    TYPE_CHECKING,
156
75
    Union,
157
76
)
158
77
159
78
from bs4._typing import (
160
79
    _AttributeValue,
161
80
    _AttributeValues,
162
81
    _Encoding,
163
82
    _Encodings,
164
83
    _IncomingMarkup,
165
84
)
166
62
85
167
63
# Define some custom warnings.
86
# Define some custom warnings.
168
64
class GuessedAtParserWarning(UserWarning):
87
class GuessedAtParserWarning(UserWarning):
169
@@ -104,24 +127,64 @@ class BeautifulSoup(Tag):
170
104
    handle_endtag.
127
    handle_endtag.
171
105
    """
128
    """
172
106
129
191
107
    # Since BeautifulSoup subclasses Tag, it's possible to treat it as
130
    #: Since `BeautifulSoup` subclasses `Tag`, it's possible to treat it as
192
108
    # a Tag with a .name. This name makes it clear the BeautifulSoup
131
    #: a `Tag` with a `Tag.name`. Hoever, this name makes it clear the
193
109
    # object isn't a real markup tag.
132
    #: `BeautifulSoup` object isn't a real markup tag.
194
110
    ROOT_TAG_NAME = '[document]'
133
    ROOT_TAG_NAME:str = '[document]'
195
111
134
196
112
    # If the end-user gives no indication which tree builder they
135
    #: If the end-user gives no indication which tree builder they
197
113
    # want, look for one with these features.
136
    #: want, look for one with these features.
198
114
    DEFAULT_BUILDER_FEATURES = ['html', 'fast']
137
    DEFAULT_BUILDER_FEATURES: Sequence[str] = ['html', 'fast']
199
115
138
200
116
    # A string containing all ASCII whitespace characters, used in
139
    #: A string containing all ASCII whitespace characters, used in
201
117
    # endData() to detect data chunks that seem 'empty'.
140
    #: `BeautifulSoup.endData` to detect data chunks that seem 'empty'.
202
118
    ASCII_SPACES = '\x20\x0a\x09\x0c\x0d'
141
    ASCII_SPACES: str = '\x20\x0a\x09\x0c\x0d'
203
119
142
204
120
    NO_PARSER_SPECIFIED_WARNING = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features=\"%(parser)s\"' to the BeautifulSoup constructor.\n"
143
    #: :meta private:
205
121
   
144
    NO_PARSER_SPECIFIED_WARNING: str = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features=\"%(parser)s\"' to the BeautifulSoup constructor.\n"
206
122
    def __init__(self, markup="", features=None, builder=None,
145
207
123
                 parse_only=None, from_encoding=None, exclude_encodings=None,
146
    # FUTURE PYTHON:
208
124
                 element_classes=None, **kwargs):
147
    element_classes:Dict[Type[PageElement], Type[Any]] #: :meta private:
209
148
    builder:TreeBuilder #: :meta private:
210
149
    is_xml: bool
211
150
    known_xml: Optional[bool]
212
151
    parse_only: Optional[SoupStrainer] #: :meta private:
213
152
214
153
    # These members are only used while parsing markup.
215
154
    markup:Optional[Union[str,bytes]] #: :meta private:
216
155
    current_data:List[str] #: :meta private:
217
156
    currentTag:Optional[Tag] #: :meta private:
218
157
    tagStack:List[Tag] #: :meta private:
219
158
    open_tag_counter:CounterType[str] #: :meta private:
220
159
    preserve_whitespace_tag_stack:List[Tag] #: :meta private:
221
160
    string_container_stack:List[Tag] #: :meta private:
222
161
223
162
    #: Beautiful Soup's best guess as to the character encoding of the
224
163
    #: original document.
225
164
    original_encoding: Optional[_Encoding]
226
165
227
166
    #: The character encoding, if any, that was explicitly defined
228
167
    #: in the original document. This may or may not match
229
168
    #: `BeautifulSoup.original_encoding`.
230
169
    declared_html_encoding: Optional[_Encoding]
231
170
    
232
171
    #: This is True if the markup that was parsed contains
233
172
    #: U+FFFD REPLACEMENT_CHARACTER characters which were not present
234
173
    #: in the original markup. These mark character sequences that
235
174
    #: could not be represented in Unicode.
236
175
    contains_replacement_characters: bool
237
176
    
238
177
    def __init__(
239
178
            self,
240
179
            markup:_IncomingMarkup="",
241
180
            features:Optional[Union[str,Sequence[str]]]=None,
242
181
            builder:Optional[Union[TreeBuilder,Type[TreeBuilder]]]=None,
243
182
            parse_only:Optional[SoupStrainer]=None,
244
183
            from_encoding:Optional[_Encoding]=None,
245
184
            exclude_encodings:Optional[_Encodings]=None,
246
185
            element_classes:Optional[Dict[Type[PageElement], Type[Any]]]=None,
247
186
            **kwargs:Any
248
187
    ):
249
125
        """Constructor.
188
        """Constructor.
250
126
189
251
127
        :param markup: A string or a file-like object representing
190
        :param markup: A string or a file-like object representing
252
@@ -196,14 +259,14 @@ class BeautifulSoup(Tag):
253
196
        if 'selfClosingTags' in kwargs:
259
        if 'selfClosingTags' in kwargs:
254
197
            del kwargs['selfClosingTags']
260
            del kwargs['selfClosingTags']
255
198
            warnings.warn(
261
            warnings.warn(
257
199
                "BS4 does not respect the selfClosingTags argument to the "
262
                "Beautiful Soup 4 does not respect the selfClosingTags argument to the "
258
200
                "BeautifulSoup constructor. The tree builder is responsible "
263
                "BeautifulSoup constructor. The tree builder is responsible "
259
201
                "for understanding self-closing tags.")
264
                "for understanding self-closing tags.")
260
202
265
261
203
        if 'isHTML' in kwargs:
266
        if 'isHTML' in kwargs:
262
204
            del kwargs['isHTML']
267
            del kwargs['isHTML']
263
205
            warnings.warn(
268
            warnings.warn(
265
206
                "BS4 does not respect the isHTML argument to the "
269
                "Beautiful Soup 4 does not respect the isHTML argument to the "
266
207
                "BeautifulSoup constructor. Suggest you use "
270
                "BeautifulSoup constructor. Suggest you use "
267
208
                "features='lxml' for HTML and features='lxml-xml' for "
271
                "features='lxml' for HTML and features='lxml-xml' for "
268
209
                "XML.")
272
                "XML.")
269
@@ -212,7 +275,8 @@ class BeautifulSoup(Tag):
270
212
            if old_name in kwargs:
275
            if old_name in kwargs:
271
213
                warnings.warn(
276
                warnings.warn(
272
214
                    'The "%s" argument to the BeautifulSoup constructor '
277
                    'The "%s" argument to the BeautifulSoup constructor '
274
215
                    'has been renamed to "%s."' % (old_name, new_name),
278
                    'was renamed to "%s" in Beautiful Soup 4.0.0' % (
275
279
                        old_name, new_name),
276
216
                    DeprecationWarning, stacklevel=3
280
                    DeprecationWarning, stacklevel=3
277
217
                )
281
                )
278
218
                return kwargs.pop(old_name)
282
                return kwargs.pop(old_name)
279
@@ -220,7 +284,14 @@ class BeautifulSoup(Tag):
280
220
284
281
221
        parse_only = parse_only or deprecated_argument(
285
        parse_only = parse_only or deprecated_argument(
282
222
            "parseOnlyThese", "parse_only")
286
            "parseOnlyThese", "parse_only")
284
223
287
        if (parse_only is not None
285
288
            and parse_only.string_rules and
286
289
            (parse_only.name_rules or parse_only.attribute_rules)):
287
290
            warnings.warn(
288
291
                f"Value for parse_only will exclude everything, since it puts restrictions on both tags and strings: {parse_only}",
289
292
                UserWarning, stacklevel=3
290
293
            )
291
294
        
292
224
        from_encoding = from_encoding or deprecated_argument(
295
        from_encoding = from_encoding or deprecated_argument(
293
225
            "fromEncoding", "from_encoding")
296
            "fromEncoding", "from_encoding")
294
226
297
295
@@ -235,7 +306,8 @@ class BeautifulSoup(Tag):
296
235
        # specify a parser' warning.
306
        # specify a parser' warning.
297
236
        original_builder = builder
307
        original_builder = builder
298
237
        original_features = features
308
        original_features = features
300
238
            
309
301
310
        builder_class: Type[TreeBuilder]
302
239
        if isinstance(builder, type):
311
        if isinstance(builder, type):
303
240
            # A builder class was passed in; it needs to be instantiated.
312
            # A builder class was passed in; it needs to be instantiated.
304
241
            builder_class = builder
313
            builder_class = builder
305
@@ -245,12 +317,13 @@ class BeautifulSoup(Tag):
306
245
                features = [features]
317
                features = [features]
307
246
            if features is None or len(features) == 0:
318
            if features is None or len(features) == 0:
308
247
                features = self.DEFAULT_BUILDER_FEATURES
319
                features = self.DEFAULT_BUILDER_FEATURES
311
248
            builder_class = builder_registry.lookup(*features)
320
            possible_builder_class = builder_registry.lookup(*features)
312
249
            if builder_class is None:
321
            if possible_builder_class is None:
313
250
                raise FeatureNotFound(
322
                raise FeatureNotFound(
314
251
                    "Couldn't find a tree builder with the features you "
323
                    "Couldn't find a tree builder with the features you "
315
252
                    "requested: %s. Do you need to install a parser library?"
324
                    "requested: %s. Do you need to install a parser library?"
316
253
                    % ",".join(features))
325
                    % ",".join(features))
317
326
            builder_class = cast(Type[TreeBuilder], possible_builder_class)
318
254
327
319
255
        # At this point either we have a TreeBuilder instance in
328
        # At this point either we have a TreeBuilder instance in
320
256
        # builder, or we have a builder_class that we can instantiate
329
        # builder, or we have a builder_class that we can instantiate
321
@@ -259,7 +332,8 @@ class BeautifulSoup(Tag):
322
259
            builder = builder_class(**kwargs)
332
            builder = builder_class(**kwargs)
323
260
            if not original_builder and not (
333
            if not original_builder and not (
324
261
                    original_features == builder.NAME or
334
                    original_features == builder.NAME or
326
262
                    original_features in builder.ALTERNATE_NAMES
335
                    (isinstance(original_features, str)
327
336
                     and original_features in builder.ALTERNATE_NAMES)
328
263
            ) and markup:
337
            ) and markup:
329
264
                # The user did not tell us which TreeBuilder to use,
338
                # The user did not tell us which TreeBuilder to use,
330
265
                # and we had to guess. Issue a warning.
339
                # and we had to guess. Issue a warning.
331
@@ -323,6 +397,10 @@ class BeautifulSoup(Tag):
332
323
            if not self._markup_is_url(markup):
397
            if not self._markup_is_url(markup):
333
324
                self._markup_resembles_filename(markup)                
398
                self._markup_resembles_filename(markup)                
334
325
399
335
400
        # At this point we know markup is a string or bytestring.  If
336
401
        # it was a file-type object, we've read from it.
337
402
        markup = cast(Union[str,bytes], markup)
338
403
                
339
326
        rejections = []
404
        rejections = []
340
327
        success = False
405
        success = False
341
328
        for (self.markup, self.original_encoding, self.declared_html_encoding,
406
        for (self.markup, self.original_encoding, self.declared_html_encoding,
342
@@ -486,7 +564,7 @@ class BeautifulSoup(Tag):
343
486
        markup.
564
        markup.
344
487
        """
565
        """
345
488
        Tag.__init__(self, self, self.builder, self.ROOT_TAG_NAME)
566
        Tag.__init__(self, self, self.builder, self.ROOT_TAG_NAME)
347
489
        self.hidden = 1
567
        self.hidden = True
348
490
        self.builder.reset()
568
        self.builder.reset()
349
491
        self.current_data = []
569
        self.current_data = []
350
492
        self.currentTag = None
570
        self.currentTag = None
351
@@ -497,8 +575,16 @@ class BeautifulSoup(Tag):
352
497
        self._most_recent_element = None
575
        self._most_recent_element = None
353
498
        self.pushTag(self)
576
        self.pushTag(self)
354
499
577
357
500
    def new_tag(self, name, namespace=None, nsprefix=None, attrs={},
578
    def new_tag(
358
501
                sourceline=None, sourcepos=None, **kwattrs):
579
            self,
359
580
            name:str,
360
581
            namespace:Optional[str]=None,
361
582
            nsprefix:Optional[str]=None,
362
583
            attrs:_AttributeValues={},
363
584
            sourceline:Optional[int]=None,
364
585
            sourcepos:Optional[int]=None,
365
586
            **kwattrs:_AttributeValue,
366
587
    ):
367
502
        """Create a new Tag associated with this BeautifulSoup object.
588
        """Create a new Tag associated with this BeautifulSoup object.
368
503
589
369
504
        :param name: The name of the new Tag.
590
        :param name: The name of the new Tag.
370
@@ -509,7 +595,7 @@ class BeautifulSoup(Tag):
371
509
            that are reserved words in Python.
595
            that are reserved words in Python.
372
510
        :param sourceline: The line number where this tag was
596
        :param sourceline: The line number where this tag was
373
511
            (purportedly) found in its source document.
597
            (purportedly) found in its source document.
375
512
        :param sourcepos: The character position within `sourceline` where this
598
        :param sourcepos: The character position within ``sourceline`` where this
376
513
            tag was (purportedly) found.
599
            tag was (purportedly) found.
377
514
        :param kwattrs: Keyword arguments for the new Tag's attribute values.
600
        :param kwattrs: Keyword arguments for the new Tag's attribute values.
378
515
601
379
@@ -520,9 +606,17 @@ class BeautifulSoup(Tag):
380
520
            sourceline=sourceline, sourcepos=sourcepos
606
            sourceline=sourceline, sourcepos=sourcepos
381
521
        )
607
        )
382
522
608
384
523
    def string_container(self, base_class=None):
609
    def string_container(self,
385
610
                         base_class:Optional[Type[NavigableString]]=None
386
611
                         ) -> Type[NavigableString]:
387
612
        """Find the class that should be instantiated to hold a given kind of
388
613
        string.
389
614
390
615
        This may be a built-in Beautiful Soup class or a custom class passed
391
616
        in to the BeautifulSoup constructor.
392
617
        """
393
524
        container = base_class or NavigableString
618
        container = base_class or NavigableString
395
525
        
619
396
526
        # There may be a general override of NavigableString.
620
        # There may be a general override of NavigableString.
397
527
        container = self.element_classes.get(
621
        container = self.element_classes.get(
398
528
            container, container
622
            container, container
399
@@ -536,27 +630,40 @@ class BeautifulSoup(Tag):
400
536
            )
630
            )
401
537
        return container
631
        return container
402
538
        
632
        
405
539
    def new_string(self, s, subclass=None):
633
    def new_string(self, s:str, subclass:Optional[Type[NavigableString]]=None) -> NavigableString:
406
540
        """Create a new NavigableString associated with this BeautifulSoup
634
        """Create a new `NavigableString` associated with this `BeautifulSoup`
407
541
        object.
635
        object.
408
636
409
637
        :param s: The string content of the `NavigableString`
410
638
411
639
        :param subclass: The subclass of `NavigableString`, if any, to
412
640
        use. If a document is being processed, an appropriate subclass
413
641
        for the current location in the document will be determined
414
642
        automatically.
415
542
        """
643
        """
416
543
        container = self.string_container(subclass)
644
        container = self.string_container(subclass)
417
544
        return container(s)
645
        return container(s)
418
545
646
420
546
    def insert_before(self, *args):
647
    def insert_before(self, *args:PageElement) -> None:
421
547
        """This method is part of the PageElement API, but `BeautifulSoup` doesn't implement
648
        """This method is part of the PageElement API, but `BeautifulSoup` doesn't implement
422
548
        it because there is nothing before or after it in the parse tree.
649
        it because there is nothing before or after it in the parse tree.
423
549
        """
650
        """
424
550
        raise NotImplementedError("BeautifulSoup objects don't support insert_before().")
651
        raise NotImplementedError("BeautifulSoup objects don't support insert_before().")
425
551
652
427
552
    def insert_after(self, *args):
653
    def insert_after(self, *args:PageElement) -> None:
428
553
        """This method is part of the PageElement API, but `BeautifulSoup` doesn't implement
654
        """This method is part of the PageElement API, but `BeautifulSoup` doesn't implement
429
554
        it because there is nothing before or after it in the parse tree.
655
        it because there is nothing before or after it in the parse tree.
430
555
        """
656
        """
431
556
        raise NotImplementedError("BeautifulSoup objects don't support insert_after().")
657
        raise NotImplementedError("BeautifulSoup objects don't support insert_after().")
432
557
658
435
558
    def popTag(self):
659
    def popTag(self) -> Optional[Tag]:
436
559
        """Internal method called by _popToTag when a tag is closed."""
660
        """Internal method called by _popToTag when a tag is closed.
437
661
438
662
        :meta private:
439
663
        """
440
664
        if not self.tagStack:
441
665
            # Nothing to pop. This shouldn't happen.
442
666
            return None
443
560
        tag = self.tagStack.pop()
667
        tag = self.tagStack.pop()
444
561
        if tag.name in self.open_tag_counter:
668
        if tag.name in self.open_tag_counter:
445
562
            self.open_tag_counter[tag.name] -= 1
669
            self.open_tag_counter[tag.name] -= 1
446
@@ -569,8 +676,11 @@ class BeautifulSoup(Tag):
447
569
            self.currentTag = self.tagStack[-1]
676
            self.currentTag = self.tagStack[-1]
448
570
        return self.currentTag
677
        return self.currentTag
449
571
678
452
572
    def pushTag(self, tag):
679
    def pushTag(self, tag:Tag) -> None:
453
573
        """Internal method called by handle_starttag when a tag is opened."""
680
        """Internal method called by handle_starttag when a tag is opened.
454
681
455
682
        :meta private:
456
683
        """
457
574
        #print("Push", tag.name)
684
        #print("Push", tag.name)
458
575
        if self.currentTag is not None:
685
        if self.currentTag is not None:
459
576
            self.currentTag.contents.append(tag)
686
            self.currentTag.contents.append(tag)
460
@@ -583,9 +693,14 @@ class BeautifulSoup(Tag):
461
583
        if tag.name in self.builder.string_containers:
693
        if tag.name in self.builder.string_containers:
462
584
            self.string_container_stack.append(tag)
694
            self.string_container_stack.append(tag)
463
585
695
465
586
    def endData(self, containerClass=None):
696
    def endData(self, containerClass:Optional[Type[NavigableString]]=None) -> None:
466
587
        """Method called by the TreeBuilder when the end of a data segment
697
        """Method called by the TreeBuilder when the end of a data segment
467
588
        occurs.
698
        occurs.
468
699
469
700
        :param containerClass: The class to use when incorporating the
470
701
        data segment into the parse tree.
471
702
472
703
        :meta private:
473
589
        """       
704
        """       
474
590
        if self.current_data:
705
        if self.current_data:
475
591
            current_data = ''.join(self.current_data)
706
            current_data = ''.join(self.current_data)
476
@@ -609,18 +724,27 @@ class BeautifulSoup(Tag):
477
609
724
478
610
            # Should we add this string to the tree at all?
725
            # Should we add this string to the tree at all?
479
611
            if self.parse_only and len(self.tagStack) <= 1 and \
726
            if self.parse_only and len(self.tagStack) <= 1 and \
482
612
                   (not self.parse_only.text or \
727
                   (not self.parse_only.string_rules or \
483
613
                    not self.parse_only.search(current_data)):
728
                    not self.parse_only.allow_string_creation(current_data)):
484
614
                return
729
                return
485
615
730
486
616
            containerClass = self.string_container(containerClass)
731
            containerClass = self.string_container(containerClass)
487
617
            o = containerClass(current_data)
732
            o = containerClass(current_data)
488
618
            self.object_was_parsed(o)
733
            self.object_was_parsed(o)
489
619
734
492
620
    def object_was_parsed(self, o, parent=None, most_recent_element=None):
735
    def object_was_parsed(
493
621
        """Method called by the TreeBuilder to integrate an object into the parse tree."""
736
            self, o:PageElement, parent:Optional[Tag]=None,
494
737
            most_recent_element:Optional[PageElement]=None):
495
738
        """Method called by the TreeBuilder to integrate an object into the
496
739
        parse tree.
497
740
498
741
        
499
742
500
743
        :meta private:
501
744
        """
502
622
        if parent is None:
745
        if parent is None:
503
623
            parent = self.currentTag
746
            parent = self.currentTag
504
747
        assert parent is not None
505
624
        if most_recent_element is not None:
748
        if most_recent_element is not None:
506
625
            previous_element = most_recent_element
749
            previous_element = most_recent_element
507
626
        else:
750
        else:
508
@@ -685,7 +809,7 @@ class BeautifulSoup(Tag):
509
685
                break
809
                break
510
686
            target = target.parent
810
            target = target.parent
511
687
811
513
688
    def _popToTag(self, name, nsprefix=None, inclusivePop=True):
812
    def _popToTag(self, name, nsprefix=None, inclusivePop=True) -> Optional[Tag]:
514
689
        """Pops the tag stack up to and including the most recent
813
        """Pops the tag stack up to and including the most recent
515
690
        instance of the given tag.
814
        instance of the given tag.
516
691
815
517
@@ -698,11 +822,12 @@ class BeautifulSoup(Tag):
518
698
          to but *not* including the most recent instqance of the
822
          to but *not* including the most recent instqance of the
519
699
          given tag.
823
          given tag.
520
700
824
521
825
        :meta private:
522
701
        """
826
        """
523
702
        #print("Popping to %s" % name)
827
        #print("Popping to %s" % name)
524
703
        if name == self.ROOT_TAG_NAME:
828
        if name == self.ROOT_TAG_NAME:
525
704
            # The BeautifulSoup object itself can never be popped.
829
            # The BeautifulSoup object itself can never be popped.
527
705
            return
830
            return None
528
706
831
529
707
        most_recently_popped = None
832
        most_recently_popped = None
530
708
833
531
@@ -719,8 +844,11 @@ class BeautifulSoup(Tag):
532
719
844
533
720
        return most_recently_popped
845
        return most_recently_popped
534
721
846
537
722
    def handle_starttag(self, name, namespace, nsprefix, attrs, sourceline=None,
847
    def handle_starttag(
538
723
                        sourcepos=None, namespaces=None):
848
            self, name:str, namespace:Optional[str],
539
849
            nsprefix:Optional[str], attrs:Optional[Dict[str,str]],
540
850
            sourceline:Optional[int]=None, sourcepos:Optional[int]=None,
541
851
            namespaces:Optional[Dict[str, str]]=None) -> Optional[Tag]:
542
724
        """Called by the tree builder when a new tag is encountered.
852
        """Called by the tree builder when a new tag is encountered.
543
725
853
544
726
        :param name: Name of the tag.
854
        :param name: Name of the tag.
545
@@ -737,13 +865,15 @@ class BeautifulSoup(Tag):
546
737
        SoupStrainer. You should proceed as if the tag had not occurred
865
        SoupStrainer. You should proceed as if the tag had not occurred
547
738
        in the document. For instance, if this was a self-closing tag,
866
        in the document. For instance, if this was a self-closing tag,
548
739
        don't call handle_endtag.
867
        don't call handle_endtag.
549
868
550
869
        :meta private:
551
740
        """
870
        """
552
741
        # print("Start tag %s: %s" % (name, attrs))
871
        # print("Start tag %s: %s" % (name, attrs))
553
742
        self.endData()
872
        self.endData()
554
743
873
555
744
        if (self.parse_only and len(self.tagStack) <= 1
874
        if (self.parse_only and len(self.tagStack) <= 1
558
745
            and (self.parse_only.text
875
            and (self.parse_only.string_rules
559
746
                 or not self.parse_only.search_tag(name, attrs))):
876
                 or not self.parse_only.allow_tag_creation(nsprefix, name, attrs))):
560
747
            return None
877
            return None
561
748
878
562
749
        tag = self.element_classes.get(Tag, Tag)(
879
        tag = self.element_classes.get(Tag, Tag)(
563
@@ -760,48 +890,90 @@ class BeautifulSoup(Tag):
564
760
        self.pushTag(tag)
890
        self.pushTag(tag)
565
761
        return tag
891
        return tag
566
762
892
568
763
    def handle_endtag(self, name, nsprefix=None):
893
    def handle_endtag(self, name:str, nsprefix:Optional[str]=None) -> None:
569
764
        """Called by the tree builder when an ending tag is encountered.
894
        """Called by the tree builder when an ending tag is encountered.
570
765
895
571
766
        :param name: Name of the tag.
896
        :param name: Name of the tag.
572
767
        :param nsprefix: Namespace prefix for the tag.
897
        :param nsprefix: Namespace prefix for the tag.
573
898
574
899
        :meta private:
575
768
        """
900
        """
576
769
        #print("End tag: " + name)
901
        #print("End tag: " + name)
577
770
        self.endData()
902
        self.endData()
578
771
        self._popToTag(name, nsprefix)
903
        self._popToTag(name, nsprefix)
579
772
        
904
        
582
773
    def handle_data(self, data):
905
    def handle_data(self, data:str) -> None:
583
774
        """Called by the tree builder when a chunk of textual data is encountered."""
906
        """Called by the tree builder when a chunk of textual data is
584
907
        encountered.
585
908
586
909
        :meta private:
587
910
        """
588
775
        self.current_data.append(data)
911
        self.current_data.append(data)
589
776
       
912
       
598
777
    def decode(self, pretty_print=False,
913
    def decode(self, indent_level:Optional[int]=None,
599
778
               eventual_encoding=DEFAULT_OUTPUT_ENCODING,
914
               eventual_encoding:_Encoding=DEFAULT_OUTPUT_ENCODING,
600
779
               formatter="minimal", iterator=None):
915
               formatter:Union[Formatter,str]="minimal",
601
780
        """Returns a string or Unicode representation of the parse tree
916
               iterator:Optional[Iterable]=None, **kwargs) -> str:
602
781
            as an HTML or XML document.
917
        """Returns a string representation of the parse tree
603
782
918
            as a full HTML or XML document.
604
783
        :param pretty_print: If this is True, indentation will be used to
919
605
784
            make the document more readable.
920
        :param indent_level: Each line of the rendering will be
606
921
           indented this many levels. (The ``formatter`` decides what a
607
922
           'level' means, in terms of spaces or other characters
608
923
           output.) This is used internally in recursive calls while
609
924
           pretty-printing.
610
785
        :param eventual_encoding: The encoding of the final document.
925
        :param eventual_encoding: The encoding of the final document.
611
786
            If this is None, the document will be a Unicode string.
926
            If this is None, the document will be a Unicode string.
612
927
        :param formatter: Either a `Formatter` object, or a string naming one of
613
928
            the standard formatters.
614
929
        :param iterator: The iterator to use when navigating over the
615
930
            parse tree. This is only used by `Tag.decode_contents` and
616
931
            you probably won't need to use it.
617
787
        """
932
        """
618
788
        if self.is_xml:
933
        if self.is_xml:
619
789
            # Print the XML declaration
934
            # Print the XML declaration
620
790
            encoding_part = ''
935
            encoding_part = ''
621
936
            declared_encoding: Optional[str] = eventual_encoding
622
791
            if eventual_encoding in PYTHON_SPECIFIC_ENCODINGS:
937
            if eventual_encoding in PYTHON_SPECIFIC_ENCODINGS:
623
792
                # This is a special Python encoding; it can't actually
938
                # This is a special Python encoding; it can't actually
624
793
                # go into an XML document because it means nothing
939
                # go into an XML document because it means nothing
625
794
                # outside of Python.
940
                # outside of Python.
629
795
                eventual_encoding = None
941
                declared_encoding = None
630
796
            if eventual_encoding != None:
942
            if declared_encoding != None:
631
797
                encoding_part = ' encoding="%s"' % eventual_encoding
943
                encoding_part = ' encoding="%s"' % declared_encoding
632
798
            prefix = '<?xml version="1.0"%s?>\n' % encoding_part
944
            prefix = '<?xml version="1.0"%s?>\n' % encoding_part
633
799
        else:
945
        else:
634
800
            prefix = ''
946
            prefix = ''
637
801
        if not pretty_print:
947
638
802
            indent_level = None
948
        # Prior to 4.13.0, the first argument to this method was a
639
949
        # bool called pretty_print, which gave the method a different
640
950
        # signature from its superclass implementation, Tag.decode.
641
951
        #
642
952
        # The signatures of the two methods now match, but just in
643
953
        # case someone is still passing a boolean in as the first
644
954
        # argument to this method (or a keyword argument with the old
645
955
        # name), we can handle it and put out a DeprecationWarning.
646
956
        warning:Optional[str] = None
647
957
        if isinstance(indent_level, bool):
648
958
            if indent_level is True:
649
959
                indent_level = 0
650
960
            elif indent_level is False:
651
961
                indent_level = None
652
962
            warning = f"As of 4.13.0, the first argument to BeautifulSoup.decode has been changed from bool to int, to match Tag.decode. Pass in a value of {indent_level} instead."
653
803
        else:
963
        else:
655
804
            indent_level = 0
964
            pretty_print = kwargs.pop("pretty_print", None)
656
965
            assert not kwargs
657
966
            if pretty_print is not None:
658
967
                if pretty_print is True:
659
968
                    indent_level = 0
660
969
                elif pretty_print is False:                
661
970
                    indent_level = None
662
971
                warning = f"As of 4.13.0, the pretty_print argument to BeautifulSoup.decode has been removed, to match Tag.decode. Pass in a value of indent_level={indent_level} instead."
663
972
664
973
        if warning:
665
974
            warnings.warn(warning, DeprecationWarning, stacklevel=2)
666
975
        elif indent_level is False or pretty_print is False:
667
976
            indent_level = None
668
805
        return prefix + super(BeautifulSoup, self).decode(
977
        return prefix + super(BeautifulSoup, self).decode(
669
806
            indent_level, eventual_encoding, formatter, iterator)
978
            indent_level, eventual_encoding, formatter, iterator)
670
807
979
671
@@ -815,7 +987,7 @@ class BeautifulStoneSoup(BeautifulSoup):
672
815
    def __init__(self, *args, **kwargs):
987
    def __init__(self, *args, **kwargs):
673
816
        kwargs['features'] = 'xml'
988
        kwargs['features'] = 'xml'
674
817
        warnings.warn(
989
        warnings.warn(
676
818
            'The BeautifulStoneSoup class is deprecated. Instead of using '
990
            'The BeautifulStoneSoup class was deprecated in version 4.0.0. Instead of using '
677
819
            'it, pass features="xml" into the BeautifulSoup constructor.',
991
            'it, pass features="xml" into the BeautifulSoup constructor.',
678
820
            DeprecationWarning, stacklevel=2
992
            DeprecationWarning, stacklevel=2
679
821
        )
993
        )
680
diff --git a/bs4/_deprecation.py b/bs4/_deprecation.py
681
822
new file mode 100644
994
new file mode 100644
682
index 0000000..febc1b3
683
--- /dev/null
684
+++ b/bs4/_deprecation.py
685
@@ -0,0 +1,57 @@
686
1
"""Helper functions for deprecation.
687
2
688
3
This interface is itself unstable and may change without warning. Do
689
4
not use these functions yourself, even as a joke. The underscores are
690
5
there for a reson.
691
6
692
7
In particular, most of this will go away once Beautiful Soup drops
693
8
support for Python 3.11, since Python 3.12 defines a
694
9
`@typing.deprecated() decorator. <https://peps.python.org/pep-0702/>`_
695
10
"""
696
11
697
12
import functools
698
13
import warnings
699
14
700
15
from typing import (
701
16
    Any,
702
17
    Callable,
703
18
)
704
19
705
20
def _deprecated_alias(old_name, new_name, version):
706
21
    """Alias one attribute name to another for backward compatibility
707
22
708
23
    :meta private:
709
24
    """
710
25
    @property
711
26
    def alias(self) -> Any:
712
27
        ":meta private:"
713
28
        warnings.warn(f"Access to deprecated property {old_name}. (Replaced by {new_name}) -- Deprecated since version {version}.", DeprecationWarning, stacklevel=2)
714
29
        return getattr(self, new_name)
715
30
716
31
    @alias.setter
717
32
    def alias(self, value:str)->Any:
718
33
        ":meta private:"
719
34
        warnings.warn(f"Write to deprecated property {old_name}. (Replaced by {new_name}) -- Deprecated since version {version}.", DeprecationWarning, stacklevel=2)
720
35
        return setattr(self, new_name, value)
721
36
    return alias
722
37
723
38
def _deprecated_function_alias(old_name:str, new_name:str, version:str) -> Callable:
724
39
    def alias(self, *args, **kwargs):
725
40
        ":meta private:"
726
41
        warnings.warn(f"Call to deprecated method {old_name}. (Replaced by {new_name}) -- Deprecated since version {version}.", DeprecationWarning, stacklevel=2)
727
42
        return getattr(self, new_name)(*args, **kwargs)
728
43
    return alias
729
44
730
45
def _deprecated(replaced_by:str, version:str) -> Callable:
731
46
    def deprecate(func):
732
47
        @functools.wraps(func)
733
48
        def with_warning(*args, **kwargs):
734
49
            ":meta private:"
735
50
            warnings.warn(
736
51
                f"Call to deprecated method {func.__name__}. (Replaced by {replaced_by}) -- Deprecated since version {version}.",
737
52
                DeprecationWarning,
738
53
                stacklevel=2
739
54
            )
740
55
            return func(*args, **kwargs)
741
56
        return with_warning
742
57
    return deprecate
743
diff --git a/bs4/_typing.py b/bs4/_typing.py
744
0
new file mode 100644
58
new file mode 100644
745
index 0000000..9fe58c3
746
--- /dev/null
747
+++ b/bs4/_typing.py
748
@@ -0,0 +1,99 @@
749
1
# Custom type aliases used throughout Beautiful Soup to improve readability.
750
2
751
3
# Notes on improvements to the type system in newer versions of Python
752
4
# that can be used once Beautiful Soup drops support for older
753
5
# versions:
754
6
#
755
7
# * In 3.10, x|y is an accepted shorthand for Union[x,y].
756
8
# * In 3.10, TypeAlias gains capabilities that can be used to
757
9
#   improve the tree matching types (I don't remember what, exactly).
758
10
759
11
import re
760
12
from typing_extensions import TypeAlias
761
13
from typing import (
762
14
    Callable,
763
15
    Dict,
764
16
    IO,
765
17
    Iterable,
766
18
    Pattern,
767
19
    TYPE_CHECKING,
768
20
    Union,
769
21
)
770
22
771
23
if TYPE_CHECKING:
772
24
    from bs4.element import Tag
773
25
774
26
# Aliases for markup in various stages of processing.
775
27
#
776
28
# The rawest form of markup: either a string or an open filehandle.
777
29
_IncomingMarkup: TypeAlias = Union[str,bytes,IO]
778
30
779
31
# Markup that is in memory but has (potentially) yet to be converted
780
32
# to Unicode.
781
33
_RawMarkup: TypeAlias = Union[str,bytes]
782
34
783
35
# Aliases for character encodings
784
36
#
785
37
_Encoding:TypeAlias = str
786
38
_Encodings:TypeAlias = Iterable[_Encoding]
787
39
788
40
# Aliases for XML namespaces
789
41
_NamespacePrefix:TypeAlias = str
790
42
_NamespaceURL:TypeAlias = str
791
43
_NamespaceMapping:TypeAlias = Dict[_NamespacePrefix, _NamespaceURL]
792
44
_InvertedNamespaceMapping:TypeAlias = Dict[_NamespaceURL, _NamespacePrefix]
793
45
794
46
# Aliases for the attribute values associated with HTML/XML tags.
795
47
#
796
48
# Note that these are attribute values in their final form, as stored
797
49
# in the `Tag` class.  Different parsers present attributes to the
798
50
# `TreeBuilder` subclasses in different formats, which are not defined
799
51
# here.
800
52
_AttributeValue: TypeAlias = Union[str, Iterable[str]]
801
53
_AttributeValues: TypeAlias = Dict[str, _AttributeValue]
802
54
803
55
# Aliases to represent the many possibilities for matching bits of a
804
56
# parse tree.
805
57
#
806
58
# This is very complicated because we're applying a formal type system
807
59
# to some very DWIM code. The types we end up with will be the types
808
60
# of the arguments to the SoupStrainer constructor and (more
809
61
# familiarly to Beautiful Soup users) the find* methods.
810
62
811
63
# A function that takes a Tag and returns a yes-or-no answer.
812
64
# A TagNameMatchRule expects this kind of function, if you're
813
65
# going to pass it a function.
814
66
_TagMatchFunction:TypeAlias = Callable[['Tag'], bool]
815
67
816
68
# A function that takes a single string and returns a yes-or-no
817
69
# answer. An AttributeValueMatchRule expects this kind of function, if
818
70
# you're going to pass it a function. So does a StringMatchRule
819
71
_StringMatchFunction:TypeAlias = Callable[[str], bool]
820
72
821
73
# A function that takes a Tag or string and returns a yes-or-no
822
74
# answer.
823
75
_TagOrStringMatchFunction:TypeAlias = Union[_TagMatchFunction, _StringMatchFunction, bool]
824
76
825
77
# Either a tag name, an attribute value or a string can be matched
826
78
# against a string, bytestring, regular expression, or a boolean.
827
79
_BaseStrainable:TypeAlias = Union[str, bytes, Pattern[str], bool]
828
80
829
81
# A tag can also be matched using a function that takes the Tag
830
82
# as its sole argument.
831
83
_BaseStrainableElement:TypeAlias = Union[_BaseStrainable, _TagMatchFunction]
832
84
833
85
# A tag's attribute value can be matched using a function that takes
834
86
# the value as its sole argument.
835
87
_BaseStrainableAttribute:TypeAlias = Union[_BaseStrainable, _StringMatchFunction]
836
88
837
89
# Finally, a tag name, attribute or string can be matched using either
838
90
# a single criterion or a list of criteria.
839
91
_StrainableElement:TypeAlias = Union[
840
92
    _BaseStrainableElement, Iterable[_BaseStrainableElement]
841
93
]
842
94
_StrainableAttribute:TypeAlias = Union[
843
95
    _BaseStrainableAttribute, Iterable[_BaseStrainableAttribute]
844
96
]
845
97
846
98
_StrainableAttributes:TypeAlias = Dict[str, _StrainableAttribute]
847
99
_StrainableString:TypeAlias = _StrainableAttribute
848
diff --git a/bs4/builder/__init__.py b/bs4/builder/__init__.py
849
index 2e39745..671315d 100644
850
--- a/bs4/builder/__init__.py
851
+++ b/bs4/builder/__init__.py
852
@@ -1,9 +1,25 @@
853
1
from __future__ import annotations
854
1
# Use of this source code is governed by the MIT license.
2
# Use of this source code is governed by the MIT license.
855
2
__license__ = "MIT"
3
__license__ = "MIT"
856
3
4
857
4
from collections import defaultdict
5
from collections import defaultdict
858
5
import itertools
6
import itertools
859
6
import re
7
import re
860
8
from types import ModuleType
861
9
from typing import (
862
10
    Any,
863
11
    cast,
864
12
    Dict,
865
13
    Iterable,
866
14
    List,
867
15
    Optional,
868
16
    Pattern,
869
17
    Set,
870
18
    Tuple,
871
19
    Type,
872
20
    TYPE_CHECKING,
873
21
    Union,
874
22
)
875
7
import warnings
23
import warnings
876
8
import sys
24
import sys
877
9
from bs4.element import (
25
from bs4.element import (
878
@@ -17,6 +33,18 @@ from bs4.element import (
879
17
    nonwhitespace_re
33
    nonwhitespace_re
880
18
)
34
)
881
19
35
882
36
if TYPE_CHECKING:
883
37
    from bs4 import BeautifulSoup
884
38
    from bs4.element import (
885
39
        NavigableString, Tag,
886
40
        _AttributeValues, _AttributeValue,
887
41
    )
888
42
    from bs4._typing import (
889
43
        _Encoding,
890
44
        _Encodings,
891
45
        _RawMarkup,
892
46
    )
893
47
    
894
20
__all__ = [
48
__all__ = [
895
21
    'HTMLTreeBuilder',
49
    'HTMLTreeBuilder',
896
22
    'SAXTreeBuilder',
50
    'SAXTreeBuilder',
897
@@ -36,29 +64,32 @@ class XMLParsedAsHTMLWarning(UserWarning):
898
36
    """The warning issued when an HTML parser is used to parse
64
    """The warning issued when an HTML parser is used to parse
899
37
    XML that is not XHTML.
65
    XML that is not XHTML.
900
38
    """
66
    """
902
39
    MESSAGE = """It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor."""
67
    MESSAGE:str = """It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.""" #: :meta private:
903
40
68
904
41
69
905
42
class TreeBuilderRegistry(object):
70
class TreeBuilderRegistry(object):
906
43
    """A way of looking up TreeBuilder subclasses by their name or by desired
71
    """A way of looking up TreeBuilder subclasses by their name or by desired
907
44
    features.
72
    features.
908
45
    """
73
    """
909
74
910
75
    builders_for_feature: Dict[str, List[Type[TreeBuilder]]]
911
76
    builders: List[Type[TreeBuilder]]
912
46
    
77
    
913
47
    def __init__(self):
78
    def __init__(self):
914
48
        self.builders_for_feature = defaultdict(list)
79
        self.builders_for_feature = defaultdict(list)
915
49
        self.builders = []
80
        self.builders = []
916
50
81
918
51
    def register(self, treebuilder_class):
82
    def register(self, treebuilder_class:type[TreeBuilder]) -> None:
919
52
        """Register a treebuilder based on its advertised features.
83
        """Register a treebuilder based on its advertised features.
920
53
84
923
54
        :param treebuilder_class: A subclass of Treebuilder. its .features
85
        :param treebuilder_class: A subclass of `Treebuilder`. its
924
55
           attribute should list its features.
86
           `TreeBuilder.features` attribute should list its features.
925
56
        """
87
        """
926
57
        for feature in treebuilder_class.features:
88
        for feature in treebuilder_class.features:
927
58
            self.builders_for_feature[feature].insert(0, treebuilder_class)
89
            self.builders_for_feature[feature].insert(0, treebuilder_class)
928
59
        self.builders.insert(0, treebuilder_class)
90
        self.builders.insert(0, treebuilder_class)
929
60
91
931
61
    def lookup(self, *features):
92
    def lookup(self, *features:str) -> Optional[Type[TreeBuilder]]:
932
62
        """Look up a TreeBuilder subclass with the desired features.
93
        """Look up a TreeBuilder subclass with the desired features.
933
63
94
934
64
        :param features: A list of features to look for. If none are
95
        :param features: A list of features to look for. If none are
935
@@ -78,12 +109,12 @@ class TreeBuilderRegistry(object):
936
78
109
937
79
        # Go down the list of features in order, and eliminate any builders
110
        # Go down the list of features in order, and eliminate any builders
938
80
        # that don't match every feature.
111
        # that don't match every feature.
941
81
        features = list(features)
112
        feature_list = list(features)
942
82
        features.reverse()
113
        feature_list.reverse()
943
83
        candidates = None
114
        candidates = None
944
84
        candidate_set = None
115
        candidate_set = None
947
85
        while len(features) > 0:
116
        while len(feature_list) > 0:
948
86
            feature = features.pop()
117
            feature = feature_list.pop()
949
87
            we_have_the_feature = self.builders_for_feature.get(feature, [])
118
            we_have_the_feature = self.builders_for_feature.get(feature, [])
950
88
            if len(we_have_the_feature) > 0:
119
            if len(we_have_the_feature) > 0:
951
89
                if candidates is None:
120
                if candidates is None:
952
@@ -97,81 +128,61 @@ class TreeBuilderRegistry(object):
953
97
        # The only valid candidates are the ones in candidate_set.
128
        # The only valid candidates are the ones in candidate_set.
954
98
        # Go through the original list of candidates and pick the first one
129
        # Go through the original list of candidates and pick the first one
955
99
        # that's in candidate_set.
130
        # that's in candidate_set.
957
100
        if candidate_set is None:
131
        if candidate_set is None or candidates is None:
958
101
            return None
132
            return None
959
102
        for candidate in candidates:
133
        for candidate in candidates:
960
103
            if candidate in candidate_set:
134
            if candidate in candidate_set:
961
104
                return candidate
135
                return candidate
962
105
        return None
136
        return None
963
106
137
967
107
# The BeautifulSoup class will take feature lists from developers and use them
138
#: The `BeautifulSoup` constructor will take a list of features
968
108
# to look up builders in this registry.
139
#: and use it to look up `TreeBuilder` classes in this registry.
969
109
builder_registry = TreeBuilderRegistry()
140
builder_registry:TreeBuilderRegistry = TreeBuilderRegistry()
970
110
141
971
111
class TreeBuilder(object):
142
class TreeBuilder(object):
995
112
    """Turn a textual document into a Beautiful Soup object tree."""
143
    """Turn a textual document into a Beautiful Soup object tree.
996
113
144
997
114
    NAME = "[Unknown tree builder]"
145
    This is an abstract superclass which smooths out the behavior of
998
115
    ALTERNATE_NAMES = []
146
    different parser libraries into a single, unified interface.
999
116
    features = []
147
1000
117
148
    :param multi_valued_attributes: If this is set to None, the
1001
118
    is_xml = False
149
     TreeBuilder will not turn any values for attributes like
1002
119
    picklable = False
150
     'class' into lists. Setting this to a dictionary will
1003
120
    empty_element_tags = None # A tag will be considered an empty-element
151
     customize this behavior; look at DEFAULT_CDATA_LIST_ATTRIBUTES
1004
121
                              # tag when and only when it has no contents.
152
     for an example.
1005
122
    
153
1006
123
    # A value for these tag/attribute combinations is a space- or
154
     Internally, these are called "CDATA list attributes", but that
1007
124
    # comma-separated list of CDATA, rather than a single CDATA.
155
     probably doesn't make sense to an end-user, so the argument name
1008
125
    DEFAULT_CDATA_LIST_ATTRIBUTES = defaultdict(list)
156
     is `multi_valued_attributes`.
1009
126
157
1010
127
    # Whitespace should be preserved inside these tags.
158
    :param preserve_whitespace_tags: A set of tags to treat
1011
128
    DEFAULT_PRESERVE_WHITESPACE_TAGS = set()
159
     the way <pre> tags are treated in HTML. Tags in this set
1012
129
160
     are immune from pretty-printing; their contents will always be
1013
130
    # The textual contents of tags with these names should be
161
     output as-is.
1014
131
    # instantiated with some class other than NavigableString.
162
1015
132
    DEFAULT_STRING_CONTAINERS = {}
163
    :param string_containers: A dictionary mapping tag names to
1016
133
    
164
    the classes that should be instantiated to contain the textual
1017
134
    USE_DEFAULT = object()
165
    contents of those tags. The default is to use NavigableString
1018
166
    for every tag, no matter what the name. You can override the
1019
167
    default by changing DEFAULT_STRING_CONTAINERS.
1020
168
1021
169
    :param store_line_numbers: If the parser keeps track of the
1022
170
     line numbers and positions of the original markup, that
1023
171
     information will, by default, be stored in each corresponding
1024
172
     `Tag` object. You can turn this off by passing
1025
173
     store_line_numbers=False. If the parser you're using doesn't 
1026
174
     keep track of this information, then setting store_line_numbers=True
1027
175
     will do nothing.
1028
176
    """
1029
135
177
1032
136
    # Most parsers don't keep track of line numbers.
178
    USE_DEFAULT: Any = object() #: :meta private:
1031
137
    TRACKS_LINE_NUMBERS = False
1033
138
    
179
    
1038
139
    def __init__(self, multi_valued_attributes=USE_DEFAULT,
180
    def __init__(self, multi_valued_attributes:Dict[str, Set[str]]=USE_DEFAULT,
1039
140
                 preserve_whitespace_tags=USE_DEFAULT,
181
                 preserve_whitespace_tags:Set[str]=USE_DEFAULT,
1040
141
                 store_line_numbers=USE_DEFAULT,
182
                 store_line_numbers:bool=USE_DEFAULT,
1041
142
                 string_containers=USE_DEFAULT,
183
                 string_containers:Dict[str, Type[NavigableString]]=USE_DEFAULT,
1042
184
                 empty_element_tags:Set[str]=USE_DEFAULT
1043
143
    ):
185
    ):
1044
144
        """Constructor.
1045
145
1046
146
        :param multi_valued_attributes: If this is set to None, the
1047
147
         TreeBuilder will not turn any values for attributes like
1048
148
         'class' into lists. Setting this to a dictionary will
1049
149
         customize this behavior; look at DEFAULT_CDATA_LIST_ATTRIBUTES
1050
150
         for an example.
1051
151
1052
152
         Internally, these are called "CDATA list attributes", but that
1053
153
         probably doesn't make sense to an end-user, so the argument name
1054
154
         is `multi_valued_attributes`.
1055
155
1056
156
        :param preserve_whitespace_tags: A list of tags to treat
1057
157
         the way <pre> tags are treated in HTML. Tags in this list
1058
158
         are immune from pretty-printing; their contents will always be
1059
159
         output as-is.
1060
160
1061
161
        :param string_containers: A dictionary mapping tag names to
1062
162
        the classes that should be instantiated to contain the textual
1063
163
        contents of those tags. The default is to use NavigableString
1064
164
        for every tag, no matter what the name. You can override the
1065
165
        default by changing DEFAULT_STRING_CONTAINERS.
1066
166
1067
167
        :param store_line_numbers: If the parser keeps track of the
1068
168
         line numbers and positions of the original markup, that
1069
169
         information will, by default, be stored in each corresponding
1070
170
         `Tag` object. You can turn this off by passing
1071
171
         store_line_numbers=False. If the parser you're using doesn't 
1072
172
         keep track of this information, then setting store_line_numbers=True
1073
173
         will do nothing.
1074
174
        """
1075
175
        self.soup = None
186
        self.soup = None
1076
176
        if multi_valued_attributes is self.USE_DEFAULT:
187
        if multi_valued_attributes is self.USE_DEFAULT:
1077
177
            multi_valued_attributes = self.DEFAULT_CDATA_LIST_ATTRIBUTES
188
            multi_valued_attributes = self.DEFAULT_CDATA_LIST_ATTRIBUTES
1078
@@ -179,14 +190,55 @@ class TreeBuilder(object):
1079
179
        if preserve_whitespace_tags is self.USE_DEFAULT:
190
        if preserve_whitespace_tags is self.USE_DEFAULT:
1080
180
            preserve_whitespace_tags = self.DEFAULT_PRESERVE_WHITESPACE_TAGS
191
            preserve_whitespace_tags = self.DEFAULT_PRESERVE_WHITESPACE_TAGS
1081
181
        self.preserve_whitespace_tags = preserve_whitespace_tags
192
        self.preserve_whitespace_tags = preserve_whitespace_tags
1082
193
        if empty_element_tags is self.USE_DEFAULT:
1083
194
            self.empty_element_tags = self.DEFAULT_EMPTY_ELEMENT_TAGS
1084
195
        else:
1085
196
            self.empty_element_tags = empty_element_tags
1086
182
        if store_line_numbers == self.USE_DEFAULT:
197
        if store_line_numbers == self.USE_DEFAULT:
1087
183
            store_line_numbers = self.TRACKS_LINE_NUMBERS
198
            store_line_numbers = self.TRACKS_LINE_NUMBERS
1088
184
        self.store_line_numbers = store_line_numbers 
199
        self.store_line_numbers = store_line_numbers 
1089
185
        if string_containers == self.USE_DEFAULT:
200
        if string_containers == self.USE_DEFAULT:
1090
186
            string_containers = self.DEFAULT_STRING_CONTAINERS
201
            string_containers = self.DEFAULT_STRING_CONTAINERS
1091
187
        self.string_containers = string_containers
202
        self.string_containers = string_containers
1092
203
1093
204
    NAME:str = "[Unknown tree builder]"
1094
205
    ALTERNATE_NAMES: Iterable[str] = []
1095
206
    features: Iterable[str] = []
1096
207
1097
208
    is_xml: bool = False
1098
209
    picklable: bool = False
1099
210
1100
211
    soup: Optional[BeautifulSoup] #: :meta private:
1101
212
1102
213
    #: A tag will be considered an empty-element
1103
214
    #: tag when and only when it has no contents.
1104
215
    empty_element_tags: Optional[Set[str]] = None #: :meta private:
1105
216
    cdata_list_attributes: Dict[str, Set[str]] #: :meta private:
1106
217
    preserve_whitespace_tags: Set[str] #: :meta private:
1107
218
    string_containers: Dict[str, Type[NavigableString]] #: :meta private:
1108
219
    tracks_line_numbers: bool #: :meta private:
1109
220
    
1110
221
    #: A value for these tag/attribute combinations is a space- or
1111
222
    #: comma-separated list of CDATA, rather than a single CDATA.
1112
223
    DEFAULT_CDATA_LIST_ATTRIBUTES : Dict[str, Set[str]] = defaultdict(set)
1113
224
1114
225
    #: Whitespace should be preserved inside these tags.
1115
226
    DEFAULT_PRESERVE_WHITESPACE_TAGS : Set[str] = set()
1116
227
1117
228
    #: The textual contents of tags with these names should be
1118
229
    #: instantiated with some class other than NavigableString.
1119
230
    DEFAULT_STRING_CONTAINERS: Dict[str, Type[NavigableString]] = {}
1120
231
1121
232
    #: By default, tags are treated as empty-element tags if they have
1122
233
    #: no contents--that is, using XML rules. HTMLTreeBuilder
1123
234
    #: defines a different set of DEFAULT_EMPTY_ELEMENT_TAGS based on the
1124
235
    #: HTML 4 and HTML5 standards.
1125
236
    DEFAULT_EMPTY_ELEMENT_TAGS: Optional[Set] = None
1126
237
    
1127
238
    #: Most parsers don't keep track of line numbers.
1128
239
    TRACKS_LINE_NUMBERS: bool = False
1129
188
        
240
        
1131
189
    def initialize_soup(self, soup):
241
    def initialize_soup(self, soup:BeautifulSoup) -> None:
1132
190
        """The BeautifulSoup object has been initialized and is now
242
        """The BeautifulSoup object has been initialized and is now
1133
191
        being associated with the TreeBuilder.
243
        being associated with the TreeBuilder.
1134
192
244
1135
@@ -194,7 +246,7 @@ class TreeBuilder(object):
1136
194
        """
246
        """
1137
195
        self.soup = soup
247
        self.soup = soup
1138
196
        
248
        
1140
197
    def reset(self):
249
    def reset(self) -> None:
1141
198
        """Do any work necessary to reset the underlying parser
250
        """Do any work necessary to reset the underlying parser
1142
199
        for a new document.
251
        for a new document.
1143
200
252
1144
@@ -202,7 +254,7 @@ class TreeBuilder(object):
1145
202
        """
254
        """
1146
203
        pass
255
        pass
1147
204
256
1149
205
    def can_be_empty_element(self, tag_name):
257
    def can_be_empty_element(self, tag_name:str) -> bool:
1150
206
        """Might a tag with this name be an empty-element tag?
258
        """Might a tag with this name be an empty-element tag?
1151
207
259
1152
208
        The final markup may or may not actually present this tag as
260
        The final markup may or may not actually present this tag as
1153
@@ -225,46 +277,48 @@ class TreeBuilder(object):
1154
225
            return True
277
            return True
1155
226
        return tag_name in self.empty_element_tags
278
        return tag_name in self.empty_element_tags
1156
227
    
279
    
1158
228
    def feed(self, markup):
280
    def feed(self, markup:str) -> None:
1159
229
        """Run some incoming markup through some parsing process,
281
        """Run some incoming markup through some parsing process,
1166
230
        populating the `BeautifulSoup` object in self.soup.
282
        populating the `BeautifulSoup` object in `TreeBuilder.soup`
1161
231
1162
232
        This method is not implemented in TreeBuilder; it must be
1163
233
        implemented in subclasses.
1164
234
1165
235
        :return: None.
1167
236
        """
283
        """
1168
237
        raise NotImplementedError()
284
        raise NotImplementedError()
1169
238
285
1172
239
    def prepare_markup(self, markup, user_specified_encoding=None,
286
    def prepare_markup(
1173
240
                       document_declared_encoding=None, exclude_encodings=None):
287
            self, markup:_RawMarkup,
1174
288
            user_specified_encoding:Optional[_Encoding]=None,
1175
289
            document_declared_encoding:Optional[_Encoding]=None,
1176
290
            exclude_encodings:Optional[_Encodings]=None
1177
291
    ) -> Iterable[Tuple[_RawMarkup, Optional[_Encoding], Optional[_Encoding], bool]]:
1178
241
        """Run any preliminary steps necessary to make incoming markup
292
        """Run any preliminary steps necessary to make incoming markup
1179
242
        acceptable to the parser.
293
        acceptable to the parser.
1180
243
294
1183
244
        :param markup: Some markup -- probably a bytestring.
295
        :param markup: The markup that's about to be parsed.
1184
245
        :param user_specified_encoding: The user asked to try this encoding.
296
        :param user_specified_encoding: The user asked to try this encoding
1185
297
           to convert the markup into a Unicode string.
1186
246
        :param document_declared_encoding: The markup itself claims to be
298
        :param document_declared_encoding: The markup itself claims to be
1187
247
            in this encoding. NOTE: This argument is not used by the
299
            in this encoding. NOTE: This argument is not used by the
1188
248
            calling code and can probably be removed.
300
            calling code and can probably be removed.
1190
249
        :param exclude_encodings: The user asked _not_ to try any of
301
        :param exclude_encodings: The user asked *not* to try any of
1191
250
            these encodings.
302
            these encodings.
1192
251
303
1196
252
        :yield: A series of 4-tuples:
304
        :yield: A series of 4-tuples: (markup, encoding, declared encoding,
1197
253
         (markup, encoding, declared encoding,
305
            has undergone character replacement)
1195
254
          has undergone character replacement)
1198
255
306
1202
256
         Each 4-tuple represents a strategy for converting the
307
            Each 4-tuple represents a strategy that the parser can try
1203
257
         document to Unicode and parsing it. Each strategy will be tried 
308
            to convert the document to Unicode and parse it. Each
1204
258
         in turn.
309
            strategy will be tried in turn.
1205
259
310
1206
260
         By default, the only strategy is to parse the markup
311
         By default, the only strategy is to parse the markup
1207
261
         as-is. See `LXMLTreeBuilderForXML` and
312
         as-is. See `LXMLTreeBuilderForXML` and
1208
262
         `HTMLParserTreeBuilder` for implementations that take into
313
         `HTMLParserTreeBuilder` for implementations that take into
1209
263
         account the quirks of particular parsers.
314
         account the quirks of particular parsers.
1210
315
1211
316
        :meta private:
1212
317
1213
264
        """
318
        """
1214
265
        yield markup, None, None, False
319
        yield markup, None, None, False
1215
266
320
1217
267
    def test_fragment_to_document(self, fragment):
321
    def test_fragment_to_document(self, fragment:str) -> str:
1218
268
        """Wrap an HTML fragment to make it look like a document.
322
        """Wrap an HTML fragment to make it look like a document.
1219
269
323
1220
270
        Different parsers do this differently. For instance, lxml
324
        Different parsers do this differently. For instance, lxml
1221
@@ -273,26 +327,27 @@ class TreeBuilder(object):
1222
273
        which run HTML fragments through the parser and compare the
327
        which run HTML fragments through the parser and compare the
1223
274
        results against other HTML fragments.
328
        results against other HTML fragments.
1224
275
329
1226
276
        This method should not be used outside of tests.
330
        This method should not be used outside of unit tests.
1227
277
331
1230
278
        :param fragment: A string -- fragment of HTML.
332
        :param fragment: A fragment of HTML.
1231
279
        :return: A string -- a full HTML document.
333
        :return: A full HTML document.
1232
334
        :meta private:
1233
280
        """
335
        """
1234
281
        return fragment
336
        return fragment
1235
282
337
1237
283
    def set_up_substitutions(self, tag):
338
    def set_up_substitutions(self, tag:Tag) -> bool:
1238
284
        """Set up any substitutions that will need to be performed on 
339
        """Set up any substitutions that will need to be performed on 
1239
285
        a `Tag` when it's output as a string.
340
        a `Tag` when it's output as a string.
1240
286
341
1241
287
        By default, this does nothing. See `HTMLTreeBuilder` for a
342
        By default, this does nothing. See `HTMLTreeBuilder` for a
1242
288
        case where this is used.
343
        case where this is used.
1243
289
344
1244
290
        :param tag: A `Tag`
1245
291
        :return: Whether or not a substitution was performed.
345
        :return: Whether or not a substitution was performed.
1246
346
        :meta private:
1247
292
        """
347
        """
1248
293
        return False
348
        return False
1249
294
349
1251
295
    def _replace_cdata_list_attribute_values(self, tag_name, attrs):
350
    def _replace_cdata_list_attribute_values(self, tag_name:str, attrs:_AttributeValues):
1252
296
        """When an attribute value is associated with a tag that can
351
        """When an attribute value is associated with a tag that can
1253
297
        have multiple values for that attribute, convert the string
352
        have multiple values for that attribute, convert the string
1254
298
        value to a list of strings.
353
        value to a list of strings.
1255
@@ -308,10 +363,11 @@ class TreeBuilder(object):
1256
308
        if not attrs:
363
        if not attrs:
1257
309
            return attrs
364
            return attrs
1258
310
        if self.cdata_list_attributes:
365
        if self.cdata_list_attributes:
1260
311
            universal = self.cdata_list_attributes.get('*', [])
366
            universal: Set[str] = self.cdata_list_attributes.get('*', set())
1261
312
            tag_specific = self.cdata_list_attributes.get(
367
            tag_specific = self.cdata_list_attributes.get(
1262
313
                tag_name.lower(), None)
368
                tag_name.lower(), None)
1263
314
            for attr in list(attrs.keys()):
369
            for attr in list(attrs.keys()):
1264
370
                values: _AttributeValue
1265
315
                if attr in universal or (tag_specific and attr in tag_specific):
371
                if attr in universal or (tag_specific and attr in tag_specific):
1266
316
                    # We have a "class"-type attribute whose string
372
                    # We have a "class"-type attribute whose string
1267
317
                    # value is a whitespace-separated list of
373
                    # value is a whitespace-separated list of
1268
@@ -337,7 +393,15 @@ class SAXTreeBuilder(TreeBuilder):
1269
337
    how a simple TreeBuilder would work.
393
    how a simple TreeBuilder would work.
1270
338
    """
394
    """
1271
339
395
1273
340
    def feed(self, markup):
396
    def __init__(self, *args, **kwargs):
1274
397
        warnings.warn(
1275
398
            f"The SAXTreeBuilder class was deprecated in 4.13.0. It is completely untested and probably doesn't work; use at your own risk.",
1276
399
                DeprecationWarning,
1277
400
                stacklevel=2
1278
401
            )
1279
402
        super(SAXTreeBuilder, self).__init__(*args, **kwargs)
1280
403
    
1281
404
    def feed(self, markup:_RawMarkup):
1282
341
        raise NotImplementedError()
405
        raise NotImplementedError()
1283
342
406
1284
343
    def close(self):
407
    def close(self):
1285
@@ -381,12 +445,13 @@ class SAXTreeBuilder(TreeBuilder):
1286
381
445
1287
382
446
1288
383
class HTMLTreeBuilder(TreeBuilder):
447
class HTMLTreeBuilder(TreeBuilder):
1292
384
    """This TreeBuilder knows facts about HTML.
448
    """This TreeBuilder knows facts about HTML, such as which tags are treated
1293
385
449
    specially by the HTML standard.
1291
386
    Such as which tags are empty-element tags.
1294
387
    """
450
    """
1295
388
451
1297
389
    empty_element_tags = set([
452
    #: Some HTML tags are defined as having no contents. Beautiful Soup
1298
453
    #: treats these specially.
1299
454
    DEFAULT_EMPTY_ELEMENT_TAGS: Set[str] = set([
1300
390
        # These are from HTML5.
455
        # These are from HTML5.
1301
391
        'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen', 'link', 'menuitem', 'meta', 'param', 'source', 'track', 'wbr',
456
        'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen', 'link', 'menuitem', 'meta', 'param', 'source', 'track', 'wbr',
1302
392
        
457
        
1303
@@ -394,29 +459,29 @@ class HTMLTreeBuilder(TreeBuilder):
1304
394
        'basefont', 'bgsound', 'command', 'frame', 'image', 'isindex', 'nextid', 'spacer'
459
        'basefont', 'bgsound', 'command', 'frame', 'image', 'isindex', 'nextid', 'spacer'
1305
395
    ])
460
    ])
1306
396
461
1330
397
    # The HTML standard defines these as block-level elements. Beautiful
462
    #: The HTML standard defines these tags as block-level elements. Beautiful
1331
398
    # Soup does not treat these elements differently from other elements,
463
    #: Soup does not treat these elements differently from other elements,
1332
399
    # but it may do so eventually, and this information is available if
464
    #: but it may do so eventually, and this information is available if
1333
400
    # you need to use it.
465
    #: you need to use it.
1334
401
    block_elements = set(["address", "article", "aside", "blockquote", "canvas", "dd", "div", "dl", "dt", "fieldset", "figcaption", "figure", "footer", "form", "h1", "h2", "h3", "h4", "h5", "h6", "header", "hr", "li", "main", "nav", "noscript", "ol", "output", "p", "pre", "section", "table", "tfoot", "ul", "video"])
466
    DEFAULT_BLOCK_ELEMENTS: Set[str] = set(["address", "article", "aside", "blockquote", "canvas", "dd", "div", "dl", "dt", "fieldset", "figcaption", "figure", "footer", "form", "h1", "h2", "h3", "h4", "h5", "h6", "header", "hr", "li", "main", "nav", "noscript", "ol", "output", "p", "pre", "section", "table", "tfoot", "ul", "video"])
1335
402
467
1336
403
    # These HTML tags need special treatment so they can be
468
    #: These HTML tags need special treatment so they can be
1337
404
    # represented by a string class other than NavigableString.
469
    #: represented by a string class other than NavigableString.
1338
405
    #
470
    #:
1339
406
    # For some of these tags, it's because the HTML standard defines
471
    #: For some of these tags, it's because the HTML standard defines
1340
407
    # an unusual content model for them. I made this list by going
472
    #: an unusual content model for them. I made this list by going
1341
408
    # through the HTML spec
473
    #: through the HTML spec
1342
409
    # (https://html.spec.whatwg.org/#metadata-content) and looking for
474
    #: (https://html.spec.whatwg.org/#metadata-content) and looking for
1343
410
    # "metadata content" elements that can contain strings.
475
    #: "metadata content" elements that can contain strings.
1344
411
    #
476
    #:
1345
412
    # The Ruby tags (<rt> and <rp>) are here despite being normal
477
    #: The Ruby tags (<rt> and <rp>) are here despite being normal
1346
413
    # "phrasing content" tags, because the content they contain is
478
    #: "phrasing content" tags, because the content they contain is
1347
414
    # qualitatively different from other text in the document, and it
479
    #: qualitatively different from other text in the document, and it
1348
415
    # can be useful to be able to distinguish it.
480
    #: can be useful to be able to distinguish it.
1349
416
    #
481
    #:
1350
417
    # TODO: Arguably <noscript> could go here but it seems
482
    #: TODO: Arguably <noscript> could go here but it seems
1351
418
    # qualitatively different from the other tags.
483
    #: qualitatively different from the other tags.
1352
419
    DEFAULT_STRING_CONTAINERS = {
484
    DEFAULT_STRING_CONTAINERS: Dict[str, Type[NavigableString]] = {
1353
420
        'rt' : RubyTextString,
485
        'rt' : RubyTextString,
1354
421
        'rp' : RubyParenthesisString,
486
        'rp' : RubyParenthesisString,
1355
422
        'style': Stylesheet,
487
        'style': Stylesheet,
1356
@@ -424,33 +489,35 @@ class HTMLTreeBuilder(TreeBuilder):
1357
424
        'template': TemplateString,
489
        'template': TemplateString,
1358
425
    }    
490
    }    
1359
426
    
491
    
1376
427
    # The HTML standard defines these attributes as containing a
492
    #: The HTML standard defines these attributes as containing a
1377
428
    # space-separated list of values, not a single value. That is,
493
    #: space-separated list of values, not a single value. That is,
1378
429
    # class="foo bar" means that the 'class' attribute has two values,
494
    #: class="foo bar" means that the 'class' attribute has two values,
1379
430
    # 'foo' and 'bar', not the single value 'foo bar'.  When we
495
    #: 'foo' and 'bar', not the single value 'foo bar'.  When we
1380
431
    # encounter one of these attributes, we will parse its value into
496
    #: encounter one of these attributes, we will parse its value into
1381
432
    # a list of values if possible. Upon output, the list will be
497
    #: a list of values if possible. Upon output, the list will be
1382
433
    # converted back into a string.
498
    #: converted back into a string.
1383
434
    DEFAULT_CDATA_LIST_ATTRIBUTES = {
499
    DEFAULT_CDATA_LIST_ATTRIBUTES: Dict[str, Set[str]] = {
1384
435
        "*" : ['class', 'accesskey', 'dropzone'],
500
        "*" : {'class', 'accesskey', 'dropzone'},
1385
436
        "a" : ['rel', 'rev'],
501
        "a" : {'rel', 'rev'},
1386
437
        "link" :  ['rel', 'rev'],
502
        "link" :  {'rel', 'rev'},
1387
438
        "td" : ["headers"],
503
        "td" : {"headers"},
1388
439
        "th" : ["headers"],
504
        "th" : {"headers"},
1389
440
        "td" : ["headers"],
505
        "td" : {"headers"},
1390
441
        "form" : ["accept-charset"],
506
        "form" : {"accept-charset"},
1391
442
        "object" : ["archive"],
507
        "object" : {"archive"},
1392
443
508
1393
444
        # These are HTML5 specific, as are *.accesskey and *.dropzone above.
509
        # These are HTML5 specific, as are *.accesskey and *.dropzone above.
1398
445
        "area" : ["rel"],
510
        "area" : {"rel"},
1399
446
        "icon" : ["sizes"],
511
        "icon" : {"sizes"},
1400
447
        "iframe" : ["sandbox"],
512
        "iframe" : {"sandbox"},
1401
448
        "output" : ["for"],
513
        "output" : {"for"},
1402
449
        }
514
        }
1403
450
515
1405
451
    DEFAULT_PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
516
    #: By default, whitespace inside these HTML tags will be
1406
517
    #: preserved rather than being collapsed.
1407
518
    DEFAULT_PRESERVE_WHITESPACE_TAGS: set[str] = set(['pre', 'textarea'])
1408
452
519
1410
453
    def set_up_substitutions(self, tag):
520
    def set_up_substitutions(self, tag:Tag) -> bool:
1411
454
        """Replace the declared encoding in a <meta> tag with a placeholder,
521
        """Replace the declared encoding in a <meta> tag with a placeholder,
1412
455
        to be substituted when the tag is output to a string.
522
        to be substituted when the tag is output to a string.
1413
456
523
1414
@@ -458,17 +525,26 @@ class HTMLTreeBuilder(TreeBuilder):
1415
458
        encoding, but exit in a different encoding, and the <meta> tag
525
        encoding, but exit in a different encoding, and the <meta> tag
1416
459
        needs to be changed to reflect this.
526
        needs to be changed to reflect this.
1417
460
527
1418
461
        :param tag: A `Tag`
1419
462
        :return: Whether or not a substitution was performed.
528
        :return: Whether or not a substitution was performed.
1420
529
1421
530
        :meta private:
1422
463
        """
531
        """
1423
464
        # We are only interested in <meta> tags
532
        # We are only interested in <meta> tags
1424
465
        if tag.name != 'meta':
533
        if tag.name != 'meta':
1425
466
            return False
534
            return False
1426
467
535
1431
468
        http_equiv = tag.get('http-equiv')
536
        # TODO: This cast will fail in the (very unlikely) scenario
1432
469
        content = tag.get('content')
537
        # that the programmer who instantiates the TreeBuilder
1433
470
        charset = tag.get('charset')
538
        # specifies meta['content'] or meta['charset'] as
1434
471
539
        # cdata_list_attributes.
1435
540
        content:Optional[str] = cast(Optional[str], tag.get('content'))
1436
541
        charset:Optional[str] = cast(Optional[str], tag.get('charset'))
1437
542
1438
543
        # But we can accommodate meta['http-equiv'] being made a
1439
544
        # cdata_list_attribute (again, very unlikely) without much
1440
545
        # trouble.
1441
546
        http_equiv:List[str] = tag.get_attribute_list('http-equiv')
1442
547
        
1443
472
        # We are interested in <meta> tags that say what encoding the
548
        # We are interested in <meta> tags that say what encoding the
1444
473
        # document was originally in. This means HTML 5-style <meta>
549
        # document was originally in. This means HTML 5-style <meta>
1445
474
        # tags that provide the "charset" attribute. It also means
550
        # tags that provide the "charset" attribute. It also means
1446
@@ -478,20 +554,22 @@ class HTMLTreeBuilder(TreeBuilder):
1447
478
        # In both cases we will replace the value of the appropriate
554
        # In both cases we will replace the value of the appropriate
1448
479
        # attribute with a standin object that can take on any
555
        # attribute with a standin object that can take on any
1449
480
        # encoding.
556
        # encoding.
1451
481
        meta_encoding = None
557
        substituted = False
1452
482
        if charset is not None:
558
        if charset is not None:
1453
483
            # HTML 5 style:
559
            # HTML 5 style:
1454
484
            # <meta charset="utf8">
560
            # <meta charset="utf8">
1455
485
            meta_encoding = charset
561
            meta_encoding = charset
1456
486
            tag['charset'] = CharsetMetaAttributeValue(charset)
562
            tag['charset'] = CharsetMetaAttributeValue(charset)
1457
563
            substituted = True
1458
487
564
1461
488
        elif (content is not None and http_equiv is not None
565
        elif (content is not None and
1462
489
              and http_equiv.lower() == 'content-type'):
566
              any(x.lower() == 'content-type' for x in http_equiv)):
1463
490
            # HTML 4 style:
567
            # HTML 4 style:
1464
491
            # <meta http-equiv="content-type" content="text/html; charset=utf8">
568
            # <meta http-equiv="content-type" content="text/html; charset=utf8">
1465
492
            tag['content'] = ContentMetaAttributeValue(content)
569
            tag['content'] = ContentMetaAttributeValue(content)
1466
570
            substituted = True
1467
493
571
1469
494
        return (meta_encoding is not None)
572
        return substituted
1470
495
573
1471
496
class DetectsXMLParsedAsHTML(object):
574
class DetectsXMLParsedAsHTML(object):
1472
497
    """A mixin class for any class (a TreeBuilder, or some class used by a
575
    """A mixin class for any class (a TreeBuilder, or some class used by a
1473
@@ -502,19 +580,29 @@ class DetectsXMLParsedAsHTML(object):
1474
502
    This requires being able to observe an incoming processing
580
    This requires being able to observe an incoming processing
1475
503
    instruction that might be an XML declaration, and also able to
581
    instruction that might be an XML declaration, and also able to
1476
504
    observe tags as they're opened. If you can't do that for a given
582
    observe tags as they're opened. If you can't do that for a given
1478
505
    TreeBuilder, there's a less reliable implementation based on
583
    `TreeBuilder`, there's a less reliable implementation based on
1479
506
    examining the raw markup.
584
    examining the raw markup.
1480
507
    """
585
    """
1481
508
586
1485
509
    # Regular expression for seeing if markup has an <html> tag.
587
    #: Regular expression for seeing if string markup has an <html> tag.
1486
510
    LOOKS_LIKE_HTML = re.compile("<[^ +]html", re.I)
588
    LOOKS_LIKE_HTML:Pattern[str] = re.compile("<[^ +]html", re.I)
1484
511
    LOOKS_LIKE_HTML_B = re.compile(b"<[^ +]html", re.I)
1487
512
589
1490
513
    XML_PREFIX = '<?xml'
590
    #: Regular expression for seeing if byte markup has an <html> tag.
1491
514
    XML_PREFIX_B = b'<?xml'
591
    LOOKS_LIKE_HTML_B:Pattern[bytes] = re.compile(b"<[^ +]html", re.I)
1492
592
1493
593
    #: The start of an XML document string.
1494
594
    XML_PREFIX:str = '<?xml'
1495
595
1496
596
    #: The start of an XML document bytestring.
1497
597
    XML_PREFIX_B:bytes = b'<?xml'
1498
598
1499
599
    # This is typed as str, not `ProcessingInstruction`, because this
1500
600
    # check may be run before any Beautiful Soup objects are created.
1501
601
    _first_processing_instruction: Optional[str]
1502
602
    _root_tag: Optional[Tag]
1503
515
    
603
    
1504
516
    @classmethod
604
    @classmethod
1506
517
    def warn_if_markup_looks_like_xml(cls, markup):
605
    def warn_if_markup_looks_like_xml(cls, markup:Optional[_RawMarkup]) -> bool:
1507
518
        """Perform a check on some markup to see if it looks like XML
606
        """Perform a check on some markup to see if it looks like XML
1508
519
        that's not XHTML. If so, issue a warning.
607
        that's not XHTML. If so, issue a warning.
1509
520
608
1510
@@ -524,34 +612,40 @@ class DetectsXMLParsedAsHTML(object):
1511
524
        :return: True if the markup looks like non-XHTML XML, False
612
        :return: True if the markup looks like non-XHTML XML, False
1512
525
        otherwise.
613
        otherwise.
1513
526
        """
614
        """
1514
615
        if markup is None:
1515
616
            return False
1516
617
        markup = markup[:500]
1517
527
        if isinstance(markup, bytes):
618
        if isinstance(markup, bytes):
1520
528
            prefix = cls.XML_PREFIX_B
619
            markup_b = cast(bytes, markup)
1521
529
            looks_like_html = cls.LOOKS_LIKE_HTML_B
620
            looks_like_xml = (
1522
621
                markup_b.startswith(cls.XML_PREFIX_B)
1523
622
                and not cls.LOOKS_LIKE_HTML_B.search(markup)
1524
623
            )
1525
530
        else:
624
        else:
1533
531
            prefix = cls.XML_PREFIX
625
            markup_s = cast(str, markup)
1534
532
            looks_like_html = cls.LOOKS_LIKE_HTML
626
            looks_like_xml = (
1535
533
        
627
                markup_s.startswith(cls.XML_PREFIX)
1536
534
        if (markup is not None
628
                and not cls.LOOKS_LIKE_HTML.search(markup)
1537
535
            and markup.startswith(prefix)
629
            )
1538
536
            and not looks_like_html.search(markup[:500])
630
1539
537
        ):
631
        if looks_like_xml:
1540
538
            cls._warn()
632
            cls._warn()
1541
539
            return True
633
            return True
1544
540
        return False
634
        return False        
1545
541
635
    
1546
542
    @classmethod
636
    @classmethod
1548
543
    def _warn(cls):
637
    def _warn(cls) -> None:
1549
544
        """Issue a warning about XML being parsed as HTML."""
638
        """Issue a warning about XML being parsed as HTML."""
1550
545
        warnings.warn(
639
        warnings.warn(
1551
546
            XMLParsedAsHTMLWarning.MESSAGE, XMLParsedAsHTMLWarning
640
            XMLParsedAsHTMLWarning.MESSAGE, XMLParsedAsHTMLWarning
1552
547
        )
641
        )
1553
548
        
642
        
1555
549
    def _initialize_xml_detector(self):
643
    def _initialize_xml_detector(self) -> None:
1556
550
        """Call this method before parsing a document."""
644
        """Call this method before parsing a document."""
1557
551
        self._first_processing_instruction = None
645
        self._first_processing_instruction = None
1558
552
        self._root_tag = None
646
        self._root_tag = None
1559
553
       
647
       
1561
554
    def _document_might_be_xml(self, processing_instruction):
648
    def _document_might_be_xml(self, processing_instruction:str):
1562
555
        """Call this method when encountering an XML declaration, or a
649
        """Call this method when encountering an XML declaration, or a
1563
556
        "processing instruction" that might be an XML declaration.
650
        "processing instruction" that might be an XML declaration.
1564
557
        """
651
        """
1565
@@ -586,7 +680,7 @@ class DetectsXMLParsedAsHTML(object):
1566
586
            self._warn()
680
            self._warn()
1567
587
681
1568
588
    
682
    
1570
589
def register_treebuilders_from(module):
683
def register_treebuilders_from(module:ModuleType) -> None:
1571
590
    """Copy TreeBuilders from the given module into this module."""
684
    """Copy TreeBuilders from the given module into this module."""
1572
591
    this_module = sys.modules[__name__]
685
    this_module = sys.modules[__name__]
1573
592
    for name in module.__all__:
686
    for name in module.__all__:
1574
@@ -602,7 +696,7 @@ class ParserRejectedMarkup(Exception):
1575
602
    """An Exception to be raised when the underlying parser simply
696
    """An Exception to be raised when the underlying parser simply
1576
603
    refuses to parse the given markup.
697
    refuses to parse the given markup.
1577
604
    """
698
    """
1579
605
    def __init__(self, message_or_exception):
699
    def __init__(self, message_or_exception:Union[str,Exception]):
1580
606
        """Explain why the parser rejected the given markup, either
700
        """Explain why the parser rejected the given markup, either
1581
607
        with a textual explanation or another exception.
701
        with a textual explanation or another exception.
1582
608
        """
702
        """
1583
diff --git a/bs4/builder/_html5lib.py b/bs4/builder/_html5lib.py
1584
index dac2173..560a036 100644
1585
--- a/bs4/builder/_html5lib.py
1586
+++ b/bs4/builder/_html5lib.py
1587
@@ -5,6 +5,20 @@ __all__ = [
1588
5
    'HTML5TreeBuilder',
5
    'HTML5TreeBuilder',
1589
6
    ]
6
    ]
1590
7
7
1591
8
from typing import (
1592
9
    Iterable,
1593
10
    List,
1594
11
    Optional,
1595
12
    TYPE_CHECKING,
1596
13
    Tuple,
1597
14
    Union,
1598
15
)
1599
16
from bs4._typing import (
1600
17
    _Encoding,
1601
18
    _Encodings,
1602
19
    _RawMarkup,
1603
20
)
1604
21
1605
8
import warnings
22
import warnings
1606
9
import re
23
import re
1607
10
from bs4.builder import (
24
from bs4.builder import (
1608
@@ -30,50 +44,54 @@ from bs4.element import (
1609
30
    Tag,
44
    Tag,
1610
31
    )
45
    )
1611
32
46
1620
33
try:
47
from html5lib.treebuilders import base as treebuilder_base
1621
34
    # Pre-0.99999999
48
1614
35
    from html5lib.treebuilders import _base as treebuilder_base
1615
36
    new_html5lib = False
1616
37
except ImportError as e:
1617
38
    # 0.99999999 and up
1618
39
    from html5lib.treebuilders import base as treebuilder_base
1619
40
    new_html5lib = True
1622
41
49
1623
42
class HTML5TreeBuilder(HTMLTreeBuilder):
50
class HTML5TreeBuilder(HTMLTreeBuilder):
1625
43
    """Use html5lib to build a tree.
51
    """Use `html5lib <https://github.com/html5lib/html5lib-python>`_ to
1626
52
    build a tree.
1627
44
53
1630
45
    Note that this TreeBuilder does not support some features common
54
    Note that `HTML5TreeBuilder` does not support some common HTML
1631
46
    to HTML TreeBuilders. Some of these features could theoretically
55
    `TreeBuilder` features. Some of these features could theoretically
1632
47
    be implemented, but at the very least it's quite difficult,
56
    be implemented, but at the very least it's quite difficult,
1633
48
    because html5lib moves the parse tree around as it's being built.
57
    because html5lib moves the parse tree around as it's being built.
1634
49
58
1637
50
    * This TreeBuilder doesn't use different subclasses of NavigableString
59
    Specifically:
1636
51
      based on the name of the tag in which the string was found.
1638
52
60
1640
53
    * You can't use a SoupStrainer to parse only part of a document.
61
    * This `TreeBuilder` doesn't use different subclasses of
1641
62
      `NavigableString` (e.g. `Script`) based on the name of the tag
1642
63
      in which the string was found.
1643
64
    * You can't use a `SoupStrainer` to parse only part of a document.
1644
54
    """
65
    """
1645
55
66
1647
56
    NAME = "html5lib"
67
    NAME:str = "html5lib"
1648
57
68
1650
58
    features = [NAME, PERMISSIVE, HTML_5, HTML]
69
    features:Iterable[str] = [NAME, PERMISSIVE, HTML_5, HTML]
1651
59
70
1655
60
    # html5lib can tell us which line number and position in the
71
    #: html5lib can tell us which line number and position in the
1656
61
    # original file is the source of an element.
72
    #: original file is the source of an element.
1657
62
    TRACKS_LINE_NUMBERS = True
73
    TRACKS_LINE_NUMBERS:bool = True
1658
63
    
74
    
1661
64
    def prepare_markup(self, markup, user_specified_encoding,
75
    def prepare_markup(self, markup:_RawMarkup,
1662
65
                       document_declared_encoding=None, exclude_encodings=None):
76
                       user_specified_encoding:Optional[_Encoding]=None,
1663
77
                       document_declared_encoding:Optional[_Encoding]=None,
1664
78
                       exclude_encodings:Optional[_Encodings]=None
1665
79
        ) -> Iterable[Tuple[_RawMarkup, Optional[_Encoding], Optional[_Encoding], bool]]:
1666
66
        # Store the user-specified encoding for use later on.
80
        # Store the user-specified encoding for use later on.
1667
67
        self.user_specified_encoding = user_specified_encoding
81
        self.user_specified_encoding = user_specified_encoding
1668
68
82
1669
69
        # document_declared_encoding and exclude_encodings aren't used
83
        # document_declared_encoding and exclude_encodings aren't used
1670
70
        # ATM because the html5lib TreeBuilder doesn't use
84
        # ATM because the html5lib TreeBuilder doesn't use
1671
71
        # UnicodeDammit.
85
        # UnicodeDammit.
1677
72
        if exclude_encodings:
86
        for variable, name in (
1678
73
            warnings.warn(
87
            (document_declared_encoding, 'document_declared_encoding'),
1679
74
                "You provided a value for exclude_encoding, but the html5lib tree builder doesn't support exclude_encoding.",
88
            (exclude_encodings, 'exclude_encodings'),
1680
75
                stacklevel=3
89
        ):
1681
76
            )
90
            if variable:
1682
91
                warnings.warn(
1683
92
                    f"You provided a value for {name}, but the html5lib tree builder doesn't support {name}.",
1684
93
                    stacklevel=3
1685
94
                )
1686
77
95
1687
78
        # html5lib only parses HTML, so if it's given XML that's worth
96
        # html5lib only parses HTML, so if it's given XML that's worth
1688
79
        # noting.
97
        # noting.
1689
@@ -83,6 +101,9 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
1690
83
101
1691
84
    # These methods are defined by Beautiful Soup.
102
    # These methods are defined by Beautiful Soup.
1692
85
    def feed(self, markup):
103
    def feed(self, markup):
1693
104
        """Run some incoming markup through some parsing process,
1694
105
        populating the `BeautifulSoup` object in `HTML5TreeBuilder.soup`.
1695
106
        """
1696
86
        if self.soup.parse_only is not None:
107
        if self.soup.parse_only is not None:
1697
87
            warnings.warn(
108
            warnings.warn(
1698
88
                "You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.",
109
                "You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.",
1699
@@ -92,10 +113,7 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
1700
92
        self.underlying_builder.parser = parser
113
        self.underlying_builder.parser = parser
1701
93
        extra_kwargs = dict()
114
        extra_kwargs = dict()
1702
94
        if not isinstance(markup, str):
115
        if not isinstance(markup, str):
1707
95
            if new_html5lib:
116
            extra_kwargs['override_encoding'] = self.user_specified_encoding
1704
96
                extra_kwargs['override_encoding'] = self.user_specified_encoding
1705
97
            else:
1706
98
                extra_kwargs['encoding'] = self.user_specified_encoding
1708
99
        doc = parser.parse(markup, **extra_kwargs)
117
        doc = parser.parse(markup, **extra_kwargs)
1709
100
        
118
        
1710
101
        # Set the character encoding detected by the tokenizer.
119
        # Set the character encoding detected by the tokenizer.
1711
@@ -105,15 +123,18 @@ class HTML5TreeBuilder(HTMLTreeBuilder):
1712
105
            doc.original_encoding = None
123
            doc.original_encoding = None
1713
106
        else:
124
        else:
1714
107
            original_encoding = parser.tokenizer.stream.charEncoding[0]
125
            original_encoding = parser.tokenizer.stream.charEncoding[0]
1720
108
            if not isinstance(original_encoding, str):
126
            # The encoding is an html5lib Encoding object. We want to
1721
109
                # In 0.99999999 and up, the encoding is an html5lib
127
            # use a string for compatibility with other tree builders.
1722
110
                # Encoding object. We want to use a string for compatibility
128
            original_encoding = original_encoding.name
1718
111
                # with other tree builders.
1719
112
                original_encoding = original_encoding.name
1723
113
            doc.original_encoding = original_encoding
129
            doc.original_encoding = original_encoding
1724
114
        self.underlying_builder.parser = None
130
        self.underlying_builder.parser = None
1726
115
            
131
1727
116
    def create_treebuilder(self, namespaceHTMLElements):
132
    def create_treebuilder(self, namespaceHTMLElements):
1728
133
        """Called by html5lib to instantiate the kind of class it
1729
134
        calls a 'TreeBuilder'.
1730
135
        
1731
136
        :meta private:
1732
137
        """
1733
117
        self.underlying_builder = TreeBuilderForHtml5lib(
138
        self.underlying_builder = TreeBuilderForHtml5lib(
1734
118
            namespaceHTMLElements, self.soup,
139
            namespaceHTMLElements, self.soup,
1735
119
            store_line_numbers=self.store_line_numbers
140
            store_line_numbers=self.store_line_numbers
1736
diff --git a/bs4/builder/_htmlparser.py b/bs4/builder/_htmlparser.py
1737
index 3cc187f..291f6c6 100644
1738
--- a/bs4/builder/_htmlparser.py
1739
+++ b/bs4/builder/_htmlparser.py
1740
@@ -1,4 +1,5 @@
1741
1
# encoding: utf-8
1
# encoding: utf-8
1742
2
from __future__ import annotations
1743
2
"""Use the HTMLParser library to parse HTML files that aren't too bad."""
3
"""Use the HTMLParser library to parse HTML files that aren't too bad."""
1744
3
4
1745
4
# Use of this source code is governed by the MIT license.
5
# Use of this source code is governed by the MIT license.
1746
@@ -11,6 +12,19 @@ __all__ = [
1747
11
from html.parser import HTMLParser
12
from html.parser import HTMLParser
1748
12
13
1749
13
import sys
14
import sys
1750
15
from typing import (
1751
16
    Any,
1752
17
    Callable,
1753
18
    cast,
1754
19
    Dict,
1755
20
    Iterable,
1756
21
    List,
1757
22
    Optional,
1758
23
    TYPE_CHECKING,
1759
24
    Tuple,
1760
25
    Type,
1761
26
    Union,
1762
27
)
1763
14
import warnings
28
import warnings
1764
15
29
1765
16
from bs4.element import (
30
from bs4.element import (
1766
@@ -30,21 +44,25 @@ from bs4.builder import (
1767
30
    STRICT,
44
    STRICT,
1768
31
    )
45
    )
1769
32
46
1771
33
47
from bs4.element import Tag
1772
48
if TYPE_CHECKING:
1773
49
    from bs4 import BeautifulSoup
1774
50
    from bs4.element import NavigableString
1775
51
    from bs4._typing import (
1776
52
        _AttributeValues,
1777
53
        _Encoding,
1778
54
        _Encodings,
1779
55
        _RawMarkup,
1780
56
    )
1781
57
    
1782
34
HTMLPARSER = 'html.parser'
58
HTMLPARSER = 'html.parser'
1783
35
59
1784
60
_DuplicateAttributeHandler = Callable[[Dict[str, str], str, str], None]
1785
61
1786
36
class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
62
class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1787
37
    """A subclass of the Python standard library's HTMLParser class, which
63
    """A subclass of the Python standard library's HTMLParser class, which
1788
38
    listens for HTMLParser events and translates them into calls
64
    listens for HTMLParser events and translates them into calls
1789
39
    to Beautiful Soup's tree construction API.
65
    to Beautiful Soup's tree construction API.
1790
40
    """
1791
41
1792
42
    # Strategies for handling duplicate attributes
1793
43
    IGNORE = 'ignore'
1794
44
    REPLACE = 'replace'
1795
45
    
1796
46
    def __init__(self, *args, **kwargs):
1797
47
        """Constructor.
1798
48
66
1799
49
        :param on_duplicate_attribute: A strategy for what to do if a
67
        :param on_duplicate_attribute: A strategy for what to do if a
1800
50
            tag includes the same attribute more than once. Accepted
68
            tag includes the same attribute more than once. Accepted
1801
@@ -53,8 +71,10 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1802
53
            encountered), or a callable. A callable must take three
71
            encountered), or a callable. A callable must take three
1803
54
            arguments: the dictionary of attributes already processed,
72
            arguments: the dictionary of attributes already processed,
1804
55
            the name of the duplicate attribute, and the most recent value
73
            the name of the duplicate attribute, and the most recent value
1807
56
            encountered.           
74
            encountered.
1808
57
        """
75
    """
1809
76
    def __init__(self, soup:BeautifulSoup, *args, **kwargs):
1810
77
        self.soup = soup
1811
58
        self.on_duplicate_attribute = kwargs.pop(
78
        self.on_duplicate_attribute = kwargs.pop(
1812
59
            'on_duplicate_attribute', self.REPLACE
79
            'on_duplicate_attribute', self.REPLACE
1813
60
        )
80
        )
1814
@@ -70,8 +90,20 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1815
70
        self.already_closed_empty_element = []
90
        self.already_closed_empty_element = []
1816
71
91
1817
72
        self._initialize_xml_detector()
92
        self._initialize_xml_detector()
1818
93
    
1819
94
    #: Constant to handle duplicate attributes by replacing earlier values
1820
95
    #: with later ones.
1821
96
    IGNORE:str = 'ignore'
1822
97
1823
98
    #: Constant to handle duplicate attributes by ignoring later values
1824
99
    #: and keeping the earlier ones.    
1825
100
    REPLACE:str = 'replace'
1826
73
101
1828
74
    def error(self, message):
102
    on_duplicate_attribute:Union[str, _DuplicateAttributeHandler]
1829
103
    already_closed_empty_element: List[str]
1830
104
    soup: BeautifulSoup
1831
105
    
1832
106
    def error(self, message:str) -> None:
1833
75
        # NOTE: This method is required so long as Python 3.9 is
107
        # NOTE: This method is required so long as Python 3.9 is
1834
76
        # supported. The corresponding code is removed from HTMLParser
108
        # supported. The corresponding code is removed from HTMLParser
1835
77
        # in 3.5, but not removed from ParserBase until 3.10.
109
        # in 3.5, but not removed from ParserBase until 3.10.
1836
@@ -87,32 +119,33 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1837
87
        # catch this error and wrap it in a ParserRejectedMarkup.)
119
        # catch this error and wrap it in a ParserRejectedMarkup.)
1838
88
        raise ParserRejectedMarkup(message)
120
        raise ParserRejectedMarkup(message)
1839
89
121
1841
90
    def handle_startendtag(self, name, attrs):
122
    def handle_startendtag(
1842
123
            self, name:str, attrs:List[Tuple[str, Optional[str]]]
1843
124
    ) -> None:
1844
91
        """Handle an incoming empty-element tag.
125
        """Handle an incoming empty-element tag.
1845
92
126
1850
93
        This is only called when the markup looks like <tag/>.
127
        html.parser only calls this method when the markup looks like
1851
94
128
        <tag/>.
1848
95
        :param name: Name of the tag.
1849
96
        :param attrs: Dictionary of the tag's attributes.
1852
97
        """
129
        """
1854
98
        # is_startend() tells handle_starttag not to close the tag
130
        # `handle_empty_element` tells handle_starttag not to close the tag
1855
99
        # just because its name matches a known empty-element tag. We
131
        # just because its name matches a known empty-element tag. We
1857
100
        # know that this is an empty-element tag and we want to call
132
        # know that this is an empty-element tag, and we want to call
1858
101
        # handle_endtag ourselves.
133
        # handle_endtag ourselves.
1860
102
        tag = self.handle_starttag(name, attrs, handle_empty_element=False)
134
        self.handle_starttag(name, attrs, handle_empty_element=False)
1861
103
        self.handle_endtag(name)
135
        self.handle_endtag(name)
1862
104
        
136
        
1864
105
    def handle_starttag(self, name, attrs, handle_empty_element=True):
137
    def handle_starttag(
1865
138
            self, name:str, attrs:List[Tuple[str, Optional[str]]],
1866
139
            handle_empty_element:bool=True
1867
140
    ) -> None:
1868
106
        """Handle an opening tag, e.g. '<tag>'
141
        """Handle an opening tag, e.g. '<tag>'
1869
107
142
1870
108
        :param name: Name of the tag.
1871
109
        :param attrs: Dictionary of the tag's attributes.
1872
110
        :param handle_empty_element: True if this tag is known to be
143
        :param handle_empty_element: True if this tag is known to be
1873
111
            an empty-element tag (i.e. there is not expected to be any
144
            an empty-element tag (i.e. there is not expected to be any
1874
112
            closing tag).
145
            closing tag).
1875
113
        """
146
        """
1878
114
        # XXX namespace
147
        # TODO: handle namespaces here?
1879
115
        attr_dict = {}
148
        attr_dict: Dict[str, str] = {}
1880
116
        for key, value in attrs:
149
        for key, value in attrs:
1881
117
            # Change None attribute values to the empty string
150
            # Change None attribute values to the empty string
1882
118
            # for consistency with the other tree builders.
151
            # for consistency with the other tree builders.
1883
@@ -128,6 +161,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1884
128
                elif on_dupe in (None, self.REPLACE):
161
                elif on_dupe in (None, self.REPLACE):
1885
129
                    attr_dict[key] = value
162
                    attr_dict[key] = value
1886
130
                else:
163
                else:
1887
164
                    on_dupe = cast(_DuplicateAttributeHandler, on_dupe)
1888
131
                    on_dupe(attr_dict, key, value)
165
                    on_dupe(attr_dict, key, value)
1889
132
            else:
166
            else:
1890
133
                attr_dict[key] = value
167
                attr_dict[key] = value
1891
@@ -157,7 +191,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1892
157
        if self._root_tag is None:
191
        if self._root_tag is None:
1893
158
            self._root_tag_encountered(name)
192
            self._root_tag_encountered(name)
1894
159
            
193
            
1896
160
    def handle_endtag(self, name, check_already_closed=True):
194
    def handle_endtag(self, name:str, check_already_closed:bool=True) -> None:
1897
161
        """Handle a closing tag, e.g. '</tag>'
195
        """Handle a closing tag, e.g. '</tag>'
1898
162
        
196
        
1899
163
        :param name: A tag name.
197
        :param name: A tag name.
1900
@@ -175,11 +209,11 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1901
175
        else:
209
        else:
1902
176
            self.soup.handle_endtag(name)
210
            self.soup.handle_endtag(name)
1903
177
            
211
            
1905
178
    def handle_data(self, data):
212
    def handle_data(self, data:str) -> None:
1906
179
        """Handle some textual data that shows up between tags."""
213
        """Handle some textual data that shows up between tags."""
1907
180
        self.soup.handle_data(data)
214
        self.soup.handle_data(data)
1908
181
215
1910
182
    def handle_charref(self, name):
216
    def handle_charref(self, name:str) -> None:
1911
183
        """Handle a numeric character reference by converting it to the
217
        """Handle a numeric character reference by converting it to the
1912
184
        corresponding Unicode character and treating it as textual
218
        corresponding Unicode character and treating it as textual
1913
185
        data.
219
        data.
1914
@@ -219,7 +253,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1915
219
        data = data or "\N{REPLACEMENT CHARACTER}"
253
        data = data or "\N{REPLACEMENT CHARACTER}"
1916
220
        self.handle_data(data)
254
        self.handle_data(data)
1917
221
255
1919
222
    def handle_entityref(self, name):
256
    def handle_entityref(self, name:str) -> None:
1920
223
        """Handle a named entity reference by converting it to the
257
        """Handle a named entity reference by converting it to the
1921
224
        corresponding Unicode character(s) and treating it as textual
258
        corresponding Unicode character(s) and treating it as textual
1922
225
        data.
259
        data.
1923
@@ -238,7 +272,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1924
238
            data = "&%s" % name
272
            data = "&%s" % name
1925
239
        self.handle_data(data)
273
        self.handle_data(data)
1926
240
274
1928
241
    def handle_comment(self, data):
275
    def handle_comment(self, data:str) -> None:
1929
242
        """Handle an HTML comment.
276
        """Handle an HTML comment.
1930
243
277
1931
244
        :param data: The text of the comment.
278
        :param data: The text of the comment.
1932
@@ -247,7 +281,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1933
247
        self.soup.handle_data(data)
281
        self.soup.handle_data(data)
1934
248
        self.soup.endData(Comment)
282
        self.soup.endData(Comment)
1935
249
283
1937
250
    def handle_decl(self, data):
284
    def handle_decl(self, data:str) -> None:
1938
251
        """Handle a DOCTYPE declaration.
285
        """Handle a DOCTYPE declaration.
1939
252
286
1940
253
        :param data: The text of the declaration.
287
        :param data: The text of the declaration.
1941
@@ -257,11 +291,12 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1942
257
        self.soup.handle_data(data)
291
        self.soup.handle_data(data)
1943
258
        self.soup.endData(Doctype)
292
        self.soup.endData(Doctype)
1944
259
293
1946
260
    def unknown_decl(self, data):
294
    def unknown_decl(self, data:str) -> None:
1947
261
        """Handle a declaration of unknown type -- probably a CDATA block.
295
        """Handle a declaration of unknown type -- probably a CDATA block.
1948
262
296
1949
263
        :param data: The text of the declaration.
297
        :param data: The text of the declaration.
1950
264
        """
298
        """
1951
299
        cls: Type[NavigableString]
1952
265
        if data.upper().startswith('CDATA['):
300
        if data.upper().startswith('CDATA['):
1953
266
            cls = CData
301
            cls = CData
1954
267
            data = data[len('CDATA['):]
302
            data = data[len('CDATA['):]
1955
@@ -271,7 +306,7 @@ class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
1956
271
        self.soup.handle_data(data)
306
        self.soup.handle_data(data)
1957
272
        self.soup.endData(cls)
307
        self.soup.endData(cls)
1958
273
308
1960
274
    def handle_pi(self, data):
309
    def handle_pi(self, data:str) -> None:
1961
275
        """Handle a processing instruction.
310
        """Handle a processing instruction.
1962
276
311
1963
277
        :param data: The text of the instruction.
312
        :param data: The text of the instruction.
1964
@@ -286,16 +321,17 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
1965
286
    """A Beautiful soup `TreeBuilder` that uses the `HTMLParser` parser,
321
    """A Beautiful soup `TreeBuilder` that uses the `HTMLParser` parser,
1966
287
    found in the Python standard library.
322
    found in the Python standard library.
1967
288
    """
323
    """
1972
289
    is_xml = False
324
    is_xml:bool = False
1973
290
    picklable = True
325
    picklable:bool = True
1974
291
    NAME = HTMLPARSER
326
    NAME:str = HTMLPARSER
1975
292
    features = [NAME, HTML, STRICT]
327
    features: Iterable[str] = [NAME, HTML, STRICT]
1976
293
328
1980
294
    # The html.parser knows which line number and position in the
329
    #: The html.parser knows which line number and position in the
1981
295
    # original file is the source of an element.
330
    #: original file is the source of an element.
1982
296
    TRACKS_LINE_NUMBERS = True
331
    TRACKS_LINE_NUMBERS:bool = True
1983
297
332
1985
298
    def __init__(self, parser_args=None, parser_kwargs=None, **kwargs):
333
    def __init__(self, parser_args:Optional[Iterable[Any]]=None,
1986
334
                 parser_kwargs:Optional[Dict[str, Any]]=None, **kwargs:Any):
1987
299
        """Constructor.
335
        """Constructor.
1988
300
336
1989
301
        :param parser_args: Positional arguments to pass into 
337
        :param parser_args: Positional arguments to pass into 
1990
@@ -320,9 +356,12 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
1991
320
        parser_kwargs['convert_charrefs'] = False
356
        parser_kwargs['convert_charrefs'] = False
1992
321
        self.parser_args = (parser_args, parser_kwargs)
357
        self.parser_args = (parser_args, parser_kwargs)
1993
322
        
358
        
1997
323
    def prepare_markup(self, markup, user_specified_encoding=None,
359
    def prepare_markup(
1998
324
                       document_declared_encoding=None, exclude_encodings=None):
360
            self, markup:_RawMarkup,
1999
325
361
            user_specified_encoding:Optional[_Encoding]=None,
2000
362
            document_declared_encoding:Optional[_Encoding]=None,
2001
363
            exclude_encodings:Optional[_Encodings]=None
2002
364
    ) -> Iterable[Tuple[str, Optional[_Encoding], Optional[_Encoding], bool]]:
2003
326
        """Run any preliminary steps necessary to make incoming markup
365
        """Run any preliminary steps necessary to make incoming markup
2004
327
        acceptable to the parser.
366
        acceptable to the parser.
2005
328
367
2006
@@ -333,13 +372,12 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
2007
333
        :param exclude_encodings: The user asked _not_ to try any of
372
        :param exclude_encodings: The user asked _not_ to try any of
2008
334
            these encodings.
373
            these encodings.
2009
335
374
2013
336
        :yield: A series of 4-tuples:
375
        :yield: A series of 4-tuples: (markup, encoding, declared encoding,
2014
337
         (markup, encoding, declared encoding,
376
             has undergone character replacement)
2012
338
          has undergone character replacement)
2015
339
377
2019
340
         Each 4-tuple represents a strategy for converting the
378
            Each 4-tuple represents a strategy for parsing the document.
2020
341
         document to Unicode and parsing it. Each strategy will be tried 
379
            This TreeBuilder uses Unicode, Dammit to convert the markup
2021
342
         in turn.
380
            into Unicode, so the `markup` element will always be a string.
2022
343
        """
381
        """
2023
344
        if isinstance(markup, str):
382
        if isinstance(markup, str):
2024
345
            # Parse Unicode as-is.
383
            # Parse Unicode as-is.
2025
@@ -348,14 +386,19 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
2026
348
386
2027
349
        # Ask UnicodeDammit to sniff the most likely encoding.
387
        # Ask UnicodeDammit to sniff the most likely encoding.
2028
350
388
2033
351
        # This was provided by the end-user; treat it as a known
389
        known_definite_encodings: List[_Encoding] = []
2034
352
        # definite encoding per the algorithm laid out in the HTML5
390
        if user_specified_encoding:
2035
353
        # spec.  (See the EncodingDetector class for details.)
391
            # This was provided by the end-user; treat it as a known
2036
354
        known_definite_encodings = [user_specified_encoding]
392
            # definite encoding per the algorithm laid out in the
2037
393
            # HTML5 spec. (See the EncodingDetector class for
2038
394
            # details.)
2039
395
            known_definite_encodings.append(user_specified_encoding)
2040
355
396
2044
356
        # This was found in the document; treat it as a slightly lower-priority
397
        user_encodings: List[_Encoding] = []
2045
357
        # user encoding.
398
        if document_declared_encoding:
2046
358
        user_encodings = [document_declared_encoding]
399
            # This was found in the document; treat it as a slightly
2047
400
            # lower-priority user encoding.
2048
401
            user_encodings.append(document_declared_encoding)
2049
359
402
2050
360
        try_encodings = [user_specified_encoding, document_declared_encoding]
403
        try_encodings = [user_specified_encoding, document_declared_encoding]
2051
361
        dammit = UnicodeDammit(
404
        dammit = UnicodeDammit(
2052
@@ -365,17 +408,27 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
2053
365
            is_html=True,
408
            is_html=True,
2054
366
            exclude_encodings=exclude_encodings
409
            exclude_encodings=exclude_encodings
2055
367
        )
410
        )
2056
368
        yield (dammit.markup, dammit.original_encoding,
2057
369
               dammit.declared_html_encoding,
2058
370
               dammit.contains_replacement_characters)
2059
371
411
2064
372
    def feed(self, markup):
412
        if dammit.unicode_markup is None:
2065
373
        """Run some incoming markup through some parsing process,
413
            # In every case I've seen, Unicode, Dammit is able to
2066
374
        populating the `BeautifulSoup` object in self.soup.
414
            # convert the markup into Unicode, even if it needs to use
2067
375
        """
415
            # REPLACEMENT CHARACTER. But there is a code path that
2068
416
            # could result in unicode_markup being None, and
2069
417
            # HTMLParser can only parse Unicode, so here we handle
2070
418
            # that code path.
2071
419
            raise ParserRejectedMarkup("Could not convert input to Unicode, and html.parser will not accept bytestrings.")
2072
420
        else:
2073
421
            yield (dammit.unicode_markup, dammit.original_encoding,
2074
422
                   dammit.declared_html_encoding,
2075
423
                   dammit.contains_replacement_characters)
2076
424
2077
425
    def feed(self, markup:str):
2078
376
        args, kwargs = self.parser_args
426
        args, kwargs = self.parser_args
2081
377
        parser = BeautifulSoupHTMLParser(*args, **kwargs)
427
        # We know BeautifulSoup calls TreeBuilder.initialize_soup
2082
378
        parser.soup = self.soup
428
        # before calling feed(), so we can assume self.soup
2083
429
        # is set.
2084
430
        assert self.soup is not None
2085
431
        parser = BeautifulSoupHTMLParser(self.soup, *args, **kwargs)
2086
379
        try:
432
        try:
2087
380
            parser.feed(markup)
433
            parser.feed(markup)
2088
381
            parser.close()
434
            parser.close()
2089
@@ -385,3 +438,4 @@ class HTMLParserTreeBuilder(HTMLTreeBuilder):
2090
385
            # when there's an error in the doctype declaration.
438
            # when there's an error in the doctype declaration.
2091
386
            raise ParserRejectedMarkup(e)
439
            raise ParserRejectedMarkup(e)
2092
387
        parser.already_closed_empty_element = []
440
        parser.already_closed_empty_element = []
2093
441
2094
diff --git a/bs4/builder/_lxml.py b/bs4/builder/_lxml.py
2095
index 971c81e..44a477f 100644
2096
--- a/bs4/builder/_lxml.py
2097
+++ b/bs4/builder/_lxml.py
2098
@@ -1,3 +1,6 @@
2099
1
# encoding: utf-8
2100
2
from __future__ import annotations
2101
3
2102
1
# Use of this source code is governed by the MIT license.
4
# Use of this source code is governed by the MIT license.
2103
2
__license__ = "MIT"
5
__license__ = "MIT"
2104
3
6
2105
@@ -6,14 +9,26 @@ __all__ = [
2106
6
    'LXMLTreeBuilder',
9
    'LXMLTreeBuilder',
2107
7
    ]
10
    ]
2108
8
11
2113
9
try:
12
from collections.abc import Callable
2114
10
    from collections.abc import Callable # Python 3.6
13
2115
11
except ImportError as e:
14
from typing import (
2116
12
    from collections import Callable
15
    Any,
2117
16
    Dict,
2118
17
    IO,
2119
18
    Iterable,
2120
19
    List,
2121
20
    Optional,
2122
21
    Set,
2123
22
    Tuple,
2124
23
    Type,
2125
24
    TYPE_CHECKING,
2126
25
    Union,
2127
26
)
2128
13
27
2129
14
from io import BytesIO
28
from io import BytesIO
2130
15
from io import StringIO
29
from io import StringIO
2131
16
from lxml import etree
30
from lxml import etree
2132
31
from bs4.dammit import (_Encoding)
2133
17
from bs4.element import (
32
from bs4.element import (
2134
18
    Comment,
33
    Comment,
2135
19
    Doctype,
34
    Doctype,
2136
@@ -31,33 +46,54 @@ from bs4.builder import (
2137
31
    TreeBuilder,
46
    TreeBuilder,
2138
32
    XML)
47
    XML)
2139
33
from bs4.dammit import EncodingDetector
48
from bs4.dammit import EncodingDetector
2142
34
49
if TYPE_CHECKING:
2143
35
LXML = 'lxml'
50
    from bs4._typing import (
2144
51
        _Encoding,
2145
52
        _Encodings,
2146
53
        _NamespacePrefix,
2147
54
        _NamespaceURL,
2148
55
        _NamespaceMapping,
2149
56
        _InvertedNamespaceMapping,
2150
57
        _RawMarkup,
2151
58
    )
2152
59
    from bs4 import BeautifulSoup
2153
60
2154
61
LXML:str = 'lxml'
2155
36
62
2156
37
def _invert(d):
63
def _invert(d):
2157
38
    "Invert a dictionary."
64
    "Invert a dictionary."
2158
39
    return dict((v,k) for k, v in list(d.items()))
65
    return dict((v,k) for k, v in list(d.items()))
2159
40
66
2160
41
class LXMLTreeBuilderForXML(TreeBuilder):
67
class LXMLTreeBuilderForXML(TreeBuilder):
2161
42
    DEFAULT_PARSER_CLASS = etree.XMLParser
2162
43
2163
44
    is_xml = True
2164
45
    processing_instruction_class = XMLProcessingInstruction
2165
46
68
2168
47
    NAME = "lxml-xml"
69
    DEFAULT_PARSER_CLASS:Type[Any] = etree.XMLParser
2169
48
    ALTERNATE_NAMES = ["xml"]
70
    
2170
71
    is_xml:bool = True
2171
72
        
2172
73
    processing_instruction_class:Type[ProcessingInstruction]
2173
74
    
2174
75
    NAME:str = "lxml-xml"
2175
76
    ALTERNATE_NAMES: Iterable[str] = ["xml"]
2176
49
77
2177
50
    # Well, it's permissive by XML parser standards.
78
    # Well, it's permissive by XML parser standards.
2179
51
    features = [NAME, LXML, XML, FAST, PERMISSIVE]
79
    features: Iterable[str] = [NAME, LXML, XML, FAST, PERMISSIVE]
2180
52
80
2182
53
    CHUNK_SIZE = 512
81
    CHUNK_SIZE:int = 512
2183
54
82
2184
55
    # This namespace mapping is specified in the XML Namespace
83
    # This namespace mapping is specified in the XML Namespace
2185
56
    # standard.
84
    # standard.
2187
57
    DEFAULT_NSMAPS = dict(xml='http://www.w3.org/XML/1998/namespace')
85
    DEFAULT_NSMAPS: _NamespaceMapping = dict(
2188
86
        xml='http://www.w3.org/XML/1998/namespace'
2189
87
    )
2190
58
88
2192
59
    DEFAULT_NSMAPS_INVERTED = _invert(DEFAULT_NSMAPS)
89
    DEFAULT_NSMAPS_INVERTED:_InvertedNamespaceMapping = _invert(
2193
90
        DEFAULT_NSMAPS
2194
91
    )
2195
60
92
2196
93
    nsmaps: List[Optional[_InvertedNamespaceMapping]]
2197
94
    empty_element_tags: Set[str]
2198
95
    parser: Any
2199
96
    
2200
61
    # NOTE: If we parsed Element objects and looked at .sourceline,
97
    # NOTE: If we parsed Element objects and looked at .sourceline,
2201
62
    # we'd be able to see the line numbers from the original document.
98
    # we'd be able to see the line numbers from the original document.
2202
63
    # But instead we build an XMLParser or HTMLParser object to serve
99
    # But instead we build an XMLParser or HTMLParser object to serve
2203
@@ -65,16 +101,18 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2204
65
    # line numbers.
101
    # line numbers.
2205
66
    # See: https://bugs.launchpad.net/lxml/+bug/1846906
102
    # See: https://bugs.launchpad.net/lxml/+bug/1846906
2206
67
    
103
    
2208
68
    def initialize_soup(self, soup):
104
    def initialize_soup(self, soup:BeautifulSoup) -> None:
2209
69
        """Let the BeautifulSoup object know about the standard namespace
105
        """Let the BeautifulSoup object know about the standard namespace
2210
70
        mapping.
106
        mapping.
2211
71
107
2212
72
        :param soup: A `BeautifulSoup`.
108
        :param soup: A `BeautifulSoup`.
2213
73
        """
109
        """
2214
110
        # Beyond this point, self.soup is set, so we can assume (and
2215
111
        # assert) it's not None whenever necessary.
2216
74
        super(LXMLTreeBuilderForXML, self).initialize_soup(soup)
112
        super(LXMLTreeBuilderForXML, self).initialize_soup(soup)
2217
75
        self._register_namespaces(self.DEFAULT_NSMAPS)
113
        self._register_namespaces(self.DEFAULT_NSMAPS)
2218
76
114
2220
77
    def _register_namespaces(self, mapping):
115
    def _register_namespaces(self, mapping:Dict[str, str]) -> None:
2221
78
        """Let the BeautifulSoup object know about namespaces encountered
116
        """Let the BeautifulSoup object know about namespaces encountered
2222
79
        while parsing the document.
117
        while parsing the document.
2223
80
118
2224
@@ -87,6 +125,7 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2225
87
125
2226
88
        :param mapping: A dictionary mapping namespace prefixes to URIs.
126
        :param mapping: A dictionary mapping namespace prefixes to URIs.
2227
89
        """
127
        """
2228
128
        assert self.soup is not None
2229
90
        for key, value in list(mapping.items()):
129
        for key, value in list(mapping.items()):
2230
91
            # This is 'if key' and not 'if key is not None' because we
130
            # This is 'if key' and not 'if key is not None' because we
2231
92
            # don't track un-prefixed namespaces. Soupselect will
131
            # don't track un-prefixed namespaces. Soupselect will
2232
@@ -98,19 +137,18 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2233
98
                # prefix, the first one in the document takes precedence.
137
                # prefix, the first one in the document takes precedence.
2234
99
                self.soup._namespaces[key] = value
138
                self.soup._namespaces[key] = value
2235
100
                
139
                
2237
101
    def default_parser(self, encoding):
140
    def default_parser(self, encoding:Optional[_Encoding]) -> Type:
2238
102
        """Find the default parser for the given encoding.
141
        """Find the default parser for the given encoding.
2239
103
142
2240
104
        :param encoding: A string.
2241
105
        :return: Either a parser object or a class, which
143
        :return: Either a parser object or a class, which
2242
106
          will be instantiated with default arguments.
144
          will be instantiated with default arguments.
2243
107
        """
145
        """
2244
108
        if self._default_parser is not None:
146
        if self._default_parser is not None:
2245
109
            return self._default_parser
147
            return self._default_parser
2247
110
        return etree.XMLParser(
148
        return self.DEFAULT_PARSER_CLASS(
2248
111
            target=self, strip_cdata=False, recover=True, encoding=encoding)
149
            target=self, strip_cdata=False, recover=True, encoding=encoding)
2249
112
150
2251
113
    def parser_for(self, encoding):
151
    def parser_for(self, encoding: Optional[_Encoding]) -> Any:
2252
114
        """Instantiate an appropriate parser for the given encoding.
152
        """Instantiate an appropriate parser for the given encoding.
2253
115
153
2254
116
        :param encoding: A string.
154
        :param encoding: A string.
2255
@@ -119,36 +157,39 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2256
119
        # Use the default parser.
157
        # Use the default parser.
2257
120
        parser = self.default_parser(encoding)
158
        parser = self.default_parser(encoding)
2258
121
159
2260
122
        if isinstance(parser, Callable):
160
        if callable(parser):
2261
123
            # Instantiate the parser with default arguments
161
            # Instantiate the parser with default arguments
2262
124
            parser = parser(
162
            parser = parser(
2263
125
                target=self, strip_cdata=False, recover=True, encoding=encoding
163
                target=self, strip_cdata=False, recover=True, encoding=encoding
2264
126
            )
164
            )
2265
127
        return parser
165
        return parser
2266
128
166
2268
129
    def __init__(self, parser=None, empty_element_tags=None, **kwargs):
167
    def __init__(self, parser:Optional[Any]=None,
2269
168
                 empty_element_tags:Optional[Set[str]]=None, **kwargs):
2270
130
        # TODO: Issue a warning if parser is present but not a
169
        # TODO: Issue a warning if parser is present but not a
2271
131
        # callable, since that means there's no way to create new
170
        # callable, since that means there's no way to create new
2272
132
        # parsers for different encodings.
171
        # parsers for different encodings.
2273
133
        self._default_parser = parser
172
        self._default_parser = parser
2274
134
        if empty_element_tags is not None:
2275
135
            self.empty_element_tags = set(empty_element_tags)
2276
136
        self.soup = None
173
        self.soup = None
2277
137
        self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED]
174
        self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED]
2278
138
        self.active_namespace_prefixes = [dict(self.DEFAULT_NSMAPS)]
175
        self.active_namespace_prefixes = [dict(self.DEFAULT_NSMAPS)]
2279
139
        super(LXMLTreeBuilderForXML, self).__init__(**kwargs)
176
        super(LXMLTreeBuilderForXML, self).__init__(**kwargs)
2280
140
        
177
        
2282
141
    def _getNsTag(self, tag):
178
    def _getNsTag(self, tag:str) -> Tuple[Optional[str], str]:
2283
142
        # Split the namespace URL out of a fully-qualified lxml tag
179
        # Split the namespace URL out of a fully-qualified lxml tag
2284
143
        # name. Copied from lxml's src/lxml/sax.py.
180
        # name. Copied from lxml's src/lxml/sax.py.
2285
144
        if tag[0] == '{':
181
        if tag[0] == '{':
2287
145
            return tuple(tag[1:].split('}', 1))
182
            namespace, name = tag[1:].split('}', 1)
2288
183
            return (namespace, name)
2289
146
        else:
184
        else:
2290
147
            return (None, tag)
185
            return (None, tag)
2291
148
186
2295
149
    def prepare_markup(self, markup, user_specified_encoding=None,
187
    def prepare_markup(
2296
150
                       exclude_encodings=None,
188
            self, markup:_RawMarkup,
2297
151
                       document_declared_encoding=None):
189
            user_specified_encoding:Optional[_Encoding]=None,
2298
190
            document_declared_encoding:Optional[_Encoding]=None,
2299
191
            exclude_encodings:Optional[_Encodings]=None,
2300
192
    ) -> Iterable[Tuple[Union[str,bytes], Optional[_Encoding], Optional[_Encoding], bool]]:
2301
152
        """Run any preliminary steps necessary to make incoming markup
193
        """Run any preliminary steps necessary to make incoming markup
2302
153
        acceptable to the parser.
194
        acceptable to the parser.
2303
154
195
2304
@@ -166,13 +207,12 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2305
166
        :param exclude_encodings: The user asked _not_ to try any of
207
        :param exclude_encodings: The user asked _not_ to try any of
2306
167
            these encodings.
208
            these encodings.
2307
168
209
2311
169
        :yield: A series of 4-tuples:
210
        :yield: A series of 4-tuples: (markup, encoding, declared encoding,
2312
170
         (markup, encoding, declared encoding,
211
            has undergone character replacement)
2310
171
          has undergone character replacement)
2313
172
212
2317
173
         Each 4-tuple represents a strategy for converting the
213
            Each 4-tuple represents a strategy for converting the
2318
174
         document to Unicode and parsing it. Each strategy will be tried 
214
            document to Unicode and parsing it. Each strategy will be tried 
2319
175
         in turn.
215
            in turn.
2320
176
        """
216
        """
2321
177
        is_html = not self.is_xml
217
        is_html = not self.is_xml
2322
178
        if is_html:
218
        if is_html:
2323
@@ -200,14 +240,25 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2324
200
            yield (markup.encode("utf8"), "utf8",
240
            yield (markup.encode("utf8"), "utf8",
2325
201
                   document_declared_encoding, False)
241
                   document_declared_encoding, False)
2326
202
242
2331
203
        # This was provided by the end-user; treat it as a known
243
            # Since the document was Unicode in the first place, there
2332
204
        # definite encoding per the algorithm laid out in the HTML5
244
            # is no need to try any more strategies; we know this will
2333
205
        # spec.  (See the EncodingDetector class for details.)
245
            # work.
2334
206
        known_definite_encodings = [user_specified_encoding]
246
            return
2335
247
2336
248
        known_definite_encodings: List[_Encoding] = []
2337
249
        if user_specified_encoding:
2338
250
            # This was provided by the end-user; treat it as a known
2339
251
            # definite encoding per the algorithm laid out in the
2340
252
            # HTML5 spec. (See the EncodingDetector class for
2341
253
            # details.)
2342
254
            known_definite_encodings.append(user_specified_encoding)
2343
255
2344
256
        user_encodings: List[_Encoding] = []
2345
257
        if document_declared_encoding:
2346
258
            # This was found in the document; treat it as a slightly
2347
259
            # lower-priority user encoding.
2348
260
            user_encodings.append(document_declared_encoding)
2349
207
261
2350
208
        # This was found in the document; treat it as a slightly lower-priority
2351
209
        # user encoding.
2352
210
        user_encodings = [document_declared_encoding]
2353
211
        detector = EncodingDetector(
262
        detector = EncodingDetector(
2354
212
            markup, known_definite_encodings=known_definite_encodings,
263
            markup, known_definite_encodings=known_definite_encodings,
2355
213
            user_encodings=user_encodings, is_html=is_html,
264
            user_encodings=user_encodings, is_html=is_html,
2356
@@ -216,34 +267,45 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2357
216
        for encoding in detector.encodings:
267
        for encoding in detector.encodings:
2358
217
            yield (detector.markup, encoding, document_declared_encoding, False)
268
            yield (detector.markup, encoding, document_declared_encoding, False)
2359
218
269
2361
219
    def feed(self, markup):
270
    def feed(self, markup:Union[bytes,str]) -> None:
2362
271
        io: IO
2363
220
        if isinstance(markup, bytes):
272
        if isinstance(markup, bytes):
2365
221
            markup = BytesIO(markup)
273
            io = BytesIO(markup)
2366
222
        elif isinstance(markup, str):
274
        elif isinstance(markup, str):
2368
223
            markup = StringIO(markup)
275
            io = StringIO(markup)
2369
224
276
2370
277
        # initialize_soup is called before feed, so we know this
2371
278
        # is not None.
2372
279
        assert self.soup is not None
2373
280
            
2374
225
        # Call feed() at least once, even if the markup is empty,
281
        # Call feed() at least once, even if the markup is empty,
2375
226
        # or the parser won't be initialized.
282
        # or the parser won't be initialized.
2377
227
        data = markup.read(self.CHUNK_SIZE)
283
        data = io.read(self.CHUNK_SIZE)
2378
228
        try:
284
        try:
2379
229
            self.parser = self.parser_for(self.soup.original_encoding)
285
            self.parser = self.parser_for(self.soup.original_encoding)
2380
230
            self.parser.feed(data)
286
            self.parser.feed(data)
2381
231
            while len(data) != 0:
287
            while len(data) != 0:
2382
232
                # Now call feed() on the rest of the data, chunk by chunk.
288
                # Now call feed() on the rest of the data, chunk by chunk.
2384
233
                data = markup.read(self.CHUNK_SIZE)
289
                data = io.read(self.CHUNK_SIZE)
2385
234
                if len(data) != 0:
290
                if len(data) != 0:
2386
235
                    self.parser.feed(data)
291
                    self.parser.feed(data)
2387
236
            self.parser.close()
292
            self.parser.close()
2388
237
        except (UnicodeDecodeError, LookupError, etree.ParserError) as e:
293
        except (UnicodeDecodeError, LookupError, etree.ParserError) as e:
2389
238
            raise ParserRejectedMarkup(e)
294
            raise ParserRejectedMarkup(e)
2390
239
295
2392
240
    def close(self):
296
    def close(self) -> None:
2393
241
        self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED]
297
        self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED]
2394
242
298
2396
243
    def start(self, name, attrs, nsmap={}):
299
    def start(self, name:str, attrs:Dict[str, str], nsmap:_NamespaceMapping={}):
2397
300
        # This is called by lxml code as a result of calling
2398
301
        # BeautifulSoup.feed(), and we know self.soup is set by the time feed()
2399
302
        # is called.
2400
303
        assert self.soup is not None
2401
304
        
2402
244
        # Make sure attrs is a mutable dict--lxml may send an immutable dictproxy.
305
        # Make sure attrs is a mutable dict--lxml may send an immutable dictproxy.
2403
245
        attrs = dict(attrs)
306
        attrs = dict(attrs)
2405
246
        nsprefix = None
307
        nsprefix: Optional[_NamespacePrefix] = None
2406
308
        namespace: Optional[_NamespaceURL] = None
2407
247
        # Invert each namespace map as it comes in.
309
        # Invert each namespace map as it comes in.
2408
248
        if len(nsmap) == 0 and len(self.nsmaps) > 1:
310
        if len(nsmap) == 0 and len(self.nsmaps) > 1:
2409
249
                # There are no new namespaces for this tag, but
311
                # There are no new namespaces for this tag, but
2410
@@ -285,7 +347,7 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2411
285
        # Namespaces are in play. Find any attributes that came in
347
        # Namespaces are in play. Find any attributes that came in
2412
286
        # from lxml with namespaces attached to their names, and
348
        # from lxml with namespaces attached to their names, and
2413
287
        # turn then into NamespacedAttribute objects.
349
        # turn then into NamespacedAttribute objects.
2415
288
        new_attrs = {}
350
        new_attrs:Dict[Union[str,NamespacedAttribute], str] = {}
2416
289
        for attr, value in list(attrs.items()):
351
        for attr, value in list(attrs.items()):
2417
290
            namespace, attr = self._getNsTag(attr)
352
            namespace, attr = self._getNsTag(attr)
2418
291
            if namespace is None:
353
            if namespace is None:
2419
@@ -303,7 +365,7 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2420
303
            namespaces=self.active_namespace_prefixes[-1]
365
            namespaces=self.active_namespace_prefixes[-1]
2421
304
        )
366
        )
2422
305
        
367
        
2424
306
    def _prefix_for_namespace(self, namespace):
368
    def _prefix_for_namespace(self, namespace:Optional[_NamespaceURL]) -> Optional[_NamespacePrefix]:
2425
307
        """Find the currently active prefix for the given namespace."""
369
        """Find the currently active prefix for the given namespace."""
2426
308
        if namespace is None:
370
        if namespace is None:
2427
309
            return None
371
            return None
2428
@@ -312,7 +374,8 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2429
312
                return inverted_nsmap[namespace]
374
                return inverted_nsmap[namespace]
2430
313
        return None
375
        return None
2431
314
376
2433
315
    def end(self, name):
377
    def end(self, name:str) -> None:
2434
378
        assert self.soup is not None
2435
316
        self.soup.endData()
379
        self.soup.endData()
2436
317
        completed_tag = self.soup.tagStack[-1]
380
        completed_tag = self.soup.tagStack[-1]
2437
318
        namespace, name = self._getNsTag(name)
381
        namespace, name = self._getNsTag(name)
2438
@@ -334,44 +397,49 @@ class LXMLTreeBuilderForXML(TreeBuilder):
2439
334
                # namespace prefixes.
397
                # namespace prefixes.
2440
335
                self.active_namespace_prefixes.pop()
398
                self.active_namespace_prefixes.pop()
2441
336
            
399
            
2443
337
    def pi(self, target, data):
400
    def pi(self, target:str, data:str) -> None:
2444
401
        assert self.soup is not None
2445
338
        self.soup.endData()
402
        self.soup.endData()
2446
339
        data = target + ' ' + data
403
        data = target + ' ' + data
2447
340
        self.soup.handle_data(data)
404
        self.soup.handle_data(data)
2448
341
        self.soup.endData(self.processing_instruction_class)
405
        self.soup.endData(self.processing_instruction_class)
2449
342
        
406
        
2451
343
    def data(self, content):
407
    def data(self, content:str) -> None:
2452
408
        assert self.soup is not None
2453
344
        self.soup.handle_data(content)
409
        self.soup.handle_data(content)
2454
345
410
2456
346
    def doctype(self, name, pubid, system):
411
    def doctype(self, name:str, pubid:str, system:str) -> None:
2457
412
        assert self.soup is not None
2458
347
        self.soup.endData()
413
        self.soup.endData()
2459
348
        doctype = Doctype.for_name_and_ids(name, pubid, system)
414
        doctype = Doctype.for_name_and_ids(name, pubid, system)
2460
349
        self.soup.object_was_parsed(doctype)
415
        self.soup.object_was_parsed(doctype)
2461
350
416
2463
351
    def comment(self, content):
417
    def comment(self, content:str) -> None:
2464
352
        "Handle comments as Comment objects."
418
        "Handle comments as Comment objects."
2465
419
        assert self.soup is not None
2466
353
        self.soup.endData()
420
        self.soup.endData()
2467
354
        self.soup.handle_data(content)
421
        self.soup.handle_data(content)
2468
355
        self.soup.endData(Comment)
422
        self.soup.endData(Comment)
2469
356
423
2471
357
    def test_fragment_to_document(self, fragment):
424
    def test_fragment_to_document(self, fragment:str) -> str:
2472
358
        """See `TreeBuilder`."""
425
        """See `TreeBuilder`."""
2473
359
        return '<?xml version="1.0" encoding="utf-8"?>\n%s' % fragment
426
        return '<?xml version="1.0" encoding="utf-8"?>\n%s' % fragment
2474
360
427
2475
361
428
2476
362
class LXMLTreeBuilder(HTMLTreeBuilder, LXMLTreeBuilderForXML):
429
class LXMLTreeBuilder(HTMLTreeBuilder, LXMLTreeBuilderForXML):
2477
363
430
2480
364
    NAME = LXML
431
    NAME:str = LXML
2481
365
    ALTERNATE_NAMES = ["lxml-html"]
432
    ALTERNATE_NAMES: Iterable[str] = ["lxml-html"]
2482
366
433
2486
367
    features = ALTERNATE_NAMES + [NAME, HTML, FAST, PERMISSIVE]
434
    features: Iterable[str] = list(ALTERNATE_NAMES) + [NAME, HTML, FAST, PERMISSIVE]
2487
368
    is_xml = False
435
    is_xml: bool = False
2485
369
    processing_instruction_class = ProcessingInstruction
2488
370
436
2490
371
    def default_parser(self, encoding):
437
    def default_parser(self, encoding:Optional[_Encoding]) -> Type[Any]:
2491
372
        return etree.HTMLParser
438
        return etree.HTMLParser
2492
373
439
2494
374
    def feed(self, markup):
440
    def feed(self, markup:_RawMarkup) -> None: 
2495
441
        # We know self.soup is set by the time feed() is called.
2496
442
        assert self.soup is not None
2497
375
        encoding = self.soup.original_encoding
443
        encoding = self.soup.original_encoding
2498
376
        try:
444
        try:
2499
377
            self.parser = self.parser_for(encoding)
445
            self.parser = self.parser_for(encoding)
2500
@@ -381,6 +449,7 @@ class LXMLTreeBuilder(HTMLTreeBuilder, LXMLTreeBuilderForXML):
2501
381
            raise ParserRejectedMarkup(e)
449
            raise ParserRejectedMarkup(e)
2502
382
450
2503
383
451
2505
384
    def test_fragment_to_document(self, fragment):
452
    def test_fragment_to_document(self, fragment:str) -> str:
2506
385
        """See `TreeBuilder`."""
453
        """See `TreeBuilder`."""
2507
386
        return '<html><body>%s</body></html>' % fragment
454
        return '<html><body>%s</body></html>' % fragment
2508
455
2509
diff --git a/bs4/css.py b/bs4/css.py
2510
index 245ac60..0477de8 100644
2511
--- a/bs4/css.py
2512
+++ b/bs4/css.py
2513
@@ -1,6 +1,36 @@
2516
1
"""Integration code for CSS selectors using Soup Sieve (pypi: soupsieve)."""
1
"""Integration code for CSS selectors using `Soup Sieve <https://facelessuser.github.io/soupsieve/>`_ (pypi: ``soupsieve``).
2517
2
2
2518
3
Acquire a `CSS` object through the `bs4.element.Tag.css` attribute of
2519
4
the starting point of your CSS selector, or (if you want to run a
2520
5
selector against the entire document) of the `BeautifulSoup` object
2521
6
itself.
2522
7
2523
8
The main advantage of doing this instead of using ``soupsieve``
2524
9
functions is that you don't need to keep passing the `bs4.element.Tag` to be
2525
10
selected against, since the `CSS` object is permanently scoped to that
2526
11
`bs4.element.Tag`.
2527
12
2528
13
"""
2529
14
2530
15
from __future__ import annotations
2531
16
2532
17
from types import ModuleType
2533
18
from typing import (
2534
19
    Any,
2535
20
    cast,
2536
21
    Iterable,
2537
22
    Iterator,
2538
23
    Optional,
2539
24
    TYPE_CHECKING,
2540
25
)
2541
3
import warnings
26
import warnings
2542
27
from bs4._typing import _NamespaceMapping
2543
28
if TYPE_CHECKING:
2544
29
    from soupsieve import SoupSieve
2545
30
    from bs4 import element
2546
31
    from bs4.element import ResultSet, Tag
2547
32
    
2548
33
soupsieve: Optional[ModuleType]
2549
4
try:
34
try:
2550
5
    import soupsieve
35
    import soupsieve
2551
6
except ImportError as e:
36
except ImportError as e:
2552
@@ -9,34 +39,22 @@ except ImportError as e:
2553
9
        'The soupsieve package is not installed. CSS selectors cannot be used.'
39
        'The soupsieve package is not installed. CSS selectors cannot be used.'
2554
10
    )
40
    )
2555
11
41
2556
12
2557
13
class CSS(object):
42
class CSS(object):
2559
14
    """A proxy object against the soupsieve library, to simplify its
43
    """A proxy object against the ``soupsieve`` library, to simplify its
2560
15
    CSS selector API.
44
    CSS selector API.
2561
16
45
2578
17
    Acquire this object through the .css attribute on the
46
    You don't need to instantiate this class yourself; instead, use
2579
18
    BeautifulSoup object, or on the Tag you want to use as the
47
    `element.Tag.css`.
2564
19
    starting point for a CSS selector.
2565
20
2566
21
    The main advantage of doing this is that the tag to be selected
2567
22
    against doesn't need to be explicitly specified in the function
2568
23
    calls, since it's already scoped to a tag.
2569
24
    """
2570
25
2571
26
    def __init__(self, tag, api=soupsieve):
2572
27
        """Constructor.
2573
28
2574
29
        You don't need to instantiate this class yourself; instead,
2575
30
        access the .css attribute on the BeautifulSoup object, or on
2576
31
        the Tag you want to use as the starting point for your CSS
2577
32
        selector.
2580
33
48
2583
34
        :param tag: All CSS selectors will use this as their starting
49
    :param tag: All CSS selectors run by this object will use this as
2584
35
        point.
50
        their starting point.
2585
36
51
2589
37
        :param api: A plug-in replacement for the soupsieve module,
52
    :param api: An optional drop-in replacement for the ``soupsieve`` module,
2590
38
        designed mainly for use in tests.
53
        intended for use in unit tests.
2591
39
        """
54
    """
2592
55
    def __init__(self, tag: element.Tag, api:Optional[ModuleType]=None):
2593
56
        if api is None:
2594
57
            api = soupsieve
2595
40
        if api is None:
58
        if api is None:
2596
41
            raise NotImplementedError(
59
            raise NotImplementedError(
2597
42
                "Cannot execute CSS selectors because the soupsieve package is not installed."
60
                "Cannot execute CSS selectors because the soupsieve package is not installed."
2598
@@ -44,19 +62,19 @@ class CSS(object):
2599
44
        self.api = api
62
        self.api = api
2600
45
        self.tag = tag
63
        self.tag = tag
2601
46
64
2603
47
    def escape(self, ident):
65
    def escape(self, ident:str) -> str:
2604
48
        """Escape a CSS identifier.
66
        """Escape a CSS identifier.
2605
49
67
2607
50
        This is a simple wrapper around soupselect.escape(). See the
68
        This is a simple wrapper around `soupsieve.escape() <https://facelessuser.github.io/soupsieve/api/#soupsieveescape>`_. See the
2608
51
        documentation for that function for more information.
69
        documentation for that function for more information.
2609
52
        """
70
        """
2610
53
        if soupsieve is None:
71
        if soupsieve is None:
2611
54
            raise NotImplementedError(
72
            raise NotImplementedError(
2612
55
                "Cannot escape CSS identifiers because the soupsieve package is not installed."
73
                "Cannot escape CSS identifiers because the soupsieve package is not installed."
2613
56
            )
74
            )
2615
57
        return self.api.escape(ident)
75
        return cast(str, self.api.escape(ident))
2616
58
76
2618
59
    def _ns(self, ns, select):
77
    def _ns(self, ns:Optional[_NamespaceMapping], select:str) -> Optional[_NamespaceMapping]:
2619
60
        """Normalize a dictionary of namespaces."""
78
        """Normalize a dictionary of namespaces."""
2620
61
        if not isinstance(select, self.api.SoupSieve) and ns is None:
79
        if not isinstance(select, self.api.SoupSieve) and ns is None:
2621
62
            # If the selector is a precompiled pattern, it already has
80
            # If the selector is a precompiled pattern, it already has
2622
@@ -65,7 +83,7 @@ class CSS(object):
2623
65
            ns = self.tag._namespaces
83
            ns = self.tag._namespaces
2624
66
        return ns
84
        return ns
2625
67
85
2627
68
    def _rs(self, results):
86
    def _rs(self, results:Iterable[Tag]) -> ResultSet[Tag]:
2628
69
        """Normalize a list of results to a Resultset.
87
        """Normalize a list of results to a Resultset.
2629
70
88
2630
71
        A ResultSet is more consistent with the rest of Beautiful
89
        A ResultSet is more consistent with the rest of Beautiful
2631
@@ -77,7 +95,12 @@ class CSS(object):
2632
77
        from bs4.element import ResultSet
95
        from bs4.element import ResultSet
2633
78
        return ResultSet(None, results)
96
        return ResultSet(None, results)
2634
79
97
2636
80
    def compile(self, select, namespaces=None, flags=0, **kwargs):
98
    def compile(self,
2637
99
                select:str,
2638
100
                namespaces:Optional[_NamespaceMapping]=None,
2639
101
                flags:int=0,
2640
102
                **kwargs:Any
2641
103
                ) -> SoupSieve:
2642
81
        """Pre-compile a selector and return the compiled object.
104
        """Pre-compile a selector and return the compiled object.
2643
82
105
2644
83
        :param selector: A CSS selector.
106
        :param selector: A CSS selector.
2645
@@ -88,10 +111,10 @@ class CSS(object):
2646
88
           parsing the document.
111
           parsing the document.
2647
89
112
2648
90
        :param flags: Flags to be passed into Soup Sieve's
113
        :param flags: Flags to be passed into Soup Sieve's
2650
91
            soupsieve.compile() method.
114
            `soupsieve.compile() <https://facelessuser.github.io/soupsieve/api/#soupsievecompile>`_ method.
2651
92
115
2654
93
        :param kwargs: Keyword arguments to be passed into SoupSieve's
116
        :param kwargs: Keyword arguments to be passed into Soup Sieve's
2655
94
           soupsieve.compile() method.
117
           `soupsieve.compile() <https://facelessuser.github.io/soupsieve/api/#soupsievecompile>`_ method.
2656
95
118
2657
96
        :return: A precompiled selector object.
119
        :return: A precompiled selector object.
2658
97
        :rtype: soupsieve.SoupSieve
120
        :rtype: soupsieve.SoupSieve
2659
@@ -100,13 +123,16 @@ class CSS(object):
2660
100
            select, self._ns(namespaces, select), flags, **kwargs
123
            select, self._ns(namespaces, select), flags, **kwargs
2661
101
        )
124
        )
2662
102
125
2664
103
    def select_one(self, select, namespaces=None, flags=0, **kwargs):
126
    def select_one(
2665
127
            self, select:str,
2666
128
            namespaces:Optional[_NamespaceMapping]=None,
2667
129
            flags:int=0, **kwargs:Any
2668
130
    )-> element.Tag | None:
2669
104
        """Perform a CSS selection operation on the current Tag and return the
131
        """Perform a CSS selection operation on the current Tag and return the
2671
105
        first result.
132
        first result, if any.
2672
106
133
2673
107
        This uses the Soup Sieve library. For more information, see
134
        This uses the Soup Sieve library. For more information, see
2676
108
        that library's documentation for the soupsieve.select_one()
135
        that library's documentation for the `soupsieve.select_one() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect_one>`_ method.
2675
109
        method.
2677
110
136
2678
111
        :param selector: A CSS selector.
137
        :param selector: A CSS selector.
2679
112
138
2680
@@ -116,27 +142,24 @@ class CSS(object):
2681
116
           parsing the document.
142
           parsing the document.
2682
117
143
2683
118
        :param flags: Flags to be passed into Soup Sieve's
144
        :param flags: Flags to be passed into Soup Sieve's
2691
119
            soupsieve.select_one() method.
145
            `soupsieve.select_one() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect_one>`_ method.
2685
120
2686
121
        :param kwargs: Keyword arguments to be passed into SoupSieve's
2687
122
           soupsieve.select_one() method.
2688
123
2689
124
        :return: A Tag, or None if the selector has no match.
2690
125
        :rtype: bs4.element.Tag
2692
126
146
2693
147
        :param kwargs: Keyword arguments to be passed into Soup Sieve's
2694
148
           `soupsieve.select_one() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect_one>`_ method.
2695
127
        """
149
        """
2696
128
        return self.api.select_one(
150
        return self.api.select_one(
2697
129
            select, self.tag, self._ns(namespaces, select), flags, **kwargs
151
            select, self.tag, self._ns(namespaces, select), flags, **kwargs
2698
130
        )
152
        )
2699
131
153
2702
132
    def select(self, select, namespaces=None, limit=0, flags=0, **kwargs):
154
    def select(self, select:str,
2703
133
        """Perform a CSS selection operation on the current Tag.
155
               namespaces:Optional[_NamespaceMapping]=None,
2704
156
               limit:int=0, flags:int=0, **kwargs:Any) -> ResultSet[Tag]:
2705
157
        """Perform a CSS selection operation on the current `element.Tag`.
2706
134
158
2707
135
        This uses the Soup Sieve library. For more information, see
159
        This uses the Soup Sieve library. For more information, see
2710
136
        that library's documentation for the soupsieve.select()
160
        that library's documentation for the `soupsieve.select() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect>`_ method.
2709
137
        method.
2711
138
161
2713
139
        :param selector: A string containing a CSS selector.
162
        :param selector: A CSS selector.
2714
140
163
2715
141
        :param namespaces: A dictionary mapping namespace prefixes
164
        :param namespaces: A dictionary mapping namespace prefixes
2716
142
            used in the CSS selector to namespace URIs. By default,
165
            used in the CSS selector to namespace URIs. By default,
2717
@@ -146,14 +169,10 @@ class CSS(object):
2718
146
        :param limit: After finding this number of results, stop looking.
169
        :param limit: After finding this number of results, stop looking.
2719
147
170
2720
148
        :param flags: Flags to be passed into Soup Sieve's
171
        :param flags: Flags to be passed into Soup Sieve's
2728
149
            soupsieve.select() method.
172
            `soupsieve.select() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect>`_ method.
2722
150
2723
151
        :param kwargs: Keyword arguments to be passed into SoupSieve's
2724
152
            soupsieve.select() method.
2725
153
2726
154
        :return: A ResultSet of Tag objects.
2727
155
        :rtype: bs4.element.ResultSet
2729
156
173
2730
174
        :param kwargs: Keyword arguments to be passed into Soup Sieve's
2731
175
           `soupsieve.select() <https://facelessuser.github.io/soupsieve/api/#soupsieveselect>`_ method.
2732
157
        """
176
        """
2733
158
        if limit is None:
177
        if limit is None:
2734
159
            limit = 0
178
            limit = 0
2735
@@ -165,11 +184,14 @@ class CSS(object):
2736
165
            )
184
            )
2737
166
        )
185
        )
2738
167
186
2741
168
    def iselect(self, select, namespaces=None, limit=0, flags=0, **kwargs):
187
    def iselect(self, select:str,
2742
169
        """Perform a CSS selection operation on the current Tag.
188
                namespaces:Optional[_NamespaceMapping]=None,
2743
189
                limit:int=0, flags:int=0, **kwargs:Any) -> Iterator[element.Tag]:
2744
190
        """Perform a CSS selection operation on the current `element.Tag`.
2745
170
191
2746
171
        This uses the Soup Sieve library. For more information, see
192
        This uses the Soup Sieve library. For more information, see
2748
172
        that library's documentation for the soupsieve.iselect()
193
        that library's documentation for the `soupsieve.iselect()
2749
194
        <https://facelessuser.github.io/soupsieve/api/#soupsieveiselect>`_
2750
173
        method. It is the same as select(), but it returns a generator
195
        method. It is the same as select(), but it returns a generator
2751
174
        instead of a list.
196
        instead of a list.
2752
175
197
2753
@@ -183,23 +205,23 @@ class CSS(object):
2754
183
        :param limit: After finding this number of results, stop looking.
205
        :param limit: After finding this number of results, stop looking.
2755
184
206
2756
185
        :param flags: Flags to be passed into Soup Sieve's
207
        :param flags: Flags to be passed into Soup Sieve's
2761
186
            soupsieve.iselect() method.
208
            `soupsieve.iselect() <https://facelessuser.github.io/soupsieve/api/#soupsieveiselect>`_ method.
2758
187
2759
188
        :param kwargs: Keyword arguments to be passed into SoupSieve's
2760
189
            soupsieve.iselect() method.
2762
190
209
2765
191
        :return: A generator
210
        :param kwargs: Keyword arguments to be passed into Soup Sieve's
2766
192
        :rtype: types.GeneratorType
211
           `soupsieve.iselect() <https://facelessuser.github.io/soupsieve/api/#soupsieveiselect>`_ method.
2767
193
        """
212
        """
2768
194
        return self.api.iselect(
213
        return self.api.iselect(
2769
195
            select, self.tag, self._ns(namespaces, select), limit, flags, **kwargs
214
            select, self.tag, self._ns(namespaces, select), limit, flags, **kwargs
2770
196
        )
215
        )
2771
197
216
2774
198
    def closest(self, select, namespaces=None, flags=0, **kwargs):
217
    def closest(self, select:str,
2775
199
        """Find the Tag closest to this one that matches the given selector.
218
               namespaces:Optional[_NamespaceMapping]=None,
2776
219
                flags:int=0, **kwargs:Any) -> Optional[element.Tag]:
2777
220
        """Find the `element.Tag` closest to this one that matches the given selector.
2778
200
221
2779
201
        This uses the Soup Sieve library. For more information, see
222
        This uses the Soup Sieve library. For more information, see
2781
202
        that library's documentation for the soupsieve.closest()
223
        that library's documentation for the `soupsieve.closest()
2782
224
        <https://facelessuser.github.io/soupsieve/api/#soupsieveclosest>`_
2783
203
        method.
225
        method.
2784
204
226
2785
205
        :param selector: A string containing a CSS selector.
227
        :param selector: A string containing a CSS selector.
2786
@@ -210,24 +232,24 @@ class CSS(object):
2787
210
            parsing the document.
232
            parsing the document.
2788
211
233
2789
212
        :param flags: Flags to be passed into Soup Sieve's
234
        :param flags: Flags to be passed into Soup Sieve's
2791
213
            soupsieve.closest() method.
235
            `soupsieve.closest() <https://facelessuser.github.io/soupsieve/api/#soupsieveclosest>`_ method.
2792
214
236
2798
215
        :param kwargs: Keyword arguments to be passed into SoupSieve's
237
        :param kwargs: Keyword arguments to be passed into Soup Sieve's
2799
216
            soupsieve.closest() method.
238
           `soupsieve.closest() <https://facelessuser.github.io/soupsieve/api/#soupsieveclosest>`_ method.
2795
217
2796
218
        :return: A Tag, or None if there is no match.
2797
219
        :rtype: bs4.Tag
2800
220
239
2801
221
        """
240
        """
2802
222
        return self.api.closest(
241
        return self.api.closest(
2803
223
            select, self.tag, self._ns(namespaces, select), flags, **kwargs
242
            select, self.tag, self._ns(namespaces, select), flags, **kwargs
2804
224
        )
243
        )
2805
225
244
2808
226
    def match(self, select, namespaces=None, flags=0, **kwargs):
245
    def match(self, select:str,
2809
227
        """Check whether this Tag matches the given CSS selector.
246
              namespaces:Optional[_NamespaceMapping]=None,
2810
247
              flags:int=0, **kwargs:Any) -> bool:
2811
248
        """Check whether or not this `element.Tag` matches the given CSS selector.
2812
228
249
2813
229
        This uses the Soup Sieve library. For more information, see
250
        This uses the Soup Sieve library. For more information, see
2815
230
        that library's documentation for the soupsieve.match()
251
        that library's documentation for the `soupsieve.match()
2816
252
        <https://facelessuser.github.io/soupsieve/api/#soupsievematch>`_
2817
231
        method.
253
        method.
2818
232
254
2819
233
        :param: a CSS selector.
255
        :param: a CSS selector.
2820
@@ -238,25 +260,30 @@ class CSS(object):
2821
238
            parsing the document.
260
            parsing the document.
2822
239
261
2823
240
        :param flags: Flags to be passed into Soup Sieve's
262
        :param flags: Flags to be passed into Soup Sieve's
2825
241
            soupsieve.match() method.
263
            `soupsieve.match()
2826
264
            <https://facelessuser.github.io/soupsieve/api/#soupsievematch>`_
2827
265
            method.
2828
242
266
2829
243
        :param kwargs: Keyword arguments to be passed into SoupSieve's
267
        :param kwargs: Keyword arguments to be passed into SoupSieve's
2834
244
            soupsieve.match() method.
268
            `soupsieve.match()
2835
245
269
            <https://facelessuser.github.io/soupsieve/api/#soupsievematch>`_
2836
246
        :return: True if this Tag matches the selector; False otherwise.
270
            method.
2833
247
        :rtype: bool
2837
248
        """
271
        """
2839
249
        return self.api.match(
272
        return cast(bool, self.api.match(
2840
250
            select, self.tag, self._ns(namespaces, select), flags, **kwargs
273
            select, self.tag, self._ns(namespaces, select), flags, **kwargs
2842
251
        )
274
        ))
2843
252
275
2846
253
    def filter(self, select, namespaces=None, flags=0, **kwargs):
276
    def filter(self, select:str,
2847
254
        """Filter this Tag's direct children based on the given CSS selector.
277
               namespaces:Optional[_NamespaceMapping]=None,
2848
278
               flags:int=0, **kwargs:Any) -> ResultSet[Tag]:
2849
279
        """Filter this `element.Tag`'s direct children based on the given CSS selector.
2850
255
280
2851
256
        This uses the Soup Sieve library. It works the same way as
281
        This uses the Soup Sieve library. It works the same way as
2855
257
        passing this Tag into that library's soupsieve.filter()
282
        passing a `element.Tag` into that library's `soupsieve.filter()
2856
258
        method. More information, for more information see the
283
        <https://facelessuser.github.io/soupsieve/api/#soupsievefilter>`_
2857
259
        documentation for soupsieve.filter().
284
        method. For more information, see the documentation for
2858
285
        `soupsieve.filter()
2859
286
        <https://facelessuser.github.io/soupsieve/api/#soupsievefilter>`_.
2860
260
287
2861
261
        :param namespaces: A dictionary mapping namespace prefixes
288
        :param namespaces: A dictionary mapping namespace prefixes
2862
262
            used in the CSS selector to namespace URIs. By default,
289
            used in the CSS selector to namespace URIs. By default,
2863
@@ -264,17 +291,18 @@ class CSS(object):
2864
264
            parsing the document.
291
            parsing the document.
2865
265
292
2866
266
        :param flags: Flags to be passed into Soup Sieve's
293
        :param flags: Flags to be passed into Soup Sieve's
2868
267
            soupsieve.filter() method.
294
            `soupsieve.filter()
2869
295
            <https://facelessuser.github.io/soupsieve/api/#soupsievefilter>`_
2870
296
            method.
2871
268
297
2872
269
        :param kwargs: Keyword arguments to be passed into SoupSieve's
298
        :param kwargs: Keyword arguments to be passed into SoupSieve's
2878
270
            soupsieve.filter() method.
299
            `soupsieve.filter()
2879
271
300
            <https://facelessuser.github.io/soupsieve/api/#soupsievefilter>`_
2880
272
        :return: A ResultSet of Tag objects.
301
            method.
2876
273
        :rtype: bs4.element.ResultSet
2877
274
2881
275
        """
302
        """
2882
276
        return self._rs(
303
        return self._rs(
2883
277
            self.api.filter(
304
            self.api.filter(
2884
278
                select, self.tag, self._ns(namespaces, select), flags, **kwargs
305
                select, self.tag, self._ns(namespaces, select), flags, **kwargs
2885
279
            )
306
            )
2886
280
        )
307
        )
2887
308
2888
diff --git a/bs4/dammit.py b/bs4/dammit.py
2889
index 692433c..8c1b631 100644
2890
--- a/bs4/dammit.py
2891
+++ b/bs4/dammit.py
2892
@@ -2,9 +2,11 @@
2893
2
"""Beautiful Soup bonus library: Unicode, Dammit
2
"""Beautiful Soup bonus library: Unicode, Dammit
2894
3
3
2895
4
This library converts a bytestream to Unicode through any means
4
This library converts a bytestream to Unicode through any means
2899
5
necessary. It is heavily based on code from Mark Pilgrim's Universal
5
necessary. It is heavily based on code from Mark Pilgrim's `Universal
2900
6
Feed Parser. It works best on XML and HTML, but it does not rewrite the
6
Feed Parser <https://pypi.org/project/feedparser/>`_. It works best on
2901
7
XML or HTML to reflect a new encoding; that's the tree builder's job.
7
XML and HTML, but it does not rewrite the XML or HTML to reflect a new
2902
8
encoding; that's the job of `TreeBuilder`.
2903
9
2904
8
"""
10
"""
2905
9
# Use of this source code is governed by the MIT license.
11
# Use of this source code is governed by the MIT license.
2906
10
__license__ = "MIT"
12
__license__ = "MIT"
2907
@@ -12,9 +14,31 @@ __license__ = "MIT"
2908
12
from html.entities import codepoint2name
14
from html.entities import codepoint2name
2909
13
from collections import defaultdict
15
from collections import defaultdict
2910
14
import codecs
16
import codecs
2911
17
from html.entities import html5
2912
15
import re
18
import re
2914
16
import logging
19
from logging import Logger, getLogger
2915
17
import string
20
import string
2916
21
from types import ModuleType
2917
22
from typing import (
2918
23
    Dict,
2919
24
    Iterable,
2920
25
    Iterator,
2921
26
    List,
2922
27
    Optional,
2923
28
    Pattern,
2924
29
    Sequence,
2925
30
    Set,
2926
31
    Tuple,
2927
32
    Type,
2928
33
    Union,
2929
34
    cast,
2930
35
)
2931
36
from bs4._typing import (
2932
37
    _Encoding,
2933
38
    _Encodings,
2934
39
    _RawMarkup,
2935
40
)
2936
41
import warnings
2937
18
42
2938
19
# Import a library to autodetect character encodings. We'll support
43
# Import a library to autodetect character encodings. We'll support
2939
20
# any of a number of libraries that all support the same API:
44
# any of a number of libraries that all support the same API:
2940
@@ -22,37 +46,41 @@ import string
2941
22
# * cchardet
46
# * cchardet
2942
23
# * chardet
47
# * chardet
2943
24
# * charset-normalizer
48
# * charset-normalizer
2945
25
chardet_module = None
49
chardet_module: Optional[ModuleType] = None
2946
26
try:
50
try:
2947
27
    #  PyPI package: cchardet
51
    #  PyPI package: cchardet
2949
28
    import cchardet as chardet_module
52
    import cchardet
2950
53
    chardet_module = cchardet
2951
29
except ImportError:
54
except ImportError:
2952
30
    try:
55
    try:
2953
31
        #  Debian package: python-chardet
56
        #  Debian package: python-chardet
2954
32
        #  PyPI package: chardet
57
        #  PyPI package: chardet
2956
33
        import chardet as chardet_module
58
        import chardet
2957
59
        chardet_module = chardet
2958
34
    except ImportError:
60
    except ImportError:
2959
35
        try:
61
        try:
2960
36
            # PyPI package: charset-normalizer
62
            # PyPI package: charset-normalizer
2962
37
            import charset_normalizer as chardet_module
63
            import charset_normalizer
2963
64
            chardet_module = charset_normalizer
2964
38
        except ImportError:
65
        except ImportError:
2965
39
            # No chardet available.
66
            # No chardet available.
2967
40
            chardet_module = None
67
            pass
2968
41
68
2976
42
if chardet_module:
69
2977
43
    def chardet_dammit(s):
70
def _chardet_dammit(s:bytes) -> Optional[str]:
2978
44
        if isinstance(s, str):
71
    """Try as hard as possible to detect the encoding of a bytestring."""
2979
45
            return None
72
    if chardet_module is None or isinstance(s, str):
2973
46
        return chardet_module.detect(s)['encoding']
2974
47
else:
2975
48
    def chardet_dammit(s):
2980
49
        return None
73
        return None
2981
74
    module = chardet_module
2982
75
    return module.detect(s)['encoding']
2983
50
76
2984
51
# Build bytestring and Unicode versions of regular expressions for finding
77
# Build bytestring and Unicode versions of regular expressions for finding
2985
52
# a declared encoding inside an XML or HTML document.
78
# a declared encoding inside an XML or HTML document.
2989
53
xml_encoding = '^\\s*<\\?.*encoding=[\'"](.*?)[\'"].*\\?>'
79
xml_encoding:str = '^\\s*<\\?.*encoding=[\'"](.*?)[\'"].*\\?>' #: :meta private:
2990
54
html_meta = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'
80
html_meta:str = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]' #: :meta private:
2991
55
encoding_res = dict()
81
2992
82
# TODO: The Pattern type here could use more refinement, but it's tricky.
2993
83
encoding_res: Dict[Type, Dict[str, Pattern]] = dict()
2994
56
encoding_res[bytes] = {
84
encoding_res[bytes] = {
2995
57
    'html' : re.compile(html_meta.encode("ascii"), re.I),
85
    'html' : re.compile(html_meta.encode("ascii"), re.I),
2996
58
    'xml' : re.compile(xml_encoding.encode("ascii"), re.I),
86
    'xml' : re.compile(xml_encoding.encode("ascii"), re.I),
2997
@@ -62,12 +90,29 @@ encoding_res[str] = {
2998
62
    'xml' : re.compile(xml_encoding, re.I)
90
    'xml' : re.compile(xml_encoding, re.I)
2999
63
}
91
}
3000
64
92
3001
65
from html.entities import html5
3002
66
3003
67
class EntitySubstitution(object):
93
class EntitySubstitution(object):
3004
68
    """The ability to substitute XML or HTML entities for certain characters."""
94
    """The ability to substitute XML or HTML entities for certain characters."""
3005
69
95
3007
70
    def _populate_class_variables():
96
    #: A map of named HTML entities to the corresponding Unicode string.
3008
97
    #:
3009
98
    #: :meta hide-value:
3010
99
    HTML_ENTITY_TO_CHARACTER: Dict[str, str]
3011
100
    
3012
101
    #: A map of Unicode strings to the corresponding named HTML entities;
3013
102
    #: the inverse of HTML_ENTITY_TO_CHARACTER.
3014
103
    #:
3015
104
    #: :meta hide-value:
3016
105
    CHARACTER_TO_HTML_ENTITY: Dict[str, str]
3017
106
3018
107
    #: A regular expression that matches any character (or, in rare
3019
108
    #: cases, pair of characters) that can be replaced with a named
3020
109
    #: HTML entity.
3021
110
    #:
3022
111
    #: :meta hide-value:
3023
112
    CHARACTER_TO_HTML_ENTITY_RE: Pattern[str]
3024
113
3025
114
    @classmethod
3026
115
    def _populate_class_variables(cls) -> None:
3027
71
        """Initialize variables used by this class to manage the plethora of
116
        """Initialize variables used by this class to manage the plethora of
3028
72
        HTML5 named entities.
117
        HTML5 named entities.
3029
73
118
3030
@@ -184,11 +229,14 @@ class EntitySubstitution(object):
3031
184
            character = chr(codepoint)
229
            character = chr(codepoint)
3032
185
            unicode_to_name[character] = name
230
            unicode_to_name[character] = name
3033
186
231
3037
187
        return unicode_to_name, name_to_unicode, re.compile(re_definition)
232
        cls.CHARACTER_TO_HTML_ENTITY = unicode_to_name
3038
188
    (CHARACTER_TO_HTML_ENTITY, HTML_ENTITY_TO_CHARACTER,
233
        cls.HTML_ENTITY_TO_CHARACTER = name_to_unicode
3039
189
     CHARACTER_TO_HTML_ENTITY_RE) = _populate_class_variables()
234
        cls.CHARACTER_TO_HTML_ENTITY_RE = re.compile(re_definition)
3040
190
235
3042
191
    CHARACTER_TO_XML_ENTITY = {
236
    #: A map of Unicode strings to the corresponding named XML entities.
3043
237
    #:
3044
238
    #: :meta hide-value:
3045
239
    CHARACTER_TO_XML_ENTITY: Dict[str, str] = {
3046
192
        "'": "apos",
240
        "'": "apos",
3047
193
        '"': "quot",
241
        '"': "quot",
3048
194
        "&": "amp",
242
        "&": "amp",
3049
@@ -196,28 +244,37 @@ class EntitySubstitution(object):
3050
196
        ">": "gt",
244
        ">": "gt",
3051
197
        }
245
        }
3052
198
246
3058
199
    BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|"
247
    #: A regular expression matching an angle bracket or an ampersand that
3059
200
                                           "&(?!#\\d+;|#x[0-9a-fA-F]+;|\\w+;)"
248
    #: is not part of an XML or HTML entity.
3060
201
                                           ")")
249
    #:
3061
202
250
    #: :meta hide-value:
3062
203
    AMPERSAND_OR_BRACKET = re.compile("([<>&])")
251
    BARE_AMPERSAND_OR_BRACKET: Pattern[str] = re.compile(
3063
252
        "([<>]|"
3064
253
        "&(?!#\\d+;|#x[0-9a-fA-F]+;|\\w+;)"
3065
254
        ")"
3066
255
    )
3067
256
3068
257
    #: A regular expression matching an angle bracket or an ampersand.
3069
258
    #:
3070
259
    #: :meta hide-value:    
3071
260
    AMPERSAND_OR_BRACKET: Pattern[str] = re.compile("([<>&])")
3072
204
261
3073
205
    @classmethod
262
    @classmethod
3075
206
    def _substitute_html_entity(cls, matchobj):
263
    def _substitute_html_entity(cls, matchobj:re.Match[str]) -> str:
3076
207
        """Used with a regular expression to substitute the
264
        """Used with a regular expression to substitute the
3077
208
        appropriate HTML entity for a special character string."""
265
        appropriate HTML entity for a special character string."""
3078
209
        entity = cls.CHARACTER_TO_HTML_ENTITY.get(matchobj.group(0))
266
        entity = cls.CHARACTER_TO_HTML_ENTITY.get(matchobj.group(0))
3079
210
        return "&%s;" % entity
267
        return "&%s;" % entity
3080
211
268
3081
212
    @classmethod
269
    @classmethod
3083
213
    def _substitute_xml_entity(cls, matchobj):
270
    def _substitute_xml_entity(cls, matchobj:re.Match[str]) -> str:
3084
214
        """Used with a regular expression to substitute the
271
        """Used with a regular expression to substitute the
3085
215
        appropriate XML entity for a special character string."""
272
        appropriate XML entity for a special character string."""
3086
216
        entity = cls.CHARACTER_TO_XML_ENTITY[matchobj.group(0)]
273
        entity = cls.CHARACTER_TO_XML_ENTITY[matchobj.group(0)]
3087
217
        return "&%s;" % entity
274
        return "&%s;" % entity
3088
218
275
3089
219
    @classmethod
276
    @classmethod
3091
220
    def quoted_attribute_value(self, value):
277
    def quoted_attribute_value(cls, value: str) -> str:
3092
221
        """Make a value into a quoted XML attribute, possibly escaping it.
278
        """Make a value into a quoted XML attribute, possibly escaping it.
3093
222
279
3094
223
         Most strings will be quoted using double quotes.
280
         Most strings will be quoted using double quotes.
3095
@@ -233,7 +290,10 @@ class EntitySubstitution(object):
3096
233
         double quotes will be escaped, and the string will be quoted
290
         double quotes will be escaped, and the string will be quoted
3097
234
         using double quotes.
291
         using double quotes.
3098
235
292
3100
236
          Welcome to "Bob's Bar" -> "Welcome to &quot;Bob's bar&quot;
293
          Welcome to "Bob's Bar" -> Welcome to &quot;Bob's bar&quot;
3101
294
3102
295
        :param value: The XML attribute value to quote
3103
296
        :return: The quoted value
3104
237
        """
297
        """
3105
238
        quote_with = '"'
298
        quote_with = '"'
3106
239
        if '"' in value:
299
        if '"' in value:
3107
@@ -254,17 +314,22 @@ class EntitySubstitution(object):
3108
254
        return quote_with + value + quote_with
314
        return quote_with + value + quote_with
3109
255
315
3110
256
    @classmethod
316
    @classmethod
3113
257
    def substitute_xml(cls, value, make_quoted_attribute=False):
317
    def substitute_xml(cls, value:str, make_quoted_attribute:bool=False) -> str:
3114
258
        """Substitute XML entities for special XML characters.
318
        """Replace special XML characters with named XML entities.
3115
319
3116
320
        The less-than sign will become &lt;, the greater-than sign
3117
321
        will become &gt;, and any ampersands will become &amp;. If you
3118
322
        want ampersands that seem to be part of an entity definition
3119
323
        to be left alone, use `substitute_xml_containing_entities`
3120
324
        instead.
3121
259
325
3127
260
        :param value: A string to be substituted. The less-than sign
326
        :param value: A string to be substituted.
3123
261
          will become &lt;, the greater-than sign will become &gt;,
3124
262
          and any ampersands will become &amp;. If you want ampersands
3125
263
          that appear to be part of an entity definition to be left
3126
264
          alone, use substitute_xml_containing_entities() instead.
3128
265
327
3129
266
        :param make_quoted_attribute: If True, then the string will be
328
        :param make_quoted_attribute: If True, then the string will be
3130
267
         quoted, as befits an attribute value.
329
         quoted, as befits an attribute value.
3131
330
3132
331
        :return: A version of ``value`` with special characters replaced
3133
332
         with named entities.
3134
268
        """
333
        """
3135
269
        # Escape angle brackets and ampersands.
334
        # Escape angle brackets and ampersands.
3136
270
        value = cls.AMPERSAND_OR_BRACKET.sub(
335
        value = cls.AMPERSAND_OR_BRACKET.sub(
3137
@@ -276,7 +341,7 @@ class EntitySubstitution(object):
3138
276
341
3139
277
    @classmethod
342
    @classmethod
3140
278
    def substitute_xml_containing_entities(
343
    def substitute_xml_containing_entities(
3142
279
        cls, value, make_quoted_attribute=False):
344
        cls, value: str, make_quoted_attribute:bool=False) -> str:
3143
280
        """Substitute XML entities for special XML characters.
345
        """Substitute XML entities for special XML characters.
3144
281
346
3145
282
        :param value: A string to be substituted. The less-than sign will
347
        :param value: A string to be substituted. The less-than sign will
3146
@@ -297,10 +362,10 @@ class EntitySubstitution(object):
3147
297
        return value
362
        return value
3148
298
363
3149
299
    @classmethod
364
    @classmethod
3151
300
    def substitute_html(cls, s):
365
    def substitute_html(cls, s: str) -> str:
3152
301
        """Replace certain Unicode characters with named HTML entities.
366
        """Replace certain Unicode characters with named HTML entities.
3153
302
367
3155
303
        This differs from data.encode(encoding, 'xmlcharrefreplace')
368
        This differs from ``data.encode(encoding, 'xmlcharrefreplace')``
3156
304
        in that the goal is to make the result more readable (to those
369
        in that the goal is to make the result more readable (to those
3157
305
        with ASCII displays) rather than to recover from
370
        with ASCII displays) rather than to recover from
3158
306
        errors. There's absolutely nothing wrong with a UTF-8 string
371
        errors. There's absolutely nothing wrong with a UTF-8 string
3159
@@ -308,109 +373,126 @@ class EntitySubstitution(object):
3160
308
        character with "&eacute;" will make it more readable to some
373
        character with "&eacute;" will make it more readable to some
3161
309
        people.
374
        people.
3162
310
375
3164
311
        :param s: A Unicode string.
376
        :param s: The string to be modified.
3165
377
        :return: The string with some Unicode characters replaced with
3166
378
           HTML entities.
3167
312
        """
379
        """
3168
313
        return cls.CHARACTER_TO_HTML_ENTITY_RE.sub(
380
        return cls.CHARACTER_TO_HTML_ENTITY_RE.sub(
3169
314
            cls._substitute_html_entity, s)
381
            cls._substitute_html_entity, s)
3171
315
382
EntitySubstitution._populate_class_variables()
3172
316
383
3173
317
class EncodingDetector:
384
class EncodingDetector:
3175
318
    """Suggests a number of possible encodings for a bytestring.
385
    """This class is capable of guessing a number of possible encodings
3176
386
    for a bytestring.
3177
319
387
3178
320
    Order of precedence:
388
    Order of precedence:
3179
321
389
3180
322
    1. Encodings you specifically tell EncodingDetector to try first
390
    1. Encodings you specifically tell EncodingDetector to try first
3183
323
    (the known_definite_encodings argument to the constructor).
391
       (the ``known_definite_encodings`` argument to the constructor).
3184
324
392
    
3185
325
    2. An encoding determined by sniffing the document's byte-order mark.
393
    2. An encoding determined by sniffing the document's byte-order mark.
3187
326
394
    
3188
327
    3. Encodings you specifically tell EncodingDetector to try if
395
    3. Encodings you specifically tell EncodingDetector to try if
3191
328
    byte-order mark sniffing fails (the user_encodings argument to the
396
       byte-order mark sniffing fails (the ``user_encodings`` argument to the
3192
329
    constructor).
397
       constructor).
3193
330
398
3194
331
    4. An encoding declared within the bytestring itself, either in an
399
    4. An encoding declared within the bytestring itself, either in an
3198
332
    XML declaration (if the bytestring is to be interpreted as an XML
400
       XML declaration (if the bytestring is to be interpreted as an XML
3199
333
    document), or in a <meta> tag (if the bytestring is to be
401
       document), or in a <meta> tag (if the bytestring is to be
3200
334
    interpreted as an HTML document.)
402
       interpreted as an HTML document.)
3201
335
403
3202
336
    5. An encoding detected through textual analysis by chardet,
404
    5. An encoding detected through textual analysis by chardet,
3204
337
    cchardet, or a similar external library.
405
       cchardet, or a similar external library.
3205
338
406
3207
339
    4. UTF-8.
407
    6. UTF-8.
3208
340
408
3210
341
    5. Windows-1252.
409
    7. Windows-1252.
3211
342
410
3245
343
    """
411
    :param markup: Some markup in an unknown encoding.
3213
344
    def __init__(self, markup, known_definite_encodings=None,
3214
345
                 is_html=False, exclude_encodings=None,
3215
346
                 user_encodings=None, override_encodings=None):
3216
347
        """Constructor.
3217
348
3218
349
        :param markup: Some markup in an unknown encoding.
3219
350
3220
351
        :param known_definite_encodings: When determining the encoding
3221
352
            of `markup`, these encodings will be tried first, in
3222
353
            order. In HTML terms, this corresponds to the "known
3223
354
            definite encoding" step defined here:
3224
355
            https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding
3225
356
3226
357
        :param user_encodings: These encodings will be tried after the
3227
358
            `known_definite_encodings` have been tried and failed, and
3228
359
            after an attempt to sniff the encoding by looking at a
3229
360
            byte order mark has failed. In HTML terms, this
3230
361
            corresponds to the step "user has explicitly instructed
3231
362
            the user agent to override the document's character
3232
363
            encoding", defined here:
3233
364
            https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
3234
365
3235
366
        :param override_encodings: A deprecated alias for
3236
367
            known_definite_encodings. Any encodings here will be tried
3237
368
            immediately after the encodings in
3238
369
            known_definite_encodings.
3239
370
3240
371
        :param is_html: If True, this markup is considered to be
3241
372
            HTML. Otherwise it's assumed to be XML.
3242
373
3243
374
        :param exclude_encodings: These encodings will not be tried,
3244
375
            even if they otherwise would be.
3246
376
412
3248
377
        """
413
    :param known_definite_encodings: When determining the encoding
3249
414
        of ``markup``, these encodings will be tried first, in
3250
415
        order. In HTML terms, this corresponds to the "known
3251
416
        definite encoding" step defined in `section 13.2.3.1 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding>`_.
3252
417
3253
418
    :param user_encodings: These encodings will be tried after the
3254
419
        ``known_definite_encodings`` have been tried and failed, and
3255
420
        after an attempt to sniff the encoding by looking at a
3256
421
        byte order mark has failed. In HTML terms, this
3257
422
        corresponds to the step "user has explicitly instructed
3258
423
        the user agent to override the document's character
3259
424
        encoding", defined in `section 13.2.3.2 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>`_.
3260
425
3261
426
    :param override_encodings: A **deprecated** alias for
3262
427
        ``known_definite_encodings``. Any encodings here will be tried
3263
428
        immediately after the encodings in
3264
429
        ``known_definite_encodings``.
3265
430
3266
431
    :param is_html: If True, this markup is considered to be 
3267
432
        HTML. Otherwise it's assumed to be XML.
3268
433
3269
434
    :param exclude_encodings: These encodings will not be tried,
3270
435
        even if they otherwise would be.
3271
436
3272
437
    """   
3273
438
    def __init__(self, markup:bytes,
3274
439
                 known_definite_encodings:Optional[_Encodings]=None,
3275
440
                 is_html:Optional[bool]=False,
3276
441
                 exclude_encodings:Optional[_Encodings]=None,
3277
442
                 user_encodings:Optional[_Encodings]=None,
3278
443
                 override_encodings:Optional[_Encodings]=None):
3279
378
        self.known_definite_encodings = list(known_definite_encodings or [])
444
        self.known_definite_encodings = list(known_definite_encodings or [])
3280
379
        if override_encodings:
445
        if override_encodings:
3281
446
            warnings.warn(
3282
447
                "The 'override_encodings' argument was deprecated in 4.10.0. Use 'known_definite_encodings' instead.",
3283
448
                DeprecationWarning,
3284
449
                stacklevel=3
3285
450
            )
3286
380
            self.known_definite_encodings += override_encodings
451
            self.known_definite_encodings += override_encodings
3287
381
        self.user_encodings = user_encodings or []
452
        self.user_encodings = user_encodings or []
3288
382
        exclude_encodings = exclude_encodings or []
453
        exclude_encodings = exclude_encodings or []
3289
383
        self.exclude_encodings = set([x.lower() for x in exclude_encodings])
454
        self.exclude_encodings = set([x.lower() for x in exclude_encodings])
3290
384
        self.chardet_encoding = None
455
        self.chardet_encoding = None
3293
385
        self.is_html = is_html
456
        self.is_html = False if is_html is None else is_html
3294
386
        self.declared_encoding = None
457
        self.declared_encoding: Optional[str] = None
3295
387
458
3296
388
        # First order of business: strip a byte-order mark.
459
        # First order of business: strip a byte-order mark.
3297
389
        self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup)
460
        self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup)
3298
390
461
3300
391
    def _usable(self, encoding, tried):
462
    known_definite_encodings:_Encodings
3301
463
    user_encodings:_Encodings
3302
464
    exclude_encodings:_Encodings
3303
465
    chardet_encoding:Optional[_Encoding]
3304
466
    is_html:bool
3305
467
    declared_encoding:Optional[_Encoding]
3306
468
    markup:bytes
3307
469
    sniffed_encoding:Optional[_Encoding]
3308
470
        
3309
471
    def _usable(self, encoding:Optional[_Encoding], tried:Set[_Encoding]) -> bool:
3310
392
        """Should we even bother to try this encoding?
472
        """Should we even bother to try this encoding?
3311
393
473
3312
394
        :param encoding: Name of an encoding.
474
        :param encoding: Name of an encoding.
3315
395
        :param tried: Encodings that have already been tried. This will be modified
475
        :param tried: Encodings that have already been tried. This
3316
396
            as a side effect.
476
            will be modified as a side effect.
3317
397
        """
477
        """
3325
398
        if encoding is not None:
478
        if encoding is None:
3326
399
            encoding = encoding.lower()
479
            return False
3327
400
            if encoding in self.exclude_encodings:
480
        encoding = encoding.lower()
3328
401
                return False
481
        if encoding in self.exclude_encodings:
3329
402
            if encoding not in tried:
482
            return False
3330
403
                tried.add(encoding)
483
        if encoding not in tried:
3331
404
                return True
484
            tried.add(encoding)
3332
485
            return True
3333
405
        return False
486
        return False
3334
406
487
3335
407
    @property
488
    @property
3337
408
    def encodings(self):
489
    def encodings(self) -> Iterator[_Encoding]:
3338
409
        """Yield a number of encodings that might work for this markup.
490
        """Yield a number of encodings that might work for this markup.
3339
410
491
3341
411
        :yield: A sequence of strings.
492
        :yield: A sequence of strings. Each is the name of an encoding
3342
493
           that *might* work to convert a bytestring into Unicode.
3343
412
        """
494
        """
3345
413
        tried = set()
495
        tried:Set[_Encoding] = set()
3346
414
496
3347
415
        # First, try the known definite encodings
497
        # First, try the known definite encodings
3348
416
        for e in self.known_definite_encodings:
498
        for e in self.known_definite_encodings:
3349
@@ -419,7 +501,9 @@ class EncodingDetector:
3350
419
501
3351
420
        # Did the document originally start with a byte-order mark
502
        # Did the document originally start with a byte-order mark
3352
421
        # that indicated its encoding?
503
        # that indicated its encoding?
3354
422
        if self._usable(self.sniffed_encoding, tried):
504
        if self.sniffed_encoding is not None and self._usable(
3355
505
            self.sniffed_encoding, tried
3356
506
        ):
3357
423
            yield self.sniffed_encoding
507
            yield self.sniffed_encoding
3358
424
508
3359
425
        # Sniffing the byte-order mark did nothing; try the user
509
        # Sniffing the byte-order mark did nothing; try the user
3360
@@ -433,14 +517,18 @@ class EncodingDetector:
3361
433
        if self.declared_encoding is None:
517
        if self.declared_encoding is None:
3362
434
            self.declared_encoding = self.find_declared_encoding(
518
            self.declared_encoding = self.find_declared_encoding(
3363
435
                self.markup, self.is_html)
519
                self.markup, self.is_html)
3365
436
        if self._usable(self.declared_encoding, tried):
520
        if self.declared_encoding is not None and self._usable(
3366
521
            self.declared_encoding, tried
3367
522
        ):
3368
437
            yield self.declared_encoding
523
            yield self.declared_encoding
3369
438
524
3370
439
        # Use third-party character set detection to guess at the
525
        # Use third-party character set detection to guess at the
3371
440
        # encoding.
526
        # encoding.
3372
441
        if self.chardet_encoding is None:
527
        if self.chardet_encoding is None:
3375
442
            self.chardet_encoding = chardet_dammit(self.markup)
528
            self.chardet_encoding = _chardet_dammit(self.markup)
3376
443
        if self._usable(self.chardet_encoding, tried):
529
        if self.chardet_encoding is not None and self._usable(
3377
530
            self.chardet_encoding, tried
3378
531
        ):
3379
444
            yield self.chardet_encoding
532
            yield self.chardet_encoding
3380
445
533
3381
446
        # As a last-ditch effort, try utf-8 and windows-1252.
534
        # As a last-ditch effort, try utf-8 and windows-1252.
3382
@@ -449,22 +537,24 @@ class EncodingDetector:
3383
449
                yield e
537
                yield e
3384
450
538
3385
451
    @classmethod
539
    @classmethod
3387
452
    def strip_byte_order_mark(cls, data):
540
    def strip_byte_order_mark(cls, data:bytes) -> Tuple[bytes, Optional[_Encoding]]:
3388
453
        """If a byte-order mark is present, strip it and return the encoding it implies.
541
        """If a byte-order mark is present, strip it and return the encoding it implies.
3389
454
542
3392
455
        :param data: Some markup.
543
        :param data: A bytestring that may or may not begin with a
3393
456
        :return: A 2-tuple (modified data, implied encoding)
544
           byte-order mark.
3394
545
3395
546
        :return: A 2-tuple (data stripped of byte-order mark, encoding implied by byte-order mark)
3396
457
        """
547
        """
3397
458
        encoding = None
548
        encoding = None
3398
459
        if isinstance(data, str):
549
        if isinstance(data, str):
3399
460
            # Unicode data cannot have a byte-order mark.
550
            # Unicode data cannot have a byte-order mark.
3400
461
            return data, encoding
551
            return data, encoding
3401
462
        if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
552
        if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
3403
463
               and (data[2:4] != '\x00\x00'):
553
               and (data[2:4] != b'\x00\x00'):
3404
464
            encoding = 'utf-16be'
554
            encoding = 'utf-16be'
3405
465
            data = data[2:]
555
            data = data[2:]
3406
466
        elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
556
        elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
3408
467
                 and (data[2:4] != '\x00\x00'):
557
                 and (data[2:4] != b'\x00\x00'):
3409
468
            encoding = 'utf-16le'
558
            encoding = 'utf-16le'
3410
469
            data = data[2:]
559
            data = data[2:]
3411
470
        elif data[:3] == b'\xef\xbb\xbf':
560
        elif data[:3] == b'\xef\xbb\xbf':
3412
@@ -479,8 +569,9 @@ class EncodingDetector:
3413
479
        return data, encoding
569
        return data, encoding
3414
480
570
3415
481
    @classmethod
571
    @classmethod
3418
482
    def find_declared_encoding(cls, markup, is_html=False, search_entire_document=False):
572
    def find_declared_encoding(cls, markup:Union[bytes,str], is_html:bool=False, search_entire_document:bool=False) -> Optional[_Encoding]:
3419
483
        """Given a document, tries to find its declared encoding.
573
        """Given a document, tries to find an encoding declared within the
3420
574
        text of the document itself.
3421
484
575
3422
485
        An XML encoding is declared at the beginning of the document.
576
        An XML encoding is declared at the beginning of the document.
3423
486
577
3424
@@ -490,9 +581,12 @@ class EncodingDetector:
3425
490
        :param markup: Some markup.
581
        :param markup: Some markup.
3426
491
        :param is_html: If True, this markup is considered to be HTML. Otherwise
582
        :param is_html: If True, this markup is considered to be HTML. Otherwise
3427
492
            it's assumed to be XML.
583
            it's assumed to be XML.
3431
493
        :param search_entire_document: Since an encoding is supposed to declared near the beginning
584
        :param search_entire_document: Since an encoding is supposed
3432
494
            of the document, most of the time it's only necessary to search a few kilobytes of data.
585
            to declared near the beginning of the document, most of
3433
495
            Set this to True to force this method to search the entire document.
586
            the time it's only necessary to search a few kilobytes of
3434
587
            data.  Set this to True to force this method to search the
3435
588
            entire document.
3436
589
        :return: The declared encoding, if one is found.
3437
496
        """
590
        """
3438
497
        if search_entire_document:
591
        if search_entire_document:
3439
498
            xml_endpos = html_endpos = len(markup)
592
            xml_endpos = html_endpos = len(markup)
3440
@@ -520,74 +614,69 @@ class EncodingDetector:
3441
520
        return None
614
        return None
3442
521
615
3443
522
class UnicodeDammit:
616
class UnicodeDammit:
3461
523
    """A class for detecting the encoding of a *ML document and
617
    """A class for detecting the encoding of a bytestring containing an
3462
524
    converting it to a Unicode string. If the source encoding is
618
    HTML or XML document, and decoding it to Unicode. If the source
3463
525
    windows-1252, can replace MS smart quotes with their HTML or XML
619
    encoding is windows-1252, `UnicodeDammit` can also replace
3464
526
    equivalents."""
620
    Microsoft smart quotes with their HTML or XML equivalents.
3465
527
621
3466
528
    # This dictionary maps commonly seen values for "charset" in HTML
622
    :param markup: HTML or XML markup in an unknown encoding.
3467
529
    # meta tags to the corresponding Python codec names. It only covers
623
3468
530
    # values that aren't in Python's aliases and can't be determined
624
    :param known_definite_encodings: When determining the encoding
3469
531
    # by the heuristics in find_codec.
625
        of ``markup``, these encodings will be tried first, in
3470
532
    CHARSET_ALIASES = {"macintosh": "mac-roman",
626
        order. In HTML terms, this corresponds to the "known
3471
533
                       "x-sjis": "shift-jis"}
627
        definite encoding" step defined in `section 13.2.3.1 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding>`_.
3472
534
628
3473
535
    ENCODINGS_WITH_SMART_QUOTES = [
629
    :param user_encodings: These encodings will be tried after the
3474
536
        "windows-1252",
630
        ``known_definite_encodings`` have been tried and failed, and
3475
537
        "iso-8859-1",
631
        after an attempt to sniff the encoding by looking at a
3476
538
        "iso-8859-2",
632
        byte order mark has failed. In HTML terms, this
3477
539
        ]
633
        corresponds to the step "user has explicitly instructed
3478
634
        the user agent to override the document's character
3479
635
        encoding", defined in `section 13.2.3.2 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>`_.
3480
636
3481
637
    :param override_encodings: A **deprecated** alias for
3482
638
        ``known_definite_encodings``. Any encodings here will be tried
3483
639
        immediately after the encodings in
3484
640
        ``known_definite_encodings``.
3485
641
3486
642
    :param smart_quotes_to: By default, Microsoft smart quotes will,
3487
643
       like all other characters, be converted to Unicode
3488
644
       characters. Setting this to ``ascii`` will convert them to ASCII
3489
645
       quotes instead.  Setting it to ``xml`` will convert them to XML
3490
646
       entity references, and setting it to ``html`` will convert them
3491
647
       to HTML entity references.
3492
648
3493
649
    :param is_html: If True, ``markup`` is treated as an HTML
3494
650
       document. Otherwise it's treated as an XML document.
3495
651
3496
652
    :param exclude_encodings: These encodings will not be considered,
3497
653
       even if the sniffing code thinks they might make sense.
3498
540
654
3502
541
    def __init__(self, markup, known_definite_encodings=[],
655
    """
3503
542
                 smart_quotes_to=None, is_html=False, exclude_encodings=[],
656
    def __init__(
3504
543
                 user_encodings=None, override_encodings=None
657
            self, markup:bytes,
3505
658
            known_definite_encodings:Optional[_Encodings]=[],
3506
659
            # TODO PYTHON 3.8 Literal is added to the typing module
3507
660
            #
3508
661
            # smart_quotes_to: Literal["ascii", "xml", "html"] | None = None,
3509
662
            smart_quotes_to: Optional[str] = None,
3510
663
            is_html: bool = False,
3511
664
            exclude_encodings:Optional[_Encodings] = [],
3512
665
            user_encodings:Optional[_Encodings] = None,
3513
666
            override_encodings:Optional[_Encodings] = None
3514
544
    ):
667
    ):
3515
545
        """Constructor.
3516
546
3517
547
        :param markup: A bytestring representing markup in an unknown encoding.
3518
548
3519
549
        :param known_definite_encodings: When determining the encoding
3520
550
            of `markup`, these encodings will be tried first, in
3521
551
            order. In HTML terms, this corresponds to the "known
3522
552
            definite encoding" step defined here:
3523
553
            https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding
3524
554
3525
555
        :param user_encodings: These encodings will be tried after the
3526
556
            `known_definite_encodings` have been tried and failed, and
3527
557
            after an attempt to sniff the encoding by looking at a
3528
558
            byte order mark has failed. In HTML terms, this
3529
559
            corresponds to the step "user has explicitly instructed
3530
560
            the user agent to override the document's character
3531
561
            encoding", defined here:
3532
562
            https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
3533
563
3534
564
        :param override_encodings: A deprecated alias for
3535
565
            known_definite_encodings. Any encodings here will be tried
3536
566
            immediately after the encodings in
3537
567
            known_definite_encodings.
3538
568
3539
569
        :param smart_quotes_to: By default, Microsoft smart quotes will, like all other characters, be converted
3540
570
           to Unicode characters. Setting this to 'ascii' will convert them to ASCII quotes instead.
3541
571
           Setting it to 'xml' will convert them to XML entity references, and setting it to 'html'
3542
572
           will convert them to HTML entity references.
3543
573
        :param is_html: If True, this markup is considered to be HTML. Otherwise
3544
574
            it's assumed to be XML.
3545
575
        :param exclude_encodings: These encodings will not be considered, even
3546
576
            if the sniffing code thinks they might make sense.
3547
577
3548
578
        """
3549
579
        self.smart_quotes_to = smart_quotes_to
668
        self.smart_quotes_to = smart_quotes_to
3550
580
        self.tried_encodings = []
669
        self.tried_encodings = []
3551
581
        self.contains_replacement_characters = False
670
        self.contains_replacement_characters = False
3552
582
        self.is_html = is_html
671
        self.is_html = is_html
3554
583
        self.log = logging.getLogger(__name__)
672
        self.log = getLogger(__name__)
3555
584
        self.detector = EncodingDetector(
673
        self.detector = EncodingDetector(
3556
585
            markup, known_definite_encodings, is_html, exclude_encodings,
674
            markup, known_definite_encodings, is_html, exclude_encodings,
3557
586
            user_encodings, override_encodings
675
            user_encodings, override_encodings
3558
587
        )
676
        )
3559
588
677
3560
589
        # Short-circuit if the data is in Unicode to begin with.
678
        # Short-circuit if the data is in Unicode to begin with.
3562
590
        if isinstance(markup, str) or markup == '':
679
        if isinstance(markup, str) or markup == b'':
3563
591
            self.markup = markup
680
            self.markup = markup
3564
592
            self.unicode_markup = str(markup)
681
            self.unicode_markup = str(markup)
3565
593
            self.original_encoding = None
682
            self.original_encoding = None
3566
@@ -616,41 +705,117 @@ class UnicodeDammit:
3567
616
                            "Some characters could not be decoded, and were "
705
                            "Some characters could not be decoded, and were "
3568
617
                            "replaced with REPLACEMENT CHARACTER."
706
                            "replaced with REPLACEMENT CHARACTER."
3569
618
                    )
707
                    )
3570
708
3571
619
                    self.contains_replacement_characters = True
709
                    self.contains_replacement_characters = True
3572
620
                    break
710
                    break
3573
621
711
3574
622
        # If none of that worked, we could at this point force it to
712
        # If none of that worked, we could at this point force it to
3575
623
        # ASCII, but that would destroy so much data that I think
713
        # ASCII, but that would destroy so much data that I think
3576
624
        # giving up is better.
714
        # giving up is better.
3579
625
        self.unicode_markup = u
715
        #
3580
626
        if not u:
716
        # Note that this is extremely unlikely, probably impossible,
3581
717
        # because the "replace" strategy is so powerful. Even running
3582
718
        # the Python binary through Unicode, Dammit gives you Unicode,
3583
719
        # albeit Unicode riddled with REPLACEMENT CHARACTER.
3584
720
        if u is None:
3585
627
            self.original_encoding = None
721
            self.original_encoding = None
3586
722
            self.unicode_markup = None
3587
723
        else:
3588
724
            self.unicode_markup = u
3589
725
3590
726
    #: The original markup, before it was converted to Unicode.
3591
727
    #: This is not necessarily the same as what was passed in to the
3592
728
    #: constructor, since any byte-order mark will be stripped.
3593
729
    markup:bytes
3594
628
730
3596
629
    def _sub_ms_char(self, match):
731
    #: The Unicode version of the markup, following conversion. This
3597
732
    #: is set to `None` if there was simply no way to convert the
3598
733
    #: bytestring to Unicode (as with binary data).
3599
734
    unicode_markup:Optional[str]
3600
735
3601
736
    #: This is True if `UnicodeDammit.unicode_markup` contains
3602
737
    #: U+FFFD REPLACEMENT_CHARACTER characters which were not present
3603
738
    #: in `UnicodeDammit.markup`. These mark character sequences that
3604
739
    #: could not be represented in Unicode.
3605
740
    contains_replacement_characters: bool
3606
741
3607
742
    #: Unicode, Dammit's best guess as to the original character
3608
743
    #: encoding of `UnicodeDammit.markup`.
3609
744
    original_encoding:Optional[_Encoding]
3610
745
3611
746
    #: The strategy used to handle Microsoft smart quotes.
3612
747
    smart_quotes_to: Optional[str]
3613
748
3614
749
    #: The (encoding, error handling strategy) 2-tuples that were used to
3615
750
    #: try and convert the markup to Unicode.
3616
751
    tried_encodings: List[Tuple[_Encoding, str]]
3617
752
3618
753
    log: Logger #: :meta private:
3619
754
            
3620
755
    def _sub_ms_char(self, match:re.Match[bytes]) -> bytes:
3621
630
        """Changes a MS smart quote character to an XML or HTML
756
        """Changes a MS smart quote character to an XML or HTML
3624
631
        entity, or an ASCII character."""
757
        entity, or an ASCII character.
3625
632
        orig = match.group(1)
758
3626
759
        TODO: Since this is only used to convert smart quotes, it
3627
760
        could be simplified, and MS_CHARS_TO_ASCII made much less
3628
761
        parochial.
3629
762
        """
3630
763
        orig: bytes = match.group(1)
3631
764
        sub: bytes
3632
633
        if self.smart_quotes_to == 'ascii':
765
        if self.smart_quotes_to == 'ascii':
3634
634
            sub = self.MS_CHARS_TO_ASCII.get(orig).encode()
766
            if orig in self.MS_CHARS_TO_ASCII:
3635
767
                sub = self.MS_CHARS_TO_ASCII[orig].encode()
3636
768
            else:
3637
769
                # Shouldn't happen; substitute the character
3638
770
                # with itself.
3639
771
                sub = orig
3640
635
        else:
772
        else:
3645
636
            sub = self.MS_CHARS.get(orig)
773
            if orig in self.MS_CHARS:
3646
637
            if type(sub) == tuple:
774
                substitutions = self.MS_CHARS[orig]
3647
638
                if self.smart_quotes_to == 'xml':
775
                if type(substitutions) == tuple:
3648
639
                    sub = '&#x'.encode() + sub[1].encode() + ';'.encode()
776
                    if self.smart_quotes_to == 'xml':
3649
777
                        sub = b'&#x' + substitutions[1].encode() + b';'
3650
778
                    else:
3651
779
                        sub = b'&' + substitutions[0].encode() + b';'
3652
640
                else:
780
                else:
3654
641
                    sub = '&'.encode() + sub[0].encode() + ';'.encode()
781
                    substitutions = cast(str, substitutions)
3655
782
                    sub = substitutions.encode()
3656
642
            else:
783
            else:
3658
643
                sub = sub.encode()
784
                # Shouldn't happen; substitute the character
3659
785
                # for itself.
3660
786
                sub = orig
3661
644
        return sub
787
        return sub
3662
788
    
3663
789
    #: This dictionary maps commonly seen values for "charset" in HTML
3664
790
    #: meta tags to the corresponding Python codec names. It only covers
3665
791
    #: values that aren't in Python's aliases and can't be determined
3666
792
    #: by the heuristics in `find_codec`.
3667
793
    #:
3668
794
    #: :meta hide-value:
3669
795
    CHARSET_ALIASES: Dict[str, _Encoding] = {"macintosh": "mac-roman",
3670
796
                                             "x-sjis": "shift-jis"}
3671
797
3672
798
    #: A list of encodings that tend to contain Microsoft smart quotes.
3673
799
    #:
3674
800
    #: :meta hide-value:
3675
801
    ENCODINGS_WITH_SMART_QUOTES: _Encodings = [
3676
802
        "windows-1252",
3677
803
        "iso-8859-1",
3678
804
        "iso-8859-2",
3679
805
        ]
3680
645
806
3682
646
    def _convert_from(self, proposed, errors="strict"):
807
    def _convert_from(self, proposed:_Encoding, errors:str="strict") -> Optional[str]:
3683
647
        """Attempt to convert the markup to the proposed encoding.
808
        """Attempt to convert the markup to the proposed encoding.
3684
648
809
3685
649
        :param proposed: The name of a character encoding.
810
        :param proposed: The name of a character encoding.
3686
811
        :param errors: An error handling strategy, used when calling `str`.
3687
812
        :return: The converted markup, or `None` if the proposed
3688
813
           encoding/error handling strategy didn't work.
3689
650
        """
814
        """
3692
651
        proposed = self.find_codec(proposed)
815
        lookup_result = self.find_codec(proposed)
3693
652
        if not proposed or (proposed, errors) in self.tried_encodings:
816
        if lookup_result is None or (lookup_result, errors) in self.tried_encodings:
3694
653
            return None
817
            return None
3695
818
        proposed = lookup_result
3696
654
        self.tried_encodings.append((proposed, errors))
819
        self.tried_encodings.append((proposed, errors))
3697
655
        markup = self.markup
820
        markup = self.markup
3698
656
        # Convert smart quotes to HTML if coming from an encoding
821
        # Convert smart quotes to HTML if coming from an encoding
3699
@@ -665,36 +830,37 @@ class UnicodeDammit:
3700
665
            #print("Trying to convert document to %s (errors=%s)" % (
830
            #print("Trying to convert document to %s (errors=%s)" % (
3701
666
            #    proposed, errors))
831
            #    proposed, errors))
3702
667
            u = self._to_unicode(markup, proposed, errors)
832
            u = self._to_unicode(markup, proposed, errors)
3704
668
            self.markup = u
833
            self.unicode_markup = u
3705
669
            self.original_encoding = proposed
834
            self.original_encoding = proposed
3706
670
        except Exception as e:
835
        except Exception as e:
3707
671
            #print("That didn't work!")
836
            #print("That didn't work!")
3708
672
            #print(e)
837
            #print(e)
3709
673
            return None
838
            return None
3710
674
        #print("Correct encoding: %s" % proposed)
839
        #print("Correct encoding: %s" % proposed)
3712
675
        return self.markup
840
        return self.unicode_markup
3713
676
841
3716
677
    def _to_unicode(self, data, encoding, errors="strict"):
842
    def _to_unicode(self, data:bytes, encoding:_Encoding, errors:str="strict") -> str:
3717
678
        """Given a string and its encoding, decodes the string into Unicode.
843
        """Given a bytestring and its encoding, decodes the string into Unicode.
3718
679
844
3719
680
        :param encoding: The name of an encoding.
845
        :param encoding: The name of an encoding.
3720
846
        :param errors: An error handling strategy, used when calling `str`.
3721
681
        """
847
        """
3722
682
        return str(data, encoding, errors)
848
        return str(data, encoding, errors)
3723
683
849
3724
684
    @property
850
    @property
3728
685
    def declared_html_encoding(self):
851
    def declared_html_encoding(self) -> Optional[str]:
3729
686
        """If the markup is an HTML document, returns the encoding declared _within_
852
        """If the markup is an HTML document, returns the encoding, if any,
3730
687
        the document.
853
        declared *inside* the document.
3731
688
        """
854
        """
3732
689
        if not self.is_html:
855
        if not self.is_html:
3733
690
            return None
856
            return None
3734
691
        return self.detector.declared_encoding
857
        return self.detector.declared_encoding
3735
692
858
3738
693
    def find_codec(self, charset):
859
    def find_codec(self, charset:_Encoding) -> Optional[str]:
3739
694
        """Convert the name of a character set to a codec name.
860
        """Look up the Python codec corresponding to a given character set.
3740
695
861
3741
696
        :param charset: The name of a character set.
862
        :param charset: The name of a character set.
3743
697
        :return: The name of a codec.
863
        :return: The name of a Python codec.
3744
698
        """
864
        """
3745
699
        value = (self._codec(self.CHARSET_ALIASES.get(charset, charset))
865
        value = (self._codec(self.CHARSET_ALIASES.get(charset, charset))
3746
700
               or (charset and self._codec(charset.replace("-", "")))
866
               or (charset and self._codec(charset.replace("-", "")))
3747
@@ -706,7 +872,7 @@ class UnicodeDammit:
3748
706
            return value.lower()
872
            return value.lower()
3749
707
        return None
873
        return None
3750
708
874
3752
709
    def _codec(self, charset):
875
    def _codec(self, charset:_Encoding) -> Optional[str]:
3753
710
        if not charset:
876
        if not charset:
3754
711
            return charset
877
            return charset
3755
712
        codec = None
878
        codec = None
3756
@@ -718,8 +884,11 @@ class UnicodeDammit:
3757
718
        return codec
884
        return codec
3758
719
885
3759
720
886
3762
721
    # A partial mapping of ISO-Latin-1 to HTML entities/XML numeric entities.
887
    #: A partial mapping of ISO-Latin-1 to HTML entities/XML numeric entities.
3763
722
    MS_CHARS = {b'\x80': ('euro', '20AC'),
888
    #:
3764
889
    #: :meta hide-value:
3765
890
    MS_CHARS: Dict[bytes, Union[str, Tuple[str, str]]] = {
3766
891
        b'\x80': ('euro', '20AC'),
3767
723
                b'\x81': ' ',
892
                b'\x81': ' ',
3768
724
                b'\x82': ('sbquo', '201A'),
893
                b'\x82': ('sbquo', '201A'),
3769
725
                b'\x83': ('fnof', '192'),
894
                b'\x83': ('fnof', '192'),
3770
@@ -752,10 +921,15 @@ class UnicodeDammit:
3771
752
                b'\x9e': ('#x17E', '17E'),
921
                b'\x9e': ('#x17E', '17E'),
3772
753
                b'\x9f': ('Yuml', ''),}
922
                b'\x9f': ('Yuml', ''),}
3773
754
923
3778
755
    # A parochial partial mapping of ISO-Latin-1 to ASCII. Contains
924
    #: A parochial partial mapping of ISO-Latin-1 to ASCII. Contains
3779
756
    # horrors like stripping diacritical marks to turn á into a, but also
925
    #: horrors like stripping diacritical marks to turn á into a, but also
3780
757
    # contains non-horrors like turning “ into ".
926
    #: contains non-horrors like turning “ into ".
3781
758
    MS_CHARS_TO_ASCII = {
927
    #:
3782
928
    #: Seriously, don't use this for anything other than removing smart
3783
929
    #: quotes.
3784
930
    #:
3785
931
    #: :meta private:
3786
932
    MS_CHARS_TO_ASCII: Dict[bytes, str] = {
3787
759
        b'\x80' : 'EUR',
933
        b'\x80' : 'EUR',
3788
760
        b'\x81' : ' ',
934
        b'\x81' : ' ',
3789
761
        b'\x82' : ',',
935
        b'\x82' : ',',
3790
@@ -809,7 +983,7 @@ class UnicodeDammit:
3791
809
        b'\xb1' : '+-',
983
        b'\xb1' : '+-',
3792
810
        b'\xb2' : '2',
984
        b'\xb2' : '2',
3793
811
        b'\xb3' : '3',
985
        b'\xb3' : '3',
3795
812
        b'\xb4' : ("'", 'acute'),
986
        b'\xb4' : "'",
3796
813
        b'\xb5' : 'u',
987
        b'\xb5' : 'u',
3797
814
        b'\xb6' : 'P',
988
        b'\xb6' : 'P',
3798
815
        b'\xb7' : '*',
989
        b'\xb7' : '*',
3799
@@ -887,12 +1061,14 @@ class UnicodeDammit:
3800
887
        b'\xff' : 'y',
1061
        b'\xff' : 'y',
3801
888
        }
1062
        }
3802
889
1063
3809
890
    # A map used when removing rogue Windows-1252/ISO-8859-1
1064
    #: A map used when removing rogue Windows-1252/ISO-8859-1
3810
891
    # characters in otherwise UTF-8 documents.
1065
    #: characters in otherwise UTF-8 documents.
3811
892
    #
1066
    #:
3812
893
    # Note that \x81, \x8d, \x8f, \x90, and \x9d are undefined in
1067
    #: Note that \\x81, \\x8d, \\x8f, \\x90, and \\x9d are undefined in
3813
894
    # Windows-1252.
1068
    #: Windows-1252.
3814
895
    WINDOWS_1252_TO_UTF8 = {
1069
    #:
3815
1070
    #: :meta hide-value:
3816
1071
    WINDOWS_1252_TO_UTF8: Dict[int, bytes] = {
3817
896
        0x80 : b'\xe2\x82\xac', # €
1072
        0x80 : b'\xe2\x82\xac', # €
3818
897
        0x82 : b'\xe2\x80\x9a', # ‚
1073
        0x82 : b'\xe2\x80\x9a', # ‚
3819
898
        0x83 : b'\xc6\x92',     # ƒ
1074
        0x83 : b'\xc6\x92',     # ƒ
3820
@@ -1017,33 +1193,37 @@ class UnicodeDammit:
3821
1017
        0xfe : b'\xc3\xbe',     # þ
1193
        0xfe : b'\xc3\xbe',     # þ
3822
1018
        }
1194
        }
3823
1019
1195
3825
1020
    MULTIBYTE_MARKERS_AND_SIZES = [
1196
    #: :meta private:
3826
1197
    MULTIBYTE_MARKERS_AND_SIZES:List[Tuple[int, int, int]] = [
3827
1021
        (0xc2, 0xdf, 2), # 2-byte characters start with a byte C2-DF
1198
        (0xc2, 0xdf, 2), # 2-byte characters start with a byte C2-DF
3828
1022
        (0xe0, 0xef, 3), # 3-byte characters start with E0-EF
1199
        (0xe0, 0xef, 3), # 3-byte characters start with E0-EF
3829
1023
        (0xf0, 0xf4, 4), # 4-byte characters start with F0-F4
1200
        (0xf0, 0xf4, 4), # 4-byte characters start with F0-F4
3830
1024
        ]
1201
        ]
3831
1025
1202
3834
1026
    FIRST_MULTIBYTE_MARKER = MULTIBYTE_MARKERS_AND_SIZES[0][0]
1203
    #: :meta private:
3835
1027
    LAST_MULTIBYTE_MARKER = MULTIBYTE_MARKERS_AND_SIZES[-1][1]
1204
    FIRST_MULTIBYTE_MARKER:int = MULTIBYTE_MARKERS_AND_SIZES[0][0]
3836
1205
3837
1206
    #: :meta private:
3838
1207
    LAST_MULTIBYTE_MARKER:int = MULTIBYTE_MARKERS_AND_SIZES[-1][1]
3839
1028
1208
3840
1029
    @classmethod
1209
    @classmethod
3843
1030
    def detwingle(cls, in_bytes, main_encoding="utf8",
1210
    def detwingle(cls, in_bytes:bytes, main_encoding:_Encoding="utf8",
3844
1031
                  embedded_encoding="windows-1252"):
1211
                  embedded_encoding:_Encoding="windows-1252") -> bytes:
3845
1032
        """Fix characters from one encoding embedded in some other encoding.
1212
        """Fix characters from one encoding embedded in some other encoding.
3846
1033
1213
3847
1034
        Currently the only situation supported is Windows-1252 (or its
1214
        Currently the only situation supported is Windows-1252 (or its
3848
1035
        subset ISO-8859-1), embedded in UTF-8.
1215
        subset ISO-8859-1), embedded in UTF-8.
3849
1036
1216
3850
1037
        :param in_bytes: A bytestring that you suspect contains
1217
        :param in_bytes: A bytestring that you suspect contains
3852
1038
            characters from multiple encodings. Note that this _must_
1218
            characters from multiple encodings. Note that this *must*
3853
1039
            be a bytestring. If you've already converted the document
1219
            be a bytestring. If you've already converted the document
3854
1040
            to Unicode, you're too late.
1220
            to Unicode, you're too late.
3856
1041
        :param main_encoding: The primary encoding of `in_bytes`.
1221
        :param main_encoding: The primary encoding of ``in_bytes``.
3857
1042
        :param embedded_encoding: The encoding that was used to embed characters
1222
        :param embedded_encoding: The encoding that was used to embed characters
3858
1043
            in the main document.
1223
            in the main document.
3862
1044
        :return: A bytestring in which `embedded_encoding`
1224
        :return: A bytestring similar to ``in_bytes``, in which
3863
1045
          characters have been converted to their `main_encoding`
1225
          ``embedded_encoding`` characters have been converted to
3864
1046
          equivalents.
1226
          their ``main_encoding`` equivalents.
3865
1047
        """
1227
        """
3866
1048
        if embedded_encoding.replace('_', '-').lower() not in (
1228
        if embedded_encoding.replace('_', '-').lower() not in (
3867
1049
            'windows-1252', 'windows_1252'):
1229
            'windows-1252', 'windows_1252'):
3868
@@ -1061,9 +1241,6 @@ class UnicodeDammit:
3869
1061
        pos = 0
1241
        pos = 0
3870
1062
        while pos < len(in_bytes):
1242
        while pos < len(in_bytes):
3871
1063
            byte = in_bytes[pos]
1243
            byte = in_bytes[pos]
3872
1064
            if not isinstance(byte, int):
3873
1065
                # Python 2.x
3874
1066
                byte = ord(byte)
3875
1067
            if (byte >= cls.FIRST_MULTIBYTE_MARKER
1244
            if (byte >= cls.FIRST_MULTIBYTE_MARKER
3876
1068
                and byte <= cls.LAST_MULTIBYTE_MARKER):
1245
                and byte <= cls.LAST_MULTIBYTE_MARKER):
3877
1069
                # This is the start of a UTF-8 multibyte character. Skip
1246
                # This is the start of a UTF-8 multibyte character. Skip
3878
diff --git a/bs4/diagnose.py b/bs4/diagnose.py
3879
index e079772..201b879 100644
3880
--- a/bs4/diagnose.py
3881
+++ b/bs4/diagnose.py
3882
@@ -7,8 +7,11 @@ import cProfile
3883
7
from io import BytesIO
7
from io import BytesIO
3884
8
from html.parser import HTMLParser
8
from html.parser import HTMLParser
3885
9
import bs4
9
import bs4
3887
10
from bs4 import BeautifulSoup, __version__
10
from bs4 import BeautifulSoup, __version__ 
3888
11
from bs4.builder import builder_registry
11
from bs4.builder import builder_registry
3889
12
from typing import TYPE_CHECKING
3890
13
if TYPE_CHECKING:
3891
14
    from bs4._typing import _IncomingMarkup
3892
12
15
3893
13
import os
16
import os
3894
14
import pstats
17
import pstats
3895
@@ -19,10 +22,10 @@ import traceback
3896
19
import sys
22
import sys
3897
20
import cProfile
23
import cProfile
3898
21
24
3900
22
def diagnose(data):
25
def diagnose(data:_IncomingMarkup) -> None:
3901
23
    """Diagnostic suite for isolating common problems.
26
    """Diagnostic suite for isolating common problems.
3902
24
27
3904
25
    :param data: A string containing markup that needs to be explained.
28
    :param data: Some markup that needs to be explained.
3905
26
    :return: None; diagnostics are printed to standard output.
29
    :return: None; diagnostics are printed to standard output.
3906
27
    """
30
    """
3907
28
    print(("Diagnostic running on Beautiful Soup %s" % __version__))
31
    print(("Diagnostic running on Beautiful Soup %s" % __version__))
3908
@@ -75,7 +78,7 @@ def diagnose(data):
3909
75
78
3910
76
        print(("-" * 80))
79
        print(("-" * 80))
3911
77
80
3913
78
def lxml_trace(data, html=True, **kwargs):
81
def lxml_trace(data, html:bool=True, **kwargs) -> None:
3914
79
    """Print out the lxml events that occur during parsing.
82
    """Print out the lxml events that occur during parsing.
3915
80
83
3916
81
    This lets you see how lxml parses a document when no Beautiful
84
    This lets you see how lxml parses a document when no Beautiful
3917
@@ -109,7 +112,7 @@ class AnnouncingParser(HTMLParser):
3918
109
        print(s)
112
        print(s)
3919
110
113
3920
111
    def handle_starttag(self, name, attrs):
114
    def handle_starttag(self, name, attrs):
3922
112
        self._p("%s START" % name)
115
        self._p(f"{name} {attrs} START")
3923
113
116
3924
114
    def handle_endtag(self, name):
117
    def handle_endtag(self, name):
3925
115
        self._p("%s END" % name)
118
        self._p("%s END" % name)
3926
@@ -146,11 +149,14 @@ def htmlparser_trace(data):
3927
146
    parser = AnnouncingParser()
149
    parser = AnnouncingParser()
3928
147
    parser.feed(data)
150
    parser.feed(data)
3929
148
151
3932
149
_vowels = "aeiou"
152
_vowels:str = "aeiou"
3933
150
_consonants = "bcdfghjklmnpqrstvwxyz"
153
_consonants:str = "bcdfghjklmnpqrstvwxyz"
3934
151
154
3937
152
def rword(length=5):
155
def rword(length:int=5) -> str:
3938
153
    "Generate a random word-like string."
156
    """Generate a random word-like string.
3939
157
3940
158
    :meta private:
3941
159
    """
3942
154
    s = ''
160
    s = ''
3943
155
    for i in range(length):
161
    for i in range(length):
3944
156
        if i % 2 == 0:
162
        if i % 2 == 0:
3945
@@ -160,12 +166,18 @@ def rword(length=5):
3946
160
        s += random.choice(t)
166
        s += random.choice(t)
3947
161
    return s
167
    return s
3948
162
168
3951
163
def rsentence(length=4):
169
def rsentence(length:int=4) -> str:
3952
164
    "Generate a random sentence-like string."
170
    """Generate a random sentence-like string.
3953
171
3954
172
    :meta private:
3955
173
    """
3956
165
    return " ".join(rword(random.randint(4,9)) for i in range(length))
174
    return " ".join(rword(random.randint(4,9)) for i in range(length))
3957
166
        
175
        
3960
167
def rdoc(num_elements=1000):
176
def rdoc(num_elements:int=1000) -> str:
3961
168
    """Randomly generate an invalid HTML document."""
177
    """Randomly generate an invalid HTML document.
3962
178
3963
179
    :meta private:
3964
180
    """
3965
169
    tag_names = ['p', 'div', 'span', 'i', 'b', 'script', 'table']
181
    tag_names = ['p', 'div', 'span', 'i', 'b', 'script', 'table']
3966
170
    elements = []
182
    elements = []
3967
171
    for i in range(num_elements):
183
    for i in range(num_elements):
3968
@@ -182,24 +194,24 @@ def rdoc(num_elements=1000):
3969
182
            elements.append("</%s>" % tag_name)
194
            elements.append("</%s>" % tag_name)
3970
183
    return "<html>" + "\n".join(elements) + "</html>"
195
    return "<html>" + "\n".join(elements) + "</html>"
3971
184
196
3973
185
def benchmark_parsers(num_elements=100000):
197
def benchmark_parsers(num_elements:int=100000) -> None:
3974
186
    """Very basic head-to-head performance benchmark."""
198
    """Very basic head-to-head performance benchmark."""
3975
187
    print(("Comparative parser benchmark on Beautiful Soup %s" % __version__))
199
    print(("Comparative parser benchmark on Beautiful Soup %s" % __version__))
3976
188
    data = rdoc(num_elements)
200
    data = rdoc(num_elements)
3977
189
    print(("Generated a large invalid HTML document (%d bytes)." % len(data)))
201
    print(("Generated a large invalid HTML document (%d bytes)." % len(data)))
3978
190
    
202
    
3980
191
    for parser in ["lxml", ["lxml", "html"], "html5lib", "html.parser"]:
203
    for parser_name in ["lxml", ["lxml", "html"], "html5lib", "html.parser"]:
3981
192
        success = False
204
        success = False
3982
193
        try:
205
        try:
3983
194
            a = time.time()
206
            a = time.time()
3985
195
            soup = BeautifulSoup(data, parser)
207
            soup = BeautifulSoup(data, parser_name)
3986
196
            b = time.time()
208
            b = time.time()
3987
197
            success = True
209
            success = True
3988
198
        except Exception as e:
210
        except Exception as e:
3990
199
            print(("%s could not parse the markup." % parser))
211
            print(("%s could not parse the markup." % parser_name))
3991
200
            traceback.print_exc()
212
            traceback.print_exc()
3992
201
        if success:
213
        if success:
3994
202
            print(("BS4+%s parsed the markup in %.2fs." % (parser, b-a)))
214
            print(("BS4+%s parsed the markup in %.2fs." % (parser_name, b-a)))
3995
203
215
3996
204
    from lxml import etree
216
    from lxml import etree
3997
205
    a = time.time()
217
    a = time.time()
3998
@@ -214,7 +226,7 @@ def benchmark_parsers(num_elements=100000):
3999
214
    b = time.time()
226
    b = time.time()
4000
215
    print(("Raw html5lib parsed the markup in %.2fs." % (b-a)))
227
    print(("Raw html5lib parsed the markup in %.2fs." % (b-a)))
4001
216
228
4003
217
def profile(num_elements=100000, parser="lxml"):
229
def profile(num_elements:int=100000, parser:str="lxml"):
4004
218
    """Use Python's profiler on a randomly generated document."""
230
    """Use Python's profiler on a randomly generated document."""
4005
219
    filehandle = tempfile.NamedTemporaryFile()
231
    filehandle = tempfile.NamedTemporaryFile()
4006
220
    filename = filehandle.name
232
    filename = filehandle.name
4007
diff --git a/bs4/element.py b/bs4/element.py
4008
index 0aefe73..8b3774e 100644
4009
--- a/bs4/element.py
4010
+++ b/bs4/element.py
4011
@@ -1,55 +1,102 @@
4012
1
from __future__ import annotations
4013
1
# Use of this source code is governed by the MIT license.
2
# Use of this source code is governed by the MIT license.
4014
2
__license__ = "MIT"
3
__license__ = "MIT"
4015
3
4
4016
4
try:
4017
5
    from collections.abc import Callable # Python 3.6
4018
6
except ImportError as e:
4019
7
    from collections import Callable
4020
8
import re
5
import re
4021
9
import sys
6
import sys
4022
10
import warnings
7
import warnings
4023
11
8
4024
12
from bs4.css import CSS
9
from bs4.css import CSS
4025
10
from bs4._deprecation import (
4026
11
    _deprecated,
4027
12
    _deprecated_alias,
4028
13
    _deprecated_function_alias,
4029
14
)
4030
13
from bs4.formatter import (
15
from bs4.formatter import (
4031
14
    Formatter,
16
    Formatter,
4032
15
    HTMLFormatter,
17
    HTMLFormatter,
4033
16
    XMLFormatter,
18
    XMLFormatter,
4034
17
)
19
)
4035
18
20
4043
19
DEFAULT_OUTPUT_ENCODING = "utf-8"
21
from typing import (
4044
20
22
    Any,
4045
21
nonwhitespace_re = re.compile(r"\S+")
23
    Callable,
4046
22
24
    Dict,
4047
23
# NOTE: This isn't used as of 4.7.0. I'm leaving it for a little bit on
25
    Generator,
4048
24
# the off chance someone imported it for their own use.
26
    Generic,
4049
25
whitespace_re = re.compile(r"\s+")
27
    Iterable,
4050
28
    Iterator,
4051
29
    List,
4052
30
    Mapping,
4053
31
    Optional,
4054
32
    Pattern,
4055
33
    Sequence,
4056
34
    Set,
4057
35
    TYPE_CHECKING,
4058
36
    Tuple,
4059
37
    Type,
4060
38
    TypeVar,
4061
39
    Union,
4062
40
    cast,
4063
41
)
4064
42
from typing_extensions import Self
4065
43
if TYPE_CHECKING:
4066
44
    from bs4 import BeautifulSoup
4067
45
    from bs4.builder import TreeBuilder
4068
46
    from bs4.dammit import _Encoding
4069
47
    from bs4.formatter import (
4070
48
        _EntitySubstitutionFunction,
4071
49
        _FormatterOrName,
4072
50
    )
4073
51
    from bs4._typing import (
4074
52
        _AttributeValue,
4075
53
        _AttributeValues,
4076
54
        _StrainableElement,
4077
55
        _StrainableAttribute,
4078
56
        _StrainableAttributes,
4079
57
        _StrainableString,
4080
58
    )
4081
59
4082
60
# Deprecated module-level attributes.
4083
61
# See https://peps.python.org/pep-0562/
4084
62
_deprecated_names = dict(
4085
63
    whitespace_re = 'The {name} attribute was deprecated in version 4.7.0. If you need it, make your own copy.'
4086
64
)
4087
65
#: :meta private:
4088
66
_deprecated_whitespace_re: Pattern[str] = re.compile(r"\s+")
4089
26
67
4116
27
def _alias(attr):
68
def __getattr__(name):
4117
28
    """Alias one attribute name to another for backward compatibility"""
69
    if name in _deprecated_names:
4118
29
    @property
70
        message = _deprecated_names[name]
4119
30
    def alias(self):
71
        warnings.warn(
4120
31
        return getattr(self, attr)
72
            message.format(name=name),
4121
32
73
            DeprecationWarning, stacklevel=2
4122
33
    @alias.setter
74
        )
4123
34
    def alias(self):
75
        
4124
35
        return setattr(self, attr)
76
        return globals()[f"_deprecated_{name}"]
4125
36
    return alias
77
    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
4126
37
78
    
4127
38
79
#: Documents output by Beautiful Soup will be encoded with
4128
39
# These encodings are recognized by Python (so PageElement.encode
80
#: this encoding unless you specify otherwise.
4129
40
# could theoretically support them) but XML and HTML don't recognize
81
DEFAULT_OUTPUT_ENCODING:str = "utf-8"
4130
41
# them (so they should not show up in an XML or HTML document as that
82
4131
42
# document's encoding).
83
#: A regular expression that can be used to split on whitespace.
4132
43
#
84
nonwhitespace_re: Pattern[str] = re.compile(r"\S+")
4133
44
# If an XML document is encoded in one of these encodings, no encoding
85
4134
45
# will be mentioned in the XML declaration. If an HTML document is
86
#: These encodings are recognized by Python (so `Tag.encode`
4135
46
# encoded in one of these encodings, and the HTML document has a
87
#: could theoretically support them) but XML and HTML don't recognize
4136
47
# <meta> tag that mentions an encoding, the encoding will be given as
88
#: them (so they should not show up in an XML or HTML document as that
4137
48
# the empty string.
89
#: document's encoding).
4138
49
#
90
#:
4139
50
# Source:
91
#: If an XML document is encoded in one of these encodings, no encoding
4140
51
# https://docs.python.org/3/library/codecs.html#python-specific-encodings
92
#: will be mentioned in the XML declaration. If an HTML document is
4141
52
PYTHON_SPECIFIC_ENCODINGS = set([
93
#: encoded in one of these encodings, and the HTML document has a
4142
94
#: <meta> tag that mentions an encoding, the encoding will be given as
4143
95
#: the empty string.
4144
96
#:
4145
97
#: Source:
4146
98
#: Python documentation, `Python Specific Encodings <https://docs.python.org/3/library/codecs.html#python-specific-encodings>`_
4147
99
PYTHON_SPECIFIC_ENCODINGS: Set[_Encoding] = set([
4148
53
    "idna",
100
    "idna",
4149
54
    "mbcs",
101
    "mbcs",
4150
55
    "oem",
102
    "oem",
4151
@@ -66,11 +113,17 @@ PYTHON_SPECIFIC_ENCODINGS = set([
4152
66
113
4153
67
114
4154
68
class NamespacedAttribute(str):
115
class NamespacedAttribute(str):
4157
69
    """A namespaced string (e.g. 'xml:lang') that remembers the namespace
116
    """A namespaced attribute (e.g. the 'xml:lang' in 'xml:lang="en"')
4158
70
    ('xml') and the name ('lang') that were used to create it.
117
    which remembers the namespace prefix ('xml') and the name ('lang')
4159
118
    that were used to create it.
4160
71
    """
119
    """
4161
72
120
4163
73
    def __new__(cls, prefix, name=None, namespace=None):
121
    prefix: Optional[str]
4164
122
    name: Optional[str]
4165
123
    namespace: Optional[str]
4166
124
    
4167
125
    def __new__(cls, prefix:Optional[str],
4168
126
                name:Optional[str]=None, namespace:Optional[str]=None):
4169
74
        if not name:
127
        if not name:
4170
75
            # This is the default namespace. Its name "has no value"
128
            # This is the default namespace. Its name "has no value"
4171
76
            # per https://www.w3.org/TR/xml-names/#defaulting
129
            # per https://www.w3.org/TR/xml-names/#defaulting
4172
@@ -89,72 +142,126 @@ class NamespacedAttribute(str):
4173
89
        return obj
142
        return obj
4174
90
143
4175
91
class AttributeValueWithCharsetSubstitution(str):
144
class AttributeValueWithCharsetSubstitution(str):
4177
92
    """A stand-in object for a character encoding specified in HTML."""
145
    """An abstract class standing in for a character encoding specified
4178
146
    inside an HTML ``<meta>`` tag.
4179
93
147
4182
94
class CharsetMetaAttributeValue(AttributeValueWithCharsetSubstitution):
148
    Subclasses exist for each place such a character encoding might be
4183
95
    """A generic stand-in for the value of a meta tag's 'charset' attribute.
149
    found: either inside the ``charset`` attribute
4184
150
    (`CharsetMetaAttributeValue`) or inside the ``content`` attribute
4185
151
    (`ContentMetaAttributeValue`)
4186
96
152
4189
97
    When Beautiful Soup parses the markup '<meta charset="utf8">', the
153
    This allows Beautiful Soup to replace that part of the HTML file
4190
98
    value of the 'charset' attribute will be one of these objects.
154
    with a different encoding when ouputting a tree as a string.
4191
99
    """
155
    """
4192
156
    # The original, un-encoded value of the ``content`` attribute.
4193
157
    #: :meta private:
4194
158
    original_value: str
4195
159
    
4196
160
    def substitute_encoding(self, eventual_encoding:str) -> str:
4197
161
        """Do whatever's necessary in this implementation-specific
4198
162
        portion an HTML document to substitute in a specific encoding.
4199
163
        """
4200
164
        raise NotImplementedError()
4201
165
4202
166
class CharsetMetaAttributeValue(AttributeValueWithCharsetSubstitution):
4203
167
    """A generic stand-in for the value of a ``<meta>`` tag's ``charset``
4204
168
    attribute.
4205
169
4206
170
    When Beautiful Soup parses the markup ``<meta charset="utf8">``, the
4207
171
    value of the ``charset`` attribute will become one of these objects.
4208
100
172
4210
101
    def __new__(cls, original_value):
173
    If the document is later encoded to an encoding other than UTF-8, its
4211
174
    ``<meta>`` tag will mention the new encoding instead of ``utf8``.
4212
175
    """
4213
176
    def __new__(cls, original_value:str) -> Self:
4214
177
        # We don't need to use the original value for anything, but
4215
178
        # it might be useful for the user to know.
4216
102
        obj = str.__new__(cls, original_value)
179
        obj = str.__new__(cls, original_value)
4217
103
        obj.original_value = original_value
180
        obj.original_value = original_value
4218
104
        return obj
181
        return obj
4221
105
182
    
4222
106
    def encode(self, encoding):
183
    def substitute_encoding(self, eventual_encoding:_Encoding="utf-8") -> str:
4223
107
        """When an HTML document is being encoded to a given encoding, the
184
        """When an HTML document is being encoded to a given encoding, the
4225
108
        value of a meta tag's 'charset' is the name of the encoding.
185
        value of a ``<meta>`` tag's ``charset`` becomes the name of
4226
186
        the encoding.
4227
109
        """
187
        """
4229
110
        if encoding in PYTHON_SPECIFIC_ENCODINGS:
188
        if eventual_encoding in PYTHON_SPECIFIC_ENCODINGS:
4230
111
            return ''
189
            return ''
4232
112
        return encoding
190
        return eventual_encoding
4233
113
191
4234
114
192
4235
115
class ContentMetaAttributeValue(AttributeValueWithCharsetSubstitution):
193
class ContentMetaAttributeValue(AttributeValueWithCharsetSubstitution):
4237
116
    """A generic stand-in for the value of a meta tag's 'content' attribute.
194
    """A generic stand-in for the value of a ``<meta>`` tag's ``content``
4238
195
    attribute.
4239
117
196
4240
118
    When Beautiful Soup parses the markup:
197
    When Beautiful Soup parses the markup:
4242
119
     <meta http-equiv="content-type" content="text/html; charset=utf8">
198
     ``<meta http-equiv="content-type" content="text/html; charset=utf8">``
4243
120
199
4248
121
    The value of the 'content' attribute will be one of these objects.
200
    The value of the ``content`` attribute will become one of these objects.
4245
122
    """
4246
123
4247
124
    CHARSET_RE = re.compile(r"((^|;)\s*charset=)([^;]*)", re.M)
4249
125
201
4251
126
    def __new__(cls, original_value):
202
    If the document is later encoded to an encoding other than UTF-8, its
4252
203
    ``<meta>`` tag will mention the new encoding instead of ``utf8``.
4253
204
    """
4254
205
    #: Match the 'charset' argument inside the 'content' attribute
4255
206
    #: of a <meta> tag.
4256
207
    #: :meta private:
4257
208
    CHARSET_RE: Pattern[str] = re.compile(
4258
209
        r"((^|;)\s*charset=)([^;]*)", re.M
4259
210
    )
4260
211
    
4261
212
    def __new__(cls, original_value:str) -> Self:
4262
127
        match = cls.CHARSET_RE.search(original_value)
213
        match = cls.CHARSET_RE.search(original_value)
4263
128
        if match is None:
4264
129
            # No substitution necessary.
4265
130
            return str.__new__(str, original_value)
4266
131
4267
132
        obj = str.__new__(cls, original_value)
214
        obj = str.__new__(cls, original_value)
4268
133
        obj.original_value = original_value
215
        obj.original_value = original_value
4269
134
        return obj
216
        return obj
4270
135
217
4274
136
    def encode(self, encoding):
218
    def substitute_encoding(self, eventual_encoding:_Encoding="utf-8") -> str:
4275
137
        if encoding in PYTHON_SPECIFIC_ENCODINGS:
219
        """When an HTML document is being encoded to a given encoding, the
4276
138
            return ''
220
        value of the ``charset=`` in a ``<meta>`` tag's ``content`` becomes
4277
221
        the name of the encoding.
4278
222
        """
4279
223
        if eventual_encoding in PYTHON_SPECIFIC_ENCODINGS:
4280
224
            return self.CHARSET_RE.sub('', self.original_value)
4281
139
        def rewrite(match):
225
        def rewrite(match):
4283
140
            return match.group(1) + encoding
226
            return match.group(1) + eventual_encoding
4284
141
        return self.CHARSET_RE.sub(rewrite, self.original_value)
227
        return self.CHARSET_RE.sub(rewrite, self.original_value)
4285
142
228
4286
143
229
4287
144
class PageElement(object):
230
class PageElement(object):
4290
145
    """Contains the navigational information for some part of the page:
231
    """An abstract class representing a single element in the parse tree.
4289
146
    that is, its current location in the parse tree.
4291
147
232
4293
148
    NavigableString, Tag, etc. are all subclasses of PageElement.
233
    `NavigableString`, `Tag`, etc. are all subclasses of
4294
234
    `PageElement`. For this reason you'll see a lot of methods that
4295
235
    return `PageElement`, but you'll never see an actual `PageElement`
4296
236
    object. For the most part you can think of `PageElement` as
4297
237
    meaning "a `Tag` or a `NavigableString`."
4298
149
    """
238
    """
4299
150
239
4307
151
    # In general, we can't tell just by looking at an element whether
240
    #: In general, we can't tell just by looking at an element whether
4308
152
    # it's contained in an XML document or an HTML document. But for
241
    #: it's contained in an XML document or an HTML document. But for
4309
153
    # Tags (q.v.) we can store this information at parse time.
242
    #: `Tag` objects (q.v.) we can store this information at parse time.
4310
154
    known_xml = None
243
    #: :meta private:
4311
155
244
    known_xml: Optional[bool] = None
4312
156
    def setup(self, parent=None, previous_element=None, next_element=None,
245
4313
157
              previous_sibling=None, next_sibling=None):
246
    #: Whether or not this element has been decomposed from the tree
4314
247
    #: it was created in.
4315
248
    _decomposed: bool
4316
249
4317
250
    parent: Optional[Tag]
4318
251
    next_element: Optional[PageElement]
4319
252
    previous_element: Optional[PageElement]
4320
253
    next_sibling: Optional[PageElement]
4321
254
    previous_sibling: Optional[PageElement]
4322
255
4323
256
    #: Whether or not this element is hidden from generated output.
4324
257
    #: Only the `BeautifulSoup` object itself is hidden.
4325
258
    hidden: bool=False
4326
259
    
4327
260
    def setup(self, parent:Optional[Tag]=None,
4328
261
              previous_element:Optional[PageElement]=None,
4329
262
              next_element:Optional[PageElement]=None,
4330
263
              previous_sibling:Optional[PageElement]=None,
4331
264
              next_sibling:Optional[PageElement]=None) -> None:
4332
158
        """Sets up the initial relations between this element and
265
        """Sets up the initial relations between this element and
4333
159
        other elements.
266
        other elements.
4334
160
267
4335
@@ -175,7 +282,7 @@ class PageElement(object):
4336
175
        self.parent = parent
282
        self.parent = parent
4337
176
283
4338
177
        self.previous_element = previous_element
284
        self.previous_element = previous_element
4340
178
        if previous_element is not None:
285
        if self.previous_element is not None:
4341
179
            self.previous_element.next_element = self
286
            self.previous_element.next_element = self
4342
180
287
4343
181
        self.next_element = next_element
288
        self.next_element = next_element
4344
@@ -191,10 +298,10 @@ class PageElement(object):
4345
191
            previous_sibling = self.parent.contents[-1]
298
            previous_sibling = self.parent.contents[-1]
4346
192
299
4347
193
        self.previous_sibling = previous_sibling
300
        self.previous_sibling = previous_sibling
4349
194
        if previous_sibling is not None:
301
        if self.previous_sibling is not None:
4350
195
            self.previous_sibling.next_sibling = self
302
            self.previous_sibling.next_sibling = self
4351
196
303
4353
197
    def format_string(self, s, formatter):
304
    def format_string(self, s:str, formatter:Optional[_FormatterOrName]) -> str:
4354
198
        """Format the given string using the given formatter.
305
        """Format the given string using the given formatter.
4355
199
306
4356
200
        :param s: A string.
307
        :param s: A string.
4357
@@ -207,28 +314,35 @@ class PageElement(object):
4358
207
        output = formatter.substitute(s)
314
        output = formatter.substitute(s)
4359
208
        return output
315
        return output
4360
209
316
4362
210
    def formatter_for_name(self, formatter):
317
    def formatter_for_name(
4363
318
        self,
4364
319
        formatter_name:Union[_FormatterOrName, _EntitySubstitutionFunction]
4365
320
    ) -> Formatter:
4366
211
        """Look up or create a Formatter for the given identifier,
321
        """Look up or create a Formatter for the given identifier,
4367
212
        if necessary.
322
        if necessary.
4368
213
323
4370
214
        :param formatter: Can be a Formatter object (used as-is), a
324
        :param formatter: Can be a `Formatter` object (used as-is), a
4371
215
            function (used as the entity substitution hook for an
325
            function (used as the entity substitution hook for an
4374
216
            XMLFormatter or HTMLFormatter), or a string (used to look
326
            `XMLFormatter` or `HTMLFormatter`), or a string (used to look
4375
217
            up an XMLFormatter or HTMLFormatter in the appropriate
327
            up an `XMLFormatter` or `HTMLFormatter` in the appropriate
4376
218
            registry.
328
            registry.
4377
219
        """
329
        """
4380
220
        if isinstance(formatter, Formatter):
330
        if isinstance(formatter_name, Formatter):
4381
221
            return formatter
331
            return formatter_name
4382
332
        c: type[Formatter]
4383
333
        registry: Mapping[Optional[str], Formatter]
4384
222
        if self._is_xml:
334
        if self._is_xml:
4385
223
            c = XMLFormatter
335
            c = XMLFormatter
4386
336
            registry = XMLFormatter.REGISTRY
4387
224
        else:
337
        else:
4388
225
            c = HTMLFormatter
338
            c = HTMLFormatter
4392
226
        if isinstance(formatter, Callable):
339
            registry = HTMLFormatter.REGISTRY
4393
227
            return c(entity_substitution=formatter)
340
        if callable(formatter_name):
4394
228
        return c.REGISTRY[formatter]
341
            return c(entity_substitution=formatter_name)
4395
342
        return registry[formatter_name]
4396
229
343
4397
230
    @property
344
    @property
4399
231
    def _is_xml(self):
345
    def _is_xml(self) -> bool:
4400
232
        """Is this element part of an XML tree or an HTML tree?
346
        """Is this element part of an XML tree or an HTML tree?
4401
233
347
4402
234
        This is used in formatter_for_name, when deciding whether an
348
        This is used in formatter_for_name, when deciding whether an
4403
@@ -250,28 +364,41 @@ class PageElement(object):
4404
250
            return getattr(self, 'is_xml', False)
364
            return getattr(self, 'is_xml', False)
4405
251
        return self.parent._is_xml
365
        return self.parent._is_xml
4406
252
366
4409
253
    nextSibling = _alias("next_sibling")  # BS3
367
    nextSibling = _deprecated_alias("nextSibling", "next_sibling", "4.0.0")
4410
254
    previousSibling = _alias("previous_sibling")  # BS3
368
    previousSibling = _deprecated_alias(
4411
369
        "previousSibling", "previous_sibling", "4.0.0"
4412
370
    )
4413
255
371
4416
256
    default = object()
372
    def __deepcopy__(self, memo:Dict, recursive:bool=False) -> Self:
4417
257
    def _all_strings(self, strip=False, types=default):
373
        raise NotImplementedError()
4418
374
        
4419
375
    def __copy__(self) -> Self:
4420
376
        """A copy of a PageElement can only be a deep copy, because
4421
377
        only one PageElement can occupy a given place in a parse tree.
4422
378
        """
4423
379
        return self.__deepcopy__({})
4424
380
    
4425
381
    default: Iterable[type[NavigableString]] = tuple() #: :meta private:
4426
382
    def _all_strings(self, strip:bool=False, types:Iterable[type[NavigableString]]=default) -> Iterator[str]:
4427
258
        """Yield all strings of certain classes, possibly stripping them.
383
        """Yield all strings of certain classes, possibly stripping them.
4428
259
384
4430
260
        This is implemented differently in Tag and NavigableString.
385
        This is implemented differently in `Tag` and `NavigableString`.
4431
261
        """
386
        """
4432
262
        raise NotImplementedError()
387
        raise NotImplementedError()
4433
263
388
4434
264
    @property
389
    @property
4437
265
    def stripped_strings(self):
390
    def stripped_strings(self) -> Iterator[str]:
4438
266
        """Yield all strings in this PageElement, stripping them first.
391
        """Yield all interesting strings in this PageElement, stripping them
4439
392
        first.
4440
267
393
4442
268
        :yield: A sequence of stripped strings.
394
        See `Tag` for information on which strings are considered
4443
395
        interesting in a given context.
4444
269
        """
396
        """
4445
270
        for string in self._all_strings(True):
397
        for string in self._all_strings(True):
4446
271
            yield string
398
            yield string
4447
272
399
4450
273
    def get_text(self, separator="", strip=False,
400
    def get_text(self, separator:str="", strip:bool=False,
4451
274
                 types=default):
401
                 types:Iterable[Type[NavigableString]]=default) -> str:
4452
275
        """Get all child strings of this PageElement, concatenated using the
402
        """Get all child strings of this PageElement, concatenated using the
4453
276
        given separator.
403
        given separator.
4454
277
404
4455
@@ -294,19 +421,19 @@ class PageElement(object):
4456
294
    getText = get_text
421
    getText = get_text
4457
295
    text = property(get_text)
422
    text = property(get_text)
4458
296
423
4462
297
    def replace_with(self, *args):
424
    def replace_with(self, *args:PageElement) -> PageElement:
4463
298
        """Replace this PageElement with one or more PageElements, keeping the
425
        """Replace this `PageElement` with one or more other `PageElements`,
4464
299
        rest of the tree the same.
426
        keeping the rest of the tree the same.
4465
300
427
4468
301
        :param args: One or more PageElements.
428
        :return: This `PageElement`, no longer part of the tree.
4467
302
        :return: `self`, no longer part of the tree.
4469
303
        """
429
        """
4470
304
        if self.parent is None:
430
        if self.parent is None:
4471
305
            raise ValueError(
431
            raise ValueError(
4472
306
                "Cannot replace one element with another when the "
432
                "Cannot replace one element with another when the "
4473
307
                "element to be replaced is not part of a tree.")
433
                "element to be replaced is not part of a tree.")
4474
308
        if len(args) == 1 and args[0] is self:
434
        if len(args) == 1 and args[0] is self:
4476
309
            return
435
            # Replacing an element with itself is a no-op.
4477
436
            return self
4478
310
        if any(x is self.parent for x in args):
437
        if any(x is self.parent for x in args):
4479
311
            raise ValueError("Cannot replace a Tag with its parent.")
438
            raise ValueError("Cannot replace a Tag with its parent.")
4480
312
        old_parent = self.parent
439
        old_parent = self.parent
4481
@@ -315,45 +442,28 @@ class PageElement(object):
4482
315
        for idx, replace_with in enumerate(args, start=my_index):
442
        for idx, replace_with in enumerate(args, start=my_index):
4483
316
            old_parent.insert(idx, replace_with)
443
            old_parent.insert(idx, replace_with)
4484
317
        return self
444
        return self
4486
318
    replaceWith = replace_with  # BS3
445
    replaceWith = _deprecated_function_alias(
4487
446
        "replaceWith", "replace_with", "4.0.0"
4488
447
    )
4489
319
448
4492
320
    def unwrap(self):
449
    def wrap(self, wrap_inside:Tag) -> Tag:
4493
321
        """Replace this PageElement with its contents.
450
        """Wrap this `PageElement` inside a `Tag`.
4494
322
451
4516
323
        :return: `self`, no longer part of the tree.
452
        :return: ``wrap_inside``, occupying the position in the tree that used
4517
324
        """
453
           to be occupied by this object, and with this object now inside it.
4497
325
        my_parent = self.parent
4498
326
        if self.parent is None:
4499
327
            raise ValueError(
4500
328
                "Cannot replace an element with its contents when that"
4501
329
                "element is not part of a tree.")
4502
330
        my_index = self.parent.index(self)
4503
331
        self.extract(_self_index=my_index)
4504
332
        for child in reversed(self.contents[:]):
4505
333
            my_parent.insert(my_index, child)
4506
334
        return self
4507
335
    replace_with_children = unwrap
4508
336
    replaceWithChildren = unwrap  # BS3
4509
337
4510
338
    def wrap(self, wrap_inside):
4511
339
        """Wrap this PageElement inside another one.
4512
340
4513
341
        :param wrap_inside: A PageElement.
4514
342
        :return: `wrap_inside`, occupying the position in the tree that used
4515
343
           to be occupied by `self`, and with `self` inside it.
4518
344
        """
454
        """
4519
345
        me = self.replace_with(wrap_inside)
455
        me = self.replace_with(wrap_inside)
4520
346
        wrap_inside.append(me)
456
        wrap_inside.append(me)
4521
347
        return wrap_inside
457
        return wrap_inside
4522
348
458
4524
349
    def extract(self, _self_index=None):
459
    def extract(self, _self_index:Optional[int]=None) -> PageElement:
4525
350
        """Destructively rips this element out of the tree.
460
        """Destructively rips this element out of the tree.
4526
351
461
4527
352
        :param _self_index: The location of this element in its parent's
462
        :param _self_index: The location of this element in its parent's
4528
353
           .contents, if known. Passing this in allows for a performance
463
           .contents, if known. Passing this in allows for a performance
4529
354
           optimization.
464
           optimization.
4530
355
465
4532
356
        :return: `self`, no longer part of the tree.
466
        :return: this `PageElement`, no longer part of the tree.
4533
357
        """
467
        """
4534
358
        if self.parent is not None:
468
        if self.parent is not None:
4535
359
            if _self_index is None:
469
            if _self_index is None:
4536
@@ -364,11 +474,17 @@ class PageElement(object):
4537
364
        #this element (and any children) hadn't been parsed. Connect
474
        #this element (and any children) hadn't been parsed. Connect
4538
365
        #the two.
475
        #the two.
4539
366
        last_child = self._last_descendant()
476
        last_child = self._last_descendant()
4540
477
4541
478
        # last_child can't be None because we passed accept_self=True
4542
479
        # into _last_descendant. Worst case, last_child will be
4543
480
        # self. Making this cast removes several mypy complaints later
4544
481
        # on as we manipulate last_child.
4545
482
        last_child = cast(PageElement, last_child)
4546
367
        next_element = last_child.next_element
483
        next_element = last_child.next_element
4547
368
484
4551
369
        if (self.previous_element is not None and
485
        if self.previous_element is not None:
4552
370
            self.previous_element is not next_element):
486
            if self.previous_element is not next_element:
4553
371
            self.previous_element.next_element = next_element
487
                self.previous_element.next_element = next_element
4554
372
        if next_element is not None and next_element is not self.previous_element:
488
        if next_element is not None and next_element is not self.previous_element:
4555
373
            next_element.previous_element = self.previous_element
489
            next_element.previous_element = self.previous_element
4556
374
        self.previous_element = None
490
        self.previous_element = None
4557
@@ -384,12 +500,38 @@ class PageElement(object):
4558
384
        self.previous_sibling = self.next_sibling = None
500
        self.previous_sibling = self.next_sibling = None
4559
385
        return self
501
        return self
4560
386
502
4562
387
    def _last_descendant(self, is_initialized=True, accept_self=True):
503
    def decompose(self) -> None:
4563
504
        """Recursively destroys this `PageElement` and its children.
4564
505
4565
506
        The element will be removed from the tree and wiped out; so
4566
507
        will everything beneath it.
4567
508
4568
509
        The behavior of a decomposed `PageElement` is undefined and you
4569
510
        should never use one for anything, but if you need to *check*
4570
511
        whether an element has been decomposed, you can use the
4571
512
        `PageElement.decomposed` property.
4572
513
        """
4573
514
        self.extract()
4574
515
        e: Optional[PageElement] = self
4575
516
        next_up: Optional[PageElement] = None
4576
517
        while e is not None:
4577
518
            next_up = e.next_element
4578
519
            e.__dict__.clear()
4579
520
            if isinstance(e, Tag):
4580
521
                e.contents = []
4581
522
            e._decomposed = True
4582
523
            e = next_up
4583
524
4584
525
    def _last_descendant(
4585
526
        self, is_initialized:bool=True, accept_self:bool=True
4586
527
    ) -> Optional[PageElement]:
4587
388
        """Finds the last element beneath this object to be parsed.
528
        """Finds the last element beneath this object to be parsed.
4588
389
529
4592
390
        :param is_initialized: Has `setup` been called on this PageElement
530
        :param is_initialized: Has `PageElement.setup` been called on
4593
391
            yet?
531
            this `PageElement` yet?
4594
392
        :param accept_self: Is `self` an acceptable answer to the question?
532
4595
533
        :param accept_self: Is ``self`` an acceptable answer to the
4596
534
            question?
4597
393
        """
535
        """
4598
394
        if is_initialized and self.next_sibling is not None:
536
        if is_initialized and self.next_sibling is not None:
4599
395
            last_child = self.next_sibling.previous_element
537
            last_child = self.next_sibling.previous_element
4600
@@ -400,121 +542,15 @@ class PageElement(object):
4601
400
        if not accept_self and last_child is self:
542
        if not accept_self and last_child is self:
4602
401
            last_child = None
543
            last_child = None
4603
402
        return last_child
544
        return last_child
4604
403
    # BS3: Not part of the API!
4605
404
    _lastRecursiveChild = _last_descendant
4606
405
545
4623
406
    def insert(self, position, new_child):
546
    _lastRecursiveChild = _deprecated_alias("_lastRecursiveChild", "_last_descendant", "4.0.0")
4608
407
        """Insert a new PageElement in the list of this PageElement's children.
4609
408
4610
409
        This works the same way as `list.insert`.
4611
410
4612
411
        :param position: The numeric position that should be occupied
4613
412
           in `self.children` by the new PageElement.
4614
413
        :param new_child: A PageElement.
4615
414
        """
4616
415
        if new_child is None:
4617
416
            raise ValueError("Cannot insert None into a tag.")
4618
417
        if new_child is self:
4619
418
            raise ValueError("Cannot insert a tag into itself.")
4620
419
        if (isinstance(new_child, str)
4621
420
            and not isinstance(new_child, NavigableString)):
4622
421
            new_child = NavigableString(new_child)
4624
422
547
4714
423
        from bs4 import BeautifulSoup
548
    def insert_before(self, *args:PageElement) -> None:
4626
424
        if isinstance(new_child, BeautifulSoup):
4627
425
            # We don't want to end up with a situation where one BeautifulSoup
4628
426
            # object contains another. Insert the children one at a time.
4629
427
            for subchild in list(new_child.contents):
4630
428
                self.insert(position, subchild)
4631
429
                position += 1
4632
430
            return
4633
431
        position = min(position, len(self.contents))
4634
432
        if hasattr(new_child, 'parent') and new_child.parent is not None:
4635
433
            # We're 'inserting' an element that's already one
4636
434
            # of this object's children.
4637
435
            if new_child.parent is self:
4638
436
                current_index = self.index(new_child)
4639
437
                if current_index < position:
4640
438
                    # We're moving this element further down the list
4641
439
                    # of this object's children. That means that when
4642
440
                    # we extract this element, our target index will
4643
441
                    # jump down one.
4644
442
                    position -= 1
4645
443
            new_child.extract()
4646
444
4647
445
        new_child.parent = self
4648
446
        previous_child = None
4649
447
        if position == 0:
4650
448
            new_child.previous_sibling = None
4651
449
            new_child.previous_element = self
4652
450
        else:
4653
451
            previous_child = self.contents[position - 1]
4654
452
            new_child.previous_sibling = previous_child
4655
453
            new_child.previous_sibling.next_sibling = new_child
4656
454
            new_child.previous_element = previous_child._last_descendant(False)
4657
455
        if new_child.previous_element is not None:
4658
456
            new_child.previous_element.next_element = new_child
4659
457
4660
458
        new_childs_last_element = new_child._last_descendant(False)
4661
459
4662
460
        if position >= len(self.contents):
4663
461
            new_child.next_sibling = None
4664
462
4665
463
            parent = self
4666
464
            parents_next_sibling = None
4667
465
            while parents_next_sibling is None and parent is not None:
4668
466
                parents_next_sibling = parent.next_sibling
4669
467
                parent = parent.parent
4670
468
                if parents_next_sibling is not None:
4671
469
                    # We found the element that comes next in the document.
4672
470
                    break
4673
471
            if parents_next_sibling is not None:
4674
472
                new_childs_last_element.next_element = parents_next_sibling
4675
473
            else:
4676
474
                # The last element of this tag is the last element in
4677
475
                # the document.
4678
476
                new_childs_last_element.next_element = None
4679
477
        else:
4680
478
            next_child = self.contents[position]
4681
479
            new_child.next_sibling = next_child
4682
480
            if new_child.next_sibling is not None:
4683
481
                new_child.next_sibling.previous_sibling = new_child
4684
482
            new_childs_last_element.next_element = next_child
4685
483
4686
484
        if new_childs_last_element.next_element is not None:
4687
485
            new_childs_last_element.next_element.previous_element = new_childs_last_element
4688
486
        self.contents.insert(position, new_child)
4689
487
4690
488
    def append(self, tag):
4691
489
        """Appends the given PageElement to the contents of this one.
4692
490
4693
491
        :param tag: A PageElement.
4694
492
        """
4695
493
        self.insert(len(self.contents), tag)
4696
494
4697
495
    def extend(self, tags):
4698
496
        """Appends the given PageElements to this one's contents.
4699
497
4700
498
        :param tags: A list of PageElements. If a single Tag is
4701
499
            provided instead, this PageElement's contents will be extended
4702
500
            with that Tag's contents.
4703
501
        """
4704
502
        if isinstance(tags, Tag):
4705
503
            tags = tags.contents
4706
504
        if isinstance(tags, list):
4707
505
            # Moving items around the tree may change their position in
4708
506
            # the original list. Make a list that won't change.
4709
507
            tags = list(tags)
4710
508
        for tag in tags:
4711
509
            self.append(tag)
4712
510
4713
511
    def insert_before(self, *args):
4715
512
        """Makes the given element(s) the immediate predecessor of this one.
549
        """Makes the given element(s) the immediate predecessor of this one.
4716
513
550
4721
514
        All the elements will have the same parent, and the given elements
551
        All the elements will have the same `PageElement.parent` as
4722
515
        will be immediately before this one.
552
        this one, and the given elements will occur immediately before
4723
516
553
        this one.
4720
517
        :param args: One or more PageElements.
4724
518
        """
554
        """
4725
519
        parent = self.parent
555
        parent = self.parent
4726
520
        if parent is None:
556
        if parent is None:
4727
@@ -530,13 +566,12 @@ class PageElement(object):
4728
530
            index = parent.index(self)
566
            index = parent.index(self)
4729
531
            parent.insert(index, predecessor)
567
            parent.insert(index, predecessor)
4730
532
568
4732
533
    def insert_after(self, *args):
569
    def insert_after(self, *args:PageElement) -> None:
4733
534
        """Makes the given element(s) the immediate successor of this one.
570
        """Makes the given element(s) the immediate successor of this one.
4734
535
571
4739
536
        The elements will have the same parent, and the given elements
572
        The elements will have the same `PageElement.parent` as this
4740
537
        will be immediately after this one.
573
        one, and the given elements will occur immediately after this
4741
538
574
        one.
4738
539
        :param args: One or more PageElements.
4742
540
        """
575
        """
4743
541
        # Do all error checking before modifying the tree.
576
        # Do all error checking before modifying the tree.
4744
542
        parent = self.parent
577
        parent = self.parent
4745
@@ -556,7 +591,14 @@ class PageElement(object):
4746
556
            parent.insert(index+1+offset, successor)
591
            parent.insert(index+1+offset, successor)
4747
557
            offset += 1
592
            offset += 1
4748
558
593
4750
559
    def find_next(self, name=None, attrs={}, string=None, **kwargs):
594
    def find_next(
4751
595
            self,
4752
596
            name:Optional[_StrainableElement]=None,
4753
597
            attrs:_StrainableAttributes={},
4754
598
            string:Optional[_StrainableString]=None,
4755
599
            node:Optional[_TagOrStringMatchFunction]=None,
4756
600
            **kwargs:_StrainableAttribute
4757
601
    ) -> Optional[PageElement]:
4758
560
        """Find the first PageElement that matches the given criteria and
602
        """Find the first PageElement that matches the given criteria and
4759
561
        appears later in the document than this PageElement.
603
        appears later in the document than this PageElement.
4760
562
604
4761
@@ -564,36 +606,47 @@ class PageElement(object):
4762
564
        documentation for detailed explanations.
606
        documentation for detailed explanations.
4763
565
607
4764
566
        :param name: A filter on tag name.
608
        :param name: A filter on tag name.
4766
567
        :param attrs: A dictionary of filters on attribute values.
609
        :param attrs: Additional filters on attribute values.
4767
568
        :param string: A filter for a NavigableString with specific text.
610
        :param string: A filter for a NavigableString with specific text.
4779
569
        :kwargs: A dictionary of filters on attribute values.
611
        :kwargs: Additional filters on attribute values.
4780
570
        :return: A PageElement.
612
        """
4781
571
        :rtype: bs4.element.Tag | bs4.element.NavigableString
613
        return self._find_one(self.find_all_next, name, attrs, string, node, **kwargs)
4782
572
        """
614
    findNext = _deprecated_function_alias("findNext", "find_next", "4.0.0")
4783
573
        return self._find_one(self.find_all_next, name, attrs, string, **kwargs)
615
4784
574
    findNext = find_next  # BS3
616
    def find_all_next(
4785
575
617
            self,
4786
576
    def find_all_next(self, name=None, attrs={}, string=None, limit=None,
618
            name:Optional[_StrainableElement]=None,
4787
577
                    **kwargs):
619
            attrs:_StrainableAttributes={},
4788
578
        """Find all PageElements that match the given criteria and appear
620
            string:Optional[_StrainableString]=None,
4789
579
        later in the document than this PageElement.
621
            limit:Optional[int]=None,
4790
622
            node:Optional[_TagOrStringMatchFunction]=None,
4791
623
            _stacklevel:int=2,
4792
624
            **kwargs:_StrainableAttribute
4793
625
    ) -> ResultSet[PageElement]: 
4794
626
        """Find all `PageElement` objects that match the given criteria and
4795
627
        appear later in the document than this `PageElement`.
4796
580
628
4797
581
        All find_* methods take a common set of arguments. See the online
629
        All find_* methods take a common set of arguments. See the online
4798
582
        documentation for detailed explanations.
630
        documentation for detailed explanations.
4799
583
631
4800
584
        :param name: A filter on tag name.
632
        :param name: A filter on tag name.
4802
585
        :param attrs: A dictionary of filters on attribute values.
633
        :param attrs: Additional filters on attribute values.
4803
586
        :param string: A filter for a NavigableString with specific text.
634
        :param string: A filter for a NavigableString with specific text.
4804
587
        :param limit: Stop looking after finding this many results.
635
        :param limit: Stop looking after finding this many results.
4807
588
        :kwargs: A dictionary of filters on attribute values.
636
        :param _stacklevel: Used internally to improve warning messages.
4808
589
        :return: A ResultSet containing PageElements.
637
        :kwargs: Additional filters on attribute values.
4809
590
        """
638
        """
4810
591
        _stacklevel = kwargs.pop('_stacklevel', 2)
4811
592
        return self._find_all(name, attrs, string, limit, self.next_elements,
639
        return self._find_all(name, attrs, string, limit, self.next_elements,
4816
593
                              _stacklevel=_stacklevel+1, **kwargs)
640
                              node, _stacklevel=_stacklevel+1, **kwargs)
4817
594
    findAllNext = find_all_next  # BS3
641
    findAllNext = _deprecated_function_alias("findAllNext", "find_all_next", "4.0.0")
4818
595
642
4819
596
    def find_next_sibling(self, name=None, attrs={}, string=None, **kwargs):
643
    def find_next_sibling(
4820
644
            self,
4821
645
            name:Optional[_StrainableElement]=None,
4822
646
            attrs:_StrainableAttributes={},
4823
647
            string:Optional[_StrainableString]=None,
4824
648
            node:Optional[_TagOrStringMatchFunction]=None,
4825
649
            **kwargs:_StrainableAttribute) -> Optional[PageElement]:
4826
597
        """Find the closest sibling to this PageElement that matches the
650
        """Find the closest sibling to this PageElement that matches the
4827
598
        given criteria and appears later in the document.
651
        given criteria and appears later in the document.
4828
599
652
4829
@@ -601,102 +654,143 @@ class PageElement(object):
4830
601
        online documentation for detailed explanations.
654
        online documentation for detailed explanations.
4831
602
655
4832
603
        :param name: A filter on tag name.
656
        :param name: A filter on tag name.
4838
604
        :param attrs: A dictionary of filters on attribute values.
657
        :param attrs: Additional filters on attribute values.
4839
605
        :param string: A filter for a NavigableString with specific text.
658
        :param string: A filter for a `NavigableString` with specific text.
4840
606
        :kwargs: A dictionary of filters on attribute values.
659
        :kwargs: Additional filters on attribute values.
4836
607
        :return: A PageElement.
4837
608
        :rtype: bs4.element.Tag | bs4.element.NavigableString
4841
609
        """
660
        """
4842
610
        return self._find_one(self.find_next_siblings, name, attrs, string,
661
        return self._find_one(self.find_next_siblings, name, attrs, string,
4849
611
                             **kwargs)
662
                              node, **kwargs)
4850
612
    findNextSibling = find_next_sibling  # BS3
663
    findNextSibling = _deprecated_function_alias(
4851
613
664
        "findNextSibling", "find_next_sibling", "4.0.0"
4852
614
    def find_next_siblings(self, name=None, attrs={}, string=None, limit=None,
665
    )
4853
615
                           **kwargs):
666
4854
616
        """Find all siblings of this PageElement that match the given criteria
667
    def find_next_siblings(
4855
668
            self,
4856
669
            name:Optional[_StrainableElement]=None,
4857
670
            attrs:_StrainableAttributes={},
4858
671
            string:Optional[_StrainableString]=None,
4859
672
            limit:Optional[int]=None,
4860
673
            node:Optional[_TagOrStringMatchFunction]=None,
4861
674
            _stacklevel:int=2,
4862
675
            **kwargs:_StrainableAttribute
4863
676
    ) -> ResultSet[PageElement]: 
4864
677
        """Find all siblings of this `PageElement` that match the given criteria
4865
617
        and appear later in the document.
678
        and appear later in the document.
4866
618
679
4867
619
        All find_* methods take a common set of arguments. See the online
680
        All find_* methods take a common set of arguments. See the online
4868
620
        documentation for detailed explanations.
681
        documentation for detailed explanations.
4869
621
682
4870
622
        :param name: A filter on tag name.
683
        :param name: A filter on tag name.
4873
623
        :param attrs: A dictionary of filters on attribute values.
684
        :param attrs: Additional filters on attribute values.
4874
624
        :param string: A filter for a NavigableString with specific text.
685
        :param string: A filter for a `NavigableString` with specific text.
4875
625
        :param limit: Stop looking after finding this many results.
686
        :param limit: Stop looking after finding this many results.
4879
626
        :kwargs: A dictionary of filters on attribute values.
687
        :param _stacklevel: Used internally to improve warning messages.
4880
627
        :return: A ResultSet of PageElements.
688
        :kwargs: Additional filters on attribute values.
4878
628
        :rtype: bs4.element.ResultSet
4881
629
        """
689
        """
4882
630
        _stacklevel = kwargs.pop('_stacklevel', 2)
4883
631
        return self._find_all(
690
        return self._find_all(
4884
632
            name, attrs, string, limit,
691
            name, attrs, string, limit,
4886
633
            self.next_siblings, _stacklevel=_stacklevel+1, **kwargs
692
            self.next_siblings, node, _stacklevel=_stacklevel+1, **kwargs
4887
634
        )
693
        )
4894
635
    findNextSiblings = find_next_siblings   # BS3
694
    findNextSiblings = _deprecated_function_alias(
4895
636
    fetchNextSiblings = find_next_siblings  # BS2
695
        "findNextSiblings", "find_next_siblings", "4.0.0"
4896
637
696
    )
4897
638
    def find_previous(self, name=None, attrs={}, string=None, **kwargs):
697
    fetchNextSiblings = _deprecated_function_alias(
4898
639
        """Look backwards in the document from this PageElement and find the
698
        "fetchNextSiblings", "find_next_siblings", "3.0.0"
4899
640
        first PageElement that matches the given criteria.
699
    )
4900
700
4901
701
    def find_previous(
4902
702
            self,
4903
703
            name:Optional[_StrainableElement]=None,
4904
704
            attrs:_StrainableAttributes={},
4905
705
            string:Optional[_StrainableString]=None,
4906
706
            node:Optional[_TagOrStringMatchFunction]=None,
4907
707
            **kwargs:_StrainableAttribute) -> Optional[PageElement]:
4908
708
        """Look backwards in the document from this `PageElement` and find the
4909
709
        first `PageElement` that matches the given criteria.
4910
641
710
4911
642
        All find_* methods take a common set of arguments. See the online
711
        All find_* methods take a common set of arguments. See the online
4912
643
        documentation for detailed explanations.
712
        documentation for detailed explanations.
4913
644
713
4914
645
        :param name: A filter on tag name.
714
        :param name: A filter on tag name.
4920
646
        :param attrs: A dictionary of filters on attribute values.
715
        :param attrs: Additional filters on attribute values.
4921
647
        :param string: A filter for a NavigableString with specific text.
716
        :param string: A filter for a `NavigableString` with specific text.
4922
648
        :kwargs: A dictionary of filters on attribute values.
717
        :kwargs: Additional filters on attribute values.
4918
649
        :return: A PageElement.
4919
650
        :rtype: bs4.element.Tag | bs4.element.NavigableString
4923
651
        """
718
        """
4924
652
        return self._find_one(
719
        return self._find_one(
4932
653
            self.find_all_previous, name, attrs, string, **kwargs)
720
            self.find_all_previous, name, attrs, string, node, **kwargs)
4933
654
    findPrevious = find_previous  # BS3
721
4934
655
722
    findPrevious = _deprecated_function_alias(
4935
656
    def find_all_previous(self, name=None, attrs={}, string=None, limit=None,
723
        "findPrevious", "find_previous", "3.0.0"
4936
657
                        **kwargs):
724
    )
4937
658
        """Look backwards in the document from this PageElement and find all
725
4938
659
        PageElements that match the given criteria.
726
    def find_all_previous(
4939
727
            self,
4940
728
            name:Optional[_StrainableElement]=None,
4941
729
            attrs:_StrainableAttributes={},
4942
730
            string:Optional[_StrainableString]=None,
4943
731
            limit:Optional[int]=None,
4944
732
            node:Optional[_TagOrStringMatchFunction]=None,
4945
733
            _stacklevel:int=2,
4946
734
            **kwargs:_StrainableAttribute
4947
735
    ) -> ResultSet[PageElement]: 
4948
736
        """Look backwards in the document from this `PageElement` and find all
4949
737
        `PageElement` that match the given criteria.
4950
660
738
4951
661
        All find_* methods take a common set of arguments. See the online
739
        All find_* methods take a common set of arguments. See the online
4952
662
        documentation for detailed explanations.
740
        documentation for detailed explanations.
4953
663
741
4954
664
        :param name: A filter on tag name.
742
        :param name: A filter on tag name.
4957
665
        :param attrs: A dictionary of filters on attribute values.
743
        :param attrs: Additional filters on attribute values.
4958
666
        :param string: A filter for a NavigableString with specific text.
744
        :param string: A filter for a `NavigableString` with specific text.
4959
667
        :param limit: Stop looking after finding this many results.
745
        :param limit: Stop looking after finding this many results.
4963
668
        :kwargs: A dictionary of filters on attribute values.
746
        :param _stacklevel: Used internally to improve warning messages.
4964
669
        :return: A ResultSet of PageElements.
747
        :kwargs: Additional filters on attribute values.
4962
670
        :rtype: bs4.element.ResultSet
4965
671
        """
748
        """
4966
672
        _stacklevel = kwargs.pop('_stacklevel', 2)
4967
673
        return self._find_all(
749
        return self._find_all(
4968
674
            name, attrs, string, limit, self.previous_elements,
750
            name, attrs, string, limit, self.previous_elements,
4970
675
            _stacklevel=_stacklevel+1, **kwargs
751
            node, _stacklevel=_stacklevel+1, **kwargs
4971
676
        )
752
        )
4977
677
    findAllPrevious = find_all_previous  # BS3
753
    findAllPrevious = _deprecated_function_alias(
4978
678
    fetchPrevious = find_all_previous    # BS2
754
        "findAllPrevious", "find_all_previous", "4.0.0"
4979
679
755
    )
4980
680
    def find_previous_sibling(self, name=None, attrs={}, string=None, **kwargs):
756
    fetchAllPrevious = _deprecated_function_alias(
4981
681
        """Returns the closest sibling to this PageElement that matches the
757
        "fetchAllPrevious", "find_all_previous", "3.0.0"
4982
758
    )
4983
759
4984
760
    def find_previous_sibling(
4985
761
            self,
4986
762
            name:Optional[_StrainableElement]=None,
4987
763
            attrs:_StrainableAttributes={},
4988
764
            string:Optional[_StrainableString]=None,
4989
765
            node:Optional[_TagOrStringMatchFunction]=None,
4990
766
            **kwargs:_StrainableAttribute) -> Optional[PageElement]:
4991
767
        """Returns the closest sibling to this `PageElement` that matches the
4992
682
        given criteria and appears earlier in the document.
768
        given criteria and appears earlier in the document.
4993
683
769
4994
684
        All find_* methods take a common set of arguments. See the online
770
        All find_* methods take a common set of arguments. See the online
4995
685
        documentation for detailed explanations.
771
        documentation for detailed explanations.
4996
686
772
4997
687
        :param name: A filter on tag name.
773
        :param name: A filter on tag name.
4998
688
        :param attrs: A dictionary of filters on attribute values.
4999
689
        :param string: A filter for a NavigableString with specific text.
5000
690
        :kwargs: A dictionary of filters on attribute values.
Status:	Superseded
Proposed branch:	~chrispitude/beautifulsoup:node-filters
Merge into:	beautifulsoup:master
Diff against target:	9961 lines (+4153/-2375) 31 files modified CHANGELOG (+99/-0) bs4/__init__.py (+245/-73) bs4/_deprecation.py (+57/-0) bs4/_typing.py (+99/-0) bs4/builder/__init__.py (+278/-184) bs4/builder/_html5lib.py (+57/-36) bs4/builder/_htmlparser.py (+120/-66) bs4/builder/_lxml.py (+137/-68) bs4/css.py (+124/-96) bs4/dammit.py (+407/-230) bs4/diagnose.py (+31/-19) bs4/element.py (+1154/-1052) bs4/formatter.py (+96/-41) bs4/strainer.py (+498/-0) bs4/tests/__init__.py (+17/-12) bs4/tests/test_builder_registry.py (+3/-3) bs4/tests/test_dammit.py (+17/-10) bs4/tests/test_element.py (+25/-5) bs4/tests/test_html5lib.py (+17/-1) bs4/tests/test_htmlparser.py (+5/-3) bs4/tests/test_lxml.py (+4/-3) bs4/tests/test_pageelement.py (+8/-4) bs4/tests/test_soup.py (+13/-5) bs4/tests/test_strainer.py (+485/-0) bs4/tests/test_tag.py (+1/-0) bs4/tests/test_tree.py (+44/-25) dev/null (+0/-256) doc/Makefile (+14/-124) doc/conf.py (+33/-0) doc/index.rst (+61/-57) tox.ini (+4/-2)
Related bugs:	Link a bug report
Reviewer	Review Type	Date Requested	Status
Leonard Richardson		2023-12-30	Pending
Review via email: mp+457782@code.launchpad.net