Beautiful Soup

Bug #1868861
Comment #10

Comment 10 for bug 1868861

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-10-13:

#10

I'm reopening this issue to get more information, and will file a new issue once I have a better idea of what to do. I think almost all the issues you're having can be improved by more documentation.

The methods that deal with text do give the caller a chance to configure what is considered 'text'. The get_text method takes an argument 'types':

    def get_text(self, separator=u"", strip=False,
                 types=(NavigableString, CData)):
...
        :types: A tuple of NavigableString subclasses. Any strings of
            a subclass not found in this list will be ignored. By
            default, this means only NavigableString and CData objects
            will be considered. So no comments, processing instructions,
            stylesheets, etc.

To make get_text consider scripts to be 'text', you would pass bs4.Script into the types tuple.

The '.strings' property calls the '_all_strings' internal method with default arguments, and '_all_strings' also takes the 'types' argument. (In fact, get_text calls _all_strings and joins the results.) But _all_strings is an internal method, and I can't make .strings stop being a property. This is the piece that can't be fixed with documentation.

With respect to getting the contents of an element, there's a method encode_contents() that I think will do what you want. It's not documented because it's a deprecated method that provides compatibility with Beautiful Soup 3. But if people want to use it I can un-deprecate it and document it. String types aren't an issue with encode_contents(), because it's rendering HTML markup as it would appear in a web page, not trying to apply a human notion (or HTML-spec notion) of what is "text".

So, to put all of that into an example Python script:

---
markup = """<script>some javascript</script>"""
from bs4 import BeautifulSoup, NavigableString, CData, Script
soup = BeautifulSoup(markup, 'lxml')
script = soup.script

print("get_text()")
print(script.get_text())
#
print(script.get_text(types=[NavigableString, CData, Script]))
# some javascript

print("")
print(".strings")
print([x for x in soup.script.strings])
# []
print(
    [
        x for x in soup.script._all_strings(
            types=[NavigableString, CData, Script]
        )
    ]
)
# ['some javascript']

print("")
print("encode_contents()")
print(script.encode_contents())
# some javascript
---

Will this solve the problems you bring up?