Beautiful Soup

Merge ~andres-he/beautifulsoup:add_new_line_on_br_tags into beautifulsoup:master

Proposed by Andrés Herrera on 2024-03-21

Status:	Needs review
Proposed branch:	~andres-he/beautifulsoup:add_new_line_on_br_tags
Merge into:	beautifulsoup:master
Diff against target:	62 lines (+37/-0) 2 files modified bs4/element.py (+15/-0) test_explaining_issue-remove_this.py (+22/-0)
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Leonard Richardson		2024-03-21	Pending
Review via email: mp+462910@code.launchpad.net

Commit message

add line break on text acquisition when the element is a br tag to avoid strings unexpectedly joined

Description of the change

There are cases when you find something like this on html pages:
URL 1: www.example-url-1.com<br/>URL 2: www.example-url-2.com
(you can find a real example on this link, look for "Blog" in the html: https://www.linkedin.com/posts/sakana-ai_introducing-evolutionary-model-merge-a-new-activity-7176384016978178048-izIp?utm_source=share&utm_medium=member_desktop)
(when I see that page with the inspector tool, the html is prettified, so the <br> is surrounded by new lines, but when downloading the html, it is a case like the above).

The current implementation of text acquisition (get_text()) ignores <br> tags, resulting in a string like the following (for the example above): URL 1: www.example-url-1.comURL 2: www.example-url-2.com

I included a test in the root directory of the repository (called test_explaining_issue-remove_this.py). This could be added to the rest of the tests.

With this new implementation, the test I added passes, along with the already existing tests.

Thank you for your work on this super helpful library.

Revision history for this message

Andrés Herrera (andres-he) wrote on 2024-03-22:

Fixes #2058695: https://bugs.launchpad.net/beautifulsoup/+bug/2058695

Revision history for this message

Andrés Herrera (andres-he) wrote on 2024-03-22 (last edit on 2024-03-22):

An alternate solution could be to:
- Make a new class child of bs4.Tag, just for br tags
- On initialization, set the text of this new class as '\n'
- By deffalut, add this new class to INTERESTING_STRING_TYPES.

This would probably make parsing slower (I'm not sure), and text acquisition faster, but would seemingly add more control and make things more splicit.

I could do a merge proposal on this if needed.

Unmerged commits

2916fb4... by Andrés Herrera on 2024-03-21: add_new_line_on_br_tags

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Andrés Herrera

Leonard Richardson

 diff --git a/bs4/element.py b/bs4/element.py
 index 0aefe73..5a78e95 100644
 --- a/bs4/element.py
 +++ b/bs4/element.py
@@ -1187,6 +1187,15 @@ class RubyParenthesisString(NavigableString):
      """
      pass
++# This allows to do "NavigableString in ensure_iterable(self.interesting_string_types)"
++# in the function _all_strings of the class Tag
++def ensure_iterable(obj):
++    try:
++        iter(obj)
++        return obj
++    except TypeError:
++        return (obj,)
++
  class Tag(PageElement):
      """Represents an HTML or XML tag that is part of a parse tree, along
@@ -1435,6 +1444,12 @@ class Tag(PageElement):
              types = self.interesting_string_types
          for descendant in self.descendants:
++            # I thought of adding the addition of \n for brs here, but it is a
++            #  more profound design choice
++            if types is None or NavigableString in ensure_iterable(self.interesting_string_types):
++                if getattr(descendant, 'name', None) == 'br':
++                    yield NavigableString('\n')
++
              if (types is None and not isinstance(descendant, NavigableString)):
                  continue
              descendant_type = type(descendant)
 diff --git a/test_explaining_issue-remove_this.py b/test_explaining_issue-remove_this.py
 new file mode 100644
 index 0000000..f9118bd
 --- /dev/null
 +++ b/test_explaining_issue-remove_this.py
@@ -0,0 +1,22 @@
++
++"""
++This test is meant to explain the problematic found, which is the reason for
++this PR. A real example of this html can be found here:
++https://www.linkedin.com/posts/sakana-ai_introducing-evolutionary-model-merge-a-new-activity-7176384016978178048-izIp?utm_source=share&utm_medium=member_desktop
++When I see it in the inspector tool it is prettified, but when downloading the
++html <br> is just as in the example below, with no spaces before or after the
++urls
++"""
++
++import pytest
++from bs4 import BeautifulSoup
++
++class TestNavigableString(object):
++
++    def test_text_acquisition_with_no_space_around_br_tag(self):
++        html = 'URL 1: www.example-url-1.com<br/>URL 2: www.example-url-2.com'
++        bs = BeautifulSoup(html)
++        assert 'www.example-url-1.comURL' not in bs.text
++
++if __name__ == '__main__':
++    TestNavigableString().test_text_acquisition_with_no_space_around_br_tag()
 \ No newline at end of file