Merge ~andres-he/beautifulsoup:add_new_line_on_br_tags into beautifulsoup:master

Proposed by Andrés Herrera
Status: Needs review
Proposed branch: ~andres-he/beautifulsoup:add_new_line_on_br_tags
Merge into: beautifulsoup:master
Diff against target: 62 lines (+37/-0)
2 files modified
bs4/element.py (+15/-0)
test_explaining_issue-remove_this.py (+22/-0)
Reviewer Review Type Date Requested Status
Leonard Richardson Pending
Review via email: mp+462910@code.launchpad.net

Commit message

add line break on text acquisition when the element is a br tag to avoid strings unexpectedly joined

Description of the change

There are cases when you find something like this on html pages:
URL 1: www.example-url-1.com<br/>URL 2: www.example-url-2.com
(you can find a real example on this link, look for "Blog" in the html: https://www.linkedin.com/posts/sakana-ai_introducing-evolutionary-model-merge-a-new-activity-7176384016978178048-izIp?utm_source=share&utm_medium=member_desktop)
(when I see that page with the inspector tool, the html is prettified, so the <br> is surrounded by new lines, but when downloading the html, it is a case like the above).

The current implementation of text acquisition (get_text()) ignores <br> tags, resulting in a string like the following (for the example above): URL 1: www.example-url-1.comURL 2: www.example-url-2.com

I included a test in the root directory of the repository (called test_explaining_issue-remove_this.py). This could be added to the rest of the tests.

With this new implementation, the test I added passes, along with the already existing tests.

Thank you for your work on this super helpful library.

To post a comment you must log in.
Revision history for this message
Andrés Herrera (andres-he) wrote :
Revision history for this message
Andrés Herrera (andres-he) wrote (last edit ):

An alternate solution could be to:
- Make a new class child of bs4.Tag, just for br tags
- On initialization, set the text of this new class as '\n'
- By deffalut, add this new class to INTERESTING_STRING_TYPES.

This would probably make parsing slower (I'm not sure), and text acquisition faster, but would seemingly add more control and make things more splicit.

I could do a merge proposal on this if needed.

Unmerged commits

2916fb4... by Andrés Herrera

add_new_line_on_br_tags

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/bs4/element.py b/bs4/element.py
2index 0aefe73..5a78e95 100644
3--- a/bs4/element.py
4+++ b/bs4/element.py
5@@ -1187,6 +1187,15 @@ class RubyParenthesisString(NavigableString):
6 """
7 pass
8
9+# This allows to do "NavigableString in ensure_iterable(self.interesting_string_types)"
10+# in the function _all_strings of the class Tag
11+def ensure_iterable(obj):
12+ try:
13+ iter(obj)
14+ return obj
15+ except TypeError:
16+ return (obj,)
17+
18
19 class Tag(PageElement):
20 """Represents an HTML or XML tag that is part of a parse tree, along
21@@ -1435,6 +1444,12 @@ class Tag(PageElement):
22 types = self.interesting_string_types
23
24 for descendant in self.descendants:
25+ # I thought of adding the addition of \n for brs here, but it is a
26+ # more profound design choice
27+ if types is None or NavigableString in ensure_iterable(self.interesting_string_types):
28+ if getattr(descendant, 'name', None) == 'br':
29+ yield NavigableString('\n')
30+
31 if (types is None and not isinstance(descendant, NavigableString)):
32 continue
33 descendant_type = type(descendant)
34diff --git a/test_explaining_issue-remove_this.py b/test_explaining_issue-remove_this.py
35new file mode 100644
36index 0000000..f9118bd
37--- /dev/null
38+++ b/test_explaining_issue-remove_this.py
39@@ -0,0 +1,22 @@
40+
41+"""
42+This test is meant to explain the problematic found, which is the reason for
43+this PR. A real example of this html can be found here:
44+https://www.linkedin.com/posts/sakana-ai_introducing-evolutionary-model-merge-a-new-activity-7176384016978178048-izIp?utm_source=share&utm_medium=member_desktop
45+When I see it in the inspector tool it is prettified, but when downloading the
46+html <br> is just as in the example below, with no spaces before or after the
47+urls
48+"""
49+
50+import pytest
51+from bs4 import BeautifulSoup
52+
53+class TestNavigableString(object):
54+
55+ def test_text_acquisition_with_no_space_around_br_tag(self):
56+ html = 'URL 1: www.example-url-1.com<br/>URL 2: www.example-url-2.com'
57+ bs = BeautifulSoup(html)
58+ assert 'www.example-url-1.comURL' not in bs.text
59+
60+if __name__ == '__main__':
61+ TestNavigableString().test_text_acquisition_with_no_space_around_br_tag()
62\ No newline at end of file

Subscribers

People subscribed via source and target branches