Better handling html <ruby> tags

Bug #1941980 reported by mumumu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

Sorry for my English.Please let me know if there are somethings that I'm not making clear.

https://developer.mozilla.org/en-US/docs/Web/HTML/Element/ruby

The tag is often found in Japanese ebooks.

html='<p>ある<ruby>日<rp>(</rp><rt>ひ</rt><rp>)</rp></ruby>の<ruby>放課<rp>(</rp><rt>ほうか</rt><rp>)</rp>後<rp>(</rp><rt>ご</rt><rp>)</rp></ruby>だった。</p>'

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
soup.get_text()

'ある日(ひ)の放課(ほうか)後(ご)だった。'.

We may not need the part in parentheses, but the string is still readable though.

But in case that the fall-back <rp> tag is omitted by the ebook publisher:

html='<p>ある<ruby>日<rt>ひ</rt></ruby>の<ruby>放課<rt>ほうか</rt>後<rt>ご</rt></ruby>だった。</p>'
soup = BeautifulSoup(html,'lxml')
soup.get_text()

'ある日ひの放課ほうか後ごだった。'

This one is quite confusing,since Kanji and it's hiragana(pronunciation) repeat twice but having the same meaning.

Yes, the string in <rp> and <rt> could be ignored by get_text() in an undocumented way I found in bs4's element.py and test.py:

from bs4 import BeautifulSoup
from bs4.element import NavigableString, Script, Stylesheet, TemplateString, Tag

class RTString(NavigableString):
    '''class for <rt> tag'''
    pass

class RPString(NavigableString):
    '''class for <rp> tag'''
    pass

string_containers = {
    'rp': RPString,
    'rt': RTString,
    'style': Stylesheet,
    'script': Script,
    'template': TemplateString,
}

html='<p>ある<ruby>日<rp>(</rp><rt>ひ</rt><rp>)</rp></ruby>の<ruby>放課<rp>(</rp><rt>ほうか</rt><rp>)</rp>後<rp>(</rp><rt>ご</rt><rp>)</rp></ruby>だった。</p>'
soup = BeautifulSoup(html,'lxml'string_containers =string_containers)
soup.get_text()

'ある日の放課後だった。'

soup.get_text(types={NavigableString,RTString,RPString,})

'ある日(ひ)の放課(ほうか)後(ご)だった。'

But could Beautifully Soup handle <rp> and <rt> tag internally like <script> rather than end user implement?
And also a little hint in the documentation?
I think these tags are standard html tags,not user custom tags.Many East Asian languages also use these tags, such as Pinyin in Chinese.

OS windows 10
beautifulsoup4==4.9.3
lxml==4.6.3

mumumu (mumumu42)
information type: Public → Public Security
information type: Public Security → Private Security
information type: Private Security → Public
Revision history for this message
Leonard Richardson (leonardr) wrote :

Revision 614 includes RubyTextString and RubyParenthesisString classes for this purpose.

Changed in beautifulsoup:
status: New → Fix Committed
Revision history for this message
Leonard Richardson (leonardr) wrote :

Fix released in version 4.11.0.

Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.