Beautiful Soup

Better handling html <ruby> tags

Bug #1941980 reported by mumumu on 2021-08-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Fix Released	Undecided	Unassigned

Bug Description

Sorry for my English.Please let me know if there are somethings that I'm not making clear.

https://developer.mozilla.org/en-US/docs/Web/HTML/Element/ruby

The tag is often found in Japanese ebooks.

html='ある<ruby>日<rp>（</rp><rt>ひ</rt><rp>）</rp></ruby>の<ruby>放課<rp>（</rp><rt>ほうか</rt><rp>）</rp>後<rp>（</rp><rt>ご</rt><rp>）</rp></ruby>だった。'

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
soup.get_text()

'ある日（ひ）の放課（ほうか）後（ご）だった。'.

We may not need the part in parentheses, but the string is still readable though.

But in case that the fall-back <rp> tag is omitted by the ebook publisher:

html='ある<ruby>日<rt>ひ</rt></ruby>の<ruby>放課<rt>ほうか</rt>後<rt>ご</rt></ruby>だった。'
soup = BeautifulSoup(html,'lxml')
soup.get_text()

'ある日ひの放課ほうか後ごだった。'

This one is quite confusing,since Kanji and it's hiragana(pronunciation) repeat twice but having the same meaning.

Yes, the string in <rp> and <rt> could be ignored by get_text() in an undocumented way I found in bs4's element.py and test.py:

from bs4 import BeautifulSoup
from bs4.element import NavigableString, Script, Stylesheet, TemplateString, Tag

class RTString(NavigableString):
'''class for <rt> tag'''
pass

class RPString(NavigableString):
'''class for <rp> tag'''
pass

string_containers = {
    'rp': RPString,
    'rt': RTString,
    'style': Stylesheet,
    'script': Script,
    'template': TemplateString,
}

html='ある<ruby>日<rp>（</rp><rt>ひ</rt><rp>）</rp></ruby>の<ruby>放課<rp>（</rp><rt>ほうか</rt><rp>）</rp>後<rp>（</rp><rt>ご</rt><rp>）</rp></ruby>だった。'
soup = BeautifulSoup(html,'lxml'string_containers =string_containers)
soup.get_text()

'ある日の放課後だった。'

soup.get_text(types={NavigableString,RTString,RPString,})

'ある日（ひ）の放課（ほうか）後（ご）だった。'

But could Beautifully Soup handle <rp> and <rt> tag internally like <script> rather than end user implement?
And also a little hint in the documentation?
I think these tags are standard html tags,not user custom tags.Many East Asian languages also use these tags, such as Pinyin in Chinese.

OS windows 10
beautifulsoup4==4.9.3
lxml==4.6.3

mumumu (mumumu42) on 2021-08-29

information type:	Public → Public Security
information type:	Public Security → Private Security
information type:	Private Security → Public

Revision history for this message

Leonard Richardson (leonardr) wrote on 2021-10-11:

Revision 614 includes RubyTextString and RubyParenthesisString classes for this purpose.

Changed in beautifulsoup:
status:	New → Fix Committed

Revision history for this message

Leonard Richardson (leonardr) wrote on 2022-04-08:

Fix released in version 4.11.0.

Changed in beautifulsoup:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.