Better handling html <ruby> tags
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Sorry for my English.Please let me know if there are somethings that I'm not making clear.
https:/
The tag is often found in Japanese ebooks.
html='<
from bs4 import BeautifulSoup
soup = BeautifulSoup(
soup.get_text()
'ある日(ひ)
We may not need the part in parentheses, but the string is still readable though.
But in case that the fall-back <rp> tag is omitted by the ebook publisher:
html='<
soup = BeautifulSoup(
soup.get_text()
'ある日ひの放課ほうか後ごだった。'
This one is quite confusing,since Kanji and it's hiragana(
Yes, the string in <rp> and <rt> could be ignored by get_text() in an undocumented way I found in bs4's element.py and test.py:
from bs4 import BeautifulSoup
from bs4.element import NavigableString, Script, Stylesheet, TemplateString, Tag
class RTString(
'''class for <rt> tag'''
pass
class RPString(
'''class for <rp> tag'''
pass
string_containers = {
'rp': RPString,
'rt': RTString,
'style': Stylesheet,
'script': Script,
'template': TemplateString,
}
html='<
soup = BeautifulSoup(
soup.get_text()
'ある日の放課後だった。'
soup.get_
'ある日(ひ)
But could Beautifully Soup handle <rp> and <rt> tag internally like <script> rather than end user implement?
And also a little hint in the documentation?
I think these tags are standard html tags,not user custom tags.Many East Asian languages also use these tags, such as Pinyin in Chinese.
OS windows 10
beautifulsoup4=
lxml==4.6.3
information type: | Public → Public Security |
information type: | Public Security → Private Security |
information type: | Private Security → Public |
Revision 614 includes RubyTextString and RubyParenthesis String classes for this purpose.