Beautiful Soup

bytes like regex failed on string like markup

Bug #1838877 reported by Kamil Mahmood on 2019-08-04

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Fix Released	Undecided	Unassigned

Bug Description

# Script Start
from bs4 import BeautifulSoup

markup = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://新2网址（www.ydsjyj.com）-时时彩平台,(www.xinyushishicai.com)-澳门赌场（www.amdc999.com）">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
    <title>时时彩娱乐-首页</title>
    <meta content="时时彩娱乐,时时彩娱乐网址,时时彩娱乐平台,时时彩娱乐官网" name="keywords" />
    <meta content="时时彩娱乐官网✅✅ 是全网最诚信,口碑最好的彩票平台！提款速度最快,赔率高达9.999 极力为您提供注册、登陆、下载、测速等服务.时时彩娱乐祝您玩的愉快开心。" name="description" />
    <title>时时彩娱乐-首页</title>
</head>

<body>
<h1><a href="http://4b2s.com/">时时彩娱乐</a></h1>
</body>
</html>
"""

# Raises Exception TypeError: cannot use a bytes pattern on a string-like object
soup = BeautifulSoup(markup, features="lxml")

soup = BeautifulSoup(markup.encode("utf-8"), features="lxml", from_encoding="utf-8")
# Print empty string
print(str(soup))

# Script End

Above HTML markup is a small portion from large HTML file

System information
Uname Result: 5.0.0-23-generic #24~18.04.1-Ubuntu SMP Mon Jul 29 16:12:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Python Version: 3.6.8

Libraries
beautifulsoup4==4.7.1
lxml==4.3.3

See original description

Kamil Mahmood (kamilmahmood) on 2019-08-04

description:	updated
description:	updated

Revision history for this message

Leonard Richardson (leonardr) wrote on 2019-09-02:

It looks like there are three problems here.

1. The TypeError. This is in Beautiful Soup code and easy to fix.

2. The lxml parser doesn't deal well with Unicode documents. It's rejecting your markup, with this exception:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 79: unexpected end of data

But you don't get any visibility into that exception. I fixed this by propagating the exception upwards so you can see it.

3. By encoding the data as UTF-8, you can get lxml to accept the markup without raising an exception. But whatever problem lxml is having with this particular document doesn't go away, and lxml still can't handle the document. it ignores the entire thing, because of whatever problem it perceives in the DOCTYPE, and you're left with an empty BeautifulSoup object.

The fixes to 1 and 2 are in revision 526. To actually parse the document I recommend using html5lib as the parser instead of lxml.

Changed in beautifulsoup:
status:	New → Fix Committed

Revision history for this message

Kamil Mahmood (kamilmahmood) wrote on 2019-09-04:

Thanks for fixing bug

Leonard Richardson (leonardr) on 2019-11-11

Changed in beautifulsoup:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.