1
diff --git a/doc.zh/source/conf.py b/doc.zh/source/conf.py
2
index 102c3cf..1014dad 100644
3
--- a/doc.zh/source/conf.py
4
+++ b/doc.zh/source/conf.py
5
@@ -50,7 +50,7 @@ copyright = u'2012, Leonard Richardson'
6
50
# The short X.Y version.
50
# The short X.Y version.
7
51
version = '4'
51
version = '4'
8
52
# The full version, including alpha/beta/rc tags.
52
# The full version, including alpha/beta/rc tags.
10
53
release = '4.2.0'
53
release = '4.12.0'
11
54
54
12
55
# The language for content autogenerated by Sphinx. Refer to documentation
55
# The language for content autogenerated by Sphinx. Refer to documentation
13
56
# for a list of supported languages.
56
# for a list of supported languages.
14
diff --git a/doc.zh/source/index.rst b/doc.zh/source/index.rst
15
index 05b9cfc..4e099cd 100644
16
--- a/doc.zh/source/index.rst
17
+++ b/doc.zh/source/index.rst
18
@@ -1,38 +1,47 @@
19
1
.. BeautifulSoup文档 documentation master file, created by
20
2
   Deron Wang on Fri Nov 29 13:49:30 2013.
21
3
   You can adapt this file completely to your liking, but it should at least
22
4
   contain the root `toctree` directive.
23
5
1
25
6
Beautiful Soup 4.4.0 文档
2
Beautiful Soup 4.12.0 文档
26
7
==========================
3
==========================
27
8
4
29
9
`Beautiful Soup <http://www.crummy.com/software/BeautifulSoup/>`_ 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.
5
.. py:module:: bs4
30
10
6
32
11
这篇文档介绍了BeautifulSoup4中所有主要特性,并且有小例子.让我来向你展示它适合做什么,如何工作,怎样使用,如何达到你想要的效果,和处理异常情况.
7
`Beautiful Soup <http://www.crummy.com/software/BeautifulSoup/>`_ 是一个
33
8
可以从 HTML 或 XML 文件中提取数据的 Python 库。它能用你喜欢的解析器和习惯的方式实现
34
9
文档树的导航、查找、和修改。它会帮你节省数小时甚至数天的工作时间。
35
12
10
37
13
文档中出现的例子在Python2.7和Python3.2中的执行结果相同
11
这篇文档介绍了 Beautiful Soup 4 中所有主要特性，并附带例子。文档会展示这个库的适合场景，
38
12
工作原理，怎样使用，如何达到预期效果，以及如何处理异常情况。
39
14
13
41
15
你可能在寻找 `Beautiful Soup3 <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_ 的文档,Beautiful Soup 3 目前已经停止开发,我们推荐在现在的项目中使用Beautiful Soup 4, `移植到BS4 <http://www.baidu.com>`_
14
文档覆盖了 Beautful Soup 4.12.0 版本，文档中的例子使用 Python 3.8 版本编写。
42
16
15
44
17
这篇帮助文档已经被翻译成了其它语言:
16
你可能在寻找 `Beautiful Soup3 <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_ 
45
17
的文档，Beautiful Soup 3 目前已经停止开发，并且自 2020年12月31日以后就停止维护了。
46
18
如果想要了解 Beautiful Soup 3 和 Beautiful Soup 4 的不同，参考 `迁移到 BS4`_。
47
18
19
49
19
* `这篇文档当然还有中文版. <https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/>`_
20
这篇文档已经被翻译成多种语言:
50
21
51
22
* `这篇文档当然还有中文版 <https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/>`_ , 
52
23
  (`Github 地址 <https://github.com/DeronW/beautifulsoup>`_).
53
20
* このページは日本語で利用できます(`外部リンク <http://kondou.com/BS4/>`_)
24
* このページは日本語で利用できます(`外部リンク <http://kondou.com/BS4/>`_)
54
21
* `이 문서는 한국어 번역도 가능합니다. <https://www.crummy.com/software/BeautifulSoup/bs4/doc.ko/>`_
25
* `이 문서는 한국어 번역도 가능합니다. <https://www.crummy.com/software/BeautifulSoup/bs4/doc.ko/>`_
59
22
* `Este documento também está disponível em Português do Brasil. <https://www.crummy.com/software/BeautifulSoup/bs4/doc.ptbr/>`_
26
* `Este documento também está disponível em Português do Brasil. 
60
23
* `Este documento también está disponible en una traducción al español. <https://www.crummy.com/software/BeautifulSoup/bs4/doc.es/>`_
27
  <https://www.crummy.com/software/BeautifulSoup/bs4/doc.ptbr>`_
61
24
* `Эта документация доступна на русском языке. <https://www.crummy.com/software/BeautifulSoup/bs4/doc.ru/>`_
28
* `Эта документация доступна на русском языке. 
62
25
  
29
  <https://www.crummy.com/software/BeautifulSoup/bs4/doc.ru/>`_
63
26
30
64
27
寻求帮助
31
寻求帮助
65
28
--------
32
--------
66
29
33
68
30
如果你有关于BeautifulSoup的问题,可以发送邮件到 `讨论组 <https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup>`_ .如果你的问题包含了一段需要转换的HTML代码,那么确保你提的问题描述中附带这段HTML文档的 `代码诊断`_ [1]_
34
如果有关于 Beautiful Soup 4 的疑问，或遇到了问题，可以发送邮件到 `讨论组 
69
35
<https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup>`_。
70
36
71
37
如果问题中包含要解析的 HTML 代码，那么请在你的问题描述中附带这段HTML文档的 `代码诊断`_ [1]_。
72
38
73
39
如果报告文档中的错误，请指出具体文档的语言版本。
74
31
40
75
32
快速开始
41
快速开始
76
33
========
42
========
77
34
43
79
35
下面的一段HTML代码将作为例子被多次用到.这是 *爱丽丝梦游仙境的* 的一段内容(以后内容中简称为 *爱丽丝* 的文档):
44
下面的一段HTML代码将作为例子被多次用到。这是 `爱丽丝梦游仙境` 的一段内容(以后简称 *爱丽丝* 的文档):
80
36
45
81
37
::
46
::
82
38
47
83
@@ -50,7 +59,8 @@ Beautiful Soup 4.4.0 文档
84
50
    <p class="story">...</p>
59
    <p class="story">...</p>
85
51
    """
60
    """
86
52
61
88
53
使用BeautifulSoup解析这段代码,能够得到一个 ``BeautifulSoup`` 的对象,并能按照标准的缩进格式的结构输出:
62
上面的 *爱丽丝* 文档经过 Beautiful Soup 的解析后，会得到一个 :py:class:`BeautifulSoup` 的对象，
89
63
一个嵌套结构的对象:
90
54
64
91
55
::
65
::
92
56
66
93
@@ -80,7 +90,7 @@ Beautiful Soup 4.4.0 文档
94
80
    #     Lacie
90
    #     Lacie
95
81
    #    </a>
91
    #    </a>
96
82
    #    and
92
    #    and
98
83
    #    <a class="sister" href="http://example.com/tillie" id="link3">
93
    #    <a class="sister" href="http://example.com/tillie" id="link2">
99
84
    #     Tillie
94
    #     Tillie
100
85
    #    </a>
95
    #    </a>
101
86
    #    ; and they lived at the bottom of a well.
96
    #    ; and they lived at the bottom of a well.
102
@@ -91,7 +101,7 @@ Beautiful Soup 4.4.0 文档
103
91
    #  </body>
101
    #  </body>
104
92
    # </html>
102
    # </html>
105
93
103
107
94
几个简单的浏览结构化数据的方法:
104
这是几个简单的浏览结构化数据的方法:
108
95
105
109
96
::
106
::
110
97
107
111
@@ -124,7 +134,7 @@ Beautiful Soup 4.4.0 文档
112
124
    soup.find(id="link3")
134
    soup.find(id="link3")
113
125
    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
135
    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
114
126
136
116
127
从文档中找到所有<a>标签的链接:
137
常见任务之一，就是从文档中找到所有 <a> 标签的链接:
117
128
138
118
129
::
139
::
119
130
140
120
@@ -134,7 +144,7 @@ Beautiful Soup 4.4.0 文档
121
134
        # http://example.com/lacie
144
        # http://example.com/lacie
122
135
        # http://example.com/tillie
145
        # http://example.com/tillie
123
136
146
125
137
从文档中获取所有文字内容:
147
另一种常见任务，是从文档中获取所有文字内容:
126
138
148
127
139
::
149
::
128
140
150
129
@@ -151,152 +161,142 @@ Beautiful Soup 4.4.0 文档
130
151
    #
161
    #
131
152
    # ...
162
    # ...
132
153
163
134
154
这是你想要的吗?别着急,还有更好用的
164
这是你想要的吗？是的话，继续看下去。
135
155
165
136
156
安装 Beautiful Soup
166
安装 Beautiful Soup
137
157
======================
167
======================
138
158
168
146
159
如果你用的是新版的Debain或ubuntu,那么可以通过系统的软件包管理来安装:
169
如果你用的是新版的 Debain 或 Ubuntu，那么可以通过系统的软件包管理来安装:
140
160
141
161
``$ apt-get install Python-bs4``
142
162
143
163
Beautiful Soup 4 通过PyPi发布,所以如果你无法使用系统包管理安装,那么也可以通过 ``easy_install`` 或 ``pip`` 来安装.包的名字是 ``beautifulsoup4`` ,这个包兼容Python2和Python3.
144
164
145
165
``$ easy_install beautifulsoup4``
147
166
170
162
167
``$ pip install beautifulsoup4``
171
:kbd:`$ apt-get install python3-bs4`
149
168
150
169
(在PyPi中还有一个名字是 ``BeautifulSoup`` 的包,但那可能不是你想要的,那是 `Beautiful Soup3 <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_ 的发布版本,因为很多项目还在使用BS3, 所以 ``BeautifulSoup`` 包依然有效.但是如果你在编写新项目,那么你应该安装的 ``beautifulsoup4`` )
151
170
152
171
如果你没有安装 ``easy_install`` 或 ``pip`` ,那你也可以 `下载BS4的源码 <http://www.crummy.com/software/BeautifulSoup/download/4.x/>`_ ,然后通过setup.py来安装.
153
172
154
173
``$ Python setup.py install``
155
174
156
175
如果上述安装方法都行不通,Beautiful Soup的发布协议允许你将BS4的代码打包在你的项目中,这样无须安装即可使用.
157
176
158
177
作者在Python2.7和Python3.2的版本下开发Beautiful Soup, 理论上Beautiful Soup应该在所有当前的Python版本中正常工作
159
178
160
179
安装完成后的问题
161
180
-----------------
163
181
172
165
182
Beautiful Soup发布时打包成Python2版本的代码,在Python3环境下安装时,会自动转换成Python3的代码,如果没有一个安装的过程,那么代码就不会被转换.
173
Beautiful Soup 4 通过 PyPi 发布，所以如果无法使用系统包管理安装，那么
166
174
也可以通过 ``easy_install`` 或 ``pip`` 来安装。包的名字是 ``beautifulsoup4``。
167
175
确保使用的是与 Python 版本对应的 ``pip`` 或 ``easy_install`` 版本
168
176
(他们的名字也可能是 ``pip3`` 和 ``easy_install`` )。
169
183
177
171
184
如果代码抛出了 ``ImportError`` 的异常: "No module named HTMLParser", 这是因为你在Python3版本中执行Python2版本的代码.
178
:kbd:`$ easy_install beautifulsoup4`
172
185
179
173
180
:kbd:`$ pip install beautifulsoup4`
174
186
181
176
187
如果代码抛出了 ``ImportError`` 的异常: "No module named html.parser", 这是因为你在Python2版本中执行Python3版本的代码.
182
(在 PyPi 中还有一个名字是 ``BeautifulSoup`` 的包，但那可能不是你想要的，那是 
177
183
`Beautiful Soup3`_ 版本。因为很多项目还在使用BS3, 所以 ``BeautifulSoup`` 
178
184
包依然有效。但是新项目中，应该安装 ``beautifulsoup4``。)
179
188
185
181
189
如果遇到上述2种情况,最好的解决方法是重新安装BeautifulSoup4.
186
如果没有安装 ``easy_install`` 或 ``pip`` ，那也可以 `下载 BS4 的源码 
182
187
<http://www.crummy.com/software/BeautifulSoup/download/4.x/>`_ ,
183
188
然后通过 ``setup.py`` 来安装。
184
190
189
186
191
如果在ROOT_TAG_NAME = u'[document]'代码处遇到 ``SyntaxError`` "Invalid syntax"错误,需要将把BS4的Python代码版本从Python2转换到Python3. 可以重新安装BS4:
190
:kbd:`$ Python setup.py install`
187
192
191
189
193
``$ Python3 setup.py install``
192
如果上述安装方法都行不通，根据 Beautiful Soup 的协议，可以将项目的代码打包在
190
193
你的项目中，这样无须安装即可使用。
191
194
194
195
195
或在bs4的目录中执行Python代码版本转换脚本
195
Beautiful Soup 用 Python 3.10 版本开发，但也可以在当前的其它版本中运行。
193
196
194
197
``$ 2to3-3.2 -w bs4``
196
198
196
197
199
安装解析器
197
安装解析器
199
200
------------
198
--------------
200
201
199
202
202
Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 `lxml <http://lxml.de/>`_ .根据操作系统不同,可以选择下列方法来安装lxml:
200
Beautiful Soup 支持 Python 标准库中的 HTML 解析器，还支持一些第三方的解析器，
203
201
其中一个是 `lxml parser <http://lxml.de/>`_ 。根据安装方法的不同，
204
202
可以选择下列方法来安装 lxml:
205
203
203
207
204
``$ apt-get install Python-lxml``
204
:kbd:`$ apt-get install Python-lxml`
208
205
205
210
206
``$ easy_install lxml``
206
:kbd:`$ easy_install lxml`
211
207
207
213
208
``$ pip install lxml``
208
:kbd:`$ pip install lxml`
214
209
209
216
210
另一个可供选择的解析器是纯Python实现的 `html5lib <http://code.google.com/p/html5lib/>`_ , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:
210
另一个可供选择的解析器是纯 Python 实现的 `html5lib <http://code.google.com/p/html5lib/>`_ , 
217
211
html5lib 的解析方式与浏览器相同，根据安装方法的不同，可以选择下列方法来安装html5lib:
218
211
212
220
212
``$ apt-get install Python-html5lib``
213
:kbd:`$ apt-get install python-html5lib`
221
213
214
223
214
``$ easy_install html5lib``
215
:kbd:`$ easy_install html5lib`
224
215
216
226
216
``$ pip install html5lib``
217
:kbd:`$ pip install html5lib`
227
217
218
229
218
下表列出了主要的解析器,以及它们的优缺点:
219
下表描述了几种解析器的优缺点:
230
219
220
253
220
+-----------------------+---------------------------+---------------------------+---------------------------+
221
+-------------------+-------------------------------------------+---------------------------+------------------------------------------+
254
221
|         解析器        |         使用方法          |            优势           |            劣势           |
222
| 解析器            | 使用方法                                  | 优势                      | 劣势                                     |
255
222
+=======================+===========================+===========================+===========================+
223
+===================+===========================================+===========================+==========================================+
256
223
| Python标准库          | ``BeautifulSoup(markup,   | - Python的内置标准库      | - Python 2.7.3 or 3.2.2)前|
224
|| Python 标准库    || ``BeautifulSoup(markup, "html.parser")`` || - Python的内置标准库     || - 速度没有 lxml 快，容错没有 html5lib强 |
257
224
|                       | "html.parser")``          | - 执行速度适中            |   的版本中文档容错能力差  |
225
||                  ||                                          || - 执行速度较快           ||                                         |
258
225
|                       |                           | - 文档容错能力强          |                           |
226
||                  ||                                          || - 容错能力强             ||                                         |
259
226
|                       |                           |                           |                           |
227
+-------------------+-------------------------------------------+---------------------------+------------------------------------------+
260
227
+-----------------------+---------------------------+---------------------------+---------------------------+
228
|| lxml HTML 解析器 || ``BeautifulSoup(markup, "lxml")``        || - 速度快                 || - 额外的 C 依赖                         |
261
228
| lxml HTML 解析器      | ``BeautifulSoup(markup,   | - 速度快                  | - 需要安装C语言库         |
229
||                  ||                                          || - 容错能力强             ||                                         |
262
229
|                       | "lxml")``                 | - 文档容错能力强          |                           |
230
||                  ||                                          ||                          ||                                         |
263
230
|                       |                           |                           |                           |
231
+-------------------+-------------------------------------------+---------------------------+------------------------------------------+
264
231
+-----------------------+---------------------------+---------------------------+---------------------------+
232
|| lxml XML 解析器  || ``BeautifulSoup(markup, ["lxml-xml"])``  || - 速度快                 || - 额外的 C 依赖                         |
265
232
| lxml XML 解析器       | ``BeautifulSoup(markup,   | - 速度快                  | - 需要安装C语言库         |
233
||                  || ``BeautifulSoup(markup, "xml")``         || - 唯一支持 XML 的解析器  ||                                         |
266
233
|                       | ["lxml-xml"])``           | - 唯一支持XML的解析器     |                           |
234
+-------------------+-------------------------------------------+---------------------------+------------------------------------------+
267
234
|                       |                           |                           |                           |
235
|| html5lib         || ``BeautifulSoup(markup, "html5lib")``    || - 最好的容错性           || - 速度慢                                |
268
235
|                       | ``BeautifulSoup(markup,   |                           |                           |
236
||                  ||                                          || - 以浏览器的方式解析文档 || - 额外的 Python 依赖                    |
269
236
|                       | "xml")``                  |                           |                           |
237
||                  ||                                          || - 生成 HTML5 格式的文档  ||                                         |
270
237
+-----------------------+---------------------------+---------------------------+---------------------------+
238
+-------------------+-------------------------------------------+---------------------------+------------------------------------------+
249
238
| html5lib              | ``BeautifulSoup(markup,   | - 最好的容错性            | - 速度慢                  |
250
239
|                       | "html5lib")``             | - 以浏览器的方式解析文档  | - 不依赖外部扩展          |
251
240
|                       |                           | - 生成HTML5格式的文档     |                           |
252
241
+-----------------------+---------------------------+---------------------------+---------------------------+
271
242
239
273
243
推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定.
240
如果可以，推荐使用 lxml 来获得更高的速度。
274
244
241
276
245
提示: 如果一段HTML或XML文档格式不正确的话,那么在不同的解析器中返回的结果可能是不一样的,查看 `解析器之间的区别`_  了解更多细节
242
注意，如果一段文档格式不标准，那么在不同解析器生成的 Beautiful Soup 数可能不一样。
277
243
查看 `解析器之间的区别`_  了解更多细节。
278
246
244
279
247
如何使用
245
如何使用
280
248
========
246
========
281
249
247
283
250
将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄.
248
解析文档是，将文档传入 :py:class:`BeautifulSoup` 的构造方法。也可以传入一段字符串
284
249
或一个文件句柄:
285
251
250
286
252
::
251
::
287
253
252
288
254
    from bs4 import BeautifulSoup
253
    from bs4 import BeautifulSoup
289
255
254
291
256
    soup = BeautifulSoup(open("index.html"))
255
    with open("index.html") as fp:
292
256
        soup = BeautifulSoup(fp, 'html.parser')
293
257
257
295
258
    soup = BeautifulSoup("<html>data</html>")
258
    soup = BeautifulSoup("<html>a web page</html>", 'html.parser')
296
259
259
298
260
首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码
260
首先，文档被转换成 Unicode，并且 HTML 中的实体也都被转换成 Unicode 编码
299
261
261
300
262
::
262
::
301
263
263
304
264
    BeautifulSoup("Sacr&eacute; bleu!")
264
    print(BeautifulSoup("<html><head></head><body>Sacr&eacute; bleu!</body></html>", "html.parser"))
305
265
    <html><head></head><body>Sacré bleu!</body></html>
265
    # <html><head></head><body>Sacré bleu!</body></html>
306
266
266
308
267
然后,Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档.(参考 `解析成XML`_ ).
267
然后，Beautiful Soup 选择最合适的解析器来解析这段文档。如果指定了解析器那么 Beautiful Soup 
309
268
会选择指定的解析器来解析文档。(参考 `解析成XML`_ )。
310
268
269
311
269
对象的种类
270
对象的种类
312
270
==========
271
==========
313
271
272
316
272
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
273
Beautiful Soup 将复杂的 HTML 文档转换成一个复杂的由 Python 对象构成的树形结构，但处理对象
317
273
``Tag`` , ``NavigableString`` , ``BeautifulSoup`` , ``Comment`` .
274
的过程只包含 4 种类型的对象: :py:class:`Tag`, :py:class:`NavigableString`, 
318
275
:py:class:`BeautifulSoup`, 和 :py:class:`Comment`。
319
274
276
324
275
Tag
277
:py:class:`Tag`
325
276
-----
278
``Tag`` 对象与 XML 或 HTML 原生文档中的 tag 相同:
322
277
323
278
``Tag`` 对象与XML或HTML原生文档中的tag相同:
326
279
279
327
280
::
280
::
328
281
281
330
282
    soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
282
    soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
331
283
    tag = soup.b
283
    tag = soup.b
332
284
    type(tag)
284
    type(tag)
333
285
    # <class 'bs4.element.Tag'>
285
    # <class 'bs4.element.Tag'>
334
286
286
336
287
Tag有很多方法和属性,在 `遍历文档树`_ 和 `搜索文档树`_ 中有详细解释.现在介绍一下tag中最重要的属性: name和attributes
287
Tag有很多属性和方法，在 `遍历文档树`_ 和 `搜索文档树`_ 中有详细解释。
337
288
现在介绍一下 tag 中最重要的属性: name 和 attributes。
338
288
289
341
289
Name
290
.. py:attribute:: name
340
290
.....
342
291
291
344
292
每个tag都有自己的名字,通过 ``.name`` 来获取:
292
每个 tag 都有一个名字:
345
293
293
346
294
::
294
::
347
295
295
348
296
    tag.name
296
    tag.name
349
297
    # u'b'
297
    # u'b'
350
298
298
352
299
如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档:
299
如果改变了 tag 的 name，那将影响所有通过当前 Beautiful Soup 对象生成的HTML文档:
353
300
300
354
301
::
301
::
355
302
302
356
@@ -304,107 +304,161 @@ Name
357
304
    tag
304
    tag
358
305
    # <blockquote class="boldest">Extremely bold</blockquote>
305
    # <blockquote class="boldest">Extremely bold</blockquote>
359
306
306
362
307
Attributes
307
.. py:attribute:: attrs
361
308
............
363
309
308
365
310
一个tag可能有很多个属性. tag ``<b class="boldest">`` 有一个 "class" 的属性,值为 "boldest" . tag的属性的操作方法与字典相同:
309
一个 HTML 或 XML 的 tag 可能有很多属性。tag ``<b id="boldest">`` 有
366
310
一个 "id" 的属性，值为 "boldest"。你可以想处理一个字段一样来处理 tag 的属性:
367
311
311
368
312
::
312
::
369
313
313
372
314
    tag['class']
314
   tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
373
315
    # u'boldest'
315
   tag['id']
374
316
   # 'boldest'
375
316
317
377
317
也可以直接"点"取属性, 比如: ``.attrs`` :
318
也可以直接"点"取属性，比如: ``.attrs`` :
378
318
319
379
319
::
320
::
380
320
321
381
321
    tag.attrs
322
    tag.attrs
382
322
    # {u'class': u'boldest'}
323
    # {u'class': u'boldest'}
383
323
324
385
324
tag的属性可以被添加,删除或修改. 再说一次, tag的属性操作方法与字典一样
325
tag 的属性可以被添加、删除或修改。再说一次，tag的属性操作方法与字典一样
386
325
326
387
326
::
327
::
388
327
328
393
328
    tag['class'] = 'verybold'
329
   tag['id'] = 'verybold'
394
329
    tag['id'] = 1
330
   tag['another-attribute'] = 1
395
330
    tag
331
   tag
396
331
    # <blockquote class="verybold" id="1">Extremely bold</blockquote>
332
   # <b another-attribute="1" id="verybold"></b>
397
332
333
402
333
    del tag['class']
334
   del tag['id']
403
334
    del tag['id']
335
   del tag['another-attribute']
404
335
    tag
336
   tag
405
336
    # <blockquote>Extremely bold</blockquote>
337
   # <b>bold</b>
406
337
338
411
338
    tag['class']
339
   tag['id']
412
339
    # KeyError: 'class'
340
   # KeyError: 'id'
413
340
    print(tag.get('class'))
341
   tag.get('id')
414
341
    # None
342
   # None
415
343
416
344
.. _multivalue:
417
342
345
418
343
多值属性
346
多值属性
420
344
``````````
347
----------
421
345
348
423
346
HTML 4定义了一系列可以包含多个值的属性.在HTML5中移除了一些,却增加更多.最常见的多值的属性是 class (一个tag可以有多个CSS的class). 还有一些属性 ``rel`` , ``rev`` , ``accept-charset`` , ``headers`` , ``accesskey`` . 在Beautiful Soup中多值属性的返回类型是list:
349
HTML 4 定义了一系列可以包含多个值的属性。在 HTML5 中移除了一些，却增加更多。
424
350
最常见的多值的属性是 ``class`` (一个 tag 可以有多个 CSS class)。还有一些
425
351
属性 ``rel``、 ``rev``、 ``accept-charset``、 ``headers``、 ``accesskey``。
426
352
默认情况，Beautiful Soup 中将多值属性解析为一个列表:
427
347
353
428
348
::
354
::
429
349
355
433
350
    css_soup = BeautifulSoup('<p class="body strikeout"></p>')
356
   css_soup = BeautifulSoup('<p class="body"></p>', 'html.parser')
434
351
    css_soup.p['class']
357
   css_soup.p['class']
435
352
    # ["body", "strikeout"]
358
   # ['body']
436
359
  
437
360
   css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
438
361
   css_soup.p['class']
439
362
   # ['body', 'strikeout']
440
363
441
364
  If an attribute `looks` like it has more than one value, but it's not
442
365
  a multi-valued attribute as defined by any version of the HTML
443
366
  standard, Beautiful Soup will leave the attribute alone::
444
353
367
448
354
    css_soup = BeautifulSoup('<p class="body"></p>')
368
   id_soup = BeautifulSoup('<p id="my id"></p>', 'html.parser')
449
355
    css_soup.p['class']
369
   id_soup.p['id']
450
356
    # ["body"]
370
   # 'my id'
451
357
371
453
358
如果某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性,那么Beautiful Soup会将这个属性作为字符串返回
372
如果某个属性看起来好像有多个值，但在任何版本的 HTML 定义中都没有将其定义为多值属性，
454
373
那么 Beautiful Soup 会将这个属性作为单值返回
455
359
374
456
360
::
375
::
457
361
376
461
362
    id_soup = BeautifulSoup('<p id="my id"></p>')
377
   id_soup = BeautifulSoup('<p id="my id"></p>', 'html.parser')
462
363
    id_soup.p['id']
378
   id_soup.p['id']
463
364
    # 'my id'
379
   # 'my id'
464
365
380
466
366
将tag转换成字符串时,多值属性会合并为一个值
381
将 tag 转换成字符串时，多值属性会合并为一个值
467
367
382
468
368
::
383
::
469
369
384
471
370
    rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
385
    rel_soup = BeautifulSoup('<p>Back to the <a rel="index first">homepage</a></p>', 'html.parser')
472
371
    rel_soup.a['rel']
386
    rel_soup.a['rel']
474
372
    # ['index']
387
    # ['index', 'first']
475
373
    rel_soup.a['rel'] = ['index', 'contents']
388
    rel_soup.a['rel'] = ['index', 'contents']
476
374
    print(rel_soup.p)
389
    print(rel_soup.p)
477
375
    # <p>Back to the <a rel="index contents">homepage</a></p>
390
    # <p>Back to the <a rel="index contents">homepage</a></p>
478
376
391
480
377
如果转换的文档是XML格式,那么tag中不包含多值属性
392
若想强制将所有属性当做多值进行解析，可以在 :py:class:`BeautifulSoup` 构造方法中设置
481
393
``multi_valued_attributes=None`` 参数：
482
394
483
395
::
484
396
485
397
    no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser', multi_valued_attributes=None)
486
398
    no_list_soup.p['class']
487
399
    # 'body strikeout'
488
400
489
401
或者使用 ``get_attribute_list`` 方法来获取多值列表，不管是不是一个多值属性:
490
402
491
403
::
492
404
493
405
    id_soup.p.get_attribute_list('id')
494
406
    # ["my id"]
495
407
496
408
如果以 XML 方式解析文档，则没有多值属性:
497
378
409
498
379
::
410
::
499
380
411
500
381
    xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
412
    xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
501
382
    xml_soup.p['class']
413
    xml_soup.p['class']
503
383
    # u'body strikeout'
414
    # 'body strikeout'
504
415
505
416
但是，可以通过配置 ``multi_valued_attributes`` 参数来修改:
506
417
507
418
::
508
419
509
420
    class_is_multi= { '*' : 'class'}
510
421
    xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml', multi_valued_attributes=class_is_multi)
511
422
    xml_soup.p['class']
512
423
    # ['body', 'strikeout']
513
424
514
425
可能实际当中并不需要修改默认配置，默认采用的是 HTML 标准:
515
426
516
427
::
517
384
428
519
385
可以遍历的字符串
429
    from bs4.builder import builder_registry
520
430
    builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES
521
431
522
432
.. py:class:: NavigableString
523
433
524
434
可遍历的字符串
525
386
----------------
435
----------------
526
387
436
528
388
字符串常被包含在tag内.Beautiful Soup用 ``NavigableString`` 类来包装tag中的字符串:
437
字符串对应 tag 中的一段文本。Beautiful Soup 用 :py:class:`NavigableString` 
529
438
类来包装 tag 中的字符串:
530
389
439
531
390
::
440
::
532
391
441
533
442
    soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
534
443
    tag = soup.b
535
392
    tag.string
444
    tag.string
537
393
    # u'Extremely bold'
445
    # 'Extremely bold'
538
394
    type(tag.string)
446
    type(tag.string)
539
395
    # <class 'bs4.element.NavigableString'>
447
    # <class 'bs4.element.NavigableString'>
540
396
448
542
397
一个 ``NavigableString`` 字符串与Python中的Unicode字符串相同,并且还支持包含在 `遍历文档树`_ 和 `搜索文档树`_ 中的一些特性. 通过 ``unicode()`` 方法可以直接将 ``NavigableString`` 对象转换成Unicode字符串:
449
一个 :py:class:`NavigableString` 对象与 Python 中的Unicode 字符串相同，
543
450
并且还支持包含在 `遍历文档树`_ 和 `搜索文档树`_ 中的一些特性。通过 ``str`` 方法可以直接将
544
451
:py:class:`NavigableString` 对象转换成 Unicode 字符串:
545
398
452
546
399
::
453
::
547
400
454
549
401
    unicode_string = unicode(tag.string)
455
    unicode_string = str(tag.string)
550
402
    unicode_string
456
    unicode_string
552
403
    # u'Extremely bold'
457
    # 'Extremely bold'
553
404
    type(unicode_string)
458
    type(unicode_string)
555
405
    # <type 'unicode'>
459
    # <type 'str'>
556
406
460
558
407
tag中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 `replace_with()`_ 方法:
461
tag 中包含的字符串不能直接编辑，但是可以被替换成其它的字符串，用 :ref:`replace_with()` 方法:
559
408
462
560
409
::
463
::
561
410
464
562
@@ -412,16 +466,24 @@ tag中包含的字符串不能编辑,但是可以被替换成其它的字符串,
563
412
    tag
466
    tag
564
413
    # <blockquote>No longer bold</blockquote>
467
    # <blockquote>No longer bold</blockquote>
565
414
468
567
415
``NavigableString`` 对象支持 `遍历文档树`_ 和 `搜索文档树`_ 中定义的大部分属性, 并非全部.尤其是,一个字符串不能包含其它内容(tag能够包含字符串或是其它tag),字符串不支持 ``.contents`` 或 ``.string`` 属性或 ``find()`` 方法.
469
:py:class:`NavigableString` 对象支持 `遍历文档树`_ 和 `搜索文档树`_ 中定义的大部分属性，
568
470
并非全部。尤其是，一个字符串不能包含其它内容(tag 能够包含字符串或是其它 tag)，字符串不支持 
569
471
``.contents`` 或 ``.string`` 属性或 ``find()`` 方法。
570
416
472
572
417
如果想在Beautiful Soup之外使用 ``NavigableString`` 对象,需要调用 ``unicode()`` 方法,将该对象转换成普通的Unicode字符串,否则就算Beautiful Soup已方法已经执行结束,该对象的输出也会带有对象的引用地址.这样会浪费内存.
473
如果想在 Beautiful Soup 之外使用 :py:class:`NavigableString` 对象，需要调用 ``unicode()``
573
474
方法，将该对象转换成普通的Unicode字符串，否则就算 Beautiful Soup 方法已经执行结束，该对象的输出
574
475
也会带有对象的引用地址。这样会浪费内存。
575
418
476
578
419
BeautifulSoup
477
.. py:class:: BeautifulSoup
579
420
----------------
478
580
479
-------------------------------
581
421
480
583
422
``BeautifulSoup`` 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 ``Tag`` 对象,它支持 `遍历文档树`_ 和 `搜索文档树`_ 中描述的大部分的方法.
481
``BeautifulSoup`` 对象表示的是一个文档的全部内容。大部分时候，可以把它当作 ``Tag`` 对象，
584
482
它支持 `遍历文档树`_ 和 `搜索文档树`_ 中描述的大部分的方法。
585
423
483
587
424
因为 ``BeautifulSoup`` 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 ``.name`` 属性是很方便的,所以 ``BeautifulSoup`` 对象包含了一个值为 "[document]" 的特殊属性 ``.name``
484
因为 ``BeautifulSoup`` 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性。
588
485
但有时查看它的 ``.name`` 属性是很方便的，所以 ``BeautifulSoup`` 对象包含了一个
589
486
值为 "[document]" 的特殊属性 ``.name``
590
425
487
591
426
::
488
::
592
427
489
593
@@ -431,24 +493,25 @@ BeautifulSoup
594
431
注释及特殊字符串
493
注释及特殊字符串
595
432
-----------------
494
-----------------
596
433
495
598
434
``Tag`` , ``NavigableString`` , ``BeautifulSoup`` 几乎覆盖了html和xml中的所有内容,但是还有一些特殊对象.容易让人担心的内容是文档的注释部分:
496
:py:class:`Tag`, :py:class:`NavigableString`, :py:class:`BeautifulSoup`
599
497
几乎覆盖了html和xml中的所有内容，但是还有一些特殊对象。容易让人担心的内容是文档的注释部分:
600
435
498
601
436
::
499
::
602
437
500
603
438
    markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
501
    markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
605
439
    soup = BeautifulSoup(markup)
502
    soup = BeautifulSoup(markup, 'html.parser')
606
440
    comment = soup.b.string
503
    comment = soup.b.string
607
441
    type(comment)
504
    type(comment)
608
442
    # <class 'bs4.element.Comment'>
505
    # <class 'bs4.element.Comment'>
609
443
506
611
444
``Comment`` 对象是一个特殊类型的 ``NavigableString`` 对象:
507
:py:class:`Comment` 对象是一个特殊类型的 :py:class:`NavigableString` 对象:
612
445
508
613
446
::
509
::
614
447
510
615
448
    comment
511
    comment
616
449
    # u'Hey, buddy. Want to buy a used parser'
512
    # u'Hey, buddy. Want to buy a used parser'
617
450
513
619
451
但是当它出现在HTML文档中时, ``Comment`` 对象会使用特殊的格式输出:
514
但是当它出现在 HTML 文档中时，:py:class:`Comment` 对象会使用特殊的格式输出:
620
452
515
621
453
::
516
::
622
454
517
623
@@ -457,23 +520,61 @@ BeautifulSoup
624
457
    #  <!--Hey, buddy. Want to buy a used parser?-->
520
    #  <!--Hey, buddy. Want to buy a used parser?-->
625
458
    # </b>
521
    # </b>
626
459
522
627
460
Beautiful Soup中定义的其它类型都可能会出现在XML的文档中: ``CData`` , ``ProcessingInstruction`` , ``Declaration`` , ``Doctype`` .与 ``Comment`` 对象类似,这些类都是 ``NavigableString`` 的子类,只是添加了一些额外的方法的字符串独享.下面是用CDATA来替代注释的例子:
628
461
523
630
462
::
524
针对 HTML 文档
631
525
^^^^^^^^^^^^^^^^^^
632
463
526
636
464
    from bs4 import CData
527
Beautiful Soup 定义了一些 :py:class:`NavigableString` 子类来处理特定的 HTML 标签。
637
465
    cdata = CData("A CDATA block")
528
通过忽略页面中表示程序指令的字符串，可以更容易挑出页面的 body 内容。
638
466
    comment.replace_with(cdata)
529
（这些类是在 Beautiful Soup 4.9.0 版本中添加的，html5lib 解析器不会使用它们）
639
467
530
644
468
    print(soup.b.prettify())
531
.. py:class:: Stylesheet
645
469
    # <b>
532
646
470
    #  <![CDATA[A CDATA block]]>
533
有一种 :py:class:`NavigableString` 子类表示嵌入的 CSS 脚本；
647
471
    # </b>
534
内容是 ``<style>`` 标签内部的所有字符串。
648
535
649
536
.. py:class:: Script
650
537
651
538
有一种 :py:class:`NavigableString` 子类表示嵌入的 JavaScript 脚本；
652
539
内容是 ``<script>`` 标签内部的所有字符串。
653
540
654
541
.. py:class:: Template
655
542
656
543
有一种 :py:class:`NavigableString` 子类表示嵌入的 HTML 模板，
657
544
内容是 ``<template>``  标签内部的所有字符串。
658
545
659
546
针对 XML 文档
660
547
^^^^^^^^^^^^^^^^^
661
548
662
549
Beautiful Soup 定义了一些 :py:class:`NavigableString` 子类来处理 XML 文档中的特定
663
550
字符串。比如 :py:class:`Comment`，这些 :py:class:`NavigableString` 的子类生成字符
664
551
串时会添加额外内容。
665
552
666
553
.. py:class:: Declaration
667
554
668
555
有一种 :py:class:`NavigableString` 子类表示 XML 文档开头的 
669
556
`declaration <https://www.w3.org/TR/REC-xml/#sec-prolog-dtd>`_ 。
670
557
671
558
.. py:class:: Doctype
672
559
673
560
有一种 :py:class:`NavigableString` 子类表示可能出现在 XML 文档开头的
674
561
`document type
675
562
declaration <https://www.w3.org/TR/REC-xml/#dt-doctype>`_ 。
676
563
677
564
.. py:class:: CData
678
565
679
566
有一种 :py:class:`NavigableString` 子类表示
680
567
`CData section <https://www.w3.org/TR/REC-xml/#sec-cdata-sect>`_。
681
568
682
569
.. py:class:: ProcessingInstruction
683
570
684
571
有一种 :py:class:`NavigableString` 子类表示 `XML 处理指令 
685
572
<https://www.w3.org/TR/REC-xml/#sec-pi>`_。
686
472
573
687
473
遍历文档树
574
遍历文档树
688
474
==========
575
==========
689
475
576
691
476
还拿"爱丽丝梦游仙境"的文档来做例子:
577
还是用"爱丽丝"的文档来做例子:
692
477
578
693
478
::
579
::
694
479
580
695
@@ -499,14 +600,15 @@ Beautiful Soup中定义的其它类型都可能会出现在XML的文档中: ``CD
696
499
子节点
600
子节点
697
500
-------
601
-------
698
501
602
700
502
一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.
603
tag 可能包含多个字符串或其它的 tag，这些都是这个 Tag 的子节点。Beautiful Soup 提供了许多查找
701
604
和操作子节点的方法。
702
503
605
704
504
注意: Beautiful Soup中字符串节点不支持这些属性,因为字符串没有子节点
606
注意: Beautiful Soup中字符串节点不支持这些属性，因为字符串没有子节点。
705
505
607
708
506
tag的名字
608
Tag 的名字
709
507
..........
609
^^^^^^^^^^^^
710
508
610
712
509
操作文档树最简单的方法就是告诉它你想获取的tag的name.如果想获取 <head> 标签,只要用 ``soup.head`` :
611
操作文档树最简单的方法就是告诉它你想获取的 tag 的 name。如果想获取 <head> 标签，只要用 ``soup.head``:
713
510
612
714
511
::
613
::
715
512
614
716
@@ -516,7 +618,8 @@ tag的名字
717
516
    soup.title
618
    soup.title
718
517
    # <title>The Dormouse's story</title>
619
    # <title>The Dormouse's story</title>
719
518
620
721
519
这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.下面的代码可以获取<body>标签中的第一个<b>标签:
621
这是个获取tag的小窍门，可以在文档树的tag中多次调用这个方法。下面的代码可以获取 <body> 标签中的
722
622
第一个 <b> 标签:
723
520
623
724
521
::
624
::
725
522
625
726
@@ -530,7 +633,8 @@ tag的名字
727
530
    soup.a
633
    soup.a
728
531
    # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
634
    # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
729
532
635
731
533
如果想要得到所有的<a>标签,或是通过名字得到比一个tag更多的内容的时候,就需要用到 `Searching the tree` 中描述的方法,比如: find_all()
636
如果想要得到所有的 <a> 标签，或是比通过名字获取内容更复杂的方法时，就需要用到 `搜索文档树`_ 
732
637
中描述的方法，比如: `find_all()`
733
534
638
734
535
::
639
::
735
536
640
736
@@ -539,10 +643,10 @@ tag的名字
737
539
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
643
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
738
540
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
644
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
739
541
645
742
542
.contents 和 .children
646
``.contents`` 和 ``.children``
743
543
........................
647
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
744
544
648
746
545
tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
649
Tag 的 ``.contents`` 属性可以将 tag 的全部子节点以列表的方式输出:
747
546
650
748
547
::
651
::
749
548
652
750
@@ -559,7 +663,8 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
751
559
    title_tag.contents
663
    title_tag.contents
752
560
    # [u'The Dormouse's story']
664
    # [u'The Dormouse's story']
753
561
665
755
562
``BeautifulSoup`` 对象本身一定会包含子节点,也就是说<html>标签也是 ``BeautifulSoup`` 对象的子节点:
666
:py:class:`BeautifulSoup` 对象一定会包含子节点。下面例子中 <html> 标签就是 ``BeautifulSoup`` 
756
667
对象的子节点:
757
563
668
758
564
::
669
::
759
565
670
760
@@ -568,7 +673,7 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
761
568
    soup.contents[0].name
673
    soup.contents[0].name
762
569
    # u'html'
674
    # u'html'
763
570
675
765
571
字符串没有 ``.contents`` 属性,因为字符串没有子节点:
676
字符串没有 ``.contents`` 属性，因为字符串没有子节点:
766
572
677
767
573
::
678
::
768
574
679
769
@@ -576,7 +681,7 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
770
576
    text.contents
681
    text.contents
771
577
    # AttributeError: 'NavigableString' object has no attribute 'contents'
682
    # AttributeError: 'NavigableString' object has no attribute 'contents'
772
578
683
774
579
通过tag的 ``.children`` 生成器,可以对tag的子节点进行循环:
684
通过 tag 的 ``.children`` 生成器，可以对 tag 的子节点进行循环:
775
580
685
776
581
::
686
::
777
582
687
778
@@ -584,17 +689,23 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
779
584
        print(child)
689
        print(child)
780
585
        # The Dormouse's story
690
        # The Dormouse's story
781
586
691
784
587
.descendants
692
如果想要修改 tag 的子节点，使用 `修改文档树`_ 中描述的方法。不要直接修改 ``contents`` 列表:
785
588
..............
693
那样会导致细微且难以定位的问题。
786
694
787
695
``.descendants``
788
696
^^^^^^^^^^^^^^^^
789
589
697
791
590
``.contents`` 和 ``.children`` 属性仅包含tag的直接子节点.例如,<head>标签只有一个直接子节点<title>
698
``.contents`` 和 ``.children`` 属性仅包含 tag 的直接子节点。例如，<head> 标签只有一个直接
792
699
子节点 <title>
793
591
700
794
592
::
701
::
795
593
702
796
594
    head_tag.contents
703
    head_tag.contents
797
595
    # [<title>The Dormouse's story</title>]
704
    # [<title>The Dormouse's story</title>]
798
596
705
800
597
但是<title>标签也包含一个子节点:字符串 “The Dormouse’s story”,这种情况下字符串 “The Dormouse’s story”也属于<head>标签的子孙节点. ``.descendants`` 属性可以对所有tag的子孙节点进行递归循环 [5]_ :
706
但是 <title> 标签也包含一个子节点：字符串 “The Dormouse’s story”。这种情况下字符串
801
707
“The Dormouse’s story” 也属于 <head> 标签的子节点。 ``.descendants`` 属性可以对
802
708
所有 tag 的子孙节点进行递归循环 [5]_ ，包括子节点，子节点的子节点:
803
598
709
804
599
::
710
::
805
600
711
806
@@ -603,7 +714,8 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
807
603
        # <title>The Dormouse's story</title>
714
        # <title>The Dormouse's story</title>
808
604
        # The Dormouse's story
715
        # The Dormouse's story
809
605
716
811
606
上面的例子中, <head>标签只有一个子节点,但是有2个子孙节点:<head>节点和<head>的子节点, ``BeautifulSoup`` 有一个直接子节点(<html>节点),却有很多子孙节点:
717
上面的例子中，<head> 标签只有一个子节点，但是有 2 个子孙节点: <head> 标签和 <head> 的子节点。
812
718
:py:class:`BeautifulSoup` 对象只有一个直接子节点(<html> 节点)，却有很多子孙节点:
813
607
719
814
608
::
720
::
815
609
721
816
@@ -612,17 +724,22 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
817
612
    len(list(soup.descendants))
724
    len(list(soup.descendants))
818
613
    # 25
725
    # 25
819
614
726
820
615
.string
821
616
........
822
617
727
824
618
如果tag只有一个 ``NavigableString`` 类型子节点,那么这个tag可以使用 ``.string`` 得到子节点:
728
.. _.string:
825
729
826
730
``.string``
827
731
^^^^^^^^^^^
828
732
829
733
如果 tag 只有一个 ``NavigableString`` 类型子节点，那么这个tag可以使用 ``.string`` 
830
734
得到子节点:
831
619
735
832
620
::
736
::
833
621
737
834
622
    title_tag.string
738
    title_tag.string
835
623
    # u'The Dormouse's story'
739
    # u'The Dormouse's story'
836
624
740
838
625
如果一个tag仅有一个子节点,那么这个tag也可以使用 ``.string`` 方法,输出结果与当前唯一子节点的 ``.string`` 结果相同:
741
如果一个tag仅有一个子节点，那么这个tag也可以使用 ``.string`` 方法，输出结果与当前唯一
839
742
子节点的 ``.string`` 结果相同:
840
626
743
841
627
::
744
::
842
628
745
843
@@ -632,17 +749,20 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
844
632
    head_tag.string
749
    head_tag.string
845
633
    # u'The Dormouse's story'
750
    # u'The Dormouse's story'
846
634
751
848
635
如果tag包含了多个子节点,tag就无法确定 ``.string`` 方法应该调用哪个子节点的内容, ``.string`` 的输出结果是 ``None`` :
752
如果tag包含了多个子节点，tag就无法确定 ``.string`` 方法应该调用哪个子节点的内容， 
849
753
``.string`` 的输出结果是 ``None`` :
850
636
754
851
637
::
755
::
852
638
756
853
639
    print(soup.html.string)
757
    print(soup.html.string)
854
640
    # None
758
    # None
855
641
759
856
760
.. _string-generators:
857
761
858
642
.strings 和 stripped_strings
762
.strings 和 stripped_strings
860
643
.............................
763
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
861
644
764
863
645
如果tag中包含多个字符串 [2]_ ,可以使用 ``.strings`` 来循环获取:
765
如果 tag 中包含多个字符串 [2]_ ,可以使用 ``.strings`` 来循环获取:
864
646
766
865
647
::
767
::
866
648
768
867
@@ -663,7 +783,7 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
868
663
        # u'...'
783
        # u'...'
869
664
        # u'\n'
784
        # u'\n'
870
665
785
872
666
输出的字符串中可能包含了很多空格或空行,使用 ``.stripped_strings`` 可以去除多余空白内容:
786
输出的字符串中可能包含了很多空格或空行，使用 ``.stripped_strings`` 可以去除多余空白内容:
873
667
787
874
668
::
788
::
875
669
789
876
@@ -680,17 +800,20 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
877
680
        # u';\nand they lived at the bottom of a well.'
800
        # u';\nand they lived at the bottom of a well.'
878
681
        # u'...'
801
        # u'...'
879
682
802
881
683
全部是空格的行会被忽略掉,段首和段末的空白会被删除
803
全部是空格的行会被忽略掉，段首和段末的空白会被删除
882
684
804
883
685
父节点
805
父节点
884
686
-------
806
-------
885
687
807
887
688
继续分析文档树,每个tag或字符串都有父节点:被包含在某个tag中
808
继续分析文档树，每个 tag 或字符串都有父节点: 包含当前内容的 tag
888
809
889
810
.. _.parent:
890
689
811
891
690
.parent
812
.parent
893
691
........
813
^^^^^^^^^^^^^
894
692
814
896
693
通过 ``.parent`` 属性来获取某个元素的父节点.在例子“爱丽丝”的文档中,<head>标签是<title>标签的父节点:
815
通过 ``.parent`` 属性来获取某个元素的父节点。在例子“爱丽丝”的文档中，<head> 标签是
897
816
<title> 标签的父节点:
898
694
817
899
695
::
818
::
900
696
819
901
@@ -700,14 +823,14 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
902
700
    title_tag.parent
823
    title_tag.parent
903
701
    # <head><title>The Dormouse's story</title></head>
824
    # <head><title>The Dormouse's story</title></head>
904
702
825
906
703
文档title的字符串也有父节点:<title>标签
826
文档的 title 字符串也有父节点: <title> 标签
907
704
827
908
705
::
828
::
909
706
829
910
707
    title_tag.string.parent
830
    title_tag.string.parent
911
708
    # <title>The Dormouse's story</title>
831
    # <title>The Dormouse's story</title>
912
709
832
914
710
文档的顶层节点比如<html>的父节点是 ``BeautifulSoup`` 对象:
833
文档的顶层节点比如 <html> 的父节点是 ``BeautifulSoup`` 对象:
915
711
834
916
712
::
835
::
917
713
836
918
@@ -722,10 +845,13 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
919
722
    print(soup.parent)
845
    print(soup.parent)
920
723
    # None
846
    # None
921
724
847
922
848
.. _.parents:
923
849
924
725
.parents
850
.parents
926
726
..........
851
^^^^^^^^^^^^
927
727
852
929
728
通过元素的 ``.parents`` 属性可以递归得到元素的所有父辈节点,下面的例子使用了 ``.parents`` 方法遍历了<a>标签到根节点的所有节点.
853
通过元素的 ``.parents`` 属性可以递归得到元素的所有父辈节点，下面的例子使用了 ``.parents`` 
930
854
方法遍历了 <a> 标签到根节点的所有节点。
931
729
855
932
730
::
856
::
933
731
857
934
@@ -750,10 +876,8 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
935
750
876
936
751
::
877
::
937
752
878
939
753
    sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
879
    sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></a>", 'html.parser')
940
754
    print(sibling_soup.prettify())
880
    print(sibling_soup.prettify())
941
755
    # <html>
942
756
    #  <body>
943
757
    #   <a>
881
    #   <a>
944
758
    #    <b>
882
    #    <b>
945
759
    #     text1
883
    #     text1
946
@@ -762,15 +886,14 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
947
762
    #     text2
886
    #     text2
948
763
    #    </c>
887
    #    </c>
949
764
    #   </a>
888
    #   </a>
950
765
    #  </body>
951
766
    # </html>
952
767
889
954
768
因为<b>标签和<c>标签是同一层:他们是同一个元素的子节点,所以<b>和<c>可以被称为兄弟节点.一段文档以标准格式输出时,兄弟节点有相同的缩进级别.在代码中也可以使用这种关系.
890
因为 <b> 标签和 <c> 标签是同一层: 他们是同一个元素的子节点，所以 <b> 和 <c> 可以被称为兄弟节点。
955
891
一段文档以标准格式输出时，兄弟节点有相同的缩进级别。在代码中也可以使用这种关系。
956
769
892
957
770
.next_sibling 和 .previous_sibling
893
.next_sibling 和 .previous_sibling
959
771
....................................
894
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
960
772
895
962
773
在文档树中,使用 ``.next_sibling`` 和 ``.previous_sibling`` 属性来查询兄弟节点:
896
在文档树中，使用 ``.next_sibling`` 和 ``.previous_sibling`` 属性来查询兄弟节点:
963
774
897
964
775
::
898
::
965
776
899
966
@@ -780,7 +903,9 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
967
780
    sibling_soup.c.previous_sibling
903
    sibling_soup.c.previous_sibling
968
781
    # <b>text1</b>
904
    # <b>text1</b>
969
782
905
971
783
<b>标签有 ``.next_sibling`` 属性,但是没有 ``.previous_sibling`` 属性,因为<b>标签在同级节点中是第一个.同理,<c>标签有 ``.previous_sibling`` 属性,却没有 ``.next_sibling`` 属性:
906
<b> 标签有 ``.next_sibling`` 属性，但是没有 ``.previous_sibling`` 属性，
972
907
因为 <b> 标签在同级节点中是第一个。同理，<c>标签有 ``.previous_sibling`` 属性，
973
908
却没有 ``.next_sibling`` 属性:
974
784
909
975
785
::
910
::
976
786
911
977
@@ -789,7 +914,7 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
978
789
    print(sibling_soup.c.next_sibling)
914
    print(sibling_soup.c.next_sibling)
979
790
    # None
915
    # None
980
791
916
982
792
例子中的字符串“text1”和“text2”不是兄弟节点,因为它们的父节点不同:
917
例子中的字符串 "text1" 和 "text2" 不是兄弟节点，因为它们的父节点不同:
983
793
918
984
794
::
919
::
985
795
920
986
@@ -799,7 +924,8 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
987
799
    print(sibling_soup.b.string.next_sibling)
924
    print(sibling_soup.b.string.next_sibling)
988
800
    # None
925
    # None
989
801
926
991
802
实际文档中的tag的 ``.next_sibling`` 和 ``.previous_sibling`` 属性通常是字符串或空白. 看看“爱丽丝”文档:
927
实际文档中的 tag 的 ``.next_sibling`` 和 ``.previous_sibling`` 属性通常是字符串或空白。
992
928
看看“爱丽丝”文档:
993
803
929
994
804
::
930
::
995
805
931
996
@@ -807,7 +933,8 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
997
807
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
933
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
998
808
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
934
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
999
809
935
1001
810
如果以为第一个<a>标签的 ``.next_sibling`` 结果是第二个<a>标签,那就错了,真实结果是第一个<a>标签和第二个<a>标签之间的顿号和换行符:
936
如果以为第一个 <a> 标签的 ``.next_sibling`` 结果是第二个 <a> 标签，那就错了，
1002
937
真实结果是第一个 <a> 标签和第二个<a> 标签之间的顿号和换行符:
1003
811
938
1004
812
::
939
::
1005
813
940
1006
@@ -825,8 +952,10 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
1007
825
    link.next_sibling.next_sibling
952
    link.next_sibling.next_sibling
1008
826
    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
953
    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
1009
827
954
1010
955
.. _sibling-generators:
1011
956
1012
828
.next_siblings 和 .previous_siblings
957
.next_siblings 和 .previous_siblings
1014
829
......................................
958
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1015
830
959
1016
831
通过 ``.next_siblings`` 和 ``.previous_siblings`` 属性可以对当前节点的兄弟节点迭代输出:
960
通过 ``.next_siblings`` 和 ``.previous_siblings`` 属性可以对当前节点的兄弟节点迭代输出:
1017
832
961
1018
@@ -834,21 +963,19 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
1019
834
963
1020
835
    for sibling in soup.a.next_siblings:
964
    for sibling in soup.a.next_siblings:
1021
836
        print(repr(sibling))
965
        print(repr(sibling))
1028
837
        # u',\n'
966
    # ',\n'
1029
838
        # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
967
    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
1030
839
        # u' and\n'
968
    # ' and\n'
1031
840
        # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
969
    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
1032
841
        # u'; and they lived at the bottom of a well.'
970
    # '; and they lived at the bottom of a well.'
1027
842
        # None
1033
843
971
1034
844
    for sibling in soup.find(id="link3").previous_siblings:
972
    for sibling in soup.find(id="link3").previous_siblings:
1035
845
        print(repr(sibling))
973
        print(repr(sibling))
1042
846
        # ' and\n'
974
    # ' and\n'
1043
847
        # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
975
    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
1044
848
        # u',\n'
976
    # ',\n'
1045
849
        # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
977
    # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
1046
850
        # u'Once upon a time there were three little sisters; and their names were\n'
978
    # 'Once upon a time there were three little sisters; and their names were\n'
1041
851
        # None
1047
852
979
1048
853
回退和前进
980
回退和前进
1049
854
----------
981
----------
1050
@@ -860,14 +987,20 @@ tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
1051
860
    <html><head><title>The Dormouse's story</title></head>
987
    <html><head><title>The Dormouse's story</title></head>
1052
861
    <p class="title"><b>The Dormouse's story</b></p>
988
    <p class="title"><b>The Dormouse's story</b></p>
1053
862
989
1055
863
HTML解析器把这段字符串转换成一连串的事件: "打开<html>标签","打开一个<head>标签","打开一个<title>标签","添加一段字符串","关闭<title>标签","打开<p>标签",等等.Beautiful Soup提供了重现解析器初始化过程的方法.
990
HTML解析器把这段字符串转换成一连串的事件: "打开<html>标签","打开一个<head>标签",
1056
991
"打开一个<title>标签","添加一段字符串","关闭<title>标签","打开<p>标签",等等。
1057
992
Beautiful Soup提供了重现解析器初始化过程的方法。
1058
993
1059
994
.. _element-generators:
1060
864
995
1061
865
.next_element 和 .previous_element
996
.next_element 和 .previous_element
1063
866
...................................
997
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1064
867
998
1066
868
``.next_element`` 属性指向解析过程中下一个被解析的对象(字符串或tag),结果可能与 ``.next_sibling`` 相同,但通常是不一样的.
999
``.next_element`` 属性指向解析过程中下一个被解析的对象(字符串或tag),
1067
1000
结果可能与 ``.next_sibling`` 相同，但通常是不一样的。
1068
869
1001
1070
870
这是“爱丽丝”文档中最后一个<a>标签,它的 ``.next_sibling`` 结果是一个字符串,因为当前的解析过程 [2]_ 因为当前的解析过程因为遇到了<a>标签而中断了:
1002
这是“爱丽丝”文档中最后一个 <a> 标签，它的 ``.next_sibling`` 结果是一个字符串，
1071
1003
因为当前的解析过程 [2]_ 因为当前的解析过程因为遇到了<a>标签而中断了:
1072
871
1004
1073
872
::
1005
::
1074
873
1006
1075
@@ -878,16 +1011,20 @@ HTML解析器把这段字符串转换成一连串的事件: "打开<html>标签"
1076
878
    last_a_tag.next_sibling
1011
    last_a_tag.next_sibling
1077
879
    # '; and they lived at the bottom of a well.'
1012
    # '; and they lived at the bottom of a well.'
1078
880
1013
1080
881
但这个<a>标签的 ``.next_element`` 属性结果是在<a>标签被解析之后的解析内容,不是<a>标签后的句子部分,应该是字符串"Tillie":
1014
但这个 <a> 标签的 ``.next_element`` 属性结果是在 <a> 标签被解析之后的解析内容，
1081
1015
不是 <a> 标签后的句子部分，而是字符串 "Tillie":
1082
882
1016
1083
883
::
1017
::
1084
884
1018
1085
885
    last_a_tag.next_element
1019
    last_a_tag.next_element
1086
886
    # u'Tillie'
1020
    # u'Tillie'
1087
887
1021
1089
888
这是因为在原始文档中,字符串“Tillie” 在分号前出现,解析器先进入<a>标签,然后是字符串“Tillie”,然后关闭</a>标签,然后是分号和剩余部分.分号与<a>标签在同一层级,但是字符串“Tillie”会被先解析.
1022
这是因为在原始文档中，字符串 “Tillie” 在分号前出现，解析器先进入 <a> 标签，
1090
1023
然后是字符串 “Tillie”，然后关闭 </a> 标签，然后是分号和剩余部分。
1091
1024
分号与 <a> 标签在同一层级，但是字符串 “Tillie” 会先被解析。
1092
889
1025
1094
890
``.previous_element`` 属性刚好与 ``.next_element`` 相反,它指向当前被解析的对象的前一个解析对象:
1026
``.previous_element`` 属性刚好与 ``.next_element`` 相反，
1095
1027
它指向当前被解析的对象的前一个解析对象:
1096
891
1028
1097
892
::
1029
::
1098
893
1030
1099
@@ -897,9 +1034,10 @@ HTML解析器把这段字符串转换成一连串的事件: "打开<html>标签"
1100
897
    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
1034
    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
1101
898
1035
1102
899
.next_elements 和 .previous_elements
1036
.next_elements 和 .previous_elements
1104
900
.....................................
1037
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1105
901
1038
1107
902
通过 ``.next_elements`` 和 ``.previous_elements`` 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样:
1039
通过 ``.next_elements`` 和 ``.previous_elements`` 的迭代器就可以向前或向后
1108
1040
访问文档的解析内容，就好像文档正在被解析一样:
1109
903
1041
1110
904
::
1042
::
1111
905
1043
1112
@@ -914,9 +1052,10 @@ HTML解析器把这段字符串转换成一连串的事件: "打开<html>标签"
1113
914
    # None
1052
    # None
1114
915
1053
1115
916
搜索文档树
1054
搜索文档树
1117
917
==========
1055
============
1118
918
1056
1120
919
Beautiful Soup定义了很多搜索方法,这里着重介绍2个: ``find()`` 和 ``find_all()`` .其它方法的参数和用法类似,请读者举一反三.
1057
Beautiful Soup 定义了很多相似的文档搜索方法，这里着重介绍2个: ``find()`` 和 ``find_all()``，
1121
1058
其它方法的参数和用法类似，所以一笔带过。
1122
920
1059
1123
921
再以“爱丽丝”文档作为例子:
1060
再以“爱丽丝”文档作为例子:
1124
922
1061
1125
@@ -939,29 +1078,38 @@ Beautiful Soup定义了很多搜索方法,这里着重介绍2个: ``find()`` 和
1126
939
    from bs4 import BeautifulSoup
1078
    from bs4 import BeautifulSoup
1127
940
    soup = BeautifulSoup(html_doc, 'html.parser')
1079
    soup = BeautifulSoup(html_doc, 'html.parser')
1128
941
1080
1130
942
使用 ``find_all()`` 类似的方法可以查找到想要查找的文档内容
1081
使用 ``find_all()`` 这种过滤方法，就可以检索想要查找的文档内容。
1131
1082
1132
1083
过滤器类型
1133
1084
-------------
1134
943
1085
1137
944
过滤器
1086
介绍 ``find_all()`` 或类似方法前，先介绍一下这些方法可以使用哪些过滤器的类型 [3]_,
1138
945
------
1087
这些过滤器在搜索的 API 中反复出现。过滤器可以作用在 tag 的 name 上，节点的属性上，
1139
1088
字符串上或与他们混合使用。
1140
946
1089
1142
947
介绍 ``find_all()`` 方法前,先介绍一下过滤器的类型 [3]_ ,这些过滤器贯穿整个搜索的API.过滤器可以被用在tag的name中,节点的属性中,字符串中或他们的混合中.
1090
.. _字符串:
1143
948
1091
1144
949
字符串
1092
字符串
1146
950
............
1093
^^^^^^^^
1147
951
1094
1149
952
最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的<b>标签:
1095
最简单的过滤器是字符串。在搜索方法中传入一个字符串参数，Beautiful Soup
1150
1096
会查找与字符串完整匹配的内容，下面的例子用于查找文档中所有的 <b> 标签:
1151
953
1097
1152
954
::
1098
::
1153
955
1099
1154
956
    soup.find_all('b')
1100
    soup.find_all('b')
1155
957
    # [<b>The Dormouse's story</b>]
1101
    # [<b>The Dormouse's story</b>]
1156
958
1102
1158
959
如果传入字节码参数,Beautiful Soup会当作UTF-8编码,可以传入一段Unicode 编码来避免Beautiful Soup解析编码出错
1103
如果传入字节码参数，Beautiful Soup会当作UTF-8编码，可以传入一段Unicode 编码来避免
1159
1104
Beautiful Soup 解析编码出错。
1160
1105
1161
1106
.. _正则表达式:
1162
960
1107
1163
961
正则表达式
1108
正则表达式
1165
962
..........
1109
^^^^^^^^^^^
1166
963
1110
1168
964
如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 ``search()`` 来匹配内容.下面例子中找出所有以b开头的标签,这表示<body>和<b>标签都应该被找到:
1111
如果传入正则表达式作为参数，Beautiful Soup 会通过正则表达式的 ``match()`` 来匹配内容。
1169
1112
下面例子中找出所有以 b 开头的标签，这种情况下 <body> 和 <b> 标签都会被找到:
1170
965
1113
1171
966
::
1114
::
1172
967
1115
1173
@@ -971,7 +1119,7 @@ Beautiful Soup定义了很多搜索方法,这里着重介绍2个: ``find()`` 和
1174
971
    # body
1119
    # body
1175
972
    # b
1120
    # b
1176
973
1121
1178
974
下面代码找出所有名字中包含"t"的标签:
1122
下面代码找出所有名字中包含 "t" 的标签:
1179
975
1123
1180
976
::
1124
::
1181
977
1125
1182
@@ -980,10 +1128,13 @@ Beautiful Soup定义了很多搜索方法,这里着重介绍2个: ``find()`` 和
1183
980
    # html
1128
    # html
1184
981
    # title
1129
    # title
1185
982
1130
1186
1131
.. _列表:
1187
1132
1188
983
列表
1133
列表
1190
984
....
1134
^^^^^^
1191
985
1135
1193
986
如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和<b>标签:
1136
如果传入列表参数，Beautiful Soup会 将与列表中任一元素匹配的内容返回。
1194
1137
下面代码找到文档中所有 <a> 标签和 <b> 标签:
1195
987
1138
1196
988
::
1139
::
1197
989
1140
1198
@@ -993,10 +1144,12 @@ Beautiful Soup定义了很多搜索方法,这里着重介绍2个: ``find()`` 和
1199
993
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1144
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1200
994
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1145
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1201
995
1146
1202
1147
.. _True:
1203
1148
1204
996
True
1149
True
1206
997
.....
1150
^^^^^^
1207
998
1151
1209
999
``True`` 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点
1152
``True`` 可以匹配任何值，下面代码查找到所有的 tag，但是不会返回字符串节点
1210
1000
1153
1211
1001
::
1154
::
1212
1002
1155
1213
@@ -1014,19 +1167,23 @@ True
1214
1014
    # a
1167
    # a
1215
1015
    # p
1168
    # p
1216
1016
1169
1219
1017
方法
1170
.. _函数:
1218
1018
....
1220
1019
1171
1222
1020
如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数 [4]_ ,如果这个方法返回 ``True`` 表示当前元素匹配并且被找到,如果不是则反回 ``False``
1172
函数
1223
1173
^^^^^^
1224
1021
1174
1226
1022
下面方法校验了当前元素,如果包含 ``class`` 属性却不包含 ``id`` 属性,那么将返回 ``True``:
1175
如果没有合适过滤器，那么还可以定义一个函数方法，参数是一个元素 [4]_ ，
1227
1176
如果这个方法返回 ``True`` 表示当前元素匹配并且被找到，如果不是则反回 ``False``。
1228
1177
1229
1178
下面方法实现的匹配功能是，如果包含 ``class`` 属性却不包含 ``id`` 属性，
1230
1179
那么将返回 ``True``:
1231
1023
1180
1232
1024
::
1181
::
1233
1025
1182
1234
1026
    def has_class_but_no_id(tag):
1183
    def has_class_but_no_id(tag):
1235
1027
        return tag.has_attr('class') and not tag.has_attr('id')
1184
        return tag.has_attr('class') and not tag.has_attr('id')
1236
1028
1185
1238
1029
将这个方法作为参数传入 ``find_all()`` 方法,将得到所有<p>标签:
1186
将这个方法作为参数传入 ``find_all()`` 方法，将得到所有 <p> 标签:
1239
1030
1187
1240
1031
::
1188
::
1241
1032
1189
1242
@@ -1035,21 +1192,20 @@ True
1243
1035
    #  <p class="story">Once upon a time there were...</p>,
1192
    #  <p class="story">Once upon a time there were...</p>,
1244
1036
    #  <p class="story">...</p>]
1193
    #  <p class="story">...</p>]
1245
1037
1194
1252
1038
返回结果中只有<p>标签没有<a>标签,因为<a>标签还定义了"id",没有返回<html>和<head>,因为<html>和<head>中没有定义"class"属性.
1195
返回结果中只有 <p> 标签，没有 <a> 标签，因为 <a> 标签还定义了"id"，
1253
1039
1196
没有返回 <html> 和 <head>，因为 <html> 和 <head> 中没有定义 "class" 属性。
1248
1040
通过一个方法来过滤一类标签属性的时候, 这个方法的参数是要被过滤的属性的值, 而不是这个标签.
1249
1041
下面的例子是找出 ``href`` 属性不符合指定正则的 ``a`` 标签.
1250
1042
1251
1043
::
1254
1044
1197
1255
1198
如果通过方法来筛选特殊属性，比如 ``href``，传入方法的参数应该是对应属性的值，
1256
1199
而不是整个元素。下面的例子是找出那些 ``a`` 标签中的 ``href`` 属性不匹配指定正则::
1257
1045
1200
1258
1046
	def not_lacie(href):
1201
	def not_lacie(href):
1259
1047
		return href and not re.compile("lacie").search(href)
1202
		return href and not re.compile("lacie").search(href)
1260
1203
1261
1048
	soup.find_all(href=not_lacie)
1204
	soup.find_all(href=not_lacie)
1262
1049
	# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1205
	# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1263
1050
	#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1206
	#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1264
1051
1207
1266
1052
标签过滤方法可以使用复杂方法. 下面的例子可以过滤出前后都有文字的标签.
1208
标签过滤方法可以使用复杂方法。下面的例子可以过滤出前后都有文字的标签。
1267
1053
1209
1268
1054
::
1210
::
1269
1055
1211
1270
@@ -1071,9 +1227,11 @@ True
1271
1071
find_all()
1227
find_all()
1272
1072
-----------
1228
-----------
1273
1073
1229
1275
1074
find_all( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1230
find_all(`name`_ , `attrs`_ , `recursive <recursive>`_ , `string <string>`_ , 
1276
1231
`**kwargs <kwargs>`_ )
1277
1075
1232
1279
1076
``find_all()`` 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件.这里有几个例子:
1233
``find_all()`` 方法搜索当前 tag 的所有子节点，并判断是否符合过滤器的条件。
1280
1234
`过滤器类型`_ 中已经举过几个例子，这里再展示几个新例子:
1281
1077
1235
1282
1078
::
1236
::
1283
1079
1237
1284
@@ -1095,12 +1253,17 @@ find_all( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1285
1095
    soup.find(string=re.compile("sisters"))
1253
    soup.find(string=re.compile("sisters"))
1286
1096
    # u'Once upon a time there were three little sisters; and their names were\n'
1254
    # u'Once upon a time there were three little sisters; and their names were\n'
1287
1097
1255
1289
1098
有几个方法很相似,还有几个方法是新的,参数中的 ``string`` 和 ``id`` 是什么含义? 为什么 ``find_all("p", "title")`` 返回的是CSS Class为"title"的<p>标签? 我们来仔细看一下 ``find_all()`` 的参数
1256
有几个方法很相似，还有几个方法是新的，参数中的 ``string`` 和 ``id`` 是什么含义? 
1290
1257
为什么 ``find_all("p", "title")`` 返回的是CSS Class为"title"的<p>标签? 
1291
1258
我们来仔细看一下 ``find_all()`` 的参数
1292
1259
1293
1260
.. _name:
1294
1099
1261
1295
1100
name 参数
1262
name 参数
1297
1101
..........
1263
^^^^^^^^^^^
1298
1102
1264
1300
1103
``name`` 参数可以查找所有名字为 ``name`` 的tag,字符串对象会被自动忽略掉.
1265
传一个值给 ``name`` 参数，就可以查找所有名字为 ``name`` 的 tag。所有文本都会被忽略掉，
1301
1266
因为它们不匹配标签名字。
1302
1104
1267
1303
1105
简单的用法如下:
1268
简单的用法如下:
1304
1106
1269
1305
@@ -1109,19 +1272,24 @@ name 参数
1306
1109
    soup.find_all("title")
1272
    soup.find_all("title")
1307
1110
    # [<title>The Dormouse's story</title>]
1273
    # [<title>The Dormouse's story</title>]
1308
1111
1274
1310
1112
重申: 搜索 ``name`` 参数的值可以使任一类型的 `过滤器`_ ,字符窜,正则表达式,列表,方法或是 ``True`` .
1275
回忆 `过滤器类型`_ 中描述的内容，搜索 ``name`` 的参数值可以是：
1311
1276
字符串、正则表达式、列表、方法或是 ``True`` 。
1312
1277
1313
1278
.. _kwargs:
1314
1113
1279
1315
1114
keyword 参数
1280
keyword 参数
1317
1115
..............
1281
^^^^^^^^^^^^^^^
1318
1116
1282
1320
1117
如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 ``id`` 的参数,Beautiful Soup会搜索每个tag的"id"属性.
1283
如果动态参数中出现未能识别的参数名，搜索时会把该参数当作 tag 属性来搜索，
1321
1284
比如搜索参数中包含一个名字为 ``id`` 的参数，Beautiful Soup 会搜索每个
1322
1285
tag 上的 ``id`` 属性
1323
1118
1286
1324
1119
::
1287
::
1325
1120
1288
1326
1121
    soup.find_all(id='link2')
1289
    soup.find_all(id='link2')
1327
1122
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1290
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1328
1123
1291
1330
1124
如果传入 ``href`` 参数,Beautiful Soup会搜索每个tag的"href"属性:
1292
如果传入 ``href`` 参数，Beautiful Soup会搜索每个 tag 的 ``href`` 属性
1331
1125
1293
1332
1126
::
1294
::
1333
1127
1295
1334
@@ -1130,7 +1298,7 @@ keyword 参数
1335
1130
1298
1336
1131
搜索指定名字的属性时可以使用的参数值包括 `字符串`_ , `正则表达式`_ , `列表`_, `True`_ .
1299
搜索指定名字的属性时可以使用的参数值包括 `字符串`_ , `正则表达式`_ , `列表`_, `True`_ .
1337
1132
1300
1339
1133
下面的例子在文档树中查找所有包含 ``id`` 属性的tag,无论 ``id`` 的值是什么:
1301
下面的例子在文档树中查找所有包含 ``id`` 属性的 tag，无论 ``id`` 的值是什么:
1340
1134
1302
1341
1135
::
1303
::
1342
1136
1304
1343
@@ -1139,14 +1307,14 @@ keyword 参数
1344
1139
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1307
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1345
1140
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1308
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1346
1141
1309
1348
1142
使用多个指定名字的参数可以同时过滤tag的多个属性:
1310
使用多个指定名字的参数可以同时过滤多个 tag 属性:
1349
1143
1311
1350
1144
::
1312
::
1351
1145
1313
1352
1146
    soup.find_all(href=re.compile("elsie"), id='link1')
1314
    soup.find_all(href=re.compile("elsie"), id='link1')
1353
1147
    # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]
1315
    # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]
1354
1148
1316
1356
1149
有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性:
1317
有些 tag 属性在搜索不能使用，比如HTML5中的 data-* 属性:
1357
1150
1318
1358
1151
::
1319
::
1359
1152
1320
1360
@@ -1154,17 +1322,33 @@ keyword 参数
1361
1154
    data_soup.find_all(data-foo="value")
1322
    data_soup.find_all(data-foo="value")
1362
1155
    # SyntaxError: keyword can't be an expression
1323
    # SyntaxError: keyword can't be an expression
1363
1156
1324
1365
1157
但是可以通过 ``find_all()`` 方法的 ``attrs`` 参数定义一个字典参数来搜索包含特殊属性的tag:
1325
这种情况下可以通过 ``find_all()`` 方法的 ``attrs`` 参数定义一个字典参数
1366
1326
来搜索包含特殊属性的 tag:
1367
1158
1327
1368
1159
::
1328
::
1369
1160
1329
1370
1161
    data_soup.find_all(attrs={"data-foo": "value"})
1330
    data_soup.find_all(attrs={"data-foo": "value"})
1371
1162
    # [<div data-foo="value">foo!</div>]
1331
    # [<div data-foo="value">foo!</div>]
1372
1163
1332
1373
1333
不要使用 "name" 作为关键字参数搜索 HTML 元素，因为 Beautiful Soup 用 ``name`` 
1374
1334
来识别 tag 本身的名字。换一种方法，你可以这样搜索属性中的 "name" 值
1375
1335
1376
1336
::
1377
1337
1378
1338
    name_soup = BeautifulSoup('<input name="email"/>', 'html.parser')
1379
1339
    name_soup.find_all(name="email")
1380
1340
    # []
1381
1341
    name_soup.find_all(attrs={"name": "email"})
1382
1342
    # [<input name="email"/>]
1383
1343
1384
1344
.. _attrs:
1385
1345
1386
1164
按CSS搜索
1346
按CSS搜索
1388
1165
..........
1347
^^^^^^^^^^^
1389
1166
1348
1391
1167
按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 ``class`` 在Python中是保留字,使用 ``class`` 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 ``class_`` 参数搜索有指定CSS类名的tag:
1349
按照 CSS 类名搜索的功能非常实用，但标识 CSS 类名的关键字 ``class`` 在Python中是保留字，
1392
1350
使用 ``class`` 做参数会导致语法错误。从 Beautiful Soup 4.1.2 版本开始，可以通过 ``class_`` 
1393
1351
参数搜索有指定CSS类名的 tag:
1394
1168
1352
1395
1169
::
1353
::
1396
1170
1354
1397
@@ -1173,7 +1357,8 @@ keyword 参数
1398
1173
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1357
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1399
1174
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1358
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1400
1175
1359
1402
1176
``class_`` 参数同样接受不同类型的 ``过滤器`` ,字符串,正则表达式,方法或 ``True`` :
1360
作为关键字形式的参数 ``class_`` 同样接受不同类型的 ``过滤器``，字符串、正则表达式、
1403
1361
方法或 ``True`` :
1404
1177
1362
1405
1178
::
1363
::
1406
1179
1364
1407
@@ -1188,7 +1373,7 @@ keyword 参数
1408
1188
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1373
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1409
1189
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1374
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1410
1190
1375
1412
1191
tag的 ``class`` 属性是 `多值属性`_ .按照CSS类名搜索tag时,可以分别搜索tag中的每个CSS类名:
1376
tag 的 ``class`` 属性是 `多值属性`_ 。按照 CSS 类名搜索时，表示匹配到 tag 中任意 CSS 类名:
1413
1192
1377
1414
1193
::
1378
::
1415
1194
1379
1416
@@ -1199,14 +1384,29 @@ tag的 ``class`` 属性是 `多值属性`_ .按照CSS类名搜索tag时,可以
1417
1199
    css_soup.find_all("p", class_="body")
1384
    css_soup.find_all("p", class_="body")
1418
1200
    # [<p class="body strikeout"></p>]
1385
    # [<p class="body strikeout"></p>]
1419
1201
1386
1421
1202
搜索 ``class`` 属性时也可以通过CSS值完全匹配:
1387
搜索 ``class`` 属性时也可以通过 CSS 值进行完全匹配:
1422
1203
1388
1423
1204
::
1389
::
1424
1205
1390
1425
1206
    css_soup.find_all("p", class_="body strikeout")
1391
    css_soup.find_all("p", class_="body strikeout")
1426
1207
    # [<p class="body strikeout"></p>]
1392
    # [<p class="body strikeout"></p>]
1427
1208
1393
1429
1209
完全匹配 ``class`` 的值时,如果CSS类名的顺序与实际不符,将搜索不到结果:
1394
完全匹配 ``class`` 的值时，如果CSS类名的顺序与实际不符，将搜索不到结果:
1430
1395
1431
1396
::
1432
1397
1433
1398
    css_soup.find_all("p", class_="strikeout body")
1434
1399
    # []
1435
1400
1436
1401
如果想要通过多个 CSS 类型来搜索 tag，应该使用 CSS 选择器
1437
1402
1438
1403
::
1439
1404
1440
1405
    css_soup.select("p.strikeout.body")
1441
1406
    # [<p class="body strikeout"></p>]
1442
1407
1443
1408
在旧版本的 Beautiful Soup 中，可能不支持 ``class_``，这时可以使用 ``attrs`` 实现相同效果。
1444
1409
创建一个字典，包含要搜索的 class 类名（或者正则表达式等形式）
1445
1210
1410
1446
1211
::
1411
::
1447
1212
1412
1448
@@ -1215,10 +1415,13 @@ tag的 ``class`` 属性是 `多值属性`_ .按照CSS类名搜索tag时,可以
1449
1215
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1415
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1450
1216
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1416
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1451
1217
1417
1454
1218
``string`` 参数
1418
.. _string2:
1455
1219
...............
1419
1456
1420
string 参数
1457
1421
^^^^^^^^^^^^^^^^^^^^^^^
1458
1220
1422
1460
1221
通过 ``string`` 参数可以搜搜文档中的字符串内容.与 ``name`` 参数的可选值一样, ``string`` 参数接受 `字符串`_ , `正则表达式`_ , `列表`_, `True`_ . 看例子:
1423
通过 ``string`` 参数可以搜索文档中的字符串内容。与 ``name`` 参数接受的值一样，
1461
1424
``string`` 参数接受 `字符串`_ , `正则表达式`_ , `列表`_, `函数`_, `True`_ 。看例子:
1462
1222
1425
1463
1223
::
1426
::
1464
1224
1427
1465
@@ -1238,19 +1441,32 @@ tag的 ``class`` 属性是 `多值属性`_ .按照CSS类名搜索tag时,可以
1466
1238
    soup.find_all(string=is_the_only_string_within_a_tag)
1441
    soup.find_all(string=is_the_only_string_within_a_tag)
1467
1239
    # [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']
1442
    # [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']
1468
1240
1443
1470
1241
虽然 ``string`` 参数用于搜索字符串,还可以与其它参数混合使用来过滤tag.Beautiful Soup会找到 ``.string`` 方法与 ``string`` 参数值相符的tag.下面代码用来搜索内容里面包含“Elsie”的<a>标签:
1444
虽然 ``string`` 参数用于搜索字符串，同时也以与其它参数混合使用来搜索 tag。
1471
1445
Beautiful Soup 会过滤那些 ``string`` 值与 ``.string`` 参数相符的 tag。
1472
1446
下面代码用来搜索内容里面包含 “Elsie” 的 <a> 标签:
1473
1242
1447
1474
1243
::
1448
::
1475
1244
1449
1476
1245
    soup.find_all("a", string="Elsie")
1450
    soup.find_all("a", string="Elsie")
1477
1246
    # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
1451
    # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
1478
1247
1452
1479
1453
``string`` 参数是在 4.4.0 中新增的。早期版本中该参数名为 ``text``。
1480
1454
1481
1455
::
1482
1456
1483
1457
    soup.find_all("a", text="Elsie")
1484
1458
    # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
1485
1459
1486
1248
``limit`` 参数
1460
``limit`` 参数
1488
1249
...............
1461
^^^^^^^^^^^^^^^^
1489
1250
1462
1491
1251
``find_all()`` 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用 ``limit`` 参数限制返回结果的数量.效果与SQL中的limit关键字类似,当搜索到的结果数量达到 ``limit`` 的限制时,就停止搜索返回结果.
1463
``find_all()`` 方法会返回全部的搜索结构，如果文档树很大那么搜索会很慢。
1492
1464
如果我们不需要全部结果，可以使用 ``limit`` 参数限制返回结果的数量。
1493
1465
效果与SQL中的limit关键字类似，当搜索到的结果数量达到 ``limit`` 的限制时，
1494
1466
就停止搜索返回结果。
1495
1252
1467
1497
1253
文档树中有3个tag符合搜索条件,但结果只返回了2个,因为我们限制了返回数量:
1468
“爱丽丝”文档例子中有 3 个 tag 符合搜索条件，但下面例子中的结果只返回了 2 个，
1498
1469
因为我们限制了返回数量:
1499
1254
1470
1500
1255
::
1471
::
1501
1256
1472
1502
@@ -1258,24 +1474,13 @@ tag的 ``class`` 属性是 `多值属性`_ .按照CSS类名搜索tag时,可以
1503
1258
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1474
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1504
1259
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1475
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1505
1260
1476
1512
1261
``recursive`` 参数
1477
.. _recursive2:
1507
1262
...................
1508
1263
1509
1264
调用tag的 ``find_all()`` 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 ``recursive=False`` .
1510
1265
1511
1266
一段简单的文档:
1513
1267
1478
1523
1268
::
1479
``recursive`` 参数
1524
1269
1480
^^^^^^^^^^^^^^^^^^^^^^^
1516
1270
    <html>
1517
1271
     <head>
1518
1272
      <title>
1519
1273
       The Dormouse's story
1520
1274
      </title>
1521
1275
     </head>
1522
1276
    ...
1525
1277
1481
1527
1278
是否使用 ``recursive`` 参数的搜索结果:
1482
如果调用 ``mytag.find_all()`` 方法，Beautiful Soup 会检索 ``mytag`` 的所有子孙节点，
1528
1483
如果只想搜索直接子节点，可以使用参数 ``recursive=False``。查看下面例子
1529
1279
1484
1530
1280
::
1485
::
1531
1281
1486
1532
@@ -1285,30 +1490,32 @@ tag的 ``class`` 属性是 `多值属性`_ .按照CSS类名搜索tag时,可以
1533
1285
    soup.html.find_all("title", recursive=False)
1490
    soup.html.find_all("title", recursive=False)
1534
1286
    # []
1491
    # []
1535
1287
1492
1537
1288
这是文档片段
1493
下面一段简单的文档:
1538
1289
1494
1539
1290
::
1495
::
1540
1291
1496
1548
1292
	<html>
1497
    <html>
1549
1293
		<head>
1498
     <head>
1550
1294
		<title>
1499
      <title>
1551
1295
		The Dormouse's story
1500
       The Dormouse's story
1552
1296
	    </title>
1501
      </title>
1553
1297
		</head>
1502
     </head>
1554
1298
		...
1503
    ...
1555
1299
1504
1559
1300
<title>标签在 <html> 标签下, 但并不是直接子节点, <head> 标签才是直接子节点.
1505
<title> 标签在 <html> 标签之下，但并不是直接子节点，<head> 标签才是直接子节点。
1560
1301
在允许查询所有后代节点时 Beautiful Soup 能够查找到 <title> 标签.
1506
在允许查询所有后代节点时 Beautiful Soup 能够查找到 <title> 标签。
1561
1302
但是使用了 ``recursive=False``  参数之后,只能查找直接子节点,这样就查不到 <title> 标签了.
1507
但是使用了 ``recursive=False``  参数之后，只能查找直接子节点，这样就查不到 <title> 标签了。
1562
1303
1508
1564
1304
Beautiful Soup 提供了多种DOM树搜索方法. 这些方法都使用了类似的参数定义.
1509
Beautiful Soup 提供了多种 DOM 树搜索方法。这些方法都使用了类似的参数定义。
1565
1305
比如这些方法: ``find_all()``: ``name``, ``attrs``, ``text``, ``limit``.
1510
比如这些方法: ``find_all()``: ``name``, ``attrs``, ``text``, ``limit``.
1567
1306
但是只有 ``find_all()`` 和 ``find()`` 支持 ``recursive`` 参数.
1511
但是只有 ``find_all()`` 和 ``find()`` 支持 ``recursive`` 参数。
1568
1307
1512
1569
1308
像调用 ``find_all()`` 一样调用tag
1513
像调用 ``find_all()`` 一样调用tag
1570
1309
----------------------------------
1514
----------------------------------
1571
1310
1515
1573
1311
``find_all()`` 几乎是Beautiful Soup中最常用的搜索方法,所以我们定义了它的简写方法. ``BeautifulSoup`` 对象和 ``tag`` 对象可以被当作一个方法来使用,这个方法的执行结果与调用这个对象的 ``find_all()`` 方法相同,下面两行代码是等价的:
1516
``find_all()`` 几乎是 Beautiful Soup 中最常用的搜索方法，所以我们定义了它的简写方法。
1574
1517
``BeautifulSoup`` 对象和 ``Tag`` 对象可以被当作一个方法来使用，这个方法的执行结果与
1575
1518
调用这个对象的 ``find_all()`` 方法相同，下面两行代码是等价的:
1576
1312
1519
1577
1313
::
1520
::
1578
1314
1521
1579
@@ -1325,9 +1532,13 @@ Beautiful Soup 提供了多种DOM树搜索方法. 这些方法都使用了类似
1580
1325
find()
1532
find()
1581
1326
-------
1533
-------
1582
1327
1534
1584
1328
find( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1535
find(`name`_ , `attrs`_ , `recursive <recursive>`_ , `string <string>`_ , 
1585
1536
`**kwargs <kwargs>`_ )
1586
1329
1537
1588
1330
``find_all()`` 方法将返回文档中符合条件的所有tag,尽管有时候我们只想得到一个结果.比如文档中只有一个<body>标签,那么使用 ``find_all()`` 方法来查找<body>标签就不太合适, 使用 ``find_all`` 方法并设置 ``limit=1`` 参数不如直接使用  ``find()`` 方法.下面两行代码是等价的:
1538
``find_all()`` 方法将返回文档中符合条件的所有 tag，尽管有时候我们只想得到一个结果。
1589
1539
比如文档中只有一个 <body> 标签，那么使用 ``find_all()`` 方法来查找 <body> 标签就
1590
1540
不太合适，使用 ``find_all`` 方法并设置 ``limit=1`` 参数不如直接使用  ``find()`` 方法。
1591
1541
下面两行代码是等价的:
1592
1331
1542
1593
1332
::
1543
::
1594
1333
1544
1595
@@ -1337,16 +1548,18 @@ find( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1596
1337
    soup.find('title')
1548
    soup.find('title')
1597
1338
    # <title>The Dormouse's story</title>
1549
    # <title>The Dormouse's story</title>
1598
1339
1550
1600
1340
唯一的区别是 ``find_all()`` 方法的返回结果是值包含一个元素的列表,而 ``find()`` 方法直接返回结果.
1551
唯一的区别是 ``find_all()`` 方法的返回结果是值包含一个元素的列表，而 ``find()`` 方法
1601
1552
直接返回结果。
1602
1341
1553
1604
1342
``find_all()`` 方法没有找到目标是返回空列表, ``find()`` 方法找不到目标时,返回 ``None`` .
1554
``find_all()`` 方法没有找到目标是返回空列表， ``find()`` 方法找不到目标时，返回 ``None``。
1605
1343
1555
1606
1344
::
1556
::
1607
1345
1557
1608
1346
    print(soup.find("nosuchtag"))
1558
    print(soup.find("nosuchtag"))
1609
1347
    # None
1559
    # None
1610
1348
1560
1612
1349
``soup.head.title`` 是 `tag的名字`_ 方法的简写.这个简写的原理就是多次调用当前tag的 ``find()`` 方法:
1561
``soup.head.title`` 是 `Tag 的名字`_ 方法的简写。这个简写就是通过多次调用 ``find()`` 方
1613
1562
法实现的:
1614
1350
1563
1615
1351
::
1564
::
1616
1352
1565
1617
@@ -1359,13 +1572,20 @@ find( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1618
1359
find_parents() 和 find_parent()
1572
find_parents() 和 find_parent()
1619
1360
--------------------------------
1573
--------------------------------
1620
1361
1574
1622
1362
find_parents( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1575
find_parents( `name`_ , `attrs`_ , `recursive <recursive>`_ , 
1623
1576
`string <string>`_ , `**kwargs <kwargs>`_ )
1624
1363
1577
1626
1364
find_parent( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1578
find_parent( `name`_ , `attrs`_ , `recursive <recursive>`_ , 
1627
1579
`string <string>`_ , `**kwargs <kwargs>`_ )
1628
1365
1580
1630
1366
我们已经用了很大篇幅来介绍 ``find_all()`` 和 ``find()`` 方法,Beautiful Soup中还有10个用于搜索的API.它们中的五个用的是与 ``find_all()`` 相同的搜索参数,另外5个与 ``find()`` 方法的搜索参数类似.区别仅是它们搜索文档的不同部分.
1581
我们已经用了很大篇幅来介绍 ``find_all()`` 和 ``find()`` 方法，Beautiful Soup 中
1631
1582
还有 10 个用于搜索的 API。它们中有 5 个用的是与 ``find_all()`` 相同的搜索参数，
1632
1583
另外 5 个与 ``find()`` 方法的搜索参数类似。区别仅是它们搜索文档的位置不同。
1633
1367
1584
1635
1368
记住: ``find_all()`` 和 ``find()`` 只搜索当前节点的所有子节点,孙子节点等. ``find_parents()`` 和 ``find_parent()`` 用来搜索当前节点的父辈节点,搜索方法与普通tag的搜索方法相同,搜索文档\搜索文档包含的内容. 我们从一个文档中的一个叶子节点开始:
1585
首先来看看 ``find_parents()`` 和 ``find_parent()``。
1636
1586
记住: ``find_all()`` 和 ``find()`` 只搜索当前节点的所有子节点，孙子节点等。
1637
1587
而这 2 个方法刚好相反，它们用来搜索当前节点的父辈节点。
1638
1588
我们来试试看，从例子文档中的一个深层叶子节点开始:
1639
1369
1589
1640
1370
::
1590
::
1641
1371
1591
1642
@@ -1383,21 +1603,29 @@ find_parent( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1643
1383
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
1603
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
1644
1384
    #  and they lived at the bottom of a well.</p>
1604
    #  and they lived at the bottom of a well.</p>
1645
1385
1605
1647
1386
    a_string.find_parents("p", class_="title")
1606
    a_string.find_parents("p", class="title")
1648
1387
    # []
1607
    # []
1649
1388
1608
1651
1389
文档中的一个<a>标签是是当前叶子节点的直接父节点,所以可以被找到.还有一个<p>标签,是目标叶子节点的间接父辈节点,所以也可以被找到.包含class值为"title"的<p>标签不是不是目标叶子节点的父辈节点,所以通过 ``find_parents()`` 方法搜索不到.
1609
文档中的一个 <a> 标签是是当前叶子节点的直接父节点，所以可以被找到。
1652
1610
还有一个 <p> 标签，是目标叶子节点的间接父辈节点，所以也可以被找到。
1653
1611
包含 class 值为 "title" 的 <p> 标签不是不是目标叶子节点的父辈节点，
1654
1612
所以通过 ``find_parents()`` 方法搜索不到。
1655
1390
1613
1657
1391
``find_parent()`` 和 ``find_parents()`` 方法会让人联想到 `.parent`_ 和 `.parents`_ 属性.它们之间的联系非常紧密.搜索父辈节点的方法实际上就是对 ``.parents`` 属性的迭代搜索.
1614
``find_parent()`` 和 ``find_parents()`` 方法会让人联想到 `.parent`_ 和 `.parents`_ 属性。
1658
1615
它们之间的联系非常紧密。搜索父辈节点的方法实际上就是对 ``.parents`` 属性的迭代搜索。
1659
1392
1616
1660
1393
find_next_siblings() 和 find_next_sibling()
1617
find_next_siblings() 和 find_next_sibling()
1661
1394
-------------------------------------------
1618
-------------------------------------------
1662
1395
1619
1664
1396
find_next_siblings( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1620
find_next_siblings( `name`_ , `attrs`_ , `recursive <recursive>`_ , 
1665
1621
`string <string>`_ , `**kwargs <kwargs>`_ )
1666
1397
1622
1668
1398
find_next_sibling( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1623
find_next_sibling( `name`_ , `attrs`_ , `recursive <recursive>`_ , 
1669
1624
`string <string>`_ , `**kwargs <kwargs>`_ )
1670
1399
1625
1672
1400
这2个方法通过 `.next_siblings`_ 属性对当tag的所有后面解析 [5]_ 的兄弟tag节点进行迭代, ``find_next_siblings()`` 方法返回所有符合条件的后面的兄弟节点, ``find_next_sibling()`` 只返回符合条件的后面的第一个tag节点.
1626
这 2 个方法通过 `.next_siblings <sibling-generators>`_ 属性对当 tag 的所有后面解析 [5]_ 
1673
1627
的兄弟tag节点进行迭代， ``find_next_siblings()`` 方法返回所有符合条件的后面的兄弟节点，
1674
1628
``find_next_sibling()`` 只返回符合条件的后面的第一个 tag 节点。
1675
1401
1629
1676
1402
::
1630
::
1677
1403
1631
1678
@@ -1416,11 +1644,15 @@ find_next_sibling( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1679
1416
find_previous_siblings() 和 find_previous_sibling()
1644
find_previous_siblings() 和 find_previous_sibling()
1680
1417
-----------------------------------------------------
1645
-----------------------------------------------------
1681
1418
1646
1683
1419
find_previous_siblings( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1647
find_previous_siblings( `name`_ , `attrs`_ , `recursive <recursive>`_ , 
1684
1648
`string <string>`_ , `**kwargs <kwargs>`_ )
1685
1420
1649
1687
1421
find_previous_sibling( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1650
find_previous_sibling( `name`_ , `attrs`_ , `recursive <recursive>`_ , 
1688
1651
`string <string>`_ , `**kwargs <kwargs>`_ )
1689
1422
1652
1691
1423
这2个方法通过 `.previous_siblings`_ 属性对当前tag的前面解析 [5]_ 的兄弟tag节点进行迭代, ``find_previous_siblings()`` 方法返回所有符合条件的前面的兄弟节点, ``find_previous_sibling()`` 方法返回第一个符合条件的前面的兄弟节点:
1653
这 2 个方法通过 `.previous_siblings <sibling-generators>`_ 属性对当前 tag 的前面解析 [5]_ 
1692
1654
的兄弟 tag 节点进行迭代， ``find_previous_siblings()`` 方法返回所有符合条件的前面的兄弟节点，
1693
1655
``find_previous_sibling()`` 方法返回第一个符合条件的前面的兄弟节点:
1694
1424
1656
1695
1425
::
1657
::
1696
1426
1658
1697
@@ -1439,11 +1671,15 @@ find_previous_sibling( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs
1698
1439
find_all_next() 和 find_next()
1671
find_all_next() 和 find_next()
1699
1440
--------------------------------
1672
--------------------------------
1700
1441
1673
1702
1442
find_all_next( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1674
find_all_next( `name`_ , `attrs`_ , `recursive <recursive>`_ , 
1703
1675
`string <string>`_ , `**kwargs <kwargs>`_ )
1704
1443
1676
1706
1444
find_next( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1677
find_next( `name`_ , `attrs`_ , `recursive <recursive>`_ , 
1707
1678
`string <string>`_ , `**kwargs <kwargs>`_ )
1708
1445
1679
1710
1446
这2个方法通过 `.next_elements`_ 属性对当前tag的之后的 [5]_ tag和字符串进行迭代, ``find_all_next()`` 方法返回所有符合条件的节点, ``find_next()`` 方法返回第一个符合条件的节点:
1680
这 2 个方法通过 `.next_elements <element-generators>`_ 属性对当前 tag 的之后的 [5]_ 
1711
1681
tag 和字符串进行迭代， ``find_all_next()`` 方法返回所有符合条件的节点， ``find_next()`` 
1712
1682
方法返回第一个符合条件的节点:
1713
1447
1683
1714
1448
::
1684
::
1715
1449
1685
1716
@@ -1458,16 +1694,22 @@ find_next( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1717
1458
    first_link.find_next("p")
1694
    first_link.find_next("p")
1718
1459
    # <p class="story">...</p>
1695
    # <p class="story">...</p>
1719
1460
1696
1721
1461
第一个例子中,字符串 “Elsie”也被显示出来,尽管它被包含在我们开始查找的<a>标签的里面.第二个例子中,最后一个<p>标签也被显示出来,尽管它与我们开始查找位置的<a>标签不属于同一部分.例子中,搜索的重点是要匹配过滤器的条件,并且在文档中出现的顺序而不是开始查找的元素的位置.
1697
第一个例子中，字符串 “Elsie”也被显示出来，尽管它被包含在我们开始查找的 <a> 标签的里面。
1722
1698
第二个例子中，最后一个<p>标签也被显示出来，尽管它与我们开始查找位置的 <a> 标签不属于同一部分。
1723
1699
例子中，搜索的重点是要匹配过滤器的条件，以及元素在文档中出现的顺序要在查找的元素的之后。
1724
1462
1700
1725
1463
find_all_previous() 和 find_previous()
1701
find_all_previous() 和 find_previous()
1726
1464
---------------------------------------
1702
---------------------------------------
1727
1465
1703
1729
1466
find_all_previous( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1704
find_all_previous( `name`_ , `attrs`_ , `recursive <recursive>`_ , 
1730
1705
`string <string>`_ , `**kwargs <kwargs>`_ )
1731
1467
1706
1733
1468
find_previous( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1707
find_previous( `name`_ , `attrs`_ , `recursive <recursive>`_ , 
1734
1708
`string <string>`_ , `**kwargs <kwargs>`_ )
1735
1469
1709
1737
1470
这2个方法通过 `.previous_elements`_ 属性对当前节点前面 [5]_ 的tag和字符串进行迭代, ``find_all_previous()`` 方法返回所有符合条件的节点, ``find_previous()`` 方法返回第一个符合条件的节点.
1710
这 2 个方法通过 `.previous_elements <sibling-generators>`_ 属性对当前节点前面 [5]_ 的 
1738
1711
tag 和字符串进行迭代， ``find_all_previous()`` 方法返回所有符合条件的节点， ``find_previous()`` 
1739
1712
方法返回第一个符合条件的节点。
1740
1471
1713
1741
1472
::
1714
::
1742
1473
1715
1743
@@ -1482,105 +1724,110 @@ find_previous( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
1744
1482
    first_link.find_previous("title")
1724
    first_link.find_previous("title")
1745
1483
    # <title>The Dormouse's story</title>
1725
    # <title>The Dormouse's story</title>
1746
1484
1726
1748
1485
``find_all_previous("p")`` 返回了文档中的第一段(class="title"的那段),但还返回了第二段,<p>标签包含了我们开始查找的<a>标签.不要惊讶,这段代码的功能是查找所有出现在指定<a>标签之前的<p>标签,因为这个<p>标签包含了开始的<a>标签,所以<p>标签一定是在<a>之前出现的.
1727
``find_all_previous("p")`` 既返回了文档中的第一段(class="title"的那段)，还返回了第二段，
1749
1728
包含了我们开始查找的 <a> 标签的那段。不用惊讶，这段代码的功能是查找所有出现在指定 <a> 标签之前
1750
1729
的 <p> 标签，因为这个 <p> 标签包含了开始的 <a> 标签，所以 <p> 标签当然是在 <a> 之前出现的。
1751
1486
1730
1753
1487
CSS选择器
1731
CSS 选择器
1754
1488
------------
1732
------------
1755
1489
1733
1759
1490
Beautiful Soup支持大部分的CSS选择器 `<http://www.w3.org/TR/CSS2/selector.html>`_ [6]_ ,
1734
BeautifulSoup 对象和 Tag 对象支持通过 ``.css`` 属性实现 CSS 选择器。具体选择功能是通过
1760
1491
在 ``Tag`` 或 ``BeautifulSoup`` 对象的 ``.select()`` 方法中传入字符串参数,
1735
`Soup Sieve <https://facelessuser.github.io/soupsieve/>`_ 库实现的，在 PyPI 上通
1761
1492
即可使用CSS选择器的语法找到tag:
1736
过关键字 ``soupsieve`` 可以找到。通过 pip 安装 Beautiful Soup 时，Soup Sieve 也会自
1762
1737
动安装，不用其它额外操作。
1763
1738
1764
1739
Soup Sieve 文档列出了 `当前支持的 CSS 选择器 <https://facelessuser.github.io/soupsieve/selectors/>`_，
1765
1740
下面是一些基本应用
1766
1493
1741
1767
1494
::
1742
::
1768
1495
1743
1770
1496
    soup.select("title")
1744
    soup.css.select("title")
1771
1497
    # [<title>The Dormouse's story</title>]
1745
    # [<title>The Dormouse's story</title>]
1772
1498
1746
1774
1499
    soup.select("p:nth-of-type(3)")
1747
    soup.css.select("p:nth-of-type(3)")
1775
1500
    # [<p class="story">...</p>]
1748
    # [<p class="story">...</p>]
1776
1501
1749
1778
1502
通过tag标签逐层查找:
1750
查找指定层级的 tag:
1779
1503
1751
1780
1504
::
1752
::
1781
1505
1753
1783
1506
    soup.select("body a")
1754
    soup.css.select("body a")
1784
1507
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1755
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1785
1508
    #  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
1756
    #  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
1786
1509
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1757
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1787
1510
1758
1789
1511
    soup.select("html head title")
1759
    soup.css.select("html head title")
1790
1512
    # [<title>The Dormouse's story</title>]
1760
    # [<title>The Dormouse's story</title>]
1791
1513
1761
1793
1514
找到某个tag标签下的直接子标签 [6]_ :
1762
找到某个 tag 标签下的直接子标签 [6]_ :
1794
1515
1763
1795
1516
::
1764
::
1796
1517
1765
1798
1518
    soup.select("head > title")
1766
    soup.css.select("head > title")
1799
1519
    # [<title>The Dormouse's story</title>]
1767
    # [<title>The Dormouse's story</title>]
1800
1520
1768
1802
1521
    soup.select("p > a")
1769
    soup.css.select("p > a")
1803
1522
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1770
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1804
1523
    #  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
1771
    #  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
1805
1524
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1772
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1806
1525
1773
1808
1526
    soup.select("p > a:nth-of-type(2)")
1774
    soup.css.select("p > a:nth-of-type(2)")
1809
1527
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1775
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1810
1528
1776
1812
1529
    soup.select("p > #link1")
1777
    soup.css.select("p > #link1")
1813
1530
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
1778
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
1814
1531
1779
1816
1532
    soup.select("body > a")
1780
    soup.css.select("body > a")
1817
1533
    # []
1781
    # []
1818
1534
1782
1819
1535
找到兄弟节点标签:
1783
找到兄弟节点标签:
1820
1536
1784
1821
1537
::
1785
::
1822
1538
1786
1824
1539
    soup.select("#link1 ~ .sister")
1787
    soup.css.select("#link1 ~ .sister")
1825
1540
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1788
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1826
1541
    #  <a class="sister" href="http://example.com/tillie"  id="link3">Tillie</a>]
1789
    #  <a class="sister" href="http://example.com/tillie"  id="link3">Tillie</a>]
1827
1542
1790
1829
1543
    soup.select("#link1 + .sister")
1791
    soup.css.select("#link1 + .sister")
1830
1544
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1792
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1831
1545
1793
1833
1546
通过CSS的类名查找:
1794
通过 CSS 的类名查找:
1834
1547
1795
1835
1548
::
1796
::
1836
1549
1797
1838
1550
    soup.select(".sister")
1798
    soup.css.select(".sister")
1839
1551
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1799
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1840
1552
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1800
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1841
1553
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1801
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1842
1554
1802
1844
1555
    soup.select("[class~=sister]")
1803
    soup.css.select("[class~=sister]")
1845
1556
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1804
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1846
1557
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1805
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1847
1558
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1806
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1848
1559
1807
1850
1560
通过tag的id查找:
1808
通过 id 查找 tag:
1851
1561
1809
1852
1562
::
1810
::
1853
1563
1811
1855
1564
    soup.select("#link1")
1812
    soup.css.select("#link1")
1856
1565
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
1813
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
1857
1566
1814
1859
1567
    soup.select("a#link2")
1815
    soup.css.select("a#link2")
1860
1568
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1816
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1861
1569
1817
1863
1570
同时用多种CSS选择器查询元素:
1818
查找符合列表中任意一个选择器的 tag：
1864
1571
1819
1865
1572
::
1820
::
1866
1573
1821
1871
1574
	soup.select("#link1,#link2")
1822
    soup.css.select("#link1,#link2")
1872
1575
	# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1823
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1873
1576
	#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1824
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1870
1577
1874
1578
1825
1875
1579
通过是否存在某个属性来查找:
1826
通过是否存在某个属性来查找:
1876
1580
1827
1877
1581
::
1828
::
1878
1582
1829
1880
1583
    soup.select('a[href]')
1830
    soup.css.select('a[href]')
1881
1584
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1831
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1882
1585
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1832
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1883
1586
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1833
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1884
@@ -1589,62 +1836,143 @@ Beautiful Soup支持大部分的CSS选择器 `<http://www.w3.org/TR/CSS2/selecto
1885
1589
1836
1886
1590
::
1837
::
1887
1591
1838
1889
1592
    soup.select('a[href="http://example.com/elsie"]')
1839
    soup.css.select('a[href="http://example.com/elsie"]')
1890
1593
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
1840
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
1891
1594
1841
1893
1595
    soup.select('a[href^="http://example.com/"]')
1842
    soup.css.select('a[href^="http://example.com/"]')
1894
1596
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1843
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1895
1597
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1844
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1896
1598
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1845
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1897
1599
1846
1899
1600
    soup.select('a[href$="tillie"]')
1847
    soup.css.select('a[href$="tillie"]')
1900
1601
    # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1848
    # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1901
1602
1849
1903
1603
    soup.select('a[href*=".com/el"]')
1850
    soup.css.select('a[href*=".com/el"]')
1904
1604
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
1851
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
1905
1605
1852
1907
1606
通过语言设置来查找:
1853
还有一个 ``select_one()`` 方法，它会返回符合筛选条件的元素列表中的第一个
1908
1607
1854
1909
1608
::
1855
::
1910
1609
1856
1922
1610
    multilingual_markup = """
1857
    soup.css.select_one(".sister")
1923
1611
     <p lang="en">Hello</p>
1858
    # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
1913
1612
     <p lang="en-us">Howdy, y'all</p>
1914
1613
     <p lang="en-gb">Pip-pip, old fruit</p>
1915
1614
     <p lang="fr">Bonjour mes amis</p>
1916
1615
    """
1917
1616
    multilingual_soup = BeautifulSoup(multilingual_markup)
1918
1617
    multilingual_soup.select('p[lang|=en]')
1919
1618
    # [<p lang="en">Hello</p>,
1920
1619
    #  <p lang="en-us">Howdy, y'all</p>,
1921
1620
    #  <p lang="en-gb">Pip-pip, old fruit</p>]
1924
1621
1859
1926
1622
返回查找到的元素的第一个
1860
为了方便使用，在 BeautifulSoup 或 Tag 对象上直接调用 ``select()`` 和 ``select_one()`` 方法，
1927
1861
中间省略 ``.css`` 属性
1928
1623
1862
1929
1624
::
1863
::
1930
1625
1864
1933
1626
	soup.select_one(".sister")
1865
    soup.select('a[href$="tillie"]')
1934
1627
	# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
1866
    # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1935
1628
1867
1936
1868
    soup.select_one(".sister")
1937
1869
    # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
1938
1870
1939
1871
CSS 选择器对于熟悉 CSS 语法的人来说非常方便。你可以在 Beautiful Soup 中使用相同的方法。
1940
1872
但是如果你只需要使用 CSS 选择器就够了，那么应该 ``lxml`` 作为文档解析器：因为速度快很多。
1941
1873
但是 Soup Sieve 也有优势，它允许 `组合` 使用 CSS 选择器和 Beautiful Soup 的 API。
1942
1874
1943
1875
Soup Sieve 高级特性
1944
1876
-------------------
1945
1877
1946
1878
Soup Sieve 提供的是比 ``select()`` 和 ``select_one()`` 更底层的方法，通过 Tag 或
1947
1879
Beautiful Soup 对象的 ``.css`` 属性，可以调用大部分的 API。下面是支持这种调用方式的方法列表，
1948
1880
查看 `Soup Sieve <https://facelessuser.github.io/soupsieve/>`_ 文档了解全部细节。
1949
1881
1950
1882
``iselect()`` 方法与 ``select()`` 效果相同，区别是返回的结果是迭代器。
1951
1883
::
1952
1884
1953
1885
    [tag['id'] for tag in soup.css.iselect(".sister")]
1954
1886
    # ['link1', 'link2', 'link3']
1955
1887
1956
1888
``closest()`` 方法与 ``find_parent()`` 方法相似，返回符合 CSS 选择器的 Tag 对象的最近父级。
1957
1889
1958
1890
::
1959
1891
1960
1892
    elsie = soup.css.select_one(".sister")
1961
1893
    elsie.css.closest("p.story")
1962
1894
    # <p class="story">Once upon a time there were three little sisters; and their names were
1963
1895
    #  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1964
1896
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
1965
1897
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
1966
1898
    #  and they lived at the bottom of a well.</p>
1967
1899
1968
1900
``match()`` 方法返回布尔结果，标记指定 Tag 是否符合指定筛选器
1969
1901
1970
1902
::
1971
1903
1972
1904
    # elsie.css.match("#link1")
1973
1905
    True
1974
1906
1975
1907
    # elsie.css.match("#link2")
1976
1908
    False
1977
1909
1978
1910
``filter()`` 方法返回 tag 直接子节点中符合筛选器的节点列表
1979
1911
1980
1912
::
1981
1913
 
1982
1914
    [tag.string for tag in soup.find('p', 'story').css.filter('a')]
1983
1915
    # ['Elsie', 'Lacie', 'Tillie']
1984
1916
1985
1917
``escape()`` 方法可以对 CSS 标识符中的特殊字符进行转义，否则是非法 CSS 标识符
1986
1629
1918
1990
1630
对于熟悉CSS选择器语法的人来说这是个非常方便的方法.Beautiful Soup也支持CSS选择器API,
1919
::
1991
1631
如果你仅仅需要CSS选择器的功能,那么直接使用 ``lxml`` 也可以,
1920
1992
1632
而且速度更快,支持更多的CSS选择器语法,但Beautiful Soup整合了CSS选择器的语法和自身方便使用API.
1921
    soup.css.escape("1-strange-identifier")
1993
1922
    # '\\31 -strange-identifier'
1994
1923
1995
1924
CSS 筛选器中的命名空间
1996
1925
------------------------
1997
1926
1998
1927
如果解析的 XML 文档中定义了命名空间，那么 CSS 筛选器中也可以使用
1999
1928
2000
1929
::
2001
1930
2002
1931
    from bs4 import BeautifulSoup
2003
1932
    xml = """<tag xmlns:ns1="http://namespace1/" xmlns:ns2="http://namespace2/">
2004
1933
    <ns1:child>I'm in namespace 1</ns1:child>
2005
1934
    <ns2:child>I'm in namespace 2</ns2:child>
2006
1935
    </tag> """
2007
1936
    namespace_soup = BeautifulSoup(xml, "xml")
2008
1937
2009
1938
    namespace_soup.css.select("child")
2010
1939
    # [<ns1:child>I'm in namespace 1</ns1:child>, <ns2:child>I'm in namespace 2</ns2:child>]
2011
1940
2012
1941
    namespace_soup.css.select("ns1|child")
2013
1942
    # [<ns1:child>I'm in namespace 1</ns1:child>]
2014
1943
2015
1944
Beautiful Soup 尝试自动匹配解析文档中的命名空间前缀，除此之外，你还可以自定义目录的缩写
2016
1945
2017
1946
::
2018
1947
2019
1948
    namespaces = dict(first="http://namespace1/", second="http://namespace2/")
2020
1949
    namespace_soup.css.select("second|child", namespaces=namespaces)
2021
1950
    # [<ns1:child>I'm in namespace 2</ns1:child>]
2022
1951
2023
1952
支持 CSS 筛选器的历史版本
2024
1953
-------------------------
2025
1954
2026
1955
``.css`` 属性是在 Beautiful Soup 4.12.0 中添加的。在此之前，只能使用 ``.select()`` 和
2027
1956
 ``.select_one()`` 方法。
2028
1957
2029
1958
Soup Sieve 是在 Beautiful Soup 4.7.0 开始集成的。早期版本中有 ``.select()`` 方法，但
2030
1959
仅能支持最常用的 CSS 选择器。
2031
1633
1960
2032
1634
1961
2033
1635
修改文档树
1962
修改文档树
2034
1636
===========
1963
===========
2035
1637
1964
2037
1638
Beautiful Soup的强项是文档树的搜索,但同时也可以方便的修改文档树
1965
Beautiful Soup 的强项是文档树的搜索，但也支持修改文档数，或者编写新的 HTML、XML 文档。
2038
1639
1966
2041
1640
修改tag的名称和属性
1967
修改 tag 的名称和属性
2042
1641
-------------------
1968
----------------------
2043
1642
1969
2045
1643
在 `Attributes`_ 的章节中已经介绍过这个功能,但是再看一遍也无妨. 重命名一个tag,改变属性的值,添加或删除属性:
1970
在 :py:attr:`Tag.attrs` 的章节中已经介绍过这个功能，但是再看一遍也无妨。重命名一个 tag, 
2046
1971
改变属性的值，添加或删除属性
2047
1644
1972
2048
1645
::
1973
::
2049
1646
1974
2051
1647
    soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
1975
    soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
2052
1648
    tag = soup.b
1976
    tag = soup.b
2053
1649
1977
2054
1650
    tag.name = "blockquote"
1978
    tag.name = "blockquote"
2055
@@ -1658,47 +1986,64 @@ Beautiful Soup的强项是文档树的搜索,但同时也可以方便的修改
2056
1658
    tag
1986
    tag
2057
1659
    # <blockquote>Extremely bold</blockquote>
1987
    # <blockquote>Extremely bold</blockquote>
2058
1660
1988
2061
1661
修改 .string
1989
修改 ``.string``
2062
1662
-------------
1990
------------------
2063
1663
1991
2065
1664
给tag的 ``.string`` 属性赋值,就相当于用当前的内容替代了原来的内容:
1992
如果设置 tag 的 ``.string`` 属性值，就相当于用新的内容替代了原来的内容:
2066
1665
1993
2067
1666
::
1994
::
2068
1667
1995
2069
1668
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
1996
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
2071
1669
    soup = BeautifulSoup(markup)
1997
    soup = BeautifulSoup(markup, 'html.parser')
2072
1670
1998
2073
1671
    tag = soup.a
1999
    tag = soup.a
2074
1672
    tag.string = "New link text."
2000
    tag.string = "New link text."
2075
1673
    tag
2001
    tag
2076
1674
    # <a href="http://example.com/">New link text.</a>
2002
    # <a href="http://example.com/">New link text.</a>
2077
1675
2003
2079
1676
注意: 如果当前的tag包含了其它tag,那么给它的 ``.string`` 属性赋值会覆盖掉原有的所有内容包括子tag
2004
注意：如果 tag 原本包含了其它子节点，原有的所有内容包括子 tag 都会被覆盖掉。
2080
1677
2005
2083
1678
append()
2006
``append()``
2084
1679
----------
2007
--------------
2085
1680
2008
2087
1681
``Tag.append()`` 方法想tag中添加内容,就好像Python的列表的 ``.append()`` 方法:
2009
向 tag 中添加内容可以使用 ``Tag.append()`` 方法，就好像调用 Python 列表的 ``.append()`` 方法:
2088
1682
2010
2089
1683
::
2011
::
2090
1684
2012
2092
1685
    soup = BeautifulSoup("<a>Foo</a>")
2013
    soup = BeautifulSoup("<a>Foo</a>", 'html.parser')
2093
1686
    soup.a.append("Bar")
2014
    soup.a.append("Bar")
2094
1687
2015
2095
1688
    soup
2016
    soup
2097
1689
    # <html><head></head><body><a>FooBar</a></body></html>
2017
    # <a>FooBar</a>
2098
2018
    soup.a.contents
2099
2019
    # ['Foo', 'Bar']
2100
2020
2101
2021
``extend()``
2102
2022
--------------
2103
2023
2104
2024
从 Beautiful Soup 4.7.0 版本开始，tag 增加了 ``.extend()`` 方法，可以把一个列表中内容，
2105
2025
按顺序全部添加到一个 tag 当中
2106
2026
2107
2027
::
2108
2028
2109
2029
    soup = BeautifulSoup("<a>Soup</a>", 'html.parser')
2110
2030
    soup.a.extend(["'s", " ", "on"])
2111
2031
2112
2032
    soup
2113
2033
    # <a>Soup's on</a>
2114
1690
    soup.a.contents
2034
    soup.a.contents
2116
1691
    # [u'Foo', u'Bar']
2035
    # ['Soup', ''s', ' ', 'on']
2117
1692
2036
2118
1693
NavigableString() 和 .new_tag()
2037
NavigableString() 和 .new_tag()
2119
1694
-----------------------------------------
2038
-----------------------------------------
2120
1695
2039
2123
1696
如果想添加一段文本内容到文档中也没问题,可以调用Python的 ``append()`` 方法
2040
如果想添加一段文本内容到文档中，可以将一个 Python 字符串对象传给 ``append()`` 方法，
2124
1697
或调用 ``NavigableString`` 的构造方法:
2041
或调用 ``NavigableString`` 构造方法:
2125
1698
2042
2126
1699
::
2043
::
2127
1700
2044
2129
1701
    soup = BeautifulSoup("<b></b>")
2045
    from bs4 import NavigableString
2130
2046
    soup = BeautifulSoup("<b></b>", 'html.parser')
2131
1702
    tag = soup.b
2047
    tag = soup.b
2132
1703
    tag.append("Hello")
2048
    tag.append("Hello")
2133
1704
    new_string = NavigableString(" there")
2049
    new_string = NavigableString(" there")
2134
@@ -1706,27 +2051,27 @@ NavigableString() 和 .new_tag()
2135
1706
    tag
2051
    tag
2136
1707
    # <b>Hello there.</b>
2052
    # <b>Hello there.</b>
2137
1708
    tag.contents
2053
    tag.contents
2139
1709
    # [u'Hello', u' there']
2054
    # ['Hello', ' there']
2140
1710
2055
2142
1711
如果想要创建一段注释,或 ``NavigableString`` 的任何子类, 只要调用 NavigableString 的构造方法:
2056
如果想要创建一段注释，或其它 ``NavigableString`` 的子类，只要调用构造方法:
2143
1712
2057
2144
1713
::
2058
::
2145
1714
2059
2146
1715
    from bs4 import Comment
2060
    from bs4 import Comment
2148
1716
    new_comment = soup.new_string("Nice to see you.", Comment)
2061
    new_comment = Comment("Nice to see you.")
2149
1717
    tag.append(new_comment)
2062
    tag.append(new_comment)
2150
1718
    tag
2063
    tag
2151
1719
    # <b>Hello there<!--Nice to see you.--></b>
2064
    # <b>Hello there<!--Nice to see you.--></b>
2152
1720
    tag.contents
2065
    tag.contents
2154
1721
    # [u'Hello', u' there', u'Nice to see you.']
2066
    # ['Hello', ' there', 'Nice to see you.']
2155
1722
2067
2157
1723
# 这是Beautiful Soup 4.2.1 中新增的方法
2068
`(这是 Beautiful Soup 4.4.0 中新增的方法)`
2158
1724
2069
2160
1725
创建一个tag最好的方法是调用工厂方法 ``BeautifulSoup.new_tag()`` :
2070
如果需要新创建一个 tag，最好的方法是调用工厂方法 ``BeautifulSoup.new_tag()`` 
2161
1726
2071
2162
1727
::
2072
::
2163
1728
2073
2165
1729
    soup = BeautifulSoup("<b></b>")
2074
    soup = BeautifulSoup("<b></b>", 'html.parser')
2166
1730
    original_tag = soup.b
2075
    original_tag = soup.b
2167
1731
2076
2168
1732
    new_tag = soup.new_tag("a", href="http://www.example.com")
2077
    new_tag = soup.new_tag("a", href="http://www.example.com")
2169
@@ -1738,58 +2083,62 @@ NavigableString() 和 .new_tag()
2170
1738
    original_tag
2083
    original_tag
2171
1739
    # <b><a href="http://www.example.com">Link text.</a></b>
2084
    # <b><a href="http://www.example.com">Link text.</a></b>
2172
1740
2085
2174
1741
第一个参数作为tag的name,是必填,其它参数选填
2086
只有第一个参数用作 tag 的 name，是必填的。
2175
1742
2087
2176
1743
insert()
2088
insert()
2177
1744
--------
2089
--------
2178
1745
2090
2180
1746
``Tag.insert()`` 方法与 ``Tag.append()`` 方法类似,区别是不会把新元素添加到父节点 ``.contents`` 属性的最后,而是把元素插入到指定的位置.与Python列表总的 ``.insert()`` 方法的用法下同:
2091
``Tag.insert()`` 方法与 ``Tag.append()`` 方法类似，区别是不会把新元素添加到
2181
2092
父节点 ``.contents`` 属性的最后。而是把元素插入到按顺序指定的位置。与 Python 列表
2182
2093
中的 ``.insert()`` 方法的用法相同
2183
1747
2094
2184
1748
::
2095
::
2185
1749
2096
2186
1750
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
2097
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
2188
1751
    soup = BeautifulSoup(markup)
2098
    soup = BeautifulSoup(markup, 'html.parser')
2189
1752
    tag = soup.a
2099
    tag = soup.a
2190
1753
2100
2191
1754
    tag.insert(1, "but did not endorse ")
2101
    tag.insert(1, "but did not endorse ")
2192
1755
    tag
2102
    tag
2193
1756
    # <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>
2103
    # <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>
2194
1757
    tag.contents
2104
    tag.contents
2196
1758
    # [u'I linked to ', u'but did not endorse', <i>example.com</i>]
2105
    # ['I linked to ', 'but did not endorse', <i>example.com</i>]
2197
1759
2106
2198
1760
insert_before() 和 insert_after()
2107
insert_before() 和 insert_after()
2199
1761
-----------------------------------
2108
-----------------------------------
2200
1762
2109
2202
1763
``insert_before()`` 方法在当前tag或文本节点前插入内容:
2110
``insert_before()`` 方法可以在文档树中直接在目标之前添加 tag 或文本
2203
1764
2111
2204
1765
::
2112
::
2205
1766
2113
2207
1767
    soup = BeautifulSoup("<b>stop</b>")
2114
    soup = BeautifulSoup("<b>leave</b>", 'html.parser')
2208
1768
    tag = soup.new_tag("i")
2115
    tag = soup.new_tag("i")
2209
1769
    tag.string = "Don't"
2116
    tag.string = "Don't"
2210
1770
    soup.b.string.insert_before(tag)
2117
    soup.b.string.insert_before(tag)
2211
1771
    soup.b
2118
    soup.b
2213
1772
    # <b><i>Don't</i>stop</b>
2119
    # <b><i>Don't</i>leave</b>
2214
1773
2120
2216
1774
``insert_after()`` 方法在当前tag或文本节点后插入内容:
2121
``insert_after()`` 方法可以在文档树中直接在目标之后添加 tag 或文本
2217
1775
2122
2218
1776
::
2123
::
2219
1777
2124
2221
1778
    soup.b.i.insert_after(soup.new_string(" ever "))
2125
    div = soup.new_tag('div')
2222
2126
    div.string = 'ever'
2223
2127
    soup.b.i.insert_after(" you ", div)
2224
1779
    soup.b
2128
    soup.b
2226
1780
    # <b><i>Don't</i> ever stop</b>
2129
    # <b><i>Don't</i> you <div>ever</div> leave</b>
2227
1781
    soup.b.contents
2130
    soup.b.contents
2229
1782
    # [<i>Don't</i>, u' ever ', u'stop']
2131
    # [<i>Don't</i>, ' you', <div>ever</div>, 'leave']
2230
1783
2132
2231
1784
clear()
2133
clear()
2232
1785
--------
2134
--------
2233
1786
2135
2235
1787
``Tag.clear()`` 方法移除当前tag的内容:
2136
``Tag.clear()`` 方法可以移除 tag 的内容:
2236
1788
2137
2237
1789
::
2138
::
2238
1790
2139
2239
1791
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
2140
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
2241
1792
    soup = BeautifulSoup(markup)
2141
    soup = BeautifulSoup(markup, 'html.parser')
2242
1793
    tag = soup.a
2142
    tag = soup.a
2243
1794
2143
2244
1795
    tag.clear()
2144
    tag.clear()
2245
@@ -1799,12 +2148,12 @@ clear()
2246
1799
extract()
2148
extract()
2247
1800
----------
2149
----------
2248
1801
2150
2250
1802
``PageElement.extract()`` 方法将当前tag移除文档树,并作为方法结果返回:
2151
``PageElement.extract()`` 方法将当前 tag 或文本从文档树中移除，并返回被删除的内容:
2251
1803
2152
2252
1804
::
2153
::
2253
1805
2154
2254
1806
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
2155
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
2256
1807
    soup = BeautifulSoup(markup)
2156
    soup = BeautifulSoup(markup, 'html.parser')
2257
1808
    a_tag = soup.a
2157
    a_tag = soup.a
2258
1809
2158
2259
1810
    i_tag = soup.i.extract()
2159
    i_tag = soup.i.extract()
2260
@@ -1816,15 +2165,16 @@ extract()
2261
1816
    # <i>example.com</i>
2165
    # <i>example.com</i>
2262
1817
2166
2263
1818
    print(i_tag.parent)
2167
    print(i_tag.parent)
2265
1819
    None
2168
    # None
2266
1820
2169
2268
1821
这个方法实际上产生了2个文档树: 一个是用来解析原始文档的 ``BeautifulSoup`` 对象,另一个是被移除并且返回的tag.被移除并返回的tag可以继续调用 ``extract`` 方法:
2170
这个方法实际上产生了 2 个文档树: 一个是原始文档的 ``BeautifulSoup`` 对象，
2269
2171
另一个是被移除并且返回的文档树。还可以在新生成的文档树上继续调用 ``extract`` 方法:
2270
1822
2172
2271
1823
::
2173
::
2272
1824
2174
2273
1825
    my_string = i_tag.string.extract()
2175
    my_string = i_tag.string.extract()
2274
1826
    my_string
2176
    my_string
2276
1827
    # u'example.com'
2177
    # 'example.com'
2277
1828
2178
2278
1829
    print(my_string.parent)
2179
    print(my_string.parent)
2279
1830
    # None
2180
    # None
2280
@@ -1834,71 +2184,135 @@ extract()
2281
1834
decompose()
2184
decompose()
2282
1835
------------
2185
------------
2283
1836
2186
2285
1837
``Tag.decompose()`` 方法将当前节点移除文档树并完全销毁:
2187
``Tag.decompose()`` 方法会将前节点从文档书中移除并完全销毁:
2286
1838
2188
2287
1839
::
2189
::
2288
1840
2190
2289
1841
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
2191
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
2291
1842
    soup = BeautifulSoup(markup)
2192
    soup = BeautifulSoup(markup, 'html.parser')
2292
1843
    a_tag = soup.a
2193
    a_tag = soup.a
2293
2194
    i_tag = soup.i
2294
1844
2195
2297
1845
    soup.i.decompose()
2196
    i_tag.decompose()
2296
1846
2298
1847
    a_tag
2197
    a_tag
2299
1848
    # <a href="http://example.com/">I linked to</a>
2198
    # <a href="http://example.com/">I linked to</a>
2300
1849
2199
2301
2200
被 decompose 的 Tag 或者 `NavigableString` 是不稳定的，什么时候都不要使用它。如果不确定
2302
2201
某些内容是否被 decompose 了，可以通过 ``.decomposed`` 属性进行检查 `(Beautiful Soup 4.9.0 新增)`
2303
2202
2304
2203
::
2305
2204
2306
2205
    i_tag.decomposed
2307
2206
    # True
2308
2207
2309
2208
    a_tag.decomposed
2310
2209
    # False
2311
2210
2312
2211
.. _replace_with():
2313
2212
2314
1850
replace_with()
2213
replace_with()
2315
1851
---------------
2214
---------------
2316
1852
2215
2318
1853
``PageElement.replace_with()`` 方法移除文档树中的某段内容,并用新tag或文本节点替代它:
2216
``PageElement.replace_with()`` 方法移除文档树中的某段内容，并用新 tag 或文本节点替代它:
2319
1854
2217
2320
1855
::
2218
::
2321
1856
2219
2322
1857
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
2220
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
2324
1858
    soup = BeautifulSoup(markup)
2221
    soup = BeautifulSoup(markup, 'html.parser')
2325
1859
    a_tag = soup.a
2222
    a_tag = soup.a
2326
1860
2223
2327
1861
    new_tag = soup.new_tag("b")
2224
    new_tag = soup.new_tag("b")
2329
1862
    new_tag.string = "example.net"
2225
    new_tag.string = "example.com"
2330
1863
    a_tag.i.replace_with(new_tag)
2226
    a_tag.i.replace_with(new_tag)
2331
1864
2227
2332
1865
    a_tag
2228
    a_tag
2334
1866
    # <a href="http://example.com/">I linked to <b>example.net</b></a>
2229
    # <a href="http://example.com/">I linked to <b>example.com</b></a>
2335
2230
2336
2231
    bold_tag = soup.new_tag("b")
2337
2232
    bold_tag.string = "example"
2338
2233
    i_tag = soup.new_tag("i")
2339
2234
    i_tag.string = "net"
2340
2235
    a_tag.b.replace_with(bold_tag, ".", i_tag)
2341
2236
2342
2237
    a_tag
2343
2238
    # <a href="http://example.com/">I linked to <b>example</b>.<i>net</i></a>
2344
1867
2239
2346
1868
``replace_with()`` 方法返回被替代的tag或文本节点,可以用来浏览或添加到文档树其它地方
2240
``replace_with()`` 方法返回被替代的 tag 或文本节点，可以用来检查或添加到文档树其它地方。
2347
2241
2348
2242
`传递多个参数给 replace_with() 方法在 Beautiful Soup 4.10.0 版本中新增`
2349
1869
2243
2350
1870
wrap()
2244
wrap()
2352
1871
------
2245
--------
2353
1872
2246
2355
1873
``PageElement.wrap()`` 方法可以对指定的tag元素进行包装 [8]_ ,并返回包装后的结果:
2247
``PageElement.wrap()`` 方法可以对指定的tag元素进行包装 [8]_ ，并返回包装后的结果:
2356
1874
2248
2357
1875
::
2249
::
2358
1876
2250
2360
1877
    soup = BeautifulSoup("<p>I wish I was bold.</p>")
2251
    soup = BeautifulSoup("<p>I wish I was bold.</p>", 'html.parser')
2361
1878
    soup.p.string.wrap(soup.new_tag("b"))
2252
    soup.p.string.wrap(soup.new_tag("b"))
2362
1879
    # <b>I wish I was bold.</b>
2253
    # <b>I wish I was bold.</b>
2363
1880
2254
2364
1881
    soup.p.wrap(soup.new_tag("div"))
2255
    soup.p.wrap(soup.new_tag("div"))
2365
1882
    # <div><p><b>I wish I was bold.</b></p></div>
2256
    # <div><p><b>I wish I was bold.</b></p></div>
2366
1883
2257
2368
1884
该方法在 Beautiful Soup 4.0.5 中添加
2258
该方法在 Beautiful Soup 4.0.5 中添加。
2369
1885
2259
2370
1886
unwrap()
2260
unwrap()
2371
1887
---------
2261
---------
2372
1888
2262
2374
1889
``Tag.unwrap()`` 方法与 ``wrap()`` 方法相反.将移除tag内的所有tag标签,该方法常被用来进行标记的解包:
2263
``Tag.unwrap()`` 方法与 ``wrap()`` 方法相反。它将用 tag 内内容来替换 tag 本身，
2375
2264
该方法常被用来解包内容:
2376
1890
2265
2377
1891
::
2266
::
2378
1892
2267
2379
1893
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
2268
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
2381
1894
    soup = BeautifulSoup(markup)
2269
    soup = BeautifulSoup(markup, 'html.parser')
2382
1895
    a_tag = soup.a
2270
    a_tag = soup.a
2383
1896
2271
2384
1897
    a_tag.i.unwrap()
2272
    a_tag.i.unwrap()
2385
1898
    a_tag
2273
    a_tag
2386
1899
    # <a href="http://example.com/">I linked to example.com</a>
2274
    # <a href="http://example.com/">I linked to example.com</a>
2387
1900
2275
2389
1901
与 ``replace_with()`` 方法相同, ``unwrap()`` 方法返回被移除的tag
2276
与 ``replace_with()`` 方法相同，``unwrap()`` 方法会返回被移除的 tag。
2390
2277
2391
2278
smooth()
2392
2279
----------
2393
2280
2394
2281
调用了一堆修改文档树的方法后，可能剩下的是 2 个或更多个彼此衔接的 NavigableString 对象。
2395
2282
Beautiful Soup 处理起来没有问题，但在刚刚解析的文档树中，可能会出现非预期情况
2396
2283
2397
2284
::
2398
2285
2399
2286
    soup = BeautifulSoup("<p>A one</p>", 'html.parser')
2400
2287
    soup.p.append(", a two")
2401
2288
2402
2289
    soup.p.contents
2403
2290
    # ['A one', ', a two']
2404
2291
2405
2292
    print(soup.p.encode())
2406
2293
    # b'<p>A one, a two</p>'
2407
2294
2408
2295
    print(soup.p.prettify())
2409
2296
    # <p>
2410
2297
    #  A one
2411
2298
    #  , a two
2412
2299
    # </p>
2413
2300
2414
2301
这时可以使用 ``Tag.smooth()`` 方法来清理文档树，把相邻的字符串平滑的链接到一起
2415
2302
2416
2303
::
2417
2304
2418
2305
    soup.smooth()
2419
2306
2420
2307
    soup.p.contents
2421
2308
    # ['A one, a two']
2422
2309
2423
2310
    print(soup.p.prettify())
2424
2311
    # <p>
2425
2312
    #  A one, a two
2426
2313
    # </p>
2427
2314
2428
2315
该方法在 Beautiful Soup 4.8.0 中添加。
2429
1902
2316
2430
1903
输出
2317
输出
2431
1904
====
2318
====
2432
@@ -1906,12 +2320,13 @@ unwrap()
2433
1906
格式化输出
2320
格式化输出
2434
1907
-----------
2321
-----------
2435
1908
2322
2437
1909
``prettify()`` 方法将Beautiful Soup的文档树格式化后以Unicode编码输出,每个XML/HTML标签都独占一行
2323
``prettify()`` 方法将 Beautiful Soup 的文档树格式化后以 Unicode 编码输出，
2438
2324
每个 XML/HTML 标签都独占一行
2439
1910
2325
2440
1911
::
2326
::
2441
1912
2327
2444
1913
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
2328
    markup = '<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a>'
2445
1914
    soup = BeautifulSoup(markup)
2329
    soup = BeautifulSoup(markup, 'html.parser')
2446
1915
    soup.prettify()
2330
    soup.prettify()
2447
1916
    # '<html>\n <head>\n </head>\n <body>\n  <a href="http://example.com/">\n...'
2331
    # '<html>\n <head>\n </head>\n <body>\n  <a href="http://example.com/">\n...'
2448
1917
2332
2449
@@ -1929,7 +2344,7 @@ unwrap()
2450
1929
    #  </body>
2344
    #  </body>
2451
1930
    # </html>
2345
    # </html>
2452
1931
2346
2454
1932
``BeautifulSoup`` 对象和它的tag节点都可以调用 ``prettify()`` 方法:
2347
``BeautifulSoup`` 对象的根节点和它的所有 tag 节点都可以调用 ``prettify()`` 方法:
2455
1933
2348
2456
1934
::
2349
::
2457
1935
2350
2458
@@ -1941,237 +2356,439 @@ unwrap()
2459
1941
    #  </i>
2356
    #  </i>
2460
1942
    # </a>
2357
    # </a>
2461
1943
2358
2462
2359
因为格式化会添加额外的空格（为了换行显示），因为 ``prettify()`` 会改变 HTML 文档的内容，
2463
2360
所以不要用来格式化文档。 ``prettify()`` 方法的设计目标是为了帮助更好的显示和理解文档。
2464
2361
2465
1944
压缩输出
2362
压缩输出
2466
1945
----------
2363
----------
2467
1946
2364
2469
1947
如果只想得到结果字符串,不重视格式,那么可以对一个 ``BeautifulSoup`` 对象或 ``Tag`` 对象使用Python的 ``unicode()`` 或 ``str()`` 方法:
2365
如果只想得到结果字符串，不重视格式，那么可以对一个 ``BeautifulSoup`` 对象或 ``Tag`` 对象
2470
2366
使用 Python 的 ``unicode()`` 或 ``str()`` 方法:
2471
1948
2367
2472
1949
::
2368
::
2473
1950
2369
2474
1951
    str(soup)
2370
    str(soup)
2475
1952
    # '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>'
2371
    # '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>'
2476
1953
2372
2479
1954
    unicode(soup.a)
2373
    str(soup.a)
2480
1955
    # u'<a href="http://example.com/">I linked to <i>example.com</i></a>'
2374
    # '<a href="http://example.com/">I linked to <i>example.com</i></a>'
2481
1956
2375
2483
1957
``str()`` 方法返回UTF-8编码的字符串,可以指定 `编码`_ 的设置.
2376
``str()`` 方法返回 UTF-8 编码的字符串，查看定 `编码`_ 了解更多选项。
2484
1958
2377
2486
1959
还可以调用 ``encode()`` 方法获得字节码或调用 ``decode()`` 方法获得Unicode.
2378
还可以调用 ``encode()`` 方法获得字节码或调用 ``decode()`` 方法获得Unicode。
2487
1960
2379
2488
1961
输出格式
2380
输出格式
2489
1962
---------
2381
---------
2490
1963
2382
2492
1964
Beautiful Soup输出是会将HTML中的特殊字符转换成Unicode,比如“&lquot;”:
2383
Beautiful Soup 输出是会将 HTML 中的特殊字符编码转换成 Unicode, 比如 “&lquot;”:
2493
1965
2384
2494
1966
::
2385
::
2495
1967
2386
2499
1968
    soup = BeautifulSoup("&ldquo;Dammit!&rdquo; he said.")
2387
    soup = BeautifulSoup("&ldquo;Dammit!&rdquo; he said.", 'html.parser')
2500
1969
    unicode(soup)
2388
    str(soup)
2501
1970
    # u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>'
2389
    # '“Dammit!” he said.'
2502
1971
2390
2504
1972
如果将文档转换成字符串,Unicode编码会被编码成UTF-8.这样就无法正确显示HTML特殊字符了:
2391
如果将文档转换成字节编码，那么字节码 Unicode 会被编码成 UTF-8。并且无法再转换回 html 中的特殊字符编码:
2505
1973
2392
2506
1974
::
2393
::
2507
1975
2394
2510
1976
    str(soup)
2395
    soup.encode("utf8")
2511
1977
    # '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>'
2396
    # b'\xe2\x80\x9cDammit!\xe2\x80\x9d he said.'
2512
2397
2513
2398
默认情况下，只会转义 & 符号和尖角号。它们会被转义为 "&amp;"，"&lt;" 和 "&gt;"，因此 Beautiful Soup
2514
2399
不会无意间生成错误格式的的 HTML 或 XML
2515
2400
2516
2401
::
2517
2402
2518
2403
    soup = BeautifulSoup("<p>The law firm of Dewey, Cheatem, & Howe</p>", 'html.parser')
2519
2404
    soup.p
2520
2405
    # <p>The law firm of Dewey, Cheatem, &amp; Howe</p>
2521
2406
2522
2407
    soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>', 'html.parser')
2523
2408
    soup.a
2524
2409
    # <a href="http://example.com/?foo=val1&amp;bar=val2">A link</a>
2525
2410
2526
2411
修改默认转义规则的方法是，设置 ``prettify()``, ``encode()``, 或 ``decode()`` 方法的 ``formatter``
2527
2412
参数。Beautiful Soup 可以识别 5 种 ``formatter`` 值。
2528
2413
2529
2414
默认的设置是 ``formatter="minimal"``。处置字符串时 Beautiful Soup 会确保生成合法的 HTML/XML
2530
2415
2531
2416
::
2532
2417
2533
2418
    french = "<p>Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;</p>"
2534
2419
    soup = BeautifulSoup(french, 'html.parser')
2535
2420
    print(soup.prettify(formatter="minimal"))
2536
2421
    # <p>
2537
2422
    #  Il a dit &lt;&lt;Sacré bleu!&gt;&gt;
2538
2423
    # </p>
2539
2424
2540
2425
设置为 ``formatter="html"`` 时，Beautiful Soup 会尽可能把 Unicode 字符转换为 HTML 实体
2541
2426
2542
2427
::
2543
2428
2544
2429
    print(soup.prettify(formatter="html"))
2545
2430
    # <p>
2546
2431
    #  Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;
2547
2432
    # </p>
2548
2433
2549
2434
设置为 ``formatter="html5"`` 时，结果与 ``formatter="html"`` 相似，区别是 Beautiful Soup
2550
2435
会忽略 HTML 标签种空标签里的斜杠符号，比如 “br” 标签
2551
2436
2552
2437
::
2553
2438
2554
2439
    br = BeautifulSoup("<br>", 'html.parser').br
2555
2440
2556
2441
    print(br.encode(formatter="html"))
2557
2442
    # b'<br/>'
2558
2443
2559
2444
    print(br.encode(formatter="html5"))
2560
2445
    # b'<br>'
2561
2446
2562
2447
另外，如果属性的值为空字符串的，它会变为 HTML 风格的 boolean 属性
2563
2448
2564
2449
::
2565
2450
2566
2451
    option = BeautifulSoup('<option selected=""></option>').option
2567
2452
    print(option.encode(formatter="html"))
2568
2453
    # b'<option selected=""></option>'
2569
2454
2570
2455
    print(option.encode(formatter="html5"))
2571
2456
    # b'<option selected></option>'
2572
2457
2573
2458
这种机制在 Beautiful Soup 4.10.0 中添加。
2574
2459
2575
2460
设置为 ``formatter=None`` 时，Beautiful Soup 在输出时不会修改任何字符串内容。这是效率最高的选项，
2576
2461
但可能导致输出非法的 HTML/XML，比如下面例子
2577
2462
2578
2463
::
2579
2464
2580
2465
    print(soup.prettify(formatter=None))
2581
2466
    # <p>
2582
2467
    #  Il a dit <<Sacré bleu!>>
2583
2468
    # </p>
2584
2469
2585
2470
    link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>', 'html.parser')
2586
2471
    print(link_soup.a.encode(formatter=None))
2587
2472
    # b'<a href="http://example.com/?foo=val1&bar=val2">A link</a>'
2588
2473
2589
2474
格式化对象
2590
2475
---------------
2591
2476
2592
2477
如果需要更复杂的机制来控制输出内容，可以实例化 Beautiful Soup 的 formatter 实例，
2593
2478
然后用作 ``formatter`` 参数。
2594
2479
2595
2480
.. py:class:: HTMLFormatter
2596
2481
2597
2482
可以用来自定义 HTML 文档的格式化规则。
2598
2483
2599
2484
下面的 formatter 例子，可以将字符串全部转化为大写，不论是文字节点中的字符还是属性值 
2600
2485
2601
2486
::
2602
2487
2603
2488
    from bs4.formatter import HTMLFormatter
2604
2489
    def uppercase(str):
2605
2490
        return str.upper()
2606
2491
2607
2492
    formatter = HTMLFormatter(uppercase)
2608
2493
2609
2494
    print(soup.prettify(formatter=formatter))
2610
2495
    # <p>
2611
2496
    #  IL A DIT <<SACRÉ BLEU!>>
2612
2497
    # </p>
2613
2498
2614
2499
    print(link_soup.a.prettify(formatter=formatter))
2615
2500
    # <a href="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2">
2616
2501
    #  A LINK
2617
2502
    # </a>
2618
2503
2619
2504
下面的 formatter 例子，在美化文档时增加缩进长度
2620
2505
2621
2506
::
2622
2507
2623
2508
    formatter = HTMLFormatter(indent=8)
2624
2509
    print(link_soup.a.prettify(formatter=formatter))
2625
2510
    # <a href="http://example.com/?foo=val1&bar=val2">
2626
2511
    #         A link
2627
2512
    # </a>
2628
2513
2629
2514
.. py:class:: XMLFormatter
2630
2515
2631
2516
可以用来自定义 XML 文档的格式化规则。
2632
2517
2633
2518
编写自定义 formatter
2634
2519
----------------------
2635
2520
2636
2521
:py:class:`HTMLFormatter` or :py:class:`XMLFormatter` 的子类可以控制更多的输出过程。
2637
2522
例如，Beautiful Soup 默认情况下会对属性中的 tag 进行排序
2638
2523
2639
2524
::
2640
2525
2641
2526
    attr_soup = BeautifulSoup(b'<p z="1" m="2" a="3"></p>', 'html.parser')
2642
2527
    print(attr_soup.p.encode())
2643
2528
    # <p a="3" m="2" z="1"></p>
2644
2529
2645
2530
若想关闭这个功能，可以使用子类的 ``Formatter.attributes()`` 方法，该方法可以控制输出那些属性
2646
2531
以及这些属性的输出顺序。下面的例子会过滤掉文档中的 “m” 属性
2647
2532
2648
2533
::
2649
2534
2650
2535
    class UnsortedAttributes(HTMLFormatter):
2651
2536
        def attributes(self, tag):
2652
2537
            for k, v in tag.attrs.items():
2653
2538
                if k == 'm':
2654
2539
                    continue
2655
2540
                yield k, v
2656
2541
2657
2542
    print(attr_soup.p.encode(formatter=UnsortedAttributes())) 
2658
2543
    # <p z="1" a="3"></p>
2659
2544
2660
2545
危险提示：如果创建了 `CData` 对象，对象中的字符串对象始终表示原始内容，不会被格式化方法影响。
2661
2546
Beautiful Soup 输出时依然会调用自定义格式化方法，以防自定义方法中包含自定义的字符串计数方法，
2662
2547
但调用后不会使用返回结果，不影响原来的返回值。
2663
2548
2664
2549
::
2665
2550
2666
2551
    from bs4.element import CData
2667
2552
    soup = BeautifulSoup("<a></a>", 'html.parser')
2668
2553
    soup.a.string = CData("one < three")
2669
2554
    print(soup.a.prettify(formatter="html"))
2670
2555
    # <a>
2671
2556
    #  <![CDATA[one < three]]>
2672
2557
    # </a>
2673
1978
2558
2674
1979
get_text()
2559
get_text()
2675
1980
----------
2560
----------
2676
1981
2561
2678
1982
如果只想得到tag中包含的文本内容,那么可以调用 ``get_text()`` 方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回:
2562
如果只想得到 tag 中包含的文本内容，那么可以调用 ``get_text()`` 方法，这个方法获取到 tag
2679
2563
包含的所有文本内容，包括子孙 tag 中的可读内容，并将结果作为单独的一个 Unicode 编码字符串返回:
2680
1983
2564
2681
1984
::
2565
::
2682
1985
2566
2683
1986
    markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
2567
    markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
2685
1987
    soup = BeautifulSoup(markup)
2568
    soup = BeautifulSoup(markup, 'html.parser')
2686
1988
2569
2687
1989
    soup.get_text()
2570
    soup.get_text()
2689
1990
    u'\nI linked to example.com\n'
2571
    '\nI linked to example.com\n'
2690
1991
    soup.i.get_text()
2572
    soup.i.get_text()
2692
1992
    u'example.com'
2573
    'example.com''
2693
1993
2574
2695
1994
可以通过参数指定tag的文本内容的分隔符:
2575
可以通过参数指定 tag 的文本内容的连接符:
2696
1995
2576
2697
1996
::
2577
::
2698
1997
2578
2699
1998
    # soup.get_text("|")
2579
    # soup.get_text("|")
2701
1999
    u'\nI linked to |example.com|\n'
2580
    '\nI linked to |example.com|\n'
2702
2000
2581
2704
2001
还可以去除获得文本内容的前后空白:
2582
还可以去除每一个文本片段内容的前后空白:
2705
2002
2583
2706
2003
::
2584
::
2707
2004
2585
2708
2005
    # soup.get_text("|", strip=True)
2586
    # soup.get_text("|", strip=True)
2710
2006
    u'I linked to|example.com'
2587
    'I linked to|example.com'
2711
2007
2588
2713
2008
或者使用 `.stripped_strings`_ 生成器,获得文本列表后手动处理列表:
2589
但这种情况，你可能应该使用 `.stripped_strings <string-generators>`_ 生成器，
2714
2590
获得文本列表后手动处理内容:
2715
2009
2591
2716
2010
::
2592
::
2717
2011
2593
2718
2012
    [text for text in soup.stripped_strings]
2594
    [text for text in soup.stripped_strings]
2720
2013
    # [u'I linked to', u'example.com']
2595
    # ['I linked to', 'example.com']
2721
2596
2722
2597
*因为 Beautiful Soup 4.9.0 版本开始使用 lxml 或 html.parser，<script>，<style> 和 
2723
2598
<template> 标签中的内容不会被当做普通的 '文本' 来处理，因此这些标签中的内容不会算作页面中的
2724
2599
可读内容的一部分。*
2725
2600
2726
2601
*Beautiful Soup 4.10.0 版本以后，可以在 NavigableString 对象上调用 get_text()，.strings 
2727
2602
或 .stripped_strings 属性，结果会返回对象本身或空，这种用法只有在对混合类型列表迭代时才会用到。*
2728
2014
2603
2729
2015
指定文档解析器
2604
指定文档解析器
2730
2016
==============
2605
==============
2731
2017
2606
2733
2018
如果仅是想要解析HTML文档,只要用文档创建 ``BeautifulSoup`` 对象就可以了.Beautiful Soup会自动选择一个解析器来解析文档.但是还可以通过参数指定使用那种解析器来解析当前文档.
2607
如果仅是想要解析HTML文档，只需要创建 ``BeautifulSoup`` 对象时传入文档就可以了。Beautiful Soup
2734
2608
会自动选择一个解析器来解析文档。同时还可以使用额外参数，来指定文档解析器。
2735
2609
2736
2610
``BeautifulSoup`` 第一个参数应该是要被解析的文档字符串或是文件句柄 -- 待解析文件的句柄，
2737
2611
第二个参数用来标识怎样解析文档。
2738
2019
2612
2740
2020
``BeautifulSoup`` 第一个参数应该是要被解析的文档字符串或是文件句柄,第二个参数用来标识怎样解析文档.如果第二个参数为空,那么Beautiful Soup根据当前系统安装的库自动选择解析器,解析器的优先数序: lxml, html5lib, Python标准库.在下面两种条件下解析器优先顺序会变化:
2613
如果不指定解析器，默认使用已安装的 `最佳` HTML 解析器。Beautiful Soup 把 lxml 解析器排在第一, 
2741
2614
然后是 html5lib, 然后是 Python 标准库。在下面两种条件下解析器优先顺序会变化:
2742
2021
2615
2745
2022
    * 要解析的文档是什么类型: 目前支持,  “html”, “xml”, 和 “html5”
2616
    * 要解析的文档是什么类型: 目前支持， “html”，“xml”，和 “html5”
2746
2023
    * 指定使用哪种解析器: 目前支持, “lxml”, “html5lib”, 和 “html.parser”
2617
    * 指定使用哪种解析器: 目前支持，“lxml”，“html5lib”，和 “html.parser”（Python 标准库）
2747
2024
2618
2749
2025
`安装解析器`_ 章节介绍了可以使用哪种解析器,以及如何安装.
2619
`安装解析器`_ 章节介绍了可以使用哪种解析器，以及如何安装。
2750
2026
2620
2752
2027
如果指定的解析器没有安装,Beautiful Soup会自动选择其它方案.目前只有 lxml 解析器支持XML文档的解析,在没有安装lxml库的情况下,创建 ``beautifulsoup`` 对象时无论是否指定使用lxml,都无法得到解析后的对象
2621
如果指定的解析器没有安装，Beautiful Soup会自动选择其它方案。目前只有 lxml 解析器支持XML文档的解析，
2753
2622
在没有安装 lxml 库的情况下，无法自动选择 XML 文档解析器，手动指定 lxml 也不行。
2754
2028
2623
2755
2029
解析器之间的区别
2624
解析器之间的区别
2756
2030
-----------------
2625
-----------------
2757
2031
2626
2759
2032
Beautiful Soup为不同的解析器提供了相同的接口,但解析器本身是有区别的.同一篇文档被不同的解析器解析后可能会生成不同结构的树型文档.区别最大的是HTML解析器和XML解析器.以下是被Python自带的HTML解析器解析成的HTML片段:
2627
Beautiful Soup 为不同的解析器提供了相同的接口，但解析器本身时有区别的。同一篇文档被不同的解析器解析后
2760
2628
可能会生成不同结构的文档。区别最大的是 HTML 解析器和 XML 解析器，看下面片段被解析成 HTML 结构:
2761
2033
2629
2762
2034
::
2630
::
2763
2035
2631
2766
2036
    BeautifulSoup("<a><b /></a>")
2632
    BeautifulSoup("<a><b/></a>", "html.parser")
2767
2037
    # <html><head></head><body><a><b></b></a></body></html>
2633
    # <a><b></b></a>
2768
2038
2634
2770
2039
因为空标签<b />不符合HTML标准,所以解析器把它解析成<b></b>
2635
因为空标签 <b /> 不符合 HTML 标准，html.parser 解析器把它解析成一对儿 <b></b>。
2771
2040
2636
2773
2041
同样的文档使用XML解析如下(解析XML需要安装lxml库).注意,空标签<b />依然被保留,并且文档前添加了XML头,而不是被包含在<html>标签内:
2637
同样的文档使用 XML 解析结果如下(解析 XML 需要安装 lxml 库)。注意，空标签 <b /> 依然被保留，
2774
2638
并且文档前添加了 XML 头，而不是被包含在 <html> 标签内:
2775
2042
2639
2776
2043
::
2640
::
2777
2044
2641
2779
2045
    BeautifulSoup("<a><b /></a>", "xml")
2642
    print(BeautifulSoup("<a><b/></a>", "xml"))
2780
2046
    # <?xml version="1.0" encoding="utf-8"?>
2643
    # <?xml version="1.0" encoding="utf-8"?>
2781
2047
    # <a><b/></a>
2644
    # <a><b/></a>
2782
2048
2645
2784
2049
HTML解析器之间也有区别,如果被解析的HTML文档是标准格式,那么解析器之间没有任何差别,只是解析速度不同,结果都会返回正确的文档树.
2646
HTML 解析器之间也有区别，如果被解析的HTML文档是标准格式，那么解析器之间没有任何差别。
2785
2647
只是解析速度不同，结果都会返回正确的文档树。
2786
2050
2648
2788
2051
但是如果被解析文档不是标准格式,那么不同的解析器返回结果可能不同.下面例子中,使用lxml解析错误格式的文档,结果</p>标签被直接忽略掉了:
2649
但是如果被解析文档不是标准格式，那么不同的解析器返回结果可能不同。下面例子中，使用 lxml 
2789
2650
解析错误格式的文档，结果 </p> 标签被直接忽略掉了:
2790
2052
2651
2791
2053
::
2652
::
2792
2054
2653
2793
2055
    BeautifulSoup("<a></p>", "lxml")
2654
    BeautifulSoup("<a></p>", "lxml")
2794
2056
    # <html><body><a></a></body></html>
2655
    # <html><body><a></a></body></html>
2795
2057
2656
2797
2058
使用html5lib库解析相同文档会得到不同的结果:
2657
使用 html5lib 库解析相同文档会得到不同的结果:
2798
2059
2658
2799
2060
::
2659
::
2800
2061
2660
2801
2062
    BeautifulSoup("<a></p>", "html5lib")
2661
    BeautifulSoup("<a></p>", "html5lib")
2802
2063
    # <html><head></head><body><a><p></p></a></body></html>
2662
    # <html><head></head><body><a><p></p></a></body></html>
2803
2064
2663
2805
2065
html5lib库没有忽略掉</p>标签,而是自动补全了标签,还给文档树添加了<head>标签.
2664
html5lib 库没有忽略掉 </p> 标签，而是自动补全了标签，还给文档树添加了 <head> 标签。
2806
2066
2665
2808
2067
使用pyhton内置库解析结果如下:
2666
使用 pyhton 内置库解析结果如下:
2809
2068
2667
2810
2069
::
2668
::
2811
2070
2669
2812
2071
    BeautifulSoup("<a></p>", "html.parser")
2670
    BeautifulSoup("<a></p>", "html.parser")
2813
2072
    # <a></a>
2671
    # <a></a>
2814
2073
2672
2816
2074
与lxml [7]_ 库类似的,Python内置库忽略掉了</p>标签,与html5lib库不同的是标准库没有尝试创建符合标准的文档格式或将文档片段包含在<body>标签内,与lxml不同的是标准库甚至连<html>标签都没有尝试去添加.
2673
与 lxml [7]_ 库类似的，Python 内置库忽略掉了 </p> 标签，与 html5lib 库不同的是标准库没有
2817
2674
尝试创建符合标准的文档格式或将文档片段包含在 <body> 标签内，与lxml不同的是标准库甚至连 <html>
2818
2675
标签都没有尝试去添加。
2819
2075
2676
2821
2076
因为文档片段“<a></p>”是错误格式,所以以上解析方式都能算作"正确",html5lib库使用的是HTML5的部分标准,所以最接近"正确".不过所有解析器的结构都能够被认为是"正常"的.
2677
因为文档片段 “<a></p>” 是错误格式，所以以上解析方式都能算作 "正确"，html5lib 库使用的是 HTML5
2822
2678
的部分标准，所以最接近"正确"。不过所有解析器的结构都能够被认为是"正常"的。
2823
2077
2679
2825
2078
不同的解析器可能影响代码执行结果,如果在分发给别人的代码中使用了 ``BeautifulSoup`` ,那么最好注明使用了哪种解析器,以减少不必要的麻烦.
2680
不同的解析器可能影响代码执行结果，如果在分发给别人的代码中使用了 ``BeautifulSoup`` ,
2826
2681
那么最好注明使用了哪种解析器，以减少不必要的麻烦。
2827
2079
2682
2828
2080
编码
2683
编码
2829
2081
====
2684
====
2830
2082
2685
2832
2083
任何HTML或XML文档都有自己的编码方式,比如ASCII 或 UTF-8,但是使用Beautiful Soup解析后,文档都被转换成了Unicode:
2686
任何 HTML 或 XML 文档都有自己的编码方式，比如ASCII 或 UTF-8。但是使用 Beautiful Soup 解析后，
2833
2687
文档都被转换成了 Unicode:
2834
2084
2688
2835
2085
::
2689
::
2836
2086
2690
2837
2087
    markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
2691
    markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
2839
2088
    soup = BeautifulSoup(markup)
2692
    soup = BeautifulSoup(markup, 'html.parser')
2840
2089
    soup.h1
2693
    soup.h1
2841
2090
    # <h1>Sacré bleu!</h1>
2694
    # <h1>Sacré bleu!</h1>
2842
2091
    soup.h1.string
2695
    soup.h1.string
2844
2092
    # u'Sacr\xe9 bleu!'
2696
    # 'Sacr\xe9 bleu!'
2845
2093
2697
2847
2094
这不是魔术(但很神奇),Beautiful Soup用了 `编码自动检测`_ 子库来识别当前文档编码并转换成Unicode编码. ``BeautifulSoup`` 对象的 ``.original_encoding`` 属性记录了自动识别编码的结果:
2698
这不是魔术(但很神奇)，Beautiful Soup 用了 `编码自动检测 <Unicode, Dammit>`_ 子库来识别当前
2848
2699
文档编码并转换成 Unicode 编码。``BeautifulSoup`` 对象的 ``.original_encoding`` 属性记录了
2849
2700
自动识别编码的结果:
2850
2095
2701
2851
2096
::
2702
::
2852
2097
2703
2853
2098
    soup.original_encoding
2704
    soup.original_encoding
2854
2099
    'utf-8'
2705
    'utf-8'
2855
2100
2706
2857
2101
`编码自动检测`_ 功能大部分时候都能猜对编码格式,但有时候也会出错.有时候即使猜测正确,也是在逐个字节的遍历整个文档后才猜对的,这样很慢.如果预先知道文档编码,可以设置编码参数来减少自动检查编码出错的概率并且提高文档解析速度.在创建 ``BeautifulSoup`` 对象的时候设置 ``from_encoding`` 参数.
2707
`编码自动检测 <Unicode, Dammit>`_ 功能大部分时候都能猜对编码格式，但有时候也会出错。有时候即使
2858
2708
猜测正确，也是在逐个 字节的遍历整个文档后才猜对的，这样很慢。如果预先知道文档编码，可以设置编码参数
2859
2709
来减少自动检查编码 出错的概率并且提高文档解析速度。在创建 ``BeautifulSoup`` 对象的时候设置 
2860
2710
``from_encoding`` 参数。
2861
2102
2711
2863
2103
下面一段文档用了ISO-8859-8编码方式,这段文档太短,结果Beautiful Soup以为文档是用ISO-8859-7编码:
2712
下面一段文档用了 ISO-8859-8 编码方式，这段文档太短，结果 Beautiful Soup 以为文档是用 ISO-8859-7 编码:
2864
2104
2713
2865
2105
::
2714
::
2866
2106
2715
2867
2107
    markup = b"<h1>\xed\xe5\xec\xf9</h1>"
2716
    markup = b"<h1>\xed\xe5\xec\xf9</h1>"
2873
2108
    soup = BeautifulSoup(markup)
2717
    soup = BeautifulSoup(markup, 'html.parser')
2874
2109
    soup.h1
2718
    print(soup.h1)
2875
2110
    <h1>νεμω</h1>
2719
    # <h1>νεμω</h1>
2876
2111
    soup.original_encoding
2720
    print(soup.original_encoding)
2877
2112
    'ISO-8859-7'
2721
    # iso-8859-7
2878
2113
2722
2879
2114
通过传入 ``from_encoding`` 参数来指定编码方式:
2723
通过传入 ``from_encoding`` 参数来指定编码方式:
2880
2115
2724
2881
2116
::
2725
::
2882
2117
2726
2888
2118
    soup = BeautifulSoup(markup, from_encoding="iso-8859-8")
2727
    soup = BeautifulSoup(markup, 'html.parser', from_encoding="iso-8859-8")
2889
2119
    soup.h1
2728
    print(soup.h1)
2890
2120
    <h1>םולש</h1>
2729
    # <h1>םולש</h1>
2891
2121
    soup.original_encoding
2730
    print(soup.original_encoding)
2892
2122
    'iso8859-8'
2731
    # iso8859-8
2893
2732
2894
2733
如果仅知道文档采用了 Unicode 编码，但不知道具体编码。可以先自己猜测，猜测错误(依旧是乱码)时，
2895
2734
可以把错误编码作为 ``exclude_encodings`` 参数，这样文档就不会尝试使用这种编码了解码了。
2896
2123
2735
2900
2124
如果仅知道文档采用了Unicode编码, 但不知道具体编码. 可以先自己猜测, 猜测错误(依旧是乱码)时,
2736
译者备注: 在没有指定编码的情况下，BS会自己猜测编码，把不正确的编码排除掉，BS就更容易猜到正确编码。
2898
2125
可以把错误编码作为 ``exclude_encodings`` 参数, 这样文档就不会尝试使用这种编码了解码了.
2899
2126
译者备注: 在没有指定编码的情况下, BS会自己猜测编码, 把不正确的编码排除掉, BS就更容易猜到正确编码.
2901
2127
2737
2902
2128
::
2738
::
2903
2129
2739
2909
2130
	soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])
2740
    soup = BeautifulSoup(markup, 'html.parser', exclude_encodings=["iso-8859-7"])
2910
2131
	soup.h1
2741
    print(soup.h1)
2911
2132
	<h1>םולש</h1>
2742
    # <h1>םולש</h1>
2912
2133
	soup.original_encoding
2743
    print(soup.original_encoding)
2913
2134
	'WINDOWS-1255'
2744
    # WINDOWS-1255
2914
2135
2745
2917
2136
猜测结果是 Windows-1255 编码, 猜测结果可能不够准确, 但是 Windows-1255 编码是 ISO-8859-8 的扩展集,
2746
猜测的结果 Windows-1255 可能不是 100% 准确，但是 Windows-1255 编码是 ISO-8859-8 的扩展集，
2918
2137
所以猜测结果已经十分接近了, 并且不影响使用. (``exclude_encodings`` 参数是 4.4.0版本的新功能)
2747
所以猜测结果已经十分接近了，并不影响使用。(``exclude_encodings`` 参数是 4.4.0版本的新功能)
2919
2138
2748
2921
2139
少数情况下(通常是UTF-8编码的文档中包含了其它编码格式的文件),想获得正确的Unicode编码就不得不将文档中少数特殊编码字符替换成特殊Unicode编码,“REPLACEMENT CHARACTER” (U+FFFD, �) [9]_ . 如果Beautifu Soup猜测文档编码时作了特殊字符的替换,那么Beautiful Soup会把 ``UnicodeDammit`` 或 ``BeautifulSoup`` 对象的 ``.contains_replacement_characters`` 属性标记为 ``True`` .这样就可以知道当前文档进行Unicode编码后丢失了一部分特殊内容字符.如果文档中包含�而 ``.contains_replacement_characters`` 属性是 ``False`` ,则表示�就是文档中原来的字符,不是转码失败.
2749
少数情况下(通常是UTF-8编码的文档中包含了其它编码格式的文件)，想获得正确的 Unicode 编码就不得不将
2922
2750
文档中少数特殊编码字符替换成特殊 Unicode 编码，“REPLACEMENT CHARACTER” (U+FFFD, �) [9]_ 。
2923
2751
如果 Beautifu Soup 猜测文档编码时作了特殊字符的替换，那么 Beautiful Soup 会把 ``UnicodeDammit`` 
2924
2752
或 ``BeautifulSoup`` 对象的 ``.contains_replacement_characters`` 属性标记为 ``True`` 。
2925
2753
这样就可以知道当前文档进行 Unicode 编码后丢失了一部分特殊内容字符。如果文档中包含 � 而 
2926
2754
``.contains_replacement_characters`` 属性是 ``False`` ,则表示 � 就是文档中原来的字符，
2927
2755
不是转码失败。
2928
2140
2756
2929
2141
输出编码
2757
输出编码
2930
2142
--------
2758
--------
2931
2143
2759
2933
2144
通过Beautiful Soup输出文档时,不管输入文档是什么编码方式,输出编码均为UTF-8编码,下面例子输入文档是Latin-1编码:
2760
通过 Beautiful Soup 输出文档时，不管输入文档是什么编码方式，输出编码均为UTF-8编码，
2934
2761
下面例子输入文档是 Latin-1 编码:
2935
2145
2762
2936
2146
::
2763
::
2937
2147
2764
2961
2148
    markup = b'''
2765
 markup = b'''
2962
2149
    <html>
2766
  <html>
2963
2150
      <head>
2767
   <head>
2964
2151
        <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" />
2768
    <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" />
2965
2152
      </head>
2769
   </head>
2966
2153
      <body>
2770
   <body>
2967
2154
        <p>Sacr\xe9 bleu!</p>
2771
    <p>Sacr\xe9 bleu!</p>
2968
2155
      </body>
2772
   </body>
2969
2156
    </html>
2773
  </html>
2970
2157
    '''
2774
 '''
2948
2158
2949
2159
    soup = BeautifulSoup(markup)
2950
2160
    print(soup.prettify())
2951
2161
    # <html>
2952
2162
    #  <head>
2953
2163
    #   <meta content="text/html; charset=utf-8" http-equiv="Content-type" />
2954
2164
    #  </head>
2955
2165
    #  <body>
2956
2166
    #   <p>
2957
2167
    #    Sacré bleu!
2958
2168
    #   </p>
2959
2169
    #  </body>
2960
2170
    # </html>
2971
2171
2775
2973
2172
注意,输出文档中的<meta>标签的编码设置已经修改成了与输出编码一致的UTF-8.
2776
 soup = BeautifulSoup(markup, 'html.parser')
2974
2777
 print(soup.prettify())
2975
2778
 # <html>
2976
2779
 #  <head>
2977
2780
 #   <meta content="text/html; charset=utf-8" http-equiv="Content-type" />
2978
2781
 #  </head>
2979
2782
 #  <body>
2980
2783
 #   <p>
2981
2784
 #    Sacré bleu!
2982
2785
 #   </p>
2983
2786
 #  </body>
2984
2787
 # </html>
2985
2173
2788
2987
2174
如果不想用UTF-8编码输出,可以将编码方式传入 ``prettify()`` 方法:
2789
注意，输出文档中的 <meta> 标签内容中的编码信息已经修改成了与输出编码一致的 UTF-8。
2988
2790
2989
2791
如果不想用 UTF-8 编码输出，可以将编码方式传入 ``prettify()`` 方法:
2990
2175
2792
2991
2176
::
2793
::
2992
2177
2794
2993
@@ -2181,103 +2798,107 @@ html5lib库没有忽略掉</p>标签,而是自动补全了标签,还给文档树
2994
2181
    #   <meta content="text/html; charset=latin-1" http-equiv="Content-type" />
2798
    #   <meta content="text/html; charset=latin-1" http-equiv="Content-type" />
2995
2182
    # ...
2799
    # ...
2996
2183
2800
2998
2184
还可以调用 ``BeautifulSoup`` 对象或任意节点的 ``encode()`` 方法,就像Python的字符串调用 ``encode()`` 方法一样:
2801
还可以调用 ``BeautifulSoup`` 对象或任意节点的 ``encode()`` 方法，就像 Python 的字符串
2999
2802
调用 ``encode()`` 方法一样:
3000
2185
2803
3001
2186
::
2804
::
3002
2187
2805
3003
2188
    soup.p.encode("latin-1")
2806
    soup.p.encode("latin-1")
3005
2189
    # '<p>Sacr\xe9 bleu!</p>'
2807
    # b'<p>Sacr\xe9 bleu!</p>'
3006
2190
2808
3007
2191
    soup.p.encode("utf-8")
2809
    soup.p.encode("utf-8")
3009
2192
    # '<p>Sacr\xc3\xa9 bleu!</p>'
2810
    # b'<p>Sacr\xc3\xa9 bleu!</p>'
3010
2193
2811
3012
2194
如果文档中包含当前编码不支持的字符,那么这些字符将被转换成一系列XML特殊字符引用,下面例子中包含了Unicode编码字符SNOWMAN:
2812
如果文档中包含当前编码不支持的字符，那么这些字符将被转换成一系列 XML 特殊字符引用，下面例子中
3013
2813
包含了 Unicode 编码字符 SNOWMAN:
3014
2195
2814
3015
2196
::
2815
::
3016
2197
2816
3017
2198
    markup = u"<b>\N{SNOWMAN}</b>"
2817
    markup = u"<b>\N{SNOWMAN}</b>"
3019
2199
    snowman_soup = BeautifulSoup(markup)
2818
    snowman_soup = BeautifulSoup(markup, 'html.parser')
3020
2200
    tag = snowman_soup.b
2819
    tag = snowman_soup.b
3021
2201
2820
3023
2202
SNOWMAN字符在UTF-8编码中可以正常显示(看上去像是☃),但有些编码不支持SNOWMAN字符,比如ISO-Latin-1或ASCII,那么在这些编码中SNOWMAN字符会被转换成“&#9731”:
2821
SNOWMAN 字符在 UTF-8 编码中可以正常显示(看上去是 ☃)，但有些编码不支持 SNOWMAN 字符，比如 
3024
2822
ISO-Latin-1 或 ASCII，那么在这些编码中 SNOWMAN 字符会被转换成 “&#9731”:
3025
2203
2823
3026
2204
::
2824
::
3027
2205
2825
3028
2206
    print(tag.encode("utf-8"))
2826
    print(tag.encode("utf-8"))
3030
2207
    # <b>☃</b>
2827
    # b'<b>\xe2\x98\x83</b>'
3031
2208
2828
3034
2209
    print tag.encode("latin-1")
2829
    print(tag.encode("latin-1"))
3035
2210
    # <b>&#9731;</b>
2830
    # b'<b>&#9731;</b>'
3036
2211
2831
3039
2212
    print tag.encode("ascii")
2832
    print(tag.encode("ascii"))
3040
2213
    # <b>&#9731;</b>
2833
    # b'<b>&#9731;</b>'
3041
2214
2834
3044
2215
Unicode, Dammit! (乱码, 靠!)
2835
Unicode, Dammit
3045
2216
-----------------------------
2836
----------------------
3046
2217
2837
3048
2218
译者备注: UnicodeDammit 是BS内置库, 主要用来猜测文档编码.
2838
译者备注: Unicode Dammit 是 Beautiful Soup 内置库，主要用来猜测文档编码。
3049
2219
2839
3051
2220
`编码自动检测`_ 功能可以在Beautiful Soup以外使用,检测某段未知编码时,可以使用这个方法:
2840
`编码自动检测 <Unicode, Dammit>`_ 功能可以在 Beautiful Soup 以外使用。当遇到一段未知编码
3052
2841
的文档时，可以通过下面方法把它转换为 Unicode 编码
3053
2221
2842
3054
2222
::
2843
::
3055
2223
2844
3056
2224
    from bs4 import UnicodeDammit
2845
    from bs4 import UnicodeDammit
3058
2225
    dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!")
2846
    dammit = UnicodeDammit(b"\xc2\xabSacr\xc3\xa9 bleu!\xc2\xbb")
3059
2226
    print(dammit.unicode_markup)
2847
    print(dammit.unicode_markup)
3061
2227
    # Sacré bleu!
2848
    # «Sacré bleu!»
3062
2228
    dammit.original_encoding
2849
    dammit.original_encoding
3063
2229
    # 'utf-8'
2850
    # 'utf-8'
3064
2230
2851
3068
2231
如果Python中安装了 ``chardet`` 或 ``cchardet`` 那么编码检测功能的准确率将大大提高.
2852
如果安装了 Python 的 ``chardet`` 或 ``cchardet`` 库，那么编码检测功能的准确率将大大提高。
3069
2232
输入的字符越多,检测结果越精确,如果事先猜测到一些可能编码,
2853
输入的字符越多，检测结果越准确，如果事先猜测到一些可能编码，那么可以将猜测的编码作为参数，
3070
2233
那么可以将猜测的编码作为参数,这样将优先检测这些编码:
2854
这样将优先检测这些编码:
3071
2234
2855
3072
2235
::
2856
::
3073
2236
2857
3074
2237
3075
2238
    dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"])
2858
    dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"])
3076
2239
    print(dammit.unicode_markup)
2859
    print(dammit.unicode_markup)
3077
2240
    # Sacré bleu!
2860
    # Sacré bleu!
3078
2241
    dammit.original_encoding
2861
    dammit.original_encoding
3079
2242
    # 'latin-1'
2862
    # 'latin-1'
3080
2243
2863
3082
2244
`编码自动检测`_ 功能中有2项功能是Beautiful Soup库中用不到的
2864
`编码自动检测 <Unicode, Dammit>`_ 功能中有 2 项功能是 Beautiful Soup 库中用不到的
3083
2245
2865
3084
2246
智能引号
2866
智能引号
3086
2247
...........
2867
^^^^^^^^^^^
3087
2248
2868
3089
2249
使用Unicode时,Beautiful Soup还会智能的把引号 [10]_ 转换成HTML或XML中的特殊字符:
2869
使用 Unicode 时，Beautiful Soup 还会智能的把引号 [10]_ 转换成 HTML 或 XML 中的特殊字符:
3090
2250
2870
3091
2251
::
2871
::
3092
2252
2872
3093
2253
    markup = b"<p>I just \x93love\x94 Microsoft Word\x92s smart quotes</p>"
2873
    markup = b"<p>I just \x93love\x94 Microsoft Word\x92s smart quotes</p>"
3094
2254
2874
3095
2255
    UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup
2875
    UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup
3097
2256
    # u'<p>I just &ldquo;love&rdquo; Microsoft Word&rsquo;s smart quotes</p>'
2876
    # '<p>I just &ldquo;love&rdquo; Microsoft Word&rsquo;s smart quotes</p>'
3098
2257
2877
3099
2258
    UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup
2878
    UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup
3101
2259
    # u'<p>I just &#x201C;love&#x201D; Microsoft Word&#x2019;s smart quotes</p>'
2879
    # '<p>I just &#x201C;love&#x201D; Microsoft Word&#x2019;s smart quotes</p>'
3102
2260
2880
3104
2261
也可以把引号转换为ASCII码:
2881
也可以把引号转换为 ASCII 码:
3105
2262
2882
3106
2263
::
2883
::
3107
2264
2884
3108
2265
    UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup
2885
    UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup
3110
2266
    # u'<p>I just "love" Microsoft Word\'s smart quotes</p>'
2886
    # '<p>I just "love" Microsoft Word\'s smart quotes</p>'
3111
2267
2887
3113
2268
很有用的功能,但是Beautiful Soup没有使用这种方式.默认情况下,Beautiful Soup把引号转换成Unicode:
2888
虽然这个功能很有用，但是 Beautiful Soup 没有使用这种方式。默认情况下，Beautiful Soup 
3114
2889
把引号转换成 Unicode:
3115
2269
2890
3116
2270
::
2891
::
3117
2271
2892
3118
2272
    UnicodeDammit(markup, ["windows-1252"]).unicode_markup
2893
    UnicodeDammit(markup, ["windows-1252"]).unicode_markup
3120
2273
    # u'<p>I just \u201clove\u201d Microsoft Word\u2019s smart quotes</p>'
2894
    # '<p>I just “love” Microsoft Word’s smart quotes</p>'
3121
2274
2895
3122
2275
矛盾的编码
2896
矛盾的编码
3124
2276
...........
2897
^^^^^^^^^^^^^
3125
2277
2898
3129
2278
有时文档的大部分都是用UTF-8,但同时还包含了Windows-1252编码的字符,就像微软的智能引号 [10]_ 一样.
2899
有时文档的大部分都是用 UTF-8，但同时还包含了 Windows-1252 编码的字符，就像微软的智能引号 [10]_ 一样。
3130
2279
一些包含多个信息的来源网站容易出现这种情况. ``UnicodeDammit.detwingle()``
2900
一些包含多个信息的来源网站容易出现这种情况。``UnicodeDammit.detwingle()`` 方法可以把这类文档转换成纯
3131
2280
方法可以把这类文档转换成纯UTF-8编码格式,看个简单的例子:
2901
UTF-8 编码格式，看个简单的例子:
3132
2281
2902
3133
2282
::
2903
::
3134
2283
2904
3135
@@ -2285,7 +2906,8 @@ Unicode, Dammit! (乱码, 靠!)
3136
2285
    quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}")
2906
    quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}")
3137
2286
    doc = snowmen.encode("utf8") + quote.encode("windows_1252")
2907
    doc = snowmen.encode("utf8") + quote.encode("windows_1252")
3138
2287
2908
3140
2288
这段文档很杂乱,snowmen是UTF-8编码,引号是Windows-1252编码,直接输出时不能同时显示snowmen和引号,因为它们编码不同:
2909
这段文档很杂乱，snowmen 是 UTF-8 编码，引号是 Windows-1252 编码，直接输出时不能同时显示
3141
2910
snowmen 和引号，因为它们编码不同:
3142
2289
2911
3143
2290
::
2912
::
3144
2291
2913
3145
@@ -2295,8 +2917,9 @@ Unicode, Dammit! (乱码, 靠!)
3146
2295
    print(doc.decode("windows-1252"))
2917
    print(doc.decode("windows-1252"))
3147
2296
    # â˜ƒâ˜ƒâ˜ƒ“I like snowmen!”
2918
    # â˜ƒâ˜ƒâ˜ƒ“I like snowmen!”
3148
2297
2919
3151
2298
如果对这段文档用UTF-8解码就会得到 ``UnicodeDecodeError`` 异常,如果用Windows-1252解码就回得到一堆乱码.
2920
如果对这段文档用 UTF-8 解码就会产生 ``UnicodeDecodeError`` 异常，如果用 Windows-1252 
3152
2299
幸好, ``UnicodeDammit.detwingle()`` 方法会把这段字符串转换成UTF-8编码,允许我们同时显示出文档中的snowmen和引号:
2921
解码就会得到一堆乱码。幸好，``UnicodeDammit.detwingle()`` 方法会把这段字符串转换成 UTF-8 
3153
2922
编码，允许我们同时显示出文档中的 snowmen 和引号:
3154
2300
2923
3155
2301
::
2924
::
3156
2302
2925
3157
@@ -2304,29 +2927,71 @@ Unicode, Dammit! (乱码, 靠!)
3158
2304
    print(new_doc.decode("utf8"))
2927
    print(new_doc.decode("utf8"))
3159
2305
    # ☃☃☃“I like snowmen!”
2928
    # ☃☃☃“I like snowmen!”
3160
2306
2929
3162
2307
``UnicodeDammit.detwingle()`` 方法只能解码包含在UTF-8编码中的Windows-1252编码内容,但这解决了最常见的一类问题.
2930
``UnicodeDammit.detwingle()`` 方法只能解码包含在 UTF-8 编码中的 Windows-1252 编码内容，
3163
2931
（反过来的话，大概也可以）但这是最常见的用法。
3164
2932
3165
2933
在创建 ``BeautifulSoup`` 或 ``UnicodeDammit`` 对象前一定要先对文档调用 
3166
2934
``UnicodeDammit.detwingle()`` 确保文档的编码方式正确。Beautiful Soup 
3167
2935
会假设文档只包含一种编码，如果尝试去解析一段同时包含 UTF-8 和 Windows-1252 编码的文档，
3168
2936
就有可能被误判成整个文档都是 Windows-1252 编码，解析结果就会得到一堆乱码，
3169
2937
比如: â˜ƒâ˜ƒâ˜ƒ“I like snowmen!”。
3170
2308
2938
3172
2309
在创建 ``BeautifulSoup`` 或 ``UnicodeDammit`` 对象前一定要先对文档调用 ``UnicodeDammit.detwingle()`` 确保文档的编码方式正确.如果尝试去解析一段包含Windows-1252编码的UTF-8文档,就会得到一堆乱码,比如: â˜ƒâ˜ƒâ˜ƒ“I like snowmen!”.
2939
``UnicodeDammit.detwingle()`` 方法在 Beautiful Soup 4.1.0 版本中新增。
3173
2310
2940
3175
2311
``UnicodeDammit.detwingle()`` 方法在Beautiful Soup 4.1.0版本中新增
2941
行编号
3176
2942
==========
3177
2943
3178
2944
``html.parser`` 和 ``html5lib`` 解析器可以跟踪原始文档中发现的每个 Tag。查看原始信息可以
3179
2945
使用 ``Tag.sourceline`` （行号）和 ``Tag.sourcepos`` （标签所在行的起始位置）
3180
2946
3181
2947
::
3182
2948
3183
2949
    markup = "<p\n>Paragraph 1</p>\n    <p>Paragraph 2</p>"
3184
2950
    soup = BeautifulSoup(markup, 'html.parser')
3185
2951
    for tag in soup.find_all('p'):
3186
2952
        print(repr((tag.sourceline, tag.sourcepos, tag.string)))
3187
2953
    # (1, 0, 'Paragraph 1')
3188
2954
    # (3, 4, 'Paragraph 2')
3189
2955
3190
2956
注意，这两个解析器的 ``sourceline`` 和 ``sourcepos`` 会有些许的不同。html.parser 将
3191
2957
标签开始的 小于号作为标签起始符号，而 html5lib 将标签开始的大于号作为标签起始符号
3192
2958
3193
2959
::
3194
2960
3195
2961
    soup = BeautifulSoup(markup, 'html5lib')
3196
2962
    for tag in soup.find_all('p'):
3197
2963
        print(repr((tag.sourceline, tag.sourcepos, tag.string)))
3198
2964
    # (2, 0, 'Paragraph 1')
3199
2965
    # (3, 6, 'Paragraph 2')
3200
2966
3201
2967
可以在 BeautifulSoup 构造函数中配置 ``store_line_numbers=False`` 来关闭这个功能
3202
2968
3203
2969
::
3204
2970
3205
2971
    markup = "<p\n>Paragraph 1</p>\n    <p>Paragraph 2</p>"
3206
2972
    soup = BeautifulSoup(markup, 'html.parser', store_line_numbers=False)
3207
2973
    print(soup.p.sourceline)
3208
2974
    # None
3209
2975
3210
2976
这个功能在 4.8.1 版本中引入，lxml 解析器不支持这个功能。
3211
2312
2977
3212
2313
比较对象是否相同
2978
比较对象是否相同
3213
2314
=================
2979
=================
3214
2315
2980
3218
2316
两个 ``NavigableString`` 或 ``Tag`` 对象具有相同的HTML或XML结构时,
2981
两个 ``NavigableString`` 或 ``Tag`` 对象具有相同的 HTML 或 XML 结构时，
3219
2317
Beautiful Soup就判断这两个对象相同. 这个例子中, 2个 <b> 标签在 BS 中是相同的,
2982
Beautiful Soup就判断这两个对象相同。这个例子中，2个 <b> 标签在 BS 中是相同的，
3220
2318
尽管他们在文档树的不同位置, 但是具有相同的表象: "<b>pizza</b>"
2983
尽管他们在文档树的不同位置，但是具有相同的表象: "<b>pizza</b>"
3221
2319
2984
3222
2320
::
2985
::
3223
2321
2986
3229
2322
	markup = "<p>I want <b>pizza</b> and more <b>pizza</b>!</p>"
2987
    markup = "<p>I want <b>pizza</b> and more <b>pizza</b>!</p>"
3230
2323
	soup = BeautifulSoup(markup, 'html.parser')
2988
    soup = BeautifulSoup(markup, 'html.parser')
3231
2324
	first_b, second_b = soup.find_all('b')
2989
    first_b, second_b = soup.find_all('b')
3232
2325
	print first_b == second_b
2990
    print(first_b == second_b)
3233
2326
	# True
2991
    # True
3234
2327
2992
3237
2328
	print first_b.previous_element == second_b.previous_element
2993
    print(first_b.previous_element == second_b.previous_element)
3238
2329
	# False
2994
    # False
3239
2330
2995
3240
2331
如果想判断两个对象是否严格的指向同一个对象可以通过 ``is`` 来判断
2996
如果想判断两个对象是否严格的指向同一个对象可以通过 ``is`` 来判断
3241
2332
2997
3242
@@ -2342,12 +3007,12 @@ Beautiful Soup就判断这两个对象相同. 这个例子中, 2个 <b> 标签
3243
2342
3007
3244
2343
::
3008
::
3245
2344
3009
3250
2345
	import copy
3010
    import copy
3251
2346
	p_copy = copy.copy(soup.p)
3011
    p_copy = copy.copy(soup.p)
3252
2347
	print p_copy
3012
    print(p_copy)
3253
2348
	# <p>I want <b>pizza</b> and more <b>pizza</b>!</p>
3013
    # <p>I want <b>pizza</b> and more <b>pizza</b>!</p>
3254
2349
3014
3256
2350
复制后的对象跟与对象是相等的, 但指向不同的内存地址
3015
复制后的对象跟与对象是相等的，但指向不同的内存地址
3257
2351
3016
3258
2352
::
3017
::
3259
2353
3018
3260
@@ -2357,8 +3022,8 @@ Beautiful Soup就判断这两个对象相同. 这个例子中, 2个 <b> 标签
3261
2357
	print soup.p is p_copy
3022
	print soup.p is p_copy
3262
2358
	# False
3023
	# False
3263
2359
3024
3266
2360
源对象和复制对象的区别是源对象在文档树中, 而复制后的对象是独立的还没有添加到文档树中.
3025
源对象和复制对象的区别是源对象在文档树中，而复制后的对象是独立的还没有添加到文档树中。
3267
2361
复制后对象的效果跟调用了 ``extract()`` 方法相同.
3026
复制后对象的效果跟调用了 ``extract()`` 方法相同。
3268
2362
3027
3269
2363
::
3028
::
3270
2364
3029
3271
@@ -2367,16 +3032,29 @@ Beautiful Soup就判断这两个对象相同. 这个例子中, 2个 <b> 标签
3272
2367
3032
3273
2368
这是因为相等的对象不能同时插入相同的位置
3033
这是因为相等的对象不能同时插入相同的位置
3274
2369
3034
3275
3035
高级自定义解析
3276
3036
================
3277
3037
3278
3038
Beautiful Soup 提供多种途径自定义解析器如果解析 HTML 和 XML。本章覆盖了最常用的自定义方法。
3279
2370
3039
3280
2371
解析部分文档
3040
解析部分文档
3282
2372
============
3041
-------------
3283
2373
3042
3285
2374
如果仅仅因为想要查找文档中的<a>标签而将整片文档进行解析,实在是浪费内存和时间.最快的方法是从一开始就把<a>标签以外的东西都忽略掉. ``SoupStrainer`` 类可以定义文档的某段内容,这样搜索文档时就不必先解析整篇文档,只会解析在 ``SoupStrainer`` 中定义过的文档. 创建一个 ``SoupStrainer`` 对象并作为 ``parse_only`` 参数给 ``BeautifulSoup`` 的构造方法即可.
3043
如果仅仅因为想要查找文档中的 <a> 标签而将整片文档进行解析，实在是浪费内存和时间。最快的方法
3286
3044
是从一开始 就把 <a> 标签以外的东西都忽略掉。 ``SoupStrainer`` 类可以选择解析哪部分文档内容，
3287
3045
创建一个 ``SoupStrainer`` 对象并作为 ``parse_only`` 参数给 ``BeautifulSoup`` 的构造
3288
3046
方法即可。
3289
3047
3290
3048
(注意，*这个功能在 html5lib 解析器中无法使用*。如果使用 html5lib 解析器，整篇文档都会被解析，
3291
3049
这是因为 html5lib 会重新排列文档树的结构，如果部分节点不在文档树中，会导致崩溃。为了避免混淆，
3292
3050
下面的例子中 Beautiful Soup 都强制指定使用了 Python 内置解析器。)
3293
2375
3051
3294
2376
SoupStrainer
3052
SoupStrainer
3295
2377
-------------
3053
-------------
3296
2378
3054
3298
2379
``SoupStrainer`` 类接受与典型搜索方法相同的参数：`name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ 。下面举例说明三种 ``SoupStrainer`` 对象：
3055
``SoupStrainer`` 类接受与典型搜索方法相同的参数： `name`_ , `attrs`_ , 
3299
3056
`recursive <recursive>`_ , `string <string>`_ , `**kwargs <kwargs>`_ 。
3300
3057
下面举例说明三种 ``SoupStrainer`` 对象：
3301
2380
3058
3302
2381
::
3059
::
3303
2382
3060
3304
@@ -2387,7 +3065,7 @@ SoupStrainer
3305
2387
    only_tags_with_id_link2 = SoupStrainer(id="link2")
3065
    only_tags_with_id_link2 = SoupStrainer(id="link2")
3306
2388
3066
3307
2389
    def is_short_string(string):
3067
    def is_short_string(string):
3309
2390
        return len(string) < 10
3068
        return string is not None and len(string) < 10
3310
2391
3069
3311
2392
    only_short_strings = SoupStrainer(string=is_short_string)
3070
    only_short_strings = SoupStrainer(string=is_short_string)
3312
2393
3071
3313
@@ -2395,9 +3073,8 @@ SoupStrainer
3314
2395
3073
3315
2396
::
3074
::
3316
2397
3075
3320
2398
    html_doc = """
3076
    html_doc = """<html><head><title>The Dormouse's story</title></head>
3321
2399
    <html><head><title>The Dormouse's story</title></head>
3077
    <body>
3319
2400
	<body>
3322
2401
    <p class="title"><b>The Dormouse's story</b></p>
3078
    <p class="title"><b>The Dormouse's story</b></p>
3323
2402
3079
3324
2403
    <p class="story">Once upon a time there were three little sisters; and their names were
3080
    <p class="story">Once upon a time there were three little sisters; and their names were
3325
@@ -2434,27 +3111,147 @@ SoupStrainer
3326
2434
    # ...
3111
    # ...
3327
2435
    #
3112
    #
3328
2436
3113
3330
2437
还可以将 ``SoupStrainer`` 作为参数传入 `搜索文档树`_ 中提到的方法.这可能不是个常用用法,所以还是提一下:
3114
还可以将 ``SoupStrainer`` 作为参数传入 `搜索文档树`_ 中提到的方法。虽然不常用，但还是提一下:
3331
2438
3115
3332
2439
::
3116
::
3333
2440
3117
3335
2441
    soup = BeautifulSoup(html_doc)
3118
    soup = BeautifulSoup(html_doc, 'html.parser')
3336
2442
    soup.find_all(only_short_strings)
3119
    soup.find_all(only_short_strings)
3339
2443
    # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
3120
    # ['\n\n', '\n\n', 'Elsie', ',\n', 'Lacie', ' and\n', 'Tillie',
3340
2444
    #  u'\n\n', u'...', u'\n']
3121
    #  '\n\n', '...', '\n']
3341
3122
3342
3123
自定义包含多个值的属性
3343
3124
----------------------
3344
3125
3345
3126
在 HTML 文档中，像 ``class`` 这样的属性的值是一个列表，像 ``id`` 这样的属性的值是一个单一字符串，
3346
3127
因为 HTML 标准定义了这些属性的不同行为
3347
3128
3348
3129
::
3349
3130
3350
3131
    markup = '<a class="cls1 cls2" id="id1 id2">'
3351
3132
    soup = BeautifulSoup(markup, 'html.parser')
3352
3133
    soup.a['class']
3353
3134
    # ['cls1', 'cls2']
3354
3135
    soup.a['id']
3355
3136
    # 'id1 id2'
3356
3137
3357
3138
设置 ``multi_valued_attributes=None`` 可以禁用多值的自动识别，然后全部属性的值都变成一个字符串
3358
3139
3359
3140
::
3360
3141
3361
3142
    soup = BeautifulSoup(markup, 'html.parser', multi_valued_attributes=None)
3362
3143
    soup.a['class']
3363
3144
    # 'cls1 cls2'
3364
3145
    soup.a['id']
3365
3146
    # 'id1 id2'
3366
3147
3367
3148
如果给 ``multi_valued_attributes`` 参数传入一个字典，可以实现一点点解析自定义。如果需要这么做，
3368
3149
查看 ``HTMLTreeBuilder.DEFAULT_CDATA_LIST_ATTRIBUTES`` 了解 Beautiful Soup 的默认配置，
3369
3150
这些均是基于 HTML 标准配置的。
3370
3151
3371
3152
`(这个功能添加于 Beautiful Soup 4.8.0)`
3372
3153
3373
3154
处理重复属性
3374
3155
---------------
3375
3156
3376
3157
使用 ``html.parser`` 解析器时，可以通过设置 ``on_duplicate_attribute`` 参数，来定义当
3377
3158
Beautiful Soup 在 tag 中发现重复的属性名字时如何处理
3378
3159
::
3379
3160
3380
3161
    markup = '<a href="http://url1/" href="http://url2/">'
3381
3162
3382
3163
默认行为是，重名属性会使用最后出现的值
3383
3164
3384
3165
::
3385
3166
3386
3167
    soup = BeautifulSoup(markup, 'html.parser')
3387
3168
    soup.a['href']
3388
3169
    # http://url2/
3389
3170
3390
3171
    soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='replace')
3391
3172
    soup.a['href']
3392
3173
    # http://url2/
3393
3174
3394
3175
当 ``on_duplicate_attribute='ignore'`` 时，Beautiful Soup 会使用第一个出现的值，然后忽略
3395
3176
后出现的值
3396
3177
3397
3178
::
3398
3179
3399
3180
    soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='ignore')
3400
3181
    soup.a['href']
3401
3182
    # http://url1/
3402
3183
3403
3184
（lxml 和 html5lib 总是采用这种处理方式，它们的默认行为不能通过 Beautiful Soup 配置。）
3404
3185
3405
3186
如果需要复杂的控制，可以传入一个方法，当属性值重复时会被调用
3406
3187
3407
3188
::
3408
3189
3409
3190
    def accumulate(attributes_so_far, key, value):
3410
3191
        if not isinstance(attributes_so_far[key], list):
3411
3192
            attributes_so_far[key] = [attributes_so_far[key]]
3412
3193
        attributes_so_far[key].append(value)
3413
3194
3414
3195
    soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute=accumulate)
3415
3196
    soup.a['href']
3416
3197
    # ["http://url1/", "http://url2/"]
3417
3198
3418
3199
这个特性新增于 Beautiful Soup 4.9.1。
3419
3200
3420
3201
实例化自定义子类
3421
3202
------------------
3422
3203
3423
3204
当解析器传递给 Beautiful Soup 一个标签或一个字符串后，Beautiful Soup 会实例化为 `Tag` 或 
3424
3205
`NavigableString` 对象，并包含相关信息。如果想修改默认行为，可以让 Beautiful Soup 实例化
3425
3206
`Tag` 或 `NavigableString` 的子类，子类中可以自定义行为
3426
3207
3427
3208
::
3428
3209
3429
3210
    from bs4 import Tag, NavigableString
3430
3211
    class MyTag(Tag):
3431
3212
        pass
3432
3213
3433
3214
3434
3215
    class MyString(NavigableString):
3435
3216
        pass
3436
3217
3437
3218
3438
3219
    markup = "<div>some text</div>"
3439
3220
    soup = BeautifulSoup(markup, 'html.parser')
3440
3221
    isinstance(soup.div, MyTag)
3441
3222
    # False
3442
3223
    isinstance(soup.div.string, MyString)
3443
3224
    # False 
3444
3225
3445
3226
    my_classes = { Tag: MyTag, NavigableString: MyString }
3446
3227
    soup = BeautifulSoup(markup, 'html.parser', element_classes=my_classes)
3447
3228
    isinstance(soup.div, MyTag)
3448
3229
    # True
3449
3230
    isinstance(soup.div.string, MyString)
3450
3231
    # True  
3451
3232
3452
3233
这种用法可用在于 Beautiful Soup 与测试框架集成。
3453
3234
3454
3235
这个特性新增于 Beautiful Soup 4.8.1。
3455
2445
3236
3456
2446
常见问题
3237
常见问题
3457
2447
========
3238
========
3458
2448
3239
3459
3240
.. _diagnose:
3460
3241
3461
2449
代码诊断
3242
代码诊断
3462
2450
----------
3243
----------
3463
2451
3244
3465
2452
如果想知道Beautiful Soup到底怎样处理一份文档,可以将文档传入 ``diagnose()`` 方法(Beautiful Soup 4.2.0中新增),Beautiful Soup会输出一份报告,说明不同的解析器会怎样处理这段文档,并标出当前的解析过程会使用哪种解析器:
3245
如果想知道 Beautiful Soup 到底怎样处理一份文档，可以将文档传入 ``diagnose()`` 
3466
3246
方法(Beautiful Soup 4.2.0中新增)， Beautiful Soup 会输出一份报告，
3467
3247
说明不同的解析器会怎样处理这段文档，并标出当前的解析过程会使用哪种解析器:
3468
2453
3248
3469
2454
::
3249
::
3470
2455
3250
3471
2456
    from bs4.diagnose import diagnose
3251
    from bs4.diagnose import diagnose
3473
2457
    data = open("bad.html").read()
3252
    with open("bad.html") as fp:
3474
3253
        data = fp.read()
3475
3254
3476
2458
    diagnose(data)
3255
    diagnose(data)
3477
2459
3256
3478
2460
    # Diagnostic running on Beautiful Soup 4.2.0
3257
    # Diagnostic running on Beautiful Soup 4.2.0
3479
@@ -2466,93 +3263,150 @@ SoupStrainer
3480
2466
    # Here's what html.parser did with the document:
3263
    # Here's what html.parser did with the document:
3481
2467
    # ...
3264
    # ...
3482
2468
3265
3484
2469
``diagnose()`` 方法的输出结果可能帮助你找到问题的原因,如果不行,还可以把结果复制出来以便寻求他人的帮助
3266
``diagnose()`` 方法的输出结果可能帮助你找到问题的原因，如果不行，还可以把结果复制出来以
3485
3267
便寻求他人的帮助。
3486
2470
3268
3487
2471
文档解析错误
3269
文档解析错误
3488
2472
-------------
3270
-------------
3489
2473
3271
3491
2474
文档解析错误有两种.一种是崩溃,Beautiful Soup尝试解析一段文档结果却抛除了异常,通常是 ``HTMLParser.HTMLParseError`` .还有一种异常情况,是Beautiful Soup解析后的文档树看起来与原来的内容相差很多.
3272
文档解析错误有两种。一种是崩溃，Beautiful Soup 尝试解析一段文档结果却抛除了异常，通常是 
3492
3273
``HTMLParser.HTMLParseError`` 。还有一种异常情况，是Beautiful Soup 解析后的文档树
3493
3274
看起来与原来的内容相差很多。
3494
2475
3275
3496
2476
这些错误几乎都不是Beautiful Soup的原因,这不会是因为Beautiful Soup的代码写的太优秀,而是因为Beautiful Soup没有包含任何文档解析代码.异常产生自被依赖的解析器,如果解析器不能很好的解析出当前的文档,那么最好的办法是换一个解析器.更多细节查看 `安装解析器`_ 章节.
3276
这些错误几乎都不是 Beautiful Soup 的原因，这不是因为 Beautiful Soup 的代码写得多优秀，
3497
3277
而是因为 Beautiful Soup 没有包含任何文档解析代码。异常产生自被依赖的解析器，如果解析器不能
3498
3278
很好的解析出当前的文档，那么最好的办法是换一个解析器。更多细节查看 `安装解析器`_ 章节。
3499
2477
3279
3501
2478
最常见的解析错误是 ``HTMLParser.HTMLParseError: malformed start tag`` 和 ``HTMLParser.HTMLParseError: bad end tag`` .这都是由Python内置的解析器引起的,解决方法是 `安装lxml或html5lib`_
3280
最常见的解析错误是 ``HTMLParser.HTMLParseError: malformed start tag`` 和 
3502
3281
``HTMLParser.HTMLParseError: bad end tag`` 。这都是由Python内置的解析器引起的，
3503
3282
解决方法是 `安装 lxml 或 html5lib <安装解析器>`_ 。
3504
2479
3283
3506
2480
最常见的异常现象是当前文档找不到指定的Tag,而这个Tag光是用眼睛就足够发现的了. ``find_all()`` 方法返回 [] ,而 ``find()`` 方法返回 None .这是Python内置解析器的又一个问题: 解析器会跳过那些它不知道的tag.解决方法还是 `安装lxml或html5lib`_
3284
最常见的非预期行为是发现不了一个确定存在稳当中的 Tag。光是用眼睛就能轻易发现，但用 ``find_all()`` 
3507
3285
方法返回 [] ，用 ``find()`` 方法返回 None 。这是 Python 内置解析器的又一个问题: 解析器会跳过那些
3508
3286
它不知道的 tag。解决方法还是 `安装 lxml 或 html5lib <安装解析器>`_
3509
2481
3287
3510
2482
版本错误
3288
版本错误
3511
2483
----------
3289
----------
3512
2484
3290
3514
2485
* ``SyntaxError: Invalid syntax`` (异常位置在代码行: ``ROOT_TAG_NAME = u'[document]'`` ),因为Python2语法的代码(没有经过迁移)直接在Python3中运行
3291
* ``SyntaxError: Invalid syntax`` (异常位置在代码行: ``ROOT_TAG_NAME = u'[document]'`` )，
3515
3292
  原因是用 Python2 版本的 Beautiful Soup 未经过代码转换，直接在 Python3 中运行。
3516
2486
3293
3518
2487
* ``ImportError: No module named HTMLParser`` 因为在Python3中执行Python2版本的Beautiful Soup
3294
* ``ImportError: No module named HTMLParser`` 因为在 Python3 中执行 Python2 版本的 Beautiful Soup。
3519
2488
3295
3521
2489
* ``ImportError: No module named html.parser`` 因为在Python2中执行Python3版本的Beautiful Soup
3296
* ``ImportError: No module named html.parser`` 因为在 Python2 中执行 Python3 版本的 Beautiful Soup
3522
2490
3297
3524
2491
* ``ImportError: No module named BeautifulSoup`` 因为在没有安装BeautifulSoup3库的Python环境下执行代码,或忘记了BeautifulSoup4的代码需要从 ``bs4`` 包中引入
3298
* ``ImportError: No module named BeautifulSoup`` 因为在没有安装 Beautiful Soup3 库的 Python 环境下执行代码，
3525
3299
  或忘记了 Beautiful Soup4 的代码需要从 ``bs4`` 包中引入。
3526
2492
3300
3528
2493
* ``ImportError: No module named bs4`` 因为当前Python环境下还没有安装BeautifulSoup4
3301
* ``ImportError: No module named bs4`` 因为当前 Python 环境下还没有安装 Beautiful Soup4。
3529
2494
3302
3530
2495
解析成XML
3303
解析成XML
3531
2496
----------
3304
----------
3532
2497
3305
3534
2498
默认情况下,Beautiful Soup会将当前文档作为HTML格式解析,如果要解析XML文档,要在 ``BeautifulSoup`` 构造方法中加入第二个参数 "xml":
3306
默认情况下，Beautiful Soup 会将当前文档作为 HTML 格式解析，如果要解析 XML 文档，要在 
3535
3307
``BeautifulSoup`` 构造方法中加入第二个参数 "xml":
3536
2499
3308
3537
2500
::
3309
::
3538
2501
3310
3539
2502
    soup = BeautifulSoup(markup, "xml")
3311
    soup = BeautifulSoup(markup, "xml")
3540
2503
3312
3542
2504
当然,还需要 `安装lxml`_
3313
当然，还需要 `安装 lxml <安装解析器>`_
3543
2505
3314
3546
2506
解析器的错误
3315
其它解析器的错误
3547
2507
------------
3316
------------------
3548
2508
3317
3550
2509
* 如果同样的代码在不同环境下结果不同,可能是因为两个环境下使用不同的解析器造成的.例如这个环境中安装了lxml,而另一个环境中只有html5lib, `解析器之间的区别`_ 中说明了原因.修复方法是在 ``BeautifulSoup`` 的构造方法中中指定解析器
3318
* 如果同样的代码在不同环境下结果不同，可能是因为两个环境下使用不同的解析器造成的。
3551
3319
  例如这个环境中安装了 lxml，而另一个环境中只有 html5lib, `解析器之间的区别`_ 中说明了原因。
3552
3320
  修复方法是在 ``BeautifulSoup`` 的构造方法中中指定解析器。
3553
2510
3321
3555
2511
* 因为HTML标签是 `大小写敏感 <http://www.w3.org/TR/html5/syntax.html#syntax>`_ 的,所以3种解析器再出来文档时都将tag和属性转换成小写.例如文档中的 <TAG></TAG> 会被转换为 <tag></tag> .如果想要保留tag的大写的话,那么应该将文档 `解析成XML`_ .
3322
* 因为HTML标签是 `大小写敏感 <http://www.w3.org/TR/html5/syntax.html#syntax>`_ 的，
3556
3323
  所以解析器会将 tag 和属性都转换成小写。例如文档中的 <TAG></TAG> 会被转换为 <tag></tag> 。
3557
3324
  如果想要保留 tag 的大写的话，那么应该将文档 `解析成XML`_ 。
3558
2512
3325
3559
2513
杂项错误
3326
杂项错误
3560
2514
--------
3327
--------
3561
2515
3328
3563
2516
* ``UnicodeEncodeError: 'charmap' codec can't encode character u'\xfoo' in position bar`` (或其它类型的 ``UnicodeEncodeError`` )的错误,主要是两方面的错误(都不是Beautiful Soup的原因),第一种是正在使用的终端(console)无法显示部分Unicode,参考 `Python wiki <http://wiki.Python.org/moin/PrintFails>`_ ,第二种是向文件写入时,被写入文件不支持部分Unicode,这时只要用 ``u.encode("utf8")`` 方法将编码转换为UTF-8.
3329
* ``UnicodeEncodeError: 'charmap' codec can't encode character 
3564
3330
  '\xfoo' in position bar`` (或其它类型的 ``UnicodeEncodeError`` )的错误，
3565
3331
  主要是两方面的原因，第一种是正在使用的终端(console)无法显示部分Unicode，参考 
3566
3332
  `Python wiki <http://wiki.Python.org/moin/PrintFails>`_ ，第二种是向文件
3567
3333
  写入时，被写入文件不支持部分 Unicode，这时需要用 ``u.encode("utf8")`` 方法将
3568
3334
  编码转换为UTF-8。
3569
3335
3570
3336
* ``KeyError: [attr]`` 因为调用 ``tag['attr']`` 方法而引起，因为这个 tag 没有定义
3571
3337
  该属性。出错最多的是 ``KeyError: 'href'`` 和 ``KeyError: 'class'`` 。如果不确定
3572
3338
  某个属性是否存在时，用 ``tag.get('attr')`` 方法去获取它，跟获取 Python 字典的 key 一样。
3573
2517
3339
3575
2518
* ``KeyError: [attr]`` 因为调用 ``tag['attr']`` 方法而引起,因为这个tag没有定义该属性.出错最多的是 ``KeyError: 'href'`` 和 ``KeyError: 'class'`` .如果不确定某个属性是否存在时,用 ``tag.get('attr')`` 方法去获取它,跟获取Python字典的key一样
3340
* ``AttributeError: 'ResultSet' object has no attribute 'foo'`` 错误通常是
3576
3341
  因为把 ``find_all()`` 的返回结果当作一个 tag 或文本节点使用，实际上返回结果是一个
3577
3342
  列表或 ``ResultSet`` 对象的字符串，需要对结果进行循环才能得到每个节点的 ``.foo`` 
3578
3343
  属性。 或者使用 ``find()`` 方法仅获取到一个节点。
3579
2519
3344
3581
2520
* ``AttributeError: 'ResultSet' object has no attribute 'foo'`` 错误通常是因为把 ``find_all()`` 的返回结果当作一个tag或文本节点使用,实际上返回结果是一个列表或 ``ResultSet`` 对象的字符串,需要对结果进行循环才能得到每个节点的 ``.foo`` 属性.或者使用 ``find()`` 方法仅获取到一个节点
3345
* ``AttributeError: 'NoneType' object has no attribute 'foo'`` 这个错误通常是
3582
3346
  在调用了 ``find()`` 方法后直节点取某个属性 foo。但是 ``find()`` 方法并没有找到任何
3583
3347
  结果，所以它的返回值是 ``None`` 。需要找出为什么 ``find()`` 的返回值是 ``None``。
3584
2521
3348
3586
2522
* ``AttributeError: 'NoneType' object has no attribute 'foo'`` 这个错误通常是在调用了 ``find()`` 方法后直节点取某个属性 .foo 但是 ``find()`` 方法并没有找到任何结果,所以它的返回值是 ``None`` .需要找出为什么 ``find()`` 的返回值是 ``None`` .
3349
* ``AttributeError: 'NavigableString' object has no attribute 'foo'`` 这种问题
3587
3350
  通常是因为吧一个字符串当做一个 tag 来处理。可能在迭代一个列表时，期望其中都是 tag，但实际上
3588
3351
  列表里既包含 tag 也包含字符串。
3589
2523
3352
3590
2524
如何提高效率
3353
如何提高效率
3591
2525
------------
3354
------------
3592
2526
3355
3594
2527
Beautiful Soup对文档的解析速度不会比它所依赖的解析器更快,如果对计算时间要求很高或者计算机的时间比程序员的时间更值钱,那么就应该直接使用 `lxml <http://lxml.de/>`_ .
3356
Beautiful Soup对文档的解析速度不会比它所依赖的解析器更快，如果对计算时间要求很高或者
3595
3357
计算机的时间比程序员的时间更值钱，那么就应该直接使用 `lxml <http://lxml.de/>`_ 。
3596
3358
3597
3359
换句话说，还有提高 Beautiful Soup 效率的办法，使用lxml作为解析器。Beautiful Soup
3598
3360
用 lxml 做解析器比用 html5lib 或 Python 内置解析器速度快很多。
3599
3361
3600
3362
安装 `cchardet <http://pypi.Python.org/pypi/cchardet/>`_ 后文档的解码的编码检测
3601
3363
速度会更快。
3602
3364
3603
3365
`解析部分文档`_ 不会节省多少解析时间，但是会节省很多内存，并且搜索时也会变得更快。
3604
3366
3605
3367
翻译这篇文档
3606
3368
=============
3607
3369
3608
3370
非常感谢欢迎翻译 Beautiful Soup 的文档。翻译内容应当基于 MIT 协议，就像 Beautiful Soup 和
3609
3371
英文文档一样。
3610
2528
3372
3612
2529
换句话说,还有提高Beautiful Soup效率的办法,使用lxml作为解析器.Beautiful Soup用lxml做解析器比用html5lib或Python内置解析器速度快很多.
3373
有两种方式将翻译内容添加到 Beautiful Soup 的网站上：
3613
2530
3374
3615
2531
安装 `cchardet <http://pypi.Python.org/pypi/cchardet/>`_ 后文档的解码的编码检测会速度更快
3375
1. 在 Beautiful Soup 代码库上创建一个分支，添加翻译，然后合并到主分支。就像修改源代码一样。
3616
2532
3376
3618
2533
`解析部分文档`_ 不会节省多少解析时间,但是会节省很多内存,并且搜索时也会变得更快.
3377
2. 向 Beautiful Soup 讨论组里发送一个消息，带上翻译的链接，或翻译内容的附件。
3619
3378
3620
3379
使用中文或葡萄牙语的翻译作为模型。尤其注意，请翻译源文件 ``doc/source/index.rst``，
3621
3380
而不是 HTML 版本的文档。这样才能将文档发布为多种格式，而不局限于 HTML。
3622
2534
3381
3623
2535
Beautiful Soup 3
3382
Beautiful Soup 3
3624
2536
=================
3383
=================
3625
2537
3384
3627
2538
Beautiful Soup 3是上一个发布版本,目前已经停止维护.Beautiful Soup 3库目前已经被几个主要的linux平台添加到源里:
3385
Beautiful Soup 3 是上一个发布版本，目前已经停止维护。Beautiful Soup 3 库目前已经被
3628
3386
几个主要的 linux 发行版添加到源里:
3629
2539
3387
3630
2540
``$ apt-get install Python-beautifulsoup``
3388
``$ apt-get install Python-beautifulsoup``
3631
2541
3389
3633
2542
在PyPi中分发的包名字是 ``BeautifulSoup`` :
3390
在 PyPi 中分发的包名字是 ``BeautifulSoup`` :
3634
2543
3391
3635
2544
``$ easy_install BeautifulSoup``
3392
``$ easy_install BeautifulSoup``
3636
2545
3393
3637
2546
``$ pip install BeautifulSoup``
3394
``$ pip install BeautifulSoup``
3638
2547
3395
3640
2548
或通过 `Beautiful Soup 3.2.0源码包 <http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.0.tar.gz>`_ 安装
3396
或通过 `Beautiful Soup 3.2.0源码包 
3641
3397
<http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.0.tar.gz>`_ 安装。
3642
3398
3643
3399
如果是通过 ``easy_install beautifulsoup`` 或 ``easy_install
3644
3400
BeautifulSoup`` 安装，然后代码无法运行，那么可能是安装了错误的 Beautiful Soup 3 版本。
3645
3401
应该这样安装 ``easy_install beautifulsoup4``。
3646
2549
3402
3648
2550
Beautiful Soup 3的在线文档查看 `这里 <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_ .
3403
Beautiful Soup 3 的在线文档查看 `这里 <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_ .
3649
2551
3404
3651
2552
迁移到BS4
3405
迁移到 BS4
3652
2553
----------
3406
----------
3653
2554
3407
3655
2555
只要一个小变动就能让大部分的Beautiful Soup 3代码使用Beautiful Soup 4的库和方法----修改 ``BeautifulSoup`` 对象的引入方式:
3408
大部分使用 Beautiful Soup 3 编写的代码都可以在 Beautiful Soup 4 上运行， 
3656
3409
只有一个小变动。只要把引用包的名字从 :py:class:`BeautifulSoup` 改为 ``bs4``，比如:
3657
2556
3410
3658
2557
::
3411
::
3659
2558
3412
3660
@@ -2564,23 +3418,31 @@ Beautiful Soup 3的在线文档查看 `这里 <http://www.crummy.com/software/Be
3661
2564
3418
3662
2565
    from bs4 import BeautifulSoup
3419
    from bs4 import BeautifulSoup
3663
2566
3420
3665
2567
* 如果代码抛出 ``ImportError`` 异常“No module named BeautifulSoup”,原因可能是尝试执行Beautiful Soup 3,但环境中只安装了Beautiful Soup 4库
3421
* 如果代码抛出 ``ImportError`` 异常 "No module named BeautifulSoup"，
3666
3422
  原因可能是尝试执行 Beautiful Soup 3，但环境中只安装了 Beautiful Soup 4。
3667
2568
3423
3669
2569
* 如果代码跑出 ``ImportError`` 异常“No module named bs4”,原因可能是尝试运行Beautiful Soup 4的代码,但环境中只安装了Beautiful Soup 3.
3424
* 如果代码抛出 ``ImportError`` 异常 "No module named bs4"，原因可能是尝试
3670
3425
  运行 Beautiful Soup 4 的代码，但环境中只安装了Beautiful Soup 3。
3671
2570
3426
3673
2571
虽然BS4兼容绝大部分BS3的功能,但BS3中的大部分方法已经不推荐使用了,就方法按照 `PEP8标准 <http://www.Python.org/dev/peps/pep-0008/>`_ 重新定义了方法名.很多方法都重新定义了方法名,但只有少数几个方法没有向下兼容.
3427
尽管 BS4 兼容绝大部分 BS3 的功能，但 BS3 中的大部分方法已经不推荐使用了，旧方法标记废弃，
3674
3428
并按照 `PEP8标准 <http://www.Python.org/dev/peps/pep-0008/>`_ 重新命名了新方法。
3675
3429
虽然有大量的重命名和修改，但只有少数几个方法没有向下兼容。
3676
2572
3430
3678
2573
上述内容就是BS3迁移到BS4的注意事项
3431
下面内容就是 BS3 迁移到 BS4 的注意事项:
3679
2574
3432
3682
2575
需要的解析器
3433
解析器的变化
3683
2576
............
3434
^^^^^^^^^^^^^^^
3684
2577
3435
3686
2578
Beautiful Soup 3曾使用Python的 ``SGMLParser`` 解析器,这个模块在Python3中已经被移除了.Beautiful Soup 4默认使用系统的 ``html.parser`` ,也可以使用lxml或html5lib扩展库代替.查看 `安装解析器`_ 章节
3436
Beautiful Soup 3 曾使用 Python 的 ``SGMLParser`` 解析器，这个模块在 Python3 
3687
3437
中已经被移除了。Beautiful Soup 4 默认使用系统的 ``html.parser`` , 也可以使用 
3688
3438
lxml 或 html5lib 扩展库代替。查看 `安装解析器`_ 章节。
3689
2579
3439
3691
2580
因为解析器 ``html.parser`` 与 ``SGMLParser`` 不同. BS4 和 BS3 处理相同的文档会产生不同的对象结构. 使用lxml或html5lib解析文档的时候, 如果添加了 ``html.parser`` 参数, 解析的对象又回发生变化. 如果发生了这种情况, 只能修改对应的处文档结果处理代码了.
3440
因为解析器 ``html.parser`` 与 ``SGMLParser`` 不同。BS4 和 BS3 处理相同的文档会
3692
3441
产生不同的对象结构。使用 lxml 或 html5lib 解析文档的时候，如果添加了 ``html.parser`` 
3693
3442
参数，解析的对象又会发生变化。如果发生了这种情况，只能修改对应的文档处理代码了。
3694
2581
3443
3695
2582
方法名的变化
3444
方法名的变化
3697
2583
............
3445
^^^^^^^^^^^^^^
3698
2584
3446
3699
2585
* ``renderContents`` -> ``encode_contents``
3447
* ``renderContents`` -> ``encode_contents``
3700
2586
3448
3701
@@ -2614,32 +3476,33 @@ Beautiful Soup 3曾使用Python的 ``SGMLParser`` 解析器,这个模块在Pytho
3702
2614
3476
3703
2615
* ``previousSibling`` -> ``previous_sibling``
3477
* ``previousSibling`` -> ``previous_sibling``
3704
2616
3478
3706
2617
Beautiful Soup构造方法的参数部分也有名字变化:
3479
Beautiful Soup 构造方法的参数部分也有名字变化:
3707
2618
3480
3708
2619
* ``BeautifulSoup(parseOnlyThese=...)`` -> ``BeautifulSoup(parse_only=...)``
3481
* ``BeautifulSoup(parseOnlyThese=...)`` -> ``BeautifulSoup(parse_only=...)``
3709
2620
3482
3710
2621
* ``BeautifulSoup(fromEncoding=...)`` -> ``BeautifulSoup(from_encoding=...)``
3483
* ``BeautifulSoup(fromEncoding=...)`` -> ``BeautifulSoup(from_encoding=...)``
3711
2622
3484
3713
2623
为了适配Python3,修改了一个方法名:
3485
为了适配 Python3，修改了一个方法名:
3714
2624
3486
3715
2625
* ``Tag.has_key()`` -> ``Tag.has_attr()``
3487
* ``Tag.has_key()`` -> ``Tag.has_attr()``
3716
2626
3488
3718
2627
修改了一个属性名,让它看起来更专业点:
3489
修改了一个属性名，让它看起来更专业点:
3719
2628
3490
3720
2629
* ``Tag.isSelfClosing`` -> ``Tag.is_empty_element``
3491
* ``Tag.isSelfClosing`` -> ``Tag.is_empty_element``
3721
2630
3492
3723
2631
修改了下面3个属性的名字,以免雨Python保留字冲突.这些变动不是向下兼容的,如果在BS3中使用了这些属性,那么在BS4中这些代码无法执行.
3493
修改了下面 3 个属性的名字，以免与 Python 保留字冲突。这些变动不是向下兼容的，如果在 BS3 
3724
3494
中使用了这些属性，那么在 BS4 中这些代码无法执行。
3725
2632
3495
3727
2633
* UnicodeDammit.Unicode -> UnicodeDammit.Unicode_markup``
3496
* ``UnicodeDammit.Unicode -> UnicodeDammit.Unicode_markup``
3728
2634
3497
3729
2635
* ``Tag.next`` -> ``Tag.next_element``
3498
* ``Tag.next`` -> ``Tag.next_element``
3730
2636
3499
3731
2637
* ``Tag.previous`` -> ``Tag.previous_element``
3500
* ``Tag.previous`` -> ``Tag.previous_element``
3732
2638
3501
3733
2639
生成器
3502
生成器
3735
2640
.......
3503
^^^^^^^
3736
2641
3504
3738
2642
将下列生成器按照PEP8标准重新命名,并转换成对象的属性:
3505
将下列生成器按照 PEP8 标准重新命名，并转换成对象的属性:
3739
2643
3506
3740
2644
* ``childGenerator()`` -> ``children``
3507
* ``childGenerator()`` -> ``children``
3741
2645
3508
3742
@@ -2655,7 +3518,7 @@ Beautiful Soup构造方法的参数部分也有名字变化:
3743
2655
3518
3744
2656
* ``parentGenerator()`` -> ``parents``
3519
* ``parentGenerator()`` -> ``parents``
3745
2657
3520
3747
2658
所以迁移到BS4版本时要替换这些代码:
3521
所以要把这样的代码:
3748
2659
3522
3749
2660
::
3523
::
3750
2661
3524
3751
@@ -2669,74 +3532,74 @@ Beautiful Soup构造方法的参数部分也有名字变化:
3752
2669
    for parent in tag.parents:
3532
    for parent in tag.parents:
3753
2670
        ...
3533
        ...
3754
2671
3534
3756
2672
(两种调用方法现在都能使用)
3535
(其实老方法也可以继续使用)
3757
2673
3536
3759
2674
BS3中有的生成器循环结束后会返回 ``None`` 然后结束.这是个bug.新版生成器不再返回 ``None`` .
3537
有的生成器循环结束后会返回 ``None`` 然后结束。这是个 bug。新版生成器不再返回 ``None`` 。
3760
2675
3538
3762
2676
BS4中增加了2个新的生成器, `.strings 和 stripped_strings`_ . ``.strings`` 生成器返回NavigableString对象, ``.stripped_strings`` 方法返回去除前后空白的Python的string对象.
3539
BS4 中增加了 2 个新的生成器， `.strings 和 stripped_strings`_ 。 ``.strings`` 生成器
3763
3540
返回 NavigableString 对象， ``.stripped_strings`` 方法返回去除前后空白的 Python 的 
3764
3541
string 对象。
3765
2677
3542
3766
2678
XML
3543
XML
3768
2679
....
3544
^^^^^
3769
2680
3545
3771
2681
BS4中移除了解析XML的 ``BeautifulStoneSoup`` 类.如果要解析一段XML文档,使用 ``BeautifulSoup`` 构造方法并在第二个参数设置为“xml”.同时 ``BeautifulSoup`` 构造方法也不再识别 ``isHTML`` 参数.
3546
BS4 中移除了解析 XML 的 ``BeautifulStoneSoup`` 类。如果要解析一段 XML 文档，使用 
3772
3547
``BeautifulSoup`` 构造方法并在第二个参数设置为“xml”。同时 ``BeautifulSoup`` 构造
3773
3548
方法也不再识别 ``isHTML`` 参数。
3774
2682
3549
3776
2683
Beautiful Soup处理XML空标签的方法升级了.旧版本中解析XML时必须指明哪个标签是空标签. 构造方法的 ``selfClosingTags`` 参数已经不再使用.新版Beautiful Soup将所有空标签解析为空元素,如果向空元素中添加子节点,那么这个元素就不再是空元素了.
3550
Beautiful Soup 处理 XML 空标签的方法升级了。旧版本中解析 XML 时必须指明哪个标签是空标签。
3777
3551
构造方法的 ``selfClosingTags`` 参数已经不再使用。新版 Beautiful Soup 将所有空标签解析
3778
3552
为空元素，如果向空元素中添加子节点，那么这个元素就不再是空元素了。
3779
2684
3553
3780
2685
实体
3554
实体
3782
2686
.....
3555
^^^^^
3783
2687
3556
3785
2688
HTML或XML实体都会被解析成Unicode字符,Beautiful Soup 3版本中有很多处理实体的方法,在新版中都被移除了. ``BeautifulSoup`` 构造方法也不再接受 ``smartQuotesTo`` 或 ``convertEntities`` 参数. `编码自动检测`_ 方法依然有 ``smart_quotes_to`` 参数,但是默认会将引号转换成Unicode.内容配置项 ``HTML_ENTITIES`` , ``XML_ENTITIES`` 和 ``XHTML_ENTITIES`` 在新版中被移除.因为它们代表的特性已经不再被支持.
3557
输入的 HTML 或 XML 实体都会被解析成 Unicode 字符。Beautiful Soup 3 版本中有很多相似处理
3786
3558
实体的方法，在新版中都被移除了。 ``BeautifulSoup`` 构造方法也不再接受 ``smartQuotesTo`` 
3787
3559
或 ``convertEntities`` 参数。 `编码自动检测 <Unicode, Dammit>`_ 方法依然有 ``smart_quotes_to`` 
3788
3560
参数，但是默认会将引号转换成 Unicode。内容配置项 ``HTML_ENTITIES`` , ``XML_ENTITIES`` 和 
3789
3561
``XHTML_ENTITIES`` 在新版中被移除。因为它们代表的特性（转换部分而不是全部实体到 Unicode 字符）
3790
3562
已经不再支持。
3791
2689
3563
3793
2690
如果在输出文档时想把Unicode字符转换成HTML实体,而不是输出成UTF-8编码,那就需要用到 `输出格式`_ 的方法.
3564
如果在输出文档时想把 Unicode 字符转回 HTML 实体，而不是输出成 UTF-8 编码，那就需要用到 
3794
3565
`输出格式`_ 的方法。
3795
2691
3566
3796
2692
迁移杂项
3567
迁移杂项
3798
2693
.........
3568
^^^^^^^^^
3799
3569
3800
3570
`Tag.string <string>`_ 属性现在是一个递归操作。如果 A 标签只包含了一个 B 标签，那么 A 标签的。
3801
3571
string 属性值与 B 标签的 string 属性值相同。
3802
2694
3572
3804
2695
`Tag.string`_ 属性现在是一个递归操作.如果A标签只包含了一个B标签,那么A标签的.string属性值与B标签的.string属性值相同.
3573
`多值属性`_ 比如 ``class`` 属性包含一个他们的值的列表，而不是一个字符串。这可能会影响到如何按照 
3805
3574
CSS 类名哦搜索 tag。
3806
2696
3575
3808
2697
`多值属性`_ 比如 ``class`` 属性包含一个他们的值的列表,而不是一个字符串.这可能会影响到如何按照CSS类名哦搜索tag.
3576
Tag 对象实现了一个 ``__hash__`` 方法，这样当两个 Tag 对象生成相同的摘要时会被认为相等。这可能会
3809
3577
改变你的脚本行为，如果你吧 Tag 对象放到字典或集合中。
3810
2698
3578
3812
2699
如果使用 ``find*`` 方法时同时传入了 `string 参数`_ 和 `name 参数`_ .Beautiful Soup会搜索指定name的tag,并且这个tag的 `Tag.string`_ 属性包含text参数的内容.结果中不会包含字符串本身.旧版本中Beautiful Soup会忽略掉tag参数,只搜索text参数.
3579
如果使用 ``find*`` 方法时同时传入了 `string 参数`_ 和一个指定 tag 的参数比如 `name 参数`_ 。
3813
3580
Beautiful Soup 会搜索符合指定参数的 tag，并且这个 tag 的 `Tag.string <string>`_ 属性包含 
3814
3581
`string 参数`_ 参数的内容。结果中 `不会` 包含字符串本身。旧版本中 Beautiful Soup 会忽略掉指定 
3815
3582
tag 的参数，只搜索符合 string 的内容。
3816
2700
3583
3818
2701
``BeautifulSoup`` 构造方法不再支持 markupMassage 参数.现在由解析器负责文档的解析正确性.
3584
``BeautifulSoup`` 构造方法不再支持 markupMassage 参数。现在由解析器负责文档的解析正确性。
3819
2702
3585
3821
2703
很少被用到的几个解析器方法在新版中被移除,比如 ``ICantBelieveItsBeautifulSoup`` 和 ``BeautifulSOAP`` .现在由解析器完全负责如何解释模糊不清的文档标记.
3586
很少被用到的几个解析器方法在新版中被移除，比如 ``ICantBelieveItsBeautifulSoup`` 和 ``BeautifulSOAP`` 。
3822
3587
现在由解析器完全负责如何解释模糊不清的文档标记。
3823
2704
3588
3825
2705
``prettify()`` 方法在新版中返回Unicode字符串,不再返回字节流.
3589
``prettify()`` 方法在新版中返回 Unicode 字符串，不再返回字节串。
3826
2706
3590
3827
2707
附录
3591
附录
3828
2708
=====
3592
=====
3829
2709
3593
3851
2710
.. _`BeautifulSoup3 文档`: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html
3594
.. [1] BeautifulSoup 的 google 讨论组不是很活跃，可能是因为库已经比较完善了吧，但是作者还是会很热心的尽量帮你解决问题的。
3852
2711
.. _name: `name 参数`_
3595
.. [2] 文档被解析成树形结构，所以下一步解析过程应该是当前节点的子节点
3853
2712
.. _attrs: `按CSS搜索`_
3596
.. [3] 过滤器只能作为搜索文档的参数，或者说应该叫参数类型更为贴切，原文中用了 ``filter`` 因此翻译为过滤器
3854
2713
.. _recursive: `recursive 参数`_
3597
.. [4] 元素参数，HTML 文档中的一个 tag 节点，不能是文本节点
3834
2714
.. _string: `string 参数`_
3835
2715
.. _**kwargs: `keyword 参数`_
3836
2716
.. _.next_siblings: `.next_siblings 和 .previous_siblings`_
3837
2717
.. _.previous_siblings: `.next_siblings 和 .previous_siblings`_
3838
2718
.. _.next_elements: `.next_elements 和 .previous_elements`_
3839
2719
.. _.previous_elements: `.next_elements 和 .previous_elements`_
3840
2720
.. _.stripped_strings: `.strings 和 stripped_strings`_
3841
2721
.. _安装lxml: `安装解析器`_
3842
2722
.. _安装lxml或html5lib: `安装解析器`_
3843
2723
.. _编码自动检测: `Unicode, Dammit! (乱码, 靠!)`_
3844
2724
.. _Tag.string: `.string`_
3845
2725
3846
2726
3847
2727
.. [1] BeautifulSoup的google讨论组不是很活跃,可能是因为库已经比较完善了吧,但是作者还是会很热心的尽量帮你解决问题的.
3848
2728
.. [2] 文档被解析成树形结构,所以下一步解析过程应该是当前节点的子节点
3849
2729
.. [3] 过滤器只能作为搜索文档的参数,或者说应该叫参数类型更为贴切,原文中用了 ``filter`` 因此翻译为过滤器
3850
2730
.. [4] 元素参数,HTML文档中的一个tag节点,不能是文本节点
3855
2731
.. [5] 采用先序遍历方式
3598
.. [5] 采用先序遍历方式
3857
2732
.. [6] CSS选择器是一种单独的文档搜索语法, 参考 http://www.w3school.com.cn/css/css_selector_type.asp
3599
.. [6] CSS选择器是一种单独的文档搜索语法，参考 http://www.w3school.com.cn/css/css_selector_type.asp
3858
2733
.. [7] 原文写的是 html5lib, 译者觉得这是原文档的一个笔误
3600
.. [7] 原文写的是 html5lib, 译者觉得这是原文档的一个笔误
3868
2734
.. [8] wrap含有包装,打包的意思,但是这里的包装不是在外部包装而是将当前tag的内部内容包装在一个tag里.包装原来内容的新tag依然在执行 `wrap()`_ 方法的tag内
3601
.. [8] wrap含有包装，打包的意思，但是这里的包装不是在外部包装而是将当前tag的内部内容包装在一个tag里。
3869
2735
.. [9] 文档中特殊编码字符被替换成特殊字符(通常是�)的过程是Beautful Soup自动实现的,如果想要多种编码格式的文档被完全转换正确,那么,只好,预先手动处理,统一编码格式
3602
        包装原来内容的新tag依然在执行 `wrap()`_ 方法的tag内
3870
2736
.. [10] 智能引号,常出现在microsoft的word软件中,即在某一段落中按引号出现的顺序每个引号都被自动转换为左引号,或右引号.
3603
.. [9] 文档中特殊编码字符被替换成特殊字符(通常是�)的过程是Beautful Soup自动实现的，
3871
2737
3604
        如果想要多种编码格式的文档被完全转换正确，那么，只好，预先手动处理，统一编码格式
3872
2738
原文: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
3605
.. [10] 智能引号，常出现在 microsoft 的 word 软件中，即在某一段落中按引号出现的顺序每个引号都被自动转换为左引号，或右引号。
3864
2739
3865
2740
翻译: Deron Wang
3866
2741
3867
2742
查看 `BeautifulSoup3 文档`_
Status:	Merged
Merged at revision:	59f273d402dc5c5ce48f78d043b701ab0b1f7764
Proposed branch:	~phoenixsite/beautifulsoup:master
Merge into:	beautifulsoup:master
Diff against target:	3872 lines (+1586/-723) 2 files modified doc.zh/source/conf.py (+1/-1) doc.zh/source/index.rst (+1585/-722)
Related bugs:	Link a bug report
Reviewer	Review Type	Date Requested	Status
Leonard Richardson		2024-01-15	Approve on 2024-01-16
Review via email: mp+458648@code.launchpad.net