Python BeautifulSoup

最后修改于 2024 年 1 月 29 日

Python BeautifulSoup 教程是 BeautifulSoup Python 库的入门教程。这些例子演示了如何查找标签、遍历文档树、修改文档以及抓取网页。

BeautifulSoup

BeautifulSoup 是一个用于解析 HTML 和 XML 文档的 Python 库。它通常用于网络爬虫。 BeautifulSoup 将复杂的 HTML 文档转换为复杂的 Python 对象树，例如 tag、navigable string 或 comment。

安装 BeautifulSoup

我们使用 pip3 命令来安装必要的模块。

$ sudo pip3 install lxml

我们需要安装 lxml 模块，BeautifulSoup 会用到它。

$ sudo pip3 install bs4

BeautifulSoup 使用以上命令安装。

HTML 文件

在示例中，我们将使用以下 HTML 文件

index.html

<!DOCTYPE html>
<html>
    <head>
        <title>Header</title>
        <meta charset="utf-8">
    </head>

    <body>
        <h2>Operating systems</h2>

        <ul id="mylist" style="width:150px">
            <li>Solaris</li>
            <li>FreeBSD</li>
            <li>Debian</li>
            <li>NetBSD</li>
            <li>Windows</li>
        </ul>

        <p>
          FreeBSD is an advanced computer operating system used to
          power modern servers, desktops, and embedded platforms.
        </p>

        <p>
          Debian is a Unix-like computer operating system that is
          composed entirely of free software.
        </p>

    </body>
</html>

Python BeautifulSoup 简单示例

在第一个示例中，我们使用 BeautifulSoup 模块来获取三个标签。

simple.py

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    print(soup.h2)
    print(soup.head)
    print(soup.li)

此代码示例打印三个标签的 HTML 代码。

from bs4 import BeautifulSoup

我们从 bs4 模块导入 BeautifulSoup 类。 BeautifulSoup 是执行工作的主要类。

with open('index.html', 'r') as f:

    contents = f.read()

我们打开 index.html 文件并使用 read 方法读取其内容。

soup = BeautifulSoup(contents, 'lxml')

创建一个 BeautifulSoup 对象； HTML 数据传递给构造函数。第二个选项指定解析器。

print(soup.h2)
print(soup.head)

这里我们打印两个标签的 HTML 代码：h2 和 head。

print(soup.li)

有多个 li 元素；该行打印第一个。

$ ./simple.py
<h2>Operating systems</h2>
<head>
<title>Header</title>
<meta charset="utf-8"/>
</head>
<li>Solaris</li>

BeautifulSoup 标签、名称、文本

标签的 name 属性给出其名称，text 属性给出其文本内容。

tags_names.py

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    print(f'HTML: {soup.h2}, name: {soup.h2.name}, text: {soup.h2.text}')

此代码示例打印 h2 标签的 HTML 代码、名称和文本。

$ ./tags_names.py
HTML: <h2>Operating systems</h2>, name: h2, text: Operating systems

BeautifulSoup 遍历标签

使用 recursiveChildGenerator 方法，我们遍历 HTML 文档。

traverse_tree.py

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    for child in soup.recursiveChildGenerator():

        if child.name:

            print(child.name)

该示例遍历文档树并打印所有 HTML 标签的名称。

$ ./traverse_tree.py
html
head
title
meta
body
h2
ul
li
li
li
li
li
p
p

在 HTML 文档中，我们有以下这些标签。

BeautifulSoup 元素子节点

使用 children 属性，我们可以获取标签的子节点。

get_children.py

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    root = soup.html

    root_childs = [e.name for e in root.children if e.name is not None]
    print(root_childs)

该示例检索 html 标签的子节点，将它们放入 Python 列表并将其打印到控制台。由于 children 属性还会返回标签之间的空格，因此我们添加一个条件，仅包含标签名称。

$ ./get_children.py
['head', 'body']

html 标签有两个子节点：head 和 body。

BeautifulSoup 元素后代

使用 descendants 属性，我们获得标签的所有后代（所有级别的子节点）。

get_descendants.py

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    root = soup.body

    root_childs = [e.name for e in root.descendants if e.name is not None]
    print(root_childs)

该示例检索 body 标签的所有后代。

$ ./get_descendants.py
['h2', 'ul', 'li', 'li', 'li', 'li', 'li', 'p', 'p']

这些是 body 标签的所有后代。

BeautifulSoup 网络爬虫

Requests 是一个简单的 Python HTTP 库。它提供了通过 HTTP 访问 Web 资源的方法。

scraping.py

#!/usr/bin/python

from bs4 import BeautifulSoup
import requests as req

resp = req.get('http://webcode.me')

soup = BeautifulSoup(resp.text, 'lxml')

print(soup.title)
print(soup.title.text)
print(soup.title.parent)

该示例检索简单网页的标题。它还打印它的父节点。

resp = req.get('http://webcode.me')

soup = BeautifulSoup(resp.text, 'lxml')

我们获取页面的 HTML 数据。

print(soup.title)
print(soup.title.text)
print(soup.title.parent)

我们检索标题的 HTML 代码、其文本及其父节点的 HTML 代码。

$ ./scraping.py 
<title>My html page</title>
My html page
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>My html page</title>
</head>

BeautifulSoup 格式化代码

使用 prettify 方法，我们可以使 HTML 代码看起来更好。

prettify.py

#!/usr/bin/python

from bs4 import BeautifulSoup
import requests as req

resp = req.get('http://webcode.me')

soup = BeautifulSoup(resp.text, 'lxml')

print(soup.prettify())

我们格式化简单网页的 HTML 代码。

$ ./prettify.py 
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8"/>
    <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
    <title>
      My html page
    </title>
  </head>
  <body>
    <p>
      Today is a beautiful day. We go swimming and fishing.
    </p>
    <p>
      Hello there. How are you?
    </p>
  </body>
</html>

BeautifulSoup 使用内置 Web 服务器抓取

我们还可以使用简单的内置 HTTP 服务器提供 HTML 页面。

$ mkdir public
$ cp index.html public/

我们创建一个 public 目录并将 index.html 复制到那里。

$ python -m http.server --directory public
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...

然后我们启动 Python HTTP 服务器。

scraping2.py

#!/usr/bin/python

from bs4 import BeautifulSoup
import requests as req

resp = req.get('https://:8000/')

soup = BeautifulSoup(resp.text, 'lxml')

print(soup.title)
print(soup.body)

现在我们从本地运行的服务器获取文档。

BeautifulSoup 按 Id 查找元素

使用 find 方法，我们可以通过各种方式（包括元素 id）查找元素。

find_by_id.py

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    #print(soup.find('ul', attrs={ 'id' : 'mylist'}))
    print(soup.find('ul', id='mylist'))

此代码示例查找具有 mylist id 的 ul 标签。注释掉的行是执行相同任务的另一种方法。

BeautifulSoup 查找所有标签

使用 find_all 方法，我们可以查找满足某些标准的所有元素。

find_all.py

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    for tag in soup.find_all('li'):
        print(f'{tag.name}: {tag.text}')

此代码示例查找并打印所有 li 标签。

$ ./find_all.py 
li: Solaris
li: FreeBSD
li: Debian
li: NetBSD
li: Windows

find_all 方法可以接受要搜索的元素列表。

find_all2.py

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    tags = soup.find_all(['h2', 'p'])

    for tag in tags:
        print(' '.join(tag.text.split()))

该示例查找所有 h2 和 p 元素并打印其文本。

find_all 方法还可以接受一个函数，该函数确定应返回哪些元素。

find_by_fun.py

#!/usr/bin/python

from bs4 import BeautifulSoup

def myfun(tag):

    return tag.is_empty_element


with open('index.html', 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    tags = soup.find_all(myfun)
    print(tags)

该示例打印空元素。

$ ./find_by_fun.py
[<meta charset="utf-8"/>]

文档中唯一的空元素是 meta。

也可以使用正则表达式查找元素。

regex.py

#!/usr/bin/python

import re

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    strings = soup.find_all(string=re.compile('BSD'))

    for txt in strings:

        print(' '.join(txt.split()))

该示例打印包含 'BSD' 字符串的元素的内容。

$ ./regex.py
FreeBSD
NetBSD
FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.

BeautifulSoup CSS 选择器

使用 select 和 select_one 方法，我们可以使用一些 CSS 选择器来查找元素。

select_nth_tag.py

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    print(soup.select('li:nth-of-type(3)'))

此示例使用 CSS 选择器来打印第三个 li 元素的 HTML 代码。

$ ./select_nth_tag.py
<li>Debian</li>

这是第三个 li 元素。

# 字符在 CSS 中用于按 id 属性选择标签。

select_by_id.py

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    print(soup.select_one('#mylist'))

该示例打印具有 mylist id 的元素。

BeautifulSoup 追加元素

append 方法将新标签追加到 HTML 文档。

append_tag.py

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    newtag = soup.new_tag('li')
    newtag.string='OpenBSD'

    ultag = soup.ul

    ultag.append(newtag)

    print(ultag.prettify())

该示例追加一个新的 li 标签。

newtag = soup.new_tag('li')
newtag.string='OpenBSD'

首先，我们使用 new_tag 方法创建一个新标签。

ultag = soup.ul

我们获取对 ul 标签的引用。

ultag.append(newtag)

我们将新创建的标签附加到 ul 标签。

print(ultag.prettify())

我们以整洁的格式打印 ul 标签。

BeautifulSoup 插入元素

insert 方法在指定位置插入标签。

insert_tag.py

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    newtag = soup.new_tag('li')
    newtag.string='OpenBSD'

    ultag = soup.ul

    ultag.insert(2, newtag)

    print(ultag.prettify())

该示例将一个 li 标签插入到 ul 标签中的第三个位置。

BeautifulSoup 替换文本

replace_with 替换元素的文本。

replace_text.py

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    tag = soup.find(text='Windows')
    tag.replace_with('OpenBSD')

    print(soup.ul.prettify())

该示例使用 find 方法查找特定元素，并使用 replace_with 方法替换其内容。

BeautifulSoup 删除元素

decompose 方法从树中删除标签并销毁它。

decompose_tag.py

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    ptag2 = soup.select_one('p:nth-of-type(2)')

    ptag2.decompose()

    print(soup.body.prettify())

该示例删除第二个 p 元素。

来源

Python Beautiful Soup 文档

在本文中，我们使用了 Python BeautifulSoup 库。

作者

我叫 Jan Bodnar，我是一位充满激情的程序员，拥有丰富的编程经验。我自 2007 年以来一直撰写编程文章。迄今为止，我已经撰写了 1,400 多篇文章和 8 本电子书。我拥有超过十年的编程教学经验。

列出所有 Python 教程。