Python lxml

最后修改于 2024 年 1 月 29 日

在本文中，我们将展示如何使用 lxml 库在 Python 中解析和生成 XML 和 HTML 数据。

lxml 库为 C 库 libxml2 和 libxslt 提供了 Python 绑定。

以下文件将在示例中使用。

words.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Words</title>
</head>
<body>

<ul>
    <li>sky</li>
    <li>cup</li>
    <li>water</li>
    <li>cloud</li>
    <li>bear</li>
    <li>wolf</li>
</ul>

<div id="output">
    ...
</div>

</body>
</html>

这是一个简单的 HTML 文档。

Python lxml 迭代标签

在第一个例子中，我们迭代文档的标签。

tags.py

#!/usr/bin/python

from lxml import html

fname = 'words.html'
tree = html.parse(fname)

for e in tree.iter():
    print(e.tag)

该程序列出所有可用的 HTML 标签。

from lxml import html

我们导入 html 模块。

fname = 'words.html'
tree = html.parse(fname)

我们使用 parse 从给定文件中解析文档。

for e in tree.iter():
    print(e.tag)

我们利用 iter 迭代元素。

$ ./tags.py
html
head
meta
title
body
ul
li
li
li
li
li
li
div

Python lxml 根元素

根元素使用 getroot 检索。

root.py

#!/usr/bin/python

from lxml import html
import re

fname = 'words.html'

tree = html.parse(fname)
root = tree.getroot()

print(root.tag)

print('----------------')

print(root.head.tag)
print(root.head.text_content().strip())

print('----------------')

print(root.body.tag)
print(re.sub('\s+', ' ', root.body.text_content()).strip())

在程序中，我们获取根元素。我们打印 head、body 标签及其文本内容。

tree = html.parse(fname)
root = tree.getroot()

从文档树中，我们使用 getroot 方法获取根。

print(root.tag)

我们打印根元素的标签名称 (html)。

print(root.head.tag)
print(root.head.text_content().strip())

我们打印 head 标签及其文本内容。

print(root.body.tag)
print(re.sub('\s+', ' ', root.body.text_content()).strip())

类似地，我们打印 body 标签及其文本内容。为了删除多余的空格，我们使用正则表达式。

$ ./root.py
html
----------------
head
Words
----------------
body
sky cup water cloud bear wolf ...

Python lxml 创建文档

lxml 模块允许创建 HTML 文档。

create_doc.py

#!/usr/bin/python

from lxml import etree

root = etree.Element('html', lang='en')

head = etree.SubElement(root, 'head')
title = etree.SubElement(head, 'title')
title.text = 'HTML document'
body = etree.SubElement(root, 'body')

p = etree.SubElement(body, 'p')
p.text = 'A simple HTML document'

with open('new.html', 'wb') as f:
    f.write(etree.tostring(root, pretty_print=True))

我们使用 etree 模块来生成文档。

root = etree.Element('html', lang='en')

我们创建根元素。

head = etree.SubElement(root, 'head')
title = etree.SubElement(head, 'title')

在根元素内部，我们创建两个子元素。

title.text = 'HTML document'

我们通过 text 属性插入文本。

with open('new.html', 'wb') as f:
    f.write(etree.tostring(root, pretty_print=True))

最后，我们将文档写入文件。

Python lxml findall

findall 方法用于查找所有指定的元素。

find_all.py

#!/usr/bin/python

from lxml import html

fname = 'words.html'

root = html.parse(fname)
els = root.findall('body/ul/li')

for e in els:
    print(e.text)

该程序查找所有 li 标签并打印它们的内容。

els = root.findall('body/ul/li')

我们使用 findall 查找所有元素。我们传递到元素的精确路径。

for e in els:
    print(e.text)

我们迭代标签并打印它们的文本内容。

$ ./find_all.py
sky
cup
water
cloud
bear
wolf

Python lxml 按 id 查找

可以使用 get_element_by_id 找到特定元素。

find_by_id.py

#!/usr/bin/python

from lxml import html

fname = 'words.html'

tree = html.parse(fname)
root = tree.getroot()

e = root.get_element_by_id('output')
print(e.tag)
print(e.text.strip())

该程序按 id 查找 div 元素并打印它的标签名称和文本内容。

$ ./find_by_id.py
div
...

Python lxml 网络抓取

lxml 模块可用于网络抓取。

scrape.py

#!/usr/bin/python

import urllib3
import re
from lxml import html

http = urllib3.PoolManager()

url = 'http://webcode.me/countries.html'
resp = http.request('GET', url)

content = resp.data.decode('utf-8')
doc = html.fromstring(content)

els = doc.findall('body/table/tbody/tr')

for e in els[:10]:
    row = e.text_content().strip()
    row2 = re.sub('\s+', ' ', row)
    print(row2)

该程序获取一个包含人口最多的国家列表的 HTML 文档。它打印表格中的前十个国家。

import urllib3

为了获取网页，我们使用 urllib3 库。

http = urllib3.PoolManager()

url = 'http://webcode.me/countries.html'
resp = http.request('GET', url)

我们生成一个对资源的 GET 请求。

content = resp.data.decode('utf-8')
doc = html.fromstring(content)

我们解码内容并解析文档。

els = doc.findall('body/table/tbody/tr')

我们找到所有包含数据的 tr 标签。

for e in els[:10]:
    row = e.text_content().strip()
    row2 = re.sub('\s+', ' ', row)
    print(row2)

我们遍历行列表并打印前十行。

$ ./scrape.py
1 China 1382050000
2 India 1313210000
3 USA 324666000
4 Indonesia 260581000
5 Brazil 207221000
6 Pakistan 196626000
7 Nigeria 186988000
8 Bangladesh 162099000
9 Russia 146838000
10 Japan 126830000

来源

lxml - 使用 Python 处理 XML 和 HTML

在本文中，我们使用 lxml 在 Python 中处理了 XML/HTML 数据。

作者

我的名字是 Jan Bodnar，我是一位充满热情的程序员，拥有丰富的编程经验。我自 2007 年以来一直在撰写编程文章。到目前为止，我已经撰写了 1,400 多篇文章和 8 本电子书。我拥有超过十年的编程教学经验。

列出所有 Python 教程。