Python XML with SAX

最后修改：2025 年 2 月 15 日

在本文中，我们将展示如何在 Python 中使用 SAX (Simple API for XML) 进行事件驱动的 XML 解析。SAX 是一种内存效率高的方法来解析 XML 文档，因此适用于大文件。与 DOM (Document Object Model) 不同，SAX 不会将整个 XML 文档加载到内存中。相反，它会顺序处理文档，并在遇到元素、属性和文本时触发事件。

xml.sax 模块是 Python 标准库的一部分，因此无需额外安装。

基本 SAX 解析

以下示例演示了如何使用 SAX 解析 XML 文档。我们创建一个自定义处理程序类来处理事件，例如开始元素、结束元素和字符数据。

main.py

import xml.sax
from io import StringIO

class MyHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.current_element = ""
        self.current_data = ""

    # Called when an element starts
    def startElement(self, tag, attributes):
        self.current_element = tag
        if tag == "book":
            print("Book Id:", attributes["id"])

    # Called when an element ends
    def endElement(self, tag):
        if tag == "title":
            print("Title:", self.current_data)
        elif tag == "author":
            print("Author:", self.current_data)
        elif tag == "year":
            print("Year:", self.current_data)
        self.current_data = ""

    # Called when character data is found
    def characters(self, content):
        if self.current_element in ["title", "author", "year"]:
            self.current_data += content.strip()

# XML data
xml_data = """
<catalog>
    <book id="1">
        <title>The Great Gatsby</title>
        <author>F. Scott Fitzgerald</author>
        <year>1925</year>
    </book>
    <book id="2">
        <title>1984</title>
        <author>George Orwell</author>
        <year>1949</year>
    </book>
</catalog>
"""

# Create a SAX parser
parser = xml.sax.make_parser()
handler = MyHandler()
parser.setContentHandler(handler)

# Parse the XML data
parser.parse(StringIO(xml_data))

在此程序中，MyHandler 类继承自 xml.sax.ContentHandler，并重写了 startElement、endElement 和 characters 方法来处理 XML 事件。

parser.parse(StringIO(xml_data))

StringIO 用于从 xml_data 字符串创建内存中的类似文件的对象。这使得 parser.parse 方法可以像读取文件一样读取 XML 数据。

$ python main.py
Book Id: 1
Title: The Great Gatsby
Author: F. Scott Fitzgerald
Year: 1925
Book Id: 2
Title: 1984
Author: George Orwell
Year: 1949

处理属性

以下示例演示了如何使用 SAX 处理 XML 元素中的属性。

main.py

import xml.sax
from io import StringIO

import xml.sax


class MyHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.current_element = ""

    # Called when an element starts
    def startElement(self, tag, attributes):
        self.current_element = tag
        if tag == "book":
            print("Book Id:", attributes["id"])
            print("Category:", attributes["category"])

    # Called when an element ends
    def endElement(self, tag):
        pass

    # Called when character data is found
    def characters(self, content):
        pass


# XML data
xml_data = """
<catalog>
    <book id="1" category="fiction">
        <title>The Great Gatsby</title>
        <author>F. Scott Fitzgerald</author>
        <year>1925</year>
    </book>
    <book id="2" category="dystopian">
        <title>1984</title>
        <author>George Orwell</author>
        <year>1949</year>
    </book>
    <book id="3" category="fiction">
        <title>War and Peace</title>
        <author>Leo Tolstoy</author>
        <year>1869</year>
    </book>
</catalog>
"""

# Create a SAX parser
parser = xml.sax.make_parser()
handler = MyHandler()
parser.setContentHandler(handler)

# Parse the XML data
parser.parse(StringIO(xml_data))

在此程序中，startElement 方法用于处理 book 元素的属性，例如 id 和 category。

$ python main.py
Book Id: 1
Category: fiction
Book Id: 2
Category: dystopian
Book Id: 3
Category: fiction

解析 XML 文件

以下示例演示了如何使用 SAX 解析 XML 文件。这种方法内存效率高，因为它会顺序处理文件，而不会将其完全加载到内存中。

products.xml

<products>
    <product>
        <id>1</id>
        <name>Product 1</name>
        <price>10.99</price>
        <quantity>30</quantity>
    </product>
    <product>
        <id>2</id>
        <name>Product 2</name>
        <price>20.99</price>
        <quantity>130</quantity>
    </product>
    <product>
        <id>4</id>
        <name>Product 4</name>
        <price>24.59</price>
        <quantity>350</quantity>
    </product>
    <product>
        <id>5</id>
        <name>Product 5</name>
        <price>9.9</price>
        <quantity>650</quantity>
    </product>
    <product>
        <id>6</id>
        <name>Product 6</name>
        <price>45</price>
        <quantity>290</quantity>
    </product>
</products>

这是文件。

main.py

from xml.sax import make_parser, ContentHandler

class ProductHandler(ContentHandler):
  def __init__(self):
    self.current_data = ""
    self.product = {}
  
  def startElement(self, name, attrs):
    self.current_data = ""
    if name == "product":
      self.product = {}
  
  def characters(self, content):
    self.current_data += content.strip()
  
  def endElement(self, name):
    if name != "product":
      self.product[name] = self.current_data
    else:
      print(f"Id: {self.product['id']}, Name: {self.product['name']}")

parser = make_parser()
parser.setContentHandler(ProductHandler())
parser.parse("products.xml")

在此程序中，parser.parse 方法用于解析名为 products.xml 的 XML 文件。SAX 解析器会顺序处理文件，因此适用于大文件。

$ python main.py
Id: 1, Name: Product 1
Id: 2, Name: Product 2
Id: 4, Name: Product 4
Id: 5, Name: Product 5
Id: 6, Name: Product 6

来源

Python SAX - 文档

在本文中，我们展示了如何在 Python 中使用 SAX API 进行事件驱动的 XML 解析。SAX 方法内存效率高，适用于大型 XML 文件。

作者

我叫 Jan Bodnar，我是一名热情的程序员，拥有丰富的编程经验。我自 2007 年以来一直在撰写编程文章。迄今为止，我已撰写了 1400 多篇文章和 8 本电子书。我在编程教学方面拥有十多年的经验。

列出所有 Python 教程。