Python 正则表达式

最后修改于 2024 年 1 月 29 日

Python 正则表达式教程展示了如何在 Python 中使用正则表达式。在 Python 中，我们使用 re 模块处理正则表达式。

正则表达式用于文本搜索和更高级的文本操作。正则表达式是内置工具，例如 grep, sed, 文本编辑器（如 vi, emacs）和编程语言（如 Tcl, Perl 和 Python）。

Python re 模块

在 Python 中，re 模块提供了正则表达式匹配操作。

模式是一个正则表达式，它定义了我们要搜索或操作的文本。它由文本字面量和元字符组成。该模式使用 compile 函数编译。因为正则表达式通常包含特殊字符，所以建议使用原始字符串。（原始字符串以 r 字符开头。）这样，字符在编译为模式之前不会被解释。

编译模式后，我们可以使用其中一个函数将模式应用于文本字符串。这些函数包括 match、search、find 和 finditer。

正则表达式

下表显示了一些基本正则表达式

Regex	含义
`.`	匹配任何单个字符。
`?`	匹配前一个元素一次或不匹配。
`+`	匹配前一个元素一次或多次。
`*`	匹配前一个元素零次或多次。
`^`	匹配字符串内的起始位置。
`$`	匹配字符串内的结束位置。
`\|`	交替运算符。
`[abc]`	匹配 a、b 或 c。
`[a-c]`	范围；匹配 a、b 或 c。
`[^abc]`	否定；匹配除 a、b 或 c 之外的所有字符。
`\s`	匹配空白字符。
`\w`	匹配单词字符；等同于 `[a-zA-Z_0-9]`

regex 函数

我们使用 regex 函数查找匹配项。

函数	描述
`match`	确定 RE 是否在字符串的开头匹配。
`fullmatch`	确定 RE 是否匹配整个字符串。
`search`	扫描整个字符串，查找 RE 匹配的任何位置。
`findall`	查找 RE 匹配的所有子字符串，并将它们作为列表返回。
`finditer`	查找 RE 匹配的所有子字符串，并将它们作为迭代器返回。
`split`	按 RE 模式分割字符串。

如果 match、fullmatch 和 search 函数成功，则返回一个匹配对象。否则，它们返回 None。

match 函数

如果字符串开头的零个或多个字符与正则表达式模式匹配，则 match 函数返回一个匹配对象。

match_fun.py

#!/usr/bin/python

import re

words = ('book', 'bookworm', 'Bible', 
    'bookish','cookbook', 'bookstore', 'pocketbook')

pattern = re.compile(r'book')

for word in words:

    if re.match(pattern, word):
        print(f'The {word} matches')

在本例中，我们有一个单词元组。编译后的模式将在每个单词中查找“book”字符串。

pattern = re.compile(r'book')

使用 compile 函数，我们创建一个模式。正则表达式是一个原始字符串，由四个普通字符组成。

for word in words:

    if re.match(pattern, word):
        print(f'The {word} matches')

我们遍历元组并调用 match 函数。它将模式应用于单词。如果字符串开头存在匹配项，则 match 函数返回一个匹配对象。如果没有匹配项，则返回 None。

$ ./match_fun.py 
The book matches 
The bookworm matches 
The bookish matches 
The bookstore matches

元组中的四个单词与该模式匹配。请注意，不以“book”项开头的单词不匹配。要包含这些单词，我们使用 search 函数。

fullmatch 函数

fullmatch 函数查找完全匹配。

fullmatch_fun.py

#!/usr/bin/python

import re

words = ('book', 'bookworm', 'Bible', 
    'bookish','cookbook', 'bookstore', 'pocketbook')

pattern = re.compile(r'book')

for word in words:

    if re.fullmatch(pattern, word):
        print(f'The {word} matches')

在本例中，我们使用 fullmatch 函数查找确切的“book”项。

$ ./fullmatch_fun.py 
The book matches

只有一个匹配项。

search 函数

search 函数查找正则表达式模式产生匹配的第一个位置。

search_fun.py

#!/usr/bin/python

import re

words = ('book', 'bookworm', 'Bible', 
    'bookish','cookbook', 'bookstore', 'pocketbook')

pattern = re.compile(r'book')

for word in words:

    if re.search(pattern, word):
        print(f'The {word} matches')

在本例中，我们使用 search 函数查找“book”项。

$ ./search_fun.py 
The book matches 
The bookworm matches 
The bookish matches 
The cookbook matches 
The bookstore matches 
The pocketbook matches

这次也包括 cookbook 和 pocketbook 单词。

点元字符

点 (.) 元字符代表文本中的任何单个字符。

dot_meta.py

#!/usr/bin/python

import re

words = ('seven', 'even', 'prevent', 'revenge', 'maven', 
    'eleven', 'amen', 'event')

pattern = re.compile(r'.even')

for word in words:
    if re.match(pattern, word):
        print(f'The {word} matches')

在本例中，我们有一个包含八个单词的元组。我们将包含点元字符的模式应用于每个单词。

pattern = re.compile(r'.even')

点代表文本中的任何单个字符。字符必须存在。

$ ./dot_meta.py 
The seven matches 
The revenge matches

两个单词与模式匹配：seven 和 revenge。

问号元字符

问号 (?) 元字符是一个量词，它匹配前一个元素零次或一次。

question_mark_meta.py

#!/usr/bin/python

import re

words = ('seven', 'even','prevent', 'revenge', 'maven', 
    'eleven', 'amen', 'event')

pattern = re.compile(r'.?even')

for word in words:

    if re.match(pattern, word):
        print(f'The {word} matches')

在本例中，我们在点字符后添加一个问号。这意味着在模式中，我们可以有一个任意字符，也可以没有字符。

$ ./question_mark_meta.py 
The seven matches 
The even matches 
The revenge matches 
The event matches

这一次，除了 seven 和 revenge 之外，even 和 event 单词也匹配。

锚点

锚点匹配给定文本中字符的位置。使用 ^ 锚点时，匹配必须发生在字符串的开头；使用 $ 锚点时，匹配必须发生在字符串的结尾。

anchors.py

#!/usr/bin/python

import re

sentences = ('I am looking for Jane.',
    'Jane was walking along the river.',
    'Kate and Jane are close friends.')

pattern = re.compile(r'^Jane')

for sentence in sentences:
    
    if re.search(pattern, sentence):
        print(sentence)

在本例中，我们有三个句子。搜索模式是 ^Jane。该模式检查“Jane”字符串是否位于文本的开头。 Jane\. 将在句子的末尾查找“Jane”。

精确匹配

可以使用 fullmatch 函数或通过将项放置在锚点 ^ 和 $ 之间来执行精确匹配。

exact_match.py

#!/usr/bin/python

import re

words = ('book', 'bookworm', 'Bible', 
    'bookish','cookbook', 'bookstore', 'pocketbook')

pattern = re.compile(r'^book$')

for word in words:

    if re.search(pattern, word):
        print(f'The {word} matches')

在本例中，我们查找术语“book”的精确匹配项。

$ ./exact_match.py 
The book matches

字符类

字符类定义了一组字符，其中任何一个字符都可能出现在输入字符串中，以便匹配成功。

character_class.py

#!/usr/bin/python

import re

words = ('a gray bird', 'grey hair', 'great look')

pattern = re.compile(r'gr[ea]y')

for word in words:

    if re.search(pattern, word):
        print(f'{word} matches')

在本例中，我们使用字符类来包括 gray 和 grey 单词。

pattern = re.compile(r'gr[ea]y')

[ea] 类允许在模式中使用“e”或“a”字符。

命名字符类

有一些预定义的字符类。 \s 匹配一个空白字符 [\t\n\t\f\v]，\d 匹配一个数字 [0-9]，而 \w 匹配一个单词字符 [a-zA-Z0-9_]。

named_character_class.py

#!/usr/bin/python

import re

text = 'We met in 2013. She must be now about 27 years old.'

pattern = re.compile(r'\d+')

found = re.findall(pattern, text)

if found:
    print(f'There are {len(found)} numbers')

在本例中，我们统计文本中的数字。

pattern = re.compile(r'\d+')

\d+ 模式在文本中查找任意数量的数字集。

found = re.findall(pattern, text)

使用 findall 方法，我们查找文本中的所有数字。

$ ./named_character_classes.py 
There are 2 numbers

不区分大小写的匹配

默认情况下，模式匹配区分大小写。通过将 re.IGNORECASE 传递给 compile 函数，我们可以使其不区分大小写。

case_insensitive.py

#!/usr/bin/python

import re

words = ('dog', 'Dog', 'DOG', 'Doggy')

pattern = re.compile(r'dog', re.IGNORECASE)

for word in words:
    if re.match(pattern, word):
        print(f'{word} matches')

在本例中，我们对单词应用模式，而不考虑大小写。

$ ./case_insensitive.py 
dog matches
Dog matches
DOG matches
Doggy matches

所有四个单词都与该模式匹配。

交替

交替运算符 | 创建具有多个选择的正则表达式。

alternations.py

#!/usr/bin/python

import re

words = ("Jane", "Thomas", "Robert",
    "Lucy", "Beky", "John", "Peter", "Andy")

pattern = re.compile(r'Jane|Beky|Robert')

for word in words:
    
    if re.match(pattern, word):
        print(word)

列表中有八个名称。

pattern = re.compile(r'Jane|Beky|Robert')

这个正则表达式查找 "Jane"、"Beky" 或 "Robert" 字符串。

finditer 函数

finditer 函数返回一个迭代器，该迭代器在字符串中为模式生成所有非重叠匹配项的匹配对象。

finditer_fun.py

#!/usr/bin/python

import re

text = 'I saw a fox in the wood. The fox had red fur.'

pattern = re.compile(r'fox')

found = re.finditer(pattern, text)

for item in found:

    s = item.start()
    e = item.end()
    print(f'Found {text[s:e]} at {s}:{e}')

在本例中，我们在文本中搜索“fox”项。我们遍历找到的匹配项的迭代器，并打印它们的索引。

s = item.start()
e = item.end()

start 和 end 函数分别返回起始索引和结束索引。

$ ./finditer_fun.py 
Found fox at 8:11
Found fox at 29:32

捕获组

捕获组是将多个字符视为一个单元的一种方式。它们是通过将字符放在一组圆括号内来创建的。例如，(book) 是一个包含“b”、“o”、“o”、“k”字符的单个组。

捕获组技术使我们能够找出字符串中与常规模式匹配的那些部分。

capturing_groups.py

#!/usr/bin/python

import re

content = '''<p>The <code>Pattern</code> is a compiled
representation of a regular expression.</p>'''

pattern = re.compile(r'(</?[a-z]*>)')

found = re.findall(pattern, content)

for tag in found:
    print(tag)

代码示例通过捕获一组字符来打印提供的字符串中的所有 HTML 标签。

found = re.findall(pattern, content)

为了找到所有标签，我们使用 findall 方法。

$ ./capturing_groups.py 
<p>
<code>
</code>
</p>

我们找到了四个 HTML 标签。

Python regex 电子邮件示例

在下面的示例中，我们创建一个 regex 模式来检查电子邮件地址。

emails.py

#!/usr/bin/python

import re

emails = ("luke@gmail.com", "andy@yahoocom", 
    "34234sdfa#2345", "f344@gmail.com")

pattern = re.compile(r'^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,18}$')

for email in emails:

    if re.match(pattern, email):
        print(f'{email} matches')
    else:
        print(f'{email} does not match')

此示例提供了一种可能的解决方案。

pattern = re.compile(r'^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,18}$')

第一个 ^ 和最后一个 $ 字符提供了精确的模式匹配。不允许在该模式之前和之后出现任何字符。电子邮件分为五个部分。第一部分是本地部分。这通常是公司、个人或昵称的名称。 [a-zA-Z0-9._-]+ 列出了所有可能的字符，我们可以在本地部分中使用。它们可以使用一次或多次。

第二部分由字面量 @ 字符组成。第三部分是域部分。通常是电子邮件提供商（如 Yahoo 或 Gmail）的域名。 [a-zA-Z0-9-]+ 是一个字符类，它提供了域名中可以使用的所有字符。 + 量词允许使用这些字符中的一个或多个。

第四部分是点字符。它以转义字符 (\) 开头，以获取字面量点。

最后一部分是顶级域：[a-zA-Z.]{2,18}。顶级域可以有 2 到 18 个字符，例如 sk、net、info、travel、cleaning、travelinsurance。最大长度可以是 63 个字符，但如今大多数域都短于 18 个字符。还有一个点字符。这是因为某些顶级域有两个部分；例如 co.uk。

$ ./emails.py 
luke@gmail.com matches
andy@yahoocom does not match
34234sdfa#2345 does not match
f344@gmail.com matches

来源

Python 正则表达式 - 语言参考

在本文中，我们介绍了 Python 中的正则表达式。

作者

我叫 Jan Bodnar，是一位充满激情的程序员，拥有丰富的编程经验。我自 2007 年以来一直在撰写编程文章。迄今为止，我已经撰写了 1,400 多篇文章和 8 本电子书。我拥有超过十年的编程教学经验。

列出所有 Python 教程。