Go 正则表达式

最后修改时间 2024 年 4 月 11 日

在本文中，我们将展示如何使用正则表达式在 Go 中解析文本。

正则表达式

正则表达式用于文本搜索和更高级的文本操作。正则表达式内置于 grep 和 sed 等工具、vi 和 emacs 等文本编辑器，以及 Go、Java 和 Python 等编程语言中。

Go 具有用于处理正则表达式的内置 API；它位于 regexp 包中。

正则表达式定义了字符串的搜索模式。它用于匹配文本、替换文本或拆分文本。正则表达式可以被编译以提高性能。Go 接受的正则表达式语法与 Perl、Python 和其他语言使用的通用语法相同。

Regex 示例

下表显示了一些正则表达式字符串。

Regex	含义
`.`	匹配任何单个字符。
`?`	匹配前一个元素一次或不匹配。
`+`	匹配前一个元素一次或多次。
`*`	匹配前一个元素零次或多次。
`^`	匹配字符串内的起始位置。
`$`	匹配字符串内的结束位置。
`\|`	交替运算符。
`[abc]`	匹配 a、b 或 c。
`[a-c]`	范围；匹配 a、b 或 c。
`[^abc]`	否定；匹配除 a、b 或 c 之外的所有字符。
`\s`	匹配空白字符。
`\w`	匹配单词字符；等同于 `[a-zA-Z_0-9]`

Go regex MatchString

MatchString 函数报告一个字符串是否包含正则表达式模式的任何匹配项。

matchstring.go

package main

import (
    "fmt"
    "log"
    "regexp"
)

func main() {

    words := [...]string{"Seven", "even", "Maven", "Amen", "eleven"}

    for _, word := range words {

        found, err := regexp.MatchString(".even", word)

        if err != nil {
            log.Fatal(err)
        }

        if found {

            fmt.Printf("%s matches\n", word)
        } else {

            fmt.Printf("%s does not match\n", word)
        }
    }
}

在代码示例中，我们在一个数组中有五个单词。我们检查哪些单词与 .even 正则表达式匹配。

words := [...]string{"Seven", "even", "Maven", "Amen", "eleven"}

我们有一个单词数组。

for _, word := range words {

我们遍历单词数组。

found, err := regexp.MatchString(".even", word)

我们使用 MatchString 检查当前单词是否与正则表达式匹配。我们有 .even 正则表达式。点号 (.) 元字符代表文本中的任何单个字符。

if found {

    fmt.Printf("%s matches\n", word)
} else {

    fmt.Printf("%s does not match\n", word)
}

我们打印出单词是否与正则表达式匹配。

$ go run matchstring.go 
Seven matches
even does not match
Maven does not match
Amen does not match
eleven matches

数组中的两个单词与我们的正则表达式匹配。

Go 编译正则表达式

Compile 函数解析正则表达式，如果成功，则返回一个 Regexp 对象，该对象可用于与文本进行匹配。编译后的正则表达式可生成更快的代码。

MustCompile 函数是一个方便的函数，它会编译一个正则表达式，并在表达式无法解析时 panic。

compiled.go

package main

import (
    "fmt"
    "log"
    "regexp"
)

func main() {

    words := [...]string{"Seven", "even", "Maven", "Amen", "eleven"}

    re, err := regexp.Compile(".even")

    if err != nil {
        log.Fatal(err)
    }

    for _, word := range words {

        found := re.MatchString(word)

        if found {

            fmt.Printf("%s matches\n", word)
        } else {

            fmt.Printf("%s does not match\n", word)
        }
    }
}

在代码示例中，我们使用了编译后的正则表达式。

re, err := regexp.Compile(".even")

我们使用 Compile 编译正则表达式。

found := re.MatchString(word)

MatchString 函数在返回的 regex 对象上调用。

compiled2.go

package main

import (
    "fmt"
    "regexp"
)

func main() {

    words := [...]string{"Seven", "even", "Maven", "Amen", "eleven"}

    re := regexp.MustCompile(".even")

    for _, word := range words {

        found := re.MatchString(word)

        if found {

            fmt.Printf("%s matches\n", word)
        } else {

            fmt.Printf("%s does not match\n", word)
        }
    }
}

该示例使用 MustCompile 进行了简化。

Go regex FindAllString

FindAllString 函数返回正则表达式的所有连续匹配项的切片。

findall.go

package main

import (
    "fmt"
    "os"
    "regexp"
)

func main() {

    var content = `Foxes are omnivorous mammals belonging to several genera 
of the family Canidae. Foxes have a flattened skull, upright triangular ears, 
a pointed, slightly upturned snout, and a long bushy tail. Foxes live on every 
continent except Antarctica. By far the most common and widespread species of 
fox is the red fox.`

    re := regexp.MustCompile("(?i)fox(es)?")

    found := re.FindAllString(content, -1)

    fmt.Printf("%q\n", found)

    if found == nil {
        fmt.Printf("no match found\n")
        os.Exit(1)
    }

    for _, word := range found {
        fmt.Printf("%s\n", word)
    }

}

在代码示例中，我们查找单词 fox 的所有出现，包括其复数形式。

re := regexp.MustCompile("(?i)fox(es)?")

使用 (?i) 语法，正则表达式不区分大小写。(es)? 表示“es”字符可能会出现零次或一次。

found := re.FindAllString(content, -1)

我们使用 FindAllString 查找已定义正则表达式的所有出现。第二个参数是要查找的最大匹配数；-1 表示搜索所有可能的匹配项。

$ go run findall.go 
["Foxes" "Foxes" "Foxes" "fox" "fox"]
Foxes
Foxes
Foxes
fox
fox

我们找到了五个匹配项。

Go regex FindAllStringIndex

FindAllStringIndex 返回表达式所有连续匹配项的索引的切片。

allindex.go

package main

import (
    "fmt"
    "regexp"
)

func main() {

    var content = `Foxes are omnivorous mammals belonging to several genera 
of the family Canidae. Foxes have a flattened skull, upright triangular ears, 
a pointed, slightly upturned snout, and a long bushy tail. Foxes live on every 
continent except Antarctica. By far the most common and widespread species of 
fox is the red fox.`

    re := regexp.MustCompile("(?i)fox(es)?")

    idx := re.FindAllStringIndex(content, -1)

    for _, j := range idx {
        match := content[j[0]:j[1]]
        fmt.Printf("%s at %d:%d\n", match, j[0], j[1])
    }
}

在代码示例中，我们查找 fox 单词的所有出现及其在文本中的索引。

$ go run allindex.go 
Foxes at 0:5
Foxes at 81:86
Foxes at 196:201
fox at 296:299
fox at 311:314

Go regex Split

Split 函数将一个字符串分割成由定义好的正则表达式分隔的子字符串。它返回这些表达式匹配项之间的子字符串的切片。

splittext.go

package main

import (
    "fmt"
    "log"
    "regexp"
    "strconv"
)

func main() {

    var data = `22, 1, 3, 4, 5, 17, 4, 3, 21, 4, 5, 1, 48, 9, 42`

    sum := 0

    re := regexp.MustCompile(",\\s*")

    vals := re.Split(data, -1)

    for _, val := range vals {

        n, err := strconv.Atoi(val)

        sum += n

        if err != nil {
            log.Fatal(err)
        }
    }

    fmt.Println(sum)
}

在代码示例中，我们有一个逗号分隔的值列表。我们从字符串中提取值并计算它们的总和。

re := regexp.MustCompile(",\\s*")

正则表达式包含一个逗号字符和任意数量的相邻空格。

vals := re.Split(data, -1)

我们得到值的切片。

for _, val := range vals {

    n, err := strconv.Atoi(val)

    sum += n

    if err != nil {
        log.Fatal(err)
    }
}

我们遍历切片并计算总和。切片包含字符串；因此，我们使用 strconv.Atoi 函数将每个字符串转换为整数。

$ go run splittext.go 
189

值的总和是 189。

Go regex 捕获组

圆括号 () 用于创建捕获组。这允许我们将量词应用于整个组或将交替限制在正则表达式的一部分。

要查找捕获组（Go 使用术语子表达式），我们使用 FindStringSubmatch 函数。

capturegroups.go

package main

import (
    "fmt"
    "regexp"
)

func main() {

    websites := [...]string{"webcode.me", "zetcode.com", "freebsd.org", "netbsd.org"}

    re := regexp.MustCompile("(\\w+)\\.(\\w+)")

    for _, website := range websites {

        parts := re.FindStringSubmatch(website)

        for i, _ := range parts {
            fmt.Println(parts[i])
        }

        fmt.Println("---------------------")
    }
}

在代码示例中，我们使用组将域名分为两部分。

re := regexp.MustCompile("(\\w+)\\.(\\w+)")

我们用括号定义了两个组。

parts := re.FindStringSubmatch(website)

FindStringSubmatch 返回一个字符串切片，其中包含匹配项，包括捕获组中的匹配项。

$ go run capturegroups.go 
webcode.me
webcode
me
---------------------
zetcode.com
zetcode
com
---------------------
freebsd.org
freebsd
org
---------------------
netbsd.org
netbsd
org
---------------------

Go regex 替换字符串

可以使用 ReplaceAllString 替换字符串。该方法返回修改后的字符串。

replacing.go

package main

import (
    "fmt"
    "io/ioutil"
    "log"
    "net/http"
    "regexp"
    "strings"
)

func main() {

    resp, err := http.Get("http://webcode.me")

    if err != nil {
        log.Fatal(err)
    }

    defer resp.Body.Close()

    body, err := ioutil.ReadAll(resp.Body)

    if err != nil {

        log.Fatal(err)
    }

    content := string(body)

    re := regexp.MustCompile("<[^>]*>")
    replaced := re.ReplaceAllString(content, "")

    fmt.Println(strings.TrimSpace(replaced))
}

该示例读取网页的 HTML 数据并使用正则表达式剥离其 HTML 标签。

resp, err := http.Get("http://webcode.me")

我们使用 http 包中的 Get 函数创建一个 GET 请求。

body, err := ioutil.ReadAll(resp.Body)

我们读取响应对象的正文。

re := regexp.MustCompile("<[^>]*>")

此模式定义了一个匹配 HTML 标签的正则表达式。

replaced := re.ReplaceAllString(content, "")

我们使用 ReplaceAllString 方法删除所有标签。

Go regex ReplaceAllStringFunc

ReplaceAllStringFunc 返回一个字符串的副本，其中正则表达式的所有匹配项都已替换为指定函数的返回值。

replacing2.go

package main

import (
    "fmt"
    "regexp"
    "strings"
)

func main() {

    content := "an old eagle"

    re := regexp.MustCompile(`[^aeiou]`)

    fmt.Println(re.ReplaceAllStringFunc(content, strings.ToUpper))
}

在代码示例中，我们将 strings.ToUpper 函数应用于字符串中的所有元音字母。

$ go run replaceallfunc.go 
aN oLD eaGLe

来源

Go regexp 包 - 参考

在本文中，我们研究了 Go 中的正则表达式。

作者

我叫 Jan Bodnar，是一名充满热情的程序员，拥有丰富的编程经验。我自 2007 年以来一直在撰写编程文章。迄今为止，我已撰写了 1400 多篇文章和 8 本电子书。我在编程教学方面拥有十多年的经验。

列出所有 Go 教程。