Kotlin 正则表达式

最后修改于 2024 年 1 月 29 日

本文介绍了如何在 Kotlin 中使用正则表达式。

正则表达式用于文本搜索和更高级的文本处理。正则表达式内置于 grep、sed 等工具中，文本编辑器如 vi、Emacs，编程语言包括 Kotlin、JavaScript、Perl 和 Python。

Kotlin 正则表达式

在 Kotlin 中，我们使用 Regex 构建正则表达式。

Regex("book")
"book".toRegex()
Regex.fromLiteral("book")

模式是一个正则表达式，它定义了我们正在搜索或操作的文本。它由文本字面量和元字符组成。元字符是控制正则表达式如何求值的特殊字符。例如，使用 \s，我们搜索空格。

特殊字符必须进行双重转义，或者我们可以使用 Kotlin 原始字符串。

创建模式后，我们可以使用其中一个函数将该模式应用于文本字符串。这些函数包括 matches、containsMatchIn、find、findall、replace 和 split。

下表显示了一些常用的正则表达式

Regex	含义
`.`	匹配任何单个字符。
`?`	匹配前一个元素一次或不匹配。
`+`	匹配前一个元素一次或多次。
`*`	匹配前一个元素零次或多次。
`^`	匹配字符串内的起始位置。
`$`	匹配字符串内的结束位置。
`\|`	交替运算符。
`[abc]`	匹配 a、b 或 c。
`[a-c]`	范围；匹配 a、b 或 c。
`[^abc]`	否定；匹配除 a、b 或 c 之外的所有字符。
`\s`	匹配空白字符。
`\w`	匹配单词字符；等同于 `[a-zA-Z_0-9]`

Kotlin matches 和 containsMatchIn 方法

如果正则表达式与整个输入字符串匹配，matches 方法将返回 true。containsMatchIn 方法指示正则表达式是否可以在指定的输入中找到至少一个匹配项。

simple_regex.kt

package com.zetcode

fun main() {

    val words = listOf("book", "bookworm", "Bible",
            "bookish","cookbook", "bookstore", "pocketbook")

    val pattern = "book".toRegex()

    println("*********************")
    println("containsMatchIn function")

    words.forEach { word ->
        if (pattern.containsMatchIn(word)) {
            println("$word matches")
        }
    }

    println("*********************")
    println("matches function")

    words.forEach { word ->
        if (pattern.matches(word)) {
            println("$word matches")
        }
    }
}

在这个例子中，我们使用了 matches 和 containsMatchIn 方法。我们有一个单词列表。模式将使用这两种方法在每个单词中查找“book”字符串。

val pattern = "book".toRegex()

使用 toRegex 方法创建正则表达式模式。正则表达式由四个普通字符组成。

words.forEach { word ->
    if (pattern.containsMatchIn(word)) {
        println("$word matches")
    }
}

我们遍历列表，并在每个单词上应用 containsMatchIn。

words.forEach { word ->
    if (pattern.matches(word)) {
        println("$word matches")
    }
}

我们再次遍历列表，并在每个单词上应用 matches。

*********************
containsMatchIn function
book matches
bookworm matches
bookish matches
cookbook matches
bookstore matches
pocketbook matches
*********************
matches function
book matches

对于 containsMatchIn 方法，如果“book”这个词在单词的某个地方，则模式匹配；对于 matches，输入字符串必须完全匹配模式。

Kotlin find 方法

find 方法返回输入中正则表达式的第一个匹配项，从指定的开始索引开始。默认情况下，开始索引为 0。

regex_find.kt

package com.zetcode

fun main() {

    val text = "I saw a fox in the wood. The fox had red fur."

    val pattern = "fox".toRegex()

    val match = pattern.find(text)

    val m = match?.value
    val idx = match?.range

    println("$m found at indexes: $idx")

    val match2 = pattern.find(text, 11)

    val m2 = match2?.value
    val idx2 = match2?.range

    println("$m2 found at indexes: $idx2")
}

在这个例子中，我们找出“fox”这个词的匹配索引。

val match = pattern.find(text)

val m = match?.value
val idx = match?.range

我们找到了“fox”这个词的第一个匹配项。我们获取它的值和索引。

val match2 = pattern.find(text, 11)

val m2 = match2?.value
val idx2 = match2?.range

在第二种情况下，我们从索引 11 开始搜索，从而找到下一个词。

fox found at indexes: 8..10
fox found at indexes: 29..31

Kotlin findAll 方法

findAll 方法返回输入字符串中正则表达式所有出现的序列。

regex_findall.kt

package com.zetcode

fun main() {

    val text = "I saw a fox in the wood. The fox had red fur."
    val pattern = "fox".toRegex()

    val matches = pattern.findAll(text)

    matches.forEach { f ->
        
        val m = f.value
        val idx = f.range
        println("$m found at indexes: $idx")
    }
}

在这个例子中，我们使用 findAll 找到了“fox”这个词的所有出现。

Kotlin 正则表达式词边界

元字符 \b 是一个锚点，它匹配一个称为单词边界的位置。它允许搜索整个单词。

word_boundaries.kt

package com.zetcode

fun main() {

    val text = "This island is beautiful"
    val pattern = "\\bis\\b".toRegex()

    val matches = pattern.findAll(text)

    matches.forEach { m ->
        val v = m.value
        val idx = m.range
        println("$v found at indexes: $idx")
    }
}

在该示例中，我们查找 is 单词。我们不想包括 This 和 island 单词。

val pattern = "\\bis\\b".toRegex()

使用两个 \b 元字符，我们搜索 is 整个单词。

val matches = pattern.findAll(text)

使用 findAll 函数，我们找到所有匹配项。

is found at indexes: 12..13

Kotlin 正则表达式隐式词边界

\w 是一个字符类，用于单词中允许的字符。对于 \w+ 正则表达式（表示一个单词），前导和尾随的词边界元字符是隐式的；即 \w+ 等于 \b\w+\b。

implicit_word_boundaries.kt

package com.zetcode

fun main() {
    val content = """
Foxes are omnivorous mammals belonging to several genera
of the family Canidae. Foxes have a flattened skull, upright triangular ears,
a pointed, slightly upturned snout, and a long bushy tail. Foxes live on every
continent except Antarctica. By far the most common and widespread species of
fox is the red fox."""

    val pattern = "\\w+".toRegex()

    val words = pattern.findAll(content)
    val count = words.count()

    println("There are $count words")

    words.forEach { matchResult ->
        println(matchResult.value)
    }
}

在该示例中，我们搜索文本中的所有单词。

val pattern = "\\w+".toRegex()

我们查找单词。

val words = pattern.findAll(content)
val count = words.count()

我们找到所有单词并对它们进行计数。

Kotlin 货币符号

\p{Sc} 正则表达式可用于查找货币符号。

currency_symbols.kt

package com.zetcode

fun main() {

    val content = """
Currency symbols: ฿ Thailand bath, ₹ Indian rupee, ₾ Georgian lari, $ Dollar,
€ Euro, ¥ Yen, £ Pound Sterling"""

    val pattern = "\\p{Sc}".toRegex(RegexOption.IGNORE_CASE)

    val matches = pattern.findAll(content)

    matches.forEach { matchResult ->

        val currency = matchResult.value
        val idx = matchResult.range

        println("$currency at $idx")
    }
}

在该示例中，我们查找货币符号。

    val content = """
Currency symbols: ฿ Thailand bath, ₹ Indian rupee, ₾ Georgian lari, $ Dollar,
€ Euro, ¥ Yen, £ Pound Sterling"""

文本中有几个货币符号。

val pattern = "\\p{Sc}".toRegex(RegexOption.IGNORE_CASE)

我们定义货币符号的正则表达式。

val matches = pattern.findAll(content)

我们找到所有匹配项。

matches.forEach { matchResult ->

    val currency = matchResult.value
    val idx = matchResult.range

    println("$currency at $idx")
}

我们打印所有匹配的值及其索引。

฿ at 19..19
₹ at 36..36
₾ at 52..52
$ at 69..69
€ at 79..79
¥ at 87..87
£ at 94..94

Kotlin split 函数

split 方法根据正则表达式的匹配项分割输入字符串。

regex_split.js

package com.zetcode

fun main() {

    val text = "I saw a fox in the wood. The fox had red fur."

    val pattern = "\\W+".toRegex()
    val words = pattern.split(text).filter { it.isNotBlank() }

    println(words)
}

在这个例子中，我们找出“fox”这个词的出现次数。

val pattern = "\\W+".toRegex()

模式包含 \W 命名字符类，它代表非单词字符。与 + 量词结合使用，模式查找非单词字符，如空格、逗号或句点，这些字符通常用于分隔文本中的单词。请注意，字符类是双重转义的。

val words = pattern.split(text).filter { it.isNotBlank() }

使用 split 方法，我们将输入字符串分割成单词列表。此外，我们删除了空白的尾随单词，这是因为我们的文本以非单词字符结尾。

[I, saw, a, fox, in, the, wood, The, fox, had, red, fur]

不区分大小写的匹配

要启用不区分大小写的搜索，我们将 RegexOption.IGNORE_CASE 传递给 toRegex 方法。

regex_case_insensitive.kt

package com.zetcode

fun main() {

    val words = listOf("dog", "Dog", "DOG", "Doggy")

    val pattern = "dog".toRegex(RegexOption.IGNORE_CASE)

    words.forEach { word ->

        if (pattern.matches(word)) {

            println("$word matches")
        }
    }
}

在本例中，我们对单词应用模式，而不考虑大小写。

val pattern = "dog".toRegex(RegexOption.IGNORE_CASE)

我们使用 RegexOption.IGNORE_CASE 忽略输入字符串的大小写。

dog matches
Dog matches
DOG matches

点元字符

点 (.) 元字符代表文本中的任何单个字符。

regex_dot_meta.kt

package com.zetcode

fun main() {

    val words = listOf("seven", "even", "prevent", "revenge", "maven",
            "eleven", "amen", "event")

    val pattern = "..even".toRegex()

    words.forEach { word ->

        if (pattern.containsMatchIn(word)) {

            println("$word matches")
        }
    }
}

在这个例子中，我们在一个列表中有八个单词。我们在每个单词上应用包含两个点元字符的模式。

prevent matches
eleven matches

有两个单词与模式匹配。

问号元字符

问号 (?) 元字符是一个量词，它匹配前一个元素零次或一次。

regex_qmark_meta.kt

package com.zetcode

fun main() {

    val words = listOf("seven", "even", "prevent", "revenge", "maven",
            "eleven", "amen", "event")

    val pattern = ".?even".toRegex()

    words.forEach { word ->

        if (pattern.matches(word)) {

            println("$word matches")
        }
    }
}

在本例中，我们在点字符后添加一个问号。这意味着在模式中，我们可以有一个任意字符，也可以没有字符。

seven matches
even matches

{n,m} 量词

{n,m} 量词匹配前一个表达式的至少 n 次和最多 m 次出现。

regex_mn_quantifier.kt

package com.zetcode

fun main() {

    val words = listOf("pen", "book", "cool", "pencil", "forest", "car",
            "list", "rest", "ask", "point", "eyes")

    val pattern = "\\w{3,4}".toRegex()

    words.forEach { word ->

        if (pattern.matches(word)) {

            println("$word matches")
        } else {
            println("$word does not match")
        }
    }
}

在这个例子中，我们搜索包含三个或四个字符的单词。

val pattern = "\\w{3,4}".toRegex()

在模式中，我们有一个单词字符重复了三或四次。请注意，数字之间不能有空格。

pen matches
book matches
cool matches
pencil does not match
forest does not match
car matches
list matches
rest matches
ask matches
point does not match
eyes matches

Kotlin 正则表达式锚点

锚点匹配给定文本中字符的位置。使用 ^ 锚点时，匹配必须发生在字符串的开头；使用 $ 锚点时，匹配必须发生在字符串的结尾。

regex_anchors.kt

package com.zetcode

fun main() {

    val sentences = listOf("I am looking for Jane.",
        "Jane was walking along the river.",
        "Kate and Jane are close friends.")

    val pattern = "^Jane".toRegex()

    sentences.forEach { sentence ->

        if (pattern.containsMatchIn(sentence)) {

            println(sentence)
        }
    }
}

在本例中，我们有三个句子。搜索模式是 ^Jane。该模式检查“Jane”字符串是否位于文本的开头。 Jane\. 将在句子的末尾查找“Jane”。

Kotlin 正则表达式选择

交替运算符 | 创建具有多个选择的正则表达式。

regex_alternations.kt

package com.zetcode

fun main() {

    val words = listOf("Jane", "Thomas", "Robert",
        "Lucy", "Beky", "John", "Peter", "Andy")

    val pattern = "Jane|Beky|Robert".toRegex()

    words.forEach { word ->

        if (pattern.matches(word)) {

            println(word)
        }
    }
}

列表中有八个名称。

val pattern = "Jane|Beky|Robert".toRegex()

这个正则表达式查找 "Jane"、"Beky" 或 "Robert" 字符串。

Kotlin 正则表达式子模式

子模式是模式中的模式。子模式使用 () 字符创建。

regex_subpatterns.kt

package com.zetcode

fun main() {

    val words = listOf("book", "bookshelf", "bookworm",
            "bookcase", "bookish", "bookkeeper", "booklet", "bookmark")

    val pattern = "book(worm|mark|keeper)?".toRegex()

    words.forEach { word ->

        if (pattern.matches(word)) {

            println("$word matches")
        } else {

            println("$word does not match")
        }
    }
}

该示例创建一个子模式。

val pattern = "book(worm|mark|keeper)?".toRegex()

正则表达式使用一个子模式。它匹配 bookworm、bookmark、bookkeeper 和 book 单词。

book matches
bookshelf does not match
bookworm matches
bookcase does not match
bookish does not match
bookkeeper matches
booklet does not match
bookmark matches

Kotlin 正则表达式字符类

字符类定义了一组字符，其中任何一个字符都可能出现在输入字符串中，以便匹配成功。

character_classes.kt

package com.zetcode

fun main() {

    val words = listOf("a gray bird", "grey hair", "great look")

    val pattern = "gr[ea]y".toRegex()

    words.forEach { word ->

        if (pattern.containsMatchIn(word)) {

            println(word)
        }
    }
}

在本例中，我们使用字符类来包括 gray 和 grey 单词。

val pattern = "gr[ea]y".toRegex()

[ea] 类允许在模式中使用“e”或“a”字符。

Kotlin 命名字符类

有一些预定义的字符类。 \s 匹配一个空白字符 [\t\n\t\f\v]，\d 匹配一个数字 [0-9]，而 \w 匹配一个单词字符 [a-zA-Z0-9_]。

named_character_classes.kt

package com.zetcode

fun main() {

    val text = "We met in 2013. She must be now about 27 years old."

    val pattern = "\\d+".toRegex()
    val found = pattern.findAll(text)

    found.forEach { f ->
        
        val m = f.value
        println(m)
    }
}

在这个例子中，我们搜索文本中的数字。

val pattern = "\\d+".toRegex()

\d+ 模式查找文本中任意数量的数字集。

val found = pattern.findAll(text)

使用 findAll 找到所有匹配项。

2013
27

Kotlin 正则表达式捕获组

圆括号用于创建捕获组。这使我们可以将量词应用于整个组，或者将交替选择限制为正则表达式的一部分。

capturing_groups.kt

package com.zetcode

fun main() {

    val sites = listOf(
        "webcode.me", "zetcode.com", "freebsd.org",
        "netbsd.org"
    )

    val pattern = "(\\w+)\\.(\\w+)".toRegex()

    for (site in sites) {

        val matches = pattern.findAll(site)

        matches.forEach { matchResult ->

            println(matchResult.value)
            println(matchResult.groupValues[1])
            println(matchResult.groupValues[2])
            println("*****************")
        }
    }
}

在示例中，我们使用组将域名分成两部分。

val pattern = "(\\w+)\\.(\\w+)".toRegex()

我们用括号定义了两个组。

matches.forEach { matchResult ->

    println(matchResult.value)
    println(matchResult.groupValues[1])
    println(matchResult.groupValues[2])
    println("*****************")
}

通过 groupValues 函数访问这些组。groupValues[0]) 返回整个匹配的字符串；它等同于 value 属性。

webcode.me
webcode
me
*****************
zetcode.com
zetcode
com
*****************
freebsd.org
freebsd
org
*****************
netbsd.org
netbsd
org
*****************

在以下示例中，我们使用组来处理表达式。

regex_expressions.kt

package com.zetcode

fun main() {

    val expressions = listOf("16 + 11", "12 * 5", "27 / 3", "2 - 8")
    val pattern = "(\\d+)\\s+([-+*/])\\s+(\\d+)".toRegex()

    for (expression in expressions) {

        val matches = pattern.findAll(expression)

        matches.forEach { matchResult ->

            val value1 = matchResult.groupValues[1].toInt()
            val value2 = matchResult.groupValues[3].toInt()

            val msg = when (matchResult.groupValues[2]) {

                "+" -> "$expression = ${value1 + value2}"
                "-" -> "$expression = ${value1 - value2}"
                "*" -> "$expression = ${value1 * value2}"
                "/" -> "$expression = ${value1 / value2}"
                else -> "Unknown operator"
            }

            println(msg)
        }
    }
}

该示例解析四个简单的数学表达式并对其进行计算。

val expressions = listOf("16 + 11", "12 * 5", "27 / 3", "2 - 8")

我们有一个由四个表达式组成的列表。

val pattern = "(\\d+)\\s+([-+*/])\\s+(\\d+)".toRegex()

在正则表达式模式中，我们有三个组：两个组用于值，一个组用于运算符。

val value1 = matchResult.groupValues[1].toInt()
val value2 = matchResult.groupValues[3].toInt()

我们获取值并将其转换为整数。

val msg = when (matchResult.groupValues[2]) {

    "+" -> "$expression = ${value1 + value2}"
    "-" -> "$expression = ${value1 - value2}"
    "*" -> "$expression = ${value1 * value2}"
    "/" -> "$expression = ${value1 / value2}"
    else -> "Unknown operator"
}

使用 when 表达式，我们计算表达式并构建消息。

16 + 11 = 27
12 * 5 = 60
27 / 3 = 9
2 - 8 = -6

Kotlin 正则表达式词频

在下一个示例中，我们统计文件中单词的频率。

$ wget https://raw.githubusercontent.com/janbodnar/data/main/the-king-james-bible.txt

我们使用《英王钦定本圣经》。

word_freq.kt

import java.io.File

fun main() {

    val fileName = "src/main/resources/the-king-james-bible.txt";

    val text = File(fileName).readText()

    val r = "[a-zA-Z']+".toRegex()
    val matches = r.findAll(text)

    val data = matches.map { it.value }
        .groupBy { it }
        .map { Pair(it.key, it.value.size) }
        .sortedByDescending { it.second }
        .take(10)

    for ((word, freq) in data) {

        System.out.printf("%s %d \n", word, freq)
    }
}

我们使用 findAll 找到所有匹配的单词。我们对单词进行分组，并按它们出现的次数进行排序。我们打印前几个出现次数最多的单词。

the 62103 
and 38848 
of 34478 
to 13400 
And 12846 
that 12576 
in 12331 
shall 9760 
he 9665 
unto 8942

Kotlin 正则表达式电子邮件示例

在下面的示例中，我们创建一个 regex 模式来检查电子邮件地址。

regex_emails.kt

package com.zetcode

fun main() {

    val emails = listOf("luke@gmail.com", "andy@yahoocom",
            "34234sdfa#2345", "f344@gmail.com", "dandy!@yahoo.com")

    val pattern = "[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\\.[a-zA-Z.]{2,18}".toRegex()

    emails.forEach { email ->

        if (pattern.matches(email)) {

            println("$email matches")
        } else {

            println("$email does not match")
        }
    }
}

此示例提供了一种可能的解决方案。

val pattern = "[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\\.[a-zA-Z.]{2,18}".toRegex()

电子邮件被分成五个部分。第一部分是本地部分。通常它是一个公司名称、个人或昵称。[a-zA-Z0-9._-]+ 列出了我们可以在本地部分中使用的所有可能的字符。它们可以被使用一次或多次。

第二部分由字面量 @ 字符组成。第三部分是域部分。它通常是电子邮件提供商的域名，例如 yahoo 或 gmail。[a-zA-Z0-9-]+ 是一个字符类，提供可以在域名中使用的所有字符。+ 量词允许使用一个或多个这些字符。

第四部分是点字符；它前面是双转义字符 (\\) 以获得字面量点。

最后一部分是顶级域名：[a-zA-Z.]{2,18}。顶级域名可以有 2 到 18 个字符，例如 sk、net、info、travel、cleaning、travelinsurance。最大长度可以为 63 个字符，但今天大多数域名的长度都小于 18 个字符。还有一个点字符。这是因为一些顶级域名有两部分；例如 co.uk。

luke@gmail.com matches
andy@yahoocom does not match
34234sdfa#2345 does not match
f344@gmail.com matches
dandy!@yahoo.com does not match

来源

Kotlin 正则表达式文档

在本章中，我们介绍了 Kotlin 中的正则表达式。

作者

我叫 Jan Bodnar，是一名充满激情的程序员，拥有丰富的编程经验。自 2007 年以来，我一直在撰写编程文章。迄今为止，我撰写了 1,400 多篇文章和 8 本电子书。我拥有超过十年的编程教学经验。

列出所有 Kotlin 教程。