Polars 字符串操作

最后修改时间：2025年3月1日

Polars 是一个用 Rust 编写的快速 DataFrame 库，提供 Python 绑定。它专为高效的数据操作和分析而设计。字符串操作对于清理和转换 DataFrame 中的文本数据至关重要。本教程将通过实际示例介绍 Polars 中常见的字符串操作。

字符串操作包括子字符串提取、大小写转换和模式匹配等任务。Polars 为这些任务提供了丰富的字符串方法，使其成为文本数据处理的强大工具。

转换为小写

此示例显示如何将字符串列转换为小写。

lowercase.py

import polars as pl

df = pl.DataFrame({
    "text": ["Hello", "WORLD", "Polars", "Tutorial"]
})

df = df.with_column(pl.col("text").str.to_lowercase().alias("lowercase_text"))
print(df)

str.to_lowercase 方法将 'text' 列中的所有字符串转换为小写。这对于标准化文本数据很有用。

提取子字符串

此示例演示如何从列中提取子字符串。

substring.py

import polars as pl

df = pl.DataFrame({
    "text": ["Hello World", "Polars Tutorial", "Data Science"]
})

df = df.with_column(pl.col("text").str.slice(0, 5).alias("substring"))
print(df)

str.slice(0, 5) 方法从 'text' 列中的每个字符串中提取前 5 个字符。这对于提取固定长度数据很有用。

替换子字符串

此示例显示如何替换列中的子字符串。

replace.py

import polars as pl

df = pl.DataFrame({
    "text": ["Hello World", "Polars Tutorial", "Data Science"]
})

df = df.with_column(pl.col("text").str.replace("World", "Universe").alias("replaced_text"))
print(df)

str.replace("World", "Universe") 方法将 'text' 列中的 'World' 替换为 'Universe'。这对于更正或更新文本很有用。

拆分字符串

此示例演示如何根据分隔符将字符串拆分为列表。

split.py

import polars as pl

df = pl.DataFrame({
    "text": ["Hello,World", "Polars,Tutorial", "Data,Science"]
})

df = df.with_column(pl.col("text").str.split(",").alias("split_text"))
print(df)

str.split(",") 方法按逗号拆分 'text' 列中的每个字符串。这对于解析类 CSV 的数据很有用。

连接字符串

此示例显示如何连接多个列中的字符串。

concat.py

import polars as pl

df = pl.DataFrame({
    "first_name": ["John", "Jane", "Alice"],
    "last_name": ["Doe", "Smith", "Johnson"]
})

df = df.with_column((pl.col("first_name") + " " + pl.col("last_name")).alias("full_name"))
print(df)

+ 运算符在 'first_name' 和 'last_name' 列之间添加空格进行连接。这对于创建全名或组合文本很有用。

检查子字符串

此示例演示如何检查列中是否存在子字符串。

contains.py

import polars as pl

df = pl.DataFrame({
    "text": ["Hello World", "Polars Tutorial", "Data Science"]
})

df = df.with_column(pl.col("text").str.contains("World").alias("contains_world"))
print(df)

str.contains("World") 方法检查 'text' 列中的每个字符串是否包含 'World'。这对于过滤或标记数据很有用。

修剪空格

此示例显示如何修剪字符串开头和结尾的空格。

trim.py

import polars as pl

df = pl.DataFrame({
    "text": ["  Hello  ", "  Polars  ", "  Data Science  "]
})

df = df.with_column(pl.col("text").str.strip().alias("trimmed_text"))
print(df)

str.strip 方法删除 'text' 列中每个字符串开头和结尾的空格。这对于清理混乱的数据很有用。

正则表达式匹配

此示例演示如何使用正则表达式提取模式。

regex.py

import polars as pl

df = pl.DataFrame({
    "text": ["Hello123", "Polars456", "Data789"]
})

df = df.with_column(pl.col("text").str.extract(r"\d+").alias("extracted_numbers"))
print(df)

str.extract(r"\d+") 方法从 'text' 列中的每个字符串中提取数字序列。这对于基于模式的提取很有用。

字符串操作的最佳实践

标准化文本：使用小写或大写以保持一致性。
处理缺失数据：在操作前检查 null 值。
明智地使用正则表达式：彻底测试正则表达式模式。
优化性能：对大型数据集使用矢量化操作。

来源

Polars 文档

在本文中，我们探讨了如何在 Polars 中执行字符串操作。

作者

我叫 Jan Bodnar，是一名充满热情的程序员，拥有丰富的编程经验。自 2007 年以来，我一直在撰写编程文章。迄今为止，我已撰写了 1,400 多篇文章和 8 本电子书。我在教授编程方面拥有十多年的经验。

所有 Polars 教程列表。