Python os.walk 函数

上次修改时间：2025 年 4 月 11 日

本综合指南探讨了 Python 的 os.walk 函数，该函数递归地遍历目录树。我们将涵盖目录导航、文件列表和实际的文件系统探索示例。

基本定义

os.walk 函数通过自上而下或自下而上的方式生成目录树中的文件名。它为访问的每个目录返回一个 3 元组。

主要参数：top（根目录）、topdown（遍历顺序）、onerror（错误处理程序）、followlinks（跟随符号链接）。返回 (dirpath, dirnames, filenames) 元组。

基本目录遍历

os.walk 的最简单用法是列出从给定根目录开始的所有文件和目录。此示例显示了基本的递归遍历。

basic_walk.py

import os

# Walk through directory tree
for root, dirs, files in os.walk("my_directory"):
    print(f"Current directory: {root}")
    print(f"Subdirectories: {dirs}")
    print(f"Files: {files}")
    print("-" * 40)

# Count total files
file_count = sum(len(files) for _, _, files in os.walk("my_directory"))
print(f"Total files found: {file_count}")

此示例显示了 os.walk 用法的基本结构。对于每个目录，它都会打印路径、子目录和文件。最后，它会计算所有文件。

生成器会产生包含当前路径、直接子目录以及在遍历的每个级别上的非目录文件的元组。

按扩展名筛选文件

我们可以在遍历期间过滤文件，仅处理特定文件类型。此示例查找目录树中的所有 Python 文件 (.py)。

filter_extensions.py

import os

# Find all Python files
python_files = []
for root, _, files in os.walk("src"):
    for file in files:
        if file.endswith(".py"):
            python_files.append(os.path.join(root, file))

print("Python files found:")
for file in python_files:
    print(f"- {file}")

# Count lines of code
total_lines = 0
for file in python_files:
    with open(file) as f:
        total_lines += len(f.readlines())

print(f"\nTotal lines of Python code: {total_lines}")

此代码首先收集所有 Python 文件，然后计算它们的总行数。 os.path.join 确保跨操作系统的正确路径构造。

下划线 (_) 忽略 dirnames，因为我们在此示例中不需要它们。这是 Python 中未使用变量的常见约定。

修改目录遍历

可以就地修改 dirnames 列表，以控制访问哪些子目录。此示例排除以点开头的目录。

modify_traversal.py

import os

# Skip hidden directories (starting with dot)
for root, dirs, files in os.walk("project"):
    # Modify dirs in-place to skip hidden directories
    dirs[:] = [d for d in dirs if not d.startswith(".")]
    
    print(f"Scanning: {root}")
    for file in files:
        if not file.startswith("."):  # Skip hidden files too
            print(f"  - {file}")

# Alternative: Skip specific directories entirely
skip_dirs = {"venv", "__pycache__", "node_modules"}
for root, dirs, files in os.walk("project"):
    dirs[:] = [d for d in dirs if d not in skip_dirs]
    # Process remaining files...

通过在遍历期间修改 dirs 列表，我们可以控制递归。这比在遍历完成后进行过滤更有效。

切片赋值 (dirs[:]) 就地修改列表，这会影响 walk 行为。这是 os.walk 的一个关键特性。

计算目录大小

我们可以使用 os.walk 来计算目录树中所有文件的总大小。此示例显示了每个目录和总大小的计算。

directory_sizes.py

import os

def get_size(start_path):
    total_size = 0
    for dirpath, _, filenames in os.walk(start_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            try:
                total_size += os.path.getsize(fp)
            except OSError:
                continue
    return total_size

# Calculate size for each directory
for root, dirs, _ in os.walk("data"):
    dir_sizes = {}
    for d in dirs:
        path = os.path.join(root, d)
        size = get_size(path)
        dir_sizes[d] = size
    
    print(f"\nDirectory: {root}")
    for d, size in dir_sizes.items():
        print(f"  {d}: {size/1024:.2f} KB")

# Total size
total = get_size("data")
print(f"\nTotal data size: {total/1024/1024:.2f} MB")

此示例显示了两种方法：递归计算总大小和显示每个目录的细分。错误处理可防止因无法访问的文件而崩溃。

该函数将字节转换为 KB 和 MB，以提高可读性。实际应用程序可能会添加更多格式或阈值检查。

查找重复文件

通过将 os.walk 与文件哈希相结合，我们可以识别目录结构中的重复文件。此示例使用 MD5 哈希进行比较。

find_duplicates.py

import os
import hashlib

def file_hash(filepath):
    hasher = hashlib.md5()
    with open(filepath, "rb") as f:
        while chunk := f.read(8192):
            hasher.update(chunk)
    return hasher.hexdigest()

# Find duplicate files
file_hashes = {}
duplicates = []

for root, _, files in os.walk("photos"):
    for file in files:
        path = os.path.join(root, file)
        try:
            file_size = os.path.getsize(path)
            if file_size > 0:  # Skip empty files
                fhash = file_hash(path)
                if fhash in file_hashes:
                    duplicates.append((path, file_hashes[fhash]))
                else:
                    file_hashes[fhash] = path
        except (OSError, PermissionError):
            continue

print("Duplicate files found:")
for dup, original in duplicates:
    print(f"{dup} is a duplicate of {original}")

此代码构建文件哈希字典并检查重复项。它跳过空文件并优雅地处理潜在的权限错误。

对于大型文件集合，请考虑在哈希之前添加大小检查或使用更快的哈希算法（如 xxHash）以获得更好的性能。

按修改时间处理文件

此示例查找在过去 7 天内修改的文件，演示如何将 os.walk 与文件元数据操作相结合。

recent_files.py

import os
import time

# Files modified in last 7 days
recent_files = []
current_time = time.time()
seven_days_ago = current_time - (7 * 24 * 60 * 60)

for root, _, files in os.walk("logs"):
    for file in files:
        path = os.path.join(root, file)
        try:
            mtime = os.path.getmtime(path)
            if mtime > seven_days_ago:
                recent_files.append((path, time.ctime(mtime)))
        except OSError:
            continue

print("Files modified in last 7 days:")
for path, mtime in sorted(recent_files, key=lambda x: x[1], reverse=True):
    print(f"{mtime}: {path}")

# Archive old files
for root, _, files in os.walk("logs"):
    for file in files:
        path = os.path.join(root, file)
        mtime = os.path.getmtime(path)
        if mtime <= seven_days_ago:
            # Add archiving logic here
            print(f"Archiving: {path}")

该脚本首先识别最近的文件，然后显示如何分别处理较旧的文件。时间计算使用自 epoch 以来的秒数进行比较。

此模式对于日志轮换、备份系统或任何基于时间的文件处理任务都非常有用。

自下而上的目录遍历

设置 topdown=False 会反转遍历顺序，以便在其父目录之前处理子目录。这对于目录删除等操作非常有用。

bottom_up.py

import os
import shutil

# Bottom-up traversal example
print("Bottom-up traversal order:")
for root, dirs, files in os.walk("temp", topdown=False):
    print(f"Processing: {root}")
    for file in files:
        file_path = os.path.join(root, file)
        print(f"  Deleting file: {file_path}")
        os.unlink(file_path)
    
    # Now safe to remove the directory
    print(f"  Removing directory: {root}")
    os.rmdir(root)

# Alternative using shutil for entire tree removal
if os.path.exists("temp_backup"):
    print("\nRemoving backup directory with shutil:")
    shutil.rmtree("temp_backup")

自下而上的方法确保我们在尝试删除其父目录之前处理文件。该示例显示了手动和 shutil 方法。

当必须先完成对子项的操作，然后才能处理其父项时，自下而上的遍历至关重要。

安全注意事项

符号链接处理： followlinks=False 阻止跟随符号链接
权限错误： 实现 onerror 回调以进行优雅处理
路径清理： 始终使用 os.path.join 获取跨平台路径
内存使用： 对于大型目录，请考虑迭代方法
竞争条件： 文件系统在遍历期间可能会发生变化

最佳实践

使用生成器： 增量处理文件以节省内存
处理错误： 实现 onerror 以实现健壮的遍历
修改 dirs 列表： 通过就地编辑 dirs 来控制递归
跨平台： 使用 os.path 函数进行路径操作
记录行为： 注意遍历顺序（自上而下/自下而上）

资料来源

作者

我叫 Jan Bodnar，是一位充满热情的程序员，拥有丰富的编程经验。自 2007 年以来，我一直在撰写编程文章。迄今为止，我已经撰写了 1,400 多篇文章和 8 本电子书。我拥有超过十年的编程教学经验。

列出所有 Python 教程。