Python 迭代器和生成器

最后修改于 2023 年 10 月 18 日

在本 Python 教程中，我们将使用迭代器和生成器。 迭代器 是一个对象，它允许程序员遍历集合中的所有元素，而不管其具体实现如何。

在 Python 中，迭代器是一个实现迭代器协议的对象。迭代器协议由两种方法组成。 __iter__ 方法，它必须返回迭代器对象，以及 next 方法，它从序列中返回下一个元素。

迭代器有几个优点

更简洁的代码
迭代器可以处理无限序列
迭代器节省资源

Python 有几个内置对象，它们实现了迭代器协议。例如列表、元组、字符串、字典或文件。

iterator.py

#!/usr/bin/env python

# iterator.py

str = "formidable"

for e in str:
   print(e, end=" ")

print()

it = iter(str)

print(next(it))
print(next(it))
print(next(it))

print(list(it))

在代码示例中，我们在字符串上展示了一个内置的迭代器。在 Python 中，字符串是不可变的字符序列。 iter 函数返回一个对象的迭代器。我们还可以在迭代器上使用 list 或 tuple 函数。

$ ./iterator.py 
f o r m i d a b l e
f
o
r
['m', 'i', 'd', 'a', 'b', 'l', 'e']

Python 读取行

通过节省系统资源，我们的意思是当使用迭代器时，我们可以获取序列中的下一个元素，而无需将整个数据集保存在内存中。

read_data.py

#!/usr/bin/env python

# read_data.py

with open('data.txt', 'r') as f:

    while True:

        line = f.readline()
    
        if not line: 
            break
            
        else: 
            print(line.rstrip())

此代码打印 data.txt 文件的内容。我们可以应用一个迭代器而不是使用 while 循环，这简化了我们的任务。

read_data_iterator.py

#!/usr/bin/env python

# read_data_iterator.py

with open('data.txt', 'r') as f:

    for line in f:
        print(line.rstrip())

open 函数返回一个文件对象，它是一个迭代器。我们可以在 for 循环中使用它。通过使用迭代器，代码更简洁。

Python 迭代器协议

在下面的示例中，我们创建一个实现迭代器协议的自定义对象。

iterator_protocol.py

#!/usr/bin/env python

# iterator_protocol.py

class Seq:

   def __init__(self):
       
      self.x = 0

   def __next__(self):
   
      self.x += 1
      return self.x**self.x

   def __iter__(self):
       
      return self


s = Seq()
n = 0

for e in s:

   print(e)
   n += 1
   
   if n > 10:
      break

在代码示例中，我们创建一个数字序列 1、4、27、256、...。这表明通过迭代器，我们可以处理无限序列。

def __iter__(self):
    
    return self

for 语句在容器对象上调用 __iter__ 函数。该函数返回一个迭代器对象，该对象定义了 __next__ 方法，该方法一次访问容器中的元素。

 def next(self):
    self.x += 1
    return self.x**self.x

next 方法返回序列的下一个元素。

if n > 10:
    break

因为我们正在处理一个无限序列，所以我们必须中断 for 循环。

$ ./iterator.py 
1
4
27
256
3125
46656
823543
16777216
387420489
10000000000
285311670611

StopIteration

循环可以通过另一种方式中断。在类定义中，我们必须引发一个 StopIteration 异常。在下面的示例中，我们重做之前的示例。

stopiter.py

#!/usr/bin/env python

# stopiter.py

class Seq14:
    
   def __init__(self):
      self.x = 0

   def __next__(self):
       
      self.x += 1
      
      if self.x > 14:
         raise StopIteration
     
      return self.x ** self.x

   def __iter__(self):
      return self


s = Seq14()

for e in s:
   print(e)

代码示例将打印序列的前 14 个数字。

if self.x > 14:
    raise StopIteration

StopIteration 异常将停止 for 循环。

$ ./stop_iter.py 
1
4
27
256
3125
46656
823543
16777216
387420489
10000000000
285311670611
8916100448256
302875106592253
11112006825558016

Python 生成器

生成器 是一种特殊例程，可用于控制循环的迭代行为。生成器类似于返回数组的函数。生成器有参数，可以被调用，它生成一个数字序列。但是，与返回整个数组的函数不同，生成器一次 yield 一个值。这需要更少的内存。

Python 中的生成器

用 def 关键字定义
使用 yield 关键字
可以使用多个 yield 关键字
返回一个迭代器

让我们看一个生成器的例子。

simple_generator.py

#!/usr/bin/env python

# simple_generator.py

def gen():

   x, y = 1, 2
   yield x, y
   
   x += 1
   yield x, y

g = gen()

print(next(g))
print(next(g))

try:
   print(next(g))
   
except StopIteration:
   print("Iteration finished")

程序创建一个非常简单的生成器。

def gen():

   x, y = 1, 2
   yield x, y
   
   x += 1
   yield x, y

生成器使用 def 关键字定义，就像普通函数一样。我们在生成器的正文中使用两个 yield 关键字。 yield 关键字退出生成器并返回值。下一次调用迭代器的 next 函数时，我们继续执行 yield 关键字后面的行。请注意，局部变量在整个迭代过程中被保留。当没有剩下要 yield 的内容时，会引发 StopIteration 异常。

$ ./generator.py 
(1, 2)
(2, 2)
Iteration finished

在下面的示例中，我们计算斐波那契数。序列的第一个数字是 0，第二个数字是 1，每个后续数字等于序列前两个数字的总和。

fibonacci_gen.py

#!/usr/bin/env python

# fibonacci_gen.py

import time

def fib():
    
   a, b = 0, 1

   while True:
      yield b
      
      a, b = b, a + b


g = fib()

try:
   for e in g:
      print(e)
      
      time.sleep(1)
            
except KeyboardInterrupt:
   print("Calculation stopped")

该脚本持续将斐波那契数打印到控制台。它使用 Ctrl + C 组合键终止。

Python 生成器表达式

生成器表达式类似于列表推导式。区别在于，生成器表达式返回一个生成器，而不是列表。

generator_expression.py

#!/usr/bin/env python

# generator_expression.py

n = (e for e in range(50000000) if not e % 3)

i = 0

for e in n:
    print(e)
    
    i += 1
    
    if i > 100:
        raise StopIteration

该示例计算可以被 3 整除而没有余数的值。

n = (e for e in range(50000000) if not e % 3)

生成器表达式是用圆括号创建的。在这种情况下创建列表推导式将非常低效，因为该示例会不必要地占用大量内存。替代方案是，我们创建一个生成器表达式，它根据需要惰性地生成值。

i = 0

for e in n:
    print(e)
    
    i += 1
    
    if i > 100:
        raise StopIteration

在 for 循环中，我们使用生成器生成 100 个值。我们在没有大量使用内存的情况下完成了此操作。

在下一个示例中，我们使用生成器表达式在 Python 中创建一个类似 grep 的实用程序。

roman_empire.txt

The Roman Empire (Latin: Imperium Rōmānum; Classical Latin: [ɪmˈpɛ.ri.ũː roːˈmaː.nũː] 
Koine and Medieval Greek: Βασιλεία τῶν Ῥωμαίων, tr. Basileia tōn Rhōmaiōn) was the 
post-Roman Republic period of the ancient Roman civilization, characterized by government 
headed by emperors and large territorial holdings around the Mediterranean Sea in Europe, 
Africa and Asia. The city of Rome was the largest city in the world c. 100 BC – c. AD 400, 
with Constantinople (New Rome) becoming the largest around AD 500,[5][6] and the Empire's 
populace grew to an estimated 50 to 90 million inhabitants (roughly 20% of the world's 
population at the time).[n 7][7] The 500-year-old republic which preceded it was severely 
destabilized in a series of civil wars and political conflict, during which Julius Caesar 
was appointed as perpetual dictator and then assassinated in 44 BC. Civil wars and executions 
continued, culminating in the victory of Octavian, Caesar's adopted son, over Mark Antony and 
Cleopatra at the Battle of Actium in 31 BC and the annexation of Egypt. Octavian's power was 
then unassailable and in 27 BC the Roman Senate formally granted him overarching power and 
the new title Augustus, effectively marking the end of the Roman Republic.

我们使用这个文本文件。

generator_expression.py

#!/usr/bin/env python

# gen_grep.py

import sys

def grep(pattern, lines):
    return ((line, lines.index(line)+1) for line in lines if pattern in line)

file_name = sys.argv[2]
pattern = sys.argv[1]

with open(file_name, 'r') as f:
    lines = f.readlines()
    
    for line, n in grep(pattern, lines):
        print(n, line.rstrip())

该示例从文件读取数据并打印包含指定模式及其行号的行。

def grep(pattern, lines):
    return ((line, lines.index(line)+1) for line in lines if pattern in line)

类似 grep 的实用程序使用此生成器表达式。该表达式遍历行列表并选取包含该模式的行。它计算行在列表中的索引，即其在文件中的行号。

with open(file_name, 'r') as f:
    lines = f.readlines()
    
    for line, n in grep(pattern, lines):
        print(n, line.rstrip())

我们打开文件进行读取，并在数据上调用 grep 函数。该函数返回一个生成器，该生成器使用 for 循环进行遍历。

$ ./gen_grep.py Roman roman_empire.txt 
1 The Roman Empire (Latin: Imperium Rōmānum; Classical Latin: [ɪmˈpɛ.ri.ũː roːˈmaː.nũː]
3 post-Roman Republic period of the ancient Roman civilization, characterized by government
13 then unassailable and in 27 BC the Roman Senate formally granted him overarching power and
14 the new title Augustus, effectively marking the end of the Roman Republic.

文件中包含“Roman”一词的行有四行。