Pandas

最后修改于 2024 年 1 月 29 日

在本文中，我们将展示如何使用 Pandas 库在 Python 中进行基本数据分析。代码示例和数据可在作者的 Github repository 中找到。

Pandas

Pandas 是一个开源的、BSD 许可的库，为 Python 编程语言提供高性能、易于使用的数据结构和数据分析工具。

该库的名称来源于术语“面板数据 (panel data)”，这是一个计量经济学术语，指的是包含对相同个体在多个时间段内的观察的数据集。

它提供数据结构和操作，用于处理数值表和时间序列。主要的两种数据类型是：Series 和 DataFrame。

DataFrame 是一个二维的、大小可变的、可能包含异构类型的表格数据结构，带有标记的轴（行和列）。它是一种类似电子表格的数据结构。Series 是 DataFrame 的单列。可以将 DataFrame 视为 Series 对象的字典。

Python Pandas 安装

Pandas 使用以下命令安装

$ pip install pandas

我们使用 pip3 命令来安装 pandas 模块。

$ pip install numpy

一些例子也使用了 numpy。

Pandas 简单示例

以下是一个简单的 Pandas 示例。

simple.py

#!/usr/bin/python

import pandas as pd

data = [['Alex', 10], ['Ronald', 18], ['Jane', 33]]
df = pd.DataFrame(data, columns=['Name', 'Age'])

print(df)

在该程序中，我们创建一个简单的 DataFrame 并将其打印到控制台。

import pandas as pd

我们导入 Pandas 库。

data = [['Alex', 10], ['Ronald', 18], ['Jane', 33]]

这是要在 frame 中显示的数据。每个嵌套列表都是表中的一行。请注意，有很多方法可以初始化 Pandas DataFrame。

df = pd.DataFrame(data, columns=['Name', 'Age'])

从数据创建 DataFrame。我们使用 columns 属性为 frame 提供列名。

$ python simple.py
    Name  Age
0    Alex   10
1  Ronald   18
2    Jane   33

这是输出。第一列是行索引。

Pandas 更改索引

我们可以更新索引，使其不从 0 开始。

change_index.py

#!/usr/bin/python

import pandas as pd

data = [['Alex', 10], ['Ronald', 18], ['Jane', 33]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
df.index = df.index + 1

print(df)

在该示例中，我们向索引添加 1。

$ python change_index.py
    Name  Age
1    Alex   10
2  Ronald   18
3    Jane   33

Pandas 标量序列

以下示例创建一个标量值的序列。

series_scalar.py

#!/usr/bin/python

import pandas as pd

s = pd.Series(5, index=[0, 1, 2, 3])
print(s)

我们有一列包含五个 5。

$ python series_scalar.py
0    5
1    5
2    5
3    5
dtype: int64

左列是索引。

Pandas 序列 ndarray

我们可以从 numpy ndarray 创建一个序列对象。

series_numpy.py

#!/usr/bin/python

import pandas as pd
import numpy as np

data = np.array(['a', 'b', 'c', 'd'])
s = pd.Series(data)

print(s)

该示例从 ndarray 创建一列字母。

$ python series_numpy.py
0    a
1    b
2    c
3    d
dtype: object

Pandas 序列 dict

可以从 Python 字典创建一个序列。

series_dict.py

#!/usr/bin/python

import pandas as pd
import numpy as np

data = {'coins' : 22, 'pens' : 3, 'books' : 28}
s = pd.Series(data)

print(s)

该示例从项目字典创建一个序列对象。

$ python series_dict.py
coins    22
pens      3
books    28
dtype: int64

索引由项目的名称组成。

Pandas 序列检索

以下示例从序列对象中检索值。

series_retrieve.py

#!/usr/bin/python

import pandas as pd

s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])

print(s[0])
print('-----------------------')

print(s[1:4])
print('-----------------------')

print(s[['a','c','d']])

该示例从序列对象中检索值。

print(s[0])

这里我们得到一个单一值。

print(s[1:4])

我们按索引检索行。

print(s[['a','c','d']])

这里我们通过索引标签获取值。

$ python series_retrieve.py
1
-----------------------
b    2
c    3
d    4
dtype: int64
-----------------------
a    1
c    3
d    4
dtype: int64

Pandas 自定义索引

索引列不必是数字的。我们可以创建自己的自定义索引。

custom_index.py

#!/usr/bin/python

import pandas as pd

data = {"country": ["Brazil", "Russia", "India", "China", "South Africa"],
        "capital": ["Brasilia", "Moscow", "New Dehli", "Beijing", "Pretoria"],
        "area": [8.516, 17.10, 3.286, 9.597, 1.221],
        "population": [200.4, 143.5, 1252, 1357, 52.98]}

frame = pd.DataFrame(data)
print(frame)

print('------------------------------')

frame.index = ["BR", "RU", "IN", "CH", "SA"]
print(frame)

在该示例中，我们从数据字典创建一个数据帧。我们打印数据帧，然后使用 index 属性更改索引列。

$ python custom_index.py
        country    capital    area  population
0        Brazil   Brasilia   8.516      200.40
1        Russia     Moscow  17.100      143.50
2         India  New Dehli   3.286     1252.00
3         China    Beijing   9.597     1357.00
4  South Africa   Pretoria   1.221       52.98
------------------------------
         country    capital    area  population
BR        Brazil   Brasilia   8.516      200.40
RU        Russia     Moscow  17.100      143.50
IN         India  New Dehli   3.286     1252.00
CH         China    Beijing   9.597     1357.00
SA  South Africa   Pretoria   1.221       52.98

Pandas 索引、列和值

Pandas DataFrame 有三个基本部分：索引、列和值。

index_vals_cols.py

#!/usr/bin/python

import pandas as pd

data = [['Alex', 10], ['Ronald', 18], ['Jane', 33]]
df = pd.DataFrame(data, columns=['Name', 'Age'])

print(f'Index: {df.index}')
print(f'Columns: {df.columns}')
print(f'Values: {df.values}')

该示例打印数据帧的索引、列和值。

$ python index_vals_cols.py
Index: RangeIndex(start=0, stop=3, step=1)
Columns: Index(['Name', 'Age'], dtype='object')
Values: [['Alex' 10]
    ['Ronald' 18]
    ['Jane' 33]]

Pandas 求和和最大值

以下示例计算数据帧列中值的总和和最大值。它也使用 numpy 库。

sum_max.py

#!/usr/bin/python

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(0, 1200, 2), columns=['A'])
# df.index = df.index + 1

print(sum(df['A']))
print(max(df['A']))

# print(df)

该示例计算值的最大值和总和。它使用 numpy 的 arange 函数来生成一个值数组。

print(sum(df['A']))

当我们计算总和值时，我们通过其名称引用该列。

$ sum_max.py
359400
1198

Pandas 读取 CSV

Pandas 使用 read_csv 从 CSV 文件读取数据。

military_spending.csv

Pos, Country, Amount (Bn. $), GDP
1, United States, 610.0, 3.1
2, China, 228.0, 1.9
3, Saudi Arabia, 69.4, 10.0
4, Russia, 66.3, 4.3
5, India, 63.9, 2.5
6, France, 57.8, 2.3
7, United Kingdom, 47.2, 1.8
8, Japan, 45.4, 0.9
9, Germany, 44.3, 1.2
10, South Korea, 39.2, 2.6
11, Brazil, 29.3, 1.4
12, Italy Italy, 29.2, 1.5
13, Australia Australia, 27.5, 2.0
14, Canada Canada, 20.6, 1.3
15, Turkey Turkey, 18.2, 2.2

这是一个简单的 CSV 文件，其中包含有关各国军事支出的数据。

注意： CSV 文件可能在第一行中包含可选的列名。

read_from_csv.py

#!/usr/bin/python

import pandas as pd

df = pd.read_csv("military_spending.csv")

print(df.to_string(index=False))

该示例从 military_spending.csv 文件读取所有数据，并以表格格式将其打印到控制台。它使用 read_csv 方法。

print(df.to_string(index=False))

由于我们有 positions 列，我们从输出中隐藏了索引。

$ python read_from_csv.py
Pos               Country   Amount (Bn. $)   GDP
  1         United States            610.0   3.1
  2                 China            228.0   1.9
  3          Saudi Arabia             69.4  10.0
  4                Russia             66.3   4.3
  5                 India             63.9   2.5
  6                France             57.8   2.3
  7        United Kingdom             47.2   1.8
  8                 Japan             45.4   0.9
  9               Germany             44.3   1.2
 10           South Korea             39.2   2.6
 11                Brazil             29.3   1.4
 12           Italy Italy             29.2   1.5
 13   Australia Australia             27.5   2.0
 14         Canada Canada             20.6   1.3
 15         Turkey Turkey             18.2   2.2

Pandas 写入 CSV

使用 to_csv 将 DataFrame 写入 CSV 文件。

write_csv.py

#!/usr/bin/python

import pandas as pd

data = [['Alex', 10], ['Ronald', 18], ['Jane', 33]]
df = pd.DataFrame(data, columns=['Name', 'Age'])

df.to_csv("users.csv", index=False)

该示例将数据写入 users.csv 文件。

Pandas 随机行

可以使用 sample 选择数据帧中的随机行。

random_sample.py

#!/usr/bin/python

import pandas as pd

df = pd.read_csv("military_spending.csv")

print(df.sample(3))

在该示例中，我们打印数据帧中的三个随机行。

Pandas to_dict 函数

to_dict 将数据帧转换为 Python 字典。该字典可以以不同的数据输出显示。

data_orient.py

#!/usr/bin/python

import pandas as pd

data = [['Alex', 10], ['Ronald', 18], ['Jane', 33]]
df = pd.DataFrame(data, columns=['Name', 'Age'])

print('list')
print(df.to_dict(orient='list'))

print('************************************')

print('series')
print(df.to_dict(orient='series'))

print('************************************')

print('dict')
print(df.to_dict(orient='dict'))

print('************************************')

print('split')
print(df.to_dict(orient='split'))

print('************************************')

print('records')
print(df.to_dict(orient='records'))

print('************************************')

print('index')
print(df.to_dict(orient='index'))

该示例以六种不同的格式将数据帧打印到控制台。

Pandas describe

describe 方法生成描述性统计信息，该信息总结了数据集分布的中心趋势、离散程度和形状，不包括 NaN 值。

describing.py

#!/usr/bin/python

import pandas as pd

s1 = pd.Series([1, 2, 3, 4, 5, 6, 7, 8])
s2 = pd.Series([12, 23, 31, 14, 11, 61, 17, 18])

data = {'Vals 1': s1, 'Vals 2': s2}
df = pd.DataFrame(data)

print(df.describe())

该示例从数据帧打印描述性统计信息。

$ python describe.py
        Vals 1     Vals 2
count  8.00000   8.000000
mean   4.50000  23.375000
std    2.44949  16.535136
min    1.00000  11.000000
25%    2.75000  13.500000
50%    4.50000  17.500000
75%    6.25000  25.000000
max    8.00000  61.000000

Pandas 计数

下一个示例计算值。您可以在 Github 存储库中找到 employees.csv 文件。

counting.py

#!/usr/bin/python

import pandas as pd

df = pd.read_csv("employees.csv")

print(df.count())

print(f'Number of columns: {len(df.columns)}')

print(df.shape)

count 方法计算每列的值的数量。列的数量使用 len(df.columns) 检索。shape 返回一个表示数据帧维度的元组。

$ python counting.py
First Name            933
Gender                855
Start Date           1000
Last Login Time      1000
Salary               1000
Bonus %              1000
Senior Management     933
Team                  957
dtype: int64
Number of columns: 8
(1000, 8)

请注意，这些列具有不同数量的值，因为有些值缺失。

Pandas head 和 tail

使用 head 和 tail 方法，我们可以显示数据帧中的前 n 行和后 n 行。

head_tail.py

#!/usr/bin/python

import pandas as pd

df = pd.read_csv("military_spending.csv")

print(df.head(4))

print('*******************************************')

print(df.tail(4))

该示例显示数据帧中的前四行和后四行。

$ python head_tail.py
Pos         Country   Amount (Bn. $)   GDP
0    1   United States            610.0   3.1
1    2           China            228.0   1.9
2    3    Saudi Arabia             69.4  10.0
3    4          Russia             66.3   4.3
*******************************************
 Pos               Country   Amount (Bn. $)   GDP
11   12           Italy Italy             29.2   1.5
12   13   Australia Australia             27.5   2.0
13   14         Canada Canada             20.6   1.3
14   15         Turkey Turkey             18.2   2.2

Pandas 无标题和索引

当我们显示数据帧时，我们可以隐藏标题和索引。

no_header_index.py

#!/usr/bin/python

import pandas as pd

df = pd.read_csv("military_spending.csv")

print(df.head(4).to_string(header=False, index=False))

通过将 header 和 index 属性设置为 False，我们输出没有标题和索引的数据帧。

$ python no_header.py
1   United States  610.0   3.1
2           China  228.0   1.9
3    Saudi Arabia   69.4  10.0
4          Russia   66.3   4.3

这是输出。（值 1 到 4 来自 pos 列。）

Pandas loc

loc 方法允许通过标签或布尔数组访问一组行和列。

select_loc.py

#!/usr/bin/python

import pandas as pd

data = {'Items': ['coins', 'pens', 'books'], 'Quantity': [22, 28, 3]}

df = pd.DataFrame(data, index=['A', 'B', 'C'])

print(df.loc['A'])

print('-------------------------------')

print(df.loc[['A', 'B'], ['Items']])

该示例使用 loc 函数。

print(df.loc['A'])

在这里，我们得到第一行。我们通过其索引标签访问该行。

print(df.loc[['A', 'B'], ['Items']])

在这里，我们获得 Items 列的前两行。

$ python select_loc.py
Items       coins
Quantity       22
Name: A, dtype: object
-------------------------------
    Items
A  coins
B   pens

第二个示例显示如何通过布尔数组进行选择。

select_loc2.py

#!/usr/bin/python

import pandas as pd

data = {'Items': ['coins', 'pens', 'books'], 'Quantity': [22, 28, 3]}

df = pd.DataFrame(data, index=['A', 'B', 'C'])

print(df.loc[[True, False, True], ['Items', 'Quantity']])

该示例通过布尔数组选择行。

$ select_loc2.py
    Items  Quantity
 A  coins        22
 C  books         3

在第三个示例中，我们在选择时应用条件。

select_loc3.py

#!/usr/bin/python

import pandas as pd

df = pd.read_csv("employees.csv")

data = df.loc[(df['Salary'] > 10000) & (df['Salary'] < 50000)]
print(data.head(5))

该示例从 employees.csv 文件打印前五行，这些行符合条件：工资在 10000 到 50000 之间。

Pandas iloc

iloc 函数允许基于整数位置的索引，用于按位置进行选择。

select_iloc.py

#!/usr/bin/python

import pandas as pd

df = pd.read_csv("employees.csv")

# integer-location based indexing for selection by position.
# Multiple row and column selections using iloc and DataFrame

print(df.iloc[0:6])  # first six rows of dataframe
print('--------------------------------------')

print(df.iloc[:, 0:2])  # first two columns of data frame with all rows
print('--------------------------------------')

# 1st, 4th, 7th, 25th row + 1st 6th 8th column
print(df.iloc[[0, 3, 6, 24], [0, 5, 7]])
print('--------------------------------------')

# first 5 rows and 5th, 6th, 7th columns of data frame
print(df.iloc[:5, 5:8])
print('--------------------------------------')

该示例显示如何使用 iloc 选择各种行和列的组合。

take 函数

take 函数返回给定轴上给定位置索引中的元素。

take_fun.py

#!/usr/bin/python

import pandas as pd
import numpy as np

df = pd.read_csv('military_spending.csv')
df_c = df.take([1, 2, 3], axis=1)

print(df_c.to_string(index=False))

该示例显示索引为 1、2 和 3 的列。

$ ./take_fun.py
    Country   Amount (Bn. $)   GDP
United States          610.0   3.1
      China            228.0   1.9
Saudi Arabia            69.4  10.0
     Russia             66.3   4.3
      India             63.9   2.5
     France             57.8   2.3
United Kingdom          47.2   1.8
      Japan             45.4   0.9
    Germany             44.3   1.2
South Korea             39.2   2.6
     Brazil             29.3   1.4
Italy Italy             29.2   1.5
Australia Australia     27.5   2.0
Canada Canada           20.6   1.3
Turkey Turkey           18.2   2.2

Pandas drop 函数

drop 函数删除给定的列或行。

drop_fun.py

#!/usr/bin/python

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(16).reshape(4, 4), columns=['A', 'B', 'C', 'D'])
df.index = df.index + 1
print(df)

print('----------------------------------------')

df2 = df.drop(['A', 'B'], axis=1)
print(df2)

print('----------------------------------------')

df3 = df.drop([2, 3], axis=0)
print(df3)

我们构建一个小数据帧，并删除两列和两行。

df2 = df.drop(['A', 'B'], axis=1)
print(df2)

我们删除列“A”和“B”。axis=1 表示我们删除列。

df3 = df.drop([2, 3], axis=0)

我们删除宽度索引 2 和 3 的行。要删除行，我们将轴设置为 0。

$./drop_fun.py
    A   B   C   D
1   0   1   2   3
2   4   5   6   7
3   8   9  10  11
4  12  13  14  15
----------------------------------------
    C   D
1   2   3
2   6   7
3  10  11
4  14  15
----------------------------------------
    A   B   C   D
1   0   1   2   3
4  12  13  14  15

Pandas 排序

sort_values 以升序或降序对序列进行排序。

sorting.py

#!/usr/bin/python

import pandas as pd

s1 = pd.Series([2, 1, 4, 5, 3, 8, 7, 6])
s2 = pd.Series([12, 23, 31, 14, 11, 61, 17, 18])

data = {'Col 1': s1, 'Col 2': s2}
df = pd.DataFrame(data)

print(df.sort_values('Col 1', ascending=True))
print('------------------------------------')
print('Sorted')

print(df.sort_values('Col 2', ascending=False))

该示例以升序或降序对列进行排序。

$ python sorting.py
    Col 1  Col 2
 1      1     23
 0      2     12
 4      3     11
 2      4     31
 3      5     14
 7      6     18
 6      7     17
 5      8     61
 ------------------------------------
 Sorted
    Col 1  Col 2
 5      8     61
 2      4     31
 1      1     23
 7      6     18
 6      7     17
 3      5     14
 0      2     12
 4      3     11

在下一个示例中，我们按多个列进行排序。

sorting2.py

#!/usr/bin/python

import pandas as pd

s1 = pd.Series([1, 2, 1, 2, 2, 1, 2, 2])
s2 = pd.Series(['A', 'A', 'B', 'A', 'C', 'C', 'C', 'B'])

data = {'Col 1': s1, 'Col 2': s2}
df = pd.DataFrame(data)

print(df.sort_values(['Col 1', 'Col 2'], ascending=[True, False]))

该示例按包含整数的第一列进行排序。然后对第二列进行排序，同时考虑第一次排序的结果。

$ python sorting2.py
    Col 1 Col 2
 5      1     C
 2      1     B
 0      1     A
 4      2     C
 6      2     C
 7      2     B
 1      2     A
 3      2     A

来源

Pandas 文档

在本文中，我们使用了 Pandas 库。

作者

我叫 Jan Bodnar，是一位充满激情的程序员，拥有丰富的编程经验。我从 2007 年开始撰写编程文章。到目前为止，我已经撰写了 1,400 多篇文章和 8 本电子书。我在编程教学方面拥有超过十年的经验。

列出所有 Python 教程。