pandas入门

pandas入门

简介

pandas包含的数据结构和操作工具能快速简单地清洗和分析数据。

pandas经常与NumPy和SciPy这样的数据计算工具,statsmodels和scikit-learn之类的分析库及数据可视化库(如matplotlib)等一起用使用。pandas基于NumPy的数组,经常可以不使用循环就能处理好大量数据。

pandas适合处理表格数据或巨量数据。NumPy则适合处理巨量的数值数组数据。

这里约定导入方式:

技术支持qq群:630011153

1
import pandas as pd

pandas数据结构介绍

  • 技术支持 (可以加钉钉pythontesting邀请加入) qq群:144081101 591302926 567351477

  • 道家技术-手相手诊看相中医等钉钉群21734177 qq群:391441566 184175668 338228106 看手相、面相、舌相、抽签、体质识别。服务费50元每人次起。请联系钉钉或者微信pythontesting

接口自动化性能测试数据分析人工智能从业专家一对一线上培训大纲

主要数据结构:Series和DataFrame。

Series

Series类似于一维数组的对象,它由一组数据(NumPy类似数据类型)以及相关的数据标签(即索引)组成。仅由一组数据即可产生最简单的Series:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
In [2]: import pandas as pd

In [3]: obj = pd.Series([4, 7, -5, 3])

In [4]: obj
Out[4]: 
0    4
1    7
2   -5
3    3
dtype: int64

In [5]: obj.values
Out[5]: array([ 4,  7, -5,  3])

In [6]: obj.index
Out[6]: Int64Index([0, 1, 2, 3], dtype='int64')

指定索引:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
In [2]: obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [3]: obj2
Out[3]: 
d    4
b    7
a   -5
c    3
dtype: int64

In [4]: obj2.index
Out[4]: Index(['d', 'b', 'a', 'c'], dtype='object')

In [10]: obj2['a']
Out[10]: -5

In [11]: obj2['d'] = 6

In [12]: obj2[['c', 'a', 'd']]
Out[12]: 
c    3
a   -5
d    6
dtype: int64

可见与普通NumPy数组相比,你还可以通过索引的方式选取Series中的值。

NumPy函数或类似操作,如根据布尔型数组进行过滤、标量乘法、应用数学函数等)都会保留索引和值之间的链接:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
In [13]: obj2[obj2 > 0]
Out[13]: 
d    6
b    7
c    3
dtype: int64

In [14]: obj2 * 2
Out[14]: 
d    12
b    14
a   -10
c     6
dtype: int64

In [15]: obj2
Out[15]: 
d    6
b    7
a   -5
c    3
dtype: int64

In [17]: import numpy as np

In [18]: np.exp(obj2)
Out[18]: 
d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [19]: 'b' in obj2
Out[19]: True

In [20]: 'e' in obj2
Out[20]: False

可见可以吧Series看成是定长的有序字典。也可由字典创建Series:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
In [21]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [22]: obj3 = pd.Series(sdata)

In [23]: obj3
Out[23]: 
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [24]: states = ['California', 'Ohio', 'Oregon', 'Texas']

In [25]: obj4 = pd.Series(sdata, index=states)

In [26]: obj4
Out[26]: 
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [27]: pd.isnull(obj4)
Out[27]: 
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [28]: pd.notnull(obj4)
Out[28]: 
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [29]: obj4.isnull()
Out[29]: 
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [32]: obj4.notnull()
Out[32]: 
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

相加

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
In [33]: obj3
Out[33]: 
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [34]: obj4
Out[34]: 
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [35]: obj3 + obj4
Out[35]: 
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [36]: obj4.name = 'population'

In [37]: obj4.index.name = 'state'

In [38]: obj4
Out[38]: 
state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [40]: obj = pd.Series([4, 7, -5, 3])

In [41]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

In [42]: obj
Out[42]: 
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

本文代码地址:https://github.com/china-testing/python-api-tesing/

本文最新版本地址:http://t.cn/R8tJ9JH

交流QQ群:python 测试开发 144081101

wechat: pythontesting

淘宝天猫可以把链接发给qq850766020,为你生成优惠券,降低你的购物成本!

DataFrame

DataFrame是矩状表格型的数据结构,包含有序的列,每列可以是不同的类型(数值、字符串、布尔值等)。DataFrame既有行索引也有列索引,它可以被看做由相同索引的Series组成的字典。DataFrame中的数据是以一个或多个二维块存放的。

构建DataFrame的办法有很多,最常用的是直接传入等长列表或NumPy数组组成的字典。DataFrame会自动加上索引(跟Series一样),有序排列。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: 

In [3]: data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
   ...: 'year': [2000, 2001, 2002, 2001, 2002, 2003],
   ...: 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [4]: 

In [4]: frame = pd.DataFrame(data)

In [5]: frame
Out[5]: 
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002
5  3.2  Nevada  2003

In [6]: frame.head()
Out[6]: 
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002

In [7]: 

In [7]: pd.DataFrame(data, columns=['year', 'state', 'pop'])
Out[7]: 
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
5  2003  Nevada  3.2

In [8]: frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
   ...: index=['one', 'two', 'three', 'four', 'five', 'six'])

In [9]: frame2
Out[9]: 
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN

In [10]: frame2['state']
Out[10]: 
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

可见还可以通过columns指定DataFrame的列序, index指定索引名。跟Series一样,如果传入的列在数据中找不到,就会产生NaN值。通过类似字典的方式或属性的方式,可以将DataFrame的列获取为Series,返回的Series拥有DataFrame相同的索引,且其name属性也已经被相应地设置好。

行也可以用loc属性通过位置或名称的方式进行获取。列可以通过赋值的方式进行修改。 将列表或数组赋值给某个列时,其长度必须跟DataFrame的长度相匹配。如果赋值的是Series,就会精确匹配

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
In [11]: frame2.loc['three']
Out[11]: 
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [12]: frame2['debt'] = 16.5

In [13]: frame2
Out[13]: 
       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5
six    2003  Nevada  3.2  16.5

In [14]: frame2['debt'] = np.arange(6.)

In [15]: frame2
Out[15]: 
       year   state  pop  debt
one    2000    Ohio  1.5   0.0
two    2001    Ohio  1.7   1.0
three  2002    Ohio  3.6   2.0
four   2001  Nevada  2.4   3.0
five   2002  Nevada  2.9   4.0
six    2003  Nevada  3.2   5.0

In [16]: val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

In [17]: frame2['debt'] = val

In [18]: frame2
Out[18]: 
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7
six    2003  Nevada  3.2   NaN

为不存在的列赋值会创建出一个新列。关键字del用于删除列:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
In [19]: frame2['eastern'] = frame2['state'] == 'Ohio'

In [20]: frame2
Out[20]: 
       year   state  pop  debt  eastern
one    2000    Ohio  1.5   NaN     True
two    2001    Ohio  1.7  -1.2     True
three  2002    Ohio  3.6   NaN     True
four   2001  Nevada  2.4  -1.5    False
five   2002  Nevada  2.9  -1.7    False
six    2003  Nevada  3.2   NaN    False

In [21]: del frame2['eastern']

In [22]: frame2.columns
Out[22]: Index(['year', 'state', 'pop', 'debt'], dtype='object')

通过索引方式返回的列只是相应数据的视图而不是副本。因此,对返回的Series所做的任何就地修改 全都会反映到源DataFrame上。通过Series的copy方法即可显式地复制列。

另一种常见的数据形式是嵌套字典,外层字典的键作为列,内层键则作为行索引:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
In [23]: pop = {'Nevada': {2001: 2.4, 2002: 2.9},
   ....: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [24]: frame3 = pd.DataFrame(pop)

In [25]: frame3
Out[25]: 
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6

In [26]: frame3.T
Out[26]: 
        2000  2001  2002
Nevada   NaN   2.4   2.9
Ohio     1.5   1.7   3.6

In [27]: pd.DataFrame(pop, index=[2001, 2002, 2003])
Out[27]: 
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2003     NaN   NaN

In [28]: pdata = {'Ohio': frame3['Ohio'][:-1], 'Nevada': frame3['Nevada'][:2]}

In [29]: pdata
Out[29]: 
{'Ohio': 2000    1.5
 2001    1.7
 Name: Ohio, dtype: float64, 'Nevada': 2000    NaN
 2001    2.4
 Name: Nevada, dtype: float64}

In [30]: pd.DataFrame(pdata)
Out[30]: 
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7

In [31]: frame3.index.name = 'year'; frame3.columns.name = 'state'

In [32]: frame3
Out[32]: 
state  Nevada  Ohio
year               
2000      NaN   1.5
2001      2.4   1.7
2002      2.9   3.6

In [33]: frame3.values
Out[33]: 
array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

In [34]: frame2.values
Out[34]: 
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

可见可以转置,由Series组成的字典和字典类似。如果设置了DataFrame的index和columns的name属性,则这些信息也会被显示出来。跟Series一样,values属性也会以二维ndarray的形式返回DataFrame中的数据。如果DataFrame各列的数据类型不同,则值数组的数据类型就会选用能兼容所有列的数据类型。

DataFrame的constructor接受的类型为:2D ndarray、dict of arrays, lists, or tuples、NumPy structured/record、array、dict of Series、dict of dicts、List of dicts or Series、List of lists or tuples、Another DataFrame、NumPy MaskedArray。

更多参考: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

索引对象

pandas的索引对象负责管理轴标签和其他元数据(比如轴名称等)。构建Series或DataFrame时,所用到的任何数组或其他序列的标签都会被转换成Index。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
In [35]: obj = pd.Series(range(3), index=['a', 'b', 'c'])

In [36]: index = obj.index

In [37]: index
Out[37]: Index(['a', 'b', 'c'], dtype='object')

In [38]: index[1:]
Out[38]: Index(['b', 'c'], dtype='object')

In [39]: index[1] = 'd'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-676fdeb26a68> in <module>()
----> 1 index[1] = 'd'

/usr/local/lib/python3.5/dist-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   1722 
   1723     def __setitem__(self, key, value):
-> 1724         raise TypeError("Index does not support mutable operations")
   1725 
   1726     def __getitem__(self, key):

TypeError: Index does not support mutable operations

In [40]: labels = pd.Index(np.arange(3))

In [41]: labels
Out[41]: Int64Index([0, 1, 2], dtype='int64')

In [42]: obj2 = pd.Series([1.5, -2.5, 0], index=labels)

In [43]: obj2
Out[43]: 
0    1.5
1   -2.5
2    0.0
dtype: float64

In [44]: obj2.index is labels
Out[44]: True

In [45]: frame3
Out[45]: 
state  Nevada  Ohio
year               
2000      NaN   1.5
2001      2.4   1.7
2002      2.9   3.6

In [46]: frame3.columns
Out[46]: Index(['Nevada', 'Ohio'], dtype='object', name='state')

In [47]: 'Ohio' in frame3.columns
Out[47]: True

In [48]: 2003 in frame3.index
Out[48]: False

In [49]: dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])

In [50]: dup_labels
Out[50]: Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

Index对象是不可变的,因此用户不能对其进行修改,这样Index对象在多个数据结构之间可安全共享。除了像数组,Index类似固定大小的集合。

Index的方法和属性有:append,difference,intersection,union,isin,delete,drop,insert,is_monotonic,unique。

更多参考: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html

基本功能

本节中,我将介绍操作Series和DataFrame中的数据的基本手段。

重新索引

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
In [51]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

In [52]: obj
Out[52]: 
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

# 调用reindex将会根据新索引进行重排。如果某个索引值当前不存在,就为NaN

In [53]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [54]: obj2
Out[54]: 
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [55]: obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])

In [56]: obj3
Out[56]: 
0      blue
2    purple
4    yellow
dtype: object

# 对于时间序列这样的有序数据,重新索引时可能需要做插值处理。method选项即可达到此目的,例如,使用ffill以实现前向值填充:

In [57]: obj3.reindex(range(6), method='ffill')
Out[57]: 
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

# DataFrame中reindex可以调整行列
In [58]: frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
   ....: index=['a', 'c', 'd'],
   ....: columns=['Ohio', 'Texas', 'California'])

In [59]: frame
Out[59]: 
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8

In [60]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])

In [61]: frame2
Out[61]: 
   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0

In [62]: states = ['Texas', 'Utah', 'California']

In [63]: frame.reindex(columns=states)
Out[63]: 
   Texas  Utah  California
a      1   NaN           2
c      4   NaN           5
d      7   NaN           8

In [69]:  frame2 = frame.reindex(['a', 'b', 'c', 'd'],columns=states)

In [70]: frame2
Out[70]: 
   Texas  Utah  California
a    1.0   NaN         2.0
b    NaN   NaN         NaN
c    4.0   NaN         5.0
d    7.0   NaN         8.0

reindex函数的参数有index,method,fill_value,limit,tolerance,level,copy等。

更多参考: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html

丢弃指定轴上的项

丢弃某条轴上的一项很简单,只要有索引数组或列表即可。由于需要执行一些数据整理和集合逻辑,所以drop方法返回的是在指定轴上删除了指定值的新对象:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
In [71]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [72]: obj
Out[72]: 
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [73]: new_obj = obj.drop('c')

In [74]: new_obj
Out[74]: 
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [75]: obj
Out[75]: 
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [76]: obj.drop(['d', 'c'])
Out[76]: 
a    0.0
b    1.0
e    4.0
dtype: float64

In [77]: obj
Out[77]: 
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [78]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
   ....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
   ....: columns=['one', 'two', 'three', 'four'])

In [79]: data
Out[79]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

In [80]: data.drop(['Colorado', 'Ohio'])
Out[80]: 
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15

In []: data.drop('two',1)
Out[57]: 
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15

In []: data.drop('two', axis=1)
Out[58]: 
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15

In []: data.drop(['two', 'four'], axis='columns')
Out[59]: 
          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14

In []: obj.drop('c', inplace=True)

In []: obj
Out[61]: 
d    4.5
b    7.2
a   -5.3
dtype: float64

索引、选取和过滤

Series索引(obj[...])的工作方式类似于NumPy数组的索引,只不过Series的索引值不只是整数。下面是几个例子:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

obj
Out[63]: 
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

obj['b']
Out[64]: 1.0

obj[1]
Out[65]: 1.0

obj[2:4]
Out[66]: 
c    2.0
d    3.0
dtype: float64

obj[['b', 'a', 'd']]
Out[67]: 
b    1.0
a    0.0
d    3.0
dtype: float64

obj[[1, 3]]
Out[68]: 
b    1.0
d    3.0
dtype: float64

obj[obj < 2]
Out[69]: 
a    0.0
b    1.0
dtype: float64

obj['b':'c']
Out[70]: 
b    1.0
c    2.0
dtype: float64

obj['b':'c'] = 5

obj
Out[72]: 
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

注意标签的方式和python的列表不同,后面的index也是包含在里面的。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])

data
Out[74]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

data['two']
Out[75]: 
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

data[['three', 'one']]
Out[76]: 
          three  one
Ohio          2    0
Colorado      6    4
Utah         10    8
New York     14   12

data[:2]
Out[77]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7

data[data['three'] > 5]
Out[78]: 
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

data < 5
Out[79]: 
            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False

data[data < 5] = 0

data
Out[81]: 
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
  • loc和iloc

对于行上的DataFrame标签索引有特殊的索引操作符loc(标签)和iloc(整数索引)能够从DataFrame中选择子集。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])

data
Out[74]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

data['two']
Out[75]: 
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

data[['three', 'one']]
Out[76]: 
          three  one
Ohio          2    0
Colorado      6    4
Utah         10    8
New York     14   12

data[:2]
Out[77]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7

data[data['three'] > 5]
Out[78]: 
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

data < 5
Out[79]: 
            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False

data[data < 5] = 0

data
Out[81]: 
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

data.loc['Colorado', ['two', 'three']]
Out[82]: 
two      5
three    6
Name: Colorado, dtype: int32

data.iloc[2, [3, 0, 1]]
Out[83]: 
four    11
one      8
two      9
Name: Utah, dtype: int32

data.iloc[2]
Out[84]: 
one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

data.iloc[[1, 2], [3, 0, 1]]
Out[85]: 
          four  one  two
Colorado     7    0    5
Utah        11    8    9

data.loc[:'Utah', 'two']
Out[86]: 
Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

data.iloc[:, :3][data.three > 5]
Out[87]: 
          one  two  three
Colorado    0    5      6
Utah        8    9     10
New York   12   13     14

注意ix现在已经不推荐使用。

整数索引(Integer Indexes)

pandas对象的整数索引与内置Python数据的索引语义存在一些差异,以下代码会生成错误:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
ser = pd.Series(np.arange(3.))

ser[-1]
Traceback (most recent call last):

  File "<ipython-input-20-3cbe0b873a9e>", line 1, in <module>
    ser[-1]

  File "C:\Users\andrew\AppData\Local\conda\conda\envs\my_root\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
    result = self.index.get_value(self, key)

  File "C:\Users\andrew\AppData\Local\conda\conda\envs\my_root\lib\site-packages\pandas\core\indexes\base.py", line 2477, in get_value
    tz=getattr(series.dtype, 'tz', None))

  File "pandas\_libs\index.pyx", line 98, in pandas._libs.index.IndexEngine.get_value

  File "pandas\_libs\index.pyx", line 106, in pandas._libs.index.IndexEngine.get_value

  File "pandas\_libs\index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\hashtable_class_helper.pxi", line 759, in pandas._libs.hashtable.Int64HashTable.get_item

  File "pandas\_libs\hashtable_class_helper.pxi", line 765, in pandas._libs.hashtable.Int64HashTable.get_item

KeyError: -1

ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])

ser2[-1]
Out[22]: 2.0

ser[:1]
Out[23]: 
0    0.0
dtype: float64

ser.loc[:1]
Out[24]: 
0    0.0
1    1.0
dtype: float64

ser.iloc[:1]
Out[25]: 
0    0.0
dtype: float64

算术和数据对齐

pandas可在不同索引的对象建进行算术运算,类似数据库的连接:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])

s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

s1
Out[28]: 
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

s2
Out[29]: 
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

s1 + s2
Out[30]: 
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
index=['Ohio', 'Texas', 'Colorado'])

df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])

df1
Out[33]: 
            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0

df2
Out[34]: 
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

df1 + df2
Out[35]: 
            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN

df1 = pd.DataFrame({'A': [1, 2]})

df2 = pd.DataFrame({'B': [3, 4]})

df1
Out[38]: 
   A
0  1
1  2

df2
Out[39]: 
   B
0  3
1  4

df1 - df2
Out[40]: 
    A   B
0 NaN NaN
1 NaN NaN

还可以进行值的填充

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
columns=list('abcd'))

df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
columns=list('abcde'))

df1
Out[43]: 
     a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0

df2
Out[44]: 
      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   6.0   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0

df2.loc[1, 'b'] = np.nan

df2
Out[46]: 
      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   NaN   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0

df1 + df2
Out[47]: 
      a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0   NaN  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN

df1.add(df2, fill_value=0)
Out[48]: 
      a     b     c     d     e
0   0.0   2.0   4.0   6.0   4.0
1   9.0   5.0  13.0  15.0   9.0
2  18.0  20.0  22.0  24.0  14.0
3  15.0  16.0  17.0  18.0  19.0

1 / df1
Out[49]: 
          a         b         c         d
0       inf  1.000000  0.500000  0.333333
1  0.250000  0.200000  0.166667  0.142857
2  0.125000  0.111111  0.100000  0.090909

df1.rdiv(1)
Out[50]: 
          a         b         c         d
0       inf  1.000000  0.500000  0.333333
1  0.250000  0.200000  0.166667  0.142857
2  0.125000  0.111111  0.100000  0.090909


df1.reindex(columns=df2.columns, fill_value=0)
Out[53]: 
     a    b     c     d  e
0  0.0  1.0   2.0   3.0  0
1  4.0  5.0   6.0   7.0  0
2  8.0  9.0  10.0  11.0  0
Method Description
add, radd for addition (+) 
sub, rsub for subtraction (-) 
div, rdiv for division (/)
floordiv, rfloordiv for floor division (//)
mul, rmul for multiplication (*)
pow, rpow for exponentiation (**)
  • DataFrame和Series间的操作

默认基于行进行广播,用( axis='index' or axis=0 )可以基于列进行广播。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
arr = np.arange(12.).reshape((3, 4))

arr
Out[55]: 
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

arr[0]
Out[56]: array([ 0.,  1.,  2.,  3.])

arr - arr[0]
Out[57]: 
array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

arr
Out[58]: 
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
columns=list('bde'),index=['Utah', 'Ohio', 'Texas', 'Oregon'])

series = frame.iloc[0]

frame
Out[61]: 
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

series
Out[62]: 
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

frame - series
Out[63]: 
          b    d    e
Utah    0.0  0.0  0.0
Ohio    3.0  3.0  3.0
Texas   6.0  6.0  6.0
Oregon  9.0  9.0  9.0

series2 = pd.Series(range(3), index=['b', 'e', 'f'])

series2
Out[65]: 
b    0
e    1
f    2
dtype: int32

frame + series2
Out[66]: 
          b   d     e   f
Utah    0.0 NaN   3.0 NaN
Ohio    3.0 NaN   6.0 NaN
Texas   6.0 NaN   9.0 NaN
Oregon  9.0 NaN  12.0 NaN


series3 = frame['d']

frame
Out[69]: 
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

series3
Out[70]: 
Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

frame.sub(series3, axis='index')
Out[71]: 
          b    d    e
Utah   -1.0  0.0  1.0
Ohio   -1.0  0.0  1.0
Texas  -1.0  0.0  1.0
Oregon -1.0  0.0  1.0

函数应用和映射

NumPy的ufuncs(元素级数组方法)也可用于操作pandas对象:

另一个常见的操作是将函数应用到由各列或行所形成的一维数组上。DataFrame的apply方法即可实现此功能:

许多最为常见的数组统计功能都被实现成DataFrame的方法(如sum和mean),因此无需使用apply方法。除标量值外,传递给apply的函数还可以返回由多个值组成的Series:

此外,元素级的Python函数也是可以用的。假如你想得到frame中各个浮点值的格式化字符串,使用applymap即可:

之所以叫做applymap,是因为Series有一个用于应用元素级函数的map方法:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
arr = np.arange(12.).reshape((3, 4))

arr
Out[73]: 
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

arr[0]
Out[74]: array([ 0.,  1.,  2.,  3.])

arr - arr[0]
Out[75]: 
array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])





frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])

frame
Out[77]: 
               b         d         e
Utah    0.255395  1.983985  0.936326
Ohio    0.319394  2.231544 -0.051256
Texas  -0.041388 -0.026032 -0.446722
Oregon  1.099475 -1.432638 -0.919189

np.abs(frame)
Out[78]: 
               b         d         e
Utah    0.255395  1.983985  0.936326
Ohio    0.319394  2.231544  0.051256
Texas   0.041388  0.026032  0.446722
Oregon  1.099475  1.432638  0.919189

f = lambda x: x.max() - x.min()

frame.apply(f)
Out[80]: 
b    1.140863
d    3.664181
e    1.855515
dtype: float64

frame.apply(f, axis='columns')
Out[81]: 
Utah      1.728590
Ohio      2.282800
Texas     0.420690
Oregon    2.532113
dtype: float64

def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])


frame.apply(f)
Out[83]: 
            b         d         e
min -0.041388 -1.432638 -0.919189
max  1.099475  2.231544  0.936326

format = lambda x: '%.2f' % x

frame.applymap(format)
Out[85]: 
            b      d      e
Utah     0.26   1.98   0.94
Ohio     0.32   2.23  -0.05
Texas   -0.04  -0.03  -0.45
Oregon   1.10  -1.43  -0.92

frame['e'].map(format)
Out[86]: 
Utah       0.94
Ohio      -0.05
Texas     -0.45
Oregon    -0.92
Name: e, dtype: object

排序和排名

根据条件对数据集排序(sorting)也是重要的内置运算。要对行或列索引进行排序(按字典顺序),可使用sort_index方法,它将返回一个已排序的新对象。

而对于DataFrame,则可以根据任意轴上的索引进行排序:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

obj.sort_index()
Out[88]: 
a    1
b    2
c    3
d    0
dtype: int32

frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
index=['three', 'one'],columns=['d', 'a', 'b', 'c'])

frame
Out[90]: 
       d  a  b  c
three  0  1  2  3
one    4  5  6  7

frame.sort_index()
Out[91]: 
       d  a  b  c
one    4  5  6  7
three  0  1  2  3

frame.sort_index(axis='columns')
Out[94]: 
       a  b  c  d
three  1  2  3  0
one    5  6  7  4

数据默认是按升序排序的,但也可以降序排序,若要按值对Series进行排序,可使用其order方法。在排序时,任何缺失值默认都会被放到Series的末尾。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
frame.sort_index(axis='columns', ascending=False)
Out[95]: 
       d  c  b  a
three  0  3  2  1
one    4  7  6  5
obj = pd.Series([4, 7, -3, 2])

obj.sort_values()
Out[97]: 
2   -3
3    2
0    4
1    7
dtype: int64

obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])

obj.sort_values()
Out[99]: 
4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

obj.sort_values(ascending=False)
Out[100]: 
2    7.0
0    4.0
5    2.0
4   -3.0
1    NaN
3    NaN
dtype: float64

在DataFrame上,你可能希望根据一个或多个列中的值进行排序。将一个或多个列的名字传递给by选项即可。要根据多个列进行排序,传入名称的列表即可:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})

frame
Out[102]: 
   a  b
0  0  4
1  1  7
2  0 -3
3  1  2

frame.sort_values(by='b')
Out[103]: 
   a  b
2  0 -3
3  1  2
0  0  4
1  1  7

frame.sort_values(by=['a', 'b'])
Out[104]: 
   a  b
2  0 -3
0  0  4
3  1  2
1  1  7

排名(ranking)跟排序关系密切,且它会增设排名值(从1开始,一直到数组中有效数据的数量)。它跟numpy.argsort产生的间接排序索引差不多,只不过它可以根据某种规则破坏平级关系。接下来介绍Series和DataFrame的rank方法。默认情况下,rank是通过“为各组分配一个平均排名”的方式破坏平级关系的:

也可以根据值在原数据中出现的顺序给出排名

当然,你也可以按降序进行排名:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])

obj.rank()
Out[106]: 
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

obj.rank(method='first')
Out[107]: 
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

obj.rank(ascending=False, method='max')
Out[108]: 
0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
'c': [-2, 5, 8, -2.5]})

frame
Out[110]: 
   a    b    c
0  0  4.3 -2.0
1  1  7.0  5.0
2  0 -3.0  8.0
3  1  2.0 -2.5

frame.rank(axis='columns')
Out[111]: 
     a    b    c
0  2.0  3.0  1.0
1  1.0  3.0  2.0
2  2.0  1.0  3.0
3  2.0  3.0  1.0
Method Description
'average' Default: assign the average rank to each entry in the equal group
'min' Use the minimum rank for the whole group
'max' Use the maximum rank for the whole group
'first' Assign ranks in the order the values appear in the data
'dense' Like method='min' , but ranks always increase by 1 in between groups rather than the number of equal
elements in a group

带有重复值的轴索引

直到目前为止,我所介绍的所有范例都有着唯一的轴标签(索引值)。虽然许多pandas函数(如reindex)都要求标签唯一,但这并不是强制性的。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import pandas as pd

obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])

obj
Out[3]: 
a    0
a    1
b    2
b    3
c    4
dtype: int32

obj.index.is_unique
Out[4]: False

obj['a']
Out[5]: 
a    0
a    1
dtype: int32

obj['c']
Out[6]: 4

import numpy as np

df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])

df
Out[10]: 
          0         1         2
a  0.835470  0.465657 -0.068212
a -1.067020  1.148283  1.722324
b  0.057184 -0.441111 -0.388286
b -0.363911 -0.599963  0.126594

df.loc['b']
Out[11]: 
          0         1         2
b  0.057184 -0.441111 -0.388286
b -0.363911 -0.599963  0.126594

汇总和计算描述统计

pandas对象拥有一组常用的数学和统计方法。它们大部分都属于reduction和summary统计,用于从Series中提取单个值(如sum或mean)或从DataFrame的行或列中提取Series。跟对应的NumPy数组方法相比,它们都是基于没有缺失数据的假设而构建的。接下来看一个简单DataFrame:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]], 
index=['a', 'b', 'c', 'd'],columns=['one', 'two'])

df
Out[14]: 
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3

df.sum()
Out[15]: 
one    9.25
two   -5.80
dtype: float64

df.sum(axis='columns')
Out[16]: 
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

df.mean(axis='columns', skipna=False)
Out[17]: 
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

df.mean(axis='columns')
Out[18]: 
a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64
Method Description
axis Axis to reduce over; 0 for DataFrame’s rows and 1 for columns
skipna Exclude missing values; True by default
level Reduce grouped by level if the axis is hierarchically indexed (MultiIndex)

有些方法(如idxmin和idxmax)返回的是间接统计(比如达到最小值或最大值的索引),cumsum则为累计求和,describe则为汇总统计。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
df
Out[19]: 
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3

df.idxmax()
Out[20]: 
one    b
two    d
dtype: object

df.cumsum()
Out[21]: 
    one  two
a  1.40  NaN
b  8.50 -4.5
c   NaN  NaN
d  9.25 -5.8

df.describe()
Out[22]: 
            one       two
count  3.000000  2.000000
mean   3.083333 -2.900000
std    3.493685  2.262742
min    0.750000 -4.500000
25%    1.075000 -3.700000
50%    1.400000 -2.900000
75%    4.250000 -2.100000
max    7.100000 -1.300000

obj = pd.Series(['a', 'a', 'b', 'c'] * 4)

obj.describe()
Out[24]: 
count     16
unique     3
top        a
freq       8
dtype: object
Method Description
count Number of non-NA values
describe Compute set of summary statistics for Series or each DataFrame column
min, max Compute minimum and maximum values
argmin, argmax Compute index locations (integers) at which minimum or maximum value obtained, respectively
idxmin, idxmax Compute index labels at which minimum or maximum value obtained, respectively
quantile Compute sample quantile ranging from 0 to 1
sum Sum of values
mean Mean of values
median Arithmetic median (50% quantile) of values
mad Mean absolute deviation from mean value
prod Product of all values
var Sample variance of values
std Sample standard deviation of values
skew Sample skewness (third moment) of values
kurt Sample kurtosis (fourth moment) of values
cumsum Cumulative sum of values
cummin, cummax Cumulative minimum or maximum of values, respectively
cumprod Cumulative product of values
diff Compute first arithmetic difference (useful for time series)
pct_change Compute percent changes

相关性和方差

一些汇总统计,如相关和方差,是从成对的参数程程。 让我们考虑一些来自Yahoo的股票价格和数量DataFrame! 使用附加的pandas-datareader包,

暂略

唯一值、值计数以及成员资格

还有一类方法可以从一维Series的值中抽取信息。以下面这个Series为例:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import pandas as pd

obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

uniques = obj.unique()

uniques
Out[9]: array(['c', 'a', 'd', 'b'], dtype=object)

obj.value_counts()
Out[10]: 
a    3
c    3
b    2
d    1
dtype: int64

pd.value_counts(obj.values, sort=False)
Out[11]: 
c    3
d    1
b    2
a    3
dtype: int64

obj
Out[12]: 
0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

mask = obj.isin(['b', 'c'])

mask
Out[14]: 
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

obj[mask]
Out[15]: 
0    c
5    b
6    b
7    c
8    c
dtype: object

to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])

unique_vals = pd.Series(['c', 'b', 'a'])

pd.Index(unique_vals).get_indexer(to_match)
Out[18]: array([0, 2, 1, 1, 0, 2], dtype=int64)

data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4], 'Qu2': [2, 3, 1, 2, 3], 'Qu3': [1, 5, 2, 4, 4]})

data
Out[20]: 
   Qu1  Qu2  Qu3
0    1    2    1
1    3    3    5
2    4    1    2
3    3    2    4
4    4    3    4

result = data.apply(pd.value_counts).fillna(0)

result
Out[22]: 
   Qu1  Qu2  Qu3
1  1.0  1.0  1.0
2  0.0  2.0  1.0
3  2.0  2.0  0.0
4  2.0  0.0  2.0
5  0.0  0.0  1.0
Method Description
isin Compute boolean array indicating whether each Series value is contained in the passed sequence ofvalues
match Compute integer indices for each value in an array into another array of distinct values; helpful for data
alignment and join-type operations
unique Compute array of unique values in a Series, returned in the order observed
value_counts Return a Series containing unique values as its index and frequencies as its values, ordered count in
descending order

数据清洗和准备

在进行数据分析和建模的过程中,需要花费大量的时间(80%或更多)在数据准备上:加载,清理,转换和重新排列。有时候数据存储在文件或数据库中的方式不适合特定任务的格式。

在本章中,我将讨论缺失数据,重复数据,字符串操作,和其他一些分析数据转换。在下一章中,我将重点放在组合上,并以各种方式重新排列数据集。

处理缺失数据

数值用浮点数NaN (Not a Number)表示缺失。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [4]: string_data
Out[4]: 
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [5]: string_data.isnull()
Out[5]: 
0    False
1    False
2     True
3    False
dtype: bool

In [6]: string_data[0] = None

In [7]: string_data.isnull()
Out[7]: 
0     True
1    False
2     True
3    False
dtype: bool

NA相关的处理方法

数据缺失用NA(not available)表示, python内置的None也为NA。

Argument Description
dropna Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
fillna Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill' .
isnull Return boolean values indicating which values are missing/NA.
notnull Negation of isnull .
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
In [8]: from numpy import nan as NA

In [9]: data = pd.Series([1, NA, 3.5, NA, 7])

In [10]: data.dropna()
Out[10]: 
0    1.0
2    3.5
4    7.0
dtype: float64

In [11]: data[data.notnull()]
Out[11]: 
0    1.0
2    3.5
4    7.0
dtype: float64

In [12]: data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
   ....: [NA, NA, NA], [NA, 6.5, 3.]])

In [13]: cleaned = data.dropna()

In [14]: data
Out[14]: 
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0

In [15]: cleaned
Out[15]: 
     0    1    2
0  1.0  6.5  3.0

In [16]: data.dropna(how='all')
Out[16]: 
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
3  NaN  6.5  3.0

In [17]: data[4] = NA

In [18]: data
Out[18]: 
     0    1    2   4
0  1.0  6.5  3.0 NaN
1  1.0  NaN  NaN NaN
2  NaN  NaN  NaN NaN
3  NaN  6.5  3.0 NaN

In [19]: data.dropna((axis='columns', how='all')
Out[19]: 
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0

how='all'要所有行都为NaN时才会删除。thresh参数可以指定NA的个数。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
In [21]: df = pd.DataFrame(np.random.randn(7, 3))

In [22]: df.iloc[:4, 1] = NA

In [23]: df.iloc[:2, 2] = NA

In [24]: df
Out[24]: 
          0         1         2
0 -0.843340       NaN       NaN
1 -1.305941       NaN       NaN
2  1.026378       NaN  2.176567
3  0.048885       NaN  0.012649
4  0.591212 -0.739625  1.017533
5  0.633873 -0.124162 -0.823495
6 -1.537827  0.802565  0.359058

In [25]: df.dropna()
Out[25]: 
          0         1         2
4  0.591212 -0.739625  1.017533
5  0.633873 -0.124162 -0.823495
6 -1.537827  0.802565  0.359058

In [26]: df.dropna(thresh=2)
Out[26]: 
          0         1         2
2  1.026378       NaN  2.176567
3  0.048885       NaN  0.012649
4  0.591212 -0.739625  1.017533
5  0.633873 -0.124162 -0.823495
6 -1.537827  0.802565  0.359058

fillna用来对缺失值进行填充。可以针对列进行填充,用上一行的值填充,用平均值填充等。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
In [27]: df.fillna(0)
Out[27]: 
          0         1         2
0 -0.843340  0.000000  0.000000
1 -1.305941  0.000000  0.000000
2  1.026378  0.000000  2.176567
3  0.048885  0.000000  0.012649
4  0.591212 -0.739625  1.017533
5  0.633873 -0.124162 -0.823495
6 -1.537827  0.802565  0.359058

In [28]: df.fillna({1: 0.5, 2: 0})
Out[28]: 
          0         1         2
0 -0.843340  0.500000  0.000000
1 -1.305941  0.500000  0.000000
2  1.026378  0.500000  2.176567
3  0.048885  0.500000  0.012649
4  0.591212 -0.739625  1.017533
5  0.633873 -0.124162 -0.823495
6 -1.537827  0.802565  0.359058

In [29]: _ = df.fillna(0, inplace=True)

In [30]: df
Out[30]: 
          0         1         2
0 -0.843340  0.000000  0.000000
1 -1.305941  0.000000  0.000000
2  1.026378  0.000000  2.176567
3  0.048885  0.000000  0.012649
4  0.591212 -0.739625  1.017533
5  0.633873 -0.124162 -0.823495
6 -1.537827  0.802565  0.359058

In [31]: df = pd.DataFrame(np.random.randn(6, 3))

In [32]: df.iloc[2:, 1] = NA

In [33]: df.iloc[4:, 2] = NA

In [34]: df
Out[34]: 
          0         1         2
0 -0.081265 -0.820770 -0.746845
1  1.150648  0.977842  0.861825
2  1.823679       NaN  1.272047
3  0.293133       NaN  0.273399
4  0.235116       NaN       NaN
5  1.365186       NaN       NaN

In [35]: df.fillna(method='ffill')
Out[35]: 
          0         1         2
0 -0.081265 -0.820770 -0.746845
1  1.150648  0.977842  0.861825
2  1.823679  0.977842  1.272047
3  0.293133  0.977842  0.273399
4  0.235116  0.977842  0.273399
5  1.365186  0.977842  0.273399

In [36]: df.fillna(method='ffill', limit=2)
Out[36]: 
          0         1         2
0 -0.081265 -0.820770 -0.746845
1  1.150648  0.977842  0.861825
2  1.823679  0.977842  1.272047
3  0.293133  0.977842  0.273399
4  0.235116       NaN  0.273399
5  1.365186       NaN  0.273399

In [37]: data = pd.Series([1., NA, 3.5, NA, 7])

In [38]: data.fillna(data.mean())
Out[38]: 
0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64
Argument Description
value Scalar value or dict-like object to use to fill missing values
method Interpolation; by default 'ffill' if function called with no other arguments
axis Axis to fill on; default axis=0
inplace Modify the calling object without producing a copy
limit For forward and backward filling, maximum number of consecutive periods to fill

数据转换

去重

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
In [39]: data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
   ....: 'k2': [1, 1, 2, 3, 3, 4, 4]})

In [40]: data
Out[40]: 
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4
6  two   4

In [41]: data.duplicated()
Out[41]: 
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [42]: data.drop_duplicates()
Out[42]: 
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4

In [43]: data['v1'] = range(7)

In [44]: data.drop_duplicates(['k1'])
Out[44]: 
    k1  k2  v1
0  one   1   0
1  two   1   1

In [45]: data.drop_duplicates(['k1', 'k2'], keep='last')
Out[45]: 
    k1  k2  v1
0  one   1   0
1  two   1   1
2  one   2   2
3  two   3   3
4  one   3   4
6  two   4   6

使用函数或者映射(map)转换数据

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import pandas as np

import pandas as pd

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

data
Out[5]: 
          food  ounces
0        bacon     4.0
1  pulled pork     3.0
2        bacon    12.0
3     Pastrami     6.0
4  corned beef     7.5
5        Bacon     8.0
6     pastrami     3.0
7    honey ham     5.0
8     nova lox     6.0

meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}

lowercased = data['food'].str.lower()

lowercased
Out[8]: 
0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

data['animal'] = lowercased.map(meat_to_animal)

data
Out[10]: 
          food  ounces  animal
0        bacon     4.0     pig
1  pulled pork     3.0     pig
2        bacon    12.0     pig
3     Pastrami     6.0     cow
4  corned beef     7.5     cow
5        Bacon     8.0     pig
6     pastrami     3.0     cow
7    honey ham     5.0     pig
8     nova lox     6.0  salmon

data['food'].map(lambda x: meat_to_animal[x.lower()])
Out[11]: 
0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

替换

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: data = pd.Series([1., -999., 2., -999., -1000., 3.])

In [5]: data
Out[5]: 
0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [6]: data.replace(-999, np.nan)
Out[6]: 
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [7]: data.replace([-999, -1000], np.nan)
Out[7]: 
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [8]: data.replace([-999, -1000], [np.nan, 0])
Out[8]: 
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [9]: data.replace({-999: np.nan, -1000: 0})
Out[9]: 
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

索引和列名修改

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
In [2]: import pandas as pd

In [3]: import numpy as np



In [10]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),
   ....: index=['Ohio', 'Colorado', 'New York'],
   ....: columns=['one', 'two', 'three', 'four'])

In [11]: data
Out[11]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
New York    8    9     10    11

In [5]: data.replace(4, 40)
Out[5]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado   40    5      6     7
New York    8    9     10    11


In [12]: transform = lambda x: x[:4].upper()

In [13]: data.index.map(transform)
Out[13]: Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [14]: data
Out[14]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
New York    8    9     10    11

In [15]: data.index = data.index.map(transform)

In [16]: data
Out[16]: 
      one  two  three  four
OHIO    0    1      2     3
COLO    4    5      6     7
NEW     8    9     10    11

In [17]: data.rename(index=str.title, columns=str.upper)
Out[17]: 
      ONE  TWO  THREE  FOUR
Ohio    0    1      2     3
Colo    4    5      6     7
New     8    9     10    11

In [18]: data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'})
Out[18]: 
         one  two  peekaboo  four
INDIANA    0    1         2     3
COLO       4    5         6     7
NEW        8    9        10    11

In [19]: data.rename(index={'OHIO': 'INDIANA'}, inplace=True)

In [20]: data
Out[20]: 
         one  two  three  four
INDIANA    0    1      2     3
COLO       4    5      6     7
NEW        8    9     10    11

离散化和面元划分

以下暂略

字符串处理

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
In [7]: data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
   ...: 'Rob': 'rob@gmail.com', 'Wes': np.nan}

In [8]: data = pd.Series(data)

In [9]: data
Out[9]: 
Dave     dave@google.com
Rob        rob@gmail.com
Steve    steve@gmail.com
Wes                  NaN
dtype: object

In [10]: data.isnull()
Out[10]: 
Dave     False
Rob      False
Steve    False
Wes       True
dtype: bool

In [11]: data.str.contains('gmail')
Out[11]: 
Dave     False
Rob       True
Steve     True
Wes        NaN
dtype: object

In [12]: pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [13]: data.str.findall(pattern, flags=re.IGNORECASE)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-085c16e4dbfe> in <module>()
----> 1 data.str.findall(pattern, flags=re.IGNORECASE)

NameError: name 're' is not defined

In [14]: import re

In [15]: data.str.findall(pattern, flags=re.IGNORECASE)
Out[15]: 
Dave     [(dave, google, com)]
Rob        [(rob, gmail, com)]
Steve    [(steve, gmail, com)]
Wes                        NaN
dtype: object

In [16]: matches = data.str.match(pattern, flags=re.IGNORECASE)

In [17]: matches
Out[17]: 
Dave     True
Rob      True
Steve    True
Wes       NaN
dtype: object

In [18]: matches.str.get(1)
Out[18]: 
Dave    NaN
Rob     NaN
Steve   NaN
Wes     NaN
dtype: float64

In [19]: matches.str[0]
Out[19]: 
Dave    NaN
Rob     NaN
Steve   NaN
Wes     NaN
dtype: float64

In [20]: data.str[:5]
Out[20]: 
Dave     dave@
Rob      rob@g
Steve    steve
Wes        NaN
dtype: object
Method Description
cat Concatenate strings element-wise with optional delimiter
contains Return boolean array if each string contains pattern/regex
count Count occurrences of pattern
extract Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group
endswith Equivalent to x.endswith(pattern) for each element
startswith Equivalent to x.startswith(pattern) for each element
findall Compute list of all occurrences of pattern/regex for each string
get Index into each element (retrieve i-th element)
isalnum Equivalent to built-in str.alnum
isalpha Equivalent to built-in str.isalpha
isdecimal Equivalent to built-in str.isdecimal
isdigit Equivalent to built-in str.isdigit
islower Equivalent to built-in str.islower
isnumeric Equivalent to built-in str.isnumeric
isupper Equivalent to built-in str.isupper
join Join strings in each element of the Series with passed separator
len Compute length of each string
lower, upper Convert cases; equivalent to x.lower() or x.upper() for each element
match Use re.match with the passed regular expression on each element, returning matched groups as list
pad Add whitespace to left, right, or both sides of strings
center Equivalent to pad(side='both')
repeat Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string)
replace Replace occurrences of pattern/regex with some other string
slice Slice each string in the Series
split Split strings on delimiter or regular expression
strip Trim whitespace from both sides, including newlines
rstrip Trim whitespace on right side
lstrip Trim whitespace on left side

数据争夺:连接,合并,和重塑

在许多应用程序中,数据可能分布在多个文件或数据库中,或者是以不易分析的形式。 本章重点介绍连接,合并,和重塑。

分层索引

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
import pandas as pd

import numpy as np

data = pd.Series(np.random.randn(9), index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],[1, 2, 3, 1, 3, 1, 2, 2, 3]])

data
Out[5]: 
a  1   -1.111004
   2   -0.451764
   3   -0.501180
b  1    1.007739
   3    0.407470
c  1   -0.307985
   2    0.608742
d  2    1.432663
   3   -1.660043
dtype: float64

data['b']
Out[6]: 
1    1.007739
3    0.407470
dtype: float64

data['b':'c']
Out[7]: 
b  1    1.007739
   3    0.407470
c  1   -0.307985
   2    0.608742
dtype: float64

data.loc[['b', 'd']]
Out[8]: 
b  1    1.007739
   3    0.407470
d  2    1.432663
   3   -1.660043
dtype: float64

data.loc[:, 2]
Out[9]: 
a   -0.451764
c    0.608742
d    1.432663
dtype: float64

data.unstack()
Out[10]: 
          1         2         3
a -1.111004 -0.451764 -0.501180
b  1.007739       NaN  0.407470
c -0.307985  0.608742       NaN
d       NaN  1.432663 -1.660043

data.unstack().stack()
Out[11]: 
a  1   -1.111004
   2   -0.451764
   3   -0.501180
b  1    1.007739
   3    0.407470
c  1   -0.307985
   2    0.608742
d  2    1.432663
   3   -1.660043
dtype: float64

frame = pd.DataFrame(np.arange(12).reshape((4, 3)),index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])

frame
Out[13]: 
     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11

frame.index.names = ['key1', 'key2']

frame.columns.names = ['state', 'color']

frame
Out[16]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2                   
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11

frame['Ohio']
Out[17]: 
color      Green  Red
key1 key2            
a    1         0    1
     2         3    4
b    1         6    7
     2         9   10
  • 重新排序和排序级别

以下暂略

联结和合并数据集

  • pandas.merge可根据键将不同DataFrame中的行连接起来。SQL或其他关系型数据库的用户对此应该会比较 熟悉,因为它实现的就是数据库的连接操作。

  • pandas.concat可以沿轴将多个对象堆叠到一起。

  • 实例方法combine_first可以将重复数据编接在一起,用一个对象中的值填充另一个对象中的缺失值。

数据库风格的DataFrame合并

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})

df1
Out[20]: 
   data1 key
0      0   b
1      1   b
2      2   a
3      3   c
4      4   a
5      5   a
6      6   b

df2 = pd.DataFrame({'key': ['a', 'b', 'd'], 'data2': range(3)})

df2
Out[22]: 
   data2 key
0      0   a
1      1   b
2      2   d

pd.merge(df1, df2)
Out[23]: 
   data1 key  data2
0      0   b      1
1      1   b      1
2      6   b      1
3      2   a      0
4      4   a      0
5      5   a      0

pd.merge(df1, df2, on='key')
Out[24]: 
   data1 key  data2
0      0   b      1
1      1   b      1
2      6   b      1
3      2   a      0
4      4   a      0
5      5   a      0

df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})

df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)})

pd.merge(df3, df4, left_on='lkey', right_on='rkey')
Out[27]: 
   data1 lkey  data2 rkey
0      0    b      1    b
1      1    b      1    b
2      6    b      1    b
3      2    a      0    a
4      4    a      0    a
5      5    a      0    a

pd.merge(df1, df2, how='outer')
Out[28]: 
   data1 key  data2
0    0.0   b    1.0
1    1.0   b    1.0
2    6.0   b    1.0
3    2.0   a    0.0
4    4.0   a    0.0
5    5.0   a    0.0
6    3.0   c    NaN
7    NaN   d    2.0
Option Behavior
'inner' Use only the key combinations observed in both tables
'left' Use all key combinations found in the left table
'right' Use all key combinations found in the right table
'output' Use all key combinations observed in both tables together

上面是多对一合并,下面看下多对多

数据聚合与分组运算

对数据集进行分组并对各组应用函数(无论是聚合还是转换)是数据分析工作中的重要环节。在将数据集准备好之后,通常的任务就是计算分组统计或生成透视表。

pandas提供了灵活高效的gruopby功能,能以自然的方式对数据集进行切片、切块、摘要等操作。关系型数据库和SQL(Structured Query Language,结构化查询语言)能够如此流行的原因之一就是其能够方便地对数据进行连接、过滤、转换和聚合。但是,像SQL这样的查询语 言所能执行的分组运算的种类很有限。

在本章中你将会看到,由于Python和pandas强大的表达能力,我们可以执行复杂得多的分组运算(利用任何可以接受pandas对象或NumPy数组的函数)。在本章中,你将会学到:

  • 根据一个或多个键(可以是函数、数组或DataFrame列名)拆分pandas对象。
  • 计算分组摘要统计,如计数、平均值、标准差,或用户自定义函数。
  • 对DataFrame的列应用各种各样的函数。
  • 应用组内转换或其他运算,如规格化、线性回归、排名或选取子集等。
  • 计算透视表或交叉表。
  • 执行分位数分析以及其他分组分析。

分组技术

Hadley Wickham(许多热门R语言包的作者)创造了用于表示分组运算的术语"split-apply-combine"(拆分-应用-合并),我觉得这个词很好地描述了整个过程。分组运算的第一个阶段,pandas对象(无论是Series、DataFrame还是其他的)中的数据会根据你所提供的一个或多个键被拆分(split)为多组。拆分操作是在对象的特定轴上执行的。例如,DataFrame可以在其行(axis=0)或列(axis=1)上进行分组。然后,将一个函数应用(apply)到各个分组并产生一个新值。最后,所有这些函数的执行结果会被合并(combine)到最终的结果对象中。结果对象的形式一般取决于数据上所执行的操作。图9-1大致说明了一个简单的分组聚合过程。

捕获.PNG

分组键可以有多种形式,且类型不必相同: * 列表或数组,其长度与待分组的轴一样。 * 表示DataFrame某个列名的值。 * 字典或Series,给出待分组轴上的值与分组名之间的对应关系。 * 函数,用于处理轴索引或索引中的各个标签。

注意,后三种都只是快捷方式而已,其最终目的仍然是产生一组用于拆分对象的值。如果觉得这些东西看起来很抽象,不用担心,我将在本章中给出大量有关于此的示例。首先来看看下面这个非常简单的表格型数据集(以DataFrame的形式)。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),'data2' : np.random.randn(5)})

df
Out[32]: 
      data1     data2 key1 key2
0 -0.592555  0.537886    a  one
1  0.286764  1.498792    a  two
2 -0.149658  0.847675    b  one
3  0.961803 -1.218945    b  two
4  0.896790  1.461441    a  one

假设你想要按key1进行分组,并计算data1列的平均值。实现该功能的方式有很多,而我们这里要用的是:访问data1,并根据key1调用groupby。

1
2
3
4
grouped = df['data1'].groupby(df['key1'])

grouped
Out[34]: <pandas.core.groupby.SeriesGroupBy object at 0x000001937BF46E48>

变量grouped是一个GroupBy对象。它实际上还没有进行任何计算,只是含有一些有关分组键df['key1']的中间数据而已。换句话说,该对象已经有了接下来对各分组执行运算所需的一切信息。例如,我们可以调用GroupBy的mean方法来计算分组平均值:

1
2
3
4
5
6
grouped.mean()
Out[35]: 
key1
a    0.197000
b    0.406073
Name: data1, dtype: float64

数据(Series)根据分组键进行了聚合,产生了新的Series,其索引为key1列中的唯一值。

如果我们一次传入多个数组,就会得到不同的结果:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
means = df['data1'].groupby([df['key1'], df['key2']]).mean()

means
Out[38]: 
key1  key2
a     one     0.152117
      two     0.286764
b     one    -0.149658
      two     0.961803
Name: data1, dtype: float64

means.unstack()
Out[39]: 
key2       one       two
key1                    
a     0.152117  0.286764
b    -0.149658  0.961803

states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])

years = np.array([2005, 2005, 2006, 2005, 2006])

df['data1'].groupby([states, years]).mean()
Out[42]: 
California  2005    0.286764
            2006   -0.149658
Ohio        2005    0.184624
            2006    0.896790
Name: data1, dtype: float64

更常用的是列名(可以是字符串、数字或其他Python对象)用作分组键:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
df.groupby('key1').mean()
Out[44]: 
         data1     data2
key1                    
a     0.197000  1.166040
b     0.406073 -0.185635

df.groupby(['key1', 'key2']).mean()
Out[45]: 
              data1     data2
key1 key2                    
a    one   0.152117  0.999663
     two   0.286764  1.498792
b    one  -0.149658  0.847675
     two   0.961803 -1.218945

df.groupby(['key1', 'key2']).size()
Out[46]: 
key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

你可能已经注意到在执行df.groupby('key1').mean()时,结果中没有key2列。这是因为df['key2']不是数值数据(俗称“麻烦列”),所以被从结果中排除了。默认情况下,所有数值列都会被聚合,虽然有时可能会被过滤为一个子集(稍后就会讲到)。分组键中的任何缺失值都会被排除在结果之外。

  • 分组迭代

GroupBy对象支持迭代,可以产生一组二元元组(由分组名和数据块组成)。看看下面这个简单的数据集:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
for name, group in df.groupby('key1'):
    print(name)
    print(group)

a
      data1     data2 key1 key2
0 -0.592555  0.537886    a  one
1  0.286764  1.498792    a  two
4  0.896790  1.461441    a  one
b
      data1     data2 key1 key2
2 -0.149658  0.847675    b  one
3  0.961803 -1.218945    b  two

对于多重键的情况,元组的第一个元素将会是由键值组成的元组:

links