Pandas库学习
Pandas的Series相当于添加了序号的一维array,Dataframe相当于添加了序号的二维array。(默认序号是从0开始计数)
1、Pandas属性
Series
import pandas as pd
import numpy as np
a = pd.Series([1,2,3,np.nan,4],index=['a','b','c','d','e'])
print(a)
结果:
a 1.0
b 2.0
c 3.0
d NaN
e 4.0
dtype: float64
Dataframe
import pandas as pd
import numpy as np
np.random.seed(1)
a = pd.DataFrame(np.random.randint(0,5,(4,3)), index=['a','b','c','d'], columns=['cat', 'dog', 'car'])
print('DataFrame:\n',a)
print('DataFrame行名:\n',a.index)
print('DataFrame列名:\n',a.columns)
print('DataFrame值:\n',a.values)
print('DataFrame每列的统计特征:\n',a.describe())
print('DataFrame按index排序:\n',a.sort_index(axis=1))
print('DataFrame按值排序:\n',a.sort_values(by='car',ascending=False))
print('DataFrame翻转:\n',a.T)
结果:
DataFrame:
cat dog car
a 3 4 0
b 1 3 0
c 0 1 4
d 4 1 2
DataFrame行名:
Index(['a', 'b', 'c', 'd'], dtype='object')
DataFrame列名:
Index(['cat', 'dog', 'car'], dtype='object')
DataFrame值:
[[3 4 0]
[1 3 0]
[0 1 4]
[4 1 2]]
DataFrame每列的统计特征:
cat dog car
count 4.000000 4.00 4.000000
mean 2.000000 2.25 1.500000
std 1.825742 1.50 1.914854
min 0.000000 1.00 0.000000
25% 0.750000 1.00 0.000000
50% 2.000000 2.00 1.000000
75% 3.250000 3.25 2.500000
max 4.000000 4.00 4.000000
DataFrame按index排序:
car cat dog
a 0 3 4
b 0 1 3
c 4 0 1
d 2 4 1
DataFrame按值排序:
cat dog car
c 0 1 4
d 4 1 2
a 3 4 0
b 1 3 0
DataFrame翻转:
a b c d
cat 3 1 0 4
dog 4 3 1 1
car 0 0 4 2
2、Pandas数据操作
选择数据
import pandas as pd
import numpy as np
np.random.seed(1)
a = pd.DataFrame(np.random.randint(0,5,(4,3)), index=['a','b','c','d'], columns=['cat', 'dog', 'car'])
print('DataFrame:\n',a)
print('选择DataFrame列:\n',a.car,'\n',a[['cat','dog']])
print('选择DataFrame行:\n',a[0:2],'\n',a['c':'d'])
print('-------------------------')
print('采用loc按标签选择:')
print('选行:\n',a.loc[['b','d'],:])
print('选列:\n',a.loc[['a','c'],['cat','car']])
print('选具体位置:\n',a.loc['a','car'])
print('-------------------------')
print('采用iloc按位置选择:')
print('选具体位置:\n',a.iloc[[1,3],0:2])
print('-------------------------')
print('Boolean选择:')
print(a[a.cat>2])
结果:
DataFrame:
cat dog car
a 3 4 0
b 1 3 0
c 0 1 4
d 4 1 2
选择DataFrame列:
a 0
b 0
c 4
d 2
Name: car, dtype: int32
cat dog
a 3 4
b 1 3
c 0 1
d 4 1
选择DataFrame行:
cat dog car
a 3 4 0
b 1 3 0
cat dog car
c 0 1 4
d 4 1 2
-------------------------
采用loc按标签选择:
选行:
cat dog car
b 1 3 0
d 4 1 2
选列:
cat car
a 3 0
c 0 4
选具体位置:
0
-------------------------
采用iloc按位置选择:
选具体位置:
cat dog
b 1 3
d 4 1
-------------------------
Boolean选择:
cat dog car
a 3 4 0
d 4 1 2
DataFrame合并
import pandas as pd
import numpy as np
np.random.seed(1)
a = pd.DataFrame(np.random.randint(0,5,(2,3)), columns=['cat', 'dog', 'car'])
b = pd.DataFrame(np.random.randint(0,5,(4,3)), columns=['cat', 'dog', 'car'])
print('合并前:\n',a)
print(b)
print('合并后:\n',pd.concat([a,b]))
print('合并后(忽略索引):\n',pd.concat([a,b],axis=0,ignore_index=True))
结果:
合并前:
cat dog car
0 3 4 0
1 1 3 0
cat dog car
0 0 1 4
1 4 1 2
2 4 2 4
3 3 4 2
合并后:
cat dog car
0 3 4 0
1 1 3 0
0 0 1 4
1 4 1 2
2 4 2 4
3 3 4 2
合并后(忽略索引):
cat dog car
0 3 4 0
1 1 3 0
2 0 1 4
3 4 1 2
4 4 2 4
5 3 4 2
import pandas as pd
import numpy as np
np.random.seed(1)
a = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=[1,2,3], columns=['a', 'b', 'c'])
np.random.seed(1)
b = pd.DataFrame([[2,3,4],[5,6,7],[8,9,10]], index=[2,3,4], columns=['b', 'c', 'd'])
print('合并前:\n',a)
print(b)
print('合并后(扩充):\n',pd.concat([a,b],ignore_index=True))
print('合并后(裁剪):\n',pd.concat([a,b],join='inner',ignore_index=True))
print('合并后(兼并):\n',pd.merge(a,b,on=['b','c']))
结果:
合并前:
a b c
1 1 2 3
2 4 5 6
3 7 8 9
b c d
2 2 3 4
3 5 6 7
4 8 9 10
合并后(扩充):
a b c d
0 1.0 2 3 NaN
1 4.0 5 6 NaN
2 7.0 8 9 NaN
3 NaN 2 3 4.0
4 NaN 5 6 7.0
5 NaN 8 9 10.0
合并后(裁剪):
b c
0 2 3
1 5 6
2 8 9
3 2 3
4 5 6
5 8 9
合并后(兼并):
a b c d
0 1 2 3 4
1 4 5 6 7
2 7 8 9 10
修改值
import pandas as pd
import numpy as np
np.random.seed(1)
a = pd.DataFrame(np.random.randint(0,5,(4,3)), index=['a','b','c','d'], columns=['cat', 'dog', 'car'])
print('DataFrame:\n',a)
print('-------------------------')
a.dog[a.cat>2]=10 # car列中大于2的行对应的dog列的值设为10
a.loc[['c','d'],'car']=15
print('修改值:\n',a)
print('-------------------------')
a['lion']=pd.Series([6,np.nan,8,9],index=['a','b','c','d']) # 添加列
print('添加列:\n',a)
a.loc['e']=['x','y','z',np.nan] # 添加行
print('添加行:\n',a)
结果:
DataFrame:
cat dog car
a 3 4 0
b 1 3 0
c 0 1 4
d 4 1 2
-------------------------
修改值:
cat dog car
a 3 10 0
b 1 3 0
c 0 1 15
d 4 10 15
-------------------------
添加列:
cat dog car lion
a 3 10 0 6.0
b 1 3 0 NaN
c 0 1 15 8.0
d 4 10 15 9.0
添加行:
cat dog car lion
a 3 10 0 6.0
b 1 3 0 NaN
c 0 1 15 8.0
d 4 10 15 9.0
e x y z NaN
处理丢失数据
import pandas as pd
import numpy as np
np.random.seed(1)
a = pd.DataFrame(np.random.randint(0,5,(4,3)), index=['a','b','c','d'], columns=['cat', 'dog', 'car'])
a.loc[['b','d'],'cat']=np.nan
a.loc['c','dog']=np.nan
print('DataFrame:\n',a)
print('-------------------------')
print('检查是否有NaN:\n',a.isnull())
print('是否有NaN:\n',np.any(a.isnull())==True)
print('-------------------------')
print('(1)直接丢掉')
print('丢掉NaN所在行:\n',a.dropna(axis=0, how='any')) # 有NaN就丢
print('丢掉NaN所在列:\n',a.dropna(axis=1, how='any'))
print('丢掉全为NaN的列:\n',a.dropna(axis=1, how='all')) # 全为NaN才丢
print('-------------------------')
print('(2)填充NaN')
print('填充NaN:\n',a.fillna(value=10))
结果:
DataFrame:
cat dog car
a 3.0 4.0 0
b NaN 3.0 0
c 0.0 NaN 4
d NaN 1.0 2
-------------------------
检查是否有NaN:
cat dog car
a False False False
b True False False
c False True False
d True False False
是否有NaN:
True
-------------------------
(1)直接丢掉
丢掉NaN所在行:
cat dog car
a 3.0 4.0 0
丢掉NaN所在列:
car
a 0
b 0
c 4
d 2
丢掉全为NaN的列:
cat dog car
a 3.0 4.0 0
b NaN 3.0 0
c 0.0 NaN 4
d NaN 1.0 2
-------------------------
(2)填充NaN
填充NaN:
cat dog car
a 3.0 4.0 0
b 10.0 3.0 0
c 0.0 10.0 4
d 10.0 1.0 2
导入导出数据
读取的数据文件要和.py文件在同一目录下。
import pandas as pd
import numpy as np
np.random.seed(1)
a = pd.DataFrame(np.random.randint(0,5,(4,3)), index=['a','b','c','d'], columns=['cat', 'dog', 'car'])
a.iloc[:,1:3].to_pickle('df.pickle')
b = pd.read_pickle('df.pickle')
print(b)
结果:
dog car
a 4 0
b 3 0
c 1 4
d 1 2