Pandas库学习

Pandas的Series相当于添加了序号的一维array,Dataframe相当于添加了序号的二维array。(默认序号是从0开始计数)

1、Pandas属性

Series

import pandas as pd
import numpy as np

a = pd.Series([1,2,3,np.nan,4],index=['a','b','c','d','e'])
print(a)

结果:
a    1.0
b    2.0
c    3.0
d    NaN
e    4.0
dtype: float64

Dataframe

import pandas as pd
import numpy as np

np.random.seed(1)
a = pd.DataFrame(np.random.randint(0,5,(4,3)), index=['a','b','c','d'], columns=['cat', 'dog', 'car'])
print('DataFrame:\n',a)
print('DataFrame行名:\n',a.index)
print('DataFrame列名:\n',a.columns)
print('DataFrame值:\n',a.values)
print('DataFrame每列的统计特征:\n',a.describe())
print('DataFrame按index排序:\n',a.sort_index(axis=1))
print('DataFrame按值排序:\n',a.sort_values(by='car',ascending=False))
print('DataFrame翻转:\n',a.T)

结果:
DataFrame:
    cat  dog  car
a    3    4    0
b    1    3    0
c    0    1    4
d    4    1    2
DataFrame行名:
 Index(['a', 'b', 'c', 'd'], dtype='object')
DataFrame列名:
 Index(['cat', 'dog', 'car'], dtype='object')
DataFrame值:
 [[3 4 0]
 [1 3 0]
 [0 1 4]
 [4 1 2]]
DataFrame每列的统计特征:
             cat   dog       car
count  4.000000  4.00  4.000000
mean   2.000000  2.25  1.500000
std    1.825742  1.50  1.914854
min    0.000000  1.00  0.000000
25%    0.750000  1.00  0.000000
50%    2.000000  2.00  1.000000
75%    3.250000  3.25  2.500000
max    4.000000  4.00  4.000000
DataFrame按index排序:
    car  cat  dog
a    0    3    4
b    0    1    3
c    4    0    1
d    2    4    1
DataFrame按值排序:
    cat  dog  car
c    0    1    4
d    4    1    2
a    3    4    0
b    1    3    0
DataFrame翻转:
      a  b  c  d
cat  3  1  0  4
dog  4  3  1  1
car  0  0  4  2
​

2、Pandas数据操作

选择数据

import pandas as pd
import numpy as np

np.random.seed(1)
a = pd.DataFrame(np.random.randint(0,5,(4,3)), index=['a','b','c','d'], columns=['cat', 'dog', 'car'])
print('DataFrame:\n',a)
print('选择DataFrame列:\n',a.car,'\n',a[['cat','dog']])
print('选择DataFrame行:\n',a[0:2],'\n',a['c':'d'])
print('-------------------------')
print('采用loc按标签选择:')
print('选行:\n',a.loc[['b','d'],:])
print('选列:\n',a.loc[['a','c'],['cat','car']])
print('选具体位置:\n',a.loc['a','car'])
print('-------------------------')
print('采用iloc按位置选择:')
print('选具体位置:\n',a.iloc[[1,3],0:2])
print('-------------------------')
print('Boolean选择:')
print(a[a.cat>2])

结果:
DataFrame:
    cat  dog  car
a    3    4    0
b    1    3    0
c    0    1    4
d    4    1    2
选择DataFrame列:
 a    0
b    0
c    4
d    2
Name: car, dtype: int32 
    cat  dog
a    3    4
b    1    3
c    0    1
d    4    1
选择DataFrame行:
    cat  dog  car
a    3    4    0
b    1    3    0 
    cat  dog  car
c    0    1    4
d    4    1    2
-------------------------
采用loc按标签选择:
选行:
    cat  dog  car
b    1    3    0
d    4    1    2
选列:
    cat  car
a    3    0
c    0    4
选具体位置:
 0
-------------------------
采用iloc按位置选择:
选具体位置:
    cat  dog
b    1    3
d    4    1
-------------------------
Boolean选择:
   cat  dog  car
a    3    4    0
d    4    1    2

DataFrame合并

import pandas as pd
import numpy as np

np.random.seed(1)
a = pd.DataFrame(np.random.randint(0,5,(2,3)), columns=['cat', 'dog', 'car'])
b = pd.DataFrame(np.random.randint(0,5,(4,3)), columns=['cat', 'dog', 'car'])
print('合并前:\n',a)
print(b)
print('合并后:\n',pd.concat([a,b]))
print('合并后(忽略索引):\n',pd.concat([a,b],axis=0,ignore_index=True))

结果:
合并前:
    cat  dog  car
0    3    4    0
1    1    3    0
   cat  dog  car
0    0    1    4
1    4    1    2
2    4    2    4
3    3    4    2
合并后:
    cat  dog  car
0    3    4    0
1    1    3    0
0    0    1    4
1    4    1    2
2    4    2    4
3    3    4    2
合并后(忽略索引):
    cat  dog  car
0    3    4    0
1    1    3    0
2    0    1    4
3    4    1    2
4    4    2    4
5    3    4    2
import pandas as pd
import numpy as np

np.random.seed(1)
a = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=[1,2,3], columns=['a', 'b', 'c'])
np.random.seed(1)
b = pd.DataFrame([[2,3,4],[5,6,7],[8,9,10]], index=[2,3,4], columns=['b', 'c', 'd'])
print('合并前:\n',a)
print(b)
print('合并后(扩充):\n',pd.concat([a,b],ignore_index=True))
print('合并后(裁剪):\n',pd.concat([a,b],join='inner',ignore_index=True))
print('合并后(兼并):\n',pd.merge(a,b,on=['b','c']))

结果:
合并前:
    a  b  c
1  1  2  3
2  4  5  6
3  7  8  9
   b  c   d
2  2  3   4
3  5  6   7
4  8  9  10
合并后(扩充):
      a  b  c     d
0  1.0  2  3   NaN
1  4.0  5  6   NaN
2  7.0  8  9   NaN
3  NaN  2  3   4.0
4  NaN  5  6   7.0
5  NaN  8  9  10.0
合并后(裁剪):
    b  c
0  2  3
1  5  6
2  8  9
3  2  3
4  5  6
5  8  9
合并后(兼并):
    a  b  c   d
0  1  2  3   4
1  4  5  6   7
2  7  8  9  10
​

修改值

import pandas as pd
import numpy as np

np.random.seed(1)
a = pd.DataFrame(np.random.randint(0,5,(4,3)), index=['a','b','c','d'], columns=['cat', 'dog', 'car'])
print('DataFrame:\n',a)
print('-------------------------')
a.dog[a.cat>2]=10 # car列中大于2的行对应的dog列的值设为10
a.loc[['c','d'],'car']=15
print('修改值:\n',a)
print('-------------------------')
a['lion']=pd.Series([6,np.nan,8,9],index=['a','b','c','d']) # 添加列
print('添加列:\n',a)
a.loc['e']=['x','y','z',np.nan] # 添加行
print('添加行:\n',a)

结果:
DataFrame:
    cat  dog  car
a    3    4    0
b    1    3    0
c    0    1    4
d    4    1    2
-------------------------
修改值:
    cat  dog  car
a    3   10    0
b    1    3    0
c    0    1   15
d    4   10   15
-------------------------
添加列:
    cat  dog  car  lion
a    3   10    0   6.0
b    1    3    0   NaN
c    0    1   15   8.0
d    4   10   15   9.0
添加行:
   cat dog car  lion
a   3  10   0   6.0
b   1   3   0   NaN
c   0   1  15   8.0
d   4  10  15   9.0
e   x   y   z   NaN

处理丢失数据

import pandas as pd
import numpy as np

np.random.seed(1)
a = pd.DataFrame(np.random.randint(0,5,(4,3)), index=['a','b','c','d'], columns=['cat', 'dog', 'car'])
a.loc[['b','d'],'cat']=np.nan
a.loc['c','dog']=np.nan
print('DataFrame:\n',a)
print('-------------------------')
print('检查是否有NaN:\n',a.isnull())
print('是否有NaN:\n',np.any(a.isnull())==True)
print('-------------------------')
print('(1)直接丢掉')
print('丢掉NaN所在行:\n',a.dropna(axis=0, how='any')) # 有NaN就丢
print('丢掉NaN所在列:\n',a.dropna(axis=1, how='any'))
print('丢掉全为NaN的列:\n',a.dropna(axis=1, how='all')) # 全为NaN才丢
print('-------------------------')
print('(2)填充NaN')
print('填充NaN:\n',a.fillna(value=10))

结果:
DataFrame:
    cat  dog  car
a  3.0  4.0    0
b  NaN  3.0    0
c  0.0  NaN    4
d  NaN  1.0    2
-------------------------
检查是否有NaN:
      cat    dog    car
a  False  False  False
b   True  False  False
c  False   True  False
d   True  False  False
是否有NaN:
 True
-------------------------
(1)直接丢掉
丢掉NaN所在行:
    cat  dog  car
a  3.0  4.0    0
丢掉NaN所在列:
    car
a    0
b    0
c    4
d    2
丢掉全为NaN的列:
    cat  dog  car
a  3.0  4.0    0
b  NaN  3.0    0
c  0.0  NaN    4
d  NaN  1.0    2
-------------------------
(2)填充NaN
填充NaN:
     cat   dog  car
a   3.0   4.0    0
b  10.0   3.0    0
c   0.0  10.0    4
d  10.0   1.0    2
​

导入导出数据

读取的数据文件要和.py文件在同一目录下。

import pandas as pd
import numpy as np

np.random.seed(1)
a = pd.DataFrame(np.random.randint(0,5,(4,3)), index=['a','b','c','d'], columns=['cat', 'dog', 'car'])
a.iloc[:,1:3].to_pickle('df.pickle')
b = pd.read_pickle('df.pickle')
print(b)

结果:
   dog  car
a    4    0
b    3    0
c    1    4
d    1    2

参考

bilibili-莫烦python的Numpy & Pandas (数据处理教程)