crosstab을 이용하여 없는 컬럼 포함 집계테이블 만들기

Untitled1

Crosstab을 이용하여 집계 (존재하지 않는 값 포함)

  • 존재하지 않는 값을 포함하여 crosstab을 할 필요가 있다.
  • 이러한 경우에는 pd.Series를 이용하여 컬럼을 만들어준다.
In [60]:
import pandas as pd
import numpy as np
In [61]:
df = pd.DataFrame({'CODE' : ['a','b','c','d','e','e','b'], 'Q' : [2,3,3,2,1,2,3]})
In [62]:
df
Out[62]:
CODE Q
0 a 2
1 b 3
2 c 3
3 d 2
4 e 1
5 e 2
6 b 3
In [63]:
df1 = pd.crosstab(index = df.CODE, columns = pd.Series([0,1,2,3,4,5]), values = df.Q, aggfunc = 'sum')
df1
Out[63]:
col_0 0 1 2 3 4 5
CODE
a 2.0 NaN NaN NaN NaN NaN
b NaN 3.0 NaN NaN NaN NaN
c NaN NaN 3.0 NaN NaN NaN
d NaN NaN NaN 2.0 NaN NaN
e NaN NaN NaN NaN 1.0 2.0
In [64]:
df2 = df1.fillna(0)
df2
Out[64]:
col_0 0 1 2 3 4 5
CODE
a 2.0 0.0 0.0 0.0 0.0 0.0
b 0.0 3.0 0.0 0.0 0.0 0.0
c 0.0 0.0 3.0 0.0 0.0 0.0
d 0.0 0.0 0.0 2.0 0.0 0.0
e 0.0 0.0 0.0 0.0 1.0 2.0
  • 전체를 대상으로 퍼센티지를 구한다.
In [65]:
df3 = pd.crosstab(index = df.CODE, columns = pd.Series([0,1,2,3,4,5]), values = df.Q, aggfunc = 'sum', normalize='all')
df3
Out[65]:
col_0 0 1 2 3 4 5
CODE
a 0.153846 0.000000 0.000000 0.000000 0.000000 0.000000
b 0.000000 0.230769 0.000000 0.000000 0.000000 0.000000
c 0.000000 0.000000 0.230769 0.000000 0.000000 0.000000
d 0.000000 0.000000 0.000000 0.153846 0.000000 0.000000
e 0.000000 0.000000 0.000000 0.000000 0.076923 0.153846
  • 컬럼이나 행을 기준으로 퍼센티지를 구한다.
In [49]:
df4 = pd.crosstab(index = df.CODE, columns = pd.Series([0,1,2,3,4,5]), values = df.Q, aggfunc = 'sum', normalize='index')
df4
Out[49]:
col_0 0 1 2 3 4 5
CODE
a 1.0 0.0 0.0 0.0 0.000000 0.000000
b 0.0 1.0 0.0 0.0 0.000000 0.000000
c 0.0 0.0 1.0 0.0 0.000000 0.000000
d 0.0 0.0 0.0 1.0 0.000000 0.000000
e 0.0 0.0 0.0 0.0 0.333333 0.666667
In [51]:
df5 = pd.crosstab(index = df.CODE, columns = pd.Series([0,1,2,3,4,5]), values = df.Q, aggfunc = 'sum', normalize='columns')
df5
Out[51]:
col_0 0 1 2 3 4 5
CODE
a 1.0 0.0 0.0 0.0 0.0 0.0
b 0.0 1.0 0.0 0.0 0.0 0.0
c 0.0 0.0 1.0 0.0 0.0 0.0
d 0.0 0.0 0.0 1.0 0.0 0.0
e 0.0 0.0 0.0 0.0 1.0 1.0
  • TOTAL을 구할수도 있다.
In [56]:
df6 = pd.crosstab(index = df.CODE, columns = pd.Series([0,1,2,3,4,5]), values = df.Q, aggfunc = 'sum', normalize='index', margins=True, margins_name='TOTAL')
df6
Out[56]:
col_0 0 1 2 3 4 5
CODE
a 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000
b 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
c 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000
d 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000
e 0.000000 0.000000 0.000000 0.000000 0.333333 0.666667
TOTAL 0.153846 0.230769 0.230769 0.153846 0.076923 0.153846
In [58]:
df7 = pd.crosstab(index = df.CODE, columns = pd.Series([0,1,2,3,4,5]), values = df.Q, aggfunc = 'sum', normalize='columns', margins=True, margins_name='TOTAL')
df7
Out[58]:
col_0 0 1 2 3 4 5 TOTAL
CODE
a 1.0 0.0 0.0 0.0 0.0 0.0 0.153846
b 0.0 1.0 0.0 0.0 0.0 0.0 0.230769
c 0.0 0.0 1.0 0.0 0.0 0.0 0.230769
d 0.0 0.0 0.0 1.0 0.0 0.0 0.153846
e 0.0 0.0 0.0 0.0 1.0 1.0 0.230769

댓글 남기기