LDA(Linear Discriminant Analysis)을 이용한 이진분류 예제¶
- 거의 모든 X변수는 정규분포를 띄고 있음
- Kaggle에서 Instant Gratification이라는 이름의 대회 데이터셋
- LDA는 독립변수에서의 정규분포를 가정하기 때문에, 해당 데이터셋에 정확하게 맞지만, 선형분류만 가능함
In [5]:
import pandas_profiling
import numpy as np
import pandas as pd
import seaborn as sns
In [63]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
In [7]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
In [8]:
from matplotlib import pyplot
In [9]:
train = pd.read_csv("../inputs/train.csv")
test = pd.read_csv("../inputs/test.csv")
In [10]:
sns.distplot(train['hazy-emerald-cuttlefish-unsorted'])
Out[10]:
In [12]:
train_x, test_x, train_y, test_y = train_test_split(train.drop(['target', 'id'], axis = 1), train['target'], test_size = 0.2)
In [58]:
model_lda = LDA()
In [61]:
train_lda_x = model_lda.fit_transform(train_x, train_y)
test_lda_x = model_lda.transform(test_x)
In [52]:
df = pd.DataFrame( {'PC1' : train_lda_x[:,0], 'class' : train_y} )
In [53]:
sns.regplot(data = df[["PC1","class"]], x = "PC1",y = "class", fit_reg=False,scatter_kws = {'s':50}, )
Out[53]:
In [64]:
clf = LogisticRegression(random_state = 0)
clf.fit(train_lda_x, train_y)
Out[64]:
In [66]:
test_lda_predict = clf.predict(test_lda_x)
지표 확인¶
In [67]:
confusion_matrix(test_y, test_lda_predict)
Out[67]:
In [69]:
test_predicted_prob = clf.predict_proba(test_lda_x)
In [70]:
roc_auc_score(test_y, test_predicted_prob[:,1])
Out[70]:
ROC Curve 그리기¶
In [71]:
fpr, tpr, thresholds = roc_curve(test_y, test_predicted_prob[:,1])
In [72]:
pyplot.plot([0, 1], [0, 1], linestyle='--')
pyplot.plot(fpr, tpr, marker='.')
pyplot.show()
- ROC 커브는 대각선에 가깝기 때문에, fit이 되지 않았음을 알수있음