LDA를 이용해 차원축소 후 Logistic Regression

LDA를 이용해 차원축소 후 Logistic Regression

LDA(Linear Discriminant Analysis)을 이용한 이진분류 예제

  • 거의 모든 X변수는 정규분포를 띄고 있음
  • Kaggle에서 Instant Gratification이라는 이름의 대회 데이터셋
  • LDA는 독립변수에서의 정규분포를 가정하기 때문에, 해당 데이터셋에 정확하게 맞지만, 선형분류만 가능함
In [5]:
import pandas_profiling
import numpy as np
import pandas as pd
import seaborn as sns
In [63]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
In [7]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
In [8]:
from matplotlib import pyplot
In [9]:
train = pd.read_csv("../inputs/train.csv")
test = pd.read_csv("../inputs/test.csv")
In [10]:
sns.distplot(train['hazy-emerald-cuttlefish-unsorted'])
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d90a218be0>
In [12]:
train_x, test_x, train_y, test_y = train_test_split(train.drop(['target', 'id'], axis = 1), train['target'], test_size = 0.2)
In [58]:
model_lda = LDA()
In [61]:
train_lda_x = model_lda.fit_transform(train_x, train_y)
test_lda_x = model_lda.transform(test_x)
In [52]:
df = pd.DataFrame( {'PC1' : train_lda_x[:,0], 'class' : train_y} )
In [53]:
sns.regplot(data = df[["PC1","class"]], x = "PC1",y = "class", fit_reg=False,scatter_kws = {'s':50}, )
Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d9387a18d0>
In [64]:
clf = LogisticRegression(random_state = 0)
clf.fit(train_lda_x, train_y)
C:\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
Out[64]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)
In [66]:
test_lda_predict = clf.predict(test_lda_x)

지표 확인

In [67]:
confusion_matrix(test_y, test_lda_predict)
Out[67]:
array([[13719, 12411],
       [12781, 13518]], dtype=int64)
In [69]:
test_predicted_prob = clf.predict_proba(test_lda_x)
In [70]:
roc_auc_score(test_y, test_predicted_prob[:,1])
Out[70]:
0.5291008810961615

ROC Curve 그리기

In [71]:
fpr, tpr, thresholds = roc_curve(test_y, test_predicted_prob[:,1])
In [72]:
pyplot.plot([0, 1], [0, 1], linestyle='--')
pyplot.plot(fpr, tpr, marker='.')
pyplot.show()
  • ROC 커브는 대각선에 가깝기 때문에, fit이 되지 않았음을 알수있음

답글 남기기