LDA(Linear Discriminant Analysis)을 이용한 이진분류 예제¶

거의 모든 X변수는 정규분포를 띄고 있음
Kaggle에서 Instant Gratification이라는 이름의 대회 데이터셋
LDA는 독립변수에서의 정규분포를 가정하기 때문에, 해당 데이터셋에 정확하게 맞지만, 선형분류만 가능함

In [5]:

import pandas_profiling
import numpy as np
import pandas as pd
import seaborn as sns

In [63]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

In [7]:

from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [8]:

from matplotlib import pyplot

In [9]:

train = pd.read_csv("../inputs/train.csv")
test = pd.read_csv("../inputs/test.csv")

In [10]:

sns.distplot(train['hazy-emerald-cuttlefish-unsorted'])

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x1d90a218be0>

In [12]:

train_x, test_x, train_y, test_y = train_test_split(train.drop(['target', 'id'], axis = 1), train['target'], test_size = 0.2)

In [58]:

model_lda = LDA()

In [61]:

train_lda_x = model_lda.fit_transform(train_x, train_y)
test_lda_x = model_lda.transform(test_x)

In [52]:

df = pd.DataFrame( {'PC1' : train_lda_x[:,0], 'class' : train_y} )

In [53]:

sns.regplot(data = df[["PC1","class"]], x = "PC1",y = "class", fit_reg=False,scatter_kws = {'s':50}, )

Out[53]:

<matplotlib.axes._subplots.AxesSubplot at 0x1d9387a18d0>

In [64]:

clf = LogisticRegression(random_state = 0)
clf.fit(train_lda_x, train_y)

C:\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Out[64]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [66]:

test_lda_predict = clf.predict(test_lda_x)

지표 확인¶

In [67]:

confusion_matrix(test_y, test_lda_predict)

Out[67]:

array([[13719, 12411],
       [12781, 13518]], dtype=int64)

In [69]:

test_predicted_prob = clf.predict_proba(test_lda_x)

In [70]:

roc_auc_score(test_y, test_predicted_prob[:,1])

Out[70]:

0.5291008810961615

ROC Curve 그리기¶

In [71]:

fpr, tpr, thresholds = roc_curve(test_y, test_predicted_prob[:,1])

In [72]:

pyplot.plot([0, 1], [0, 1], linestyle='--')
pyplot.plot(fpr, tpr, marker='.')
pyplot.show()

ROC 커브는 대각선에 가깝기 때문에, fit이 되지 않았음을 알수있음

LDA를 이용해 차원축소 후 Logistic Regression

LDA(Linear Discriminant Analysis)을 이용한 이진분류 예제¶

지표 확인¶

ROC Curve 그리기¶

답글 남기기