가짜데이터를 통한 분류 문제 만들기¶

살다보면 분석대회를 출제해야할때가 올겁니다.
그러다보면 데이터셋에 허덕이게 됩니다.
그럴때는 아래와 같이 make_classification으로 만들어낼수 있습니다.

In [10]:

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
import matplotlib.pyplot as plt

In [21]:

from sklearn.datasets import make_classification 
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
np.random.seed(2019)

영문버전 : https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html

n_samples : 몇개의 데이터를 만들건지
flip_y : 예측라벨을 임의로 바꿔서 100%의 정확도에 다다르지 못하게 방지
n_features : 독립변수 피처의 수
n_informative : 쓸모있는 피처가 몇개있는지. n_features보다는 적게 주어야 한다.
n_redundant : 쓰잘데기 없는 피처. 소용이 있어보이지만 informative피처를 선형결합해서 만들었기 때문에 쓰잘데기 없다.
n_repeated : 중복되는 피처
n_classes : 클래스 갯수
n_clusters_per_class : 클래스당 클러스터 갯수
weights : 1이면 균형셋, 그렇지않으면 불균형 세트
class_sep : 클래스간 얼마나 떨어져있는지, 기본값은 1이며 값이 올라갈수록 클래스간 많이 떨어져있어 분류가 쉽다.
hypercube : 딱 초입방체 안에 흩뿌려지도록 하든지, 혹은 그냥 아무 (초)도형이나 만들어서 랜덤한 형태를 가지게 하든지.
shift : class_sep을 기준으로 한꺼번에 데이터들을 이동하도록 함
scale : 데이터들을 스케일을 다시 조정
shuffle : 데이터들을 섞는다
random_state : 재현가능하기위한 랜덤넘버

In [59]:

X, y = make_classification(n_samples = 10000, flip_y=0, n_features=2, 
                           n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)

In [60]:

plt.scatter(X[:, 0], X[:, 1], marker='o', c=y,
            s=100, edgecolor="k", linewidth=2)

plt.xlabel("$X_1$")
plt.ylabel("$X_2$")

Out[60]:

Text(0, 0.5, '$X_2$')

분류모델로 분류하기¶

해당 모델은 Kaggle의 Instant Gratification에서 제일 예측력이 제일 높은축이였던 QDA 방식의 예측방법입니다.
flip_y가 거의 없기 때문에 예측률은 98.6%로 상당히 높음

In [64]:

oof = np.zeros(len(y))
skf = StratifiedKFold(n_splits=11, random_state=42)
for train_index, val_index in skf.split(X, y):

    clf = QuadraticDiscriminantAnalysis()
    clf.fit(X,y)
    oof[val_index] = clf.predict_proba(X[val_index,:])[:,1]
    
print(roc_auc_score(y, oof))
print("")
print(confusion_matrix(y, clf.predict(X)))

0.9180249599999999

[[4321  679]
 [ 625 4375]]

Flip_y 를 조정해서 데이터들의 15%를 임의로 클래스 교체¶

잘못 섞여있기 때문에, 예측률의 하락이 예상됨

In [66]:

X, y = make_classification(n_samples = 10000, flip_y=0.15, n_features=2, 
                           n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)

plt.scatter(X[:, 1], X[:, 0], marker='o', c=y,
            s=100, edgecolor="k", linewidth=2)

plt.xlabel("$X_1$")
plt.ylabel("$X_2$")

Out[66]:

Text(0, 0.5, '$X_2$')

다시한번 분류모델로 분류¶

91.8%로 상대적으로 낮은 예측률을 보임
만약, Train에서 몇가지 이상한 데이터를 골라내어 이의 라벨을 바꿔주는 작업을 하거나
Test (Pseudo Labeling) 데이터를 예측한후 이를 다시 Train에 쓴다면 예측률의 상승을 기대가능

In [63]:

oof = np.zeros(len(y))
skf = StratifiedKFold(n_splits=11, random_state=42)
for train_index, val_index in skf.split(X, y):

    clf = QuadraticDiscriminantAnalysis()
    clf.fit(X,y)
    oof[val_index] = clf.predict_proba(X[val_index,:])[:,1]
    
print(roc_auc_score(y, oof))
print("")
print(confusion_matrix(y, clf.predict(X)))

0.9180249599999999

[[4321  679]
 [ 625 4375]]

모든걸 다 섞기¶

flip_y를 1로 두면 무작위로 섞게 된다.

In [67]:

X, y = make_classification(n_samples = 10000, flip_y=1, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)

plt.scatter(X[:, 1], X[:, 0], marker='o', c=y,
            s=100, edgecolor="k", linewidth=2)

plt.xlabel("$X_1$")
plt.ylabel("$X_2$")

Out[67]:

Text(0, 0.5, '$X_2$')

예측률은 AUROC가 0.5수준으로 하나마나 한 모델이 된다.
울지말고 모델을 다시 검토하도록 한다.

In [68]:

oof = np.zeros(len(y))
skf = StratifiedKFold(n_splits=11, random_state=42)
for train_index, val_index in skf.split(X, y):

    clf = QuadraticDiscriminantAnalysis()
    clf.fit(X,y)
    oof[val_index] = clf.predict_proba(X[val_index,:])[:,1]
    
print(roc_auc_score(y, oof))
print("")
print(confusion_matrix(y, clf.predict(X)))

0.5095373615259778

[[2493 2509]
 [2442 2556]]

데이터 분석 대회를 위한 모의 분류 데이터 만들기

가짜데이터를 통한 분류 문제 만들기¶

분류모델로 분류하기¶

Flip_y 를 조정해서 데이터들의 15%를 임의로 클래스 교체¶

다시한번 분류모델로 분류¶

모든걸 다 섞기¶

답글 남기기