이진분류 table데이터 예측 부스팅과 신경망 베이스 파이썬 치트코드

EDA baseline-Neural Network

뉴럴 네트워크와 LightGBM을 이용한 table 데이터 예측 (이진 분류)

In [1]:
import pandas as pd
import numpy as np
import os
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, KFold

from tensorflow.keras.layers import Dense, Input, Activation, Flatten
from tensorflow.keras.layers import BatchNormalization,Add,Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.layers import LeakyReLU, ReLU, Conv2D, MaxPooling2D, BatchNormalization, Conv2DTranspose, UpSampling2D
from tensorflow.keras import callbacks
from tensorflow.keras import backend as K

from tensorflow import metrics
import tensorflow as tf
from sklearn.metrics import roc_auc_score
In [2]:
os.listdir('../input')
Out[2]:
['fe',
 'minification',
 'sample_submission.csv',
 'test_identity.csv',
 'test_transaction.csv',
 'train_identity.csv',
 'train_transaction.csv']
In [3]:
submission = pd.read_csv("../input/sample_submission.csv")
In [4]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
In [5]:
submission.head(3)
Out[5]:
TransactionID isFraud
0 3663549 0.5
1 3663550 0.5
2 3663551 0.5

LGB는 굳이 인코딩을 하지 않더라도 category타입이라면 자동변환되므로 해당하는 함수

In [6]:
def to_category_columns(df) :
    string_columns = df.columns[df.dtypes == 'object']
    for c in string_columns :
        df[c] = df[c].astype('category')
    return df

라벨 인코딩을 하는 함수

In [7]:
def to_label_encoding(df) :
    
    le = preprocessing.LabelEncoder()
    string_columns = df.columns[(df.dtypes == 'object') | (df.dtypes == 'category')]
    for c in string_columns :
        df[c] = le.fit_transform(df[c].astype(str))
    return df
In [8]:
raw_transaction = pd.read_csv("../input/train_transaction.csv")
raw_identity = pd.read_csv("../input/train_identity.csv")

COMPETITION_raw_transaction = pd.read_csv("../input/test_transaction.csv")
COMPETITION_raw_identity = pd.read_csv("../input/test_identity.csv")
In [9]:
# 카테고리 컬럼으로 변환
raw_transaction = to_category_columns(raw_transaction)
raw_identity = to_category_columns(raw_identity)

COMPETITION_raw_transaction = to_category_columns(COMPETITION_raw_transaction)
COMPETITION_raw_identity = to_category_columns(COMPETITION_raw_identity)
In [10]:
raw_transaction.head(2)
Out[10]:
TransactionID isFraud TransactionDT TransactionAmt ProductCD card1 card2 card3 card4 card5 card6 addr1 addr2 dist1 dist2 P_emaildomain R_emaildomain
0 2987000 0 86400 68.5 W 13926 NaN 150.0 discover 142.0 credit 315.0 87.0 19.0 NaN NaN NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 2.0 0.0 1.0 1.0 14.0 NaN 13.0 NaN NaN NaN NaN NaN NaN 13.0 13.0 NaN NaN NaN 0.0 T T T M2 F T NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 117.0 0.0 0.0 0.0 0.0 0.0 117.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 117.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 117.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2987001 0 86401 29.0 W 2755 404.0 150.0 mastercard 102.0 credit 325.0 87.0 NaN NaN gmail.com NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN NaN NaN M0 T T NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [11]:
raw_identity.head(2)
Out[11]:
TransactionID id_01 id_02 id_03 id_04 id_05 id_06 id_07 id_08 id_09 id_10 id_11 id_12 id_13 id_14 id_15 id_16 id_17 id_18 id_19 id_20 id_21 id_22 id_23 id_24 id_25 id_26 id_27 id_28 id_29 id_30 id_31 id_32 id_33 id_34 id_35 id_36 id_37 id_38 DeviceType DeviceInfo
0 2987004 0.0 70787.0 NaN NaN NaN NaN NaN NaN NaN NaN 100.0 NotFound NaN -480.0 New NotFound 166.0 NaN 542.0 144.0 NaN NaN NaN NaN NaN NaN NaN New NotFound Android 7.0 samsung browser 6.2 32.0 2220×1080 match_status:2 T F T T mobile SAMSUNG SM-G892A Build/NRD90M
1 2987008 -5.0 98945.0 NaN NaN 0.0 -5.0 NaN NaN NaN NaN 100.0 NotFound 49.0 -300.0 New NotFound 166.0 NaN 621.0 500.0 NaN NaN NaN NaN NaN NaN NaN New NotFound iOS 11.1.2 mobile safari 11.0 32.0 1334×750 match_status:1 T F F T mobile iOS Device
In [12]:
feature_list = ['TransactionAmt', 'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6']
In [13]:
train_raw_X = raw_transaction[feature_list]
train_raw_y = raw_transaction[['isFraud']]

COMPETITION_X = COMPETITION_raw_transaction[feature_list]

비교용 LGB모델

In [14]:
train_X, valid_X, train_y, valid_y = train_test_split(train_raw_X, train_raw_y, test_size=0.2, random_state=1493)

params = {'learning_rate': 0.1, 
          'max_depth': 16, 
          'boosting': 'gbdt', 
          'objective': 'binary', 
          'metric': 'auc', 
          'is_training_metric': True, 
          'num_leaves': 144, 
          'feature_fraction': 0.9, 
          'bagging_fraction': 0.7, 
          'bagging_freq': 5, 
          'seed':2018}

train_ds = lgb.Dataset(train_X, label = train_y) 
valid_ds = lgb.Dataset(valid_X, label = valid_y) 

model = lgb.train(params, train_ds, 1000, [train_ds, valid_ds], verbose_eval=100, early_stopping_rounds=100)
In [16]:
model = lgb.train(params, train_ds, 1000, [train_ds, valid_ds], verbose_eval=100, early_stopping_rounds=100)
C:\Anaconda3\lib\site-packages\lightgbm\basic.py:762: UserWarning: categorical_feature in param dict is overridden.
  warnings.warn('categorical_feature in param dict is overridden.')
Training until validation scores don't improve for 100 rounds.
[100]	training's auc: 0.905507	valid_1's auc: 0.863402
[200]	training's auc: 0.926325	valid_1's auc: 0.869437
[300]	training's auc: 0.938162	valid_1's auc: 0.872857
[400]	training's auc: 0.946685	valid_1's auc: 0.875459
[500]	training's auc: 0.951834	valid_1's auc: 0.876158
[600]	training's auc: 0.95546	valid_1's auc: 0.87668
Early stopping, best iteration is:
[597]	training's auc: 0.955359	valid_1's auc: 0.876812
In [17]:
predict = model.predict(COMPETITION_X)
submission['isFraud'] = predict
In [19]:
submission.to_csv("./submission/submission_fraud_0829_baseline2.csv", index=False)

기본적 신경망 모델 구축

In [17]:
def create_nn_model(input_shape):
    inp = Input(shape = (input_shape, ))

    x = Dense(1024, activation = 'relu', kernel_initializer='he_normal')(inp)

    x = BatchNormalization()(x)
    x = Dense(1024, activation = 'relu')(x)

    x = BatchNormalization()(x)
    x = Dense(512, activation = 'relu')(x)

    x = BatchNormalization()(x)
    x = Dense(256, activation = 'relu')(x)

    x = BatchNormalization()(x)
    x = Dense(128, activation = 'relu')(x)

    x = BatchNormalization()(x)
    out = Dense(1, activation = 'sigmoid')(x)

    model = Model(inputs=inp, outputs=[out])
    
    return model

카테고리컬 변수를 숫자로 변환하기 위한 과정

  • Label인코더는 숫자로 의미를 띄게 되어서, embedding이나 one hot encoder가 예측률이 더 놓은 편
In [88]:
le = preprocessing.LabelEncoder()

X = pd.concat([train_raw_X, COMPETITION_X], ignore_index=True)

X = to_label_encoding(X)

input_data = StandardScaler().fit_transform(X)

input_data = np.nan_to_num(input_data)
C:\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:645: DataConversionWarning: Data with input dtype int32, int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
C:\Anaconda3\lib\site-packages\sklearn\base.py:464: DataConversionWarning: Data with input dtype int32, int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)

AUROC를 계산하기 위한 callback 함수 적용

In [27]:
from sklearn.metrics import roc_auc_score
from keras.callbacks import Callback
class roc_callback(Callback):
    def __init__(self,training_data,validation_data):
        self.x = training_data[0]
        self.y = training_data[1]
        self.x_val = validation_data[0]
        self.y_val = validation_data[1]


    def on_train_batch_begin(self, batch, logs=None):
        return

    def on_train_batch_end(self, batch, logs=None):
        return

    def on_epoch_begin(self, epoch, logs={}):
        return

    def on_epoch_end(self, epoch, logs={}):
        y_pred = self.model.predict(self.x)
        roc = roc_auc_score(self.y, y_pred)
        y_pred_val = self.model.predict(self.x_val)
        roc_val = roc_auc_score(self.y_val, y_pred_val)
        print('\rroc-auc: %s - roc-auc_val: %s' % (str(round(roc,4)),str(round(roc_val,4))),end=100*' '+'\n')
        return

    def on_test_batch_begin(self, batch, logs={}):
        return

    def on_test_batch_end(self, batch, logs={}):
        return

    def on_test_begin(self, batch, logs={}):
        return

    def on_test_end(self, batch, logs={}):
        return

실험용으로 KFOLD를 이용 (stratified나 timeseries를 이용하는게 적당)

  • 카테고리변수를 제대로 처리를 안하면, 아래와 같이 정확도에서 의심가는 형태가 발견됨
In [71]:
N_FOLDS = 2
folds = KFold(n_splits=N_FOLDS, shuffle=True, random_state=1493)
In [93]:
predictions = np.zeros((len(COMPETITION_X), 1))
In [ ]:
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_raw_X, train_raw_y)):

    tr_x, tr_y = input_data[trn_idx], train_raw_y.values[trn_idx]
    vl_x, vl_y = input_data[val_idx], train_raw_y.values[val_idx]

    test_input = input_data[len(train_raw_X):,:]      

    nn_model = create_nn_model(tr_x.shape[1])
    nn_model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=Adam())

    es = callbacks.EarlyStopping(monitor='val_loss', min_delta = 0.0001, patience = 40, verbose = 1, mode='auto', restore_best_weights = True)
    rlr = callbacks.ReduceLROnPlateau(monitor='val_loss', factor = 0.1, patience = 30, min_lr = 1e-6, mode = 'auto', verbose = 1)

    history = nn_model.fit(tr_x, [tr_y], validation_data=(vl_x, [vl_y]), callbacks = [es, rlr,roc_callback(training_data=(tr_x, tr_y),validation_data=(vl_x, vl_y)) ], epochs = 10, batch_size = 10000, verbose = 1)
    #history = nn_model.fit(tr_x, [tr_y], validation_data=(vl_x, [vl_y]), callbacks = [es, rlr], epochs = 5, batch_size = 10000, verbose = 1)

    cv_predict = nn_model.predict(vl_x)
    
    v = np.squeeze(vl_y)
    p = np.squeeze(cv_predict)
    
    accuracy = 1 - np.mean(np.abs(v - p))

    test_predict = nn_model.predict(test_input)

    predictions += test_predict/N_FOLDS
Train on 295270 samples, validate on 295270 samples
Epoch 1/10
roc-auc: 0.6889 - roc-auc_val: 0.6833                                                                                                    
295270/295270 [==============================] - 72s 243us/sample - loss: 0.6598 - accuracy: 0.7669 - val_loss: 0.7320 - val_accuracy: 0.6874
Epoch 2/10
roc-auc: 0.6951 - roc-auc_val: 0.6907                                                                                                    7s - loss: 0.4603 - accuracy: 0. - ETA: 5s - loss: 0.4552 - accu
295270/295270 [==============================] - 89s 302us/sample - loss: 0.4390 - accuracy: 0.9503 - val_loss: 0.4402 - val_accuracy: 0.8531
Epoch 3/10
roc-auc: 0.7133 - roc-auc_val: 0.7084                                                                                                    
295270/295270 [==============================] - 88s 299us/sample - loss: 0.3024 - accuracy: 0.9639 - val_loss: 0.2670 - val_accuracy: 0.9587
Epoch 4/10
roc-auc: 0.7391 - roc-auc_val: 0.731                                                                                                    9s - loss: 0.2
295270/295270 [==============================] - 88s 298us/sample - loss: 0.2103 - accuracy: 0.9653 - val_loss: 0.1809 - val_accuracy: 0.9644
Epoch 5/10
roc-auc: 0.7615 - roc-auc_val: 0.747                                                                                                    
295270/295270 [==============================] - 91s 308us/sample - loss: 0.1642 - accuracy: 0.9656 - val_loss: 0.1523 - val_accuracy: 0.9647
Epoch 6/10
roc-auc: 0.778 - roc-auc_val: 0.7584                                                                                                    
295270/295270 [==============================] - 83s 280us/sample - loss: 0.1438 - accuracy: 0.9657 - val_loss: 0.1394 - val_accuracy: 0.9652
Epoch 7/10
roc-auc: 0.7986 - roc-auc_val: 0.7753                                                                                                    
295270/295270 [==============================] - 87s 296us/sample - loss: 0.1347 - accuracy: 0.9657 - val_loss: 0.1346 - val_accuracy: 0.9653
Epoch 8/10
roc-auc: 0.7998 - roc-auc_val: 0.7779                                                                                                    
295270/295270 [==============================] - 93s 315us/sample - loss: 0.1308 - accuracy: 0.9657 - val_loss: 0.1324 - val_accuracy: 0.9652
Epoch 9/10
roc-auc: 0.8057 - roc-auc_val: 0.78                                                                                                    
295270/295270 [==============================] - 81s 273us/sample - loss: 0.1281 - accuracy: 0.9658 - val_loss: 0.1308 - val_accuracy: 0.9653
Epoch 10/10
roc-auc: 0.8102 - roc-auc_val: 0.7866                                                                                                    
295270/295270 [==============================] - 88s 299us/sample - loss: 0.1266 - accuracy: 0.9659 - val_loss: 0.1297 - val_accuracy: 0.9653
Train on 295270 samples, validate on 295270 samples
Epoch 1/10
roc-auc: 0.5867 - roc-auc_val: 0.5861                                                                                                    
295270/295270 [==============================] - 91s 310us/sample - loss: 0.6590 - accuracy: 0.7684 - val_loss: 0.4733 - val_accuracy: 0.8440
Epoch 2/10
roc-auc: 0.6233 - roc-auc_val: 0.6202                                                                                                    
295270/295270 [==============================] - 85s 287us/sample - loss: 0.4335 - accuracy: 0.9517 - val_loss: 0.3261 - val_accuracy: 0.9514
Epoch 3/10
290000/295270 [============================>.] - ETA: 0s - loss: 0.2967 - accuracy: 0.9641 ETA: 6s - loss: 0.3117 - 
In [ ]:
np.sort(predictions)
In [ ]:
 

답글 남기기