GRU를 이용한 텍스트 훈련 파이썬 치트코드

simpleGRU

GPU와 케라스를 이용한 간단한 GRU

In [ ]:
import os
import time
import numpy as np 
import pandas as pd
from tqdm import tqdm
import math

from sklearn.model_selection import train_test_split
from sklearn import metrics

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, CuDNNGRU,CuDNNLSTM, Conv1D, Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

from sklearn.model_selection import cross_val_score

from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_fscore_support

import matplotlib.pyplot as plt

탭으로 구별되어 있는 텍스트를 불러온다.

인코딩은 윈도우의 경우 cp949로 되어있는 경우가 있으나, 보통은 utf-8로 부른다.

In [2]:
train_raw = pd.read_csv("raw.txt",sep='\t', encoding='cp949')
train, test = train_test_split(train_raw, test_size = 0.1, random_state = 58)

EMBDED_SIZE 임베딩되는 차원

MAX_FEATURE 단어를 최대 몇개까지 feature로 쓰일지 확인, 너무 적으면 변환 안됨.

MAX_LEN 문장이 특정이상 넘어가면 자름.

In [ ]:
EMBED_SIZE = 300
MAX_FEATURE =  100000
MAX_LEN = 300

결측치 처리

In [ ]:
train_x = train['sentence'].fillna("_na_").values
test_x = test['sentence'].fillna("_na_").values

토큰을 끊고, 토크나이저를 준비

In [ ]:
tokenizer = Tokenizer(num_words=MAX_FEATURE)

tokenizer.fit_on_texts(list(train_x))
tokenizer.fit_on_texts(list(test_x))

텍스트를 숫자로 변환

In [3]:
train_s = tokenizer.texts_to_sequences(train_x)
test_s = tokenizer.texts_to_sequences(test_x)

시퀀스가 빈것들을 0으로 채움

In [ ]:
train_p = pad_sequences(train_s, maxlen=MAX_LEN)
test_p = pad_sequences(test_s, maxlen=MAX_LEN)

라벨링 데이터 나누기

In [ ]:
train_y = train['label']
test_y = test['label']

분류를 위해 라벨을 원핫인코딩

In [3]:
train_dummy_y = pd.get_dummies(train_y)
test_dummy_y = pd.get_dummies(test_y)

Embedding 레이어를 만들고, Bidirectional GRU layer 모델을 컴파일

In [4]:
inp = Input(shape = (MAX_LEN,))

x = Embedding(MAX_FEATURE, EMBED_SIZE)(inp)
x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x)
x = GlobalMaxPool1D()(x)
x = Dense(16, activation = "relu")(x)
x = Dropout(0.1)(x)
x = Dense(3, activation = "sigmoid")(x)

model = Model(inputs = inp, outputs = x)

model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics=['acc','mae'])
In [5]:
print(model.summary())
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 300)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 300, 300)          30000000  
_________________________________________________________________
bidirectional_1 (Bidirection (None, 300, 128)          140544    
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                2064      
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 51        
=================================================================
Total params: 30,142,659
Trainable params: 30,142,659
Non-trainable params: 0
_________________________________________________________________
None
In [6]:
history = model.fit(train_p, train_dummy_y, batch_size=512, epochs=24, validation_data=(test_p, test_dummy_y))
Deprecated in favor of operator or tf.math.divide.
Train on 3340 samples, validate on 372 samples
Epoch 1/24
3340/3340 [==============================] - ETA: 50s - loss: 0.6883 - acc: 0.5781 - mean_absolute_error: 0.49 - ETA: 20s - loss: 0.6843 - acc: 0.5788 - mean_absolute_error: 0.49 - ETA: 11s - loss: 0.6804 - acc: 0.5805 - mean_absolute_error: 0.49 - ETA: 5s - loss: 0.6773 - acc: 0.5785 - mean_absolute_error: 0.4913 - ETA: 2s - loss: 0.6735 - acc: 0.5816 - mean_absolute_error: 0.489 - ETA: 0s - loss: 0.6695 - acc: 0.5852 - mean_absolute_error: 0.486 - 10s 3ms/step - loss: 0.6666 - acc: 0.5876 - mean_absolute_error: 0.4848 - val_loss: 0.6339 - val_acc: 0.5663 - val_mean_absolute_error: 0.4618
Epoch 2/24
Epoch 24/24
3340/3340 [==============================] - ETA: 0s - loss: 0.0153 - acc: 0.9941 - mean_absolute_error: 0.011 - ETA: 0s - loss: 0.0130 - acc: 0.9961 - mean_absolute_error: 0.010 - ETA: 0s - loss: 0.0112 - acc: 0.9974 - mean_absolute_error: 0.009 - ETA: 0s - loss: 0.0121 - acc: 0.9972 - mean_absolute_error: 0.009 - ETA: 0s - loss: 0.0122 - acc: 0.9974 - mean_absolute_error: 0.009 - ETA: 0s - loss: 0.0132 - acc: 0.9969 - mean_absolute_error: 0.010 - 1s 265us/step - loss: 0.0137 - acc: 0.9966 - mean_absolute_error: 0.0105 - val_loss: 1.1659 - val_acc: 0.7294 - val_mean_absolute_error: 0.2768

argmax를 통해 컬럼중 제일 높은 확률을 가지는 컬럼을 선택

In [7]:
pred_y = model.predict([test_p], batch_size=1024, verbose=1)
answer = np.argmax(pred_y, axis=1)

plt.plot(history.history['acc'])
confusion_matrix(test_y, answer) 
372/372 [==============================] - 0s 546us/step
Out[7]:
array([[ 22,  11,  15],
       [  5,  74,  51],
       [  8,  65, 121]], dtype=int64)

정확도 확인

In [9]:
precision_recall_fscore_support(test_y, answer, average='micro')
Out[9]:
(0.5833333333333334, 0.5833333333333334, 0.5833333333333334, None)
In [10]:
precision_recall_fscore_support(test_y, answer, average='macro')
Out[10]:
(0.589654528478058, 0.550425147590096, 0.5646208380578933, None)

답글 남기기