• 머신러닝을 위한 원본 데이터는 로그 파일이나 데이터셋 파일, 데이터베이스, 여러 소스에서 추출
  • CSV, SQL 데이터베이스 등 다양한 소스에서 데이터를 적재하는 방법 학습
  • 실험에 필요한 특성을 가진 모의 데이터 생성 방법 학습
  • 외부 데이터 적재에는 판다스 / 모의 데이터 생성에는 사이킷런 사용

1. 샘플 데이터셋 적재하기

  • 사이킷런의 예제 데이터 사용
# pip install scikit-learn
from sklearn import datasets

# 숫자 데이터셋 적재
digits = datasets.load_digits()

# 특성 행렬
features = digits.data

# 타깃 벡터 생성
target = digits.target

# 첫번째 샘플 확인
features[0]

# 출력 결과
array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])

 

  • 사이킷런 예제 데이터들은 딕셔너리 구조를 가지고 있어 keays와 values를 반환함
digits.keys()

# 출력 결과
dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])


# DESCR 키는 데이터셋에 대한 설명
print(digits['DESCR'])

# 출력 결과
.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 1797
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
...
    2005.
  - Claudio Gentile. A New Approximate Maximal Margin Classification
    Algorithm. NIPS. 2000.

 

  • load_데이터셋의 유일한 매개변수인 return_X_y를 True로 하면 특성 X와 타깃 y배열을 반환
import numpy as np

# digits 함수는 특별히 n_class 매개변수를 사용하여 필요한 숫자 개수 지정가능
# 0~4까지 5개의 숫자만 y값에 넣도록 매개변수 설정
X, y = datasets.load_digits(n_class = 5, return_X_y = True)

np.unique(y)
# 출력 결과
array([0, 1, 2, 3, 4])

 

 

2. 모의 데이터셋 만들기

  • 선형 회귀 알고리즘에 사용하려면 make_regression 추천
  • make_regression은 실수 특성 행렬과 실수 타깃 벡터 반환
from sklearn.datasets import make_regression

# n_features는 전체 특성
# n_informative는 타깃 벡터 생성에 사용할 특성
features, target, coefiicients = make_regression(n_samples = 100,
                                                 n_features = 3,
                                                 n_informative = 3,
                                                 n_targets = 1,
                                                 noise = 0.0,
                                                 coef = True,
                                                 random_state = 1)
from sklearn.datasets import make_regression
features, target, coefiicients = make_regression(n_samples = 100,
                                                 n_features = 3,
                                                 n_informative = 3,
                                                 n_targets = 1,
                                                 noise = 0.0,
                                                 coef = True,
                                                 random_state = 1)
print('특성 행렬\n', features[:3])
print('타깃 벡터\n', target[:3])

# 출력 결과
특성 행렬
 [[ 1.29322588 -0.61736206 -0.11044703]
 [-2.793085    0.36633201  1.93752881]
 [ 0.80186103 -0.18656977  0.0465673 ]]
타깃 벡터
 [-10.37865986  25.5124503   19.67705609]

 

  • 분류 알고리즘에 사용하려면 make_classification 추천
  • make_classification은 실수 특성 행렬과 정수 타깃 벡터 반환
from sklearn.datasets import make_classification
# n_redundant는 필요없는 특성의 수, n_informative 특성의 랜덤 선형 결합으로 생성됨
# weights는 차례대로 각각 첫번째 ,두번째 클래스의 비율
features, target = make_classification(n_samples = 100,
                                                     n_features = 3,
                                                     n_informative = 3,
                                                     n_redundant = 0,
                                                     n_classes = 2,
                                                     weights = [.25, .75],
                                                     random_state = 1)
print('특성 행렬\n', features[:3])
print('타깃 벡터\n', target[:3])

# 출력 결과
특성 행렬
 [[ 1.06354768 -1.42632219  1.02163151]
 [ 0.23156977  1.49535261  0.33251578]
 [ 0.15972951  0.83533515 -0.40869554]]
타깃 벡터
 [1 0 0]

 

  • 군집 알고리즘에 사용하려면 make_blobs 추천
  • make_blobs은 실수 특성 행렬과 정수 타깃 벡터 반환
from sklearn.datasets import make_blobs
features, target = make_blobs(n_samples = 100,
                                       n_features = 2,
                                       centers = 3,
                                       cluster_std = 0.5,
                                       shuffle = True,
                                       random_state = 1)
print('특성 행렬\n', features[:3])
print('타깃 벡터\n', target[:3])

# 출력 결과
특성 행렬
 [[ -1.22685609   3.25572052]
 [ -9.57463218  -4.38310652]
 [-10.71976941  -4.20558148]]
타깃 벡터
 [0 1 1]
# 만들어진 군집 데이터 시각화
# pip install matplotlib
import matplotlib.pyplot as plt

plt.scatter(features[:, 0], features[:, 1], c = target)
plt.show()

 

 

3. CSV 파일 적재하기

  • pandas 라이브러리의 read_csv() 사용
  • 매개변수
    • sep = ',': 파일이 사용하는 구분자를 ','로 지정
    • skiprows = range(1, 11): 1행부터 12행까지 건너뛴 후 출력
    • nrows = 1: 한 행 출력
import pandas as pd

dataframe = pd.read_csv('csv 경로')

 

 

4. 엑셀 파일 적재하기

  • pandas 라이브러리의 read_excel() 사용
  • read_excel()을 사용하려면 xlrd 패키지 설치 필요
  • 매개변수
    • sheet_name: 시트 이름 문자열 또는 시트의 위치를 나타내는 정수(0부터 시작되는 인덱스)
    • na_filter: 결측값 탐지, 결측값이 없는 데이터라면 해당 옵션을 제외하는 것이 성능 향상에 도움
    • skip_rows
    • nrows
    • keep_default_na: 공백을 NA로 지정할지 결정
    • na_values = '-': 결측값을 '-'로 표시
# pip install xlrd
import pandas as pd

dataframe = pd.read_excel('excel 경로')

 

 

5. JSON 파일 적재하기

  • pandas 라이브러리의 read_json() 사용
  • 매개변수
    • orient = 'columns'(split, records, index, columns, values 중 하나): JSON 파일의 구성 형식 지정, 어떤 값을 넣어야하는 지 알아내기 위해서는 실험이 필요
      • 'split': {"index" : [인덱스, ...], "columns" : [열, ...], "data" : [값, ...]}
      • 'records': {[열 : 값}, ..., {열 : 값}]
      • 'index': {인덱스 : {열 : 값, ...}, ...}
      • 'values': [값, ...]
      • 'columns': {열: {인덱스 : 값, ...}, ...}
    • json_normalize: 구조화가 덜 된 JSON 데이터를 데이터프레임으로 변환하는 도구
import pandas as pd

dataframe = pd.read_json('json 경로', orient = 'columns')

 

 

6. SQL 데이터베이스로부터 적재하기

  • pandas 라이브러리의 read_sql_query() 사용
  • 먼저, SQLite 데이터베이스 엔진으로 연결하기 위해 create_engine 함수 사용
  • 이후, read_sql_query 함수로 SQL을 사용하여 데이터베이스에 질의, 그 결과를 데이터프레임으로 가져옴
# pip install sqlalchemy
import pandas as pd
from sqlalchemy import create_engine

# sample.db라는 데이터베이스에 연결
database_connection = create_engine('sqlite:///sample.db')

# sample.db에서 data라는 이름의 테이블의 모든 열을 반환하라고 요청
dataframe = pd.read_sql_query('SELECT * FROM data', database_connection)

# 모든 행을 가져올 때는 질의문 없이 read_sql_table() 함수도 사용 가능
dataframe = pd.read_sql_table('data', database_connection)

1. 순환 신경망(Recurrent Neural Network, RNN)

  • 루프(loop)를 가진 신경망의 한 종류
  • 시퀀스의 원소를 순회하면서 지금가지 처리한 정보를 상태(state)에 저장

https://aditi-mittal.medium.com/understanding-rnn-and-lstm-f7cdf6dfc14e

 

  - 순환 신경망 레이어(RNN Layer)

  • 입력: (timesteps, input_features)
  • 출력: (timesteps, output_features)
# numpy로 RNN 구조 표현
import numpy as np

timesteps = 100
input_features = 32
output_features = 64

inputs = np.random.random((timesteps, input_features))

state_t = np.zeros((output_features, ))

W = np.random.random((output_features, input_features))
U = np.random.random((output_features, output_features))
b = np.random.random((output_features, ))

sucessive_outputs = []

for input_t in inputs:
    output_t = np.tanh(np.dot(W, input_t) + np.dot(U, state_t) + b)
    sucessive_outputs.append(output_t)
    state_t = output_t

final_output_sequence = np.stack(sucessive_outputs, axis = 0)

 

  - 케라스의 순환층

  • SimpleRNN layer
  • 입력: (batch_size, timesteps, input_features)
  • 출력
    • return_sequences로 결정할 수 있음
    • 3D 텐서
      • timesteps의 출력을 모든 전체 sequences를 반환
      • (batch_size, timesteps, output_features)
    • 2D 텐서
      • 입력 sequence에 대한 마지막 출력만 반환
      • (batch_size, output_features)
from tensorflow.keras.layers import SimpleRNN, Embedding
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32))  # SimpleRNN 안에 return_sequences = True옵션을 추가하면 전체 sequences를 return시켜줌
model.summary()

# 출력 결과
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, None, 32)          320000    
                                                                 
 simple_rnn (SimpleRNN)      (None, 32)                2080      
                                                                 
=================================================================
Total params: 322,080
Trainable params: 322,080
Non-trainable params: 0
_________________________________________________________________
  • 네트워크의 표현력을 증가시키기 위해 여러 개의 순환층을 차례대로 쌓는 것이 유용할 때가 있음
    • 이런 설정에서는 중간층들이 전체 출력 sequences를 반환하도록 설정
model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences = True))
model.add(SimpleRNN(32, return_sequences = True))
model.add(SimpleRNN(32, return_sequences = True))
model.add(SimpleRNN(32))
model.summary()

# 출력 결과
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_2 (Embedding)     (None, None, 32)          320000    
                                                                 
 simple_rnn_2 (SimpleRNN)    (None, None, 32)          2080      
                                                                 
 simple_rnn_3 (SimpleRNN)    (None, None, 32)          2080      
                                                                 
 simple_rnn_4 (SimpleRNN)    (None, None, 32)          2080      
                                                                 
 simple_rnn_5 (SimpleRNN)    (None, 32)                2080      
                                                                 
=================================================================
Total params: 328,320
Trainable params: 328,320
Non-trainable params: 0
_________________________________________________________________

 

  - LMDB 데이터 적용

  - 데이터 로드

from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence

num_words = 10000
max_len = 500
batch_size = 32

(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words = num_words)
print(len(input_train))  # 25000
print(len(input_test))   # 25000

input_train = sequence.pad_sequences(input_train, maxlen = max_len)
input_test = sequence.pad_sequences(input_test, maxlen = max_len)
print(input_train.shape) # (25000, 500)
print(input_test.shape)  # (25000, 500)

 

  - 모델 구성

from tensorflow.keras.layers import Dense

model = Sequential()

model.add(Embedding(num_words, 32))
model.add(SimpleRNN(32))
model.add(Dense(1, activation = 'sigmoid'))

model.compile(optimizer = 'rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['acc'])

model.summary()

# 출력 결과
Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_3 (Embedding)     (None, None, 32)          320000    
                                                                 
 simple_rnn_6 (SimpleRNN)    (None, 32)                2080      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
=================================================================
Total params: 322,113
Trainable params: 322,113
Non-trainable params: 0
_________________________________________________________________

 

  - 모델 학습

history = model.fit(input_train, y_train,
                    epochs = 10,
                    batch_size = 128,
                    validation_split = 0.2)

 

 

  - 시각화

import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']
acc = history.history['acc']
val_acc = history.history['val_acc']

epochs = range(1, len(loss) + 1)

plt.plot(epochs, loss, 'b--', label = 'train loss')
plt.plot(epochs, val_loss, 'r:', label = 'validation loss')
plt.grid()
plt.legend()

plt.figure()
plt.plot(epochs, acc, 'b--', label = 'train accuracy')
plt.plot(epochs, val_acc, 'r:', label = 'validation accuracy')
plt.grid()
plt.legend()

model.evaluate(input_test, y_test)

# 출력 결과
loss: 0.6755 - acc: 0.7756
[0.6754735112190247, 0.7755600214004517]
  • 전체 sequences가 아니라 순서대로 500개의 단어만 입력했기 때문에 성능이 낮게 나옴
  • simpleRNN은 긴 sequence를 처리하는데 적합하지 않음

 

 

2. LSTM과 GRU 레이어

  • Simple RNN은 실전에 사용하기엔 너무 단순
  • SimpleRNN은 이론적으로 시간 t에서 이전의 모든 timesteps의 정보를 유지할 수 있지만, 실제로는 긴 시간에 걸친 의존성은 학습할 수 없음
  • 그레디언트 소실 문제(vanishing gradient problem)
    • 이를 방지하기 위해 LSTM, GRU 같은 레이어 등장

 

  - LSTM(Long-Short-Term Memory)

  • 장단기 메모리 알고리즘
  • 나중을 위해 정보를 저장함으로써 오래된 시그널이 점차 소실되는 것을 막아줌

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

 

  - 예제 1) Reyters

  • IMDB와 유사한 데이터셋(텍스트 데이터)
  • 46개의 상호 배타적인 토픽으로 이루어진 데이터셋
    • 다중 분류 문제

  - 데이터셋 로드

from tensorflow.keras.datasets import reuters

num_words = 10000
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words = num_words)

print(x_train.shape) # (8982,)
print(y_train.shape) # (8982,)
print(x_test.shape)  # (2246,)
print(y_test.shape)  # (2246,)

 

  - 데이터 전처리 및 확인

from tensorflow.keras.preprocessing.sequence import pad_sequences

max_len = 500

pad_x_train = pad_sequences(x_train, maxlen = max_len)
pad_x_test = pad_sequences(x_test, maxlen = max_len)

print(len(pad_x_train[0]))  # 500

pad_x_train[0]

 

  - 모델 구성

  • LSTM 레이어도 SimpleRNN과 같이 return_sequences 인자 사용 가능
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding

model = Sequential()
model.add(Embedding(input_dim = num_words, output_dim = 64))
model.add(LSTM(64, return_sequences = True))
model.add(LSTM(32))
model.add(Dense(46, activation = 'softmax'))

model.compile(optimizer = 'adam',
              loss = 'sparse_categorical_crossentropy',
              metrics = ['acc'])
model.summary()

# 출력 결과
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_1 (Embedding)     (None, None, 64)          640000    
                                                                 
 lstm (LSTM)                 (None, None, 64)          33024     
                                                                 
 lstm_1 (LSTM)               (None, 32)                12416     
                                                                 
 dense (Dense)               (None, 46)                1518      
                                                                 
=================================================================
Total params: 686,958
Trainable params: 686,958
Non-trainable params: 0
_________________________________________________________________

 

  - 모델 학습

history = model.fit(pad_x_train, y_train,
                    epochs = 20,
                    batch_size = 32,
                    validation_split = 0.2)

 

  - 시각화

import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']
acc = history.history['acc']
val_acc = history.history['val_acc']

epochs = range(1, len(loss) + 1)

plt.plot(epochs, loss, 'b--', label = 'train loss')
plt.plot(epochs, val_loss, 'r:', label = 'validation loss')
plt.grid()
plt.legend()

plt.figure()
plt.plot(epochs, acc, 'b--', label = 'train accuracy')
plt.plot(epochs, val_acc, 'r:', label = 'validation accuracy')
plt.grid()
plt.legend()

 

  - 모델 평가

model.evaluate(pad_x_test, y_test)

# 출력 결과
loss: 1.6927 - acc: 0.6336
[1.692732810974121, 0.6335707902908325]

 

  - 예제 2) IMDB 데이터셋

  - 데이터 로드

from tensorflow.keras.datasets import imdb
from tensorflow.kears.preprocessing.sequence import pad_sequences

num_words = 10000
max_len = 500
batch_size = 32

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = num_words)

pad_x_train = sequence.pad_sequences(x_train, maxlen = max_len)
pad_x_test = sequence.pad_sequences(x_test, maxlen = max_len)

 

  - 모델 구성

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding

model = Sequential()
model.add(Embedding(num_words, 32))
model.add(LSTM(32))
model.add(Dense(1, activation = 'sigmoid'))

model.compile(optimizer = 'rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['acc'])
model.summray()

# 출력 결과
Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_3 (Embedding)     (None, None, 32)          320000    
                                                                 
 lstm_3 (LSTM)               (None, 32)                8320      
                                                                 
 dense_2 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 328,353
Trainable params: 328,353
Non-trainable params: 0
_________________________________________________________________

 

  - 모델 학습

history = model.fit(pad_x_train, y_train,
                    epochs = 10,
                    batch_size = 128,
                    validation_split = 0.2)

 

  - 시각화

import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']
acc = history.history['acc']
val_acc = history.history['val_acc']

epochs = range(1, len(loss) + 1)

plt.plot(epochs, loss, 'b--', label = 'train loss')
plt.plot(epochs, val_loss, 'r:', label = 'validation loss')
plt.grid()
plt.legend()

plt.figure()
plt.plot(epochs, acc, 'b--', label = 'train accuracy')
plt.plot(epochs, val_acc, 'r:', label = 'validation accuracy')
plt.grid()
plt.legend()

 

  - 모델 평가

model.evaluate(pad_x_test, y_test)

# 출력 결과
loss: 0.9135 - acc: 0.7898
[0.9135046601295471, 0.7898399829864502]
  • LSTM 쓰기전, SimpleRNN을 썻을 때 loss가 0.6755, acc가 0.7756으로 나온 것에 비해 좋은 결과가 나옴

 

 

3. Cosine 함수를 이용한 순환 신경망

# 코사인 시계열 데이터
import numpy as np

np.random.seed(111)
time = np.arange(30 * 12 + 1)
month_time = (time % 30) / 30
time_series = 20 * np.where(month_time < 0.5,
                            np.cos(2 * np.pi * month_time),
                            np.cos(2 * np.pi * month_time) + np.random.random(361))
plt.figure(figsize = (15, 8))
plt.xlabel('Time')
plt.ylabel('Value')
plt.plot(np.arange(0, 30 * 11 + 1),
         time_series[:30 * 11 + 1],
         color = 'blue', alpha = 0.6, label = 'Train Data')
plt.plot(np.arange(30 * 11, 30 * 12 + 1),
         time_series[30 * 11:],
         color = 'orange', label = 'Test Data')
plt.show()

 

  - 데이터 전처리

def make_data(time_series, n):
    x_train_full, y_train_full = list(), list()

    for i in range(len(time_series)):
        x = time_series[i:(i + n)]
        if (i + n) < len(time_series):
            x_train_full.append(x)
            y_train_full.append(time_series[i + n])
        else:
            break
    
    x_train_full, y_train_full = np.array(x_train_full), np.array(y_train_full)

    return x_train_full, y_train_full

n = 10
x_train_full, y_train_full = make_data(time_series, n)

print(x_train_full.shape) # (351, 10)
print(y_train_full.shape) # (351,)


# 뒤에 1씩 추가
x_train_full = x_train_full.reshape(-1, n, 1)
y_train_full = y_train_full.reshape(-1, n, 1)

print(x_train_full.shape) # (351, 10, 1)
print(y_train_full.shape) # (351, 1)

 

  - 테스트 데이터셋 생성

x_train_full = x_train_full.reshape(-1, n, 1)
y_train_full = y_train_full.reshape(-1, n, 1)

print(x_train_full.shape)
print(y_train_full.shape)


# train 데이터와 test 데이터 분리
x_train = x_train_full[:30 * 11]
y_train = y_train_full[:30 * 11]

x_test = x_train_full[30 * 11:]
y_test = y_train_full[30 * 11:]

print(x_train.shape) # (330, 10, 1)
print(y_train.shape) # (330, 1)
print(x_test.shape)  # (21, 10, 1)
print(y_test.shape)  # (21, 10, 1)

 

  - 데이터 확인

sample_series = np.arange(100)
a, b = make_data(sample_series, 10)

print(a[0])  # [0 1 2 3 4 5 6 7 8 9]
print(b[0])  # 10

 

  - 모델 구성

from tensorflow.keras.layers import SimpleRNN, Flatten, Dense
from tensorflow.keras.models import Sequential

def build_model(n):
    model = Sequential()

    model.add(SimpleRNN(units = 32, activation = 'tanh', input_shape = (n, 1)))
    model.add(Dense(1))

    model.compile(optimizer = 'adam',
                  loss = 'mse')
    return model

model = build_model(10)
model.summary()

# 출력 결과
Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 simple_rnn (SimpleRNN)      (None, 32)                1088      
                                                                 
 dense_3 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 1,121
Trainable params: 1,121
Non-trainable params: 0
_________________________________________________________________

 

  - 모델 학습

model.fit(x_train, y_train,
          epochs = 100, batch_size = 12)

 

  - 예측값 그려보기

prediction = model.predict(x_test)

pred_range = np.arange(len(y_train), len(y_train) + len(prediction))

plt.figure(figsize = (12, 5))
plt.xlabel('Time')
plt.ylabel('Value')
plt.plot(pred_range, y_test.flatten(), color = 'orange', label = 'Ground Truth')
plt.plot(pred_range, prediction.flatten(), color = 'blue', label = 'Prediction')
plt.legend()
plt.show()

 

  - 모델 재구성

  • LSTM 사용
from tensorflow.keras.layers import LSTM

def build_model2(n):
    model = Sequential()

    model.add(LSTM(units = 64, return_sequences = True, input_shape = (n, 1)))
    model.add(LSTM(32))
    model.add(Dense(1))

    model.compile(optimizer = 'adam',
                  loss = 'mse')
    return model

model2 = build_model2(10)
model2.summary()

# 출력 결과
Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 lstm_4 (LSTM)               (None, 10, 64)            16896     
                                                                 
 lstm_5 (LSTM)               (None, 32)                12416     
                                                                 
 dense_5 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 29,345
Trainable params: 29,345
Non-trainable params: 0
_________________________________________________________________

 

  - 모델 재학습 및 예측값 그려보기

model2.fit(x_train, y_train,
           epochs = 100, batch_size = 12)

prediction_2 = model_2.predict(x_test)

pred_range = np.arange(len(y_train), len(y_train) + len(prediction_2))

plt.figure(figsize = (12, 5))
plt.xlabel('Time')
plt.ylabel('Value')
plt.plot(pred_range, y_test.flatten(), color = 'orange', label = 'Ground Truth')
plt.plot(pred_range, prediction.flatten(), color = 'r:', label = 'Model1 Prediction')
plt.plot(pred_range, prediction_2.flatten(), color = 'blue', label = 'Model2 Prediction')
plt.legend()
plt.show()

 

  - 모델 재구성

  • GRU 사용(LSTM보다 더 쉬운 구조)
from tensorflow.keras.layers import GRU

def build_model3(n):
    model = Sequential()

    model.add(GRU(units = 30, return_sequences = True, input_shape = (n, 1)))
    model.add(GRU(30))
    model.add(Dense(1))

    model.compile(optimizer = 'adam',
                  loss = 'mse')
    return model

model_3 = build_model3(10)
model_3.summary()

# 출력 결과
Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 lstm_6 (LSTM)               (None, 10, 64)            16896     
                                                                 
 lstm_7 (LSTM)               (None, 32)                12416     
                                                                 
 dense_6 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 29,345
Trainable params: 29,345
Non-trainable params: 0
_________________________________________________________________

 

  - 모델 재학습 및 예측값 그려보기

model_3.fit(x_train, y_train,
           epochs = 100, batch_size = 12)

prediction_3 = model_3.predict(x_test)

pred_range = np.arange(len(y_train), len(y_train) + len(prediction_3))

plt.figure(figsize = (12, 5))
plt.xlabel('Time')
plt.ylabel('Value')
plt.plot(pred_range, y_test.flatten(), color = 'orange', label = 'Ground Truth')
plt.plot(pred_range, prediction.flatten(), color = 'r:', label = 'Model1 Prediction')
plt.plot(pred_range, prediction_2.flatten(), color = 'blue', label = 'Model2 Prediction')
plt.plot(pred_range, prediction_2.flatten(), color = 'blue', label = 'Model3 Prediction')
plt.legend()
plt.show()

 

  - Conv1D

  • 텍스트 분류나 시계열 예측같은 간단한 문제, 오디오 생성, 기계 번역 등의 문제에서 좋은 성능
  • timestep의 순서에 민감 X
  • 2D Convolution
    • 지역적 특징을 인식
  • 2D Convolution
    • 문맥을 인식

 

  - Conv1D Layer

  • 입력: (batch_size, timesteps, channels)
  • 출력: (batch_size, timesteps, filters)
  • 필터의 사이즈가 커져도 모델이 급격히 증가하지 않기 때문에 다양한 크기를 사용할 수 있음
  • 데이터의 품질이 좋으면 굳이 크기를 달리하여 여러 개를 사용하지 않아도 될 수도 있음

 

  - MaxPooling1D Layer

  • 다운 샘플링 효과
  • 단지 1차원형태로 바뀐 것 뿐

 

  - GlovalMaxPooling Layer

  • 배치 차원을 제외하고 2차원 형태를 1차원 형태로 바꾸어주는 레이어
  • Flatten layer로 대신 사용가능

 

  - IMDB 데이터셋

  - 데이터 로드 및 전처리

from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.layers import Dense, Embedding, Conv1D, MaxPooling1D, GlobalMaxPooling1D

num_words = 10000
max_len = 500
batch_size = 32

(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words = num_words)

print(len(input_train))  # 25000
print(len(input_test))   # 25000

pad_x_train = pad_sequences(input_train, maxlen = max_len)
pad_x_test = pad_sequences(input_test, maxlen = max_len)

print(pad_x_train.shape) # (25000, 500)
print(pad_x_test.shape)  # (25000, 500)

 

  -모델 구성

def build_model():
    model = Sequential()

    model.add(Embedding(input_dim = num_words, output_dim = 32,
                        input_length = max_len))
    model.add(Conv1D(32, 7, activation = 'relu'))
    model.add(MaxPooling1D(7))
    model.add(Conv1D(32, 5, activation = 'relu'))
    model.add(MaxPooling1D(5))
    model.add(GlobalMaxPooling1D())
    model.add(Dense(1, activation = 'sigmoid'))

    model.compile(optimizer = RMSprop(learning_rate = 1e-4),
                  loss ='binary_crossentropy',
                  metrics = ['accuracy'])
    
    return model

model = build_model()
model.summary()

# 출력 결과
Model: "sequential_13"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_5 (Embedding)     (None, 500, 32)           320000    
                                                                 
 conv1d_2 (Conv1D)           (None, 494, 32)           7200      
                                                                 
 max_pooling1d_2 (MaxPooling  (None, 70, 32)           0         
 1D)                                                             
                                                                 
 conv1d_3 (Conv1D)           (None, 66, 32)            5152      
                                                                 
 max_pooling1d_3 (MaxPooling  (None, 13, 32)           0         
 1D)                                                             
                                                                 
 global_max_pooling1d_1 (Glo  (None, 32)               0         
 balMaxPooling1D)                                                
                                                                 
 dense_12 (Dense)            (None, 1)                 33        
                                                                 
=================================================================
Total params: 332,385
Trainable params: 332,385
Non-trainable params: 0
_________________________________________________________________

 

  - 모델 학습

history = model.fit(pad_x_train, y_train,
                    epochs = 30,
                    batch_size = 128,
                    validation_split = 0.2)

 

  - 시각화

import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

epochs = range(1, len(loss) + 1)

plt.plot(epochs, loss, 'b--', label = 'train loss')
plt.plot(epochs, val_loss, 'r:', label = 'validation loss')
plt.grid()
plt.legend()

plt.figure()
plt.plot(epochs, acc, 'b--', label = 'train accuracy')
plt.plot(epochs, val_acc, 'r:', label = 'validation accuracy')
plt.grid()
plt.legend()

model.evaluate(pad_x_test, y_test)

# 출력 결과
loss: 0.3534 - accuracy: 0.8526
[0.35335206985473633, 0.8525999784469604]
  • 과적합이 일어났지만, 다른 optimizer 사용, 규제화를 걸어보는 등 다양하게 시도해볼 수 있음

 데이터 분석을 공부하며 코딩 자체에도 재미를 느꼈습니다. 그래서 자바로 웹개발을 공부하고, 다른 여러 프로그래밍 언어도 공부하였습니다. 그러면서 데이터 분석의 결과를 직접 만든 웹에 대시보드 형태로 띄워보고 싶다는 목표를 가지게 되었습니다. 또, 요즘 휴대폰 하나로 모든 것을 해결할 수 있는 시대인 만큼 휴대폰 앱으로도 확인할 수 있으면 좋을 것이라고 생각하였고, 앱 개발도 공부해보고 싶었습니다. 웹과 앱 모두에서 개발이 가능한 사람이 되고 싶었고 그렇게 선택한 언어가 Dart 기반의 Flutter 였습니다. 그리고 Flutter를 공부하기 위해 선택한 책은 '풀스택 개발이 쉬워지는 다트&플러터'입니다.

 서평 이벤트를 통해 해당 책을 제공받았습니다.

 IT 관련 분야의 도서를 많이 출판하는 영진닷텀에서 나온 책으로 전문성은 물론, 책의 구성도 언어를 공부하기에 최적화된 순서로 되어 있습니다. 책에는 다음과 같은 내용이 포함되어 있었습니다.

  • Dart 언어 문법과 구조 이해
  • Dart로 서버와 클라이언트 개발하기
  • Flutter 래퍼런스 프로그램 개발하기
  • Flutter로 데스트톱, 웹서비스 개발하기
  • 지속 가능한 개발자로 첫 걸음 내딛기

 

 책의 두께도 어마어마 하지만 그만큼 한권으로 많은 양을 공부할 수 있고, 이후에 Flutter를 전문적으로 응용하고 싶을 때도 기본서로 사용할 수 있을 것 같습니다.

 

 장마다 마지막에 연습문제를 넣어두어서 공부했던 개념을 직접 실습해보고 모르는 부분은 더 확실히하며 넘어갈 수 있었습니다.

 

 언어를 사용하며 접할 수 있는 Error에 대한 설명도 나와있고, 코드에 대한 설명도 한줄 한줄 꼼꼼하고 친절하게 설명되어 있어 혼자 공부하는 데에도 적은 시간에 많은 양을 정확하게 공부하는데 도움이 되었습니다.

 

 책의 가장 좋았던 점은 책을 따라가다보면 프로젝트를 직접 구현할 수 있다는 것이었습니다. 언어의 문법이나 개념을 정확히 공부했더라도 실제 프로젝트에 응용할 수 없다면 무용지물이 됩니다. 그만큼 실제 프로젝트에 응용하는 것이 중요하고 어려운 일이지만 '풀스택 개발이 쉬워지는 다트&플러터'와 함께하면 프로젝트에 언어를 적용하는 것도 문제 없어 보입니다.

 

 '풀스택 개발이 쉬워지는 다트&플러터'로 웹, 앱의 풀스택 개발을 제대로 공부할 수 있는 기회를 얻어 정말 좋은 시간이었습니다. 데이터 분석과 접목한 웹, 앱 개발이라는 저의 목표를 달성하는데 큰 도움이 된 책입니다.

9. Keras에서 Word2Vec 직접 학습

  - 데이터 준비

from tensorflow.keras.datasets import imdb

(x_train, y_train), (x_test, y_test) = imdb.load_data()
  • 단어 번호와 단어의 관계를 사전으로 만듦
  • 1번은 문장의 시작, 2번은 사전에 없는 단어(OOV)로 미리 지정
word_index = imdb.get_word_index()
index_word = {idx + 3 : word for word, idx in word_index.items()}

index_word[1] = '<START>'
index_word[2] = '<UNKNOWN>'

' '.join(index_word[i] for i in x_train[0])

# 출력 결과
"<START> this film was just brilliant casting location scenery story direction everyone's really
suited the part they played and you could just imagine being there robert redford's is an
amazing actor and now the same being director norman's father came from the same scottish
island as myself so i loved the fact there was a real connection with this film the witty
remarks throughout the film were great it was just brilliant so much that i bought the film
as soon as it was released for retail and would recommend it to everyone to watch and the fly
fishing was amazing really cried at the end it was so sad and you know what they say if you
cry at a film it must have been good and this definitely was also congratulations to the two
little boy's that played the part's of norman and paul they were just brilliant children are
often left out of the praising list i think because the stars that play them all grown up are
such a big profile for the whole film but these children are amazing and should be praised for
what they have done don't you think the whole story was so lovely because it was true and was
someone's life after all that was shared with us all"
num_words = max(index_word) + 1

 

  - 텍스트를 단어 번호로 바꾸기

texts = []
for data in x_train:
    text = ' '.join(index_word[i] for i in data)
    texts.append(text)

len(texts)  # 25000
  • Tokenizer를 사용해 텍스트를 단어로 바꿈
from keras.preprocessing.text import Tokenizer

tok = Tokenizer()
tok.fit_on_texts(texts)

new_data = tok.texts_to_sequences(texts)
new_data[0][:10]

# 출력 결과
[28, 11, 19, 13, 41, 526, 968, 1618, 1381, 63]
# 모든 데이터 문장을 토큰화하고 위의 문장을 그 토큰으로 바꾼뒤 10개만 출력

 

  - 단어쌍 만들기

from tensorflow.keras.preprocessing.sequence import make_sampling_table, skipgrams

# 전제 토큰 개수
VOCAB_SIZE = len(tok.word_index)
print(VOCAB_SIZE)  # 88581
  • 단어를 무작위로 추출하면 자주 나오는 단어가 더 많이 나오게 됨
  • 이를 방지하기위해 단어를 추출할 확률의 균형을 맞춘 샘플링 표를 생성
table = make_sampling_table(VOCAB_SIZE)
  • 두 단어씩 뽑아 좌우 2단어(window_size = 2)안에 들어있는 경우가 있는지 없는지 확인하며 데이터 생성
couples, labels = skipgrams(data, VOCAB_SIZE, window_size = 2, sampling_table = table)
couples[:5]

# 출력 결과
[[16876, 497], [9685, 21], [16876, 21917], [383, 5452], [2098, 13577]]
  • labels에는 윈도우 안에 들어있는 경우가 있으면 1, 없으면 0
labels[:5]

# 출력 결과
[1, 1, 0, 0, 0]
  • 대상 단어는 word_target으로, 맥락 단어는 word_context로 모음
word_target, word_context = zip(*couples)
  • 배열로 바꿈
word_target = np.asarray(word_target, dtype = 'int32')
word_context = np.asarray(word_context, dtype = 'int32')
labels = np.asarray(labels, dtype = 'int32')

word_target.shape    # (288,)
word_context.shape   # (288,)

 

  - skip-gram 모형

  • skip-gram 모형은 함수형 API를 사용해야 함
from tensorflow.keras.layers import Activation, Dot, Embedding, Flatten, Input, Reshape
from tensorflow.keras.models import Model

def build_model():
    input_target = Input(shape = (1, ))
    input_context = Input(shape = (1, ))

    emb = Embedding(input_dim = VOCAB_SIZE, output_dim = 8)
    target = emb(input_target)
    context = emb(input_context)

    dot = Dot(axes = 2)([target, context])
    flatten = Reshape((1, ))(dot)
    output = Activation('sigmoid')(flatten)
    skipgram = Model(inputs = [input_target, input_context], outputs = output)

    return skipgram

model = build_model()
model.summary()

# 출력 결과
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_3 (InputLayer)           [(None, 1)]          0           []                               
                                                                                                  
 input_4 (InputLayer)           [(None, 1)]          0           []                               
                                                                                                  
 embedding_5 (Embedding)        (None, 1, 8)         708648      ['input_3[0][0]',                
                                                                  'input_4[0][0]']                
                                                                                                  
 dot (Dot)                      (None, 1, 1)         0           ['embedding_5[0][0]',            
                                                                  'embedding_5[1][0]']            
                                                                                                  
 reshape (Reshape)              (None, 1)            0           ['dot[0][0]']                    
                                                                                                  
 activation (Activation)        (None, 1)            0           ['reshape[0][0]']                
                                                                                                  
==================================================================================================
Total params: 708,648
Trainable params: 708,648
Non-trainable params: 0
__________________________________________________________________________________________________

 

  - 모델 컴파일 및 학습

from tensorflow.keras.optimizers import Adam

model.compile(optimizer = Adam(),
              loss = 'binary_crossentropy',
              metrics = ['accuracy'])

model.fit([word_target, word_context], labels, epochs = 30)

 

  - 임베딩 레이어 저장 및 로드

emb = model.layers[2]
emb.get_weights()

# 출력 결과
[array([[ 0.01938832,  0.01921825, -0.0462908 , ...,  0.01147114,
         -0.04764376,  0.01121316],
        [-0.01068624, -0.04315212,  0.00839611, ..., -0.02030395,
         -0.02321514, -0.03680412],
        [ 0.00915837,  0.00973357,  0.00904005, ...,  0.01291057,
          0.04295233,  0.0488804 ],
        ...,
        [ 0.01314208,  0.02786795,  0.01130085, ...,  0.03705814,
          0.0427903 ,  0.0109529 ],
        [-0.03585767, -0.04641544, -0.02590518, ..., -0.00451361,
         -0.03019956,  0.01893195],
        [ 0.00769577, -0.02014879, -0.03623866, ..., -0.03457584,
         -0.02138668,  0.02141118]], dtype=float32)]
# 임베딩 레이어 저장
np.save('emb.npy', emb.get_weights()[0])
  • 임베딩 레이어 로드
w = np.load('emb.npy')
  • 임베딩 레이어를 추가할 때 trainable을 False로 하면 추가학습이 이루어 지지 않음
emb_ff = Embedding(input_dim = num_words, output_dim = 8, input_length = 30,
                   weights = [w], trainable = False)

 

 

10. 사전 훈련된 단어 임베딩 사용하기: GloVe 임베딩

  - 원본 IMDB 텍스트 내려받기

import wget
import os
import zipfile

wget.download("http://mng.bz/0tIo")

local_zip = '0tIo'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall()
zip_ref.close()

imdb_dir = "aclImdb"
train_dir = os.path.join(imdb_dir, 'train')

labels = []
texts = []
for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)

    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname), encoding = 'utf-8')
            texts.append(f.read())
            f.close()

            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

texts[0]

# 출력 결과
"Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is
a terrific example of absurd comedy. A formal orchestra audience is turned into an insane,
violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE
time with no general narrative eventually making it just too off putting. Even those from the
era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader.
On a technical level it's better than you might think with some good cinematography by future
great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."

labels[0]  # 0(부정적인 리뷰)

 

  - 데이터 토큰화

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_len = 100
training_samples = 200
validation_samples = 10000
max_words = 10000

tokenizer = Tokenizer(num_words = max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print(len(word_index))  # 88582
data = pad_sequences(sequences, maxlen = max_len)
labels = np.asarray(labels)

print(data.shape)    # (25000, 100)
print(labels.shape)  # (25000,)
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples : training_samples + validation_samples]
y_val = labels[training_samples : training_samples + validation_samples]

print(x_train.shape)  # (200, 100)
print(y_train.shape)  # (200,)
print(x_val.shape)    # (10000, 100)
print(y_val.shape)    # (10000,)

 

  - GloVe 단어 임베딩 내려받기

import wget

wget.download("http://nlp.stanford.edu/data/glove.6B.zip")

# 압축풀기
local_zip = 'glove.6B.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall()
zip_ref.close()

 

  - 임베딩 전처리

  • GloVe 파싱
# 데이터를 라인 단위로 불러오기
glove_dir = "glove.6B"
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'), encoding = 'utf8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype = 'float32')
    embeddings_index[word] = coefs

f.close()

print(len(embeddings_index))  # 400000
embedding_dim = 100
embedding_mat = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    if i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_mat[i] = embedding_vector

embedding_mat

# 출력 결과
array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.038194  , -0.24487001,  0.72812003, ..., -0.1459    ,
         0.82779998,  0.27061999],
       [-0.071953  ,  0.23127   ,  0.023731  , ..., -0.71894997,
         0.86894   ,  0.19539   ],
       ...,
       [ 0.13787   , -0.17727   , -0.62436002, ...,  0.35506001,
         0.33443999,  0.14436001],
       [-0.88968998,  0.55208999, -0.50498998, ..., -0.54351002,
        -0.21874   ,  0.51186001],
       [-0.17381001, -0.037609  ,  0.068837  , ..., -0.097167  ,
         1.08840001,  0.22676   ]])

 

  - 모델 정의

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense

model = Sequential()

model.add(Embedding(max_words, embedding_dim, input_length = max_len))
model.add(Flatten())
model.add(Dense(32, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))
model.summary()

# 출력 결과
Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_7 (Embedding)     (None, 100, 100)          1000000   
                                                                 
 flatten_2 (Flatten)         (None, 10000)             0         
                                                                 
 dense_2 (Dense)             (None, 32)                320032    
                                                                 
 dense_3 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 1,320,065
Trainable params: 1,320,065
Non-trainable params: 0
_________________________________________________________________
# 가중치 설정
model.layers[0].set_weights([embedding_mat])

# 학습하지 않고 기존의 가중치값 그대로 사용
model.layers[0].trainable = False
model.compile(optimizer = 'rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['accuracy'])

history = model.fit(x_train, y_train,
                    epochs = 10,
                    batch_size = 32,
                    validation_data = (x_val, y_val))

# 모델 저장
model.save_weights('pre_trained_glove_model.h5')

 

  - 시각화

loss = history.history['loss']
val_loss = history.history['val_loss']
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

epochs = range(1, len(loss) + 1)

plt.plot(epochs, loss, 'b--', label = 'Training Loss')
plt.plot(epochs, val_loss, 'r:', label = 'Validaiton Loss')
plt.legend()
plt.grid()

plt.figure()
plt.plot(epochs, acc, 'b--', label = 'Training Accuracy')
plt.plot(epochs, val_acc, 'r:', label = 'Validaiton Accuracy')
plt.legend()
plt.grid()

 

11. 사전 훈련된 단어 임베딩을 사용하지 않고 같은 모델 훈련

model2 = Sequential()

model2.add(Embedding(max_words, embedding_dim, input_length = max_len))
model2.add(Flatten())
model2.add(Dense(32, activation = 'relu'))
model2.add(Dense(1, activation = 'sigmoid'))
model2.summary()

# 출력 결과
Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_8 (Embedding)     (None, 100, 100)          1000000   
                                                                 
 flatten_3 (Flatten)         (None, 10000)             0         
                                                                 
 dense_4 (Dense)             (None, 32)                320032    
                                                                 
 dense_5 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 1,320,065
Trainable params: 1,320,065
Non-trainable params: 0
_________________________________________________________________
model2.compile(optimizer = 'rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['accuracy'])
history2 = model2.fit(x_train, y_train,
                    epochs = 10,
                    batch_size = 32,
                    validation_data = (x_val, y_val))

loss = history2.history['loss']
val_loss = history2.history['val_loss']
acc = history2.history['accuracy']
val_acc = history2.history['val_accuracy']

epochs = range(1, len(loss) + 1)

plt.plot(epochs, loss, 'b--', label = 'Training Loss')
plt.plot(epochs, val_loss, 'r:', label = 'Validaiton Loss')
plt.legend()
plt.grid()

plt.figure()
plt.plot(epochs, acc, 'b--', label = 'Training Accuracy')
plt.plot(epochs, val_acc, 'r:', label = 'Validaiton Accuracy')
plt.legend()
plt.grid()

 

  - 테스트 데이터 토큰화

test_dir = os.path.join(imdb_dir, 'test')

labels = []
texts = []
for label_type in ['neg', 'pos']:
    dir_name = os.path.join(test_dir, label_type)

    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname), encoding = 'utf8')
            texts.append(f.read())
            f.close()

            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

sequences = tokenizer.texts_to_sequences(texts)
x_test = pad_sequences(sequences, maxlen = max_len)
y_test = np.asarray(labels)

print(x_test.shape)  # (25000, 100)
print(y_test.shape)  # (25000,)
model.load_weights('pre_trained_glove_model.h5')
model.evaluate(x_test, y_test)

# 출력 결과
loss: 0.7546 - accuracy: 0.5566
[0.754594087600708, 0.5565599799156189]

1. 용어 설명

  • 토큰(token)
    • 텍스트를 나누는 단위
    • 토큰화(tokenization): 토큰으로 나누는 작업
  • n-gram
    • 문장에서 추출한 N개(또는 그 이하)의 연속된 단어 그룹
    • 같은 개념이 '문자'에도 적용 가능

https://www.sqlservercentral.com/articles/nasty-fast-n-grams-part-1-character-level-unigrams

 

 

2. 문자 수준 원-핫 인코딩

import numpy as np

samples = ['The cat sat on the mat.',
           'The dog ate my homeworks.']

token_index = {}

for sample in samples:
    for word in sample.split():
        if word not in token_index:
            token_index[word] = len(token_index) + 1

max_len = 10
results = np.zeros(shape = (len(samples), max_len,
                            max(token_index.values()) + 1))

# 원-핫 인코딩
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_len]:
        index = token_index.get(word)
        results[i, j, index] = 1.
results

# 출력 결과
array([[[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],  # The
        [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],  # cat
        [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],  # sat
        [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],  # on
        [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],  # the
        [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],  # mat
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

       [[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],  # The
        [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],  # dog
        [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],  # ate
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],  # my
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],  # homeworks
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]])

 

 

3. 케라스를 사용한 단어 수준 원-핫 인코딩

  • fit_on_texts()
  • texts_to_sequences()
  • texts_to_matrix()
from tensorflow.keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.',
           'The dog ate my homeworks.']

tokenizer = Tokenizer(num_words = 1000)
tokenizer.fit_on_texts(samples)

sequences = tokenizer.texts_to_sequences(samples)

ohe_results = tokenizer.texts_to_matrix(samples, mode = 'binary')

word_index = tokenizer.word_index
print(len(word_index))

# 출력 결과
9
# 9개의 토큰을 가지고 있음
# 단어의 순서
sequences

# 출력 결과
[[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]
# 원-핫 인코딩 결과
print(ohe_results.shape)
print(ohe_results)

# 출력 결과
(2, 1000)
[[0. 1. 1. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]]
word_index

# 출력 결과
{'the': 1,
 'cat': 2,
 'sat': 3,
 'on': 4,
 'mat': 5,
 'dog': 6,
 'ate': 7,
 'my': 8,
 'homeworks': 9}
 
 # 단어 인덱스에 따라 sequences의 값이 정해짐

 

  - 토큰화 예제

  • OOV: Out Of Vocabulary
    • 새로운 문장에서 기존에 토큰화한 문장에 존재하지 않으면 OOV로 대체됨
from tensorflow.keras.preprocessing.text import Tokenizer

samples = ["I'm the smartest student.",
           "I'm the best student."]
tokenizer = Tokenizer(num_words = 10, oov_token = '<OOV>')
tokenizer.fit_on_texts(samples)

sequences = tokenizer.texts_to_sequences(samples)

binary_results = tokenizer.texts_to_matrix(samples, mode = 'binary')

print(tokenizer.word_index)

# 출력 결과
# 현재 tokenizer에 대한 word_index
{'<OOV>': 1, "i'm": 2, 'the': 3, 'student': 4, 'smartest': 5, 'best': 6}
binary_results

# 출력 결과
array([[0., 0., 1., 1., 1., 1., 0., 0., 0., 0.],
       [0., 0., 1., 1., 1., 0., 1., 0., 0., 0.]])
  • 테스트
test = ["I'm the fastest student."]
test_seq = tokenizer.texts_to_sequences(test)

print("word index:", tokenizer.word_index)
print("Test Text:", test)
print("Test Seq:", test_seq)

# 출력 결과
word index: {'<OOV>': 1, "i'm": 2, 'the': 3, 'student': 4, 'smartest': 5, 'best': 6}
Test Text: ["I'm the fastest student."]
Test Seq: [[2, 3, 1, 4]]

# fastest는 vocabulary에 없는 oov(out-of-vocabulary) 값이므로 1로 표시됨

 

 

4. 원-핫 단어 벡터와 단어 임베딩

  • 원-핫 단어 벡터
    • 데이터가 희소(sparse)
    • 고차원
  • 단어 임베딩
    • 밀집(dense)
    • 저차원

https://freecontent.manning.com/deep-learning-for-text/

 

 

5. 단어 임베딩

  • 단어 간 벡터 사이의 거리가 가까운, 즉 비슷한 단어들끼리 임베딩
  • 거리 외에 임베딩 공간의 특정 방향도 의미를 가질 수 있음

https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8

 

  - Embedding Layer

  • 특정 단어를 나타내는 정수 인덱스를 밀집 벡터(dense vector)로 매핑하는 딕셔너리 레이어
  • 입력: (samples, sqquence_length)
  • 출력: (samples, sequences_length, dim)
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(1000, 64)
embedding_layer

# 출력 결과
<keras.layers.core.embedding.Embedding at 0x265f5b12fa0>
# embedding 객체가 출력됨

 

 

6. 예제: IMDB 데이터

  • 인터넷 영화 데이터베이스(Internet Movie Database)
  • 양극단의 리뷰 5만개로 이루어진 데이터셋
    • 훈련 데이터: 25,000개
    • 테스트 데이터: 25,000개

 

  - modules import

from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, Flatten

 

  - 데이터 로드

num_words = 1000
max_len = 20

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = num_words)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

# 출력 결과
(25000,)
(25000,)
(25000,)
(25000,)

 

  - 데이터 확인

  • 긍정: 1
  • 부정: 0
print(x_train[0])
print(y_train[0])

# 출력 결과
# 리뷰 데이터의 sequence와 긍정/부정 결과 출력
[1, 14, 22, 16, 43, 530, 973, 2, 2, 65, 458, 2, 66, 2, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 2, 2, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2, 19, 14, 22, 4, 2, 2, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 2, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2, 2, 16, 480, 66, 2, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 2, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 2, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 2, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 2, 88, 12, 16, 283, 5, 16, 2, 113, 103, 32, 15, 16, 2, 19, 178, 32]
1

 

  - 참고) IMDB 데이터셋에서 가장 많이 사용된 단어

word_index = {}

for key, val in imdb.get_word_index().items():
    word_index[val] = key

for i in range(1, 6):
    print(word_index[i])

# 출력 결과
the
and
a
of
to

 

  - 데이터 전처리

  • 모든 데이터를 같은 길이로 맞추기
    • pad_sequence()
      • 데이터가 maxlen보다 길면 데이터를 자름
      • 데이터가 길면 padding 설정
        • pre: 데이터 앞에 0으로 채움
        • post: 데이터 뒤에 0으로 채움
  • 모든 데이터(문장 하나하나)가 같은 길이로 맞춰져야 Embedding 레이어 사용가능
from tensorflow.keras.preprocessing.sequence import pad_sequences

pad_x_train = pad_sequences(x_train, maxlen = max_len, padding = 'pre')
pad_x_test = pad_sequences(x_test, maxlen = max_len, padding = 'pre')

print(len(x_train[0]))
print(len(pad_x_train[0]))

# 출력 결과
218
20
# 최대 길이만큼 줄어듬
print(x_train[0])
print(pad_x_train[0])

# 출력 결과
[1, 14, 22, 16, 43, 530, 973, 2, 2, 65, 458, 2, 66, 2, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 2, 2, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2, 19, 14, 22, 4, 2, 2, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 2, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2, 2, 16, 480, 66, 2, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 2, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 2, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 2, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 2, 88, 12, 16, 283, 5, 16, 2, 113, 103, 32, 15, 16, 2, 19, 178, 32]
[ 65  16  38   2  88  12  16 283   5  16   2 113 103  32  15  16   2  19  178  32]

 

  - 모델 구성

model = Sequential()

model.add(Embedding(input_dim = num_words, output_dim = 32, input_length = max_len))
model.add(Flatten())
model.add(Dense(1, activation = 'sigmoid'))

model.summary()

# 출력 결과
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_1 (Embedding)     (None, 20, 32)            32000     
                                                                 
 flatten (Flatten)           (None, 640)               0         
                                                                 
 dense (Dense)               (None, 1)                 641       
                                                                 
=================================================================
Total params: 32,641
Trainable params: 32,641
Non-trainable params: 0
_________________________________________________________________

 

  - 모델 컴파일 및 학습

model.compile(optimizer = 'rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['accuracy'])

history = model.fit(pad_x_train, y_train,
                    epochs = 10,
                    batch_size = 32,
                    validation_split = 0.2)

 

  - 시각화

import matplotlib.pyplot as plt

hist_dict = history.history

plt.plot(hist_dict['loss'], 'b--', label = 'Train Loss')
plt.plot(hist_dict['val_loss'], 'r:', label = 'Validation Loss')
plt.legend()
plt.grid()

plt.figure()
plt.plot(hist_dict['accuracy'], 'b--', label = 'Train Accuracy')
plt.plot(hist_dict['val_accuracy'], 'r:', label = 'Validation Accuracy')
plt.legend()
plt.grid()

plt.show()

 

  - 모델 평가

model.evaluate(pad_x_test, y_test)

# 출력 결과
loss: 0.5986 - accuracy: 0.7085
[0.5986294150352478, 0.7085199952125549]

 

  - 단어의 수를 늘린 후 재학습

num_words = 1000
max_len = 500

pad_x_train_2 = pad_sequences(x_train, maxlen = max_len, padding = 'pre')
pad_x_test_2 = pad_sequences(x_test, maxlen = max_len, padding = 'pre')

print(x_train[0])
print(pad_x_train_2[0])

# 출력 결과
[1, 14, 22, 16, 43, 530, 973, 2, 2, 65, 458, 2, 66, 2, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 2, 2, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2, 19, 14, 22, 4, 2, 2, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 2, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2, 2, 16, 480, 66, 2, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 2, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 2, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 2, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 2, 88, 12, 16, 283, 5, 16, 2, 113, 103, 32, 15, 16, 2, 19, 178, 32]
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   1  14  22  16  43 530
 973   2   2  65 458   2  66   2   4 173  36 256   5  25 100  43 838 112
  50 670   2   9  35 480 284   5 150   4 172 112 167   2 336 385  39   4
 172   2   2  17 546  38  13 447   4 192  50  16   6 147   2  19  14  22
   4   2   2 469   4  22  71  87  12  16  43 530  38  76  15  13   2   4
  22  17 515  17  12  16 626  18   2   5  62 386  12   8 316   8 106   5
   4   2   2  16 480  66   2  33   4 130  12  16  38 619   5  25 124  51
  36 135  48  25   2  33   6  22  12 215  28  77  52   5  14 407  16  82
   2   8   4 107 117   2  15 256   4   2   7   2   5 723  36  71  43 530
 476  26 400 317  46   7   4   2   2  13 104  88   4 381  15 297  98  32
   2  56  26 141   6 194   2  18   4 226  22  21 134 476  26 480   5 144
  30   2  18  51  36  28 224  92  25 104   4 226  65  16  38   2  88  12
  16 283   5  16   2 113 103  32  15  16   2  19 178  32]

# 500이라는 최대 길이 맞추고 남은 공간을 0으로 채움, pre이므로 앞쪽에 채움
model = Sequential()

model.add(Embedding(input_dim = num_words, output_dim = 32, input_length = max_len))
model.add(Flatten())
model.add(Dense(1, activation = 'sigmoid'))

model.compile(optimizer = 'rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['accuracy'])

history2 = model.fit(pad_x_train_2, y_train,
                    epochs = 10,
                    batch_size = 32,
                    validation_split = 0.2)

hist_dict_2 = history2.history

plt.plot(hist_dict_2['loss'], 'b--', label = 'Train Loss')
plt.plot(hist_dict_2['val_loss'], 'r:', label = 'Validation Loss')
plt.legend()
plt.grid()

plt.figure()
plt.plot(hist_dict_2['accuracy'], 'b--', label = 'Train Accuracy')
plt.plot(hist_dict_2['val_accuracy'], 'r:', label = 'Validation Accuracy')
plt.legend()
plt.grid()

plt.show()

model.evaluate(pad_x_test_2, y_test)

# 출력 결과
loss: 0.5295 - accuracy: 0.8316
[0.5295160412788391, 0.8316400051116943]

  - 위의 결과도 정확도로 봤을때는 나쁘지 않지만 과적합이 됨

  - 그 이유는

  • 단어 간 관계나 문장 구조 등 의미적 연결 고려 x
  • 시퀀스 전체를 고려한 특성을 학습하는 것은 Embedding 층 위에 RNN층이나 1D 합성곱을 추가하는 것이 좋음

 

 

● 단어 임베딩의 종류

  • LSA
  • Word2Vec
  • Blove
  • FastText
  • etc...

 

 

7. Word2Vec

  • 분류 등과 같이 별도의 레이블 없이 텍스트 자체만 있어도 학습이 가능
  • Word2Vec의 방식(주변 단어의 관계를 이용)
    • CBOW(Continuous Bag-Of-Word)
      • 주변 단어의 임베딩을 더해서 대상 단어를 예측
    • Skip-Gram
      • 대상 단어의 임베딩으로 주변 단어를 예측
      • 일반적으로 CBOW보다 성능이 좋은 편
      • 한번에 여러 단어를 예측해야하기 때문에 비효율적
      • 최근에는 negative sampling이라는 방법 사용

https://www.researchgate.net/figure/CBOW-and-Skip-Gram-neural-architectures_fig14_328160770

 

 

8. 구텐베르크 프로젝트 예제

import requests
import re

 

  - 데이터 다운로드

res = requests.get('https://www.gutenberg.org/files/2591/2591-0.txt')
res

# 출력 결과
<Response [200]>
# 200이면 잘 응답한 것
# 404면 오류 발생한 것

 

  - 데이터 전처리

grimm = res.text[2801:530661]
grimm = re.sub(r'[^a-zA-Z\. ]', ' ', grimm)
sentences = grimm.split('. ')
data = [s.split() for s in sentences]

len(data)  # 3468


data[0]

# 출력 결과
['SECOND',
 'STORY',
 'THE',
 'SALAD',
 'THE',
 'STORY',
 'OF',
 'THE',
 'YOUTH',
 'WHO',
 'WENT',
 'FORTH',
 'TO',
 'LEARN',
 'WHAT',
 'FEAR',
 'WAS',
 'KING',
 'GRISLY',
 'BEARD',
 'IRON',
 'HANS',
 'CAT',
 'SKIN',
 'SNOW',
...
 'tree',
 'which',
 'bore',
 'golden',
 'apples']
# gensim 패키지로부터 Word2Vec을 불러오기
from gensim.models.word2vec import Word2Vec
# sg인자에 0을 넘겨주면 CBOW, 1을 넘겨주면 Skip-gram
# 최소 3번은 등장한 단어, 동시 처리의 수는 4개
model = Word2Vec(data, sg = 1, vector_size = 100, window = 3, min_count = 3, workers = 4)

 

  - 모델 저장 및 로드

# 저장
model.save('word2vec.model')

# 로드
pretrained_model = Word2Vec.load('word2vec.model')

 

  - 단어를 벡터로 변환

  • wv
pretrained_model.wv['princess']

# 출력 결과
array([-0.19268924,  0.17087255, -0.13460916,  0.20450976,  0.03542079,
       -0.31665406,  0.13296   ,  0.54076153, -0.18337499, -0.21417093,
        0.02725333, -0.31845513,  0.01819889,  0.10720193,  0.16601542,
       -0.19728081,  0.05753807, -0.12273175, -0.17903367, -0.22576232,
        0.2438455 ,  0.13664703,  0.18498562, -0.1679803 ,  0.07735273,
       -0.00432668, -0.00775897, -0.08363435, -0.12566872, -0.07055762,
        0.02887373, -0.08917326,  0.17351009, -0.18784055, -0.20769958,
        0.19657052,  0.01372425, -0.074237  , -0.10052767, -0.11275681,
        0.06725535, -0.09701315,  0.02844668,  0.05958825, -0.02586031,
       -0.01711333, -0.11226629, -0.08671231,  0.1945969 ,  0.01690222,
        0.07196116, -0.08172472, -0.05373074, -0.14637838,  0.16281295,
        0.06222549,  0.10643765,  0.07477342, -0.16238536,  0.03527208,
       -0.04292673,  0.04597842,  0.13826323, -0.19217554, -0.25257504,
        0.10983958,  0.03293723,  0.4319519 , -0.21335553,  0.24770555,
       -0.00888118,  0.02231867,  0.17330043, -0.10485211,  0.35415375,
       -0.08000654,  0.01478033, -0.03938808, -0.06453493,  0.02249427,
       -0.21435274, -0.01287377, -0.2137464 ,  0.21174915, -0.1006554 ,
        0.00902446,  0.05607878,  0.16368881,  0.13859129, -0.01395336,
        0.09382439,  0.08065708, -0.056269  ,  0.09765122,  0.188912  ,
        0.1668056 , -0.01361183, -0.14287405, -0.11452819, -0.20357099],
      dtype=float32)

# 'princess'라는 단어를 벡터로 변환한 값

 

  - 유추 또는 유비(analogy)

  • wv.similarity()에 두 단어를 넣어주면 코사인 유사도를 구할 수 있음
pretrained_model.wv.similarity('king', 'prince')

# 출력 결과
0.8212076
  • wv.most_similar()에 단어를 넘겨주면 가장 유사한 단어를 추출할 수 있음
pretrained_model.wv.most_similar('king')

# 출력 결과
[('daughter', 0.9241937398910522),
 ('son', 0.9213796257972717),
 ('woman', 0.9177201390266418),
 ('man', 0.897368848323822),
 ('queen', 0.8747967481613159),
 ('miller', 0.8610494136810303),
 ('old', 0.8595746755599976),
 ('young', 0.8504902124404907),
 ('wolf', 0.8450464010238647),
 ('But', 0.8406485319137573)]
  • wv.most_similar()에 positive와 negetive라는 옵션을 넘길 수 있음
# 'man + princess - woman'을 벡터 계산을 한 값을 출력
# man이고 princess인데 woman이 아닌 단어
pretrained_model.wv.most_similar(positive = ['man', 'princess'], negative = ['woman'])

# 출력 결과
[('bird', 0.9595717787742615),
 ('prince', 0.9491060376167297),
 ('cook', 0.9410891532897949),
 ('bride', 0.9401964545249939),
 ('huntsman', 0.9375050067901611),
 ('mouse', 0.9356588125228882),
 ('cat', 0.9344455003738403),
 ('giant', 0.9341970682144165),
 ('gardener', 0.9327394366264343),
 ('maid', 0.9326624870300293)]

 

  - gensim으로 학습된 단어 임베딩을 Keras에서 불러오기 

from keras.models import Sequential
from keras.layers import Embedding

num_words, emb_dim = pretrained_model.wv.vectors.shape

print(num_words)
print(emb_dim)

# 출력 결과
2446
100

 

  - gensim으로 학습된 단어 임베딩을 Keras의 임베딩 레이어의 가중치로 설정

emb = Embedding(input_dim = num_words, output_dim = emb_dim,
                trainable = False, weights = [pretrained_model.wv.vectors])

model = Sequential()
model.add(emb)

model.summary()

# 출력 결과
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_3 (Embedding)     (None, None, 100)         244600    
                                                                 
=================================================================
Total params: 244,600
Trainable params: 0
Non-trainable params: 244,600
_________________________________________________________________
# princess에 대한 결과 벡터
i = pretrained_model.wv.index_to_key.index('princess')

model.predict([i])

# 출력 결과
array([[-0.19268924,  0.17087255, -0.13460916,  0.20450976,  0.03542079,
        -0.31665406,  0.13296   ,  0.54076153, -0.18337499, -0.21417093,
         0.02725333, -0.31845513,  0.01819889,  0.10720193,  0.16601542,
        -0.19728081,  0.05753807, -0.12273175, -0.17903367, -0.22576232,
         0.2438455 ,  0.13664703,  0.18498562, -0.1679803 ,  0.07735273,
        -0.00432668, -0.00775897, -0.08363435, -0.12566872, -0.07055762,
         0.02887373, -0.08917326,  0.17351009, -0.18784055, -0.20769958,
         0.19657052,  0.01372425, -0.074237  , -0.10052767, -0.11275681,
         0.06725535, -0.09701315,  0.02844668,  0.05958825, -0.02586031,
        -0.01711333, -0.11226629, -0.08671231,  0.1945969 ,  0.01690222,
         0.07196116, -0.08172472, -0.05373074, -0.14637838,  0.16281295,
         0.06222549,  0.10643765,  0.07477342, -0.16238536,  0.03527208,
        -0.04292673,  0.04597842,  0.13826323, -0.19217554, -0.25257504,
         0.10983958,  0.03293723,  0.4319519 , -0.21335553,  0.24770555,
        -0.00888118,  0.02231867,  0.17330043, -0.10485211,  0.35415375,
        -0.08000654,  0.01478033, -0.03938808, -0.06453493,  0.02249427,
        -0.21435274, -0.01287377, -0.2137464 ,  0.21174915, -0.1006554 ,
         0.00902446,  0.05607878,  0.16368881,  0.13859129, -0.01395336,
         0.09382439,  0.08065708, -0.056269  ,  0.09765122,  0.188912  ,
         0.1668056 , -0.01361183, -0.14287405, -0.11452819, -0.20357099]],
      dtype=float32)

● 케라스 전이학습(tramsfer learning)

https://medium.com/the-official-integrate-ai-blog/transfer-learning-explained-7d275c1e34e2

  • 새로운 모델을 만들때 기존에 학습된 모델을 사용
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPool2D, Dense, Flatten, BatchNormalization, Activation
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.applications import *


# 예시로 학습된 vgg 데이터 불러오기
vgg16 = VGG16(weights = 'imagenet',
              input_shape = (32, 32, 3), include_top = False)

model = Sequential()
model.add(vgg16)

model.add(Flatten())
model.add(Dense(256))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dense(10, activation = 'softmax'))

model.summary()

# 출력 결과
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 vgg16 (Functional)          (None, 1, 1, 512)         14714688  
                                                                 
 flatten (Flatten)           (None, 512)               0         
                                                                 
 dense (Dense)               (None, 256)               131328    
                                                                 
 batch_normalization (BatchN  (None, 256)              1024      
 ormalization)                                                   
                                                                 
 activation (Activation)     (None, 256)               0         
                                                                 
 dense_1 (Dense)             (None, 10)                2570      
                                                                 
=================================================================
Total params: 14,849,610
Trainable params: 14,849,098
Non-trainable params: 512
_________________________________________________________________
  • vgg16 이외에 MobileNet, ResNet50, Xceoption 모델 등이 존재하여 전이 학습에 이용가능

 

1. 예제: Dogs vs Cats

 

  - modules import

import tensorflow as tf
from tensorflow.keras.preprocessing.image import array_to_img, img_to_array, load_img, ImageDataGenerator
from tensorflow.keras.layers import Conv2D, Flatten, MaxPool2D, Input, Dropout, Dense
from tensorflow.keras import Model
from tensorflow.keras.optimizers import Adam

import os
import zipfile
import matplotlib.image as mpimg
import matplotlib.pyplot as plt

 

  - 데이터 로드

# 외부에서 데이터 가져오기
import wget

wget.download("https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip")


# 압축 해제
local_zip = 'cats_and_dogs_filtered.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
# 현재 폴더에 압축해제
zip_ref.extractall()
zip_ref.close()


# 압축해제된 폴더를 기본 경로로 지정, 폴더 내의 train과 validation 폴더에 각각 접근
base_dir = 'cats_and_dogs_filtered'
train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')


# 압축해제된 폴더 내의 train cat, validation cat, train dog, validation dog 폴더에 각각 접근
train_cats_dir = os.path.join(train_dir, 'cats')
train_dogs_dir = os.path.join(train_dir, 'dogs')

validation_cats_dir = os.path.join(validation_dir, 'cats')
validation_dogs_dir = os.path.join(validation_dir, 'dogs')

train_cat_frames = os.listdir(train_cats_dir)
train_dog_frames = os.listdir(train_dogs_dir)

 

  - 이미지 보강된 데이터 확인

# ImageDataGenerator 정의
datagen = ImageDataGenerator(
    rotation_range = 40,
    width_shift_range = 0.2,
    height_shift_range = 0.2,
    shear_range = 0.2,
    zoom_range = 0.2,
    horizontal_flip = True,
    fill_mode = 'nearest'
)


# 이미지 로드
img_path = os.path.join(train_cats_dir, train_cat_frames[2])
img = load_img(img_path, target_size = (150, 150))
x = img_to_array(img)
x = x.reshape((1, ) + x.shape)

i = 0
for batch in datagen.flow(x, batch_size = 1):
    plt.figure(i)
    imgplot = plt.imshow(array_to_img(batch[0]))
    i += 1
    if i % 5 == 0:
        break

 

  - 학습, 검증 데이터셋의 Data Generator

train_datagen = ImageDataGenerator(
    rescale = 1. / 255,
    rotation_range = 40,
    width_shift_range = 0.2,
    height_shift_range = 0.2,
    shear_range = 0.2,
    zoom_range = 0.2,
    horizontal_flip = True
)

train_generator = train_datagen.flow_from_directory(
    train_dir,
    target_size = (150, 150),
    batch_size = 20,
    class_mode = 'binary'
)

val_datagen = ImageDataGenerator(rescale = 1. / 255)

validation_generator = val_datagen.flow_from_directory(
    validation_dir,
    target_size = (150, 150),
    batch_size = 20,
    class_mode = 'binary'
)


# 출력 결과
Found 2000 images belonging to 2 classes.
Found 1000 images belonging to 2 classes.

 

  - 모델 구성 및 컴파일

model = Sequential()
model.add(Conv2D(32, (3, 3), activation = 'relu', input_shape = (150, 150, 3)))
model.add(MaxPool2D(2, 2))
model.add(Conv2D(64, (3, 3), activation = 'relu'))
model.add(MaxPool2D(2, 2))
model.add(Conv2D(128, (3, 3), activation = 'relu'))
model.add(MaxPool2D(2, 2))
model.add(Conv2D(128, (3, 3), activation = 'relu'))
model.add(MaxPool2D(2, 2))
model.add(Flatten())
model.add(Dropout(0.5))
model.add(Dense(512, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))

model.compile(loss = 'binary_crossentropy',
              optimizer = Adam(learning_rate = 1e-4),
              metrics = ['acc'])

model.summary()

# 출력 결과
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d (Conv2D)             (None, 148, 148, 32)      896       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 74, 74, 32)       0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 72, 72, 64)        18496     
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 36, 36, 64)       0         
 2D)                                                             
                                                                 
 conv2d_2 (Conv2D)           (None, 34, 34, 128)       73856     
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 17, 17, 128)      0         
 2D)                                                             
                                                                 
 conv2d_3 (Conv2D)           (None, 15, 15, 128)       147584    
                                                                 
 max_pooling2d_3 (MaxPooling  (None, 7, 7, 128)        0         
 2D)                                                             
                                                                 
 flatten_1 (Flatten)         (None, 6272)              0         
                                                                 
 dropout (Dropout)           (None, 6272)              0         
                                                                 
 dense_2 (Dense)             (None, 512)               3211776   
                                                                 
 dense_3 (Dense)             (None, 1)                 513       
                                                                 
=================================================================
Total params: 3,453,121
Trainable params: 3,453,121
Non-trainable params: 0
_________________________________________________________________

 

  - 모델 학습 및 학습 과정 시각화

history = model.fit(train_generator,
                    steps_per_epoch = 100,
                    epochs = 30,
                    batch_size = 256,
                    validation_data = validation_generator,
                    validation_steps = 50,
                    verbose = 2)

# 시각화
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(len(acc))

plt.plot(epochs, loss, 'b--', label = 'Train Loss')
plt.plot(epochs, val_loss, 'b--', label = 'Validation Loss')
plt.grid()
plt.legend()

plt.plot(epochs, acc, 'b--', label = 'Train Accuracy')
plt.plot(epochs, val_acc, 'b--', label = 'Validation Accuracy')
plt.grid()
plt.legend()

plt.show()

 

  - 모델 저장

model.save('cats_and_dogs_model.h5')

 

  - 사전 훈련된 모델 사용

from tensorflow.keras.optimizers import RMSprop

conv_base = VGG16(weights = 'imagenet',
                  input_shape = (150, 150, 3), include_top = False)

def build_model_with_pretrained(convbase):
    model = Sequential()
    model.add(conv_base)
    model.add(Flatten())
    model.add(Dense(256, activation = 'relu'))
    model.add(Dense(1, activation = 'sigmoid'))

    model.compile(loss = binary_crossentropy,
                  optimizer = RMSprop(learning_rate = 2e-5),
                  metrics = ['accuracy'])
    return model
  • 파라미터 수 확인
model.build_model_with_pretrained(conv_base)
model.summary()

# 출력 결과
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 vgg16 (Functional)          (None, 4, 4, 512)         14714688  
                                                                 
 flatten_2 (Flatten)         (None, 8192)              0         
                                                                 
 dense_4 (Dense)             (None, 256)               2097408   
                                                                 
 dense_5 (Dense)             (None, 1)                 257       
                                                                 
=================================================================
Total params: 16,812,353
Trainable params: 16,812,353
Non-trainable params: 0
_________________________________________________________________

 

  - 레이어 동결

  • 훈련하기 전, 합성곱 기반 레이어들의 가중치 학습을 막기 위해 이를 동결
# 동결 전
print(len(model.trainable_weights))

# 출력 결과
30


# 동결 후
conv_base.trainable = False
print(len(model.trainable_weights))

# 출력 결과
4

 

  - 모델 컴파일

  • trainable 속성을 변경했기 때문에 다시 모델을 컴파일 해야함
model.compile(loss = 'binary_crossentropy',
              optimizer = RMSprop(learning_rate = 2e-5),
              metrics = ['accuracy'])

 

  - 이미지 제너레이터

train_datagen = ImageDataGenerator(
    rescale = 1. / 255,
    rotation_range = 40,
    width_shift_range = 0.2,
    height_shift_range = 0.2,
    shear_range = 0.2,
    zoom_range = 0.2,
    horizontal_flip = True
)

train_generator = train_datagen.flow_from_directory(
    train_dir,
    target_size = (150, 150),
    batch_size = 20,
    class_mode = 'binary'
)

val_datagen = ImageDataGenerator(rescale = 1. / 255)

validation_generator = val_datagen.flow_from_directory(
    validation_dir,
    target_size = (150, 150),
    batch_size = 20,
    class_mode = 'binary'
)

# 출력 결과
Found 2000 images belonging to 2 classes.
Found 1000 images belonging to 2 classes.

 

  - 모델 재학습

history2 = model.fit(train_generator,
                    steps_per_epoch = 100,
                    epochs = 30,
                    batch_size = 256,
                    validation_data = validation_generator,
                    validation_steps = 50,
                    verbose = 2)

acc = history2.history['accuracy']
val_acc = history2.history['val_accuracy']
loss = history2.history['loss']
val_loss = history2.history['val_loss']
epochs = range(len(acc))

plt.plot(epochs, loss, 'b--', label = 'Train Loss')
plt.plot(epochs, val_loss, 'r:', label = 'Validation Loss')
plt.grid()
plt.legend()

plt.plot(epochs, acc, 'b--', label = 'Train Accuracy')
plt.plot(epochs, val_acc, 'r:', label = 'Validation Accuracy')
plt.grid()
plt.legend()

plt.show()

 

  - 모델 저장

model.save('cats_and_dogs_with_pretrained_model.h5')

 

 

2. Feature Map 시각화

  - 모델 구성

import numpy as np
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing import image


# 저장된 모델 로드
model = load_model('cats_and_dogs_model.h5')
model.summary()

# 출력 결과
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d (Conv2D)             (None, 148, 148, 32)      896       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 74, 74, 32)       0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 72, 72, 64)        18496     
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 36, 36, 64)       0         
 2D)                                                             
                                                                 
 conv2d_2 (Conv2D)           (None, 34, 34, 128)       73856     
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 17, 17, 128)      0         
 2D)                                                             
                                                                 
 conv2d_3 (Conv2D)           (None, 15, 15, 128)       147584    
                                                                 
 max_pooling2d_3 (MaxPooling  (None, 7, 7, 128)        0         
 2D)                                                             
                                                                 
 flatten_1 (Flatten)         (None, 6272)              0         
                                                                 
 dropout (Dropout)           (None, 6272)              0         
                                                                 
 dense_2 (Dense)             (None, 512)               3211776   
                                                                 
 dense_3 (Dense)             (None, 1)                 513       
                                                                 
=================================================================
Total params: 3,453,121
Trainable params: 3,453,121
Non-trainable params: 0
_________________________________________________________________
img_path = 'cats_and_dogs_filtered/validation/dogs/dog.2000.jpg'

img = image.load_img(img_path, target_size = (150, 150))
img_tensor = image.img_to_array(img)
img_tensor = img_tensor[np.newaxis, ...]
img_tensor /= 255.
print(img_tensor.shape)

# 출력 결과
(1, 150, 150, 3)
plt.imshow(img_tensor[0])
plt.show()

# 레이어 중 일부만(8개) 출력
conv_output = [layer.output for layer in model.layer[:8]]
conv_output

# 출력 결과
[<KerasTensor: shape=(None, 148, 148, 32) dtype=float32 (created by layer 'conv2d')>,
 <KerasTensor: shape=(None, 74, 74, 32) dtype=float32 (created by layer 'max_pooling2d')>,
 <KerasTensor: shape=(None, 72, 72, 64) dtype=float32 (created by layer 'conv2d_1')>,
 <KerasTensor: shape=(None, 36, 36, 64) dtype=float32 (created by layer 'max_pooling2d_1')>,
 <KerasTensor: shape=(None, 34, 34, 128) dtype=float32 (created by layer 'conv2d_2')>,
 <KerasTensor: shape=(None, 17, 17, 128) dtype=float32 (created by layer 'max_pooling2d_2')>,
 <KerasTensor: shape=(None, 15, 15, 128) dtype=float32 (created by layer 'conv2d_3')>,
 <KerasTensor: shape=(None, 7, 7, 128) dtype=float32 (created by layer 'max_pooling2d_3')>]
activation_model = Model(inputs = [model.input], outputs = conv_output)
activations = activation_model.predict(img_tensor)
len(activations)

# 출력 결과
8

 

  - 시각화

print(activations[0].shape)
plt.matshow(activations[0][0, :, :, 7], cmap = 'viridis')
plt.show()

# 출력 결과
(1, 148, 148, 32)

print(activations[0].shape)
plt.matshow(activations[0][0, :, :, 10], cmap = 'viridis')
plt.show()

# 출력 결과
(1, 148, 148, 32)

 

  - 중간의 모든 활성화에 대해 시각화

# 각 layer에서 이미지의 변환과정을 시각화
layer_names = []
for layer in model.layers[:8]:
    layer_names.append(layer.name)

images_per_row = 16

for layer_name, layer_activation in zip(layer_names, activations):
    num_features = layer_activation.shape[-1]

    size = layer_activation.shape[1]

    num_cols = num_features // images_per_row
    display_grid = np.zeros((size * num_cols, size * images_per_row))

    for col in range(num_cols):
        for row in range(images_per_row):
            channel_image = layer_activation[0, :, :, col * images_per_row + row]
            channel_image -= channel_image.mean()
            channel_image /= channel_image.std()
            channel_image *= 64
            channel_image += 128
            channel_image =np.clip(channel_image, 0, 255).astype('unit8')
            display_grid[col * size : (col + 1) * size, row * size : (row + 1) * size] = channel_image
        
    scale = 1. / size

    plt.figure(figsize = (scale * display_grid.shape[1],
                          scale * display_grid.shape[0]))
    plt.title(layer_name)
    plt.grid(False)
    plt.imshow(display_grid, aspect = 'auto', cmap = 'viridis')

plt.show()

● CIFAR 10

  • 50,000개의 학습 데이터, 10,000개의 테스트 데이터로 구성
  • 데이터 복잡도가 MNIST보다 훨씬 높은 특징이 있음
    • 신경망이 특징을 검출하기 어려움

1. modules import

import tensorflow as tf
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.layers import Conv2D, MaxPool2D, Dense, Flatten, Input, Dropout, BatchNormalization
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import numpy as np

 

 

2. 데이터 로드 및 전처리

(x_train_full, y_train_full), (x_test, y_test) = cifar10.load_data()
print(x_train_full.shape, y_train_full.shape)
print(x_test.shape, y_test.shape)

# 출력 결과
(50000, 32, 32, 3) (50000, 1)
(10000, 32, 32, 3) (10000, 1)


# 정답 데이터의 값은 레이블로 되어있음
print(y_test[0])

# 출력 결과
[3]


# 예시 데이터
np.random.seed(777)

class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'sheep', 'truck']

sample_size = 9
random_idx = np.random.randint(60000, size = sample_size)

plt.figure(figsize = (5, 5))
for i, idx in enumerate(random_idx):
    plt.subplot(3, 3, i + 1)
    plt.xticks([])
    plt.yticks([])
    plt.imshow(x_train_full[i])
    plt.xlabel(class_names[int(y_train_full[i])])

plt.show()

  • 32 * 32 이미지라 화질이 낮음
# x 데이터 정규화
x_mean = np.mean(x_train_full, axis = (0, 1, 2))
x_std = np.std(x_train_full, axis = (0, 1, 2))
x_train_full = (x_train_full - x_mean) / x_std
x_test = (x_test - x_mean) / x_std


# 학습데이터와 검증데이터 분리
x_train, x_val, y_train, y_val = train_test_split(x_train_full, y_train_full, test_size = 0.3)


# 전처리한 데이터 형태 출력
print(x_train.shape)
print(y_train.shape)

print(x_val.shape)
print(y_val.shape)

print(x_test.shape)
print(y_test.shape)

# 출력 결과
(35000, 32, 32, 3)
(35000, 1)
(15000, 32, 32, 3)
(15000, 1)
(10000, 32, 32, 3)
(10000, 1)

 

 

3. 모델 구성 및 컴파일

def model_build():
    model = Sequential()

    input = Input(shape = (32, 32, 3))

    output = Conv2D(filters = 32, kernel_size = 3, padding = 'same', activation = 'relu')(input)
    output = MaxPool2D(pool_size = (2, 2), strides = 2, padding = 'same')(output)

    output = Conv2D(filters = 64, kernel_size = 3, padding = 'same', activation = 'relu')(output)
    output = MaxPool2D(pool_size = (2, 2), strides = 2, padding = 'same')(output)

    output = Conv2D(filters = 128, kernel_size = 3, padding = 'same', activation = 'relu')(output)
    output = MaxPool2D(pool_size = (2, 2), strides = 2, padding = 'same')(output)

    output = Flatten()(output)
    output = Dense(256, activation = 'relu')(output)
    output = Dense(128, activation = 'relu')(output)
    output = Dense(10, activation = 'softmax')(output)

    model = Model(inputs = [input], outputs = [output])

    model.compile(optimizer = Adam(learning_rate = 1e-4),
                  loss = 'sparse_categorical_crossentropy',
                  metrics = ['accuracy'])
    return model
model = model_build()
model.summary()

# 출력 결과
Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_3 (InputLayer)        [(None, 32, 32, 3)]       0         
                                                                 
 conv2d_3 (Conv2D)           (None, 32, 32, 32)        896       
                                                                 
 max_pooling2d_3 (MaxPooling  (None, 16, 16, 32)       0         
 2D)                                                             
                                                                 
 conv2d_4 (Conv2D)           (None, 16, 16, 64)        18496     
                                                                 
 max_pooling2d_4 (MaxPooling  (None, 8, 8, 64)         0         
 2D)                                                             
                                                                 
 conv2d_5 (Conv2D)           (None, 8, 8, 128)         73856     
                                                                 
 max_pooling2d_5 (MaxPooling  (None, 4, 4, 128)        0         
 2D)                                                             
                                                                 
 flatten_1 (Flatten)         (None, 2048)              0         
                                                                 
 dense_3 (Dense)             (None, 256)               524544    
                                                                 
 dense_4 (Dense)             (None, 128)               32896     
                                                                 
 dense_5 (Dense)             (None, 10)                1290      
                                                                 
=================================================================
Total params: 651,978
Trainable params: 651,978
Non-trainable params: 0
_________________________________________________________________

 

 

4. 모델 학습 및 평가

history = model.fit(x_train, y_train,
                    epochs = 30,
                    batch_size = 256,
                    validation_data = (x_val, y_val))

 

 

5. 학습 과정 시각화

plt.figure(figsize = (12, 4))

plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], 'b--', label = 'loss')
plt.plot(history.history['val_loss'], 'r:', label = 'val_loss')
plt.xlabel('Epochs')
plt.grid()
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], 'b--', label = 'accuracy')
plt.plot(history.history['val_accuracy'], 'r:', label = 'val_accuracy')
plt.xlabel('Epochs')
plt.grid()
plt.legend()

  - 해당 모델은 성능이 좋지 않음

  - 규제화, 드롭아웃 등 과대적합을 방지하는 기술 필요

def model_build2():
    model = Sequential()

    input = Input(shape = (32, 32, 3))

    output = Conv2D(filters = 32, kernel_size = 3, padding = 'same', activation = 'relu')(input)
    output = BatchNormalization()(output)
    output = MaxPool2D(pool_size = (2, 2), strides = 2, padding = 'same')(output)

    output = Conv2D(filters = 64, kernel_size = 3, padding = 'same', activation = 'relu')(output)
    output = BatchNormalization()(output)
    output = MaxPool2D(pool_size = (2, 2), strides = 2, padding = 'same')(output)

    output = Conv2D(filters = 128, kernel_size = 3, padding = 'same', activation = 'relu')(output)
    output = BatchNormalization()(output)
    output = MaxPool2D(pool_size = (2, 2), strides = 2, padding = 'same')(output)
    output = Dropout(0.5)(output)

    output = Flatten()(output)
    output = Dense(256, activation = 'relu')(output)
    output = Dropout(0.5)(output)
    output = Dense(128, activation = 'relu')(output)
    output = Dense(10, activation = 'softmax')(output)

    model = Model(inputs = [input], outputs = [output])

    model.compile(optimizer = Adam(learning_rate = 1e-4),
                  loss = 'sparse_categorical_crossentropy',
                  metrics = ['accuracy'])
    return model
model2 = model_build2()
model2.summary()

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_4 (InputLayer)        [(None, 32, 32, 3)]       0         
                                                                 
 conv2d_6 (Conv2D)           (None, 32, 32, 32)        896       
                                                                 
 batch_normalization (BatchN  (None, 32, 32, 32)       128       
 ormalization)                                                   
                                                                 
 max_pooling2d_6 (MaxPooling  (None, 16, 16, 32)       0         
 2D)                                                             
                                                                 
 conv2d_7 (Conv2D)           (None, 16, 16, 64)        18496     
                                                                 
 batch_normalization_1 (Batc  (None, 16, 16, 64)       256       
 hNormalization)                                                 
                                                                 
 max_pooling2d_7 (MaxPooling  (None, 8, 8, 64)         0         
 2D)                                                             
                                                                 
 conv2d_8 (Conv2D)           (None, 8, 8, 128)         73856     
                                                                 
 batch_normalization_2 (Batc  (None, 8, 8, 128)        512       
 hNormalization)                                                 
                                                                 
 max_pooling2d_8 (MaxPooling  (None, 4, 4, 128)        0         
 2D)                                                             
                                                                 
 dropout (Dropout)           (None, 4, 4, 128)         0         
                                                                 
 flatten_2 (Flatten)         (None, 2048)              0         
                                                                 
 dense_6 (Dense)             (None, 256)               524544    
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 dense_7 (Dense)             (None, 128)               32896     
                                                                 
 dense_8 (Dense)             (None, 10)                1290      
                                                                 
=================================================================
Total params: 652,874
Trainable params: 652,426
Non-trainable params: 448
_________________________________________________________________

 

 

6. 모델 학습 및 평가

history2 = model2.fit(x_train, y_train,
                      epochs = 30,
                      batch_size = 256,
                      validation_data = (x_val, y_val))

 

 

7. 학습 과정 시각화

plt.figure(figsize = (12, 4))

plt.subplot(1, 2, 1)
plt.plot(history2.history['loss'], 'b--', label = 'loss')
plt.plot(history2.history['val_loss'], 'r:', label = 'val_loss')
plt.xlabel('Epochs')
plt.grid()
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history2.history['accuracy'], 'b--', label = 'accuracy')
plt.plot(history2.history['val_accuracy'], 'r:', label = 'val_accuracy')
plt.xlabel('Epochs')
plt.grid()
plt.legend()

  • 검증데이터의 결과가 많이 개선됨

1. modules import 

%load_ext tensorboard
import datetime
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

from tensorflow.keras import Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets.fashion_mnist import load_data
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import Dense, Conv2D, MaxPool2D, Dropout, Input, Flatten

 

 

2. 데이터 로드 및 전처리

(x_train, y_train), (x_test, y_test) = load_data()

x_train = x_train[..., np.newaxis]
x_test = x_test[..., np.newaxis]

x_train = x_train / 255.
x_test = x_test / 255.

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

# 출력 결과
(60000, 28, 28, 1)
(60000,)
(10000, 28, 28, 1)
(10000,)
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

 

3. 모델 구성 및 컴파일

def build_model():
    model = Sequential()

    input = Input(shape = (28, 28, 1))
    output = Conv2D(filters = 32, kernel_size = (3, 3))(input)
    output = Conv2D(filters = 64, kernel_size = (3, 3))(output)
    output = Conv2D(filters = 64, kernel_size = (3, 3))(output)
    output = Flatten()(output)
    output = Dense(units = 128, activation = 'relu')(output)
    output = Dense(units = 64, activation = 'relu')(output)
    output = Dense(units = 10, activation = 'softmax')(output)

    model = Model(inputs = [input], outputs = [output])

    model.compile(optimizer = 'adam',
                  loss = 'sparse_categorical_crossentropy',
                  metrics = ['acc'])
    return model

model_1 = build_model()
model_1.summary()

# 출력 결과
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 conv2d (Conv2D)             (None, 26, 26, 32)        320       
                                                                 
 conv2d_1 (Conv2D)           (None, 24, 24, 64)        18496     
                                                                 
 conv2d_2 (Conv2D)           (None, 22, 22, 64)        36928     
                                                                 
 flatten (Flatten)           (None, 30976)             0         
                                                                 
 dense (Dense)               (None, 128)               3965056   
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 dense_2 (Dense)             (None, 10)                650       
                                                                 
=================================================================
Total params: 4,029,706
Trainable params: 4,029,706
Non-trainable params: 0
_________________________________________________________________

 

 

4. 모델 학습

hist_1 = model_1.fit(x_train, y_train,
                     epochs = 25,
                     validation_split = 0.3,
                     batch_size = 128)

 

 

5. 학습 결과 시각화

plt.figure(figsize = (12, 4))
plt.subplot(1, 2, 1)
plt.plot(hist_1.history['loss'], 'b--', label = 'loss')
plt.plot(hist_1.history['val_loss'], 'r:', label = 'val_loss')
plt.xlabel('Epochs')
plt.grid()
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(hist_1.history['acc'], 'b--', label = 'accuracy')
plt.plot(hist_1.history['val_acc'], 'r:', label = 'val_accuracy')
plt.xlabel('Epochs')
plt.grid()
plt.legend()

 

 

6. 모델 평가

model_1.evaluate(x_test, y_test)

# 출력 결과
loss: 1.1168 - acc: 0.8566
[1.116817831993103, 0.8565999865531921]

 

 

7. 모델 재구성(학습 파라미터 수 비교)

def build_model_2():
    model = Sequential()

    input = Input(shape = (28, 28, 1))
    output = Conv2D(filters = 32, kernel_size = (3, 3))(input)
    output = MaxPool2D(strides = (2, 2))(output)
    output = Conv2D(filters = 64, kernel_size = (3, 3))(output)
    output = MaxPool2D(strides = (2, 2))(output)
    output = Conv2D(filters = 64, kernel_size = (3, 3))(output)
    output = MaxPool2D(strides = (2, 2))(output)
    output = Flatten()(output)
    output = Dense(units = 128, activation = 'relu')(output)
    output = Dropout(0.3)(output)
    output = Dense(units = 64, activation = 'relu')(output)
    output = Dropout(0.3)(output)
    output = Dense(units = 10, activation = 'softmax')(output)

    model = Model(inputs = [input], outputs = [output])

    model.compile(optimizer = 'adam',
                  loss = 'sparse_categorical_crossentropy',
                  metrics = ['acc'])
    return model

model_2 = build_model_2()
model_2.summary()

# 출력 결과
Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_3 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 conv2d_6 (Conv2D)           (None, 26, 26, 32)        320       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 13, 13, 32)       0         
 )                                                               
                                                                 
 conv2d_7 (Conv2D)           (None, 11, 11, 64)        18496     
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 5, 5, 64)         0         
 2D)                                                             
                                                                 
 conv2d_8 (Conv2D)           (None, 3, 3, 64)          36928     
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 1, 1, 64)         0         
 2D)                                                             
                                                                 
 flatten_2 (Flatten)         (None, 64)                0         
                                                                 
 dense_6 (Dense)             (None, 128)               8320      
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_7 (Dense)             (None, 64)                8256      
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_8 (Dense)             (None, 10)                650       
                                                                 
=================================================================
Total params: 72,970
Trainable params: 72,970
Non-trainable params: 0
_________________________________________________________________
  • 학습 파라미터 수가 줄어듦

 

 

8. 모델 재학습

hist_2 = model_2.fit(x_train, y_train,
                     epochs = 25,
                     validation_split = 0.3,
                     batch_size = 128)

# 재학습 결과 시각화
plt.figure(figsize = (12, 4))
plt.subplot(1, 2, 1)
plt.plot(hist_2.history['loss'], 'b--', label = 'loss')
plt.plot(hist_2.history['val_loss'], 'r:', label = 'val_loss')
plt.xlabel('Epochs')
plt.grid()
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(hist_2.history['acc'], 'b--', label = 'accuracy')
plt.plot(hist_2.history['val_acc'], 'r:', label = 'val_accuracy')
plt.xlabel('Epochs')
plt.grid()
plt.legend()

  • 처음 모델보다 학습데이터에 오버피팅이 덜 된 모습

 

9. 모델 재평가

model_2.evaluate(x_test, y_test)

# 출력 결과
loss: 0.4026 - acc: 0.8830
[0.4026452302932739, 0.8830000162124634]

 

 

10. 모델 성능 높이기(많은 레이어 쌓기)

from tensorflow.keras.layers import BatchNormalization, ReLU

def build_model_3():
    model = Sequential()

    input = Input(shape = (28, 28, 1))
    output = Conv2D(filters = 32, kernel_size = 3, activation = 'relu', padding = 'same')(input)
    output = Conv2D(filters = 64, kernel_size = 3, activation = 'relu', padding = 'valid')(output)
    output = MaxPool2D(strides = (2, 2))(output)
    output = Dropout(0.5)(output)

    output = Conv2D(filters = 128, kernel_size = 3, activation = 'relu', padding = 'same')(output)
    output = Conv2D(filters = 256, kernel_size = 3, activation = 'relu', padding = 'valid')(output)
    output = MaxPool2D(strides = (2, 2))(output)
    output = Dropout(0.5)(output)

    output = Flatten()(output)
    output = Dense(units = 256, activation = 'relu')(output)
    output = Dropout(0.5)(output)
    output = Dense(units = 100, activation = 'relu')(output)
    output = Dropout(0.5)(output)
    output = Dense(units = 10, activation = 'softmax')(output)

    model = Model(inputs = [input], outputs = [output])

    model.compile(optimizer = 'adam',
                  loss = 'sparse_categorical_crossentropy',
                  metrics = ['acc'])
    return model

model_3 = build_model_3()
model_3.summary()

# 출력 결과
Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_4 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 conv2d_9 (Conv2D)           (None, 28, 28, 32)        320       
                                                                 
 conv2d_10 (Conv2D)          (None, 26, 26, 64)        18496     
                                                                 
 max_pooling2d_3 (MaxPooling  (None, 13, 13, 64)       0         
 2D)                                                             
                                                                 
 dropout_2 (Dropout)         (None, 13, 13, 64)        0         
                                                                 
 conv2d_11 (Conv2D)          (None, 13, 13, 128)       73856     
                                                                 
 conv2d_12 (Conv2D)          (None, 11, 11, 256)       295168    
                                                                 
 max_pooling2d_4 (MaxPooling  (None, 5, 5, 256)        0         
 2D)                                                             
                                                                 
 dropout_3 (Dropout)         (None, 5, 5, 256)         0         
                                                                 
 flatten_3 (Flatten)         (None, 6400)              0         
                                                                 
 dense_9 (Dense)             (None, 256)               1638656   
                                                                 
 dropout_4 (Dropout)         (None, 256)               0         
                                                                 
 dense_10 (Dense)            (None, 100)               25700     
                                                                 
 dropout_5 (Dropout)         (None, 100)               0         
                                                                 
 dense_11 (Dense)            (None, 10)                1010      
                                                                 
=================================================================
Total params: 2,053,206
Trainable params: 2,053,206
Non-trainable params: 0
_________________________________________________________________

 

  - 모델 학습 및 결과 시각화

hist_3 = model_3.fit(x_train, y_train,
                     epochs = 25,
                     validation_split = 0.3,
                     batch_size = 128)

  - 과적합은 되지 않았지만 층을 늘려도 좋은 성능을 낼 수 있음

plt.figure(figsize = (12, 4))
plt.subplot(1, 2, 1)
plt.plot(hist_3.history['loss'], 'b--', label = 'loss')
plt.plot(hist_3.history['val_loss'], 'r:', label = 'val_loss')
plt.xlabel('Epochs')
plt.grid()
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(hist_3.history['acc'], 'b--', label = 'accuracy')
plt.plot(hist_3.history['val_acc'], 'r:', label = 'val_accuracy')
plt.xlabel('Epochs')
plt.grid()
plt.legend()

model_3.evaluate(x_test, y_test)

# 출력 결과
loss: 0.2157 - acc: 0.9261
[0.21573999524116516, 0.9261000156402588]

 

 

11. 모델 성능 높이기(이미지 보강, Image Augmentation)

from tensorflow.keras.preprocessing.image import ImageDataGenerator

image_generator = ImageDataGenerator(
    rotation_range = 10,
    zoom_range = 0.2,
    share_range = 0.6,
    width_shift_range = 0.1,
    height_shift_range = 0.1,
    horizontal_flip = True,
    vertival_flip = False
)

augment_size = 200

print(x_train.shape)
print(x_train[0].shape)

# 출력 결과
(60000, 28, 28, 1)
(28, 28, 1)
x_augment = image_generator.flow(np.tile(x_train[0].reshape(28 * 28 * 1), augment_size).reshape(28 * 28 * 1),
                                 np.zeros(augment_size), batch_size = augment_size, shuffle = False).next()[0]

plt.figure(figsize = (10, 10))
for i in range(1, 101):
    plt.subplot(10, 10, i)
    plt.axis('off')
    plt.imshow(x_augment[i - 1].reshape(28, 28), cmap = 'gray')

  • 위의 코드를 사용해 학습에 사용할 데이터 추가
from tensorflow.keras.preprocessing.image import ImageDataGenerator

image_generator = ImageDataGenerator(
    rotation_range = 15,
    zoom_range = 0.1,
    share_range = 0.6,
    width_shift_range = 0.15,
    height_shift_range = 0.1,
    horizontal_flip = True,
    vertival_flip = False
)

augment_size = 30000

random_mask = np.random.randint(x_train.shape[0], size = augment_size)
x_augmented = x_train[random_mask].copy()
y_augmented = y_train[random_mask].copy()

x_augmented = image_generator.flow(x_augmented, np.zeros(augment_size),
                                   batch_size = augment_size, shuffle = False).next()[0]
x_train = np.concatenate((x_train, x_augmented))
y_train = np.concatenate((y_train, y_augmented))

# 생성한 augment 30000개가 더 추가됨
print(x_train.shape)

# 출력 결과
(90000, 28, 28, 1)

 

  - 모델 학습 및 결과 시각화

model_4 = build_model_3()
model_4.summary()

# 출력 결과
Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_5 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 conv2d_13 (Conv2D)          (None, 28, 28, 32)        320       
                                                                 
 conv2d_14 (Conv2D)          (None, 26, 26, 64)        18496     
                                                                 
 max_pooling2d_5 (MaxPooling  (None, 13, 13, 64)       0         
 2D)                                                             
                                                                 
 dropout_6 (Dropout)         (None, 13, 13, 64)        0         
                                                                 
 conv2d_15 (Conv2D)          (None, 13, 13, 128)       73856     
                                                                 
 conv2d_16 (Conv2D)          (None, 11, 11, 256)       295168    
                                                                 
 max_pooling2d_6 (MaxPooling  (None, 5, 5, 256)        0         
 2D)                                                             
                                                                 
 dropout_7 (Dropout)         (None, 5, 5, 256)         0         
                                                                 
 flatten_4 (Flatten)         (None, 6400)              0         
                                                                 
 dense_12 (Dense)            (None, 256)               1638656   
                                                                 
 dropout_8 (Dropout)         (None, 256)               0         
                                                                 
 dense_13 (Dense)            (None, 100)               25700     
                                                                 
 dropout_9 (Dropout)         (None, 100)               0         
                                                                 
 dense_14 (Dense)            (None, 10)                1010      
                                                                 
=================================================================
Total params: 2,053,206
Trainable params: 2,053,206
Non-trainable params: 0
_________________________________________________________________
hist_4 = model_4.fit(x_train, y_train,
                     epochs = 25,
                     validation_spli = 0.3,
                     batch_size = 128)

plt.figure(figsize = (12, 4))
plt.subplot(1, 2, 1)
plt.plot(hist_4.history['loss'], 'b--', label = 'loss')
plt.plot(hist_4.history['val_loss'], 'r:', label = 'val_loss')
plt.xlabel('Epochs')
plt.grid()
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(hist_4.history['acc'], 'b--', label = 'accuracy')
plt.plot(hist_4.history['val_acc'], 'r:', label = 'val_accuracy')
plt.xlabel('Epochs')
plt.grid()
plt.legend()

model_4.evaluate(x_test, y_test)

# 출력 결과
loss: 0.2023 - acc: 0.9313
[0.2023032009601593, 0.9312999844551086]

 

  - 학습 인자를 이전과 다르게 주면서 학습하면 더 잘 나올 것

+ Recent posts