[딥러닝 기초] 오차역전파(Backpropagation)

2023. 3. 15. 15:56

1. 오차역전파 알고리즘

학습 데이터로 정방향(forward) 연산을 통해 손실함수 값(loss)을 구함
각 layer별로 역전파 학습을 위해 중간값을 저장
손실함수를 학습 파라미터(가중치, 편향)으로 미분하여,
마지막 layer로부터 앞으로 하나씩 연쇄법칙을 이용하여 미분 각 layer 를 통과할 때마다 저장된 값을 이용
오류(error)를 전달하면서 학습 파라미터를 조금씩 갱신

- 오차역전파 학습의 특징

손실함수를 통한 평가를 한 번만 하고, 연쇄 법칙을 이용한 미분을 활용하기 때문에 학습 소요시간이 매우 단축
미분을 위한 중간값을 모두 저장하기 때문에 메모리를 많이 사용

- 신경망 학습에 있어서 미분 가능성의 중요성

경사하강법(Gradient Descent)에서 손실 함수(cost function)의 최소값 즉, 최적값을 찾기 위한 방법으로 미분 활용
미분을 통해 손실함수의 학습 매개변수(trainable parameter)를 갱신하여 모델의 가중치의 최적값을 찾는 과정

https://www.pinterest.co.kr/pin/424816177350692379/

2. 합성 함수의 미분(연쇄법칙, chain rule)

$$ \frac{d}{dx}[f(g(x))]=f'(g(x))g'(x) $$

여러 개 연속으로 사용가능
$ \frac{\partial f}{\partial x}=\frac{\partial f}{\partial u} \times \frac{\partial u}{\partial m} \times \frac{\partial m}{\partial n} \times \cdots \times \frac{\partial l}{\partial k} \times \frac{\partial k}{\partial g} \times \frac{\partial g}{\partial x} $
각각에 대해 편미분 적용 가능

https://www.freecodecamp.org/news/demystifying-gradient-descent-and-backpropagation-via-logistic-regression-based-image-classification-9b5526c2ed46/

오차역전파의 직관적 이해
- 학습을 진행하면서, 즉 손실함수의 최소값을 찾아가는 과정에서 가중치 또는 편향의 변화에 따라 얼마나 영향을 받는지 알 수 있음

- 합성함수 미분 예제

https://medium.com/spidernitt/breaking-down-neural-networks-an-intuitive-approach-to-backpropagation-3b2ff958794c

$ a=-1, \,\, b=3, \,\, c=4,$

$ x=a+b, \,\, y=b+c, \,\, f=x*y $일때

$ \begin{matrix} \frac{\partial f}{\partial x} &=& y+x\frac{\partial y}{\partial x} \\ &=& (b+c)+(a+b)\times 0 \\ &=& 7 \end{matrix} $

$ \begin{matrix} \frac{\partial f}{\partial y} &=& x+\frac{\partial x}{\partial y}y \\ &=& (a+b)+0 \times (b+c) \\ &=& 2 \end{matrix} $

$ \begin{matrix} \frac{\partial x}{\partial a} &=& 1+a\frac{\partial b}{\partial a} \\ &=& 1 \end{matrix} $

$ \begin{matrix} \frac{\partial y}{\partial c} &=& \frac{\partial b}{\partial c}+1 \\ &=& 1 \end{matrix} $

$ \begin{matrix} \frac{\partial f}{\partial a} &=& \frac{\partial f}{\partial x} \times \frac{\partial x}{\partial a} \\ &=& y \times 1 \\ &=& 7 \times 1 = 7 \end{matrix} $

$ \begin{matrix} \frac{\partial f}{\partial b} &=& \frac{\partial x}{\partial b}y+x\frac{\partial y}{\partial b} \\ &=& 1 \times 7+2 \times 1 = 9 \end{matrix} $

- 덧셈, 곱셈 계층의 역전파

위 예제를 통해 아래 사항을 알 수 있음
1. $ z=x+y $ 일 때,
  $ \frac {\partial z}{\partial x}=1, \frac {\partial z}{\partial y}=1 $
2. $ t = xy $ 일 때,
  $ \frac {\partial t}{\partial x}=y, \frac {\partial t}{\partial y}=x $

# 곱셈 연산
class Mul():

    def __init__(self):
        self.x = None
        self.y = None

    def forward(self, x, y):
        self.x = x
        self.y = y
        result = x*y
        return result
    
    def backward(self, dresult):
        dx = dresult * self.y
        dy = dresult * self.x
        return dx, dy

# 덧셈 연산
class Add():
    
    def __init__(self):
        self.x = None
        self.y = None

    def forward(self, x, y):
        self.x = x
        self.y = y
        result = x + y
        return result
    
    def backward(self, dresult):
        dx = dresult * 1
        dy = dresult * 1
        return dx, dy

a, b, c = -1, 3, 4
x = Add()
y = Add()
f = Mul()

# forward
x_result = x.forward(a, b)
y_result = y.forward(b, c)

print(x_result)
print(y_result)
print(f.forward(x_result, y_result))

# 출력 결과
2
7
14

# backward
dresult = 1
dx_mul, dy_mul = f.backward(dresult)

da_add, db_add_1 = x.backward(dx_mul)
db_add_2, dc_add = y.backward(dy_mul)

print(dx_mul, dy_mul)
print(da_add)
print(db_add_1 + db_add_2)
print(dc_add)

# 출력 결과
7 2
7
9
2

3. 활성화 함수에서의 역전파

- 시그모이드(sigmoid) 함수

https://www.geeksforgeeks.org/implement-sigmoid-function-using-numpy/

수식
$ y=\frac{1}{1+e^{-x}} $ 일 때,
$ \begin{matrix} y' &=& \left ( \frac{1}{1+e^{-x}} \right )' \\
&=& \frac{-1}{(1+e^{-x})^{2}} \times (-e^{-x}) \\
&=& \frac{1}{1+e^{-x}} \times \frac{e^{-x}}{1+e^{-x}} \\
&=& \frac{1}{1+e^{-x}} \times \left ( 1-\frac{1}{1+e^{-x}} \right ) \\
&=& y(1-y) \end{matrix} $

class Sigmoid:
    def __init__(self):
        self.out = None

    def forward(self, x):
        out = 1 / (1 + np.exp(-x))
        return out
    
    def backward(self, dout):
        dx = dout * (1.0 - self.out) * self.dout
        return dx

- ReLU 함수

https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/

수식
/$ y=\left \{ \begin{matrix} x & (x \geq 0) \\ 0 & (x<0) \end{matrix} \right. $

class ReLU():
    def __init__(self):
        self.out = None
    
    def forward(self, x):
        self.mask = (x < 0)
        out = x.copy()
        # 마스킹 연산자(x<0일때는 전부 0으로 넣어버리기)
        out[x<0] = 0
        return out
    
    def backward(self, dout):
        dout[self.mask] = 0
        dx =dout
        return dx

4. 행렬 연산에 대한 역전파

$$ Y=X \bullet W+B $$

- 순전파(forward)

형상(shape)을 맞춰줘야함
곱셈, 덧셈 계층을 합친 상태

# shape이 맞을 때
import numpy as np

X = np.random.rand(3)
W = np.random.rand(3, 2)
B = np.random.rand(2)

print(X.shape)
print(W.shape)
print(B.shape)

# 출력 결과
(3,)
(3, 2)
(2,)

Y = np.dot(X, W) + B
print(Y.shape)

# 출력 결과
(2,)

# shape이 틀릴 때
import numpy as np

X = np.random.rand(3)
W = np.random.rand(2, 2)
B = np.random.rand(2)

Y = np.dot(X, W) + B
print(Y.shape)

# 출력 결과
ValueError: shapes (3,) and (2,2) not aligned: 3 (dim 0) != 2 (dim 0)

- 역전파(1)

$$ Y=X \bullet W $$

$ X $: (2, )
$ W $: (2, 3)
$ X \bullet W $: (3, )
$ \frac{\partial L}{\partial Y} $: (3, )
$ \frac{\partial L}{\partial X}=\frac{\partial L}{\partial Y} \bullet W^{T}, (2, ) $
$ \frac{\partial L}{\partial W}=X^{T} \bullet \frac{\partial L}{\partial Y}, (2, 3) $

# 순전파
X = np.random.rand(2)
W = np.random.rand(2, 3)
Y = np.dot(X, W)

print("X\n{}".format(X))
print("W\n{}".format(W))
print("Y\n{}".format(Y))

# 출력 결과
X
[0.82112094 0.52401537]
W
[[0.98913291 0.3114957  0.74020997]
 [0.0272213  0.29891712 0.30511339]]
Y
[0.82646212 0.41241281 0.76768601]

# 역전파
dL_dY = np.random.randn(3)
dL_dX = np.dot(dL_dY, W.T)
dL_dW = np.dot(X.reshape(-1, 1), dL_dY.reshape(1, -1))

print("dL_dY\n{}".format(dL_dY))
print("dL_dX\n{}".format(dL_dX))
print("dL_dW\n{}".format(dL_dW))

# 출력 결과
dL_dY
[ 2.14017912 -1.88100173 -0.33160328]
dL_dX
[ 1.28554159 -0.60518177]
dL_dW
[[ 1.75734588 -1.5445299  -0.2722864 ]
 [ 1.12148676 -0.98567383 -0.17376522]]

- 역전파(2)

$$ Y=X \bullet W+B $$

X, W는 위와 동일
B: (3, )
$ \frac{\partial L}{\partial B}=\frac{\partial L}{\partial Y}, (3, ) $

# 순전파
X = np.random.randn(2)
W = np.random.randn(2, 3)
B = np.random.randn(3)
Y = np.dot(X, W) + B
print(Y)

# 출력 결과
[1.32055282 0.71833361 1.73777915]

# 역전파
dL_dY = np.random.randn(3)
dL_dX = np.dot(dL_dY, W.T)
dL_dW = np.dot(X.reshape(-1, 1), dL_dY.reshape(1, -1))
dL_dB = dL_dY

print("dL_dY\n{}".format(dL_dY))
print("dL_dX\n{}".format(dL_dX))
print("dL_dW\n{}".format(dL_dW))
print("dL_dB\n{}".format(dL_dB))

# 출력 결과
dL_dY
[-0.00997423  0.34937897  1.55598133]
dL_dX
[ 0.9368195  -0.10629718]
dL_dW
[[ 0.00182182 -0.06381513 -0.28420473]
 [ 0.00772224 -0.2704957  -1.20466967]]
dL_dB
[-0.00997423  0.34937897  1.55598133]

- 배치용 행렬 내적 계층

N개의 데이터에 대해,

$$ Y=X \bullet W+B $$

$ X $: (N, 3)
$ W $: (3, 2)
$ B $: (2, )

X = np.random.rand(4, 3)
W = np.random.rand(3, 2)
B = np.random.rand(2)

print(X.shape)
print(W.shape)
print(B.shape)

# 출력 결과
(4, 3)
(3, 2)
(2,)

print("X\n{}".format(X))
print("W\n{}".format(W))
print("B\n{}".format(B))

# 출력 결과
X
[[0.5345643  0.82120127 0.38182761]
 [0.07479261 0.99042377 0.50473867]
 [0.47142528 0.72266964 0.44472929]
 [0.16390528 0.94442809 0.78815273]]
W
[[0.90326978 0.75695534]
 [0.24771738 0.05041714]
 [0.5838499  0.60451043]]
B
[0.90558133 0.19752999]

# 순전파
Y = np.dot(X, W) + B
print("Y\n{}".format(Y))
print("Y.shape:", Y.shape)

# 출력 결과
Y
[[1.81479294 0.87439269]
 [1.51317603 0.60919879]
 [1.77007852 0.85965631]
 [1.74774615 0.84566088]]
Y.shape: (4, 2)

# 역전파
dL_dY = np.random.randn(4, 2)
dL_dX = np.dot(dL_dY, W.T)
dL_dW = np.dot(X.T, dL_dY)
dL_dB = np.sum(dL_dY, axis = 0)

print("dL_dY\n{}".format(dL_dY))
print("dL_dX\n{}".format(dL_dX))
print("dL_dW\n{}".format(dL_dW))
print("dL_dB\n{}".format(dL_dB))

# 출력 결과
dL_dY
[[-0.70142117  1.06162232]
 [-0.114932    0.16975345]
 [-1.51024593  0.19728549]
 [ 1.93432977  0.58605845]]
dL_dX
[[ 0.17002814 -0.12023025  0.23223709]
 [ 0.02468118 -0.01991217  0.0355147 ]
 [-1.2148232  -0.36416759 -0.76249579]
 [ 2.1908417   0.50871449  1.48363669]]
dL_dW
[[-0.77847203  0.76926514]
 [ 0.04558715  1.73599575]
 [ 0.52706409  1.04068005]]
dL_dB
[-0.39226933  2.01471971]

예제 Layer 생성

class Layer():
    def __init__(self):
        self.W = np.random.randn(3, 2)
        self.b = np.random.randn(2)
        self.x = None
        self.dW = None
        self.sb = None

    def forward(self, x):
        self.x = x
        out = np.dot(x, self.W) + self.b
        return out
    
    def backward(self, dout):
        dx = np.dot(dout, self.W.T)
        self.dW = np.dot(self.x.T, dout)
        self.db = np.sum(dout, axis = 0)
        return dx

np.random.seed(111)
layer = Layer()

# 순전파
X = np.random.rand(2, 3)
Y = layer.forward(X)

print(X)

# 출력 결과
[[0.23868214 0.33765619 0.99071246]
 [0.23772645 0.08119266 0.66960024]]
 
 
 # 역전파
 dout = np.random.rand(2, 2)
dout_dx = layer.backward(dout)

print(dout_dx)

# 출력 결과
[[-0.57717814  0.8894841  -1.01146255]
 [-0.5434705   0.86783399 -1.09728643]]

5. MNIST 분류 with 역전파

Module Import

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from collections import OrderedDict

데이터 로드

np.random.seed(42)

mnist = tf.keras.datasets.mnist

(X_train, y_train), (X_test, y_test) = mnist.load_data()

num_classes = 10

데이터 전처리

X_train, X_test = X_train.reshape(-1, 28 * 28).astype(np.float32), X_test.reshape(-1, 28 * 28).astype(np.float32)

# 색깔 평탄화
X_train /= .255
X_test /= .255

# 원-핫 벡터 변환
y_train = np.eye(num_classes)[y_train]

# 확인
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

# 출력 결과
(60000, 784)
(60000, 10)
(10000, 784)
(10000,)

하이퍼 파라미터

epochs = 1000
learning_rate = 1e-3
batch_size = 100
train_size=  X_train.shape[0]

Util Functions

def softmax(x):
    if x.ndim == 2:
        x = x.T
        x = x - np.max(x, axis = 0)
        y = np.exp(x) / np.sum(np.exp(x), axis = 0)
        return y.T
    
    # 오버플로우 방지
    x = x - np.max(x)
    return np.exp(x) / np.sum(np.exp(x))

def mean_squared_error(pred_y, true_y):
    return 0.5 * np.sum((pred_y, true_y)**2)

def cross_entropy_error(pred_y, true_y):
    if pred_y.ndim == 1:
        true_y = true_y.reshape(1, true_y.size)
        pred_y = pred_y.reshape(1, pred_y.size)

    # train 데이터가 원-핫 벡터 형태면 정답 레이블의 인덱스 반환
    if true_y.size == pred_y.size:
        true_y = true_y.argmax(axis = 1)
    
    batch_size = pred_y.shape[0]
    return -np.sum(np.log(pred_y[np.arange(batch_size), true_y] + 1e-7)) / batch_size

def softmax_loss(X, true_y):
    # softmax의 결과와 원-핫 벡터의 실제 ture_y 값과 비교하여 그것에 대한 차이를 cross_entrpy_error로 return
    pred_y = softmax(X)
    return cross_entropy_error(pred_y, true_y)

Util Classes

- ReLU

class ReLU():
    def __init__(self):
        self.out = None
    
    def forward(self, x):
        self.mask = (x < 0)
        out = x.copy()
        out[x<0] = 0
        return out
    
    def backward(self, dout):
        dout[self.mask] = 0
        dx = dout
        return dx

- Sigmoid

class Sigmoid():
    def __init__(self):
        self.out = None

    def forward(self, x):
        out = 1 / (1 + np.exp(-x))
        return out
    
    def backward(self, dout):
        dx = dout * (1.0 - self.out) * self.dout
        return dx

- Layer

class Layer():
    def __init__(self, W, b):
        self.W = W
        self.b = B

        self.x = None
        self.origin_x_shape = None

        self.dL_dW = None
        self.dL_db = None
    
    def forward(self, x):
        self.origin_x_shape = x.shape

        x = x.reshape(x.shape[0], -1)
        self.x = x
        out = np.dot(self.x, self.W) + self.b

        return out
    
    def backward(self, dout):
        dx = np.dot(dout, self.W.T)
        self.dL_dW = np.dot(self.x.T, dout)
        self.dL_db = np.sum(dout, axis = 0)
        dx = dx.reshpae(self.origin_x_shape)
        return dx

- SoftMax

class Softmax():
    def __init__(self):
        self.loss = None
        self.y = None
        self.x = None
    
    def forward(self, x, t):
        self.t = t
        self.y = softmax(x)
        self.loss = cross_entropy_error(self.y, self.t)

        return self.loss
    
    def backward(self, dout = 1):
        batch_size = self.t.shape[0]

        # 정답 레이블이 원-핫 인코딩 형태일 때
        if self.t.size == self.y.size:
            dx = (self.y - self.t) / batch_size
        else:
            dx = self.y.copy()
            dx[np.arange(batch_size), self.t] -= 1
            dx = dx / batch_size
        
        return dx

모델 생성 및 학습

class MyModel():
    def __init__(self, input_size, hidden_size_list, output_size, activation = 'relu'):
        self.input_size = input_size
        self.output_size = output_size
        self.hidden_size_list = hidden_size_list
        self.hidden_layer_num = len(hidden_size_list)
        self.params = {}

        self.__init_weights(activation)

        activation_layer = {'sigmoid': Sigmoid, 'relu': ReLU}
        self.layers = OrderedDict()
        for idx in range(1, self.hidden_layer_num + 1):
            self.layers['Layer' + str(idx)] = Layer(self.params['W' + str(idx)], self.params['b' + str(idx)])
            self.layers['Activation_function' + str(idx)] = activation_layer[activation]()
        
        idx = self.hidden_layer_num + 1

        self.layers['Layer' + str(idx)] = Layer(self.params['W' + str(idx)], self.params['b' + str(idx)])

        self.last_layer = Softmax()

    
    def __init_weights(self, activation):
        weight_std = None
        # 전체 사이즈 리스트
        all_size_list = [self.input_size] + self.hidden_size_list + [self.output_size]
        for idx in range(1, len(all_size_list)):
            if activation.lower() == 'relu':
                weight_std = np.sqrt(2.0 / self.input_size)
            elif activation.lower() =='sigmoid':
                weight_std = np.sqrt(1.0 / self.input_size)
            
            self.params['W' + str(idx)] = weight_std * np.random.randn(all_size_list[idx-1], all_size_list[idx])
            self.params['b' + str(idx)] = np.random.randn(all_size_list[idx])

    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)
        return X
    
    def loss(self, x, true_y):
        pred_y = self.predict(x)

        return self.last_layer.forward(pred_y, true_y)
    
    def accuracy(self, x, true_y):
        pred_y = self.predict(x)
        pred_y = np.argmax(pred_y, axis = 1)

        if true_y.ndim != 1:
            true_y = np.argmax(true_y, axis = 1)
        
        accuracy = np.sum(pred_y == true_y) / float(x.shape[0])
        return accuracy
    
    def gradient(self, x, t):
        self.loss(x, t)

        dout = 1
        dout = self.last_layer.backward(dout)

        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)
        
        grads = {}
        for idx in range(1, self.hidden_layer_num + 2):
            grads['W' + str(idx)] = self.layers['Layer' + str(idx)].dL_dW
            grads['b' + str(idx)] = self.layers['Layer' + str(idx)].dL_db
        return grads

model = MyModel(28*28, [100, 64, 32], 10, activation = 'relu')

# 손실값과 정확도를 저장하는 리스트 생성
train_lost_list = []
train_acc_list = []
test_acc_list = []

for epoch in range(epochs):
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = X_train[batch_mask]
    y_batch = y_train[batch_mask]

    grad = model.gradient(x_batch, y_batch)

    for key in model.params.keys():
        model.params[key] -= learning_rate * grad[key]
    
    loss = model.loss(x_batch, y_batch)
    train_lost_list.append(loss)

    if epoch % 50 == 0:
        train_acc = model.accuracy(X_train, y_train)
        test_acc = model.accuracy(X_test, y_test)
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)
        print("Epoch: {}, Train Accuracy: {:.4f}, Test Accuracy: {:.4f}".format(epoch, train_acc, test_acc))

# 정확도 시각화
plt.plot(np.arange(1000//50), train_acc_list, 'r--', label = 'train_acc')
plt.plot(np.arange(1000//50), test_acc_list, 'b', label = 'test_acc')

plt.title('Result')
plt.xlabel(loc = 5)
plt.grid()
plt.show()

# 손실값 시각화
plt.plot(np.arange(1000), train_lost_list, 'green', label = 'train_loss')
plt.title('train loss')
plt.xlabel('Epochs')
plt.legend(loc = 5)
plt.grid()
plt.show()

저작자표시 (새창열림)

'Python > Deep Learning' 카테고리의 다른 글

[딥러닝 기초] 딥러닝 학습 기술 (2) (0)	2023.03.22
[딥러닝 기초] 딥러닝 학습 기술 (1) (0)	2023.03.21
[딥러닝 기초] 신경망 학습 (0)	2023.03.14
[딥러닝 기초] 경사하강법 (0)	2023.03.13
[딥러닝 기초] 모델 학습과 손실 함수 (1)	2023.03.12

감으로 코딩하던 내가 알고 코딩할 때까지