Backpropagation

Overview of Steps for Neural Networks

  1. Obtain some response data yy and predictor data xx
  2. Initialize parameter values θ\theta
  3. Propagate forwards to get predictions y^\hat{y}
  4. Compute the cost function J(θ)J(\theta) to get the error
  5. Propagate backwards to get the gradient J(θ)θ\frac{\partial J(\theta)}{\partial \theta}
  6. Use gradient descent to repeatedly update parameters θ\theta until J(θ)J(\theta) is minimized

    • This refers to repeating steps 3-5

Defining Notation

  • The output of some activation function of the jthj^{th} neuron in the lthl^{th} layer is directly related to the output of the activation function in the (l1)th(l-1)^{th} layer
  • We'll refer to the output of an activation function ff as:
al=f(wlal1+bl)a^{l} = f(w^{l}a^{l-1} + b^{l})
  • In other words, the output of an activation function is based on the output of an activation function of its previous layer, the weights of the current layer, and the bias of the current layer
  • We can also refer to the weighted input of a neuron in the lthl^{th} layer as the following:
zlwlal1+blz^{l} \equiv w^{l}a^{l-1} + b^{l}
  • In other words, a zlz^{l} term refers to the input going into our activation function of the jthj^{th} neuron in the lthl^{th} layer

Describing Forward Propagation

  • The goal of forward propagation is to calculate layer-by-layer neuron activations aa until we get to the output y^\hat{y}
  • Specifically, forward propagation is a layer-level function that receives input al1a^{l-1} and outputs ala^{l}
  • We can repeatedly forward propagate to get the output y^\hat{y} and compare it with the real value yy to get the error
  • In other words, we can determine how well our neural network is behaving by calculating the error
  • That error is found by determining our prediction y^\hat{y}
  • We determine our y^\hat{y} using forward propagation

Describing Backward Propagation

  • The goal of backward propagation is to estimate parameters ww and bb by computing the partial derivatives J(w,b)w\frac{\partial J(w,b)}{\partial w} and J(w,b)b\frac{\partial J(w,b)}{\partial b} of a cost function J(w,b)J(w,b)
  • Specifically, backward propagation is a layer-level function that receives input J(w,b)al\frac{\partial J(w,b)}{\partial a^{l}} and outputs J(w,b)al1\frac{\partial J(w,b)}{\partial a^{l-1}}, J(w,b)Wl\frac{\partial J(w,b)}{\partial W^{l}}, and J(w,b)bl\frac{\partial J(w,b)}{\partial b^{l}}
  • We can repeatedly backward propagate to find the partial derivative of the cost function J(w,b)J(w,b) with respect to each weight and bias
  • Then, we'll use gradient descent to update ww and bb by evaluating those partial derivatives from our backward propagation step
  • In summary, we propagate forwards to see how well our neural netowrk is behaving and finding the error
  • After we find out that our network has error, we propagate backwards and use a form of gradient descent to update new values of weights and biases
  • Then, we will propagate forwards again to see how well those weights are performing, and then propagate backwards and use gradient descent again to update the weights
  • This will go on until we reach some minima for error value

Summarizing Steps of Neural Network

  • The goal of forward propagation is to understand how changes in the weights and biases lead to change in the predictions
  • The goal of backpropagation is to understand how changes in the weights and biases lead to changes in the error (i.e. cost function)
  • The goal of gradient descent is to minimize the error by updating our weights and biases

Illustrating Forward and Backward Propagation

backwardpropagation

Color Representation
#74b9ffff Function
#fdefc4ff Input of backward and forward propagation
#55efc4ff Output of inner activation and chain rule functions
#ff7675ff Output of backward and forward propagation
  • Where dashed lines indicate cached zlz^{l} values
  • Where daida^{i} represents the initialized partial derivative J(w,b)ai\frac{\partial J(w,b)}{\partial a^{i}}
  • Since we're using logistic regression:
dal=J(w,b)al=ya+1y1ada^{l} = \frac{\partial J(w,b)}{\partial a^{l}} = - \frac{y}{a} + \frac{1-y}{1-a}
  • Where daida^{i} represents the following:
dai=J(w,b)ai=yy^+1y1y^da^{i} = \frac{\partial J(w,b)}{\partial a^{i}} = - \frac{y}{\hat{y}} + \frac{1 - y}{1 - \hat{y}}
  • Where da1da^{1} represents the following:
da1=J(w,b)a1=J(w,b)aiaiz3z3a2a2z2z2a1da^{1} = \frac{\partial J(w,b)}{\partial a^{1}} = \frac{\partial J(w,b)}{\partial a^{i}} \frac{\partial a^{i}}{\partial z^{3}} \frac{\partial z^{3}}{\partial a^{2}} \frac{\partial a^{2}}{\partial z^{2}} \frac{\partial z^{2}}{\partial a^{1}}
  • Although da0da^{0} is not really used, da0da^{0} represents the following:
da0=J(w,b)a0=J(w,b)aiaiz3z3a2a2z2z2a1a1z1z1a0da^{0} = \frac{\partial J(w,b)}{\partial a^{0}} = \frac{\partial J(w,b)}{\partial a^{i}} \frac{\partial a^{i}}{\partial z^{3}} \frac{\partial z^{3}}{\partial a^{2}} \frac{\partial a^{2}}{\partial z^{2}} \frac{\partial z^{2}}{\partial a^{1}} \frac{\partial a^{1}}{\partial z^{1}} \frac{\partial z^{1}}{\partial a^{0}}
  • Where dw1dw^{1} represents the following:
dw1=J(w,b)w1=J(w,b)aiaiz3z3a2a2z2z2a1a1z1z1w1dw^{1} = \frac{\partial J(w,b)}{\partial w^{1}} = \frac{\partial J(w,b)}{\partial a^{i}} \frac{\partial a^{i}}{\partial z^{3}} \frac{\partial z^{3}}{\partial a^{2}} \frac{\partial a^{2}}{\partial z^{2}} \frac{\partial z^{2}}{\partial a^{1}} \frac{\partial a^{1}}{\partial z^{1}} \frac{\partial z^{1}}{\partial w^{1}}
  • Where db1db^{1} represents the following:
db1=J(w,b)b1=J(w,b)aiaiz3z3a2a2z2z2a1a1z1z1b1db^{1} = \frac{\partial J(w,b)}{\partial b^{1}} = \frac{\partial J(w,b)}{\partial a^{i}} \frac{\partial a^{i}}{\partial z^{3}} \frac{\partial z^{3}}{\partial a^{2}} \frac{\partial a^{2}}{\partial z^{2}} \frac{\partial z^{2}}{\partial a^{1}} \frac{\partial a^{1}}{\partial z^{1}} \frac{\partial z^{1}}{\partial b^{1}}

Computing Components of Forward Propagation

  • In forward propagation, our output is:
ala^{l}
  • We cache zlz^{l} and output the following:
ala^{l}
  • These components look like the following:
zl=Wlal1+blz^{l} = W^{l}a^{l-1} + b^{l} al=f(zl)a^{l} = f(z^{l})

Computing Components of Backward Propagation

  • In backward propagating, our input is:
J(w,b)al\frac{\partial J(w,b)}{\partial a^{l}}
  • Our output is:
J(w,b)al1,J(w,b)Wl,J(w,b)bl\frac{\partial J(w,b)}{\partial a^{l-1}}, \frac{\partial J(w,b)}{\partial W^{l}}, \frac{\partial J(w,b)}{\partial b^{l}}
  • These components look like the following:
J(w,b)zl=J(w,b)al×alzl\frac{\partial J(w,b)}{\partial z^{l}} = \frac{\partial J(w,b)}{\partial a^{l}} \times \frac{\partial a^{l}}{\partial z^{l}} J(w,b)Wl=J(w,b)zl×al1\frac{\partial J(w,b)}{\partial W^{l}} = \frac{\partial J(w,b)}{\partial z^{l}} \times a^{l-1} J(w,b)bl=J(w,b)zl\frac{\partial J(w,b)}{\partial b^{l}} = \frac{\partial J(w,b)}{\partial z^{l}} J(w,b)al1=Wl×J(w,b)zl\frac{\partial J(w,b)}{\partial a^{l-1}} = W^{l} \times \frac{\partial J(w,b)}{\partial z^{l}}

Coding Backward Propagation

# Binary classification with 2-layer
# neural network (single hidden layer)

sigmoid = lambda x: 1 / (1 + np.exp(-x))

def fprop(x, y, params):
  # Follows procedure given in notes
  W1, b1, W2, b2 = [params[key] for key
                    in ('W1', 'b1', 'W2', 'b2')]
  z1 = np.dot(W1, x) + b1
  a1 = sigmoid(z1)
  z2 = np.dot(W2, a1) + b2
  a2 = sigmoid(z2)
  loss = -(y * np.log(a2) + (1-y) * np.log(1-a2))
  ret = {
    'x': x, 'y': y, 'z1': z1, 'a1': a1, 
    'z2': z2, 'a2': a2, 'loss': loss
    }
  for key in params:
    ret[key] = params[key]
  return ret

def bprop(fprop_cache):
  # Follows procedure given in notes
  x, y, z1, a1, z2, a2, loss = [fprop_cache[key] for key
            in ('x', 'y', 'z1', 'a1', 'z2', 'a2', 'loss')]
  dz2 = (a2 - y)
  dW2 = np.dot(dz2, a1.T)
  db2 = dz2
  dz1 = np.dot(fprop_cache['W2'].T, dz2)
  dz1 = dz1 * sigmoid(z1) * (1-sigmoid(z1))
  dW1 = np.dot(dz1, x.T)
  db1 = dz1
  return {'b1': db1, 'W1': dW1, 'b2': db2, 'W2': dW2}

# Gradient checking

if __name__ == '__main__':
  # Initialize random parameters and inputs
  W1 = np.random.rand(2,2)
  b1 = np.random.rand(2, 1)
  W2 = np.random.rand(1, 2)
  b2 = np.random.rand(1, 1)
  params = {'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
  x = np.random.rand(2, 1)
  y = np.random.randint(0, 2)  # Returns 0/1

  fprop_cache = fprop(x, y, params)
  bprop_cache = bprop(fprop_cache)

  # Numerical gradient checking
  # Note how slow this is!
  # Thus we want to use
  # the backpropagation algorithm instead.
  eps = 1e-6
  ng_cache = {}
  # For every single parameter (W, b)
  for key in params:
    param = params[key]
    # This will be our numerical gradient
    ng = np.zeros(param.shape)
    for j in range(ng.shape[0]):
      for k in xrange(ng.shape[1]):
        # For every element of parameter matrix,
        # compute gradient of loss wrt that 
        # element numerically using finite differences
        add_eps = np.copy(param)
        min_eps = np.copy(param)
        add_eps[j, k] += eps
        min_eps[j, k] -= eps
        add_params = copy(params)
        min_params = copy(params)
        add_params[key] = add_eps
        min_params[key] = min_eps
        fprop_new = fprop(x, y, add_params)['loss']
        fprop_min = fprop(x, y, min_params)['loss']
        ng[j, k] = (fprop_new - fprop_min) / (2 * eps)
    ng_cache[key] = ng

  # Compare numerical gradients to those
  # computed using backpropagation algorithm
  for key in params:
    print key
    # These should be the same
    print(bprop_cache[key])
    print(ng_cache[key])

tldr

  • Forward propagation is a layer-level function that receives input al1a^{l-1} and outputs ala^{l}
  • Backward propagation is a layer-level function that receives input J(w,b)al\frac{\partial J(w,b)}{\partial a^{l}} and outputs J(w,b)al1\frac{\partial J(w,b)}{\partial a^{l-1}}, J(w,b)Wl\frac{\partial J(w,b)}{\partial W^{l}}, and J(w,b)bl\frac{\partial J(w,b)}{\partial b^{l}}
  • The goal of forward propagation is to understand how changes in the weights and biases lead to change in the predictions
  • The goal of backpropagation is to understand how changes in the weights and biases lead to changes in the error (i.e. cost function)
  • The goal of gradient descent is to minimize the error by updating our weights and biases based on those changes

References

Previous
Next

Computing Derivatives

Regularization