Motivating Effective Weight Initialization
- As stated previously, the choice of activation function can help reduce the problem with vanishing gradients
- However, sometimes the choice of activation function can only go so far (and it sometimes necessary)
- Another approach for reducing the problem with vanishing gradients involves weight initialization
Importance of Effective Initialization
- The goal of model training is to predict a new data point
-
The common training process for neural networks is the following:
- Initialize the parameters
- Choose an optimization algorithm (e.g. gradient descent)
-
Repeat these steps:
- Forward propagate an input
- Compute the cost function
- Compute the gradients of the cost with respect to parameters using backpropagation
- Update each parameter using the gradients, according to the optimization algorithm
- The initialization step is critical to the performance of the model
Initializing Weights with Zeroes
- Initializing all the weights with values too small leads to slow learning
- Initializing all the weights with values too large leads to divergence
- Initializing all the weights with leads the neurons to learn the same features during training
- In fact, any constant initialization scheme will perform poorly
- For example, assigning all the weights to be or all the weights to be
- Consider a neural network with two hidden units
- Assume we initialize all the biases to an weights to some constant
- If we forward propagate an input , then the output of both hidden units will be
- Therefore, both hidden units will have identical influence on the cost function
- This will lead to identical gradients
- Thus, both neurons will evolve symmetrically throughout training, effectively preventing different neurons from learning different patterns
Setup for Cases of Unstable Gradient Problem
- Let's say we have a -layer neural network
- Assume all the activation functions are linear
- Then the output activation is the following:
- If we assume the weights are the same:
- Then the output prediction is the following:
- Here, take the matrix to the power of , while just denotes the matrix
Example: Huge Initialization Causes Exploding Gradients
- If every weight is initialized slightly larger than the identity matrix:
- This simplifies to the following:
- The values of the activation increase exponentially with
- When these activations are used in backward propagation, this leads to the exploding gradient problem
- In other words, the gradients of the cost with respect to the parameters become too big
- This causes the cost to oscillate around its minimum value
Example: Tiny Initialization Causes Vanishing Gradients
- If every weight is initialized slightly larger than the identity matrix:
- This simplifies to the following:
- The values of the activation decrease exponentially with
- When these activations are used in backward propagation, this leads to the vanishing gradient problem
- In other words, the gradients of the cost with respect to the parameters become too small
- This causes the cost to converge to some value before reaching the minimum
Finding Appropriate Initialization Values
-
To prevent the gradients of neurons from vanishing or exploding, we will do the following:
- The mean of the activations should be
- The variance of the activations should remain constant across each layer
- Under these two assumptions, we should never observe the vanishing or exploding gradient problem
- In other words, the backpropagated gradient should not be multiplied by values too small or too large in any layer
- However, sometimes we can't always guarantee these assumptions
Motivating Xavier Initialization
- Consider a layer , where its forward propagation is the following:
- We want the two assumptions to hold:
- These assumptions are enforced for both forward and backward propagation
- Specifically, these assumptions hold true for both the activations and gradients of the cost function with respect to the activations
- The recommended initialization of these two assumptions is Xavier initialization for ever layer
Defining Xavier Initialization
- Xaviar initialization is defined as the following:
- In other words, all the weights of layer are picked randomly from a normal distribution with a mean of and variance of
- Here, is the number of neurons in layer
- Biases are initialized to be
Justification for Xavier Initialization
- We will see that Xavier initialization matains a constance variance across each layer if the activation of each layer has a mean of
- Let's assume we're using a tanh activation function
- Our forward propagation would look like:
- Assume we initialized our network with appropriate values and the input is normalized
- Early on in training, we are in the linear regime of the tanh function
- Meaning, the inputs around essentially become a linear transformation of our input
- In other words, values are small enough where
- As a result, the following holds true:
- Causing the following to hold true:
- Therefore, we must set by initializing
- By doing this, we can avoid the vanishing or exploding gradient problem
tldr
- The initialization step is critical to the performance of the model
- Initializing all the weights with values too small leads to slow learning
- Initializing all the weights with values too large leads to divergence
- Initializing weights to the same constant leads to neurons evolving symmetrically throughout training
- Meaning, each neuron will have an identical influence on the cost function, since parameters will have identical gradients
- Initializing weights with inappropriate values will lead to divergence or a slow-down in training speed
-
To prevent the gradients of neurons from vanishing or exploding, we will do the following:
- The mean of the activations should be
- The variance of the activations should remain constant across each layer
- Under these two assumptions, we should never observe the vanishing or exploding gradient problem
- These assumptions are enforced for both forward and backward propagation
- The recommended initialization of these two assumptions is Xavier initialization for ever layer
- Xavier initialization ensures that all the weights of layer are picked randomly from a normal distribution with a mean of and variance of
References
Previous
Next