Motivating Batch Normalization
- Batch normalization is a technique for improving the speed, performance, and stability of a network
- Essentially, batch normalization involves re-scaling and re-centering input layers
-
By doing this, we can achieve the following:
- Search through hyperparameters more easily
- Have a greater range of hyperparameters that work well
- Have a more robust network in terms of the choice of hyperparameters
Introducing Batch Normalization
- We've already suggested normalizing our input data
- Normalized input data can improve our training speed
- This is because an elongated contour becomes more circular
- However, deeper networks still have an issue with elongated contours if the activations aren't normalized
- This is because activations are the input of our training of weights and biases
- Batch norm solves this issue by normalizing the hidden layers
- Specifically, batch normalization re-scales any activation so the weights and biases can be trained faster
Using Activations in Batch Normalization
- In practice, we typically normalize instead of
- Some people use over , since is the input of our parameter training
- However, more people normalize instead
- This is because normalizing versus normalizing will tend to produce a similar outcome
- Therefore, we'll stick to normalizing
Defining Batch Normalization
- Receive some intermediate values
- Calculate the mean of the vector :
- Calculate the variance of the vector :
- Normalize the vector :
- Ensure doesn't always have and :
Understanding Batch Normalization
- In a previous chapter, we've seen a similar type of normalization process applied to our input layer already
- Essentially, batch normalization takes that similar normalization process and not only applies it to our input layer, but our hidden unit values as well
- The additional step (step ) of batch normalization represents the biggest difference between the two forms of normalization
- Specifically, we want to ensure that doesn't always have and
- This is because most activation functions depend on having a wider range of values that aren't centered around with a variance of
- In other words, this would defeat the purpose of using most activation functions
- For example, we wouldn't want the input values of a sigmoid activation function to have these properties
- This is because the sigmoid function outputs linear activations when the input values are close to
- Since the output becomes linear, the output would be considered uninsightful
- In summary, this step in batch normalization ensures that our hidden unit values have a standardized mean and variance, where the mean and variance are controlled by two explicit parameters and
Describing the Parameters and Hyperparameters
- Batch normalization uses a hyperparameter to ensure numerical stability during division
- Batch normalization uses the parameter as an additive term for scaling
- Batch normalization uses the parameter as a scaling factor of
- The and parameters are included so we can re-scale to be whatever value we want
- As a reminder, the and are parameters, not hyperparameters
- In our optimization algorithm (e.g. Adam), we would update these parameters just like we'd update the weights and biases parameters
- Note, that we use instead of going forward in our optimization algorithm
Replacing Bias with Beta
- Recall that is defined as the following:
- Recall that is defined as the following:
- Therefore, the bias term can be considered a constant when using batch normalization
- Then, the notion of the bias term can be indluded in
- Meaning, our formulas essentially become:
- And, we only need to train our model for the following parameters:
- In other words, we train a model while updating only two parameters and without using batch normalization
- Using batch normalization, we would train a model while updating only one additional parameter
Applying Batch Normalization
- The following is a general iteration of a single layer using batch normalization:
- The following is an example of a network with one hidden layer using batch normalization:
- In this example, we need to train the following parameters using an optimization algorithm:
Applying Batch Normalization with Mini-Batches
- When applying batch normalization to mini-batches, we apply a very similar approach as the one described above:
- The following is an example of a network with one hidden layer and two mini-batches using batch normalization:
Training a Model with Batch Normalization
- Select a mini-batch
- Use forward propagation to compute
- During forward propagation, use batch normalization on each hidden layer to replace with
- Compute the loss from
- Compute the cost from the loss
- Use backward propagation to compute , , and
- Use an optimization algorithm (e.g. gradient descent, adam, rmsprop, etc.) to update the following parameters:
Why Batch Norm Increases Training Performance
-
Roughly, batch normalization scales each of our activations
- Recall that normalizing inputs speeds up training by transforming any elongated contours into circular contours
- We did this by normalizing our inputs such that and
- Batch normalization is trying to do the same thing, but in a slighly different way
- Specifically, batch normalization also speeds up training by ensuring contours are no longer elongated
- However, batch normalization does this by scaling activations without strictly enforcing and
- In the end, batch normalization ensures that the inner layer, hidden layers, and output layer all become normalized
-
Weights in later layers become more robust to changes in earlier layers in the network
- Normalizing the activations leads to those activations having a smaller range of values
- This will make training faster for later layers
- This is because the activations of the earlier layers are not shifted around as much, due to these activations having a smaller range of values
- Meaning, activations become more stable
- As a result, later layers are able to rely on more stable activation inputs
- This also means parameter updates have less of an impact on the the distribution of activations
- This also leads to activations becoming more stable
- Note, activations having a smaller range of values indicates that changes in activations have a larger effect than before
-
The optimization landscape becomes significantly smoother
- We've already seen how normalizing inputs lead to a smoother contour
- Therefore, normalizing activations will only lead to an even smoother contour
- This causes the training speed to improve
Motivating Batch Normalization at Test Time
- Typically, we only want to predict only one observation
- In this case, dealing with and in batch normalization becomes difficult
- We don't just want to set and to be equal to the mean and variance of that single observation
- Therefore, we track and in training and perform an exponentially weighted average on the vector of means and variance for each layer and mini-batch during training
Defining Batch Normalization at Test Time
- Track each layer's and during training:
- Estimate and using exponentially weighted averages:
tldr
- We know we can improve training speed by normalizing inputs
- Specifically, we do this by normalizing inputs such that and
- Now, we can further improve training speed by normalizing the inputs and activations (using batch normalization)
- Specifically, we use batch normalization to normalize inputs and activations such that and are fixed values for each layer, based on the parameters and
- In other words, doesn't need to equal and doesn't need to (but can be if desired)
- This provides enough flexibility to the parameters so that activations remain normalized, but the activation functions (e.g. sigmoid, relu, etc.) remain effective
- These two forms of normalization are similar, but batch normalization includes an additional step
- This step ensures that our hidden unit values have a standardized mean and variance, where the mean and variance are controlled by two explicit parameters and
References
Previous
Next