Motivating Regularization
- A good way to reduce variance in our network is to include more data in our training and test sets
- However, we can't always just go and get more data
- Therefore, we may need to try other approaches in hopes of reducing the variance in our network
- Adding regularization to our network will reduce overfitting
Defining L2 Regularization
- Up until now, we've defined our cost function as:
- If we add regularization to our cost function, our cost function would look like the following:
- The additional regularization component can be simplified to:
- This new component can be thought of as a term that shrinks all of the weights so only the largest weights remain
- Consequently, this reduces the amount of overfitting as well
- We can adjust the amount of shrinkage by changing the term
- This term is known as the regularization parameter
- Increasing can lead to more shrinkage, meaning there's a better chance we see underfitting
- Decreasing basically removes this term altogether, meaning the overfitting hasn't been dealt with at all
- Therefore, we typically need to tune our parameter to find some good medium
Possibly Adding More Regularized Terms
- Sometimes, we can add a bias term such as the following:
- However, this usually doesn't affect our results very much
- This is because contains a much smaller percentage of parameter values compared to
- For this reason, we usually exclude the bias component
Defining L1 Regularization
- L1 regularization is another form of regularization
- Specifically, L1 regularization is defined as:
- The additional regularization component can be simplified to:
- L1 regularization causes to be sparse
- Meaning, the vector will include lots of zeroes
- This is because a property of the regularization component is the weight values will shrink all the way to zero
- Some will say this can help with compressing the model
- This is because we may need less memory to store the parameters if they're zero
- However, model compression improves performance only slightly
- Therefore, L2 regularization is typically the most popular regularization technique in practice
- This is only because L1 regularization doesn't have much of an advantage over it
Using Regularization in Neural Networks
- We're probably wondering how to implement gradient descent using the new cost function
- When we update the weight parameter during gradient descent, the update with our regularization term will look like:
- For each layer, comes from the derivative of the L2 regularization term
- The term represents the term we normally get from performing backpropagation
- In other words, we've calculated the derivative of our cost function with respect to , then added some regularized term on the end
- We can simplify the above formulas into the following:
- We can see is some constant that is less than
- Essentially, we are updating the weight parameter using gradient descent as usual
- However, we are multiplying each weight parameter by some constant slightly less than
- This prevents the weights from growing too large
- Again, the regularization parameter determines how we trade off the original cost with the large weights penalization
- In other words, the term causes the weight to decay in proportion to its size
- For this reason, we sometimes refer to L2 normalization as weight decay
Other Ways of Reducing Overfitting
- As we've already mentioned, regularization is a great way of reducing overfitting
- However, we can use other methods to reduce overfitting in our neural network
-
Here are a few:
-
Dropout Regularization
- This is another regularization method
-
Data Augmentation
- This involves adjusting input images to generate more data
- Meaning, we could take our input images and flip, zoom, distort, etc.
-
Early Stopping
- This involves stopping the training of a neural network early
- Specifically, we're trying to find the optimal number of iterations the provide us with the best parameters
-
tldr
- An added regularization component can be thought of as a term that shrinks all of the weights so only the largest weights remain
- Consequently, this reduces the amount of overfitting as well
- We can adjust the amount of shrinkage by changing the term
- This term is known as the regularization parameter
- Increasing can lead to more shrinkage, meaning there's a better chance we see underfitting
- Decreasing basically removes this term altogether, meaning the overfitting hasn't been dealt with at all
References
Previous
Next