Regularization

Motivating Regularization

  • A good way to reduce variance in our network is to include more data in our training and test sets
  • However, we can't always just go and get more data
  • Therefore, we may need to try other approaches in hopes of reducing the variance in our network
  • Adding regularization to our network will reduce overfitting

regularization

Defining L2 Regularization

  • Up until now, we've defined our cost function as:
J(w,b)=1mi=1mL(y^,y)J(w,b) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}, y)
  • If we add regularization to our cost function, our cost function would look like the following:
J(w,b)=1mi=1mL(y^,y)+λ2mw22J(w,b) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}, y) + \frac{\lambda}{2m} \Vert w \Vert_{2}^{2}
  • The additional regularization component can be simplified to:
w22=j=1nwj2=wTw\Vert w \Vert_{2}^{2} = \sum_{j=1}^{n} w_{j}^{2} = w^{T}w
  • This new component can be thought of as a term that shrinks all of the weights so only the largest weights remain
  • Consequently, this reduces the amount of overfitting as well
  • We can adjust the amount of shrinkage by changing the λ\lambda term
  • This λ\lambda term is known as the regularization parameter
  • Increasing λ\lambda can lead to more shrinkage, meaning there's a better chance we see underfitting
  • Decreasing λ\lambda basically removes this term altogether, meaning the overfitting hasn't been dealt with at all
  • Therefore, we typically need to tune our parameter λ\lambda to find some good medium

Possibly Adding More Regularized Terms

  • Sometimes, we can add a bias term such as the following:
J(w,b)=1mi=1mL(y^,y)+λ2mw22+λ2mb22J(w,b) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}, y) + \frac{\lambda}{2m} \Vert w \Vert_{2}^{2} + \frac{\lambda}{2m} \Vert b \Vert_{2}^{2}
  • However, this usually doesn't affect our results very much
  • This is because bb contains a much smaller percentage of parameter values compared to ww
  • For this reason, we usually exclude the bias component λ2mb22\frac{\lambda}{2m} \Vert b \Vert_{2}^{2}

Defining L1 Regularization

  • L1 regularization is another form of regularization
  • Specifically, L1 regularization is defined as:
J(w,b)=1mi=1mL(y^,y)+λ2mw1J(w,b) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}, y) + \frac{\lambda}{2m} \Vert w \Vert_{1}
  • The additional regularization component can be simplified to:
w1=j=1nwj\Vert w \Vert_{1} = \sum_{j=1}^{n} |w_{j}|
  • L1 regularization causes ww to be sparse
  • Meaning, the ww vector will include lots of zeroes
  • This is because a property of the regularization component λ2mw1\frac{\lambda}{2m} \Vert w \Vert_{1} is the weight values ww will shrink all the way to zero
  • Some will say this can help with compressing the model
  • This is because we may need less memory to store the parameters if they're zero
  • However, model compression improves performance only slightly
  • Therefore, L2 regularization is typically the most popular regularization technique in practice
  • This is only because L1 regularization doesn't have much of an advantage over it

Using Regularization in Neural Networks

  • We're probably wondering how to implement gradient descent using the new cost function
  • When we update the weight parameter ww during gradient descent, the update with our regularization term will look like:
w=wαJww = w - \alpha \frac{\partial J}{\partial w} Jw=(from backprop)+λmw\frac{\partial J}{\partial w} = (\text{from backprop}) + \frac{\lambda}{m} w
  • For each layer, λmw\frac{\lambda}{m}w comes from the derivative of the L2 regularization term λ2mw22\frac{\lambda}{2m} \Vert w \Vert_{2}^{2}
  • The (from backprop)(\text{from backprop}) term represents the Jw\frac{\partial J}{\partial w} term we normally get from performing backpropagation
  • In other words, we've calculated the derivative of our cost function JJ with respect to ww, then added some regularized term on the end
  • We can simplify the above formulas into the following:
w=wα[(from backprop)+λmw]w = w - \alpha [(\text{from backprop}) + \frac{\lambda}{m}w] =wαλmwα(from backprop)= w - \frac{\alpha \lambda}{m}w - \alpha(\text{from backprop}) =w(1αλm)α(from backprop)= w(1-\frac{\alpha \lambda}{m}) - \alpha(\text{from backprop})
  • We can see (1αλm)(1-\frac{\alpha \lambda}{m}) is some constant that is less than 11
  • Essentially, we are updating the weight parameter ww using gradient descent as usual
  • However, we are multiplying each weight parameter ww by some constant (1αλm)(1-\frac{\alpha \lambda}{m}) slightly less than 11
  • This prevents the weights from growing too large
  • Again, the regularization parameter λ\lambda determines how we trade off the original cost JJ with the large weights penalization
  • In other words, the w(1αλm)w(1-\frac{\alpha \lambda}{m}) term causes the weight to decay in proportion to its size
  • For this reason, we sometimes refer to L2 normalization as weight decay

Other Ways of Reducing Overfitting

  • As we've already mentioned, regularization is a great way of reducing overfitting
  • However, we can use other methods to reduce overfitting in our neural network
  • Here are a few:

    • Dropout Regularization

      • This is another regularization method
    • Data Augmentation

      • This involves adjusting input images to generate more data
      • Meaning, we could take our input images and flip, zoom, distort, etc.
    • Early Stopping

      • This involves stopping the training of a neural network early
      • Specifically, we're trying to find the optimal number of iterations the provide us with the best parameters

tldr

  • An added regularization component can be thought of as a term that shrinks all of the weights so only the largest weights remain
  • Consequently, this reduces the amount of overfitting as well
  • We can adjust the amount of shrinkage by changing the λ\lambda term
  • This λ\lambda term is known as the regularization parameter
  • Increasing λ\lambda can lead to more shrinkage, meaning there's a better chance we see underfitting
  • Decreasing λ\lambda basically removes this term altogether, meaning the overfitting hasn't been dealt with at all

References

Previous
Next

Backpropagation

Intuition behind Regularization