Intuition behind Regularization

Motivating Regularization for Overfitting

  • Roughly speaking, weights are usually relatively smaller for a model with a large amount of bias (i.e. underfitting)
  • Roughly speaking, weights are usually relatively larger for a model with a large amount of variance (i.e. overfitting)
  • Regularization attempts to find a happy medium so the weights don't become too large or too small

regularization

  • We expect our weights to converge to 00 when there is a lot of noise, and our weights to stray from 00 when there is signal
  • By adding regularization to our model, we are adding some amount of noise that is proportional to our weight
  • Since noise makes it harder to pick up on any signal, our data becomes harder to fit as we add more noise
  • Therefore, it becomes harder to overfit
  • We can control the amount of noise added to our weight by increasing λ\lambda
  • Increasing λ\lambda will shrink the weights to 00 at a faster rate
  • It will take longer for our weight to shrink to zero if our weight is already large, compared to if our weight is small

Illustrating Effects of Regularization

  • Up until this point, we know how our penalty terms w12\Vert w \Vert_{1}^{2} and w22\Vert w \Vert_{2}^{2} lead to shrinkage
  • However, we're still probably wondering how exactly shrinkage prevents overfitting
  • As hinted at previously, if we increase our regularization parameter λ\lambda, our weights start to converge to 00
  • Increasing our regularization parameter will leads to more weights becoming equal to 00
  • This zeroes out the impact of our hidden units
  • Meaning, our neural network becomes simplified and smaller

regularization

  • In practice, these hidden units aren't actually zeroed out
  • Instead, they are still used, but have a much smaller effect
  • Therefore, we do in fact end up with a simpler network that has the same effect as if we had a smaller network

Another Illustration of Regularization Effects

  • If we increase the regularization parameter λ\lambda, then the weights decrease towards 00
  • This will prevent overfitting, but could cause underfitting if we increase our regularization parameter λ\lambda too much
  • We can notice this issue by taking a closer look at the tanh activation function in our network
  • Let's say we're faced with the following:
zl=wlal1+blz^{l} = w^{l}a^{l-1} + b^{l} al=tanh(zl)a^{l} = tanh(z^{l})
  • We know that if λ\lambda increases by too much, then wlw^{l} will become very small
  • Ignoring the effects of blb^{l}, we can see that zlz^{l} will become very small and will take on a small range of values
  • Then, the output ala^{l} of our tanh function will become relatively linear
  • Consequently, linear functions are typically too general and aren't able to fit those complicated, nonlinear decision boundaries we're most likely looking for
  • This intuition is true for many other activation functions other than the tanh activation function

tanhregularization


tldr

  • Roughly speaking, weights are usually relatively smaller for a model with a large amount of bias (i.e. underfitting)
  • Roughly speaking, weights are usually relatively larger for a model with a large amount of variance (i.e. overfitting)
  • Regularization attempts to find a happy medium so the weights don't become too large or too small

References

Previous
Next

Regularization

Dropout Regularization