Adam Optimization

Describing Adam Optimization

  • Adam stands for adaptive moment estimation
  • Adam is almost always more performant than the standard gradient descent algorithm
  • Adam is an optimization algorithm that essentially combines momentum and rmsprop
  • Adam is an extension of mini-batch gradient descent

    • Mini-batch gradient descent is a configuration of stochastic gradient descent
    • Therefore, we'll sometimes see that Adam is an extension of stochastic gradient descent
  • Adam uses the hyperparameters βv\beta_{v}, βs\beta_{s}, α\alpha, and ϵ\epsilon
  • Adam is generally regarded as being fairly robust to the choice of hyper parameters, though the learning rate sometimes needs to be changed from the suggested default

Defining Adam Optimization

  1. Compute dWdW and dbdb on the current mini-batch tt

    • Where dW=J(w,b)wdW = \frac{\partial J(w,b)}{\partial w}
    • Where db=J(w,b)bdb = \frac{\partial J(w,b)}{\partial b}
  2. Update each ww parameter as the following:
vt=βvvt1+(1βv)dW}=momentumst=βsst1+(1βs)dW2}=rmspropvtcorrected=vt1βvt}=bias correctionstcorrected=st1βst}=bias correction\begin{aligned} v_{t} = \beta_{v} v_{t-1} + (1-\beta_{v})dW \qquad \rbrace &= \text{momentum} \cr s_{t} = \beta_{s} s_{t-1} + (1-\beta_{s})dW^{2} \qquad \rbrace &= \text{rmsprop} \cr v_{t}^{corrected} = \frac{v_{t}}{1-\beta_{v}^{t}} \qquad \rbrace &= \text{bias correction} \cr s_{t}^{corrected} = \frac{s_{t}}{1-\beta_{s}^{t}} \qquad \rbrace &= \text{bias correction} \end{aligned} Wt=Wt1αvtcorrectedstcorrected+ϵW_{t} = W_{t-1} - \alpha \frac{v_{t}^{corrected}}{\sqrt{s_{t}^{corrected}} + \epsilon}
  1. Update each bb parameter as the following:
vt=βvvt1+(1βv)db}=momentumst=βsst1+(1βs)db2}=rmspropvtcorrected=vt1βvt}=bias correctionstcorrected=st1βst}=bias correction\begin{aligned} v_{t} = \beta_{v} v_{t-1} + (1-\beta_{v})db \qquad \rbrace &= \text{momentum} \cr s_{t} = \beta_{s} s_{t-1} + (1-\beta_{s})db^{2} \qquad \rbrace &= \text{rmsprop} \cr v_{t}^{corrected} = \frac{v_{t}}{1-\beta_{v}^{t}} \qquad \rbrace &= \text{bias correction} \cr s_{t}^{corrected} = \frac{s_{t}}{1-\beta_{s}^{t}} \qquad \rbrace &= \text{bias correction} \end{aligned} Wt=Wt1αvtcorrectedstcorrected+ϵW_{t} = W_{t-1} - \alpha \frac{v_{t}^{corrected}}{\sqrt{s_{t}^{corrected}} + \epsilon}
  1. Repeat the above steps for each ttht^{th} iteration of mini-batch

    • Note that our hyperparameters are βv\beta_{v}, βs\beta_{s}, α\alpha, and ϵ\epsilon here

Adam Hyperparameters

  • The hyperparameter βv\beta_{v} refers to computing the mean of dWdW
  • The hyperparameter βs\beta_{s} refers to computing the mean of dW2dW^{2}
  • The hyperparameter α\alpha typically needs to be tuned
  • The hyperparameter ϵ\epsilon is never tuned
  • The hyperparameter βv\beta_{v} typically doesn't need to be tuned

    • The default value is typically left unchanged
    • The default value of βv\beta_{v} is 0.90.9
  • The hyperparameter βs\beta_{s} typically doesn't need to be tuned

    • The default value is typically left unchanged
    • The default value of βs\beta_{s} is 0.9990.999

Interpreting Adam Optimization

  • Note, the dW2dW^{2} is an element-wise operation, since WW represents a vector of weights
  • Also note, ϵ\epsilon represents some small number that ensures numerical stability

    • The default value of this is typically ϵ=108\epsilon = 10^{-8}
    • This value is essentially negligable for interpretation purposes
  • Essentially, we're taking the benefit of smoothness by applying exponential weights to our gradient descent function
  • Therefore, the steps should become smoother as we increase β\beta
  • This is because we're averaging out the previous steps iterations
  • As a result, we shouldn't be bouncing up and down around the same parameter value as much, since the average of these partial derivatives are essentially 00
  • And, we should be quickly moving sideways towards the optimal parameter value, since the average of these partial derivatives are much larger than 00
  • We're also gaining a smoothness benefit from vtcorrectedstcorrected\frac{v_{t}^{corrected}}{\sqrt{s_{t}^{corrected}}}
  • This is essentially a ratio of the bias-corrected momentum term to the bias-corrected rmsprop terms
  • Therefore, we will get a dampening-out effect if:

    • The stcorrected\sqrt{s_{t}^{corrected}} term is very large
    • The vtcorrectedv_{t}^{corrected} term is very small
  • On the other hand, we will quickly gravitate toward the optimal parameter values if either of the following are true:

    • The stcorrected\sqrt{s_{t}^{corrected}} term is very small

      • The vtcorrectedv_{t}^{corrected} term is very large

tldr

  • Adam optimization can be thought of as standard gradient descent that has been given a short-term memory
  • Adam optimization attempts to average out the oscillations around the same value (in a given direction) by apply exponentially weighted averages to standard gradient descent
  • It also gains a smoothness benefit from updating parameters using the vtcorrectedstcorrected\frac{v_{t}^{corrected}}{\sqrt{s_{t}^{corrected}}} terms
  • Adam optimization is almost always faster than the standard gradient descent algorithm

References

Previous
Next

RMSProp

Learning Rate Decay