RMSProp

Introducing Gradient Descent with RMSprop

  • RMSProp stands for root mean squared prop
  • Gradient descent with rmsprop is almost always faster than the standard gradient descent algorithm
  • Essentially, gradient descent with rmsprop involves computing an exponentially weighted average of the gradients
  • Then, we would use those weighted gradients to update our parameters, instead of using the standard gradients

Motivating RMSProp

  • Let's say we're performing standard gradient descent to optimize our parameters ww and bb
  • After running many iterations, our contour could look like:

contourrmsprop

  • Note, the darkness of the contour represents a smaller cost JJ
  • Although we're able to find the optimal parameters by minimizing the cost function, we can imagine that gradient descent is slowed down due to the up and down oscillations
  • This is slowed down for two reasons: 1. We're not able to use a larger learning rate, since there's a better chance of overshooting the global minimum 2. The small steps lead us to focus too much on moving up and down relative to the minimum, rather than sideways toward the minimum
  • Mitigating the first problem involves scaling our data properly
  • For now, let's focus on the second problem
  • We can mitigate the second problem by slowing any of the up and down learning, while speeding up the sideways learning towards the global minimum
  • This is accomplished using gradient descent with rmsprop

Defining Gradient Descent with RMSProp

  1. Compute dWdW and dbdb on the current mini-batch tt

    • Where dW=J(w,b)wdW = \frac{\partial J(w,b)}{\partial w}
    • Where db=J(w,b)bdb = \frac{\partial J(w,b)}{\partial b}
  2. Update each ww parameter as the following:
vt=βvt1+(1β)dW2v_{t} = \beta v_{t-1} + (1-\beta)dW^{2} Wt=Wt1αdWvt+ϵW_{t} = W_{t-1} - \alpha \frac{dW}{\sqrt{v_{t}} + \epsilon}
  1. Update each bb parameter as the following:
vt=βvt1+(1β)db2v_{t} = \beta v_{t-1} + (1-\beta)db^{2} bt=bt1αdbvt+ϵb_{t} = b_{t-1} - \alpha \frac{db}{\sqrt{v_{t}} + \epsilon}
  1. Repeat the above steps for each ttht^{th} iteration of mini-batch

    • Note that our hyperparameters are α\alpha and β\beta here

Interpreting Gradient Descent with RMSProp

  • Note, the dW2dW^{2} is an element-wise operation, since WW represents a vector of weights
  • Also note, ϵ\epsilon represents some small number that ensures numerical stability

    • The default value of this is typically ϵ=108\epsilon = 10^{-8}
    • This value is essentially negligable for interpretation purposes
  • Essentially, we're taking the benefit of smoothness by applying exponential weights to our gradient descent function
  • Therefore, the steps should become smoother as we increase β\beta
  • This is because we're averaging out the previous steps iterations
  • As a result, we shouldn't be bouncing up and down around the same parameter value as much, since the average of these partial derivatives are essentially 00
  • And, we should be quickly moving sideways towards the optimal parameter value, since the average of these partial derivatives are much larger than 00
  • We're also gaining a smoothness benefit from dWvt\frac{dW}{\sqrt{v_{t}}} and dbvt\frac{db}{\sqrt{v_{t}}}
  • This is because we will get a dampening-out effect if vt\sqrt{v_{t}} is large
  • On the other hand, we will quickly gravitate toward the optimal parameter values if vt\sqrt{v_{t}} is small

Lower-Level Intuition behind RMSProp

  • If dWdW is large:

    • It is squared, giving us a larger denominator
    • This makes the learning rate coefficient smaller
    • Meaning, our update becomes smaller
  • If dWdW is small:

    • It is squared, giving us a smaller denominator
    • This makes the learning rate coefficient larger
    • Meaning, our update becomes larger

Intuition behind Momentum and RMSProp

  • Momentum adds momentum to hyperparameter updates by heavily weighting the average of previous (nearby) changes in loss
  • RMSProp adds resistance to hyperparameter updates by standardizing dWdW with respect to the average of previous (nearby) changes in loss

    • Where dBdB represents the current update
    • Where vt\sqrt{v_{t}} is roughly the average of previous changes in loss

tldr

  • Gradient descent with rmsprop can be thought of as standard gradient descent that has been given a short-term memory
  • Gradient descent with rmsprop attempts to average out the oscillations around the same value (in a given direction) by apply exponentially weighted averages to standard gradient descent
  • It also gains a smoothness benefit from updating parameters using the dWvt\frac{dW}{\sqrt{v_{t}}} and dbvt\frac{db}{\sqrt{v_{t}}} terms
  • Gradient descent with rmsprop is almost always faster than the standard gradient descent algorithm

References

Previous
Next

Gradient Descent with Momentum

Adam Optimization