Introducing Gradient Descent with RMSprop
- RMSProp stands for root mean squared prop
- Gradient descent with rmsprop is almost always faster than the standard gradient descent algorithm
- Essentially, gradient descent with rmsprop involves computing an exponentially weighted average of the gradients
- Then, we would use those weighted gradients to update our parameters, instead of using the standard gradients
Motivating RMSProp
- Let's say we're performing standard gradient descent to optimize our parameters and
- After running many iterations, our contour could look like:
- Note, the darkness of the contour represents a smaller cost
- Although we're able to find the optimal parameters by minimizing the cost function, we can imagine that gradient descent is slowed down due to the up and down oscillations
- This is slowed down for two reasons: 1. We're not able to use a larger learning rate, since there's a better chance of overshooting the global minimum 2. The small steps lead us to focus too much on moving up and down relative to the minimum, rather than sideways toward the minimum
- Mitigating the first problem involves scaling our data properly
- For now, let's focus on the second problem
- We can mitigate the second problem by slowing any of the up and down learning, while speeding up the sideways learning towards the global minimum
- This is accomplished using gradient descent with rmsprop
Defining Gradient Descent with RMSProp
-
Compute and on the current mini-batch
- Where
- Where
- Update each parameter as the following:
- Update each parameter as the following:
-
Repeat the above steps for each iteration of mini-batch
- Note that our hyperparameters are and here
Interpreting Gradient Descent with RMSProp
- Note, the is an element-wise operation, since represents a vector of weights
-
Also note, represents some small number that ensures numerical stability
- The default value of this is typically
- This value is essentially negligable for interpretation purposes
- Essentially, we're taking the benefit of smoothness by applying exponential weights to our gradient descent function
- Therefore, the steps should become smoother as we increase
- This is because we're averaging out the previous steps iterations
- As a result, we shouldn't be bouncing up and down around the same parameter value as much, since the average of these partial derivatives are essentially
- And, we should be quickly moving sideways towards the optimal parameter value, since the average of these partial derivatives are much larger than
- We're also gaining a smoothness benefit from and
- This is because we will get a dampening-out effect if is large
- On the other hand, we will quickly gravitate toward the optimal parameter values if is small
Lower-Level Intuition behind RMSProp
-
If is large:
- It is squared, giving us a larger denominator
- This makes the learning rate coefficient smaller
- Meaning, our update becomes smaller
-
If is small:
- It is squared, giving us a smaller denominator
- This makes the learning rate coefficient larger
- Meaning, our update becomes larger
Intuition behind Momentum and RMSProp
- Momentum adds momentum to hyperparameter updates by heavily weighting the average of previous (nearby) changes in loss
-
RMSProp adds resistance to hyperparameter updates by standardizing with respect to the average of previous (nearby) changes in loss
- Where represents the current update
- Where is roughly the average of previous changes in loss
tldr
- Gradient descent with rmsprop can be thought of as standard gradient descent that has been given a short-term memory
- Gradient descent with rmsprop attempts to average out the oscillations around the same value (in a given direction) by apply exponentially weighted averages to standard gradient descent
- It also gains a smoothness benefit from updating parameters using the and terms
- Gradient descent with rmsprop is almost always faster than the standard gradient descent algorithm
References
Previous
Next