Introducing Momentum
- Gradient descent with momentum is almost always faster than the standard gradient descent algorithm
- Essentially, gradient descent with momentum involves computing an exponentially weighted average of the gradients
- Then, we would use those weighted gradients to update our parameters, instead of using the standard gradients
Motivating Momentum
- Let's say we're performing standard gradient descent to optimize our parameters and
- After running many iterations, our contour could look like:
- Note, the darkness of the contour represents a smaller cost
- Although we're able to find the optimal parameters by minimizing the cost function, we can imagine that gradient descent is slowed down due to the up and down oscillations
-
This is slowed down for two reasons:
- We're not able to use a larger learning rate, since there's a better chance of overshooting the global minimum
- The small steps lead us to focus too much on moving up and down relative to the minimum, rather than sideways toward the minimum
- Mitigating the first problem involves scaling our data properly
- For now, let's focus on the second problem
- We can mitigate the second problem by slowing any of the up and down learning, while speeding up the sideways learning towards the global minimum
- This is accomplished using gradient descent with momentum
Defining Gradient Descent with Momentum
-
Compute and on the current mini-batch
- Where
- Where
- Update each parameter as the following:
- Update each parameter as the following:
-
Repeat the above steps for each iteration of mini-batch
- Note that our hyperparameters are and here
Interpreting Gradient Descent with Momentum
- Essentially, we're taking the benefit of smoothness by applying exponential weights to our gradient descent function
- Therefore, the steps should become smoother as we increase
- This is because we're averaging out the previous steps iterations
- As a result, we shouldn't be bouncing up and down around the same parameter value as much, since the average of these partial derivatives are essentially
- And, we should be quickly moving sideways towards the optimal parameter value, since the average of these partial derivatives are much larger than
tldr
- Gradient descent with momentum can be thought of as standard gradient descent that has been given a short-term memory
- Gradient descent with momentum attempts to average out the oscillations around the same value (in a direction) by apply exponentially weighted averages to standard gradient descent
- Gradient descent with momentum is almost always faster than the standard gradient descent algorithm
References
Previous
Next