Describing Adam Optimization
- Adam stands for adaptive moment estimation
- Adam is almost always more performant than the standard gradient descent algorithm
- Adam is an optimization algorithm that essentially combines momentum and rmsprop
-
Adam is an extension of mini-batch gradient descent
- Mini-batch gradient descent is a configuration of stochastic gradient descent
- Therefore, we'll sometimes see that Adam is an extension of stochastic gradient descent
- Adam uses the hyperparameters , , , and
- Adam is generally regarded as being fairly robust to the choice of hyper parameters, though the learning rate sometimes needs to be changed from the suggested default
Defining Adam Optimization
-
Compute and on the current mini-batch
- Where
- Where
- Update each parameter as the following:
- Update each parameter as the following:
-
Repeat the above steps for each iteration of mini-batch
- Note that our hyperparameters are , , , and here
Adam Hyperparameters
- The hyperparameter refers to computing the mean of
- The hyperparameter refers to computing the mean of
- The hyperparameter typically needs to be tuned
- The hyperparameter is never tuned
-
The hyperparameter typically doesn't need to be tuned
- The default value is typically left unchanged
- The default value of is
-
The hyperparameter typically doesn't need to be tuned
- The default value is typically left unchanged
- The default value of is
Interpreting Adam Optimization
- Note, the is an element-wise operation, since represents a vector of weights
-
Also note, represents some small number that ensures numerical stability
- The default value of this is typically
- This value is essentially negligable for interpretation purposes
- Essentially, we're taking the benefit of smoothness by applying exponential weights to our gradient descent function
- Therefore, the steps should become smoother as we increase
- This is because we're averaging out the previous steps iterations
- As a result, we shouldn't be bouncing up and down around the same parameter value as much, since the average of these partial derivatives are essentially
- And, we should be quickly moving sideways towards the optimal parameter value, since the average of these partial derivatives are much larger than
- We're also gaining a smoothness benefit from
- This is essentially a ratio of the bias-corrected momentum term to the bias-corrected rmsprop terms
-
Therefore, we will get a dampening-out effect if:
- The term is very large
- The term is very small
-
On the other hand, we will quickly gravitate toward the optimal parameter values if either of the following are true:
-
The term is very small
- The term is very large
-
tldr
- Adam optimization can be thought of as standard gradient descent that has been given a short-term memory
- Adam optimization attempts to average out the oscillations around the same value (in a given direction) by apply exponentially weighted averages to standard gradient descent
- It also gains a smoothness benefit from updating parameters using the terms
- Adam optimization is almost always faster than the standard gradient descent algorithm
References
Previous
Next