Describing Adam Optimization
- Adam stands for adaptive moment estimation
 - Adam is almost always more performant than the standard gradient descent algorithm
 - Adam is an optimization algorithm that essentially combines momentum and rmsprop
 - 
Adam is an extension of mini-batch gradient descent
- Mini-batch gradient descent is a configuration of stochastic gradient descent
 - Therefore, we'll sometimes see that Adam is an extension of stochastic gradient descent
 
 - Adam uses the hyperparameters , , , and
 - Adam is generally regarded as being fairly robust to the choice of hyper parameters, though the learning rate sometimes needs to be changed from the suggested default
 
Defining Adam Optimization
- 
Compute and on the current mini-batch
- Where
 - Where
 
 - Update each parameter as the following:
 
- Update each parameter as the following:
 
- 
Repeat the above steps for each iteration of mini-batch
- Note that our hyperparameters are , , , and here
 
 
Adam Hyperparameters
- The hyperparameter refers to computing the mean of
 - The hyperparameter refers to computing the mean of
 - The hyperparameter typically needs to be tuned
 - The hyperparameter is never tuned
 - 
The hyperparameter typically doesn't need to be tuned
- The default value is typically left unchanged
 - The default value of is
 
 - 
The hyperparameter typically doesn't need to be tuned
- The default value is typically left unchanged
 - The default value of is
 
 
Interpreting Adam Optimization
- Note, the is an element-wise operation, since represents a vector of weights
 - 
Also note, represents some small number that ensures numerical stability
- The default value of this is typically
 - This value is essentially negligable for interpretation purposes
 
 - Essentially, we're taking the benefit of smoothness by applying exponential weights to our gradient descent function
 - Therefore, the steps should become smoother as we increase
 - This is because we're averaging out the previous steps iterations
 - As a result, we shouldn't be bouncing up and down around the same parameter value as much, since the average of these partial derivatives are essentially
 - And, we should be quickly moving sideways towards the optimal parameter value, since the average of these partial derivatives are much larger than
 - We're also gaining a smoothness benefit from
 - This is essentially a ratio of the bias-corrected momentum term to the bias-corrected rmsprop terms
 - 
Therefore, we will get a dampening-out effect if:
- The term is very large
 - The term is very small
 
 - 
On the other hand, we will quickly gravitate toward the optimal parameter values if either of the following are true:
- 
The term is very small
- The term is very large
 
 
 - 
 
tldr
- Adam optimization can be thought of as standard gradient descent that has been given a short-term memory
 - Adam optimization attempts to average out the oscillations around the same value (in a given direction) by apply exponentially weighted averages to standard gradient descent
 - It also gains a smoothness benefit from updating parameters using the terms
 - Adam optimization is almost always faster than the standard gradient descent algorithm
 
References
Previous
Next