Motivating Learning Rate Decay
- Sometimes, assigning a large learning rate can cause our optimization algorithm to overshoot the minimum
 - On the other hand, assigning a small learning rate can cause our optimization algorithm to never reach the minimum, since our steps become too small
 - Meaning, we may never be able to converge to the optimal parameters that minimize our cost function in either of these situations
 
Illustrating Learning Rate Decay
- To illustrate this point, let's say we're implementing mini-batch gradient descent with reasonably small batch sizes and a somewhat large learning rate
 - As we iterate, we'll notice our steps drifting towards the minimum in the beginning
 - However, we may start to wander around the minimum and never converge to optimal parameters once we start to approach the minimum
 - On the other hand, we may never converge to the minimum if we start with a somewhat small learning rate
 - Slowly reducing the learning rate overtime can help improve the performance of an optimization algorithm
 
Defining Standard Learning Rate Decay
- Compute for the current epoch
 
- Repeat step for each epoch
 
Describing the Hyperparameters
- The hyperparameter represents the learning rate for the epoch
 - The hyperparameter represents an initial learning rate
 - The hyperparameter represents the rate of decay
 - The hyperparameters and are fixed for each epoch
 - The hyperparameter is obviously different for each epoch
 - We typically tune the hyperparameters and
 
Example of the Standard Learning Rate Decay
| epoch | |||
|---|---|---|---|
| 1 | 0.1 | 0.2 | 1 | 
| 2 | 0.67 | 0.2 | 1 | 
| 3 | 0.5 | 0.2 | 1 | 
| 4 | 0.4 | 0.2 | 1 | 
| ... | ... | 0.2 | 1 | 
Defining Exponential Learning Rate Decay
- Compute for the current epoch
 
- 
Repeat step for each epoch
- Where the hyperparameter is the exponential rate of decay
 
 
Other Forms of Learning Rate Decay
- There are other forms of learning rate decay
 - These are essentially doing the same thing, but use different functions to express different rates of decay applied to the learning rate
 - Another example is the discrete staircase decay function
 - We could also use the following:
 
- Here, the hyperparameter refers to the epoch
 - And, the hyperparameter refers to some constant
 - We can also use a similar method using mini-batches:
 
- Here, the hyperparameter refers to the mini-batch
 
tldr
- Slowly reducing the learning rate overtime can help improve the performance of an optimization algorithm
 - In other words, we want to start off with a large learning rate in the beginning, since we want to quickly converge to the minimum
 - Then, we want to end with a small learning rate as we approach the minimum, so that our steps aren't big enough to overshoot the minimum
 - The standard formula for learning rate decay is:
 
- The hyperparameter represents the learning rate for the epoch
 - The hyperparameter represents an initial learning rate
 - The hyperparameter represents the rate of decay
 
References
Previous
Next