Motivating Learning Rate Decay
- Sometimes, assigning a large learning rate can cause our optimization algorithm to overshoot the minimum
- On the other hand, assigning a small learning rate can cause our optimization algorithm to never reach the minimum, since our steps become too small
- Meaning, we may never be able to converge to the optimal parameters that minimize our cost function in either of these situations
Illustrating Learning Rate Decay
- To illustrate this point, let's say we're implementing mini-batch gradient descent with reasonably small batch sizes and a somewhat large learning rate
- As we iterate, we'll notice our steps drifting towards the minimum in the beginning
- However, we may start to wander around the minimum and never converge to optimal parameters once we start to approach the minimum
- On the other hand, we may never converge to the minimum if we start with a somewhat small learning rate
- Slowly reducing the learning rate overtime can help improve the performance of an optimization algorithm
Defining Standard Learning Rate Decay
- Compute for the current epoch
- Repeat step for each epoch
Describing the Hyperparameters
- The hyperparameter represents the learning rate for the epoch
- The hyperparameter represents an initial learning rate
- The hyperparameter represents the rate of decay
- The hyperparameters and are fixed for each epoch
- The hyperparameter is obviously different for each epoch
- We typically tune the hyperparameters and
Example of the Standard Learning Rate Decay
epoch | |||
---|---|---|---|
1 | 0.1 | 0.2 | 1 |
2 | 0.67 | 0.2 | 1 |
3 | 0.5 | 0.2 | 1 |
4 | 0.4 | 0.2 | 1 |
... | ... | 0.2 | 1 |
Defining Exponential Learning Rate Decay
- Compute for the current epoch
-
Repeat step for each epoch
- Where the hyperparameter is the exponential rate of decay
Other Forms of Learning Rate Decay
- There are other forms of learning rate decay
- These are essentially doing the same thing, but use different functions to express different rates of decay applied to the learning rate
- Another example is the discrete staircase decay function
- We could also use the following:
- Here, the hyperparameter refers to the epoch
- And, the hyperparameter refers to some constant
- We can also use a similar method using mini-batches:
- Here, the hyperparameter refers to the mini-batch
tldr
- Slowly reducing the learning rate overtime can help improve the performance of an optimization algorithm
- In other words, we want to start off with a large learning rate in the beginning, since we want to quickly converge to the minimum
- Then, we want to end with a small learning rate as we approach the minimum, so that our steps aren't big enough to overshoot the minimum
- The standard formula for learning rate decay is:
- The hyperparameter represents the learning rate for the epoch
- The hyperparameter represents an initial learning rate
- The hyperparameter represents the rate of decay
References
Previous
Next