Learning Rate Decay

Motivating Learning Rate Decay

  • Sometimes, assigning a large learning rate can cause our optimization algorithm to overshoot the minimum
  • On the other hand, assigning a small learning rate can cause our optimization algorithm to never reach the minimum, since our steps become too small
  • Meaning, we may never be able to converge to the optimal parameters that minimize our cost function JJ in either of these situations

Illustrating Learning Rate Decay

  • To illustrate this point, let's say we're implementing mini-batch gradient descent with reasonably small batch sizes and a somewhat large learning rate
  • As we iterate, we'll notice our steps drifting towards the minimum in the beginning
  • However, we may start to wander around the minimum and never converge to optimal parameters once we start to approach the minimum
  • On the other hand, we may never converge to the minimum if we start with a somewhat small learning rate
  • Slowly reducing the learning rate overtime can help improve the performance of an optimization algorithm

learningdecay

Defining Standard Learning Rate Decay

  1. Compute αi\alpha_{i} for the current epoch ii
αi=α011+γ×i\alpha_{i} = \alpha_{0} \frac{1}{1 + \gamma \times i}
  1. Repeat step 11 for each ithi^{th} epoch

Describing the Hyperparameters

  • The hyperparameter αi\alpha_{i} represents the learning rate for the ithi^{th} epoch
  • The hyperparameter α0\alpha_{0} represents an initial learning rate
  • The hyperparameter γ\gamma represents the rate of decay
  • The hyperparameters α0\alpha_{0} and γ\gamma are fixed for each epoch
  • The hyperparameter αi\alpha_{i} is obviously different for each epoch
  • We typically tune the hyperparameters α0\alpha_{0} and γ\gamma

Example of the Standard Learning Rate Decay

epoch αepoch\alpha_{epoch} α0\alpha_{0} γ\gamma
1 0.1 0.2 1
2 0.67 0.2 1
3 0.5 0.2 1
4 0.4 0.2 1
... ... 0.2 1

Defining Exponential Learning Rate Decay

  1. Compute αi\alpha_{i} for the current epoch ii
αi=α0×γi\alpha_{i} = \alpha_{0} \times \gamma^{i}
  1. Repeat step 11 for each ithi^{th} epoch

    • Where the hyperparameter γ\gamma is the exponential rate of decay

Other Forms of Learning Rate Decay

  • There are other forms of learning rate decay
  • These are essentially doing the same thing, but use different functions to express different rates of decay applied to the learning rate
  • Another example is the discrete staircase decay function
  • We could also use the following:
αi=α0ki\alpha_{i} = \alpha_{0} \frac{k}{\sqrt{i}}
  • Here, the hyperparameter ii refers to the ithi^{th} epoch
  • And, the hyperparameter kk refers to some constant
  • We can also use a similar method using mini-batches:
αt=α0kt\alpha_{t} = \alpha_{0} \frac{k}{\sqrt{t}}
  • Here, the hyperparameter tt refers to the ttht^{th} mini-batch

tldr

  • Slowly reducing the learning rate overtime can help improve the performance of an optimization algorithm
  • In other words, we want to start off with a large learning rate in the beginning, since we want to quickly converge to the minimum
  • Then, we want to end with a small learning rate as we approach the minimum, so that our steps aren't big enough to overshoot the minimum
  • The standard formula for learning rate decay is:
αi=α011+γ×i\alpha_{i} = \alpha_{0} \frac{1}{1 + \gamma \times i}
  • The hyperparameter αi\alpha_{i} represents the learning rate for the ithi^{th} epoch
  • The hyperparameter α0\alpha_{0} represents an initial learning rate
  • The hyperparameter γ\gamma represents the rate of decay

References

Previous
Next

Adam Optimization

The Tuning Process