Dying ReLU

Motivating the Dying ReLU Problem

  • ReLU doesn't suffer from the vanishing gradient problem as much as other activation functions
  • However, the relu function still has a vanishing gradient problem, but only on one side
  • Therefore, we call it something else
  • We call it the dying relu problem instead

Describing the Dying ReLU Problem

  • Since the relu function returns 00 for all negative inputs, the gradient of negative sums is also 00
  • Indicating, a neuron stops learning once it becomes negative
  • This usually happens because of a very large negative bias bb
  • Since the gradient is always 00, then a neuron is unlikely to recover
  • Therefore, the weights will not adjust in gradient descent
  • This is good if we are at a global minimum
  • However, we'll frequently get stuck at local minimums and plateaus because of the dying relu problem

The Vanishing Gradient and Dying ReLU Problem

  • The derivatives of many activation functions (e.g. tanh, sigmoid, etc.) are very close to 00
  • In other words, if the gradient becomes smaller, then the slower and harder it is to return into the good zone
  • Roughly speaking, this demonstrates a major effect of the vanishing gradient problem
  • The gradient of the relu function doesn't become smaller in the positive direction
  • Therefore, the relu function doesn't suffer from the vanishing gradient problem in the positive direction

Introducing the Leaky ReLU

  • The leaky relu attempts to solve the dying relu problem
  • Specifically, the leaky relu does this by providing a very small gradient for negative values
  • This represents an attempt to allow neurons to recover
  • We can define the leaky relu function as the following:
leakyrelu(x)={0.01xif x0xif x>0\text{leakyrelu}(x) = \begin{cases} 0.01x &\text{if } x \le 0 \cr x &\text{if } x > 0 \end{cases}

leakyrelu

  • Unfortunately, the leaky relu doesn't perform as well as the relu
  • Also, there isn't much of an accuracy boost in most circumstances

tldr

  • ReLU doesn't suffer from the vanishing gradient problem as much as other activation functions
  • However, the relu function still has a vanishing gradient problem, but only on one side
  • Therefore, we call it something else
  • We call it the dying relu problem instead
  • The leaky relu attempts to solve the dying relu problem

References

Previous
Next

Vanishing Gradient

Weight Initialization