Data Science

ReLU doesn't suffer from the vanishing gradient problem as much as other activation functions
However, the relu function still has a vanishing gradient problem, but only on one side
Therefore, we call it something else
We call it the dying relu problem instead

Since the relu function returns $0$ for all negative inputs, the gradient of negative sums is also $0$
Indicating, a neuron stops learning once it becomes negative
This usually happens because of a very large negative bias $b$
Since the gradient is always $0$ , then a neuron is unlikely to recover
Therefore, the weights will not adjust in gradient descent
This is good if we are at a global minimum
However, we'll frequently get stuck at local minimums and plateaus because of the dying relu problem

The derivatives of many activation functions (e.g. tanh, sigmoid, etc.) are very close to $0$
In other words, if the gradient becomes smaller, then the slower and harder it is to return into the good zone
Roughly speaking, this demonstrates a major effect of the vanishing gradient problem
The gradient of the relu function doesn't become smaller in the positive direction
Therefore, the relu function doesn't suffer from the vanishing gradient problem in the positive direction

The leaky relu attempts to solve the dying relu problem
Specifically, the leaky relu does this by providing a very small gradient for negative values
This represents an attempt to allow neurons to recover
We can define the leaky relu function as the following:

\text{leakyrelu}(x) = \begin{cases} 0.01x &\text{if } x \le 0 \cr x &\text{if } x > 0 \end{cases}

leakyrelu

ReLU doesn't suffer from the vanishing gradient problem as much as other activation functions
However, the relu function still has a vanishing gradient problem, but only on one side
Therefore, we call it something else
We call it the dying relu problem instead
The leaky relu attempts to solve the dying relu problem

Vanishing Gradient

Weight Initialization