Vanishing Gradient

Motivating the Problem with Vanishing Gradients

  • We know that adding more layers to a network will lead to slower training
  • However, our network can grow so deep that training becomes too slow
  • Sometimes, our network can even become less accurate compared to smaller networks
  • In this case, a layer can train slower compared to other layers based on its order in the network
  • Specifically, neurons in earlier layers learn slower than neurons in later layers
  • For example, our first hidden layer typically trains slower than our second hidden layer

The Vanishing Gradient Problem

  • Sometimes, the gradients of later hidden layers become very small

    • As a reminder, this doesn't necessarily mean we're at a global extremum
    • We could be at a plateau in this case
  • As a result, the gradients of the earlier hidden layers sometimes become small in a deep network

    • This is because deep networks heavily use the chain rule in backward propagation
  • In this case, the neurons in earlier layers learn much slower than neurons in later layers
  • This phenomenon is referred to as the vanishing gradient problem
  • In other words, the vanishing gradient problem refers to when neurons in earlier layers learn much slower than neurons in later layers as a consequence of gradients of the later layers being small
  • Said another way, it's a phenomenon where gradients of earlier layers tend to reach (or get stuck at) plateaus more easily than gradients of later layers

Visual Interpretation of the Vanishing Gradient Problem

  • The cost function of a neural network is neither convex nor concave
  • As a result, the vanishing gradient problem can occur throughout the learning process
  • We can use a contour plot to visualize the behavior of our cost function:

contourgradient

  • Roughly speaking, if we see points where the cost function becomes elongated or plateaued, then we will experience the vanishing gradient problem:

plateaucost

  • In comparison, a more optimal shape looks like the following:

optimalcost

  • In both situations, the gradients become smaller near the minimum
  • Meaning, the learning speed becomes slower near the minimum
  • However, the latter image has a larger gradient, and consequently converges much faster
  • In summary, we experience the vanishing gradient problem if we're at these plateaus

Why do Earlier Layers Learn Slower

  • We know that the learning rate of a layer slows down if the gradient of that layer becomes small
  • We also know that the vanishing gradient problem occurs if the learning rate of earlier layers slows down if gradients of laters layers become small
  • Naturally, we should wonder why the learning rate of a layer slows down if the partial derivatives of the later layers slow down
  • This happens in deep networks because backward propagation uses the chain rule
  • Specifically, the chain rule is dependent on the partial derivatives of the following layers
  • Deeper networks are dependent on more later layers
  • Specifically, the partial derivatives of the earlier layers are made up of very many multiplications of the partial derivatives of the later layers
  • If these multiplications are small enough (near 00), then the partial derivatives of the earlier layers will become small as well

Visual Interpretation of Slower Learning for Earlier Layers

  • We know that the learning rate of a layer slows down if the partial derivatives of the later layers become smaller
  • Mathematically, we've seen this using the chain rule in backward propagation for earlier layers in a deep network
  • Visually, we can interpret this vanishing gradient problem as getting stuck at plateaus
  • In other words, we're less prone to escape these plateaus once we arrive at them
  • In gradient descent, getting stuck at these plateaus makes it difficult to make any progress
  • This is because the gradient of the earlier layer becomes so small for a deep network
  • We can see this in the following example:

plateausimulation

  • Visually, we can interpret an exploding gradient as approaching gradient cliffs, which launches parameters into space
  • In summary, the vanishing gradient problem refers to getting stuck at these plateaus, where it's difficult to many any progress because learning of earlier layers is so slow (but not theoretically impossible)

Influences on the Vanishing Gradient Problem

  • Again, the vanishing gradient problem is caused by many multiplications of small partial derivatives
  • The number of multiplications is obviously influenced by the deepness of our network
  • The smallness of partial derivatives is influenced by the choice of activation function
  • Therefore, both the activation function and deepness of our network influence the vanishing gradient problem
  • In other words, the vanishing gradient problem is a consequence of the deepness of our network, which can be observed mathematically using the chain rule in backward propagation
  • The vanishing gradient problem is also a consequence of the activation function, which can be observed by swapping to another activation function (such as ReLU)

Influence of Activation Functions

  • The sigmoid function is useful because it squeezes any input value into an output range of (0,1)(0,1)
  • This is perfect for representing output as probabilities
  • However, if a function has a small enough range, then the derivatives of that function will be relatively small as well
  • Therefore, these types of activation functions suffer from the vanishing gradient problem
  • For example, the distribution of derivatives looks like the following:

derivativedistribution

  • We can see the maximum point of the function is 0.250.25, which is very small
  • In other words, the output of the derivative of the cost function is always between 00 and 0.250.25
  • This is why we should never model probabilities by chaining together sigmoid activation functions
  • In these situations, we should use activation functions that output larger derivatives for our hidden layers, then use a sigmoid or tanh activation function for our output layer

Example of Slow Sigmoid Learning

  • Recall that the derivative of the sigmoid function outputs values between 00 and 0.250.25
  • For a shallow network, we calculate the derivative of our first hidden layer:
J(w,b)w1=J(w,b)ai<14aiz3<14z3a2<14a2z2<14z2a1<14a1z1<14z1w1<140\frac{\partial J(w,b)}{\partial w^{1}} = \overbrace{\frac{\partial J(w,b)}{\partial a^{i}}}^{<\frac{1}{4}} \overbrace{\frac{\partial a^{i}}{\partial z^{3}}}^{<\frac{1}{4}} \overbrace{\frac{\partial z^{3}}{\partial a^{2}}}^{<\frac{1}{4}} \overbrace{\frac{\partial a^{2}}{\partial z^{2}}}^{<\frac{1}{4}} \overbrace{\frac{\partial z^{2}}{\partial a^{1}}}^{<\frac{1}{4}} \overbrace{\frac{\partial a^{1}}{\partial z^{1}}}^{<\frac{1}{4}} \overbrace{\frac{\partial z^{1}}{\partial w^{1}}}^{<\frac{1}{4}} \to 0
  • Here, we are multiplying four values between 00 and 11
  • This will become very small very fast
  • A deep network would become very small even faster
  • We can see the first layer is the furthest back
  • Thus, the derivative is a longer expression using the chain rule
  • Hence, it will depend on more sigmoid derivatives
  • This makes it smaller
  • Because of this, the first layers are the slowest to train
  • Since the output and layer layers are dependent on the earlier layers in forward propagation, the inaccurate early layers will cause the layer layers to build on this inaccuracy
  • As a result, the entire neural network will become inaccurate

Using a ReLU Activation Function

  • The relu activation function doesn't suffer as much from the vanishing gradient problem
  • The relu function is defined as the piecewise function:
relu(x)={0if x0xif x>0relu(x) = \begin{cases} 0 &\text{if } x \le 0 \cr x &\text{if } x > 0 \end{cases}
  • The function will output 00 if the input is smaller than 00
  • Otherwise, the function will mimic the identity function
  • The derivative of the function is the following:
relu(x)x={0if x01if x>0\frac{\partial relu(x)}{\partial x} = \begin{cases} 0 &\text{if } x \le 0 \cr 1 &\text{if } x > 0 \end{cases}
  • We can see the relu function won't suffer from the vanishing gradient problem as much as other activation functions
  • This is because the output is either a 00 or a 11, instead of a small decimal for a majority of the time
  • In other words, the output of the sigmoid function is saturated by small decimals, which doesn't happen with the relu function

The Unstable Gradient Problem

  • The fundamental problem here isn't so much the vanishing gradient problem or the exploding gradient problem
  • It's that the gradient in early layers is the product of terms from all the later layers
  • Where there are many layers, this becomes an unstable situation
  • The only way all layers can learn at close to the same speed is if all those products of terms come close to balancing out
  • Without some mechanism or underlying reason for that balancing to occur, it's highly unlikely to happen simply by chance
  • In short, the real problem here is that neural networks suffer from an unstable gradient problem
  • As a result, if we use standard gradient-based learning techniques, different layers in the network will tend to learn at wildly different speeds

tldr

  • Sometimes, the gradients of later hidden layers become very small
  • As a result, the gradients of the earlier hidden layers sometimes become small in a deep network
  • In this case, the neurons in earlier layers learn much slower than neurons in later layers
  • The vanishing gradient problem refers to when neurons in earlier layers learn much slower than neurons in later layers as a consequence of gradients of the later layers being small
  • In other words, the vanishing gradient problem is a consequence of the following:

    • Activation function
    • Deepness of a network
  • The deepness of a network can be observed mathematically using the chain rule in backward propagation
  • The deepness of a network influences the number of partial derivatives multiplied together in the chain rule
  • The activation function influences the smallness of any partial derivatives
  • Therefore, both the activation function and deepness of our network influence the vanishing gradient problem
  • In other words, the vanishing gradient problem is a consequence of the deepness of our network, which can be observed mathematically using the chain rule in backward propagation
  • The vanishing gradient problem is also a consequence of the activation function, which can be observed by swapping to another activation function (such as ReLU)

References

Previous
Next

Normalizing Inputs

Dying ReLU