Motivating Dropout Regularization
- Dropout regularization is a popular regularization layer-level method
- Dropout regularization is performed to prevent overfitting in neural networks
- We can effectively remove neurons by multiplying its activations by
- In machine learning, we can use an ensemble of classifiers to combat overfitting
- However, ensemble methods become too expensive in the realm of deep learning
- Therefore, we can use dropout to approximate the effects of ensemble methods
Describing the Original Dropout Implementation
- Dropout can be implemented by randomly removing neurons from a network during its training phase
- Specifically, each neuron's activation has a probability of being removed during the training phase
- Each neuron's activation is multiplied by a probability during the testing phase
- For example, a feedforward operation of a standard neural network looks like the following:
- With dropout, a feedforward operation becomes the following during the training step:
- With dropout, a feedforward operation becomes the following during the testing step:
Reasoning behind the Original Dropout Method
- During the training phase, removing neurons adds a degree of noise to the architecture
- Specifically, removing neurons ensures that our network isn't dependent on any handful of nodes
- By doing this, we will prevent overfitting because we aren't limiting our network by only fitting to that limited set of neurons
- During the testing phase, activations are multiplied by probability
- The purpose of doing this is to ensure that the distribution of values from training closely resemble the values from testing
- Equivalently, we sometimes decide to multiply by the weights rather than the activations
Example of the Original Dropout Implementation
- Let's say we have and the following activations for a certain layer:
- During the training phase, half of our neurons (i.e. ) would be removed:
- Therefore, our training activations could look like the following:
- During the testing phase, our activations are multiplied by
- Therefore, our testing activations would look like the following:
Example of the Original Dropout for Input Layer
- If we are dealing with an input layer, our activation is represented by our input data
- Let's say we have and the following input data for a certain layer:
- During the training phase, half of our neurons (i.e. ) would be removed:
- Therefore, our training activations could look like the following:
- During the testing phase, our activations are multiplied by
- Therefore, our testing activations would look like the following:
Describing the TensorFlow Dropout Implementation
- TensorFlow has a different implementation of the original dropout implementation
- These implementations only have slight differences
- We will see that these implementations are even equivalent sometimes
- During the testing phase of the original dropout implementation, we are essentially downgrading the activations
- During the training phase of the tensorflow implementation, we increase the activations by a factor of
- Specifically, we do this to prevent the need for changing the activations during the testing phase
- To be clear, all of the neurons that aren't dropped with a probability are increased by a factor of during the training phase
- The neurons that are dropped with a probability are obviously during the training phase
- The reason for this upscale of values is to preserve the distribution of values so that the total sum is preserved as well
- For example, upscaling the values here will lead to a close approximation of sums:
- This makes sense because we'd hope for and to be approximately similar
- This is because we'd like any transformation of to reflect a similar summation output, since we're neural networks use summations of so often
Example of the TensorFlow Dropout Implementation
- Let's say we have and the following activations for a certain layer:
- During the training phase, half of our neurons (i.e. ) would be removed:
- Therefore, our training activations could look like the following:
- During the testing phase, our activations are unchanged
- Therefore, our testing activations would look like the following:
Equivalence of Implementations
- Mathematically, these two implmentations do not look the same
- Specifically, these two implementations are off by a constant value in both forward and backward propagation
- However, these two implementations are mathematically equivalent when optimizing parameters
Footnotes for Applying Dropout Regularization
- Typically, we set the dropout probability for our input layer
- In other words, we don't remove any neurons for our input layer in application
- We can also think of the dropout probability for all neurons during any testing phase
tldr
- Dropout regularization is performed to prevent overfitting in neural networks
- Dropout regularization is performed for a layer
- Typically, we don't perform dropout on an input layer
- Dropout can be implemented by randomly removing neurons from a network during its training phase
- Specifically, each neuron's activation has a probability of being removed during the training phase
- The purpose of doing this is to ensure that our network isn't dependent on any handful of nodes
- This prevents overfitting
- Each neuron's activation is multiplied by a probability during the testing phase
- The purpose of doing this is to preserve the distribution of values between the training and testing sets
- Then, the total sum will be preserved
- This is important because our network uses these summations so often
References
Previous
Next