Motivating other Optimization Algorithms
- Gradient descent can vary in terms of the number of iterations it takes to converge to parameters values
- The number of iterations typically determines the stability of a gradient used to update the parameters
-
There are typically three types of gradient descent
-
Batch gradient descent
- This is what we've been using up until this point
-
Stochastic gradient descent
- This is a different type of gradient descent
-
Mini batch gradient descent
- This is a hybrid between the above two
-
Motivating Stochastic Gradient Descent
- Gradient descent can lead to slow traning on very large datasets
- This is because one iteration requires a prediction for each instance in the training dataset
- When we have very large data, we could use stochastic gradient descent
Describing Stochastic Gradient Descent
- Stochastic gradient descent (SGD) is a variation of gradient descent
- In this variation, the gradient descent procedure updates parameters for each training instance, rather than at the end of each iteration
- Specifically, we calculate the error and update the parameters accordingly for each training observation
Upsides
-
SGD can quickly give us insight about the performance and rate improvement of our network
- This is because parameters are updated so frequently
-
SGD is sometimes useful for beginners
- This is sometimes simpler to understand than other variants of gradient descent
-
SGD can sometimes lead to faster learning
- The increased frequency of parameter updates can result in faster learning
-
SGD can lead to premature parameter convergence
- The noisy update process can lead to networks avoiding the local minima
Downsides
-
SGD can be more computationally expensive compared to other optimization algorithms
- This is because the parameters are compared and updated so frequently
-
SGD can result in updates leading to noisy gradient
- The gradients have a high variance over training epochs
-
SGD can sometimes lead to worse accuracy
- The noisy learning process can make it hard for the algorithm to settle on a minimum of the cost function
tldr
- Stochastic gradient descent (SGD) is a variation of gradient descent
- In this variation, the gradient descent procedure updates parameters for each training instance, rather than at the end of each iteration
- Specifically, we calculate the error and update the parameters accordingly for each training observation
References
Previous
Next