Data Science

Gradient descent can vary in terms of the number of iterations it takes to converge to parameters values
The number of iterations typically determines the stability of a gradient used to update the parameters
There are typically three types of gradient descent
- Batch gradient descent
  - This is what we've been using up until this point
- Stochastic gradient descent
  - This is a different type of gradient descent
- Mini batch gradient descent
  - This is a hybrid between the above two

Gradient descent can lead to slow traning on very large datasets
This is because one iteration requires a prediction for each instance in the training dataset
When we have very large data, we could use stochastic gradient descent

Stochastic gradient descent (SGD) is a variation of gradient descent
In this variation, the gradient descent procedure updates parameters for each training instance, rather than at the end of each iteration
Specifically, we calculate the error and update the parameters accordingly for each training observation

SGD can quickly give us insight about the performance and rate improvement of our network
- This is because parameters are updated so frequently
SGD is sometimes useful for beginners
- This is sometimes simpler to understand than other variants of gradient descent
SGD can sometimes lead to faster learning
- The increased frequency of parameter updates can result in faster learning
SGD can lead to premature parameter convergence
- The noisy update process can lead to networks avoiding the local minima

SGD can be more computationally expensive compared to other optimization algorithms
- This is because the parameters are compared and updated so frequently
SGD can result in updates leading to noisy gradient
- The gradients have a high variance over training epochs
SGD can sometimes lead to worse accuracy
- The noisy learning process can make it hard for the algorithm to settle on a minimum of the cost function

Stochastic gradient descent (SGD) is a variation of gradient descent
In this variation, the gradient descent procedure updates parameters for each training instance, rather than at the end of each iteration
Specifically, we calculate the error and update the parameters accordingly for each training observation

Gradient Checking

Mini-Batch Gradient Descent