Motivating Mini-Batch Gradient Descent
- The key advantage of using mini-batch as opposted to batch goes back to the fundamental idea of stochastic gradient descent
- In batch gradient descent, we compute the gradient over the entire dataset, averaging over a potentially vast amount of data
-
The disadvantages of this are the following:
- This involves using a lot of memory
- We're more likely to land on saddle points
- In stochastic gradient descent, we update parameters for each single instance of training data
- Since it's based on one random data point, learning becomes very noisy and we may go off in a direction far from the batch gradient
- However, we need some degree of noisiness in non-convex optimization
- This is because we're more likely to escape from local minima and saddle points
- The obvious distadvantage of SGD is its poor efficiency
- In most cases, we would need to loop over the entire dataset many time to find a good solution
- The mini-batch methodology is a compromise that injects enough noise to each gradient update, while achieving a relative speedy convergence
Describing Mini-Batch Gradient Descent
- Mini-batch gradient descent is a type of gradient descent algorithm
- It involves splitting the training dataset into small batches
- Then, those small batches are used to find gradients and update parameters
- We need to tune a new hyperparameter for mini-batch gradient descent
- Here, represents the amount of observations in each sample (or miniature batch)
- Mini-batch gradient descent attempts to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent
- It is the most common implementation of gradient descent
Defining Mini-Batch Gradient Descent
-
Split our vector of predictor data into samples
- Where is a matrix
- Where represents the number of predictors associated with an observation
- Where represents the number of observations
- Where represents the number of minibatches
- Where represents a single observation in a training set
- These samples represent miniature batches
- The number of observations in each sample is represented using the hyperparameter
-
Split our vector of response data into samples
- Where is a matrix
- Where represents the number of observations
- Where represents the number of minibatches
- Where represents a single obeservation in a training set
- These samples represent miniature batches
- The number of observations in each sample should match the number of observations in each sample from
-
Choose one minibatch
- In other words, focus on a minibatch from all of the minibatches for now
- For now, let's say as an example
- Let's also say each minibatch has training observations
-
Perform regular batch gradient descent on minibatch
- Perform forward propagation on
- Evaluate the cost function
- Perform backward propagation
-
Iterate through the remaining minibatches
- Perform steps on each of the minibatches
Defining Notation
- used in represents the index of a training set observation
- used in represents the index of a neural network layer
- used in represents the index of a minibatch
- A single epoch represents a single pass through a training set
- For example, a single epoch refers to taking only one gradient descent step for batch gradient descent
- A single epoch refers to taking amount of gradient descent steps for mini-batch gradient descent
General Upsides
-
Mini-batch gradient descent involves robust parameter convergence
- Mini batch gradient descent is more robust compared to batch gradient descent
- The parameter convergence manages to avoid local minima more often
- This happens because the parameters are updated more frequently
- This adds a degree of noisiness to our learning process
- We need some degree of noisiness in non-convex optimization to avoid local minima
-
Mini-batch gradient descent is computationally efficient
- It is more efficient than stochastic and batch gradient descent
- Training speed for batch gradient descent is too slow because each iteration is too slow
- On the other hand, training speed for stochastic gradient descent is too slow because we need to run more iterations usually
-
Mini-batch gradient descent doesn't require all training data in memory
- This is more efficient and better for memory
- This is partially because vectorizing our parameters into a big vector allows us to efficiently compute on samples
General Downsides
-
Mini-batch requires the tuning of additional hyperparameters
- Specifically, it requires tuning of an additioanl mini-batch size hyperparameter for the learning algorithm
-
Mini-batch involves more steps in its algorithm
- Gradient information must be accumulated across mini-batches of training examples
Implementing Mini-Batch Gradient Descent
- The use of large mini-batches increases the available computational parallelism
- However, we should typically prefer small batch sizes
- This is because small batch sizes provide improved generalization performance
- The best performance has been consistently obtained for mini-batch sizes between and
- This contrasts with recent work advocating the use of mini-batch sizes in the thousands
tldr
- The mini-batch methodology is a compromise that injects enough noise to each gradient update, while achieving a relative speedy convergence
- In other words, mini-batch finds a balance between batch and stochastic gradient descent
- The mini-batch algorithm refers to splitting training data into samples and performing parameter updates on each sample using the gradient descent algorithm we're all used to
- Mini-batch gradient descent is the most popular type of gradient descent method
References
Previous
Next