Motivating Gradient-Based Learning
- Up until now, we've built neural networks by specifying a pre-defined optimization procedure, cost function, and model family
- When dealing with convex functions, we can rely on a simple equation for finding parameter estimates
- This is because convex functions guarantee global convergance
- For example, we can rely on a simple equation for linear functions, since they are convex
- However, we want to also find parameter estimates for nonlinear functions as well
- The nonlinearity of a neural network causes most loss functions to become nonconvex
- This means we can't rely on those simple equations anymore
- Instead, neural networks are usually trained by using iterative, gradient-based optimizers that merely drive the cost function to a very low value
Introducing Gradient Descent
- Most deep learning algorithms involve optimization
- Optimization refers to the task of either minimizing or maximizing some function by altering
-
We usually phrase most optimization problems in terms of minimizing
- Maximization may be accomplished via a minimization algorithm by minimizing
- The function we want to minimize or maximize is called the objective function or criterion
- When we are minimizing it, we may also call it the cost function, loss function, or error function
- We often denote the value that minimizes or maximizes a function with a superscript
- Specifically, we might say
Using Derivatives with Gradient Descent
- As a reminder, a derivative is useful for minimizing a function because it tells us how to change in order to make a small improvement in
- We can reduce by moving in small steps with the opposite sign of the derivative
- This technique is called gradient descent
- When , then the derivative provides no information about which direction to move
- These points are known as critical points or stationary points
- For functions with multiple inputs, we must use partial derivatives instead of our usual derivatives
- A partial derivative measures how changes as only the variable changes at point
- A gradient of is the vector containing all the partial derivatives
- Element of the gradient is the partial derivative of with respect to
Notation used in Gradient Descent
- A derivative of a function is denoted as or
- A partial derivative of a function is denoted as
- A gradient is denoted as
Local Minimums in Gradient Descent
- A local minimum is a point where is lower than all of its neighboring points
- Meaning, it is no longer possible to decrease by making infinitesimal steps
- A local maximum is a point where is greater than all of its neighboring points
- A saddle point is a critical point that is neither a local minimum or maximum
- A point that obtains the absolute lowest value of is a global minimum
- There can only be one global minimum, but there can be many local minimas
- In the context of deep learning, we optimize functions that may have many local minima that are not optimal and many saddle points surrounded by very flat regions
- Therefore, we usually settle for finding a value of that is very low but not necessarily minimal in any formal sense
Defining Gradient Descent for Linear Regression
- Determine our parameters:
- Create a hypothesis:
- Determine our cost function:
- Determine our goal:
- Initialize parameters:
- Update parameters until our converges to a minumum:
- Where is the quadratic cost function
- Where is our learning rate
- Where is either or
- Where
- Where
Translating Gradient Descent
- The learning rate ensures we only take small steps
- Meaning, even if we have a very large derivative, then we're only taking a very small step still
- In other words, we won't overreact by only making small changes
- Although the cost function will continue to be minimized until convergence, the parameters will swing back and forth like a pendulum until the cost function converges
- For neural networks, is estimated using a training algorithm, such as backpropagation
Gradient Descent Implementation
>>> # Load data
>>> iris = datasets.load_iris()
>>> X = iris.data[:,2]
>>> y = iris.data[:,3]
>>> n = X.size
>>> # Initialize parameters
>>> theta_0 = 100 # actual = -0.3631
>>> theta_1 = 0.4158 # actual = 0.4158
>>> lr = 0.001 # learning rate
>>> epochs = 100000 # iterations
>>> # Perform gradient descent
>>> for i in range(epochs):
>>> y_hat = theta_0 + theta_1*X
>>> d_theta_0 = (-2/n) * sum(y-y_hat) # derivative
>>> d_theta_1 = (-2/n) * sum(X*(y-y_hat)) # derivative
>>> theta_0 = theta_0 - lr*d_theta_0 # update
>>> theta_1 = theta_1 - lr*d_theta_1 # update
>>> theta_0
-0.363211
>>> theta_1
0.415755
tldr
- Gradient Descent finds the parameters that minimize the cost function (error in prediction)
- A gradient is a vector-valued function that represents the slope of the tangent of the graph of the function, pointing in the direction of the greatest rate of increase of the function
- It is a derivative that indicates the incline or the slope of the cost function
References
Previous
Next