Gradient-Based Learning

Motivating Gradient-Based Learning

  • Up until now, we've built neural networks by specifying a pre-defined optimization procedure, cost function, and model family
  • When dealing with convex functions, we can rely on a simple equation for finding parameter estimates
  • This is because convex functions guarantee global convergance
  • For example, we can rely on a simple equation for linear functions, since they are convex
  • However, we want to also find parameter estimates for nonlinear functions as well
  • The nonlinearity of a neural network causes most loss functions to become nonconvex
  • This means we can't rely on those simple equations anymore
  • Instead, neural networks are usually trained by using iterative, gradient-based optimizers that merely drive the cost function to a very low value

Introducing Gradient Descent

  • Most deep learning algorithms involve optimization
  • Optimization refers to the task of either minimizing or maximizing some function f(x)f(x) by altering xx
  • We usually phrase most optimization problems in terms of minimizing f(x)f(x)

    • Maximization may be accomplished via a minimization algorithm by minimizing f(x)-f(x)
  • The function we want to minimize or maximize is called the objective function or criterion
  • When we are minimizing it, we may also call it the cost function, loss function, or error function
  • We often denote the value that minimizes or maximizes a function with a superscript \ast
  • Specifically, we might say x=arg minf(x)x^{\ast} = \argmin f(x)

Using Derivatives with Gradient Descent

  • As a reminder, a derivative is useful for minimizing a function because it tells us how to change xx in order to make a small improvement in yy
  • We can reduce f(x)f(x) by moving xx in small steps with the opposite sign of the derivative
  • This technique is called gradient descent
  • When f(x)=0f^{\prime}(x)=0, then the derivative provides no information about which direction to move
  • These points are known as critical points or stationary points
  • For functions with multiple inputs, we must use partial derivatives instead of our usual derivatives
  • A partial derivative measures how ff changes as only the variable xix_{i} changes at point xx
  • A gradient of ff is the vector containing all the partial derivatives
  • Element ii of the gradient is the partial derivative of ff with respect to xix_{i}

Notation used in Gradient Descent

  • A derivative of a function ff is denoted as dydx\frac{dy}{dx} or ff'
  • A partial derivative of a function ff is denoted as xif(x)\frac{\partial}{\partial x_{i}} f(x)
  • A gradient is denoted as xf(x)\triangledown_{x}f(x)

Local Minimums in Gradient Descent

  • A local minimum is a point where f(x)f(x) is lower than all of its neighboring points
  • Meaning, it is no longer possible to decrease f(x)f(x) by making infinitesimal steps
  • A local maximum is a point where f(x)f(x) is greater than all of its neighboring points
  • A saddle point is a critical point that is neither a local minimum or maximum
  • A point that obtains the absolute lowest value of f(x)f(x) is a global minimum
  • There can only be one global minimum, but there can be many local minimas
  • In the context of deep learning, we optimize functions that may have many local minima that are not optimal and many saddle points surrounded by very flat regions
  • Therefore, we usually settle for finding a value of ff that is very low but not necessarily minimal in any formal sense

Defining Gradient Descent for Linear Regression

  1. Determine our parameters:
θ0,θ1\theta_{0}, \theta_{1}
  1. Create a hypothesis:
hθ(x)=θ0+θ1xh_{\theta}(x) = \theta_{0} + \theta_{1}x
  1. Determine our cost function:
J(θ0,θ1)=12mi=1m(hθ(x)y)2J(\theta_{0}, \theta_{1}) = \frac{1}{2m} \sum_{i=1}^{m}(h_{\theta}(x)-y)^{2}
  1. Determine our goal:
arg minθ0,θ1J(θ0,θ1)\argmin_{\theta_{0}, \theta_{1}} J(\theta_{0}, \theta_{1})
  1. Initialize parameters:
θ0,θ1=0\theta_{0}, \theta_{1} = 0
  1. Update parameters until our J(θ0,θ1)J(\theta_{0}, \theta_{1}) converges to a minumum:
θ=θ(α×J(θ0,θ1)θ)\theta = \theta - (\alpha \times \frac{\partial J(\theta_{0}, \theta_{1})}{\partial \theta})
  • Where J(θ0,θ1)J(\theta_{0}, \theta_{1}) is the quadratic cost function
  • Where α\alpha is our learning rate
  • Where θ\theta is either θ0\theta_{0} or θ1\theta_{1}
  • Where J(θ0,θ1)θ0=1mi=1m(hθ(x)y)\frac{\partial J(\theta_{0}, \theta_{1})}{\partial \theta_{0}} = - \frac{1}{m} \sum_{i=1}^{m}(h_{\theta}(x)-y)
  • Where J(θ0,θ1)θ1=1mi=1m(hθ(x)y)x\frac{\partial J(\theta_{0}, \theta_{1})}{\partial \theta_{1}} = - \frac{1}{m} \sum_{i=1}^{m}(h_{\theta}(x)-y)x

Translating Gradient Descent

  • The learning rate ensures we only take small steps
  • Meaning, even if we have a very large derivative, then we're only taking a very small step still
  • In other words, we won't overreact by only making small changes
  • Although the cost function will continue to be minimized until convergence, the parameters will swing back and forth like a pendulum until the cost function converges
  • For neural networks, J(θ0,θ1)θi\frac{\partial J(\theta_{0}, \theta_{1})}{\partial \theta_{i}} is estimated using a training algorithm, such as backpropagation

Gradient Descent Implementation

>>> # Load data
>>> iris = datasets.load_iris()
>>> X = iris.data[:,2]
>>> y = iris.data[:,3]
>>> n = X.size

>>> # Initialize parameters
>>> theta_0 = 100 # actual = -0.3631
>>> theta_1 = 0.4158 # actual = 0.4158
>>> lr = 0.001 # learning rate
>>> epochs = 100000 # iterations

>>> # Perform gradient descent
>>> for i in range(epochs):
>>>     y_hat = theta_0 + theta_1*X
>>>     d_theta_0 = (-2/n) * sum(y-y_hat) # derivative
>>>     d_theta_1 = (-2/n) * sum(X*(y-y_hat)) # derivative
>>>     theta_0 =  theta_0 - lr*d_theta_0 # update
>>>     theta_1 = theta_1 - lr*d_theta_1 # update

>>> theta_0
-0.363211
>>> theta_1
0.415755

tldr

  • Gradient Descent finds the parameters that minimize the cost function (error in prediction)
  • A gradient is a vector-valued function that represents the slope of the tangent of the graph of the function, pointing in the direction of the greatest rate of increase of the function
  • It is a derivative that indicates the incline or the slope of the cost function

References

Previous
Next

Learning XOR

Cost Function