Normalizing Inputs

Motivating Normalization of Inputs

  • If the scale of our input data is vastly different across features, then our cost function may take much longer to converge to 00
  • In other words, gradient descent will take a long time to find optimal parameter values for ww and bb if our data is unnormalized
  • In certain situations, the cost function may not even converge to 00 depending on the scale of our input data and the size of our learning rate λ\lambda for gradient descent

Illustrating the Need for Normalization

  • Let's say we define our cost function as the following:
J(w,b)=12mi=1mL(y^i,yi)J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}_{i}, y_{i})
  • And our input data looks like the following:
x1:0,...,1x_{1} : 0, ..., 1 x2:1,...,1000x_{2} : 1, ..., 1000
  • The contour could look like the following after gradient descent:

unnormalizedcontour

  • Where the darker colors represents a smaller JJ value

Illustrating the Goal of Normalization

  • Using out input data from before, our normalized data could look like the following:
x1:0,...,3x_{1} : 0, ..., 3 x2:0,...,3.5x_{2} : 0, ..., 3.5
  • The contour could look like the following after gradient descent:

normalizedcontour

  • Note that there isn't a great need for normalization if our input data is on the same relative scale
  • However, normalization is especially important if the scale of our input data varies across features dramatically
  • To be safe, we should just always normalize our data

Normalization Algorithm

  • Normalizing input data includes the following steps:

    1. Center data around the mean
    μ=1mi=1mxi\mu = \frac{1}{m} \sum_{i=1}^{m} x_{i} x~i=xiμ\tilde{x}_{i} = x_{i} - \mu
    1. Normalize the variance
    σ=1m1i=1mx~i2\sigma = \sqrt{\frac{1}{m-1} \sum_{i=1}^{m} \tilde{x}_{i}^{2}} xinorm=x~iσx^{norm}_{i} = \frac{\tilde{x}_{i}}{\sigma}
  • Centering xx around μ\mu will make it so the new mean of xx is 00
  • Normalizing the variance of xx will make the new variance of xx equal to 11
  • Input data should be normalized for both the training and test sets
  • We should first calculate μ\mu and σ\sigma for the training set
  • Then, the test set should use those same parameters μ\mu and σ\sigma from the training set
  • We should not calculate one set of parameters for the training set, and a different set of parameters for the test set
  • In other words, our training and test sets should be scaled in the exact same way

tldr

  • We need to normalize our input data to improve training performance
  • Normalization involves the following steps:

    • Centering xx around μ\mu will make it so the new mean of xx is 00
    • Normalizing the variance of xx will make the new variance of xx equal to 11
  • Input data should be normalized for both the training and test sets
  • Normalization involves calculating parameters μ\mu and σ\sigma for each of our input features
  • We should use the same μ\mu and σ\sigma parameters for both the training set and test set
  • To reiterate, this form of normalization can only be applied to our input data
  • This can't be applied to any activations in our hidden layers

References

Previous
Next

Orthogonalization

Vanishing Gradient