Smoothing Splines

Motivating Smoothing Splines

  • We create a regression splines by specifying a set of knots, producting a sequence of basis functions (constraints), and then using least squares to estimate the spline coefficients
  • Smoothing splines are created using a different approach
  • Instead, smoothing splines are created by finding a function gg that minimizes the RSSRSS with some penalty
  • This notion of minimizing the loss function + penalty term is also found in ridge regression and lasso

Describing Smoothing Splines

  • A spline is a special function defined piecewise by polynomials
  • Again, smoothing splines are created by minimizing the RSS with some constraints
  • If we don't put any constraints on g(x)g(x), then we will always overfit the data by interpolating all of the data, which will always make RSSRSS zero
  • Therefore, we want to find the the function gg that minimizes the RSSRSS with added constraints, which will make the curve smooth
  • A smoothing spline is the function gg that minimizes the RSSRSS and the penalty term

Finding a Function gg

  • We find a function gg by minimizing the following:
i=1n(yig(xi))2+λg(t)2dt\sum_{i=1}^{n}(y_{i} - g(x_{i}))^{2} + \lambda \int g\rq\rq(t)^{2}dt
  • The i=1n(yig(xi))2\sum_{i=1}^{n}(y_{i} - g(x_{i}))^{2} term refers to our loss function
  • The λg(t)2dt\lambda \int g\rq\rq(t)^{2}dt term refers to our penalty term
  • The loss function encourages gg to fit the data well
  • The penalty term penalizes the variability in gg
  • The notation g(t)g\rq\rq(t) indicates the second dervative of the function gg

    • The first derivative g(t)g\rq(t) would measure the slope of a function at tt
    • The second dervative corresponds to the amount by which the slope is changing
    • Roughly speaking, the second derivative is a measure of its roughness
    • Specifically, it is large if g(t)g(t) is very wiggly near tt, and it is close to zero otherwise
    • For example, the second derivative of a straight line is zero, since a straight line is perfectly smooth
  • The \int notation is an integral, which we can think of as a summation over the range of tt
  • In other words, g(t)2dt\int g\rq\rq(t)^{2}dt is simply a measure of the total amount of roughness across the function

Summarizing the Minimization Formula

  • The loss function encourages gg to fit the data well
  • The penalty term penalizes the variability in gg
  • Roughly speaking, the second derivative of a function is a measure of its roughness
  • Minimizing the error and roughness is what we want, since we want a smooth curve that fits well
  • When the tuning parameter λ=0\lambda = 0, then the penalty term has no effect
  • In this case, the function gg will perfectly fit the data, causing overfitting
  • When the tuning parameter λ\lambda \to \infty, then gg will be perfectly smooth
  • This is case, gg will just be a straight line

    • Specifically, gg will be the linear least squares line
  • Essentially, the tuning parameter λ\lambda controls the bias-variance trade-off of the smoothing spline

Finding the Best Smoothing Spline

  • The function gg that minimizes the smoothing spline function is a natural cubic spline with knots at each data point
  • Specifically, the function gg that minimizes the smoothing spline is a piecewise cubic polynomial with the following properties:

    1. Knots at the unique values of x1,...,xnx_{1},...,x_{n}
    2. Continuous first derivative g(x)g\rq(x)
    3. Continuous second derivative g(t)g\rq\rq(t)
  • However, it is not the same natural cubic spline that one would get if one applied the basis function approach occurring in spline regression
  • Instead, it is a shrunken version of the natural cubic spline found in spline regression, where the value of the tuning parameter in the smoothing spline function controls the level of shrinkage

Choosing the Smoothing Parameter

  • We have seen that a smoothing spline is simply a natural cubic spline with knots at every unique value of xx
  • It might seem that a smoothing spline will have far too many degrees of freedom, since a knot at each data point allows a great deal of flexibility
  • However, the tuning parameter will control the roughness of the smoothing spline, and hence the effective degrees of freedom
  • Roughly speaking, as the tuning parameter λ\lambda increases from 00 to \infty, the effective degrees of freedom decrease from nn to 22
  • Therefore, we do not need to select the number or locations of the knots, since there will be a knot at each training observation when fitting a smoothing spline
  • Instead, we need to choose the value of the tuning parameter that makes the cross-validated RSSRSS as small as possible
  • The LOOCVLOOCV can be computed very efficiently for smoothing splines

References

Previous
Next

Spline Regression

Local Regression