Regression Metrics

Describing Regression Metrics

  • Suppose we want to guess the value of a random variable
  • Since we don't feel comfortable with the word guess, we use the word prediction instead
  • Our goal is to find the best buess for our random variable
  • We need some way to measure how good a guess is
  • Therefore, we use metrics to measure the accuracy of our guesses

Using Regression Metrics

  • A metric is a quantitative standard of measurement
  • There isn't any Holy Grail of regression metrics
  • Solely using one regression metric for every case is a bad idea
  • Each metric has its own use-case, and should be used carefully and accordingly

Motivating General Terminology

  • Let's denote YY as our observation and mm as our prediction
  • When finding our prediction mm, we typically want to use an mm that minimizes our regression metric (i.e. MSE, MAE, etc.)
  • Therefore, we should really only focus on minimizing the bias, since the variance term is irrelevant when minimizing the MSE:
  • Specifically, we would like to find a prediction mm that makes our regression metrics small (i.e. minimizes our errors)
  • The variance term is irrelevant to making this small, since it’s the same no matter what m is
Var[Ym]=Var[Y]\text{Var}[Y-m] = \text{Var}[Y]
  • This is because Var[Y]\text{Var}[Y] is about the true distribution of YY, but mm is just our guess
  • Therefore, it shouldn't play any role in the minimization process

Defining Basics of MSE

  • The mean squared error (or MSE) is defined as the following:
MSE(m)=E[(Ym)2]=E[Ym]2+Var[Ym]MSE(m) = \text{E}[(Y-m)^2] = \text{E}[Y-m]^2 + \text{Var}[Y-m] where Bias =E[Ym]2\text{where Bias } = \text{E}[Y-m]^2 where Variance =Var[Ym]\text{where Variance } = \text{Var}[Y-m]
  • In other words, the mean squared error represents any bias or variance associated with a prediction
  • The mean squared error represents the simplest form of the bias-variance decomposition
  • Since we can consider the variance term as irrelevant, the MSE formula can be simplified to the following:
MSE=mean((yiy^i)2)=mean(errori2)MSE = mean((y_{i} - \hat{y}_{i})^{2}) = mean(error_{i}^{2})

Defining Basics of RMSE

  • The root mean squared error (or RMSE) is defined as the following:
RMSE(m)=MSE(m)=E[(Ym)2]=E[Ym]2+Var[Ym]RMSE(m) = \sqrt{MSE(m)} = \sqrt{\text{E}[(Y-m)^2]} = \sqrt{\text{E}[Y-m]^2 + \text{Var}[Y-m]} where Bias =E[Ym]2\text{where Bias } = \text{E}[Y-m]^2 where Variance =Var[Ym]\text{where Variance } = \text{Var}[Y-m]
  • In other words, the root mean squared error represents the square root of any bias or variance aggregated together, associated with a prediction
  • Since we can consider the variance term as irrelevant, the RMSE formula can be simplified to the following:
RMSE=mean((yiy^i)2)=mean(errori2)RMSE = \sqrt{mean((y_{i} - \hat{y}_{i})^{2})} = \sqrt{mean(error_{i}^{2})}

Defining Basics of MAE

  • The mean absolute error (or MAE) is defined as the following:
MAE(m)=E[(Ym)]=E[Ym]+Var[Ym]MAE(m) = \text{E}[|(Y-m)|] = |\text{E}[Y-m]| + \text{Var}[Y-m] where Bias =E[Ym]\text{where Bias } = |\text{E}[Y-m]| where Variance =Var[Ym]\text{where Variance } = \text{Var}[Y-m]
  • In other words, the mean absolute error represents any bias or variance associated with any prediction
  • Since we can consider the variance term as irrelevant, the MAE formula can be simplified to the following:
MAE=mean(yiy^i)=mean(errori)MAE = mean(|y_{i} - \hat{y}_{i}|) = mean(|error_{i}|)

Describing Use-Cases for MAE, MSE, and RMSE

  • If we want to penalize our predictions as they are increasingly off from our actual observations, then we should use MSE and RMSE because the errors are squared
  • In other words, we should use the MSE or RMSE metrics if we want to penalize large errors
  • For example, if our prediction being off by 10 is twice as bad as being off by 5, then we should most likely use MAE
  • On the other hand, if our prediction being off by 10 is more than twice as bad as being off by 5, then we should most likely use MSE or RMSE
  • If we want to use a regression metric that produces interpretable results, then MAE is clearly the winner, since it's difficult to interpret the square in the MSE or RMSE
  • From a mathematical standpoint, MSE is clearly the winner, since it's difficult to perform many mathematical calculations on formulas involving the absolute value

Defining Notation for MAPE and MASE

  • Let yty_{t} denote the current observation at time tt
  • Let yt1y_{t-1} denote the previous observation at time t1t-1
  • Let ftf_{t} denote the forecast of yty_{t}
  • Let ete_{t} denote the forecast error where et=ytfte_{t} = y_{t} - f_{t}
  • Let oto_{t} denote the one-step naive error where ot=ytyt1o_{t} = y_{t} - y_{t-1}

Defining Basics of MAPE

  • The mean absolute percentage error (or MAPE) is defined as the following:
MAPE=100mean(etyt)MAPE = 100 * mean(\frac{|e_{t}|}{|y_{t}|})
  • The MAPE favors predictions that are smaller than its data value, which can be considered a drawback
  • On the other hand, we may want this property depending on our problem, in which case we would want to use MAPE
  • In other words, the MAPE puts a heavier penalty on forecasts that exceed the actual data values than those that are less than the actual values
  • Said another way, the MAPE puts a heavier penalty on negative errors than positive errors
  • Naturally, we would like to avoid this asymmetry of the MAPE
  • The MASE can be used if we want a more symmetrical measure of the percentage error

Defining Basics of MASE

  • The mean absolute scaled error (or MASE) is arguably considered the best available measure of forecast accuracy
  • Before we define the MASE formula, we should define a one-step naive error
  • The one-step naive error oto_{t} refers to the error associated with guessing the previous data value as our current prediction:
  • The MASE is defined as the following:
MASE=mean(et1n1i=1nytyt1)MASE = mean(\frac{|e_{t}|}{\frac{1}{n-1}\sum_{i=1}^{n}|y_{t} − y_{t-1}|})
  • Where the scaled error term refers to the following:
scalederrort=et1n1i=1nytyt1scalederror_{t} = \frac{|e_{t}|}{\frac{1}{n-1}\sum_{i=1}^{n}|y_{t} − y_{t-1}|}
  • Therefore, the MASE formula can be simplifed to the following:
MASE=mean(scalederrort)MASE = mean(scalederror_{t})
  • We can go one step further, and simplify the MASE to the following, roughly:
MASE=mean(etmean(ot))MASE = mean(\frac{|e_{t}|}{mean(|o_{t}|)})
  • Where the scaled error term roughly refers to the following:
scalederrort=etmean(ot)scalederror_{t} = \frac{|e_{t}|}{mean(|o_{t}|)}
  • Essentially, the MASE is an average of our scaled errors
  • The MASE has the following benefits:

    • Working with scaled errors, since the scaled errors are independent of the scale of the data
    • Symmetrical measure
    • Less sensitive to outliers compared to other metrics
    • Easily interpreted metric using scaled errors (compared to other metrics like RMSE)
    • Less variable on small samples
  • We can interpret scaled errors based on the following criteria:

    • A scaled error is less than one if our forecast is better than the average one-step naive forecast (i.e. using the previous data point)
    • A scaled error is greater than one if our forecast is worse than the average one-step naive forecast (i.e. using the previous data point)

Defining Perils of F-tests

  1. The F-test does not measure goodness-of-fit

    • Specifically, the F-test does not measure if there is a linear fit, non-linear fit, etc.

      • For example, let's suppose that we reject the null, intercept-only hypothesis
      • This does not mean that the simple linear model is right
      • It means that the latter model is better at making predictions compared to the intercept-only model
    • Specifically, the reason for the latter model being better at making predictions is not due to random chance

      • The simple linear regression model can be absolute garbage, with every single one of its assumptions flagrantly violated, and yet better than the model which makes all those assumptions and thinks the optimal slope is zero
  2. The F-test is a pretty useless measure of predictability

    • Not finding any significant share of variance associated with the regression could be caused by any of the following:

      1. There is no such variance, and the intercept-only model is correct
      2. There is some variance, but we were unlucky)
      3. Or, the test doesn’t have enough power to detect departures from the null
    • To expand on that last point, the power to detect a non-zero slope is going to increase with the sample size nn, decrease with the noise level σ2\sigma^{2}, and increase with the magnitude of the slope β|\beta|

Defining Perils of R2R^{2}

  1. R2R^{2} does not measure goodness-of-fit

    • Specifically, R2R^{2} does not measure if there is a linear fit, non-linear fit, etc.

      • R2R^{2} can be low when the model follows a completely correct form (by making the variance of our random variable large)
      • R2R^{2} can be close to 1 when the model follows a completely incorrect form (by making the variance of our random variable small)
      • More specifically, R2R^{2} can be very high when the true model fit is linear, but our model fit is non-linear (since all that matters is for the slope of the best linear approximation to be non-zero)
  2. R2R^{2} is a pretty useless measure of predictability

    • R2R^{2} says nothing about prediction error (R2R^{2} can be anywhere between 0 and 1 just by changing the range of our random variable)
    • R2R^{2} says nothing about interval forecasts (particularly it doesn't give us any notion of how large our confidence interval is)
    • Specifically, MSE is a much better measure of prediction error
  3. R2R^{2} cannot be compared across data sets

    • Specifically, the same model can have radically different R2R^{2} values on different data
  4. R2R^{2} cannot be compared between a model with an untransformed response variable and one with a transformed response variable (or between different transformations of response variables)

    • Specifically, the R2R^{2} can be different by changing the range of our response variable
  5. The one situation where R2R^{2} can be compared is when different models are fit to the same data set with the same, untransformed response variable
  6. However, you might as well just compare the MSE measures in this situation
  7. Note that the adjusted R2R^{2} does absolutely nothing to fix any of these issues

References

Previous
Next

Variable Selection

Classification Metrics