Describing Linear Regression
- Regression is a statistical technique for estimating the relationship among a continuous response variable and a set of predictor variables
- Linear regression is a type of regression that models a linear relationship among a continuous response variable and its predictor variables
- In other words, linear regression assumes a constant rate of change between a response variable and its predictor variables
-
The most common approach for estimating the population parameters involved in the linear regression model is the method of least squares
- There are other ways of estimating the population parameters, such as maximum likelihood estimation (or MLE)
- Roughly speaking, residuals are estimates (or realizations) of any random error described by the error term
Model Components
- In linear regression, our true data generating process could look like the following:
- In this case, the response variable , the predictor variable , and the error term are all random variables
- On the other hand, the and coefficients are our fixed population parameters
- Usually, we're interested in the mean of the response variable conditional on our predictor variables, which looks like this:
- We'll typically estimate these conditional means using MLE or OLS parameter estimates:
- The response variable conditional on the predictor variables is denoted as the following:
- The error term conditional on the predictor variables is denoted as the following:
Assumptions of the Gaussian-Noise Simple Linear Regression Model
- The distribution of any predictor variable is unspecified (possibly even deterministic)
- The relationship between the response variable and each value of the predictors variables is linear:
- The error term is normally distributed with a mean of 0 and constant variance for all values of
- Which implies the error term is uncorrelated across observations and is uncorrelated with predictors
- In other words, this implies is independent of observations, and is independent of the predictor variables
- Said another way, homoscedasticity is maintained (i.e. constant variance):
Benefits of Assuming the the Gaussian-Noise Model
-
We can use the Central Limit Theorem
- The noise might be due to adding up the effects of lots of little random causes, all nearly independent of each other and of X , where each of the effects are of roughly similar magnitude
- Then the central limit theorem will take over, and the distribution of the sum of effects will indeed be pretty Gaussian
- Therefore, the Central Limit Thorem makes the assumption that is normally-distributed
-
It will be mathetmatically convenient
-
Assuming Gaussian noise lets us work out a very complete theory of inference and prediction for the model, with lots of closed-form answers to questions like:
- What is the optimal estimate of our terms using MLE or OLS?
- Can we assume our estimates are Gaussian?
- What is the optimal estimate of the variance?
- What is the probability that we'd see a fit this good from a line with a non-zero intercept if the true like goes through the origin?
- Answering such questions without the Gaussian-noise assumption needs somewhat more advanced techniques, and much more advanced computing
-
Clarifying Properties of Linear Models
- The predictor variables do not need to be normally distributed
- The reponse variable does not need to be normally distributed
- The response variable conditional on the predictor variables needs to be normally distributed
Clarifying Properties of Residuals
- The residuals should have an expected value of zero:
- The residuals should show a nearly constant variance:
- We don't expect the residuals to ever be completely uncorrelated with each other, but the correlation should be extremely weak and grow negligable as
- The residuals should be Gaussian, since the errors should be Gaussian
References
Previous
Next