Motivating the Use of Panel Data
-
Often, our outcome variable depends on several factors
- These factors may be observed or unobserved in our data
- As we know, if any unobserved variables are correlated with the treatment variable, then the treatment variable is endogenous
- Meaning, any correlations are not estimates of a causal effect
-
Panel data refers to data where we observe the same units over more than one time period
- E.g. individuals, firms, countries, etc.
-
Panel data is very similar to time series data with one key difference
- Time series data refers to data consisting of observations of one individual at multiple time points
- Whereas, panel data refers to data consisting of observations of multiple individuals at multiple time points
-
Panel data is used to estimate causal effects when there are unobserved confounders (that are constant over time)
- To do this, we make the assumption that these unobserved confounders are constant over time
Illustrating Panel Data with Unobserved Confounders
- Panel data allows us to control for unobserved variables by using a fixed, known variable
- For example, we can’t measure attributes like beauty and intelligence
-
But, we know that a person is the same individual across time
- Meaning, their beauty and intelligence is fixed across time
-
So, we can create a dummy variable (i.e. their name) referring to that person with a set of fixed, unobserved variables
- Then, we can control for their unobserved variables by adding that person to a regression model
-
This is what we mean when we say we can control for the person itself
- We are adding a variable (dummy in this case) that denotes that particular person
- By controlling for this dummy variable, we can estimate causal effects of a treatment on outcomes when there are unobserved confounders
Defining Types of Estimators for Panel Data
- There are several different kinds of estimators for panel data
- For now, we'll focus on fixed effects (FE)
-
Note, panel methods are usually based on the traditional notation
- And, not the potential outcomes notation
- Keep this in mind when we define their notation
-
The notation is defined as the following:
- Let be our observed, outcomes
- Let be a set of observed, variables
- Let be an unobservable random variable
- We're interested in the partial effects of variable in the population regression function:
- Thus, our regression model looks like the following for an observation :
- And, the entire panel (sample of data) looks like the following for an observation:
Illustrating Fixed Effects
- Generally, a fixed effect model is defined as the following:
-
Here, is the outcome of the individual at time
- Let be a set of observed variables for individual at time
-
Let be a set of unobserved variables for individual
- Notice, these unobservables are assumed to be fixed over time
- Hence, the lack of the time subscript
- Finally, is the error term
-
As an example, could represent wages
-
And are the observed variables that change over time
- E.g. marriage and experience
-
And are the unobserved variables that are constant over time
- E.g. beauty and intelligence
-
Defining Fixed Effects
- The fixed effects model gets the average for every person in our panel
- Essentially, the individual dummy is regressed on the other variables
-
This motivates the following estimation procedure:
- Create time-demeaned variables by subtracting the mean for the individual:
- Regress on
-
Notice, the unobserved variables vanishes after performing the above transformation, since is constant over time
- Actually, even observed variables that are constant over time are eliminated after performing the above transformation
- For this reason, including any variables that are constant across time would be removed, since they would be a linear combination of the dummy variables
Defining the Identifying Assumptions
-
To identify with a fixed effects model, we must satisfy the following assumptions:
-
- In other words, there can't be any unobserved variables changing over time
-
- Meaning, there aren't any collinear observed variables
-
Describing the Problem with Panel Data
- Panel data is useful when controlling for confounding with non-random data (i.e. non-experimental data)
- However, it isn't great for every scenario, due to its assumptions
-
There are two common situations when panel data doesn't work effectively to estimate causal effects:
- When we have reverse causality
- When unmeasured confounding is changing in time