Module 2: Panel Data and Further Issues

Applied Econometrics · Econometrics III

Max Heinze (mheinze@wu.ac.at)

Department of Economics, WU Vienna

Sannah Tijani (stijani@wu.ac.at)

Department of Economics, WU Vienna

Causality

Panel Data

Before and After Comparisons

Fixed Effects

Causality

Causality is when one cause leads to some effect.
The cause is partly responsible for the effect, and the effect partly depends on the cause.
A causal relationship tells us what would happen in alternative (counterfactual) worlds.
Consider a binary treatment X, and outcome Y. We can think of the causal effect \(\tau\) as the difference in potential outcomes: \[ \tau = Y(X=1) - Y(X=0) \]
In reality, only one outcome is realized, the other is counterfactual. Thus, we have to estimate this missing outcome to learn about the causal effect.
Average Treatment Effect: the average causal effect is simply the mean of all treatment effects \[ \tau_{ATE} = \mathrm{E}[\tau_i] = \mathrm{E}[Y(1)-Y(0)]= \mathrm{E}[Y(1)]- \mathrm{E}[Y(0)] \]

Identification

We say an effect is causally identified if we can interpret it causally in our framework and scope.
If we want to understand the causal impact of studying on income, \(\boldsymbol{y}^{inc} = \boldsymbol{x}^{stu} \beta + \boldsymbol{u}\): Studying → Income
We likely run into an issue, as you don’t get paid for studying but for your skills: Studying → Skills → Income
Moreover, ability may confound your effect estimates of studying on income, \(\boldsymbol{y}^{inc} = \boldsymbol{x}^{stu} \beta + \boldsymbol{u}\), affecting studying, skills, and income.
You cannot identify the causal effect of studying on income.

Ways of Viewing Causal Questions

The Potential Outcomes (PO) framework is one way to view causal questions.

There is a treatment \(X_i\) that takes on different values for each unit.
For each possible level of treatment, there is a certain potential outcome \(Y(x)\).
Only one potential outcome is observed, the others are counterfactuals.

The potential outcomes framework relates very clearly to the notion of a randomized experiment.

A different framework that has its strengths elsewhere: the Directed Acyclic Graphs (DAG) framework.

It is a graphical framework that helps us identify a causal effect in a network of variables.
It has its strengths in a world with a large number of (observed) variables, and
may help people who prefer thinking graphically to understand causal questions.

DAGs for Causal Modeling

Why do we talk about DAGs in an Econometrics class? Because they are really useful for causal modeling.

In the following DAG, nodes represent (random) variables, and edges represent (hypothesized) causal effects.

Missing edges also convey information: the assumption of no causal effect.

Validity

In order to assess the quality of causal inferences, it helps to think of the validity of a statistical analysis. Different concepts of validity include the following:
External validity: is the validity of an analysis outside its own context, telling us whether findings can be generalized across situations, people, time, regions etc.
Internal validity: Internal validity is the validity of an analysis within its own context. It is the extent to which the analysis allows for causal inference.
Empirical evidence may support various different interpretations.
We want to be able to credibly eliminate non‐causal interpretations.
The Principle of Parsimony (Occam’s razor): There may be incomprehensibly many alternatives for each explanation. The idea is to give preference to the simplest explanation (that cannot be refuted), i.e., the one with the fewest parameters and/or assumptions.

Exogeneity

The exogeneity assumption \(\mathbb{E}[\boldsymbol{u} \mid X] = 0\) is sometimes substituted with weak exogeneity \(\mathrm{Cov}(X,\boldsymbol{u}) = 0\)
This guarantees consistency, but not unbiasedness of the estimator.
A failure of exogeneity is called endogeneity
It causes bias and inconsistency by confounding the effects of our regressors X and the true errors u on y

Omitted Variable Bias

A confounder is an additional variable that drives both the cause and effect
If we don’t account for that variable we can’t provide a causal effect explanation for what we observe.
Bias from a confounder is also called omitted variable bias. It occurs when:
- The omitted variable is correlated with the regressors ( \(\mathrm{Cov}(x_1,x_2) \neq 0\))
- It is also a determinant of y (\(\beta_2 \neq 0\))
Many (potentially) omitted variables cannot be observed.
To solve this we might be able to use proxy variables
To use a proxy variable to identify a causal effect, it must:
1. Correlate with the omitted variable: \(\theta_1 \neq 0\)
2. Not correlate with other explanatory variables \(Cov(\boldsymbol{X}, \boldsymbol{e}) = 0\)
3. have no direct impact on the dependent variable \(Cov(\boldsymbol{z}, \boldsymbol{u}) = 0\)

Type of Selection Bias

If the sample is not random, we may speak of selection bias
It is the idea that some subjects are more or less likely to be selected for our sample than others, distorting statistical insight
Selection bias is related to sample issues that may plague external validity, but also threatens in-sample inference.
There are many types of selection bias
Consider a true \(f\) describing a population of size \(N\), but we only observe \(M (<N)\). Can we learn something using our subset?
If our \(M\) represents a random subset of the population, then the selection process is ignorable
Otherwise there may be selection bias
endogenous sample selection, related to the dependent variable
exogenous sample selection, related to the explanatory or third variables

Data Issues

Data may be subject to various issues, due to errors in collection, which may affect our ability to analyse it.
If there seems to be a pattern to missingness we may have to account for it to avoid bias or to benefit from accounting for it. (Truncation, Censoring)
Outliers are observations that are very different from the rest
Note: estimation for limited dependent variables done via ML estimators

Simultaneity and Reverse Causality

The causal effect of interest, i.e., \(X \rightarrow Y\), is not always as straightforward as we would like. Instead, we may encounter:

Reverse Causality: The issue is determining the direction of causation where \(Y \rightarrow X\)
1. Happiness and income: being happy increases productivity → income
2. Health and exercise: poor health reduces ability to exercise → low exercise correlates with poor health
3. Education and growth: richer countries invest more in education
Simultaneity: We need to disentangle the effects where \(Y \leftrightarrow X\)
1. Wage and Employment
2. Price and quantity: market clears instantly – price and quantity determined together.

Techniques for Causal Identification

OLS estimates are biased when regressors are correlated with the error term (omitted variables, simultaneity, reverse causality, measurement errors)
Several strategies allow us to identify causal effects despite these challenges:
Difference-in-Differences (DiD): compare changes over time between a treatment and control group
- Requires: parallel trends assumption – groups would have evolved similarly absent treatment
Instrumental Variables (IV): Use an instrument \(Z\) that affects \(X\) but has no direct effect on \(Y\)
- Requires: relevance (\(Z\) correlated with \(X\)) and exclusion restriction (\(Z\) uncorrelated with \(u\))

Techniques for Causal Identification (2)

Regression Discontinuity Design (RDD): exploit a cutoff rule that determines treatment assignment
- Requires: no other discontinuity at the cutoff (continuity assumption)
Randomized Controlled Trials (RCT): random assignment of treatment ensures independence from all confounders
- The “gold standard” – but often infeasible for ethical or practical reasons

The Instrumental Variables (IV) Estimator

Core problem: \(X_{it}\) is endogenous – correlated with \(u_{it}\) – so OLS is biased
Idea: find an instrument \(Z\) that isolates exogenous variation in \(X\) – variation that is “as good as random”
An instrument \(Z\) must satisfy two conditions:
- Relevance: \(Z\) is correlated with \(X\) – \(\text{Cov}(Z, X) \neq 0\) – testable with an \(F\)-test (rule of thumb: \(F > 10\))
- Exclusion restriction: \(Z\) affects \(Y\) only through \(X\) – \(\text{Cov}(Z, u) = 0\) – not directly testable

The Instrumental Variables (IV) Estimator (2)

Two-Stage Least Squares (2SLS):
- Stage 1: regress \(X\) on \(Z\) → obtain \(\hat{X}\) (the exogenous part of \(X\)) \[X_i = \pi_0 + \pi_1 Z_i + \nu_i\]
- Stage 2: regress \(Y\) on \(\hat{X}\) → recovers the causal effect of \(X\) on \(Y\) \[Y_i = \beta_0 + \beta_1 \hat{X}_i + u_i\]
Intuition: IV “throws away” the endogenous variation in \(X\) and uses only the exogenous variation induced by \(Z\)
Classic example: returns to education – instrument \(Z\) = quarter of birth (Angrist & Krueger, 1991)
- Quarter of birth affects schooling (compulsory schooling laws) but has no direct effect on wages

Causality

Panel Data

Before and After Comparisons

Fixed Effects

Two-Way Fixed Effects

Panel Data

Panel data (longitudinal data) refers to data for \(N\) different entities observed at \(T\) different time periods
A balanced panel has all its observations; that is, the variables are observed for each entity and each time period
An unbalanced panel has some missing data for at least one time period for at least one entity
Panel data is used as a method for controlling for some types of omitted variables without actually observing them
By studying changes in the dependent variable over time, it is possible to eliminate the effect of omitted variables that differ across entities but are constant over time
Panel data allows for analysis of dynamics and improves causal inference compared to pure cross-sections

Example

Individuals tracked over time (e.g. income, employment, education)
Households surveyed repeatedly (e.g. consumption, demographics)
Firms observed across years (e.g. profits, productivity, employment)
Countries over time (e.g. GDP, inequality, policy changes)
Students followed across grades (e.g. test scores, outcomes)
Workers within firms over time (matched employer–employee data)
Regions or cities tracked over time (e.g. crime rates, housing prices)
Political units (e.g. voting behavior across elections)
Hospitals or patients tracked over time (health outcomes)

Notation

Panel data consist of observations on the same \(N\) entities over \(T\) time periods
If the data set includes variables \(X\) and \(Y\), the data can be written as: \[ (X_{it}, Y_{it}), \quad i = 1, \dots, N, \quad t = 1, \dots, T \]
\(i\) indexes the entity (individual, firm, country, etc.) being observed
\(t\) indexes time (year, month, etc.) of the observation

Example: Traffic Deaths and Alcohol Taxes

There are approximately 40,000 highway traffic fatalities each year in the United States. Approximately one-fourth of fatal crashes involve a driver who was drinking, and this fraction rises during peak drinking periods.
We study how effective various government policies designed to discourage drunk driving actually are in reducing traffic deaths.
Panel data set: the number of traffic fatalities in each state in each year (1982-1988), the type of drunk driving laws in each state in each year, and the tax on beer in each state
Fatality rate, which is the number of annual traffic deaths per 10,000 people
We can start by doing the regression for every year: \[ \text{Fatality\ Rate}_i = \beta_0 + \beta_1 \text{Beer\ Tax}_i + u_i \]

Example: Traffic Deaths and Alcohol Taxes (2)

1982 data: \[ \text{Fatality\ Rate}_i = 2.01 + 0.15 \text{Beer\ Tax}_i \]
1988 data: \[ \text{Fatality\ Rate}_i = 1.86 + 0.44 \text{Beer\ Tax}_i \]
Should we conclude that an increase in the tax on beer leads to more traffic deaths? Not necessarily, because these regressions could have substantial omitted variable bias
One approach would be to collect data on all these variables and add them to the annual cross-sectional regressions
Unfortunately, some of these variables might be very hard or even impossible to measure

Causality

Panel Data

Before and After Comparisons

Fixed Effects

Two-Way Fixed Effects

Assumptions and Inference for FE Regression

Panel Data with 2 Time Periods

When data for each state are obtained for \(T = 2\) time periods, we can compare values of the dependent variable in the second period to values in the first period
By focusing on changes in the dependent variable, this “before and after” comparison holds constant the unobserved factors that differ from one state to the next but do not change over time within the state
For example: the local cultural attitude toward drinking and driving
Let \(Z_i\) be a variable that determines the fatality rate in the \(i^{th}\) state but does not change over time (so the \(t\) subscript is omitted)
The population linear regression relating \(Z_i\) and the real beer tax to the fatality rate: \[ \text{Fatality\ Rate}_{it} = \beta_0 + \beta_1 \text{Beer\ Tax}_{it} + Z_i + u_{it} \]

Before and After Comparison

Because \(Z_i\) does not change over time, in the regression model it will not produce any change in the fatality rate between 1982 and 1988 \[ \text{Fatality\ Rate}_{i1982} = \beta_0 + \beta_1 \text{Beer\ Tax}_{i1982} + Z_i + u_{i1982} \] \[ \text{Fatality\ Rate}_{i1988} = \beta_0 + \beta_1 \text{Beer\ Tax}_{i1988} + Z_i + u_{i1988} \]
The influence of \(Z_i\) can be eliminated by analyzing the change in the fatality rate between the two periods. \[ \begin{align} \text{Fatality\ Rate}_{i,1982} - &\text{Fatality\ Rate}_{i,1988} \\ &= (\beta_0-\beta_0) + \beta_1 (\text{Beer\ Tax}_{i,1982} - \text{Beer\ Tax}_{i,1988}) \nonumber\\ &\quad + (Z_i - Z_i) + (u_{i,1982} - u_{i,1988}) \end{align} \]

Example

This specification has an intuitive interpretation.
Cultural attitudes toward drinking and driving affect the level of drunk driving and thus the traffic fatality rate in a state.
If they did not change between 1982 and 1988, then they did not produce any change in fatalities in the state. \[ \begin{align} \text{Fatality\ Rate}_{i,1982} - &\text{Fatality\ Rate}_{i,1988} \\ &= -0.072 - 1.04 (\text{Beer\ Tax}_{i,1982} - \text{Beer\ Tax}_{i,1988}) \nonumber \end{align} \]
By examining changes in the fatality rate over time, the regression controls for fixed factors such as cultural attitudes toward drinking and driving.
The “before and after” method does not apply directly when \(T > 2\).

Causality

Panel Data

Before and After Comparisons

Fixed Effects

Two-Way Fixed Effects

Assumptions and Inference for FE Regression

Data

Fixed Effects Regression

To analyze all the observations in our panel data set, we use the method of fixed effects regression
Fixed effects regression is a method for controlling for omitted variables in panel data when the omitted variables vary either only across time or observations
For instance, a variable could vary across entities (states) but does not change over time
The fixed effects regression model has \(N\) different intercepts, one for each entity.
These intercepts can be represented by a set of binary (or indicator) variables
These binary variables absorb the influences of all omitted variables that differ from one entity to the next but are constant over time

The Fixed Effects Regression Model

Consider the regression model with FatalityRate and BeerTax denoted as \(Y_{it}\) and \(X_{it}\), respectively: \[ Y_{it} = \beta_0 + \beta_1 X_{it} + \beta_2 Z_i + u_{it} \]
\(Z_i\) is an unobserved variable that varies from one state to the next but does not change over time (e.g., cultural attitudes toward drinking and driving)
We want to estimate \(\beta_1\), the effect on \(Y\) of \(X\), holding constant the unobserved state characteristics \(Z\)
Since \(Z_i\) is constant over time, our regression can be interpreted as having \(N\) intercepts – one per state: \(\alpha_i = \beta_0 + \beta_2 Z_i\)
This gives us the fixed effects model: \[ Y_{it} = \beta_1 X_{it} + \alpha_i + u_{it} \]

Entity Fixed Effects

In \(Y_{it} = \beta_1 X_{it} + \alpha_i + u_{it}\), the terms \(\alpha_1, \ldots, \alpha_N\) are treated as unknown intercepts to be estimated – one for each state
The population regression line for the \(i^{th}\) state is \(\alpha_i + \beta_1 X_{it}\):
- The slope \(\beta_1\) is the same for all states
- The intercept \(\alpha_i\) varies from one state to the next
Because \(\alpha_i\) can be thought of as the “effect” of being in entity \(i\), the terms \(\alpha_1, \ldots, \alpha_N\) are known as entity fixed effects
The variation in entity fixed effects comes from omitted variables that, like \(Z_i\), vary across entities but not over time

Fixed Effects via Binary Variables

The state-specific intercepts in the fixed effects regression model can be expressed using binary variables to denote the individual states.
Define binary variables \(D1_i, D2_i, \ldots, DN_i\) where \(Dj_i = 1\) when \(i = j\) and \(0\) otherwise
We cannot include all \(N\) binary variables plus a common intercept – this causes perfect multicollinearity (the dummy variable trap)
Solution: omit \(D1_i\) for the first entity, giving the equivalent model: \[ Y_{it} = \beta_0 + \beta_1 X_{it} + \gamma_2 D2_i + \gamma_3 D3_i + \cdots + \gamma_N DN_i + u_{it} \]
The link between the two representations:
- First state: regression line is \(\beta_0 + \beta_1 X_{it}\), so \(\alpha_1 = \beta_0\)
- All other states (\(i \geq 2\)): regression line is \(\beta_0 + \beta_1 X_{it} + \gamma_i\), so \(\alpha_i = \beta_0 + \gamma_i\)
Thus there are two equivalent ways to write the fixed effects regression model

The Fixed Effects Regression Model: General Form

The fixed effects regression model with \(k\) regressors is: \[ Y_{it} = \beta_1 X_{1,it} + \cdots + \beta_k X_{k,it} + \alpha_i + u_{it} \]
Where \(i = 1, \ldots, N\); \(t = 1, \ldots, T\); and \(\alpha_1, \ldots, \alpha_N\) are entity-specific intercepts
Equivalently, using a common intercept and \(N-1\) binary variables (omitting the first entity): \[ Y_{it} = \beta_0 + \beta_1 X_{1,it} + \cdots + \beta_k X_{k,it} + \gamma_2 D2_i + \gamma_3 D3_i + \cdots + \gamma_N DN_i + u_{it} \]
Where \(D2_i = 1\) if \(i = 2\) and \(0\) otherwise, and so forth
These two representations are equivalent – both control for all time-invariant entity characteristics
If there are other observed determinants of \(Y\) that are correlated with \(X\) and that change over time, then these should also be included in the regression to avoid omitted variable bias

Application to Traffic Deaths

OLS estimate of the fixed effects model using all 7 years of data (336 observations): \[ \text{Fatality\ Rate} = -0.66 \text{Beer\ Tax} + \text{state fixed effects} \]
The estimated fixed intercepts are not reported – they are not of primary interest
The coefficient on Beer Tax is negative: higher beer taxes \(\Rightarrow\) fewer traffic deaths
This is the opposite of the initial cross-sectional results – entity fixed effects remove omitted variable bias
Difference from the “before and after” regression: uses only 1982 and 1988 data, now we use all 7 years → smaller standard error

Regression with Time Fixed Effects

Entity fixed effects control for variables constant over time but varying across entities
Time fixed effects control for variables constant across entities but evolving over time
Example: automobile safety improvements are introduced nationally → same value for all states but changes over time
We can extend the model to make the effect of safety \(S_t\) explicit: \[ Y_{it} = \beta_0 + \beta_1 X_{it} + \beta_2 Z_i + \beta_3 S_t + u_{it} \]
\(S_t\) has only a \(t\) subscript → changes over time but is constant across states
If \(S_t\) is correlated with \(X_{it}\) and omitted from the regression → omitted variable bias

Time Fixed Effects Regression Model

Although \(S_t\) is unobserved, its influence can be eliminated – just as \(Z_i\) was eliminated by entity fixed effects
Parallel logic:
- \(Z_i\) varies across states but not over time → entity fixed effects: each state gets its own intercept
- \(S_t\) varies over time but not across states → time fixed effects: each time period gets its own intercept

Time Fixed Effects Regression Model (2)

The time fixed effects regression model with a single \(X\) regressor: \[ Y_{it} = \beta_1 X_{it} + \lambda_t + u_{it} \]
\(\lambda_t\) is a different intercept for each time period
\(\lambda_t\) captures the “effect” of year \(t\) on \(Y\), so \(\lambda_1, \ldots, \lambda_T\) are the time fixed effects
The variation in time fixed effects comes from omitted variables that vary over time but are constant across entities
Just as the entity fixed effects regression model, it can be represented using \(T - 1\) binary indicators
In the traffic fatalities regression, the time fixed effects specification allows us to eliminate bias arising from omitted variables that change over time but are the same across states in a given year

Panel Data

Before and After Comparisons

Fixed Effects

Two-Way Fixed Effects

Assumptions and Inference for FE Regression

Data

Quasi-Experiments

Both Entity and Time Fixed Effects

Some omitted variables are constant over time but vary across states (e.g., cultural norms)
Others are constant across states but vary over time (e.g., national safety standards)
When both types are present → include both entity and time fixed effects
The combined model: \[ Y_{it} = \beta_1 X_{it} + \alpha_i + \lambda_t + u_{it} \]
Where \(\alpha_i\) is the entity fixed effect and \(\lambda_t\) is the time fixed effect
Equivalently, using \(N-1\) entity binary indicators and \(T-1\) time binary indicators: \[ Y_{it} = \beta_0 + \beta_1 X_{it} + \gamma_2 D2_i + \cdots + \gamma_N DN_i + \delta_2 B2_t + \cdots + \delta_T BT_t + u_{it} \]
Where \(\beta_0, \beta_1, \gamma_2, \ldots, \gamma_N\) and \(\delta_2, \ldots, \delta_T\) are unknown coefficients to be estimated
We call this the Two-Way Fixed Effects Estimator

Caveats of the Fixed Effects Estimator

Caveat 1: Fixed effects cannot solve reverse causality. If you have simultaneity or reverse causality bias then you would not use panel fixed effects, specifically when it is really strong, as your identification strategy.
Caveat 2: Fixed effects cannot address time-variant unobserved heterogeneity.
Using the TWFE is helpful but will not solve your endogeneity issues.
Nothing substitutes for careful reasoning and economic theory, as they are necessary conditions for good research design

Application to Traffic Deaths: Time Effects Added

Adding time effects to the state fixed effects model gives: \[ \text{Fatality Rate} = -0.64 \text{Beer Tax} + \text{State Fixed Effects} + \text{Time Fixed Effects} \]
This specification includes 55 right-hand side variables in total:
- 1 beer tax regressor
- 47 state binary variables (state fixed effects)
- 6 single-year binary variables (time fixed effects)
- 1 intercept
State/time binary coefficients and intercept are not reported – not of primary interest
Including time effects has little impact on the coefficient on the real beer tax

Before and After Comparisons

Fixed Effects

Two-Way Fixed Effects

Assumptions and Inference for FE Regression

Data

Quasi-Experiments

Networks and Peers

The Fixed Effects Regression Assumptions

The fixed effects model: \(Y_{it} = \beta_1 X_{it} + \alpha_i + u_{it}\), \(i = 1, \ldots, N\), \(t = 1, \ldots, T\), where \(\beta_1\) is the causal effect on \(Y\) of \(X\)
Assumption 1: \(u_{it}\) has conditional mean zero: \[E(u_{it} | X_{i1}, X_{i2}, \ldots, X_{iT}, \alpha_i) = 0\]
Assumption 2: \((X_{i1}, X_{i2}, \ldots, X_{iT}, u_{i1}, u_{i2}, \ldots, u_{iT})\), \(i = 1, \ldots, N\), are i.i.d. draws from their joint distribution
Assumption 3: Large outliers are unlikely: \((X_{it}, u_{it})\) have nonzero finite fourth moments
Assumption 4: No perfect multicollinearity
For multiple regressors, \(X_{it}\) is replaced by the full list \(X_{1,it}, X_{2,it}, \ldots, X_{k,it}\)

Fixed Effects Assumptions: Closer Look

Assumption 1 – conditional mean zero given all \(T\) values of \(X\):
- Implies no omitted variable bias
- \(u_{it}\) must not depend on any values of \(X\) for that entity – past, present, or future
- Violated if current \(u_{it}\) is correlated with past, present, or future values of \(X\)
Assumption 2 – variables are i.i.d. across entities for \(i = 1, \ldots, N\):
- Variables for one entity are distributed identically to, but independently of, variables for another entity
- Holds if entities are selected by simple random sampling from the population
- Key difference from cross-sectional assumption 2: no restriction within an entity – \(X_{it}\) can be correlated over time within the same entity
Under all four assumptions, the fixed effects estimator is consistent and normally distributed when \(N\) is large

Autocorrelation in Panel Data

If \(X_{it}\) is correlated with \(X_{is}\) for \(s \neq t\), then \(X_{it}\) is autocorrelated (or serially correlated)
Autocorrelation is a pervasive feature of time series data: what happens one year tends to be correlated with what happens the next
Example: beer tax \(X_{it}\) is autocorrelated – legislatures rarely change tax rates year to year, so a high tax in year \(t\) tends to remain high in year \(t+1\)
\(u_{it}\) can also be autocorrelated due to persistent omitted factors, for example:
- A local economic downturn reduces commuting → fewer fatalities for multiple years
- A road improvement project reduces accidents not just at completion but in future years
Not all omitted factors cause autocorrelation – e.g., severe winter weather that is independently distributed year to year would be serially uncorrelated
In general, as long as some omitted factors are autocorrelated, \(u_{it}\) will be autocorrelated

Standard Errors for Fixed Effects Regression

If errors are autocorrelated, the usual heteroskedasticity-robust standard errors are not valid – they assume no serial correlation
Solution: use heteroskedasticity- and autocorrelation-robust (HAR) standard errors
The standard approach in panel data: clustered standard errors
- Allow arbitrary correlation within a cluster (= entity)
- Assume errors are uncorrelated across clusters (= across entities)
- Allow for both heteroskedasticity and autocorrelation within an entity
Clustered standard errors are valid whether or not there is heteroskedasticity, autocorrelation, or both
When \(N\) is large, inference proceeds using:
- Normal critical values for \(t\)-statistics
- \(F_{q,\infty}\) critical values for \(F\)-statistics testing \(q\) restrictions

Fixed Effects

Two-Way Fixed Effects

Assumptions and Inference for FE Regression

Data

Quasi-Experiments

Networks and Peers

Spatial Econometrics

Finding Data

High-quality data is the foundation of credible research
The choice of data shapes: whether you can answer your research questions, which methodology to use, the validity of results
Data availability often constrains what we can study, but more data sources exist than we initially suspect
Different sources offer different strengths and limitations:
Administrative data: Large scale, high accuracy, often longitudinal, Limited to recorded variables, access can be restricted
Surveys: Flexible design, can capture attitudes and unobservables, Measurement error, recall bias, smaller samples
Web / scraped data: Real-time, novel behaviors, large and granular, Selection bias, legal/ethical issues, messy structure
Experimental data: High causal identification (internal validity), Costly, smaller samples, external validity concerns

Different Types of Data

Administrative records
Household surveys
Firm-level / business data
Experimental (lab) data
Field experimental data (RCTs)
Web-scraped data
Social media data
Transaction data (e.g. credit cards)
Sensor / GPS / mobility data
Satellite / remote sensing data
Text data (documents, news, transcripts)
Image / video data

Network data (social or economic connections)
Platform / marketplace data (e.g. Uber, Airbnb)
Qualitative / interview data
Audio data (recordings, speech)
Biometric data (heart rate, eye tracking)
Genetic / DNA data
Online search data (e.g. Google Trends)
Weather and climate data
Administrative legal data (court cases, police reports)

Example – Demographic and Health Surveys (DHS)

Large-scale, nationally representative household surveys in developing countries
Cover topics such as:
Fertility, family planning, and maternal health
Child health and mortality
HIV/AIDS knowledge and prevalence
Gender, education, and household characteristics
Key strengths: High-quality, comparable across countries and time, rich micro-level data (individual & household)
Key limitations: Self-reported data (measurement error), Cross-sectional (limited panel structure)
Widely used in development and health economics

Remotely Sensed Data

There is a lot of remotely sensed data that is useful for Development Economics:

Nighttime lights as a proxy for economic activity
Vegetation indices and rainfall estimates
Elevation, terrain ruggedness
Satellite imagery to detect roads, buildings, mines, conflict areas, ;

There are many ways to access this data, among them: Google Earth Engine, NASA Worldview, data download

Example – Luxembourg Income Study (LIS)

Cross-national microdata database on income, inequality, and poverty
Harmonizes household survey data from multiple countries
Covers: Income, earnings, and redistribution, Wealth (via LWS), Demographics and labor market outcomes
Key strengths: High-quality harmonization for inequality research, widely used in top economic research, strong comparability across countries and over time
Key limitations: Limited variable detail compared to raw surveys, access requires application and approval, remote execution system (no direct microdata download)

Example – Census: IPUMS

Data harmonization project by IPUMS at the University of Minnesota
Provides access to: Census and household survey microdata from around the world, U.S. Census and American Community Survey data
Key feature: Harmonized variables across countries and time, thus enables cross-country and longitudinal comparisons
Available datasets: IPUMS USA, IPUMS International, IPUMS CPS, IPUMS DHS, etc.
Key strengths: Large samples, high comparability, easy-to-use extracts
Key limitations: Some loss of detail due to harmonization, Access restrictions for sensitive data

Example – WVS & ANES: Public Opinion Data

Cross-national and U.S.-focused surveys on values, attitudes, and political behavior
WVS: Covers 100+ countries over multiple waves, focus on cultural values, beliefs, trust, religion, and social norms
ANES: U.S.-focused, election-centered surveys since 1948, covers voting behavior, political attitudes, and public opinion
Key strengths: rich information on preferences, beliefs, and norms, long time span (especially ANES), cross-country comparability (WVS)
Key limitations: self-reported attitudes (measurement error, social desirability bias), limited causal identification, repeated cross-sections rather than true panels

Example – Text

Unstructured data from written or spoken language sources
Common sources: news articles, policy documents, transcripts, social media posts, forums, reviews, legal texts, reports, historical archives
Methods: Keyword search, dictionaries, sentiment analysis, topic modeling, machine learning / NLP
Key strengths: Captures beliefs, narratives, and framing, large-scale and often real-time, enables analysis of previously unobservable concepts
Key limitations: Requires substantial preprocessing, measurement depends on method choices, potential bias in text sources (selection, platform effects)

Example – Macro data

Aggregate data describing the economy at the country or regional level
Common indicators: GDP, inflation, unemployment, trade, government spending, debt, productivity
Common sources: World Bank, IMF, OECD, Eurostat, FRED, ECB, ILO, UN data
Key strengths: Broad coverage across countries and long time periods, Standardized and comparable indicators
Key limitations: Aggregation masks heterogeneity, Limited ability for causal inference

Two-Way Fixed Effects

Assumptions and Inference for FE Regression

Data

Quasi-Experiments

Networks and Peers

Spatial Econometrics

What is a Quasi-Experiment?

From your classes in Econometrics I and Econometrics II, you know what an experiment is and that experiments allow us to use some very convenient methods and estimators. You may also have heard of quasi-experiments, but we are going to revisit them anyway.

In an experiment, we are able to randomly assign treatment. This means that we can divide our sample (randomly) into a treatment group and a control group, and then compare outcomes between them.
In a natural experiment, we make use of circumstances (either induced by nature, or in some instances by other humans) that exogenously divide subjects into treatment and control groups in a way that makes it seem as if they were randomly allocated.
Other quasi-experimental designs include regression discontinuity designs, event studies, or synthetic control methods.

We are first going to look at natural experiments.

Natural Experiments

A natural experiment is a study where an experimental setting is induced by nature or other factors outside our control.

It is an observational study with properties of randomised experiments.
This provides a good basis for causal inference, and
does not suffer from potential issues of conducting an experiment, such as cost, ethics, feasibility, etc.
For a natural experiment, we need something to happen exogenously and create variation in treatment.

U.S. Representative Alexander Pirnie of New York drawing the first capsule in the Vietnam war draft lottery.

Cholera

In the 1800s, London (as well as many other places) was repeatedly hit by waves of a cholera epidemic.

The predominant theory at the time was that the disease was spread by small inanimate particles that floated through the air (which is obviously incorrect).
John Snow was a physician working in London at the time, and he suspected instead that cholera was caused by microscopic living organisms that entered the body through water and food, multiplied in the body, and then exited the body through urine and feces.
This would imply that clean water supply was a way to slow the spread of the disease.
Unfortunately, he was only able to collect anecdotal evidence, which did not allow him to make a causal claim.

Of course, running an experiment is infeasible in this context. It would require randomizing households, and allocating clean water to only a subset of them. This was both logistically infeasible and ethically questionable.

A Natural Experiment

In 1852, the following happened:

One water company moved its pipes further upstream, to a location that incidentally was upstream of the main sewage discharge facility. Suddenly, households in the same neighborhoods had access to different qualities of water.

A Natural Experiment

There were a few other factors that made this situation a natural experiment:

Water companies were not serving disjoint geographical areas. Their networks intersected and often houses in the same street were chaotically served by pipes from different companies.
John Snow collected extensive additional data and compared characteristics of treatment and control households to confirm their comparability.
The change in water supply thus induced exogenous variation.

Photograph by Hisgett (2015).

In the end, John Snow collected very convincing evidence for his theory and went on to identify a certain contaminated water pump. The theory, however, was deemed politically unpleasant and was thus not accepted until long after Snow’s death.

Other Examples for Quasi-Experiments

The following three examples for quasi-experimental research designs are taken from Stock & Watson (2019):

To find out whether immigration reduces wages, we would ideally run an experiment where we randomize the number of immigrants to different municipalities. Of course, this is infeasible. Card (1990) used a temporary lifting of restrictions on immigration from Cuba to the U.S., which led to the Mariel boatlift, as a natural experiment. He found no effect of this exogenous influx of immigrants on wages.
Urquiola (2006) analyzed the effect of class size on educational achievement. This is tricky because students with more favorable socioeconomic backgrounds may tend to enroll in schools with smaller classes. He made use of a particular regulatory discontinuity in Bolivia: There, schools could obtain funding for an additional teacher if there were more than 30 students in a certain grade. The identifying assumption is that grades with just below 30 students should not be systematically different from those with just above 30 students. He found that smaller classes led to much better outcomes.

The Differences Estimator

Let us start by looking at probably the simplest estimator for a treatment effect we will ever encounter. The differences estimator can be used in experimental settings and can be computed like this:

\[ y_{i} = \beta_0 + D_i\beta_1 + u_i, \]

where \(D_i\) is the treatment indicator.

If \(D\) is truly randomly assigned, then \(\mathrm{E}(u_i\mid D_i)=0\), and the OLS estimator of the causal effect \(\beta_1\) is unbiased and consistent.

We are discussing this estimator not only because we can use it when we run an experiment, but also because what we discuss next is based on it.

Two Groups Observed Twice

In natural experiment settings, as long as we have data on both groups from both before the intervention and after the intervention, we can use the difference-in-differences (DiD) estimator. This is a very convenient and simple approach.

We start by collecting data for the following four subsets of our dataset:

	Before	After
Control	\(\dots\)	\(\dots\)
Treatment	\(\dots\)	\(\dots\)

That is, we compute averages for the treatment group and the control group, both for the before-intervention period and the post-intervention period. Alternatively, we can write this down as follows using three dummies:

\[ y_{i t} = \alpha + \text{after}\: \phi + \text{treated} \: \theta + \text{after}\times\text{treated}\: \delta + u_{it} \]

The Difference-in-Differences Estimator

\[ y_{i t} = \alpha + \text{after}\: \phi + \text{treated} \: \theta + \text{after}\times\text{treated}\: \delta + u_{it} \]

We can use the coefficients to express the averages in the table from before:

	Before	After	Difference
Control	\(\alpha\)	\(\alpha + \phi\)	\(\phi\)
Treatment	\(\alpha + \theta\)	\(\alpha + \theta + \phi + \delta\)	\(\phi + \delta\)
Difference	\(\theta\)	\(\theta + \delta\)	\(\delta\)

We also added differences to the table. Let us look at the difference column. \(\phi\) expresses by how much the average in the control group changed from before to after treatment. Analogously, \(\phi+\delta\) is the change for the treatment group.

The difference between them, \(\delta\), represents the average treatment effect (as long as group allocation is as good as random). We can obtain its estimate \(\hat{\delta}\) in an extremely easy way just by comparing the averages in this way.

Diff-in-Diff Illustration

This illustration shows one important implicit assumption we make when comparing differences of differences.

The two groups are assumed to follow parallel trends before the treatment is applied. Without the treatment, they would have continued in parallel, but the treatment makes the difference between the two lines change.

Estimating a DiD

Let us estimate an average treatment effect using the DiD estimator. This is very simple also in terms of implementation.

loedata::Fastfood is a simple example dataset for DiD estimation that contains data on fast food restaurants.
It contains data on restaurant-level employment (fte) for two U.S. states (nj and pa) before and after a minimum wage increase in New Jersey in 1992.

The DiD Estimator With Additional Regressors

Alternatively, we can write the regression equation for the difference-in-differences estimator for a given observation \(i\) like this, using first differences:

\[ \Delta y_i = \beta_0 + D_i\beta_1 + u_i, \]

where \(D_i\) is the treatment group indicator and \(\Delta y_i\) is the difference in outcomes for individual \(i\) between the pre- and the post-treatment period. The OLS estimator for \(\beta_1\) in this equation is the DiD estimator.

What if we want to include additional covariates? Easy:

\[ \Delta y_i = \beta_0 + D_i\beta_1 + \boldsymbol{x}_i'\boldsymbol{\gamma} + u_i, \]

with additional variables stored in the vector \(\boldsymbol{x}_i\).

DiD Using Repeated Cross-Sections

What happens when we do not have a true panel, but only repeated cross-sections? We can still use the DiD estimator.

If individuals are randomly drawn in both periods, then we can use individuals from the first period as surrogates for the treated and non-treated individuals in the second period.
Identification does not rely on following the same individual, but on assuming parallel trends for the two groups.

Instead of using the convenient first-difference notation, \(\Delta y_i = \beta_0 + D_i\beta_1 + u_i\), we have to resort to specifying the model in full, but conceptually nothing changes:

\[ y_{it} = \beta_0 + D_{it}\beta_1 + T_{it}\beta_2+P_{it}\beta_3 + \boldsymbol{x}_i'\boldsymbol{\gamma} + u_{it}, \]

where \(D_{it}\) is the treatment indicator as before. It is an interaction of the indicator for being in the (surrogate or actual) treatment group, \(T_{it}\); and the indicator for being in the post-treatment time period, \(P_{it}\).

IV Estimation in DiD Settings

Consider the following case:

A quasi-experiment yields some variable \(Z_i\), and
this variable \(Z_i\) influences whether \(i\) is treated or not.
In other words, the difference to before is that it is not the treatment directly that is assigned by the quasi-experiment.

Then, if we observe both \(D_i\), the treatment indicator, and \(Z_i\), \(Z_i\) is a valid instrument for the treatment \(D_i\).

This means that we can use the instrument to estimate simple differences:

\[ y_{i} = \beta_0 + \hat{D}_i\beta_1 + u_i, \]

where \(\hat{D}_i\) is the estimated treatment indicator from the first stage.

Parallel Trends and Event Studies

We have mentioned before that the most important assumption in a DiD setting is that of parallel trends. Of course, since we cannot observe the counterfactual, we can never observe whether trends would be parallel post-treatment. What can we do instead?

A good first step is to compare pre-treatment trends, e.g. by visualization. But note that three sunny days in a row do not imply sun tomorrow; and neither do parallel past trends imply parallel future trends – at least not without additional assumptions.
This is a problem that we are never going to be able to solve. But we can try to make a convincing argument for our assumptions.
One way is to estimate an event study. An event study can be understood as a DiD where pre- and post-treatment periods are not pooled, but instead a separate DiD coefficient is estimated for each period relative to the pre-treatment baseline: \[ y_{it} = \beta_0 + \sum_{k\neq -1}\beta_{1k}\bigl(T_i\times P^{(k)}_{it}\bigr) + \boldsymbol{x}_i'\boldsymbol{\gamma} + u_{it},\qquad P^{(k)}_{it}=𝟙\{t-t_0=k\}. \]

Event Studies

The following is an example of an event study graph taken from Miller et al. (2021).

Their study is about the relationship between an expansion of Medicare, a U.S. health program for senior citizens, and mortality.

Pre-treatment coefficients are close to zero – this can be seen as an indication of parallel trends.

After the treatment, the difference between the differences becomes negative: Enrolled individuals’ mortality is lower.

Event Study Estimation

Let us look at the following dataset from the bacondecomp package.

It contains U.S. state-level data on homicide rates and adoption of stand-your-ground laws (the “Castle Doctrine”).
We can use the dataset to estimate an event study, with the adoption of a stand-your-ground law in a certain state being the treatment.

Event Study Plot

Staggered Treatment

When we discussed the DiD estimator, we have so far assumed that all units are treated at exactly the same time. That may not always be the case, however. When treatment is staggered, we often use the TWFE estimator, which we got using this equation:

\[ y_{it} = \mu + \delta D_{it} + \varphi_i+\lambda_t+u_{it}. \]

Goodman-Bacon (2021) shows that the TWFE estimator can be decomposed into a weighted average of all potential \(2\times 2\) DiD estimates. These include comparisons of treated and never-treated units, but also between late and early treated units.
The comparison between units treated at different times is problematic when treatment effects vary with time, because this can introduce bias.
Also, weights depend on cohort sizes and treatment variance (which is largest when groups are treated after exactly half the time).
Callaway & Sant’Anna (2021) provide an alternative estimation strategy for cohort- and time-specific ATT, which can then be aggregated.

Regression Discontinuity

A Regression Discontinuity Design (RDD) is another type of quasi-experimental design.

We make use of a sharp cutoff in some running variable and compare values immediately below and immediately above the cutoff.

The size of the discontinuity in outcomes gives us the local treatment effect.

How to Find a Discontinuity Setting?

One frequently used setting for a regression discontinuity design is something where a certain test score is required for e.g. being admitted to a university.

Students that receive just the required test score, or slightly more, are allowed to, e.g., pursue a degree at the university. This will in turn affect their later job market outcomes, along with a bunch of other things.
Students whose score is just below the threshold should not be systematically dissimilar to those who reached the threshold. But they are denied (in this example case) university education.

Another frequently used design makes use of close elections. Imagine two candidates run for office and the result is close to 50-50. Then, their districts are probably similar and different only in who governs them after the election.

Think

Can you think of more examples where discontinuities arise naturally?

Sharp and Fuzzy RDD

There are two ways in which a discontinuity can affect treatment.

In a sharp regression discontinuity design, whether someone is treated depends precisely, and only, on whether that individual crosses the threshold. In the standardized test score example, this would mean that everyone that clears the threshold must attend university, and no one below the threshold attends university, period. In this case, estimation of the treatment effect is simple.
In a fuzzy regression discontinuity design, clearing the threshold influences treatment, but is not the only factor. For example, some students that were successful on the exam may still choose not to attend university; and some students below the threshold may get in via a retake. In this case, the treatment indicator is correlated with the error term. But if we observe actual treatment, we can use the threshold-crossing indicator as an instrument for the actual treatment and obtain an estimate for \(\beta_1\) by IV estimation.

How an RDD Works

Treatment status \(D_i\) is a deterministic function of the running variable \(x_i\) that jumps at the cutoff \(c\):

\[ D_i = \begin{cases} 1 \text{ if}&x_i\geq c\\ 0 \text{ if}&x_i<c \end{cases}. \]

Using a potential outcomes framework, we get

\[ \begin{aligned} y_i^{(0)}&=\beta_0+\beta_1x_i+u_i \\ y_i^{(1)}&=y_i^{(0)}+\delta, \end{aligned} \]

where \(\delta\) is the treatment effect.

How Do We Find the Treatment Effect?

Consider the following general switching equation:

\[ \begin{aligned} y_i &= D_iy_i^{(1)} + (1-D_i)y_i^{(0)} \\ y_i &= y_i^{(0)}+\left(y_i^{(1)}-y_i^{(0)}\right)D_i. \end{aligned} \]

Applying this to the RDD setting, we get

\[ y_i = \beta_0+\beta_1x_i + \delta D_i + u_i \]

We can find the treatment effect \(\delta=\mathrm{E}(y_i^{(1)}-y_i^{(0)}\mid x_i=c)\) like this:

\[ \begin{aligned} \delta & =\lim_{x_i\:\downarrow\:c} \mathrm{E}\big(y^{(1)}_i\mid x_i=c\big)-\lim_{x_i\:\uparrow\:{c}} \mathrm{E}\big(y^{(0)}_i\mid x_i=c\big) \\ & = \lim_{x_i\:\downarrow\:c} \mathrm{E}\big(y_i\mid x_i=c\big) -\lim_{x_i\:\uparrow\:{c}} \mathrm{E}\big(y_i\mid x_i=c\big) \end{aligned} \]

Requirements for an RDD

For an ideal RDD, we need a few things:

All other relevant variables should be continuous at the cutoff, meaning that they do not jump.
There needs to be randomness in the assignment around the cutoff. People just below and just above the threshold should be otherwise comparable.
We also need to model the functional form (i.e., of the relationship between the running variable and the outcome) correctly.

In practice, these requirements are hard to check.

Effects are often contaminated by other factors, as cutoffs often trigger multiple things simultaneously.
We (obviously) never truly know the functional form.
Treatment assignment can sometimes be manipulated. Think of us giving you a half-point you don’t deserve in the exam to make you get the better grade.

Look, I’ve Found a Discontinuity

A common problem is “fabricating” a discontinuity by overfitting the data to both sides of the cutoff.

In the example on the left, there is obviously no discontinuity – yet we can fit something that makes one appear.

Estimating a Sharp RDD

To try estimating an RDD, we can use a classic example:

The rdrobust_RDsenate dataset from the rdrobust package contains U.S. Senate election results.
There are two variables in the dataset, margin is the Democratic candidate’s winning margin in \(t\), and vote is the Democratic vote share in \(t+1\).
We can try to estimate the incumbency advantage.

Estimating a Sharp RDD

Plotting a Sharp RDD

Non-Linear Sharp RDD

Assumptions and Inference for FE Regression

Data

Quasi-Experiments

Networks and Peers

Spatial Econometrics

Networks

Image by Grandjean (2013)

Many real-world relationships can be thought of as being organized in networks.

Think of every node being one of your friends, and the connections being whether they know each other.
Alternatively, think of countries and their trade relationships.
Or cities and road connections, or …

Revisiting Graph Theory Basics

What you see on the right is what we call a graph. Depending on which Econometrics II class you took, you may remember this from the section on DAGs.

This graph has three nodes. They are labeled \(i\), \(j\), and \(k\). Sometimes, we call the nodes “vertices,” “agents,” “points,” etc.

Some of the nodes in a graph are usually connected to each other, while others are not. We call those connections edges. Alternatively, they can be called “links,” “connections,” “lines,” etc.

Edges are pairs of two nodes. In the second graph, there is one edge from \(i\) to \(j\). We call this edge \(\{i,j\}\).

Revisiting Graph Theory Basics

This edge does not have a direction.

However, we can easily give edges a direction. We call an edge like this a directed edge. When an edge is directed, the corresponding pair of nodes is no longer an unordered pair, but an ordered pair: \(\{j,i\}\neq\{i,j\}\).

A walk is a sequence of edges that joins a sequence of nodes. A cycle is a special case of a walk where all nodes are distinct and the initial and final node are equal. In this graph, \(\left\{\{a,b\},\{b,c\},\{c,a\}\right\}\) is a cycle.

Trade Networks

This network graph shows refined copper trade flows in the year 2023 between a number of countries, colored by continent.

We can see that European countries cluster together since they trade a lot among each other.

Drawing a Network

We can investigate this a little further. Say we want to draw a network graph of all subway stations in Vienna.

Stations are nodes, and
stations that are connected by a subway line get an edge.

We end up with a symmetric adjacency matrix that has 99 rows/columns.

\[ \boldsymbol{W}= \left( \begin{array}{c|ccccc c} & \text{AD} & \text{AE} & \text{AK} & \text{AL} & \text{AN} & \\ \hline \text{AD} & 0 & 0 & 0 & 0 & 0 & \cdots \\ \text{AE} & 0 & 0 & 0 & 0 & 0 & \cdots \\ \text{AK} & 0 & 0 & 0 & 0 & 0 & \cdots \\ \text{AL} & 0 & 0 & 0 & 0 & 0 & \cdots \\ \text{AN} & 0 & 0 & 0 & 0 & 0 & \cdots \\ & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots \end{array} \right) \]

Mapping Subway Stations

This graph is constructed from the adjacency matrix from before, with no geographic information.

The resulting graph resembles the actual subway map closely. Using a standard plotting algorithm, we can recover the latent space encoded in the network.

Networks contain interesting information beyond plain links.

Centrality

Some stations are depicted in the center, and when you are there, you can reach other places easily.

This relates to an important concept: the centrality of nodes in a network.

The simplest centrality measure is a node’s degree, the sum of its connections.

More elaborate measures include eigenvector centrality, a derivative of which is used to rank pages in Google searches.

Peers

Your immediate connections in a network are your peers or neighbors. In economics, we are interested in the role of peers for multiple reasons.

They may influence your beliefs and behaviors.
They can affect your human capital and are in turn relevant for questions of inequality.
Policies that rely on propagation through a society may work differently depending on initial seeding.

Most of what we discuss next is a simplified account of Manski (1993), who codified the literature on peer effects and laid the foundation for much subsequent research.

Manski (1993) models an agent’s response as a combination of the following:

The expected response of the agent’s peers.
The characteristics of the agent’s peers.
The characteristics of the agent themselves.

The Reflection Problem

There are two ways in which an agent’s peers can influence the agent.

The agent’s response can be influenced by the expected response of the agent’s peers. We call this an endogenous peer effect.
But the characteristics of the agent’s peers can also influence the agent’s response. We call this an exogenous peer effect, or alternatively a contextual effect.

Assume an agent’s utility depends on their response, as well as their peers’ responses. Since both the agent’s response and their utility depend on the others’ actions, their best responses form a system of simultaneous equations. This is what Manski (1993) calls the Reflection Problem.

This means that there is endogeneity and we cannot use least squares to estimate peer effects.
In this course, this becomes relevant when we discuss spatial econometric models next.

Data

Quasi-Experiments

Networks and Peers

Spatial Econometrics

Space

The image on this slide depicts the European continent by night. We can see that densely populated areas are brighter than sparsely populated areas.

Think

Why did people settle in exactly this pattern? If you were to found a settlement, where would you do that?

Autocorrelation, Networks, Peers, and Space

We have finally acquired enough building blocks to be able to discuss econometrics in space.

We have learned that spatial autocorrelation is a thing, and that it works similarly to temporal autocorrelation.
We have discussed how networks shape how a variable autocorrelates (or how multiple variables correlate) across space.
We are also aware of the reflection problem and that estimation of peer effects is not straightforward.

Let us approach this chapter by picturing a situation we know well.

We are researchers that want to investigate how education affects average income in a given areal unit. As observations, we choose NUTS-3 regions, standardized small-scale regions across Europe. Assume we have all data we need and we have dealt with all endogeneity other than that arising from spatial factors.

The Linear Model

We start by modeling our situation using a regular linear model:

\[ \boldsymbol{y} = \boldsymbol{X\beta}+\boldsymbol{u}, \qquad \boldsymbol{u}\sim\boldsymbol{N}(\boldsymbol{0},\sigma^2\boldsymbol{I}) \]

In what way does space affect the outcome of a given observation?

Their neighbor’s outcomes may influence \(i\)’s outcome.
Their neighbor’s characteristics may also influence them.
Additionally, there may be unobserved variables that are spatially correlated.

What does that mean in our example?

The GDP of a region may be influenced by the GDP in nearby regions, by the educational level of people there, and may depend on other spatially correlated variables.
If we ignore that, our estimates will be biased and inefficient (and we forego the chance to learn about spatial spillover patterns).

The Spatial Autoregressive (SAR) Model

The most straightforward extension is the spatial autoregressive model:

\[ \boldsymbol{y} = \lambda\boldsymbol{Wy}+\boldsymbol{X\beta}+\boldsymbol{u}. \]

Here, \(\lambda\) is a spatial autoregressive parameter, and \(\boldsymbol{Wy}\) is a spatially lagged version of the outcome.

The model looks simple, but poses challenges that are not at all trivial.

\(\boldsymbol{Wy}\) is by construction endogenous. This is the regression equivalent of the reflection problem.
We can thus not use least squares for estimation (it will be biased and inconsistent). But even using ML is not straightforward in this case.
What we get in the end depends on what spatial weights matrix we choose.

The Spatial Lag of X (SLX) Model

What happens when we do not lag the outcome, but only the characteristics? Then, we get a spatial lag of \(\boldsymbol{X}\) model:

\[ \boldsymbol{y} = \boldsymbol{X\beta}+ \boldsymbol{WX\theta}+\boldsymbol{u}. \]

\(\boldsymbol{WX}\) is a matrix of spatially lagged covariates, and \(\boldsymbol{\theta}\) is the associated coefficient.

\(\boldsymbol{\theta}\) reflects the effect of neighbors’ characteristics.
We still have to choose \(\boldsymbol{W}\), the spatial weights matrix. But conditional on \(\boldsymbol{W}\), this is just a linear model.
This means that we can estimate it using OLS.

This model does not capture spatial autoregressive properties of the outcome, but is much easier to deal with.

The Spatial Durbin Model (SDM)

The Spatial Durbin Model combines the SAR and the SLX models:

\[ \boldsymbol{y} = \lambda\boldsymbol{Wy}+\boldsymbol{X\beta}+ \boldsymbol{WX\theta}+\boldsymbol{u}. \]

As before,

\(\lambda\boldsymbol{Wy}\) reflects effects of the spatially lagged outcome,
\(\boldsymbol{WX\theta}\) captures effects of spatially lagged covariates, and
\(\boldsymbol{X\beta}\) captures effects of the own characteristics of each observation.

Spatial Errors

Instead of explicitly including spatially lagged regressors, we can also allow for spatial structure in the error term:

\[ \boldsymbol{y} = \boldsymbol{X\beta}+\boldsymbol{e}, \qquad \boldsymbol{e}=\varrho\boldsymbol{We}+\boldsymbol{u} \]

This gives us a linear model with spatially autoregressive errors. As everything else that relied on a spatial weights matrix, this only yields meaningful results conditional on us specifying the correct \(\boldsymbol{W}\). Like with the SAR model, estimation is not straightforward and requires additional assumptions.

What is often done in practice when one suspects there to be a spatial pattern in the errors is to estimate a regular linear model using OLS, and use standard errors that are robust to spatial autocorrelation.

A frequently used variance-covariance estimator is provided by Conley (1999).
These kinds of corrections do not mean that OLS is suddenly efficient in the presence of spatial autocorrelation in the error.
But they at least prevent incorrect inference.

References

Callaway, B., & Sant’Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200–230. https://doi.org/10.1016/j.jeconom.2020.12.001

Card, D. (1990). The impact of the mariel boatlift on the miami labor market. Industrial and Labor Relations Review, 43(2), 245–257.

Conley, T. G. (1999). GMM estimation with cross sectional dependence. Journal of Econometrics, 92(1), 1–45. https://doi.org/10.1016/s0304-4076(98)00084-0

Giraffael. (2024). Social network analysis on my instagram connections graph. Medium. https://medium.com/@girraffael/network-statistical-analysis-on-my-instagram-connections-graph-c5ca91062d47

Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254–277. https://doi.org/10.1016/j.jeconom.2021.03.014

Grandjean, M. (2013). Social network analysis visualization [image]. Wikimedia Commons; Wikimedia Commons. https://upload.wikimedia.org/wikipedia/commons/??/??/Social_Network_Analysis_Visualization.png

Hisgett, T. (2015). Dr john snow. Flickr. https://flickr.com/photos/37804979@N00/24023399742

Manski, C. F. (1993). Identification of endogenous social effects: The reflection problem. The Review of Economic Studies, 60(3), 531. https://doi.org/10.2307/2298123

Miller, S., Johnson, N., & Wherry, L. R. (2021). Medicaid and mortality: New evidence from linked survey and administrative data. The Quarterly Journal of Economics, 136(3), 1783–1829. https://doi.org/10.1093/qje/qjab004

Stock, J., & Watson, M. W. (2019). Introduction to econometrics, global edition (4th ed.). Pearson Education.

Urquiola, M. (2006). Identifying class size effects in developing countries: Evidence from rural bolivia. Review of Economics and Statistics, 88(1), 171–177.