Applied Econometrics · Econometrics III
Department of Economics, WU Vienna
Department of Economics, WU Vienna
The Potential Outcomes (PO) framework is one way to view causal questions.
The potential outcomes framework relates very clearly to the notion of a randomized experiment.
A different framework that has its strengths elsewhere: the Directed Acyclic Graphs (DAG) framework.
Why do we talk about DAGs in an Econometrics class? Because they are really useful for causal modeling.
In the following DAG, nodes represent (random) variables, and edges represent (hypothesized) causal effects.
Missing edges also convey information: the assumption of no causal effect.
The causal effect of interest, i.e., \(X \rightarrow Y\), is not always as straightforward as we would like. Instead, we may encounter:
Before and After Comparisons
There is a lot of remotely sensed data that is useful for Development Economics:
There are many ways to access this data, among them: Google Earth Engine, NASA Worldview, data download
From your classes in Econometrics I and Econometrics II, you know what an experiment is and that experiments allow us to use some very convenient methods and estimators. You may also have heard of quasi-experiments, but we are going to revisit them anyway.
We are first going to look at natural experiments.
A natural experiment is a study where an experimental setting is induced by nature or other factors outside our control.
U.S. Representative Alexander Pirnie of New York drawing the first capsule in the Vietnam war draft lottery.
In the 1800s, London (as well as many other places) was repeatedly hit by waves of a cholera epidemic.
Of course, running an experiment is infeasible in this context. It would require randomizing households, and allocating clean water to only a subset of them. This was both logistically infeasible and ethically questionable.
In 1852, the following happened:
One water company moved its pipes further upstream, to a location that incidentally was upstream of the main sewage discharge facility. Suddenly, households in the same neighborhoods had access to different qualities of water.
There were a few other factors that made this situation a natural experiment:
Photograph by Hisgett (2015).
In the end, John Snow collected very convincing evidence for his theory and went on to identify a certain contaminated water pump. The theory, however, was deemed politically unpleasant and was thus not accepted until long after Snow’s death.
The following three examples for quasi-experimental research designs are taken from Stock & Watson (2019):
Let us start by looking at probably the simplest estimator for a treatment effect we will ever encounter. The differences estimator can be used in experimental settings and can be computed like this:
\[ y_{i} = \beta_0 + D_i\beta_1 + u_i, \]
where \(D_i\) is the treatment indicator.
If \(D\) is truly randomly assigned, then \(\mathrm{E}(u_i\mid D_i)=0\), and the OLS estimator of the causal effect \(\beta_1\) is unbiased and consistent.
We are discussing this estimator not only because we can use it when we run an experiment, but also because what we discuss next is based on it.
In natural experiment settings, as long as we have data on both groups from both before the intervention and after the intervention, we can use the difference-in-differences (DiD) estimator. This is a very convenient and simple approach.
We start by collecting data for the following four subsets of our dataset:
| Before | After | |
|---|---|---|
| Control | \(\dots\) | \(\dots\) |
| Treatment | \(\dots\) | \(\dots\) |
That is, we compute averages for the treatment group and the control group, both for the before-intervention period and the post-intervention period. Alternatively, we can write this down as follows using three dummies:
\[ y_{i t} = \alpha + \text{after}\: \phi + \text{treated} \: \theta + \text{after}\times\text{treated}\: \delta + u_{it} \]
\[ y_{i t} = \alpha + \text{after}\: \phi + \text{treated} \: \theta + \text{after}\times\text{treated}\: \delta + u_{it} \]
We can use the coefficients to express the averages in the table from before:
| Before | After | Difference | |
|---|---|---|---|
| Control | \(\alpha\) | \(\alpha + \phi\) | \(\phi\) |
| Treatment | \(\alpha + \theta\) | \(\alpha + \theta + \phi + \delta\) | \(\phi + \delta\) |
| Difference | \(\theta\) | \(\theta + \delta\) | \(\delta\) |
We also added differences to the table. Let us look at the difference column. \(\phi\) expresses by how much the average in the control group changed from before to after treatment. Analogously, \(\phi+\delta\) is the change for the treatment group.
The difference between them, \(\delta\), represents the average treatment effect (as long as group allocation is as good as random). We can obtain its estimate \(\hat{\delta}\) in an extremely easy way just by comparing the averages in this way.
This illustration shows one important implicit assumption we make when comparing differences of differences.
The two groups are assumed to follow parallel trends before the treatment is applied. Without the treatment, they would have continued in parallel, but the treatment makes the difference between the two lines change.
Let us estimate an average treatment effect using the DiD estimator. This is very simple also in terms of implementation.
loedata::Fastfood is a simple example dataset for DiD estimation that contains data on fast food restaurants.fte) for two U.S. states (nj and pa) before and after a minimum wage increase in New Jersey in 1992.Alternatively, we can write the regression equation for the difference-in-differences estimator for a given observation \(i\) like this, using first differences:
\[ \Delta y_i = \beta_0 + D_i\beta_1 + u_i, \]
where \(D_i\) is the treatment group indicator and \(\Delta y_i\) is the difference in outcomes for individual \(i\) between the pre- and the post-treatment period. The OLS estimator for \(\beta_1\) in this equation is the DiD estimator.
What if we want to include additional covariates? Easy:
\[ \Delta y_i = \beta_0 + D_i\beta_1 + \boldsymbol{x}_i'\boldsymbol{\gamma} + u_i, \]
with additional variables stored in the vector \(\boldsymbol{x}_i\).
What happens when we do not have a true panel, but only repeated cross-sections? We can still use the DiD estimator.
Instead of using the convenient first-difference notation, \(\Delta y_i = \beta_0 + D_i\beta_1 + u_i\), we have to resort to specifying the model in full, but conceptually nothing changes:
\[ y_{it} = \beta_0 + D_{it}\beta_1 + T_{it}\beta_2+P_{it}\beta_3 + \boldsymbol{x}_i'\boldsymbol{\gamma} + u_{it}, \]
where \(D_{it}\) is the treatment indicator as before. It is an interaction of the indicator for being in the (surrogate or actual) treatment group, \(T_{it}\); and the indicator for being in the post-treatment time period, \(P_{it}\).
Consider the following case:
Then, if we observe both \(D_i\), the treatment indicator, and \(Z_i\), \(Z_i\) is a valid instrument for the treatment \(D_i\).
This means that we can use the instrument to estimate simple differences:
\[ y_{i} = \beta_0 + \hat{D}_i\beta_1 + u_i, \]
where \(\hat{D}_i\) is the estimated treatment indicator from the first stage.
We have mentioned before that the most important assumption in a DiD setting is that of parallel trends. Of course, since we cannot observe the counterfactual, we can never observe whether trends would be parallel post-treatment. What can we do instead?
The following is an example of an event study graph taken from Miller et al. (2021).
Their study is about the relationship between an expansion of Medicare, a U.S. health program for senior citizens, and mortality.
Pre-treatment coefficients are close to zero – this can be seen as an indication of parallel trends.
After the treatment, the difference between the differences becomes negative: Enrolled individuals’ mortality is lower.
Let us look at the following dataset from the bacondecomp package.
When we discussed the DiD estimator, we have so far assumed that all units are treated at exactly the same time. That may not always be the case, however. When treatment is staggered, we often use the TWFE estimator, which we got using this equation:
\[ y_{it} = \mu + \delta D_{it} + \varphi_i+\lambda_t+u_{it}. \]
A Regression Discontinuity Design (RDD) is another type of quasi-experimental design.
We make use of a sharp cutoff in some running variable and compare values immediately below and immediately above the cutoff.
The size of the discontinuity in outcomes gives us the local treatment effect.
One frequently used setting for a regression discontinuity design is something where a certain test score is required for e.g. being admitted to a university.
Another frequently used design makes use of close elections. Imagine two candidates run for office and the result is close to 50-50. Then, their districts are probably similar and different only in who governs them after the election.
Think
Can you think of more examples where discontinuities arise naturally?
There are two ways in which a discontinuity can affect treatment.
Treatment status \(D_i\) is a deterministic function of the running variable \(x_i\) that jumps at the cutoff \(c\):
\[ D_i = \begin{cases} 1 \text{ if}&x_i\geq c\\ 0 \text{ if}&x_i<c \end{cases}. \]
Using a potential outcomes framework, we get
\[ \begin{aligned} y_i^{(0)}&=\beta_0+\beta_1x_i+u_i \\ y_i^{(1)}&=y_i^{(0)}+\delta, \end{aligned} \]
where \(\delta\) is the treatment effect.
Consider the following general switching equation:
\[ \begin{aligned} y_i &= D_iy_i^{(1)} + (1-D_i)y_i^{(0)} \\ y_i &= y_i^{(0)}+\left(y_i^{(1)}-y_i^{(0)}\right)D_i. \end{aligned} \]
Applying this to the RDD setting, we get
\[ y_i = \beta_0+\beta_1x_i + \delta D_i + u_i \]
We can find the treatment effect \(\delta=\mathrm{E}(y_i^{(1)}-y_i^{(0)}\mid x_i=c)\) like this:
\[ \begin{aligned} \delta & =\lim_{x_i\:\downarrow\:c} \mathrm{E}\big(y^{(0)}_i\mid x_i=c\big)-\lim_{x_i\:\uparrow\:{c}} \mathrm{E}\big(y^{(1)}_i\mid x_i=c\big) \\ & = \lim_{x_i\:\downarrow\:c} \mathrm{E}\big(y_i\mid x_i=c\big) -\lim_{x_i\:\uparrow\:{c}} \mathrm{E}\big(y_i\mid x_i=c\big) \end{aligned} \]
For an ideal RDD, we need a few things:
In practice, these requirements are hard to check.
A common problem is “fabricating” a discontinuity by overfitting the data to both sides of the cutoff.
In the example on the left, there is obviously no discontinuity – yet we can fit something that makes one appear.
To try estimating an RDD, we can use a classic example:
rdrobust_RDsenate dataset from the rdrobust package contains U.S. Senate election results.margin is the Democratic candidate’s winning margin in \(t\), and vote is the Democratic vote share in \(t+1\).Image by Grandjean (2013)
Many real-world relationships can be thought of as being organized in networks.
What you see on the right is what we call a graph. Depending on which Econometrics II class you took, you may remember this from the section on DAGs.
This graph has three nodes. They are labeled \(i\), \(j\), and \(k\). Sometimes, we call the nodes “vertices,” “agents,” “points,” etc.
Some of the nodes in a graph are usually connected to each other, while others are not. We call those connections edges. Alternatively, they can be called “links,” “connections,” “lines,” etc.
Edges are pairs of two nodes. In the second graph, there is one edge from \(i\) to \(j\). We call this edge \(\{i,j\}\).
This edge does not have a direction.
However, we can easily give edges a direction. We call an edge like this a directed edge. When an edge is directed, the corresponding pair of nodes is no longer an unordered pair, but an ordered pair: \(\{j,i\}\neq\{i,j\}\).
A walk is a sequence of edges that joins a sequence of nodes. A cycle is a special case of a walk where all nodes are distinct and the initial and final node are equal. In this graph, \(\left\{\{a,b\},\{b,c\},\{c,a\}\right\}\) is a cycle.
This network graph shows refined copper trade flows in the year 2023 between a number of countries, colored by continent.
We can see that European countries cluster together since they trade a lot among each other.
We can investigate this a little further. Say we want to draw a network graph of all subway stations in Vienna.
We end up with a symmetric adjacency matrix that has 99 rows/columns.
\[ \boldsymbol{W}= \left( \begin{array}{c|ccccc c} & \text{AD} & \text{AE} & \text{AK} & \text{AL} & \text{AN} & \\ \hline \text{AD} & 0 & 0 & 0 & 0 & 0 & \cdots \\ \text{AE} & 0 & 0 & 0 & 0 & 0 & \cdots \\ \text{AK} & 0 & 0 & 0 & 0 & 0 & \cdots \\ \text{AL} & 0 & 0 & 0 & 0 & 0 & \cdots \\ \text{AN} & 0 & 0 & 0 & 0 & 0 & \cdots \\ & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots \end{array} \right) \]
This graph is constructed from the adjacency matrix from before, with no geographic information.
The resulting graph resembles the actual subway map closely. Using a standard plotting algorithm, we can recover the latent space encoded in the network.
Networks contain interesting information beyond plain links.
Some stations are depicted in the center, and when you are there, you can reach other places easily.
This relates to an important concept: the centrality of nodes in a network.
The simplest centrality measure is a node’s degree, the sum of its connections.
More elaborate measures include eigenvector centrality, a derivative of which is used to rank pages in Google searches.
Your immediate connections in a network are your peers or neighbors. In economics, we are interested in the role of peers for multiple reasons.
Most of what we discuss next is a simplified account of Manski (1993), who codified the literature on peer effects and laid the foundation for much subsequent research.
Manski (1993) models an agent’s response as a combination of the following:
There are two ways in which an agent’s peers can influence the agent.
Assume an agent’s utility depends on their response, as well as their peers’ responses. Since both the agent’s response and their utility depend on the others’ actions, their best responses form a system of simultaneous equations. This is what Manski (1993) calls the Reflection Problem.
The image on this slide depicts the European continent by night. We can see that densely populated areas are brighter than sparsely populated areas.
Think
Why did people settle in exactly this pattern? If you were to found a settlement, where would you do that?
We have finally acquired enough building blocks to be able to discuss econometrics in space.
Let us approach this chapter by picturing a situation we know well.
We are researchers that want to investigate how education affects average income in a given areal unit. As observations, we choose NUTS-3 regions, standardized small-scale regions across Europe. Assume we have all data we need and we have dealt with all endogeneity other than that arising from spatial factors.
We start by modeling our situation using a regular linear model:
\[ \boldsymbol{y} = \boldsymbol{X\beta}+\boldsymbol{u}, \qquad \boldsymbol{u}\sim\boldsymbol{N}(\boldsymbol{0},\sigma^2\boldsymbol{I}) \]
In what way does space affect the outcome of a given observation?
What does that mean in our example?
The most straightforward extension is the spatial autoregressive model:
\[ \boldsymbol{y} = \lambda\boldsymbol{Wy}+\boldsymbol{X\beta}+\boldsymbol{u}. \]
Here, \(\lambda\) is a spatial autoregressive parameter, and \(\boldsymbol{Wy}\) is a spatially lagged version of the outcome.
The model looks simple, but poses challenges that are not at all trivial.
What happens when we do not lag the outcome, but only the characteristics? Then, we get a spatial lag of \(\boldsymbol{X}\) model:
\[ \boldsymbol{y} = \boldsymbol{X\beta}+ \boldsymbol{WX\theta}+\boldsymbol{u}. \]
\(\boldsymbol{WX}\) is a matrix of spatially lagged covariates, and \(\boldsymbol{\theta}\) is the associated coefficient.
This model does not capture spatial autoregressive properties of the outcome, but is much easier to deal with.
The Spatial Durbin Model combines the SAR and the SLX models:
\[ \boldsymbol{y} = \lambda\boldsymbol{Wy}+\boldsymbol{X\beta}+ \boldsymbol{WX\theta}+\boldsymbol{u}. \]
As before,
Instead of explicitly including spatially lagged regressors, we can also allow for spatial structure in the error term:
\[ \boldsymbol{y} = \boldsymbol{X\beta}+\boldsymbol{e}, \qquad \boldsymbol{e}=\varrho\boldsymbol{We}+\boldsymbol{u} \]
This gives us a linear model with spatially autoregressive errors. As everything else that relied on a spatial weights matrix, this only yields meaningful results conditional on us specifying the correct \(\boldsymbol{W}\). Like with the SAR model, estimation is not straightforward and requires additional assumptions.
What is often done in practice when one suspects there to be a spatial pattern in the errors is to estimate a regular linear model using OLS, and use standard errors that are robust to spatial autocorrelation.
Social Networks
For the image on this slide, someone (not me) collected data on their Instagram followers, and the connections between them.
Coincidentally, this social network is related to a social media site, which we colloquially often refer to as “social networks.” But in the context of network analysis, a social network is any network that connects people.
Think
Colors were manually assigned, but positions were not. Why do similar people cluster together?
Image by Giraffael (2024)