Applied Econometrics · Econometrics III
Department of Economics, WU Vienna
Department of Economics, WU Vienna
Notation and Basic Concepts
Time series methods focus on modelling random components. But, there may also be fixed variation: seasonality of GDP, or weekday effects.
We are looking to remove fixed components (such as a trend or seasonal patterns) and focus on random components of a time series. There are many ways to achieve this:
For example, many economic time series are analyzed after computing their (change in) logarithms, as they exhibit exponential growth.
Consider the linear model and related assumptions:
\[ y_t = {x}_t'{\beta} + u_t, \qquad u_t \sim \mathcal{N}(0,\sigma^2). \]
When \(y\) and \(x\) are time series, Assumption (3) is often violated. We need a sensible alternative.
\[ \text{$j^{\text{th}}$ autocovariance} \;=\; \operatorname{cov}(y_t, y_{t-j}) \]
\[ \text{$j^{\text{th}}$ autocorrelation} \;=\; \rho_j \;=\; \;=\; \frac{\operatorname{cov}(y_t, y_{t-j})}{\sqrt{\operatorname{var}(y_t)\operatorname{var}(y_{t-j})}} \]
The autocorrelation function (ACF) is a useful summary for a stationary time series. It is defined as:
\[ \mathrm{ACF}(j) \;=\; \rho_j \;=\; \operatorname{corr}(y_t, y_{t-j}) \;=\; \frac{\operatorname{cov}(y_t, y_{t-j})}{\operatorname{var}(y_t)}. \]
The ACF is a common diagnostic tool that reveals interesting patterns of autocorrelation, but it may also indicate a lack of stationarity.
Autoregressions
\[ \mathbb{E}\left(y_t \mid y_{t-1}, y_{t-2}, \ldots \right) = \alpha_0 + \alpha_1 y_{t-1}. \]
\[ y_t = \alpha_0 + \alpha_1 y_{t-1} + \varepsilon_t, \]
The \(p\)-th-order autoregressive (AR(\(p\))) model represents the conditional expectation of \(y_t\) as a linear function of \(p\) of its lagged values:
\[ y_t = \alpha_0 + \alpha_1 y_{t-1} + \alpha_2 y_{t-2} + \cdots + \alpha_p y_{t-p} + \varepsilon_t. \]
If \(y_t\) follows an AR(\(p\)), then the oracle one-step-ahead forecast of \(y_{t+1}\) based on \(y_t, y_{t-1}, \ldots\) is
\[ y_{t+1\mid t} = \alpha_0 + \alpha_1 y_t + \alpha_2 y_{t-1} + \cdots + \alpha_p y_{t-p+1}. \]
An autoregressive distributed lag (ADL) model is:
An ADL model with \(p\) lags of \(y_t\) and \(q\) lags of \(x_t\), denoted ADL(\(p,q\)), is
\[ y_t = \alpha_0 + \alpha_1 y_{t-1} + \alpha_2 y_{t-2} + \cdots + \alpha_p y_{t-p} + \delta_1 x_{t-1} + \delta_2 x_{t-2} + \cdots + \delta_q x_{t-q} + \varepsilon_t. \]
Here, \(\alpha_0,\alpha_1,\ldots,\alpha_p,\delta_1,\ldots,\delta_q\) are unknown coefficients and \(\varepsilon_t\) is the error term, with
\[ \mathrm{E}\left(\varepsilon_t \mid y_{t-1},y_{t-2},\ldots,x_{t-1},x_{t-2},\ldots \right)=0. \]
The general time series regression model allows for \(k\) additional predictors, where \(q_1\) lags of the first predictor are included, \(q_2\) lags of the second predictor are included, and so forth:
\[ \begin{aligned} y_t &= \alpha_0 + \alpha_1 y_{t-1} + \alpha_2 y_{t-2} + \cdots + \alpha_p y_{t-p} \\ &\quad + \delta_{11} x_{1,t-1} + \delta_{12} x_{1,t-2} + \cdots + \delta_{1q_1} x_{1,t-q_1} \\ &\quad + \cdots \\ &\quad + \delta_{k1} x_{k,t-1} + \delta_{k2} x_{k,t-2} + \cdots + \delta_{kq_k} x_{k,t-q_k} + \varepsilon_t . \end{aligned} \]
Under the assumptions 1-4, inference on the regression coefficients using OLS proceeds in the same way as it usually does for cross-sectional data.
For time series regression, the i.i.d. assumption is replaced by a more appropriate assumption with two parts:
(a) Stationarity: The data are drawn from a stationary distribution, so the distribution of the time series today is the same as its distribution in the past. The joint distribution of the variables (including lags) does not change over time.
(b) Weak dependence: The random variables become approximately independent when they are separated by long periods of time. Weak dependence ensures that, in large samples, there is enough randomness for the law of large numbers and the central limit theorem to hold.
\[ \text{BIC}(p) = \ln \left( \frac{SSR(p)}{T} \right) + (p + 1) \frac{\ln(T)}{T}, \]
The trade-off involved with lag length choice in the general time series regression model with multiple predictors is similar to that in an autoregression:
The choice of lags must balance the benefit of using additional information against the cost of estimating the additional coefficients.
As in an autoregression, the BIC and the AIC can be used to estimate the number of lags and variables in a time series regression model with multiple predictors. If the regression model has \(K\) coefficients (including the intercept), the BIC is
\[ \text{BIC}(K) = \ln\!\left(\frac{SSR(K)}{T}\right) + K\frac{\ln(T)}{T}. \]
The AIC is defined in the same way, but with \(2\) replacing \(\ln(T)\) in the second term:
\[ \text{AIC}(K) = \ln\!\left(\frac{SSR(K)}{T}\right) + K\frac{2}{T}. \]
For each candidate model, the BIC (or the AIC) can be evaluated, and the model with the lowest value of the BIC (or the AIC) is the preferred model, based on the information criterion.
If a regressor has a stochastic trend (has a unit root), then OLS inference can be misleading.
As a result, conventional confidence intervals and hypothesis tests are not valid in the usual way.
\[ y_t = \alpha_0 + \alpha_1 y_{t-1} + \delta_1 x_{t-1} + \gamma_0 D_t(\tau) + \gamma_1 [D_t(\tau) \times y_{t-1}] + \gamma_2 [D_t(\tau) \times x_{t-1}] + \varepsilon_t \]
One way in which everything so far differed from what we used to do in Econometrics was that we were only analyzing one variable at a time. Now, we are going to talk about how we can analyze two or more time series.
As you can imagine, there are endless applications for time series methods for multiple variables. For one thing, it is rare to find a variable that is not influenced by past or present realizations of another. We may be interested in describing or forecasting them properly, and thus we need methods to do so.
We are going to go about this the following way:
Let us start by considering two variables and writing down AR(1) processes for both of them:
\[ y_t = a_0 + a_1 y_{t-1} + \varepsilon_t \]
\[ x_t = b_0 + b_1 x_{t-1} + \varepsilon_t \]
This way, we model both of them on past realizations of themselves. It is very easy to extend this to also include past realizations of the opposite variable:
\[ y_t = a_{10} + a_{11} y_{t-1} + a_{12} x_{t-1} + \varepsilon_{1t} \]
\[ x_t = a_{20} + a_{21} y_{t-1} + a_{22} x_{t-1} + \varepsilon_{2t} \]
Maybe you already suspect where we are going to end up.
\[ y_t = a_{10} + a_{11} y_{t-1} + a_{12} x_{t-1} + \varepsilon_{1t} \]
\[ x_t = a_{20} + a_{21} y_{t-1} + a_{22} x_{t-1} + \varepsilon_{2t} \]
Of course, we know a way to consolidate this into one line by stacking the equations:
\[ \begin{pmatrix} y_t \\ x_t \end{pmatrix} = \begin{pmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{pmatrix} \begin{pmatrix} y_{t-1} \\ x_{t-1} \end{pmatrix} + \begin{pmatrix} \varepsilon_{1t} \\ \varepsilon_{2t} \end{pmatrix} \]
This is what we call a Vector Autoregression (VAR). More specifically, this is a VAR(1) system of equations.
Let us now consider three variables we know well: real GDP growth \(\Delta y_t\), inflation \(\pi_t\), and the interest rate \(r_t\). In the spirit of what we did before, we can construct the following VAR(1):
\[ \begin{pmatrix} \Delta y_t \\ \pi_t \\ r_t \end{pmatrix} = \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{pmatrix} \begin{pmatrix} \Delta y_{t-1} \\ \pi_{t-1} \\ r_{t-1} \end{pmatrix} + \begin{pmatrix} \varepsilon_{\Delta y_t} \\ \varepsilon_{\pi_t} \\ \varepsilon_{r_{t}} \end{pmatrix} \]
Written more compactly, this becomes
\[ \boldsymbol{y}_t = \boldsymbol{A}\boldsymbol{y}_{t-1}+\boldsymbol{\varepsilon}_{t}. \]
Practice
Write this down as three separate equations for \(\Delta y_t\), \(\pi_t\), and \(r_t\).
We have discussed a bunch of examples, but what we lack so far is a general representation of a VAR(\(p\)) model. So let us write one down:
\[ \begin{aligned} \boldsymbol{y}_t &= \boldsymbol{a}_0 + \boldsymbol{A}_1\boldsymbol{y}_{t-1} + \dots + \boldsymbol{A}_p\boldsymbol{y}_{t-p} + \boldsymbol{\varepsilon}_t, \\ \boldsymbol{\varepsilon}_t &\sim \mathcal{N}_M(\boldsymbol{0},\boldsymbol{\Sigma}), \end{aligned} \]
where
\[ \begin{aligned} \boldsymbol{y}_t &= \boldsymbol{a}_0 + \boldsymbol{A}_1\boldsymbol{y}_{t-1} + \dots + \boldsymbol{A}_p\boldsymbol{y}_{t-p} + \boldsymbol{\varepsilon}_t, \\ \boldsymbol{\varepsilon}_t &\sim \mathcal{N}_M(\boldsymbol{0},\boldsymbol{\Sigma}), \end{aligned} \]
This can be estimated using multiple techniques.
As long as errors are assumed normal, OLS is consistent and estimates asymptotically follow a multivariate normal distribution in large samples. This means that
Of course, we can also estimate the VAR using ML or Bayesian estimation techniques, if we want.
We have to be careful when interpreting VAR results: Coefficients are only interpretable as predictive relationships between certain lags.
So far, we have treated VARs – and other time series methods – as tools to forecast variables. But what if we are interested in causal inference?
There is a certain “causality” concept that exists in the realm of time series econometrics: Granger Causality, named after Granger (1969). In short, we speak of Granger Causality when a realization of one variable, let us call it \(x_t\), can be used to predict a future realization of another variable, \(y_{t+1}\). In this case, we say that \(x\) Granger causes \(y\).
But what about real causality? It turns out that VARs were originally introduced into Economics as a tool for analyzing causal relationships between multiple time series (Sims, 1980). But using them for this purpose requires going one step further to Structural VARs. Most of how this works is out of scope for this class, but the following should serve as a brief introduction.
We can ask R whether one variable Granger causes the others.
What we plainly called a VAR before is actually a Reduced Form VAR:
\[ \boldsymbol{y}_t = \boldsymbol{a}_0 + \boldsymbol{A}_1\boldsymbol{y}_{t-1} + \dots + \boldsymbol{A}_p\boldsymbol{y}_{t-p} + \boldsymbol{\varepsilon}_t \]
The problem with this is that the errors are not uncorrelated, and this renders us unable to draw causal conclusions. For Structural VARs, we assume that there exists an invertible matrix \(\boldsymbol{B}_0\) such that
\[ \boldsymbol{\varepsilon}_t=\boldsymbol{B}_0^{-1}\boldsymbol{e}_t, \]
giving us the uncorrelated, structural shocks \(\boldsymbol{e}_t\). Using this decomposition, we can transform the reduced form VAR into a structural VAR. \(\boldsymbol{B}_0\) governs the contemporaneous relations between the variables, and thus we need to know about it if we want to investigate causal relationships.
The real challenge is finding \(\boldsymbol{B}_0\). This requires imposing certain restrictions based on economic theory and the researcher’s assumptions.
Look at the two time series in the chart of a long-term interest rate and a short-term interest rate. What do you notice?
The two time series seem to move together; that is, they share a common trend, but not perfectly so.
We call this phenomenon cointegration.
Let us now more formally define what cointegration means.
Assume that we have two time series, \(y_t\) and \(x_t\), and they are both integrated of order 1 (i.e., two \(I(1)\) processes). Then, the two are called cointegrated if a \(\beta\) exists such that
\[ u_t = y_t - \beta x_t \]
is a stationary \(I(0)\) process.
In this case, we call \(\beta\) the cointegrating coefficient.
More generally, two series are cointegrated if they are \(I(d)\) and some linear combination of them is integrated of order less than \(d\).
Let us plot on the next slide how the process \(u_t\) would look like for our previous example.
Choosing 1.2 as cointegrating coefficient gives us the pink process in the graph on the left, which looks reasonably stationary.
Of course, we would normally estimate \(\beta\). OLS is consistent in this case, but not normally distributed, but there are extensions which enable inference on \(t\)-statistics.
If we have two cointegrated series, we can model their short-run dynamics (more specifically, their first differences) using an error correction model. Consider the following vector error correction model (VECM):
\[ \begin{pmatrix} \Delta y_t\\ \Delta x_t \end{pmatrix} = \begin{pmatrix} \alpha_1\\ \alpha_2 \end{pmatrix} + \begin{pmatrix} \delta_1\\ \delta_2 \end{pmatrix} \hat u_{t-1} + \sum_{i=1}^{p-1} \begin{pmatrix} \gamma_{11,i} & \gamma_{12,i}\\ \gamma_{21,i} & \gamma_{22,i} \end{pmatrix} \begin{pmatrix} \Delta y_{t-i}\\ \Delta x_{t-i} \end{pmatrix} + \begin{pmatrix} \varepsilon_{1t}\\ \varepsilon_{2t} \end{pmatrix} , \]
where \(\hat{u}_{t-1} = y_{t-1} - \beta x_{t-1}\) is the error correction term.
The parameter \(\boldsymbol{\delta}\) governs the adjustment to equilibrium:
In addition to considering economic theory and inspecting time series graphs, we can test for cointegration. We can divide the available testing procedures into two scenarios: Testing for cointegration when \(\beta\) is known, and testing for cointegration when \(\beta\) is not known.
When \(\beta\) is known (e.g. because theory suggests a value), we can use a Dickey-Fuller Test to test for cointegration.
In practice, \(\beta\) is often unknown. In these cases, one option is to follow the Engle-Granger Procedure.
Look at this chart of daily log differences of the S&P 500 index from 2016 to now.
We can clearly see that volatility changes over time.
This is not unusual; in fact, higher moments of time series are rarely constant over time.
When we observe some periods of lower volatility and some periods of higher volatility, we say that there is volatility clustering.
This is interesting (not only) in financial contexts, because volatility is a measure of how risky an asset is, and the value of some derivatives depends directly on that volatility. Also, it helps us determine confidence intervals of forecasts.
The simplest volatility measure we have just makes use of the sample variance. This is useful when we have very frequent data. The \(h\)-period realized volatility of a (demeaned) time series \(x_t\) is given by
\[ \textstyle RV_t^h=\sqrt{\frac{1}{h}\sum^t_{s=t-h+1}x_s^2}. \]
With lower-frequency data, we have to resort to different methods to estimate how volatility changes over time. First, let us consider the Autoregressive Conditional Heteroskedasticity (ARCH) model.
Consider some model for \(x_t\),
\[ x_t = \mu + \dots + \varepsilon_t, \]
where \(\varepsilon_t\) is normally distributed with mean zero and variance \(\sigma^2_t\). The variance \(\sigma^2_t\) is then modeled on past squared values of \(\varepsilon_t\). This gives us an ARCH model of order \(p\):
\[ \sigma^2_t = \omega + \psi_1\varepsilon_{t-1}^2 + \psi_2\varepsilon_{t-2}^2+\dots+\psi_p\varepsilon_{t-p}^2. \]
If \(\psi_1, \dots, \psi_p\) are large, large recent squared errors predict a high variance.
The Generalized Autoregressive Conditional Heteroskedasticity (GARCH) model is an extension of the ARCH model. The difference to the ARCH model is that \(\sigma^2\) can now additionally depend on its own lags, in addition to lags of the squared error.
Consider some model for \(x_t\),
\[ x_t = \mu + \dots + \varepsilon_t, \]
where \(\varepsilon_t\) is normally distributed with mean zero and variance \(\sigma^2_t\). The variance \(\sigma^2_t\) is then modeled on past squared values of \(\varepsilon_t\). This gives us a GARCH(\(p,q\)) model:
\[ \sigma^2_t = \omega + \psi_1\varepsilon_{t-1}^2 +\dots+\psi_p\varepsilon_{t-p}^2 + \phi_1\sigma^2_{t-1} + \dots + \phi_q\sigma^2_{t-q}. \]
If \(\psi_1, \dots, \psi_p\) are large, large recent squared errors predict a high variance.
Part of our motivation to treat time series differently was that we could no longer credibly assume errors to be uncorrelated with themselves.
If the value of one time series is high in a given period, it has a higher probability to also be high in a subsequent period.
But doesn’t the same that applies to time also apply to space? If GDP is high in one place, it is more likely to also be high in places close to it.
This is the idea behind the concept of Spatial Autocorrelation.
The photograph on the left depicts Waldo Tobler (1930–2018), a famous Swiss-American geographer. He is known for a lot of things, among them Tobler’s First Law of Geography:
Everything is related to everything else, but near things are more related than distant things.
This quote, published in Tobler (1970), fundamentally describes spatial autocorrelation: The value of a given variable in a given place depends, among other things, on realizations of the same variable in places close by.
We are going to use this as a starting point to venture very quickly into the field of Spatial Econometrics.
One tricky thing about space is that it has multiple dimensions.
The simplest distinction we can make is that between positive and negative spatial autocorrelation.
But how do we quantify “near”? Consider the following Spatial Weights Matrix:
\[ \boldsymbol{W}= \left( \begin{array}{c|cccccccc} & \text{AT} & \text{CH} & \text{CZ} & \text{DE} & \text{HU} & \text{IT} & \text{LI} & \text{SI} & \text{SK} \\ \hline \text{AT} & 0 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ \text{CH} & 1 & 0 & 0 & 1 & 0 & 1 & 1 & 0 & 0 \\ \text{CZ} & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 1 \\ \text{DE} & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ \text{HU} & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ \text{IT} & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ \text{LI} & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \text{SI} & 1 & 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 \\ \text{SK} & 1 & 0 & 1 & 0 & 1 & 0 & 0 & 0 & 0 \\ \end{array} \right) \]
This simple matrix is binary. It uses contiguity as measure of distance: If two countries share a border, they are “near” (1), if not, they are “distant” (0).
There are different types of contiguity. These are relevant when we use pixel-based data, which can be the case when we use remotely sensed data or other data that is aggregated for square areal units.
Queen Contiguity
Rook Contiguity
Bishop Contiguity
We do not need to restrict ourselves to contiguity. Look at this Spatial Weights Matrix:
\[ \boldsymbol{W}= \left( \begin{array}{c|ccccccccc} & \text{AT} & \text{CH} & \text{CZ} & \text{DE} & \text{HU} & \text{IT} & \text{LI} & \text{SI} & \text{SK} \\ \hline \text{AT} & 0 & 683 & 253 & 524 & 214 & 765 & 480 & 278 & 55 \\ \text{CH} & 683 & 0 & 530 & 753 & 846 & 691 & 118 & 547 & 738 \\ \text{CZ} & 253 & 530 & 0 & 280 & 444 & 923 & 434 & 449 & 290 \\ \text{DE} & 524 & 753 & 280 & 0 & 689 & 1181 & 658 & 769 & 518 \\ \text{HU} & 214 & 846 & 444 & 689 & 0 & 810 & 677 & 381 & 162 \\ \text{IT} & 765 & 691 & 923 & 1181 & 810 & 0 & 612 & 489 & 820 \\ \text{LI} & 480 & 118 & 434 & 658 & 677 & 612 & 0 & 400 & 535 \\ \text{SI} & 278 & 547 & 449 & 769 & 381 & 489 & 400 & 0 & 333 \\ \text{SK} & 55 & 738 & 290 & 518 & 162 & 820 & 535 & 333 & 0 \\ \end{array} \right) \]
Here, we are using distances between capital cities as our distance measure.
All of this is interesting, but how does it relate to our earlier concept of spatial autocorrelation? Remember,
The reason we asked more explicitly how to quantify “near” is that we can use spatial weights matrices to calculate measures of spatial autocorrelation. These can broadly be grouped into two categories:
One measurement for both global and local autocorrelation is Moran’s \(I\).
Global Moran’s \(I\) is a measure for how similar near observations are on average.
\[ I = \frac{N\sum^N_{i=1}\sum^N_{j=1}w_{ij}(y_i-\bar{y})(y_j-\bar{y})}{\left(\sum^N_{i=1}\sum^N_{j=1}w_{ij}\right)\sum^N_{i=1}(y_i-\bar{y})^2} \]
This is simply a measure of \(y\)’s autocovariance, weighted by the spatial weights \(w_{ij}\).
Local Moran’s \(I\) is a version of the measure that is specific to a given observation. It is useful for finding spatial outliers and local clusters.
For Moran’s \(I\), as well as for most questions of spatial econometric analysis, the choice of the spatial weights matrix is therefore very important. We need to think about the spatial pattern we assume to be present and justify our assumption.
In Econometrics, we always make (implicit) assumptions about the relation between observations. Often times, we assume that observations are independent and come from the same data-generating process.
Think
When data has a spatial dimension, how often can we credibly assume that there is zero spatial (auto)correlation?
How much of the data we encounter in economic analysis has a spatial dimension?
If we ignore space when it plays an important role, we run the risk of getting invalid results.
A lot of times, the spatial dimension of a sample relates to spillover effects.
Analysis of spatial autocorrelation is very similar to discussing temporal autocorrelation, as long as we only consider one period.
However, spatial autocorrelation as a pattern can persist through time. If we observe a spatial correlation pattern over multiple rounds, this can imply that an entity that was affected in one round can affect the entity it was originally affected by. (Note that this is a bit simplified.)
In situations with patterns like these, the question of who is affecting whom becomes difficult to answer.
We are now aware of a phenomenon, spatial autocorrelation, we
If time permits, we are going to talk about related issues at the end of the course.