Module 1: Time Series and Autocorrelation

Applied Econometrics · Econometrics III

Max Heinze (mheinze@wu.ac.at)

Department of Economics, WU Vienna

Sannah Tijani (stijani@wu.ac.at)

Department of Economics, WU Vienna

Introduction

Notation and Basic Concepts

Autoregressions

Determining the Lag Length

Time Series Data

Time series data: Data collected for a single entity at multiple points in time.
This data type can be used to answer quantitative questions for which cross-sectional data are inadequate.
Dynamic causal effect of variable $x$ on dependent variable $y$ over time.
Example: What is the effect of a law requiring passengers to wear seatbelts on traffic fatalities?
Forecasting the value of a variable at a later date.
Example: What will next month’s inflation rate be?
Using time series data allows us to extend the questions we can now answer. However, it poses special challenges, and overcoming them requires new techniques.

Example: Baby’s Names

Example: Unemployment in Austria

Example: GDP in Austria

Introduction

Notation and Basic Concepts

Autoregressions

Determining the Lag Length

Nonstationarity

Time Series

Time series data are observations indexed according to time.
It implies that the index has an ordering. Observations are not exchangeable.
The observation on the time series variable $y$ made at date $t$ is denoted $y_t$.
The total number of observations is denoted $T$.
The interval (of time) between observation $t$ and $t+1$ is some unit of weeks, months, quarters, or even years.

Lags

Special terminology and notation are used to indicate future and past values of $y$.
The value of $y$ in the previous period is called its first lagged value or first lag, and is denoted $y_{t-1}$.
Its j-th lagged value is its value $j$ periods ago, denoted $y_{t-j}$.
Similarly, $y_{t+1}$ denotes the value of $y$ one period into the future.

First Differences

The change of value of $y$ between period $t-1$ and $t$ is $y_{t}-y_{t-1}$
We call it the first difference in the variable $y_t$
We will use the following notation: $\Delta y = y_t - y_{t-1}$

Detrending: Fixed and Random Variation

Time series methods focus on modelling random components. But, there may also be fixed variation: seasonality of GDP, or weekday effects.

We are looking to remove fixed components (such as a trend or seasonal patterns) and focus on random components of a time series. There are many ways to achieve this:

growth rates (e.g. year-to-year, quarter-to-quarter)
linear trends (e.g. linear growth, demeaning seasons)
filters (e.g. bandpass filter, Hodrick-Prescott filter)

For example, many economic time series are analyzed after computing their (change in) logarithms, as they exhibit exponential growth.

Detrending: Examples

Autocorrelation in time series regressions

Consider the linear model and related assumptions:

\[ y_t = {x}_t'{\beta} + u_t, \qquad u_t \sim \mathcal{N}(0,\sigma^2). \]

$\mathrm{E}(u)=0$,
$\mathrm{Var}(u)=\sigma^2$, and
$\mathrm{Cov}(u_t,u_s)=0 \ \text{for all } t \neq s$.

When $y$ and $x$ are time series, Assumption (3) is often violated. We need a sensible alternative.

Autocorrelation

In time series, the value of $y$ in one period is typically correlated with its value in the next period.
The correlation of a series with its own lagged values is called autocorrelation.
The first autocorrelation is the correlation between $y_t$ and $y_{t-1}$
The j-th autocorrelation is the correlation between $y_t$ and $y_{t-j}$
The j-th autocovariance is the covariance between $y_t$ and $y_{t-j}$

\[ \text{$j^{\text{th}}$ autocovariance} \;=\; \operatorname{cov}(y_t, y_{t-j}) \]

\[ \text{$j^{\text{th}}$ autocorrelation} \;=\; \rho_j \;=\; \;=\; \frac{\operatorname{cov}(y_t, y_{t-j})}{\sqrt{\operatorname{var}(y_t)\operatorname{var}(y_{t-j})}} \]

Autocorrelation function

The autocorrelation function (ACF) is a useful summary for a stationary time series. It is defined as:

\[ \mathrm{ACF}(j) \;=\; \rho_j \;=\; \operatorname{corr}(y_t, y_{t-j}) \;=\; \frac{\operatorname{cov}(y_t, y_{t-j})}{\operatorname{var}(y_t)}. \]

The ACF is a common diagnostic tool that reveals interesting patterns of autocorrelation, but it may also indicate a lack of stationarity.

Autocorrelation of GDP

Durbin-Watson Statistic

We can test for autocorrelation using the Durbin-Watson statistic: \[ d = \frac{\sum_{t=2}^{T}\left(\hat{u}_t - \hat{u}_{t-1}\right)^2}{\sum_{t=1}^{T}\hat{u}_t^{\,2}}. \]
A value of $d = 2$ indicates no autocorrelation.
The statistic is bounded as $d \in [0,4]$
Values smaller than two may indicate positive autocorrelation.
Values larger than two may indicate negative autocorrelation.

Stationarity

Time series forecasts use data on the past to forecast the future.
Doing so presumes that the future is similar to the past (in the sense that the correlations)
More generally, the distributions of the data in the future will be like they were in the past.
In the context of regression with time series data, the idea that historical relationships can be generalized to the future is formalized by the concept of stationarity.
Definition of stationarity: The probability distribution of the time series variable does not change over time.
Under the assumption of stationarity, regression models estimated using past data can be used to forecast future values
Stationarity can fail to hold for multiple reasons, in which case the time series is said to be nonstationary

Introduction

Notation and Basic Concepts

Autoregressions

Determining the Lag Length

Nonstationarity

Vector Autoregressions

The First-Order Autoregressive Model

An autoregression expresses the conditional mean of a time series variable $y_t$ as a linear function of its own lagged values.
A first-order autoregression uses only one lag of $y$ in this conditional expectation.

\[ \mathbb{E}\left(y_t \mid y_{t-1}, y_{t-2}, \ldots \right) = \alpha_0 + \alpha_1 y_{t-1}. \]

The first-order autoregression (AR(1)) model can be written in the familiar form of a regression model as

\[ y_t = \alpha_0 + \alpha_1 y_{t-1} + \varepsilon_t, \]

The First-Order Autoregressive Model (2)

The unknown population coefficients $\alpha_0$ and $\alpha_1$ can be estimated by ordinary least square (OLS).
How to estimate $\alpha_0$ and $\alpha_1$ might initially seem puzzling: Unlike a cross-sectional regression with $x$ on the right-hand side, we have $y$ on both sides.
The solution to this puzzle is to realize that the variable $y_{t-1}$ on the right-hand side differs from the dependent variable $y_t$ because the regressor is the first lag of $y$.
That is, $x$ is the first lag of $y$.

Forecasts and forecast errors

If the population coefficients were known, then the one-step-ahead forecast of $y_{t+1}$, made using data through date $t$, would be: \[ y_{t+1\mid t} = \alpha_0 + \alpha_1 y_t \]
Although $\alpha_0$ and $\alpha_1$ are unknown, we can use their OLS estimates instead. The forecast based on the AR(1) model is: \[ \widehat{y}_{t+1\mid t} = \widehat{\alpha}_0 + \widehat{\alpha}_1 y_t, \]
$\widehat{\alpha}_0$ and $\widehat{\alpha}_1$ are estimated using historical data through time $t$.
The forecast error is: $y_{t+1} - \widehat{y}_{t+1\mid t}$

The $p$-th-Order Autoregressive Model (1)

The AR(1) model uses $y_{t-1}$ to forecast $y_t$, but doing so ignores potentially useful information in the more distant past.
One way to incorporate this information is to include additional lags in the AR(1) model; this yields the $p$th-order autoregressive model.
The model represents $y_t$ as a linear function of $p$ of its lagged values.
In the AR($p$) model, the regressors are $y_{t-1}, y_{t-2}, \ldots, y_{t-p}$, and intercept.
The number of lags, $p$, included in an In the AR($p$) model is called the order (or lag length) of the autoregression.

The $p$-th-Order Autoregressive Model (2)

The $p$-th-order autoregressive (AR($p$)) model represents the conditional expectation of $y_t$ as a linear function of $p$ of its lagged values:

\[ y_t = \alpha_0 + \alpha_1 y_{t-1} + \alpha_2 y_{t-2} + \cdots + \alpha_p y_{t-p} + \varepsilon_t. \]

If $y_t$ follows an AR($p$), then the oracle one-step-ahead forecast of $y_{t+1}$ based on $y_t, y_{t-1}, \ldots$ is

\[ y_{t+1\mid t} = \alpha_0 + \alpha_1 y_t + \alpha_2 y_{t-1} + \cdots + \alpha_p y_{t-p+1}. \]

The Autoregressive Distributed Lag (ADL) Model

An autoregressive distributed lag (ADL) model is:

Autoregressive because lagged values of the dependent variable $y_t$ are included as regressors.
Distributed lag because the regression also includes multiple lags of an additional predictor $x_t$.

An ADL model with $p$ lags of $y_t$ and $q$ lags of $x_t$, denoted ADL($p,q$), is

\[ y_t = \alpha_0 + \alpha_1 y_{t-1} + \alpha_2 y_{t-2} + \cdots + \alpha_p y_{t-p} + \delta_1 x_{t-1} + \delta_2 x_{t-2} + \cdots + \delta_q x_{t-q} + \varepsilon_t. \]

Here, $\alpha_0,\alpha_1,\ldots,\alpha_p,\delta_1,\ldots,\delta_q$ are unknown coefficients and $\varepsilon_t$ is the error term, with

\[ \mathrm{E}\left(\varepsilon_t \mid y_{t-1},y_{t-2},\ldots,x_{t-1},x_{t-2},\ldots \right)=0. \]

General Time Series Regression with Additional Predictors

The general time series regression model allows for $k$ additional predictors, where $q_1$ lags of the first predictor are included, $q_2$ lags of the second predictor are included, and so forth:

\[ \begin{aligned} y_t &= \alpha_0 + \alpha_1 y_{t-1} + \alpha_2 y_{t-2} + \cdots + \alpha_p y_{t-p} \\ &\quad + \delta_{11} x_{1,t-1} + \delta_{12} x_{1,t-2} + \cdots + \delta_{1q_1} x_{1,t-q_1} \\ &\quad + \cdots \\ &\quad + \delta_{k1} x_{k,t-1} + \delta_{k2} x_{k,t-2} + \cdots + \delta_{kq_k} x_{k,t-q_k} + \varepsilon_t . \end{aligned} \]

Assumptions

$\mathrm{E}\left(\varepsilon_t \mid y_{t-1},y_{t-2},\ldots, x_{1,t-1},x_{1,t-2},\ldots, x_{k,t-1},x_{k,t-2},\ldots\right)=0$;
1. The random variables $(y_t, x_{1t}, \ldots, x_{kt})$ have a stationary distribution, and
2. $(y_t, x_{1t}, \ldots, x_{kt})$ and $(y_{t-j}, x_{1,t-j}, \ldots, x_{k,t-j})$ become independent as $j$ gets large;
Large outliers are unlikely: $y_t, x_{1t}, \ldots, x_{kt}$ have nonzero, finite fourth moments; and
There is no perfect multicollinearity.

Under the assumptions 1-4, inference on the regression coefficients using OLS proceeds in the same way as it usually does for cross-sectional data.

Assumption 1, 3, 4

The first assumption is that $\varepsilon_t$ has conditional mean zero given the history of all the regressors: \[ \mathrm{E}\left(\varepsilon_t \mid y_{t-1},y_{t-2},\ldots, x_{1,t-1},x_{1,t-2},\ldots, x_{k,t-1},x_{k,t-2},\ldots\right)=0. \]
This assumption extends the assumption used in the AR and ADL models and implies that the forecast of $y_t$, using all past values of $y$ and the $x$’s, is given by the regression.
The third assumption no outliers and fourth assumption no perfect multicollinearity are the same as for cross-sectional data.

Assumption 2

For time series regression, the i.i.d. assumption is replaced by a more appropriate assumption with two parts:

(a) Stationarity: The data are drawn from a stationary distribution, so the distribution of the time series today is the same as its distribution in the past. The joint distribution of the variables (including lags) does not change over time.

(b) Weak dependence: The random variables become approximately independent when they are separated by long periods of time. Weak dependence ensures that, in large samples, there is enough randomness for the law of large numbers and the central limit theorem to hold.

Introduction

Notation and Basic Concepts

Autoregressions

Determining the Lag Length

Nonstationarity

Vector Autoregressions

Cointegration

Determining the Order of an Autoregression

Choosing the order p of an autoregression requires balancing the marginal benefit of including more lags against the marginal cost of additional estimation error.
If the order of an estimated autoregression is too low, you will omit potentially valuable information contained in the more distant lagged values.
If it is too high, you will be estimating more coefficients than necessary, which in turn introduces additional estimation error into your forecasts.

The F-statistic approach

One approach to choosing $p$ is to start with a model with many lags and to perform hypothesis tests on the final lag.
For example, you might start by estimating an AR(6) and test whether the coefficient on the sixth lag is significant at the 5% level; if not, drop it and estimate an AR(5), test the coefficient on the fifth lag, and so forth.
The drawback to this method is that it will tend to produce large models.
Even if the true AR order is five, so the sixth coefficient is 0, a 5% test using the t-statistic will incorrectly reject this null hypothesis 5% of the time just by chance. Thus, if the true value of $p$ is five, this method will estimate $p$ to be six 5% of the time.

The Bayes Information Criterion (BIC) (1)

One way to estimate $p$ is by minimizing an information criterion, e.g., the BIC, which is:

\[ \text{BIC}(p) = \ln \left( \frac{SSR(p)}{T} \right) + (p + 1) \frac{\ln(T)}{T}, \]

$SSR(p)$ is the sum of squared residuals of the estimated AR($p$).
The BIC estimator of $p$, $\hat{p}$, is the value that minimizes $\text{BIC}(p)$ among the possible choices $p = 0, 1, \ldots, p_{\max}$
$p_{\max}$ is the largest value of $p$ considered and $p = 0$ corresponds to the model that contains only an intercept.

The Bayes Information Criterion (BIC) (2)

Because the regression coefficients are estimated by OLS, the SSR necessarily decreases (or at least does not increase) when you add a lag.
In contrast, the second term is the number of estimated regression coefficients (the number of lags, $p$, plus one for the intercept) times the factor $\frac{\ln(T)}{T}$.
This second term increases when you add a lag and thus provides a penalty for including another lag.
The BIC trades off these two forces so that the number of lags that minimizes the BIC is a consistent estimator of the true lag length.
The BIC helps decide precisely how large the increase in the $R^2$ must be to justify including the additional lag.

The Akaike Information Criterion (AIC)

Another information criterion is the Akaike information criterion (AIC): \[ \text{AIC}(p) = \ln \left( \frac{SSR(p)}{T} \right) + \frac{(p + 1)^2}{T}, \]
The difference between the AIC and the BIC is that the term $\mathrm{ln}(T)$ in the BIC is replaced by $2$ in the AIC, so the second term in the AIC is smaller.
The second term in the AIC is not large enough to ensure that the correct lag length is chosen, even in large samples, so the AIC estimator of $p$ is not consistent.
If you are concerned that the BIC might yield a model with too few lags, the AIC provides a reasonable alternative

Lag Length Selection with Multiple Predictors

The trade-off involved with lag length choice in the general time series regression model with multiple predictors is similar to that in an autoregression:

Using too few lags can decrease forecast accuracy because valuable information is lost
Adding lags increases estimation error.

The choice of lags must balance the benefit of using additional information against the cost of estimating the additional coefficients.

The F-Statistic Approach

As in the univariate autoregression, one way to determine the number of lags is to use F-statistics to test joint hypotheses that sets of coefficients are equal to 0.
If the number of models being compared is small, then this F-statistic method is easy to use.
In general, however, the F-statistic method can produce models that are large and thus have considerable estimation error

Information Criteria for Model Selection

As in an autoregression, the BIC and the AIC can be used to estimate the number of lags and variables in a time series regression model with multiple predictors. If the regression model has $K$ coefficients (including the intercept), the BIC is

\[ \text{BIC}(K) = \ln\!\left(\frac{SSR(K)}{T}\right) + K\frac{\ln(T)}{T}. \]

The AIC is defined in the same way, but with $2$ replacing $\ln(T)$ in the second term:

\[ \text{AIC}(K) = \ln\!\left(\frac{SSR(K)}{T}\right) + K\frac{2}{T}. \]

For each candidate model, the BIC (or the AIC) can be evaluated, and the model with the lowest value of the BIC (or the AIC) is the preferred model, based on the information criterion.

Notation and Basic Concepts

Autoregressions

Determining the Lag Length

Nonstationarity

Vector Autoregressions

Cointegration

Volatility Clustering, ARCH and GARCH

Nonstationarity

Until now we have assumed that the dependent variable and the regressors are stationary.
If this is not the case—that is, if the dependent variable and/or the regressors are nonstationarity—then conventional hypothesis tests, confidence intervals, and forecasts can be unreliable.
The precise problem created by nonstationarity, and the solution to that problem, depends on the nature of that nonstationarity
There are two types of nonstationarity frequently encountered in economic time series: trends and breaks.

What Is a Trend?

A trend is a persistent long-term movement of a variable over time. A time series variable fluctuates around its trend
There are two types of trends in time series data: deterministic and stochastic.
A deterministic trend is a nonrandom function of time.
A stochastic trend is random and varies over time.
It is more appropriate to model economic time series as having stochastic rather than deterministic trends.

The Random Walk Model of a Trend

The basic idea of a random walk is that the value of the series tomorrow is its value today plus an unpredictable change.
Because the path followed by $y_t$ consists of random “steps” $\varepsilon_t$, that path is a random walk.
The conditional mean of $y_t$ based on data through time $t-1$ is $y_{t-1}$. That is, because \[ \mathrm{E}\left(\varepsilon_t \mid y_{t-1}, y_{t-2}, \ldots \right)=0, \] it follows that \[ \mathrm{E}\left(y_t \mid y_{t-1}, y_{t-2}, \ldots \right)=y_{t-1}. \]
if $y_t$ follows a random walk, then the best forecast of tomorrow’s value is its value today.
If $y_t$ follows a random walk, its variance increases over time. Because it does not have a constant variance, a random walk is nonstationary

Random Walk with Drift

Some series, such as the logarithm of U.S. GDP, have an obvious upward tendency. In that case, the best forecast must include an adjustment for the tendency of the series to increase.
This leads to an extension of the random walk model that includes a tendency to move, or drift, in one direction or the other. This extension is referred to as a random walk with drift: \[ y_t = \alpha_0 + y_{t-1} + \varepsilon_t, \] \[ \mathrm{E}\left(\varepsilon_t \mid y_{t-1}, y_{t-2}, \ldots \right)=0, \]
$\alpha_0$ is the drift in the random walk.
If $\alpha_0>0$, then $y_t$ increases on average.

Unit Root (1)

The random walk model is a special case of the AR(1) model in which $\alpha_1 = 1$.
If $y_t$ follows an AR(1) with $\alpha_1 = 1$, then $y_t$ contains a stochastic trend and is nonstationary.
If $|\alpha_1| < 1$ and $\varepsilon_t$ is stationary, then the joint distribution of $y_t$ and its lags does not depend on $t$, $y_t$ is stationary.
For an AR($p$) to be stationary is more complicated. Its formal statement involves the roots of the polynomial \[ 1 - \alpha_1 z - \alpha_2 z^2 - \alpha_3 z^3 - \cdots - \alpha_p z^p. \]
For an AR($p$) to be stationary, all roots of this polynomial must be greater than 1 in absolute value.

Unit Root (2)

In the special case of an AR(1), the root solves \[ 1 - \alpha_1 z = 0, \]
so the root is \[ z = \frac{1}{\alpha_1}. \]
Saying that the root must be greater than 1 in absolute value is equivalent to $|\alpha_1| < 1$.
If an AR($p$) has a root equal to 1, the series is said to have a unit autoregressive root (or a unit root)
If $y_t$ has a unit root, then it contains a stochastic trend.

Problems Caused by Stochastic Trends

If a regressor has a stochastic trend (has a unit root), then OLS inference can be misleading.

the usual OLS $t$-statistic can have a nonnormal distribution, even in large samples
the estimate of the autoregressive coefficient is biased toward 0
spurious regression: two independent series with stochastic trends can appear to be related

As a result, conventional confidence intervals and hypothesis tests are not valid in the usual way.

Detecting Stochastic Trends: The Dickey–Fuller Test

The hypothesis of a stochastic trend can be tested using a Dickey–Fuller test. For the AR(1) model: \[ y_t = \alpha_0 + \alpha_1 y_{t-1} + \varepsilon_t, \]
the null hypothesis that $y_t$ has a stochastic trend is \[ H_0:\alpha_1=1 \qquad \text{vs.} \qquad H_1:\alpha_1<1. \]
Let $\rho = \alpha_1 - 1$. Then the model can be rewritten as \[ \Delta y_t = \alpha_0 + \rho y_{t-1} + \varepsilon_t, \]
The hypotheses become: $H_0:\rho=0 \qquad \text{vs.} \qquad H_1:\rho<0.$
The OLS $t$-statistic for testing $\rho=0$ is called the Dickey–Fuller statistic.

Critical values for the ADF Statistic

Under the null hypothesis of a unit root, the ADF statistic does not have a normal distribution, even in large samples, so standard critical values cannot be used.
Because the alternative hypothesis is $\rho < 0$, the ADF test is one-sided.
Studies of the ADF statistic suggest that it is better to have too many lags than too few
it is then recommended to use the AIC instead of the BIC to estimate p for the ADF statistic
The most reliable way to handle a trend in a series is to transform the series so that it does not have the trend.
If the series has a stochastic trend, then its difference does not

Nonstationarity II: Breaks

A second type of nonstationarity arises when the population regression function changes over the sample period.
These changes, or breaks, can make inference and forecasting misleading if they are ignored.
Breaks can occur because of changes in policy, economic structure, or industry conditions.
A break can arise from:
A discrete change in the regression coefficients at a specific date;
A gradual change in the coefficients over time.
One way to detect breaks is to test for discrete changes, or breaks, in the regression coefficients.
How this is done depends on whether the break date is known

Testing for a Break at a Known Date (1)

If the date of the hypothesized break is known, the null hypothesis of no break can be tested using a binary variable interaction regression.
For simplicity, consider an ADL(1, 1) model with an intercept, a single lag of $y_t$, and a single lag of $x_t$.
Let $\tau$ denote the break date, and let $D_t(\tau)$ be a binary variable equal to 0 before the break and 1 after.
The regression including the binary break indicator and all interaction terms is:

\[ y_t = \alpha_0 + \alpha_1 y_{t-1} + \delta_1 x_{t-1} + \gamma_0 D_t(\tau) + \gamma_1 [D_t(\tau) \times y_{t-1}] + \gamma_2 [D_t(\tau) \times x_{t-1}] + \varepsilon_t \]

Testing for a Break at a Known Date (2)

If there is no break, then the population regression function is the same over both parts of the sample, so the terms involving the break indicator $D_t(\tau)$ do not enter.
The null hypothesis of no break implies $\gamma_0 = \gamma_1 = \gamma_2 = 0$.
The alternative hypothesis is that there is a break, and the population regression function is different before and after the break date, implying at least one of the $\gamma$’s is nonzero.
The hypothesis of a break can be tested using the F-statistic testing $\gamma_0 = \gamma_1 = \gamma_2 = 0$ against the alternative.
This is often referred to as the Chow test.

Testing for a Break at an Unknown Date

Often, the date of a possible break is unknown or known only within a range.
You might suspect that a break occurred between two dates, $\tau_0$ and $\tau_1$.
The Chow test can be extended to test for breaks at all possible dates $\tau$ between $\tau_0$ and $\tau_1$ and then use the largest resulting $F$-statistics to test for a break at an unknown date.
This modified Chow test is called the Quandt likelihood ratio (QLR) statistic (sometimes the sup-Wald statistic).
Because the QLR statistic is the largest of many $F$-statistics, its distribution is not the same as an individual $F$-statistic.
Instead, the critical values for the QLR statistic must be obtained from a special distribution.
The QLR test can detect a single discrete break, multiple discrete breaks, and/or slow evolution of the regression function.

Detecting Breaks Using Pseudo Out-of-Sample Forecasts

The ultimate test of a forecasting model is its out-of-sample performance: its forecasting performance in “real time,” after the model has been estimated.
Pseudo out-of-sample forecasting simulates the real-time performance of a forecasting model and can be used to detect breaks near the end of the sample.
The most direct and often most useful way to do so is via a time series plot of the in-sample predicted values, the pseudo out-of-sample forecasts, and the actual values of the series.
A visible deterioration of the forecasts in the pseudo out-of-sample period is a red flag warning of a possible breakdown of the forecasting model.

Example: Do you see a Trend and/or a Break?

Summary

Regression models used for forecasting need not have a causal interpretation.
A time series variable generally is correlated with one or more of its lagged values; that is, it is serially correlated.
The accuracy of a forecast is measured by its mean squared forecast error.
An autoregression of order $p$ is a linear multiple regression model in which the regressors are the first $p$ lags of the dependent variable.
The coefficients of an AR($p$) model can be estimated by OLS, and the estimated regression function can be used for forecasting.
The lag order $p$ can be estimated using an information criterion such as the BIC or the AIC.
Adding other variables and their lags to an autoregression can improve forecasting performance.
Under the least squares assumptions for prediction with time series regression, the OLS estimators have normal distributions in large samples, and statistical inference proceeds the same way as for cross-sectional data.

Autoregressions

Determining the Lag Length

Nonstationarity

Vector Autoregressions

Cointegration

Volatility Clustering, ARCH and GARCH

Spatial Autocorrelation

Forecasting Multiple Variables

One way in which everything so far differed from what we used to do in Econometrics was that we were only analyzing one variable at a time. Now, we are going to talk about how we can analyze two or more time series.

As you can imagine, there are endless applications for time series methods for multiple variables. For one thing, it is rare to find a variable that is not influenced by past or present realizations of another. We may be interested in describing or forecasting them properly, and thus we need methods to do so.

We are going to go about this the following way:

First, we are going to introduce a model to forecast multiple variables at the same time.
Then, we are going to discuss cointegration, which means multiple variables share a common trend.
Finally, we are going to talk about situations where volatility changes over time.

One Model per Variable or One Model for All?

Let us start by considering two variables and writing down AR(1) processes for both of them:

\[ y_t = a_0 + a_1 y_{t-1} + \varepsilon_t \]

\[ x_t = b_0 + b_1 x_{t-1} + \varepsilon_t \]

This way, we model both of them on past realizations of themselves. It is very easy to extend this to also include past realizations of the opposite variable:

\[ y_t = a_{10} + a_{11} y_{t-1} + a_{12} x_{t-1} + \varepsilon_{1t} \]

\[ x_t = a_{20} + a_{21} y_{t-1} + a_{22} x_{t-1} + \varepsilon_{2t} \]

Maybe you already suspect where we are going to end up.

Why Do We Call It a “Vector” Autoregression?

\[ y_t = a_{10} + a_{11} y_{t-1} + a_{12} x_{t-1} + \varepsilon_{1t} \]

\[ x_t = a_{20} + a_{21} y_{t-1} + a_{22} x_{t-1} + \varepsilon_{2t} \]

Of course, we know a way to consolidate this into one line by stacking the equations:

\[ \begin{pmatrix} y_t \\ x_t \end{pmatrix} = \begin{pmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{pmatrix} \begin{pmatrix} y_{t-1} \\ x_{t-1} \end{pmatrix} + \begin{pmatrix} \varepsilon_{1t} \\ \varepsilon_{2t} \end{pmatrix} \]

This is what we call a Vector Autoregression (VAR). More specifically, this is a VAR(1) system of equations.

A VAR is an extension of univariate autoregressive processes to vectors of multiple variables.
When the number of lags with respect to each variable is $p$, we call the system of equations a VAR($p$).

Three Time Series

Let us now consider three variables we know well: real GDP growth $\Delta y_t$, inflation $\pi_t$, and the interest rate $r_t$. In the spirit of what we did before, we can construct the following VAR(1):

\[ \begin{pmatrix} \Delta y_t \\ \pi_t \\ r_t \end{pmatrix} = \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{pmatrix} \begin{pmatrix} \Delta y_{t-1} \\ \pi_{t-1} \\ r_{t-1} \end{pmatrix} + \begin{pmatrix} \varepsilon_{\Delta y_t} \\ \varepsilon_{\pi_t} \\ \varepsilon_{r_{t}} \end{pmatrix} \]

Written more compactly, this becomes

\[ \boldsymbol{y}_t = \boldsymbol{A}\boldsymbol{y}_{t-1}+\boldsymbol{\varepsilon}_{t}. \]

Practice

Write this down as three separate equations for $\Delta y_t$, $\pi_t$, and $r_t$.

General Reduced-Form VAR(p)

We have discussed a bunch of examples, but what we lack so far is a general representation of a VAR($p$) model. So let us write one down:

\[ \begin{aligned} \boldsymbol{y}_t &= \boldsymbol{a}_0 + \boldsymbol{A}_1\boldsymbol{y}_{t-1} + \dots + \boldsymbol{A}_p\boldsymbol{y}_{t-p} + \boldsymbol{\varepsilon}_t, \\ \boldsymbol{\varepsilon}_t &\sim \mathcal{N}_M(\boldsymbol{0},\boldsymbol{\Sigma}), \end{aligned} \]

where

$\boldsymbol{y}_t$ is an $M$-dimensional vector of endogenous variables,
$\boldsymbol{a}_0$ is an $M$-dimensional constant vector,
$\boldsymbol{A}_j$ are $M\times M$-dimensional coefficient matrices,
$\mathcal{N}_M(\cdot,\cdot)$ denotes a multivariate normal distribution of $M$ variables, and
$\boldsymbol{\Sigma}$ denotes an $M\times M$-dimensional variance-covariance matrix.

Estimation and Hypothesis Tests

This can be estimated using multiple techniques.

As long as errors are assumed normal, OLS is consistent and estimates asymptotically follow a multivariate normal distribution in large samples. This means that

we can easily conduct hypothesis tests the way we know.
We can use the straightforward critical values that we know from Econometrics I.
Using an $F$-test, we can even test restrictions across multiple equations.

Of course, we can also estimate the VAR using ML or Bayesian estimation techniques, if we want.

We have to be careful when interpreting VAR results: Coefficients are only interpretable as predictive relationships between certain lags.

Estimating a VAR: Data

Estimating a VAR: Plotting

Estimating a VAR: Results

Causal Analysis

So far, we have treated VARs – and other time series methods – as tools to forecast variables. But what if we are interested in causal inference?

There is a certain “causality” concept that exists in the realm of time series econometrics: Granger Causality, named after Granger (1969). In short, we speak of Granger Causality when a realization of one variable, let us call it $x_t$, can be used to predict a future realization of another variable, $y_{t+1}$. In this case, we say that $x$ Granger causes $y$.

More precisely, $x$ provides statistically significant information about future values of $y$.
This is an instance of predictive causality, which naturally falls short of “real” causality, which we know from non-time-series contexts.

But what about real causality? It turns out that VARs were originally introduced into Economics as a tool for analyzing causal relationships between multiple time series (Sims, 1980). But using them for this purpose requires going one step further to Structural VARs. Most of how this works is out of scope for this class, but the following should serve as a brief introduction.

Granger Causality

We can ask R whether one variable Granger causes the others.

Structural VARs

What we plainly called a VAR before is actually a Reduced Form VAR:

\[ \boldsymbol{y}_t = \boldsymbol{a}_0 + \boldsymbol{A}_1\boldsymbol{y}_{t-1} + \dots + \boldsymbol{A}_p\boldsymbol{y}_{t-p} + \boldsymbol{\varepsilon}_t \]

The problem with this is that the errors are not uncorrelated, and this renders us unable to draw causal conclusions. For Structural VARs, we assume that there exists an invertible matrix $\boldsymbol{B}_0$ such that

\[ \boldsymbol{\varepsilon}_t=\boldsymbol{B}_0^{-1}\boldsymbol{e}_t, \]

giving us the uncorrelated, structural shocks $\boldsymbol{e}_t$. Using this decomposition, we can transform the reduced form VAR into a structural VAR. $\boldsymbol{B}_0$ governs the contemporaneous relations between the variables, and thus we need to know about it if we want to investigate causal relationships.

The real challenge is finding $\boldsymbol{B}_0$. This requires imposing certain restrictions based on economic theory and the researcher’s assumptions.

Determining the Lag Length

Nonstationarity

Vector Autoregressions

Cointegration

Volatility Clustering, ARCH and GARCH

Spatial Autocorrelation

Cointegration

Look at the two time series in the chart of a long-term interest rate and a short-term interest rate. What do you notice?

The two time series seem to move together; that is, they share a common trend, but not perfectly so.

We call this phenomenon cointegration.

Defining Cointegration

Let us now more formally define what cointegration means.

Assume that we have two time series, $y_t$ and $x_t$, and they are both integrated of order 1 (i.e., two $I(1)$ processes). Then, the two are called cointegrated if a $\beta$ exists such that

\[ u_t = y_t - \beta x_t \]

is a stationary $I(0)$ process.

In this case, we call $\beta$ the cointegrating coefficient.

More generally, two series are cointegrated if they are $I(d)$ and some linear combination of them is integrated of order less than $d$.

Let us plot on the next slide how the process $u_t$ would look like for our previous example.

Cointegration

Choosing 1.2 as cointegrating coefficient gives us the pink process in the graph on the left, which looks reasonably stationary.

Of course, we would normally estimate $\beta$. OLS is consistent in this case, but not normally distributed, but there are extensions which enable inference on $t$-statistics.

Error Correction

If we have two cointegrated series, we can model their short-run dynamics (more specifically, their first differences) using an error correction model. Consider the following vector error correction model (VECM):

\[ \begin{pmatrix} \Delta y_t\\ \Delta x_t \end{pmatrix} = \begin{pmatrix} \alpha_1\\ \alpha_2 \end{pmatrix} + \begin{pmatrix} \delta_1\\ \delta_2 \end{pmatrix} \hat u_{t-1} + \sum_{i=1}^{p-1} \begin{pmatrix} \gamma_{11,i} & \gamma_{12,i}\\ \gamma_{21,i} & \gamma_{22,i} \end{pmatrix} \begin{pmatrix} \Delta y_{t-i}\\ \Delta x_{t-i} \end{pmatrix} + \begin{pmatrix} \varepsilon_{1t}\\ \varepsilon_{2t} \end{pmatrix} , \]

where $\hat{u}_{t-1} = y_{t-1} - \beta x_{t-1}$ is the error correction term.

The parameter $\boldsymbol{\delta}$ governs the adjustment to equilibrium:

$\delta = 0$: no cointegration
$-1 < \delta < 0$: stable error correction
$\delta \leq -1$: oscillatory / unstable adjustment

Testing for Cointegration

In addition to considering economic theory and inspecting time series graphs, we can test for cointegration. We can divide the available testing procedures into two scenarios: Testing for cointegration when $\beta$ is known, and testing for cointegration when $\beta$ is not known.

When $\beta$ is known (e.g. because theory suggests a value), we can use a Dickey-Fuller Test to test for cointegration.

First, we construct the series $\hat{u}_t = y_t - \beta x_t$, and
then, we use the DF test to test for a unit autoregressive root.

In practice, $\beta$ is often unknown. In these cases, one option is to follow the Engle-Granger Procedure.

We start by obtaining an estimate for $\beta$ from regressing $y_t = \alpha_0 + \beta x_t + u_t$.
We can then run a special Dickey-Fuller test (EG-ADF test), which has different critical values than a regular one.

Testing for Cointegration with a Known Coefficient

Testing for Cointegration with an Unknown Coefficient

Nonstationarity

Vector Autoregressions

Cointegration

Volatility Clustering, ARCH and GARCH

Spatial Autocorrelation

Variance over Time

Look at this chart of daily log differences of the S&P 500 index from 2016 to now.

We can clearly see that volatility changes over time.

This is not unusual; in fact, higher moments of time series are rarely constant over time.

Volatility Clustering

When we observe some periods of lower volatility and some periods of higher volatility, we say that there is volatility clustering.

Volatility appears in clusters.
So even though tomorrow’s price change is difficult to forecast,
we can say something about the variance of the price change.

This is interesting (not only) in financial contexts, because volatility is a measure of how risky an asset is, and the value of some derivatives depends directly on that volatility. Also, it helps us determine confidence intervals of forecasts.

The simplest volatility measure we have just makes use of the sample variance. This is useful when we have very frequent data. The $h$-period realized volatility of a (demeaned) time series $x_t$ is given by

\[ \textstyle RV_t^h=\sqrt{\frac{1}{h}\sum^t_{s=t-h+1}x_s^2}. \]

Autoregressive Conditional Heteroskedasticity (ARCH)

With lower-frequency data, we have to resort to different methods to estimate how volatility changes over time. First, let us consider the Autoregressive Conditional Heteroskedasticity (ARCH) model.

Consider some model for $x_t$,

\[ x_t = \mu + \dots + \varepsilon_t, \]

where $\varepsilon_t$ is normally distributed with mean zero and variance $\sigma^2_t$. The variance $\sigma^2_t$ is then modeled on past squared values of $\varepsilon_t$. This gives us an ARCH model of order $p$:

\[ \sigma^2_t = \omega + \psi_1\varepsilon_{t-1}^2 + \psi_2\varepsilon_{t-2}^2+\dots+\psi_p\varepsilon_{t-p}^2. \]

If $\psi_1, \dots, \psi_p$ are large, large recent squared errors predict a high variance.

Generalized ARCH (GARCH)

The Generalized Autoregressive Conditional Heteroskedasticity (GARCH) model is an extension of the ARCH model. The difference to the ARCH model is that $\sigma^2$ can now additionally depend on its own lags, in addition to lags of the squared error.

Consider some model for $x_t$,

\[ x_t = \mu + \dots + \varepsilon_t, \]

\[ \sigma^2_t = \omega + \psi_1\varepsilon_{t-1}^2 +\dots+\psi_p\varepsilon_{t-p}^2 + \phi_1\sigma^2_{t-1} + \dots + \phi_q\sigma^2_{t-q}. \]

If $\psi_1, \dots, \psi_p$ are large, large recent squared errors predict a high variance.

Vector Autoregressions

Cointegration

Volatility Clustering, ARCH and GARCH

Spatial Autocorrelation

Time and Space

Part of our motivation to treat time series differently was that we could no longer credibly assume errors to be uncorrelated with themselves.

If the value of one time series is high in a given period, it has a higher probability to also be high in a subsequent period.

But doesn’t the same that applies to time also apply to space? If GDP is high in one place, it is more likely to also be high in places close to it.

This is the idea behind the concept of Spatial Autocorrelation.

Tobler’s First Law of Geography

The photograph on the left depicts Waldo Tobler (1930–2018), a famous Swiss-American geographer. He is known for a lot of things, among them Tobler’s First Law of Geography:

Everything is related to everything else, but near things are more related than distant things.

This quote, published in Tobler (1970), fundamentally describes spatial autocorrelation: The value of a given variable in a given place depends, among other things, on realizations of the same variable in places close by.

We are going to use this as a starting point to venture very quickly into the field of Spatial Econometrics.

Spatial Weights

But how do we quantify “near”? Consider the following Spatial Weights Matrix:

\[ \boldsymbol{W}= \left( \begin{array}{c|cccccccc} & \text{AT} & \text{CH} & \text{CZ} & \text{DE} & \text{HU} & \text{IT} & \text{LI} & \text{SI} & \text{SK} \\ \hline \text{AT} & 0 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ \text{CH} & 1 & 0 & 0 & 1 & 0 & 1 & 1 & 0 & 0 \\ \text{CZ} & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 1 \\ \text{DE} & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ \text{HU} & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ \text{IT} & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ \text{LI} & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \text{SI} & 1 & 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 \\ \text{SK} & 1 & 0 & 1 & 0 & 1 & 0 & 0 & 0 & 0 \\ \end{array} \right) \]

This simple matrix is binary. It uses contiguity as measure of distance: If two countries share a border, they are “near” (1), if not, they are “distant” (0).

Contiguity

There are different types of contiguity. These are relevant when we use pixel-based data, which can be the case when we use remotely sensed data or other data that is aggregated for square areal units.

Queen Contiguity

Rook Contiguity

Bishop Contiguity

Distance

We do not need to restrict ourselves to contiguity. Look at this Spatial Weights Matrix:

\[ \boldsymbol{W}= \left( \begin{array}{c|ccccccccc} & \text{AT} & \text{CH} & \text{CZ} & \text{DE} & \text{HU} & \text{IT} & \text{LI} & \text{SI} & \text{SK} \\ \hline \text{AT} & 0 & 683 & 253 & 524 & 214 & 765 & 480 & 278 & 55 \\ \text{CH} & 683 & 0 & 530 & 753 & 846 & 691 & 118 & 547 & 738 \\ \text{CZ} & 253 & 530 & 0 & 280 & 444 & 923 & 434 & 449 & 290 \\ \text{DE} & 524 & 753 & 280 & 0 & 689 & 1181 & 658 & 769 & 518 \\ \text{HU} & 214 & 846 & 444 & 689 & 0 & 810 & 677 & 381 & 162 \\ \text{IT} & 765 & 691 & 923 & 1181 & 810 & 0 & 612 & 489 & 820 \\ \text{LI} & 480 & 118 & 434 & 658 & 677 & 612 & 0 & 400 & 535 \\ \text{SI} & 278 & 547 & 449 & 769 & 381 & 489 & 400 & 0 & 333 \\ \text{SK} & 55 & 738 & 290 & 518 & 162 & 820 & 535 & 333 & 0 \\ \end{array} \right) \]

Here, we are using distances between capital cities as our distance measure.

This allows for different-strength links, and
we have a lot of information where we previously only had zeroes.
Usually, we will transform distances in such a way that more means closer.

Back to Spatial Autocorrelation

All of this is interesting, but how does it relate to our earlier concept of spatial autocorrelation? Remember,

Everything is related to everything else, but near things are more related to distant things.
In reality, of course, this is not purely a feature of geography. Agents of nearby places interact more with each other.

The reason we asked more explicitly how to quantify “near” is that we can use spatial weights matrices to calculate measures of spatial autocorrelation. These can broadly be grouped into two categories:

Measures of global spatial autocorrelation give us one value that describes the spatial pattern present in an entire dataset.
Measures of local spatial autocorrelation give us an idea of how interconnected a given observation is.

Moran’s I

One measurement for both global and local autocorrelation is Moran’s $I$.

Global Moran’s $I$ is a measure for how similar near observations are on average.

\[ I = \frac{N\sum^N_{i=1}\sum^N_{j=1}w_{ij}(y_i-\bar{y})(y_j-\bar{y})}{\left(\sum^N_{i=1}\sum^N_{j=1}w_{ij}\right)\sum^N_{i=1}(y_i-\bar{y})^2} \]

This is simply a measure of $y$’s autocovariance, weighted by the spatial weights $w_{ij}$.

Local Moran’s $I$ is a version of the measure that is specific to a given observation. It is useful for finding spatial outliers and local clusters.

For Moran’s $I$, as well as for most questions of spatial econometric analysis, the choice of the spatial weights matrix is therefore very important. We need to think about the spatial pattern we assume to be present and justify our assumption.

Space Is Everywhere

In Econometrics, we always make (implicit) assumptions about the relation between observations. Often times, we assume that observations are independent and come from the same data-generating process.

Think

When data has a spatial dimension, how often can we credibly assume that there is zero spatial (auto)correlation?

How much of the data we encounter in economic analysis has a spatial dimension?

If we ignore space when it plays an important role, we run the risk of getting invalid results.

Spillovers

A lot of times, the spatial dimension of a sample relates to spillover effects.

You may know this figure from Econometrics I.
The setting we consider here is an experiment.
We have a square field, which we divide into 100 plots.
Then, we randomize fertilizer use.
Finally, we measure yields and compare whether the fields where fertilizer was applied perform better.
One way this design can be invalidated if untreated plots that are near treated plots are affected by fertilizer from treated plots, e.g. from groundwater.
If we explicitly allow for spatial spillovers in our analysis, we can still get meaningful results.

A Vicious Cycle

Analysis of spatial autocorrelation is very similar to discussing temporal autocorrelation, as long as we only consider one period.

However, spatial autocorrelation as a pattern can persist through time. If we observe a spatial correlation pattern over multiple rounds, this can imply that an entity that was affected in one round can affect the entity it was originally affected by. (Note that this is a bit simplified.)

In situations with patterns like these, the question of who is affecting whom becomes difficult to answer.

Outlook

We are now aware of a phenomenon, spatial autocorrelation, we

can quantify and describe it, and we
know when it can lead to problems.

If time permits, we are going to talk about related issues at the end of the course.

Networks are a very stylized form of space, and allow us to study peer effects. This relates to the idea we had before about spatial autocorrelation going both ways: I affect my neighbor, but my neighbor also affects me.
Spatial Econometric Models are a way to explicitly consider space in econometric settings where ignoring it would invalidate results. There are a number of spatial models that can be used in different settings, and they are modular and relate clearly to the cross-sectional models we know and the panel models we will learn about.

References

Granger, C. W. J. (1969). Investigating causal relations by econometric models and cross-spectral methods. Econometrica, 37(3), 424. https://doi.org/10.2307/1912791

Sims, C. A. (1980). Macroeconomics and reality. Econometrica, 48(1), 1. https://doi.org/10.2307/1912017

Stock, J., & Watson, M. W. (2019). Introduction to econometrics, global edition (4th ed.). Pearson Education.

Tobler, W. R. (1970). A computer movie simulating urban growth in the detroit region. Economic Geography, 46, 234. https://doi.org/10.2307/143141

Module 1: Time Series and Autocorrelation

Introduction

Time Series Data

Example: Baby’s Names

Example: Unemployment in Austria

Example: GDP in Austria

Notation and Basic Concepts

Time Series

Lags

First Differences

Detrending: Fixed and Random Variation

Detrending: Examples

Autocorrelation in time series regressions

Autocorrelation

Autocorrelation function

Autocorrelation of GDP

Durbin-Watson Statistic

Stationarity

Autoregressions

The First-Order Autoregressive Model

The First-Order Autoregressive Model (2)

Forecasts and forecast errors

The \(p\)-th-Order Autoregressive Model (1)

The \(p\)-th-Order Autoregressive Model (2)

The Autoregressive Distributed Lag (ADL) Model

General Time Series Regression with Additional Predictors

Assumptions

Assumption 1, 3, 4

Assumption 2

Determining the Lag Length

Determining the Order of an Autoregression

The F-statistic approach

The Bayes Information Criterion (BIC) (1)

The Bayes Information Criterion (BIC) (2)

The Akaike Information Criterion (AIC)

Lag Length Selection with Multiple Predictors

The F-Statistic Approach

Information Criteria for Model Selection

Nonstationarity

Nonstationarity

What Is a Trend?

The Random Walk Model of a Trend

Random Walk with Drift

Unit Root (1)

Unit Root (2)

Problems Caused by Stochastic Trends

Detecting Stochastic Trends: The Dickey–Fuller Test

Critical values for the ADF Statistic

Nonstationarity II: Breaks

Testing for a Break at a Known Date (1)

Testing for a Break at a Known Date (2)

Testing for a Break at an Unknown Date

Detecting Breaks Using Pseudo Out-of-Sample Forecasts

Example: Do you see a Trend and/or a Break?

Summary

Vector Autoregressions

Forecasting Multiple Variables

One Model per Variable or One Model for All?

Why Do We Call It a “Vector” Autoregression?

Three Time Series

General Reduced-Form VAR(p)

Estimation and Hypothesis Tests

Estimating a VAR: Data

Estimating a VAR: Plotting

Estimating a VAR: Results

Causal Analysis

Granger Causality

Structural VARs

Cointegration

Cointegration

Defining Cointegration

Cointegration

Error Correction

Testing for Cointegration

Testing for Cointegration with a Known Coefficient

Testing for Cointegration with an Unknown Coefficient

Volatility Clustering, ARCH and GARCH

Variance over Time

Volatility Clustering

Autoregressive Conditional Heteroskedasticity (ARCH)