Econometrics I
Department of Economics, WU Vienna
May 22, 2025
We’ve already talked several times about unbiasedness. Another important property is consistency.
We can sketch how we would prove that the OLS estimator is consistent. We proceed as follows, where \(\mathrm{plim}\:X_n=X\) means that \(X_n\) converges in probability to \(X\) as \(N\rightarrow\infty\):
\[ \begin{aligned} \mathrm{plim}\:\hat{\boldsymbol{\beta}} &= \mathrm{plim}\:((\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{y}) \\ &= \mathrm{plim}\:((\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'(\boldsymbol{X\beta}+\boldsymbol{u})) \\ &= \mathrm{plim}\:((\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{X\beta}+(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{u}) \\ &= \mathrm{plim}\:\boldsymbol{\beta}+\mathrm{plim}\:((\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{u}) \\ &= \boldsymbol{\beta}+\mathrm{plim}\:(\boldsymbol{X}'\boldsymbol{X})^{-1}\mathrm{plim}\:\boldsymbol{X}'\boldsymbol{u} \\ &= \boldsymbol{\beta}+\mathrm{plim}\:(N^{-1}\boldsymbol{X}'\boldsymbol{X})^{-1}\mathrm{plim}\:(N^{-1}\boldsymbol{X}'\boldsymbol{u}) \\ \end{aligned} \]
In the last step, we multiply the second term by \(\left(N^{-1}\right)^{-1}\) so that we can directly apply the law of large numbers.
Since \(\mathrm{plim}\:(N^{-1}\boldsymbol{X}'\boldsymbol{X})^{-1}\) is invertible, we only need to show that \(\mathrm{plim}\:(N^{-1}\boldsymbol{X}'\boldsymbol{u})=\boldsymbol{0}.\) This is the case because, as \(N\rightarrow\infty\), the sample covariance converges to the population covariance and we have assumed that all \(x_k\) are uncorrelated with the error term (MLR.4).
\[ \mathrm{plim}(N^{-1}\boldsymbol{X}'\boldsymbol{u}) = \mathrm{plim}\:N^{-1}\sum^N_{i=1}\boldsymbol{x}_i'u_i=\boldsymbol{0}. \]
Although we used MLR.4 in the outline of the consistency proof of the OLS estimator, we actually only needed a weaker assumption. We can explicitly state this weaker assumption as MLR.4’:
The error term has expected value 0 and is uncorrelated with any explanatory variable:
\[ \mathrm{E}\left(u\right) = 0, \qquad\qquad \mathrm{Cov}\left(x_k,u\right)=0\quad\text{ for } k=1,\dots,K. \]
We discussed that
\[ \mathrm{plim}\:\hat{\boldsymbol{\beta}} = \boldsymbol{\beta}+\mathrm{plim}\:(N^{-1}\boldsymbol{X}'\boldsymbol{X})^{-1}\mathrm{plim}\:(N^{-1}\boldsymbol{X}'\boldsymbol{u}) \]
Alternatively, for one element of \(\hat{\boldsymbol{\beta}}\), for example \(\hat{\beta}_1\), we can write:
\[ \mathrm{plim}\:\hat{\beta}_1 = \beta_1 + \frac{\mathrm{Cov}(x_1,u)}{\mathrm{Var}(x_1)}. \]
So we can state:
Scaling, Transforming, Interacting
When we scale a variable, the scale of certain coefficients changes:
\[ \begin{aligned} y^{*}&=\textcolor{var(--primary-color)}{10}\beta_{0} +\textcolor{var(--primary-color)}{10}\beta_{1}x_{1} +\textcolor{var(--primary-color)}{10}\beta_{2}x_{2} +\textcolor{var(--primary-color)}{10}u, & y^*=\textcolor{var(--primary-color)}{10}\times y\\ y&=\beta_{0} +\frac{\beta_{1}}{\textcolor{var(--secondary-color)}{10}}\,x_{1}^{*} +\beta_{2}x_{2}+u, & x_1^*=\textcolor{var(--secondary-color)}{10}\times x_1 \end{aligned} \]
Fortunately, not much else changes:
We’ve already talked about logarithmic transformations:
Model | Dep. Variable | Indep. Variable | Interpretation |
---|---|---|---|
Level-Level | \(y\) | \(x\) | \(+1\) in \(x\) \(\Leftrightarrow\) \(+\beta_1\) in \(y\) |
Level-Log | \(y\) | \(\log(x)\) | \(+1\%\) in \(x\) \(\Leftrightarrow\) \(+\beta_1 / 100\) in \(y\) |
Log-Level | \(\log(y)\) | \(x\) | \(+1\) in \(x\) \(\Leftrightarrow\) \(+\beta_1 \times 100\%\) in \(y\) |
Log-Log | \(\log(y)\) | \(\log(x)\) | \(+1\%\) in \(x\) \(\Leftrightarrow\) \(+\beta_1\)% in \(y\) |
There are various reasons to log-transform variables:
But there are also good reasons not to log-transform variables:
Quadratic functions allow us to model nonlinear relationships. We then estimate a model of the form:
\[ y = \beta_0 + \beta_1 x + \textcolor{var(--quarternary-color)}{\beta_2 x^2} + u, \]
It makes no difference whether we estimate a model with or without quadratic functions. Since a quadratic function is not a linear function, MLR.3 is not violated. However, there is a difference in interpretation:
If we estimate the following equation:
\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x + \hat{\beta}_2x^2, \]
then we can approximate:
\[ \frac{\Delta \hat{y}}{\Delta x}\approx\hat{\beta}_1 +2\hat{\beta}_2x. \]
We can also model situations in which the effect of one variable depends on the value of another variable. We use an interaction term for this:
\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \textcolor{var(--secondary-color)}{\beta_3\:\:x_1\times x_2} + u. \]
The average partial effect (APE) is a helpful metric for interpretation when the coefficients themselves do not represent partial effects (due to logs, quadratic terms, or interactions).
Assume we have the following model:
\[ y = \beta_0 + \beta_1\textcolor{var(--primary-color)}{x_1} + \beta_2\textcolor{var(--secondary-color)}{x_2} + \beta_3\textcolor{var(--secondary-color)}{x_2^2} + \beta_4\textcolor{var(--primary-color)}{x_1}\textcolor{var(--secondary-color)}{x_2} + u. \]
We know that \(\textcolor{var(--tertiary-color)}{R^2}\) is only of limited use for evaluating and comparing models. Certain problems specifically affect the multivariate case:
\[ \textcolor{var(--tertiary-color)}{R^2} = \frac{\textcolor{var(--secondary-color)}{\mathrm{SSE}}}{\textcolor{var(--primary-color)}{\mathrm{SST}}} = 1- \frac{\textcolor{var(--quarternary-color)}{\mathrm{SSR}}}{\textcolor{var(--primary-color)}{\mathrm{SST}}}. \]
Adjusted R² is one way to get around the issue that \(\textcolor{var(--tertiary-color)}{R^2}\) always increases as \(K\) increases:
\[ \textcolor{var(--secondary-color)}{R^2_{\mathrm{adj.}}} = 1- \frac{\textcolor{var(--quarternary-color)}{\mathrm{SSR}}/N-K-1}{\textcolor{var(--primary-color)}{\mathrm{SST}}/N-1}\qquad = 1-\left(1-\textcolor{var(--tertiary-color)}{R^2}\right)\times\frac{N-1}{N-K-1}. \]
So far, we’ve only seen one method for choosing between models: the F-test. The F-test only allows us to compare nested models, i.e., situations where one model is a special case of another.
Adjusted R² gives us a (first and simple) way to compare models that are nonnested.
With dummy variables, we can incorporate qualitative information into our model.
\[ y = \beta_0 + \beta_1x_1+\dots + u,\qquad x_1\in\{0,1\} \]
We interpreted the coefficients in such a case as follows:
\[ \mathrm{E}(y\mid x_1=1) = \beta_0+ \beta_1+\cdots, \qquad\mathrm{E}(y\mid x_1=0) = \beta_0+\cdots. \]
With the methods of multiple linear regression, we can also include variables with more than two categories.
Suppose we want to use a car’s color as a regressor. In our population, there are black, red, and blue cars. In principle, we can create three dummy variables:
\[ \mathrm{black}_i = \begin{cases} 1&\text{if }i\text{ is black},\\ 0&\text{otherwise} \end{cases}, \qquad \mathrm{red}_i = \begin{cases} 1&\text{if }i\text{ is red},\\ 0&\text{otherwise} \end{cases}, \qquad \mathrm{blue}_i = \begin{cases} 1&\text{if }i\text{ is blue},\\ 0&\text{otherwise} \end{cases}. \]
Suppose we estimate the model
\[ y = \beta_0 + \beta_1\:\mathrm{black}+\beta_2\:\mathrm{red}+\beta_3\:\mathrm{blue}+u. \]
The regressor matrix \(\boldsymbol{X}\) would then look like this:
\[ \boldsymbol{X}= \begin{pmatrix} 1 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 0 & 1 \\ \vdots & \vdots & \vdots & \vdots \end{pmatrix} \]
\[ \boldsymbol{X}= \begin{pmatrix} 1 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 0 & 1 \\ \vdots & \vdots & \vdots & \vdots \end{pmatrix} \]
What’s the problem with this matrix? The fourth column, \(\beta_3\), is a linear combination of the other columns: \(x_3=1-x_1-x_2\).
So let’s define \(\mathrm{blue}_i\) as the benchmark and estimate:
\[ y = \beta_0 + \beta_1\:\mathrm{black}+\beta_2\:\mathrm{red}+u. \]
How do we interpret the parameters?
Suppose there are other explanatory variables, e.g., a numerical variable \(x_3\). Then we interpret the parameters analogously:
\(\beta_0+\beta_1\) is then the expected \(y\) value for a black car with \(x_3=0\). In a way, \(\beta_0+\beta_1\) is then a group-specific intercept for black cars.
So we can model different intercepts for each group. Can we also model different slopes for each group? Yes, with interactions. Let’s consider the following model:
\[ y = \beta_0 + \beta_1\:\mathrm{black}+\beta_2\:\mathrm{red}+\beta_3x_3+\textcolor{var(--secondary-color)}{\beta_4\:\mathrm{black}\times x_3}+\textcolor{var(--secondary-color)}{\beta_5\:\mathrm{red}\times x_3}+u. \]
We interpret the parameters as follows:
We can also use a dummy variable as a dependent variable. For example, we can examine what influences whether someone passes econometrics (study time, motivation, …):
\[ y = \beta_0 + \beta_1x_1+\dots+u,\qquad\qquad y_i = \begin{cases} 1&\text{if }i\text{ passes the econometrics course},\\ 0&\text{otherwise} \end{cases}. \]
Such a model is called a linear probability model.