Econometrics I
April 10, 2025
Let’s start with the simplest possible model with more than two variables:
\[ \textcolor{var(--primary-color)}{y_i}=\beta_0+\beta_1\textcolor{var(--secondary-color)}{x_{i1}}+\beta_2\textcolor{var(--secondary-color)}{x_{i2}}+u_i \]
How do we interpret the parameters in such a model?
How do we interpret the parameters in such a model?
\[ \textcolor{var(--primary-color)}{y_i}=\beta_0+\beta_1\textcolor{var(--secondary-color)}{x_{i1}}+\beta_2\textcolor{var(--secondary-color)}{x_{i2}}+u_i \]
The parameter
\[ \beta_1=\frac{\partial\mathrm{E}(y_i\mid x_{i1},x_{i2})}{\partial x_{i1}} \]
This interpretation is often referred to as the ceteris paribus interpretation; however, it’s important that only the observed and included variables in the model are actually held fixed.
What does that look like in an example?
\[ \textcolor{var(--primary-color)}{\mathrm{Wage}_i}=\beta_0+\beta_1\textcolor{var(--secondary-color)}{\mathrm{Education}_{i}}+\beta_2\textcolor{var(--secondary-color)}{\mathrm{Experience}_{i}}+u_i \]
In this model, the parameter
\[ \beta_1=\frac{\partial\mathrm{E}(\mathrm{Wage}_i\mid \mathrm{Education}_i,\mathrm{Experience}_i)}{\partial \mathrm{Education}_i} \]
measures the expected change in wage, when education increases by one unit, and we hold experience constant.
We can add as many variables as we want:
\[ \textcolor{var(--primary-color)}{\mathrm{Wage}_i}=\beta_0+\beta_1\textcolor{var(--secondary-color)}{\mathrm{Education}_{i}}+\beta_2\textcolor{var(--secondary-color)}{\mathrm{Experience}_{i}}+\beta_3\textcolor{var(--secondary-color)}{\mathrm{Age}_{i}}+\beta_4\textcolor{var(--secondary-color)}{\mathrm{CareerYears}_{i}}+\beta_5\textcolor{var(--secondary-color)}{\mathrm{Union}_{i}}+u_i \]
Let’s try to derive the OLS estimators as we did in the bivariate case. We begin by setting up a loss function:
\[ \left(\hat{\beta}_0,\hat{\beta}_1,\dots,\hat{\beta}_K\right) = \underset{\left(\tilde{\beta}_0,\tilde{\beta}_1,\dots,\tilde{\beta}_K\right)}{\mathrm{arg\:min}}\sum^N_{i=1}\left(y_i-\tilde{\beta}_0-\tilde{\beta}_1x_{i1}-\dots-\tilde{\beta}_Kx_{iK}\right)^2. \]
We can differentiate and set it to zero and obtain a system of first-order conditions:
\[ \begin{aligned} \textstyle-2\sum^N_{i=1}\left(y_i-\tilde{\beta}_0-\tilde{\beta}_1x_{i1}-\dots-\tilde{\beta}_Kx_{iK}\right)&=0 \\ \textstyle-2\sum^N_{i=1}x_{i1}\left(y_i-\tilde{\beta}_0-\tilde{\beta}_1x_{i1}-\dots-\tilde{\beta}_Kx_{iK}\right)&=0 \\ \textstyle&\vdots\\ \textstyle-2\sum^N_{i=1}x_{iK}\left(y_i-\tilde{\beta}_0-\tilde{\beta}_1x_{i1}-\dots-\tilde{\beta}_Kx_{iK}\right)&=0 \\ \end{aligned} \]
We can differentiate and set it to zero and obtain a system of first-order conditions:
\[ \begin{aligned} \textstyle-2\sum^N_{i=1}\left(y_i-\tilde{\beta}_0-\tilde{\beta}_1x_{i1}-\dots-\tilde{\beta}_Kx_{iK}\right)&=0 \\ \textstyle-2\sum^N_{i=1}x_{i1}\left(y_i-\tilde{\beta}_0-\tilde{\beta}_1x_{i1}-\dots-\tilde{\beta}_Kx_{iK}\right)&=0 \\ \textstyle&\vdots\\ \textstyle-2\sum^N_{i=1}x_{iK}\left(y_i-\tilde{\beta}_0-\tilde{\beta}_1x_{i1}-\dots-\tilde{\beta}_Kx_{iK}\right)&=0 \\ \end{aligned} \]
This system of equations is solvable because it is linear and has \(K+1\) equations and \(K+1\) variables. However, we cannot solve it and find \(\hat{\beta}_k\) without using matrix algebra.
As in the bivariate case, we can interpret these first-order conditions as moment conditions:
When we have multiple variables, the summation notation used in the last chapter becomes increasingly tedious. We therefore use vectors and matrices to write models with more than two variables.
Let’s take another look at this model (slightly less extensive than before):
\[ \textcolor{var(--primary-color)}{\mathrm{Wage}_i}=\beta_0\cdot\textcolor{var(--secondary-color)}{1}+\beta_1\textcolor{var(--secondary-color)}{\mathrm{Education}_{i}}+\beta_2\textcolor{var(--secondary-color)}{\mathrm{Experience}_{i}}+\beta_3\textcolor{var(--secondary-color)}{\mathrm{Age}_{i}}+u_i \]
Note that we have added a \(1\) for the constant parameter. If we treat this \(1\) like a variable, we can treat \(\beta_0\) like the other parameters and construct a vector of variables and a vector of parameters:
\[ \textcolor{var(--secondary-color)}{\boldsymbol{x}_i}= \begin{pmatrix} \textcolor{var(--secondary-color)}{1}\\ \textcolor{var(--secondary-color)}{\mathrm{Education}_{i}}\\ \textcolor{var(--secondary-color)}{\mathrm{Experience}_{i}}\\ \textcolor{var(--secondary-color)}{\mathrm{Age}_{i}} \end{pmatrix} ,\qquad \boldsymbol{\beta}= \begin{pmatrix} \beta_0\\ \beta_1\\ \beta_2\\ \beta_3 \end{pmatrix} \]
\[ \textcolor{var(--secondary-color)}{\boldsymbol{x}_i}= \begin{pmatrix} \textcolor{var(--secondary-color)}{1}\\ \textcolor{var(--secondary-color)}{\mathrm{Education}_{i}}\\ \textcolor{var(--secondary-color)}{\mathrm{Experience}_{i}}\\ \textcolor{var(--secondary-color)}{\mathrm{Age}_{i}} \end{pmatrix} ,\qquad \boldsymbol{\beta}= \begin{pmatrix} \beta_0\\ \beta_1\\ \beta_2\\ \beta_3 \end{pmatrix} \]
We can transpose the variable vector, i.e., switch rows and columns. We mark this with a prime:
\[ \textcolor{var(--secondary-color)}{\boldsymbol{x}_i'}=\left(\textcolor{var(--secondary-color)}{1},\textcolor{var(--secondary-color)}{\mathrm{Education}_{i}},\textcolor{var(--secondary-color)}{\mathrm{Experience}_{i}},\textcolor{var(--secondary-color)}{\mathrm{Age}_{i}}\right) \]
Now we can make use of the rules of matrix multiplication:
\[ \textcolor{var(--secondary-color)}{\boldsymbol{x}_i'}\boldsymbol{\beta}=\textcolor{var(--secondary-color)}{1}\cdot\beta_0+\textcolor{var(--secondary-color)}{\mathrm{Education}_{i}}\beta_1+\textcolor{var(--secondary-color)}{\mathrm{Experience}_{i}}\beta_2+\textcolor{var(--secondary-color)}{\mathrm{Age}_{i}}\beta_3 \]
\[ \textcolor{var(--secondary-color)}{\boldsymbol{x}_i'}\boldsymbol{\beta}=\textcolor{var(--secondary-color)}{1}\cdot\beta_0+\textcolor{var(--secondary-color)}{\mathrm{Education}_{i}}\beta_1+\textcolor{var(--secondary-color)}{\mathrm{Experience}_{i}}\beta_2+\textcolor{var(--secondary-color)}{\mathrm{Age}_{i}}\beta_3 \]
We can now write our regression model very compactly, no matter how many variables we have:
\[ \textcolor{var(--primary-color)}{y_i}=\textcolor{var(--secondary-color)}{\boldsymbol{x}_i'}\boldsymbol{\beta}+u_i. \]
We can also write the OLS optimization problem as:
\[ \textstyle\hat{\boldsymbol{\beta}} = \underset{\tilde{\boldsymbol{\beta}}}{\mathrm{arg\:min}}\sum^N_{i=1}\left(y_i-\boldsymbol{x}_i'\boldsymbol{\beta}\right)^2. \]
The solution to this problem is:
\[ \textstyle\hat{\boldsymbol{\beta}} = \left(\sum^N_{i=1}\boldsymbol{x}_i\boldsymbol{x}_i'\right)^{-1}\left(\sum^N_{i=1}\boldsymbol{x}_iy_i\right) \]
Exercise
Solve this optimization problem!
This equation describes the regression model for a single observation \(i\).
\[ \textcolor{var(--primary-color)}{y_i}=\textcolor{var(--secondary-color)}{\boldsymbol{x}_i'}\boldsymbol{\beta}+u_i. \]
But we can make the model even more compact by using just one equation for all observations. For this, we define:
\[ \boldsymbol{y}= \begin{pmatrix} y_1\\ y_2\\ \vdots\\ y_N \end{pmatrix} ,\qquad \boldsymbol{X}= \begin{pmatrix} 1 & x_{11} & \dots & x_{1K} \\ 1 & x_{21} & \dots & x_{2K} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{N1} & \dots & x_{NK} \end{pmatrix} ,\qquad \boldsymbol{u}= \begin{pmatrix} u_1\\ u_2\\ \vdots\\ u_N \end{pmatrix}. \]
\[ \boldsymbol{y}= \begin{pmatrix} y_1\\ y_2\\ \vdots\\ y_N \end{pmatrix} ,\qquad \boldsymbol{X}= \begin{pmatrix} 1 & x_{11} & \dots & x_{1K} \\ 1 & x_{21} & \dots & x_{2K} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{N1} & \dots & x_{NK} \end{pmatrix} ,\qquad \boldsymbol{u}= \begin{pmatrix} u_1\\ u_2\\ \vdots\\ u_N \end{pmatrix}. \]
Our model now looks like this:
\[ \textcolor{var(--primary-color)}{\boldsymbol{y}}=\textcolor{var(--secondary-color)}{\boldsymbol{X}}\boldsymbol{\beta}+\boldsymbol{u}. \]
Exercise
What are the dimensions of \(\boldsymbol{y}\), \(\boldsymbol{\beta}\), \(\boldsymbol{X}\), and \(\boldsymbol{u}\)?
\(\boldsymbol{u}\) is the vector of error terms. So it is a vector of random variables, all of which have a mean of \(0\) and a variance of \(\sigma^2\). So what is the variance of \(\boldsymbol{u}\)?
The variance of a vector is a matrix. The diagonal elements of this matrix are the variances of the individual elements of the vector. The off-diagonal elements are the covariances between the individual elements.
We also call such a matrix a variance-covariance matrix (VCM):
\[ \mathrm{Var}(\boldsymbol{u}) = \begin{pmatrix} \mathrm{Cov}(u_1,u_1) & \dots & \mathrm{Cov}(u_1,u_N) \\ \vdots & \ddots & \vdots \\ \mathrm{Cov}(u_N,u_1) & \dots & \mathrm{Cov}(u_N,u_N) \\ \end{pmatrix} = \begin{pmatrix} \mathrm{Var}(u_1) & \dots & \mathrm{Cov}(u_1,u_N) \\ \vdots & \ddots & \vdots \\ \mathrm{Cov}(u_N,u_1) & \dots & \mathrm{Var}(u_N) \\ \end{pmatrix} \]
In matrix notation, the sum of squared residuals is:
\[ \hat{\boldsymbol{u}}'\hat{\boldsymbol{u}} = (\boldsymbol{y}-\boldsymbol{X}\tilde{\boldsymbol{\beta}})'(\boldsymbol{y}-\boldsymbol{X}\tilde{\boldsymbol{\beta}}), \]
and the OLS optimization problem is:
\[ \hat{\boldsymbol{\beta}} = \underset{\tilde{\boldsymbol{\beta}}}{\mathrm{arg\:min}}\:\hat{\boldsymbol{u}}'\hat{\boldsymbol{u}}. \]
When we solve this problem, we get the estimator
\[ \hat{\boldsymbol{\beta}}=(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{y}, \]
but we have to differentiate matrices. To avoid this, we use the method of moments.
In matrix notation, we have one moment condition for our model:
\[ \mathrm{E}(\boldsymbol{X}'\boldsymbol{u})=0. \]
If our \(\boldsymbol{X}\), as defined earlier, has a column of 1s, this condition also implies that \(\mathrm{E}(\boldsymbol{u})=0\).
We again begin by replacing the population moments with sample moments. So from \(\mathrm{E}(\boldsymbol{X}'\boldsymbol{u})=0\) we get:
\[ \boldsymbol{X}'\hat{\boldsymbol{u}}=0. \]
We can now derive our OLS estimator very easily:
We can now derive our OLS estimator very easily:
\[ \begin{aligned} \boldsymbol{X}'\hat{\boldsymbol{u}}=\boldsymbol{X}'(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}})&=0 \\ \boldsymbol{X}'\boldsymbol{y}-\boldsymbol{X}'\boldsymbol{X}\hat{\boldsymbol{\beta}}&=0 \\ \boldsymbol{X}'\boldsymbol{y}&=\boldsymbol{X}'\boldsymbol{X}\hat{\boldsymbol{\beta}} \\ (\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{y}&=(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{X}\hat{\boldsymbol{\beta}} \\ (\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{y}&=\hat{\boldsymbol{\beta}} \end{aligned} \]
We obtain the same estimator as in the derivation via optimization problem,
\[ \colorbox{var(--primary-color-lightened)}{$\hat{\boldsymbol{\beta}}=(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{y}.$} \]
We have learned about our OLS estimator in three different notations:
In summation notation (for the bivariate case):
\[ \hat{\beta}_1=\frac{\sum^N_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{\sum^N_{i=1}(x_i-\bar{x})^2}, \]
in vector notation:
\[ \hat{\boldsymbol{\beta}} = \left(\sum^N_{i=1}\boldsymbol{x}_i\boldsymbol{x}_i'\right)^{-1}\left(\sum^N_{i=1}\boldsymbol{x}_iy_i\right), \]
and in matrix notation:
\[ \hat{\boldsymbol{\beta}} =(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{y.} \]
Multivariate vs. bivariate models
Just like in simple linear regression, in multiple linear regression we split the observed \(y\) values into an
\[ y_i=\underbrace{\hat{\beta}_0+\hat{\beta}_1x_{i1}+\dots+\hat{\beta}_Kx_{iK}}_{\hat{y}_i}+\hat{u}_i. \]
If we estimate the simple linear regression model
\[ y_i = \beta_0^*+\beta_1^*x_{i1}+u_i \] and the multiple linear regression model
\[ y_i = \beta_0+\beta_1x_{i1}+\dots+\beta_Kx_{iK}+u_i \]
then the estimates for \(\hat{\beta}_1^*\) and \(\hat{\beta}_1\) will generally not match. Only in two special cases would the estimates be the same:
We can look at this property more closely for the simple case of one or two regressors. We consider the simple linear regression model
\[ y_i = \beta_0^*+\beta_1^*x_{i1}+u_i \]
and the multiple linear regression model with two regressors
\[ y_i = \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+u_i. \]
Here, one can show that
\[ \hat{\beta}_1^*=\hat{\beta}_1+\hat{\beta}_2\hat{\delta}\qquad\Rightarrow\qquad\mathrm{E}\left(\hat{\beta}_1^*\right)=\beta_1^*=\beta_1+\beta_2\delta, \]
where \(\delta\) denotes the slope parameter from a regression of \(x_2\) on \(x_1\). We see: \(\hat{\beta}_1^*\) and \(\hat{\beta}_1\) are only equal if either \(\hat{\beta}_2\) or \(\hat{\delta}\) is equal to 0.
We do not know whether one model or the other is “correct”. In particular, the larger model is not necessarily automatically better than the smaller model. We need to consider which model better fits our assumptions.
Just like in the bivariate case, we can divide the variation in \(y\) into a explained part, i.e. variation originating from variation in \(x\); and an unexplained part, i.e. a part resulting from unobserved factors
\[ \begin{aligned} \textcolor{var(--primary-color)}{\sum^N_{i=1}\left(y_i-\bar{y}\right)^2} &= \textcolor{var(--secondary-color)}{\sum^N_{i=1}\left(\hat{y}_i-\bar{y}\right)^2} + \textcolor{var(--quarternary-color)}{\sum^N_{i=1}\hat{u}_i^2}\\ \textcolor{var(--primary-color)}{\mathrm{SST}} &= \textcolor{var(--secondary-color)}{\mathrm{SSE}} + \textcolor{var(--quarternary-color)}{\mathrm{SSR}} \end{aligned} \]
And just like in the bivariate case, the coefficient of determination \(R^2\) is a measure of goodness of fit, indicates what share of the variation is explained by our model, and has exactly the same issues we discussed in the bivariate case. In addition, when adding variables, \(R^2\) will always increase.
\[ R^2 = \frac{\textcolor{var(--secondary-color)}{\mathrm{SSE}}}{\textcolor{var(--primary-color)}{\mathrm{SST}}} = 1- \frac{\textcolor{var(--quarternary-color)}{\mathrm{SSR}}}{\textcolor{var(--primary-color)}{\mathrm{SST}}}. \]
Let’s take a look at what we have discussed theoretically in an applied example. We start by loading the dataset and selecting the variables we need:
Practice task
Select different variables and run regressions like on the next slides.
Next, we can take a look at a summary of our variables.
The slope coefficient is 0.094.
The intercept is 1.485.
What happens if we add more variables?
The slope coefficient for education
is 0.094.
The coefficient for gender==female
is –0.234.
The coefficient for age
is 0.0089.
The intercept is 1.22.
When we run a regression in R and use the summary()
function, we get the following output:
Call:
lm(formula = log(earnings) ~ education + gender + age, data = CPSSW8)
Residuals:
Min 1Q Median 3Q Max
-2.79472 -0.28807 0.02562 0.32439 1.63195
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2206890 0.0130145 93.80 <2e-16 ***
education 0.0941281 0.0007922 118.82 <2e-16 ***
genderfemale -0.2338747 0.0039207 -59.65 <2e-16 ***
age 0.0088690 0.0001839 48.22 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4816 on 61391 degrees of freedom
Multiple R-squared: 0.2452, Adjusted R-squared: 0.2452
F-statistic: 6649 on 3 and 61391 DF, p-value: < 2.2e-16
In a paper or bachelor thesis, results are usually presented in a table like this:
Dependent variable: log(earnings) | ||
---|---|---|
(1) | (2) | |
education |
0.094*** (0.001) |
0.094*** (0.001) |
genderfemale |
-0.234*** (0.003) |
|
age |
0.009*** (0.0002) |
|
Constant |
1.485*** (0.011) |
1.221*** (0.013) |
Observations | 61,395 | 61,395 |
R² | 0.174 | 0.245 |
Note: *p<0.1; **p<0.05; ***p<0.01 Numbers in parentheses are standard errors. |
Multivariate vs. bivariate models
OLS Assumptions 1 to 4To make statements about expectation and variance, we again need a set of assumptions. These assumptions MLR.1 to MLR.4 are generalized versions of the assumptions SLR.1 to SLR.4 from the previous chapter.
Gauss-Markov Theorem: Assumptions for Multiple Linear Regression (MLR)
The population regression function (PRF) must be linear in its parameters:
\[ y_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_Kx_{iK} + u_i \]
Our sample with \(N\) observations, \(\left\{\left(y_i,x_{i1},\dots,x_{iK}\right), i = 1, 2, \dots, N\right\}\) must be randomly drawn from the population. The probability of including an observation must be the same for all and must not depend on whom we sampled first.
In the bivariate case, we assumed variation in the \(x\) values at this point. Here, in the multivariate case, we require a broader assumption: no regressor may be a linear combination of other regressors. Formally, we can say:
The regressor matrix \(\boldsymbol{X}\) contains no column that is a linear combination of other columns. Equivalently, \(\boldsymbol{X}\) has full rank.
We will discuss squared regressors and interaction terms in more detail in a later module. At this point, it’s only important to know that they exist and that they do not violate MLR.3.
The expected value of the error term \(u\) is 0 for each regressor:
\[ \mathrm{E}\left(u_i\mid x_{i1},\dots,x_{iK}\right) = 0 \]
In matrix notation (this assumption is slightly stronger, as it includes not just the regressors but also their linear combinations):
\[ \mathrm{E}\left(\boldsymbol{u}\mid\boldsymbol{X}\right) = \boldsymbol{0} \]
We assume that regressors and unobserved factors are independent. This is easy to achieve in experiments, but much harder with observational data. We call the case in which MLR.4 is violated endogeneity.
When \(\mathrm{E}(x_{ik}u_i)\neq 0\), we call \(x_{ik}\) an endogenous regressor. This can happen when:
If the four assumptions MLR.1 to MLR.4 are met, we can prove that the OLS estimator is unbiased. Formally:
Under assumptions MLR.1 to MLR.4, we have: \(\mathrm{E}\left(\hat{\beta}_k\right) = \beta_k,\qquad\qquad k=0,1,\dots,K,\) for every value of the parameters \(\beta_j\). In matrix notation:
\[ \mathrm{E}\left(\hat{\boldsymbol{\beta}}\right)=\boldsymbol{\beta}, \]
where \(\boldsymbol{\beta}\) has dimension \((K+1)\times 1\).
The OLS estimator is therefore an unbiased estimator of the intercept and all slope parameters. We can again prove this by splitting the estimator into the true coefficient and a sample error component.
We start by decomposing \(\hat{\boldsymbol{\beta}}\):
\[ \begin{aligned} \hat{\boldsymbol{\beta}} &= (\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{y}\\ &= (\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'(\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{u}) \\ &= (\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{X}\boldsymbol{\beta}+(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{u} \\ &= \underbrace{\boldsymbol{\beta}}_{\text{true parameter}}+\underbrace{(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{u}}_{\text{sampling error}}. \end{aligned} \]
We can compare this step to the step in the proof for the SLR case, in which we decomposed \(\hat{\beta}_1\) as follows:
\[ \hat{\beta}_1 = \beta_1+\frac{\sum^N_{i=1}(x_i-\bar{x})u_i}{\sum^N_{i=1}(x_i-\bar{x})x_i}. \]
With this decomposition, we can proceed in the proof:
\[ \begin{aligned} \mathrm{E}\left(\hat{\boldsymbol{\beta}}\middle|\boldsymbol{X}\right) &= \mathrm{E}\left(\boldsymbol{\beta}+(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{u}\middle|\boldsymbol{X}\right) \\ &= \boldsymbol{\beta}+\mathrm{E}\left((\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{u}\middle|\boldsymbol{X}\right) \\ &= \boldsymbol{\beta}+(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\underbrace{\mathrm{E}\left(\boldsymbol{u}\middle|\boldsymbol{X}\right)}_{=0\text{ (MLR.4)}} \\ &= \boldsymbol{\beta}. \end{aligned} \]
Since \(\mathrm{E}\left(\hat{\boldsymbol{\beta}}\middle|\boldsymbol{X}\right)=\boldsymbol{\beta}\quad\Rightarrow\quad\mathrm{E}\left(\hat{\boldsymbol{\beta}}\right)=\boldsymbol{\beta}\) (law of iterated expectations), the OLS estimator is unbiased.
\(\square\)
The variance of the error term \(u_i\) is the same for all \(x_{ik}\):
\[ \mathrm{Var}(u_i\mid x_{i1},\dots,x_{iK}) = \mathrm{Var}(u_i) = \sigma^2, \]
or in matrix notation:
\[ \mathrm{Var}(\boldsymbol{u}\mid\boldsymbol{X}) = \sigma^2\boldsymbol{I}_N, \]
where \(\boldsymbol{I}_N\) is the identity matrix with dimension \(n\times n\).
Under assumptions MLR.1 to MLR.5, the variance of the OLS estimator is
\[ \mathrm{Var}\left(\hat{\boldsymbol{\beta}}\right)=\sigma^2(\boldsymbol{X}'\boldsymbol{X})^{-1}, \]
where \(\mathrm{Var}(\cdot)\) refers here to a variance-covariance matrix.
Exercise
Show how to derive this expression for the variance. At what step is each of the assumptions MLR.1 to MLR.5 needed? Derivation
Let’s look at the bivariate model in matrix notation:
\[ \hat{\boldsymbol{\beta}} = \begin{pmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \end{pmatrix} ,\qquad\qquad \mathrm{Var}\left(\hat{\boldsymbol{\beta}}\right)= \begin{pmatrix} \mathrm{Var}(\hat{\beta}_0) & \mathrm{Cov}(\hat{\beta}_0,\hat{\beta}_1) \\ \mathrm{Cov}(\hat{\beta}_1,\hat{\beta}_0) & \mathrm{Var}(\hat{\beta}_1) \\ \end{pmatrix} \]
Until now, we’ve only discussed the sample variance of an estimator, not the sample covariance. Statistical software usually estimates only the variances of the parameters, not the covariances—that is, only the diagonal of the variance-covariance matrix. We will need the covariances later for certain statistical tests.
Analogous to the explicit formula for the variance in the bivariate case, we can derive the following formula for the variance of an individual coefficient from the previous formula for the variance:
\[ \mathrm{Var}\left(\hat{\beta}_k\middle|\boldsymbol{X}\right)=\frac{\sigma^2}{\sum^N_{i=1}(x_{ik}-\bar{x}_k)^2}\times\frac{1}{1-R^2_k}, \]
where \(R^2_k\) is the \(R^2\) from a regression of \(x_k\) on all other regressors \(x_j, j\neq k\).
Just like in the bivariate case, we do not know the variance \(\sigma^2\), and need an estimator.
It can be shown (we omit the proof) that the following estimator:
\[ \mathrm{E}\left(\frac{\sum^N_{i=1}\hat{u}_i^2}{N-K-1}\right) = \mathrm{E}\left(\frac{\hat{\boldsymbol{u}}'\hat{\boldsymbol{u}}}{N-K-1}\right) = \mathrm{E}\left(\hat{\sigma}^2\right) = \sigma^2 \]
is an unbiased estimator of the error variance under assumptions MLR.1 to MLR.5.
We divide by \(N-K-1\) (not by \(N\)) to correct for degrees of freedom: our estimation uses \(K-1\) coefficients from \(N\) observations, so we are left with \(N-K-1\) degrees of freedom. We made the same correction in the bivariate case.
Now we can formulate the Gauss-Markov Theorem for the multivariate case, analogous to the bivariate case:
Under assumptions MLR.1 to MLR.5, the OLS estimator
\[ \hat{\boldsymbol{\beta}}= \begin{pmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \vdots \\ \hat{\beta}_K \end{pmatrix} \]
is the best linear unbiased estimator (BLUE) of the parameters \(\boldsymbol{\beta}=(\beta_0,\beta_1,\dots,\beta_K)'\).
It is not intuitively easy to understand what the coefficients in a multivariate model actually measure. The Frisch-Waugh-Lovell Theorem gives us an additional approach.
We consider the following model:
\[ y_i=x_{i1}\beta_1+\boldsymbol{x}'_{i2}\boldsymbol{\beta}_2+u_i,\qquad\mathrm{E}\left(\binom{x_{i1}}{\boldsymbol{x}_{i2}}u_i\right)=0. \]
We can assume that \(y_i\) is wage, \(x_{i1}\) is gender, and \(\boldsymbol{x}_{i2}\) is a vector consisting of a column of 1s, education, and age. We assume that we are primarily interested in \(\beta_1\), and therefore highlight \(x_{i1}\) and group the rest of the model in vector notation.
The variables we are not primarily interested in are typically called control variables. We include them so that the model is complete.
We start by regressing \(y_i\) only on the vector \(\boldsymbol{x}_{i2}\) (and not on \(x_1\)). From this regression, we “keep” the prediction errors, which we denote as \(y_i^{(R)}\).
\[ y_{i}=\boldsymbol{x}'_{i2}\boldsymbol{\alpha}+\textcolor{var(--primary-color)}{\underbrace{y_{i}^{(R)}}_{\text{error}}} \]
Next, we regress our variable of interest, \(x_{i1}\), on the vector \(\boldsymbol{x}_{i2}\), again “keeping” the prediction errors, which we denote as \(x_{i1}^{(R)}\).
\[ x_{i1}=\boldsymbol{x}'_{i2}\boldsymbol{\gamma}+\textcolor{var(--secondary-color)}{\underbrace{x_{i1}^{(R)}}_{\text{error}}} \]
Put simply, we now have a “version” of \(y_i\) with the influence of \(\boldsymbol{x}_{i2}\) filtered out, and a “version” of \(x_{i1}\) with the influence of \(\boldsymbol{x}_{i2}\) filtered out.
\[ y_{i}=\boldsymbol{x}'_{i2}\boldsymbol{\alpha}+\textcolor{var(--primary-color)}{\underbrace{y_{i}^{(R)}}_{\text{error}}} \]
\[ x_{i1}=\boldsymbol{x}'_{i2}\boldsymbol{\gamma}+\textcolor{var(--secondary-color)}{\underbrace{x_{i1}^{(R)}}_{\text{error}}} \]
Interestingly, we can obtain the same parameter \(\beta_1\) in two different ways:
If we have a sample of data, we can proceed as follows to obtain our estimator \(\hat{\beta}_1\) in this fashion:
This result is known as the Frisch-Waugh-Lovell Theorem, after Frisch and Waugh (1933) and Lovell (1963). It can help us understand the parameters of the multivariate model intuitively.
We can illustrate the previous example with the following causal graph: We assume that gender has an influence on wage, but there is also correlation between the variables in \(\boldsymbol{x}_{i2}'\) and both gender and wage.
Now that we have the ability to include as many variables as we want in our regression, the question arises:
Of course, there is no “rule of thumb” or universally valid answer to this question. Instead, we have to decide individually for each model and each variable whether it makes sense to include it.
If we omit relevant variables, we run into a problem of omitted variable bias. In this case, the effect that actually belongs to the omitted variable is incorrectly attributed to the variables included in the model.
What happens if we omit relevant variables from our model? Our estimator will no longer be unbiased, and we can prove this.
Assume this is the “true” model. We’ve split the regressors into two matrices, but in principle, it’s the same model we’ve been discussing in this chapter:
\[ \boldsymbol{y}=\boldsymbol{X\beta}+\textcolor{var(--secondary-color)}{\boldsymbol{Z\gamma}}+\boldsymbol{u} \]
What happens if we instead estimate this model?
\[ \boldsymbol{y}=\boldsymbol{X\beta}+\boldsymbol{u} \]
We again decompose \(\hat{\boldsymbol{\beta}}\), but use the true model for \(\boldsymbol{y}\).
\[ \begin{aligned} \hat{\boldsymbol{\beta}} &= (\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{y} \\ &= (\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'(\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{Z}\boldsymbol{\gamma}+\boldsymbol{u}) \\ &= (\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{X}\boldsymbol{\beta}+(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{Z}\boldsymbol{\gamma}+(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{u} \\ &= \boldsymbol{\beta}+(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{Z}\boldsymbol{\gamma}+(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{u} \end{aligned} \]
Now if we take the expectation of this expression, we see that the estimator is no longer unbiased.
\[ \begin{aligned} \mathrm{E}\left(\hat{\boldsymbol{\beta}}\middle|\boldsymbol{X}\right) &= \mathrm{E}\left( \boldsymbol{\beta}+(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{Z}\boldsymbol{\gamma}+(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{u}\middle|\boldsymbol{X}\right) \\ &= \boldsymbol{\beta} + \mathrm{E}\left( (\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{Z}\boldsymbol{\gamma}\middle|\boldsymbol{X}\right)+\mathrm{E}\left((\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{u}\middle|\boldsymbol{X}\right) \\ &= \boldsymbol{\beta} + \textcolor{var(--secondary-color)}{\underbrace{(\boldsymbol{X}'\boldsymbol{X})^{-1}\mathrm{E}\left(\boldsymbol{X}'\boldsymbol{Z}\middle|\boldsymbol{X}\right)\boldsymbol{\gamma}}_{\text{Bias}}}+\boldsymbol{0} \end{aligned} \]
We can very easily see what the “direction” of the bias depends on:
\[ \mathrm{E}\left(\hat{\boldsymbol{\beta}}\middle|\boldsymbol{X}\right) = \boldsymbol{\beta} + \textcolor{var(--secondary-color)}{\underbrace{(\boldsymbol{X}'\boldsymbol{X})^{-1}\mathrm{E}\left(\boldsymbol{X}'\boldsymbol{Z}\middle|\boldsymbol{X}\right)\boldsymbol{\gamma}}_{\text{Bias}}}+\boldsymbol{0} \]
\(\mathrm{E}(\boldsymbol{X}'\boldsymbol{Z}\mid\boldsymbol{X})\) positive | \(\mathrm{E}(\boldsymbol{X}'\boldsymbol{Z}\mid\boldsymbol{X})\) negative | |
---|---|---|
\(\boldsymbol{\gamma}\) positive | Positive bias | Negative bias |
\(\boldsymbol{\gamma}\) negative | Negative bias | Positive bias |
But too many variables can also be a problem, especially in the following cases and for the following reasons:
In the following example, adding an additional variable violates MLR.4:
mtcars
dataset contains 32 car models (1973–74) and theirmpg
),wt
), displacement (disp
), etc.Let’s begin with a simple linear regression:
\[ \textrm{mpg}_i=\beta_0 + \beta_1\textrm{wt}_i+\beta_2\textrm{wt}^2_i+\beta_3\textrm{disp}_i+u_i \]
Let’s return to the CPS data; first again as a bivariate regression (earnings on education).
\[ \begin{aligned} y_i=\beta_0+\beta_1\textrm{education}_i+\beta_2\textrm{age}_i+u_i\phantom{x_i^2} \\ \phantom{x_i^2} \end{aligned} \]
\[ \begin{aligned} y_i=\beta_0+\beta_1\textrm{education}_i+\beta_2\textrm{age}_i+\beta_3\textrm{age}_i^2+u_i\\\phantom{x_i^2} \end{aligned} \]
\[ \begin{aligned} y_i=&\beta_0+\beta_1\textrm{education}_i+\beta_2\textrm{age}_i+\beta_3\textrm{age}_i^2\\ &+\beta_4\textrm{education}_i\times\textrm{age}_i +\beta_5\textrm{education}_i\times\textrm{age}_i^2+u_i \end{aligned} \]
We begin with a transformation.
\[ \begin{aligned} \boldsymbol{u}'\boldsymbol{u}&=(\boldsymbol{y}-\boldsymbol{X\beta})'(\boldsymbol{y}-\boldsymbol{X\beta}) \\ &= \boldsymbol{y}'\boldsymbol{y}-\boldsymbol{\beta}'\boldsymbol{X}'\boldsymbol{y}-\boldsymbol{y}'\boldsymbol{X\beta}+\boldsymbol{\beta}'\boldsymbol{X}'\boldsymbol{X\beta} \\ &= \boldsymbol{y}'\boldsymbol{y}-2\boldsymbol{\beta}'\boldsymbol{X}'\boldsymbol{y}+\boldsymbol{\beta}'\boldsymbol{X}'\boldsymbol{X\beta} \end{aligned} \]
In the third step, we use the fact that \(\boldsymbol{\beta}'\boldsymbol{X}'\boldsymbol{y}=\boldsymbol{y}'\boldsymbol{X\beta}\) since it is a scalar. Now we need to differentiate:
\[ \textstyle\frac{\partial\boldsymbol{u}'\boldsymbol{u}}{\partial\boldsymbol{\beta}}=-2\boldsymbol{X}'\boldsymbol{y}+2\boldsymbol{X}'\boldsymbol{X\beta}\overset{!}{=}0, \]
from which we obtain the estimator:
\[ \hat{\boldsymbol{\beta}}=(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{y}. \]
\[ \begin{aligned} \mathrm{Var}(\hat{\boldsymbol{\beta}}\mid \boldsymbol{X}) &= \mathrm{Var}\Bigl(\bigl(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{y}\Big|\boldsymbol{X}\Bigr) \\ &= \mathrm{Var}\Bigl(\bigl(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'(\boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{u})\Big|\boldsymbol{X}\Bigr) \\ &= \mathrm{Var}\Bigl(\boldsymbol{\beta} + \bigl(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{u}\Big|\boldsymbol{X}\Bigr) \\ &= \mathrm{Var}\Bigl(\bigl(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{u}\Big|\boldsymbol{X}\Bigr) \\ &= (\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\mathrm{Var}(\boldsymbol{u}\mid \boldsymbol{X})\boldsymbol{X}(\boldsymbol{X}'\boldsymbol{X})^{-1} \\ &=(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{I}\sigma^2\boldsymbol{X}(\boldsymbol{X}'\boldsymbol{X})^{-1}\\ &= \sigma^2(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{X}(\boldsymbol{X}'\boldsymbol{X})^{-1}\\ &=\sigma^2(\boldsymbol{X}'\boldsymbol{X})^{-1} \end{aligned} \]