Econometrics I
March 6, 2025
The statements on the previous slide all concern the conditional expectation of a dependent variable \(y\), given an explanatory variable \(x\).
Conditional expectations are an important measure that relates a dependent variable \(y\) to an explanatory variable \(x\), for example like this:
\[ \mathrm{E}\left(\textcolor{var(--primary-color)}{y}\mid\textcolor{var(--secondary-color)}{x}\right) = 0.4 + 0.5\textcolor{var(--secondary-color)}{x} \]
In this way, we can divide variation in the dependent variable \(y\) into two components:
When we evaluate certain measures, we are often interested in understanding differences between different groups.
Two examples:
In both cases we are examining the average treatment effect (ATE): the average effect of a “treatment” relative to no “treatment”.
We might also be interested in predicting an outcome for a specific initial situation.
Suppose we know the distribution of class size and test scores. For a new district, we only know the class size. What is the best prediction for the test scores in the new district?
If we minimize a quadratic loss function, our best prediction will be the conditional mean.
We now want to model the Conditional Expectation Function of a given random variable \(y\) depending on another random variable \(x\).
The simplest way to do that: we assume a linear function.
\[ \mathrm{E}(\textcolor{var(--primary-color)}{y_i}\mid\textcolor{var(--secondary-color)}{x_i}) = \beta_0 + \beta_1 \textcolor{var(--secondary-color)}{x_i}, \]
where
\[ \mathrm{E}(\textcolor{var(--primary-color)}{y_i}\mid\textcolor{var(--secondary-color)}{x_i}) = \beta_0 + \beta_1 \textcolor{var(--secondary-color)}{x_i}, \]
This function gives us information about the expected value of \(y_i\) for a given value \(x_i\), and only that.
Suppose the conditional expectation function for test scores given a certain class size is
\[ \mathrm{E}(\textcolor{var(--primary-color)}{\text{TestScores}_i}\mid\textcolor{var(--secondary-color)}{\text{ClassSize}_i}) = 720 - 0.6 \times \textcolor{var(--secondary-color)}{\text{ClassSize}_i}, \]
Suppose the conditional expectation function for test scores given a certain class size is
\[ \mathrm{E}(\textcolor{var(--primary-color)}{\text{TestScores}_i}\mid\textcolor{var(--secondary-color)}{\text{ClassSize}_i}) = 720 - 0.6 \times \textcolor{var(--secondary-color)}{\text{ClassSize}_i}, \]
what can we then say about test scores in a new district with a class size of 20?
In blue we see our conditional expectation function. For a class size of 18 we expect a certain value. The actual values are distributed around this value. This applies to every point along the function.
We can combine our thoughts on the conditional expectation function and the prediction error to obtain a linear regression model:
\[ \textcolor{var(--primary-color)}{y_i} = \beta_0 + \beta_1 \textcolor{var(--secondary-color)}{x_i} + \textcolor{var(--tertiary-color-semidark)}{u_i}, \]
where
\[ \textcolor{var(--primary-color)}{y_i} = \beta_0 + \beta_1 \textcolor{var(--secondary-color)}{x_i} + \textcolor{var(--tertiary-color-semidark)}{u_i}, \]
In our previous example:
\[ \textcolor{var(--primary-color)}{\text{TestScores}_i} = \beta_0 - \beta_1 \times \textcolor{var(--secondary-color)}{\text{ClassSize}_i}+ \textcolor{var(--tertiary-color-semidark)}{u_i}. \]
In this case:
\[ \beta_1 = \frac{\mathrm{d}\:\mathrm{E}(\textcolor{var(--primary-color)}{\text{TestScores}_i}\mid\textcolor{var(--secondary-color)}{\text{ClassSize}_i})}{\mathrm{d}\:\textcolor{var(--secondary-color)}{\text{ClassSize}_i}} \]
is the expected difference in test scores when we change the average class size by one unit.
\[ \beta_0 = \mathrm{E}(\textcolor{var(--primary-color)}{\text{TestScores}_i}\mid\textcolor{var(--secondary-color)}{\text{ClassSize}_i}=0) \]
is the expected value for the test score when there are on average 0 students per class in a district.
\[ \beta_1 = \frac{\mathrm{d}\:\mathrm{E}(\textcolor{var(--primary-color)}{\text{TestScores}_i}\mid\textcolor{var(--secondary-color)}{\text{ClassSize}_i})}{\mathrm{d}\:\textcolor{var(--secondary-color)}{\text{ClassSize}_i}} \]
\[ \beta_0 = \mathrm{E}(\textcolor{var(--primary-color)}{\text{TestScores}_i}\mid\textcolor{var(--secondary-color)}{\text{ClassSize}_i}=0) \]
How do these two parameters change when we change the scaling of the variables? For example, if we measure class size in tens:
\[ \textcolor{var(--primary-color)}{\text{TestScores}_i} = \beta_0^{\bullet} - \beta_1^\bullet \times \frac{\textcolor{var(--secondary-color)}{\text{ClassSize}_i}}{10}+ \textcolor{var(--tertiary-color-semidark)}{u_i}. \]
We see:
\(\beta_0^{\bullet} = \beta_0\qquad\) and \(\qquad\beta_1^{\bullet} = \textcolor{var(--secondary-color)}{10\times}\beta_1\).
The regression constant remains unchanged, but the slope parameter gets scaled.
Exercise
What happens if we scale the dependent variable (instead of the independent variable)?
On this slide, we scale the \(x_i\) values in several steps from factor 1 to 2. We observe that the intercept remains unchanged, but the slope changes.
Nothing we’ve discussed so far had to do with actual data.
We previously discussed how class size and test scores are related in the population. However, we cannot observe \(\beta_0\) and \(\beta_1\) in practice. Therefore, we need a sample to estimate them.
So we collect data:
\(\left.\begin{array}{c}\{y_1, x_1\} \\\{y_2, x_2\} \\\{y_3, x_3\} \\\vdots \\\{y_N, x_N\}\end{array}\right\}\quad\{y_i, x_i\}_{i=1}^{N}\quad\) randomly drawn from a population \(\quad F_{y,x}(\cdot,\cdot)\),
for which we want to approximate \(\mathrm{E}(y\mid x)\) using a linear conditional expectation function.
What does a random sample look like in our earlier example?
We first prepare the dataset again.
What does a random sample look like in our earlier example?
We see fixed numbers here. However, these numbers are realizations of random variables, and every time we draw a new random sample, we will get different values.
To illustrate, let’s draw a sample from a standard normal distribution and compute the mean.
If we repeat this calculation multiple times, we will always get a mean close to 0, but we get a different value every time. The more observations we collect (e.g., n=10^6
), the closer most of these values will be to 0.
We want to fit a regression line with intercept \(\tilde{\beta}_0\) and slope \(\tilde{\beta}_1\):
\[ y_i = \textcolor{var(--quarternary-color)}{\tilde{\beta}_0} + \textcolor{var(--quarternary-color)}{\tilde{\beta}_1}x_i, \]
that minimizes the following prediction errors:
\[ \textcolor{var(--quarternary-color)}{\hat{u}_i} = y_i - \textcolor{var(--quarternary-color)}{\tilde{\beta}_0} - \textcolor{var(--quarternary-color)}{\tilde{\beta}_1}x_i. \]
How do we find among all \(\tilde{\beta}_0\) and \(\tilde{\beta}_1\) the parameters \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize prediction error?
Suggestion: Take the sum of all residuals.
Better suggestion: Take the sum of all squares of residuals. That way, we penalize positive and negative residuals equally. So we look for the minimum of:
\[ S(\tilde{\beta}_0,\tilde{\beta}_1)=\sum_{i=1}^N \left(y_i-\tilde{\beta}_0-\tilde{\beta}_1x_i\right)^2. \]
We call the resulting estimator the least squares estimator, or ordinary least squares (OLS).
\[ S(\tilde{\beta}_0,\tilde{\beta}_1)=\sum_{i=1}^N \left(y_i-\tilde{\beta}_0-\tilde{\beta}_1x_i\right)^2. \]
We begin by taking the derivative with respect to \(\tilde{\beta}_0\) and setting it to zero:
\[ \frac{\partial S}{\partial \tilde{\beta}_0}=-2\sum_{i=1}^N\left(y_i-\tilde{\beta}_0-\tilde{\beta}_1x_i\right)=0, \]
That gives us
\[ \colorbox{var(--primary-color-lightened)}{$\sum_{i=1}^N y_i=n\tilde{\beta}_0+\tilde{\beta}_1\sum_{i=1}^N x_i.$} \]
Next, we differentiate with respect to \(\tilde{\beta}_1\):
\[ \frac{\partial S}{\partial \tilde{\beta}_1}=-2\sum_{i=1}^N x_i\left(y_i-\tilde{\beta}_0-\tilde{\beta}_1x_i\right)=0, \]
We obtain
\[ \colorbox{var(--secondary-color-lightened)}{$\sum_{i=1}^N x_i y_i=\tilde{\beta}_0\sum_{i=1}^N x_i+\tilde{\beta}_1\sum_{i=1}^N x_i^2.$} \]
From now on, let’s write \(\bar{x}=\frac{1}{n}\sum_{i=1}^N x_i\) and \(\bar{y}=\frac{1}{n}\sum_{i=1}^N y_i\). Then from the first-order condition 1 we get:
\[ \tilde{\beta}_0=\bar{y}-\tilde{\beta}_1\bar{x}. \]
If we plug that into the first-order condition 2, we obtain:
\[ \sum^N_{i=1}x_i\left(y_i-\bar{y}\right)=\tilde{\beta}_1\sum^N_{i=1}x_i\left(x_i-\bar{x}\right). \]
Since \(\sum^N_{i=1}x_i\left(x_i-\bar{x}\right)=\sum^N_{i=1}\left(x_i-\bar{x}\right)^2\) and \(\sum^N_{i=1}x_i\left(y_i-\bar{y}\right)=\sum^N_{i=1}\left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)\) (see Appendix A-1 in Wooldridge):
\[ \colorbox{#e0e0e0}{$\hat{\beta}_1=\frac{\sum_{i=1}^N (x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^N (x_i-\bar{x})^2},$} = \textcolor{#999999}{\frac{\widehat{\mathrm{Cov}}(x_i,y_i)}{\widehat{\mathrm{Var}}(x_i)}} \]
as long as \(\sum_{i=1}^N (x_i-\bar{x})^2>0\).
And from earlier:
\[ \colorbox{#e0e0e0}{$\hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x}.$} \]
These estimators minimize the sum of squared residuals.
Alternatively, we can derive the estimators using the method of moments. We can use the following (previously discussed) assumptions as moment conditions:
As a first step, we replace the population moments with sample moments:
\[ \frac{1}{n} \sum_{i=1}^{n} x_i (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0 \]
\[ \frac{1}{n} \sum_{i=1}^{n} y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i = 0 \]
\[ \frac{1}{n} \sum_{i=1}^{n} x_i (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0 \]
\[ \frac{1}{n} \sum_{i=1}^{n} y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i = 0 \]
These expressions are equivalent to those we obtained by differentiating the loss function. So we can proceed just as before and obtain:
\[ \colorbox{#e0e0e0}{$\hat{\beta}_1=\frac{\sum_{i=1}^N (x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^N (x_i-\bar{x})^2}$}\qquad\qquad\colorbox{#e0e0e0}{$\hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x}.$} \]
We’ve obtained the same estimator using two different methods.
We can only compute our estimator for the slope if the variance in \(x_i\) is not 0 (otherwise we would divide by 0):
\[ \hat\beta_1=\frac{\widehat{\mathrm{Cov}}(x_i,y_i)}{\widehat{\mathrm{Var}}(x_i)} \]
The residuals are the difference between the actually observed value and the fitted value:
\[ \hat{u}_i = y_i - \hat{y}_i \]
When we previously took the derivative with respect to \(\tilde{\beta}_0\), we had:
\[ \frac{\partial S}{\partial \tilde{\beta}_0}=-2\sum_{i=1}^N \left(y_i-\tilde{\beta}_0-\tilde{\beta}_1x_i\right)=0, \]
which implies that the sum (and thus the mean) of the residuals is 0.
Intuition: If the residuals had a positive or negative mean, we could shift the line down or up to achieve a better fit.
When we previously took the derivative with respect to \(\tilde{\beta}_1\), we had:
\[ \frac{\partial S}{\partial \tilde{\beta}_1}=-2\sum_{i=1}^N x_i\left(y_i-\tilde{\beta}_0-\tilde{\beta}_1x_i\right)=0. \]
This implies:
\[ \sum^N_{i=1}\left(x_i-\bar{x}\right)\hat{u}_i=0 \]
Which in turn implies that the correlation between the \(x_i\) and the residuals is 0.
Intuition: If the residuals were correlated with the \(x_i\), we could achieve a better fit by tilting the line.
We can decompose the variation in \(y\) into an explained part, i.e., variation due to variation in \(x\); and an unexplained part, i.e., the part due to unobserved factors
\[ \textcolor{var(--primary-color)}{\sum^N_{i=1}\left(y_i-\bar{y}\right)^2} = \textcolor{var(--secondary-color)}{\sum^N_{i=1}\left(\hat{y}_i-\bar{y}\right)^2} + \textcolor{var(--quarternary-color)}{\sum^N_{i=1}\hat{u}_i^2} \]
or in other words
Total Sum of Squares \(=\) Explained Sum of Squares \(+\) Residual Sum of Squares
\[ \textcolor{var(--primary-color)}{\mathrm{SST}} = \textcolor{var(--secondary-color)}{\mathrm{SSE}} + \textcolor{var(--quarternary-color)}{\mathrm{SSR}} \]
The coefficient of determination \(R^2\) is a measure of goodness of fit and indicates what proportion of the variation is explained by our model:
\[ R^2 = \frac{\textcolor{var(--secondary-color)}{\mathrm{SSE}}}{\textcolor{var(--primary-color)}{\mathrm{SST}}} = 1- \frac{\textcolor{var(--quarternary-color)}{\mathrm{SSR}}}{\textcolor{var(--primary-color)}{\mathrm{SST}}}. \]
In all four examples, \(R^2=0.67\).
Properties of the OLS Estimator
Logarithmic TransformationsLet’s start with an example. Suppose the wage a person earns depends on their education:
\[ \mathrm{Wage}_i = f\left(\mathrm{Education}_i\right) \]
Is it more plausible that an additional year of education increases wages by the same amount, or by the same factor?
The 5th year of education increases wage by 1 euro
and
The 12th year of education increases wage by 1 euro
The 5th year of education increases wage by 8 percent
and
The 12th year of education increases wage by 8 percent
We can approximate such a relationship using logarithms:
\[ \mathrm{log}\left(\mathrm{Wage}_i\right) = \beta_0 + \beta_1\mathrm{Education}_i+u_i. \]
This is equivalent to:
\[ \mathrm{Wage}_i = \mathrm{exp}\left(\beta_0 + \beta_1\mathrm{Education}_i+u_i\right). \]
The relationship is non-linear in \(y\) (wage) and \(x\) (education), but it is linear in \(\mathrm{log}(y)\) and \(x\).
We can estimate the regression using OLS just as before, by defining \(y_i^\ast=\mathrm{log}\left(y_i\right)\) and estimating the following model:
\[ y_i^\ast=\beta_0+\beta_1x_i+u_i \]
Analogously to before, we can also take the logarithm of the independent variable (\(x\)). The interpretation in the previous example would be:
An increase in education by 1 percent (regardless of level) increases wage by a certain number of euros.
We define \(x_i^\ast = \mathrm{log}\left(x_i\right)\) and estimate the model:
\[ y_i = \beta_0 + \beta_1x_i^* +u_i. \]
If we use the natural logarithm for our transformation, the interpretation of the coefficients is very simple:
Model | Dep. Variable | Indep. Variable | Interpretation |
---|---|---|---|
Level-Level | \(y\) | \(x\) | \(+1\) in \(x\) \(\Leftrightarrow\) \(+\beta_1\) in \(y\) |
Level-Log | \(y\) | \(\log(x)\) | \(+1\%\) in \(x\) \(\Leftrightarrow\) \(+\beta_1 / 100\) in \(y\) |
Log-Level | \(\log(y)\) | \(x\) | \(+1\) in \(x\) \(\Leftrightarrow\) \(+\beta_1 \times 100\%\) in \(y\) |
Log-Log | \(\log(y)\) | \(\log(x)\) | \(+1\%\) in \(x\) \(\Leftrightarrow\) \(+\beta_1\%\) in \(y\) |
Properties of the OLS Estimator
The Gauss-Markov TheoremIf we assume that our linear model is correct, we can make some statements about the expected value and variance of the OLS estimator.
The Gauss-Markov Theorem states that the OLS estimator is the
Best Linear Unbiased Estimator
(BLUE)
To prove using the Gauss-Markov Theorem that the OLS estimator is BLUE, we need to make four assumptions about our model:
Gauss-Markov Theorem: Assumptions for Simple Linear Regression (SLR)
The population regression function (PRF) must be linear in its parameters:
\[ y_i = \beta_0 + \beta_1 x_i + u_i \]
Our sample of \(N\) observations, \(\left\{\left(y_i,x_i\right), i = 1, 2, \dots, N\right\}\) must be a random sample from the population. The probability of including an observation must be equal for all, and must not depend on who we sampled first.
To estimate our model, we need variation in \(x\). The \(x\) values must not all be exactly the same.
By the way: If we have no variation in \(y\), our results won’t be very interesting (the regression line is flat), but we can still compute them without issue.
The expected value of the error term \(u\) must be 0 for every value of \(x\):
\[ \mathrm{E}\left(u_i\mid x_i\right) = 0 \]
This assumption also implies the two moment conditions \(\mathrm{E}\left(u_i\right) = 0\) and \(\mathrm{E}\left(u_i x_i\right) = 0\). Proof
\(\mathrm{E}\left(u_i\right)=0\) is not a very restrictive assumption (we can always shift the line if needed). But \(\mathrm{E}\left(u_i\mid x_i\right)=0\) is far less trivial.
We randomly select a number of fields. Then, we randomly choose half of them to apply fertilizer. We then record the crop yields.
We randomly select a number of fields. Then, we ask farmers whether they applied fertilizer. We record fertilizer use and yields.
In the experiment, the intervention (fertilizer use, \(x_i\)) is guaranteed to be independent of unobserved factors. The assumption that \(\mathrm{E}\left(u_i\mid x_i\right)=0\) is thus plausible.
In the observational study, the intervention may not be independent of unobserved factors. Maybe fertilizer is used on less fertile fields to compensate? Or on better fields to boost yield even more? If we believe \(\mathrm{E}\left(u_i\mid x_i\right)=0\) is plausible, we must justify it.
Properties of the OLS Estimator
Expected Value of the OLS EstimatorIf assumptions SLR.1 through SLR.4 hold, we can prove that the OLS estimator is unbiased.
An estimator is unbiased if its expected value equals the true value of the parameter in the population model. So we want to prove:
\[ \mathrm{E}\left(\hat{\beta}_j\right) = \beta_j\qquad j = 0,1 \]
We start with the expression for the OLS estimator:
\[ \hat{\beta}_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x}) (y_i-\bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x}) x_i} = \frac{\sum_{i=1}^{n} (x_i - \bar{x}) y_i}{\sum_{i=1}^{n} (x_i - \bar{x}) x_i} \]
As a first step, we write \(y_i\) as the sum of its components:
\[ \hat{\beta}_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x}) (\textcolor{var(--primary-color)}{\beta_0} + \textcolor{var(--secondary-color)}{\beta_1 x_i} + \textcolor{var(--quarternary-color)}{u_i})}{\sum_{i=1}^{n} (x_i - \bar{x}) x_i} \]
We split this up:
\[ \hat{\beta}_1 = \frac{\textcolor{var(--primary-color)}{\beta_0 \sum_{i=1}^{n} (x_i - \bar{x})} + \textcolor{var(--secondary-color)}{\beta_1 \sum_{i=1}^{n} (x_i - \bar{x}) x_i} + \textcolor{var(--quarternary-color)}{\sum_{i=1}^{n} (x_i - \bar{x}) u_i}}{\sum_{i=1}^{n} (x_i - \bar{x}) x_i} \]
\[ \hat{\beta}_1 = \frac{\textcolor{var(--primary-color)}{\beta_0 \sum_{i=1}^{n} (x_i - \bar{x})} + \textcolor{var(--secondary-color)}{\beta_1 \sum_{i=1}^{n} (x_i - \bar{x}) x_i} + \textcolor{var(--quarternary-color)}{\sum_{i=1}^{n} (x_i - \bar{x}) u_i}}{\sum_{i=1}^{n} (x_i - \bar{x}) x_i} \]
Because \(\textcolor{var(--primary-color)}{\sum_{i=1}^{n} (x_i - \bar{x})} = 0\) and \(\frac{\textcolor{var(--secondary-color)}{\sum_{i=1}^{n} (x_i - \bar{x}) x_i}}{\textcolor{var(--secondary-color)}{\sum_{i=1}^{n} (x_i - \bar{x}) x_i}} = 1\):
\[ \hat{\beta}_1 = \beta_1 + \textcolor{var(--quarternary-color)}{\frac{\sum_{i=1}^{n} (x_i - \bar{x}) u_i}{\sum_{i=1}^{n} (x_i - \bar{x}) x_i}} \]
We call \(\textcolor{var(--quarternary-color)}{\frac{\sum_{i=1}^{n} (x_i - \bar{x}) u_i}{\sum_{i=1}^{n} (x_i - \bar{x}) x_i}}\) the sampling error. The equation shows us that \(\hat{\beta}_1\) in a finite sample equals the sum of the true parameter \(\beta_1\) and a certain linear combination of the error terms—the sampling error.
If we can show that this sampling error has an expected value of 0, we have proven that the OLS estimator is unbiased.
So what is the expected value of \(\hat{\beta}_1\)?
\[ \mathrm{E}(\hat{\beta}_1 | x_1, \dots, x_N) = \mathrm{E} \left( \beta_1 + \frac{\sum_{i=1}^{n} (x_i - \bar{x}) u_i}{\sum_{i=1}^{n} (x_i - \bar{x}) x_i} \Bigg| x_1, \dots, x_N \right) \]
Since the true parameter \(\beta_1\) is not a random variable, we can take it outside:
\[ \mathrm{E}(\hat{\beta}_1 | x_1, \dots, x_N) = \beta_1 + \mathrm{E} \left( \frac{\sum_{i=1}^{n} (x_i - \bar{x}) u_i}{\sum_{i=1}^{n} (x_i - \bar{x}) x_i} \Bigg| x_1, \dots, x_N \right) \]
Because \(\mathrm{E}(x_i\mid x_i)=x_i\):
\[ \mathrm{E}(\hat{\beta}_1 | x_1, \dots, x_N) = \beta_1 + \frac{\sum_{i=1}^{n} (x_i - \bar{x}) \mathrm{E} \left( u_i | x_1, \dots, x_N \right)}{\sum_{i=1}^{n} (x_i - \bar{x}) x_i} \]
\[ \mathrm{E}(\hat{\beta}_1 | x_1, \dots, x_N) = \beta_1 + \frac{\sum_{i=1}^{n} (x_i - \bar{x}) \mathrm{E} \left( u_i | x_1, \dots, x_N \right)}{\sum_{i=1}^{n} (x_i - \bar{x}) x_i} \]
Assumption SLR.2 allows us to simplify:
\[ \mathrm{E}(\hat{\beta}_1 | x_1, \dots, x_N) = \beta_1 + \frac{\sum_{i=1}^{n} (x_i - \bar{x}) \mathrm{E} \left( u_i | x_i \right)}{\sum_{i=1}^{n} (x_i - \bar{x}) x_i} \]
Assumption SLR.4 says that \(\mathrm{E} \left( u_i | x_i \right)=0\), so:
\[ \mathrm{E}(\hat{\beta}_1 | x_1, \dots, x_N) = \beta_1 \]
\[ \mathrm{E}(\hat{\beta}_1 | x_1, \dots, x_N) = \beta_1 \]
By the law of iterated expectations, \(\mathrm{E}(\hat{\beta}_1)=\mathrm{E}(\mathrm{E}(\hat{\beta}_1\mid x_1,\dots,x_N))\) and thus:
\[ \mathrm{E}(\hat{\beta}_1) = \beta_1, \]
The expected value of the estimator equals the true parameter from the population model, so it is unbiased.
\(\square\)
The proof that \(\hat{\beta}_0\) is also unbiased is very simple. First, write \(\hat{\beta}_0\) as
\[ \hat{\beta}_0 = \bar{y}-\hat{\beta}_1\bar{x}. \]
Because \(\mathrm{E}(\hat{\beta}_1\mid x_1,\dots,x_N)=\beta_1\):
\[ \begin{aligned} \mathrm{E}(\hat{\beta}_0\mid x_1,\dots,x_N) &= \mathrm{E}(\bar{y}\mid x_1,\dots,x_N)-\mathrm{E}(\hat{\beta}_1\bar{x}\mid x_1,\dots,x_N) \\ &= \mathrm{E}(\bar{y}\mid x_1,\dots,x_N)-\mathrm{E}(\hat{\beta}_1\mid x_1,\dots,x_N)\bar{x} \\ &= \beta_0+\beta_1\bar{x}-\beta_1\bar{x} \\ &= \beta_0. \end{aligned} \]
So the estimator \(\hat{\beta}_0\) is also unbiased.
\(\square\)
Expected Value of the OLS Estimator
Variance of the OLS EstimatorThe variance of the error term \(u_i\) is the same for all values of \(x_i\):
\[ \mathrm{Var}(u_i\mid x_i) = \mathrm{Var}(u_i) = \sigma^2 \]
If assumptions SLR.1 through SLR.5 are fulfilled, we can prove that the OLS estimator has the lowest possible variance among all linear unbiased estimators.
We then say that it is the best linear unbiased estimator (BLUE). This property is also called efficiency.
We can prove this by first showing that the variance of the OLS estimator is
\[ \colorbox{var(--primary-color-lightened)}{$\mathrm{Var}(\hat{\beta}_1\mid x_i) = \frac{\sigma^2}{\sum^N_{i=1}(x_i-\bar{x})^2}, \qquad \mathrm{Var}(\hat{\beta}_0\mid x_i) = \frac{\sigma^2 N^{-1}\sum^N_{i=1}x_i^2}{\sum^N_{i=1}(x_i-\bar{x})^2}$} \]
and then showing that there cannot be any other linear unbiased estimator with a smaller variance.
We prove this for \(\beta_1\). We begin with the decomposition of the estimator from earlier:
\[ \mathrm{Var}(\hat{\beta}_1\mid x_i) = \mathrm{Var}\left(\beta_1+\frac{\sum_{i=1}^{n} (x_i - \bar{x}) u_i}{\sum_{i=1}^{n} (x_i - \bar{x}) x_i}\middle| x_i\right) \]
To simplify notation, we now define \(w_i:=\frac{x_i - \bar{x}}{\sum_{i=1}^{n} (x_i - \bar{x})x_i}\):
\[ \textstyle\mathrm{Var}(\hat{\beta}_1\mid x_i) = \mathrm{Var}\left(\beta_1+\sum_{i=1}^{n}w_iu_i \middle| x_i\right) \]
Now we can apply SLR.5. Also, the weights \(w_i\) depend only on \(x_i\) and are thus fixed:
\[ \mathrm{Var}(\hat{\beta}_1\mid x_i) = \sigma^2\sum_{i=1}^{n}w_i^2 \]
\[ \mathrm{Var}(\hat{\beta}_1\mid x_i) = \sigma^2\sum_{i=1}^{n}w_i^2 \]
Now we expand \(w_i\): Since \(w_i=\frac{x_i - \bar{x}}{\sum_{i=1}^{n} (x_i - \bar{x})x_i}\), we also have: \(\sum_{i=1}^{n}w_i^2=\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{\left(\sum_{i=1}^{n} (x_i - \bar{x})x_i\right)^2}=\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{\left(\sum_{i=1}^{n} (x_i - \bar{x})^2\right)^2}=\frac{1}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\). Thus:
\[ \colorbox{var(--secondary-color-lightened)}{$\mathrm{Var}(\hat{\beta}_1\mid x_i) = \frac{\sigma^2}{\sum^N_{i=1}(x_i-\bar{x})^2}$} \]
Exercise
How can we derive \(\mathrm{Var}(\hat{\beta}_0\mid x_i) = \frac{\sigma^2 N^{-1}\sum^N_{i=1}x_i^2}{\sum^N_{i=1}(x_i-\bar{x})^2}\)?
Now we address the second part: Is this variance the smallest possible for a linear unbiased estimator? Let \(\tilde{\beta}_1\) be any other linear estimator with arbitrary weights \(a_i\) (instead of the OLS weights \(w_i\)):
\[ \tilde{\beta}_1 = \sum^N_{i=1}a_iy_i = \sum^N_{i=1} a_i\left(\beta_0+\beta_1x_i+u_i\right) \]
Since these weights \(a_i\) are based on the \(x\) values, we can apply SLR.4 to write the expectation as:
\[ \mathrm{E}\left(\tilde{\beta}_1\middle| x_i\right) = \beta_0\sum^N_{i=1}a_i+\beta_1\sum^N_{i=1}a_ix_i \]
Because we assume this estimator is also unbiased, we get two conditions: \(\sum^N_{i=1}a_i = 0\) and \(\sum^N_{i=1}a_ix_i = 1\).
We can express the weights of \(\tilde{\beta}_1\) as the OLS weights plus a difference:
\[ a_i = w_i + d_i \]
That allows us to rewrite the estimator (using the same decomposition as earlier for the OLS estimator):
\[ \tilde{\beta}_1 = \beta_1 + \sum^N_{i=1}(w_i+d_i)u_i. \]
The variance of \(\tilde{\beta}_1\) is thus:
\[ \mathrm{Var}\left(\tilde{\beta}_1\middle|x_i\right) \quad = \quad \sigma^2\sum^N_{i=1}\left(w_i+d_i\right)^2 \quad = \quad \sigma^2\sum^N_{i=1}\left(w_i^2+2w_id_i+d_i^2\right) \]
\[ \textstyle\mathrm{Var}\left(\tilde{\beta}_1\middle|x_i\right) \quad = \quad \sigma^2\sum^N_{i=1}\left(w_i+d_i\right)^2 \quad = \quad \sigma^2\sum^N_{i=1}w_i^2+2w_id_i+d_i^2 \]
Because \(\sum^N_{i=1}a_i = \sum^N_{i=1}(w_i+d_i)=0\) and \(\sum^N_{i=1}w_i=0\), we also have:
\[ \sum^N_{i=1}d_i=0 \]
Also:
\[ \sum^N_{i=1}(w_i+d_i)x_i=\sum^N_{i=1}w_ix_i+\sum^N_{i=1}d_ix_i=1\quad\Rightarrow\quad \sum^N_{i=1}d_ix_i=0 \]
\[ \textstyle\mathrm{Var}\left(\tilde{\beta}_1\middle|x_i\right) \quad = \quad \sigma^2\sum^N_{i=1}\left(w_i+d_i\right)^2 \quad = \quad \sigma^2\sum^N_{i=1}w_i^2+2w_id_i+d_i^2 \]
Because \(\sum^N_{i=1}d_i=0\) and \(\sum^N_{i=1}d_ix_i=0\), the middle term becomes:
\[ \textstyle\sum^N_{i=1}w_id_i = \frac{\sum^N_{i=1}\left(x_i-\bar{x}\right)}{\sum^N_{i=1}(x_i-\bar{x})^2}d_i=\frac{1}{\sum^N_{i=1}(x_i-\bar{x})^2}\sum^N_{i=1}x_id_i-\frac{\bar{x}}{\sum^N_{i=1}(x_i-\bar{x})^2}\sum^N_{i=1}d_i=0 \]
So the expression for the variance simplifies to:
\[ \mathrm{Var}\left(\tilde{\beta}_1\middle|x_i\right) \quad = \quad \sigma^2\sum^N_{i=1}w_i^2+\textcolor{var(--secondary-color)}{\sigma^2\sum^N_{i=1}d_i^2} \]
The difference from the variance of the OLS estimator is the right-hand term. Since this term can never be negative, the variance of \(\tilde{\beta}_1\) must always be greater than or equal to that of \(\hat{\beta}_1\).
\(\square\)
Back to the variance of the OLS estimator:
\[ \mathrm{Var}(\hat{\beta}_1\mid x_i) = \frac{\sigma^2}{\sum^N_{i=1}(x_i-\bar{x})^2} \]
If we want to compute this variance from data, we have a problem: We don’t know \(\sigma^2\).
Under assumptions SLR.1 through SLR.5, we can find an unbiased estimator for the variance:
\[ \colorbox{var(--secondary-color-lightened)}{$\hat{\sigma}^2=\frac{\sum^N_{i=1}\hat{u}_i^2}{n-2}$}, \]
i.e., the residual sum of squares divided by \(n-2\).
If we take the square root of the estimator for the variance of the error term, we get:
\[ \hat{\sigma}=\sqrt{\hat{\sigma}^2}. \]
We call this value the standard error of the regression. While not unbiased, it is a consistent estimator for \(\sigma\). With it, we can compute the standard error of \(\hat{\beta}_1\), an estimator for the standard deviation of \(\hat{\beta}_1\):
\[ \textstyle\mathrm{se}\left(\hat{\beta}_1\right)=\frac{\hat{\sigma}}{\sqrt{\sum^N_{i=1}\left(x_i-\bar{x}\right)^2}} \]
Similarly, we can compute the standard error of \(\beta_0\). This allows us to measure how “precisely” the coefficients are estimated.
We simulate 4000 samples from a population and estimate the \(\beta_1\) coefficient 4000 times.
In this example, the standard deviation of the \(\beta_1\) coefficients is 0.161. The standard error is 0.1637897.
Expected Value of the OLS Estimator
Regressions with Only One ParameterWhat happens if, instead of estimating the model \(y = \beta_0 + \beta_1x + u\), we estimate the following model?
\[ y = \beta_1x + u \]
This simply means that we impose the restriction \(\beta_0=0\), so the regression line goes through the origin.
The OLS estimator in this case is
\[ \hat{\beta_1}=\frac{\sum^N_{i=1}x_iy_i}{\sum^N_{i=1}x_i^2}. \]
Exercise
How can we derive this estimator?
If the true model of the population has no intercept, then this estimator is unbiased:
If the true model of the population has an intercept, then this estimator is biased:
The OLS estimator in a regression without an intercept is only unbiased if the intercept in the true model is actually 0.
Exercise
How can we prove that the estimator is biased in the case mentioned above?
What happens if, instead of estimating the model \(y = \beta_0 + \beta_1x + u\), we estimate the following model?
\[ y = \beta_0 + u \]
This simply means we impose the restriction \(\beta_1=0\), and the regression line is horizontal.
The OLS estimator in this case is
\[ \hat{\beta_0} = \bar{y}, \]
the mean of the \(y\) values.
Exercise
How can we derive this estimator?
Expected Value of the OLS Estimator
Regressions with Only One Parameter
Binary Explanatory Variables
So far, we have only used explanatory variables with a quantitative interpretation (years of education, class size, …). How can we include qualitative information in the model?
Suppose we want to analyze the gender pay gap and are therefore interested in whether an individual is a woman or not. We can define a variable as follows:
\[ \mathrm{Woman}_i = \begin{cases} 1&\text{if }i\text{ is a woman},\\ 0&\text{otherwise} \end{cases} \]
We call such a variable a binary variable or dummy variable.
Another example would be a job training program. The variable \(\text{Programmteilnahme}_i\) is then 1 for all people who participated in the program and 0 for all others.
So we have a model of the form
\[ y = \beta_0 + \beta_1x + u, \]
where \(x\) is a dummy variable. Our assumptions SLR.1 to SLR.5 still hold. This means:
\[ \begin{align} \mathrm{E}(y\mid x=1) &= \beta_0+ \beta_1, \\ \mathrm{E}(y\mid x=0) &= \beta_0. \end{align} \]
We can thus interpret \(\beta_1\) as the expected difference in \(y\) between the two groups, and \(\beta_0\) as the mean value in the group \(x=0\). It follows that the mean value in group \(x=1\) is then \(\beta_0 + \beta_1\).
We can also encode more complex qualitative information than just “yes/no” with dummy variables. But for that, we need the techniques of multiple linear regression from the next module.
We have talked several times about wanting to evaluate treatments or interventions.
So for each individual, there are two possible states, of which we can observe only one.
An effect we can estimate is the average treatment effect (ATE):
\[ \mathrm{ATE}=\mathrm{E}\left(\text{Causal Effect}_i\right) = \mathrm{E}\left(y_i(1)-y_i(0)\right) = \mathrm{E}\left(y_i(1)\right)-\mathrm{E}\left(y_i(0)\right). \]
If assumptions SLR.1 through SLR.4 hold, the OLS estimator \(\beta_1\) is an unbiased estimator of the average treatment effect.
This brings us back to something we’ve discussed before: Assumption SLR.4 (in this context: the errors are independent of treatment group assignment \(x\)) only holds guaranteed if the assignment to the treatment group is random, for example in a randomized controlled trial.
In contexts where random assignment to treatment groups is not possible, the methods we have learned so far cannot yield valid statements about treatment effects. In Module 3, we will discuss how to address this problem using methods of multiple linear regression.
Why do we use the linear conditional expectation function for prediction?
With a quadratic loss function:
So, if we know the joint distribution of \(x\) and \(y\), want to predict \(y\) with a linear model, and minimize the expected squared errors, the linear conditional expectation function is the best function to use.
Two Remarks:
Explicit solutions for slope and intercept based on the (unobserved) moments of the population are: \[\beta_1=\frac{\mathrm{Cov}(x,y)}{\mathrm{Var}(x)} \quad\quad\text{and} \quad\quad\beta_0=\mathrm{E}(y)-\beta_1\mathrm{E}(x).\]
A similar result as in the previous slide holds in more general form: When using a quadratic loss function, the best prediction function for unknown \(y\) is always a conditional expectation function—even when working with nonlinear functions.
Why a quadratic loss function?
\[ \mathrm{Cov}(u_i,x_i) = \mathrm{E}(u_ix_i)-\mathrm{E}(u_i)\mathrm{E}(x_i) \]
Since we assume that \(\mathrm{E}(u_i)=0\),
\[ \mathrm{Cov}(u_i,x_i)=\mathrm{E}(x_iu_i) \]
\(\square\)
\[ \begin{aligned} \text{SST} &= \sum (y_i - \bar{y})^2\\ &= \sum \bigl(y_i - \bar{y} + \underbrace{\hat{y}_i - \hat{y}_i}_{=0}\bigr)^2\\ &= \sum \Bigl((y_i - \hat{y}_i) + (\hat{y}_i - \bar{y})\Bigr)^2\\ &= \sum \Bigl(\hat{u}_i + (\hat{y}_i - \bar{y})\Bigr)^2\\ &= \sum \Bigl(\hat{u}_i^2 + 2\,\hat{u}_i(\hat{y}_i - \bar{y}) + (\hat{y}_i - \bar{y})^2\Bigr)\\ &= \sum \hat{u}_i^2 + 2 \sum \hat{u}_i(\hat{y}_i - \bar{y}) + \sum (\hat{y}_i - \bar{y})^2\\ &= \text{SSR} + 2 \underbrace{\sum \hat{u}_i(\hat{y}_i - \bar{y})}_{=0\text{, see right}} + \text{SSE}\\ &= \text{SSR} + \text{SSE}\qquad\qquad\qquad\qquad\qquad\qquad\square \end{aligned} \]
\[ \begin{aligned} \sum \hat{u}_i(\hat{y}_i - \bar{y}) &= \sum \hat{u}_i \,\hat{y}_i -\bar{y}\,\sum \hat{u}_i\\ &= \sum \hat{u}_i \bigl(\hat{\beta}_0 + \hat{\beta}_1 x_i\bigr)-\bar{y}\,\sum \hat{u}_i\\ &= \hat{\beta}_0\underbrace{\sum \hat{u}_i}_{=0} +\hat{\beta}_1 \underbrace{\sum \hat{u}_i x_i}_{=0} -\bar{y}\underbrace{\sum \hat{u}_i}_{=0}\\ &= 0 \end{aligned} \]
First, apply the law of iterated expectations: \(\mathrm{E}\left(u_i\right) = \mathrm{E}\left(\mathrm{E}\left(u_i\mid x_i\right)\right)\). Then use the assumption that \(\mathrm{E}\left(u_i\mid x_i\right)=0\): \(\mathrm{E}\left(\mathrm{E}\left(u_i\mid x_i\right)\right) = \mathrm{E}(0) = 0\). \(\square\)
Again, apply the law of iterated expectations: \(\mathrm{E}\left(u_ix_i\right) = \mathrm{E}\left(\mathrm{E}\left(u_ix_i\mid x_i\right)\right)\) Since \(\mathrm{E}(x_i\mid x_i) = x_i\), we have \(\mathrm{E}\left(\mathrm{E}\left(u_ix_i\mid x_i\right)\right) = \mathrm{E}\left(\mathrm{E}\left(u_i\mid x_i\right)x_i\right)\) Then, using the assumption that \(\mathrm{E}\left(u_i\mid x_i\right)=0\): \(\mathrm{E}\left(\mathrm{E}\left(u_i\mid x_i\right)x_i\right) = \mathrm{E}(0x_i) = 0\). \(\square\)