Module 1: Introduction

Econometrics I

Max Heinze (mheinze@wu.ac.at)

Department of Economics, WU Vienna

Partly based on a slide set Simon Heß, with additional thanks to Gustav Pirich, Lucas Unterweger and Fynn Lohre for their inputs

March 6, 2025

 

 

 

What is Econometrics?

Causality

Structure of Econometric Data

 

What is Econometrics?

Econometrics is a subfield of economics.

We deal with economic questions.

Econometrics is a kind of applied statistics.

We use statistical methods to test hypotheses.

What is Econometrics?

Econometrics is a subfield of economics.

We deal with economic questions.

Econometrics is a kind of applied statistics.

We use statistical methods to test hypotheses.

Econometrics differs from mathematical statistics mainly in its focus on the problems associated with the use of non-experimental data.

Our Research Question

What does such an economic question look like?

  • Suppose the government is interested in evaluating the effectiveness of government-funded educational leave.
  • How do we as economists and econometricians approach this question?

We need to think carefully about what specific question we want to investigate.

  • In order to test a hypothesis using data, we need both data and a hypothesis.
    • We can derive the hypothesis from a formal economic model, for example.
    • We can collect data, or we can use data that someone has already collected. This data typically needs to be processed.

Our Model

We could formulate our question like this:

If a worker takes advantage of educational leave, does their wage increase over the course of their career?

We assume the following model:

\[ \mathrm{Wage} = f\left(\mathrm{Education},\mathrm{Experience},\mathrm{Talent},\mathrm{EducationalLeave},\dots\right) \]

Wages depend on our variable of interest, the use of educational leave, but also on a number of other factors.

How does the variable education differ from the variable talent?

  • We can easily observe how much education a person has, but not how much talent.
  • Our outcome (the wage) is therefore influenced by both observed and unobserved variables.

Why Do We Do This?

Test and falsify economic theories
Do households save more when interest rates rise?
Do countries converge to a common equilibrium?

Quantify relationships between economic variables
What is the causal effect of education on wages?
How large is the average gender pay gap?

Evaluate policy measures
Does a minimum wage reduce unemployment?
Does a reduction in class size have different effects on male and female students?

Predictions and forecasts
How much will GDP grow next year?
How volatile will stock markets be next week?

A Practical Example

Suppose we are tasked with investigating:

Does the average class size in a district influence test performance?

… and if so, by how much?

As before, we can assume there are both observed and unobserved influencing factors.

Observed Influencing Factors

  • Average household income in the district
  • Average reading competence
  • Share of students who do not speak the language of instruction at home

Unobserved Influencing Factors

  • Average student motivation
  • Average teacher motivation

What a Coincidence, There’s a Dataset for That

CASchools is a dataset on math and reading test scores from 420 California schools in 1999. So let’s create a plot.

First, we prepare our data.

A Plot

Interpretation

What does this plot tell us? Not much.

  • The data is quite noisy.
  • If we draw a line through it, it slopes slightly downward.
  • What does that tell us?
    • Not much. 🙃
  • Are average scores different in districts with class sizes > 22?
  • In districts with a student-teacher ratio over 22, the average test score is slightly lower.
  • What does that tell us?
    • Again, not much.

Why “not much”? There’s a line, right?

We face two major problems with this analysis:

  • Our estimate comes with uncertainty.
    • 383 districts have a student-teacher ratio of ≤ 22, only 37 have > 22.
    • We can be much more confident in the mean of the larger subsample than the smaller.
    • Whenever we analyze samples, we deal with uncertainty.
  • We can’t say anything about causality.
    • Why are results worse in high-ratio districts?
    • Would they improve if we changed the ratio?
    • Or are the students in those districts different?
    • If they are different, then they likely differ in both observed and unobserved characteristics.

So What Do We Do in Econometrics?

  • It’s not enough to simply analyze economic data using statistical methods.
  • We must also think carefully about how we analyze and interpret the data.
    • That starts with data collection,
    • involves deciding which methods we use and how we apply them,
    • and includes the interpretation of our results.

In Econometrics I, Econometrics II, and Applied Econometrics, we learn step by step how to address these issues. By the end of these three courses, we are able to independently answer econometric research questions.

 

 

What is Econometrics?

Causality

Structure of Econometric Data

 

 

Just Semantics?


One additional year of education leads to an average 20% increase in wages.

People who have one more year of education earn on average 20% more.

Do these two statements mean the same thing? No. 🙃

Causal Effects

As economists, we are often interested in causal effects, where one variable affects another variable.

  • how does price affect demand for a product?
  • how does a particular policy measure affect unemployment?
  • how does the use of fertilizer affect agricultural yields?

Informal Definition: Causality

We speak of a causal effect when the isolated change of a variable has a direct, measurable effect on another variable.

Let’s take the example of fertilizer and agricultural yields. How could we isolate a causal effect here?

Experiments (1)

Let’s take the example of fertilizer and agricultural yields. How could we isolate a causal effect here?


  • Let’s do an experiment!
  • We have a square field divided into 100 subplots.
  • We randomly choose 50 plots to fertilize.
  • At the end, we measure yields and compare the groups.

Experiments (2)

Let’s take the example of fertilizer and agricultural yields. How could we isolate a causal effect here?

Randomized Controlled Trials (RCTs)

We assign an intervention to a randomly selected study group. A control group does not receive the intervention. Such a study approximates a natural science experiment.

Under certain assumptions, our results are valid:

  • Yields are also influenced by other variables. We assume that the expectation of those variables does not differ between groups.
    • That’s why we randomize the groups.
  • We also assume that using fertilizer has no effect on neighboring subplots.
    • Is that assumption realistic in the setting we thought about?

Sounds good, let’s just do more experiments

We can’t always run an experiment (RCT or lab experiment). There are

  • practical reasons,
  • financial reasons,
  • legal reasons, and
  • ethical reasons.

Coville et al. (2020) want to find out if people who haven’t paid their water bills pay faster when their water is shut off.

  • They randomly select, among households in Nairobi with payment issues, households where the water is shut off.
  • Because customers signed a contract that includes this as a last resort, the authors interpret this as informed consent.

Sounds good, let’s just do more experiments

Coville et al. (2020) want to find out if people who haven’t paid their water bills pay faster when their water is shut off.

  • They randomly select households in Nairobi with payment issuesand shut off their water access.
  • Customers signed a contract that includes this measure as a last resort. The authors interpret this as informed consent.

Cohen & Dupas (2008) examine whether co-payments for malaria nets reduce “wasteful” usage.

  • They randomize the price (from 0 to 40 Kenyan shillings) at which malaria nets are distributed to pregnant women.
  • They find no evidence that lack of cost leads to wasteful use.

Sounds good, let’s just do more experiments

Cohen & Dupas (2008) examine whether co-payments for malaria nets reduce “wasteful” usage.

  • They randomize the price (from 0 to 40 Kenyan shillings) at which malaria nets are distributed to pregnant women.
  • They find no evidence that lack of cost leads to wasteful use.

In many cases, conducting an experiment is unrealistic. In other cases, it is ethically questionable.

So we often rely on observational data.

Observational Data

Experiments are becoming more common in economic research, but like other social sciences, we usually work with observational data.

  • Observational data is non-experimental, i.e. not generated through a lab or RCT experiment.
  • We can obtain it from many sources: surveys, administrative data, satellite data, …

Advantages

  • Often large scale, sometimes covering the entire population of a country.
  • Reflects real behavior.

Disadvantages

  • Not collected specifically for the study, so isolating the effect of interest is harder.
  • Under certain assumptions and in specific situations, we can still investigate causal effects.

 

What is Econometrics?

Causality

Structure of Econometric Data

 

 

 

Observations

Back to our model for educational leave:

\[ \mathrm{Wage} = f\left(\mathrm{Education},\mathrm{Experience},\mathrm{Talent},\mathrm{EducationalLeave,\dots}\right) \]

What would a dataset look like for studying such a question?

? Wage Education Experience Educational Leave
1 15 12 9 Yes
2 21 14 2 No
3 14 11 7 No
4 18 9 22 No
5

In this dataset, columns are variables and rows are observations.

Cross-Sectional Data

Individuals Wage Education Experience Educational Leave
i = 1 15 12 9 Yes
i = 2 21 14 2 No
i = 3 14 11 7 No
i = 4 18 9 22 No
i = 5

Cross-Sectional Data

Cross-sectional data consists of a sample of individuals, households, firms, cities, countries, etc., for which data is collected at one point in time. We use index \(i\) for each observation. The number of observations is denoted \(N\).

  • As a rule, we assume the sample is randomly drawn from a population.

Time Series Data

Time Points Wage Education Experience Educational Leave
t = 2021 0 8 0 No
t = 2022 0 9 0 No
t = 2023 12 10 1 No
t = 2024 14 10 2 Yes
t = 2025

Time Series Data

Time series data consists of a sequence of time points at which data is collected on the same individual or unit. We use index \(t\) for each observation. The number of observations is denoted \(T\).

  • We cannot assume a random sample here. Later observations depend on earlier ones.

Panel Data

Individuals Time Points Wage Education Experience Educational Leave
i = 1 t = 2023 20 14 1 No
i = 2 t = 2023 12 10 1 No
i = 1 t = 2024 21 14 2 No
i = 2 t = 2024 14 10 2 No
i = 1 t = 2025

Panel Data

Panel data includes both a cross-sectional and time component. Each observation is indexed by \(i\) and \(t\). We observe \(N\) units over \(T\) periods, for a total of \(NT\) observations.

  • A major advantage of panel data is that we can account for certain kinds of unobserved variation.

Literature


Cohen, J., & Dupas, P. (2008). Free distribution or cost-sharing? Evidence from a malaria prevention experiment. National Bureau of Economic Research. https://doi.org/10.3386/w14406
Coville, A., Galiani, S., Gertler, P., & Yoshida, S. (2020). Financing municipal water and sanitation services in nairobi’s informal settlements. National Bureau of Economic Research. https://doi.org/10.3386/w27569
Wooldridge, J. M. (2020). Introductory econometrics : A modern approach (Seventh edition, pp. xxii, 826 Seiten). Cengage. https://permalink.obvsg.at/wuw/AC15200792