An interactive on multiple linear regression · n = 300, simulated

The sign of β₁ is lying to you

You have data on 300 adults: hours of exercise per week, and systolic blood pressure (mmHg). Because this world is simulated, the truth is knowable — and by the end you will see it. Your job along the way: state what you expect, watch a regression contradict you, diagnose why, and fix it.

STEP 1

Setting: BP_i = β₀ + β₁·exercise_i + u_i

A health economist wants the effect of exercise on blood pressure and proposes the simple model above. Before touching the data, commit to an expectation.

Q1 · Expectation, before any data

Based on what you know about exercise and health, what sign should β₁ have?

Locked — fit the simple regression in Step 1 first

STEP 2

What is hiding in u?

u absorbs every determinant of BP that the model leaves out. Omitting things is unavoidable — the OLS estimate only stays trustworthy if exogeneity holds: E[u | exercise] = 0. So the question is never “did I omit something?” but “did I omit something that moves with my regressor?”

Q2 · The error term

Which of these omissions breaks the simple regression?

Locked — find the problem omission in Step 2 first

STEP 3

Draw the problem: age points at both

Age is dangerous precisely because it has two arrows: one into the outcome and one into the regressor. Fill in the direction of each arrow for this world.

As age increases, blood pressure typically… ✓ arrow labeled

And in this dataset, as age increases, weekly exercise… ✓ arrow labeled

Locked — label both arrows in Step 3 first

STEP 4

What comparison do we actually want?

“The effect of exercise on BP” is a statement about changing one thing. Pick the comparison that isolates it.

Q3 · Choose the clean comparison

To learn the effect of exercise on BP, we should compare…

Locked — pick the clean comparison in Step 4 first

STEP 5

Controlling for age: BP_i = β₀ + β₁·exercise_i + β₂·age_i + u_i

Multiple regression does “within the same age” for you, at every age at once: it fits a plane over (exercise, age), and β̂₁ is the slope along the exercise axis holding age fixed.

Estimates vs. truth · dependent variable: systolic BP · n = 300
regressor	(1) simple	(2) with age	true value (DGP)

Locked — run the multiple regression in Step 5 first

STEP 6

How wrong does the simple regression get — and when?

The sign flip you just saw is the dramatic special case, not the rule. What governs the damage is how strongly the omitted variable moves with the regressor. Below, the same 300 people are re-simulated in worlds that differ only in corr(exercise, age); in each world we run the simple regression and record its β̂₁. The true β₁ is −1.0 everywhere.

Simple-regression β̂₁ across worlds · true β₁ = −1.0 in every row
corr(exercise, age)	simple β̂₁	error (β̂₁ − β₁)	verdict

Q4 · Read the pattern

As corr(exercise, age) increases, the magnitude of the error in the simple regression’s β̂₁…

Take these with you

Omission is only a problem when the omitted variable moves with your regressor. The bias in the simple regression equals β₂ × δ, where δ is how strongly age tracks exercise. Either factor being zero kills it — at corr = 0 the “wrong” model was nearly right.
“Controlling for age” means comparing like with like. It is the same idea as the two 45-year-olds: vary exercise, hold age fixed. Multiple regression just does that comparison at every age simultaneously — a plane instead of a pooled line.
A regression slope is not a causal effect until you defend exogeneity. The simple model’s β̂₁ = +3.2 was estimated perfectly precisely — and answered a different question: “how does BP differ across people who exercise differently and are older?”

Setting: BPi = β₀ + β₁·exercisei + ui