You have data on 300 adults: hours of exercise per week, and systolic blood pressure (mmHg). Because this world is simulated, the truth is knowable — and by the end you will see it. Your job along the way: state what you expect, watch a regression contradict you, diagnose why, and fix it.
A health economist wants the effect of exercise on blood pressure and proposes the simple model above. Before touching the data, commit to an expectation.
| regressor | coef. | std. err. | t | p>|t| | 95% CI |
|---|
u absorbs every determinant of BP that the model leaves out. Omitting things is unavoidable — the OLS estimate only stays trustworthy if exogeneity holds: E[u | exercise] = 0. So the question is never “did I omit something?” but “did I omit something that moves with my regressor?”
Age is dangerous precisely because it has two arrows: one into the outcome and one into the regressor. Fill in the direction of each arrow for this world.
“The effect of exercise on BP” is a statement about changing one thing. Pick the comparison that isolates it.
Same data, compared correctly. Color the 300 people by age group and fit a separate line within each group — every apples-to-apples slope is negative. The dashed rose line is the Step-1 regression: it tilts up only because it compares young low-exercisers with old high-exercisers, mixing the two arrows of the diagram.
Multiple regression does “within the same age” for you, at every age at once: it fits a plane over (exercise, age), and β̂₁ is the slope along the exercise axis holding age fixed.
| regressor | (1) simple | (2) with age | true value (DGP) |
|---|
The sign flip you just saw is the dramatic special case, not the rule. What governs the damage is how strongly the omitted variable moves with the regressor. Below, the same 300 people are re-simulated in worlds that differ only in corr(exercise, age); in each world we run the simple regression and record its β̂₁. The true β₁ is −1.0 everywhere.
| corr(exercise, age) | simple β̂₁ | error (β̂₁ − β₁) | verdict |
|---|