The Reasoning Gap: Frontier LLMs Fail at Interventional Causal Inference from Probability Tables
Headline Results
All three evaluated models — GPT-5.4, GPT-4o mini, and Gemini 2.0 Flash — perform at or near the 25% random-chance baseline on 840 four-choice causal inference questions. None of the models achieve statistically significant above-chance performance. Scale does not help: the frontier model (GPT-5.4) performs no better than the budget model (GPT-4o mini).
Abstract
We introduce a causal reasoning benchmark that cleanly separates observational from interventional queries over the same causal graphs and probability tables. Across 840 four-choice questions spanning seven canonical graph templates, we evaluate three frontier large language models. All three perform at or near random chance (25.0% [22.2, 28.0], 25.7% [22.9, 28.8], and 29.2% [14.9, 49.2] respectively, 95% Wilson confidence intervals), while an exact inference baseline achieves 100%. Human experts achieve 97.8% on similar formal causal reasoning tasks.
Notably, models also fail at observational queries, suggesting a broader inability to compute probabilities from conditional probability tables rather than a deficit specific to interventional reasoning. This finding has direct implications for deploying LLMs in settings that require probabilistic reasoning: scientific discovery, medical diagnosis, policy analysis, and experimental design.
Read the full paper for methodology, benchmark design, example questions, result breakdowns by graph type, and discussion of limitations.
What We Measured
The benchmark uses 7 canonical causal graph templates — chain, fork, collider, M-bias, instrumental variable, front-door, and back-door — each instantiated with 20 random seeded conditional probability tables. For each instantiation, models answer 6 four-choice questions spanning observational, interventional, and counterfactual queries.
Every question provides complete information: the full causal graph (variables and edges) and all CPTs. The only variable is the query type. This design eliminates confounds such as missing information, ambiguous language, or the need for commonsense knowledge retrieval.
Results by Question Type
| Model | Observational | Interventional | Counterfactual |
|---|---|---|---|
| GPT-4o mini | 24.2% (110/455) | 27.0% (85/315) | 30.0% (21/70) |
| GPT-5.4 | 24.2% (110/455) | 24.8% (78/315) | 31.4% (22/70) |
Models fail uniformly across all three levels. The primary difficulty is not specifically about interventions but rather a broader inability to compute probabilities from CPTs presented in text.
Why This Matters
If LLMs are to serve as autonomous agents that act on the world, they must be able to predict the effects of their actions — which is precisely the interventional reasoning task we evaluate. Our results suggest that this capability may not be present in current frontier architectures.
The failure is striking because the task is straightforward for humans: given a causal graph and complete probability tables, compute a probability. An undergraduate statistics student can solve these problems. Human experts score 97.8% on comparable tasks. The exact inference engine achieves 100%. Yet every evaluated model — including the most capable available — performs at random chance.
Access
- Paper: Preprint available on arXiv (endorsement pending).
- Code: Full benchmark, evaluation code, and analysis at github.com/Badtheorylabs/reasoning-gap.
- Live test: Try the benchmark yourself at badtheorylabs.com/reasoning-test.
- Data: Raw responses stored via Supabase; results dashboard at badtheorylabs.com/reasoning-test/results.
“The field has been teaching systems to predict the world. We are trying to build systems that understand it. The difference is not scale. It is the objective.”