Research Paper · June 2026

The Reasoning Gap

Frontier LLMs Fail at Interventional Causal Inference from Probability Tables

Published June 2026 · Bad Theory Labs

The Reasoning Gap: Frontier LLMs Fail at Interventional Causal Inference from Probability Tables

Olajide Al-ameen

Headline Results

25.0%GPT-5.4 accuracy95% CI [22.2, 28.0]
25.7%GPT-4o mini accuracy95% CI [22.9, 28.8]
100%Exact solver baselineVerifies benchmark correctness
97.8%Human experts (CounterBench)PhD-level annotators on comparable tasks

All three evaluated models — GPT-5.4, GPT-4o mini, and Gemini 2.0 Flash — perform at or near the 25% random-chance baseline on 840 four-choice causal inference questions. None of the models achieve statistically significant above-chance performance. Scale does not help: the frontier model (GPT-5.4) performs no better than the budget model (GPT-4o mini).

Abstract

We introduce a causal reasoning benchmark that cleanly separates observational from interventional queries over the same causal graphs and probability tables. Across 840 four-choice questions spanning seven canonical graph templates, we evaluate three frontier large language models. All three perform at or near random chance (25.0% [22.2, 28.0], 25.7% [22.9, 28.8], and 29.2% [14.9, 49.2] respectively, 95% Wilson confidence intervals), while an exact inference baseline achieves 100%. Human experts achieve 97.8% on similar formal causal reasoning tasks.

Notably, models also fail at observational queries, suggesting a broader inability to compute probabilities from conditional probability tables rather than a deficit specific to interventional reasoning. This finding has direct implications for deploying LLMs in settings that require probabilistic reasoning: scientific discovery, medical diagnosis, policy analysis, and experimental design.

Read the full paper for methodology, benchmark design, example questions, result breakdowns by graph type, and discussion of limitations.

What We Measured

The benchmark uses 7 canonical causal graph templates — chain, fork, collider, M-bias, instrumental variable, front-door, and back-door — each instantiated with 20 random seeded conditional probability tables. For each instantiation, models answer 6 four-choice questions spanning observational, interventional, and counterfactual queries.

Every question provides complete information: the full causal graph (variables and edges) and all CPTs. The only variable is the query type. This design eliminates confounds such as missing information, ambiguous language, or the need for commonsense knowledge retrieval.

Results by Question Type

ModelObservationalInterventionalCounterfactual
GPT-4o mini24.2% (110/455)27.0% (85/315)30.0% (21/70)
GPT-5.424.2% (110/455)24.8% (78/315)31.4% (22/70)

Models fail uniformly across all three levels. The primary difficulty is not specifically about interventions but rather a broader inability to compute probabilities from CPTs presented in text.

Why This Matters

If LLMs are to serve as autonomous agents that act on the world, they must be able to predict the effects of their actions — which is precisely the interventional reasoning task we evaluate. Our results suggest that this capability may not be present in current frontier architectures.

The failure is striking because the task is straightforward for humans: given a causal graph and complete probability tables, compute a probability. An undergraduate statistics student can solve these problems. Human experts score 97.8% on comparable tasks. The exact inference engine achieves 100%. Yet every evaluated model — including the most capable available — performs at random chance.

Access

“The field has been teaching systems to predict the world. We are trying to build systems that understand it. The difference is not scale. It is the objective.”

Bad Theory Labs, Lagos · June 2026