Research

The Compression Program

Bad Theory Labs - Internal Research Program

Bad Theory Labs - Internal Research Program

The Compression Program

Active - Program opened April 2026 · Lead: Al-ameen · Lab: Bad Theory Labs, Lagos

Thesis

Intelligence is compression.

Not metaphorically - mechanically. To perceive is to compress sensory input into representations that preserve what is useful and discard what is not. To reason is to compress those representations further, into structure that supports prediction, transfer, and counterfactual inference. To act is to compress a goal and a world-model into a decision under uncertainty.

This is not a new observation. Shannon formalized it. Kolmogorov sharpened it. Hinton built careers adjacent to it. What the field has not done is take it seriously as a design principle - as the thing you actually optimize for, rather than a property that sometimes emerges from optimizing for something else.

What we call reasoning in current systems is better described as pattern completion at scale: shallow compression, fast and brittle, that fails precisely when the test distribution requires a step the training distribution never took. We are investigating whether there exists a structurally different regime - deep compression - in which representations are not just statistically compact but causally structured: stable under intervention, not just under observation.

The Compression Program is Bad Theory Labs' attempt to establish rigorous empirical evidence for this distinction, and to build the training methodology that follows from it.

The Core Bet

The field has been optimizing the wrong objective for a decade. Next-token prediction induces compression as a side effect of generalization pressure - incidental, uncontrolled, and shallow by default. Our bet is that treating compression as a primary objective produces qualitatively different internal representations: ones that encode generating structure rather than surface statistics.

Our further hypothesis - which we intend to test, not assume - is that representations built under a genuine compression objective will tend toward causal structure. Not because causality is explicitly supervised, but because the shortest sufficient explanation of a distribution generated by a causal process is the causal process.

At sufficient compression depth, we expect behavior that looks like reasoning: counterfactual stability, systematic generalization, structured transfer. This is falsifiable by design.

Key Definitions

Compression Depth:the minimum number of sequential abstraction steps required to reduce a problem instance to its minimal sufficient description. A depth-1 task is lookup/memorization. Depth-2 is learning the rule generating that lookup. Depth-3 is learning the meta-rule generating the rule. A model's effective compression depth is the maximum task depth where it still generalizes OOD.

Causal Minimality: a representation is causally minimal if it encodes generating causal structure rather than a compact summary of surface correlations. Formally, representation R of distribution P is causally minimal if R supports inference under interventional queries P(Y | do(X)), not just observational P(Y | X).

Workstreams

01 - Compression as Objective

Current systems are trained to predict. Compression is incidental. This workstream asks: what if compression were the loss?

Grounding: two-part MDL objective, minimize L(model) + L(data | model). Key departure from prior MDL work: not MDL as a regularizer on prediction, but replacement of prediction objective itself.

Key questions:

  • How should L(model) be operationalized in neural networks (bits-back, PAC-Bayes, variational proxies)?
  • Does compression-as-objective produce more causally structured representations than prediction-as-objective?
  • Is compression depth monotonically related to OOD generalization, or does it show phase transition behavior?

02 - The Reasoning Gap

This workstream asks whether current AI is reasoning or pattern completion, as an empirical question with a designed answer.

Hypothesis: genuine reasoning requires representations supporting interventional inference, not only observational inference. P(Y | X = x) and P(Y | do(X = x)) are different quantities.

Key questions:

  • Can we build a benchmark that provably separates observational and interventional reasoning?
  • What is the quantitative reasoning gap for frontier models?
  • Does the gap close with scale, or is it objective/architecture dependent?
  • Is there a compression-depth threshold where the gap closes?

How They Relate

01 and 02 are two ends of the same proof.

01 asks: if we compress correctly, what structure emerges?

02 asks: what structure must exist for genuine reasoning?

Central hypothesis: the answer to both is causal structure.

If 01 shows compression-as-objective produces causally structured representations, and 02 shows causal structure is what reasoning requires, then compression is the mechanism and reasoning is what emerges.

What Success Looks Like

  • Benchmark from 02 adopted as a standard for measuring genuine reasoning.
  • Training objective from 01 that outperforms next-token prediction on that benchmark at matched compute.
  • At least one surprising result predicted by compression framing and validated empirically.
  • Internal eval/training/probing infrastructure for all future Bad Theory Labs models.

What This Is Not

This is not benchmark chasing, prompting GPT-4 and calling it research, or capabilities theater. This is a mechanistic research program that can fail, and if it fails, that failure is also a real result.

The Compression Program - Experiment Designs

Bad Theory Labs

Workstream 01 - Compression as Objective

Core claim: if compression is the primary objective, representations will be structurally different and better than those produced by prediction-as-objective.

Experiment 1.1 - MDL as Loss

Hypothesis: MDL-trained models recover generating structure; cross-entropy models learn brittle correlations.

Setup: synthetic dataset from known generating program; matched transformer pair (cross-entropy baseline vs MDL objective).

Measurement: OOD generalization, representation alignment via probes, and held-out description length.

Expected: MDL condition generalizes better and aligns with generating variables.

Falsification: equal OOD performance implies no structural benefit beyond implicit compression in prediction.

Experiment 1.2 - Compression Depth vs Generalization

Hypothesis: monotonic relation between compression depth and OOD generalization.

Setup: depth-1 through depth-5 curriculum with shared surface format and increasing abstraction demands.

Measurement: OOD accuracy vs depth, representation probing, and transfer from deep to shallow tasks.

Falsification: non-monotonic behavior implies compression depth is not the sole explanatory variable.

Experiment 1.3 - Representation Causality

Hypothesis: compression objective induces causal representations without explicit causal supervision.

Setup: synthetic SCM with known graph; compare cross-entropy and MDL training.

Measurement: probe accuracy for parents/directions and counterfactual prediction under forced interventions.

Falsification: parity between conditions indicates compression is not the causal-representation mechanism.

Workstream 02 - The Reasoning Gap

Core claim: current AI is primarily pattern completion, not reasoning. These are structurally different and empirically separable.

Experiment 2.1 - The Separation Benchmark

Hypothesis: class A pattern tasks and class B causal tasks can share format while requiring different structure.

Setup: evaluate frontier models (GPT-4o, Claude, Gemini) on both classes.

Measurement: class A vs class B accuracy, model-size correlation, and causal failure taxonomy.

Expected: near-ceiling class A, substantial class B failures, weak scale-to-causal-performance relation.

Experiment 2.2 - Observation vs Intervention

Hypothesis: models passing observational tasks fail intervention queries beyond what difficulty alone explains.

Setup: known causal graph; query both P(Z | X=x) and P(Z | do(X=x)).

Measurement: observational accuracy, interventional accuracy, and their delta as operationalized reasoning gap.

Falsification: if scale closes the gap, strong thesis version must be revised.

Experiment 2.3 - The Phase Transition

Hypothesis: reasoning emerges at threshold compression depth d*, as a phase transition, not smooth trend.

Setup: evaluate depth-trained models from 1.2 on class-B and intervention tasks from 2.1/2.2.

Measurement: reasoning-gap score vs depth; detect non-linearity and representation change at d*.

Falsification: linear improvements imply gradual enablement, not discrete emergence.

The Connection

  • 1.1 establishes that compression objective changes representations.
  • 1.2 establishes that compression depth predicts generalization.
  • 1.3 establishes that compression produces causal representations.
  • 2.1 establishes reasoning gap is real and measurable.
  • 2.2 operationalizes reasoning gap formally (observation vs intervention).
  • 2.3 connects both workstreams (compression depth predicts reasoning gap).

Together: compression is the mechanism, reasoning emerges at depth.

Order of Execution

  1. Start with 2.1 (benchmark first).
  2. Run 1.1 in parallel (small compute, quick signal).
  3. Run 1.2 and 1.3 after early 1.1 results.
  4. Run 2.2 once 2.1 benchmark is validated.
  5. Run 2.3 last as synthesis.

What a Paper Looks Like

Paper 1: The Reasoning Gap Benchmark (2.1 + 2.2). Claim: rigorous empirical distinction between pattern completion and reasoning.

Paper 2: Compression as Mechanism (1.1 + 1.2 + 1.3 + 2.3). Claim: compression-as-objective produces causal representations, and reasoning emerges at sufficient depth.

“The field has been teaching systems to predict the world. We are trying to build systems that understand it. The difference is not scale. It is the objective.”

Bad Theory Labs, Lagos - April 2026