Research Paper · July 2026

Context Integrity

A benchmark for long-running AI agent memory and action.

Bad Theory Labs · CIB v0 · Paper v0.1

The agent stack has a context integrity problem.

Long context is capacity. Memory is continuity. Retrieval is access. Reasoning is use. Current systems often conflate these. Context Integrity Benchmark separates them.

250CIB v0 tasksdeterministic synthetic workflows across seven context-integrity task families
100.0%upper-bound sufficiencystructured scoped memory retrieves enough evidence with no stale-fact errors
76.0%full-history ceilingretrieving every event still fails update and causal-action tasks
38.40utility / 1k tokenssupported retrieval outcomes per estimated thousand context tokens

Abstract

AI agents are increasingly expected to operate across long-running workflows: reading documents, remembering user preferences, updating stale facts, retrieving evidence, and choosing actions. Existing evaluations usually isolate one piece of this system. Long-context benchmarks test whether a model can attend over a fixed prompt. Retrieval benchmarks test whether a relevant passage can be found. Agent benchmarks test whether a model can call tools. None of these alone measures whether an agent preserves context integrity across time.

We define context integrity as the property that every answer or action can be traced to the right stored evidence, updated against newer evidence, bounded by uncertainty, and executed only when the evidence supports it. CIB v0 now includes 250 deterministic tasks and five retrieval/memory baselines. These are context-pipeline results, not frontier LLM agent results.

Core claim: AI agents fail when their context pipeline fails. The bottleneck is not only model intelligence, but the integrity of memory, retrieval, evidence, and action over time. Read the PDF.

Why This Exists

Real work does not arrive as a single prompt with all relevant facts attached. A user changes their mind. A document supersedes an older document. A preference applies in one situation but not another. An instruction is remembered, then contradicted. The agent must decide what to store, what to ignore, what to retrieve, what to update, when to ask for clarification, and whether an action is justified.

Formal Model

A CIB task is an ordered event stream plus a later query, allowed actions, gold evidence, disallowed stale evidence, and a gold answer or action. Retrieval is sufficient only when the returned sources include all required evidence and exclude superseded evidence that would authorize the wrong current decision.

This separates model quality from context-pipeline quality: if the retrieved context is insufficient or stale, even a perfect evidence-gated actor is already bounded away from the correct action.

Task Families

01

Selective Write

The system receives noisy sessions and must store durable facts without turning every sentence into memory.

02

Evidence Retrieval

The system must retrieve the minimum sufficient evidence set, not a vague pile of semantically similar context.

03

Knowledge Update

New evidence may supersede old evidence. The agent must use current facts while preserving history when asked.

04

Abstention

When evidence is missing, the correct behavior is to say so. Unsupported confidence counts as failure.

05

Multi-Session Reasoning

The answer requires combining evidence from separate sessions without importing unrelated project context.

06

Action Grounding

The chosen action must follow from retrieved evidence, not from a plausible guess about what the user probably wants.

07

Causal Action

Agents that act must distinguish observed correlation from evidence that an intervention will work.

Evaluation Pipeline

Ingest eventsWrite memoryRetrieve evidenceAnswer or actAudit sources

A benchmark item is a multi-session workflow with timestamped events, evidence sources, distractors, contradictions, user preferences, documents, possible tool actions, and final questions. The system must process events over time, maintain memory, retrieve evidence, and answer or act later.

Metrics

answer accuracyaction accuracyevidence recallevidence precisionretrieval sufficiencyunsupported claim ratestale fact error rateabstention precisionabstention recallwrite precisionwrite recalllatencytoken cost

The production metric is grounded utility per token: supported correct outcomes divided by total billable tokens. This ties memory quality to actual deployment pressure.

CIB v0 Results

SystemPrecisionRecallSufficiencyActionStaleUnsupportedTokensUtility
recent318.0%39.0%16.0%16.0%8.0%46.0%40.04.00
fullHistory29.5%100.0%76.0%76.0%24.0%0.0%55.513.68
lexical343.3%100.0%76.0%76.0%24.0%0.0%42.118.03
writeLexical388.0%100.0%76.0%76.0%24.0%0.0%28.027.10
scopedHybrid3100.0%100.0%100.0%100.0%0.0%0.0%26.038.40

The first run is intentionally modest. It evaluates retrieval and memory policy before answer generation. Recency fails badly. Lexical retrieval finds evidence but retrieves stale facts. Write filtering reduces flooding, and scoped update semantics remove stale errors in this synthetic setting.

Split Results

Familyrecent3fullHistorylexical3writeLexical3scopedHybrid3
selective_write0.0%100.0%100.0%100.0%100.0%
evidence_retrieval0.0%100.0%100.0%100.0%100.0%
knowledge_update100.0%0.0%0.0%0.0%100.0%
abstention0.0%100.0%100.0%100.0%100.0%
multi_session0.0%100.0%100.0%100.0%100.0%
action_grounding0.0%100.0%100.0%100.0%100.0%
causal_action0.0%0.0%0.0%0.0%100.0%

Full history and lexical retrieval fail exactly where update semantics and causal-action discipline matter. More context can preserve the old evidence too well.

Paired Tests

BaselineBoth sufficientScoped onlyBaseline onlyBoth failDeltap
recent3402100084.0%<0.0001
fullHistory190600024.0%<0.0001
lexical3190600024.0%<0.0001
writeLexical3190600024.0%<0.0001

The paired result is the sharpest CIB v0 signal: scopedHybrid3 matches full-history context on every task where full history is sufficient and fixes the 60 update or causal-action tasks where stale evidence breaks full-history context.

Example Task

Session 1

For finance exports, group invoices by client, not by month.

Session 2

Actually, for audit exports only, group invoices by month.

Question

How should the agent format a normal finance invoice export?

Gold action

Group by client. The audit exception does not globally replace the original preference.

Baselines

  • Full-history prompting: pass the entire available history when it fits.
  • Long-context truncation: pass the most recent history up to the model limit.
  • Naive vector RAG: chunk, embed, and retrieve top-k similar chunks.
  • Hybrid retrieval: combine lexical BM25 and vector retrieval.
  • Memory system: use explicit write, update, and retrieve operations.
  • Memory plus critic: verify retrieval and answer support before final output.

Falsifiable Claims

Claim 1: Long context alone is insufficient for durable agent memory.

Claim 2: Hybrid retrieval improves evidence precision over naive vector retrieval.

Claim 3: A critic reduces unsupported claims and stale-fact errors.

Claim 4: Causal-action tasks expose failures not visible in recall tasks.

Model Eval Harness

The repo includes an OpenAI-compatible model harness for the next phase. It prompts models to return strict JSON with an action, source IDs, and abstention flag, then scores action accuracy, evidence sufficiency, stale evidence, and abstention behavior. No frontier model scores are reported until real credentials are available.

Dataset Card

CIB v0 is a deterministic synthetic JSONL benchmark with no personal or customer data. Each item includes timestamped events, gold evidence source IDs, stale-evidence markers, an abstention flag, and a discrete gold action. The release includes the generator, dataset, summary JSON, benchmark report, PDF, and an OpenAI-compatible model evaluation harness.

How It Connects to BTL

This is the bridge between the lab's research and products. RetainDB can be evaluated as the memory and retrieval layer. BTL Runtime can measure model, latency, and token-cost effects. Marrow can be evaluated as the action layer that must decide when evidence is strong enough to intervene.

Related Work

CIB builds on RAG, long-context evaluation, MemGPT, LongMemEval, MemoryAgentBench, Evo-Memory, Mem0, StructMemEval, and recent agentic-memory work. Its narrower contribution is evidence-level grading for whether memory is current, scoped, auditable, sufficient, and safe to act on.

Memory is not the ability to recall a sentence. It is the ability to maintain an auditable state that supports correct decisions over time.

Bad Theory Labs, Lagos · July 2026