The agent stack has a context integrity problem.
Long context is capacity. Memory is continuity. Retrieval is access. Reasoning is use. Current systems often conflate these. Context Integrity Benchmark separates them.
Abstract
AI agents are increasingly expected to operate across long-running workflows: reading documents, remembering user preferences, updating stale facts, retrieving evidence, and choosing actions. Existing evaluations usually isolate one piece of this system. Long-context benchmarks test whether a model can attend over a fixed prompt. Retrieval benchmarks test whether a relevant passage can be found. Agent benchmarks test whether a model can call tools. None of these alone measures whether an agent preserves context integrity across time.
We define context integrity as the property that every answer or action can be traced to the right stored evidence, updated against newer evidence, bounded by uncertainty, and executed only when the evidence supports it. CIB v0 now includes 250 deterministic tasks and five retrieval/memory baselines. These are context-pipeline results, not frontier LLM agent results.
Core claim: AI agents fail when their context pipeline fails. The bottleneck is not only model intelligence, but the integrity of memory, retrieval, evidence, and action over time. Read the PDF.
Why This Exists
Real work does not arrive as a single prompt with all relevant facts attached. A user changes their mind. A document supersedes an older document. A preference applies in one situation but not another. An instruction is remembered, then contradicted. The agent must decide what to store, what to ignore, what to retrieve, what to update, when to ask for clarification, and whether an action is justified.
Formal Model
A CIB task is an ordered event stream plus a later query, allowed actions, gold evidence, disallowed stale evidence, and a gold answer or action. Retrieval is sufficient only when the returned sources include all required evidence and exclude superseded evidence that would authorize the wrong current decision.
This separates model quality from context-pipeline quality: if the retrieved context is insufficient or stale, even a perfect evidence-gated actor is already bounded away from the correct action.
Task Families
Selective Write
The system receives noisy sessions and must store durable facts without turning every sentence into memory.
Evidence Retrieval
The system must retrieve the minimum sufficient evidence set, not a vague pile of semantically similar context.
Knowledge Update
New evidence may supersede old evidence. The agent must use current facts while preserving history when asked.
Abstention
When evidence is missing, the correct behavior is to say so. Unsupported confidence counts as failure.
Multi-Session Reasoning
The answer requires combining evidence from separate sessions without importing unrelated project context.
Action Grounding
The chosen action must follow from retrieved evidence, not from a plausible guess about what the user probably wants.
Causal Action
Agents that act must distinguish observed correlation from evidence that an intervention will work.
Evaluation Pipeline
A benchmark item is a multi-session workflow with timestamped events, evidence sources, distractors, contradictions, user preferences, documents, possible tool actions, and final questions. The system must process events over time, maintain memory, retrieve evidence, and answer or act later.
Metrics
The production metric is grounded utility per token: supported correct outcomes divided by total billable tokens. This ties memory quality to actual deployment pressure.
CIB v0 Results
| System | Precision | Recall | Sufficiency | Action | Stale | Unsupported | Tokens | Utility |
|---|---|---|---|---|---|---|---|---|
| recent3 | 18.0% | 39.0% | 16.0% | 16.0% | 8.0% | 46.0% | 40.0 | 4.00 |
| fullHistory | 29.5% | 100.0% | 76.0% | 76.0% | 24.0% | 0.0% | 55.5 | 13.68 |
| lexical3 | 43.3% | 100.0% | 76.0% | 76.0% | 24.0% | 0.0% | 42.1 | 18.03 |
| writeLexical3 | 88.0% | 100.0% | 76.0% | 76.0% | 24.0% | 0.0% | 28.0 | 27.10 |
| scopedHybrid3 | 100.0% | 100.0% | 100.0% | 100.0% | 0.0% | 0.0% | 26.0 | 38.40 |
The first run is intentionally modest. It evaluates retrieval and memory policy before answer generation. Recency fails badly. Lexical retrieval finds evidence but retrieves stale facts. Write filtering reduces flooding, and scoped update semantics remove stale errors in this synthetic setting.
Split Results
| Family | recent3 | fullHistory | lexical3 | writeLexical3 | scopedHybrid3 |
|---|---|---|---|---|---|
| selective_write | 0.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| evidence_retrieval | 0.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| knowledge_update | 100.0% | 0.0% | 0.0% | 0.0% | 100.0% |
| abstention | 0.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| multi_session | 0.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| action_grounding | 0.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| causal_action | 0.0% | 0.0% | 0.0% | 0.0% | 100.0% |
Full history and lexical retrieval fail exactly where update semantics and causal-action discipline matter. More context can preserve the old evidence too well.
Paired Tests
| Baseline | Both sufficient | Scoped only | Baseline only | Both fail | Delta | p |
|---|---|---|---|---|---|---|
| recent3 | 40 | 210 | 0 | 0 | 84.0% | <0.0001 |
| fullHistory | 190 | 60 | 0 | 0 | 24.0% | <0.0001 |
| lexical3 | 190 | 60 | 0 | 0 | 24.0% | <0.0001 |
| writeLexical3 | 190 | 60 | 0 | 0 | 24.0% | <0.0001 |
The paired result is the sharpest CIB v0 signal: scopedHybrid3 matches full-history context on every task where full history is sufficient and fixes the 60 update or causal-action tasks where stale evidence breaks full-history context.
Example Task
Session 1
For finance exports, group invoices by client, not by month.
Session 2
Actually, for audit exports only, group invoices by month.
Question
How should the agent format a normal finance invoice export?
Gold action
Group by client. The audit exception does not globally replace the original preference.
Baselines
- Full-history prompting: pass the entire available history when it fits.
- Long-context truncation: pass the most recent history up to the model limit.
- Naive vector RAG: chunk, embed, and retrieve top-k similar chunks.
- Hybrid retrieval: combine lexical BM25 and vector retrieval.
- Memory system: use explicit write, update, and retrieve operations.
- Memory plus critic: verify retrieval and answer support before final output.
Falsifiable Claims
Claim 1: Long context alone is insufficient for durable agent memory.
Claim 2: Hybrid retrieval improves evidence precision over naive vector retrieval.
Claim 3: A critic reduces unsupported claims and stale-fact errors.
Claim 4: Causal-action tasks expose failures not visible in recall tasks.
Model Eval Harness
The repo includes an OpenAI-compatible model harness for the next phase. It prompts models to return strict JSON with an action, source IDs, and abstention flag, then scores action accuracy, evidence sufficiency, stale evidence, and abstention behavior. No frontier model scores are reported until real credentials are available.
Dataset Card
CIB v0 is a deterministic synthetic JSONL benchmark with no personal or customer data. Each item includes timestamped events, gold evidence source IDs, stale-evidence markers, an abstention flag, and a discrete gold action. The release includes the generator, dataset, summary JSON, benchmark report, PDF, and an OpenAI-compatible model evaluation harness.
How It Connects to BTL
This is the bridge between the lab's research and products. RetainDB can be evaluated as the memory and retrieval layer. BTL Runtime can measure model, latency, and token-cost effects. Marrow can be evaluated as the action layer that must decide when evidence is strong enough to intervene.
Related Work
CIB builds on RAG, long-context evaluation, MemGPT, LongMemEval, MemoryAgentBench, Evo-Memory, Mem0, StructMemEval, and recent agentic-memory work. Its narrower contribution is evidence-level grading for whether memory is current, scoped, auditable, sufficient, and safe to act on.
Memory is not the ability to recall a sentence. It is the ability to maintain an auditable state that supports correct decisions over time.