Test agent memory across sessions and over time — measure context completeness first, then answer correctness, latency, and token use. Methods and benchmarks.
You test agent memory by checking whether the agent retrieves the right facts, at the right time, across multiple sessions, as those facts change — measuring retrieval completeness first, then answer correctness. Single-turn accuracy tells you almost nothing; memory is a temporal, multi-session capability, so the test has to seed facts, change them, and verify the agent answers “what's true now?” and “what was true then?” correctly. This guide covers what to measure, how to build a test set, the industry benchmarks, and how to run an end-to-end evaluation on your own data.
Separate the two failure modes — retrieval and generation — because they need different fixes.
| Metric | Question it answers | Why it's primary/secondary |
|---|---|---|
| Context completeness | Did memory retrieve all the facts needed to answer? (COMPLETE / PARTIAL / INSUFFICIENT) | Primary — measures the memory system itself, isolated from the LLM |
| Answer correctness | Did the final answer match the expected answer? (CORRECT / WRONG) | Secondary — depends on both retrieval and the model's generation |
| Retrieval latency | How fast did memory return context? | Production gate — accuracy at high latency doesn't ship |
| Token efficiency | How many tokens did the retrieved context consume? | Cost gate — memory systems often trade accuracy for token bloat |
Measure completeness first: if the right facts weren't retrieved, no prompt engineering will save the answer. Then measure correctness, latency, and tokens together — a good memory system wins on all of them at once rather than trading one for another.
For comparable numbers across systems, use the two industry benchmarks for long-running memory:
Always report accuracy with latency and token use. As a reference point, Zep's temporal context graph reports 94.7% accuracy on LoCoMo (155ms retrieval, 5,760 tokens) and 90.2% on LongMemEval (162ms, 4,408 tokens) — leading on all three at once.
Benchmarks are directional; the real test is your use case. Zep ships an open evaluation harness that runs this exact loop on your data: write your example interactions, generate test cases and multi-session conversations, ingest them, and run the evaluation to get context-completeness and answer-correctness scores with per-test breakdowns. See Evaluate Zep for Your Use Case for the step-by-step harness, and the LLM evaluation framework for the broader evaluation methodology.
Related: LLM evaluation framework · What is agent memory? · Evaluate Zep for your use case (docs) · Research & benchmarks
Context completeness — whether the system retrieved all the facts needed to answer. It isolates memory quality from the LLM's generation. Answer correctness is the secondary, end-to-end metric.
Because memory is about recall across many sessions and over time. Single-turn tests miss the actual failure modes: forgetting across sessions and mixing stale with current facts.
Seed a fact, change it later in the conversation timeline, then ask the question — a correct system returns the current value and can return the historical value when asked “as of” a date.
LoCoMo and LongMemEval for long, multi-session memory. Report accuracy alongside retrieval latency and token cost.