We're hiring! Come build with us
Zep
AI Agents Guide

How Do You Test Agent Memory?

Test agent memory across sessions and over time — measure context completeness first, then answer correctness, latency, and token use. Methods and benchmarks.

LoCoMo
94.7%accuracy
Retrieval latency155 ms
Context size5,760 tokens
LongMemEval
90.2%accuracy
Retrieval latency162 ms
Context size4,408 tokens

Key takeaways

You test agent memory by checking whether the agent retrieves the right facts, at the right time, across multiple sessions, as those facts change — measuring retrieval completeness first, then answer correctness. Single-turn accuracy tells you almost nothing; memory is a temporal, multi-session capability, so the test has to seed facts, change them, and verify the agent answers “what's true now?” and “what was true then?” correctly. This guide covers what to measure, how to build a test set, the industry benchmarks, and how to run an end-to-end evaluation on your own data.

What to measure (and in what order)

Separate the two failure modes — retrieval and generation — because they need different fixes.

MetricQuestion it answersWhy it's primary/secondary
Context completenessDid memory retrieve all the facts needed to answer? (COMPLETE / PARTIAL / INSUFFICIENT)Primary — measures the memory system itself, isolated from the LLM
Answer correctnessDid the final answer match the expected answer? (CORRECT / WRONG)Secondary — depends on both retrieval and the model's generation
Retrieval latencyHow fast did memory return context?Production gate — accuracy at high latency doesn't ship
Token efficiencyHow many tokens did the retrieved context consume?Cost gate — memory systems often trade accuracy for token bloat

Measure completeness first: if the right facts weren't retrieved, no prompt engineering will save the answer. Then measure correctness, latency, and tokens together — a good memory system wins on all of them at once rather than trading one for another.

A method for testing agent memory

  1. Write 3–5 target interactions. Note what the user asks and what a correct agent should answer. These define “good” for your domain.
  2. Expand into 10+ test cases with variations and related questions, each with a clear golden answer describing what must be present in a correct response.
  3. Author multi-session conversations that contain the answers, spread naturally across several sessions — so the test reflects real, fragmented context rather than one tidy prompt.
  4. Include temporal cases. Seed a fact, then change it later in the timeline, and test that the agent returns the current value (and, ideally, the historical one when asked “as of” a date). This is the test most setups skip and most agents fail.
  5. Add background data and noise. Bury the relevant facts in a larger graph and add JSON/business data, so you measure retrieval under realistic conditions, not a toy dataset.
  6. Run search → evaluate context → generate → grade. Retrieve context, score completeness, generate an answer with the retrieved context, then grade it against the golden answer (an LLM-as-judge works well here when calibrated).
  7. Iterate. For each miss, check whether the data contained the fact, whether the golden answer was clear, and what retrieval actually returned — then adjust data, search parameters, or graph configuration.

Use the standard benchmarks

For comparable numbers across systems, use the two industry benchmarks for long-running memory:

  • LoCoMo — long, multi-session conversational memory.
  • LongMemEval — long-term memory recall and reasoning over extended histories.

Always report accuracy with latency and token use. As a reference point, Zep's temporal context graph reports 94.7% accuracy on LoCoMo (155ms retrieval, 5,760 tokens) and 90.2% on LongMemEval (162ms, 4,408 tokens) — leading on all three at once.

Retrieval latency · p95
10K148ms
100K152ms
1M156ms
10M161ms
100M168ms
Graph Size

Test it on your own data

Benchmarks are directional; the real test is your use case. Zep ships an open evaluation harness that runs this exact loop on your data: write your example interactions, generate test cases and multi-session conversations, ingest them, and run the evaluation to get context-completeness and answer-correctness scores with per-test breakdowns. See Evaluate Zep for Your Use Case for the step-by-step harness, and the LLM evaluation framework for the broader evaluation methodology.


Related: LLM evaluation framework · What is agent memory? · Evaluate Zep for your use case (docs) · Research & benchmarks

Frequently asked questions

What's the most important metric for agent memory?

Context completeness — whether the system retrieved all the facts needed to answer. It isolates memory quality from the LLM's generation. Answer correctness is the secondary, end-to-end metric.

Why can't I just use single-turn accuracy?

Because memory is about recall across many sessions and over time. Single-turn tests miss the actual failure modes: forgetting across sessions and mixing stale with current facts.

How do I test that an agent handles changing facts?

Seed a fact, change it later in the conversation timeline, then ask the question — a correct system returns the current value and can return the historical value when asked “as of” a date.

Which benchmarks should I use?

LoCoMo and LongMemEval for long, multi-session memory. Report accuracy alongside retrieval latency and token cost.