How Do You Test Agent Memory? A Practical Guide

Key takeaways

Test agent memory across multiple sessions and over time — measure context completeness first, then answer correctness, latency, and token cost.
Use the standard multi-session benchmarks: LoCoMo (arXiv:2402.17753) and LongMemEval (arXiv:2410.10813).
Test on your own data with Zep's evaluation harness; Zep reports 94.7% LoCoMo and 90.2% LongMemEval (results, Zep paper).

You test agent memory by checking whether the agent retrieves the right facts, at the right time, across multiple sessions, as those facts change — measuring retrieval completeness first, then answer correctness. Single-turn accuracy tells you almost nothing; memory is a temporal, multi-session capability, so the test has to seed facts, change them, and verify the agent answers “what's true now?” and “what was true then?” correctly. This guide covers what to measure, how to build a test set, the industry benchmarks, and how to run an end-to-end evaluation on your own data.

What to measure (and in what order)

Separate the two failure modes — retrieval and generation — because they need different fixes.

Metric	Question it answers	Why it's primary/secondary
Context completeness	Did memory retrieve all the facts needed to answer? (COMPLETE / PARTIAL / INSUFFICIENT)	Primary — measures the memory system itself, isolated from the LLM
Answer correctness	Did the final answer match the expected answer? (CORRECT / WRONG)	Secondary — depends on both retrieval and the model's generation
Retrieval latency	How fast did memory return context?	Production gate — accuracy at high latency doesn't ship
Token efficiency	How many tokens did the retrieved context consume?	Cost gate — memory systems often trade accuracy for token bloat

Measure completeness first: if the right facts weren't retrieved, no prompt engineering will save the answer. Then measure correctness, latency, and tokens together — a good memory system wins on all of them at once rather than trading one for another.

A method for testing agent memory

Write 3–5 target interactions. Note what the user asks and what a correct agent should answer. These define “good” for your domain.
Expand into 10+ test cases with variations and related questions, each with a clear golden answer describing what must be present in a correct response.
Author multi-session conversations that contain the answers, spread naturally across several sessions — so the test reflects real, fragmented context rather than one tidy prompt.
Include temporal cases. Seed a fact, then change it later in the timeline, and test that the agent returns the current value (and, ideally, the historical one when asked “as of” a date). This is the test most setups skip and most agents fail.
Add background data and noise. Bury the relevant facts in a larger graph and add JSON/business data, so you measure retrieval under realistic conditions, not a toy dataset.
Run search → evaluate context → generate → grade. Retrieve context, score completeness, generate an answer with the retrieved context, then grade it against the golden answer (an LLM-as-judge works well here when calibrated).
Iterate. For each miss, check whether the data contained the fact, whether the golden answer was clear, and what retrieval actually returned — then adjust data, search parameters, or graph configuration.

Use the standard benchmarks

For comparable numbers across systems, use the two industry benchmarks for long-running memory:

LoCoMo — long, multi-session conversational memory.
LongMemEval — long-term memory recall and reasoning over extended histories.

Always report accuracy with latency and token use. As a reference point, Zep's temporal context graph reports 94.7% accuracy on LoCoMo (155ms retrieval, 5,760 tokens) and 90.2% on LongMemEval (162ms, 4,408 tokens) — leading on all three at once.

Retrieval latency · p95

10K148ms

100K152ms

1M156ms

10M161ms

100M168ms

Graph Size

Test it on your own data

Benchmarks are directional; the real test is your use case. Zep ships an open evaluation harness that runs this exact loop on your data: write your example interactions, generate test cases and multi-session conversations, ingest them, and run the evaluation to get context-completeness and answer-correctness scores with per-test breakdowns. See Evaluate Zep for Your Use Case for the step-by-step harness, and the LLM evaluation framework for the broader evaluation methodology.

Frequently asked questions

What's the most important metric for agent memory?

Context completeness — whether the system retrieved all the facts needed to answer. It isolates memory quality from the LLM's generation. Answer correctness is the secondary, end-to-end metric.

Why can't I just use single-turn accuracy?

Because memory is about recall across many sessions and over time. Single-turn tests miss the actual failure modes: forgetting across sessions and mixing stale with current facts.

How do I test that an agent handles changing facts?

Seed a fact, change it later in the conversation timeline, then ask the question — a correct system returns the current value and can return the historical value when asked “as of” a date.

Which benchmarks should I use?

LoCoMo and LongMemEval for long, multi-session memory. Report accuracy alongside retrieval latency and token cost.