Benchmarks for agent memory.
Results on LoCoMo and LongMemEval, two industry benchmarks for long-running agent memory.
We designed Zep for three constraints together: accuracy, retrieval latency, and token efficiency. Production agents need all three.
Every decision in the Zep architecture follows from that. How we build the graph. How we retrieve. How we assemble context.
The benchmarks below measure all three: how accurate Zep is, how fast it retrieves, and how many tokens it returns.
LongMemEval
LongMemEval evaluates long-running memory across six question types, including temporal reasoning and multi-session recall.
Accuracy by question type
0 — 100%LoCoMo
LoCoMo evaluates memory over multi-session conversations across four question categories: multi-hop, temporal, open-domain, and single-hop.
Accuracy by question category
0 — 100%Auto search: half the tokens, no tuning.
The results above use multi-scope retrieval — five parallel searches across facts, entities, episodes, observations, and thread summaries, composed at the client. Zep also offers auto search: a single API call that retrieves across every scope, applies a cross-scope rerank, and packs the result into a character-bounded context block. No scope selection, no client-side composition.
Run on LoCoMo, auto search delivers:
A single call, no scope tuning, with a context block roughly half the size.
How the results follow from the architecture.
Latency
Hot graphs are held in memory as adjacency lists and CSR matrices, with vector and BM25 indexes alongside. One query returns one ranked answer across every retrieval signal.
Context Graph EngineToken efficiency
The Context Block is shaped at retrieval, not generated by an LLM. Entities, relationships, and observations are ranked at query time and packed to fit the token budget.
Agent MemoryAccuracy
Temporal invalidation keeps facts current. Pattern matching surfaces Observations. The retrieval ranks the right things on the first call.
Reproducibility notes.
Reader: gpt-5.4 (reasoning = medium). Judge: gpt-5.4 with chain-of-thought grading. Multi-scope retrieval depth: 20 edges, 10 nodes, 10 episodes, 5 thread summaries, 5 observations, cross-encoder reranking. Auto search at max_characters=10000. Run on 1,540 LoCoMo questions and 500 LongMemEval questions. 0 failed tests on either benchmark.