We're hiring! Come build with us
Zep
Research

Benchmarks for agent memory.

Results on LoCoMo and LongMemEval, two industry benchmarks for long-running agent memory.

We designed Zep for three constraints together: accuracy, retrieval latency, and token efficiency. Production agents need all three.

Every decision in the Zep architecture follows from that. How we build the graph. How we retrieve. How we assemble context.

The benchmarks below measure all three: how accurate Zep is, how fast it retrieves, and how many tokens it returns.

LoCoMo
94.7%accuracy
Retrieval latency155 ms
Context size5,760 tokens
LongMemEval
90.2%accuracy
Retrieval latency162 ms
Context size4,408 tokens

LongMemEval

LongMemEval evaluates long-running memory across six question types, including temporal reasoning and multi-session recall.

Accuracy
90.2%
451 / 500 correct
Retrieval latency
104/162ms
p50 / p95
Median context
4,408tokens
per question, end-to-end

Accuracy by question type

0  —  100%
Single-session assistant
96.4%
Single-session user
94.3%
Knowledge update
93.6%
Temporal reasoning
90.2%
Single-session preference
90.0%
Multi-session
83.5%

LoCoMo

LoCoMo evaluates memory over multi-session conversations across four question categories: multi-hop, temporal, open-domain, and single-hop.

Accuracy
94.7%
1,459 / 1,540 correct
Retrieval latency
87/155ms
p50 / p95
Median context
5,760tokens
per question, end-to-end

Accuracy by question category

0  —  100%
Single-hop 646 / 670
96.4%
Temporal 311 / 325
95.6%
Multi-hop 304 / 323
94.0%
Open-domain 175 / 221
79.2%

Auto search: half the tokens, no tuning.

The results above use multi-scope retrieval — five parallel searches across facts, entities, episodes, observations, and thread summaries, composed at the client. Zep also offers auto search: a single API call that retrieves across every scope, applies a cross-scope rerank, and packs the result into a character-bounded context block. No scope selection, no client-side composition.

Run on LoCoMo, auto search delivers:

Accuracy
86.5%
single API call, no tuning
Retrieval latency
115/173ms
p50 / p95
Median context
2,680tokens
↓ 53% smaller

A single call, no scope tuning, with a context block roughly half the size.

How the results follow from the architecture.

01

Latency

Hot graphs are held in memory as adjacency lists and CSR matrices, with vector and BM25 indexes alongside. One query returns one ranked answer across every retrieval signal.

Context Graph Engine
02

Token efficiency

The Context Block is shaped at retrieval, not generated by an LLM. Entities, relationships, and observations are ranked at query time and packed to fit the token budget.

Agent Memory
03

Accuracy

Temporal invalidation keeps facts current. Pattern matching surfaces Observations. The retrieval ranks the right things on the first call.

Methodology

Reproducibility notes.

Reader: gpt-5.4 (reasoning = medium). Judge: gpt-5.4 with chain-of-thought grading. Multi-scope retrieval depth: 20 edges, 10 nodes, 10 episodes, 5 thread summaries, 5 observations, cross-encoder reranking. Auto search at max_characters=10000. Run on 1,540 LoCoMo questions and 500 LongMemEval questions. 0 failed tests on either benchmark.

Run Zep on your own data.

Start Building