Research

Benchmarks for agent memory.

Results on LoCoMo and LongMemEval, two industry benchmarks for long-running agent memory.

We designed Zep for three constraints together: accuracy, retrieval latency, and token efficiency. Production agents need all three.

Every decision in the Zep architecture follows from that. How we build the graph. How we retrieve. How we assemble context.

The benchmarks below measure all three: how accurate Zep is, how fast it retrieves, and how many tokens it returns.

LoCoMo

94.7%accuracy

Retrieval latency155 ms

Context size5,760 tokens

LongMemEval

90.2%accuracy

Retrieval latency162 ms

Context size4,408 tokens

LongMemEval

LongMemEval evaluates long-running memory across six question types, including temporal reasoning and multi-session recall.

Accuracy

90.2%

451 / 500 correct

Retrieval latency

104/162ms

p50 / p95

Median context

4,408tokens

per question, end-to-end

Accuracy by question type

0 — 100%

Single-session assistant

96.4%

Single-session user

94.3%

Knowledge update

93.6%

Temporal reasoning

90.2%

Single-session preference

90.0%

Multi-session

83.5%

LoCoMo

LoCoMo evaluates memory over multi-session conversations across four question categories: multi-hop, temporal, open-domain, and single-hop.

Accuracy

94.7%

1,459 / 1,540 correct

Retrieval latency

87/155ms

p50 / p95

Median context

5,760tokens

per question, end-to-end

Accuracy by question category

0 — 100%

Single-hop — 646 / 670

96.4%

Temporal — 311 / 325

95.6%

Multi-hop — 304 / 323

94.0%

Open-domain — 175 / 221

79.2%

Auto search: half the tokens, no tuning.

The results above use multi-scope retrieval — five parallel searches across facts, entities, episodes, observations, and thread summaries, composed at the client. Zep also offers auto search: a single API call that retrieves across every scope, applies a cross-scope rerank, and packs the result into a character-bounded context block. No scope selection, no client-side composition.

Run on LoCoMo, auto search delivers:

Accuracy

86.5%

single API call, no tuning

Retrieval latency

115/173ms

p50 / p95

Median context

2,680tokens

↓ 53% smaller

A single call, no scope tuning, with a context block roughly half the size.

Read the auto search docs

How the results follow from the architecture.

Latency

Hot graphs are held in memory as adjacency lists and CSR matrices, with vector and BM25 indexes alongside. One query returns one ranked answer across every retrieval signal.

Context Graph Engine

Token efficiency

The Context Block is shaped at retrieval, not generated by an LLM. Entities, relationships, and observations are ranked at query time and packed to fit the token budget.

Agent Memory

Accuracy

Temporal invalidation keeps facts current. Pattern matching surfaces Observations. The retrieval ranks the right things on the first call.

Methodology

Reproducibility notes.

Reader: gpt-5.4 (reasoning = medium). Judge: gpt-5.4 with chain-of-thought grading. Multi-scope retrieval depth: 20 edges, 10 nodes, 10 episodes, 5 thread summaries, 5 observations, cross-encoder reranking. Auto search at max_characters=10000. Run on 1,540 LoCoMo questions and 500 LongMemEval questions. 0 failed tests on either benchmark.

Run Zep on your own data.

Start Building