Agent Runtime
Any agent framework — or none. The Context Lake is invoked through a single SDK.
Vectorize Hindsight and Zep are both dedicated agent-memory systems that score near the top of LongMemEval. The decision isn't which is more accurate — it's which token-efficiency and operational profile fit your use case.
What Vectorize Hindsight is. Hindsight (vectorize.io, GitHub) is an open-source (MIT) agent-memory system from Vectorize. It organizes memory with biomimetic structures — World (facts), Experiences (the agent's own history), and Mental Models(learned understanding formed by reflecting on raw memories) — integrates in about two lines of code, and reports 91.4% on LongMemEval. It offers self-hosting (Docker/embedded), and Vectorize is building a hosted cloud version for managed, production features.
What Zep is. Zep is a dedicated, managed agent-memory platform — the Context Lake. It builds bi-temporal context graphs (via the open-source Graphiti) in which every fact carries a validity window and provenance, so the agent reasons over what's true now vs then and can audit any answer to its source. It serves millions of graphs at sub-200ms p95, governs memory in the substrate (ABAC, retention, audit), and deploys managed, BYOK, or BYOC. Zep reports 90.2% on LongMemEval and 94.7% on LoCoMo (results); architecture in the Zep paper.
Any agent framework — or none. The Context Lake is invoked through a single SDK.
Raw signal arrives from any source the agent touches.
Relevant context is assembled on demand into token-efficient blocks.
Signal becomes a temporal context graph as new facts arrive and stale ones are invalidated.
Selects what's relevant and what adds the most information within the token budget.
Native to the substrate, not a layer bolted on. Every read and write is policy-gated for access and provenance; retention runs across the data lifecycle.
Temporal context graph with provenance — sub-200ms retrieval at scale.
| Vectorize Hindsight | Zep | |
|---|---|---|
| Model | Biomimetic (World / Experiences / Mental Models) | Bi-temporal temporal context graph (facts + provenance + validity) |
| LongMemEval (self-reported) | 91.4% | 90.2% (also 94.7% LoCoMo) |
| Context per query at that score | ~8,192 tokens (Budget.HIGH in their runner) | ~4,408 tokens — roughly half |
| Temporal reasoning | Yes — temporal retrieval arm + temporal indexes on fact lifespans; facts carry temporal links (capped at 20/fact) | Bi-temporal edges: “what's true now / what was true then,” automatic fact invalidation, point-in-time queries |
| Provenance | Yes — facts trace to the originating message; observations record their source facts | Yes — every fact traces to its source episode |
| Open source | Yes (MIT), self-hosted | Graphiti (the graph library) is open source |
| Managed / hosted | Cloud version in development | Managed cloud available today |
| Access control | No built-in RBAC or ABAC (no users/roles/attribute policies). Static API key, off by default; multi-tenant isolation only via a custom-coded extension. MCP tool allowlisting limits surface area, not per-user access | ABAC in the substrate — attribute-based policies govern what each agent/user can read |
| Audit & retention | audit_log table + /audit-logs endpoint, disabled by default; configurable audit retention. No legal hold | Audit, retention policies + legal hold in the substrate |
| Compliance / operations | Self-hosted OSS — you certify and operate it. No SOC 2 / HIPAA from the vendor | Managed service: SOC 2 Type II, HIPAA |
| Deployment | Self-host (Docker/embedded) + forthcoming cloud | Managed, BYOK, or BYOC (AWS/GCP/Azure) |
| Scale (public) | Single Postgres (pgvector/HNSW + BM25 + graph + temporal indexes); stateless API + worker processes scale horizontally; vector search ~10–50ms on 100K+ facts. No published multi-tenant scale figures | Millions of graphs per deployment, sub-200ms p95 |
The published comparison isn't a controlled, matched-backbone head-to-head — by Vectorize's own account. Their repo states that only Hindsight's scorewas independently reproduced (Virginia Tech, The Washington Post) and that “other scores are self-reported by software vendors.” The Zep figure they cite (71.2%) is Zep's 2025-papernumber, not Zep's current 90.2% — so the comparison pairs Hindsight's reproduced Gemini-3 Pro result against Zep's older self-reported figure.
Mechanically the two systems are similar: Hindsight extracts facts with an LLM on ingest (“retain”), and on recall runs vector + BM25 + graph + temporal retrieval merged with reciprocal-rank fusion and a cross-encoder reranker, trimmed to a token limit — the same shape as Graphiti/Zep. (Its “LLM-free recall” claim applies to recall, not ingest.) Hindsight publishes a separate speed/cost benchmark but no per-query latency/token figure in the accuracy table.
Context size (from Hindsight's own benchmark code). Their LongMemEval runner defaults to an 8,192-token retrieval budget at Budget.HIGH (thinking_budget=500), with the answer model at high reasoning effort. That's roughly 1.9× the ~4,408-token context Zep reports on the same benchmark — so the 91.4% is achieved with about double the retrieved context (and a top-tier backbone). On a token-matched basis the gap narrows or reverses. (Hindsight's leaner LoCoMo quality benchmark uses a 4,096-token “low” budget.) Read accuracy alongside latency and tokens, on a matched backbone, against Zep's current numbers. (Hindsight paper: arXiv 2512.12818.)
You want an open-source system you can self-host or embed, you like the biomimetic World / Experiences / Mental-Models model, and your priority is LongMemEval-style recall.
Memory has to be governed and operated at enterprise scale today, at lower per-query context cost.
Both score near the top of LongMemEval (Hindsight 91.4%, Zep 90.2%; Zep also reports 94.7% on LoCoMo) — effectively a tie. The more useful question is at what cost: Hindsight's number is measured at an ~8,192-token retrieval budget vs Zep's ~4,408 — about double the context your answer model processes on every query. For the same accuracy, that's roughly 2× the memory-token cost and added latency at scale.
At the published accuracy levels, Zep feeds your answer LLM about half the memory tokens Hindsight does (~4,408 vs ~8,192), so per-query token cost and latency are lower. Hindsight's recall path itself is LLM-free and fast (100–600ms); the cost difference is in how much retrieved context each system hands to the answer model.
Yes, MIT-licensed. Zep's graph library, Graphiti, is also open source; Zep's managed platform and Context Graph Engine are commercial.
Zep is purpose-built for governed memory at scale (ABAC, retention, audit, SOC 2 Type II, HIPAA, BYOK/BYOC, millions of graphs). Evaluate both against your governance, deployment, and scale requirements.
Hindsight has no built-in RBAC or ABAC — no users, roles, or attribute-based access policies. Its built-in auth is a single static API key (off by default); multi-tenant isolation requires coding a custom extension, and the only finer-grained controls are MCP tool allowlisting (which tools are exposed) and a config-field permission hook. Audit logging exists but is disabled by default, and there's no legal hold or vendor compliance certification. Zep provides ABAC, retention with legal hold, and audit in the substrate, as a managed SOC 2 Type II / HIPAA service. If access control and auditability are requirements, that's a meaningful gap to weigh.
Hindsight supports self-hosting today. Zep offers managed cloud, BYOK, and BYOC (in your VPC); Graphiti can also be self-hosted standalone.