We're hiring! Come build with us
Zep
AI Agents Guide

How to Give Your AI Agent Memory That Persists

A step-by-step tutorial for giving an AI agent memory that persists across sessions — with code, and an explanation of what happens inside the memory layer.

8 candidates · ranked for task
ObsJane upgrades within 2 weeks of each launch.
FactJoined Aug 2024.
FactCurrently on Pro v4.
FactAccount billing monthly.
SumRecent chats: power-user features.
SumPast tickets: rate limits.
ObsTickets pair with plan changes.
FactLast login 12h ago.
Context block1,847 / 2,000
ObsJane upgrades within 2 weeks of each launch.
FactCurrently on Pro v4.
SumRecent chats: power-user features.
ObsTickets pair with plan changes.

Key takeaways

  • Persistent memory means the agent's knowledge lives in a durable, per-user store — not the context window — so it survives across sessions, threads, and restarts.
  • The loop is simple: writeevery message and business event to the user's graph, read an assembled Context Block before each reply, and write the reply back so memory compounds.
  • The store is a temporal context graph: inputs become episodes, from which entities, facts (with validity windows + provenance), and Observations are derived.
  • With Zep this is a handful of SDK calls, framework-agnostic, with sub-200ms p95 retrieval (Quick Start, benchmark results).

To give an AI agent memory that persists, you store what it learns in a durable, per-user memory layer — not in the prompt — and retrieve the relevant slice of that memory into the context window on every turn. The context window is temporary and finite; it resets each session and overflows if you keep appending history. Persistent memory lives outside it, in a store keyed to the user, and is assembled into each prompt on demand. This tutorial builds that loop step by step with Zep, and explains what's happening underneath at each step — the part a quickstart usually skips.

The mental model: a write path and a read path

Before any code, hold this picture. Persistent memory is two flows around a durable store:

  • Write path — every signal the agent sees (user messages, and business data like events, tickets, or transcripts) is sent to the store and added to that user's graph.
  • Read path — before the agent answers, you ask the store for the relevant context for the current moment and drop it into the prompt.

The store in the middle is a temporal context graph, one per user. It doesn't keep raw transcripts to replay; it extracts entities, the facts and relationships between them (each stamped with when it was true and where it came from), and derives Observations — patterns across many sessions. “Persists” comes from two properties: the graph is keyed to the user (so it outlives any single thread), and it's temporal (so it stays current as facts change instead of accumulating contradictions).

Prerequisites

Create a Zep account, get an API key, set it as ZEP_API_KEY, and install the SDK:

pip install zep-cloud   # Python  (TypeScript and Go SDKs also available)

Initialize the client once at startup and reuse it:

import os
from zep_cloud.client import Zep

client = Zep(api_key=os.environ["ZEP_API_KEY"])

Step 1 — Model users and threads (and why they're separate)

# One Zep user per real user — use your internal user ID
client.user.add(
    user_id="your_internal_user_id",
    email="jane@example.com",
    first_name="Jane",
    last_name="Smith",
)

# One thread per conversation
import uuid
thread_id = uuid.uuid4().hex
client.thread.create(thread_id=thread_id, user_id="your_internal_user_id")

What's happening, and why it matters. This two-level model is the whole reason memory persists. The user owns the durable knowledge graph; the threadis a single conversation that writes into it. Memory accrues at the user level, so a fact Jane mentioned in a chat last month is available in a brand-new thread today — that's persistence across sessions, by construction. Give the user a real first/last name: Zep uses it to correctly attribute references in messages and business data to the right person as the graph is built. Set the Zep user_id to your own user ID so you never have to map between systems.

Step 2 — Write user messages (memory is built from episodes)

from zep_cloud.types import Message
from datetime import datetime, timezone

client.thread.add_messages(
    thread_id,
    messages=[Message(
        created_at=datetime.now(timezone.utc).isoformat(),  # RFC3339
        name="Jane Smith",
        role="user",
        content="Who was Octavia Butler?",
    )],
)

What's happening. Each message becomes an episode— the raw, lossless record that everything else traces back to. In the background, Zep extracts entities and facts from the episode and writes them into Jane's graph. Two fields do real work: name helps the extractor attribute statements to the right entity, and created_atanchors the fact in time. That timestamp is not bookkeeping — it's what lets the graph reason about when something was true, so that if Jane changes her mind next week, the old fact is closed rather than left to contradict the new one.

Step 3 — Write business data (this is what makes memory more than chat history)

import json

client.graph.add(
    user_id="your_internal_user_id",
    type="json",  # also "text" or "message"
    data=json.dumps({
        "event_type": "song_played",
        "song_title": "Bohemian Rhapsody",
        "artist": "Queen",
        "duration_seconds": 354,
    }),
)

What's happening, and why it's the key step. A common misconception is that “agent memory” means “remembering the conversation.” Real persistent memory unifies everything the agent should know about the user — transactions, support tickets, app events, emails, transcripts. graph.add ingests any text (structured JSON, semi-structured logs, or a plain sentence like "Jane upgraded to the Pro plan") as an episode and folds it into the same user graph as the chat. This is why a support agent can know that Jane's last payment failed without her ever mentioning it: the event was written to her graph from your billing system.

RRobbie2024-09-07 · 14:27
I only wear Adidas shoes. I love them!
Facts
  • Robbie only wears Adidas shoes.
  • Robbie strongly favors Adidas shoes.
soleworks.com/account/returns/SO-48219
SoleworksReturn · Order #SO-48219 · Adidas Ultraboost 22

Reason for return

Product fell apart

Additional comments

These Adidas fell apartafter three weeks and I'm furious. I'll be buying Nike from now on.
Facts
  • Robbie only wears Adidas shoes.
  • Robbie strongly favors Adidas shoes.
  • Robbie’s Adidas shoes fell apart.
  • Robbie is returning their Adidas shoes.
  • Robbie is angry about their Adidas shoes.
  • Robbie intends to wear Nike shoes.

What Zep does in the background

You don't orchestrate any of this — it happens on write:

  • Extraction turns episodes into entities (people, accounts, products) and facts/edges between them.
  • Each fact gets a validity window (valid-from / valid-to) and provenance back to its source episode — the bi-temporal model behind point-in-time answers and auditability.
  • When a new fact contradicts an old one, the old one is invalidated (closed), not deleted, so the agent reasons over current truth while history stays queryable.
  • Observations — patterns across many episodes (recurring behaviors, decisions, preferences) — are derived and stored as single retrievable claims.

Step 4 — Read: retrieve the Context Block (the read path)

Before generating a reply, ask for the relevant context:

user_context = client.thread.get_user_context(thread_id=thread_id)
context_block = user_context.context
print(context_block)

What's happening. This is the heart of the read path, and the step people most often get wrong by trying to do it themselves. You are notdumping the user's history into the prompt. Zep assembles a Context Block: it takes the current conversation slice (the last couple of messages), runs semantic search, full-text search, and breadth-first graph traversal over Jane's graph, and returns a compact, relevant string — a user summary plus the most relevant facts (with their date ranges), entities, and episodes. It targets high recall (better to include a slightly-relevant fact than miss a needed one) at sub-200ms p95, and it's token-efficient by design, so you get the right context without blowing the window.

The returned block looks like a labeled summary + dated facts — for example a <USER_SUMMARY> followed by <FACTS> such as - User account is suspended due to payment failure (2024-11-14 - present). That structure is deliberate: it's legible to the model and carries the temporal validity ranges.

Step 5 — Control what's in the block with a context template (optional)

The default block is good; for production you often want to decide exactly what goes in. Create a template once and reference it by ID:

client.context.create_context_template(
    template_id="customer-support",
    template="""# CUSTOMER PROFILE
%{user_summary}

# OBSERVATIONS
%{observations limit=10}

# FACTS
%{edges limit=10}

# KEY ENTITIES
%{entities limit=5}""",
)

user_context = client.thread.get_user_context(
    thread_id=thread_id, template_id="customer-support",
)
context_block = user_context.context

What's happening. A template makes the context reproducible and versioned instead of implicit in your code: you declare which context types (user summary, observations, edges/facts, entities, episodes) appear and at what limits, and Zep fills it on each call. This is also how you pull in derived Observations — the cross-session patterns — alongside raw facts.

Step 6 — Put the Context Block in the agent's prompt

Two placements, with a real tradeoff:

  • Option A — system prompt. Append the block to your system prompt. Simplest, but the system prompt now changes every turn, which defeats prompt caching with your LLM provider.
  • Option B — context message (recommended at scale). Keep a static, cacheable system prompt, and insert the block as a separate “context message” (a tool message) right after the latest user message. Each turn, remove the previous context message and add the new one. Everything before it stays cacheable.
System     → your static system prompt        (cacheable)
…history…
User       → latest user message
Tool       → {Zep context block}              (replaced each turn)

Why this matters.It's the difference between a tutorial and production: Option B preserves prompt caching (lower cost and latency) while still giving the model fresh memory every turn.

Step 7 — Write the assistant's reply back (close the loop)

client.thread.add_messages(
    thread_id,
    messages=[Message(
        created_at=datetime.now(timezone.utc).isoformat(),
        name="AI Assistant",
        role="assistant",
        content="Octavia Butler was an influential American science-fiction writer...",
    )],
)

What's happening.The reply is itself signal — it records what the agent told the user, which may matter next session (“you said you'd refund me”). Writing it back keeps the graph complete and lets memory compound: every turn makes the next one better informed. This closing write is what turns a one-shot retrieval into a memory that grows.

What “persists” actually buys you

Run the loop for a while and the agent gains properties a context window can't give it:

  • Cross-session continuity — it remembers Jane between conversations, because the knowledge is in her user graph, not the thread.
  • Consistency over time— when a fact changes, the old one is invalidated, so the agent doesn't contradict itself.
  • Whole-user awareness — it knows the business context (payments, tickets, events), not just what was typed.
  • Bounded prompts — you pass a small, relevant Context Block instead of an ever-growing transcript, so cost and latency stay flat as history grows.

Common pitfalls

  • Stuffing history into the prompt instead of retrieving. This is the anti-pattern persistent memory replaces — it overflows the window and re-introduces stale facts.
  • Skipping created_at / name. Without accurate timestamps and names, temporal reasoning and entity attribution degrade.
  • Writing only chat, not business data. The biggest wins come from graph.add — unify the events that actually describe the user.
  • Forgetting to write the assistant reply back. Memory stops compounding.

What this gets you

A loop like this turns a stateless agent into one that remembers across sessions, stays consistent when facts change, and grounds answers in what's actually known about the user and business. Zep is the Context Lake for AI agents — it manages, governs, and serves agent memory on temporal context graphs (built on the open-source Graphiti and served by the Context Graph Engine), with sub-200ms p95 retrieval and benchmark-leading accuracy.


Related: How to give an AI agent long-term memory · What is agent memory? · What is a temporal knowledge graph? · How to test agent memory · Quick Start (docs) · Benchmark results · AI agent memory guides

Frequently asked questions

What does "persistent memory" mean for an AI agent?

Memory that survives outside the context window — stored in a durable, per-user layer so the agent remembers across sessions, threads, and restarts, rather than forgetting when the conversation ends.

Why can't I just keep the whole conversation in the context window?

The window is finite and resets each session. Appending full history overflows it, raises cost and latency, and mixes stale facts with current ones. Persistent memory stores everything durably and retrieves only the relevant slice per turn.

Where is the memory actually stored?

In a temporal context graph keyed to the user: episodes (raw inputs), entities and facts (with validity windows and provenance), and derived Observations. With Zep this is managed for you and served in sub-200ms p95.

How much code does it take?

The core loop is a few SDK calls — create the user/thread, add_messages, graph.add, get_user_context, and write the reply back — and it works with any agent framework, or none.

Does it work with my existing framework and LLM?

Yes. The memory layer is framework- and model-agnostic; you assemble the Context Block into whatever prompt your stack already builds.

How is this different from RAG?

RAG retrieves static documents by similarity. Persistent agent memory tracks evolving, user-scoped facts over time, with provenance. Most production agents use both — see agent memory vs RAG.