A step-by-step tutorial for giving an AI agent memory that persists across sessions — with code, and an explanation of what happens inside the memory layer.
To give an AI agent memory that persists, you store what it learns in a durable, per-user memory layer — not in the prompt — and retrieve the relevant slice of that memory into the context window on every turn. The context window is temporary and finite; it resets each session and overflows if you keep appending history. Persistent memory lives outside it, in a store keyed to the user, and is assembled into each prompt on demand. This tutorial builds that loop step by step with Zep, and explains what's happening underneath at each step — the part a quickstart usually skips.
Before any code, hold this picture. Persistent memory is two flows around a durable store:
The store in the middle is a temporal context graph, one per user. It doesn't keep raw transcripts to replay; it extracts entities, the facts and relationships between them (each stamped with when it was true and where it came from), and derives Observations — patterns across many sessions. “Persists” comes from two properties: the graph is keyed to the user (so it outlives any single thread), and it's temporal (so it stays current as facts change instead of accumulating contradictions).
Create a Zep account, get an API key, set it as ZEP_API_KEY, and install the SDK:
pip install zep-cloud # Python (TypeScript and Go SDKs also available)Initialize the client once at startup and reuse it:
import os
from zep_cloud.client import Zep
client = Zep(api_key=os.environ["ZEP_API_KEY"])# One Zep user per real user — use your internal user ID
client.user.add(
user_id="your_internal_user_id",
email="jane@example.com",
first_name="Jane",
last_name="Smith",
)
# One thread per conversation
import uuid
thread_id = uuid.uuid4().hex
client.thread.create(thread_id=thread_id, user_id="your_internal_user_id")What's happening, and why it matters. This two-level model is the whole reason memory persists. The user owns the durable knowledge graph; the threadis a single conversation that writes into it. Memory accrues at the user level, so a fact Jane mentioned in a chat last month is available in a brand-new thread today — that's persistence across sessions, by construction. Give the user a real first/last name: Zep uses it to correctly attribute references in messages and business data to the right person as the graph is built. Set the Zep user_id to your own user ID so you never have to map between systems.
from zep_cloud.types import Message
from datetime import datetime, timezone
client.thread.add_messages(
thread_id,
messages=[Message(
created_at=datetime.now(timezone.utc).isoformat(), # RFC3339
name="Jane Smith",
role="user",
content="Who was Octavia Butler?",
)],
)What's happening. Each message becomes an episode— the raw, lossless record that everything else traces back to. In the background, Zep extracts entities and facts from the episode and writes them into Jane's graph. Two fields do real work: name helps the extractor attribute statements to the right entity, and created_atanchors the fact in time. That timestamp is not bookkeeping — it's what lets the graph reason about when something was true, so that if Jane changes her mind next week, the old fact is closed rather than left to contradict the new one.
import json
client.graph.add(
user_id="your_internal_user_id",
type="json", # also "text" or "message"
data=json.dumps({
"event_type": "song_played",
"song_title": "Bohemian Rhapsody",
"artist": "Queen",
"duration_seconds": 354,
}),
)What's happening, and why it's the key step. A common misconception is that “agent memory” means “remembering the conversation.” Real persistent memory unifies everything the agent should know about the user — transactions, support tickets, app events, emails, transcripts. graph.add ingests any text (structured JSON, semi-structured logs, or a plain sentence like "Jane upgraded to the Pro plan") as an episode and folds it into the same user graph as the chat. This is why a support agent can know that Jane's last payment failed without her ever mentioning it: the event was written to her graph from your billing system.
Reason for return
Additional comments
You don't orchestrate any of this — it happens on write:
Before generating a reply, ask for the relevant context:
user_context = client.thread.get_user_context(thread_id=thread_id)
context_block = user_context.context
print(context_block)What's happening. This is the heart of the read path, and the step people most often get wrong by trying to do it themselves. You are notdumping the user's history into the prompt. Zep assembles a Context Block: it takes the current conversation slice (the last couple of messages), runs semantic search, full-text search, and breadth-first graph traversal over Jane's graph, and returns a compact, relevant string — a user summary plus the most relevant facts (with their date ranges), entities, and episodes. It targets high recall (better to include a slightly-relevant fact than miss a needed one) at sub-200ms p95, and it's token-efficient by design, so you get the right context without blowing the window.
The returned block looks like a labeled summary + dated facts — for example a <USER_SUMMARY> followed by <FACTS> such as - User account is suspended due to payment failure (2024-11-14 - present). That structure is deliberate: it's legible to the model and carries the temporal validity ranges.
The default block is good; for production you often want to decide exactly what goes in. Create a template once and reference it by ID:
client.context.create_context_template(
template_id="customer-support",
template="""# CUSTOMER PROFILE
%{user_summary}
# OBSERVATIONS
%{observations limit=10}
# FACTS
%{edges limit=10}
# KEY ENTITIES
%{entities limit=5}""",
)
user_context = client.thread.get_user_context(
thread_id=thread_id, template_id="customer-support",
)
context_block = user_context.contextWhat's happening. A template makes the context reproducible and versioned instead of implicit in your code: you declare which context types (user summary, observations, edges/facts, entities, episodes) appear and at what limits, and Zep fills it on each call. This is also how you pull in derived Observations — the cross-session patterns — alongside raw facts.
Two placements, with a real tradeoff:
System → your static system prompt (cacheable)
…history…
User → latest user message
Tool → {Zep context block} (replaced each turn)Why this matters.It's the difference between a tutorial and production: Option B preserves prompt caching (lower cost and latency) while still giving the model fresh memory every turn.
client.thread.add_messages(
thread_id,
messages=[Message(
created_at=datetime.now(timezone.utc).isoformat(),
name="AI Assistant",
role="assistant",
content="Octavia Butler was an influential American science-fiction writer...",
)],
)What's happening.The reply is itself signal — it records what the agent told the user, which may matter next session (“you said you'd refund me”). Writing it back keeps the graph complete and lets memory compound: every turn makes the next one better informed. This closing write is what turns a one-shot retrieval into a memory that grows.
Run the loop for a while and the agent gains properties a context window can't give it:
created_at / name. Without accurate timestamps and names, temporal reasoning and entity attribution degrade.graph.add — unify the events that actually describe the user.A loop like this turns a stateless agent into one that remembers across sessions, stays consistent when facts change, and grounds answers in what's actually known about the user and business. Zep is the Context Lake for AI agents — it manages, governs, and serves agent memory on temporal context graphs (built on the open-source Graphiti and served by the Context Graph Engine), with sub-200ms p95 retrieval and benchmark-leading accuracy.
Related: How to give an AI agent long-term memory · What is agent memory? · What is a temporal knowledge graph? · How to test agent memory · Quick Start (docs) · Benchmark results · AI agent memory guides
Memory that survives outside the context window — stored in a durable, per-user layer so the agent remembers across sessions, threads, and restarts, rather than forgetting when the conversation ends.
The window is finite and resets each session. Appending full history overflows it, raises cost and latency, and mixes stale facts with current ones. Persistent memory stores everything durably and retrieves only the relevant slice per turn.
In a temporal context graph keyed to the user: episodes (raw inputs), entities and facts (with validity windows and provenance), and derived Observations. With Zep this is managed for you and served in sub-200ms p95.
The core loop is a few SDK calls — create the user/thread, add_messages, graph.add, get_user_context, and write the reply back — and it works with any agent framework, or none.
Yes. The memory layer is framework- and model-agnostic; you assemble the Context Block into whatever prompt your stack already builds.
RAG retrieves static documents by similarity. Persistent agent memory tracks evolving, user-scoped facts over time, with provenance. Most production agents use both — see agent memory vs RAG.