Why LLMs hallucinate and how to reduce it — ground answers with RAG and agent memory, constrain decoding, and verify outputs. A practical developer guide.
Robbie strongly favors Adidas shoes.
“I only wear Adidas shoes. I love them!”
Hallucination is when an LLM produces output that sounds confident and plausible but is factually wrong or unsupported. It happens because models predict the next token from statistical patterns — they don't verify facts. You can't eliminate it, but you can reduce it sharply by grounding the model in real data, constraining how it generates, and giving it reliable memory of what's already known. This guide covers why hallucinations happen and the techniques that work in production.
| Technique | What it does | When to reach for it |
|---|---|---|
| Retrieval-augmented generation (RAG) | Grounds answers in retrieved source data | Static documents, FAQs, knowledge bases |
| Agent memory (temporal context graph) | Grounds answers in what's true about the user/business over time | Multi-session agents, personalization, evolving facts |
| Decoding controls (temperature, top-p) | Limits randomness | Factual Q&A, structured output |
| Prompt constraints + few-shot + chain-of-thought | Tells the model to not guess; shows the pattern | Every application; reasoning tasks |
| Post-hoc verification / self-consistency | Catches errors after generation | High-stakes outputs |
An LLM generates text by predicting the most likely next token from patterns in its training data. It has no built-in fact-checking and no awareness of its own uncertainty, so when a query falls outside what it reliably knows, it fills the gap with the most statistically plausible completion — which can be wrong. The main drivers: training-data limits and staleness, purely probabilistic generation, overconfidence with no self-assessment, and the absence of any runtime check against external truth.
The practical implication: reducing hallucination is mostly about supplying the model with the right facts at the right moment, and constraining it when it has none.
Retrieval-augmented generation fetches relevant source text at query time and puts it in the prompt, so the answer is anchored to real data rather than the model's parametric memory. RAG is the right tool for static knowledge— documentation, policies, product manuals — that you can chunk, embed, and retrieve. It addresses staleness (you can supply current data), gives the model a factual basis it can cite, and fills gaps the model “doesn't know it doesn't know.”
RAG's limit: it retrieves by semantic similarity over documents. It doesn't track how facts change over time, it doesn't unify a user's history across sessions and sources, and “closest chunk” isn't always “the fact you need.” For anything stateful — an agent that must remember a user, a customer, or a decision across conversations — document RAG alone produces contradictions and forgetting. That's where agent memory comes in.
For agents, the highest-impact hallucination fix is reliable long-term memory — grounding the agent in what's actually true about the user and the business, tracked over time. Many hallucinations in chat and voice agents aren't knowledge errors; they're context errors: the agent forgets a detail the user gave earlier, or contradicts a fact that has since changed.
Reason for return
Additional comments
A temporal context graph addresses this directly. Instead of dumping raw history into the prompt (which blows the context window and adds noise), it extracts entities, relationships, and facts from every source — chat, business data, documents — and stores them in a graph that is bi-temporal: every fact carries provenance back to the episode that produced it and a validity window. When information changes, the old fact is invalidated and the new one recorded. The agent can ask “what's true now?” or “what was true on this date?” and get the right answer to either — so it stops mixing stale and current facts, the single most common source of agent hallucination.
This is what Zep provides. Zep is the Context Lake for AI agents — the platform that manages, governs, and serves agent memory at scale on temporal context graphs (built on the open-source library Graphiti). Retrieval stays under 200ms p95 and returns token-efficient, relevant context rather than the whole transcript. On the LoCoMo and LongMemEval long-memory benchmarks, this approach leads on accuracy, latency, and token use at the same time (94.7% LoCoMo accuracy at 155ms; 90.2% LongMemEval at 162ms).
RAG and agent memory are complementary: use RAG to ground answers in documents, and agent memory to ground the agent in the user and the business over time. Production systems that minimize hallucination usually use both.
In code, grounding is a retrieval step before generation. With agent memory you fetch the relevant, current facts for this user and put them in the prompt — so the model answers from what's true, not what's plausible:
# Fetch relevant, current context for this user (Zep)
memory = client.thread.get_user_context(thread_id=thread_id).context
prompt = f"""Use ONLY the context below. If the answer isn't there, say you don't know.
Context (what's currently true about this user):
{memory}
User: {user_question}"""Two things in that prompt cut hallucinations: the grounding context (so the model has the facts) and the explicit “say you don't know” instruction (so it abstains instead of inventing). The temporal context graph guarantees the context reflects the current truth, not a stale fact from three sessions ago.
For high-stakes outputs, add a safety net: rule-based filters for impossible answers, cross-checks against an authoritative source or API, and self-consistency (sample several answers and take the consensus — hallucinated details tend to diverge).
You can't reduce what you don't measure. Build an evaluation set of questions with known correct answers (include “should abstain” cases), run your agent, and score two things: did it retrieve the right context (completeness), and was the answer faithful to that context (correctness)? Track the rate before and after each change so you know which intervention actually helped. For the full method — metrics, benchmarks, and an evaluation harness — see how to test agent memory and the LLM evaluation framework.
There's no single fix. In practice, a production setup combines: RAG for document grounding, a temporal context graph for agent memory and personalization, low-temperature decoding for factual tasks, prompt constraints against guessing, and verification on critical paths. Start with grounding (RAG + memory) — it removes the largest class of errors — then layer constraints and verification.
Related: What is agent memory? · Agent memory vs RAG · LLM evaluation framework · AI agent memory guides
RAG grounds answers in retrieved documents; agent memory grounds the agent in facts about the user and business that evolve over time. RAG handles static knowledge; agent memory handles stateful, multi-session context. Use both.
No. Larger windows let you pass more text, but stuffing full history adds noise and can increase hallucination. Selecting the relevant context (via retrieval or a context graph) beats dumping everything in.
Not today. You can reduce them substantially by grounding, constraining, and verifying — enough to move from demo to production-grade reliability.
In agents, the largest class isn't knowledge gaps — it's context errors: the agent forgot a fact from an earlier session, or acted on a fact that has since changed. Reliable agent memory (a temporal context graph) removes that class directly.
Run a fixed evaluation set with known answers and score retrieval completeness and answer faithfulness before and after each change. See how to test agent memory.