We're hiring! Come build with us
Zep
AI Agents Guide

Reducing LLM Hallucinations: A Developer's Guide

Why LLMs hallucinate and how to reduce it — ground answers with RAG and agent memory, constrain decoding, and verify outputs. A practical developer guide.

Robbie strongly favors Adidas shoes.

traced_from
Chat messageuser_8a32e1f92024-09-07
“I only wear Adidas shoes. I love them!”

Key takeaways

  • Hallucinations stem from how LLMs generate text — probabilistic next-token prediction with no built-in fact-checking — so the fix is to ground, constrain, and verify.
  • For agents, the highest-impact fix is reliable agent memory: grounding the agent in what's true about the user and business over time, via a temporal context graph.
  • RAG and agent memory are complementary. On long-memory benchmarks, Zep's approach reports 94.7% LoCoMo accuracy at 155ms retrieval (benchmark results).

Hallucination is when an LLM produces output that sounds confident and plausible but is factually wrong or unsupported. It happens because models predict the next token from statistical patterns — they don't verify facts. You can't eliminate it, but you can reduce it sharply by grounding the model in real data, constraining how it generates, and giving it reliable memory of what's already known. This guide covers why hallucinations happen and the techniques that work in production.

TechniqueWhat it doesWhen to reach for it
Retrieval-augmented generation (RAG)Grounds answers in retrieved source dataStatic documents, FAQs, knowledge bases
Agent memory (temporal context graph)Grounds answers in what's true about the user/business over timeMulti-session agents, personalization, evolving facts
Decoding controls (temperature, top-p)Limits randomnessFactual Q&A, structured output
Prompt constraints + few-shot + chain-of-thoughtTells the model to not guess; shows the patternEvery application; reasoning tasks
Post-hoc verification / self-consistencyCatches errors after generationHigh-stakes outputs

Why LLMs hallucinate

An LLM generates text by predicting the most likely next token from patterns in its training data. It has no built-in fact-checking and no awareness of its own uncertainty, so when a query falls outside what it reliably knows, it fills the gap with the most statistically plausible completion — which can be wrong. The main drivers: training-data limits and staleness, purely probabilistic generation, overconfidence with no self-assessment, and the absence of any runtime check against external truth.

The practical implication: reducing hallucination is mostly about supplying the model with the right facts at the right moment, and constraining it when it has none.

Ground the model in data: RAG

Retrieval-augmented generation fetches relevant source text at query time and puts it in the prompt, so the answer is anchored to real data rather than the model's parametric memory. RAG is the right tool for static knowledge— documentation, policies, product manuals — that you can chunk, embed, and retrieve. It addresses staleness (you can supply current data), gives the model a factual basis it can cite, and fills gaps the model “doesn't know it doesn't know.”

RAG's limit: it retrieves by semantic similarity over documents. It doesn't track how facts change over time, it doesn't unify a user's history across sessions and sources, and “closest chunk” isn't always “the fact you need.” For anything stateful — an agent that must remember a user, a customer, or a decision across conversations — document RAG alone produces contradictions and forgetting. That's where agent memory comes in.

Ground the agent in memory: temporal context graphs

For agents, the highest-impact hallucination fix is reliable long-term memory — grounding the agent in what's actually true about the user and the business, tracked over time. Many hallucinations in chat and voice agents aren't knowledge errors; they're context errors: the agent forgets a detail the user gave earlier, or contradicts a fact that has since changed.

RRobbie2024-09-07 · 14:27
I only wear Adidas shoes. I love them!
Facts
  • Robbie only wears Adidas shoes.
  • Robbie strongly favors Adidas shoes.
soleworks.com/account/returns/SO-48219
SoleworksReturn · Order #SO-48219 · Adidas Ultraboost 22

Reason for return

Product fell apart

Additional comments

These Adidas fell apartafter three weeks and I'm furious. I'll be buying Nike from now on.
Facts
  • Robbie only wears Adidas shoes.
  • Robbie strongly favors Adidas shoes.
  • Robbie’s Adidas shoes fell apart.
  • Robbie is returning their Adidas shoes.
  • Robbie is angry about their Adidas shoes.
  • Robbie intends to wear Nike shoes.

A temporal context graph addresses this directly. Instead of dumping raw history into the prompt (which blows the context window and adds noise), it extracts entities, relationships, and facts from every source — chat, business data, documents — and stores them in a graph that is bi-temporal: every fact carries provenance back to the episode that produced it and a validity window. When information changes, the old fact is invalidated and the new one recorded. The agent can ask “what's true now?” or “what was true on this date?” and get the right answer to either — so it stops mixing stale and current facts, the single most common source of agent hallucination.

This is what Zep provides. Zep is the Context Lake for AI agents — the platform that manages, governs, and serves agent memory at scale on temporal context graphs (built on the open-source library Graphiti). Retrieval stays under 200ms p95 and returns token-efficient, relevant context rather than the whole transcript. On the LoCoMo and LongMemEval long-memory benchmarks, this approach leads on accuracy, latency, and token use at the same time (94.7% LoCoMo accuracy at 155ms; 90.2% LongMemEval at 162ms).

RAG and agent memory are complementary: use RAG to ground answers in documents, and agent memory to ground the agent in the user and the business over time. Production systems that minimize hallucination usually use both.

Grounding an agent with memory: example

In code, grounding is a retrieval step before generation. With agent memory you fetch the relevant, current facts for this user and put them in the prompt — so the model answers from what's true, not what's plausible:

# Fetch relevant, current context for this user (Zep)
memory = client.thread.get_user_context(thread_id=thread_id).context

prompt = f"""Use ONLY the context below. If the answer isn't there, say you don't know.

Context (what's currently true about this user):
{memory}

User: {user_question}"""

Two things in that prompt cut hallucinations: the grounding context (so the model has the facts) and the explicit “say you don't know” instruction (so it abstains instead of inventing). The temporal context graph guarantees the context reflects the current truth, not a stale fact from three sessions ago.

Constrain generation

  • Temperature / top-p / top-k: lower values make output more deterministic and factual. Use low temperature for factual Q&A.
  • Length and stop conditions: models ramble into hallucination; cap answers to what was asked.
  • Confidence signals: inspect token probabilities or ask the model to flag low confidence, and fall back to retrieval or a human when it does.

Prompt to discourage guessing

  • Tell it not to fabricate: an explicit “if you don't know, say so” instruction measurably reduces invented answers.
  • Few-shot examples that include an “I don't know” response teach the model that abstaining is acceptable.
  • Chain-of-thought for reasoning tasks surfaces the steps so errors are catchable — but verify the final answer for critical outputs, since long reasoning chains can hallucinate too.

Verify after generation

For high-stakes outputs, add a safety net: rule-based filters for impossible answers, cross-checks against an authoritative source or API, and self-consistency (sample several answers and take the consensus — hallucinated details tend to diverge).

Measure your hallucination rate

You can't reduce what you don't measure. Build an evaluation set of questions with known correct answers (include “should abstain” cases), run your agent, and score two things: did it retrieve the right context (completeness), and was the answer faithful to that context (correctness)? Track the rate before and after each change so you know which intervention actually helped. For the full method — metrics, benchmarks, and an evaluation harness — see how to test agent memory and the LLM evaluation framework.

The practical stack

There's no single fix. In practice, a production setup combines: RAG for document grounding, a temporal context graph for agent memory and personalization, low-temperature decoding for factual tasks, prompt constraints against guessing, and verification on critical paths. Start with grounding (RAG + memory) — it removes the largest class of errors — then layer constraints and verification.


Related: What is agent memory? · Agent memory vs RAG · LLM evaluation framework · AI agent memory guides

Frequently asked questions

What's the difference between RAG and agent memory for reducing hallucinations?

RAG grounds answers in retrieved documents; agent memory grounds the agent in facts about the user and business that evolve over time. RAG handles static knowledge; agent memory handles stateful, multi-session context. Use both.

Does a bigger context window fix hallucinations?

No. Larger windows let you pass more text, but stuffing full history adds noise and can increase hallucination. Selecting the relevant context (via retrieval or a context graph) beats dumping everything in.

Can you eliminate hallucinations entirely?

Not today. You can reduce them substantially by grounding, constraining, and verifying — enough to move from demo to production-grade reliability.

What causes most hallucinations in production agents?

In agents, the largest class isn't knowledge gaps — it's context errors: the agent forgot a fact from an earlier session, or acted on a fact that has since changed. Reliable agent memory (a temporal context graph) removes that class directly.

How do I measure whether my changes reduced hallucinations?

Run a fixed evaluation set with known answers and score retrieval completeness and answer faithfulness before and after each change. See how to test agent memory.