How to Reduce LLM Hallucinations

Key takeaways

Hallucination has two distinct forms: factuality (the output contradicts the world) and faithfulness (the output contradicts its own source or input). They have different fixes.
Models hallucinate partly by design. Training and evaluation reward a confident guess over an “I don't know,” so models learn to guess. On the AA-Omniscience benchmark (current June 2026), even the best frontier models score around 40 out of a possible 100 on knowledge reliability, and most of the field still gets hard questions wrong at least as often as right.
No single technique removes hallucination. The reliable pattern is layered: retrieve relevant facts, let the model abstain when unsure, verify before answering, and constrain the output.
Most production agent errors are context problems, not model-quality problems. The right fact was not in front of the model at the right moment. Persistent, sourced agent memory is how you close that gap across sessions.

An LLM hallucinates when it generates plausible text that is not grounded in fact or in its provided context. The model is not lying and it is not broken. It is doing what it was trained to do — predict likely next tokens — in a situation where the likely continuation is not the true one. Understanding that difference is the starting point for reducing it, because it tells you that the fix is rarely a better model and usually better context, better incentives, and better checks around the model.

Why LLMs hallucinate

Two kinds of hallucination

The research literature splits hallucination into two types, and the distinction matters because each has a different cause and a different remedy. A 2025 comprehensive survey of causes, detection, and mitigation (2510.06265) traces this split through the full model pipeline, from data and pretraining to fine-tuning and inference.

Type one

Factuality

Output conflicts with real-world facts.

Model says

“Founded in 2009”

The world

“Founded in 2011”

Trigger Answers from parametric memory it does not actually hold.

Primary fix

Ground in retrieved sourcesLet the model abstain

Type two

Faithfulness

Output conflicts with the provided context or input.

Source given

“Revenue rose 12%”

Output

“Revenue fell”

Trigger Ignores, misreads, or contradicts the documents and history it was given.

Primary fix

Tighter retrievalVerificationSmaller, cleaner context

Figure 2 — Two failures, two fixes. Factuality errors contradict the real world; faithfulness errors contradict the source the model was handed — each needs a different remedy.

A retrieval system can fix factuality and still leave faithfulness untouched: the right document is in the context window and the model summarizes it wrong. Knowing which failure you have tells you which lever to pull.

Models are trained to guess

The most useful recent explanation of the root cause comes from Kalai et al. at OpenAI, Why Language Models Hallucinate (2509.04664). Their argument is that hallucination is a predictable result of how models are trained and graded, not a mysterious glitch.

Pretraining gives the model a statistical view of language, so even a perfectly calibrated model will produce some errors on facts that are rare or unseen. The deeper problem is the incentive. Most benchmarks score a correct answer as a point and an abstention as zero, the same as a wrong answer. Under that rule the expected score is always higher if you guess. Models optimized to be good test-takers learn exactly that lesson: when uncertain, produce a confident answer rather than admit the gap. The behavior we call hallucination is, in part, rational test-taking.

This reframes the goal. You are not only trying to make the model know more. You are trying to make it willing to say “I don't know” when it does not — and building systems that reward that.

Confabulation and uncertainty

A large share of hallucinations are what Farquhar et al. call confabulations in their 2024 Nature paper, Detecting hallucinations in large language models using semantic entropy: arbitrary, incorrect generations that change from run to run. Their finding is practical. When a model is uncertain in a way that produces confabulation, its sampled answers disagree in meaning, not just wording. Measuring that semantic spread — high entropy across meaning-clustered samples — flags likely hallucinations without any external knowledge source. Uncertainty is detectable, which means it is actionable.

The context is part of the cause

Even when the right information is supplied, models use it unevenly. Liu et al., Lost in the Middle (2307.03172), showed that performance is highest when relevant facts sit at the very start or end of the context window and degrades when they are buried in the middle — even in models built for long contexts. The implication for agents is direct. Dumping an entire chat history or a pile of retrieved chunks into the prompt does not help; it dilutes the signal and raises the chance the model leans on its priors instead of your data. Less, better-placed context beats more context.

How often do frontier models hallucinate?

Hallucination is now measured directly. The clearest current benchmark is AA-Omniscience from Artificial Analysis (leaderboard, paper 2511.13029). It asks 6,000 questions across 42 topics in six domains, derived from authoritative academic and industry sources, and it is built specifically to remove the guessing incentive. The leaderboard updates as new models ship; the figures below are current as of June 2026.

AA-Omniscience reports three numbers:

Accuracy — share of all questions answered correctly, whether or not the model chose to answer.
Hallucination rate — of the questions the model got wrong or skipped, how often it gave a wrong answer instead of abstaining. Defined as incorrect / (incorrect + partial + not attempted).
AA-Omniscience Index — a bounded score from −100 to +100 that rewards a correct answer, penalizes a wrong one, and gives zero for an abstention. A model that answers as many questions wrong as right scores 0; the ceiling of 100 means never wrong.

Two findings are worth carrying into design decisions.

First, hallucination remains the default on hard questions, even at the frontier. The top AA-Omniscience Index score is around 40 (a Claude Opus 4.8 reasoning configuration), with Gemini 3.1 Pro Preview near 33 and Claude Opus 4.8 around 27. Those are the leaders — far below the 100 ceiling, and most of the field still clusters near or below zero, meaning they get difficult questions wrong at least as often as right. No current model is close to factually reliable on this set.

Second, and more useful: higher accuracy does not mean lower hallucination. GPT-5.5 posts among the highest raw accuracy on the benchmark (roughly 56–57%), yet it trails Claude Opus 4.8 and Gemini 3.1 Pro on the reliability Index, because it answers more of the questions it does not know instead of abstaining. A model can know more and still be less reliable, because reliability is about what it does when it doesn't know.

A caution on the hallucination-rate metric on its own: a model that almost always abstains scores a near-perfect hallucination rate while being useless. A small open model on the same leaderboard records a ~1% hallucination rate precisely because it rarely attempts an answer. Read hallucination rate together with accuracy and the Index, never alone.

A model can know more and still be less reliable. GPT-5.5 posts the highest raw accuracy yet trails on the Index because it answers more of what it does not know instead of abstaining.

frontier modelhighest accuracy

Source — Artificial Analysis, AA-Omniscience leaderboard (figures current June 2026). Index scale runs −100 to +100.

Figure 3 — Accuracy isn't reliability. On AA-Omniscience (June 2026), GPT-5.5 leads on accuracy yet trails on the reliability Index because it guesses instead of abstaining.

The takeaway for builders is that picking a “smarter” model does not buy you reliability on its own. You choose models for the calibration behavior your use case needs, and you build the controls below around whatever model you pick. For measuring this on your own application rather than a public benchmark, see the LLM evaluation framework.

Techniques to reduce hallucinations

No single method solves this. The systems that hold up in production stack several controls, each targeting a different cause from the section above. The order here is roughly the order of impact for most agents.

The first and the last of these — retrieval (§1) and agent memory (§5) — are the same idea in two forms: grounding, putting true, sourced facts in front of the model instead of trusting what is in its weights. They differ in what they ground against. RAG grounds an answer in a static document corpus, retrieved by similarity. Agent memory grounds the agent in evolving, sourced facts about the user and the business, tracked over time. Different data, different access pattern, same job. A real agent uses both, which is why they appear as separate controls rather than one. (For a deeper comparison, see agent memory vs RAG.)

Ground in retrieved documentsGrounding · RAG

Put the facts in front of the model instead of trusting its weights. Attacks factuality — the model no longer has to know the answer, only to read it.

Let the model abstain

Make “I don't know” an acceptable, instructed outcome, and grade evals so a wrong answer costs more than declining. Sample and measure agreement of meaning to abstain quantitatively.

Verify before answering

Draft, generate independent verification questions, answer them in isolation, then revise. For agents, run verification gates at each step, not just at the end.

Constrain the output

When the answer has a known shape — a schema, an enum, a set of valid IDs — constrained decoding removes a whole class of hallucination by construction.

Give the agent memoryGrounding · over time

Per-turn checks miss contradictions across sessions. Persistent, sourced memory grounds the agent in the evolving state of the user and the business — served as the relevant slice, with provenance and validity windows.

Start with grounding and abstention — they remove the largest share of errors for the least effort. Add verification and constrained outputs where a wrong answer is expensive.

Figure 4 — Defense in depth. Five stacked controls in rough order of impact; the two highlighted layers, RAG and agent memory, are grounding aimed at different targets.

1. Ground the model in retrieved documents

The highest-leverage move is to stop asking the model to recall facts from its weights and instead put the facts in front of it. Retrieval-augmented generation (RAG) retrieves relevant documents at query time and conditions the answer on them. It directly attacks factuality hallucination: the model no longer has to know the answer, only to read it. This is grounding against a static corpus — documents, manuals, a knowledge base — that does not change per user. A 2026 application-oriented survey of RAG, reasoning, and agentic mitigation (2510.24476) maps where each grounding method helps and where it does not.

python

# Retrieve first, then answer only from what was retrieved.docs = retriever.search(query, k=5)context = "\n\n".join(d.text for d in docs) prompt = f"""Answer the question using ONLY the context below.If the context does not contain the answer, reply exactly: "Not in the provided sources." Context:{context} Question: {query}""" answer = llm.generate(prompt)

Two cautions. RAG fixes factuality but can introduce faithfulness errors if the model strays from the retrieved text, so the instruction to answer only from context does real work here. And RAG quality is retrieval quality: when retrieval surfaces weak or off-topic passages, a meaningful share of answers stay partly ungrounded even with documents in context. Recent work on domain-grounded, tiered retrieval (2603.17872) shows the gains come from retrieving precisely and ranking hard, not from retrieving more. Keep the context tight, and prefer fewer high-relevance passages over many marginal ones (see Lost in the Middle above).

2. Let the model abstain

If models hallucinate because guessing scores well, the fix is to make abstention an acceptable, instructed outcome. This is the practical consequence of the Kalai et al. argument. Give the model explicit permission to decline, and grade your own evals so that a wrong answer costs more than an “I don't know.” Recent work frames the underlying decision as separating evidence-backed generation — output supported by retrieved passages, computation, or citations — from prior-onlygeneration drawn from the model's weights, and gates the latter (2604.06195). Conformal-abstention methods (2405.01563) go further, giving statistical guarantees on the resulting hallucination rate.

python

prompt = f"""Answer the question. You will be penalized for a wrong answermore than for declining. If you are not confident the answer is correct,respond exactly: "I'm not certain." Do not guess. Question: {query}"""

You can make abstention quantitative rather than relying on the model's self-report. Following Farquhar et al., sample the answer several times and measure whether the samples agree in meaning; high disagreement is a signal to abstain or escalate.

python

# Cheap semantic-entropy proxy: sample, then check agreement of meaning.samples = [llm.generate(prompt, temperature=0.8) for _ in range(5)]if not answers_agree_in_meaning(samples):   # e.g. NLI clustering of the samples    answer = "I'm not certain — escalating or asking a clarifying question."else:    answer = majority_meaning(samples)

3. Verify before answering

A model can check its own work if you make it a separate step. The pattern that holds up: draft an answer, generate independent verification questions about that draft, answer those questions on their own so they are not biased by the draft, then revise. For agents, 2026 work argues this should run as explicit verification gates at each step — planning, retrieval, reasoning, execution — rather than a single end-of-turn check, because an agent can act on a wrong intermediate conclusion long before it produces a final answer (2604.04269).

python

draft = llm.generate(f"Answer: {query}") checks = llm.generate(f"List factual claims in this answer that should be "                      f"independently verified:\n{draft}") # Answer each verification question in isolation — no draft in context.verifications = [llm.generate(f"Verify this claim, citing evidence: {c}")                 for c in parse_list(checks)] final = llm.generate(f"Revise the answer so it is consistent with these "                     f"verifications. Remove unsupported claims.\n"                     f"Draft: {draft}\nVerifications: {verifications}")

A lighter-weight relative is self-consistency: sample several reasoning paths and take the answer the majority converge on. It costs extra tokens but catches the arbitrary, run-to-run confabulations that semantic entropy also targets.

4. Constrain the output

When the answer has a known shape — a schema, an enum, a set of valid IDs — constrain the model so it cannot invent values that do not exist. Structured outputs and constrained decoding restrict generation to the allowed grammar, which removes a whole class of hallucination by construction. The NAACL industry study Reducing hallucination in structured outputs via RAG (2024) shows the two techniques compound: ground the values, then constrain the format.

python

schema = {    "type": "object",    "properties": {        "status": {"type": "string", "enum": ["open", "closed", "pending"]},        "order_id": {"type": "string"},    },    "required": ["status", "order_id"],}# The model cannot return a status outside the enum or omit a required field.result = llm.generate(prompt, response_format={"type": "json_schema", "schema": schema})

5. Give the agent memory

The four techniques above fix a single turn. Agents run for many turns and many sessions, and that is where a different failure shows up: the agent forgets what the user told it, contradicts a decision from two sessions ago, or treats a fact that changed yesterday as still true. These are faithfulness hallucinations across time, and no per-turn check catches them, because the contradicting information is not in the prompt at all.

This is the context problem at the heart of most production agent failures, and recent surveys of hallucination in agentic systems (2510.24476) put cross-step state and memory among the hardest cases. The right fact existed; it just was not in front of the model at the right moment. Agent memory is the system that fixes it — persistent, retrievable knowledge of the user, the business, and the work, served into each turn as the relevant slice rather than the whole history.

This is grounding (§1), pointed at a different target. RAG grounds the model in a static document corpus; agent memory grounds it in the evolving, sourced state of the user and the business. The two are complementary, not alternatives: use RAG to answer from documents, and agent memory to keep the agent consistent with what it has learned over time. Where RAG removes a factuality gap on a single question, memory removes the faithfulness gap that opens up between sessions.

Three properties of good agent memory map directly onto the hallucination causes:

Provenance. Every fact traces back to the source it came from, so a grounded answer can cite where it learned something and a wrong fact can be audited rather than guessed at.
Temporality.Facts carry validity windows. When a user's preference changes, the old fact is marked invalid and the new one recorded, so the agent answers “what is true now” without contradicting itself or resurfacing stale facts — a common driver of faithfulness errors.
Relevant retrieval. Memory returns the token-efficient slice that matters for the current task, not the full transcript. That keeps the context window clean and well-placed, which is exactly what Lost in the Middle says the model needs.

A worked example: rather than replaying the conversation into the prompt, the agent writes new signals to memory and reads back only the relevant context.

python

# WRITE — persist signals as they arrive (a chat turn and a business event)client.thread.add_messages(thread_id=thread_id, messages=[    Message(role="user", content="Cancel my Pro plan, switching to a competitor."),])client.graph.add(user_id=user_id, type="json",                 data=json.dumps({"event": "plan_cancel", "from": "pro"})) # READ — assemble only the relevant, sourced context for this turncontext = client.thread.get_user_context(thread_id=thread_id).contextprompt = f"Use this verified context about the user:\n{context}\n\nQuestion: {query}"

The agent never sees the entire history. It sees the facts, with their sources and their current validity, that bear on the question in front of it. That is grounding applied to the agent's own state, and it removes the temporal contradictions that single-turn techniques cannot reach.

Session 1 · Jan 2026

User

“I only use Adidas.”

Fact recorded

prefers: Adidas

valid Jan 2026 →

Session 2 · Mar 2026

User

“Returning these — switching to Nike.”

Invalidated

prefers: Adidas

valid Jan – Mar 2026

Fact recorded

prefers: Nike

valid Mar 2026 →

Session 3 · now

Query

“What shoes does this user prefer?”

Answer

Nike

cite: Session 2 · Mar 2026

Without temporal memory

A flat chat buffer or vector store surfaces both facts — “Adidas” and “Nike” — with no notion of which is current, and lets the model contradict itself.

Figure 5 — Faithfulness across time. When a fact changes, agent memory invalidates the old one and records the new, so a later query is answered with what's true now.

The strongest implementations build this on a temporal context graph: inputs are ingested, entities and facts are extracted with validity windows and provenance, and retrieval returns the relevant slice. At enterprise scale, agent memory is implemented as a Context Lake — a governed system of context graphs. This is what Zep provides: it manages, governs, and serves agent memory on temporal context graphs, so the context an agent reasons over is sourced, current, and scoped to the task. For a hands-on build, see how to give an AI agent long-term memory.

Putting it together

Match the control to the cause. A reliable agent uses several at once.

Both grounding controls — RAG and agent memory — sit at the top, because the largest share of hallucinations comes from the model answering with knowledge it does not have. They ground against different things.

Cause	Symptom	Control
Recalling facts from model weights it doesn't reliably hold	Confident wrong facts	Grounding via RAG — static documents (§1)
Acting without the user's evolving, sourced state	Contradicts earlier turns; acts on outdated facts	Grounding via agent memory — provenance + temporality (§5)
Guessing incentive	Wrong answer instead of “I don't know”	Abstention + uncertainty checks (§2)
Arbitrary confabulation	Answer changes run to run	Self-consistency, semantic entropy (§2, §3)
Misreading provided context	Output contradicts its sources	Chain-of-Verification, tighter context (§3)
Invented field values	Out-of-schema or fake IDs	Structured outputs / constrained decoding (§4)

Start with grounding and abstention — they remove the largest share of errors for the least effort. Add verification and constrained outputs where a wrong answer is expensive. For anything that runs across sessions, agent memory is the layer that keeps the grounding true over time. To know whether a change actually helped, measure it: see how to test agent memory and the LLM evaluation framework.

Sources

Current state (2025–2026):

Artificial Analysis — AA-Omniscience: Knowledge and Hallucination Benchmark (live leaderboard, figures current June 2026): leaderboard · arxiv.org/abs/2511.13029
Kalai, Nachum, Vempala, Zhang — Why Language Models Hallucinate (OpenAI, 2025): arxiv.org/abs/2509.04664
A Comprehensive Survey of Hallucination in LLMs: Causes, Detection, and Mitigation (2025): arxiv.org/abs/2510.06265
Mitigating Hallucination in LLMs: An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems (2025): arxiv.org/abs/2510.24476
Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval (2026): arxiv.org/abs/2603.17872
Hallucination as Output-Boundary Misclassification: A Composite Abstention Architecture (2026): arxiv.org/abs/2604.06195
Beyond Fluency: Toward Reliable Trajectories in Agentic IR (2026): arxiv.org/abs/2604.04269
Mitigating LLM Hallucinations via Conformal Abstention: arxiv.org/abs/2405.01563
Reducing hallucination in structured outputs via RAG (NAACL 2024, Industry): aclanthology.org/2024.naacl-industry.19

Foundational:

Farquhar, Kossen, Kuhn, Gal — Detecting hallucinations in large language models using semantic entropy (Nature, 2024): nature.com/articles/s41586-024-07421-0
Liu et al. — Lost in the Middle: How Language Models Use Long Contexts (2023): arxiv.org/abs/2307.03172

This article references hallucination as a research topic. The techniques here reduce, but do not guarantee the elimination of, incorrect model outputs; validate critical outputs before relying on them.

Frequently asked questions

Why do LLMs hallucinate?

Because they predict likely text, and the likely continuation is not always the true one. Two structural reasons make it persistent: pretraining leaves gaps on rare facts, and most training and evaluation reward a confident guess over an admission of uncertainty, so models learn to guess. See Kalai et al., Why Language Models Hallucinate.

Can hallucinations be eliminated completely?

No. They can be reduced substantially. Grounding, abstention, verification, and constrained outputs each cut a different class of error, and combining them is far more effective than any one alone. The goal is a system reliable enough for its use case, with abstention or escalation when confidence is low.

What is the difference between factuality and faithfulness hallucinations?

Factuality hallucinations contradict the real world; faithfulness hallucinations contradict the context or input the model was given. Retrieval fixes factuality; verification and tighter context fix faithfulness. See the 2025 survey of causes, detection, and mitigation (2510.06265).

Does RAG stop hallucinations?

It reduces factuality hallucinations by putting sources in the context, but it does not stop a model from misreading or straying from those sources. Instruct the model to answer only from the retrieved text, keep the context tight, and add verification for high-stakes answers.

How do you measure hallucination rates?

Use a benchmark that penalizes guessing, not just one that rewards correct answers. AA-Omniscience (paper) reports accuracy, a hallucination rate, and an index that subtracts points for wrong answers and gives zero for abstention, which is why high-accuracy models can still rank poorly. See the LLM evaluation framework for evaluating this on your own application.

How does agent memory reduce hallucinations?

It grounds the agent in sourced, current facts about the user and the business and serves only the relevant slice into each turn. Provenance gives every fact a source, temporality keeps changed facts from contradicting each other across sessions, and selective retrieval keeps the context window clean. See what is agent memory.

Is RAG the same as agent memory?

No, but they are both grounding. RAG retrieves static documents by similarity to answer a question; agent memory tracks evolving, sourced facts about the user and the business over time, with provenance and validity windows. RAG closes a factuality gap on a single query; memory closes the faithfulness gap that opens across sessions. Most production agents use both — RAG for documents, memory for state.