Why LLMs hallucinate, how often frontier models do it on benchmarks like AA-Omniscience, and the techniques that reduce hallucinations when building agents — grounding, abstention, verification, and agent memory.
An LLM hallucinates when it generates plausible text that is not grounded in fact or in its provided context. The model is not lying and it is not broken. It is doing what it was trained to do — predict likely next tokens — in a situation where the likely continuation is not the true one. Understanding that difference is the starting point for reducing it, because it tells you that the fix is rarely a better model and usually better context, better incentives, and better checks around the model.
The research literature splits hallucination into two types, and the distinction matters because each has a different cause and a different remedy. A 2025 comprehensive survey of causes, detection, and mitigation (2510.06265) traces this split through the full model pipeline, from data and pretraining to fine-tuning and inference.
A retrieval system can fix factuality and still leave faithfulness untouched: the right document is in the context window and the model summarizes it wrong. Knowing which failure you have tells you which lever to pull.
The most useful recent explanation of the root cause comes from Kalai et al. at OpenAI, Why Language Models Hallucinate (2509.04664). Their argument is that hallucination is a predictable result of how models are trained and graded, not a mysterious glitch.
Pretraining gives the model a statistical view of language, so even a perfectly calibrated model will produce some errors on facts that are rare or unseen. The deeper problem is the incentive. Most benchmarks score a correct answer as a point and an abstention as zero, the same as a wrong answer. Under that rule the expected score is always higher if you guess. Models optimized to be good test-takers learn exactly that lesson: when uncertain, produce a confident answer rather than admit the gap. The behavior we call hallucination is, in part, rational test-taking.
This reframes the goal. You are not only trying to make the model know more. You are trying to make it willing to say “I don't know” when it does not — and building systems that reward that.
A large share of hallucinations are what Farquhar et al. call confabulations in their 2024 Nature paper, Detecting hallucinations in large language models using semantic entropy: arbitrary, incorrect generations that change from run to run. Their finding is practical. When a model is uncertain in a way that produces confabulation, its sampled answers disagree in meaning, not just wording. Measuring that semantic spread — high entropy across meaning-clustered samples — flags likely hallucinations without any external knowledge source. Uncertainty is detectable, which means it is actionable.
Even when the right information is supplied, models use it unevenly. Liu et al., Lost in the Middle (2307.03172), showed that performance is highest when relevant facts sit at the very start or end of the context window and degrades when they are buried in the middle — even in models built for long contexts. The implication for agents is direct. Dumping an entire chat history or a pile of retrieved chunks into the prompt does not help; it dilutes the signal and raises the chance the model leans on its priors instead of your data. Less, better-placed context beats more context.
Hallucination is now measured directly. The clearest current benchmark is AA-Omniscience from Artificial Analysis (leaderboard, paper 2511.13029). It asks 6,000 questions across 42 topics in six domains, derived from authoritative academic and industry sources, and it is built specifically to remove the guessing incentive. The leaderboard updates as new models ship; the figures below are current as of June 2026.
AA-Omniscience reports three numbers:
Two findings are worth carrying into design decisions.
First, hallucination remains the default on hard questions, even at the frontier. The top AA-Omniscience Index score is around 40 (a Claude Opus 4.8 reasoning configuration), with Gemini 3.1 Pro Preview near 33 and Claude Opus 4.8 around 27. Those are the leaders — far below the 100 ceiling, and most of the field still clusters near or below zero, meaning they get difficult questions wrong at least as often as right. No current model is close to factually reliable on this set.
Second, and more useful: higher accuracy does not mean lower hallucination. GPT-5.5 posts among the highest raw accuracy on the benchmark (roughly 56–57%), yet it trails Claude Opus 4.8 and Gemini 3.1 Pro on the reliability Index, because it answers more of the questions it does not know instead of abstaining. A model can know more and still be less reliable, because reliability is about what it does when it doesn't know.
A caution on the hallucination-rate metric on its own: a model that almost always abstains scores a near-perfect hallucination rate while being useless. A small open model on the same leaderboard records a ~1% hallucination rate precisely because it rarely attempts an answer. Read hallucination rate together with accuracy and the Index, never alone.
The takeaway for builders is that picking a “smarter” model does not buy you reliability on its own. You choose models for the calibration behavior your use case needs, and you build the controls below around whatever model you pick. For measuring this on your own application rather than a public benchmark, see the LLM evaluation framework.
No single method solves this. The systems that hold up in production stack several controls, each targeting a different cause from the section above. The order here is roughly the order of impact for most agents.
The first and the last of these — retrieval (§1) and agent memory (§5) — are the same idea in two forms: grounding, putting true, sourced facts in front of the model instead of trusting what is in its weights. They differ in what they ground against. RAG grounds an answer in a static document corpus, retrieved by similarity. Agent memory grounds the agent in evolving, sourced facts about the user and the business, tracked over time. Different data, different access pattern, same job. A real agent uses both, which is why they appear as separate controls rather than one. (For a deeper comparison, see agent memory vs RAG.)
The highest-leverage move is to stop asking the model to recall facts from its weights and instead put the facts in front of it. Retrieval-augmented generation (RAG) retrieves relevant documents at query time and conditions the answer on them. It directly attacks factuality hallucination: the model no longer has to know the answer, only to read it. This is grounding against a static corpus — documents, manuals, a knowledge base — that does not change per user. A 2026 application-oriented survey of RAG, reasoning, and agentic mitigation (2510.24476) maps where each grounding method helps and where it does not.
# Retrieve first, then answer only from what was retrieved.docs = retriever.search(query, k=5)context = "\n\n".join(d.text for d in docs) prompt = f"""Answer the question using ONLY the context below.If the context does not contain the answer, reply exactly: "Not in the provided sources." Context:{context} Question: {query}""" answer = llm.generate(prompt)Two cautions. RAG fixes factuality but can introduce faithfulness errors if the model strays from the retrieved text, so the instruction to answer only from context does real work here. And RAG quality is retrieval quality: when retrieval surfaces weak or off-topic passages, a meaningful share of answers stay partly ungrounded even with documents in context. Recent work on domain-grounded, tiered retrieval (2603.17872) shows the gains come from retrieving precisely and ranking hard, not from retrieving more. Keep the context tight, and prefer fewer high-relevance passages over many marginal ones (see Lost in the Middle above).
If models hallucinate because guessing scores well, the fix is to make abstention an acceptable, instructed outcome. This is the practical consequence of the Kalai et al. argument. Give the model explicit permission to decline, and grade your own evals so that a wrong answer costs more than an “I don't know.” Recent work frames the underlying decision as separating evidence-backed generation — output supported by retrieved passages, computation, or citations — from prior-onlygeneration drawn from the model's weights, and gates the latter (2604.06195). Conformal-abstention methods (2405.01563) go further, giving statistical guarantees on the resulting hallucination rate.
prompt = f"""Answer the question. You will be penalized for a wrong answermore than for declining. If you are not confident the answer is correct,respond exactly: "I'm not certain." Do not guess. Question: {query}"""You can make abstention quantitative rather than relying on the model's self-report. Following Farquhar et al., sample the answer several times and measure whether the samples agree in meaning; high disagreement is a signal to abstain or escalate.
# Cheap semantic-entropy proxy: sample, then check agreement of meaning.samples = [llm.generate(prompt, temperature=0.8) for _ in range(5)]if not answers_agree_in_meaning(samples): # e.g. NLI clustering of the samples answer = "I'm not certain — escalating or asking a clarifying question."else: answer = majority_meaning(samples)A model can check its own work if you make it a separate step. The pattern that holds up: draft an answer, generate independent verification questions about that draft, answer those questions on their own so they are not biased by the draft, then revise. For agents, 2026 work argues this should run as explicit verification gates at each step — planning, retrieval, reasoning, execution — rather than a single end-of-turn check, because an agent can act on a wrong intermediate conclusion long before it produces a final answer (2604.04269).
draft = llm.generate(f"Answer: {query}") checks = llm.generate(f"List factual claims in this answer that should be " f"independently verified:\n{draft}") # Answer each verification question in isolation — no draft in context.verifications = [llm.generate(f"Verify this claim, citing evidence: {c}") for c in parse_list(checks)] final = llm.generate(f"Revise the answer so it is consistent with these " f"verifications. Remove unsupported claims.\n" f"Draft: {draft}\nVerifications: {verifications}")A lighter-weight relative is self-consistency: sample several reasoning paths and take the answer the majority converge on. It costs extra tokens but catches the arbitrary, run-to-run confabulations that semantic entropy also targets.
When the answer has a known shape — a schema, an enum, a set of valid IDs — constrain the model so it cannot invent values that do not exist. Structured outputs and constrained decoding restrict generation to the allowed grammar, which removes a whole class of hallucination by construction. The NAACL industry study Reducing hallucination in structured outputs via RAG (2024) shows the two techniques compound: ground the values, then constrain the format.
schema = { "type": "object", "properties": { "status": {"type": "string", "enum": ["open", "closed", "pending"]}, "order_id": {"type": "string"}, }, "required": ["status", "order_id"],}# The model cannot return a status outside the enum or omit a required field.result = llm.generate(prompt, response_format={"type": "json_schema", "schema": schema})The four techniques above fix a single turn. Agents run for many turns and many sessions, and that is where a different failure shows up: the agent forgets what the user told it, contradicts a decision from two sessions ago, or treats a fact that changed yesterday as still true. These are faithfulness hallucinations across time, and no per-turn check catches them, because the contradicting information is not in the prompt at all.
This is the context problem at the heart of most production agent failures, and recent surveys of hallucination in agentic systems (2510.24476) put cross-step state and memory among the hardest cases. The right fact existed; it just was not in front of the model at the right moment. Agent memory is the system that fixes it — persistent, retrievable knowledge of the user, the business, and the work, served into each turn as the relevant slice rather than the whole history.
This is grounding (§1), pointed at a different target. RAG grounds the model in a static document corpus; agent memory grounds it in the evolving, sourced state of the user and the business. The two are complementary, not alternatives: use RAG to answer from documents, and agent memory to keep the agent consistent with what it has learned over time. Where RAG removes a factuality gap on a single question, memory removes the faithfulness gap that opens up between sessions.
Three properties of good agent memory map directly onto the hallucination causes:
A worked example: rather than replaying the conversation into the prompt, the agent writes new signals to memory and reads back only the relevant context.
# WRITE — persist signals as they arrive (a chat turn and a business event)client.thread.add_messages(thread_id=thread_id, messages=[ Message(role="user", content="Cancel my Pro plan, switching to a competitor."),])client.graph.add(user_id=user_id, type="json", data=json.dumps({"event": "plan_cancel", "from": "pro"})) # READ — assemble only the relevant, sourced context for this turncontext = client.thread.get_user_context(thread_id=thread_id).contextprompt = f"Use this verified context about the user:\n{context}\n\nQuestion: {query}"The agent never sees the entire history. It sees the facts, with their sources and their current validity, that bear on the question in front of it. That is grounding applied to the agent's own state, and it removes the temporal contradictions that single-turn techniques cannot reach.
The strongest implementations build this on a temporal context graph: inputs are ingested, entities and facts are extracted with validity windows and provenance, and retrieval returns the relevant slice. At enterprise scale, agent memory is implemented as a Context Lake — a governed system of context graphs. This is what Zep provides: it manages, governs, and serves agent memory on temporal context graphs, so the context an agent reasons over is sourced, current, and scoped to the task. For a hands-on build, see how to give an AI agent long-term memory.
Match the control to the cause. A reliable agent uses several at once.
Both grounding controls — RAG and agent memory — sit at the top, because the largest share of hallucinations comes from the model answering with knowledge it does not have. They ground against different things.
| Cause | Symptom | Control |
|---|---|---|
| Recalling facts from model weights it doesn't reliably hold | Confident wrong facts | Grounding via RAG — static documents (§1) |
| Acting without the user's evolving, sourced state | Contradicts earlier turns; acts on outdated facts | Grounding via agent memory — provenance + temporality (§5) |
| Guessing incentive | Wrong answer instead of “I don't know” | Abstention + uncertainty checks (§2) |
| Arbitrary confabulation | Answer changes run to run | Self-consistency, semantic entropy (§2, §3) |
| Misreading provided context | Output contradicts its sources | Chain-of-Verification, tighter context (§3) |
| Invented field values | Out-of-schema or fake IDs | Structured outputs / constrained decoding (§4) |
Start with grounding and abstention — they remove the largest share of errors for the least effort. Add verification and constrained outputs where a wrong answer is expensive. For anything that runs across sessions, agent memory is the layer that keeps the grounding true over time. To know whether a change actually helped, measure it: see how to test agent memory and the LLM evaluation framework.
Current state (2025–2026):
Foundational:
Related: What is agent memory? · Agent memory vs RAG · What is a temporal knowledge graph? · How to give an AI agent long-term memory · AI agent memory guides
This article references hallucination as a research topic. The techniques here reduce, but do not guarantee the elimination of, incorrect model outputs; validate critical outputs before relying on them.
Because they predict likely text, and the likely continuation is not always the true one. Two structural reasons make it persistent: pretraining leaves gaps on rare facts, and most training and evaluation reward a confident guess over an admission of uncertainty, so models learn to guess. See Kalai et al., Why Language Models Hallucinate.
No. They can be reduced substantially. Grounding, abstention, verification, and constrained outputs each cut a different class of error, and combining them is far more effective than any one alone. The goal is a system reliable enough for its use case, with abstention or escalation when confidence is low.
Factuality hallucinations contradict the real world; faithfulness hallucinations contradict the context or input the model was given. Retrieval fixes factuality; verification and tighter context fix faithfulness. See the 2025 survey of causes, detection, and mitigation (2510.06265).
It reduces factuality hallucinations by putting sources in the context, but it does not stop a model from misreading or straying from those sources. Instruct the model to answer only from the retrieved text, keep the context tight, and add verification for high-stakes answers.
Use a benchmark that penalizes guessing, not just one that rewards correct answers. AA-Omniscience (paper) reports accuracy, a hallucination rate, and an index that subtracts points for wrong answers and gives zero for abstention, which is why high-accuracy models can still rank poorly. See the LLM evaluation framework for evaluating this on your own application.
It grounds the agent in sourced, current facts about the user and the business and serves only the relevant slice into each turn. Provenance gives every fact a source, temporality keeps changed facts from contradicting each other across sessions, and selective retrieval keeps the context window clean. See what is agent memory.
No, but they are both grounding. RAG retrieves static documents by similarity to answer a question; agent memory tracks evolving, sourced facts about the user and the business over time, with provenance and validity windows. RAG closes a factuality gap on a single query; memory closes the faithfulness gap that opens across sessions. Most production agents use both — RAG for documents, memory for state.