hub Canonical reference

Why Language Models Hallucinate

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, Edwin Zhang · 2025 · cs.CL · arXiv 2509.04664

Canonical reference. 75% of citing Pith papers cite this work as background.

44 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 44 citing papers arXiv PDF

abstract

Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 dataset 1

citation-polarity summary

background 9 support 2 use dataset 1

representative citing papers

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

cs.CL · 2026-04-14 · unverdicted · novelty 8.0

InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?

cs.CL · 2026-05-01 · unverdicted · novelty 7.0

Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.

Uncertainty Propagation in LLM-Based Systems

cs.SE · 2026-04-26 · unverdicted · novelty 7.0

This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insights and open challenges.

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.

SVR-MAD: A Bayesian-Inspired Framework for Posterior-Guided Multi-Agent Debate

cs.MA · 2026-05-21 · unverdicted · novelty 6.0

SVR-MAD treats pre-debate signals as priors and debate results as evidence to build a sparser communication graph, cutting token use by up to 61% while preserving or raising accuracy over prior MAD methods.

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Factual recall quality in LLMs follows a sigmoid scaling law in the log-linear combination of model parameter count and topic frequency in training data, explaining 60% of variance across models and up to 94% within families.

Measuring Google AI Overviews: Activation, Source Quality, Claim Fidelity, and Publisher Impact

cs.CY · 2026-05-13 · unverdicted · novelty 6.0

Google AI Overviews activate on 13.7% of queries overall and 64.7% of questions, cite more credible sources than standard results but omit key information in 11% of claims, and suppress clicks on over half of cited pages that carry ads.

Scalable Token-Level Hallucination Detection in Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reasoning models.

Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

A single-pass black-box method models LLM outputs as dynamical systems via Koopman operators to detect hallucinations with claimed state-of-the-art accuracy and lower cost.

Agentic Repository Mining: A Multi-Task Evaluation

cs.SE · 2026-05-06 · unverdicted · novelty 6.0

LLM agents dynamically exploring repositories via bash commands achieve competitive accuracy to context-provided LLMs across four classification tasks, with superior robustness to artifact size.

LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning

cs.CR · 2026-05-01 · unverdicted · novelty 6.0

Adaptive Unlearning suppresses package hallucinations in code-generating LLMs by 81% while preserving benchmark performance, using model-generated data and no human labels.

CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

cs.CL · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

CLEAR reveals that LLMs' accuracy on medical questions drops and their 'humility deficit' grows as the number of plausible answers increases and abstention options shift from assertive to uncertain phrasing.

From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction

cs.AI · 2026-04-30 · unverdicted · novelty 6.0

Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.

SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

cs.CV · 2026-04-28 · conditional · novelty 6.0 · 2 refs

SIEVES improves selective prediction coverage by up to 3x on OOD VQA benchmarks by training a selector to score the quality of visual evidence produced by reasoner models, generalizing across benchmarks and proprietary models without internal access or per-task retraining.

Process Supervision of Confidence Margin for Calibrated LLM Reasoning

cs.LG · 2026-04-25 · unverdicted · novelty 6.0

RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.

Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning tasks with lower total inference cost.

From Subsumption to Satisfiability: LLM-Assisted Active Learning for OWL Ontologies

cs.AI · 2026-04-17 · unverdicted · novelty 6.0

LLM-assisted active learning reformulates OWL subsumption checks as satisfiability queries, queries models for counter-concept examples, and ensures errors are only Type II delays rather than inconsistencies.

Calibration-Aware Policy Optimization for Reasoning LLMs

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

CAPO improves LLM calibration by up to 15% while matching or exceeding GRPO accuracy through logistic AUC loss and noise masking, enabling better abstention and scaling performance.

A Two-Stage LLM Framework for Accessible and Verified XAI Explanations

cs.AI · 2026-04-14 · unverdicted · novelty 6.0

A two-stage LLM explainer-verifier framework with iterative refeed improves faithfulness and accessibility of XAI explanations, as shown in experiments across five techniques and three LLM families, with EPR analysis indicating progressive stabilization.

STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.

Ensemble-Based Uncertainty Estimation for Code Correctness Estimation

cs.SE · 2026-03-28 · unverdicted · novelty 6.0

Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.

Causal Evidence that Language Models use Confidence to Drive Behavior

cs.LG · 2026-03-23 · unverdicted · novelty 6.0

Language models deploy multidimensional internal confidence representations and threshold-based policies to control abstention behavior, with causal support from activation steering experiments.

ERA: Evidence-based Reliability Alignment for Honest Retrieval-Augmented Generation

cs.IR · 2026-02-24 · unverdicted · novelty 6.0

ERA models internal and external knowledge as independent Dirichlet belief masses and uses Dempster-Shafer Theory to quantify conflicts, enabling better abstention decisions in RAG systems.

citing papers explorer

Showing 9 of 9 citing papers after filters.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark cs.AI · 2025-09-30 · unverdicted · none · ref 67 · internal anchor
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction cs.AI · 2026-04-30 · unverdicted · none · ref 17 · internal anchor
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations cs.AI · 2026-04-22 · unverdicted · none · ref 15 · internal anchor
An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning tasks with lower total inference cost.
From Subsumption to Satisfiability: LLM-Assisted Active Learning for OWL Ontologies cs.AI · 2026-04-17 · unverdicted · none · ref 4 · internal anchor
LLM-assisted active learning reformulates OWL subsumption checks as satisfiability queries, queries models for counter-concept examples, and ensures errors are only Type II delays rather than inconsistencies.
A Two-Stage LLM Framework for Accessible and Verified XAI Explanations cs.AI · 2026-04-14 · unverdicted · none · ref 11 · internal anchor
A two-stage LLM explainer-verifier framework with iterative refeed improves faithfulness and accessibility of XAI explanations, as shown in experiments across five techniques and three LLM families, with EPR analysis indicating progressive stabilization.
Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models cs.AI · 2026-02-12 · unverdicted · none · ref 2 · internal anchor
REVIS reduces object hallucination in large vision-language models by about 19% via sparse orthogonal projection in latent space at suppression depths while keeping reasoning intact.
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents cs.AI · 2026-04-19 · unverdicted · none · ref 25 · internal anchor
HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
Beyond Literal Summarization: Redefining Hallucination for Medical SOAP Note Evaluation cs.AI · 2026-04-16 · unverdicted · none · ref 2 · internal anchor
Redefining hallucination evaluation for medical SOAP notes to credit clinical reasoning reduces reported hallucination rates from 35% to 9%.
Robust AI Security and Alignment: A Sisyphean Endeavor? cs.AI · 2025-12-10 · unverdicted · none · ref 5 · internal anchor
AI security and alignment cannot achieve full robustness because any sufficiently powerful AI inherits incompleteness-style limitations from formal systems.

Why Language Models Hallucinate

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer