hub

Why Language Models Hallucinate

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, Edwin Zhang · 2025 · cs.CL · arXiv 2509.04664

24 Pith papers cite this work. Polarity classification is still indexing.

24 Pith papers citing it

open full Pith review browse 24 citing papers arXiv PDF

abstract

Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

cs.CL · 2026-04-14 · unverdicted · novelty 8.0

InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.

ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?

cs.CL · 2026-05-01 · unverdicted · novelty 7.0

Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.

Uncertainty Propagation in LLM-Based Systems

cs.SE · 2026-04-26 · unverdicted · novelty 7.0

This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insights and open challenges.

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.

Scalable Token-Level Hallucination Detection in Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reasoning models.

Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

A single-pass black-box method models LLM outputs as dynamical systems via Koopman operators to detect hallucinations with claimed state-of-the-art accuracy and lower cost.

Agentic Repository Mining: A Multi-Task Evaluation

cs.SE · 2026-05-06 · unverdicted · novelty 6.0

LLM agents dynamically exploring repositories via bash commands achieve competitive accuracy to context-provided LLMs across four classification tasks, with superior robustness to artifact size.

LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning

cs.CR · 2026-05-01 · unverdicted · novelty 6.0

Adaptive Unlearning suppresses package hallucinations in code-generating LLMs by 81% while preserving benchmark performance, using model-generated data and no human labels.

CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

cs.CL · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

CLEAR reveals that LLMs' accuracy on medical questions drops and their 'humility deficit' grows as the number of plausible answers increases and abstention options shift from assertive to uncertain phrasing.

From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction

cs.AI · 2026-04-30 · unverdicted · novelty 6.0

Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.

SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

cs.CV · 2026-04-28 · unverdicted · novelty 6.0

SIEVES improves selective prediction coverage up to 3x on OOD VQA benchmarks by training a selector on visual localization quality, generalizing across datasets and proprietary reasoners without specific adaptation.

Process Supervision of Confidence Margin for Calibrated LLM Reasoning

cs.LG · 2026-04-25 · unverdicted · novelty 6.0

RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.

Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning tasks with lower total inference cost.

From Subsumption to Satisfiability: LLM-Assisted Active Learning for OWL Ontologies

cs.AI · 2026-04-17 · unverdicted · novelty 6.0

LLM-assisted active learning reformulates OWL subsumption checks as satisfiability queries, queries models for counter-concept examples, and ensures errors are only Type II delays rather than inconsistencies.

Calibration-Aware Policy Optimization for Reasoning LLMs

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

CAPO improves LLM calibration by up to 15% while matching or exceeding GRPO accuracy through logistic AUC loss and noise masking, enabling better abstention and scaling performance.

A Two-Stage LLM Framework for Accessible and Verified XAI Explanations

cs.AI · 2026-04-14 · unverdicted · novelty 6.0

A two-stage LLM explainer-verifier framework with iterative refeed improves faithfulness and accessibility of XAI explanations, as shown in experiments across five techniques and three LLM families, with EPR analysis indicating progressive stabilization.

STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.

Benchmarking LLM-Based Static Analysis for Secure Smart Contract Development: Reliability, Limitations, and Potential Hybrid Solutions

cs.CR · 2026-05-11 · unverdicted · novelty 5.0

LLMs for smart contract security analysis show lexical bias from identifier names causing high false positives, with prompting creating precision-recall trade-offs, positioning them as complements rather than replacements for static analysis tools.

HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents

cs.AI · 2026-04-19 · unverdicted · novelty 5.0

HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.

Beyond Literal Summarization: Redefining Hallucination for Medical SOAP Note Evaluation

cs.AI · 2026-04-16 · unverdicted · novelty 5.0

Redefining hallucination evaluation for medical SOAP notes to credit clinical reasoning reduces reported hallucination rates from 35% to 9%.

Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems

cs.HC · 2026-04-28 · unverdicted · novelty 4.0

AI explanations in language learning often fail across six dimensions like diagnostic accuracy and self-regulation support, creating hidden risks that demand better evaluation frameworks such as L2-Bench.

The Crutch or the Ceiling? How Different Generations of LLMs Shape EFL Student Writings

cs.HC · 2026-04-16 · unverdicted · novelty 4.0

Advanced LLMs improve EFL writing scores and diversity for lower-proficiency students but correlate with lower expert ratings on deep coherence, acting more as crutches than scaffolds.

When Should a Language Model Trust Itself? Same-Model Self-Verification as a Conditional Confidence Signal

cs.CL · 2026-04-08 · unverdicted · novelty 4.0

Self-verification acts as a conditional confidence signal for language models rather than a reliable general-purpose uncertainty estimator.

EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors

cs.CV · 2026-04-03 · unverdicted · novelty 4.0

EnsemHalDet improves hallucination detection in VLMs by ensembling independent detectors on diverse internal states, yielding higher AUC than single-detector baselines on VQA datasets.

citing papers explorer

Showing 24 of 24 citing papers.

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis cs.CL · 2026-04-14 · unverdicted · none · ref 13 · internal anchor
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost? cs.CL · 2026-05-01 · unverdicted · none · ref 13 · internal anchor
Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.
Uncertainty Propagation in LLM-Based Systems cs.SE · 2026-04-26 · unverdicted · none · ref 16 · internal anchor
This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insights and open challenges.
BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence cs.CL · 2026-04-03 · unverdicted · none · ref 27 · internal anchor
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
Scalable Token-Level Hallucination Detection in Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 8 · internal anchor
TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reasoning models.
Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction cs.LG · 2026-05-06 · unverdicted · none · ref 9 · internal anchor
A single-pass black-box method models LLM outputs as dynamical systems via Koopman operators to detect hallucinations with claimed state-of-the-art accuracy and lower cost.
Agentic Repository Mining: A Multi-Task Evaluation cs.SE · 2026-05-06 · unverdicted · none · ref 20 · internal anchor
LLM agents dynamically exploring repositories via bash commands achieve competitive accuracy to context-provided LLMs across four classification tasks, with superior robustness to artifact size.
LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning cs.CR · 2026-05-01 · unverdicted · none · ref 22 · internal anchor
Adaptive Unlearning suppresses package hallucinations in code-generating LLMs by 81% while preserving benchmark performance, using model-generated data and no human labels.
CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine cs.CL · 2026-05-01 · unverdicted · none · ref 30 · 2 links · internal anchor
CLEAR reveals that LLMs' accuracy on medical questions drops and their 'humility deficit' grows as the number of plausible answers increases and abstention options shift from assertive to uncertain phrasing.
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction cs.AI · 2026-04-30 · unverdicted · none · ref 17 · internal anchor
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring cs.CV · 2026-04-28 · unverdicted · none · ref 25 · internal anchor
SIEVES improves selective prediction coverage up to 3x on OOD VQA benchmarks by training a selector on visual localization quality, generalizing across datasets and proprietary reasoners without specific adaptation.
Process Supervision of Confidence Margin for Calibrated LLM Reasoning cs.LG · 2026-04-25 · unverdicted · none · ref 32 · internal anchor
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations cs.AI · 2026-04-22 · unverdicted · none · ref 15 · internal anchor
An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning tasks with lower total inference cost.
From Subsumption to Satisfiability: LLM-Assisted Active Learning for OWL Ontologies cs.AI · 2026-04-17 · unverdicted · none · ref 4 · internal anchor
LLM-assisted active learning reformulates OWL subsumption checks as satisfiability queries, queries models for counter-concept examples, and ensures errors are only Type II delays rather than inconsistencies.
Calibration-Aware Policy Optimization for Reasoning LLMs cs.LG · 2026-04-14 · unverdicted · none · ref 14 · internal anchor
CAPO improves LLM calibration by up to 15% while matching or exceeding GRPO accuracy through logistic AUC loss and noise masking, enabling better abstention and scaling performance.
A Two-Stage LLM Framework for Accessible and Verified XAI Explanations cs.AI · 2026-04-14 · unverdicted · none · ref 11 · internal anchor
A two-stage LLM explainer-verifier framework with iterative refeed improves faithfulness and accessibility of XAI explanations, as shown in experiments across five techniques and three LLM families, with EPR analysis indicating progressive stabilization.
STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models cs.CV · 2026-04-03 · unverdicted · none · ref 19 · internal anchor
STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.
Benchmarking LLM-Based Static Analysis for Secure Smart Contract Development: Reliability, Limitations, and Potential Hybrid Solutions cs.CR · 2026-05-11 · unverdicted · none · ref 38 · internal anchor
LLMs for smart contract security analysis show lexical bias from identifier names causing high false positives, with prompting creating precision-recall trade-offs, positioning them as complements rather than replacements for static analysis tools.
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents cs.AI · 2026-04-19 · unverdicted · none · ref 25 · internal anchor
HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
Beyond Literal Summarization: Redefining Hallucination for Medical SOAP Note Evaluation cs.AI · 2026-04-16 · unverdicted · none · ref 2 · internal anchor
Redefining hallucination evaluation for medical SOAP notes to credit clinical reasoning reduces reported hallucination rates from 35% to 9%.
Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems cs.HC · 2026-04-28 · unverdicted · none · ref 20 · internal anchor
AI explanations in language learning often fail across six dimensions like diagnostic accuracy and self-regulation support, creating hidden risks that demand better evaluation frameworks such as L2-Bench.
The Crutch or the Ceiling? How Different Generations of LLMs Shape EFL Student Writings cs.HC · 2026-04-16 · unverdicted · none · ref 34 · internal anchor
Advanced LLMs improve EFL writing scores and diversity for lower-proficiency students but correlate with lower expert ratings on deep coherence, acting more as crutches than scaffolds.
When Should a Language Model Trust Itself? Same-Model Self-Verification as a Conditional Confidence Signal cs.CL · 2026-04-08 · unverdicted · none · ref 4 · internal anchor
Self-verification acts as a conditional confidence signal for language models rather than a reliable general-purpose uncertainty estimator.
EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors cs.CV · 2026-04-03 · unverdicted · none · ref 1 · internal anchor
EnsemHalDet improves hallucination detection in VLMs by ensembling independent detectors on diverse internal states, yielding higher AUC than single-detector baselines on VQA datasets.

Why Language Models Hallucinate

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer