InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
hub Canonical reference
Why Language Models Hallucinate
Canonical reference. 77% of citing Pith papers cite this work as background.
abstract
Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
LibEvoBench benchmark shows LLMs are version-oblivious on evolving APIs, with documentation helping but version specification not.
The study links three LVLM architectural dimensions to three hallucination types via a new benchmark, finding that language foundation quality reduces co-occurrence errors, visual encoder strength reduces similarity errors, alignment reduces uncertainty errors, and joint visual-alignment improvement
Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.
This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insights and open challenges.
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
TestHumanizer uses LLMs as controlled refactoring layers on EvoSuite suites to boost readability and maintainability, achieving 88-98% compilation rates and developer preference gains on 350 classes from Defects4J and SF110.
Grad Detect uses internal gradient patterns from one inference pass to predict LLM hallucinations and abstention, outperforming confidence and sampling baselines on Q&A benchmarks with most signal in the final five layers.
SVR-MAD treats pre-debate signals as priors and debate results as evidence to build a sparser communication graph, cutting token use by up to 61% while preserving or raising accuracy over prior MAD methods.
Factual recall quality in LLMs follows a sigmoid scaling law in the log-linear combination of model parameter count and topic frequency in training data, explaining 60% of variance across models and up to 94% within families.
Google AI Overviews activate on 13.7% of queries overall and 64.7% of questions, cite more credible sources than standard results but omit key information in 11% of claims, and suppress clicks on over half of cited pages that carry ads.
TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reasoning models.
A single-pass black-box method models LLM outputs as dynamical systems via Koopman operators to detect hallucinations with claimed state-of-the-art accuracy and lower cost.
LLM agents dynamically exploring repositories via bash commands achieve competitive accuracy to context-provided LLMs across four classification tasks, with superior robustness to artifact size.
Adaptive Unlearning suppresses package hallucinations in code-generating LLMs by 81% while preserving benchmark performance, using model-generated data and no human labels.
CLEAR reveals that LLMs' accuracy on medical questions drops and their 'humility deficit' grows as the number of plausible answers increases and abstention options shift from assertive to uncertain phrasing.
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
SIEVES improves selective prediction coverage by up to 3x on OOD VQA benchmarks by training a selector to score the quality of visual evidence produced by reasoner models, generalizing across benchmarks and proprietary models without internal access or per-task retraining.
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning tasks with lower total inference cost.
LLM-assisted active learning reformulates OWL subsumption checks as satisfiability queries, queries models for counter-concept examples, and ensures errors are only Type II delays rather than inconsistencies.
CAPO improves LLM calibration by up to 15% while matching or exceeding GRPO accuracy through logistic AUC loss and noise masking, enabling better abstention and scaling performance.
A two-stage LLM explainer-verifier framework with iterative refeed improves faithfulness and accessibility of XAI explanations, as shown in experiments across five techniques and three LLM families, with EPR analysis indicating progressive stabilization.
citing papers explorer
-
InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
-
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
-
LibEvoBench: Probing Temporal Knowledge Stratification in Code Generation Models
LibEvoBench benchmark shows LLMs are version-oblivious on evolving APIs, with documentation helping but version specification not.
-
What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness
The study links three LVLM architectural dimensions to three hallucination types via a new benchmark, finding that language foundation quality reduces co-occurrence errors, visual encoder strength reduces similarity errors, alignment reduces uncertainty errors, and joint visual-alignment improvement
-
ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?
Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.
-
Uncertainty Propagation in LLM-Based Systems
This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insights and open challenges.
-
BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
-
Humanizing Automatically Generated Unit Test Suites with LLM-Based Refactoring
TestHumanizer uses LLMs as controlled refactoring layers on EvoSuite suites to boost readability and maintainability, achieving 88-98% compilation rates and developer preference gains on 350 classes from Defects4J and SF110.
-
Grad Detect: Gradient-Based Hallucination Detection in LLMs
Grad Detect uses internal gradient patterns from one inference pass to predict LLM hallucinations and abstention, outperforming confidence and sampling baselines on Q&A benchmarks with most signal in the final five layers.
-
SVR-MAD: A Bayesian-Inspired Framework for Posterior-Guided Multi-Agent Debate
SVR-MAD treats pre-debate signals as priors and debate results as evidence to build a sparser communication graph, cutting token use by up to 61% while preserving or raising accuracy over prior MAD methods.
-
Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency
Factual recall quality in LLMs follows a sigmoid scaling law in the log-linear combination of model parameter count and topic frequency in training data, explaining 60% of variance across models and up to 94% within families.
-
Measuring Google AI Overviews: Activation, Source Quality, Claim Fidelity, and Publisher Impact
Google AI Overviews activate on 13.7% of queries overall and 64.7% of questions, cite more credible sources than standard results but omit key information in 11% of claims, and suppress clicks on over half of cited pages that carry ads.
-
Scalable Token-Level Hallucination Detection in Large Language Models
TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reasoning models.
-
Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction
A single-pass black-box method models LLM outputs as dynamical systems via Koopman operators to detect hallucinations with claimed state-of-the-art accuracy and lower cost.
-
Agentic Repository Mining: A Multi-Task Evaluation
LLM agents dynamically exploring repositories via bash commands achieve competitive accuracy to context-provided LLMs across four classification tasks, with superior robustness to artifact size.
-
LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning
Adaptive Unlearning suppresses package hallucinations in code-generating LLMs by 81% while preserving benchmark performance, using model-generated data and no human labels.
-
CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine
CLEAR reveals that LLMs' accuracy on medical questions drops and their 'humility deficit' grows as the number of plausible answers increases and abstention options shift from assertive to uncertain phrasing.
-
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
-
SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring
SIEVES improves selective prediction coverage by up to 3x on OOD VQA benchmarks by training a selector to score the quality of visual evidence produced by reasoner models, generalizing across benchmarks and proprietary models without internal access or per-task retraining.
-
Process Supervision of Confidence Margin for Calibrated LLM Reasoning
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
-
Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations
An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning tasks with lower total inference cost.
-
From Subsumption to Satisfiability: LLM-Assisted Active Learning for OWL Ontologies
LLM-assisted active learning reformulates OWL subsumption checks as satisfiability queries, queries models for counter-concept examples, and ensures errors are only Type II delays rather than inconsistencies.
-
Calibration-Aware Policy Optimization for Reasoning LLMs
CAPO improves LLM calibration by up to 15% while matching or exceeding GRPO accuracy through logistic AUC loss and noise masking, enabling better abstention and scaling performance.
-
A Two-Stage LLM Framework for Accessible and Verified XAI Explanations
A two-stage LLM explainer-verifier framework with iterative refeed improves faithfulness and accessibility of XAI explanations, as shown in experiments across five techniques and three LLM families, with EPR analysis indicating progressive stabilization.
-
STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models
STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.
-
Ensemble-Based Uncertainty Estimation for Code Correctness Estimation
Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.
-
Causal Evidence that Language Models use Confidence to Drive Behavior
Language models deploy multidimensional internal confidence representations and threshold-based policies to control abstention behavior, with causal support from activation steering experiments.
-
ERA: Evidence-based Reliability Alignment for Honest Retrieval-Augmented Generation
ERA models internal and external knowledge as independent Dirichlet belief masses and uses Dempster-Shafer Theory to quantify conflicts, enabling better abstention decisions in RAG systems.
-
Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models
REVIS reduces object hallucination in large vision-language models by about 19% via sparse orthogonal projection in latent space at suppression depths while keeping reasoning intact.
-
Variational Visual Question Answering for Uncertainty-Aware Selective Prediction
Variational VQA applies variational Bayes to improve calibration and selective prediction on VQA and visual reasoning tasks, with gains at low error tolerance via a risk-averse selector that uses prediction variance.
-
Towards Lightweight Reliability: Using Soft Prompts for Hallucination Mitigation in Large Language Models
RCSP trains soft prompts with contrastive loss, curriculum learning, and KL regularization to balance hallucination suppression, abstention, and factual recall, yielding higher F-scores than baselines on five QA datasets using Gemma 3 (12B) and Llama 3.1 (8B) backbones while updating only a small fr
-
Towards FairRAG: Preventing Representational Harm in Retrieval-Augmented Generation by Enforcing Fair Exposure at Retrieval Time
Representative Stochastic ranking achieves near-parity average exposure on the TREC 2022 Fair Ranking Dataset, with generation demographic parity closely tracking retrieval exposure.
-
Benchmarking LLM-Based Static Analysis for Secure Smart Contract Development: Reliability, Limitations, and Potential Hybrid Solutions
LLMs for smart contract security analysis show lexical bias from identifier names causing high false positives, with prompting creating precision-recall trade-offs, positioning them as complements rather than replacements for static analysis tools.
-
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents
HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
-
Beyond Literal Summarization: Redefining Hallucination for Medical SOAP Note Evaluation
Redefining hallucination evaluation for medical SOAP notes to credit clinical reasoning reduces reported hallucination rates from 35% to 9%.
-
EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors
EnsemHalDet improves VLM hallucination detection by ensembling independent detectors trained on diverse internal states, yielding higher AUC than single-detector baselines across VQA datasets.
-
Hierarchical Memorization in Large Language Models: Evidence from Citation Generation
LLMs memorize citations hierarchically: titles and first authors are recalled at lower redundancy levels than venues or years, with accuracy scaling log-linearly and saturating near verbatim reproduction above roughly 1200 citations.
-
RECAP: Transparent Inference-Time Emotion Alignment for Medical Dialogue Systems
RECAP is an inference-time framework using cognitive appraisal theory to enhance emotional alignment and transparency in medical dialogue systems across model scales.
-
Vibe Coding in Product Teams: Reconfiguring AI-Assisted Workflows, Prototyping, and Collaboration
Interviews reveal a four-stage vibe coding workflow that accelerates prototyping while introducing tensions between quick efficiency and reflective design intention, plus asymmetries in trust and ownership.
-
Semantic Concurrency Limits in Large Language Models
High-dimensional geometry imposes concurrency limits on semantic directions in LLM embeddings via residual interference, with N < exp(c d_eff ε²) for coexistence and σ_int ~ √(k/d_eff) for readout noise.
-
At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization
Sparse autoencoders show OOD prompts increase fallacious concept activation in transformers, offering a mechanistic measure of shift and a path to robust fine-tuning.
-
Opportunities and Risks of Generative AI through the Health Information Journey
Authors propose a four-stage framework to analyze opportunities and risks of generative AI across the health information journey from public sources to clinical care.
-
Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems
Introduces L2-Bench benchmark for AI feedback in language education across six dimensions and identifies explainability pitfalls in AI-generated explanations that appear helpful but are flawed.
-
The Crutch or the Ceiling? How Different Generations of LLMs Shape EFL Student Writings
Advanced LLMs improve EFL writing scores and diversity for lower-proficiency students but correlate with lower expert ratings on deep coherence, acting more as crutches than scaffolds.
-
When Should a Language Model Trust Itself? Same-Model Self-Verification as a Conditional Confidence Signal
Self-verification acts as a conditional confidence signal for language models rather than a reliable general-purpose uncertainty estimator.
-
Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction
Five prompt strategies were evaluated for stabilizing LLM outputs, with Enhanced Data Registry judged better than baseline in all 100 trials while others ranged from 34% to 80% success.
-
Robust AI Security and Alignment: A Sisyphean Endeavor?
AI security and alignment cannot achieve full robustness because any sufficiently powerful AI inherits incompleteness-style limitations from formal systems.
-
The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences
The paper reduces a broad set of prompt engineering techniques to six core approaches and applies them to life sciences use cases while addressing common LLM pitfalls.
- Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage
- Beyond Explainable AI (XAI): An Overdue Paradigm Shift and Post-XAI Research Directions