InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
hub Canonical reference
Why Language Models Hallucinate
Canonical reference. 75% of citing Pith papers cite this work as background.
abstract
Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.
This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insights and open challenges.
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
SVR-MAD treats pre-debate signals as priors and debate results as evidence to build a sparser communication graph, cutting token use by up to 61% while preserving or raising accuracy over prior MAD methods.
Factual recall quality in LLMs follows a sigmoid scaling law in the log-linear combination of model parameter count and topic frequency in training data, explaining 60% of variance across models and up to 94% within families.
Google AI Overviews activate on 13.7% of queries overall and 64.7% of questions, cite more credible sources than standard results but omit key information in 11% of claims, and suppress clicks on over half of cited pages that carry ads.
TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reasoning models.
A single-pass black-box method models LLM outputs as dynamical systems via Koopman operators to detect hallucinations with claimed state-of-the-art accuracy and lower cost.
LLM agents dynamically exploring repositories via bash commands achieve competitive accuracy to context-provided LLMs across four classification tasks, with superior robustness to artifact size.
Adaptive Unlearning suppresses package hallucinations in code-generating LLMs by 81% while preserving benchmark performance, using model-generated data and no human labels.
CLEAR reveals that LLMs' accuracy on medical questions drops and their 'humility deficit' grows as the number of plausible answers increases and abstention options shift from assertive to uncertain phrasing.
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
SIEVES improves selective prediction coverage by up to 3x on OOD VQA benchmarks by training a selector to score the quality of visual evidence produced by reasoner models, generalizing across benchmarks and proprietary models without internal access or per-task retraining.
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning tasks with lower total inference cost.
LLM-assisted active learning reformulates OWL subsumption checks as satisfiability queries, queries models for counter-concept examples, and ensures errors are only Type II delays rather than inconsistencies.
CAPO improves LLM calibration by up to 15% while matching or exceeding GRPO accuracy through logistic AUC loss and noise masking, enabling better abstention and scaling performance.
A two-stage LLM explainer-verifier framework with iterative refeed improves faithfulness and accessibility of XAI explanations, as shown in experiments across five techniques and three LLM families, with EPR analysis indicating progressive stabilization.
STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.
Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.
Language models deploy multidimensional internal confidence representations and threshold-based policies to control abstention behavior, with causal support from activation steering experiments.
ERA models internal and external knowledge as independent Dirichlet belief masses and uses Dempster-Shafer Theory to quantify conflicts, enabling better abstention decisions in RAG systems.
citing papers explorer
-
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
-
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
-
Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations
An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning tasks with lower total inference cost.
-
From Subsumption to Satisfiability: LLM-Assisted Active Learning for OWL Ontologies
LLM-assisted active learning reformulates OWL subsumption checks as satisfiability queries, queries models for counter-concept examples, and ensures errors are only Type II delays rather than inconsistencies.
-
A Two-Stage LLM Framework for Accessible and Verified XAI Explanations
A two-stage LLM explainer-verifier framework with iterative refeed improves faithfulness and accessibility of XAI explanations, as shown in experiments across five techniques and three LLM families, with EPR analysis indicating progressive stabilization.
-
Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models
REVIS reduces object hallucination in large vision-language models by about 19% via sparse orthogonal projection in latent space at suppression depths while keeping reasoning intact.
-
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents
HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
-
Beyond Literal Summarization: Redefining Hallucination for Medical SOAP Note Evaluation
Redefining hallucination evaluation for medical SOAP notes to credit clinical reasoning reduces reported hallucination rates from 35% to 9%.
-
Robust AI Security and Alignment: A Sisyphean Endeavor?
AI security and alignment cannot achieve full robustness because any sufficiently powerful AI inherits incompleteness-style limitations from formal systems.