super hub Canonical reference

Survey of Hallucination in Natural Language Generation

Andrea Madotto, Dan Su, Etsuko Ishii, Nayeon Lee, Pascale Fung, Rita Frieske + 2 more · 2023 · ACM Computing Surveys · DOI 10.1145/3571730

Canonical reference. 88% of citing Pith papers cite this work as background.

100 Pith papers citing it

2,906 external citations · Crossref

Background 88% of classified citations

open at publisher browse 100 citing papers more from Andrea Madotto

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 17

citation-polarity summary

background 15 unclear 2

claims ledger

background [315, 361]. Furthermore, Liu et al. [185], Zong et al. [395] and Liu et al. [184] show that LVLMs can be easily fooled and experience a severe performance drop due to their over-reliance on the strong language prior, as well as its inferior ability to defend against inappropriate user inputs [112, 134]. Jiang et al. [138], Wang et al. [315] and Jing et al. [141] took a step forward to holistically evaluate multi-modal hallucination. What's more, when presented with multiple images, LVLMs sometim

authors

Andrea Madotto Dan Su Etsuko Ishii Nayeon Lee Pascale Fung Rita Frieske Tiezheng Yu Yan Xu Ye Jin Bang Ziwei Ji

co-cited works

representative citing papers

HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

cs.CL · 2026-05-19 · conditional · novelty 8.0

HalluWorld is a controlled benchmark using explicit reference world models to automatically label and disentangle hallucinations in LLMs across synthetic environments with varying complexity and observability.

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

cs.AI · 2026-05-17 · unverdicted · novelty 8.0

A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.

LibEvoBench: Probing Temporal Knowledge Stratification in Code Generation Models

cs.SE · 2026-06-24 · unverdicted · novelty 7.0

LibEvoBench benchmark shows LLMs are version-oblivious on evolving APIs, with documentation helping but version specification not.

MedHal-Loc: Are "Explainable-by-Architecture" Medical Hallucination Detectors Faithful Localizers? A Localization Benchmark

cs.CL · 2026-06-19 · unverdicted · novelty 7.0

MedHal-Loc benchmark shows KG-triple hallucination detectors localize errors no better than chance on controlled medical statements due to entity extraction limits, while NLI and consistency methods succeed above chance, and real hallucinations are mostly diffuse conclusion changes.

Description-Code Inconsistency in Real-world MCP Servers: Measurement, Detection, and Security Implications

cs.CR · 2026-06-03 · unverdicted · novelty 7.0

Empirical study of 2,214 MCP servers finds 9.93% of 19,200 description-code pairs inconsistent via a new static-analysis-plus-LLM-prompting framework, with security implications.

Knowledge Editing in Masked Diffusion Language Models

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

Locate-then-edit succeeds at the same early-to-mid MLP locations in masked diffusion models as in autoregressive models, but requires optimization over intermediate partial-mask states to handle multi-token targets.

AI Assistance for Discretionary Work: Increasing Feedback Provision in Higher Education

cs.HC · 2026-06-02 · accept · novelty 7.0

Randomized experiment finds AI draft assistance raises feedback provision by teaching assistants 10.8 percentage points without harming quality.

Honest Lying: Understanding Memory Confabulation in Reflexive Agents

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Reflexive agents confabulate incorrect task interpretations in memory, detected via Reflection Repetition Rate metric, with a programmatic mitigation raising correct object mentions from 0% to 86% in frozen ALFWorld cases.

When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

QAOD projects away question-aligned directions from answer representations to isolate domain-agnostic factuality signals, enabling efficient hallucination detection with top in-domain AUROC and up to 21% better OOD transfer.

Trust Me, Import This: Dependency Steering Attacks via Malicious Agent Skills

cs.CR · 2026-05-10 · unverdicted · novelty 7.0

Malicious Skills induce coding agents to hallucinate and import attacker-controlled packages at high rates while evading detection.

Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations

cs.HC · 2026-05-09 · accept · novelty 7.0

LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.

Eliciting associations between clinical variables from LLMs via comparison questions across populations

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Indirect elicitation via triplet comparisons recovers meaningful association structures from LLMs and supports conservative causal candidate links across prompted subpopulations.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

cs.AI · 2026-05-07 · conditional · novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets

cs.CL · 2026-04-28 · conditional · novelty 7.0

A graphlet-anchored framework generates 119,856 factually grounded biomedical QA pairs that improve accuracy on PubMedQA and MedQA benchmarks.

CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

CyberCertBench shows frontier LLMs reach human-expert performance on general IT and networking security but drop on vendor-specific and formal standards questions such as IEC 62443, with a new framework for producing interpretable explanations.

BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

cs.DL · 2026-04-03 · conditional · novelty 7.0

Frontier LLMs generate BibTeX entries at 83.6% field accuracy but only 50.9% fully correct; two-stage clibib revision raises accuracy to 91.5% and fully correct entries to 78.3% with 0.8% regression.

Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries

cs.SE · 2025-09-26 · unverdicted · novelty 7.0

A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.

Auditing AI Investment Recommendations as Executable Actions

cs.LO · 2026-06-25 · unverdicted · novelty 6.0

Introduces a protocol scoring AI investment advisors on validity under constraints, stability, and agreement with a deterministic baseline, showing agreement often masks invalid actions.

Hallucination in World Models is Predictable and Preventable

cs.LG · 2026-06-25 · unverdicted · novelty 6.0

Hallucination in world models is a data coverage issue predictable by three signals and preventable through targeted training sampling and online data collection.

Exposing the Illusion of Erasure in Knowledge Editing for LLMs

cs.LG · 2026-06-22 · unverdicted · novelty 6.0

Knowledge editing methods redistribute and suppress rather than overwrite facts in LLMs, creating narrow vulnerable regions in representation space that adversarial prompts can exploit.

Vaani Benchmark V1.0: An Inclusive Multimodal Benchmark Dataset for Hindi

eess.AS · 2026-06-19 · unverdicted · novelty 6.0

Vaani Benchmark V1.0 is a multimodal Hindi ASR dataset from 104 districts featuring spontaneous speech recordings in real-world conditions and three independent transcriptions per segment for robust multi-reference evaluation.

CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM System

cs.SE · 2026-06-17 · unverdicted · novelty 6.0

CAPRA is a multi-agent LLM system with evidence anchoring and consistency checking that analyzes software architecture deliverables and meets 88.8% of an eight-criterion evaluation on 10 student reports.

A Risk Decomposition Framework for Pre-Hoc Fine-Tuning Prediction

cs.LG · 2026-06-16 · unverdicted · novelty 6.0

Formulates pre-hoc fine-tuning prediction as stochastic estimation, proves lower bound on optimization variance decay rate, and introduces a three-regime predictability phase diagram.

IVIE: A Neuro-symbolic Approach to Incremental and Validated Generation of Interactive Fiction Worlds

cs.CL · 2026-06-11 · unverdicted · novelty 6.0

IVIE generates complete playable interactive fiction worlds via a four-stage incremental pipeline that combines LLM creativity with symbolic validation for coherence.

citing papers explorer

Showing 50 of 100 citing papers.

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models cs.CL · 2023-03-15 · unverdicted · none · ref 58
SelfCheckGPT detects hallucinations by checking consistency across multiple sampled responses from black-box LLMs on WikiBio biography generation tasks.
Beyond Her: Safety Dynamics in Role-play AI Companions cs.CR · 2026-06-27 · unverdicted · none · ref 40 · 2 links
Mixed-methods study of role-play AI companions finds short-term emotional relief that can mask longer-term deterioration, especially among users with internalizing problems who show unstable risk patterns.
Topic-to-Timestamp Alignment by Constrained Evidence Selection cs.CL · 2026-06-18 · unverdicted · none · ref 7
Constrained candidate selection from retrieved chunks raises Recall@5 from 31.9% to 50.0% and parseable outputs on 420 queries from 200 municipal meeting transcripts.
TuneAhead: Predicting Fine-tuning Performance Before Full Training Begins cs.LG · 2026-06-16 · unverdicted · none · ref 80
TUNEAHEAD predicts fine-tuning performance from meta-features and short probes, reporting RMSE 1.47 and 95.1% of predictions within 3 points on 370 held-out runs of Qwen2.5-7B.
Short paper: Models in the dark -- Rectification and erasure under GDPR in ML supply chains cs.LG · 2026-06-04 · unverdicted · none · ref 36
Survey identifying technical and supply-chain barriers to GDPR data subject rights in ML, with new framing of 'models in the dark' for downstream opacity.
When AI Says It Feels cs.AI · 2026-06-04 · unverdicted · none · ref 120
LLMs trained via rubric-based self-rewarding RL with GRPO enhanced feeling expression and sycophancy robustness but degraded truthful QA performance.
Can Crowdsourcing Survive the LLM Era? A Community Survey on Human Data Collection cs.CL · 2026-06-03 · unverdicted · none · ref 23
Survey of 155 researchers finds 44% observed LLM usage in crowdsourced data, with high awareness but insufficient mitigation efforts.
A cross-domain tropical species dataset with Chinese vernacular names and CITES source links cs.CL · 2026-06-02 · unverdicted · none · ref 15
A cross-domain dataset of 410,499 tropical species adds Chinese vernacular names at 99.5% coverage and CITES source links to existing taxonomic identifiers.
Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation cs.AI · 2026-05-28 · unverdicted · none · ref 1
A sleep-health text generation pipeline using deterministic code for analysis followed by one LLM call achieves lower numeric error, instruction-compliance error, and cost than pure LLM baselines across 280 user-nights and six models.
Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint) cs.AI · 2026-05-26 · unverdicted · none · ref 21
Neuro-symbolic pipeline using formal logic and semantic embeddings detects hallucinations in LLM medical reports at 83%+ for entities and 72% for fabrications while cutting creation time 30%.
From Norms to Indicators (N2I-RAG): An Agentic Retrieval-Augmented Generation Framework for Legal Indicator Computation cs.AI · 2026-05-26 · unverdicted · none · ref 12
N2I-RAG is an agentic RAG pipeline that automates binary legal indicator computation from complex normative texts with explicit traceability to provisions.
Empirical Analysis and Detection of Hallucinations in LLM-Generated Bug Report Summaries cs.SE · 2026-05-22 · unverdicted · none · ref 13
Develops a section-aware hallucination detection method for LLM bug report summaries using synthetic injection on the BugsRepo dataset from Mozilla projects, reporting up to 0.89 Macro-F1 at report level.
ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking cs.CL · 2026-05-18 · unverdicted · none · ref 12
ReacTOD introduces a bounded neuro-symbolic ReAct architecture with symbolic validation that delivers new zero-shot SOTA joint goal accuracy on MultiWOZ 2.1 and strong results on SGD.
Fairness-Aware Retrieval Optimization for Retrieval-Augmented Generation cs.DB · 2026-05-15 · unverdicted · none · ref 3
Introduces FARO, a scalable quadratic optimization approach for fairness-aware top-k retrieval in RAG that mitigates generation bias via controlled reranking and position-aware propagation modeling.
IFPV: An Integrated Multi-Agent Framework for Generative Operational Planning and High-Fidelity Plan Verification cs.MA · 2026-05-14 · unverdicted · none · ref 18
IFPV integrates multi-perspective hierarchical agents for generative planning with an adversarial cognitive simulation engine for verification, reporting 19.4% higher mission success, 41.7% lower cost versus LLM baseline, and 31.8% higher suppression versus rule-based validation in combat simulation
Beliefs and Misconceptions around Integrated Conversational AI cs.HC · 2026-05-14 · unverdicted · none · ref 18
Qualitative study of 20 users of integrated browser conversational AI found that citations raise trustworthiness without verification and that users apply existing LLM and search perceptions to prompting strategies.
The Semantic Training Gap: Ontology-Grounded Tool Architectures for Industrial AI Agent Systems cs.AI · 2026-05-11 · unverdicted · none · ref 24
Ontology-grounded tool architectures eliminate hallucination of domain identifiers in industrial AI agents by enforcing semantic constraints through a typed relational configuration and three-operation interface.
Evaluating the False Trust Engendered by LLM Explanations cs.HC · 2026-05-11 · unverdicted · none · ref 27 · 2 links
LLM reasoning traces and post-hoc explanations increase false trust in incorrect predictions, whereas contrastive dual explanations enhance users' ability to distinguish correct from incorrect AI outputs.
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity cs.LG · 2026-05-01 · unverdicted · none · ref 21 · 2 links
EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.
An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress cs.AI · 2026-04-27 · unverdicted · none · ref 6
A thermodynamic-inspired information-geometric framework defines a composite LLM stability score that outperforms a utility-entropy baseline by 0.0299 on average across 80 observations, with gains increasing at higher entropy.
Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness cs.AI · 2026-04-22 · unverdicted · none · ref 6
SABA improves LLM performance on detective puzzle benchmarks by recursively fusing information into a base state and using queries to resolve missing premises before concluding.
Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression cs.AI · 2026-04-21 · unverdicted · none · ref 70
LightEdit enables scalable lifelong knowledge editing in LLMs via selective knowledge retrieval and probability suppression during decoding, outperforming prior methods on ZSRE, Counterfact, and RIPE while reducing training costs.
Align Documents to Questions: Question-Oriented Document Rewriting for Retrieval-Augmented Generation cs.CL · 2026-04-19 · unverdicted · none · ref 103
QREAM rewrites documents to question-focused style using iterative ICL and distilled FT models, boosting RAG performance by up to 8% relative improvement.
A Graph-Enhanced Defense Framework for Explainable Fake News Detection with LLM cs.CL · 2026-04-08 · unverdicted · none · ref 33
G-Defense builds claim-centered graphs from sub-claims, applies RAG for evidence and competing explanations, then uses graph inference to detect fake news veracity and generate intuitive explanation graphs, claiming SOTA results.
The Personalization Paradox: Semantic Loss vs. Reasoning Gains in Agentic AI Q&A cs.IR · 2025-12-04 · unverdicted · none · ref 13
Personalization in an agentic RAG advising system boosts reasoning quality and grounding while reducing semantic metric scores due to the inability of current metrics to accommodate user-specific responses.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions cs.CL · 2023-11-09 · unverdicted · none · ref 141
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
PaLM 2 Technical Report cs.CL · 2023-05-17 · unverdicted · none · ref 72
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
StarCoder: may the source be with you! cs.CL · 2023-05-09 · accept · none · ref 276
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
What the LLM Should Not Say: Boundary-Aware Context Grounding for A Seven-Channel EEG Agent cs.AI · 2026-06-25 · unverdicted · none · ref 6 · 2 links
NeuraDock Agent is an open-source architecture that pairs a local EEG engine with a restricted LLM context pack to enforce hardware and implementation boundaries for seven-channel recordings.
At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization cs.LG · 2026-06-24 · unverdicted · none · ref 13
Sparse autoencoders show OOD prompts increase fallacious concept activation in transformers, offering a mechanistic measure of shift and a path to robust fine-tuning.
Hallucinations in Organization-backed AI advisors: Evidence about Skepticism, Verification, and Reliance in Goal-Directed Use cs.HC · 2026-06-22 · unverdicted · none · ref 6
Literature review synthesizing evidence on user skepticism, verification, and reliance with hallucinating AI advisors, noting that output-related cues like warnings show weak effects and that content category has not been experimentally varied.
Integrating Large Language Model Agents with Digital Twins for Industrial Autonomous Systems cs.SE · 2026-06-18 · unverdicted · none · ref 13
A TPSR-based framework with four LLM roles integrates language model reasoning into industrial automation via digital twins, achieving high task executability in case studies.
Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints cs.AI · 2026-06-11 · unverdicted · none · ref 5
A literature synthesis that unifies hallucination taxonomies across medical imaging modalities, finds general-purpose foundation models hallucinate less than specialized ones, and maps mitigation to FDA lifecycle frameworks.
Explicit Evidence Grounding via Structured Inline Citation Generation cs.CL · 2026-06-05 · unverdicted · none · ref 30
FullCite introduces three strategies for structured inline citation generation in QA and finds LLMs identify relevant documents well but struggle with precise evidence spans on ASQA, BioASQ, and ExpertQA.
Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback cs.AI · 2026-06-01 · unverdicted · none · ref 72
Systematic evaluation shows LLMs frequently give unsafe responses to eating disorder prompts when linguistic cues signal risk, as measured by varying prompt danger levels with clinician feedback.
JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data cs.AI · 2026-05-23 · unverdicted · none · ref 6
JT-Safe-V2 is a safety-by-design LLM that reports SOTA scores on both capability and safety benchmarks while Safe-MoMA cuts inference cost over 30 percent.
Using Large Language Models in Physics Education physics.ed-ph · 2026-05-22 · unverdicted · none · ref 16 · 2 links
Frontier LLMs from mid-2024 to late-2025 reach near-perfect scores on text-based physics problems and show improved human alignment in grading, but assigning partial credit for flawed reasoning remains difficult.
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment cs.CL · 2026-05-08 · unverdicted · none · ref 1
Human adjudication of conflicts between original benchmark labels and LLM predictions on QAGS-C and SummEval increases triple agreement by 6-8% and LLM accuracy by 2-9%, with adjudicators often siding with models that provide explicit reasoning.
Text-Graph Synergy: A Bidirectional Verification and Completion Framework for RAG cs.AI · 2026-05-07 · unverdicted · none · ref 16
TGS-RAG adds graph-to-text re-ranking with global voting and text-to-graph orphan path bridging to improve precision and efficiency in multi-hop RAG over prior baselines.
Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR) cs.LG · 2026-04-13 · unverdicted · none · ref 12
HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperformance over self-consistency on benchmarks and production data.
Council Mode: A Heterogeneous Multi-Agent Consensus Framework for Reducing LLM Hallucination and Bias cs.CL · 2026-04-03 · unverdicted · none · ref 2
Council Mode reduces LLM hallucinations by 35.9% and improves TruthfulQA scores by 7.8 points through parallel heterogeneous model generation followed by structured consensus synthesis.
Mitigating Hallucination on Hallucination in RAG via Ensemble Voting cs.CL · 2026-03-28 · unverdicted · none · ref 4
VOTE-RAG applies retrieval voting across diverse queries and response voting across independent generations to mitigate hallucination-on-hallucination in RAG, matching or exceeding complex baselines on six benchmarks with a parallelizable design.
A Conversational Framework for Human-Robot Collaborative Manipulation with Distributed Generative AI models cs.RO · 2026-06-04 · unverdicted · none · ref 9
Presents a distributed ROS 2 framework integrating local LLMs and VLMs for conversational human-robot manipulation tasks with operator confirmation and experimental evaluation on a Franka FR3 arm.
Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges cs.CL · 2026-05-19 · unverdicted · none · ref 46
A literature survey synthesizing benchmarks, architectures, training strategies, and evaluation methods for mathematical reasoning in LLMs, based on roughly 120 papers.
AgriIR: A Scalable Framework for Domain-Specific Knowledge Retrieval cs.IR · 2026-03-17 · unverdicted · none · ref 17
AgriIR is a configurable RAG framework using modular stages and 1B-parameter models to deliver grounded, citable answers for Indian agricultural information access.
Large Language Models: A Survey cs.CL · 2024-02-09 · accept · none · ref 145
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
LLMs in Qualitative Research: Opportunities, Limitations, and Practical Considerations cs.HC · 2026-05-15 · unverdicted · none · ref 58
The paper outlines opportunities, limitations, and practical parameters for integrating LLMs into qualitative research while aligning with epistemological commitments like reflexivity and interpretive judgment.
A Hybrid Retrieval and Reranking Framework for Evidence-Grounded Retrieval-Augmented Generation cs.IR · 2026-05-03 · unverdicted · none · ref 32
A hybrid RAG system with retrieval, Cohere reranking, and claim-level LLM judgment achieves 100% grounding accuracy on 200 claims from 25 biomedical queries in a pilot study.
From Dispersion to Attraction: Spectral Dynamics of Hallucination Across Whisper Model Scales cs.LG · 2026-03-31 · unreviewed · ref 17
Conversational AI increases political knowledge as effectively as self-directed internet search cs.HC · 2025-09-05 · unreviewed · ref 7

Survey of Hallucination in Natural Language Generation

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer