{"total":24,"items":[{"citing_arxiv_id":"2606.27806","ref_index":70,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Grounded Iterative Language Planning: How Parameterized World Models Reduce Hallucination Propagation in LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-06-26T07:45:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GILP combines a small parameterized world model with LLM agent reasoning via a consistency gate, reducing hallucinated-state rate from 0.176 to 0.035 and raising success from 0.668 to 0.838 on graph planning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28123","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Risk-aware Selective Prompting for Hallucination Mitigation in Large Vision-Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-27T08:14:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Verification prompting in LVLMs is input-dependent and risk-bearing; RSP selectively triggers it via pre-generation uncertainty to avoid performance degradation on easy cases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17998","ref_index":8,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Verify-Gated Completion as Admission Control in a Governed Multi-Agent Runtime: A Bounded Architecture Case Study","primary_cat":"cs.SE","submitted_at":"2026-05-18T07:52:13+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"In a bounded multi-agent runtime case study, verify-gated completion produced 99.5% success on invoked verification events with packetized records, supporting only a narrow claim of inspectable and fail-closed decisions under observed conditions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12813","ref_index":116,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations","primary_cat":"cs.CL","submitted_at":"2026-05-12T23:13:50+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02443","ref_index":36,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-04T10:43:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"HalluScan benchmark evaluates hallucination detection in LLMs, reporting NLI Verification at AUROC 0.88 and introducing HalluScore (r=0.41 with humans) plus Adaptive Detection Routing for 2x cost savings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19457","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents","primary_cat":"cs.AI","submitted_at":"2026-04-21T13:37:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17487","ref_index":3,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems","primary_cat":"cs.CL","submitted_at":"2026-04-19T15:20:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Compositional selective specificity (CSS) decomposes generated answers into claims and emits each at the most specific level supported by evidence, raising overcommitment-aware utility from 0.846 to 0.913 on LongFact while retaining 0.938 specificity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14401","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Credo: Declarative Control of LLM Pipelines via Beliefs and Policies","primary_cat":"cs.AI","submitted_at":"2026-04-15T20:31:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11141","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)","primary_cat":"cs.LG","submitted_at":"2026-04-13T07:57:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperformance over self-consistency on benchmarks and production data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"regulatory interpretation. The challenge is that single-model generations are inherently unstable; even State-Of-The-Art models can confidently fabricate details when operating on long-tail knowledge or complex logical predicates [27]. Standard industrial approaches to mitigate halluci- nation typically rely on Retrieval-Augmented Generation [21] or iterative refinement [4]; 'Ask the model to critique itself'. Unfor- tunately, these methods have distinct failure modes in production. Relying on a single decoding path even with temperature 𝑇= 0 leaves the system vulnerable to the specific biases and blind spots of that model instance. Another common ensemble technique is to gen- erate multiple responses and ask an LLM to summarize them."},{"citing_arxiv_id":"2604.04131","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents","primary_cat":"cs.AI","submitted_at":"2026-04-05T14:27:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06211","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models","primary_cat":"cs.CL","submitted_at":"2026-03-16T11:10:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Chain-of-illocution prompting improves source adherence in RAG explanations for programming education by up to 63% over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.22416","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Hallucination Detection and Evaluation of Large Language Model","primary_cat":"cs.CL","submitted_at":"2025-12-27T00:17:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"HHEM delivers fast hallucination detection in LLMs via classification, cutting evaluation time from 8 hours to 10 minutes with up to 82.2% accuracy while adding segment retrieval for summarization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.12634","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents","primary_cat":"cs.AI","submitted_at":"2025-12-14T10:41:39+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.24943","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents","primary_cat":"cs.CV","submitted_at":"2025-09-29T15:42:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.18864","ref_index":221,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Towards an AI co-scientist","primary_cat":"cs.AI","submitted_at":"2025-02-26T06:17:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.12187","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hallucinations are inevitable but can be made statistically negligible","primary_cat":"cs.CL","submitted_at":"2025-02-15T07:28:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hallucinations are inevitable on an infinite set of inputs but can be made statistically negligible with sufficient training data quality and quantity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.20240","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Social and Ethical Risks Posed by General-Purpose LLMs for Settling Newcomers in Canada","primary_cat":"cs.CY","submitted_at":"2024-07-15T19:23:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"The paper identifies social and ethical risks from unguarded use of general-purpose LLMs in Canadian newcomer settlement and advocates for AI literacy programs plus customized models with human oversight.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.15927","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs","primary_cat":"cs.CL","submitted_at":"2024-06-22T19:46:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SEPs approximate semantic entropy from single-generation hidden states to enable cheap and robust hallucination detection in LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.07927","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications","primary_cat":"cs.AI","submitted_at":"2024-02-05T19:49:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a taxonomy and summary table.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.11817","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Hallucination is Inevitable: An Innate Limitation of Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-01-22T10:26:14+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Hallucinations are inevitable in LLMs because they cannot learn all computable functions according to learning theory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.10997","ref_index":93,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Retrieval-Augmented Generation for Large Language Models: A Survey","primary_cat":"cs.CL","submitted_at":"2023-12-18T07:47:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"process of adding relevant context is, in principle, similar to query expansion. Specifically, a complex question can be decomposed into a series of simpler sub-questions using the least-to-most prompting method [92]. Chain-of-Verification(CoVe). The expanded queries undergo validation by LLM to achieve the effect of reducing halluci- nations. Validated expanded queries typically exhibit higher reliability [93]. 9 2) Query Transformation: The core concept is to retrieve chunks based on a transformed query instead of the user's original query. Query Rewrite.The original queries are not always optimal for LLM retrieval, especially in real-world scenarios. There- fore, we can prompt LLM to rewrite the queries. In addition to using LLM for query rewriting, specialized smaller language"},{"citing_arxiv_id":"2311.05232","ref_index":77,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions","primary_cat":"cs.CL","submitted_at":"2023-11-09T09:25:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The scarcity of annotated data further constrains their applicability. In response to this challenge, a surge of research explores leveraging data-augmentation methods to construct synthetical data for fine-tuning the classifier, either by rule-based perturbation [79, 152, 266] or generation [389]. QA-based Metrics. In contrast to classifier-based metrics, QA-based metrics [77, 119, 271, 310] have recently garnered attention for their enhanced ability to capture information overlap between the model's generation and its source. These metrics operate by initially selecting target answers from the information units within the LLM's output, and then questions are generated by the question-generation module. The questions are subsequently used to generate source answers based"},{"citing_arxiv_id":"2310.11511","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection","primary_cat":"cs.CL","submitted_at":"2023-10-17T18:18:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.01219","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-09-03T16:56:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A literature survey that taxonomizes hallucination phenomena in LLMs, reviews evaluation benchmarks, and analyzes approaches for their detection, explanation, and mitigation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2020. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empir- ical Methods in Natural Language Processing (EMNLP), pages 5418-5426. Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: Bm25 and beyond.Foundations and Trends® in In- formation Retrieval, 3(4):333-389. Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili 'c, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2022. Bloom: A 176b-parameter open-access mul- tilingual language model.arXiv preprint arXiv:2211.05100. John Schulman. 2023. Reinforcement learning from human feedback: Progress and challenges."}],"limit":50,"offset":0}