{"total":17,"items":[{"citing_arxiv_id":"2605.10930","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evaluating the False Trust engendered by LLM Explanations","primary_cat":"cs.HC","submitted_at":"2026-05-11T17:58:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09023","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation","primary_cat":"cs.SE","submitted_at":"2026-05-09T16:02:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Semantic distance on program execution behaviors improves uncertainty estimation for LLM code generation and outperforms prior sample-based methods across benchmarks and models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25634","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive","primary_cat":"cs.CR","submitted_at":"2026-04-28T13:35:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM token rank-frequency distributions converge to a shared Mandelbrot distribution across models and domains, enabling a microsecond-scale statistical primitive for provenance verification and black-box anomaly triage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20131","ref_index":144,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives","primary_cat":"cs.CL","submitted_at":"2026-04-22T02:58:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17487","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems","primary_cat":"cs.CL","submitted_at":"2026-04-19T15:20:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Compositional selective specificity (CSS) improves overcommitment-aware utility from 0.846 to 0.913 on LongFact while retaining 0.938 specificity by calibrating claim-level backoffs in agentic AI responses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17200","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Calibrating Model-Based Evaluation Metrics for Summarization","primary_cat":"cs.CL","submitted_at":"2026-04-19T02:04:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17112","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification","primary_cat":"cs.AI","submitted_at":"2026-04-18T19:00:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16706","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench","primary_cat":"cs.AI","submitted_at":"2026-04-17T21:15:35+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15945","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration","primary_cat":"cs.CL","submitted_at":"2026-04-17T11:07:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15558","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Preregistered Belief Revision Contracts","primary_cat":"cs.AI","submitted_at":"2026-04-16T22:22:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13258","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs","primary_cat":"cs.CL","submitted_at":"2026-04-14T19:43:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HETA is a new attribution framework for decoder-only LLMs that combines semantic transition vectors, Hessian-based sensitivity scores, and KL divergence to produce more faithful and human-aligned token attributions than prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11141","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)","primary_cat":"cs.LG","submitted_at":"2026-04-13T07:57:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperformance over self-consistency on benchmarks and production data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09056","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Conversations Risk Detection LLMs in Financial Agents via Multi-Stage Generative Rollout","primary_cat":"cs.CR","submitted_at":"2026-04-10T07:29:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"FinSec is a multi-stage detection system for financial LLM dialogues that reaches 90.13% F1 score, cuts attack success rate to 9.09%, and raises AUPRC to 0.9189.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.16130","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Local to Global: A Graph RAG Approach to Query-Focused Summarization","primary_cat":"cs.CL","submitted_at":"2024-04-24T18:38:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2311.05232","ref_index":214,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions","primary_cat":"cs.CL","submitted_at":"2023-11-09T09:25:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Further compounding this issue is the capability of LLMs to generate highly convincing and human-like responses [265], which makes detecting these hallucinations particularly challenging, thereby complicating the practical deployment of LLMs, especially real- world information retrieval (IR) systems that have integrated into our daily lives like chatbots [8, 231], search engines [4, 214], and recommender systems [97, 171]. Given that the information provided by these systems can directly influence decision-making, any misleading information has the potential to spread false beliefs, or even cause harm. Notably, hallucinations in conventional natural language generation (NLG) tasks have been extensively studied [ 125, 136], with hallucinations defined as generated content that is either"},{"citing_arxiv_id":"2309.07864","ref_index":161,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Rise and Potential of Large Language Model Based Agents: A Survey","primary_cat":"cs.AI","submitted_at":"2023-09-14T17:12:03+00:00","verdict":"ACCEPT","verdict_confidence":"HIGH","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Xu et al. [151], Cobbe et al. [152], Thirunavukarasu et al. [153], Lai et al. [154], Madaan et al. [150], etc. Potential issues of knowledge Edit wrong and outdated knowledge AlKhamissi et al. [155], Kemker et al. [156], Cao et al. [157], Yao et al. [158], Mitchell et al. [159], etc. Mitigate hallucination Manakul et al. [160], Qin et al. [94], Li et al. [161], Gou et al. [162], etc. Memory §3.1.3 Memory capability Raising the length limit of Transformers BART [163], Park et al. [164], LongT5 [165], CoLT5 [166], Ruoss et al. [167], etc. Summarizing memory Generative Agents [22], SCM [168], Reflexion [169], Memory- bank [170], ChatEval [171], etc. Compressing mem- ories with vectors or data structures ChatDev [109], GITM [172], RET-LLM"},{"citing_arxiv_id":"2304.12244","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WizardLM: Empowering large pre-trained language models to follow complex instructions","primary_cat":"cs.CL","submitted_at":"2023-04-24T16:31:06+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}