pith. sign in

arxiv: 2606.22728 · v1 · pith:KIISQNIQnew · submitted 2026-06-22 · 💻 cs.CL · cs.AI

When Confidence Takes the Wrong Path: Diagnosing Retrieval-State Lock-In in RAG

Pith reviewed 2026-06-26 09:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords retrieval-augmented generationretrieval-state lock-inuncertainty estimationknowledge graph RAGanswer dispersionconfidence calibrationRAG trustworthiness
0
0 comments X

The pith

RAG answer agreement can signal a locked-in wrong retrieval state rather than correctness, and a three-check rule reaches 91.9% pooled precision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that repeated samples in retrieval-augmented generation can agree because they all condition on the same defective retrieval state, either empty or filled with coherent but incorrect information. This retrieval-state lock-in makes agreement-based confidence unreliable since the error is stable across samples. The authors separate the answer surface, retrieved evidence, and retrieval state to measure the problem directly in an ontology-guided knowledge-graph RAG system across six question-answering snapshots. At five samples per question, 42% of KG-RAG errors and 59% of dense-retrieval errors show zero answer dispersion. A decision rule that accepts an answer only when all three checks agree it is low-risk achieves 91.9% pooled precision against a 69.7% accept-all baseline, though it certifies only 7.7% of answers as low-risk.

Core claim

Retrieval-state lock-in occurs when sampled answers agree because they share the same defective retrieval state rather than because the answer is correct. The paper diagnoses this by decomposing confidence into three objects: the answer surface, the retrieved evidence, and the retrieval state itself. In the tested KG-RAG system, 42% of errors at five samples carry zero answer dispersion, so agreement supplies no ranking signal, while evidence and retrieval-state checks still flag most of them. The resulting auditable decision rule accepts an answer only when answer, evidence, and retrieval checks all indicate low risk, reaching 91.9% pooled precision against a 69.7% accept-all rate and 7.7%

What carries the argument

retrieval-state lock-in, the condition in which sampled answers agree because they condition on the same defective retrieval state.

If this is right

  • 42% of KG-RAG errors and 59% of dense-retrieval errors carry zero answer dispersion at five samples per question.
  • Evidence and retrieval-state checks flag most zero-dispersion errors that answer agreement alone cannot rank.
  • The three-check rule reaches 91.9% pooled precision while certifying 7.7% of answers as low-risk.
  • On the clinical calibration domain the rule reaches 100% precision under an automated judge.
  • Confidence in RAG must be treated as object-specific rather than a single black-box score.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar lock-in rates may occur in RAG architectures that were not tested in the six snapshots.
  • The three-check approach could be combined with other uncertainty methods to raise coverage while preserving precision.
  • Real-world deployment would still require human validation beyond the automated judge used in the clinical domain.

Load-bearing premise

The six question-answering snapshots and the particular ontology-guided KG-RAG implementation are representative enough that the measured lock-in rates and three-check rule will generalize to other RAG architectures and domains.

What would settle it

Applying the three-check rule to a RAG system or domain outside the six tested snapshots and checking whether precision stays near 91.9 percent.

Figures

Figures reproduced from arXiv: 2606.22728 by Sahib Julka.

Figure 1
Figure 1. Figure 1: Worked presence-lock-in trace (the HotpotQA Osireion case, also in Appendix [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: OntoGraphRAG pipeline with the three measurement taps. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Answer-state scores rank errors most consistently; GPS discriminates only in its calibration domain (RealMedQA). [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Under strict retrieval the answer surface collapses while evidence support stays separated (RealMedQA: dense, adaptive, [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The useful diagnostic changes with the retrieval regime. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Full per-dataset, per-system view of the archived concentration–dispersion traces [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Lock-in as migration into the low-dispersion corner (2WikiMultiHopQA). [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Family disagreement in the pooled adaptive KG runs. [PITH_FULL_IMAGE:figures/full_fig_p037_8.png] view at source ↗
read the original abstract

The trustworthiness of a retrieval-augmented generation (RAG) system depends on more than the answer it returns, yet many black-box uncertainty methods still read agreement among sampled answers as confidence. That inference fails when repeated samples condition on the same defective retrieval state. The state may be empty, with the model falling back on parametric memory, or populated by a coherent but wrong neighbourhood. In either case, the answers agree because the error is stable. The problem is recognised in deployed RAG, but it has lacked a name, a measurable signature, and a prevalence bound. We supply all three. We name the failure retrieval-state lock-in and diagnose it by separating the three objects a single confidence score conflates: the answer surface, the retrieved evidence, and the retrieval state itself. In an inspectable, ontology-guided knowledge-graph RAG (KG-RAG) system across six question-answering snapshots, we measure the agreement blind spot directly: at five samples per question, 42% of KG-RAG errors and 59% of dense-retrieval errors carry zero answer dispersion, so agreement has nothing to rank, while evidence- and retrieval-state checks still flag most of them. The decomposition supports an auditable decision rule: accepting an answer only when answer, evidence, and retrieval checks all agree that it is low-risk reaches 91.9% pooled precision against a 69.7% accept-all rate. The cost is coverage: it certifies only 7.7% of answers as low-risk. On the clinical calibration domain it reaches 100% precision under an automated judge; this is an in-domain automated-label upper bound, not a clinical safety claim, and still needs human validation. Confidence in RAG is object-specific: when answers agree, the useful question is which part of the pipeline to distrust.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that standard answer-agreement based confidence in RAG can fail due to 'retrieval-state lock-in', where defective retrieval states lead to consistent but wrong answers across samples. In experiments on an ontology-guided KG-RAG system with six QA snapshots, they find 42% of KG-RAG errors and 59% of dense-retrieval errors have zero answer dispersion at 5 samples. They introduce a three-check rule (answer, evidence, retrieval-state) that achieves 91.9% precision at 7.7% coverage, compared to 69.7% for accepting all, with 100% on clinical domain under automated judge.

Significance. This work offers a concrete diagnosis and measurable signature for a recognized issue in RAG trustworthiness, along with an auditable decision rule that trades coverage for higher precision. The explicit decomposition of answer, evidence, and retrieval state, and the reporting of specific percentages on a real system, provide useful empirical grounding. If the lock-in phenomenon and the rule's performance hold more broadly, it could inform better uncertainty estimation practices in retrieval-augmented systems.

major comments (2)
  1. [Experimental results on six snapshots] The reported lock-in rates of 42% for KG-RAG and 59% for dense-retrieval errors with zero dispersion lack error bars, statistical tests, or confidence intervals, which weakens the strength of the claim that agreement has nothing to rank in these cases.
  2. [Abstract and experimental setup] The prevalence bounds and the 91.9% precision of the three-check rule are derived from a single ontology-guided KG-RAG implementation plus dense baseline; no replication or ablation on standard vector RAG, LLM rerankers, or other domains is provided, making generalization to deployed RAG systems a load-bearing assumption that requires further support.
minor comments (1)
  1. [Clinical calibration domain] The 100% precision result is correctly labeled as an automated upper bound requiring human validation, but the presentation could more explicitly discuss the limitations of the automated judge.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and presentation of our results. We address each major comment below.

read point-by-point responses
  1. Referee: [Experimental results on six snapshots] The reported lock-in rates of 42% for KG-RAG and 59% for dense-retrieval errors with zero dispersion lack error bars, statistical tests, or confidence intervals, which weakens the strength of the claim that agreement has nothing to rank in these cases.

    Authors: We agree this is a valid point. The six snapshots provide the basis for the pooled rates, but we did not report uncertainty. In the revised manuscript we will add bootstrap confidence intervals computed by resampling across snapshots and will report per-snapshot variation where feasible. This will make the descriptive claim about zero-dispersion subsets more robust while preserving the core observation that agreement supplies no ranking information inside those subsets. revision: yes

  2. Referee: [Abstract and experimental setup] The prevalence bounds and the 91.9% precision of the three-check rule are derived from a single ontology-guided KG-RAG implementation plus dense baseline; no replication or ablation on standard vector RAG, LLM rerankers, or other domains is provided, making generalization to deployed RAG systems a load-bearing assumption that requires further support.

    Authors: The choice of an ontology-guided, inspectable KG-RAG system was deliberate: only in such a setting can the retrieval state be directly audited to diagnose lock-in. The dense baseline serves as a controlled contrast rather than a comprehensive ablation. We do not assert that the exact 42 % / 59 % or 91.9 % figures generalize; the contribution is the identification of the failure mode and the three-object decomposition. In revision we will expand the limitations and discussion sections to state the scope explicitly and to call for replication on black-box vector RAG and other domains. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical measurements are direct and self-contained

full rationale

The paper reports direct empirical counts of zero-dispersion errors (42% KG-RAG, 59% dense) and the measured precision (91.9%) of a rule that requires agreement across three distinct pipeline objects on held-out snapshots. No equations or derivations reduce a claimed result to a fitted parameter or self-citation by construction; the decision rule is not optimized against the target metric but evaluated as a conservative conjunction, and no load-bearing premise depends on prior author work. The analysis remains within its stated experimental scope without renaming known results or smuggling ansatzes.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central measurements rest on the assumption that the chosen KG-RAG system exposes separable retrieval states and that the six snapshots are representative; no free parameters are explicitly fitted beyond the fixed sample count of five.

free parameters (1)
  • number of samples
    Fixed at five per question; affects the zero-dispersion statistic.
axioms (1)
  • domain assumption The KG-RAG implementation provides inspectable retrieval states that can be checked independently of the generated answer.
    Invoked to separate the three objects and to run the retrieval-state check.
invented entities (1)
  • retrieval-state lock-in no independent evidence
    purpose: Names the stable error condition in which repeated samples condition on the same defective retrieval state.
    Newly introduced term; no independent evidence outside the paper's measurements is supplied.

pith-pipeline@v0.9.1-grok · 5866 in / 1529 out tokens · 28087 ms · 2026-06-26T09:08:58.429339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 18 canonical work pages

  1. [1]

    Amugongo, Paola Mascheroni, Sarah Brooks, Susanne Doering, and Jan Seidel

    Lameck Mbangula Amugongo, Pietro Mascheroni, Steven Geoffrey Brooks, Stefan Doering, and Jan Seidel. Retrieval augmented generation for large language models in healthcare: A systematic review. PLOS Digital Health, 4 0 (6): 0 e0000877, 2025. doi:10.1371/journal.pdig.0000877. URL https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000877

  2. [2]

    Self-RAG : Learning to retrieve, generate, and critique through self-reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG : Learning to retrieve, generate, and critique through self-reflection. In International Conference on Learning Representations (ICLR), 2024

  3. [3]

    INTRYGUE : Induction-aware entropy gating for reliable RAG uncertainty estimation

    Alexandra Bazarova, Andrei Volodichev, Daria Kotova, and Alexey Zaytsev. INTRYGUE : Induction-aware entropy gating for reliable RAG uncertainty estimation. arXiv preprint arXiv:2603.21607, 2026

  4. [4]

    Campos, Ant \'o nio Farinhas, Chrysoula Zerva, M \'a rio A

    Margarida M. Campos, Ant \'o nio Farinhas, Chrysoula Zerva, M \'a rio A. T. Figueiredo, and Andr \'e F. T. Martins. Conformal prediction for natural language processing: A survey. Transactions of the Association for Computational Linguistics, 12: 0 1619--1638, 2024. doi:10.1162/tacl_a_00715

  5. [5]

    Koedinger

    Eason Chen, Chuangji Li, Shizhuo Li, Zimo Xiao, Jionghao Lin, and Kenneth R. Koedinger. Comparing RAG and GraphRAG for page-level retrieval question answering on math textbook. arXiv preprint arXiv:2509.16780, 2025

  6. [6]

    From local to global: A graph RAG approach to query-focused summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024

  7. [7]

    RAGA s: Automated Evaluation of Retrieval Augmented Generation

    Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. RAGAs : Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150--158, 2024. doi:10.18653/v1/2024.eacl-demo.16. URL https://aclanthology.org/2024...

  8. [8]

    Faithfulness-aware uncertainty quantification for fact-checking the output of retrieval augmented generation

    Ekaterina Fadeeva, Aleksandr Rubashevskii, Dzianis Piatrashyn, Roman Vashurin, Shehzaad Dhuliawala, Artem Shelmanov, Timothy Baldwin, Preslav Nakov, Mrinmaya Sachan, and Maxim Panov. Faithfulness-aware uncertainty quantification for fact-checking the output of retrieval augmented generation. arXiv preprint arXiv:2505.21072, 2025

  9. [9]

    Detecting hallucinations in large language models using semantic entropy

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630: 0 625--630, 2024

  10. [10]

    Selective classification for deep neural networks

    Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017

  11. [11]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), 2017

  12. [12]

    HippoRAG : Neurobiologically inspired long-term memory for large language models

    Bernal Jim \'e nez Guti \'e rrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG : Neurobiologically inspired long-term memory for large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  13. [13]

    Aggarwal, and Jiliang Tang

    Haoyu Han, Li Ma, Yu Wang, Harry Shomer, Yongjia Lei, Zhisheng Qi, Kai Guo, Zhigang Hua, Bo Long, Hui Liu, Charu C. Aggarwal, and Jiliang Tang. RAG vs.\ GraphRAG : A systematic evaluation and key insights. arXiv preprint arXiv:2502.11371, 2025

  14. [14]

    DeBERTa : Decoding-enhanced BERT with disentangled attention

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa : Decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=XPZIaotutsD

  15. [15]

    Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi

    Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V. Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi. G-Retriever : Retrieval-augmented generation for textual graph understanding and question answering. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  16. [16]

    Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics (COLING), 2020

  17. [17]

    GRAG : Graph Retrieval-Augmented Generation

    Yuntong Hu, Zhihan Lei, Zheng Zhang, Bo Pan, Chen Ling, and Liang Zhao. GRAG : Graph retrieval-augmented generation. In Findings of the Association for Computational Linguistics: NAACL 2025 , pages 4145--4157, 2025. doi:10.18653/v1/2025.findings-naacl.232. URL https://aclanthology.org/2025.findings-naacl.232/

  18. [18]

    TrustLLM : Trustworthiness in Large Language Models

    Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. TrustLLM : Trustworthiness in Large Language Models . arXiv preprint arXiv:2401.05561, 2024

  19. [19]

    StructGPT : A general framework for large language model to reason over structured data

    Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Xin Zhao, and Ji-Rong Wen. StructGPT : A general framework for large language model to reason over structured data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023 a

  20. [20]

    Active Retrieval Augmented Generation

    Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969--7992, Singapore, 2023 b . Association for Computational Linguistics. doi:10.18653/v1/2023.emn...

  21. [21]

    Language models (mostly) know what they know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

  22. [22]

    Semantic entropy probes: Robust and cheap hallucination detection in LLMs

    Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in LLMs . In International Conference on Learning Representations (ICLR), 2025

  23. [23]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In International Conference on Learning Representations (ICLR), 2023

  24. [24]

    Decomposing uncertainty in probabilistic knowledge graph embeddings: Why entity variance is not enough

    Chorok Lee. Decomposing uncertainty in probabilistic knowledge graph embeddings: Why entity variance is not enough. arXiv preprint arXiv:2512.22318, 2025

  25. [25]

    u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS), 2020

  26. [26]

    Simple is effective: The roles of graphs and large language models in knowledge-graph-based retrieval-augmented generation

    Mufei Li, Siqi Miao, and Pan Li. Simple is effective: The roles of graphs and large language models in knowledge-graph-based retrieval-augmented generation. In International Conference on Learning Representations (ICLR), 2025 a

  27. [27]

    Citation-enhanced generation for LLM -based chatbots

    Weitao Li, Junkai Li, Weizhi Ma, and Yang Liu. Citation-enhanced generation for LLM -based chatbots. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1451--1466, 2024 a . doi:10.18653/v1/2024.acl-long.79. URL https://aclanthology.org/2024.acl-long.79/

  28. [28]

    Semantic volume: Quantifying and detecting both external and internal uncertainty in LLMs

    Xiaomin Li, Zhou Yu, Ziji Zhang, Yingying Zhuang, Swair Shah, Narayanan Sadagopan, and Anurag Beniwal. Semantic volume: Quantifying and detecting both external and internal uncertainty in LLMs . arXiv preprint arXiv:2502.21239, 2025 b

  29. [29]

    UncertaintyRAG : Span-level uncertainty enhanced long-context modeling for retrieval-augmented generation

    Zixuan Li, Jing Xiong, Fanghua Ye, Chuanyang Zheng, Xun Wu, Jianqiao Lu, Zhongwei Wan, Xiaodan Liang, Chengming Li, Zhenan Sun, Lingpeng Kong, and Ngai Wong. UncertaintyRAG : Span-level uncertainty enhanced long-context modeling for retrieval-augmented generation. arXiv preprint arXiv:2410.02719, 2024 b

  30. [30]

    Teaching models to express their uncertainty in words

    Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words. In Transactions of the Association for Computational Linguistics (TACL), 2022

  31. [31]

    CtrlA : Adaptive retrieval-augmented generation via inherent control

    Huanshuo Liu, Hao Zhang, Zhijiang Guo, Jing Wang, Kuicai Dong, Xiangyang Li, Yi Quan Lee, Cong Zhang, and Yong Liu. CtrlA : Adaptive retrieval-augmented generation via inherent control. In Findings of the Association for Computational Linguistics: ACL 2025, pages 12592--12618, Vienna, Austria, 2025. Association for Computational Linguistics. doi:10.18653/...

  32. [32]

    NAACL : Noise- A w A re verbal confidence calibration for robust large language models in RAG systems

    Jiayu Liu, Rui Wang, Qing Zong, Yumeng Wang, Cheng Qian, Qingcheng Zeng, Tianshi Zheng, Haochen Shi, Dadi Guo, Baixuan Xu, Chunyang Li, and Yangqiu Song. NAACL : Noise- A w A re verbal confidence calibration for robust large language models in RAG systems. arXiv preprint arXiv:2601.11004, 2026 a

  33. [33]

    TruthfulRAG : Resolving factual-level conflicts in retrieval-augmented generation with knowledge graphs

    Shuyi Liu, Yuming Shang, and Xi Zhang. TruthfulRAG : Resolving factual-level conflicts in retrieval-augmented generation with knowledge graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, 2026 b

  34. [34]

    Reasoning on graphs: Faithful and interpretable large language model reasoning

    Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. Reasoning on graphs: Faithful and interpretable large language model reasoning. In International Conference on Learning Representations (ICLR), 2024

  35. [35]

    Think-on-graph 2.0: Deep and faithful large language model reasoning with knowledge-guided retrieval augmented generation

    Shengjie Ma, Chengjin Xu, Xuhui Jiang, Muzhi Li, Huaren Qu, Cehao Yang, Jiaxin Mao, and Jian Guo. Think-on-graph 2.0: Deep and faithful large language model reasoning with knowledge-guided retrieval augmented generation. In International Conference on Learning Representations (ICLR), 2025

  36. [36]

    Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. SelfCheckGPT : Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

  37. [37]

    GNN-RAG : Graph neural retrieval for large language model reasoning

    Costas Mavromatis and George Karypis. GNN-RAG : Graph neural retrieval for large language model reasoning. In Findings of the Association for Computational Linguistics (ACL), 2025

  38. [38]

    Adaptive retrieval without self-knowledge? bringing uncertainty back home

    Viktor Moskvoretskii, Maria Marina, Mikhail Salnikov, Nikolay Ivanov, Sergey Pletenev, Daria Galimzianova, Nikita Krayko, Vasily Konovalov, Irina Nikishina, and Alexander Panchenko. Adaptive retrieval without self-knowledge? bringing uncertainty back home. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1...

  39. [39]

    Towards trustworthy retrieval augmented generation for large language models: A survey

    Bo Ni, Zheyuan Liu, Leyao Wang, Yongjia Lei, Yuying Zhao, Xueqi Cheng, Qingkai Zeng, Luna Dong, Yinglong Xia, Krishnaram Kenthapadi, et al. Towards trustworthy retrieval augmented generation for large language models: A survey. arXiv preprint arXiv:2502.06872, 2025

  40. [40]

    Kernel language entropy: Fine-grained uncertainty quantification for LLMs from semantic similarities

    Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncertainty quantification for LLMs from semantic similarities. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  41. [41]

    GPT-4o mini : Advancing cost-efficient intelligence

    OpenAI . GPT-4o mini : Advancing cost-efficient intelligence. Technical report, OpenAI, 2024. URL https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

  42. [42]

    Bowman, and Shi Feng

    Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations. In Advances in Neural Information Processing Systems 37 (NeurIPS), 2024

  43. [43]

    Graph retrieval-augmented generation: A survey

    Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang. Graph retrieval-augmented generation: A survey. ACM Transactions on Information Systems , 2024

  44. [44]

    Uncertainty quantification in retrieval augmented question answering

    Laura Perez-Beltrachini and Mirella Lapata. Uncertainty quantification in retrieval augmented question answering. Transactions on Machine Learning Research (TMLR), 2025

  45. [45]

    SURE-RAG : Sufficiency and uncertainty-aware evidence verification for selective retrieval-augmented generation

    Jingxi Qiu, Zeyu Han, and Cheng Huang. SURE-RAG : Sufficiency and uncertainty-aware evidence verification for selective retrieval-augmented generation. arXiv preprint arXiv:2605.03534, 2026

  46. [46]

    Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space

    Xin Qiu and Risto Miikkulainen. Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  47. [47]

    Sentence- BERT : Sentence embeddings using siamese BERT -networks

    Nils Reimers and Iryna Gurevych. Sentence- BERT : Sentence embeddings using siamese BERT -networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019

  48. [48]

    When to trust: A causality-aware calibration framework for accurate knowledge graph retrieval-augmented generation

    Jing Ren, Bowen Li, Ziqi Xu, Xikun Zhang, Haytham Fayek, and Xiaodong Li. When to trust: A causality-aware calibration framework for accurate knowledge graph retrieval-augmented generation. arXiv preprint arXiv:2601.09241, 2026

  49. [49]

    ARES : An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

    Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. ARES : An automated evaluation framework for retrieval-augmented generation systems. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 338--354, Mexico City, Mexico, 2024. Association f...

  50. [50]

    Zhili Shen, Chenxin Diao, Pavlos Vougiouklis, Pascual Merita, Shriram Piramanayagam, Enting Chen, Damien Graux, Andre Melo, Ruofei Lai, Zeren Jiang, Zhongyang Li, Ye Qi, Yang Ren, Dandan Tu, and Jeff Z. Pan. GeAR : Graph-enhanced agent for retrieval-augmented generation. In Findings of the Association for Computational Linguistics: ACL 2025 , pages 12049-...

  51. [51]

    Why uncertainty estimation methods fall short in RAG : An axiomatic analysis

    Heydar Soudani, Evangelos Kanoulas, and Faegheh Hasibi. Why uncertainty estimation methods fall short in RAG : An axiomatic analysis. In Findings of the Association for Computational Linguistics: ACL 2025, pages 16596--16616, Vienna, Austria, 2025 a . Association for Computational Linguistics. URL https://aclanthology.org/2025.findings-acl.852/

  52. [52]

    Uncertainty quantification for retrieval-augmented reasoning

    Heydar Soudani, Hamed Zamani, and Faegheh Hasibi. Uncertainty quantification for retrieval-augmented reasoning. arXiv preprint arXiv:2510.11483, 2025 b

  53. [53]

    doi:10.18653/V1/2024.ACL-LONG.702 , url =

    Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, and Yiqun Liu. DRAGIN : Dynamic retrieval augmented generation based on the real-time information needs of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12991--13013, Bangkok, Thailand, 2024. Association for C...

  54. [54]

    Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph

    Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Shengjie Wang, Chen Lin, Yeyun Gong, Heung-Yeung Shum, and Jian Guo. Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph. In International Conference on Learning Representations (ICLR), 2024

  55. [55]

    Uncertainty-aware dynamic knowledge graphs for reliable question answering

    Yu Takahashi, Shun Takeuchi, Kexuan Xin, Guillaume Pelat, Yoshiaki Ikai, Junya Saito, Jonathan Vitale, Shlomo Berkovsky, and Amin Beheshti. Uncertainty-aware dynamic knowledge graphs for reliable question answering. arXiv preprint arXiv:2601.09720, 2026

  56. [56]

    Manning, and Chelsea Finn

    Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, and Chelsea Finn. Does my LLM need a better evaluator? Just ask for calibration. In arXiv preprint arXiv:2310.02415, 2023

  57. [57]

    Uncertainty-based abstention in LLMs improves safety and reduces hallucinations

    Christian Tomani, Kamalika Chaudhuri, Ivan Evtimov, Daniel Cremers, and Mark Ibrahim. Uncertainty-based abstention in LLMs improves safety and reduces hallucinations. arXiv preprint arXiv:2404.10960, 2024

  58. [58]

    MuSiQue : Multihop questions via single-hop question composition

    Harsh Trivedi, Niranjan Bauer, Tushar Khot, and Ashish Sabharwal. MuSiQue : Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10: 0 539--554, 2022

  59. [59]

    Semantic uncertainty quantification of hallucinations in LLMs : A quantum tensor network based method

    Pragatheeswaran Vipulanandan, Kamal Premaratne, and Dilip Sarkar. Semantic uncertainty quantification of hallucinations in LLMs : A quantum tensor network based method. arXiv preprint arXiv:2601.20026, 2026

  60. [60]

    L-RAG : Balancing context and retrieval with entropy-based lazy loading

    Sergii Voloshyn. L-RAG : Balancing context and retrieval with entropy-based lazy loading. arXiv preprint arXiv:2601.06551, 2026

  61. [61]

    Gruber, Thomas Decker, Yinchong Yang, Alireza Javanmardi, Eyke H \"u llermeier, and Florian Buettner

    Nassim Walha, Sebastian G. Gruber, Thomas Decker, Yinchong Yang, Alireza Javanmardi, Eyke H \"u llermeier, and Florian Buettner. Fine-grained uncertainty decomposition in large language models: A spectral approach. arXiv preprint arXiv:2509.22272, 2025

  62. [62]

    Correctness is not faithfulness in retrieval augmented generation attributions

    Jonas Wallat, Maria Heuss, Maarten de Rijke, and Avishek Anand. Correctness is not faithfulness in retrieval augmented generation attributions. In Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval, pages 22--32. Association for Computing Machinery, 2025. doi:10.1145/3731120.3744592

  63. [63]

    Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, and Sercan O. Arik. Astute RAG : Overcoming imperfect retrieval augmentation and knowledge conflicts for large language models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

  64. [64]

    Self-consistency improves chain of thought reasoning in language models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR), 2023

  65. [65]

    Augmenting textual generation via topology aware retrieval

    Yu Wang, Nedim Lipka, Ruiyi Zhang, Alexa Siu, Yuying Zhao, Bo Ni, Xin Wang, Ryan Rossi, and Tyler Derr. Augmenting textual generation via topology aware retrieval. arXiv preprint arXiv:2405.17602, 2024. doi:10.48550/arXiv.2405.17602

  66. [66]

    Medical graph RAG : Evidence-based medical large language model via graph retrieval-augmented generation

    Junde Wu, Jiayuan Zhu, Yunli Qi, Jingkun Chen, Min Xu, Filippo Menolascina, Yueming Jin, and Vicente Grau. Medical graph RAG : Evidence-based medical large language model via graph retrieval-augmented generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28443--28467, 2025. do...

  67. [67]

    Ho, and James Zou

    Kevin Wu, Eric Wu, Ally Cassasola, Angela Zhang, Kevin Wei, Teresa Nguyen, Sith Riantawan, Patricia Shi Riantawan, Daniel E. Ho, and James Zou. How well do LLMs cite relevant medical references? A n evaluation framework and analyses. arXiv preprint arXiv:2402.02008, 2024 a

  68. [68]

    ClashEval : Quantifying the tug-of-war between an LLM 's internal prior and external evidence

    Kevin Wu, Eric Wu, and James Zou. ClashEval : Quantifying the tug-of-war between an LLM 's internal prior and external evidence. arXiv preprint arXiv:2404.10198, 2024 b

  69. [69]

    When to use graphs in RAG : A comprehensive analysis for graph retrieval-augmented generation

    Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong, Xiao Huang, and Jinsong Su. When to use graphs in RAG : A comprehensive analysis for graph retrieval-augmented generation. arXiv preprint arXiv:2506.05690, 2025

  70. [70]

    Corrective retrieval augmented generation

    Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation. arXiv preprint arXiv:2401.15884, 2024

  71. [71]

    Cohen, Ruslan Salakhutdinov, and Christopher D

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018

  72. [72]

    SeaKR : Self-aware knowledge retrieval for adaptive retrieval augmented generation

    Zijun Yao, Weijian Qi, Liangming Pan, Shulin Cao, Linmei Hu, Liu Weichuan, Lei Hou, and Juanzi Li. SeaKR : Self-aware knowledge retrieval for adaptive retrieval augmented generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27022--27043, Vienna, Austria, 2025. Association for...

  73. [73]

    FaithfulRAG : Fact-level conflict modeling for context-faithful retrieval-augmented generation

    Qinggang Zhang, Zhishang Xiang, Yilin Xiao, Le Wang, Junhui Li, Xinrun Wang, and Jinsong Su. FaithfulRAG : Fact-level conflict modeling for context-faithful retrieval-augmented generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 21863--21882. Association for Computational Linguistics, 2025. URL htt...

  74. [74]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Advances in Neural Information Processing Systems 36 (NeurIPS), Datasets and Benchmarks Track, 2023

  75. [75]

    Bayan Bruss

    Wenqing Zheng, Dmitri Kalaev, Noah Fatsi, Daniel Barcklow, Owen Reinert, Igor Melnyk, Senthil Kumar, and C. Bayan Bruss. Revisiting RAG retrievers: An information theoretic benchmark. arXiv preprint arXiv:2602.21553, 2026

  76. [76]

    Poisoning retrieval corpora by injecting adversarial passages

    Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. Poisoning retrieval corpora by injecting adversarial passages. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

  77. [77]

    What breaks knowledge graph based RAG ? benchmarking and empirical insights into reasoning under incomplete knowledge

    Dongzhuoran Zhou, Yuqicheng Zhu, Xiaxia Wang, Hongkuan Zhou, Yuan He, Jiaoyan Chen, Steffen Staab, and Evgeny Kharlamov. What breaks knowledge graph based RAG ? benchmarking and empirical insights into reasoning under incomplete knowledge. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2026

  78. [78]

    Knowledge graph-guided retrieval augmented generation

    Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, and Wei Hu. Knowledge graph-guided retrieval augmented generation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8912--8924, 2025 a . doi:10.18653/v1/2025.naacl-long.44...

  79. [79]

    Certainty in uncertainty: Reasoning over uncertain knowledge graphs with statistical guarantees

    Yuqicheng Zhu, Jingcheng Wu, Yizhen Wang, Hongkuan Zhou, Jiaoyan Chen, Evgeny Kharlamov, and Steffen Staab. Certainty in uncertainty: Reasoning over uncertain knowledge graphs with statistical guarantees. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8730--8752, 2025 b . doi:10.18653/v1/2025.emnlp-main.44...

  80. [80]

    doi:10.18653/V1/2024.EMNLP-INDUSTRY.2 , url =

    Ilana Zimmerman, Jadin Tredup, Ethan Selfridge, and Joseph Bradley. Two-tiered encoder-based hallucination detection for retrieval-augmented generation in the wild. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 8--22, Miami, Florida, US, 2024. Association for Computational Linguistics. doi...

Showing first 80 references.