When Confidence Takes the Wrong Path: Diagnosing Retrieval-State Lock-In in RAG

Sahib Julka

arxiv: 2606.22728 · v1 · pith:KIISQNIQnew · submitted 2026-06-22 · 💻 cs.CL · cs.AI

When Confidence Takes the Wrong Path: Diagnosing Retrieval-State Lock-In in RAG

Sahib Julka This is my paper

Pith reviewed 2026-06-26 09:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords retrieval-augmented generationretrieval-state lock-inuncertainty estimationknowledge graph RAGanswer dispersionconfidence calibrationRAG trustworthiness

0 comments

The pith

RAG answer agreement can signal a locked-in wrong retrieval state rather than correctness, and a three-check rule reaches 91.9% pooled precision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that repeated samples in retrieval-augmented generation can agree because they all condition on the same defective retrieval state, either empty or filled with coherent but incorrect information. This retrieval-state lock-in makes agreement-based confidence unreliable since the error is stable across samples. The authors separate the answer surface, retrieved evidence, and retrieval state to measure the problem directly in an ontology-guided knowledge-graph RAG system across six question-answering snapshots. At five samples per question, 42% of KG-RAG errors and 59% of dense-retrieval errors show zero answer dispersion. A decision rule that accepts an answer only when all three checks agree it is low-risk achieves 91.9% pooled precision against a 69.7% accept-all baseline, though it certifies only 7.7% of answers as low-risk.

Core claim

Retrieval-state lock-in occurs when sampled answers agree because they share the same defective retrieval state rather than because the answer is correct. The paper diagnoses this by decomposing confidence into three objects: the answer surface, the retrieved evidence, and the retrieval state itself. In the tested KG-RAG system, 42% of errors at five samples carry zero answer dispersion, so agreement supplies no ranking signal, while evidence and retrieval-state checks still flag most of them. The resulting auditable decision rule accepts an answer only when answer, evidence, and retrieval checks all indicate low risk, reaching 91.9% pooled precision against a 69.7% accept-all rate and 7.7%

What carries the argument

retrieval-state lock-in, the condition in which sampled answers agree because they condition on the same defective retrieval state.

If this is right

42% of KG-RAG errors and 59% of dense-retrieval errors carry zero answer dispersion at five samples per question.
Evidence and retrieval-state checks flag most zero-dispersion errors that answer agreement alone cannot rank.
The three-check rule reaches 91.9% pooled precision while certifying 7.7% of answers as low-risk.
On the clinical calibration domain the rule reaches 100% precision under an automated judge.
Confidence in RAG must be treated as object-specific rather than a single black-box score.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar lock-in rates may occur in RAG architectures that were not tested in the six snapshots.
The three-check approach could be combined with other uncertainty methods to raise coverage while preserving precision.
Real-world deployment would still require human validation beyond the automated judge used in the clinical domain.

Load-bearing premise

The six question-answering snapshots and the particular ontology-guided KG-RAG implementation are representative enough that the measured lock-in rates and three-check rule will generalize to other RAG architectures and domains.

What would settle it

Applying the three-check rule to a RAG system or domain outside the six tested snapshots and checking whether precision stays near 91.9 percent.

Figures

Figures reproduced from arXiv: 2606.22728 by Sahib Julka.

**Figure 2.** Figure 2: OntoGraphRAG pipeline with the three measurement taps. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Answer-state scores rank errors most consistently; GPS discriminates only in its calibration domain (RealMedQA). [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Under strict retrieval the answer surface collapses while evidence support stays separated (RealMedQA: dense, adaptive, [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: The useful diagnostic changes with the retrieval regime. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Full per-dataset, per-system view of the archived concentration–dispersion traces [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗

**Figure 7.** Figure 7: Lock-in as migration into the low-dispersion corner (2WikiMultiHopQA). [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗

**Figure 8.** Figure 8: Family disagreement in the pooled adaptive KG runs. [PITH_FULL_IMAGE:figures/full_fig_p037_8.png] view at source ↗

read the original abstract

The trustworthiness of a retrieval-augmented generation (RAG) system depends on more than the answer it returns, yet many black-box uncertainty methods still read agreement among sampled answers as confidence. That inference fails when repeated samples condition on the same defective retrieval state. The state may be empty, with the model falling back on parametric memory, or populated by a coherent but wrong neighbourhood. In either case, the answers agree because the error is stable. The problem is recognised in deployed RAG, but it has lacked a name, a measurable signature, and a prevalence bound. We supply all three. We name the failure retrieval-state lock-in and diagnose it by separating the three objects a single confidence score conflates: the answer surface, the retrieved evidence, and the retrieval state itself. In an inspectable, ontology-guided knowledge-graph RAG (KG-RAG) system across six question-answering snapshots, we measure the agreement blind spot directly: at five samples per question, 42% of KG-RAG errors and 59% of dense-retrieval errors carry zero answer dispersion, so agreement has nothing to rank, while evidence- and retrieval-state checks still flag most of them. The decomposition supports an auditable decision rule: accepting an answer only when answer, evidence, and retrieval checks all agree that it is low-risk reaches 91.9% pooled precision against a 69.7% accept-all rate. The cost is coverage: it certifies only 7.7% of answers as low-risk. On the clinical calibration domain it reaches 100% precision under an automated judge; this is an in-domain automated-label upper bound, not a clinical safety claim, and still needs human validation. Confidence in RAG is object-specific: when answers agree, the useful question is which part of the pipeline to distrust.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Names a real blind spot in RAG uncertainty but measures it only on one ontology-guided KG setup with no ablations or error bars.

read the letter

The paper's useful move is naming retrieval-state lock-in and showing that answer dispersion alone misses a chunk of errors because the retrieval state itself is stable and wrong. In their six snapshots on the KG-RAG system, 42% of KG-RAG errors and 59% of dense-retrieval errors show zero dispersion at five samples, so standard sampling methods have nothing to work with. They separate the three objects—answer surface, evidence, and retrieval state—and build a simple three-check rule that lifts pooled precision to 91.9% from a 69.7% accept-all baseline, though coverage drops to 7.7%. The clinical 100% figure is labeled an automated upper bound.

What the work does cleanly is make the object-specific point explicit and give concrete prevalence numbers on an inspectable system. That distinction had not been isolated this way in the cited RAG uncertainty papers.

The soft spot is that every number comes from one ontology-guided KG-RAG plus a dense baseline. No ablation of the three-check components, no statistical tests or error bars on the 42/59 figures, and no runs on ordinary vector-store RAG or non-ontology domains. The stress-test note is right: transfer is untested, so the prevalence and the precision-coverage tradeoff could be tied to the inspectable setup.

This is for teams that already run RAG in production and need better ways to decide when to abstain. The diagnostic idea is worth a serious referee even if the current evidence stays narrow; the authors would need to address generalizability and add basic statistical detail.

Referee Report

2 major / 1 minor

Summary. The paper claims that standard answer-agreement based confidence in RAG can fail due to 'retrieval-state lock-in', where defective retrieval states lead to consistent but wrong answers across samples. In experiments on an ontology-guided KG-RAG system with six QA snapshots, they find 42% of KG-RAG errors and 59% of dense-retrieval errors have zero answer dispersion at 5 samples. They introduce a three-check rule (answer, evidence, retrieval-state) that achieves 91.9% precision at 7.7% coverage, compared to 69.7% for accepting all, with 100% on clinical domain under automated judge.

Significance. This work offers a concrete diagnosis and measurable signature for a recognized issue in RAG trustworthiness, along with an auditable decision rule that trades coverage for higher precision. The explicit decomposition of answer, evidence, and retrieval state, and the reporting of specific percentages on a real system, provide useful empirical grounding. If the lock-in phenomenon and the rule's performance hold more broadly, it could inform better uncertainty estimation practices in retrieval-augmented systems.

major comments (2)

[Experimental results on six snapshots] The reported lock-in rates of 42% for KG-RAG and 59% for dense-retrieval errors with zero dispersion lack error bars, statistical tests, or confidence intervals, which weakens the strength of the claim that agreement has nothing to rank in these cases.
[Abstract and experimental setup] The prevalence bounds and the 91.9% precision of the three-check rule are derived from a single ontology-guided KG-RAG implementation plus dense baseline; no replication or ablation on standard vector RAG, LLM rerankers, or other domains is provided, making generalization to deployed RAG systems a load-bearing assumption that requires further support.

minor comments (1)

[Clinical calibration domain] The 100% precision result is correctly labeled as an automated upper bound requiring human validation, but the presentation could more explicitly discuss the limitations of the automated judge.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and presentation of our results. We address each major comment below.

read point-by-point responses

Referee: [Experimental results on six snapshots] The reported lock-in rates of 42% for KG-RAG and 59% for dense-retrieval errors with zero dispersion lack error bars, statistical tests, or confidence intervals, which weakens the strength of the claim that agreement has nothing to rank in these cases.

Authors: We agree this is a valid point. The six snapshots provide the basis for the pooled rates, but we did not report uncertainty. In the revised manuscript we will add bootstrap confidence intervals computed by resampling across snapshots and will report per-snapshot variation where feasible. This will make the descriptive claim about zero-dispersion subsets more robust while preserving the core observation that agreement supplies no ranking information inside those subsets. revision: yes
Referee: [Abstract and experimental setup] The prevalence bounds and the 91.9% precision of the three-check rule are derived from a single ontology-guided KG-RAG implementation plus dense baseline; no replication or ablation on standard vector RAG, LLM rerankers, or other domains is provided, making generalization to deployed RAG systems a load-bearing assumption that requires further support.

Authors: The choice of an ontology-guided, inspectable KG-RAG system was deliberate: only in such a setting can the retrieval state be directly audited to diagnose lock-in. The dense baseline serves as a controlled contrast rather than a comprehensive ablation. We do not assert that the exact 42 % / 59 % or 91.9 % figures generalize; the contribution is the identification of the failure mode and the three-object decomposition. In revision we will expand the limitations and discussion sections to state the scope explicitly and to call for replication on black-box vector RAG and other domains. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical measurements are direct and self-contained

full rationale

The paper reports direct empirical counts of zero-dispersion errors (42% KG-RAG, 59% dense) and the measured precision (91.9%) of a rule that requires agreement across three distinct pipeline objects on held-out snapshots. No equations or derivations reduce a claimed result to a fitted parameter or self-citation by construction; the decision rule is not optimized against the target metric but evaluated as a conservative conjunction, and no load-bearing premise depends on prior author work. The analysis remains within its stated experimental scope without renaming known results or smuggling ansatzes.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central measurements rest on the assumption that the chosen KG-RAG system exposes separable retrieval states and that the six snapshots are representative; no free parameters are explicitly fitted beyond the fixed sample count of five.

free parameters (1)

number of samples
Fixed at five per question; affects the zero-dispersion statistic.

axioms (1)

domain assumption The KG-RAG implementation provides inspectable retrieval states that can be checked independently of the generated answer.
Invoked to separate the three objects and to run the retrieval-state check.

invented entities (1)

retrieval-state lock-in no independent evidence
purpose: Names the stable error condition in which repeated samples condition on the same defective retrieval state.
Newly introduced term; no independent evidence outside the paper's measurements is supplied.

pith-pipeline@v0.9.1-grok · 5866 in / 1529 out tokens · 28087 ms · 2026-06-26T09:08:58.429339+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 18 canonical work pages

[1]

Amugongo, Paola Mascheroni, Sarah Brooks, Susanne Doering, and Jan Seidel

Lameck Mbangula Amugongo, Pietro Mascheroni, Steven Geoffrey Brooks, Stefan Doering, and Jan Seidel. Retrieval augmented generation for large language models in healthcare: A systematic review. PLOS Digital Health, 4 0 (6): 0 e0000877, 2025. doi:10.1371/journal.pdig.0000877. URL https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000877

work page doi:10.1371/journal.pdig.0000877 2025
[2]

Self-RAG : Learning to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG : Learning to retrieve, generate, and critique through self-reflection. In International Conference on Learning Representations (ICLR), 2024

2024
[3]

INTRYGUE : Induction-aware entropy gating for reliable RAG uncertainty estimation

Alexandra Bazarova, Andrei Volodichev, Daria Kotova, and Alexey Zaytsev. INTRYGUE : Induction-aware entropy gating for reliable RAG uncertainty estimation. arXiv preprint arXiv:2603.21607, 2026

arXiv 2026
[4]

Campos, Ant \'o nio Farinhas, Chrysoula Zerva, M \'a rio A

Margarida M. Campos, Ant \'o nio Farinhas, Chrysoula Zerva, M \'a rio A. T. Figueiredo, and Andr \'e F. T. Martins. Conformal prediction for natural language processing: A survey. Transactions of the Association for Computational Linguistics, 12: 0 1619--1638, 2024. doi:10.1162/tacl_a_00715

work page doi:10.1162/tacl_a_00715 2024
[5]

Koedinger

Eason Chen, Chuangji Li, Shizhuo Li, Zimo Xiao, Jionghao Lin, and Kenneth R. Koedinger. Comparing RAG and GraphRAG for page-level retrieval question answering on math textbook. arXiv preprint arXiv:2509.16780, 2025

arXiv 2025
[6]

From local to global: A graph RAG approach to query-focused summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024

Pith/arXiv arXiv 2024
[7]

RAGA s: Automated Evaluation of Retrieval Augmented Generation

Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. RAGAs : Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150--158, 2024. doi:10.18653/v1/2024.eacl-demo.16. URL https://aclanthology.org/2024...

work page doi:10.18653/v1/2024.eacl-demo.16 2024
[8]

Faithfulness-aware uncertainty quantification for fact-checking the output of retrieval augmented generation

Ekaterina Fadeeva, Aleksandr Rubashevskii, Dzianis Piatrashyn, Roman Vashurin, Shehzaad Dhuliawala, Artem Shelmanov, Timothy Baldwin, Preslav Nakov, Mrinmaya Sachan, and Maxim Panov. Faithfulness-aware uncertainty quantification for fact-checking the output of retrieval augmented generation. arXiv preprint arXiv:2505.21072, 2025

Pith/arXiv arXiv 2025
[9]

Detecting hallucinations in large language models using semantic entropy

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630: 0 625--630, 2024

2024
[10]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017

2017
[11]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), 2017

2017
[12]

HippoRAG : Neurobiologically inspired long-term memory for large language models

Bernal Jim \'e nez Guti \'e rrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG : Neurobiologically inspired long-term memory for large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[13]

Aggarwal, and Jiliang Tang

Haoyu Han, Li Ma, Yu Wang, Harry Shomer, Yongjia Lei, Zhisheng Qi, Kai Guo, Zhigang Hua, Bo Long, Hui Liu, Charu C. Aggarwal, and Jiliang Tang. RAG vs.\ GraphRAG : A systematic evaluation and key insights. arXiv preprint arXiv:2502.11371, 2025

arXiv 2025
[14]

DeBERTa : Decoding-enhanced BERT with disentangled attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa : Decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=XPZIaotutsD

2021
[15]

Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi

Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V. Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi. G-Retriever : Retrieval-augmented generation for textual graph understanding and question answering. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[16]

Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics (COLING), 2020

2020
[17]

GRAG : Graph Retrieval-Augmented Generation

Yuntong Hu, Zhihan Lei, Zheng Zhang, Bo Pan, Chen Ling, and Liang Zhao. GRAG : Graph retrieval-augmented generation. In Findings of the Association for Computational Linguistics: NAACL 2025 , pages 4145--4157, 2025. doi:10.18653/v1/2025.findings-naacl.232. URL https://aclanthology.org/2025.findings-naacl.232/

work page doi:10.18653/v1/2025.findings-naacl.232 2025
[18]

TrustLLM : Trustworthiness in Large Language Models

Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. TrustLLM : Trustworthiness in Large Language Models . arXiv preprint arXiv:2401.05561, 2024

Pith/arXiv arXiv 2024
[19]

StructGPT : A general framework for large language model to reason over structured data

Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Xin Zhao, and Ji-Rong Wen. StructGPT : A general framework for large language model to reason over structured data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023 a

2023
[20]

Active Retrieval Augmented Generation

Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969--7992, Singapore, 2023 b . Association for Computational Linguistics. doi:10.18653/v1/2023.emn...

work page doi:10.18653/v1/2023.emnlp-main.495 2023
[21]

Language models (mostly) know what they know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

Pith/arXiv arXiv 2022
[22]

Semantic entropy probes: Robust and cheap hallucination detection in LLMs

Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in LLMs . In International Conference on Learning Representations (ICLR), 2025

2025
[23]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In International Conference on Learning Representations (ICLR), 2023

2023
[24]

Decomposing uncertainty in probabilistic knowledge graph embeddings: Why entity variance is not enough

Chorok Lee. Decomposing uncertainty in probabilistic knowledge graph embeddings: Why entity variance is not enough. arXiv preprint arXiv:2512.22318, 2025

arXiv 2025
[25]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS), 2020

2020
[26]

Simple is effective: The roles of graphs and large language models in knowledge-graph-based retrieval-augmented generation

Mufei Li, Siqi Miao, and Pan Li. Simple is effective: The roles of graphs and large language models in knowledge-graph-based retrieval-augmented generation. In International Conference on Learning Representations (ICLR), 2025 a

2025
[27]

Citation-enhanced generation for LLM -based chatbots

Weitao Li, Junkai Li, Weizhi Ma, and Yang Liu. Citation-enhanced generation for LLM -based chatbots. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1451--1466, 2024 a . doi:10.18653/v1/2024.acl-long.79. URL https://aclanthology.org/2024.acl-long.79/

work page doi:10.18653/v1/2024.acl-long.79 2024
[28]

Semantic volume: Quantifying and detecting both external and internal uncertainty in LLMs

Xiaomin Li, Zhou Yu, Ziji Zhang, Yingying Zhuang, Swair Shah, Narayanan Sadagopan, and Anurag Beniwal. Semantic volume: Quantifying and detecting both external and internal uncertainty in LLMs . arXiv preprint arXiv:2502.21239, 2025 b

arXiv 2025
[29]

UncertaintyRAG : Span-level uncertainty enhanced long-context modeling for retrieval-augmented generation

Zixuan Li, Jing Xiong, Fanghua Ye, Chuanyang Zheng, Xun Wu, Jianqiao Lu, Zhongwei Wan, Xiaodan Liang, Chengming Li, Zhenan Sun, Lingpeng Kong, and Ngai Wong. UncertaintyRAG : Span-level uncertainty enhanced long-context modeling for retrieval-augmented generation. arXiv preprint arXiv:2410.02719, 2024 b

arXiv 2024
[30]

Teaching models to express their uncertainty in words

Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words. In Transactions of the Association for Computational Linguistics (TACL), 2022

2022
[31]

CtrlA : Adaptive retrieval-augmented generation via inherent control

Huanshuo Liu, Hao Zhang, Zhijiang Guo, Jing Wang, Kuicai Dong, Xiangyang Li, Yi Quan Lee, Cong Zhang, and Yong Liu. CtrlA : Adaptive retrieval-augmented generation via inherent control. In Findings of the Association for Computational Linguistics: ACL 2025, pages 12592--12618, Vienna, Austria, 2025. Association for Computational Linguistics. doi:10.18653/...

work page doi:10.18653/v1/2025.findings-acl.652 2025
[32]

NAACL : Noise- A w A re verbal confidence calibration for robust large language models in RAG systems

Jiayu Liu, Rui Wang, Qing Zong, Yumeng Wang, Cheng Qian, Qingcheng Zeng, Tianshi Zheng, Haochen Shi, Dadi Guo, Baixuan Xu, Chunyang Li, and Yangqiu Song. NAACL : Noise- A w A re verbal confidence calibration for robust large language models in RAG systems. arXiv preprint arXiv:2601.11004, 2026 a

Pith/arXiv arXiv 2026
[33]

TruthfulRAG : Resolving factual-level conflicts in retrieval-augmented generation with knowledge graphs

Shuyi Liu, Yuming Shang, and Xi Zhang. TruthfulRAG : Resolving factual-level conflicts in retrieval-augmented generation with knowledge graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, 2026 b

2026
[34]

Reasoning on graphs: Faithful and interpretable large language model reasoning

Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. Reasoning on graphs: Faithful and interpretable large language model reasoning. In International Conference on Learning Representations (ICLR), 2024

2024
[35]

Think-on-graph 2.0: Deep and faithful large language model reasoning with knowledge-guided retrieval augmented generation

Shengjie Ma, Chengjin Xu, Xuhui Jiang, Muzhi Li, Huaren Qu, Cehao Yang, Jiaxin Mao, and Jian Guo. Think-on-graph 2.0: Deep and faithful large language model reasoning with knowledge-guided retrieval augmented generation. In International Conference on Learning Representations (ICLR), 2025

2025
[36]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. SelfCheckGPT : Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

2023
[37]

GNN-RAG : Graph neural retrieval for large language model reasoning

Costas Mavromatis and George Karypis. GNN-RAG : Graph neural retrieval for large language model reasoning. In Findings of the Association for Computational Linguistics (ACL), 2025

2025
[38]

Adaptive retrieval without self-knowledge? bringing uncertainty back home

Viktor Moskvoretskii, Maria Marina, Mikhail Salnikov, Nikolay Ivanov, Sergey Pletenev, Daria Galimzianova, Nikita Krayko, Vasily Konovalov, Irina Nikishina, and Alexander Panchenko. Adaptive retrieval without self-knowledge? bringing uncertainty back home. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1...

work page doi:10.18653/v1/2025.acl-long.319 2025
[39]

Towards trustworthy retrieval augmented generation for large language models: A survey

Bo Ni, Zheyuan Liu, Leyao Wang, Yongjia Lei, Yuying Zhao, Xueqi Cheng, Qingkai Zeng, Luna Dong, Yinglong Xia, Krishnaram Kenthapadi, et al. Towards trustworthy retrieval augmented generation for large language models: A survey. arXiv preprint arXiv:2502.06872, 2025

arXiv 2025
[40]

Kernel language entropy: Fine-grained uncertainty quantification for LLMs from semantic similarities

Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncertainty quantification for LLMs from semantic similarities. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[41]

GPT-4o mini : Advancing cost-efficient intelligence

OpenAI . GPT-4o mini : Advancing cost-efficient intelligence. Technical report, OpenAI, 2024. URL https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

2024
[42]

Bowman, and Shi Feng

Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations. In Advances in Neural Information Processing Systems 37 (NeurIPS), 2024

2024
[43]

Graph retrieval-augmented generation: A survey

Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang. Graph retrieval-augmented generation: A survey. ACM Transactions on Information Systems , 2024

2024
[44]

Uncertainty quantification in retrieval augmented question answering

Laura Perez-Beltrachini and Mirella Lapata. Uncertainty quantification in retrieval augmented question answering. Transactions on Machine Learning Research (TMLR), 2025

2025
[45]

SURE-RAG : Sufficiency and uncertainty-aware evidence verification for selective retrieval-augmented generation

Jingxi Qiu, Zeyu Han, and Cheng Huang. SURE-RAG : Sufficiency and uncertainty-aware evidence verification for selective retrieval-augmented generation. arXiv preprint arXiv:2605.03534, 2026

Pith/arXiv arXiv 2026
[46]

Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space

Xin Qiu and Risto Miikkulainen. Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[47]

Sentence- BERT : Sentence embeddings using siamese BERT -networks

Nils Reimers and Iryna Gurevych. Sentence- BERT : Sentence embeddings using siamese BERT -networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019

2019
[48]

When to trust: A causality-aware calibration framework for accurate knowledge graph retrieval-augmented generation

Jing Ren, Bowen Li, Ziqi Xu, Xikun Zhang, Haytham Fayek, and Xiaodong Li. When to trust: A causality-aware calibration framework for accurate knowledge graph retrieval-augmented generation. arXiv preprint arXiv:2601.09241, 2026

arXiv 2026
[49]

ARES : An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. ARES : An automated evaluation framework for retrieval-augmented generation systems. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 338--354, Mexico City, Mexico, 2024. Association f...

work page doi:10.18653/v1/2024.naacl-long.20 2024
[50]

Zhili Shen, Chenxin Diao, Pavlos Vougiouklis, Pascual Merita, Shriram Piramanayagam, Enting Chen, Damien Graux, Andre Melo, Ruofei Lai, Zeren Jiang, Zhongyang Li, Ye Qi, Yang Ren, Dandan Tu, and Jeff Z. Pan. GeAR : Graph-enhanced agent for retrieval-augmented generation. In Findings of the Association for Computational Linguistics: ACL 2025 , pages 12049-...

work page doi:10.18653/v1/2025.findings-acl.624 2025
[51]

Why uncertainty estimation methods fall short in RAG : An axiomatic analysis

Heydar Soudani, Evangelos Kanoulas, and Faegheh Hasibi. Why uncertainty estimation methods fall short in RAG : An axiomatic analysis. In Findings of the Association for Computational Linguistics: ACL 2025, pages 16596--16616, Vienna, Austria, 2025 a . Association for Computational Linguistics. URL https://aclanthology.org/2025.findings-acl.852/

2025
[52]

Uncertainty quantification for retrieval-augmented reasoning

Heydar Soudani, Hamed Zamani, and Faegheh Hasibi. Uncertainty quantification for retrieval-augmented reasoning. arXiv preprint arXiv:2510.11483, 2025 b

Pith/arXiv arXiv 2025
[53]

doi:10.18653/V1/2024.ACL-LONG.702 , url =

Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, and Yiqun Liu. DRAGIN : Dynamic retrieval augmented generation based on the real-time information needs of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12991--13013, Bangkok, Thailand, 2024. Association for C...

work page doi:10.18653/v1/2024.acl-long.702 2024
[54]

Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph

Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Shengjie Wang, Chen Lin, Yeyun Gong, Heung-Yeung Shum, and Jian Guo. Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph. In International Conference on Learning Representations (ICLR), 2024

2024
[55]

Uncertainty-aware dynamic knowledge graphs for reliable question answering

Yu Takahashi, Shun Takeuchi, Kexuan Xin, Guillaume Pelat, Yoshiaki Ikai, Junya Saito, Jonathan Vitale, Shlomo Berkovsky, and Amin Beheshti. Uncertainty-aware dynamic knowledge graphs for reliable question answering. arXiv preprint arXiv:2601.09720, 2026

arXiv 2026
[56]

Manning, and Chelsea Finn

Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, and Chelsea Finn. Does my LLM need a better evaluator? Just ask for calibration. In arXiv preprint arXiv:2310.02415, 2023

arXiv 2023
[57]

Uncertainty-based abstention in LLMs improves safety and reduces hallucinations

Christian Tomani, Kamalika Chaudhuri, Ivan Evtimov, Daniel Cremers, and Mark Ibrahim. Uncertainty-based abstention in LLMs improves safety and reduces hallucinations. arXiv preprint arXiv:2404.10960, 2024

arXiv 2024
[58]

MuSiQue : Multihop questions via single-hop question composition

Harsh Trivedi, Niranjan Bauer, Tushar Khot, and Ashish Sabharwal. MuSiQue : Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10: 0 539--554, 2022

2022
[59]

Semantic uncertainty quantification of hallucinations in LLMs : A quantum tensor network based method

Pragatheeswaran Vipulanandan, Kamal Premaratne, and Dilip Sarkar. Semantic uncertainty quantification of hallucinations in LLMs : A quantum tensor network based method. arXiv preprint arXiv:2601.20026, 2026

arXiv 2026
[60]

L-RAG : Balancing context and retrieval with entropy-based lazy loading

Sergii Voloshyn. L-RAG : Balancing context and retrieval with entropy-based lazy loading. arXiv preprint arXiv:2601.06551, 2026

arXiv 2026
[61]

Gruber, Thomas Decker, Yinchong Yang, Alireza Javanmardi, Eyke H \"u llermeier, and Florian Buettner

Nassim Walha, Sebastian G. Gruber, Thomas Decker, Yinchong Yang, Alireza Javanmardi, Eyke H \"u llermeier, and Florian Buettner. Fine-grained uncertainty decomposition in large language models: A spectral approach. arXiv preprint arXiv:2509.22272, 2025

arXiv 2025
[62]

Correctness is not faithfulness in retrieval augmented generation attributions

Jonas Wallat, Maria Heuss, Maarten de Rijke, and Avishek Anand. Correctness is not faithfulness in retrieval augmented generation attributions. In Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval, pages 22--32. Association for Computing Machinery, 2025. doi:10.1145/3731120.3744592

work page doi:10.1145/3731120.3744592 2025
[63]

Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, and Sercan O. Arik. Astute RAG : Overcoming imperfect retrieval augmentation and knowledge conflicts for large language models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

2025
[64]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR), 2023

2023
[65]

Augmenting textual generation via topology aware retrieval

Yu Wang, Nedim Lipka, Ruiyi Zhang, Alexa Siu, Yuying Zhao, Bo Ni, Xin Wang, Ryan Rossi, and Tyler Derr. Augmenting textual generation via topology aware retrieval. arXiv preprint arXiv:2405.17602, 2024. doi:10.48550/arXiv.2405.17602

work page doi:10.48550/arxiv.2405.17602 2024
[66]

Medical graph RAG : Evidence-based medical large language model via graph retrieval-augmented generation

Junde Wu, Jiayuan Zhu, Yunli Qi, Jingkun Chen, Min Xu, Filippo Menolascina, Yueming Jin, and Vicente Grau. Medical graph RAG : Evidence-based medical large language model via graph retrieval-augmented generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28443--28467, 2025. do...

work page doi:10.18653/v1/2025.acl-long.1381 2025
[67]

Ho, and James Zou

Kevin Wu, Eric Wu, Ally Cassasola, Angela Zhang, Kevin Wei, Teresa Nguyen, Sith Riantawan, Patricia Shi Riantawan, Daniel E. Ho, and James Zou. How well do LLMs cite relevant medical references? A n evaluation framework and analyses. arXiv preprint arXiv:2402.02008, 2024 a

arXiv 2024
[68]

ClashEval : Quantifying the tug-of-war between an LLM 's internal prior and external evidence

Kevin Wu, Eric Wu, and James Zou. ClashEval : Quantifying the tug-of-war between an LLM 's internal prior and external evidence. arXiv preprint arXiv:2404.10198, 2024 b

arXiv 2024
[69]

When to use graphs in RAG : A comprehensive analysis for graph retrieval-augmented generation

Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong, Xiao Huang, and Jinsong Su. When to use graphs in RAG : A comprehensive analysis for graph retrieval-augmented generation. arXiv preprint arXiv:2506.05690, 2025

arXiv 2025
[70]

Corrective retrieval augmented generation

Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation. arXiv preprint arXiv:2401.15884, 2024

Pith/arXiv arXiv 2024
[71]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018

2018
[72]

SeaKR : Self-aware knowledge retrieval for adaptive retrieval augmented generation

Zijun Yao, Weijian Qi, Liangming Pan, Shulin Cao, Linmei Hu, Liu Weichuan, Lei Hou, and Juanzi Li. SeaKR : Self-aware knowledge retrieval for adaptive retrieval augmented generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27022--27043, Vienna, Austria, 2025. Association for...

work page doi:10.18653/v1/2025.acl-long.1312 2025
[73]

FaithfulRAG : Fact-level conflict modeling for context-faithful retrieval-augmented generation

Qinggang Zhang, Zhishang Xiang, Yilin Xiao, Le Wang, Junhui Li, Xinrun Wang, and Jinsong Su. FaithfulRAG : Fact-level conflict modeling for context-faithful retrieval-augmented generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 21863--21882. Association for Computational Linguistics, 2025. URL htt...

2025
[74]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Advances in Neural Information Processing Systems 36 (NeurIPS), Datasets and Benchmarks Track, 2023

2023
[75]

Bayan Bruss

Wenqing Zheng, Dmitri Kalaev, Noah Fatsi, Daniel Barcklow, Owen Reinert, Igor Melnyk, Senthil Kumar, and C. Bayan Bruss. Revisiting RAG retrievers: An information theoretic benchmark. arXiv preprint arXiv:2602.21553, 2026

arXiv 2026
[76]

Poisoning retrieval corpora by injecting adversarial passages

Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. Poisoning retrieval corpora by injecting adversarial passages. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

2023
[77]

What breaks knowledge graph based RAG ? benchmarking and empirical insights into reasoning under incomplete knowledge

Dongzhuoran Zhou, Yuqicheng Zhu, Xiaxia Wang, Hongkuan Zhou, Yuan He, Jiaoyan Chen, Steffen Staab, and Evgeny Kharlamov. What breaks knowledge graph based RAG ? benchmarking and empirical insights into reasoning under incomplete knowledge. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2026

2026
[78]

Knowledge graph-guided retrieval augmented generation

Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, and Wei Hu. Knowledge graph-guided retrieval augmented generation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8912--8924, 2025 a . doi:10.18653/v1/2025.naacl-long.44...

work page doi:10.18653/v1/2025.naacl-long.449 2025
[79]

Certainty in uncertainty: Reasoning over uncertain knowledge graphs with statistical guarantees

Yuqicheng Zhu, Jingcheng Wu, Yizhen Wang, Hongkuan Zhou, Jiaoyan Chen, Evgeny Kharlamov, and Steffen Staab. Certainty in uncertainty: Reasoning over uncertain knowledge graphs with statistical guarantees. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8730--8752, 2025 b . doi:10.18653/v1/2025.emnlp-main.44...

work page doi:10.18653/v1/2025.emnlp-main.441 2025
[80]

doi:10.18653/V1/2024.EMNLP-INDUSTRY.2 , url =

Ilana Zimmerman, Jadin Tredup, Ethan Selfridge, and Joseph Bradley. Two-tiered encoder-based hallucination detection for retrieval-augmented generation in the wild. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 8--22, Miami, Florida, US, 2024. Association for Computational Linguistics. doi...

work page doi:10.18653/v1/2024.emnlp-industry.2 2024

Showing first 80 references.

[1] [1]

Amugongo, Paola Mascheroni, Sarah Brooks, Susanne Doering, and Jan Seidel

Lameck Mbangula Amugongo, Pietro Mascheroni, Steven Geoffrey Brooks, Stefan Doering, and Jan Seidel. Retrieval augmented generation for large language models in healthcare: A systematic review. PLOS Digital Health, 4 0 (6): 0 e0000877, 2025. doi:10.1371/journal.pdig.0000877. URL https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000877

work page doi:10.1371/journal.pdig.0000877 2025

[2] [2]

Self-RAG : Learning to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG : Learning to retrieve, generate, and critique through self-reflection. In International Conference on Learning Representations (ICLR), 2024

2024

[3] [3]

INTRYGUE : Induction-aware entropy gating for reliable RAG uncertainty estimation

Alexandra Bazarova, Andrei Volodichev, Daria Kotova, and Alexey Zaytsev. INTRYGUE : Induction-aware entropy gating for reliable RAG uncertainty estimation. arXiv preprint arXiv:2603.21607, 2026

arXiv 2026

[4] [4]

Campos, Ant \'o nio Farinhas, Chrysoula Zerva, M \'a rio A

Margarida M. Campos, Ant \'o nio Farinhas, Chrysoula Zerva, M \'a rio A. T. Figueiredo, and Andr \'e F. T. Martins. Conformal prediction for natural language processing: A survey. Transactions of the Association for Computational Linguistics, 12: 0 1619--1638, 2024. doi:10.1162/tacl_a_00715

work page doi:10.1162/tacl_a_00715 2024

[5] [5]

Koedinger

Eason Chen, Chuangji Li, Shizhuo Li, Zimo Xiao, Jionghao Lin, and Kenneth R. Koedinger. Comparing RAG and GraphRAG for page-level retrieval question answering on math textbook. arXiv preprint arXiv:2509.16780, 2025

arXiv 2025

[6] [6]

From local to global: A graph RAG approach to query-focused summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024

Pith/arXiv arXiv 2024

[7] [7]

RAGA s: Automated Evaluation of Retrieval Augmented Generation

Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. RAGAs : Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150--158, 2024. doi:10.18653/v1/2024.eacl-demo.16. URL https://aclanthology.org/2024...

work page doi:10.18653/v1/2024.eacl-demo.16 2024

[8] [8]

Faithfulness-aware uncertainty quantification for fact-checking the output of retrieval augmented generation

Ekaterina Fadeeva, Aleksandr Rubashevskii, Dzianis Piatrashyn, Roman Vashurin, Shehzaad Dhuliawala, Artem Shelmanov, Timothy Baldwin, Preslav Nakov, Mrinmaya Sachan, and Maxim Panov. Faithfulness-aware uncertainty quantification for fact-checking the output of retrieval augmented generation. arXiv preprint arXiv:2505.21072, 2025

Pith/arXiv arXiv 2025

[9] [9]

Detecting hallucinations in large language models using semantic entropy

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630: 0 625--630, 2024

2024

[10] [10]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017

2017

[11] [11]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), 2017

2017

[12] [12]

HippoRAG : Neurobiologically inspired long-term memory for large language models

Bernal Jim \'e nez Guti \'e rrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG : Neurobiologically inspired long-term memory for large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024

[13] [13]

Aggarwal, and Jiliang Tang

Haoyu Han, Li Ma, Yu Wang, Harry Shomer, Yongjia Lei, Zhisheng Qi, Kai Guo, Zhigang Hua, Bo Long, Hui Liu, Charu C. Aggarwal, and Jiliang Tang. RAG vs.\ GraphRAG : A systematic evaluation and key insights. arXiv preprint arXiv:2502.11371, 2025

arXiv 2025

[14] [14]

DeBERTa : Decoding-enhanced BERT with disentangled attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa : Decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=XPZIaotutsD

2021

[15] [15]

Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi

Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V. Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi. G-Retriever : Retrieval-augmented generation for textual graph understanding and question answering. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024

[16] [16]

Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics (COLING), 2020

2020

[17] [17]

GRAG : Graph Retrieval-Augmented Generation

Yuntong Hu, Zhihan Lei, Zheng Zhang, Bo Pan, Chen Ling, and Liang Zhao. GRAG : Graph retrieval-augmented generation. In Findings of the Association for Computational Linguistics: NAACL 2025 , pages 4145--4157, 2025. doi:10.18653/v1/2025.findings-naacl.232. URL https://aclanthology.org/2025.findings-naacl.232/

work page doi:10.18653/v1/2025.findings-naacl.232 2025

[18] [18]

TrustLLM : Trustworthiness in Large Language Models

Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. TrustLLM : Trustworthiness in Large Language Models . arXiv preprint arXiv:2401.05561, 2024

Pith/arXiv arXiv 2024

[19] [19]

StructGPT : A general framework for large language model to reason over structured data

Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Xin Zhao, and Ji-Rong Wen. StructGPT : A general framework for large language model to reason over structured data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023 a

2023

[20] [20]

Active Retrieval Augmented Generation

Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969--7992, Singapore, 2023 b . Association for Computational Linguistics. doi:10.18653/v1/2023.emn...

work page doi:10.18653/v1/2023.emnlp-main.495 2023

[21] [21]

Language models (mostly) know what they know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

Pith/arXiv arXiv 2022

[22] [22]

Semantic entropy probes: Robust and cheap hallucination detection in LLMs

Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in LLMs . In International Conference on Learning Representations (ICLR), 2025

2025

[23] [23]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In International Conference on Learning Representations (ICLR), 2023

2023

[24] [24]

Decomposing uncertainty in probabilistic knowledge graph embeddings: Why entity variance is not enough

Chorok Lee. Decomposing uncertainty in probabilistic knowledge graph embeddings: Why entity variance is not enough. arXiv preprint arXiv:2512.22318, 2025

arXiv 2025

[25] [25]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS), 2020

2020

[26] [26]

Simple is effective: The roles of graphs and large language models in knowledge-graph-based retrieval-augmented generation

Mufei Li, Siqi Miao, and Pan Li. Simple is effective: The roles of graphs and large language models in knowledge-graph-based retrieval-augmented generation. In International Conference on Learning Representations (ICLR), 2025 a

2025

[27] [27]

Citation-enhanced generation for LLM -based chatbots

Weitao Li, Junkai Li, Weizhi Ma, and Yang Liu. Citation-enhanced generation for LLM -based chatbots. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1451--1466, 2024 a . doi:10.18653/v1/2024.acl-long.79. URL https://aclanthology.org/2024.acl-long.79/

work page doi:10.18653/v1/2024.acl-long.79 2024

[28] [28]

Semantic volume: Quantifying and detecting both external and internal uncertainty in LLMs

Xiaomin Li, Zhou Yu, Ziji Zhang, Yingying Zhuang, Swair Shah, Narayanan Sadagopan, and Anurag Beniwal. Semantic volume: Quantifying and detecting both external and internal uncertainty in LLMs . arXiv preprint arXiv:2502.21239, 2025 b

arXiv 2025

[29] [29]

UncertaintyRAG : Span-level uncertainty enhanced long-context modeling for retrieval-augmented generation

Zixuan Li, Jing Xiong, Fanghua Ye, Chuanyang Zheng, Xun Wu, Jianqiao Lu, Zhongwei Wan, Xiaodan Liang, Chengming Li, Zhenan Sun, Lingpeng Kong, and Ngai Wong. UncertaintyRAG : Span-level uncertainty enhanced long-context modeling for retrieval-augmented generation. arXiv preprint arXiv:2410.02719, 2024 b

arXiv 2024

[30] [30]

Teaching models to express their uncertainty in words

Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words. In Transactions of the Association for Computational Linguistics (TACL), 2022

2022

[31] [31]

CtrlA : Adaptive retrieval-augmented generation via inherent control

Huanshuo Liu, Hao Zhang, Zhijiang Guo, Jing Wang, Kuicai Dong, Xiangyang Li, Yi Quan Lee, Cong Zhang, and Yong Liu. CtrlA : Adaptive retrieval-augmented generation via inherent control. In Findings of the Association for Computational Linguistics: ACL 2025, pages 12592--12618, Vienna, Austria, 2025. Association for Computational Linguistics. doi:10.18653/...

work page doi:10.18653/v1/2025.findings-acl.652 2025

[32] [32]

NAACL : Noise- A w A re verbal confidence calibration for robust large language models in RAG systems

Jiayu Liu, Rui Wang, Qing Zong, Yumeng Wang, Cheng Qian, Qingcheng Zeng, Tianshi Zheng, Haochen Shi, Dadi Guo, Baixuan Xu, Chunyang Li, and Yangqiu Song. NAACL : Noise- A w A re verbal confidence calibration for robust large language models in RAG systems. arXiv preprint arXiv:2601.11004, 2026 a

Pith/arXiv arXiv 2026

[33] [33]

TruthfulRAG : Resolving factual-level conflicts in retrieval-augmented generation with knowledge graphs

Shuyi Liu, Yuming Shang, and Xi Zhang. TruthfulRAG : Resolving factual-level conflicts in retrieval-augmented generation with knowledge graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, 2026 b

2026

[34] [34]

Reasoning on graphs: Faithful and interpretable large language model reasoning

Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. Reasoning on graphs: Faithful and interpretable large language model reasoning. In International Conference on Learning Representations (ICLR), 2024

2024

[35] [35]

Think-on-graph 2.0: Deep and faithful large language model reasoning with knowledge-guided retrieval augmented generation

Shengjie Ma, Chengjin Xu, Xuhui Jiang, Muzhi Li, Huaren Qu, Cehao Yang, Jiaxin Mao, and Jian Guo. Think-on-graph 2.0: Deep and faithful large language model reasoning with knowledge-guided retrieval augmented generation. In International Conference on Learning Representations (ICLR), 2025

2025

[36] [36]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. SelfCheckGPT : Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

2023

[37] [37]

GNN-RAG : Graph neural retrieval for large language model reasoning

Costas Mavromatis and George Karypis. GNN-RAG : Graph neural retrieval for large language model reasoning. In Findings of the Association for Computational Linguistics (ACL), 2025

2025

[38] [38]

Adaptive retrieval without self-knowledge? bringing uncertainty back home

Viktor Moskvoretskii, Maria Marina, Mikhail Salnikov, Nikolay Ivanov, Sergey Pletenev, Daria Galimzianova, Nikita Krayko, Vasily Konovalov, Irina Nikishina, and Alexander Panchenko. Adaptive retrieval without self-knowledge? bringing uncertainty back home. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1...

work page doi:10.18653/v1/2025.acl-long.319 2025

[39] [39]

Towards trustworthy retrieval augmented generation for large language models: A survey

Bo Ni, Zheyuan Liu, Leyao Wang, Yongjia Lei, Yuying Zhao, Xueqi Cheng, Qingkai Zeng, Luna Dong, Yinglong Xia, Krishnaram Kenthapadi, et al. Towards trustworthy retrieval augmented generation for large language models: A survey. arXiv preprint arXiv:2502.06872, 2025

arXiv 2025

[40] [40]

Kernel language entropy: Fine-grained uncertainty quantification for LLMs from semantic similarities

Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncertainty quantification for LLMs from semantic similarities. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024

[41] [41]

GPT-4o mini : Advancing cost-efficient intelligence

OpenAI . GPT-4o mini : Advancing cost-efficient intelligence. Technical report, OpenAI, 2024. URL https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

2024

[42] [42]

Bowman, and Shi Feng

Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations. In Advances in Neural Information Processing Systems 37 (NeurIPS), 2024

2024

[43] [43]

Graph retrieval-augmented generation: A survey

Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang. Graph retrieval-augmented generation: A survey. ACM Transactions on Information Systems , 2024

2024

[44] [44]

Uncertainty quantification in retrieval augmented question answering

Laura Perez-Beltrachini and Mirella Lapata. Uncertainty quantification in retrieval augmented question answering. Transactions on Machine Learning Research (TMLR), 2025

2025

[45] [45]

SURE-RAG : Sufficiency and uncertainty-aware evidence verification for selective retrieval-augmented generation

Jingxi Qiu, Zeyu Han, and Cheng Huang. SURE-RAG : Sufficiency and uncertainty-aware evidence verification for selective retrieval-augmented generation. arXiv preprint arXiv:2605.03534, 2026

Pith/arXiv arXiv 2026

[46] [46]

Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space

Xin Qiu and Risto Miikkulainen. Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024

[47] [47]

Sentence- BERT : Sentence embeddings using siamese BERT -networks

Nils Reimers and Iryna Gurevych. Sentence- BERT : Sentence embeddings using siamese BERT -networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019

2019

[48] [48]

When to trust: A causality-aware calibration framework for accurate knowledge graph retrieval-augmented generation

Jing Ren, Bowen Li, Ziqi Xu, Xikun Zhang, Haytham Fayek, and Xiaodong Li. When to trust: A causality-aware calibration framework for accurate knowledge graph retrieval-augmented generation. arXiv preprint arXiv:2601.09241, 2026

arXiv 2026

[49] [49]

ARES : An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. ARES : An automated evaluation framework for retrieval-augmented generation systems. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 338--354, Mexico City, Mexico, 2024. Association f...

work page doi:10.18653/v1/2024.naacl-long.20 2024

[50] [50]

Zhili Shen, Chenxin Diao, Pavlos Vougiouklis, Pascual Merita, Shriram Piramanayagam, Enting Chen, Damien Graux, Andre Melo, Ruofei Lai, Zeren Jiang, Zhongyang Li, Ye Qi, Yang Ren, Dandan Tu, and Jeff Z. Pan. GeAR : Graph-enhanced agent for retrieval-augmented generation. In Findings of the Association for Computational Linguistics: ACL 2025 , pages 12049-...

work page doi:10.18653/v1/2025.findings-acl.624 2025

[51] [51]

Why uncertainty estimation methods fall short in RAG : An axiomatic analysis

Heydar Soudani, Evangelos Kanoulas, and Faegheh Hasibi. Why uncertainty estimation methods fall short in RAG : An axiomatic analysis. In Findings of the Association for Computational Linguistics: ACL 2025, pages 16596--16616, Vienna, Austria, 2025 a . Association for Computational Linguistics. URL https://aclanthology.org/2025.findings-acl.852/

2025

[52] [52]

Uncertainty quantification for retrieval-augmented reasoning

Heydar Soudani, Hamed Zamani, and Faegheh Hasibi. Uncertainty quantification for retrieval-augmented reasoning. arXiv preprint arXiv:2510.11483, 2025 b

Pith/arXiv arXiv 2025

[53] [53]

doi:10.18653/V1/2024.ACL-LONG.702 , url =

Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, and Yiqun Liu. DRAGIN : Dynamic retrieval augmented generation based on the real-time information needs of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12991--13013, Bangkok, Thailand, 2024. Association for C...

work page doi:10.18653/v1/2024.acl-long.702 2024

[54] [54]

Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph

Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Shengjie Wang, Chen Lin, Yeyun Gong, Heung-Yeung Shum, and Jian Guo. Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph. In International Conference on Learning Representations (ICLR), 2024

2024

[55] [55]

Uncertainty-aware dynamic knowledge graphs for reliable question answering

Yu Takahashi, Shun Takeuchi, Kexuan Xin, Guillaume Pelat, Yoshiaki Ikai, Junya Saito, Jonathan Vitale, Shlomo Berkovsky, and Amin Beheshti. Uncertainty-aware dynamic knowledge graphs for reliable question answering. arXiv preprint arXiv:2601.09720, 2026

arXiv 2026

[56] [56]

Manning, and Chelsea Finn

Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, and Chelsea Finn. Does my LLM need a better evaluator? Just ask for calibration. In arXiv preprint arXiv:2310.02415, 2023

arXiv 2023

[57] [57]

Uncertainty-based abstention in LLMs improves safety and reduces hallucinations

Christian Tomani, Kamalika Chaudhuri, Ivan Evtimov, Daniel Cremers, and Mark Ibrahim. Uncertainty-based abstention in LLMs improves safety and reduces hallucinations. arXiv preprint arXiv:2404.10960, 2024

arXiv 2024

[58] [58]

MuSiQue : Multihop questions via single-hop question composition

Harsh Trivedi, Niranjan Bauer, Tushar Khot, and Ashish Sabharwal. MuSiQue : Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10: 0 539--554, 2022

2022

[59] [59]

Semantic uncertainty quantification of hallucinations in LLMs : A quantum tensor network based method

Pragatheeswaran Vipulanandan, Kamal Premaratne, and Dilip Sarkar. Semantic uncertainty quantification of hallucinations in LLMs : A quantum tensor network based method. arXiv preprint arXiv:2601.20026, 2026

arXiv 2026

[60] [60]

L-RAG : Balancing context and retrieval with entropy-based lazy loading

Sergii Voloshyn. L-RAG : Balancing context and retrieval with entropy-based lazy loading. arXiv preprint arXiv:2601.06551, 2026

arXiv 2026

[61] [61]

Gruber, Thomas Decker, Yinchong Yang, Alireza Javanmardi, Eyke H \"u llermeier, and Florian Buettner

Nassim Walha, Sebastian G. Gruber, Thomas Decker, Yinchong Yang, Alireza Javanmardi, Eyke H \"u llermeier, and Florian Buettner. Fine-grained uncertainty decomposition in large language models: A spectral approach. arXiv preprint arXiv:2509.22272, 2025

arXiv 2025

[62] [62]

Correctness is not faithfulness in retrieval augmented generation attributions

Jonas Wallat, Maria Heuss, Maarten de Rijke, and Avishek Anand. Correctness is not faithfulness in retrieval augmented generation attributions. In Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval, pages 22--32. Association for Computing Machinery, 2025. doi:10.1145/3731120.3744592

work page doi:10.1145/3731120.3744592 2025

[63] [63]

Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, and Sercan O. Arik. Astute RAG : Overcoming imperfect retrieval augmentation and knowledge conflicts for large language models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

2025

[64] [64]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR), 2023

2023

[65] [65]

Augmenting textual generation via topology aware retrieval

Yu Wang, Nedim Lipka, Ruiyi Zhang, Alexa Siu, Yuying Zhao, Bo Ni, Xin Wang, Ryan Rossi, and Tyler Derr. Augmenting textual generation via topology aware retrieval. arXiv preprint arXiv:2405.17602, 2024. doi:10.48550/arXiv.2405.17602

work page doi:10.48550/arxiv.2405.17602 2024

[66] [66]

Medical graph RAG : Evidence-based medical large language model via graph retrieval-augmented generation

Junde Wu, Jiayuan Zhu, Yunli Qi, Jingkun Chen, Min Xu, Filippo Menolascina, Yueming Jin, and Vicente Grau. Medical graph RAG : Evidence-based medical large language model via graph retrieval-augmented generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28443--28467, 2025. do...

work page doi:10.18653/v1/2025.acl-long.1381 2025

[67] [67]

Ho, and James Zou

Kevin Wu, Eric Wu, Ally Cassasola, Angela Zhang, Kevin Wei, Teresa Nguyen, Sith Riantawan, Patricia Shi Riantawan, Daniel E. Ho, and James Zou. How well do LLMs cite relevant medical references? A n evaluation framework and analyses. arXiv preprint arXiv:2402.02008, 2024 a

arXiv 2024

[68] [68]

ClashEval : Quantifying the tug-of-war between an LLM 's internal prior and external evidence

Kevin Wu, Eric Wu, and James Zou. ClashEval : Quantifying the tug-of-war between an LLM 's internal prior and external evidence. arXiv preprint arXiv:2404.10198, 2024 b

arXiv 2024

[69] [69]

When to use graphs in RAG : A comprehensive analysis for graph retrieval-augmented generation

Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong, Xiao Huang, and Jinsong Su. When to use graphs in RAG : A comprehensive analysis for graph retrieval-augmented generation. arXiv preprint arXiv:2506.05690, 2025

arXiv 2025

[70] [70]

Corrective retrieval augmented generation

Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation. arXiv preprint arXiv:2401.15884, 2024

Pith/arXiv arXiv 2024

[71] [71]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018

2018

[72] [72]

SeaKR : Self-aware knowledge retrieval for adaptive retrieval augmented generation

Zijun Yao, Weijian Qi, Liangming Pan, Shulin Cao, Linmei Hu, Liu Weichuan, Lei Hou, and Juanzi Li. SeaKR : Self-aware knowledge retrieval for adaptive retrieval augmented generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27022--27043, Vienna, Austria, 2025. Association for...

work page doi:10.18653/v1/2025.acl-long.1312 2025

[73] [73]

FaithfulRAG : Fact-level conflict modeling for context-faithful retrieval-augmented generation

Qinggang Zhang, Zhishang Xiang, Yilin Xiao, Le Wang, Junhui Li, Xinrun Wang, and Jinsong Su. FaithfulRAG : Fact-level conflict modeling for context-faithful retrieval-augmented generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 21863--21882. Association for Computational Linguistics, 2025. URL htt...

2025

[74] [74]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Advances in Neural Information Processing Systems 36 (NeurIPS), Datasets and Benchmarks Track, 2023

2023

[75] [75]

Bayan Bruss

Wenqing Zheng, Dmitri Kalaev, Noah Fatsi, Daniel Barcklow, Owen Reinert, Igor Melnyk, Senthil Kumar, and C. Bayan Bruss. Revisiting RAG retrievers: An information theoretic benchmark. arXiv preprint arXiv:2602.21553, 2026

arXiv 2026

[76] [76]

Poisoning retrieval corpora by injecting adversarial passages

Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. Poisoning retrieval corpora by injecting adversarial passages. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

2023

[77] [77]

What breaks knowledge graph based RAG ? benchmarking and empirical insights into reasoning under incomplete knowledge

Dongzhuoran Zhou, Yuqicheng Zhu, Xiaxia Wang, Hongkuan Zhou, Yuan He, Jiaoyan Chen, Steffen Staab, and Evgeny Kharlamov. What breaks knowledge graph based RAG ? benchmarking and empirical insights into reasoning under incomplete knowledge. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2026

2026

[78] [78]

Knowledge graph-guided retrieval augmented generation

Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, and Wei Hu. Knowledge graph-guided retrieval augmented generation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8912--8924, 2025 a . doi:10.18653/v1/2025.naacl-long.44...

work page doi:10.18653/v1/2025.naacl-long.449 2025

[79] [79]

Certainty in uncertainty: Reasoning over uncertain knowledge graphs with statistical guarantees

Yuqicheng Zhu, Jingcheng Wu, Yizhen Wang, Hongkuan Zhou, Jiaoyan Chen, Evgeny Kharlamov, and Steffen Staab. Certainty in uncertainty: Reasoning over uncertain knowledge graphs with statistical guarantees. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8730--8752, 2025 b . doi:10.18653/v1/2025.emnlp-main.44...

work page doi:10.18653/v1/2025.emnlp-main.441 2025

[80] [80]

doi:10.18653/V1/2024.EMNLP-INDUSTRY.2 , url =

Ilana Zimmerman, Jadin Tredup, Ethan Selfridge, and Joseph Bradley. Two-tiered encoder-based hallucination detection for retrieval-augmented generation in the wild. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 8--22, Miami, Florida, US, 2024. Association for Computational Linguistics. doi...

work page doi:10.18653/v1/2024.emnlp-industry.2 2024