arxiv: 2605.14192 · v1 · submitted 2026-05-13 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Why Retrieval-Augmented Generation Fails: A Graph Perspective

Kai Guo , Xinnan Dai , Zhibo Zhang , Nuohan Lin , Shenglai Zeng , Jie Ren , Haoyu Han , Jiliang Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords retrieval-augmented generationattribution graphscircuit tracinginformation flowerror detectionquestion answeringtransformer models

0 comments

The pith

Successful RAG answers route evidence through deeper and more distributed paths in attribution graphs than failures do.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why retrieval-augmented generation still produces errors despite access to external evidence by tracing how that evidence moves inside the model. It constructs attribution graphs via circuit tracing to track interactions between retrieved context, intermediate activations, and output tokens across transformer layers. Correct predictions consistently show deeper reasoning paths, more distributed evidence flow, and structured local connectivity, while incorrect ones display shallower, fragmented, and overly concentrated patterns. These differences support a graph-based method to detect errors and allow interventions that reinforce question-constrained grounding to reshape routing and reduce mistakes.

Core claim

Using circuit tracing to construct attribution graphs that model information flow from retrieved context through transformer layers to generated tokens, the paper shows that correct predictions exhibit deeper reasoning paths, more distributed evidence flow, and a more structured pattern of local connectivity, while failed predictions show shallower, fragmented, and overly concentrated evidence flow. These structural differences hold across multiple question-answering benchmarks. The authors build a graph-based error detection framework from attribution-graph topology features and demonstrate that reinforcing question-constrained evidence grounding reshapes internal routing, keeps answer生成on,

What carries the argument

Attribution graphs from circuit tracing, which represent interactions among retrieved context, intermediate model activations, and generated tokens to track evidence integration during decoding.

If this is right

Graph topology features can be used to detect errors in RAG outputs before final generation.
Reinforcing question-constrained evidence grounding reshapes internal routing and improves integration of retrieved information.
Targeted changes to evidence flow patterns reduce errors without altering the underlying model or retriever.
Consistent structural differences in attribution graphs separate successful from failed predictions across benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed flow patterns suggest that retriever design could prioritize documents likely to produce distributed paths rather than single strong matches.
Extending the same graph analysis to non-QA tasks could identify analogous predictors of generation success.
Training objectives that reward deeper evidence routing might reduce RAG failures more directly than post-hoc fixes.

Load-bearing premise

That attribution graphs constructed from circuit tracing faithfully capture the causal contribution of retrieved evidence to the generated answer rather than merely correlating with surface statistics.

What would settle it

An intervention that alters graph depth or distribution of evidence flow through targeted activation edits while holding retrieved documents and inputs fixed, then measures whether answer accuracy changes.

Figures

Figures reproduced from arXiv: 2605.14192 by Haoyu Han, Jie Ren, Jiliang Tang, Kai Guo, Nuohan Lin, Shenglai Zeng, Xinnan Dai, Zhibo Zhang.

**Figure 2.** Figure 2: Layer-wise attribution mass for correct and wrong [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Region-level attribution comparison between cor [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Layer-wise attribution comparison for 𝑄 → Ans_EXT and 𝑄 → 𝑄. Left: mean routing strength per layer for correct and wrong predictions. Right: relative differences (green = correct higher, red = wrong higher). To address this, we examine the routing dynamics of information across layers during decoding. Our analysis focuses on a mixedcontext setting. In this setting, each question is paired with retrieved … view at source ↗

**Figure 6.** Figure 6: Performance comparison across QA benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Performance comparison on Mix-MusiQue. A hook is a lightweight function inserted into the model’s forward pass that intercepts and modifies intermediate activations without changing model parameters. In our case, the hook intercepts the attention pattern immediately before it is used to aggregate value vectors. At chosen layers, we rescale specific groups of attention weights based on the semantic regio… view at source ↗

**Figure 8.** Figure 8: Radar comparison of attribution-graph structural [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Layer-wise attribution mass for correct and wrong [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Layer-wise attribution mass for correct and wrong [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Examples where routing control improves multi-hop reasoning. In both cases, the baseline model either stops at an [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) has become a powerful and widely used approach for improving large language models by grounding generation in retrieved evidence. However, RAG systems still produce incorrect answers in many cases. Why RAG fails despite having access to external information remains poorly understood. We present a model-internal study of retrieval-augmented generation that examines how retrieved evidence influences answer generation. Using circuit tracing, we construct attribution graphs that model the flow of information through transformer layers during decoding. These graphs represent interactions among retrieved context, intermediate model activations, and generated tokens, providing a graph, circuit-level view of how external evidence is integrated into the model's reasoning process across multiple question answering benchmarks, we observe consistent structural differences: correct predictions exhibit deeper reasoning paths, more distributed evidence flow, and a more structured pattern of local connectivity, while failed predictions show shallower, fragmented, and overly concentrated evidence flow. Building on these findings, we develop a graph-based error detection framework that uses attribution-graph topology features. Furthermore, we show that attribution graphs enable targeted interventions. By reinforcing question-constrained evidence grounding, we reshape internal routing so that answer generation remains guided by the question, leading to more effective integration of retrieved information and fewer errors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAG errors show up as shallower, more concentrated attribution paths while correct answers route evidence more deeply and evenly, but the causal status of those graphs is still open.

read the letter

The paper's central observation is that attribution graphs built via circuit tracing separate RAG successes from failures on structural grounds: correct answers travel longer paths with more distributed flow from retrieved tokens, while errors stay shallow and clump evidence in fewer places. They extract topology features from these graphs to flag mistakes and then run interventions that try to keep generation anchored to the question, reporting fewer errors as a result. That combination of diagnostic graph features and topology-guided fixes is the main new piece relative to earlier RAG work that mostly measured surface accuracy or added post-hoc checks. The patterns hold across the benchmarks they tested, which gives the claim some breadth. The intervention results are the most actionable part if they replicate cleanly. The soft spot is that the abstract and available description give no effect sizes, no statistical tests, and no clear account of how the attribution scores were computed or validated against full counterfactuals. Without those numbers it is hard to know whether the topology differences are large enough to matter in practice or whether they simply track easier-to-measure things like generation length or token entropy. The stress-test concern lands here: if the graphs are only approximate, the observed routing differences could be downstream correlates rather than the actual causal routes the model uses. For readers who already work on circuit tracing or internal diagnostics of retrieval, the framing is useful and the experiments point in a practical direction. It is worth sending to referees because the idea is grounded enough to test further and the intervention angle could influence how people build more reliable RAG systems, even if the current evidence needs tightening on the causal side.

Referee Report

3 major / 2 minor

Summary. The paper claims that circuit tracing can be used to build attribution graphs capturing information flow from retrieved context through transformer layers during RAG decoding. Across multiple QA benchmarks, correct predictions consistently exhibit deeper reasoning paths, more distributed evidence flow, and structured local connectivity, whereas failures show shallower, fragmented, and overly concentrated flows. These topological differences are used to build a graph-based error detection framework and to perform targeted interventions that reinforce question-constrained evidence grounding, thereby improving integration of retrieved information and reducing errors.

Significance. If the attribution graphs are shown to isolate causal contributions of retrieved evidence, the work would supply a mechanistic account of why RAG fails despite access to relevant context and would furnish concrete, graph-derived tools for error detection and mitigation. The reported consistency across benchmarks and the intervention results would constitute a useful advance in interpretability for augmented generation if the causal status of the graphs is established.

major comments (3)

[Methods (attribution graph construction)] The central claim that attribution graphs reveal genuine causal differences in evidence routing rests on the unverified assumption that the circuit-tracing scores isolate the causal contribution of specific retrieved tokens rather than downstream correlates such as answer length, token entropy, or layer-wise activation magnitude. No exhaustive counterfactual validation or comparison against full ablation baselines is described.
[Results (structural differences and interventions)] The abstract asserts 'consistent structural differences' across benchmarks, yet supplies no quantitative effect sizes, statistical tests, controls for confounding variables, or details on how the interventions were implemented. These omissions leave the load-bearing empirical claim under-supported.
[Error detection and intervention sections] The error-detection framework and intervention results are presented as direct consequences of the graph topology findings, but without reported performance deltas, baseline comparisons, or ablation of the topology features themselves, it is unclear whether the claimed improvements are attributable to the identified structural properties.

minor comments (2)

[Methods] The notation and precise definition of 'attribution graph' nodes and edges should be formalized early in the methods to avoid ambiguity when discussing path depth and evidence flow.
[Figures] Figure captions and axis labels for the reported graph visualizations could be expanded to include the exact metrics used for 'depth,' 'distributed flow,' and 'local connectivity.'

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where the causal and empirical support for our attribution-graph approach can be strengthened. We have revised the manuscript to incorporate the requested validations, quantitative analyses, and controls. Point-by-point responses follow.

read point-by-point responses

Referee: [Methods (attribution graph construction)] The central claim that attribution graphs reveal genuine causal differences in evidence routing rests on the unverified assumption that the circuit-tracing scores isolate the causal contribution of specific retrieved tokens rather than downstream correlates such as answer length, token entropy, or layer-wise activation magnitude. No exhaustive counterfactual validation or comparison against full ablation baselines is described.

Authors: We agree that stronger causal validation is necessary. In the revised manuscript we have added a dedicated subsection describing exhaustive counterfactual token ablations and direct comparisons against full ablation baselines. These experiments control for answer length, token entropy, and layer-wise activation magnitude and confirm that the circuit-tracing scores primarily isolate causal contributions from retrieved tokens. revision: yes
Referee: [Results (structural differences and interventions)] The abstract asserts 'consistent structural differences' across benchmarks, yet supplies no quantitative effect sizes, statistical tests, controls for confounding variables, or details on how the interventions were implemented. These omissions leave the load-bearing empirical claim under-supported.

Authors: We have expanded the Results section to include quantitative effect sizes, statistical tests (paired t-tests with p-values), and explicit controls for confounding variables. Expanded Methods text now details the precise implementation of the interventions, including the reinforcement mechanism. Updated tables and figures present these metrics across all benchmarks. revision: yes
Referee: [Error detection and intervention sections] The error-detection framework and intervention results are presented as direct consequences of the graph topology findings, but without reported performance deltas, baseline comparisons, or ablation of the topology features themselves, it is unclear whether the claimed improvements are attributable to the identified structural properties.

Authors: We have augmented these sections with reported performance deltas, comparisons against standard RAG baselines and random-intervention controls, and ablations that isolate the contribution of individual topology features (path depth, evidence distribution, local connectivity). The new results demonstrate that the observed gains are attributable to the identified structural properties. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical graph measurements on held-out data

full rationale

The paper's core claims rest on constructing attribution graphs via circuit tracing and then measuring their topological properties (path depth, evidence flow distribution, local connectivity) directly on held-out QA benchmarks. These measurements are reported as observations rather than quantities defined by or fitted to the same data in a self-referential loop. The subsequent error-detection framework and intervention experiments are downstream applications of those independent measurements, with no equations or steps shown to reduce by construction to the inputs. No self-citation is load-bearing for the central result, and no ansatz or uniqueness theorem is smuggled in to force the outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach assumes standard mechanistic interpretability tools can isolate evidence flow; no new free parameters or invented physical entities are introduced beyond the attribution-graph representation itself.

axioms (1)

domain assumption Circuit tracing identifies paths that carry information from input tokens to output logits in transformer models
Invoked when constructing attribution graphs from model activations during decoding.

invented entities (1)

attribution graph no independent evidence
purpose: Represent interactions among retrieved context, intermediate activations, and generated tokens as a graph
New representational object introduced to visualize and quantify evidence flow.

pith-pipeline@v0.9.0 · 5533 in / 1275 out tokens · 41308 ms · 2026-05-15T04:44:55.047623+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 7 internal anchors

[1]

Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...

work page 2025
[2]

Nicholas Ampazis. 2024. Improving RAG quality for large language models with topic-enhanced reranking. InIFIP international conference on artificial intelligence applications and innovations. Springer, 74–87

work page 2024
[3]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-rag: Self-reflective retrieval augmented generation. InNeurIPS 2023 workshop on instruction tuning and instruction following

work page 2023
[4]

Orlando Ayala and Patrice Bechard. 2024. Reducing hallucination in structured outputs via Retrieval-Augmented Generation. InProceedings of the 2024 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track). 228–238

work page 2024
[5]

Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, and Jie Fu. 2024. Rq-rag: Learning to refine queries for retrieval augmented generation.arXiv preprint arXiv:2404.00610(2024)

work page arXiv 2024
[6]

Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. 2024. Benchmarking large language models in retrieval-augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 17754–17762

work page 2024
[7]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning

work page
[8]

What does bert look at? an analysis of bert’s attention.arXiv preprint arXiv:1906.04341(2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906
[9]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey

work page
[11]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Xinnan Dai, Kai Guo, Chung-Hsiang Lo, Shenglai Zeng, Jiayuan Ding, Dongsheng Luo, Subhabrata Mukherjee, and Jiliang Tang. 2025. GraphGhost: Tracing Struc- tures Behind Large Language Models.arXiv preprint arXiv:2510.08613(2025)

work page arXiv 2025
[13]

Jialin Dong, Bahare Fatemi, Bryan Perozzi, Lin F Yang, and Anton Tsitsulin

work page
[14]

Don’t forget to connect! improving rag with graph-based reranking.arXiv preprint arXiv:2405.18414(2024)

work page arXiv 2024
[15]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

work page 2024
[16]

Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. 2024. Transcoders find interpretable llm feature circuits.Advances in Neural Information Processing Systems37 (2024), 24375–24410

work page 2024
[17]

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Das- Sarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah....

work page 2021
[19]

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A survey on rag meeting llms: Towards retrieval-augmented large language models. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining. 6491–6501

work page 2024
[20]

Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, and Neel Nanda. 2024. Do i know this entity? knowledge awareness and hallucinations in language models.arXiv preprint arXiv:2411.14257(2024)

work page arXiv 2024
[21]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997 2, 1 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Kai Guo, Harry Shomer, Shenglai Zeng, Haoyu Han, Yu Wang, and Jiliang Tang

work page
[23]

Empowering graphrag with knowledge filtering and integration.arXiv preprint arXiv:2503.13804(2025)

work page arXiv 2025
[24]

Shailja Gupta, Rajesh Ranjan, and Surya Narayan Singh. 2024. A comprehensive survey of retrieval-augmented generation (rag): Evolution, current landscape and future directions.arXiv preprint arXiv:2410.12837(2024). Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

work page arXiv 2024
[25]

Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Ma- hantesh Halappanavar, Ryan A Rossi, Subhabrata Mukherjee, Xianfeng Tang, et al

work page
[26]

Retrieval-augmented generation with graphs (graphrag).arXiv preprint arXiv:2501.00309(2024)

work page arXiv 2024
[27]

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reason- ing Steps. InProceedings of the 28th International Conference on Computational Linguistics. 6609–6625

work page 2020
[28]

Wentao Hu, Wengyu Zhang, Yiyang Jiang, Chen Jason Zhang, Xiaoyong Wei, and Li Qing. 2025. Removal of hallucination on hallucination: Debate-augmented RAG. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15839–15853

work page 2025
[29]

Dahyun Lee, Yongrae Jo, Haeju Park, and Moontae Lee. 2025. Shifting from ranking to set selection for retrieval augmented generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 17606–17619

work page 2025
[30]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

work page 2020
[31]

Zhining Liu, Rana Ali Amjad, Ravinarayana Adkathimar, Tianxin Wei, and Hang- hang Tong. 2025. SelfElicit: Your language model secretly knows where is the relevant evidence.arXiv preprint arXiv:2502.08767(2025)

work page arXiv 2025
[32]

Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. 2024. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models.arXiv preprint arXiv:2403.19647(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Jinming Nian, Zhiyuan Peng, Qifan Wang, and Yi Fang. 2025. W-rag: Weakly supervised dense retrieval in rag for open-domain question answering. InPro- ceedings of the 2025 International ACM SIGIR conference on innovative concepts and theories in information retrieval (ICTIR). 136–146

work page 2025
[34]

Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. 2024. Ragtruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 10862–10878

work page 2024
[35]

Gonçalo Paulo, Alex Mallen, Caden Juang, and Nora Belrose. 2024. Automati- cally interpreting millions of features in large language models.arXiv preprint arXiv:2410.13928(2024)

work page arXiv 2024
[36]

Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang. 2025. Graph retrieval-augmented generation: A survey. ACM Transactions on Information Systems44, 2 (2025), 1–52

work page 2025
[37]

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy.arXiv preprint arXiv:2305.15294(2023)

work page arXiv 2023
[38]

Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia Zhou, and Yiqun Liu. 2025. Parametric retrieval augmented generation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1240–1250

work page 2025
[39]

Yixuan Tang and Yi Yang. 2024. Multihop-rag: Benchmarking retrieval- augmented generation for multi-hop queries.arXiv preprint arXiv:2401.15391 (2024)

work page arXiv 2024
[41]

Transactions of the Association for Computational Linguistics10 (2022), 539–554

MuSiQue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics10 (2022), 539–554

work page 2022
[42]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal

work page
[43]

InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers)

Interleaving retrieval with chain-of-thought reasoning for knowledge- intensive multi-step questions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers). 10014–10037

work page
[44]

Jesse Vig and Yonatan Belinkov. 2019. Analyzing the structure of attention in a transformer language model.arXiv preprint arXiv:1906.04284(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[45]

Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan Parvez, and Graham Neubig

work page
[46]

Learning to filter context for retrieval-augmented generation.arXiv preprint arXiv:2311.08377(2023)

work page arXiv 2023
[47]

Wenlong Wu, Haofen Wang, Bohan Li, Peixuan Huang, Xinzhe Zhao, and Lei Liang. 2025. Multirag: a knowledge-guided framework for mitigating hallucina- tion in multi-source retrieval augmented generation. In2025 IEEE 41st Interna- tional Conference on Data Engineering (ICDE). IEEE, 3070–3083

work page 2025
[48]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing. 2369–2380

work page 2018
[49]

Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Moham- mad Shoeybi, and Bryan Catanzaro. 2024. Rankrag: Unifying context ranking with retrieval-augmented generation in llms.Advances in Neural Information Processing Systems37 (2024), 121156–121184

work page 2024
[50]

Shenglai Zeng, Jiankun Zhang, Bingheng Li, Yuping Lin, Tianqi Zheng, Dante Everaert, Hanqing Lu, Hui Liu, Yue Xing, Monica Xiao Cheng, et al. 2025. To- wards knowledge checking in retrieval-augmented generation: A representation perspective. InProceedings of the 2025 Conference of the Nations of the Ameri- cas Chapter of the Association for Computational ...

work page 2025
[51]

Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. 2026. Retrieval- augmented generation for ai-generated content: A survey.Data Science and Engineering(2026), 1–29

work page 2026
[52]

Zheng Zhao, Yeskendir Koishekenov, Xianjun Yang, Naila Murray, and Nicola Cancedda. 2025. Verifying Chain-of-Thought Reasoning via Its Computational Graph.arXiv preprint arXiv:2510.09312(2025)

work page arXiv 2025
[53]

Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Lutao Jiang, Haiwei Xue, Bin Ren, Danda Paudel, Nicu Sebe, Luc Van Gool, and Xuming Hu. 2025. Retrieval augmented generation and understanding in vision: A survey and new outlook.arXiv preprint arXiv:2503.18016(2025)

work page arXiv 2025
[54]

May We All

Yujia Zhou, Yan Liu, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Zheng Liu, Chaozhuo Li, Zhicheng Dou, Tsung-Yi Ho, and Philip S Yu. 2024. Trustworthiness in retrieval- augmented generation systems: A survey.arXiv preprint arXiv:2409.10102(2024). Why Retrieval-Augmented Generation Fails: A Graph Perspective Conference acronym ’XX, June 03–05, 2018, Woodstock, NY...

work page arXiv 2024