Recognition: no theorem link
Why Neighborhoods Matter: Traversal Context and Provenance in Agentic GraphRAG
Pith reviewed 2026-05-15 03:13 UTC · model grok-4.3
The pith
In Agentic GraphRAG, accurate answers depend on both cited evidence and the uncited traversal context from the agent's graph exploration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that citation faithfulness in Agentic GraphRAG is a trajectory-level problem: final citations should account for the graph traversal, structure, and visited-but-uncited entities that influence the answer. Controlled ablations demonstrate that cited evidence is necessary since its removal changes answers and reduces accuracy, but not sufficient because accurate answers also depend on uncited traversal context and surrounding graph structure.
What carries the argument
Trajectory-level citation faithfulness evaluated through controlled ablation experiments that isolate, remove, and mask cited versus uncited graph entities.
If this is right
- Citation evaluation should shift from checking source support to tracking provenance over the entire retrieval trajectory.
- Removing cited evidence substantially changes generated answers and lowers accuracy.
- Accurate answers can depend on uncited graph neighbors and visited context even when citations appear supportive.
- Agentic GraphRAG systems need mechanisms to expose or account for the full exploration path in their outputs.
Where Pith is reading between the lines
- Developers of graph-based retrieval systems might need to log and report full traversal paths rather than just final citations to improve transparency.
- Future benchmarks could include tests for sensitivity to neighborhood structure around cited nodes.
- Similar trajectory issues may arise in other agentic retrieval setups beyond graphs, such as web navigation agents.
Load-bearing premise
The controlled ablation experiments accurately isolate the causal influence of traversal context without artifacts from the specific graphs, agents, or masking procedures used.
What would settle it
An experiment showing no substantial change in answers or accuracy when uncited traversal entities are removed or masked, while keeping citations intact.
Figures
read the original abstract
Retrieval-Augmented Generation can improve factuality by grounding answers in external evidence, but Agentic GraphRAG complicates what it means for citations to be faithful. In these systems, an agent explores a knowledge graph before producing an answer and a small set of citations. We frame citation faithfulness as a trajectory-level problem: final citations should not only support the answer, but also account for the graph traversal, structure, and visited-but-uncited entities that may influence it. Through controlled ablation experiments, we compare the effects of isolating, removing, and masking cited and uncited graph entities. Our results show that cited evidence is often necessary, as removing it substantially changes answers and reduces accuracy. However, citations are not sufficient, because accurate answers can also depend on uncited traversal context and surrounding graph structure. These findings suggest that citation evaluation in Agentic GraphRAG should move beyond source support toward provenance over the broader retrieval trajectory.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that citation faithfulness in Agentic GraphRAG should be assessed at the trajectory level, accounting for the agent's graph traversal, structure, and visited-but-uncited entities, rather than solely the final cited sources. Through ablation experiments involving isolation, removal, and masking of cited and uncited graph entities, the results indicate that cited evidence is necessary for accurate answers but not sufficient, as uncited traversal context and surrounding graph structure also play a critical role.
Significance. If the central claims hold under rigorous verification, this work provides valuable empirical insights into the limitations of standard citation evaluation in agentic retrieval-augmented generation systems. It emphasizes the importance of provenance over the full retrieval trajectory, which could guide the development of more reliable evaluation metrics and improve the factuality of answers in GraphRAG applications.
major comments (2)
- [Ablation Experiments] The controlled ablation experiments do not report whether the agent's traversal paths were logged and held constant across different masking and removal conditions. Since masking graph entities could alter connectivity or visibility, potentially changing the agent's exploration policy and path selection, the observed differences in answer accuracy may partly reflect modified retrieval behavior rather than the isolated effect of uncited traversal context.
- [Results and Discussion] The ablation results are presented at a high level without error bars, exact graph statistics (such as number of nodes, edges, or traversal lengths), or statistical tests for the changes in accuracy. This omission makes it difficult to assess the robustness and generalizability of the findings that removing cited evidence substantially changes answers while uncited context is also required.
minor comments (1)
- [Abstract] The abstract provides a clear overview but could benefit from a brief mention of the specific datasets or graph types used in the experiments to give readers immediate context.
Simulated Author's Rebuttal
We thank the referee for their positive assessment and constructive feedback on our work framing citation faithfulness as a trajectory-level problem in Agentic GraphRAG. The comments have prompted us to strengthen the methodological details and statistical reporting in the manuscript. We respond point-by-point below.
read point-by-point responses
-
Referee: [Ablation Experiments] The controlled ablation experiments do not report whether the agent's traversal paths were logged and held constant across different masking and removal conditions. Since masking graph entities could alter connectivity or visibility, potentially changing the agent's exploration policy and path selection, the observed differences in answer accuracy may partly reflect modified retrieval behavior rather than the isolated effect of uncited traversal context.
Authors: We appreciate this methodological concern. Our original experiments re-ran the fixed agent policy on each modified graph to measure downstream effects, but we did not log or constrain paths to be identical across conditions. This leaves open the possibility that connectivity changes influenced path selection. In the revision we have added a new subsection detailing the experimental protocol, including traversal logging, and we now report results from additional controlled ablations that replay the original node-visit sequence on the modified graph (where feasible by forcing visits to the same entities in order). These updates isolate the contribution of uncited context more cleanly while preserving the core findings. revision: yes
-
Referee: [Results and Discussion] The ablation results are presented at a high level without error bars, exact graph statistics (such as number of nodes, edges, or traversal lengths), or statistical tests for the changes in accuracy. This omission makes it difficult to assess the robustness and generalizability of the findings that removing cited evidence substantially changes answers while uncited context is also required.
Authors: We agree that greater statistical detail is needed. The revised manuscript now includes error bars (standard deviation across 5 independent runs) on all accuracy plots, a new table reporting exact graph statistics (mean nodes visited, edges in the induced subgraph, and traversal lengths per query), and paired statistical tests (Wilcoxon signed-rank) confirming that accuracy drops after cited-entity removal are significant while the effects of uncited-context masking remain meaningful. These additions appear in the Results section and improve the ability to evaluate robustness. revision: yes
Circularity Check
No circularity: claims rest on independent ablation experiments
full rationale
The paper's central claims derive from controlled ablation experiments that isolate, remove, and mask cited versus uncited graph entities and measure effects on answer accuracy and changes. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the provided text. The necessity of cited evidence and insufficiency of citations alone are established by direct comparison of experimental conditions rather than by construction or renaming of inputs. This is a standard empirical setup with no reduction of outputs to the inputs by definition, consistent with the reader's assessment of low circularity risk.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
A. T. Kalai, O. Nachum, S. S. Vempala, E. Zhang, Evaluating large language models for accuracy incentivizes hallucinations, Nature (2026). URL: https://doi.org/10.1038/s41586-026-10549-w. doi:10.1038/s41586-026-10549-w
-
[3]
P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-augmented generation for knowledge-intensive nlp tasks, in: Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Curran Associates Inc., Red Hook, NY, USA, 2020
work page 2020
-
[4]
B. Peng, Y. Zhu, Y. Liu, X. Bo, H. Shi, C. Hong, Y. Zhang, S. Tang, Graph retrieval-augmented generation: A survey, ACM Transactions on Information Systems 44 (2025) 1–52
work page 2025
-
[5]
URLhttps://link.springer.com/article/10.1007/s104 62-025-11422-4
M. Abou Ali, F. Dornaika, J. Charafeddine, Agentic ai: a comprehensive survey of architectures, applications, and future directions, Artificial Intelligence Review 59 (2025). URL: http://dx.doi.org/ 10.1007/s10462-025-11422-4. doi:10.1007/s10462-025-11422-4
-
[6]
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, Y. Cao, React: Synergizing reasoning and acting in language models, arXiv preprint arXiv:2210.03629 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
arXiv preprint arXiv:2212.08037 , year=
B. Bohnet, V. Q. Tran, P. Verga, R. Aharoni, D. Andor, L. B. Soares, M. Ciaramita, J. Eisenstein, K. Ganchev, J. Herzig, K. Hui, T. Kwiatkowski, J. Ma, J. Ni, L. S. Saralegui, T. Schuster, W. W. Cohen, M. Collins, D. Das, D. Metzler, S. Petrov, K. Webster, Attributed question answering: Evaluation and modeling for attributed large language models, 2023. U...
-
[8]
T. Gao, H. Yen, J. Yu, D. Chen, Enabling large language models to generate text with citations,
- [9]
- [10]
-
[11]
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps
X. Ho, A.-K. Duong Nguyen, S. Sugawara, A. Aizawa, Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps, in: D. Scott, N. Bel, C. Zong (Eds.), Proceedings of the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 6609–6625. UR...
-
[12]
V. A. Traag, L. Waltman, N. J. Van Eck, From louvain to leiden: guaranteeing well-connected communities, Scientific reports 9 (2019) 5233
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.