arxiv: 2605.15109 · v1 · submitted 2026-05-14 · 💻 cs.AI · cs.IR

Recognition: no theorem link

Why Neighborhoods Matter: Traversal Context and Provenance in Agentic GraphRAG

Riccardo Terrenzi , Maximilian von Zastrow , Serkan Ayvaz

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:13 UTC · model grok-4.3

classification 💻 cs.AI cs.IR

keywords Agentic GraphRAGcitation faithfulnesstraversal contextprovenanceknowledge graphsretrieval augmented generationablation studiesgraph exploration

0 comments

The pith

In Agentic GraphRAG, accurate answers depend on both cited evidence and the uncited traversal context from the agent's graph exploration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines citation faithfulness in systems where an AI agent navigates a knowledge graph to gather information before answering. It argues that while removing cited entities alters answers and hurts accuracy, accurate responses can still rely on parts of the graph that were visited but not cited. The work uses ablation experiments to show that citations must reflect the full retrieval trajectory, including surrounding structure and uncited neighbors, rather than just the final selected sources. A sympathetic reader would care because current evaluation methods may overstate faithfulness when they ignore how the agent reached its citations.

Core claim

The central claim is that citation faithfulness in Agentic GraphRAG is a trajectory-level problem: final citations should account for the graph traversal, structure, and visited-but-uncited entities that influence the answer. Controlled ablations demonstrate that cited evidence is necessary since its removal changes answers and reduces accuracy, but not sufficient because accurate answers also depend on uncited traversal context and surrounding graph structure.

What carries the argument

Trajectory-level citation faithfulness evaluated through controlled ablation experiments that isolate, remove, and mask cited versus uncited graph entities.

If this is right

Citation evaluation should shift from checking source support to tracking provenance over the entire retrieval trajectory.
Removing cited evidence substantially changes generated answers and lowers accuracy.
Accurate answers can depend on uncited graph neighbors and visited context even when citations appear supportive.
Agentic GraphRAG systems need mechanisms to expose or account for the full exploration path in their outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers of graph-based retrieval systems might need to log and report full traversal paths rather than just final citations to improve transparency.
Future benchmarks could include tests for sensitivity to neighborhood structure around cited nodes.
Similar trajectory issues may arise in other agentic retrieval setups beyond graphs, such as web navigation agents.

Load-bearing premise

The controlled ablation experiments accurately isolate the causal influence of traversal context without artifacts from the specific graphs, agents, or masking procedures used.

What would settle it

An experiment showing no substantial change in answers or accuracy when uncited traversal entities are removed or masked, while keeping citations intact.

Figures

Figures reproduced from arXiv: 2605.15109 by Maximilian von Zastrow, Riccardo Terrenzi, Serkan Ayvaz.

**Figure 1.** Figure 1: Representation of the three agentic graphRAG systems we test in our experimental design. Peaky Blinders Cillian Murphy Oppenheimer Robert Downey Jr. Christopher Nolan Memento Dunkirk Original graph Peaky Blinders Cillian Murphy Oppenheimer Robert Downey Jr. Memento Dunkirk Visited-but-uncited entities ablation Peaky Blinders Robert Downey Jr. Christopher Nolan Memento Cited Entities Ablation Cillian Murphy… view at source ↗

**Figure 2.** Figure 2: Example of the three graph ablations effect on a synthetic subgraph. Cited entities are represented in green, visited but not cited in blue, not visited in red. be attributed specifically to the informational value of the cited entities rather than to the structural disruption caused by removing nodes from the graph. Study 3 - Visited-but-uncited entities Ablation Study 3 examines the role of entities the … view at source ↗

read the original abstract

Retrieval-Augmented Generation can improve factuality by grounding answers in external evidence, but Agentic GraphRAG complicates what it means for citations to be faithful. In these systems, an agent explores a knowledge graph before producing an answer and a small set of citations. We frame citation faithfulness as a trajectory-level problem: final citations should not only support the answer, but also account for the graph traversal, structure, and visited-but-uncited entities that may influence it. Through controlled ablation experiments, we compare the effects of isolating, removing, and masking cited and uncited graph entities. Our results show that cited evidence is often necessary, as removing it substantially changes answers and reduces accuracy. However, citations are not sufficient, because accurate answers can also depend on uncited traversal context and surrounding graph structure. These findings suggest that citation evaluation in Agentic GraphRAG should move beyond source support toward provenance over the broader retrieval trajectory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a trajectory-level view of citation faithfulness in Agentic GraphRAG, backed by ablations that need more controls to be fully convincing.

read the letter

The main takeaway is that citation faithfulness in Agentic GraphRAG should be judged against the full traversal trajectory rather than just the final cited sources. The authors show through ablations that removing cited entities changes answers and reduces accuracy, yet accurate answers still depend on uncited nodes and surrounding graph structure that the agent visited along the way. This pushes evaluation beyond simple source-support checks toward provenance over the retrieval path. What the paper does well is lay out this framing clearly and run controlled experiments that compare isolating, removing, and masking cited versus uncited entities. The results give a concrete sense that citations are necessary but not sufficient, which is a useful distinction for anyone building these systems. The soft spots sit in the experimental details. The ablations risk confounding if masking alters the agent's own traversal behavior or path selection instead of simply withholding information after the same path is taken. The abstract gives no sign that traversal logs were held constant across conditions or that path statistics were reported, so the accuracy shifts cannot be attributed cleanly to the claimed context effects. No error bars, exact graph sizes, or statistical tests appear in the summary, which leaves the size and reliability of the differences unclear. These are addressable gaps rather than fatal ones. This paper is for researchers working on agentic retrieval, faithfulness metrics, and auditability in knowledge systems. A reader focused on evaluation practices would find the trajectory perspective worth considering even if the current evidence is preliminary. It deserves a serious referee because the core idea identifies a real gap and the experiments, while high-level, point in a productive direction. I would recommend sending it to peer review with requests for fuller methods reporting and checks on whether the ablations preserve traversal behavior.

Referee Report

2 major / 1 minor

Summary. The paper claims that citation faithfulness in Agentic GraphRAG should be assessed at the trajectory level, accounting for the agent's graph traversal, structure, and visited-but-uncited entities, rather than solely the final cited sources. Through ablation experiments involving isolation, removal, and masking of cited and uncited graph entities, the results indicate that cited evidence is necessary for accurate answers but not sufficient, as uncited traversal context and surrounding graph structure also play a critical role.

Significance. If the central claims hold under rigorous verification, this work provides valuable empirical insights into the limitations of standard citation evaluation in agentic retrieval-augmented generation systems. It emphasizes the importance of provenance over the full retrieval trajectory, which could guide the development of more reliable evaluation metrics and improve the factuality of answers in GraphRAG applications.

major comments (2)

[Ablation Experiments] The controlled ablation experiments do not report whether the agent's traversal paths were logged and held constant across different masking and removal conditions. Since masking graph entities could alter connectivity or visibility, potentially changing the agent's exploration policy and path selection, the observed differences in answer accuracy may partly reflect modified retrieval behavior rather than the isolated effect of uncited traversal context.
[Results and Discussion] The ablation results are presented at a high level without error bars, exact graph statistics (such as number of nodes, edges, or traversal lengths), or statistical tests for the changes in accuracy. This omission makes it difficult to assess the robustness and generalizability of the findings that removing cited evidence substantially changes answers while uncited context is also required.

minor comments (1)

[Abstract] The abstract provides a clear overview but could benefit from a brief mention of the specific datasets or graph types used in the experiments to give readers immediate context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment and constructive feedback on our work framing citation faithfulness as a trajectory-level problem in Agentic GraphRAG. The comments have prompted us to strengthen the methodological details and statistical reporting in the manuscript. We respond point-by-point below.

read point-by-point responses

Referee: [Ablation Experiments] The controlled ablation experiments do not report whether the agent's traversal paths were logged and held constant across different masking and removal conditions. Since masking graph entities could alter connectivity or visibility, potentially changing the agent's exploration policy and path selection, the observed differences in answer accuracy may partly reflect modified retrieval behavior rather than the isolated effect of uncited traversal context.

Authors: We appreciate this methodological concern. Our original experiments re-ran the fixed agent policy on each modified graph to measure downstream effects, but we did not log or constrain paths to be identical across conditions. This leaves open the possibility that connectivity changes influenced path selection. In the revision we have added a new subsection detailing the experimental protocol, including traversal logging, and we now report results from additional controlled ablations that replay the original node-visit sequence on the modified graph (where feasible by forcing visits to the same entities in order). These updates isolate the contribution of uncited context more cleanly while preserving the core findings. revision: yes
Referee: [Results and Discussion] The ablation results are presented at a high level without error bars, exact graph statistics (such as number of nodes, edges, or traversal lengths), or statistical tests for the changes in accuracy. This omission makes it difficult to assess the robustness and generalizability of the findings that removing cited evidence substantially changes answers while uncited context is also required.

Authors: We agree that greater statistical detail is needed. The revised manuscript now includes error bars (standard deviation across 5 independent runs) on all accuracy plots, a new table reporting exact graph statistics (mean nodes visited, edges in the induced subgraph, and traversal lengths per query), and paired statistical tests (Wilcoxon signed-rank) confirming that accuracy drops after cited-entity removal are significant while the effects of uncited-context masking remain meaningful. These additions appear in the Results section and improve the ability to evaluate robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on independent ablation experiments

full rationale

The paper's central claims derive from controlled ablation experiments that isolate, remove, and mask cited versus uncited graph entities and measure effects on answer accuracy and changes. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the provided text. The necessity of cited evidence and insufficiency of citations alone are established by direct comparison of experimental conditions rather than by construction or renaming of inputs. This is a standard empirical setup with no reduction of outputs to the inputs by definition, consistent with the reader's assessment of low circularity risk.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or new entities are introduced; the work is an empirical study relying on standard ablation methodology.

pith-pipeline@v0.9.0 · 5466 in / 985 out tokens · 54411 ms · 2026-05-15T03:13:33.419336+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

Huang, W

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, et al., A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions, ACM Transactions on Information Systems 43 (2025) 1–55

work page 2025
[2]

A. T. Kalai, O. Nachum, S. S. Vempala, E. Zhang, Evaluating large language models for accuracy incentivizes hallucinations, Nature (2026). URL: https://doi.org/10.1038/s41586-026-10549-w. doi:10.1038/s41586-026-10549-w

work page doi:10.1038/s41586-026-10549-w 2026
[3]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-augmented generation for knowledge-intensive nlp tasks, in: Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Curran Associates Inc., Red Hook, NY, USA, 2020

work page 2020
[4]

B. Peng, Y. Zhu, Y. Liu, X. Bo, H. Shi, C. Hong, Y. Zhang, S. Tang, Graph retrieval-augmented generation: A survey, ACM Transactions on Information Systems 44 (2025) 1–52

work page 2025
[5]

URLhttps://link.springer.com/article/10.1007/s104 62-025-11422-4

M. Abou Ali, F. Dornaika, J. Charafeddine, Agentic ai: a comprehensive survey of architectures, applications, and future directions, Artificial Intelligence Review 59 (2025). URL: http://dx.doi.org/ 10.1007/s10462-025-11422-4. doi:10.1007/s10462-025-11422-4

work page doi:10.1007/s10462-025-11422-4 2025
[6]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, Y. Cao, React: Synergizing reasoning and acting in language models, arXiv preprint arXiv:2210.03629 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

arXiv preprint arXiv:2212.08037 , year=

B. Bohnet, V. Q. Tran, P. Verga, R. Aharoni, D. Andor, L. B. Soares, M. Ciaramita, J. Eisenstein, K. Ganchev, J. Herzig, K. Hui, T. Kwiatkowski, J. Ma, J. Ni, L. S. Saralegui, T. Schuster, W. W. Cohen, M. Collins, D. Das, D. Metzler, S. Petrov, K. Webster, Attributed question answering: Evaluation and modeling for attributed large language models, 2023. U...

work page arXiv 2023
[8]

T. Gao, H. Yen, J. Yu, D. Chen, Enabling large language models to generate text with citations,

work page
[9]

URL: https://arxiv.org/abs/2305.14627.arXiv:2305.14627

work page arXiv
[10]

Wallat, M

J. Wallat, M. Heuss, M. de Rijke, A. Anand, Correctness is not faithfulness in rag attributions, arXiv preprint arXiv:2412.18004 (2024)

work page arXiv 2024
[11]

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

X. Ho, A.-K. Duong Nguyen, S. Sugawara, A. Aizawa, Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps, in: D. Scott, N. Bel, C. Zong (Eds.), Proceedings of the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 6609–6625. UR...

work page doi:10.18653/v1/2020.coling-main.580 2020
[12]

V. A. Traag, L. Waltman, N. J. Van Eck, From louvain to leiden: guaranteeing well-connected communities, Scientific reports 9 (2019) 5233

work page 2019