Recognition: no theorem link
Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems
Pith reviewed 2026-05-13 22:32 UTC · model grok-4.3
The pith
Agentic search boosts dense RAG enough to close most of the performance gap with GraphRAG.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agentic search substantially improves dense RAG and narrows the performance gap to GraphRAG, particularly in RL-based settings. Nevertheless, GraphRAG remains advantageous for complex multi-hop reasoning, exhibiting more stable agentic search behavior when its offline cost is amortized.
What carries the argument
The RAGSearch benchmark that standardizes LLM backbone, retrieval budgets, and inference protocols to test dense RAG versus GraphRAG under both training-free and training-based agentic search.
If this is right
- Agentic wrappers can make simpler dense retrieval competitive without the expense of building graphs.
- Graph structures still provide measurable stability gains on the hardest multi-hop questions.
- Offline preprocessing cost of GraphRAG becomes acceptable only when the same index serves many queries.
- Reinforcement-learning agents close the gap more effectively than training-free agent loops.
Where Pith is reading between the lines
- Hybrid designs that add minimal graph edges to an agentic loop may combine the strengths of both approaches.
- Cost-sensitive applications should default to agentic dense RAG and reserve full GraphRAG for domains with repeated complex queries.
- The stability difference suggests measuring not only final accuracy but also variance across repeated agent runs when choosing a retrieval backbone.
Load-bearing premise
That holding the language model, retrieval budget, and inference protocol fixed across methods yields a comparison that holds on other benchmarks and other agentic implementations.
What would settle it
Repeating the full comparison on a fresh set of multi-hop question-answering datasets using a different language-model backbone and observing whether the narrowed gap reappears or widens.
Figures
read the original abstract
Retrieval-augmented generation (RAG) and its graph-based extensions (GraphRAG) are effective paradigms for improving large language model (LLM) reasoning by grounding generation in external knowledge. However, most existing RAG and GraphRAG systems operate under static or one-shot retrieval, where a fixed set of documents is provided to the LLM in a single pass. In contrast, recent agentic search systems enable dynamic, multi-round retrieval and sequential decision-making during inference, and have shown strong gains when combined with vanilla RAG by introducing implicit structure through interaction. This progress raises a fundamental question: can agentic search compensate for the absence of explicit graph structure, reducing the need for costly GraphRAG pipelines? To answer this question, we introduce RAGSearch, a unified benchmark that evaluates dense RAG and representative GraphRAG methods as retrieval infrastructures under agentic search. RAGSearch covers both training-free and training-based agentic inference across multiple question answering benchmarks. To ensure fair and reproducible comparison, we standardize the LLM backbone, retrieval budgets, and inference protocols, and report results on full test sets. Beyond answer accuracy, we report offline preprocessing cost, online inference efficiency, and stability. Our results show that agentic search substantially improves dense RAG and narrows the performance gap to GraphRAG, particularly in RL-based settings. Nevertheless, GraphRAG remains advantageous for complex multi-hop reasoning, exhibiting more stable agentic search behavior when its offline cost is amortized. Together, these findings clarify the complementary roles of explicit graph structure and agentic search, and provide practical guidance on retrieval design for modern agentic RAG systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the RAGSearch benchmark to evaluate dense RAG and representative GraphRAG methods as retrieval infrastructures under both training-free and training-based agentic search systems. It standardizes the LLM backbone, retrieval budgets, and inference protocols across multiple QA benchmarks, reporting answer accuracy along with offline preprocessing cost, online inference efficiency, and stability. The central claim is that agentic search substantially improves dense RAG performance and narrows the gap to GraphRAG (especially in RL-based settings), while GraphRAG retains advantages for complex multi-hop reasoning and exhibits more stable agentic behavior once offline costs are amortized.
Significance. If the empirical findings hold under the reported standardization, the work clarifies the complementary roles of explicit graph structure versus dynamic agentic retrieval, providing practical guidance for retrieval design in agentic RAG systems. The introduction of a unified benchmark with full test-set reporting and multi-metric evaluation (accuracy, cost, stability) strengthens the contribution for the IR and LLM communities.
major comments (2)
- [§4 and §5] §4 (Methodology) and §5 (Results): The claim that identical LLM backbones, retrieval budgets, and inference protocols produce a fair comparison is load-bearing for the headline result that agentic search narrows the gap to GraphRAG. However, GraphRAG supplies explicit entity-relation links that an agent could in principle exploit for planning or traversal; dense RAG supplies only flat passages. The paper reports GraphRAG's remaining advantage precisely on multi-hop items, raising the possibility that the protocols still permit differential use of structure. A concrete test (e.g., ablating whether the agent policy receives graph-derived metadata or only passage text) is needed to isolate the effect.
- [§5.2] §5.2 (RL-based agentic results): The reported narrowing of the performance gap in RL settings lacks accompanying error bars, statistical significance tests, or per-benchmark breakdowns on the full test sets. Without these, it is difficult to determine whether the observed improvements are robust or driven by a subset of easier instances.
minor comments (2)
- [Abstract] Abstract: The directional findings are stated without any numerical values, confidence intervals, or effect sizes; adding at least one representative accuracy delta (e.g., “+X% on multi-hop tasks”) would improve readability.
- [Tables/Figures] Table captions and figure legends should explicitly state the number of runs or seeds used for stability metrics.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript accordingly to strengthen the clarity and robustness of our claims.
read point-by-point responses
-
Referee: [§4 and §5] §4 (Methodology) and §5 (Results): The claim that identical LLM backbones, retrieval budgets, and inference protocols produce a fair comparison is load-bearing for the headline result that agentic search narrows the gap to GraphRAG. However, GraphRAG supplies explicit entity-relation links that an agent could in principle exploit for planning or traversal; dense RAG supplies only flat passages. The paper reports GraphRAG's remaining advantage precisely on multi-hop items, raising the possibility that the protocols still permit differential use of structure. A concrete test (e.g., ablating whether the agent policy receives graph-derived metadata or only passage text) is needed to isolate the effect.
Authors: We agree that isolating the contribution of explicit graph structure versus the agent's ability to exploit it is important for interpreting the results. In our setup, the agent receives retrieval outputs in a standardized format (passages for dense RAG; graph-augmented context for GraphRAG) under identical decision protocols and retrieval budgets. The remaining GraphRAG advantage on multi-hop questions is consistent with the value of explicit structure. To address the concern directly, we have added an ablation in the revised §4 and §5 where the GraphRAG agent is restricted to passage text only (masking entity-relation metadata), allowing a cleaner comparison of structure exploitation. revision: yes
-
Referee: [§5.2] §5.2 (RL-based agentic results): The reported narrowing of the performance gap in RL settings lacks accompanying error bars, statistical significance tests, or per-benchmark breakdowns on the full test sets. Without these, it is difficult to determine whether the observed improvements are robust or driven by a subset of easier instances.
Authors: We acknowledge that the original presentation of RL results would benefit from additional statistical support. In the revised manuscript, we now report standard error bars computed over multiple random seeds for all RL-based experiments, include paired statistical significance tests (t-tests) comparing dense RAG and GraphRAG under agentic search, and provide full per-benchmark breakdowns on the complete test sets to demonstrate that the gap narrowing holds consistently rather than being driven by easier subsets. revision: yes
Circularity Check
Empirical benchmark with no derivations or self-referential predictions
full rationale
This is a purely empirical benchmarking study comparing dense RAG and GraphRAG under standardized agentic search protocols on external QA datasets. All reported results (accuracy, efficiency, stability) are direct measurements from experiments with fixed LLM backbones and retrieval budgets. No mathematical derivations, fitted parameters renamed as predictions, self-citation load-bearing premises, or ansatz smuggling appear in the paper. The central claims rest on observable performance differences across methods rather than any reduction to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standardized LLM backbone and retrieval budgets produce comparable conditions across RAG and GraphRAG methods
invented entities (1)
-
RAGSearch benchmark
no independent evidence
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [4]
-
[5]
Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. 2025. LightRAG: Simple and Fast Retrieval-Augmented Generation. arXiv:2410.05779 [cs.IR] https: //arxiv.org/abs/2410.05779
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su
-
[7]
Hipporag: Neurobiologically inspired long-term memory for large language models,
HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. arXiv:2405.14831 [cs.CL] https://arxiv.org/abs/2405.14831
- [8]
-
[9]
Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Ma- hantesh Halappanavar, Ryan A Rossi, Subhabrata Mukherjee, Xianfeng Tang, et al
- [10]
- [11]
-
[12]
G-retriever: Retrieval-augmented generation for textual graph understanding and question answering,
Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V. Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi. 2024. G-Retriever: Retrieval- Augmented Generation for Textual Graph Understanding and Question An- swering. arXiv:2402.07630 [cs.LG] https://arxiv.org/abs/2402.07630
- [13]
- [14]
-
[15]
Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park
-
[16]
Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. arXiv:2403.14403 [cs.CL] https://arxiv. org/abs/2403.14403
-
[17]
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Jiajie Jin, Yutao Zhu, Zhicheng Dou, Guanting Dong, Xinyu Yang, Chenghao Zhang, Tong Zhao, Zhao Yang, and Ji-Rong Wen. 2025. FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research. InCompanion Proceedings of the ACM on Web Conference 2025, WWW 2025, Sydney, NSW, Aus- tralia, 28 April 2025 - 2 May 2025, Guodong Long, Michale Blum...
-
[19]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehen- sion. arXiv:1705.03551 [cs.CL] https://arxiv.org/abs/1705.03551
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [20]
-
[21]
https://aclanthology.org/ Q19-1026/
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research.Tr...
-
[22]
Meng-Chieh Lee, Qi Zhu, Costas Mavromatis, Zhen Han, Soji Adeshina, Vassilis N Ioannidis, Huzefa Rangwala, and Christos Faloutsos. 2025. Hybgrag: Hybrid retrieval-augmented generation on textual and relational knowledge bases. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 879–893
work page 2025
-
[23]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 [cs.CL] https://arxiv.org/abs/ 2005.11401
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[24]
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025. Search-o1: Agentic search-enhanced large reasoning models.arXiv preprint arXiv:2501.05366(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [25]
- [26]
-
[27]
Haoran Luo, Haihong E, Guanting Chen, Yandan Zheng, Xiaobao Wu, Yikai Guo, Qika Lin, Yu Feng, Zemin Kuang, Meina Song, Yifan Zhu, and Luu Anh Tuan. 2025. HyperGraphRAG: Retrieval-Augmented Generation via Hypergraph-Structured Knowledge Representation. arXiv:2503.21322 [cs.AI] https://arxiv.org/abs/2503. 21322
- [28]
- [29]
-
[30]
OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Car- ney, Alex Iftimie, and Alex Karpenko et al. 2024. OpenAI o1 System Card. arXiv:2412.16720 [cs.AI] https://arxiv.org/abs/2412.16720
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, and et al. Haoran Wei. 2025. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL] https://arxiv.org/abs/2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. 2024. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. arXiv:2401.18059 [cs.CL] https://arxiv.org/abs/2401. 18059
work page internal anchor Pith review arXiv 2024
-
[33]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeek- Math: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300 [cs.CL] https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal
-
[35]
Musique: Multihop questions via single-hop question composition.arXiv:2108.00573, 2022
MuSiQue: Multihop Questions via Single-hop Question Composition. arXiv:2108.00573 [cs.CL] https://arxiv.org/abs/2108.00573
- [36]
- [37]
- [38]
-
[39]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. arXiv:1809.09600 [cs.CL] https://arxiv.org/abs/1809.09600
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[40]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations
work page 2022
- [41]
-
[42]
arXiv preprint arXiv:2403.10131 , year=
Tianjun Zhang, Shishir G. Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, and Joseph E. Gonzalez. 2024. RAFT: Adapting Language Model to Domain Specific RAG. arXiv:2403.10131 [cs.CL] https://arxiv.org/abs/2403.10131
-
[43]
Yize Zhang, Tianshu Wang, Sirui Chen, Kun Wang, Xingyu Zeng, Hongyu Lin, Xianpei Han, Le Sun, and Chaochao Lu. 2025. ARise: Towards Knowledge- Augmented Reasoning via Risk-Adaptive Search. arXiv:2504.10893 [cs.AI] https: //arxiv.org/abs/2504.10893 Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al
-
[44]
Luyao Zhuang, Shengyuan Chen, Yilin Xiao, Huachi Zhou, Yujing Zhang, Hao Chen, Qinggang Zhang, and Xiao Huang. 2025. LinearRAG: Linear Graph Re- trieval Augmented Generation on Large-scale Corpora. arXiv:2510.10114 [cs.CL] https://arxiv.org/abs/2510.10114 A Datasets Statistics We conduct evaluations on six widely used RAG benchmarks from the FlashRAG tool...
-
[45]
Analyze the Searched Web Pages: - Carefully review the content of each searched web page. - Identify factual information that is relevant to the Current Search Query and can aid in the reasoning process for the original question
-
[46]
-Ensure that the extracted information is accurate and relevant
Extract Relevant Information: -Select the information from the Searched Web Pages that directly contributes to advancing the Previous Reasoning Steps. -Ensure that the extracted information is accurate and relevant
-
[47]
Output Format: - If the web pages provide helpful information for current search query: Present the information beginning with ‘Final Information’ as shown below. Final Information [Helpful information] - If the web pages do not provide any helpful information for current search query: Output the following text. Final Information No helpful information fo...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.