arxiv: 2604.09666 · v1 · submitted 2026-04-01 · 💻 cs.IR · cs.AI

Recognition: no theorem link

Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems

Dongzhe Fan , Zheyi Xue , Siyuan Liu , Qiaoyu Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:32 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords RAGGraphRAGagentic searchretrieval-augmented generationmulti-hop reasoningbenchmarkquestion answeringretrieval systems

0 comments

The pith

Agentic search boosts dense RAG enough to close most of the performance gap with GraphRAG.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether interactive, multi-round retrieval through agentic systems can substitute for the explicit graph construction required by GraphRAG. It introduces the RAGSearch benchmark to place both dense retrieval and graph-based methods under the same agentic inference protocols, including training-free and reinforcement-learning approaches, while fixing the language model, retrieval budget, and evaluation sets. Results show that agentic search lifts dense RAG performance substantially and reduces the advantage previously held by GraphRAG, yet GraphRAG still delivers more stable results on complex multi-hop questions once its one-time preprocessing cost is spread across many queries. The work therefore frames explicit graph structure and dynamic agentic search as complementary rather than interchangeable components of retrieval design.

Core claim

Agentic search substantially improves dense RAG and narrows the performance gap to GraphRAG, particularly in RL-based settings. Nevertheless, GraphRAG remains advantageous for complex multi-hop reasoning, exhibiting more stable agentic search behavior when its offline cost is amortized.

What carries the argument

The RAGSearch benchmark that standardizes LLM backbone, retrieval budgets, and inference protocols to test dense RAG versus GraphRAG under both training-free and training-based agentic search.

If this is right

Agentic wrappers can make simpler dense retrieval competitive without the expense of building graphs.
Graph structures still provide measurable stability gains on the hardest multi-hop questions.
Offline preprocessing cost of GraphRAG becomes acceptable only when the same index serves many queries.
Reinforcement-learning agents close the gap more effectively than training-free agent loops.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid designs that add minimal graph edges to an agentic loop may combine the strengths of both approaches.
Cost-sensitive applications should default to agentic dense RAG and reserve full GraphRAG for domains with repeated complex queries.
The stability difference suggests measuring not only final accuracy but also variance across repeated agent runs when choosing a retrieval backbone.

Load-bearing premise

That holding the language model, retrieval budget, and inference protocol fixed across methods yields a comparison that holds on other benchmarks and other agentic implementations.

What would settle it

Repeating the full comparison on a fresh set of multi-hop question-answering datasets using a different language-model backbone and observing whether the narrowed gap reappears or widens.

Figures

Figures reproduced from arXiv: 2604.09666 by Dongzhe Fan, Qiaoyu Tan, Siyuan Liu, Zheyi Xue.

**Figure 2.** Figure 2: Overview of the RAGSearch benchmark. RAGSearch models agentic search as an LLM agent interacting with inter [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Performance of different retrieval infrastructures [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: The impact of different RL algorithms 5.5 Sensitivity Analysis 5.5.1 Robust analysis. To answer RQ4, we investigate the robustness of dense RAG and GraphRAG under agentic search using retrieval recall and the mean and variance of Contain-EM (see [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Instruction for Search-o1 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Instruction for Reason-in-Documents Instruction for Open-Domain QA Tasks Please answer the following question. You should provide your final answer in the format \boxed{YOUR_ANSWER}. Question: {question} [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Instruction for Open-Domain QA Tasks [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: GraphSearch Query Decomposition [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: GraphSearch Query Decomposition (KG) [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: GraphSearch Evidence Verification Deep Answer Generation —Role— You are a helpful assistant specializing in complex question answering. —Goal— Given a complex query and retrieved context data, your task is to construct a logically sound, step-by-step answer. Your explanation should follow a rigorous reasoning path, incorporate relevant evidence, and establish clear relationships between the entities. —Ins… view at source ↗

**Figure 11.** Figure 11: GraphSearch Deep Answer Generation [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: GraphSearch Query Expansion Search-R1 Instruction Answer the given question. You must conduct reasoning inside <think> and </think> first every time you get new information. After reasoning, if you find you lack some knowledge, you can call a search engine by <search> query </search> and it will return the top searched results between <information> and </information>. You can search as many times as your … view at source ↗

**Figure 13.** Figure 13: Search-R1 Instruction [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Graph-R1 Instruction [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

read the original abstract

Retrieval-augmented generation (RAG) and its graph-based extensions (GraphRAG) are effective paradigms for improving large language model (LLM) reasoning by grounding generation in external knowledge. However, most existing RAG and GraphRAG systems operate under static or one-shot retrieval, where a fixed set of documents is provided to the LLM in a single pass. In contrast, recent agentic search systems enable dynamic, multi-round retrieval and sequential decision-making during inference, and have shown strong gains when combined with vanilla RAG by introducing implicit structure through interaction. This progress raises a fundamental question: can agentic search compensate for the absence of explicit graph structure, reducing the need for costly GraphRAG pipelines? To answer this question, we introduce RAGSearch, a unified benchmark that evaluates dense RAG and representative GraphRAG methods as retrieval infrastructures under agentic search. RAGSearch covers both training-free and training-based agentic inference across multiple question answering benchmarks. To ensure fair and reproducible comparison, we standardize the LLM backbone, retrieval budgets, and inference protocols, and report results on full test sets. Beyond answer accuracy, we report offline preprocessing cost, online inference efficiency, and stability. Our results show that agentic search substantially improves dense RAG and narrows the performance gap to GraphRAG, particularly in RL-based settings. Nevertheless, GraphRAG remains advantageous for complex multi-hop reasoning, exhibiting more stable agentic search behavior when its offline cost is amortized. Together, these findings clarify the complementary roles of explicit graph structure and agentic search, and provide practical guidance on retrieval design for modern agentic RAG systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agentic search narrows the gap between dense RAG and GraphRAG but GraphRAG still holds an edge on multi-hop tasks, with the comparison resting on whether identical protocols truly neutralize structural differences.

read the letter

The core finding is that agentic loops, especially RL-based ones, lift dense RAG performance enough to close much of the distance to GraphRAG while GraphRAG keeps a lead on complex multi-hop questions and shows steadier behavior once offline costs are spread out. The paper introduces the RAGSearch benchmark to run this head-to-head under standardized conditions for both training-free and training-based agents across several QA sets. They fix the LLM backbone, retrieval budgets, and inference protocols, then measure accuracy plus preprocessing cost, online efficiency, and stability on full test sets. That setup gives practitioners concrete numbers on when the extra graph construction pays off versus just adding interaction rounds. The standardization is a clear strength because prior work often compared the two under mismatched setups. The main soft spot is whether the shared protocols actually equalize information access. GraphRAG supplies explicit entity-relation links that an agent can use for traversal or planning decisions, while dense RAG supplies only flat passages. If the reported advantage on multi-hop items survives the controls, it may reflect real value in the structure rather than an artifact, but the abstract leaves the exact agent-retrieval interface details unclear. I would want to see the precise prompt formats and any checks that the agent cannot implicitly exploit the graph format. This work is aimed at groups building or evaluating agentic retrieval systems who need data on cost-performance trade-offs. It deserves a serious referee because the benchmark is new, the question is practical, and the empirical framing is reproducible even if some results need tighter ablations. I would send it for review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces the RAGSearch benchmark to evaluate dense RAG and representative GraphRAG methods as retrieval infrastructures under both training-free and training-based agentic search systems. It standardizes the LLM backbone, retrieval budgets, and inference protocols across multiple QA benchmarks, reporting answer accuracy along with offline preprocessing cost, online inference efficiency, and stability. The central claim is that agentic search substantially improves dense RAG performance and narrows the gap to GraphRAG (especially in RL-based settings), while GraphRAG retains advantages for complex multi-hop reasoning and exhibits more stable agentic behavior once offline costs are amortized.

Significance. If the empirical findings hold under the reported standardization, the work clarifies the complementary roles of explicit graph structure versus dynamic agentic retrieval, providing practical guidance for retrieval design in agentic RAG systems. The introduction of a unified benchmark with full test-set reporting and multi-metric evaluation (accuracy, cost, stability) strengthens the contribution for the IR and LLM communities.

major comments (2)

[§4 and §5] §4 (Methodology) and §5 (Results): The claim that identical LLM backbones, retrieval budgets, and inference protocols produce a fair comparison is load-bearing for the headline result that agentic search narrows the gap to GraphRAG. However, GraphRAG supplies explicit entity-relation links that an agent could in principle exploit for planning or traversal; dense RAG supplies only flat passages. The paper reports GraphRAG's remaining advantage precisely on multi-hop items, raising the possibility that the protocols still permit differential use of structure. A concrete test (e.g., ablating whether the agent policy receives graph-derived metadata or only passage text) is needed to isolate the effect.
[§5.2] §5.2 (RL-based agentic results): The reported narrowing of the performance gap in RL settings lacks accompanying error bars, statistical significance tests, or per-benchmark breakdowns on the full test sets. Without these, it is difficult to determine whether the observed improvements are robust or driven by a subset of easier instances.

minor comments (2)

[Abstract] Abstract: The directional findings are stated without any numerical values, confidence intervals, or effect sizes; adding at least one representative accuracy delta (e.g., “+X% on multi-hop tasks”) would improve readability.
[Tables/Figures] Table captions and figure legends should explicitly state the number of runs or seeds used for stability metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript accordingly to strengthen the clarity and robustness of our claims.

read point-by-point responses

Referee: [§4 and §5] §4 (Methodology) and §5 (Results): The claim that identical LLM backbones, retrieval budgets, and inference protocols produce a fair comparison is load-bearing for the headline result that agentic search narrows the gap to GraphRAG. However, GraphRAG supplies explicit entity-relation links that an agent could in principle exploit for planning or traversal; dense RAG supplies only flat passages. The paper reports GraphRAG's remaining advantage precisely on multi-hop items, raising the possibility that the protocols still permit differential use of structure. A concrete test (e.g., ablating whether the agent policy receives graph-derived metadata or only passage text) is needed to isolate the effect.

Authors: We agree that isolating the contribution of explicit graph structure versus the agent's ability to exploit it is important for interpreting the results. In our setup, the agent receives retrieval outputs in a standardized format (passages for dense RAG; graph-augmented context for GraphRAG) under identical decision protocols and retrieval budgets. The remaining GraphRAG advantage on multi-hop questions is consistent with the value of explicit structure. To address the concern directly, we have added an ablation in the revised §4 and §5 where the GraphRAG agent is restricted to passage text only (masking entity-relation metadata), allowing a cleaner comparison of structure exploitation. revision: yes
Referee: [§5.2] §5.2 (RL-based agentic results): The reported narrowing of the performance gap in RL settings lacks accompanying error bars, statistical significance tests, or per-benchmark breakdowns on the full test sets. Without these, it is difficult to determine whether the observed improvements are robust or driven by a subset of easier instances.

Authors: We acknowledge that the original presentation of RL results would benefit from additional statistical support. In the revised manuscript, we now report standard error bars computed over multiple random seeds for all RL-based experiments, include paired statistical significance tests (t-tests) comparing dense RAG and GraphRAG under agentic search, and provide full per-benchmark breakdowns on the complete test sets to demonstrate that the gap narrowing holds consistently rather than being driven by easier subsets. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivations or self-referential predictions

full rationale

This is a purely empirical benchmarking study comparing dense RAG and GraphRAG under standardized agentic search protocols on external QA datasets. All reported results (accuracy, efficiency, stability) are direct measurements from experiments with fixed LLM backbones and retrieval budgets. No mathematical derivations, fitted parameters renamed as predictions, self-citation load-bearing premises, or ansatz smuggling appear in the paper. The central claims rest on observable performance differences across methods rather than any reduction to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the validity of the new RAGSearch benchmark and the assumption that standardized experimental protocols isolate the effect of graph structure versus agentic dynamics.

axioms (1)

domain assumption Standardized LLM backbone and retrieval budgets produce comparable conditions across RAG and GraphRAG methods
Invoked to ensure fair comparison in the benchmark design.

invented entities (1)

RAGSearch benchmark no independent evidence
purpose: Unified evaluation framework for dense RAG and GraphRAG under agentic search
Newly defined in this work to enable the reported comparisons.

pith-pipeline@v0.9.0 · 5614 in / 1302 out tokens · 54806 ms · 2026-05-13T22:32:43.898715+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 11 internal anchors

[1]

Boyu Chen, Zirui Guo, Zidan Yang, Yuluo Chen, Junze Chen, Zhenghao Liu, Chuan Shi, and Cheng Yang. 2025. PathRAG: Pruning Graph-based Retrieval Augmented Generation with Relational Paths. arXiv:2502.14902 [cs.CL] https: //arxiv.org/abs/2502.14902

work page arXiv 2025
[2]

Junnan Dong, Siyu An, Yifei Yu, Qian-Wen Zhang, Linhao Luo, Xiao Huang, Yunsheng Wu, Di Yin, and Xing Sun. 2025. Youtu-graphrag: Vertically uni- fied agents for graph retrieval-augmented complex reasoning.arXiv preprint arXiv:2508.19855(2025)

work page arXiv 2025
[3]

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Yifan Feng, Hao Hu, Xingliang Hou, Shiquan Liu, Shihui Ying, Shaoyi Du, Han Hu, and Yue Gao. 2025. Hyper-RAG: Combating LLM Hallucinations using Hypergraph-Driven Retrieval-Augmented Generation. arXiv:2504.08758 [cs.IR] https://arxiv.org/abs/2504.08758

work page arXiv 2025
[5]

Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. 2025. LightRAG: Simple and Fast Retrieval-Augmented Generation. arXiv:2410.05779 [cs.IR] https: //arxiv.org/abs/2410.05779

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su

work page
[7]

Hipporag: Neurobiologically inspired long-term memory for large language models,

HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. arXiv:2405.14831 [cs.CL] https://arxiv.org/abs/2405.14831

work page arXiv
[8]

Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. 2025. From RAG to Memory: Non-Parametric Continual Learning for Large Language Models. arXiv:2502.14802 [cs.CL] https://arxiv.org/abs/2502.14802

work page arXiv 2025
[9]

Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Ma- hantesh Halappanavar, Ryan A Rossi, Subhabrata Mukherjee, Xianfeng Tang, et al

work page
[10]

Retrieval-augmented generation with graphs (graphrag).arXiv preprint arXiv:2501.00309(2024)

work page arXiv 2024
[11]

Bolei He, Nuo Chen, Xinran He, Lingyong Yan, Zhenkai Wei, Jinchang Luo, and Zhen-Hua Ling. 2024. Retrieving, Rethinking and Revising: The Chain-of-Verification Can Improve Retrieval Augmented Generation. arXiv:2410.05801 [cs.CL] https://arxiv.org/abs/2410.05801

work page arXiv 2024
[12]

G-retriever: Retrieval-augmented generation for textual graph understanding and question answering,

Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V. Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi. 2024. G-Retriever: Retrieval- Augmented Generation for Textual Graph Understanding and Question An- swering. arXiv:2402.07630 [cs.LG] https://arxiv.org/abs/2402.07630

work page arXiv 2024
[13]

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reason- ing Steps. arXiv:2011.01060 [cs.CL] https://arxiv.org/abs/2011.01060

work page arXiv 2020
[14]

Shayekh Bin Islam, Md Asib Rahman, K S M Tozammel Hossain, Enamul Hoque, Shafiq Joty, and Md Rizwan Parvez. 2024. Open-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models. arXiv:2410.01782 [cs.CL] https://arxiv.org/abs/2410.01782

work page arXiv 2024
[15]

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park

work page
[16]

Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complex- ity,

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. arXiv:2403.14403 [cs.CL] https://arxiv. org/abs/2403.14403

work page arXiv
[17]

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Jiajie Jin, Yutao Zhu, Zhicheng Dou, Guanting Dong, Xinyu Yang, Chenghao Zhang, Tong Zhao, Zhao Yang, and Ji-Rong Wen. 2025. FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research. InCompanion Proceedings of the ACM on Web Conference 2025, WWW 2025, Sydney, NSW, Aus- tralia, 28 April 2025 - 2 May 2025, Guodong Long, Michale Blum...

work page doi:10.1145/3701716.3715313 2025
[19]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehen- sion. arXiv:1705.03551 [cs.CL] https://arxiv.org/abs/1705.03551

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. arXiv:2004.04906 [cs.CL] https://arxiv.org/abs/ 2004.04906

work page arXiv 2020
[21]

https://aclanthology.org/ Q19-1026/

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research.Tr...

work page doi:10.1162/tacl_a_00276 2019
[22]

Meng-Chieh Lee, Qi Zhu, Costas Mavromatis, Zhen Han, Soji Adeshina, Vassilis N Ioannidis, Huzefa Rangwala, and Christos Faloutsos. 2025. Hybgrag: Hybrid retrieval-augmented generation on textual and relational knowledge bases. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 879–893

work page 2025
[23]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 [cs.CL] https://arxiv.org/abs/ 2005.11401

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025. Search-o1: Agentic search-enhanced large reasoning models.arXiv preprint arXiv:2501.05366(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Jiajin Liu, Yuanfu Sun, Dongzhe Fan, and Qiaoyu Tan. 2026. Graph- Search: Agentic Search-Augmented Reasoning for Zero-Shot Graph Learning. arXiv:2601.08621 [cs.CL] https://arxiv.org/abs/2601.08621

work page arXiv 2026
[26]

Haoran Luo, Guanting Chen, Qika Lin, Yikai Guo, Fangzhi Xu, Zemin Kuang, Meina Song, Xiaobao Wu, Yifan Zhu, Luu Anh Tuan, et al . 2025. Graph-r1: Towards agentic graphrag framework via end-to-end reinforcement learning. arXiv preprint arXiv:2507.21892(2025)

work page arXiv 2025
[27]

Haoran Luo, Haihong E, Guanting Chen, Yandan Zheng, Xiaobao Wu, Yikai Guo, Qika Lin, Yu Feng, Zemin Kuang, Meina Song, Yifan Zhu, and Luu Anh Tuan. 2025. HyperGraphRAG: Retrieval-Augmented Generation via Hypergraph-Structured Knowledge Representation. arXiv:2503.21322 [cs.AI] https://arxiv.org/abs/2503. 21322

work page arXiv 2025
[28]

Haoran Luo, Haihong E, Yikai Guo, Qika Lin, Xiaobao Wu, Xinyu Mu, Wen- hao Liu, Meina Song, Yifan Zhu, and Luu Anh Tuan. 2025. KBQA-o1: Agen- tic Knowledge Base Question Answering with Monte Carlo Tree Search. arXiv:2501.18922 [cs.CL] https://arxiv.org/abs/2501.18922

work page arXiv 2025
[29]

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Han- naneh Hajishirzi. 2023. When Not to Trust Language Models: Investigating Effec- tiveness of Parametric and Non-Parametric Memories. arXiv:2212.10511 [cs.CL] https://arxiv.org/abs/2212.10511

work page arXiv 2023
[30]

OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Car- ney, Alex Iftimie, and Alex Karpenko et al. 2024. OpenAI o1 System Card. arXiv:2412.16720 [cs.AI] https://arxiv.org/abs/2412.16720

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, and et al. Haoran Wei. 2025. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL] https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. 2024. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. arXiv:2401.18059 [cs.CL] https://arxiv.org/abs/2401. 18059

work page internal anchor Pith review arXiv 2024
[33]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeek- Math: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300 [cs.CL] https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal

work page
[35]

Musique: Multihop questions via single-hop question composition.arXiv:2108.00573, 2022

MuSiQue: Multihop Questions via Single-hop Question Composition. arXiv:2108.00573 [cs.CL] https://arxiv.org/abs/2108.00573

work page arXiv
[36]

Xi Wang, Taketomo Isazawa, Liana Mikaelyan, and James Hensman. 2025. KBLaM: Knowledge Base augmented Language Model. arXiv:2410.10450 [cs.AI] https: //arxiv.org/abs/2410.10450

work page arXiv 2025
[37]

Yujing Wang, Hainan Zhang, Liang Pang, Binghui Guo, Hongwei Zheng, and Zhiming Zheng. 2024. MaFeRw: Query Rewriting with Multi-Aspect Feedbacks for Retrieval-Augmented Large Language Models. arXiv:2408.17072 [cs.CL] https://arxiv.org/abs/2408.17072

work page arXiv 2024
[38]

Cehao Yang, Xiaojun Wu, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Yuanliang Sun, Jia Li, Hui Xiong, and Jian Guo. 2025. GraphSearch: An Agentic Deep Searching Workflow for Graph Retrieval-Augmented Generation.arXiv preprint arXiv:2509.22009(2025)

work page arXiv 2025
[39]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. arXiv:1809.09600 [cs.CL] https://arxiv.org/abs/1809.09600

work page internal anchor Pith review Pith/arXiv arXiv 2018
[40]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations

work page 2022
[41]

Chuanyue Yu, Kuo Zhao, Yuhan Li, Heng Chang, Mingjian Feng, Xiangzhe Jiang, Yufei Sun, Jia Li, Yuzhi Zhang, Jianxin Li, et al. 2025. Graphrag-r1: Graph retrieval- augmented generation with process-constrained reinforcement learning.arXiv preprint arXiv:2507.23581(2025)

work page arXiv 2025
[42]

arXiv preprint arXiv:2403.10131 , year=

Tianjun Zhang, Shishir G. Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, and Joseph E. Gonzalez. 2024. RAFT: Adapting Language Model to Domain Specific RAG. arXiv:2403.10131 [cs.CL] https://arxiv.org/abs/2403.10131

work page arXiv 2024
[43]

Yize Zhang, Tianshu Wang, Sirui Chen, Kun Wang, Xingyu Zeng, Hongyu Lin, Xianpei Han, Le Sun, and Chaochao Lu. 2025. ARise: Towards Knowledge- Augmented Reasoning via Risk-Adaptive Search. arXiv:2504.10893 [cs.AI] https: //arxiv.org/abs/2504.10893 Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

work page arXiv 2025
[44]

Tri-Graph

Luyao Zhuang, Shengyuan Chen, Yilin Xiao, Huachi Zhou, Yujing Zhang, Hao Chen, Qinggang Zhang, and Xiao Huang. 2025. LinearRAG: Linear Graph Re- trieval Augmented Generation on Large-scale Corpora. arXiv:2510.10114 [cs.CL] https://arxiv.org/abs/2510.10114 A Datasets Statistics We conduct evaluations on six widely used RAG benchmarks from the FlashRAG tool...

work page arXiv 2025
[45]

- Identify factual information that is relevant to the Current Search Query and can aid in the reasoning process for the original question

Analyze the Searched Web Pages: - Carefully review the content of each searched web page. - Identify factual information that is relevant to the Current Search Query and can aid in the reasoning process for the original question

work page
[46]

-Ensure that the extracted information is accurate and relevant

Extract Relevant Information: -Select the information from the Searched Web Pages that directly contributes to advancing the Previous Reasoning Steps. -Ensure that the extracted information is accurate and relevant

work page
[47]

{search_query}

Output Format: - If the web pages provide helpful information for current search query: Present the information beginning with ‘Final Information’ as shown below. Final Information [Helpful information] - If the web pages do not provide any helpful information for current search query: Output the following text. Final Information No helpful information fo...

work page 2018