pith. machine review for the scientific record. sign in

arxiv: 2503.19470 · v3 · submitted 2025-03-25 · 💻 cs.AI · cs.CL

Recognition: no theorem link

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-16 15:43 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords reinforcement learninglarge language modelsreasoning with searchtool integrationoutcome rewardsmulti-hop reasoningself-correctiongeneralization
0
0 comments X

The pith

ReSearch trains LLMs to interleave search operations with text reasoning using only outcome-based reinforcement learning rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ReSearch, a framework where language models learn to treat external search as part of their ongoing reasoning chain rather than a separate step. Training relies solely on reinforcement learning signals from final answer correctness, with no supervised examples of reasoning traces or search decisions. Models trained this way on a single dataset generalize to multiple benchmarks and begin to exhibit reflection and self-correction during generation. This matters because it offers a path to capable multi-hop reasoning without the cost of creating detailed human supervision data for every task.

Core claim

ReSearch frames search as an integral action within the reasoning trajectory: the model generates text-based thinking to decide when and how to search, receives the result in context, and continues reasoning, all optimized end-to-end by reinforcement learning on outcome rewards alone; this produces strong cross-benchmark generalization and the spontaneous appearance of reflection and self-correction behaviors.

What carries the argument

An outcome-optimized RL policy that generates interleaved sequences of natural-language reasoning steps and explicit search tool calls, feeding retrieved results back into the context to shape the next reasoning segment.

If this is right

  • Reflection and self-correction emerge as byproducts of the RL process without explicit rewards for those behaviors.
  • Performance generalizes across benchmarks after training on only one dataset.
  • Complex questions needing multiple retrieval steps become solvable through learned interleaving rather than fixed pipelines.
  • Search integration is driven by the model's internal text reasoning rather than external rules or prompts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same outcome-reward approach could extend to other tools such as code interpreters or databases to create agents that decide when to use each tool.
  • Reducing dependence on supervised reasoning traces might lower the data cost of building more capable models for open-ended tasks.
  • Scaling model size or adding more diverse outcome signals could strengthen the observed self-correction without changing the core training setup.

Load-bearing premise

Final-answer correctness rewards alone can shape precise decisions about when to search and how to incorporate results without any intermediate supervision on reasoning steps.

What would settle it

Train the model and test whether it shows no gain over a non-search baseline on multi-hop questions that require two or more sequential retrievals, or whether search timing remains random rather than guided by the model's own text thinking.

read the original abstract

Large Language Models (LLMs) have shown remarkable capabilities in reasoning, exemplified by the success of OpenAI-o1 and DeepSeek-R1. However, integrating reasoning with external search processes remains challenging, especially for complex multi-hop questions requiring multiple retrieval steps. We propose ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-based thinking, and search results subsequently influence further reasoning. We train ReSearch on Qwen2.5-7B(-Instruct) and Qwen2.5-32B(-Instruct) models and conduct extensive experiments. Despite being trained on only one dataset, our models demonstrate strong generalizability across various benchmarks. Analysis reveals that ReSearch naturally elicits advanced reasoning capabilities such as reflection and self-correction during the reinforcement learning process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ReSearch, a reinforcement learning framework that trains LLMs (Qwen2.5-7B and 32B variants) to interleave text-based reasoning with external search operations without any supervised reasoning traces or explicit search supervision. Search is treated as an integral part of the reasoning chain, with the policy learning when and how to invoke searches based on outcome rewards alone; the resulting models are reported to generalize strongly across benchmarks despite single-dataset training and to exhibit emergent reflection and self-correction.

Significance. If the central claims hold, the work would demonstrate that pure outcome-based RL can shape adaptive tool use and elicit advanced reasoning behaviors in LLMs, reducing dependence on supervised fine-tuning for multi-hop tasks. This would be a notable contribution to tool-augmented reasoning, with the single-dataset generalization and emergent capabilities providing a concrete path toward more scalable training of search-integrated reasoners.

major comments (2)
  1. [Experimental results] Experimental results section: the claims of strong generalization and emergent reflection rest on outcome-based RL shaping search timing, yet no ablations are reported that isolate the contribution of learned search integration (e.g., search-disabled baselines, search-frequency statistics stratified by question complexity, or comparison to fixed-search policies). Without these, it remains possible that performance gains derive from the base model or the search tool itself rather than the RL-induced policy.
  2. [Method] Method section (trajectory generation and reward): the description of how search results are incorporated into subsequent reasoning steps and how the policy learns contextual search decisions solely from final-answer rewards lacks sufficient detail on state representation and credit assignment; this is load-bearing for the central assertion that no supervised reasoning traces are required.
minor comments (1)
  1. [Abstract] Abstract: the statement that models 'demonstrate strong generalizability' would be strengthened by including at least one or two key quantitative metrics (e.g., average accuracy lift on the held-out benchmarks) rather than qualitative descriptors alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major comment point by point below, providing clarifications and indicating revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experimental results] Experimental results section: the claims of strong generalization and emergent reflection rest on outcome-based RL shaping search timing, yet no ablations are reported that isolate the contribution of learned search integration (e.g., search-disabled baselines, search-frequency statistics stratified by question complexity, or comparison to fixed-search policies). Without these, it remains possible that performance gains derive from the base model or the search tool itself rather than the RL-induced policy.

    Authors: We agree that additional ablations would more rigorously isolate the contribution of the learned search policy. In the revised manuscript we will add: (1) a search-disabled baseline in which the model is trained under the same RL setup but without access to the search tool; (2) search-frequency statistics stratified by question complexity (e.g., single-hop vs. multi-hop); and (3) comparisons against fixed-search policies that invoke search at predetermined intervals or always. These experiments will be reported alongside the existing results to demonstrate that performance gains arise from the RL-induced contextual decisions rather than the base model or tool alone. revision: yes

  2. Referee: [Method] Method section (trajectory generation and reward): the description of how search results are incorporated into subsequent reasoning steps and how the policy learns contextual search decisions solely from final-answer rewards lacks sufficient detail on state representation and credit assignment; this is load-bearing for the central assertion that no supervised reasoning traces are required.

    Authors: We acknowledge that the current description is insufficiently detailed. In the revised method section we will expand the exposition as follows: the state at each step is the complete history of prior reasoning text plus any previously retrieved search results appended verbatim to the context; search results are incorporated by direct concatenation after the current thought, allowing the next token prediction to condition on them. Credit assignment occurs entirely through the terminal outcome reward (binary correctness of the final answer) propagated by the RL algorithm (PPO) with no intermediate supervision or reasoning traces. We will include pseudocode for trajectory sampling, the precise reward function, and an explicit statement that search timing and content decisions are learned solely from final-answer signals. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ReSearch RL training framework

full rationale

The paper presents an empirical RL method that applies standard outcome-based rewards to search-augmented reasoning trajectories without supervised traces. No equations, derivations, or predictions are shown that reduce claimed capabilities (e.g., reflection, generalizability) to fitted inputs or self-citations by construction. Training on one dataset and evaluation across benchmarks is a direct experimental result, not a self-referential derivation. The framework is self-contained as a training procedure validated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pure outcome RL can shape search behavior and elicit reflection; no free parameters or invented entities are specified in the abstract.

axioms (1)
  • domain assumption Reinforcement learning with final-answer rewards suffices to train when and how to search during reasoning.
    Invoked implicitly in the description of training without supervised reasoning data.

pith-pipeline@v0.9.0 · 5508 in / 1175 out tokens · 38803 ms · 2026-05-16T15:43:22.448661+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

    cs.AI 2026-05 unverdicted novelty 7.0

    PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.

  2. Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

    cs.AI 2026-05 unverdicted novelty 7.0

    An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.

  3. Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

    cs.LG 2026-05 unverdicted novelty 7.0

    SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.

  4. Latent Abstraction for Retrieval-Augmented Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    LAnR unifies retrieval-augmented generation inside a single LLM by deriving dense retrieval vectors from a [PRED] token's hidden states and using entropy to adaptively stop retrieval, outperforming prior RAG on six QA...

  5. IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

    cs.AI 2026-04 unverdicted novelty 7.0

    IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-h...

  6. Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving

    cs.RO 2026-03 unverdicted novelty 7.0

    PaIR-Drive runs IL and RL in parallel branches with a tree-structured sampler to reach 91.2 PDMS and 87.9 EPDMS on NAVSIM benchmarks while outperforming sequential RL fine-tuning and correcting some human errors.

  7. MMSearch-R1: Incentivizing LMMs to Search

    cs.CV 2025-06 unverdicted novelty 7.0

    MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting searc...

  8. PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    PiCA improves RL for LLM search agents by defining process rewards around pivot steps that act as information peaks boosting final answer success probability via potential-based shaping.

  9. PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.

  10. DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents

    cs.CV 2026-04 unverdicted novelty 6.0

    DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.

  11. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  12. MICA: Multi-granularity Intertemporal Credit Assignment for Long-Horizon Emotional Support Dialogue

    cs.CL 2026-03 unverdicted novelty 6.0

    MICA combines incremental per-turn distance rewards and Monte Carlo returns from a shared potential function over user support states to create a mixed advantage signal that enables stable multi-turn RL optimization f...

  13. WebThinker: Empowering Large Reasoning Models with Deep Research Capability

    cs.CL 2025-04 unverdicted novelty 6.0

    WebThinker equips large reasoning models with autonomous web exploration and interleaved reasoning-drafting via a Deep Web Explorer and RL-based DPO training, yielding gains on GPQA, GAIA, and report-generation benchmarks.

  14. Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

    cs.AI 2026-05 unverdicted novelty 5.0

    An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.

  15. ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

    cs.CV 2026-04 unverdicted novelty 5.0

    A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.

  16. SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

    cs.IR 2026-04 unverdicted novelty 5.0

    SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.

  17. KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning

    cs.CL 2026-04 unverdicted novelty 5.0

    KG-Reasoner uses reinforcement learning to train LLMs for end-to-end multi-hop knowledge graph reasoning, achieving competitive or better results on eight benchmarks.

  18. E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning

    cs.AI 2026-04 unverdicted novelty 5.0

    E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.

  19. An End-to-End Framework for Building Large Language Models for Software Operations

    cs.LG 2026-04 unverdicted novelty 5.0

    OpsLLM outperforms general LLMs on software operations QA and RCA tasks through human-in-the-loop data curation, supervised fine-tuning, and domain-specific reinforcement learning.

  20. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  21. An End-to-End Framework for Building Large Language Models for Software Operations

    cs.LG 2026-04 unverdicted novelty 4.0

    OpsLLM is a domain-specific LLM for software ops QA and RCA built with human-curated data, SFT, and RL using a domain process reward model, showing accuracy gains of 0.2-5.7% on QA and 2.7-70.3% on RCA over general LLMs.

  22. From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

    cs.AI 2025-04 accept novelty 4.0

    A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 19 Pith papers · 14 internal anchors

  1. [1]

    Claude 3.7 sonnet and claude code, 2025

    Anthropic. Claude 3.7 sonnet and claude code, 2025. URL https://www.anthropic.com/ news/claude-3-7-sonnet

  2. [2]

    Self-rag: Learning to retrieve, generate, and critique through self-reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. InICLR. OpenReview.net, 2024

  3. [3]

    RQ-RAG: learning to refine queries for retrieval augmented generation.CoRR, abs/2404.00610, 2024

    Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, and Jie Fu. RQ-RAG: learning to refine queries for retrieval augmented generation.CoRR, abs/2404.00610, 2024

  4. [4]

    Facilitating multi-turn function calling for llms via compositional instruction tuning.CoRR, abs/2410.12952, 2024

    Mingyang Chen, Haoze Sun, Tianpeng Li, Fan Yang, Hao Liang, Keer Lu, Bin Cui, Wentao Zhang, Zenan Zhou, and Weipeng Chen. Facilitating multi-turn function calling for llms via compositional instruction tuning.CoRR, abs/2410.12952, 2024

  5. [5]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  6. [6]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.CoRR, abs/2312.10997, 2023

  7. [7]

    Constructing A multi-hop QA dataset for comprehensive evaluation of reasoning steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing A multi-hop QA dataset for comprehensive evaluation of reasoning steps. InCOLING, pages 6609–6625. International Committee on Computational Linguistics, 2020

  8. [8]

    Atlas: Few-shot Learning with Retrieval Augmented Language Models

    Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented language models.arXiv preprint arXiv:2208.03299, 1(2):4, 2022

  9. [9]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.CoRR, abs/2503.09516, 2025

  10. [10]

    Flashrag: A modular toolkit for efficient retrieval-augmented generation research.CoRR, abs/2405.13576, 2024

    Jiajie Jin, Yutao Zhu, Xinyu Yang, Chenghao Zhang, and Zhicheng Dou. Flashrag: A modular toolkit for efficient retrieval-augmented generation research.CoRR, abs/2405.13576, 2024

  11. [11]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InEMNLP (1), pages 6769–6781. Association for Computational Linguistics, 2020

  12. [12]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020. 10

  13. [13]

    Baichuan alignment technical report.CoRR, abs/2410.14940, 2024

    Mingan Lin, Fan Yang, Yanjun Shen, Haoze Sun, Tianpeng Li, Tao Zhang, Chenzheng Zhu, Tao Zhang, Miao Zheng, Xu Li, Yijie Zhou, Mingyang Chen, Yanzhao Qin, Youquan Li, Hao Liang, Fei Li, Yadong Li, Mang Wang, Guosheng Dong, Kun Fang, Jianhua Xu, Bin Cui, Wentao Zhang, Zenan Zhou, and Weipeng Chen. Baichuan alignment technical report.CoRR, abs/2410.14940, 2024

  14. [14]

    Agentboard: An analytical evaluation board of multi-turn LLM agents

    Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn LLM agents. InNeurIPS, 2024

  15. [15]

    Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

  16. [16]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel J. Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.CoRR, abs/2501.19393, 2025

  17. [17]

    Learning to reason with LLMs, 2024

    OpenAI. Learning to reason with LLMs, 2024. URL https://openai.com/index/ learning-to-reason-with-llms

  18. [18]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  19. [19]

    Smith, and Mike Lewis

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InEMNLP (Findings), pages 5687–5711. Association for Computational Linguistics, 2023

  20. [20]

    Manning, Stefano Ermon, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS, 2023

  21. [21]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InNeurIPS, 2023

  22. [22]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017

  23. [23]

    Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

    Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. InEMNLP (Findings), pages 9248–9274. Association for Computational Linguistics, 2023

  24. [24]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300, 2024

  25. [25]

    Hugginggpt: Solving AI tasks with chatgpt and its friends in hugging face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving AI tasks with chatgpt and its friends in hugging face. InNeurIPS, 2023

  26. [26]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. CoRR, abs/2409.19256, 2024

  27. [27]

    REPLUG: retrieval-augmented black-box language models

    Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. REPLUG: retrieval-augmented black-box language models. In NAACL-HLT, pages 8371–8384. Association for Computational Linguistics, 2024

  28. [28]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters.CoRR, abs/2408.03314, 2024. 11

  29. [29]

    R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

    Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.CoRR, abs/2503.05592, 2025

  30. [30]

    Reinforcement learning.Journal of Cognitive Neuroscience, 11(1):126–134, 1999

    Richard S Sutton, Andrew G Barto, et al. Reinforcement learning.Journal of Cognitive Neuroscience, 11(1):126–134, 1999

  31. [31]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  32. [32]

    Musique: Multihop questions via single-hop question composition.Trans

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Trans. Assoc. Comput. Linguistics, 10:539–554, 2022

  33. [33]

    Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. InACL (1), pages 10014–10037. Association for Computational Linguistics, 2023

  34. [34]

    Wikidata: a free collaborative knowledgebase.Com- mun

    Denny Vrandecic and Markus Krötzsch. Wikidata: a free collaborative knowledgebase.Com- mun. ACM, 57(10):78–85, 2014

  35. [35]

    Pan, Wen Zhang, and Huajun Chen

    Junjie Wang, Mingyang Chen, Binbin Hu, Dan Yang, Ziqi Liu, Yue Shen, Peng Wei, Zhiqiang Zhang, Jinjie Gu, Jun Zhou, Jeff Z. Pan, Wen Zhang, and Huajun Chen. Learning to plan for retrieval-augmented large language models from knowledge graphs. InEMNLP (Findings), pages 7813–7835. Association for Computational Linguistics, 2024

  36. [36]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. CoRR, abs/2212.03533, 2022

  37. [37]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022

  38. [38]

    Corrective Retrieval Augmented Generation

    Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation.CoRR, abs/2401.15884, 2024

  39. [39]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  40. [40]

    Cohen, Ruslan Salakhut- dinov, and Christopher D

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InEMNLP, pages 2369–2380. Association for Computational Linguistics, 2018

  41. [41]

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning. InNeurIPS, 2022

  42. [42]

    Dense text retrieval based on pretrained language models: A survey.ACM Trans

    Wayne Xin Zhao, Jing Liu, Ruiyang Ren, and Ji-Rong Wen. Dense text retrieval based on pretrained language models: A survey.ACM Trans. Inf. Syst., 42(4):89:1–89:60, 2024. 12 A Prompt for LLM-as-a-Judge Prompt for Extracting Scenarios You will be given a question and its ground truth answer list where each item can be a ground truth answer. Provided a pred_...

  43. [43]

    The pred_answer doesn’t need to be exactly the same as any of the ground truth answers, but should be semantically same for the question

  44. [44]

    rationale

    Each item in the ground truth answer list can be viewed as a ground truth answer for the question, and the pred_answer should be semantically same to at least one of them. question: {question} ground truth answers: {gt_answer} pred_answer: {pred_answer} The output should in the following json format: ‘‘‘json { "rationale": "your rationale for the judgemen...