pith. machine review for the scientific record. sign in

arxiv: 2503.05592 · v2 · submitted 2025-03-07 · 💻 cs.AI · cs.CL· cs.IR

Recognition: 2 theorem links

· Lean Theorem

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jie Chen, Jinhao Jiang, Ji-Rong Wen, Lei Fang, Wayne Xin Zhao, Yingqian Min, Zhipeng Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:32 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.IR
keywords reinforcement learninglarge language modelssearch capabilityretrieval-augmented generationtool useoutcome-based RL
0
0 comments X

The pith

R1-Searcher trains LLMs with outcome-based RL to call external search tools during reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a two-stage outcome-based reinforcement learning method that teaches large language models to decide when to invoke external search systems while solving problems. Existing models often produce errors on questions needing facts beyond their training data because they cannot fetch fresh information. By rewarding only the final answer correctness, the approach builds search behavior without step-by-step process signals or special warm-up training. If the method works as claimed, it yields higher accuracy on knowledge-heavy tasks than standard retrieval systems and even closed models such as GPT-4o-mini, while applying to both base and instruction-tuned models.

Core claim

R1-Searcher is a two-stage outcome-based RL framework that enables LLMs to autonomously generate calls to external search systems inside their reasoning process, producing stronger results on knowledge-intensive benchmarks than prior RAG approaches and GPT-4o-mini without any process rewards or distillation for initialization.

What carries the argument

Two-stage outcome-based reinforcement learning that rewards final answer correctness and thereby incentivizes the model to insert search tool calls into its reasoning trajectory.

If this is right

  • The same outcome-based RL pipeline produces usable search behavior in both base and instruction-tuned models.
  • Search use generalizes to datasets outside the training distribution.
  • Accuracy on time-sensitive and fact-heavy questions rises above conventional RAG pipelines.
  • No auxiliary process reward model or supervised warm-up phase is required for the capability to emerge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may transfer to training other tool-use skills such as code execution or database queries using only final-outcome signals.
  • Reducing dependence on ever-larger internal knowledge stores becomes feasible if external search can be reliably triggered on demand.
  • Training pipelines that avoid process supervision could scale more easily to larger models or longer reasoning traces.

Load-bearing premise

Outcome rewards alone can reliably produce and generalize search behavior without process supervision or a distillation cold start.

What would settle it

A controlled test set of knowledge questions where internal model knowledge is provably insufficient; if the trained model answers correctly without ever calling search, or calls search but still fails at rates comparable to the untrained base model, the claim is falsified.

read the original abstract

Existing Large Reasoning Models (LRMs) have shown the potential of reinforcement learning (RL) to enhance the complex reasoning capabilities of Large Language Models~(LLMs). While they achieve remarkable performance on challenging tasks such as mathematics and coding, they often rely on their internal knowledge to solve problems, which can be inadequate for time-sensitive or knowledge-intensive questions, leading to inaccuracies and hallucinations. To address this, we propose \textbf{R1-Searcher}, a novel two-stage outcome-based RL approach designed to enhance the search capabilities of LLMs. This method allows LLMs to autonomously invoke external search systems to access additional knowledge during the reasoning process. Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start. % effectively generalizing to out-of-domain datasets and supporting both Base and Instruct models. Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces R1-Searcher, a two-stage outcome-based reinforcement learning framework that trains LLMs (both base and instruct variants) to autonomously invoke external search tools during reasoning. It claims this approach, relying exclusively on final-answer correctness rewards without process supervision or distillation for cold-start, enables effective search behavior that generalizes out-of-domain and yields significant outperformance over strong RAG baselines, including closed-source GPT-4o-mini.

Significance. If the empirical results hold with proper controls, the work would demonstrate that sparse outcome-only RL can reliably induce tool-use policies for external knowledge access, offering a simpler alternative to process-reward or imitation-based methods for reducing hallucinations on knowledge-intensive tasks.

major comments (2)
  1. [Abstract] Abstract and Experiments section: the central claim of significant outperformance over prior RAG methods and GPT-4o-mini is asserted without any reported datasets, baselines, metrics, statistical tests, ablations, or controls in the provided text, leaving the empirical support for the two-stage RL procedure unevaluable.
  2. [Method] Method and Training sections: the premise that outcome-based rewards alone suffice to increase search-tool invocation frequency and quality (rather than producing ignored or redundant queries) is load-bearing for the contribution, yet no training dynamics, search-rate curves, or qualitative policy analysis are referenced to validate this against the known sparsity issues of terminal rewards.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have made revisions to strengthen the empirical presentation and analysis of the two-stage RL procedure.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Experiments section: the central claim of significant outperformance over prior RAG methods and GPT-4o-mini is asserted without any reported datasets, baselines, metrics, statistical tests, ablations, or controls in the provided text, leaving the empirical support for the two-stage RL procedure unevaluable.

    Authors: We agree that the abstract should provide clearer pointers to the empirical support. The full Experiments section reports results on HotpotQA, 2WikiMultihopQA, and out-of-domain sets, with baselines including standard RAG pipelines and GPT-4o-mini, using exact-match and F1 metrics, plus ablations on the two-stage design and statistical significance via paired t-tests. To make this immediately visible, we have expanded the abstract to name the primary datasets, metrics, and key controls, and added explicit cross-references to the Experiments section and appendix tables. revision: yes

  2. Referee: [Method] Method and Training sections: the premise that outcome-based rewards alone suffice to increase search-tool invocation frequency and quality (rather than producing ignored or redundant queries) is load-bearing for the contribution, yet no training dynamics, search-rate curves, or qualitative policy analysis are referenced to validate this against the known sparsity issues of terminal rewards.

    Authors: We acknowledge that explicit validation of the learned search policy is important given the sparsity of terminal rewards. In the revised manuscript we have added (i) training curves tracking search-tool invocation rate over RL steps for both base and instruct models, (ii) search-rate curves comparing the two-stage procedure against a single-stage baseline, and (iii) qualitative policy traces showing that the model learns to issue relevant, non-redundant queries rather than ignoring the tool. These additions directly address the sparsity concern and are placed in the Training and Analysis sections with accompanying discussion. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical RL training procedure

full rationale

The paper presents an empirical two-stage outcome-based RL framework that trains LLMs to invoke external search tools using only terminal rewards from final-answer correctness. No equations, parameter fits, or derivations are shown that would reduce any claimed prediction or search behavior to a self-referential quantity or fitted input by construction. Claims rest on experimental comparisons against external RAG baselines and GPT-4o-mini rather than internal self-citations, uniqueness theorems, or ansatzes. The method is explicitly described as relying on external search outcomes without process rewards or distillation, making the central result an observed training outcome rather than a definitional or fitted tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard reinforcement learning assumptions applied to tool-use behavior in LLMs; no free parameters, invented entities, or ad-hoc axioms are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Outcome-based rewards suffice to train LLMs to decide when and how to use external search tools
    Core premise of the two-stage RL framework described in the abstract.

pith-pipeline@v0.9.0 · 5491 in / 1064 out tokens · 41828 ms · 2026-05-13T18:32:38.361697+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

    cs.AI 2026-05 unverdicted novelty 7.0

    PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.

  2. LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

    cs.CL 2026-05 unverdicted novelty 7.0

    LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.

  3. IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

    cs.AI 2026-04 unverdicted novelty 7.0

    IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-h...

  4. ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    ActFER reformulates facial expression recognition as active tool-augmented visual reasoning with a custom reinforcement learning algorithm UC-GRPO that outperforms passive MLLM baselines on AU prediction.

  5. GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

    cs.CL 2026-04 unverdicted novelty 7.0

    GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.

  6. PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.

  7. SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

    cs.AI 2026-05 unverdicted novelty 6.0

    SearchSkill introduces an evolving SkillBank and two-stage SFT to make LLM search query planning explicit via skill selection, improving exact match on QA benchmarks and retrieval behavior.

  8. AIPO: : Learning to Reason from Active Interaction

    cs.CL 2026-05 unverdicted novelty 6.0

    AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...

  9. SOD: Step-wise On-policy Distillation for Small Language Model Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

  10. $S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data

    cs.LG 2026-05 unverdicted novelty 6.0

    S^3-R1 generates synthetic intermediate-difficulty multi-hop questions and applies dense rewards for search quality plus answer correctness, yielding up to 10% better out-of-domain generalization than baselines.

  11. GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.

  12. DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents

    cs.CV 2026-04 unverdicted novelty 6.0

    DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.

  13. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  14. Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search

    cs.CL 2026-04 unverdicted novelty 6.0

    CalibAdv calibrates advantages in GRPO by downscaling negative signals from incorrect final answers using intermediate step correctness and rebalancing answer-level advantages, yielding better performance and training...

  15. AutoSearch: Adaptive Search Depth for Efficient Agentic RAG via Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 6.0

    AutoSearch applies RL with a self-answering reward to adaptively determine minimal sufficient search depth in agentic RAG, reducing over-searching while maintaining answer quality on complex questions.

  16. $\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

    cs.LG 2026-04 unverdicted novelty 6.0

    π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...

  17. Towards Long-horizon Agentic Multimodal Search

    cs.CV 2026-04 unverdicted novelty 6.0

    LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...

  18. OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search

    cs.AI 2026-04 unverdicted novelty 6.0

    OASES co-trains search policies and evaluators to generate outcome-aligned process rewards, outperforming standard RL baselines on five multi-hop QA benchmarks.

  19. CharTool: Tool-Integrated Visual Reasoning for Chart Understanding

    cs.AI 2026-04 unverdicted novelty 6.0

    CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.

  20. From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents

    cs.SE 2026-04 unverdicted novelty 6.0

    A two-stage SFT pipeline distills execution-free then execution-based trajectories from a 480B model into smaller Qwen2.5-Coder agents, yielding 62.2% resolution on SWE-bench Verified and 44.1% zero-shot on the multil...

  21. Procedural Knowledge at Scale Improves Reasoning

    cs.CL 2026-04 unverdicted novelty 6.0

    Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks b...

  22. Learning to Retrieve from Agent Trajectories

    cs.IR 2026-03 conditional novelty 6.0

    Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.

  23. ToolRL: Reward is All Tool Learning Needs

    cs.LG 2025-04 conditional novelty 6.0

    A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.

  24. ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    cs.CL 2025-04 unverdicted novelty 6.0

    ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.

  25. Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging

    cs.AI 2026-05 unverdicted novelty 5.0

    MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.

  26. CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

    cs.AI 2026-05 unverdicted novelty 5.0

    CuSearch reallocates fixed training budget toward deeper-search rollouts in RLVR for agentic RAG, treating search depth as an annotation-free proxy for supervision density and reporting up to 11.8 exact-match gains ov...

  27. E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning

    cs.AI 2026-04 unverdicted novelty 5.0

    E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.

  28. EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools

    cs.AI 2026-04 unverdicted novelty 4.0

    Structured query and evidence tools added to an AI research agent improve benchmark accuracy by 0.6 to 3.8 percentage points.

  29. XekRung Technical Report

    cs.CR 2026-04 unverdicted novelty 3.0

    XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 29 Pith papers · 3 internal anchors

  1. [1]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey ...

  2. [2]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  3. [3]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

  4. [4]

    An empirical study on eliciting and improving r1-like reasoning models, 2025

    Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning models, 2025

  5. [5]

    Hotpotqa: A dataset for diverse, explainable multi-hop question answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018

  6. [6]

    Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

  7. [7]

    Omnieval: An omnidirectional and automatic rag evaluation benchmark in financial domain, 2025

    Shuting Wang, Jiejun Tan, Zhicheng Dou, and Ji-Rong Wen. Omnieval: An omnidirectional and automatic rag evaluation benchmark in financial domain, 2025

  8. [8]

    Multi-reranker: Maximizing performance of retrieval-augmented generation in the financerag challenge, 2024

    Joohyun Lee and Minji Roh. Multi-reranker: Maximizing performance of retrieval-augmented generation in the financerag challenge, 2024

  9. [9]

    Measuring and narrowing the compositionality gap in language models

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, 2023

  10. [10]

    Jie He, Nan Hu, Wanqiu Long, Jiaoyan Chen, and Jeff Z. Pan. Mintqa: A multi-hop question answering benchmark for evaluating llms on new and tail knowledge, 2025

  11. [11]

    Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement

    Jinhao Jiang, Jiayi Chen, Junyi Li, Ruiyang Ren, Shijie Wang, Wayne Xin Zhao, Yang Song, and Tao Zhang. Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement. CoRR, abs/2412.12881, 2024

  12. [12]

    Retrieval-augmented generation for large language models: A survey, 2024

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024

  13. [13]

    A survey on rag meeting llms: Towards retrieval-augmented large language models

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, page 6491–6501, New York, NY , USA, 2024. Association for Computing Machinery

  14. [14]

    Search-o1: Agentic search-enhanced large reasoning models, 2025

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models, 2025

  15. [15]

    Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open- Review.net, 2024

  16. [16]

    Atom of thoughts for markov llm test-time scaling, 2025

    Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, and Yuyu Luo. Atom of thoughts for markov llm test-time scaling, 2025

  17. [17]

    Chain- of-retrieval augmented generation

    Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, and Furu Wei. Chain- of-retrieval augmented generation. CoRR, abs/2501.14342, 2025

  18. [18]

    Le, Sergey Levine, and Yi Ma

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V . Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training, 2025

  19. [19]

    Can we further elicit reasoning in llms? critic-guided planning with retrieval-augmentation for solving challenging tasks

    Xingxuan Li, Weiwen Xu, Ruochen Zhao, Fangkai Jiao, Shafiq Joty, and Lidong Bing. Can we further elicit reasoning in llms? critic-guided planning with retrieval-augmentation for solving challenging tasks. arXiv preprint arXiv:2410.01428, 2024

  20. [20]

    Reinforce++: A simple and efficient approach for aligning large language models, 2025

    Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models, 2025

  21. [21]

    Qwen2.5 technical report, 2025

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  22. [22]

    Musique: Multihop questions via single-hop question composition

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022

  23. [23]

    Rearter: Retrieval-augmented reasoning with trustworthy process rewarding, 2025

    Zhongxiang Sun, Qipeng Wang, Weijie Yu, Xiaoxue Zang, Kai Zheng, Jun Xu, Xiao Zhang, Song Yang, and Han Li. Rearter: Retrieval-augmented reasoning with trustworthy process rewarding, 2025

  24. [24]

    Sure: Summarizing retrievals using answer candidates for open- domain QA of LLMs

    Jaehyung Kim, Jaehyun Nam, Sangwoo Mo, Jongjin Park, Sang-Woo Lee, Minjoon Seo, Jung- Woo Ha, and Jinwoo Shin. Sure: Summarizing retrievals using answer candidates for open- domain QA of LLMs. In The Twelfth International Conference on Learning Representations, 2024

  25. [25]

    arXiv preprint arXiv:2301.12652 , year=

    Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Replug: Retrieval-augmented black-box language models.arXiv preprint arXiv:2301.12652, 2023

  26. [26]

    Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression

    Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839, 2023

  27. [27]

    RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation

    Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation. In The Twelfth International Conference on Learning Representations, 2024

  28. [28]

    Compressing context to enhance inference efficiency of large language models

    Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing context to enhance inference efficiency of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6342–6353, Singapore, December 2023. Association for Computational Linguistics

  29. [29]

    Self-knowledge guided retrieval augmentation for large language models

    Yile Wang, Peng Li, Maosong Sun, and Yang Liu. Self-knowledge guided retrieval augmentation for large language models. arXiv preprint arXiv:2310.05002, 2023

  30. [30]

    arXiv preprint arXiv:2210.03350

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022

  31. [31]

    Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

    Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294, 2023

  32. [32]

    Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10014–10037, 2023

  33. [33]

    Marco-o1: Towards open reasoning models for open-ended solutions

    Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, and Kaifu Zhang. Marco-o1: Towards open reasoning models for open-ended solutions. arXiv preprint arXiv:2411.14405, 2024

  34. [34]

    Skywork-o1 open series

    Skywork o1 Team. Skywork-o1 open series. https://huggingface.co/Skywork, Novem- ber 2024

  35. [35]

    Flashrag: A modular toolkit for efficient retrieval-augmented generation research,

    Jiajie Jin, Yutao Zhu, Xinyu Yang, Chenghao Zhang, and Zhicheng Dou. Flashrag: A modular toolkit for efficient retrieval-augmented generation research. arXiv preprint arXiv:2405.13576, 2024

  36. [36]

    Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick S. H. Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. KILT: a benchmark for knowledge intensive language tasks. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, I...

  37. [37]

    Zero: Memory optimiza- tions toward training trillion parameter models, 2020

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models, 2020

  38. [38]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. 17