arxiv: 2503.05592 · v2 · submitted 2025-03-07 · 💻 cs.AI · cs.CL· cs.IR

Recognition: 2 theorem links

· Lean Theorem

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jie Chen, Jinhao Jiang, Ji-Rong Wen, Lei Fang, Wayne Xin Zhao, Yingqian Min, Zhipeng Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:32 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.IR

keywords reinforcement learninglarge language modelssearch capabilityretrieval-augmented generationtool useoutcome-based RL

0 comments

The pith

R1-Searcher trains LLMs with outcome-based RL to call external search tools during reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a two-stage outcome-based reinforcement learning method that teaches large language models to decide when to invoke external search systems while solving problems. Existing models often produce errors on questions needing facts beyond their training data because they cannot fetch fresh information. By rewarding only the final answer correctness, the approach builds search behavior without step-by-step process signals or special warm-up training. If the method works as claimed, it yields higher accuracy on knowledge-heavy tasks than standard retrieval systems and even closed models such as GPT-4o-mini, while applying to both base and instruction-tuned models.

Core claim

R1-Searcher is a two-stage outcome-based RL framework that enables LLMs to autonomously generate calls to external search systems inside their reasoning process, producing stronger results on knowledge-intensive benchmarks than prior RAG approaches and GPT-4o-mini without any process rewards or distillation for initialization.

What carries the argument

Two-stage outcome-based reinforcement learning that rewards final answer correctness and thereby incentivizes the model to insert search tool calls into its reasoning trajectory.

If this is right

The same outcome-based RL pipeline produces usable search behavior in both base and instruction-tuned models.
Search use generalizes to datasets outside the training distribution.
Accuracy on time-sensitive and fact-heavy questions rises above conventional RAG pipelines.
No auxiliary process reward model or supervised warm-up phase is required for the capability to emerge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may transfer to training other tool-use skills such as code execution or database queries using only final-outcome signals.
Reducing dependence on ever-larger internal knowledge stores becomes feasible if external search can be reliably triggered on demand.
Training pipelines that avoid process supervision could scale more easily to larger models or longer reasoning traces.

Load-bearing premise

Outcome rewards alone can reliably produce and generalize search behavior without process supervision or a distillation cold start.

What would settle it

A controlled test set of knowledge questions where internal model knowledge is provably insufficient; if the trained model answers correctly without ever calling search, or calls search but still fails at rates comparable to the untrained base model, the claim is falsified.

read the original abstract

Existing Large Reasoning Models (LRMs) have shown the potential of reinforcement learning (RL) to enhance the complex reasoning capabilities of Large Language Models~(LLMs). While they achieve remarkable performance on challenging tasks such as mathematics and coding, they often rely on their internal knowledge to solve problems, which can be inadequate for time-sensitive or knowledge-intensive questions, leading to inaccuracies and hallucinations. To address this, we propose \textbf{R1-Searcher}, a novel two-stage outcome-based RL approach designed to enhance the search capabilities of LLMs. This method allows LLMs to autonomously invoke external search systems to access additional knowledge during the reasoning process. Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start. % effectively generalizing to out-of-domain datasets and supporting both Base and Instruct models. Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R1-Searcher offers a clean two-stage outcome-only RL recipe for getting LLMs to call external search, but the abstract supplies no numbers or training details to show it actually works.

read the letter

The main idea is a two-stage RL procedure that trains LLMs to decide when to invoke an external search tool using only final-answer correctness as the signal. It skips process rewards and any distillation step for cold start, and the abstract says the same setup works on both base and instruct models while beating standard RAG baselines and even GPT-4o-mini on knowledge tasks. That framing is the clearest new piece relative to earlier RL-for-reasoning papers: it treats search invocation as something that can be shaped purely by outcome feedback. If the training curves actually show rising search frequency and the gains are real, the method is simple enough that groups working on tool-augmented agents could try it quickly. The paper also claims some out-of-domain generalization, which would be useful if shown with proper controls. The soft spot is the missing evidence. The abstract asserts significant outperformance but gives no dataset sizes, no exact baselines, no statistical tests, and no data on whether the model actually increases its search rate or just learns to answer better without the tool. Sparse terminal rewards often produce policies that ignore the action or emit low-value queries, exactly the risk the stress-test note flags. Without training dynamics or ablations on the two stages, it is hard to tell whether the signal was dense enough. The approach itself is not circular and engages the literature on RAG and RL post-training in a straightforward way. This is the kind of paper that belongs in a reading group focused on practical RL for tool use, because the recipe is easy to reproduce if the results check out. I would send it to peer review so referees can ask for the missing experimental details and checks on actual search behavior; the core question is important enough to justify the time even if revisions are needed.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces R1-Searcher, a two-stage outcome-based reinforcement learning framework that trains LLMs (both base and instruct variants) to autonomously invoke external search tools during reasoning. It claims this approach, relying exclusively on final-answer correctness rewards without process supervision or distillation for cold-start, enables effective search behavior that generalizes out-of-domain and yields significant outperformance over strong RAG baselines, including closed-source GPT-4o-mini.

Significance. If the empirical results hold with proper controls, the work would demonstrate that sparse outcome-only RL can reliably induce tool-use policies for external knowledge access, offering a simpler alternative to process-reward or imitation-based methods for reducing hallucinations on knowledge-intensive tasks.

major comments (2)

[Abstract] Abstract and Experiments section: the central claim of significant outperformance over prior RAG methods and GPT-4o-mini is asserted without any reported datasets, baselines, metrics, statistical tests, ablations, or controls in the provided text, leaving the empirical support for the two-stage RL procedure unevaluable.
[Method] Method and Training sections: the premise that outcome-based rewards alone suffice to increase search-tool invocation frequency and quality (rather than producing ignored or redundant queries) is load-bearing for the contribution, yet no training dynamics, search-rate curves, or qualitative policy analysis are referenced to validate this against the known sparsity issues of terminal rewards.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have made revisions to strengthen the empirical presentation and analysis of the two-stage RL procedure.

read point-by-point responses

Referee: [Abstract] Abstract and Experiments section: the central claim of significant outperformance over prior RAG methods and GPT-4o-mini is asserted without any reported datasets, baselines, metrics, statistical tests, ablations, or controls in the provided text, leaving the empirical support for the two-stage RL procedure unevaluable.

Authors: We agree that the abstract should provide clearer pointers to the empirical support. The full Experiments section reports results on HotpotQA, 2WikiMultihopQA, and out-of-domain sets, with baselines including standard RAG pipelines and GPT-4o-mini, using exact-match and F1 metrics, plus ablations on the two-stage design and statistical significance via paired t-tests. To make this immediately visible, we have expanded the abstract to name the primary datasets, metrics, and key controls, and added explicit cross-references to the Experiments section and appendix tables. revision: yes
Referee: [Method] Method and Training sections: the premise that outcome-based rewards alone suffice to increase search-tool invocation frequency and quality (rather than producing ignored or redundant queries) is load-bearing for the contribution, yet no training dynamics, search-rate curves, or qualitative policy analysis are referenced to validate this against the known sparsity issues of terminal rewards.

Authors: We acknowledge that explicit validation of the learned search policy is important given the sparsity of terminal rewards. In the revised manuscript we have added (i) training curves tracking search-tool invocation rate over RL steps for both base and instruct models, (ii) search-rate curves comparing the two-stage procedure against a single-stage baseline, and (iii) qualitative policy traces showing that the model learns to issue relevant, non-redundant queries rather than ignoring the tool. These additions directly address the sparsity concern and are placed in the Training and Analysis sections with accompanying discussion. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical RL training procedure

full rationale

The paper presents an empirical two-stage outcome-based RL framework that trains LLMs to invoke external search tools using only terminal rewards from final-answer correctness. No equations, parameter fits, or derivations are shown that would reduce any claimed prediction or search behavior to a self-referential quantity or fitted input by construction. Claims rest on experimental comparisons against external RAG baselines and GPT-4o-mini rather than internal self-citations, uniqueness theorems, or ansatzes. The method is explicitly described as relying on external search outcomes without process rewards or distillation, making the central result an observed training outcome rather than a definitional or fitted tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard reinforcement learning assumptions applied to tool-use behavior in LLMs; no free parameters, invented entities, or ad-hoc axioms are explicitly introduced in the abstract.

axioms (1)

domain assumption Outcome-based rewards suffice to train LLMs to decide when and how to use external search tools
Core premise of the two-stage RL framework described in the abstract.

pith-pipeline@v0.9.0 · 5491 in / 1064 out tokens · 41828 ms · 2026-05-13T18:32:38.361697+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear
Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start. Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear
we propose R1-Searcher, a novel two-stage outcome-based RL approach designed to enhance the search capabilities of LLMs

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
cs.AI 2026-05 unverdicted novelty 7.0

PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG
cs.CL 2026-05 unverdicted novelty 7.0

LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.
IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning
cs.AI 2026-04 unverdicted novelty 7.0

IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-h...
ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

ActFER reformulates facial expression recognition as active tool-augmented visual reasoning with a custom reinforcement learning algorithm UC-GRPO that outperforms passive MLLM baselines on AU prediction.
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces
cs.CL 2026-04 unverdicted novelty 7.0

GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
cs.AI 2026-05 unverdicted novelty 6.0

SearchSkill introduces an evolving SkillBank and two-stage SFT to make LLM search query planning explicit via skill selection, improving exact match on QA benchmarks and retrieval behavior.
AIPO: : Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
SOD: Step-wise On-policy Distillation for Small Language Model Agents
cs.CL 2026-05 unverdicted novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data
cs.LG 2026-05 unverdicted novelty 6.0

S^3-R1 generates synthetic intermediate-difficulty multi-hop questions and applies dense rewards for search quality plus answer correctness, yielding up to 10% better out-of-domain generalization than baselines.
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents
cs.CV 2026-04 unverdicted novelty 6.0

DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search
cs.CL 2026-04 unverdicted novelty 6.0

CalibAdv calibrates advantages in GRPO by downscaling negative signals from incorrect final answers using intermediate step correctness and rebalancing answer-level advantages, yielding better performance and training...
AutoSearch: Adaptive Search Depth for Efficient Agentic RAG via Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 6.0

AutoSearch applies RL with a self-answering reward to adaptively determine minimal sufficient search depth in agentic RAG, reducing over-searching while maintaining answer quality on complex questions.
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
cs.LG 2026-04 unverdicted novelty 6.0

π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...
Towards Long-horizon Agentic Multimodal Search
cs.CV 2026-04 unverdicted novelty 6.0

LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...
OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search
cs.AI 2026-04 unverdicted novelty 6.0

OASES co-trains search policies and evaluators to generate outcome-aligned process rewards, outperforming standard RL baselines on five multi-hop QA benchmarks.
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
cs.AI 2026-04 unverdicted novelty 6.0

CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.
From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents
cs.SE 2026-04 unverdicted novelty 6.0

A two-stage SFT pipeline distills execution-free then execution-based trajectories from a 480B model into smaller Qwen2.5-Coder agents, yielding 62.2% resolution on SWE-bench Verified and 44.1% zero-shot on the multil...
Procedural Knowledge at Scale Improves Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks b...
Learning to Retrieve from Agent Trajectories
cs.IR 2026-03 conditional novelty 6.0

Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.
ToolRL: Reward is All Tool Learning Needs
cs.LG 2025-04 conditional novelty 6.0

A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
cs.CL 2025-04 unverdicted novelty 6.0

ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.
Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging
cs.AI 2026-05 unverdicted novelty 5.0

MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
cs.AI 2026-05 unverdicted novelty 5.0

CuSearch reallocates fixed training budget toward deeper-search rollouts in RLVR for agentic RAG, treating search depth as an annotation-free proxy for supervision density and reporting up to 11.8 exact-match gains ov...
E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning
cs.AI 2026-04 unverdicted novelty 5.0

E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.
EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools
cs.AI 2026-04 unverdicted novelty 4.0

Structured query and evidence tools added to an AI research agent improve benchmark accuracy by 0.6 to 3.8 percentage points.
XekRung Technical Report
cs.CR 2026-04 unverdicted novelty 3.0

XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 29 Pith papers · 3 internal anchors

[1]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

An empirical study on eliciting and improving r1-like reasoning models, 2025

Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning models, 2025

work page 2025
[5]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018

work page 2018
[6]

Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

work page 2020
[7]

Omnieval: An omnidirectional and automatic rag evaluation benchmark in financial domain, 2025

Shuting Wang, Jiejun Tan, Zhicheng Dou, and Ji-Rong Wen. Omnieval: An omnidirectional and automatic rag evaluation benchmark in financial domain, 2025

work page 2025
[8]

Multi-reranker: Maximizing performance of retrieval-augmented generation in the financerag challenge, 2024

Joohyun Lee and Minji Roh. Multi-reranker: Maximizing performance of retrieval-augmented generation in the financerag challenge, 2024

work page 2024
[9]

Measuring and narrowing the compositionality gap in language models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, 2023

work page 2023
[10]

Jie He, Nan Hu, Wanqiu Long, Jiaoyan Chen, and Jeff Z. Pan. Mintqa: A multi-hop question answering benchmark for evaluating llms on new and tail knowledge, 2025

work page 2025
[11]

Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement

Jinhao Jiang, Jiayi Chen, Junyi Li, Ruiyang Ren, Shijie Wang, Wayne Xin Zhao, Yang Song, and Tao Zhang. Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement. CoRR, abs/2412.12881, 2024

work page arXiv 2024
[12]

Retrieval-augmented generation for large language models: A survey, 2024

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024

work page 2024
[13]

A survey on rag meeting llms: Towards retrieval-augmented large language models

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, page 6491–6501, New York, NY , USA, 2024. Association for Computing Machinery

work page 2024
[14]

Search-o1: Agentic search-enhanced large reasoning models, 2025

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models, 2025

work page 2025
[15]

Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open- Review.net, 2024

work page 2024
[16]

Atom of thoughts for markov llm test-time scaling, 2025

Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, and Yuyu Luo. Atom of thoughts for markov llm test-time scaling, 2025

work page 2025
[17]

Chain- of-retrieval augmented generation

Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, and Furu Wei. Chain- of-retrieval augmented generation. CoRR, abs/2501.14342, 2025

work page arXiv 2025
[18]

Le, Sergey Levine, and Yi Ma

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V . Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training, 2025

work page 2025
[19]

Can we further elicit reasoning in llms? critic-guided planning with retrieval-augmentation for solving challenging tasks

Xingxuan Li, Weiwen Xu, Ruochen Zhao, Fangkai Jiao, Shafiq Joty, and Lidong Bing. Can we further elicit reasoning in llms? critic-guided planning with retrieval-augmentation for solving challenging tasks. arXiv preprint arXiv:2410.01428, 2024

work page arXiv 2024
[20]

Reinforce++: A simple and efficient approach for aligning large language models, 2025

Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models, 2025

work page 2025
[21]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page 2025
[22]

Musique: Multihop questions via single-hop question composition

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022
[23]

Rearter: Retrieval-augmented reasoning with trustworthy process rewarding, 2025

Zhongxiang Sun, Qipeng Wang, Weijie Yu, Xiaoxue Zang, Kai Zheng, Jun Xu, Xiao Zhang, Song Yang, and Han Li. Rearter: Retrieval-augmented reasoning with trustworthy process rewarding, 2025

work page 2025
[24]

Sure: Summarizing retrievals using answer candidates for open- domain QA of LLMs

Jaehyung Kim, Jaehyun Nam, Sangwoo Mo, Jongjin Park, Sang-Woo Lee, Minjoon Seo, Jung- Woo Ha, and Jinwoo Shin. Sure: Summarizing retrievals using answer candidates for open- domain QA of LLMs. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[25]

arXiv preprint arXiv:2301.12652 , year=

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Replug: Retrieval-augmented black-box language models.arXiv preprint arXiv:2301.12652, 2023

work page arXiv 2023
[26]

Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839, 2023

work page arXiv 2023
[27]

RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation

Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[28]

Compressing context to enhance inference efficiency of large language models

Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing context to enhance inference efficiency of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6342–6353, Singapore, December 2023. Association for Computational Linguistics

work page 2023
[29]

Self-knowledge guided retrieval augmentation for large language models

Yile Wang, Peng Li, Maosong Sun, and Yang Liu. Self-knowledge guided retrieval augmentation for large language models. arXiv preprint arXiv:2310.05002, 2023

work page arXiv 2023
[30]

arXiv preprint arXiv:2210.03350

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022

work page arXiv 2022
[31]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294, 2023

work page arXiv 2023
[32]

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10014–10037, 2023

work page 2023
[33]

Marco-o1: Towards open reasoning models for open-ended solutions

Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, and Kaifu Zhang. Marco-o1: Towards open reasoning models for open-ended solutions. arXiv preprint arXiv:2411.14405, 2024

work page arXiv 2024
[34]

Skywork-o1 open series

Skywork o1 Team. Skywork-o1 open series. https://huggingface.co/Skywork, Novem- ber 2024

work page 2024
[35]

Flashrag: A modular toolkit for efficient retrieval-augmented generation research,

Jiajie Jin, Yutao Zhu, Xinyu Yang, Chenghao Zhang, and Zhicheng Dou. Flashrag: A modular toolkit for efficient retrieval-augmented generation research. arXiv preprint arXiv:2405.13576, 2024

work page arXiv 2024
[36]

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick S. H. Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. KILT: a benchmark for knowledge intensive language tasks. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, I...

work page 2021
[37]

Zero: Memory optimiza- tions toward training trillion parameter models, 2020

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models, 2020

work page 2020
[38]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. 17

work page 2024