hub Canonical reference

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao · 2025 · cs.AI · arXiv 2503.05592

Canonical reference. 73% of citing Pith papers cite this work as background.

79 Pith papers citing it

Background 73% of classified citations

open full Pith review browse 79 citing papers arXiv PDF

abstract

Existing Large Reasoning Models (LRMs) have shown the potential of reinforcement learning (RL) to enhance the complex reasoning capabilities of Large Language Models~(LLMs). While they achieve remarkable performance on challenging tasks such as mathematics and coding, they often rely on their internal knowledge to solve problems, which can be inadequate for time-sensitive or knowledge-intensive questions, leading to inaccuracies and hallucinations. To address this, we propose \textbf{R1-Searcher}, a novel two-stage outcome-based RL approach designed to enhance the search capabilities of LLMs. This method allows LLMs to autonomously invoke external search systems to access additional knowledge during the reasoning process. Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start. % effectively generalizing to out-of-domain datasets and supporting both Base and Instruct models. Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 baseline 1 dataset 1 method 1 other 1

citation-polarity summary

background 11 baseline 1 unclear 1 use dataset 1 use method 1

representative citing papers

FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

FORT synthesizes shortcut-resistant search tasks by controlling four identified shortcut risks across entity selection, graph construction, question formulation, and refinement, producing training data that yields agents with longer search trajectories and top performance among open-source models on

Plan Before Search: Search Agents Need Plan

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

A self-bootstrapping paradigm uses trajectories from a small seed model to activate pre-planned sub-question decomposition in target models, enabling consistent outperformance on multi-hop QA without external distillation.

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

cs.AI · 2026-05-22 · unverdicted · novelty 7.0

Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

cs.AI · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.

Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.

CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

cs.AI · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.

LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.

IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

cs.AI · 2026-04-16 · unverdicted · novelty 7.0

IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-hop tasks.

ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

ActFER reformulates facial expression recognition as active tool-augmented visual reasoning with a custom reinforcement learning algorithm UC-GRPO that outperforms passive MLLM baselines on AU prediction.

GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

cs.CL · 2026-04-05 · unverdicted · novelty 7.0

GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

cs.CL · 2025-11-04 · unverdicted · novelty 7.0

MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.

ECHO: Prune to act, trace to learn with selective turn memory in agentic RL

cs.LG · 2026-06-30 · unverdicted · novelty 6.0

ECHO is a selective turn-memory framework for agentic RL that compresses turns into indexed records, selects them for bounded contexts, and uses source indices to assign outcome credit to supporting evidence, reaching 43.4% accuracy on BrowseComp-Plus versus 28.9% for GRPO and 36.1% for SUPO.

To Reason or to Fabricate: Reasoning Without Shortcuts via Hint-Anchored Pairwise Aggregation

cs.CL · 2026-06-28 · unverdicted · novelty 6.0

HIPPO is a new RL framework that uses hint-anchored pairwise aggregation to distinguish and promote authentic reasoning deduction in LLMs instead of shortcut memorization from data overlap.

Only Ask What You Don't Know: Grounded Delta Planning for Efficient Multi-step RAG

cs.CL · 2026-06-21 · unverdicted · novelty 6.0

GDP-RAG targets only information deltas in multi-hop RAG through preliminary grounding, gap-conditioned prompts, and skeletal trajectories, reaching 60.63% accuracy at 0.51 cost-of-pass on HotpotQA, 2WikiMultiHopQA, and MuSiQue.

ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM-Based Agents

cs.AI · 2026-06-19 · unverdicted · novelty 6.0

ARCO introduces a co-evolving rubric model with generation and scoring heads plus a trajectory decomposition constraint that improves exact-match scores on multi-hop QA tasks over outcome, rubric, and process reward baselines.

BALTO: Balanced Token-Level Policy Optimization for Hallucination Mitigation

cs.CL · 2026-06-14 · unverdicted · novelty 6.0

BALTO projects claim-level verification into balanced token-level rewards for RL-based hallucination mitigation in LLMs.

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.

Divide and Cooperate: Role-Decomposed Multi-Agent LLM Training with Cross-Agent Learning Signals

cs.LG · 2026-06-09 · unverdicted · novelty 6.0

DAC decomposes agentic search into cooperative searcher and generator agents with cross-agent signals (abstention reward and hard-positive augmentation), achieving strong QA benchmark performance via LoRA on a shared backbone.

Customer-Agent: Overcoming Context Limitations in Ultra-Long Shopping Trajectories via Tool-Augmented Agents and RLVR

cs.CL · 2026-06-06 · unverdicted · novelty 6.0

Introduces ShopTrajQA long-context benchmark and an RLVR-trained tool-augmented agent that bypasses LLM context limits by external file storage and code-based retrieval for shopping trajectories.

Self-Evolving Deep Research via Joint Generation and Evaluation

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

SCORE is a shared-parameter co-evolutionary framework coupling generation and evaluation of deep research reports with a meta-harness to adapt evaluation standards as performance improves.

Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

PROVE trains LLMs on multi-step tool calls using 20 live MCP servers with 343 tools, state-grounded synthesis, and adaptive efficiency rewards, delivering gains of up to 10.2 points on BFCL Multi-Turn and similar on other benchmarks.

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

Harness-1 uses a state-externalizing harness for RL-trained search agents and reports 0.730 average curated recall, outperforming the next open subagent by 11.4 points.

Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward

cs.AI · 2026-05-29 · unverdicted · novelty 6.0

DecomposeR represents research plans as typed DAGs and uses two-stage planner-then-answerer RL to improve long-form research performance by 5.1-8.0 points over baselines.

MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

MEMENTO framework uses adaptive web exploration via AET and dual-channel memory to acquire domain expertise from interaction trajectories, yielding +25.6% and +36.5% gains over ReAct baselines in sales automation and legal research.

citing papers explorer

Showing 32 of 32 citing papers after filters.

Plan Before Search: Search Agents Need Plan cs.AI · 2026-05-27 · unverdicted · none · ref 3 · internal anchor
A self-bootstrapping paradigm uses trajectories from a small seed model to activate pre-planned sub-question decomposition in target models, enabling consistent outperformance on multi-hop QA without external distillation.
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents cs.AI · 2026-05-22 · unverdicted · none · ref 9 · internal anchor
Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain cs.AI · 2026-05-18 · unverdicted · none · ref 62 · 2 links · internal anchor
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation cs.AI · 2026-05-13 · unverdicted · none · ref 36 · internal anchor
PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG cs.AI · 2026-05-12 · unverdicted · none · ref 14 · 2 links · internal anchor
CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.
IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning cs.AI · 2026-04-16 · unverdicted · none · ref 30 · internal anchor
IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-hop tasks.
ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM-Based Agents cs.AI · 2026-06-19 · unverdicted · none · ref 13 · internal anchor
ARCO introduces a co-evolving rubric model with generation and scoring heads plus a trajectory decomposition constraint that improves exact-match scores on multi-hop QA tasks over outcome, rubric, and process reward baselines.
Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses cs.AI · 2026-06-01 · unverdicted · none · ref 101 · internal anchor
Harness-1 uses a state-externalizing harness for RL-trained search agents and reports 0.730 average curated recall, outperforming the next open subagent by 11.4 points.
Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward cs.AI · 2026-05-29 · unverdicted · none · ref 24 · internal anchor
DecomposeR represents research plans as typed DAGs and uses two-stage planner-then-answerer RL to improve long-form research performance by 5.1-8.0 points over baselines.
MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains cs.AI · 2026-05-28 · unverdicted · none · ref 13 · internal anchor
MEMENTO framework uses adaptive web exploration via AET and dual-channel memory to acquire domain expertise from interaction trajectories, yielding +25.6% and +36.5% gains over ReAct baselines in sales automation and legal research.
Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling cs.AI · 2026-05-28 · unverdicted · none · ref 23 · internal anchor
GDCR assigns step-level rewards via distance to the answer node in a training-time ER graph and SAPO combines these with trajectory advantages for credit assignment in agentic search.
Test-Time Deep Thinking to Explore Implicit Rules cs.AI · 2026-05-24 · unverdicted · none · ref 27 · internal anchor
TTExplore trains a 7B thinker via task-score RL to infer implicit rules at test time, raising agent success by 14-19 points on five embodied tasks.
EVE-Agent: Evidence-Verifiable Self-Evolving Agents cs.AI · 2026-05-21 · unverdicted · none · ref 10 · internal anchor
EVE-Agent adds an evidence verifier to the proposer-solver loop that rewards spans by marginal accuracy gain, producing self-generated but inspectable training examples for search agents.
Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning cs.AI · 2026-05-21 · unverdicted · none · ref 15 · 2 links · internal anchor
Search-E1 uses GRPO interleaved with on-policy self-distillation to reach 0.440 average EM on seven QA benchmarks with Qwen2.5-3B, outperforming open-source baselines.
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks cs.AI · 2026-05-09 · unverdicted · none · ref 27 · 3 links · internal anchor
SearchSkill improves LLM query planning on knowledge QA by using explicit skill selection from an evolving SkillBank and a two-stage SFT process that aligns training with inference-time skill-grounded execution.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence cs.AI · 2026-04-20 · unverdicted · none · ref 87 · 2 links · internal anchor
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
AutoSearch: Adaptive Search Depth for Efficient Agentic RAG via Reinforcement Learning cs.AI · 2026-04-19 · unverdicted · none · ref 18 · internal anchor
AutoSearch applies RL with a self-answering reward to adaptively determine minimal sufficient search depth in agentic RAG, reducing over-searching while maintaining answer quality on complex questions.
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding cs.AI · 2026-04-03 · unverdicted · none · ref 48 · internal anchor
CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey cs.AI · 2025-09-02 · accept · none · ref 278 · internal anchor
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments cs.AI · 2025-04-04 · conditional · none · ref 3 · internal anchor
End-to-end RL in authentic web environments produces LLM research agents that outperform prompt-engineering and RAG-based baselines by up to 28.9 and 7.2 points respectively while exhibiting emergent planning, cross-validation, and self-reflection.
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning cs.AI · 2025-03-25 · unverdicted · none · ref 29 · internal anchor
ReSearch trains LLMs via RL to integrate search operations into reasoning steps, achieving strong generalization across benchmarks and eliciting reflection and self-correction without supervised reasoning data.
DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning cs.AI · 2026-06-05 · unverdicted · none · ref 17 · internal anchor
DuMate-DeepResearch introduces a multi-agent deep research system with graph-based planning, recursive execution, and rubric optimization that reports new state-of-the-art scores of 58.03% and 61.95% on two benchmarks.
C-MIG: Multi-view Information Gain-based Retrieval-Augmented Generation for Clinical Diagnosis Reasoning cs.AI · 2026-05-27 · unverdicted · none · ref 14 · internal anchor
C-MIG uses multi-view information gain from retrieved documents and refinements to supervise RAG-RL for clinical diagnosis, claiming top performance on four medical benchmarks.
Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments cs.AI · 2026-05-26 · unverdicted · none · ref 53 · internal anchor
NoisyAgent trains LLM agents with controlled user and tool noise to improve robustness in stochastic environments while also boosting clean-benchmark performance.
Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging cs.AI · 2026-05-13 · unverdicted · none · ref 12 · internal anchor
MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.
E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning cs.AI · 2026-04-10 · unverdicted · none · ref 25 · internal anchor
E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.
MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning cs.AI · 2026-01-29 · unverdicted · none · ref 23 · internal anchor
MemOCR renders structured memory as images with adaptive visual density to improve long-horizon reasoning under tight context budgets.
Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training cs.AI · 2025-08-01 · unverdicted · none · ref 11 · internal anchor
Cognitive Kernel-Pro provides an open-source agent framework with curated training data across web, file, code, and reasoning domains plus test-time reflection and voting, achieving SOTA results on GAIA among free agents.
Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning cs.AI · 2025-04-28 · unverdicted · none · ref 3 · internal anchor
ARTIST couples agentic reasoning with outcome-based reinforcement learning to let LLMs autonomously invoke tools in multi-turn chains, reporting up to 22% gains on math and function-calling benchmarks.
EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools cs.AI · 2026-04-09 · unverdicted · none · ref 24 · internal anchor
Structured query and evidence tools added to an AI research agent improve benchmark accuracy by 0.6 to 3.8 percentage points.
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems cs.AI · 2025-03-31 · unverdicted · none · ref 74 · internal anchor
This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.
OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search cs.AI · 2026-04-04 · unreviewed · ref 25 · internal anchor

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer