hub Canonical reference

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao · 2025 · cs.AI · arXiv 2503.05592

Canonical reference. 73% of citing Pith papers cite this work as background.

68 Pith papers citing it

Background 73% of classified citations

open full Pith review browse 68 citing papers arXiv PDF

abstract

Existing Large Reasoning Models (LRMs) have shown the potential of reinforcement learning (RL) to enhance the complex reasoning capabilities of Large Language Models~(LLMs). While they achieve remarkable performance on challenging tasks such as mathematics and coding, they often rely on their internal knowledge to solve problems, which can be inadequate for time-sensitive or knowledge-intensive questions, leading to inaccuracies and hallucinations. To address this, we propose \textbf{R1-Searcher}, a novel two-stage outcome-based RL approach designed to enhance the search capabilities of LLMs. This method allows LLMs to autonomously invoke external search systems to access additional knowledge during the reasoning process. Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start. % effectively generalizing to out-of-domain datasets and supporting both Base and Instruct models. Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 baseline 1 dataset 1 method 1 other 1

citation-polarity summary

background 11 baseline 1 unclear 1 use dataset 1 use method 1

representative citing papers

Plan Before Search: Search Agents Need Plan

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

A self-bootstrapping paradigm uses trajectories from a small seed model to activate pre-planned sub-question decomposition in target models, enabling consistent outperformance on multi-hop QA without external distillation.

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

cs.AI · 2026-05-22 · unverdicted · novelty 7.0

Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

cs.AI · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.

Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.

CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

cs.AI · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.

LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.

IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

cs.AI · 2026-04-16 · unverdicted · novelty 7.0

IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-hop tasks.

ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

ActFER reformulates facial expression recognition as active tool-augmented visual reasoning with a custom reinforcement learning algorithm UC-GRPO that outperforms passive MLLM baselines on AU prediction.

GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

cs.CL · 2026-04-05 · unverdicted · novelty 7.0

GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

cs.CL · 2025-11-04 · unverdicted · novelty 7.0

MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.

To Reason or to Fabricate: Reasoning Without Shortcuts via Hint-Anchored Pairwise Aggregation

cs.CL · 2026-06-28 · unverdicted · novelty 6.0

HIPPO is a new RL framework that uses hint-anchored pairwise aggregation to distinguish and promote authentic reasoning deduction in LLMs instead of shortcut memorization from data overlap.

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.

SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering

cs.CL · 2026-05-30 · unverdicted · novelty 6.0

SPADER proposes step-wise peer advantage and diversity-aware exploration rewards in RL for multi-answer QA, reporting improved recall and F1 on QAMPARI, Mintaka, WebQSP, and QUEST.

Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward

cs.AI · 2026-05-29 · unverdicted · novelty 6.0

DecomposeR represents research plans as typed DAGs and uses two-stage planner-then-answerer RL to improve long-form research performance by 5.1-8.0 points over baselines.

MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

MEMENTO framework uses adaptive web exploration via AET and dual-channel memory to acquire domain expertise from interaction trajectories, yielding +25.6% and +36.5% gains over ReAct baselines in sales automation and legal research.

Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

GDCR assigns step-level rewards via distance to the answer node in a training-time ER graph and SAPO combines these with trajectory advantages for credit assignment in agentic search.

Test-Time Deep Thinking to Explore Implicit Rules

cs.AI · 2026-05-24 · unverdicted · novelty 6.0

TTExplore trains a 7B thinker via task-score RL to infer implicit rules at test time, raising agent success by 14-19 points on five embodied tasks.

EVE-Agent: Evidence-Verifiable Self-Evolving Agents

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

EVE-Agent adds an evidence verifier to the proposer-solver loop that rewards spans by marginal accuracy gain, producing self-generated but inspectable training examples for search agents.

Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

cs.AI · 2026-05-21 · unverdicted · novelty 6.0 · 2 refs

Search-E1 uses GRPO interleaved with on-policy self-distillation to reach 0.440 average EM on seven QA benchmarks with Qwen2.5-3B, outperforming open-source baselines.

PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.

AIPO: Learning to Reason from Active Interaction

cs.CL · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.

SOD: Step-wise On-policy Distillation for Small Language Model Agents

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.

DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.

citing papers explorer

Showing 1 of 1 citing paper after filters.

AIPO: Learning to Reason from Active Interaction cs.CL · 2026-05-08 · unverdicted · none · ref 59 · 2 links · internal anchor
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer