super hub Canonical reference

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Dong Wang, Hansi Zeng, Jinsung Yoon, Sercan Arik, Zhenrui Yue · 2025 · cs.CL · arXiv 2503.09516

Canonical reference. 78% of citing Pith papers cite this work as background.

203 Pith papers citing it

Background 78% of classified citations

open full Pith review browse 203 citing papers more from Bowen Jin arXiv PDF

abstract

Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning capabilities to use search engines during inference is often suboptimal, as the LLM might not fully possess the capability on how to interact optimally with the search engine. This paper introduces Search-R1, an extension of reinforcement learning (RL) for reasoning frameworks where the LLM learns to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM reasoning trajectories with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over various RAG baselines under the same setting. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 43 method 3 baseline 2 dataset 1 extension 1 other 1

citation-polarity summary

background 40 unclear 4 use method 3 baseline 2 extend 1 use dataset 1

claims ledger

abstract Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning capabilities to use search engines during inference is often suboptimal, as the LLM might not fully possess the capability on how to interact optimally with the search engine. This paper introduces Search-R1, an extension of reinforcement learning (RL) for reasoning frameworks where the LLM learns to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieva

authors

Bowen Jin Dong Wang Hansi Zeng Jinsung Yoon Sercan Arik Zhenrui Yue

co-cited works

representative citing papers

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

cs.CL · 2026-06-15 · unverdicted · novelty 8.0 · 3 refs

MetaSyn benchmark shows LLM pipelines recover at most 52.7% of ground-truth included studies due to screening failures on PI/ECO eligibility, despite 90.9% retrieval recall at K=200.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

cs.AI · 2026-04-28 · accept · novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

Evidence-State Rewards for Long-Context Reasoning

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

Maven is an RL method using answer-conditioned evidence-state values to assign rewards to add, link, and drop actions on evidence memory, outperforming outcome-only baselines on LongBench v2, LongReason, and RULER.

CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

CRAFT is a three-pillar credit assignment scheme that uses counterfactual token importance from GRPO sibling rollouts to provide signed per-token distillation signals in self-distilled agentic RL.

SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation

cs.CL · 2026-06-17 · unverdicted · novelty 7.0

SAGE-OPD improves multi-turn OPD via turn-level selective intervention, teacher-confidence weighting, and loss normalization, reporting up to 13.3% relative gain in ALFWorld unseen success rate over standard OPD.

GraphPO: Graph-based Policy Optimization for Reasoning Models

cs.CL · 2026-06-17 · unverdicted · novelty 7.0

GraphPO represents reasoning rollouts as a DAG to merge semantically equivalent paths, share suffixes, and assign separate efficiency and correctness advantages for lower variance and better performance than chain or tree baselines.

ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

cs.AI · 2026-06-11 · unverdicted · novelty 7.0

ReSum trains LLMs via RLVR to self-summarize reasoning trajectories, yielding 4% average performance gains and 18.6% shorter rollouts through contrastive rollout branches.

FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

FORT synthesizes shortcut-resistant search tasks by controlling four identified shortcut risks across entity selection, graph construction, question formulation, and refinement, producing training data that yields agents with longer search trajectories and top performance among open-source models on

Co-Evolving Skill Generation and Policy Optimization

cs.CL · 2026-06-07 · unverdicted · novelty 7.0

Framework estimates context-dependent marginal utility of candidate skills via reward gaps in matched base vs. skill-augmented rollouts to filter skills and co-train policy as generator.

PIXELRAG: Web Screenshots Beat Text for Retrieval-Augmented Generation

cs.IR · 2026-06-01 · unverdicted · novelty 7.0

PixelRAG shows that operating RAG entirely over web screenshots outperforms text-based retrieval on NQ, SimpleQA, MMSearch, LiveVQA, and MoNaCo, with up to 18.1% accuracy gains and 3x token savings via image compression.

RWGBench: Evaluating Scholarly Positioning in Related Work Generation

cs.DL · 2026-05-30 · unverdicted · novelty 7.0

RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

Reasoning models naturally compress context via thinking traces, with reward-constrained optimization yielding 17-23% gains over baselines on long-context QA at high compression ratios.

Plan Before Search: Search Agents Need Plan

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

A self-bootstrapping paradigm uses trajectories from a small seed model to activate pre-planned sub-question decomposition in target models, enabling consistent outperformance on multi-hop QA without external distillation.

An Interactive Paradigm for Deep Research

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

SteER introduces an interactive framework for deep research with LLMs that uses cost-benefit analysis for user control at decision points and shows improved alignment over baselines.

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

cs.AI · 2026-05-22 · unverdicted · novelty 7.0

Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

cs.AI · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and AgentBench workloads.

Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.

Learning Agentic Policy from Action Guidance

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

cs.AI · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.

Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.

DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

cs.CL · 2026-05-10 · accept · novelty 7.0 · 2 refs

LLM agents encode tool necessity in pre-generation hidden states with high linear decodability (AUROC 0.89-0.96); Probe&Prefill uses this to reduce tool calls 48% with 1.7% accuracy loss.

AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

AHD Agent trains a 4B-parameter LLM via agentic RL to actively use tools for automatic heuristic design, matching or exceeding larger baselines across eight domains with fewer evaluations.

citing papers explorer

Showing 19 of 19 citing papers after filters.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery cs.AI · 2026-04-28 · accept · none · ref 22 · internal anchor
AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG cs.AI · 2026-05-12 · unverdicted · none · ref 4 · 2 links · internal anchor
CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents cs.AI · 2026-05-11 · unverdicted · none · ref 10 · internal anchor
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design cs.AI · 2026-05-09 · unverdicted · none · ref 40 · internal anchor
AHD Agent trains a 4B-parameter LLM via agentic RL to actively use tools for automatic heuristic design, matching or exceeding larger baselines across eight domains with fewer evaluations.
Inference-Time Budget Control for LLM Search Agents cs.AI · 2026-05-07 · unverdicted · none · ref 16 · internal anchor
A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning cs.AI · 2026-05-21 · unverdicted · none · ref 41 · internal anchor
SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox cs.AI · 2026-05-11 · unverdicted · none · ref 3 · 2 links · internal anchor
ComplexMCP benchmark shows top LLM agents achieve under 60% success on dynamic interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning cs.AI · 2026-05-10 · unverdicted · none · ref 11 · 2 links · internal anchor
PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks cs.AI · 2026-05-09 · unverdicted · none · ref 11 · 3 links · internal anchor
SearchSkill improves LLM query planning on knowledge QA by using explicit skill selection from an evolving SkillBank and a two-stage SFT process that aligns training with inference-time skill-grounded execution.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents cs.AI · 2026-05-09 · unverdicted · none · ref 3 · 2 links · internal anchor
SkillMaster enables LLM agents to autonomously develop skills via trajectory review, counterfactual evaluation, and DualAdv-GRPO training, boosting success rates by 8.8% on ALFWorld and 9.3% on WebShop.
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs cs.AI · 2026-05-09 · unverdicted · none · ref 12 · internal anchor
A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey cs.AI · 2025-09-02 · accept · none · ref 276 · internal anchor
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence cs.AI · 2026-04-20 · unverdicted · none · ref 42
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction cs.AI · 2026-04-29 · unverdicted · none · ref 10 · internal anchor
Web2BigTable introduces a bi-level multi-agent system that achieves new state-of-the-art results on wide-coverage and deep web-to-table search benchmarks through orchestration, coordination, and closed-loop reflection.
Mind DeepResearch Technical Report cs.AI · 2026-04-16 · unverdicted · none · ref 14
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning cs.AI · 2026-04-10 · unverdicted · none · ref 17
E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.
Agentic Reasoning for Large Language Models cs.AI · 2026-01-18 · unverdicted · none · ref 27 · internal anchor
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems cs.AI · 2025-03-31 · unverdicted · none · ref 84 · internal anchor
This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.
OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search cs.AI · 2026-04-04 · unreviewed · ref 10 · internal anchor

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer