hub

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu · 2025 · cs.AI · arXiv 2504.03160

28 Pith papers cite this work. Polarity classification is still indexing.

28 Pith papers citing it

open full Pith review browse 28 citing papers arXiv PDF

abstract

Large Language Models (LLMs) equipped with web search capabilities have demonstrated impressive potential for deep research tasks. However, current approaches predominantly rely on either manually engineered prompts (prompt engineering-based) with brittle performance or reinforcement learning within controlled Retrieval-Augmented Generation (RAG) environments (RAG-based) that fail to capture the complexities of real-world interaction. In this paper, we introduce DeepResearcher, the first comprehensive framework for end-to-end training of LLM-based deep research agents through scaling reinforcement learning (RL) in real-world environments with authentic web search interactions. Unlike RAG-based approaches that assume all necessary information exists within a fixed corpus, our method trains agents to navigate the noisy, unstructured, and dynamic nature of the open web. We implement a specialized multi-agent architecture where browsing agents extract relevant information from various webpage structures and overcoming significant technical challenges. Extensive experiments on open-domain research tasks demonstrate that DeepResearcher achieves substantial improvements of up to 28.9 points over prompt engineering-based baselines and up to 7.2 points over RAG-based RL agents. Our qualitative analysis reveals emergent cognitive behaviors from end-to-end RL training, including the ability to formulate plans, cross-validate information from multiple sources, engage in self-reflection to redirect research, and maintain honesty when unable to find definitive answers. Our results highlight that end-to-end training in real-world web environments is not merely an implementation detail but a fundamental requirement for developing robust research capabilities aligned with real-world applications. We release DeepResearcher at https://github.com/GAIR-NLP/DeepResearcher.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 method 1

citation-polarity summary

background 3 use method 1

representative citing papers

Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent

cs.AI · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

AIDA is the first end-to-end autonomous agent that combines a domain-specific language with Pareto-guided reinforcement learning to discover insights from complex business data.

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

cs.CV · 2026-01-11 · unverdicted · novelty 7.0

VideoDR is a new benchmark for open-web video deep research that tests multimodal models on cross-frame visual anchor extraction, interactive retrieval, and multi-hop reasoning over joint video-web evidence.

MMSearch-R1: Incentivizing LMMs to Search

cs.CV · 2025-06-25 · unverdicted · novelty 7.0

MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting search calls by over 30%.

MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.

DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.

CogGen: A Cognitively Inspired Recursive Framework for Deep Research Report Generation

cs.MA · 2026-04-18 · unverdicted · novelty 6.0

CogGen uses a cognitively inspired recursive architecture with AVR for multimodal content to generate deep research reports that achieve SOTA among open-source systems and surpass Gemini Deep Research on a new OWID benchmark.

Towards Knowledgeable Deep Research: Framework and Benchmark

cs.AI · 2026-04-09 · unverdicted · novelty 6.0

The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

cs.CL · 2025-11-14 · unverdicted · novelty 6.0

MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.

DeepEyesV2: Toward Agentic Multimodal Model

cs.CV · 2025-11-07 · unverdicted · novelty 6.0

DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.

RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

cs.DC · 2025-10-22 · unverdicted · novelty 6.0

RLBoost harvests preemptible GPUs for RL rollout via a hybrid architecture with adaptive offload, pull-based transfer, and token-level migration, delivering 1.51x-1.97x throughput and 28-49% better cost efficiency than on-demand-only setups.

ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards

cs.CL · 2025-10-01 · unverdicted · novelty 6.0

ReSeek adds self-correction via a JUDGE action and a dense instructive reward (correctness plus utility) to RL training of search agents, yielding higher success and faithfulness on a new contamination-resistant benchmark.

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

cs.AI · 2025-09-02 · accept · novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

cs.IR · 2025-08-07 · unverdicted · novelty 6.0

WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.

WebSailor: Navigating Super-human Reasoning for Web Agent

cs.CL · 2025-07-03 · conditional · novelty 6.0

WebSailor trains open-source web agents to match proprietary performance on complex information-seeking tasks by generating high-uncertainty scenarios and using a new RL method called DUPO.

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

cs.CL · 2025-06-18 · unverdicted · novelty 6.0

MEM1 uses end-to-end RL to learn constant-memory agents that update a shared state for memory and reasoning, delivering 3.5x better performance and 3.7x lower memory use than larger baselines on long-horizon QA and shopping tasks.

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

cs.CL · 2025-06-13 · conditional · novelty 6.0

DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

cs.CL · 2025-05-07 · unverdicted · novelty 6.0 · 2 refs

ZeroSearch uses supervised fine-tuning to create a simulated retrieval module and curriculum-based RL rollouts that degrade document quality to train LLMs on search capabilities without real search API calls.

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

cs.CL · 2025-04-30 · unverdicted · novelty 6.0

WebThinker equips large reasoning models with autonomous web exploration and interleaved reasoning-drafting via a Deep Web Explorer and RL-based DPO training, yielding gains on GPQA, GAIA, and report-generation benchmarks.

ToolRL: Reward is All Tool Learning Needs

cs.LG · 2025-04-16 · conditional · novelty 6.0

A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.

Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery

cs.IR · 2026-05-11 · conditional · novelty 5.0

PDR is a user-context-aware framework for LLM research agents that improves report relevance over static baselines, supported by a new dataset and hybrid evaluation.

ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

cs.CV · 2026-04-22 · unverdicted · novelty 5.0

A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.

SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

cs.IR · 2026-04-22 · unverdicted · novelty 5.0

SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.

Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs

cs.CL · 2025-10-01 · unverdicted · novelty 5.0

ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.

LLM-Oriented Information Retrieval: A Denoising-First Perspective

cs.IR · 2026-05-01 · unverdicted · novelty 4.0 · 2 refs

Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.

citing papers explorer

Showing 28 of 28 citing papers.

Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent cs.AI · 2026-05-08 · unverdicted · none · ref 5 · 2 links · internal anchor
AIDA is the first end-to-end autonomous agent that combines a domain-specific language with Pareto-guided reinforcement learning to discover insights from complex business data.
Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning cs.CV · 2026-01-11 · unverdicted · none · ref 18 · internal anchor
VideoDR is a new benchmark for open-web video deep research that tests multimodal models on cross-frame visual anchor extraction, interactive retrieval, and multi-hop reasoning over joint video-web evidence.
MMSearch-R1: Incentivizing LMMs to Search cs.CV · 2025-06-25 · unverdicted · none · ref 72 · internal anchor
MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting search calls by over 30%.
MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models cs.CL · 2026-05-19 · unverdicted · none · ref 18 · internal anchor
MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents cs.CV · 2026-04-21 · unverdicted · none · ref 39 · internal anchor
DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
CogGen: A Cognitively Inspired Recursive Framework for Deep Research Report Generation cs.MA · 2026-04-18 · unverdicted · none · ref 7 · internal anchor
CogGen uses a cognitively inspired recursive architecture with AVR for multimodal content to generate deep research reports that achieve SOTA among open-source systems and surpass Gemini Deep Research on a new OWID benchmark.
Towards Knowledgeable Deep Research: Framework and Benchmark cs.AI · 2026-04-09 · unverdicted · none · ref 44 · internal anchor
The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling cs.CL · 2025-11-14 · unverdicted · none · ref 19 · internal anchor
MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.
DeepEyesV2: Toward Agentic Multimodal Model cs.CV · 2025-11-07 · unverdicted · none · ref 63 · internal anchor
DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs cs.DC · 2025-10-22 · unverdicted · none · ref 50 · internal anchor
RLBoost harvests preemptible GPUs for RL rollout via a hybrid architecture with adaptive offload, pull-based transfer, and token-level migration, delivering 1.51x-1.97x throughput and 28-49% better cost efficiency than on-demand-only setups.
ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards cs.CL · 2025-10-01 · unverdicted · none · ref 24 · internal anchor
ReSeek adds self-correction via a JUDGE action and a dense instructive reward (correctness plus utility) to RL training of search agents, yielding higher success and faithfulness on a new contamination-resistant benchmark.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey cs.AI · 2025-09-02 · accept · none · ref 281 · internal anchor
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent cs.IR · 2025-08-07 · unverdicted · none · ref 30 · internal anchor
WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.
WebSailor: Navigating Super-human Reasoning for Web Agent cs.CL · 2025-07-03 · conditional · none · ref 34 · internal anchor
WebSailor trains open-source web agents to match proprietary performance on complex information-seeking tasks by generating high-uncertainty scenarios and using a new RL method called DUPO.
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents cs.CL · 2025-06-18 · unverdicted · none · ref 69 · internal anchor
MEM1 uses end-to-end RL to learn constant-memory agents that update a shared state for memory and reasoning, delivering 3.5x better performance and 3.7x lower memory use than larger baselines on long-horizon QA and shopping tasks.
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents cs.CL · 2025-06-13 · conditional · none · ref 37 · internal anchor
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
ZeroSearch: Incentivize the Search Capability of LLMs without Searching cs.CL · 2025-05-07 · unverdicted · none · ref 48 · 2 links · internal anchor
ZeroSearch uses supervised fine-tuning to create a simulated retrieval module and curriculum-based RL rollouts that degrade document quality to train LLMs on search capabilities without real search API calls.
WebThinker: Empowering Large Reasoning Models with Deep Research Capability cs.CL · 2025-04-30 · unverdicted · none · ref 69 · internal anchor
WebThinker equips large reasoning models with autonomous web exploration and interleaved reasoning-drafting via a Deep Web Explorer and RL-based DPO training, yielding gains on GPQA, GAIA, and report-generation benchmarks.
ToolRL: Reward is All Tool Learning Needs cs.LG · 2025-04-16 · conditional · none · ref 45 · internal anchor
A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery cs.IR · 2026-05-11 · conditional · none · ref 53 · internal anchor
PDR is a user-context-aware framework for LLM research agents that improves report relevance over static baselines, supported by a new dataset and hybrid evaluation.
ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards cs.CV · 2026-04-22 · unverdicted · none · ref 33 · internal anchor
A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.
SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition cs.IR · 2026-04-22 · unverdicted · none · ref 51 · internal anchor
SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.
Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs cs.CL · 2025-10-01 · unverdicted · none · ref 43 · internal anchor
ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.
LLM-Oriented Information Retrieval: A Denoising-First Perspective cs.IR · 2026-05-01 · unverdicted · none · ref 232 · 2 links · internal anchor
Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.
Beyond Relevance: Utility-Centric Retrieval in the LLM Era cs.IR · 2026-04-10 · unverdicted · none · ref 47 · internal anchor
Retrieval systems must prioritize utility for LLM generation quality over traditional relevance metrics, supported by a unified framework distinguishing LLM-agnostic vs specific and context-independent vs dependent utility.
EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools cs.AI · 2026-04-09 · unverdicted · none · ref 33 · internal anchor
Structured query and evidence tools added to an AI research agent improve benchmark accuracy by 0.6 to 3.8 percentage points.
Agentic Reasoning for Large Language Models cs.AI · 2026-01-18 · unverdicted · none · ref 261 · internal anchor
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models cs.LG · 2026-05-15 · unreviewed · ref 72 · internal anchor

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer