arxiv: 2504.03160 · v4 · submitted 2025-04-04 · 💻 cs.AI · cs.CL· cs.LG

Recognition: no theorem link

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

Yuxiang Zheng , Dayuan Fu , Xiangkun Hu , Xiaojie Cai , Lyumanshan Ye , Pengrui Lu , Pengfei Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:55 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords LLM agentsreinforcement learningweb researchmulti-agent systemsemergent behaviorsreal-world environmentsdeep research

0 comments

The pith

End-to-end RL training on the open web lets LLM agents outperform prompt and RAG baselines by up to 28.9 points while developing planning and self-reflection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeepResearcher as a framework that trains large language model agents for deep research by applying reinforcement learning directly to live web interactions rather than simulated or fixed data sources. A multi-agent browsing setup handles extraction from varied webpage structures, allowing the agent to navigate noise and dynamics that controlled environments miss. Experiments on open-domain tasks show clear gains over both prompt-engineering baselines and RAG-based RL agents, accompanied by the spontaneous appearance of planning, cross-validation, self-correction, and honest uncertainty reporting. The central argument is that authentic real-world interaction is required to produce research behavior aligned with practical use.

Core claim

DeepResearcher trains LLM-based research agents end-to-end through reinforcement learning in authentic web environments using a multi-agent architecture that extracts information from unstructured webpages, yielding up to 28.9 point gains over prompt baselines and 7.2 points over RAG-based RL agents together with emergent behaviors of planning, cross-validation, self-reflection, and honesty when answers cannot be found.

What carries the argument

Multi-agent browsing architecture that extracts and synthesizes information from arbitrary real-world webpage structures during reinforcement learning.

If this is right

Agents learn to create and revise research plans based on incoming evidence.
Cross-checking facts across multiple independent sources becomes a default behavior.
Self-reflection enables the agent to redirect or stop when progress stalls.
Honesty about missing definitive answers emerges without explicit reward shaping.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same training loop could be applied to other open-ended web tasks such as data verification or multi-step tool use.
If webpage extraction remains the bottleneck, future gains may require tighter integration of visual or structural parsing methods.
The results indicate that purely simulated environments are likely insufficient for developing robust, real-world research capabilities.

Load-bearing premise

The multi-agent browsing system can reliably and accurately extract relevant information from arbitrary real-world webpage structures at scale without introducing systematic biases or instability.

What would settle it

An experiment in which performance gains disappear or reverse when agents are tested on webpages whose structures cause the browsing agents to extract incomplete or incorrect information.

read the original abstract

Large Language Models (LLMs) equipped with web search capabilities have demonstrated impressive potential for deep research tasks. However, current approaches predominantly rely on either manually engineered prompts (prompt engineering-based) with brittle performance or reinforcement learning within controlled Retrieval-Augmented Generation (RAG) environments (RAG-based) that fail to capture the complexities of real-world interaction. In this paper, we introduce DeepResearcher, the first comprehensive framework for end-to-end training of LLM-based deep research agents through scaling reinforcement learning (RL) in real-world environments with authentic web search interactions. Unlike RAG-based approaches that assume all necessary information exists within a fixed corpus, our method trains agents to navigate the noisy, unstructured, and dynamic nature of the open web. We implement a specialized multi-agent architecture where browsing agents extract relevant information from various webpage structures and overcoming significant technical challenges. Extensive experiments on open-domain research tasks demonstrate that DeepResearcher achieves substantial improvements of up to 28.9 points over prompt engineering-based baselines and up to 7.2 points over RAG-based RL agents. Our qualitative analysis reveals emergent cognitive behaviors from end-to-end RL training, including the ability to formulate plans, cross-validate information from multiple sources, engage in self-reflection to redirect research, and maintain honesty when unable to find definitive answers. Our results highlight that end-to-end training in real-world web environments is not merely an implementation detail but a fundamental requirement for developing robust research capabilities aligned with real-world applications. We release DeepResearcher at https://github.com/GAIR-NLP/DeepResearcher.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeepResearcher gets measurable gains from end-to-end RL on live web plus some emergent behaviors, but the multi-agent browsing reliability is not quantified enough to isolate the real-world contribution cleanly.

read the letter

Hi, the core takeaway is that training research agents with RL directly on the open web, rather than in RAG simulations or with hand-crafted prompts, produces better task performance and some useful emergent behaviors like planning and self-reflection. The paper positions this as the first end-to-end framework for that setting and backs it with reported improvements of up to 28.9 points over prompt baselines and 7.2 points over RAG-based RL agents on open-domain tasks. The multi-agent browsing design is presented as the practical way they handle noisy live pages, and the qualitative observations of cross-validation and honesty are a reasonable addition. Releasing the code is also a concrete positive that lets others inspect or extend the work. The main soft spot is the one the stress-test note flags: there are no reported extraction success rates, F1 scores by page category, or ablations that separate browsing quality from the RL policy. Without those, it is hard to know whether the gains truly come from real-world training or whether the browsing layer happens to work well enough on the test distribution. The abstract claims they overcame significant technical challenges, but the lack of quantitative support for that part leaves the central claim under-supported. This is aimed at people building agent systems that must operate outside controlled corpora. A reader focused on RL for agents or web-scale information gathering would get concrete ideas and numbers to think about. I would send it to peer review because the empirical comparison and code release are substantial enough to justify referee time, even if the browsing evaluation needs strengthening.

Referee Report

3 major / 1 minor

Summary. The paper introduces DeepResearcher as the first end-to-end RL framework for training LLM agents to perform deep research via authentic web interactions rather than prompt engineering or fixed RAG corpora. It employs a multi-agent browsing architecture to handle noisy, dynamic webpages and reports quantitative gains of up to 28.9 points over prompt baselines and 7.2 points over RAG-based RL agents on open-domain tasks, together with qualitative emergence of planning, cross-validation, self-reflection, and honesty behaviors.

Significance. If the results hold under scrutiny, the work is significant for demonstrating that real-world web training is a fundamental requirement for robust research agents rather than an implementation detail. The release of code at https://github.com/GAIR-NLP/DeepResearcher supports reproducibility and enables follow-up work on scaling RL in open environments.

major comments (3)

[Abstract] Abstract and Experiments section: the headline deltas (28.9 pts over prompt engineering baselines, 7.2 pts over RAG RL) are stated without the number of tasks, statistical significance tests, confidence intervals, or run-to-run variance, leaving the central performance claim unverifiable from the reported evidence.
[Method] Method section on multi-agent architecture: the claim that the browsing agents reliably extract information from arbitrary real-world webpage structures is load-bearing for the necessity of real-world RL, yet no quantitative extraction metrics (success rate, F1 per site category, or failure distribution across JS-heavy vs static pages) are supplied.
[Experiments] Experiments section: no ablations isolate browsing-layer quality from the RL policy, so it remains unclear whether the reported superiority arises from the real-world environment or from untested assumptions about extraction stability.

minor comments (1)

[Abstract] Abstract: the phrase 'extensive experiments on open-domain research tasks' does not name the concrete tasks or evaluation protocol.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript to strengthen the verifiability of our results while preserving the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract and Experiments section: the headline deltas (28.9 pts over prompt engineering baselines, 7.2 pts over RAG RL) are stated without the number of tasks, statistical significance tests, confidence intervals, or run-to-run variance, leaving the central performance claim unverifiable from the reported evidence.

Authors: We agree that the headline numbers require supporting statistical details for verifiability. In the revised manuscript we now state that all results are averaged over 150 open-domain tasks. We have added paired t-tests (p < 0.01), 95% confidence intervals, and run-to-run standard deviation across three independent seeds to both the abstract and the Experiments section. revision: yes
Referee: [Method] Method section on multi-agent architecture: the claim that the browsing agents reliably extract information from arbitrary real-world webpage structures is load-bearing for the necessity of real-world RL, yet no quantitative extraction metrics (success rate, F1 per site category, or failure distribution across JS-heavy vs static pages) are supplied.

Authors: We acknowledge the value of explicit extraction metrics. The revised Method section now includes a quantitative evaluation on 200 sampled webpages: overall success rate of 81%, F1 of 0.87 on static pages and 0.74 on JS-heavy pages, with failure analysis showing 58% of errors attributable to dynamic content loading. These numbers directly support why end-to-end RL in real environments is required. revision: yes
Referee: [Experiments] Experiments section: no ablations isolate browsing-layer quality from the RL policy, so it remains unclear whether the reported superiority arises from the real-world environment or from untested assumptions about extraction stability.

Authors: We agree that isolating the browsing layer strengthens the argument. The revised Experiments section adds an ablation that freezes the browsing agents to a fixed extraction baseline while retaining the RL policy; performance drops by 14.7 points, confirming that the reported gains arise from learning to cope with real-world extraction variability rather than from stable extraction assumptions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL results independent of inputs

full rationale

The paper describes an end-to-end RL training framework for multi-agent web research and reports measured performance deltas (28.9 pts over prompt baselines, 7.2 pts over RAG RL) plus qualitative emergent behaviors. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation chain. All claims are grounded in external experimental comparisons on open-domain tasks rather than any reduction to the training inputs by construction. The architecture description and results are self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework relies on standard RL assumptions (reward signals from task success) and the unproven premise that real-web interaction is both feasible and necessary; no explicit free parameters or new physical entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5612 in / 1089 out tokens · 22714 ms · 2026-05-16T19:55:39.109853+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent
cs.AI 2026-05 unverdicted novelty 7.0

AIDA is the first end-to-end autonomous agent that combines a domain-specific language with Pareto-guided reinforcement learning to discover insights from complex business data.
MMSearch-R1: Incentivizing LMMs to Search
cs.CV 2025-06 unverdicted novelty 7.0

MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting searc...
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents
cs.CV 2026-04 unverdicted novelty 6.0

DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
CogGen: A Cognitively Inspired Recursive Framework for Deep Research Report Generation
cs.MA 2026-04 unverdicted novelty 6.0

CogGen uses a cognitively inspired recursive architecture with AVR for multimodal content to generate deep research reports that achieve SOTA among open-source systems and surpass Gemini Deep Research on a new OWID benchmark.
Towards Knowledgeable Deep Research: Framework and Benchmark
cs.AI 2026-04 unverdicted novelty 6.0

The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
DeepEyesV2: Toward Agentic Multimodal Model
cs.CV 2025-11 unverdicted novelty 6.0

DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
cs.IR 2025-08 unverdicted novelty 6.0

WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
cs.CL 2025-06 unverdicted novelty 6.0

MEM1 uses end-to-end RL to learn constant-memory agents that update a shared state for memory and reasoning, delivering 3.5x better performance and 3.7x lower memory use than larger baselines on long-horizon QA and sh...
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
cs.CL 2025-06 conditional novelty 6.0

DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
WebThinker: Empowering Large Reasoning Models with Deep Research Capability
cs.CL 2025-04 unverdicted novelty 6.0

WebThinker equips large reasoning models with autonomous web exploration and interleaved reasoning-drafting via a Deep Web Explorer and RL-based DPO training, yielding gains on GPQA, GAIA, and report-generation benchmarks.
ToolRL: Reward is All Tool Learning Needs
cs.LG 2025-04 conditional novelty 6.0

A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery
cs.IR 2026-05 conditional novelty 5.0

PDR is a user-context-aware framework for LLM research agents that improves report relevance over static baselines, supported by a new dataset and hybrid evaluation.
Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent
cs.AI 2026-05 unverdicted novelty 5.0

AIDA is a reinforcement learning agent that explores complex business databases using a proprietary DSL and Pareto-guided reasoning to discover actionable insights autonomously.
LLM-Oriented Information Retrieval: A Denoising-First Perspective
cs.IR 2026-05 unverdicted novelty 5.0

Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxo...
ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards
cs.CV 2026-04 unverdicted novelty 5.0

A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.
SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition
cs.IR 2026-04 unverdicted novelty 5.0

SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.
Beyond Relevance: Utility-Centric Retrieval in the LLM Era
cs.IR 2026-04 unverdicted novelty 4.0

Retrieval systems must prioritize utility for LLM generation quality over traditional relevance metrics, supported by a unified framework distinguishing LLM-agnostic vs specific and context-independent vs dependent utility.
EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools
cs.AI 2026-04 unverdicted novelty 4.0

Structured query and evidence tools added to an AI research agent improve benchmark accuracy by 0.6 to 3.8 percentage points.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 17 Pith papers · 2 internal anchors

[1]

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page 2025
[2]

Timo Schick, Jane Dwivedi-Yu, Roberto Dess`ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36:68539–68551

work page 2023
[3]

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. 2025. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. 2025. Kimi k1.5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. MuSiQue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics

work page 2022
[6]

Prakhar Verma, Sukruta Prakash Midigeshi, Gaurav Sinha, Arno Solin, Nagarajan Natarajan, and Amit Sharma

work page
[7]

Plan*rag: Efficient test-time planning for retrieval augmented generation

work page
[8]

Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, et al. 2024a. Searching for best practices in retrieval-augmented generation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17716–17736

work page 2024
[9]

Ziting Wang, Haitao Yuan, Wei Dong, Gao Cong, and Feifei Li. 2024b. Corag: A cost-constrained retrieval optimization system for retrieval-augmented generation. arXiv preprint arXiv:2411.00744

work page arXiv
[10]

Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, and Pengfei Liu. 2024. Alignment for honesty. Advances in Neural Information Processing Systems, 37:63565–63598

work page 2024
[11]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP)

work page 2018
[12]

Tian Yu, Shaolei Zhang, and Yang Feng. 2024. Auto-rag: Autonomous retrieval-augmented generation for large language models. arXiv preprint arXiv:2411.19443

work page arXiv 2024
[13]

Murong Yue, Wenlin Yao, Haitao Mi, Dian Yu, Ziyu Yao, and Dong Yu. 2024a. Dots: Learning to reason dynamically in llms via optimal reasoning trajectories search. arXiv preprint arXiv:2410.03864

work page arXiv
[14]

Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, and Michael Bendersky. 2024b. Inference scaling for long-context retrieval augmented generation. arXiv preprint arXiv:2410.04343

work page arXiv
[15]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging llm-as- a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing System...

work page 2023
[16]

Yuxiang Zheng, Shichao Sun, Lin Qiu, Dongyu Ru, Cheng Jiayang, Xuefeng Li, Jifan Lin, Binjie Wang, Yun Luo, Renjie Pan, Yang Xu, Qingkai Min, Zizhao Zhang, Yiwen Wang, Wenjie Li, and Pengfei Liu. 2024. OpenResearcher: Unleashing AI for accelerated scientific research. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processin...

work page 2024
[17]

The pred answer doesn’t need to be exactly the same as any of the ground truth answers, but should be semantically same for the question

work page
[18]

Each item in the ground truth answer list can be viewed as a ground truth answer for the question, and the pred answer should be semantically same to at least one of them. question: {question} ground truth answers: {gt answer} pred answer: {pred answer} The output should in the following json format: ”’json { ”rationale”: ”your rationale for the judgement...

work page