pith. machine review for the scientific record. sign in

arxiv: 2504.21776 · v2 · submitted 2025-04-30 · 💻 cs.CL · cs.AI· cs.IR

Recognition: no theorem link

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords large reasoning modelsweb searchdeep research agentautonomous report draftingtool usereasoning benchmarksDPO training
0
0 comments X

The pith

WebThinker lets large reasoning models search the web and draft reports autonomously during reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models perform well on long-horizon tasks but hit limits when they need fresh or detailed external knowledge. WebThinker adds a Deep Web Explorer so the model can search, navigate pages, and pull information exactly when a gap appears. An Autonomous Think-Search-and-Draft loop lets it switch between thinking, gathering facts, and writing the report in one continuous process. Iterative online DPO training improves how effectively the model uses these tools. On benchmarks such as GPQA, GAIA, WebWalkerQA, HLE, and Glaive report generation, the system outperforms prior open methods and strong proprietary baselines.

Core claim

By integrating a Deep Web Explorer module and an Autonomous Think-Search-and-Draft strategy, large reasoning models can dynamically search the web, navigate pages, extract information, and interleave these steps with reasoning and report drafting, producing more accurate and comprehensive outputs on knowledge-intensive tasks.

What carries the argument

Deep Web Explorer module, which lets the model dynamically search, navigate, and extract information from web pages when knowledge gaps arise during reasoning.

If this is right

  • Large reasoning models can handle knowledge-intensive tasks that require current or diverse external information.
  • Report generation quality improves on scientific and complex topics as measured on Glaive and similar benchmarks.
  • Tool-use reliability increases through the RL-based iterative DPO training loop.
  • The overall system becomes more applicable to real-world deep research scenarios that mix reasoning with external data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Grounding outputs in live web content could lower the incidence of outdated or hallucinated facts compared with purely internal knowledge.
  • The same interleaving pattern might transfer to other tool sets such as code execution or database queries for different domains.
  • Scaling the approach to multi-turn interactive web sessions could support longer research projects that evolve over many steps.

Load-bearing premise

The Deep Web Explorer can reliably locate, navigate, and extract accurate information from arbitrary web pages without navigation errors or factual hallucinations that affect the final output.

What would settle it

Running the system on a set of queries where web pages contain subtle contradictions or require precise navigation, then measuring the rate of factual errors in the generated reports compared to a no-search baseline.

read the original abstract

Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, knowledge-intensive tasks and hinders their ability to produce comprehensive research reports requiring synthesis of diverse web information. To address this, we propose WebThinker, a deep research agent that empowers LRMs to autonomously search the web, navigate among web pages, and draft reports during the reasoning process. WebThinker integrates a Deep Web Explorer module, enabling LRMs to dynamically search, navigate, and extract information from the web when encountering knowledge gaps. It also employs an Autonomous Think-Search-and-Draft strategy, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. To further enhance research tool utilization, we introduce an RL-based training strategy via iterative online Direct Preference Optimization (DPO). Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) demonstrate that WebThinker significantly outperforms existing methods and strong proprietary systems. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems. The code is available at https://github.com/RUC-NLPIR/WebThinker.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes WebThinker, an agent that augments large reasoning models (LRMs) with a Deep Web Explorer module for dynamic web search, navigation, and information extraction. It introduces an interleaved Autonomous Think-Search-and-Draft strategy and an RL-based iterative online DPO training procedure to improve tool use. Experiments claim substantial gains over baselines and proprietary systems on GPQA, GAIA, WebWalkerQA, HLE, and Glaive scientific report generation, with code released at the provided GitHub link.

Significance. If the reported gains prove robust to error analysis, the work would meaningfully advance LRM-based research agents by demonstrating practical integration of web-scale retrieval into long-horizon reasoning. The code release is a clear positive for reproducibility.

major comments (2)
  1. [Experiments / §4] The central performance claims on GPQA, GAIA, WebWalkerQA, HLE, and Glaive rest on the assumption that the Deep Web Explorer (and the interleaved Think-Search-and-Draft loop) can locate, traverse, and extract accurate information without injecting navigation failures or factual hallucinations that propagate into final outputs. No quantitative breakdown of navigation success rate, extraction precision, or error-propagation analysis appears in the experimental results or ablations, leaving open the possibility that measured gains are artifacts of the particular web snapshot or base LRM rather than evidence of improved research capability.
  2. [Training / §3.3] The RL-based DPO training is described as improving tool utilization, yet the manuscript supplies no ablation isolating the contribution of the online DPO stage versus the base LRM or the Deep Web Explorer alone. Without such controls, it is unclear whether the reported outperformance is load-bearing on the training procedure or on the underlying model scale.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a concise statement of the exact number of model calls or tokens used per benchmark instance to allow direct comparison with prior agent baselines.
  2. [Figures] Figure captions for the system overview and example trajectories should explicitly label the Think/Search/Draft states to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comments point by point below and will revise the manuscript to incorporate the requested analyses, which we agree will strengthen the paper.

read point-by-point responses
  1. Referee: [Experiments / §4] The central performance claims on GPQA, GAIA, WebWalkerQA, HLE, and Glaive rest on the assumption that the Deep Web Explorer (and the interleaved Think-Search-and-Draft loop) can locate, traverse, and extract accurate information without injecting navigation failures or factual hallucinations that propagate into final outputs. No quantitative breakdown of navigation success rate, extraction precision, or error-propagation analysis appears in the experimental results or ablations, leaving open the possibility that measured gains are artifacts of the particular web snapshot or base LRM rather than evidence of improved research capability.

    Authors: We agree that a quantitative error analysis is important for validating the robustness of the reported gains. In the revised manuscript we will add a dedicated subsection in §4 that reports navigation success rates (percentage of successful page retrievals and traversals), extraction precision (fact-level accuracy of extracted content against reference sources), and an error-propagation study tracing how navigation or extraction failures affect final benchmark scores. We will also include representative failure cases and their frequency across the evaluated benchmarks. revision: yes

  2. Referee: [Training / §3.3] The RL-based DPO training is described as improving tool utilization, yet the manuscript supplies no ablation isolating the contribution of the online DPO stage versus the base LRM or the Deep Web Explorer alone. Without such controls, it is unclear whether the reported outperformance is load-bearing on the training procedure or on the underlying model scale.

    Authors: We acknowledge that the current ablations do not fully isolate the online DPO stage. In the revised manuscript we will add controlled experiments in §4 that compare three settings: (1) the base LRM without any web tools, (2) the LRM equipped with the Deep Web Explorer but trained only with supervised fine-tuning (no DPO), and (3) the full WebThinker pipeline with iterative online DPO. These results will quantify the incremental benefit attributable to the RL-based DPO procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system paper with no derivations or self-referential reductions

full rationale

The manuscript describes an agent architecture (Deep Web Explorer + interleaved Think-Search-and-Draft loop + RL-based DPO) and reports benchmark results on GPQA, GAIA, WebWalkerQA, HLE, and Glaive. No equations, first-principles derivations, or parameter-fitting steps are present that could reduce a claimed prediction to its own inputs by construction. Performance claims rest on external, publicly referenced benchmarks and released code rather than any self-citation chain or ansatz smuggled through prior work. The training procedure is presented as a standard application of online DPO to tool-use data; nothing in the provided text indicates that measured gains are forced by the definition of the method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review identifies no explicit free parameters, axioms, or invented entities beyond the named modules; full paper may introduce hyperparameters for DPO or navigation heuristics.

pith-pipeline@v0.9.0 · 5568 in / 981 out tokens · 32328 ms · 2026-05-16T19:09:52.921915+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

    cs.AI 2026-04 accept novelty 8.0

    AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

  2. Learning Agentic Policy from Action Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

  3. Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...

  4. GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

    cs.CL 2026-04 unverdicted novelty 7.0

    GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.

  5. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  6. Towards Knowledgeable Deep Research: Framework and Benchmark

    cs.AI 2026-04 unverdicted novelty 6.0

    The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.

  7. TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models

    cs.CL 2026-04 unverdicted novelty 6.0

    TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.

  8. Procedural Knowledge at Scale Improves Reasoning

    cs.CL 2026-04 unverdicted novelty 6.0

    Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks b...

  9. KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning

    cs.CL 2026-03 unverdicted novelty 6.0

    KG-Hopper uses RL to embed full multi-hop KG traversal and backtracking into a single LLM inference round, enabling a 7B model to outperform larger multi-step systems and compete with GPT-3.5/GPT-4o-mini on eight benchmarks.

  10. DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

    cs.CL 2025-06 conditional novelty 6.0

    DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.

  11. ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence

    cs.CV 2026-05 unverdicted novelty 5.0

    ViDR treats source figures as retrievable and verifiable evidence objects in multimodal deep research reports and introduces MMR Bench+ to measure improvements in visual integration and verifiability.

  12. Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery

    cs.IR 2026-05 conditional novelty 5.0

    PDR is a user-context-aware framework for LLM research agents that improves report relevance over static baselines, supported by a new dataset and hybrid evaluation.

  13. Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction

    cs.AI 2026-04 unverdicted novelty 5.0

    Web2BigTable introduces a bi-level multi-agent system that achieves new state-of-the-art results on wide-coverage and deep web-to-table search benchmarks through orchestration, coordination, and closed-loop reflection.

  14. ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

    cs.CV 2026-04 unverdicted novelty 5.0

    A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.

  15. Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering

    cs.SE 2026-04 unverdicted novelty 5.0

    LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.

  16. A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

    cs.AI 2025-08 unverdicted novelty 5.0

    A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.

  17. A Brief Overview: Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.

  18. A Brief Overview: Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 17 Pith papers · 23 internal anchors

  1. [1]

    Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open- Review.net, 2024

  2. [2]

    ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

    Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. Research: Learning to reason with search for llms via reinforcement learning.CoRR, abs/2503.19470, 2025

  3. [3]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.CoRR, abs/2503.09567, 2025

  4. [4]

    An empirical study on eliciting and improving r1-like reasoning models.CoRR, abs/2503.04548, 2025

    Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning models.CoRR, abs/2503.04548, 2025

  5. [5]

    Self-play with execution feedback: Improving instruction-following capabilities of large language models

    Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou. Self-play with execution feedback: Improving instruction-following capabilities of large language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

  6. [6]

    Toward verifiable instruction-following alignment for retrieval augmented generation

    Guanting Dong, Xiaoshuai Song, Yutao Zhu, Runqi Qiao, Zhicheng Dou, and Ji-Rong Wen. Toward verifiable instruction-following alignment for retrieval augmented generation. In Toby Walsh, Julie Shah, and Zico Kolter, editors,AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, U...

  7. [7]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  8. [8]

    ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.CoRR, abs/2504.11536, 2025

  9. [9]

    Reasoning beyond limits: Advances and open problems for llms.CoRR, abs/2503.22732, 2025

    Mohamed Amine Ferrag, Norbert Tihanyi, and Mérouane Debbah. Reasoning beyond limits: Advances and open problems for llms.CoRR, abs/2503.22732, 2025

  10. [10]

    Gemini deep research

    Gemini. Gemini deep research. https://gemini.google/overview/ deep-research, 2025

  11. [11]

    reasoning-v1-20m

    Glaive. reasoning-v1-20m. https://huggingface.co/datasets/glaiveai/ reasoning-v1-20m, 2025

  12. [12]

    Grok 3 beta — the age of reasoning agents.https://x.ai/news/grok-3, 2025

    Grok. Grok 3 beta — the age of reasoning agents.https://x.ai/news/grok-3, 2025

  13. [13]

    Deeprag: Thinking to retrieval step by step for large language models

    Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, and Jie Zhou. Deeprag: Thinking to retrieval step by step for large language models. CoRR, abs/2502.01142, 2025. 10

  14. [14]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  15. [15]

    Scaling Laws for Autoregressive Generative Modeling

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010.14701, 2020

  16. [16]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.CoRR, abs/2503.24290, 2025

  17. [17]

    MCTS-RAG: enhancing retrieval- augmented generation with monte carlo tree search.CoRR, abs/2503.20757, 2025

    Yunhai Hu, Yilun Zhao, Chen Zhao, and Arman Cohan. MCTS-RAG: enhancing retrieval- augmented generation with monte carlo tree search.CoRR, abs/2503.20757, 2025

  18. [18]

    Qwen2.5-Coder Technical Report

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report.CoRR, abs/2409.12186, 2024

  19. [19]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  20. [20]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  21. [21]

    Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement

    Jinhao Jiang, Jiayi Chen, Junyi Li, Ruiyang Ren, Shijie Wang, Xin Zhao, Yang Song, and Tao Zhang. Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational...

  22. [22]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.CoRR, abs/2503.09516, 2025

  23. [23]

    BIDER: bridging knowledge inconsistency for efficient retrieval-augmented llms via key supporting evidence

    Jiajie Jin, Yutao Zhu, Yujia Zhou, and Zhicheng Dou. BIDER: bridging knowledge inconsistency for efficient retrieval-augmented llms via key supporting evidence. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 750–...

  24. [24]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

  25. [25]

    Numi- namath

    Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numi- namath. [https://huggingface.co/AI-MO/NuminaMath-1.5](https: //github.com/project-numina/aimo-progress-prize/blob/main/ report...

  26. [26]

    Search-o1: Agentic Search-Enhanced Large Reasoning Models

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models.CoRR, abs/2501.05366, 2025

  27. [27]

    Torl: Scaling tool-integrated RL.CoRR, abs/2503.23383, 2025

    Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated RL.CoRR, abs/2503.23383, 2025. 11

  28. [28]

    From System 1 to System 2: A Survey of Reasoning Large Language Models

    Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhijiang Guo, Le Song, and Cheng-Lin Liu. From system 1 to system 2: A survey of reasoning large language models.CoRR, abs/2502.17419, 2025

  29. [29]

    Deepsolution: Boosting complex engineering solution design via tree-based exploration and bi-point thinking

    Zhuoqun Li, Haiyang Yu, Xuanang Chen, Hongyu Lin, Yaojie Lu, Fei Huang, Xianpei Han, Yongbin Li, and Le Sun. Deepsolution: Boosting complex engineering solution design via tree-based exploration and bi-point thinking. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Assoc...

  30. [30]

    How much can RAG help the reasoning of llm?CoRR, abs/2410.02338, 2024

    Jingyu Liu, Jiaen Lin, and Yong Liu. How much can RAG help the reasoning of llm?CoRR, abs/2410.02338, 2024

  31. [31]

    Query rewriting for retrieval-augmented large language models.CoRR, abs/2305.14283, 2023

    Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting for retrieval-augmented large language models.CoRR, abs/2305.14283, 2023

  32. [32]

    GAIA: a benchmark for general AI assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

  33. [33]

    Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.CoRR, abs/2412.09413, 2024

    Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.CoRR, abs/2412.09413, 2024

  34. [34]

    Learning to reason with llms

    OpenAI. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms, September 2024

  35. [35]

    Introducing deep research

    OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research, 2025

  36. [36]

    Openai o3-mini

    OpenAI. Openai o3-mini. https://openai.com/index/openai-o3-mini, January 2025

  37. [37]

    Humanity's Last Exam

    Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Daron Anderson, Tung Nguyen, Mobeen Mahmood, Fiona Feng, Steven Y . Feng, Haoran Zhao, Michael Yu, Varun Gangal, Chelsea Zou, Zihan Wang, Jessica P. Wang, Pawan...

  38. [38]

    ToolRL: Reward is All Tool Learning Needs

    Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.CoRR, abs/2504.13958, 2025

  39. [39]

    O1 replication journey: A strategic progress report–part 1

    Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, et al. O1 replication journey: A strategic progress report–part 1. arXiv preprint arXiv:2410.18982, 2024. 12

  40. [40]

    Manning, Stefano Ermon, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...

  41. [41]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.CoRR, abs/2311.12022, 2023

  42. [42]

    Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

    Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 9248–9274. Associ...

  43. [43]

    R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

    Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.CoRR, abs/2503.05592, 2025

  44. [44]

    Small models, big insights: Leveraging slim proxy models to decide when and what to retrieve for llms

    Jiejun Tan, Zhicheng Dou, Yutao Zhu, Peidong Guo, Kun Fang, and Ji-Rong Wen. Small models, big insights: Leveraging slim proxy models to decide when and what to retrieve for llms. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 20...

  45. [45]

    Unlocking general long chain-of-thought reasoning capabilities of large language models via representation engineering

    Xinyu Tang, Xiaolei Wang, Zhihao Lv, Yingqian Min, Xin Zhao, Binbin Hu, Ziqi Liu, and Zhiqiang Zhang. Unlocking general long chain-of-thought reasoning capabilities of large language models via representation engineering. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the A...

  46. [46]

    M.-A-P. Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu ...

  47. [47]

    Open Thoughts.https://open-thoughts.ai, January 2025

    OpenThoughts Team. Open Thoughts.https://open-thoughts.ai, January 2025

  48. [48]

    Qwq: Reflect deeply on the boundaries of the unknown.Hugging Face, 2024

    Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown.Hugging Face, 2024

  49. [49]

    Crawl4ai: Open-source llm friendly web crawler & scraper

    UncleCode. Crawl4ai: Open-source llm friendly web crawler & scraper. https://github. com/unclecode/crawl4ai, 2024

  50. [50]

    OTC: optimal tool calls via reinforcement learning.CoRR, abs/2504.14870, 2025

    Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. OTC: optimal tool calls via reinforcement learning.CoRR, abs/2504.14870, 2025. 13

  51. [51]

    Chain- of-retrieval augmented generation.CoRR, abs/2501.14342, 2025

    Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, and Furu Wei. Chain- of-retrieval augmented generation.CoRR, abs/2501.14342, 2025

  52. [52]

    Query2doc: Query expansion with large language models

    Liang Wang, Nan Yang, and Furu Wei. Query2doc: Query expansion with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 9414–9423. Association for Computational Linguistics, 2023

  53. [53]

    RARE: retrieval-augmented reasoning modeling.CoRR, abs/2503.23513, 2025

    Zhengren Wang, Jiayang Yu, Dongsheng Ma, Zhe Chen, Yu Wang, Zhiyu Li, Feiyu Xiong, Yanfeng Wang, Weinan E, Linpeng Tang, and Wentao Zhang. RARE: retrieval-augmented reasoning modeling.CoRR, abs/2503.23513, 2025

  54. [54]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. RAGEN: understanding self-evolution in LLM agents via multi-turn reinforcement learning.CoRR, abs/2504.20073, 2025

  55. [55]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neura...

  56. [56]

    Webwalker: Benchmarking llms in web traversal

    Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, and Fei Huang. Webwalker: Benchmarking llms in web traversal. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguist...

  57. [57]

    Self-play preference optimization for language model alignment

    Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment. InThe Thirteenth International Confer- ence on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

  58. [58]

    Omnithink: Expanding knowledge boundaries in machine writing through thinking.CoRR, abs/2501.09751, 2025

    Zekun Xi, Wenbiao Yin, Jizhan Fang, Jialong Wu, Runnan Fang, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen. Omnithink: Expanding knowledge boundaries in machine writing through thinking.CoRR, abs/2501.09751, 2025

  59. [59]

    Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

    Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing LLM reasoning with rule-based reinforcement learning.CoRR, abs/2502.14768, 2025

  60. [60]

    RECOMP: improving retrieval-augmented lms with context compression and selective augmentation

    Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: improving retrieval-augmented lms with context compression and selective augmentation. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

  61. [61]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

  62. [62]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  63. [63]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

  64. [64]

    OpenReview.net, 2023

  65. [65]

    Physics of language models: Part 2.2, how to learn from mistakes on grade-school math problems

    Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. Physics of language models: Part 2.2, how to learn from mistakes on grade-school math problems. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

  66. [66]

    OpenReview.net, 2025

  67. [67]

    Weston, and Xian Li

    Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Dong Wang, Ilia Kulikov, Kyunghyun Cho, Yuandong Tian, Jason E. Weston, and Xian Li. Naturalreasoning: Rea- soning in the wild with 2.8m challenging questions.CoRR, abs/2502.13124, 2025

  68. [68]

    Retrieval-Augmented Generation for AI-Generated Content: A Survey

    Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey.CoRR, abs/2402.19473, 2024

  69. [69]

    DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

    Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments.CoRR, abs/2504.03160, 2025

  70. [70]

    Metacognitive retrieval- augmented large language models

    Yujia Zhou, Zheng Liu, Jiajie Jin, Jian-Yun Nie, and Zhicheng Dou. Metacognitive retrieval- augmented large language models. In Tat-Seng Chua, Chong-Wah Ngo, Ravi Kumar, Hady W. Lauw, and Roy Ka-Wei Lee, editors,Proceedings of the ACM on Web Conference 2024, WWW 2024, Singapore, May 13-17, 2024, pages 1453–1463. ACM, 2024

  71. [71]

    click intent

    Yutao Zhu, Jiajie Jin, Hongjin Qian, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. Single llm, multiple roles: A unified retrieval-augmented generation framework using role-specific token optimization.CoRR, abs/2505.15444, 2025. 15 Appendix A Inference Process 17 A.1 Research-Related Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2...

  72. [72]

    - Identify factual information that is relevant to the **Current Search Query** and can aid in the reasoning process for the original question

    **Analyze the Searched Web Pages:** - Carefully review the content of each searched web page. - Identify factual information that is relevant to the **Current Search Query** and can aid in the reasoning process for the original question

  73. [73]

    **More Information Seeking:** - If the information is not relevant to the query, you could:

  74. [74]

    Search again: <|begin_search_query|>another search query<|end_search_query|>

  75. [75]

    Access webpage content using: <|begin_click_link|>your URL<|end_click_link|>

  76. [76]

    **Extract Relevant Information:** - Return the relevant information from the **Searched Web Pages** that is relevant to the **Current Search Query**

  77. [77]

    {search_query}

    **Output Format:** - Present the information beginning with **Final Information** as shown below. **Final Information** [Relevant information] **Inputs:** - **Current Search Query:** {search_query} - **Detailed Search Intent:** {search_intent} - **Searched Web Pages:** {search_result} Now please analyze the web pages and extract relevant information for t...

  78. [78]

    Use web searches to gather detailed information for each point

  79. [79]

    After each search, analyze the results and determine what additional information is needed

  80. [80]

    When you have sufficient information for a section, request to write that section

Showing first 80 references.