Recognition: no theorem link
WebThinker: Empowering Large Reasoning Models with Deep Research Capability
Pith reviewed 2026-05-16 19:09 UTC · model grok-4.3
The pith
WebThinker lets large reasoning models search the web and draft reports autonomously during reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By integrating a Deep Web Explorer module and an Autonomous Think-Search-and-Draft strategy, large reasoning models can dynamically search the web, navigate pages, extract information, and interleave these steps with reasoning and report drafting, producing more accurate and comprehensive outputs on knowledge-intensive tasks.
What carries the argument
Deep Web Explorer module, which lets the model dynamically search, navigate, and extract information from web pages when knowledge gaps arise during reasoning.
If this is right
- Large reasoning models can handle knowledge-intensive tasks that require current or diverse external information.
- Report generation quality improves on scientific and complex topics as measured on Glaive and similar benchmarks.
- Tool-use reliability increases through the RL-based iterative DPO training loop.
- The overall system becomes more applicable to real-world deep research scenarios that mix reasoning with external data.
Where Pith is reading between the lines
- Grounding outputs in live web content could lower the incidence of outdated or hallucinated facts compared with purely internal knowledge.
- The same interleaving pattern might transfer to other tool sets such as code execution or database queries for different domains.
- Scaling the approach to multi-turn interactive web sessions could support longer research projects that evolve over many steps.
Load-bearing premise
The Deep Web Explorer can reliably locate, navigate, and extract accurate information from arbitrary web pages without navigation errors or factual hallucinations that affect the final output.
What would settle it
Running the system on a set of queries where web pages contain subtle contradictions or require precise navigation, then measuring the rate of factual errors in the generated reports compared to a no-search baseline.
read the original abstract
Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, knowledge-intensive tasks and hinders their ability to produce comprehensive research reports requiring synthesis of diverse web information. To address this, we propose WebThinker, a deep research agent that empowers LRMs to autonomously search the web, navigate among web pages, and draft reports during the reasoning process. WebThinker integrates a Deep Web Explorer module, enabling LRMs to dynamically search, navigate, and extract information from the web when encountering knowledge gaps. It also employs an Autonomous Think-Search-and-Draft strategy, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. To further enhance research tool utilization, we introduce an RL-based training strategy via iterative online Direct Preference Optimization (DPO). Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) demonstrate that WebThinker significantly outperforms existing methods and strong proprietary systems. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems. The code is available at https://github.com/RUC-NLPIR/WebThinker.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes WebThinker, an agent that augments large reasoning models (LRMs) with a Deep Web Explorer module for dynamic web search, navigation, and information extraction. It introduces an interleaved Autonomous Think-Search-and-Draft strategy and an RL-based iterative online DPO training procedure to improve tool use. Experiments claim substantial gains over baselines and proprietary systems on GPQA, GAIA, WebWalkerQA, HLE, and Glaive scientific report generation, with code released at the provided GitHub link.
Significance. If the reported gains prove robust to error analysis, the work would meaningfully advance LRM-based research agents by demonstrating practical integration of web-scale retrieval into long-horizon reasoning. The code release is a clear positive for reproducibility.
major comments (2)
- [Experiments / §4] The central performance claims on GPQA, GAIA, WebWalkerQA, HLE, and Glaive rest on the assumption that the Deep Web Explorer (and the interleaved Think-Search-and-Draft loop) can locate, traverse, and extract accurate information without injecting navigation failures or factual hallucinations that propagate into final outputs. No quantitative breakdown of navigation success rate, extraction precision, or error-propagation analysis appears in the experimental results or ablations, leaving open the possibility that measured gains are artifacts of the particular web snapshot or base LRM rather than evidence of improved research capability.
- [Training / §3.3] The RL-based DPO training is described as improving tool utilization, yet the manuscript supplies no ablation isolating the contribution of the online DPO stage versus the base LRM or the Deep Web Explorer alone. Without such controls, it is unclear whether the reported outperformance is load-bearing on the training procedure or on the underlying model scale.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a concise statement of the exact number of model calls or tokens used per benchmark instance to allow direct comparison with prior agent baselines.
- [Figures] Figure captions for the system overview and example trajectories should explicitly label the Think/Search/Draft states to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address the major comments point by point below and will revise the manuscript to incorporate the requested analyses, which we agree will strengthen the paper.
read point-by-point responses
-
Referee: [Experiments / §4] The central performance claims on GPQA, GAIA, WebWalkerQA, HLE, and Glaive rest on the assumption that the Deep Web Explorer (and the interleaved Think-Search-and-Draft loop) can locate, traverse, and extract accurate information without injecting navigation failures or factual hallucinations that propagate into final outputs. No quantitative breakdown of navigation success rate, extraction precision, or error-propagation analysis appears in the experimental results or ablations, leaving open the possibility that measured gains are artifacts of the particular web snapshot or base LRM rather than evidence of improved research capability.
Authors: We agree that a quantitative error analysis is important for validating the robustness of the reported gains. In the revised manuscript we will add a dedicated subsection in §4 that reports navigation success rates (percentage of successful page retrievals and traversals), extraction precision (fact-level accuracy of extracted content against reference sources), and an error-propagation study tracing how navigation or extraction failures affect final benchmark scores. We will also include representative failure cases and their frequency across the evaluated benchmarks. revision: yes
-
Referee: [Training / §3.3] The RL-based DPO training is described as improving tool utilization, yet the manuscript supplies no ablation isolating the contribution of the online DPO stage versus the base LRM or the Deep Web Explorer alone. Without such controls, it is unclear whether the reported outperformance is load-bearing on the training procedure or on the underlying model scale.
Authors: We acknowledge that the current ablations do not fully isolate the online DPO stage. In the revised manuscript we will add controlled experiments in §4 that compare three settings: (1) the base LRM without any web tools, (2) the LRM equipped with the Deep Web Explorer but trained only with supervised fine-tuning (no DPO), and (3) the full WebThinker pipeline with iterative online DPO. These results will quantify the incremental benefit attributable to the RL-based DPO procedure. revision: yes
Circularity Check
No circularity: empirical system paper with no derivations or self-referential reductions
full rationale
The manuscript describes an agent architecture (Deep Web Explorer + interleaved Think-Search-and-Draft loop + RL-based DPO) and reports benchmark results on GPQA, GAIA, WebWalkerQA, HLE, and Glaive. No equations, first-principles derivations, or parameter-fitting steps are present that could reduce a claimed prediction to its own inputs by construction. Performance claims rest on external, publicly referenced benchmarks and released code rather than any self-citation chain or ansatz smuggled through prior work. The training procedure is presented as a standard application of online DPO to tool-use data; nothing in the provided text indicates that measured gains are forced by the definition of the method itself.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 18 Pith papers
-
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
-
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...
-
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces
GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
Towards Knowledgeable Deep Research: Framework and Benchmark
The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
-
TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models
TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.
-
Procedural Knowledge at Scale Improves Reasoning
Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks b...
-
KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning
KG-Hopper uses RL to embed full multi-hop KG traversal and backtracking into a single LLM inference round, enabling a 7B model to outperform larger multi-step systems and compete with GPT-3.5/GPT-4o-mini on eight benchmarks.
-
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
-
ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
ViDR treats source figures as retrievable and verifiable evidence objects in multimodal deep research reports and introduces MMR Bench+ to measure improvements in visual integration and verifiability.
-
Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery
PDR is a user-context-aware framework for LLM research agents that improves report relevance over static baselines, supported by a new dataset and hybrid evaluation.
-
Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction
Web2BigTable introduces a bi-level multi-agent system that achieves new state-of-the-art results on wide-coverage and deep web-to-table search benchmarks through orchestration, coordination, and closed-loop reflection.
-
ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards
A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.
-
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
-
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
-
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.
-
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...
Reference graph
Works this paper leans on
-
[1]
Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open- Review.net, 2024
work page 2024
-
[2]
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. Research: Learning to reason with search for llms via reinforcement learning.CoRR, abs/2503.19470, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.CoRR, abs/2503.09567, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
An empirical study on eliciting and improving r1-like reasoning models.CoRR, abs/2503.04548, 2025
Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning models.CoRR, abs/2503.04548, 2025
-
[5]
Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou. Self-play with execution feedback: Improving instruction-following capabilities of large language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025
work page 2025
-
[6]
Toward verifiable instruction-following alignment for retrieval augmented generation
Guanting Dong, Xiaoshuai Song, Yutao Zhu, Runqi Qiao, Zhicheng Dou, and Ji-Rong Wen. Toward verifiable instruction-following alignment for retrieval augmented generation. In Toby Walsh, Julie Shah, and Zico Kolter, editors,AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, U...
work page 2025
-
[7]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.CoRR, abs/2504.11536, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Reasoning beyond limits: Advances and open problems for llms.CoRR, abs/2503.22732, 2025
Mohamed Amine Ferrag, Norbert Tihanyi, and Mérouane Debbah. Reasoning beyond limits: Advances and open problems for llms.CoRR, abs/2503.22732, 2025
-
[10]
Gemini. Gemini deep research. https://gemini.google/overview/ deep-research, 2025
work page 2025
-
[11]
Glaive. reasoning-v1-20m. https://huggingface.co/datasets/glaiveai/ reasoning-v1-20m, 2025
work page 2025
-
[12]
Grok 3 beta — the age of reasoning agents.https://x.ai/news/grok-3, 2025
Grok. Grok 3 beta — the age of reasoning agents.https://x.ai/news/grok-3, 2025
work page 2025
-
[13]
Deeprag: Thinking to retrieval step by step for large language models
Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, and Jie Zhou. Deeprag: Thinking to retrieval step by step for large language models. CoRR, abs/2502.01142, 2025. 10
-
[14]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025
work page 2025
-
[15]
Scaling Laws for Autoregressive Generative Modeling
Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010.14701, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[16]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.CoRR, abs/2503.24290, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Yunhai Hu, Yilun Zhao, Chen Zhao, and Arman Cohan. MCTS-RAG: enhancing retrieval- augmented generation with monte carlo tree search.CoRR, abs/2503.20757, 2025
-
[18]
Qwen2.5-Coder Technical Report
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report.CoRR, abs/2409.12186, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement
Jinhao Jiang, Jiayi Chen, Junyi Li, Ruiyang Ren, Shijie Wang, Xin Zhao, Yang Song, and Tao Zhang. Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational...
work page 2025
-
[22]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.CoRR, abs/2503.09516, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Jiajie Jin, Yutao Zhu, Yujia Zhou, and Zhicheng Dou. BIDER: bridging knowledge inconsistency for efficient retrieval-augmented llms via key supporting evidence. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 750–...
work page 2024
-
[24]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020
work page 2020
-
[25]
Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numi- namath. [https://huggingface.co/AI-MO/NuminaMath-1.5](https: //github.com/project-numina/aimo-progress-prize/blob/main/ report...
work page 2024
-
[26]
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models.CoRR, abs/2501.05366, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Torl: Scaling tool-integrated RL.CoRR, abs/2503.23383, 2025
Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated RL.CoRR, abs/2503.23383, 2025. 11
-
[28]
From System 1 to System 2: A Survey of Reasoning Large Language Models
Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhijiang Guo, Le Song, and Cheng-Lin Liu. From system 1 to system 2: A survey of reasoning large language models.CoRR, abs/2502.17419, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Zhuoqun Li, Haiyang Yu, Xuanang Chen, Hongyu Lin, Yaojie Lu, Fei Huang, Xianpei Han, Yongbin Li, and Le Sun. Deepsolution: Boosting complex engineering solution design via tree-based exploration and bi-point thinking. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Assoc...
work page 2025
-
[30]
How much can RAG help the reasoning of llm?CoRR, abs/2410.02338, 2024
Jingyu Liu, Jiaen Lin, and Yong Liu. How much can RAG help the reasoning of llm?CoRR, abs/2410.02338, 2024
-
[31]
Query rewriting for retrieval-augmented large language models.CoRR, abs/2305.14283, 2023
Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting for retrieval-augmented large language models.CoRR, abs/2305.14283, 2023
-
[32]
GAIA: a benchmark for general AI assistants
Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024
work page 2024
-
[33]
Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.CoRR, abs/2412.09413, 2024
-
[34]
OpenAI. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms, September 2024
work page 2024
-
[35]
OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research, 2025
work page 2025
-
[36]
OpenAI. Openai o3-mini. https://openai.com/index/openai-o3-mini, January 2025
work page 2025
-
[37]
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Daron Anderson, Tung Nguyen, Mobeen Mahmood, Fiona Feng, Steven Y . Feng, Haoran Zhao, Michael Yu, Varun Gangal, Chelsea Zou, Zihan Wang, Jessica P. Wang, Pawan...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
ToolRL: Reward is All Tool Learning Needs
Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.CoRR, abs/2504.13958, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
O1 replication journey: A strategic progress report–part 1
Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, et al. O1 replication journey: A strategic progress report–part 1. arXiv preprint arXiv:2410.18982, 2024. 12
-
[40]
Manning, Stefano Ermon, and Chelsea Finn
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...
work page 2023
-
[41]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.CoRR, abs/2311.12022, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 9248–9274. Associ...
work page 2023
-
[43]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.CoRR, abs/2503.05592, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Jiejun Tan, Zhicheng Dou, Yutao Zhu, Peidong Guo, Kun Fang, and Ji-Rong Wen. Small models, big insights: Leveraging slim proxy models to decide when and what to retrieve for llms. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 20...
work page 2024
-
[45]
Xinyu Tang, Xiaolei Wang, Zhihao Lv, Yingqian Min, Xin Zhao, Binbin Hu, Ziqi Liu, and Zhiqiang Zhang. Unlocking general long chain-of-thought reasoning capabilities of large language models via representation engineering. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the A...
work page 2025
-
[46]
M.-A-P. Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Open Thoughts.https://open-thoughts.ai, January 2025
OpenThoughts Team. Open Thoughts.https://open-thoughts.ai, January 2025
work page 2025
-
[48]
Qwq: Reflect deeply on the boundaries of the unknown.Hugging Face, 2024
Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown.Hugging Face, 2024
work page 2024
-
[49]
Crawl4ai: Open-source llm friendly web crawler & scraper
UncleCode. Crawl4ai: Open-source llm friendly web crawler & scraper. https://github. com/unclecode/crawl4ai, 2024
work page 2024
-
[50]
OTC: optimal tool calls via reinforcement learning.CoRR, abs/2504.14870, 2025
Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. OTC: optimal tool calls via reinforcement learning.CoRR, abs/2504.14870, 2025. 13
-
[51]
Chain- of-retrieval augmented generation.CoRR, abs/2501.14342, 2025
Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, and Furu Wei. Chain- of-retrieval augmented generation.CoRR, abs/2501.14342, 2025
-
[52]
Query2doc: Query expansion with large language models
Liang Wang, Nan Yang, and Furu Wei. Query2doc: Query expansion with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 9414–9423. Association for Computational Linguistics, 2023
work page 2023
-
[53]
RARE: retrieval-augmented reasoning modeling.CoRR, abs/2503.23513, 2025
Zhengren Wang, Jiayang Yu, Dongsheng Ma, Zhe Chen, Yu Wang, Zhiyu Li, Feiyu Xiong, Yanfeng Wang, Weinan E, Linpeng Tang, and Wentao Zhang. RARE: retrieval-augmented reasoning modeling.CoRR, abs/2503.23513, 2025
-
[54]
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. RAGEN: understanding self-evolution in LLM agents via multi-turn reinforcement learning.CoRR, abs/2504.20073, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neura...
work page 2022
-
[56]
Webwalker: Benchmarking llms in web traversal
Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, and Fei Huang. Webwalker: Benchmarking llms in web traversal. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguist...
work page 2025
-
[57]
Self-play preference optimization for language model alignment
Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment. InThe Thirteenth International Confer- ence on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025
work page 2025
-
[58]
Zekun Xi, Wenbiao Yin, Jizhan Fang, Jialong Wu, Runnan Fang, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen. Omnithink: Expanding knowledge boundaries in machine writing through thinking.CoRR, abs/2501.09751, 2025
-
[59]
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing LLM reasoning with rule-based reinforcement learning.CoRR, abs/2502.14768, 2025
work page internal anchor Pith review arXiv 2025
-
[60]
RECOMP: improving retrieval-augmented lms with context compression and selective augmentation
Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: improving retrieval-augmented lms with context compression and selective augmentation. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024
work page 2024
-
[61]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,
work page 2023
-
[64]
OpenReview.net, 2023
work page 2023
-
[65]
Physics of language models: Part 2.2, how to learn from mistakes on grade-school math problems
Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. Physics of language models: Part 2.2, how to learn from mistakes on grade-school math problems. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,
work page 2025
-
[66]
OpenReview.net, 2025
work page 2025
-
[67]
Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Dong Wang, Ilia Kulikov, Kyunghyun Cho, Yuandong Tian, Jason E. Weston, and Xian Li. Naturalreasoning: Rea- soning in the wild with 2.8m challenging questions.CoRR, abs/2502.13124, 2025
-
[68]
Retrieval-Augmented Generation for AI-Generated Content: A Survey
Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey.CoRR, abs/2402.19473, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments.CoRR, abs/2504.03160, 2025
work page internal anchor Pith review arXiv 2025
-
[70]
Metacognitive retrieval- augmented large language models
Yujia Zhou, Zheng Liu, Jiajie Jin, Jian-Yun Nie, and Zhicheng Dou. Metacognitive retrieval- augmented large language models. In Tat-Seng Chua, Chong-Wah Ngo, Ravi Kumar, Hady W. Lauw, and Roy Ka-Wei Lee, editors,Proceedings of the ACM on Web Conference 2024, WWW 2024, Singapore, May 13-17, 2024, pages 1453–1463. ACM, 2024
work page 2024
-
[71]
Yutao Zhu, Jiajie Jin, Hongjin Qian, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. Single llm, multiple roles: A unified retrieval-augmented generation framework using role-specific token optimization.CoRR, abs/2505.15444, 2025. 15 Appendix A Inference Process 17 A.1 Research-Related Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2...
-
[72]
**Analyze the Searched Web Pages:** - Carefully review the content of each searched web page. - Identify factual information that is relevant to the **Current Search Query** and can aid in the reasoning process for the original question
-
[73]
**More Information Seeking:** - If the information is not relevant to the query, you could:
-
[74]
Search again: <|begin_search_query|>another search query<|end_search_query|>
-
[75]
Access webpage content using: <|begin_click_link|>your URL<|end_click_link|>
-
[76]
**Extract Relevant Information:** - Return the relevant information from the **Searched Web Pages** that is relevant to the **Current Search Query**
-
[77]
**Output Format:** - Present the information beginning with **Final Information** as shown below. **Final Information** [Relevant information] **Inputs:** - **Current Search Query:** {search_query} - **Detailed Search Intent:** {search_intent} - **Searched Web Pages:** {search_result} Now please analyze the web pages and extract relevant information for t...
-
[78]
Use web searches to gather detailed information for each point
-
[79]
After each search, analyze the results and determine what additional information is needed
-
[80]
When you have sufficient information for a section, request to write that section
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.