arxiv: 2504.21776 · v2 · submitted 2025-04-30 · 💻 cs.CL · cs.AI· cs.IR

Recognition: no theorem link

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Xiaoxi Li , Jiajie Jin , Guanting Dong , Hongjin Qian , Yongkang Wu , Ji-Rong Wen , Yutao Zhu , Zhicheng Dou

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords large reasoning modelsweb searchdeep research agentautonomous report draftingtool usereasoning benchmarksDPO training

0 comments

The pith

WebThinker lets large reasoning models search the web and draft reports autonomously during reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models perform well on long-horizon tasks but hit limits when they need fresh or detailed external knowledge. WebThinker adds a Deep Web Explorer so the model can search, navigate pages, and pull information exactly when a gap appears. An Autonomous Think-Search-and-Draft loop lets it switch between thinking, gathering facts, and writing the report in one continuous process. Iterative online DPO training improves how effectively the model uses these tools. On benchmarks such as GPQA, GAIA, WebWalkerQA, HLE, and Glaive report generation, the system outperforms prior open methods and strong proprietary baselines.

Core claim

By integrating a Deep Web Explorer module and an Autonomous Think-Search-and-Draft strategy, large reasoning models can dynamically search the web, navigate pages, extract information, and interleave these steps with reasoning and report drafting, producing more accurate and comprehensive outputs on knowledge-intensive tasks.

What carries the argument

Deep Web Explorer module, which lets the model dynamically search, navigate, and extract information from web pages when knowledge gaps arise during reasoning.

If this is right

Large reasoning models can handle knowledge-intensive tasks that require current or diverse external information.
Report generation quality improves on scientific and complex topics as measured on Glaive and similar benchmarks.
Tool-use reliability increases through the RL-based iterative DPO training loop.
The overall system becomes more applicable to real-world deep research scenarios that mix reasoning with external data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Grounding outputs in live web content could lower the incidence of outdated or hallucinated facts compared with purely internal knowledge.
The same interleaving pattern might transfer to other tool sets such as code execution or database queries for different domains.
Scaling the approach to multi-turn interactive web sessions could support longer research projects that evolve over many steps.

Load-bearing premise

The Deep Web Explorer can reliably locate, navigate, and extract accurate information from arbitrary web pages without navigation errors or factual hallucinations that affect the final output.

What would settle it

Running the system on a set of queries where web pages contain subtle contradictions or require precise navigation, then measuring the rate of factual errors in the generated reports compared to a no-search baseline.

read the original abstract

Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, knowledge-intensive tasks and hinders their ability to produce comprehensive research reports requiring synthesis of diverse web information. To address this, we propose WebThinker, a deep research agent that empowers LRMs to autonomously search the web, navigate among web pages, and draft reports during the reasoning process. WebThinker integrates a Deep Web Explorer module, enabling LRMs to dynamically search, navigate, and extract information from the web when encountering knowledge gaps. It also employs an Autonomous Think-Search-and-Draft strategy, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. To further enhance research tool utilization, we introduce an RL-based training strategy via iterative online Direct Preference Optimization (DPO). Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) demonstrate that WebThinker significantly outperforms existing methods and strong proprietary systems. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems. The code is available at https://github.com/RUC-NLPIR/WebThinker.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WebThinker adds a live web explorer and interleaved think-search-draft loop to LRMs with code release and benchmark gains, but navigation error rates stay unmeasured.

read the letter

The paper's main move is to give large reasoning models real-time web access inside their reasoning trace. WebThinker runs a Deep Web Explorer that searches, clicks through pages, and pulls text when the model spots a knowledge gap, all inside an Autonomous Think-Search-and-Draft loop that keeps reasoning, gathering, and report drafting in one continuous process. They add iterative online DPO to push the model toward better tool choices over time. The abstract reports clear lifts on GPQA, GAIA, WebWalkerQA, HLE, and the Glaive report task, and the code is out on GitHub, which is the most concrete thing they ship. That combination of architecture and training recipe looks new relative to the web-agent papers they cite. The evaluation setup is straightforward enough that a reader can at least try to reproduce the headline numbers. The soft spot is the one the stress-test note flags: there is no reported breakdown of how often the explorer lands on the right pages, extracts clean facts, or injects navigation mistakes that later show up in the final report. Without those numbers or an error-propagation check, it is hard to separate real capability gains from test-set artifacts or from the base LRM's own strengths. The central claim still holds up on its own terms because the benchmarks are external and the code is public, but the reliability story is thinner than the performance numbers suggest. This is for groups already working on tool-augmented agents and anyone who wants to test whether live search can be folded into long-horizon reasoning without breaking the trace. It is worth sending to peer review because the idea is practical, the code is available, and the experiments are on standard suites even if they need more ablation detail.

Referee Report

2 major / 2 minor

Summary. The paper proposes WebThinker, an agent that augments large reasoning models (LRMs) with a Deep Web Explorer module for dynamic web search, navigation, and information extraction. It introduces an interleaved Autonomous Think-Search-and-Draft strategy and an RL-based iterative online DPO training procedure to improve tool use. Experiments claim substantial gains over baselines and proprietary systems on GPQA, GAIA, WebWalkerQA, HLE, and Glaive scientific report generation, with code released at the provided GitHub link.

Significance. If the reported gains prove robust to error analysis, the work would meaningfully advance LRM-based research agents by demonstrating practical integration of web-scale retrieval into long-horizon reasoning. The code release is a clear positive for reproducibility.

major comments (2)

[Experiments / §4] The central performance claims on GPQA, GAIA, WebWalkerQA, HLE, and Glaive rest on the assumption that the Deep Web Explorer (and the interleaved Think-Search-and-Draft loop) can locate, traverse, and extract accurate information without injecting navigation failures or factual hallucinations that propagate into final outputs. No quantitative breakdown of navigation success rate, extraction precision, or error-propagation analysis appears in the experimental results or ablations, leaving open the possibility that measured gains are artifacts of the particular web snapshot or base LRM rather than evidence of improved research capability.
[Training / §3.3] The RL-based DPO training is described as improving tool utilization, yet the manuscript supplies no ablation isolating the contribution of the online DPO stage versus the base LRM or the Deep Web Explorer alone. Without such controls, it is unclear whether the reported outperformance is load-bearing on the training procedure or on the underlying model scale.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a concise statement of the exact number of model calls or tokens used per benchmark instance to allow direct comparison with prior agent baselines.
[Figures] Figure captions for the system overview and example trajectories should explicitly label the Think/Search/Draft states to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comments point by point below and will revise the manuscript to incorporate the requested analyses, which we agree will strengthen the paper.

read point-by-point responses

Referee: [Experiments / §4] The central performance claims on GPQA, GAIA, WebWalkerQA, HLE, and Glaive rest on the assumption that the Deep Web Explorer (and the interleaved Think-Search-and-Draft loop) can locate, traverse, and extract accurate information without injecting navigation failures or factual hallucinations that propagate into final outputs. No quantitative breakdown of navigation success rate, extraction precision, or error-propagation analysis appears in the experimental results or ablations, leaving open the possibility that measured gains are artifacts of the particular web snapshot or base LRM rather than evidence of improved research capability.

Authors: We agree that a quantitative error analysis is important for validating the robustness of the reported gains. In the revised manuscript we will add a dedicated subsection in §4 that reports navigation success rates (percentage of successful page retrievals and traversals), extraction precision (fact-level accuracy of extracted content against reference sources), and an error-propagation study tracing how navigation or extraction failures affect final benchmark scores. We will also include representative failure cases and their frequency across the evaluated benchmarks. revision: yes
Referee: [Training / §3.3] The RL-based DPO training is described as improving tool utilization, yet the manuscript supplies no ablation isolating the contribution of the online DPO stage versus the base LRM or the Deep Web Explorer alone. Without such controls, it is unclear whether the reported outperformance is load-bearing on the training procedure or on the underlying model scale.

Authors: We acknowledge that the current ablations do not fully isolate the online DPO stage. In the revised manuscript we will add controlled experiments in §4 that compare three settings: (1) the base LRM without any web tools, (2) the LRM equipped with the Deep Web Explorer but trained only with supervised fine-tuning (no DPO), and (3) the full WebThinker pipeline with iterative online DPO. These results will quantify the incremental benefit attributable to the RL-based DPO procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system paper with no derivations or self-referential reductions

full rationale

The manuscript describes an agent architecture (Deep Web Explorer + interleaved Think-Search-and-Draft loop + RL-based DPO) and reports benchmark results on GPQA, GAIA, WebWalkerQA, HLE, and Glaive. No equations, first-principles derivations, or parameter-fitting steps are present that could reduce a claimed prediction to its own inputs by construction. Performance claims rest on external, publicly referenced benchmarks and released code rather than any self-citation chain or ansatz smuggled through prior work. The training procedure is presented as a standard application of online DPO to tool-use data; nothing in the provided text indicates that measured gains are forced by the definition of the method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review identifies no explicit free parameters, axioms, or invented entities beyond the named modules; full paper may introduce hyperparameters for DPO or navigation heuristics.

pith-pipeline@v0.9.0 · 5568 in / 981 out tokens · 32328 ms · 2026-05-16T19:09:52.921915+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
cs.AI 2026-04 accept novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 7.0

COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces
cs.CL 2026-04 unverdicted novelty 7.0

GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
Towards Knowledgeable Deep Research: Framework and Benchmark
cs.AI 2026-04 unverdicted novelty 6.0

The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models
cs.CL 2026-04 unverdicted novelty 6.0

TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.
Procedural Knowledge at Scale Improves Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks b...
KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning
cs.CL 2026-03 unverdicted novelty 6.0

KG-Hopper uses RL to embed full multi-hop KG traversal and backtracking into a single LLM inference round, enabling a 7B model to outperform larger multi-step systems and compete with GPT-3.5/GPT-4o-mini on eight benchmarks.
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
cs.CL 2025-06 conditional novelty 6.0

DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
cs.CV 2026-05 unverdicted novelty 5.0

ViDR treats source figures as retrievable and verifiable evidence objects in multimodal deep research reports and introduces MMR Bench+ to measure improvements in visual integration and verifiability.
Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery
cs.IR 2026-05 conditional novelty 5.0

PDR is a user-context-aware framework for LLM research agents that improves report relevance over static baselines, supported by a new dataset and hybrid evaluation.
Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction
cs.AI 2026-04 unverdicted novelty 5.0

Web2BigTable introduces a bi-level multi-agent system that achieves new state-of-the-art results on wide-coverage and deep web-to-table search benchmarks through orchestration, coordination, and closed-loop reflection.
ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards
cs.CV 2026-04 unverdicted novelty 5.0

A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
cs.SE 2026-04 unverdicted novelty 5.0

LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
cs.AI 2025-08 unverdicted novelty 5.0

A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 17 Pith papers · 23 internal anchors

[1]

Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open- Review.net, 2024

work page 2024
[2]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. Research: Learning to reason with search for llms via reinforcement learning.CoRR, abs/2503.19470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.CoRR, abs/2503.09567, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

An empirical study on eliciting and improving r1-like reasoning models.CoRR, abs/2503.04548, 2025

Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning models.CoRR, abs/2503.04548, 2025

work page arXiv 2025
[5]

Self-play with execution feedback: Improving instruction-following capabilities of large language models

Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou. Self-play with execution feedback: Improving instruction-following capabilities of large language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

work page 2025
[6]

Toward verifiable instruction-following alignment for retrieval augmented generation

Guanting Dong, Xiaoshuai Song, Yutao Zhu, Runqi Qiao, Zhicheng Dou, and Ji-Rong Wen. Toward verifiable instruction-following alignment for retrieval augmented generation. In Toby Walsh, Julie Shah, and Zico Kolter, editors,AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, U...

work page 2025
[7]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.CoRR, abs/2504.11536, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Reasoning beyond limits: Advances and open problems for llms.CoRR, abs/2503.22732, 2025

Mohamed Amine Ferrag, Norbert Tihanyi, and Mérouane Debbah. Reasoning beyond limits: Advances and open problems for llms.CoRR, abs/2503.22732, 2025

work page arXiv 2025
[10]

Gemini deep research

Gemini. Gemini deep research. https://gemini.google/overview/ deep-research, 2025

work page 2025
[11]

reasoning-v1-20m

Glaive. reasoning-v1-20m. https://huggingface.co/datasets/glaiveai/ reasoning-v1-20m, 2025

work page 2025
[12]

Grok 3 beta — the age of reasoning agents.https://x.ai/news/grok-3, 2025

Grok. Grok 3 beta — the age of reasoning agents.https://x.ai/news/grok-3, 2025

work page 2025
[13]

Deeprag: Thinking to retrieval step by step for large language models

Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, and Jie Zhou. Deeprag: Thinking to retrieval step by step for large language models. CoRR, abs/2502.01142, 2025. 10

work page arXiv 2025
[14]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025
[15]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010.14701, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[16]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.CoRR, abs/2503.24290, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

MCTS-RAG: enhancing retrieval- augmented generation with monte carlo tree search.CoRR, abs/2503.20757, 2025

Yunhai Hu, Yilun Zhao, Chen Zhao, and Arman Cohan. MCTS-RAG: enhancing retrieval- augmented generation with monte carlo tree search.CoRR, abs/2503.20757, 2025

work page arXiv 2025
[18]

Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report.CoRR, abs/2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement

Jinhao Jiang, Jiayi Chen, Junyi Li, Ruiyang Ren, Shijie Wang, Xin Zhao, Yang Song, and Tao Zhang. Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational...

work page 2025
[22]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.CoRR, abs/2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

BIDER: bridging knowledge inconsistency for efficient retrieval-augmented llms via key supporting evidence

Jiajie Jin, Yutao Zhu, Yujia Zhou, and Zhicheng Dou. BIDER: bridging knowledge inconsistency for efficient retrieval-augmented llms via key supporting evidence. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 750–...

work page 2024
[24]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

work page 2020
[25]

Numi- namath

Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numi- namath. [https://huggingface.co/AI-MO/NuminaMath-1.5](https: //github.com/project-numina/aimo-progress-prize/blob/main/ report...

work page 2024
[26]

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models.CoRR, abs/2501.05366, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Torl: Scaling tool-integrated RL.CoRR, abs/2503.23383, 2025

Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated RL.CoRR, abs/2503.23383, 2025. 11

work page arXiv 2025
[28]

From System 1 to System 2: A Survey of Reasoning Large Language Models

Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhijiang Guo, Le Song, and Cheng-Lin Liu. From system 1 to system 2: A survey of reasoning large language models.CoRR, abs/2502.17419, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Deepsolution: Boosting complex engineering solution design via tree-based exploration and bi-point thinking

Zhuoqun Li, Haiyang Yu, Xuanang Chen, Hongyu Lin, Yaojie Lu, Fei Huang, Xianpei Han, Yongbin Li, and Le Sun. Deepsolution: Boosting complex engineering solution design via tree-based exploration and bi-point thinking. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Assoc...

work page 2025
[30]

How much can RAG help the reasoning of llm?CoRR, abs/2410.02338, 2024

Jingyu Liu, Jiaen Lin, and Yong Liu. How much can RAG help the reasoning of llm?CoRR, abs/2410.02338, 2024

work page arXiv 2024
[31]

Query rewriting for retrieval-augmented large language models.CoRR, abs/2305.14283, 2023

Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting for retrieval-augmented large language models.CoRR, abs/2305.14283, 2023

work page arXiv 2023
[32]

GAIA: a benchmark for general AI assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

work page 2024
[33]

Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.CoRR, abs/2412.09413, 2024

Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.CoRR, abs/2412.09413, 2024

work page arXiv 2024
[34]

Learning to reason with llms

OpenAI. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms, September 2024

work page 2024
[35]

Introducing deep research

OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research, 2025

work page 2025
[36]

Openai o3-mini

OpenAI. Openai o3-mini. https://openai.com/index/openai-o3-mini, January 2025

work page 2025
[37]

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Daron Anderson, Tung Nguyen, Mobeen Mahmood, Fiona Feng, Steven Y . Feng, Haoran Zhao, Michael Yu, Varun Gangal, Chelsea Zou, Zihan Wang, Jessica P. Wang, Pawan...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.CoRR, abs/2504.13958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

O1 replication journey: A strategic progress report–part 1

Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, et al. O1 replication journey: A strategic progress report–part 1. arXiv preprint arXiv:2410.18982, 2024. 12

work page arXiv 2024
[40]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...

work page 2023
[41]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.CoRR, abs/2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 9248–9274. Associ...

work page 2023
[43]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.CoRR, abs/2503.05592, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Small models, big insights: Leveraging slim proxy models to decide when and what to retrieve for llms

Jiejun Tan, Zhicheng Dou, Yutao Zhu, Peidong Guo, Kun Fang, and Ji-Rong Wen. Small models, big insights: Leveraging slim proxy models to decide when and what to retrieve for llms. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 20...

work page 2024
[45]

Unlocking general long chain-of-thought reasoning capabilities of large language models via representation engineering

Xinyu Tang, Xiaolei Wang, Zhihao Lv, Yingqian Min, Xin Zhao, Binbin Hu, Ziqi Liu, and Zhiqiang Zhang. Unlocking general long chain-of-thought reasoning capabilities of large language models via representation engineering. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the A...

work page 2025
[46]

M.-A-P. Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Open Thoughts.https://open-thoughts.ai, January 2025

OpenThoughts Team. Open Thoughts.https://open-thoughts.ai, January 2025

work page 2025
[48]

Qwq: Reflect deeply on the boundaries of the unknown.Hugging Face, 2024

Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown.Hugging Face, 2024

work page 2024
[49]

Crawl4ai: Open-source llm friendly web crawler & scraper

UncleCode. Crawl4ai: Open-source llm friendly web crawler & scraper. https://github. com/unclecode/crawl4ai, 2024

work page 2024
[50]

OTC: optimal tool calls via reinforcement learning.CoRR, abs/2504.14870, 2025

Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. OTC: optimal tool calls via reinforcement learning.CoRR, abs/2504.14870, 2025. 13

work page arXiv 2025
[51]

Chain- of-retrieval augmented generation.CoRR, abs/2501.14342, 2025

Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, and Furu Wei. Chain- of-retrieval augmented generation.CoRR, abs/2501.14342, 2025

work page arXiv 2025
[52]

Query2doc: Query expansion with large language models

Liang Wang, Nan Yang, and Furu Wei. Query2doc: Query expansion with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 9414–9423. Association for Computational Linguistics, 2023

work page 2023
[53]

RARE: retrieval-augmented reasoning modeling.CoRR, abs/2503.23513, 2025

Zhengren Wang, Jiayang Yu, Dongsheng Ma, Zhe Chen, Yu Wang, Zhiyu Li, Feiyu Xiong, Yanfeng Wang, Weinan E, Linpeng Tang, and Wentao Zhang. RARE: retrieval-augmented reasoning modeling.CoRR, abs/2503.23513, 2025

work page arXiv 2025
[54]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. RAGEN: understanding self-evolution in LLM agents via multi-turn reinforcement learning.CoRR, abs/2504.20073, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neura...

work page 2022
[56]

Webwalker: Benchmarking llms in web traversal

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, and Fei Huang. Webwalker: Benchmarking llms in web traversal. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguist...

work page 2025
[57]

Self-play preference optimization for language model alignment

Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment. InThe Thirteenth International Confer- ence on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

work page 2025
[58]

Omnithink: Expanding knowledge boundaries in machine writing through thinking.CoRR, abs/2501.09751, 2025

Zekun Xi, Wenbiao Yin, Jizhan Fang, Jialong Wu, Runnan Fang, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen. Omnithink: Expanding knowledge boundaries in machine writing through thinking.CoRR, abs/2501.09751, 2025

work page arXiv 2025
[59]

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing LLM reasoning with rule-based reinforcement learning.CoRR, abs/2502.14768, 2025

work page internal anchor Pith review arXiv 2025
[60]

RECOMP: improving retrieval-augmented lms with context compression and selective augmentation

Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: improving retrieval-augmented lms with context compression and selective augmentation. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

work page 2024
[61]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

work page 2023
[64]

OpenReview.net, 2023

work page 2023
[65]

Physics of language models: Part 2.2, how to learn from mistakes on grade-school math problems

Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. Physics of language models: Part 2.2, how to learn from mistakes on grade-school math problems. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

work page 2025
[66]

OpenReview.net, 2025

work page 2025
[67]

Weston, and Xian Li

Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Dong Wang, Ilia Kulikov, Kyunghyun Cho, Yuandong Tian, Jason E. Weston, and Xian Li. Naturalreasoning: Rea- soning in the wild with 2.8m challenging questions.CoRR, abs/2502.13124, 2025

work page arXiv 2025
[68]

Retrieval-Augmented Generation for AI-Generated Content: A Survey

Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey.CoRR, abs/2402.19473, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments.CoRR, abs/2504.03160, 2025

work page internal anchor Pith review arXiv 2025
[70]

Metacognitive retrieval- augmented large language models

Yujia Zhou, Zheng Liu, Jiajie Jin, Jian-Yun Nie, and Zhicheng Dou. Metacognitive retrieval- augmented large language models. In Tat-Seng Chua, Chong-Wah Ngo, Ravi Kumar, Hady W. Lauw, and Roy Ka-Wei Lee, editors,Proceedings of the ACM on Web Conference 2024, WWW 2024, Singapore, May 13-17, 2024, pages 1453–1463. ACM, 2024

work page 2024
[71]

click intent

Yutao Zhu, Jiajie Jin, Hongjin Qian, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. Single llm, multiple roles: A unified retrieval-augmented generation framework using role-specific token optimization.CoRR, abs/2505.15444, 2025. 15 Appendix A Inference Process 17 A.1 Research-Related Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2...

work page arXiv 2025
[72]

- Identify factual information that is relevant to the **Current Search Query** and can aid in the reasoning process for the original question

**Analyze the Searched Web Pages:** - Carefully review the content of each searched web page. - Identify factual information that is relevant to the **Current Search Query** and can aid in the reasoning process for the original question

work page
[73]

**More Information Seeking:** - If the information is not relevant to the query, you could:

work page
[74]

Search again: <|begin_search_query|>another search query<|end_search_query|>

work page
[75]

Access webpage content using: <|begin_click_link|>your URL<|end_click_link|>

work page
[76]

**Extract Relevant Information:** - Return the relevant information from the **Searched Web Pages** that is relevant to the **Current Search Query**

work page
[77]

{search_query}

**Output Format:** - Present the information beginning with **Final Information** as shown below. **Final Information** [Relevant information] **Inputs:** - **Current Search Query:** {search_query} - **Detailed Search Intent:** {search_intent} - **Searched Web Pages:** {search_result} Now please analyze the web pages and extract relevant information for t...

work page
[78]

Use web searches to gather detailed information for each point

work page
[79]

After each search, analyze the results and determine what additional information is needed

work page
[80]

When you have sufficient information for a section, request to write that section

work page

Showing first 80 references.