arxiv: 2510.00861 · v2 · submitted 2025-10-01 · 💻 cs.CL · cs.AI· cs.IR

Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs

Ziliang Wang , Kang An , Xuhui Zheng , Faqiang Qian , Weikun Zhang , Cijun Ouyang , Jialu Cai , Yuhang Wang

show 1 more author

Yichao Wu

This is my paper

Pith reviewed 2026-05-18 11:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords Erasable Reinforcement Learningmulti-hop reasoningsearch-augmented LLMserror correctionreinforcement learningquestion answering

0 comments

The pith

Erasable reinforcement learning lets search-augmented LLMs detect, erase, and regenerate faulty reasoning steps to prevent error propagation in multi-hop tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Erasable Reinforcement Learning (ERL) as a way to make search-augmented LLMs handle complex multi-hop reasoning more reliably. It targets three common failure points: breaking tasks into wrong sub-questions, failing to retrieve key facts, and letting flawed logic continue through later steps. ERL adds a training loop that spots these bad steps, removes them, and immediately regenerates better reasoning in the same spot. The resulting ESearch models are tested on four multi-hop question-answering benchmarks. A reader would care because the method improves smaller models without requiring bigger parameter counts or more search calls.

Core claim

Erasable Reinforcement Learning (ERL) transforms fragile multi-step reasoning into a robust process by explicitly identifying faulty steps, erasing them, and regenerating correct reasoning in place so that defective logic does not propagate through the chain.

What carries the argument

The ERL loop, which detects faulty reasoning steps during reinforcement learning and erases them for targeted regeneration.

If this is right

The 3B model improves by 8.48% exact match and 11.56% F1 over prior best results on HotpotQA, MuSiQue, 2Wiki, and Bamboogle.
The 7B model improves by 5.38% exact match and 7.22% F1 over prior best results on the same benchmarks.
Reasoning chains become more resilient because errors are corrected locally instead of derailing the entire answer.
The same training approach can be applied to other search-augmented LLM setups that rely on multi-step decomposition and retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method might let smaller models close performance gaps with larger ones by improving error recovery rather than adding parameters.
If extended to inference time, the same erase-and-regenerate step could support real-time correction in interactive applications.
Similar local correction could be tested on chain-of-thought traces in mathematics or code generation where step-level errors also accumulate.

Load-bearing premise

Faulty reasoning steps can be detected accurately enough that erasing and regenerating them produces a net gain without creating new errors or using too much extra computation.

What would settle it

An experiment in which the detection step often marks correct reasoning as faulty or in which regenerated steps produce lower final accuracy than the original chain would show the method does not work as claimed.

Figures

Figures reproduced from arXiv: 2510.00861 by Cijun Ouyang, Faqiang Qian, Jialu Cai, Kang An, Weikun Zhang, Xuhui Zheng, Yichao Wu, Yuhang Wang, Ziliang Wang.

**Figure 1.** Figure 1: Overview of ESEARCH. Different colors and symbols are used to represent the interactive behaviors S (Search), I (Information), O (Observation), and A (Sub Answer). In the answering process, there are three types of erasure and retry behaviors: (1) incorrect initial search results trigger initialization plan erasure; (2) incorrect subsequent search results trigger search design erasure; (3) incorrect sub-an… view at source ↗

**Figure 2.** Figure 2: Training dynamics of different RL strategies. Compared to PPO, GRPO demonstrates [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Training dynamics in ablation experiments. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of ESearch. In the figure, ’o/w’ (only with) indicates that only the current mechanism is added to the base method. 7 Limitation & Future discussion The strength of the ERL framework lies in its structured cycle of identification, erasure, and regeneration, which enables targeted correction of reasoning errors and significantly improves reliability. This sequential design inherently increases com… view at source ↗

**Figure 5.** Figure 5: LLM interacts with external search engines and provides answers to prompt templates. The [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Case study demonstrating error recovery where initial planning focused on birthplace rather than where Chopin grew up, but observations corrected the understanding to identify Warsaw as his hometown. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Step-by-step reasoning for a 2-hop question identifying the spouse of the winner of the [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Esearch makes an incorrect observation: although the first hop (KFAB [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Esearch makes an incorrect entity alignment: although the retrieval step surfaced the [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Esearch also produces an erroneous observation reasoning chain: Banir was incorrectly [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Case study showing step-by-step reasoning with sub-questions leading to the identification [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Complex multi-step reasoning requiring identification of author, educational background, [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Esearch can efficiently handle a 4-hops reasoning question: after gathering relevant [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

read the original abstract

While search-augmented large language models (LLMs) exhibit impressive capabilities, their reliability in complex multi-hop reasoning remains limited. This limitation arises from three fundamental challenges: decomposition errors, where tasks are incorrectly broken down; retrieval missing, where key evidence fails to be retrieved; and reasoning errors, where flawed logic propagates through the reasoning chain. A single failure in any of these stages can derail the final answer. We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process. ERL explicitly identifies faulty steps, erases them, and regenerates reasoning in place, preventing defective logic from propagating through the reasoning chain. This targeted correction mechanism turns brittle reasoning into a more resilient process. Models trained with ERL, termed ESearch, achieve substantial improvements on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with the 3B model achieving +8.48% EM and +11.56% F1, and the 7B model achieving +5.38% EM and +7.22% F1 over previous state-of-the-art(SOTA) results. These findings suggest that erasable reinforcement learning provides a powerful paradigm shift for robust multi-step reasoning in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ERL adds an explicit erase-and-regenerate step inside RL for search-augmented reasoning and reports clear gains on multi-hop QA, but those gains depend on whether the unsupervised fault detector actually works.

read the letter

The core contribution is turning error correction into a first-class part of the RL training loop rather than a post-hoc fix. The model learns to identify a bad intermediate step, remove it, and regenerate from that point while the rest of the chain stays intact. That is a concrete difference from standard self-correction or standard RL on final answer reward. The experiments run on HotpotQA, MuSiQue, 2Wiki, and Bamboogle with both 3B and 7B models, and the reported lifts (roughly 8 EM / 11 F1 for the smaller model) are large enough to notice against prior SOTA numbers they cite.

Referee Report

2 major / 2 minor

Summary. The paper introduces Erasable Reinforcement Learning (ERL) for search-augmented LLMs to mitigate decomposition, retrieval, and reasoning errors in multi-hop tasks by detecting faulty intermediate steps, erasing them, and regenerating the reasoning in place. Models trained under this framework (ESearch) are reported to outperform prior SOTA on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with absolute gains of +8.48% EM / +11.56% F1 for the 3B variant and +5.38% EM / +7.22% F1 for the 7B variant.

Significance. If the empirical gains prove robust and the erasure mechanism is shown to be the causal driver rather than ancillary compute, the work would provide a concrete, targeted correction strategy that strengthens the reliability of search-augmented reasoning chains. It extends RL-based training for LLMs by adding an explicit erase-and-regenerate loop, which could be broadly applicable to other chain-of-thought and retrieval-augmented settings.

major comments (2)

[§3] §3 (ERL framework description): the unsupervised detection of faulty reasoning steps is load-bearing for the central claim yet remains underspecified. Without ground-truth labels on intermediates, the method must rely on a learned critic, reward threshold, or self-consistency signal; the manuscript does not detail how false-positive erasures (removing correct steps) or false-negative retentions (leaving errors) are controlled, leaving open the possibility that observed gains arise from extra gradient updates or search budget rather than the erasable mechanism itself.
[Experimental section] Experimental section / results tables: no ablation isolates the contribution of the erase-regenerate step from standard RL training or increased inference-time search. The headline +8.48% EM lift on the 3B model cannot be confidently attributed to ERL until such controls are shown; otherwise the result risks being an artifact of the training regime rather than the proposed correction process.

minor comments (2)

[Abstract] Abstract: the phrase 'previous state-of-the-art(SOTA)' should be accompanied by explicit citations to the prior works being surpassed.
[Introduction] Notation: introduce the distinction between ERL (the training framework) and ESearch (the resulting model) at first use to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of the ERL framework and strengthen the experimental claims. We address each major point below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [§3] §3 (ERL framework description): the unsupervised detection of faulty reasoning steps is load-bearing for the central claim yet remains underspecified. Without ground-truth labels on intermediates, the method must rely on a learned critic, reward threshold, or self-consistency signal; the manuscript does not detail how false-positive erasures (removing correct steps) or false-negative retentions (leaving errors) are controlled, leaving open the possibility that observed gains arise from extra gradient updates or search budget rather than the erasable mechanism itself.

Authors: We agree that Section 3 provides a high-level description of faulty-step detection and would benefit from greater specificity. The ERL approach identifies faulty steps via a reward-based critic that flags low-reward intermediate outputs, augmented by a self-consistency check over multiple sampled continuations of the same prefix. Erasure is applied only when both signals agree and is limited to a single step per chain to reduce over-erasure risk. We will revise Section 3 to include the precise critic formulation, threshold selection procedure, and a short discussion of how these choices limit false positives and negatives. We will also add a brief analysis showing that the total number of gradient updates and search budget are matched to the baselines, so that gains cannot be attributed solely to extra compute. revision: yes
Referee: [Experimental section] Experimental section / results tables: no ablation isolates the contribution of the erase-regenerate step from standard RL training or increased inference-time search. The headline +8.48% EM lift on the 3B model cannot be confidently attributed to ERL until such controls are shown; otherwise the result risks being an artifact of the training regime rather than the proposed correction process.

Authors: This observation is correct and points to a genuine gap in the current experimental design. While the reported results compare ESearch against prior SOTA methods, they do not contain an explicit ablation that removes only the erase-and-regenerate loop while holding the underlying RL objective, training steps, and inference-time search budget fixed. We will add these controls in the revised version, including (i) a standard RL baseline without erasure, (ii) a variant that performs additional search steps without erasure, and (iii) a table reporting the isolated contribution of the erasure component. These additions will allow readers to attribute performance differences more directly to the proposed mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL framework evaluated on external benchmarks

full rationale

The paper introduces Erasable Reinforcement Learning (ERL) as a training procedure that detects faulty intermediate reasoning steps, erases them, and regenerates replacements within a search-augmented LLM pipeline. Reported gains (+8.48% EM / +11.56% F1 for the 3B model on HotpotQA/MuSiQue/etc.) are presented as measured outcomes of this training on standard held-out QA datasets. No equations, uniqueness theorems, fitted-parameter predictions, or self-citation chains appear in the provided text that would reduce any claimed result to its own inputs by construction. The central contribution is therefore an empirical method whose validity rests on external benchmark performance rather than internal definitional closure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the unverified ability of the RL process to identify and correct specific faulty steps; no free parameters, axioms, or invented entities are detailed in the abstract.

invented entities (1)

Erasable Reinforcement Learning (ERL) no independent evidence
purpose: To identify faulty steps, erase them, and regenerate reasoning to prevent error propagation
New framework introduced to address the three listed challenges in search-augmented LLMs.

pith-pipeline@v0.9.0 · 5789 in / 1182 out tokens · 43061 ms · 2026-05-18T11:02:51.265738+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 19 internal anchors

[1]

Gpt-5 system card

OpenAI. Gpt-5 system card. Technical report, OpenAI, aug 2025. Accessed: 2025-09-11

work page 2025
[2]

The llama 4 herd: The beginning of a new era of natively multimodal intelli- gence

Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal intelli- gence. https://ai.meta.com/blog/lllama-4-multimodal-intelligence/ , apr 2025. Accessed: 2025-09-11

work page 2025
[3]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55, 2025

work page 2025
[5]

TrustLLM: Trustworthiness in Large Language Models

Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, et al. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024

work page internal anchor Pith review arXiv 2024
[6]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

A survey of llm-based deep search agents: Paradigm, optimization, evaluation, and challenges

Yunjia Xi, Jianghao Lin, Yongzhao Xiao, Zheli Zhou, Rong Shan, Te Gao, Jiachen Zhu, Weiwen Liu, Yong Yu, and Weinan Zhang. A survey of llm-based deep search agents: Paradigm, optimization, evaluation, and challenges. arXiv preprint arXiv:2508.05668, 2025

work page arXiv 2025
[10]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020
[11]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2(1), 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Introducing deep research, 2025

OpenAI. Introducing deep research, 2025. Accessed: 2025-09-11

work page 2025
[13]

Gemini deep research – your personal research assistant, 2025

Google DeepMind. Gemini deep research – your personal research assistant, 2025. Accessed: 2025-09-11

work page 2025
[14]

Introducing perplexity deep research, 2025

Perplexity AI. Introducing perplexity deep research, 2025. Accessed: 2025-09-11

work page 2025
[15]

Deep Reinforcement Learning: An Overview

Yuxi Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Par- allelsearch: Train your llms to decompose query and search sub-queries in parallel with reinforcement learning

Shu Zhao, Tan Yu, Anbang Xu, Japinder Singh, Aaditya Shukla, and Rama Akkiraju. Par- allelsearch: Train your llms to decompose query and search sub-queries in parallel with reinforcement learning. arXiv preprint arXiv:2508.09303, 2025

work page arXiv 2025
[19]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhut- dinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Musique: Multihop questions via single-hop question composition

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022
[21]

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[22]

Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025

Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, Jianye Hao, et al. Deep research agents: A systematic examina- tion and roadmap. arXiv preprint arXiv:2506.18096, 2025

work page arXiv 2025
[23]

Reinforcement learning foundations for deep research systems: A survey

Wenjun Li, Zhi Chen, Jingru Lin, Hannan Cao, Wei Han, Sheng Liang, Zhi Zhang, Kuicai Dong, Dexun Li, Chen Zhang, et al. Reinforcement learning foundations for deep research systems: A survey. arXiv preprint arXiv:2509.06733, 2025

work page arXiv 2025
[24]

Reinforcement learning: An introduction, volume 1

Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998
[25]

Reinforcement learning: A survey

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996

work page 1996
[26]

Agent models: Inter- nalizing chain-of-action generation into reasoning models

Yuxiang Zhang, Yuqi Yang, Jiangming Shu, Xinyan Wen, and Jitao Sang. Agent models: Inter- nalizing chain-of-action generation into reasoning models. arXiv preprint arXiv:2503.06580, 2025

work page arXiv 2025
[27]

An empirical study on reinforcement learning for reasoning-search interleaved llm agents

Bowen Jin, Jinsung Yoon, Priyanka Kargupta, Sercan O Arik, and Jiawei Han. An empirical study on reinforcement learning for reasoning-search interleaved llm agents. arXiv preprint arXiv:2505.15117, 2025. 11

work page arXiv 2025
[28]

Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl

Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl. arXiv preprint arXiv:2508.07976, 2025

work page arXiv 2025
[29]

Measuring and Narrowing the Compositionality Gap in Language Models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

All language models large and small

Zhixun Chen, Yali Du, and David Mguni. All language models large and small. arXiv preprint arXiv:2402.12061, 2024

work page arXiv 2024
[31]

Reinforcement learning as heuristic for action-rule preferences

Joost Broekens, Koen Hindriks, and Pascal Wiggers. Reinforcement learning as heuristic for action-rule preferences. In International Workshop on Programming Multi-Agent Systems, pages 25–40. Springer, 2010

work page 2010
[32]

Reinforcement learning framework for window hardware installation

Tzu-Hao Huang. Reinforcement learning framework for window hardware installation. 2022

work page 2022
[33]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[35]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, Fan Yang, et al. Learning to reason with search for llms via reinforcement learning. arXiv preprint arXiv:2503.19470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching. arXiv preprint arXiv:2505.04588, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

R-search: Em- powering llm reasoning with search via multi-reward reinforcement learning

Qingfei Zhao, Ruobing Wang, Dingling Xu, Daren Zha, and Limin Liu. R-search: Em- powering llm reasoning with search via multi-reward reinforcement learning. arXiv preprint arXiv:2506.04185, 2025

work page arXiv 2025
[38]

Ssrl: Self-search reinforcement learning

Yuchen Fan, Kaiyan Zhang, Heng Zhou, Yuxin Zuo, Yanxu Chen, Yu Fu, Xinwei Long, Xuekai Zhu, Che Jiang, Yuchen Zhang, et al. Ssrl: Self-search reinforcement learning. arXiv preprint arXiv:2508.10874, 2025

work page arXiv 2025
[39]

Stepsearch: Igniting llms search ability via step-wise proximal policy optimization

Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization. arXiv preprint arXiv:2505.15107, 2025

work page arXiv 2025
[40]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page 2025
[41]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP (1), pages 6769–6781, 2020

work page 2020
[43]

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. arXiv preprint arXiv:2504.03160, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

O2-searcher: A searching-based agent model for open-domain open-ended question answering

Jianbiao Mei, Tao Hu, Daocheng Fu, Licheng Wen, Xuemeng Yang, Rong Wu, Pinlong Cai, Xinyu Cai, Xing Gao, Yu Yang, et al. O2-searcher: A searching-based agent model for open-domain open-ended question answering. arXiv preprint arXiv:2505.16582, 2025

work page arXiv 2025
[45]

Masksearch: A universal pre-training framework to enhance agentic search capability

Weiqi Wu, Xin Guan, Shen Huang, Yong Jiang, Pengjun Xie, Fei Huang, Jiuxin Cao, Hai Zhao, and Jingren Zhou. Masksearch: A universal pre-training framework to enhance agentic search capability. arXiv preprint arXiv:2505.20285, 2025. 12

work page arXiv 2025
[46]

Evolvesearch: An iterative self-evolving search agent

Dingchu Zhang, Yida Zhao, Jialong Wu, Baixuan Li, Wenbiao Yin, Liwen Zhang, Yong Jiang, Yufeng Li, Kewei Tu, Pengjun Xie, et al. Evolvesearch: An iterative self-evolving search agent. arXiv preprint arXiv:2505.22501, 2025

work page arXiv 2025
[47]

Dynasearcher: Dynamic knowledge graph augmented search agent via multi-reward reinforcement learning

Chuzhan Hao, Wenfeng Feng, Yuewei Zhang, and Hao Wang. Dynasearcher: Dynamic knowledge graph augmented search agent via multi-reward reinforcement learning. arXiv preprint arXiv:2507.17365, 2025

work page arXiv 2025
[48]

Hybrid deep searcher: Integrating parallel and sequential search reasoning

Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee, Yongrae Jo, Gunhee Kim, Moontae Lee, and Kyungjae Lee. Hybrid deep searcher: Integrating parallel and sequential search reasoning. arXiv preprint arXiv:2508.19113, 2025

work page arXiv 2025
[49]

Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics

work page 2020
[50]

Musique: Multihop questions via single-hop question composition, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition, 2022

work page 2022
[51]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

cheatable

Vladimir Karpukhin, Barlas O ˘guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. Dense passage retrieval for open-domain question answering, 2020. 13 A Related work Recent research has increasingly explored reinforcement learning (RL) as a means to improve the retrieval and reasoning capabilities of large language models...

work page arXiv 2020
[53]

Identify the city where Chopin was born

work page
[54]

Frédéric Chopin

Determine which organization has legislative power in that city. Let’s address these sub-questions one by one. <search> Chopin birthplace city </search> <information> Doc 1<## Title: "Frédéric Chopin" ##> Frédéric François Chopin was born on 1 March 1810 in ˙Zelazowa Wola, a village in the Duchy of Warsaw ... moved to Warsaw when he was seven months old ....

work page 2000
[55]

presidential election

Identify the person who won the electoral vote in the 2000 U.S. presidential election

work page 2000
[56]

Then find out who that person’s spouse is. <search> Winner of electoral vote in the 2000 election </search> <information> No useful information related are retrieved! </information> Plan is incorrect! It will trigger an erasure and regeneration of plan! Erasure and Regeneration: To answer this question, we need to break it down into sub-questions:

work page 2000
[57]

Identify the person who won the popular vote in the 2000 election

work page 2000
[58]

2000 United States presidential election

Determine the spouse of that person. Let’s address these sub-questions one by one. <search> popular vote in the 2000 election (Prussia)? </search> <information> Doc 1<## Title: "2000 United States presidential election" ##>The 2000 United States presidential election was the 54th quadrennial presidential election... Doc 2<## Title: "2000 United States pre...

work page 2000
[59]

Identify where City A is, where KFAB broadcasts

work page
[60]

The Road to City A

Determine who wrote “The Road to City A” with City A being the location identified above. Let’s address these step by step: <search> KFAB broadcasts </search> <information> Doc 1<## Title: KFAB ##> KFAB (1110 AM) is a 50,000 watt clear channel news and talk radio station licensed to Omaha, Nebraska... </information> <observation> KFAB broadcasts in Omaha,...

work page 2004
[61]

Identify the city where Alchip is headquartered

work page
[62]

Determine which performer released the Fantasy Land Tour 2004 album and performed it live in that city. Let’s address each sub-question: <search> Alchip headquarters city </search> <information> Doc 1<## Title: Alchip ##> Alchip is a fabless semiconductor company founded in 2003 and headquartered in Taipei, Taiwan... </information> <observation> Alchip is...

work page 2004
[63]

Identify the country Banir is located in

work page
[64]

United Ireland

Determine the year of the country’s unification. Let’s address each sub-question: <search> Banir location </search> <information> Doc 1<## Title: Banior ##> Banior is a village and gram panchayat in Nalhati I Block in Rampurhat subdivision of Birbhum District in the Indian state of West Bengal... Doc 2<## Title: Banagher ##> Banagher is a town in Republic...

work page 1921
[65]

Identify the state where Wellesley College is located

work page
[66]

Determine the capital of that state

work page
[67]

Find out who established the first committee of correspondence in 1772 in that capital

work page
[68]

Wellesley, Massachusetts

Understand the reason for establishing the committee. Let’s address each sub-question: <search> Wellesley College location </search> <information> Doc 1<## Title: "Wellesley, Massachusetts" ##> Wellesley is a town in Norfolk County, Massachusetts... It is best known as the home of Wellesley College... Doc 2<## Title: "Wellesley College" ##> Wellesley Coll...

work page
[69]

America-Lite: How Imperial Academia Dismantled Our Culture

Identify the author of the book "America-Lite: How Imperial Academia Dismantled Our Culture"

work page
[70]

Determine the school attended by this author

work page
[71]

America-Lite: How Imperial Academia Dismantled Our Culture

Find the weekly publication in Jeanne Flanagan’s birthplace that is issued by the school identified in step 2. Let’s address each sub-question: <search> author of the book "America-Lite: How Imperial Academia Dismantled Our Culture" </search> <information> Doc 1<## Title: America-Lite ##> America-Lite America-Lite: How Imperial Academia Dismantled Our Cul...

work page 2012
[72]

Identify the director of The Mystic Masseur

work page
[73]

Determine the country where the director was born

work page
[74]

The Mystic Masseur

Find out in which state Kattalai, a city, is located in that country. Let’s address each sub-question: <search> director of The Mystic Masseur </search> <information> Doc 1<## Title: "The Mystic Masseur" ##> The Mystic Masseur The Mystic Masseur is a 2001 Merchant Ivory film based on the novel of the same name by V . S. Naipaul. It is one of relatively fe...

work page 2001