pith. machine review for the scientific record. sign in

arxiv: 2510.00861 · v2 · submitted 2025-10-01 · 💻 cs.CL · cs.AI· cs.IR

Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs

Pith reviewed 2026-05-18 11:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords Erasable Reinforcement Learningmulti-hop reasoningsearch-augmented LLMserror correctionreinforcement learningquestion answering
0
0 comments X

The pith

Erasable reinforcement learning lets search-augmented LLMs detect, erase, and regenerate faulty reasoning steps to prevent error propagation in multi-hop tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Erasable Reinforcement Learning (ERL) as a way to make search-augmented LLMs handle complex multi-hop reasoning more reliably. It targets three common failure points: breaking tasks into wrong sub-questions, failing to retrieve key facts, and letting flawed logic continue through later steps. ERL adds a training loop that spots these bad steps, removes them, and immediately regenerates better reasoning in the same spot. The resulting ESearch models are tested on four multi-hop question-answering benchmarks. A reader would care because the method improves smaller models without requiring bigger parameter counts or more search calls.

Core claim

Erasable Reinforcement Learning (ERL) transforms fragile multi-step reasoning into a robust process by explicitly identifying faulty steps, erasing them, and regenerating correct reasoning in place so that defective logic does not propagate through the chain.

What carries the argument

The ERL loop, which detects faulty reasoning steps during reinforcement learning and erases them for targeted regeneration.

If this is right

  • The 3B model improves by 8.48% exact match and 11.56% F1 over prior best results on HotpotQA, MuSiQue, 2Wiki, and Bamboogle.
  • The 7B model improves by 5.38% exact match and 7.22% F1 over prior best results on the same benchmarks.
  • Reasoning chains become more resilient because errors are corrected locally instead of derailing the entire answer.
  • The same training approach can be applied to other search-augmented LLM setups that rely on multi-step decomposition and retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method might let smaller models close performance gaps with larger ones by improving error recovery rather than adding parameters.
  • If extended to inference time, the same erase-and-regenerate step could support real-time correction in interactive applications.
  • Similar local correction could be tested on chain-of-thought traces in mathematics or code generation where step-level errors also accumulate.

Load-bearing premise

Faulty reasoning steps can be detected accurately enough that erasing and regenerating them produces a net gain without creating new errors or using too much extra computation.

What would settle it

An experiment in which the detection step often marks correct reasoning as faulty or in which regenerated steps produce lower final accuracy than the original chain would show the method does not work as claimed.

Figures

Figures reproduced from arXiv: 2510.00861 by Cijun Ouyang, Faqiang Qian, Jialu Cai, Kang An, Weikun Zhang, Xuhui Zheng, Yichao Wu, Yuhang Wang, Ziliang Wang.

Figure 1
Figure 1. Figure 1: Overview of ESEARCH. Different colors and symbols are used to represent the interactive behaviors S (Search), I (Information), O (Observation), and A (Sub Answer). In the answering process, there are three types of erasure and retry behaviors: (1) incorrect initial search results trigger initialization plan erasure; (2) incorrect subsequent search results trigger search design erasure; (3) incorrect sub-an… view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics of different RL strategies. Compared to PPO, GRPO demonstrates [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training dynamics in ablation experiments. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of ESearch. In the figure, ’o/w’ (only with) indicates that only the current mechanism is added to the base method. 7 Limitation & Future discussion The strength of the ERL framework lies in its structured cycle of identification, erasure, and regen￾eration, which enables targeted correction of reasoning errors and significantly improves reliability. This sequential design inherently increases com… view at source ↗
Figure 5
Figure 5. Figure 5: LLM interacts with external search engines and provides answers to prompt templates. The [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Case study demonstrating error recovery where initial planning focused on birthplace rather than where Chopin grew up, but observations corrected the understanding to identify Warsaw as his hometown. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Step-by-step reasoning for a 2-hop question identifying the spouse of the winner of the [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Esearch makes an incorrect observation: although the first hop (KFAB [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Esearch makes an incorrect entity alignment: although the retrieval step surfaced the [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Esearch also produces an erroneous observation reasoning chain: Banir was incorrectly [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Case study showing step-by-step reasoning with sub-questions leading to the identification [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Complex multi-step reasoning requiring identification of author, educational background, [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Esearch can efficiently handle a 4-hops reasoning question: after gathering relevant [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
read the original abstract

While search-augmented large language models (LLMs) exhibit impressive capabilities, their reliability in complex multi-hop reasoning remains limited. This limitation arises from three fundamental challenges: decomposition errors, where tasks are incorrectly broken down; retrieval missing, where key evidence fails to be retrieved; and reasoning errors, where flawed logic propagates through the reasoning chain. A single failure in any of these stages can derail the final answer. We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process. ERL explicitly identifies faulty steps, erases them, and regenerates reasoning in place, preventing defective logic from propagating through the reasoning chain. This targeted correction mechanism turns brittle reasoning into a more resilient process. Models trained with ERL, termed ESearch, achieve substantial improvements on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with the 3B model achieving +8.48% EM and +11.56% F1, and the 7B model achieving +5.38% EM and +7.22% F1 over previous state-of-the-art(SOTA) results. These findings suggest that erasable reinforcement learning provides a powerful paradigm shift for robust multi-step reasoning in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Erasable Reinforcement Learning (ERL) for search-augmented LLMs to mitigate decomposition, retrieval, and reasoning errors in multi-hop tasks by detecting faulty intermediate steps, erasing them, and regenerating the reasoning in place. Models trained under this framework (ESearch) are reported to outperform prior SOTA on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with absolute gains of +8.48% EM / +11.56% F1 for the 3B variant and +5.38% EM / +7.22% F1 for the 7B variant.

Significance. If the empirical gains prove robust and the erasure mechanism is shown to be the causal driver rather than ancillary compute, the work would provide a concrete, targeted correction strategy that strengthens the reliability of search-augmented reasoning chains. It extends RL-based training for LLMs by adding an explicit erase-and-regenerate loop, which could be broadly applicable to other chain-of-thought and retrieval-augmented settings.

major comments (2)
  1. [§3] §3 (ERL framework description): the unsupervised detection of faulty reasoning steps is load-bearing for the central claim yet remains underspecified. Without ground-truth labels on intermediates, the method must rely on a learned critic, reward threshold, or self-consistency signal; the manuscript does not detail how false-positive erasures (removing correct steps) or false-negative retentions (leaving errors) are controlled, leaving open the possibility that observed gains arise from extra gradient updates or search budget rather than the erasable mechanism itself.
  2. [Experimental section] Experimental section / results tables: no ablation isolates the contribution of the erase-regenerate step from standard RL training or increased inference-time search. The headline +8.48% EM lift on the 3B model cannot be confidently attributed to ERL until such controls are shown; otherwise the result risks being an artifact of the training regime rather than the proposed correction process.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'previous state-of-the-art(SOTA)' should be accompanied by explicit citations to the prior works being surpassed.
  2. [Introduction] Notation: introduce the distinction between ERL (the training framework) and ESearch (the resulting model) at first use to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of the ERL framework and strengthen the experimental claims. We address each major point below and will incorporate revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (ERL framework description): the unsupervised detection of faulty reasoning steps is load-bearing for the central claim yet remains underspecified. Without ground-truth labels on intermediates, the method must rely on a learned critic, reward threshold, or self-consistency signal; the manuscript does not detail how false-positive erasures (removing correct steps) or false-negative retentions (leaving errors) are controlled, leaving open the possibility that observed gains arise from extra gradient updates or search budget rather than the erasable mechanism itself.

    Authors: We agree that Section 3 provides a high-level description of faulty-step detection and would benefit from greater specificity. The ERL approach identifies faulty steps via a reward-based critic that flags low-reward intermediate outputs, augmented by a self-consistency check over multiple sampled continuations of the same prefix. Erasure is applied only when both signals agree and is limited to a single step per chain to reduce over-erasure risk. We will revise Section 3 to include the precise critic formulation, threshold selection procedure, and a short discussion of how these choices limit false positives and negatives. We will also add a brief analysis showing that the total number of gradient updates and search budget are matched to the baselines, so that gains cannot be attributed solely to extra compute. revision: yes

  2. Referee: [Experimental section] Experimental section / results tables: no ablation isolates the contribution of the erase-regenerate step from standard RL training or increased inference-time search. The headline +8.48% EM lift on the 3B model cannot be confidently attributed to ERL until such controls are shown; otherwise the result risks being an artifact of the training regime rather than the proposed correction process.

    Authors: This observation is correct and points to a genuine gap in the current experimental design. While the reported results compare ESearch against prior SOTA methods, they do not contain an explicit ablation that removes only the erase-and-regenerate loop while holding the underlying RL objective, training steps, and inference-time search budget fixed. We will add these controls in the revised version, including (i) a standard RL baseline without erasure, (ii) a variant that performs additional search steps without erasure, and (iii) a table reporting the isolated contribution of the erasure component. These additions will allow readers to attribute performance differences more directly to the proposed mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL framework evaluated on external benchmarks

full rationale

The paper introduces Erasable Reinforcement Learning (ERL) as a training procedure that detects faulty intermediate reasoning steps, erases them, and regenerates replacements within a search-augmented LLM pipeline. Reported gains (+8.48% EM / +11.56% F1 for the 3B model on HotpotQA/MuSiQue/etc.) are presented as measured outcomes of this training on standard held-out QA datasets. No equations, uniqueness theorems, fitted-parameter predictions, or self-citation chains appear in the provided text that would reduce any claimed result to its own inputs by construction. The central contribution is therefore an empirical method whose validity rests on external benchmark performance rather than internal definitional closure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the unverified ability of the RL process to identify and correct specific faulty steps; no free parameters, axioms, or invented entities are detailed in the abstract.

invented entities (1)
  • Erasable Reinforcement Learning (ERL) no independent evidence
    purpose: To identify faulty steps, erase them, and regenerate reasoning to prevent error propagation
    New framework introduced to address the three listed challenges in search-augmented LLMs.

pith-pipeline@v0.9.0 · 5789 in / 1182 out tokens · 43061 ms · 2026-05-18T11:02:51.265738+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 19 internal anchors

  1. [1]

    Gpt-5 system card

    OpenAI. Gpt-5 system card. Technical report, OpenAI, aug 2025. Accessed: 2025-09-11

  2. [2]

    The llama 4 herd: The beginning of a new era of natively multimodal intelli- gence

    Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal intelli- gence. https://ai.meta.com/blog/lllama-4-multimodal-intelligence/ , apr 2025. Accessed: 2025-09-11

  3. [3]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  4. [4]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55, 2025

  5. [5]

    TrustLLM: Trustworthiness in Large Language Models

    Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, et al. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024

  6. [6]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 10

  8. [8]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025

  9. [9]

    A survey of llm-based deep search agents: Paradigm, optimization, evaluation, and challenges

    Yunjia Xi, Jianghao Lin, Yongzhao Xiao, Zheli Zhou, Rong Shan, Te Gao, Jiachen Zhu, Weiwen Liu, Yong Yu, and Weinan Zhang. A survey of llm-based deep search agents: Paradigm, optimization, evaluation, and challenges. arXiv preprint arXiv:2508.05668, 2025

  10. [10]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020

  11. [11]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2(1), 2023

  12. [12]

    Introducing deep research, 2025

    OpenAI. Introducing deep research, 2025. Accessed: 2025-09-11

  13. [13]

    Gemini deep research – your personal research assistant, 2025

    Google DeepMind. Gemini deep research – your personal research assistant, 2025. Accessed: 2025-09-11

  14. [14]

    Introducing perplexity deep research, 2025

    Perplexity AI. Introducing perplexity deep research, 2025. Accessed: 2025-09-11

  15. [15]

    Deep Reinforcement Learning: An Overview

    Yuxi Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017

  16. [16]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025

  17. [17]

    R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

    Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592, 2025

  18. [18]

    Par- allelsearch: Train your llms to decompose query and search sub-queries in parallel with reinforcement learning

    Shu Zhao, Tan Yu, Anbang Xu, Japinder Singh, Aaditya Shukla, and Rama Akkiraju. Par- allelsearch: Train your llms to decompose query and search sub-queries in parallel with reinforcement learning. arXiv preprint arXiv:2508.09303, 2025

  19. [19]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhut- dinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018

  20. [20]

    Musique: Multihop questions via single-hop question composition

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022

  21. [21]

    Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060, 2020

  22. [22]

    Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025

    Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, Jianye Hao, et al. Deep research agents: A systematic examina- tion and roadmap. arXiv preprint arXiv:2506.18096, 2025

  23. [23]

    Reinforcement learning foundations for deep research systems: A survey

    Wenjun Li, Zhi Chen, Jingru Lin, Hannan Cao, Wei Han, Sheng Liang, Zhi Zhang, Kuicai Dong, Dexun Li, Chen Zhang, et al. Reinforcement learning foundations for deep research systems: A survey. arXiv preprint arXiv:2509.06733, 2025

  24. [24]

    Reinforcement learning: An introduction, volume 1

    Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  25. [25]

    Reinforcement learning: A survey

    Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996

  26. [26]

    Agent models: Inter- nalizing chain-of-action generation into reasoning models

    Yuxiang Zhang, Yuqi Yang, Jiangming Shu, Xinyan Wen, and Jitao Sang. Agent models: Inter- nalizing chain-of-action generation into reasoning models. arXiv preprint arXiv:2503.06580, 2025

  27. [27]

    An empirical study on reinforcement learning for reasoning-search interleaved llm agents

    Bowen Jin, Jinsung Yoon, Priyanka Kargupta, Sercan O Arik, and Jiawei Han. An empirical study on reinforcement learning for reasoning-search interleaved llm agents. arXiv preprint arXiv:2505.15117, 2025. 11

  28. [28]

    Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl

    Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl. arXiv preprint arXiv:2508.07976, 2025

  29. [29]

    Measuring and Narrowing the Compositionality Gap in Language Models

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022

  30. [30]

    All language models large and small

    Zhixun Chen, Yali Du, and David Mguni. All language models large and small. arXiv preprint arXiv:2402.12061, 2024

  31. [31]

    Reinforcement learning as heuristic for action-rule preferences

    Joost Broekens, Koen Hindriks, and Pascal Wiggers. Reinforcement learning as heuristic for action-rule preferences. In International Workshop on Programming Multi-Agent Systems, pages 25–40. Springer, 2010

  32. [32]

    Reinforcement learning framework for window hardware installation

    Tzu-Hao Huang. Reinforcement learning framework for window hardware installation. 2022

  33. [33]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  34. [34]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015

  35. [35]

    ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

    Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, Fan Yang, et al. Learning to reason with search for llms via reinforcement learning. arXiv preprint arXiv:2503.19470, 2025

  36. [36]

    ZeroSearch: Incentivize the Search Capability of LLMs without Searching

    Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching. arXiv preprint arXiv:2505.04588, 2025

  37. [37]

    R-search: Em- powering llm reasoning with search via multi-reward reinforcement learning

    Qingfei Zhao, Ruobing Wang, Dingling Xu, Daren Zha, and Limin Liu. R-search: Em- powering llm reasoning with search via multi-reward reinforcement learning. arXiv preprint arXiv:2506.04185, 2025

  38. [38]

    Ssrl: Self-search reinforcement learning

    Yuchen Fan, Kaiyan Zhang, Heng Zhou, Yuxin Zuo, Yanxu Chen, Yu Fu, Xinwei Long, Xuekai Zhu, Che Jiang, Yuchen Zhang, et al. Ssrl: Self-search reinforcement learning. arXiv preprint arXiv:2508.10874, 2025

  39. [39]

    Stepsearch: Igniting llms search ability via step-wise proximal policy optimization

    Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization. arXiv preprint arXiv:2505.15107, 2025

  40. [40]

    Qwen2.5 technical report, 2025

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  41. [41]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

  42. [42]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP (1), pages 6769–6781, 2020

  43. [43]

    DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

    Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. arXiv preprint arXiv:2504.03160, 2025

  44. [44]

    O2-searcher: A searching-based agent model for open-domain open-ended question answering

    Jianbiao Mei, Tao Hu, Daocheng Fu, Licheng Wen, Xuemeng Yang, Rong Wu, Pinlong Cai, Xinyu Cai, Xing Gao, Yu Yang, et al. O2-searcher: A searching-based agent model for open-domain open-ended question answering. arXiv preprint arXiv:2505.16582, 2025

  45. [45]

    Masksearch: A universal pre-training framework to enhance agentic search capability

    Weiqi Wu, Xin Guan, Shen Huang, Yong Jiang, Pengjun Xie, Fei Huang, Jiuxin Cao, Hai Zhao, and Jingren Zhou. Masksearch: A universal pre-training framework to enhance agentic search capability. arXiv preprint arXiv:2505.20285, 2025. 12

  46. [46]

    Evolvesearch: An iterative self-evolving search agent

    Dingchu Zhang, Yida Zhao, Jialong Wu, Baixuan Li, Wenbiao Yin, Liwen Zhang, Yong Jiang, Yufeng Li, Kewei Tu, Pengjun Xie, et al. Evolvesearch: An iterative self-evolving search agent. arXiv preprint arXiv:2505.22501, 2025

  47. [47]

    Dynasearcher: Dynamic knowledge graph augmented search agent via multi-reward reinforcement learning

    Chuzhan Hao, Wenfeng Feng, Yuewei Zhang, and Hao Wang. Dynasearcher: Dynamic knowledge graph augmented search agent via multi-reward reinforcement learning. arXiv preprint arXiv:2507.17365, 2025

  48. [48]

    Hybrid deep searcher: Integrating parallel and sequential search reasoning

    Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee, Yongrae Jo, Gunhee Kim, Moontae Lee, and Kyungjae Lee. Hybrid deep searcher: Integrating parallel and sequential search reasoning. arXiv preprint arXiv:2508.19113, 2025

  49. [49]

    Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics

  50. [50]

    Musique: Multihop questions via single-hop question composition, 2022

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition, 2022

  51. [51]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:2409.19256, 2024

  52. [52]

    cheatable

    Vladimir Karpukhin, Barlas O ˘guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. Dense passage retrieval for open-domain question answering, 2020. 13 A Related work Recent research has increasingly explored reinforcement learning (RL) as a means to improve the retrieval and reasoning capabilities of large language models...

  53. [53]

    Identify the city where Chopin was born

  54. [54]

    Frédéric Chopin

    Determine which organization has legislative power in that city. Let’s address these sub-questions one by one. <search> Chopin birthplace city </search> <information> Doc 1<## Title: "Frédéric Chopin" ##> Frédéric François Chopin was born on 1 March 1810 in ˙Zelazowa Wola, a village in the Duchy of Warsaw ... moved to Warsaw when he was seven months old ....

  55. [55]

    presidential election

    Identify the person who won the electoral vote in the 2000 U.S. presidential election

  56. [56]

    Then find out who that person’s spouse is. <search> Winner of electoral vote in the 2000 election </search> <information> No useful information related are retrieved! </information> Plan is incorrect! It will trigger an erasure and regeneration of plan! Erasure and Regeneration: To answer this question, we need to break it down into sub-questions:

  57. [57]

    Identify the person who won the popular vote in the 2000 election

  58. [58]

    2000 United States presidential election

    Determine the spouse of that person. Let’s address these sub-questions one by one. <search> popular vote in the 2000 election (Prussia)? </search> <information> Doc 1<## Title: "2000 United States presidential election" ##>The 2000 United States presidential election was the 54th quadrennial presidential election... Doc 2<## Title: "2000 United States pre...

  59. [59]

    Identify where City A is, where KFAB broadcasts

  60. [60]

    The Road to City A

    Determine who wrote “The Road to City A” with City A being the location identified above. Let’s address these step by step: <search> KFAB broadcasts </search> <information> Doc 1<## Title: KFAB ##> KFAB (1110 AM) is a 50,000 watt clear channel news and talk radio station licensed to Omaha, Nebraska... </information> <observation> KFAB broadcasts in Omaha,...

  61. [61]

    Identify the city where Alchip is headquartered

  62. [62]

    Determine which performer released the Fantasy Land Tour 2004 album and performed it live in that city. Let’s address each sub-question: <search> Alchip headquarters city </search> <information> Doc 1<## Title: Alchip ##> Alchip is a fabless semiconductor company founded in 2003 and headquartered in Taipei, Taiwan... </information> <observation> Alchip is...

  63. [63]

    Identify the country Banir is located in

  64. [64]

    United Ireland

    Determine the year of the country’s unification. Let’s address each sub-question: <search> Banir location </search> <information> Doc 1<## Title: Banior ##> Banior is a village and gram panchayat in Nalhati I Block in Rampurhat subdivision of Birbhum District in the Indian state of West Bengal... Doc 2<## Title: Banagher ##> Banagher is a town in Republic...

  65. [65]

    Identify the state where Wellesley College is located

  66. [66]

    Determine the capital of that state

  67. [67]

    Find out who established the first committee of correspondence in 1772 in that capital

  68. [68]

    Wellesley, Massachusetts

    Understand the reason for establishing the committee. Let’s address each sub-question: <search> Wellesley College location </search> <information> Doc 1<## Title: "Wellesley, Massachusetts" ##> Wellesley is a town in Norfolk County, Massachusetts... It is best known as the home of Wellesley College... Doc 2<## Title: "Wellesley College" ##> Wellesley Coll...

  69. [69]

    America-Lite: How Imperial Academia Dismantled Our Culture

    Identify the author of the book "America-Lite: How Imperial Academia Dismantled Our Culture"

  70. [70]

    Determine the school attended by this author

  71. [71]

    America-Lite: How Imperial Academia Dismantled Our Culture

    Find the weekly publication in Jeanne Flanagan’s birthplace that is issued by the school identified in step 2. Let’s address each sub-question: <search> author of the book "America-Lite: How Imperial Academia Dismantled Our Culture" </search> <information> Doc 1<## Title: America-Lite ##> America-Lite America-Lite: How Imperial Academia Dismantled Our Cul...

  72. [72]

    Identify the director of The Mystic Masseur

  73. [73]

    Determine the country where the director was born

  74. [74]

    The Mystic Masseur

    Find out in which state Kattalai, a city, is located in that country. Let’s address each sub-question: <search> director of The Mystic Masseur </search> <information> Doc 1<## Title: "The Mystic Masseur" ##> The Mystic Masseur The Mystic Masseur is a 2001 Merchant Ivory film based on the novel of the same name by V . S. Naipaul. It is one of relatively fe...