Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs
Pith reviewed 2026-05-18 11:02 UTC · model grok-4.3
The pith
Erasable reinforcement learning lets search-augmented LLMs detect, erase, and regenerate faulty reasoning steps to prevent error propagation in multi-hop tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Erasable Reinforcement Learning (ERL) transforms fragile multi-step reasoning into a robust process by explicitly identifying faulty steps, erasing them, and regenerating correct reasoning in place so that defective logic does not propagate through the chain.
What carries the argument
The ERL loop, which detects faulty reasoning steps during reinforcement learning and erases them for targeted regeneration.
If this is right
- The 3B model improves by 8.48% exact match and 11.56% F1 over prior best results on HotpotQA, MuSiQue, 2Wiki, and Bamboogle.
- The 7B model improves by 5.38% exact match and 7.22% F1 over prior best results on the same benchmarks.
- Reasoning chains become more resilient because errors are corrected locally instead of derailing the entire answer.
- The same training approach can be applied to other search-augmented LLM setups that rely on multi-step decomposition and retrieval.
Where Pith is reading between the lines
- The method might let smaller models close performance gaps with larger ones by improving error recovery rather than adding parameters.
- If extended to inference time, the same erase-and-regenerate step could support real-time correction in interactive applications.
- Similar local correction could be tested on chain-of-thought traces in mathematics or code generation where step-level errors also accumulate.
Load-bearing premise
Faulty reasoning steps can be detected accurately enough that erasing and regenerating them produces a net gain without creating new errors or using too much extra computation.
What would settle it
An experiment in which the detection step often marks correct reasoning as faulty or in which regenerated steps produce lower final accuracy than the original chain would show the method does not work as claimed.
Figures
read the original abstract
While search-augmented large language models (LLMs) exhibit impressive capabilities, their reliability in complex multi-hop reasoning remains limited. This limitation arises from three fundamental challenges: decomposition errors, where tasks are incorrectly broken down; retrieval missing, where key evidence fails to be retrieved; and reasoning errors, where flawed logic propagates through the reasoning chain. A single failure in any of these stages can derail the final answer. We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process. ERL explicitly identifies faulty steps, erases them, and regenerates reasoning in place, preventing defective logic from propagating through the reasoning chain. This targeted correction mechanism turns brittle reasoning into a more resilient process. Models trained with ERL, termed ESearch, achieve substantial improvements on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with the 3B model achieving +8.48% EM and +11.56% F1, and the 7B model achieving +5.38% EM and +7.22% F1 over previous state-of-the-art(SOTA) results. These findings suggest that erasable reinforcement learning provides a powerful paradigm shift for robust multi-step reasoning in LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Erasable Reinforcement Learning (ERL) for search-augmented LLMs to mitigate decomposition, retrieval, and reasoning errors in multi-hop tasks by detecting faulty intermediate steps, erasing them, and regenerating the reasoning in place. Models trained under this framework (ESearch) are reported to outperform prior SOTA on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with absolute gains of +8.48% EM / +11.56% F1 for the 3B variant and +5.38% EM / +7.22% F1 for the 7B variant.
Significance. If the empirical gains prove robust and the erasure mechanism is shown to be the causal driver rather than ancillary compute, the work would provide a concrete, targeted correction strategy that strengthens the reliability of search-augmented reasoning chains. It extends RL-based training for LLMs by adding an explicit erase-and-regenerate loop, which could be broadly applicable to other chain-of-thought and retrieval-augmented settings.
major comments (2)
- [§3] §3 (ERL framework description): the unsupervised detection of faulty reasoning steps is load-bearing for the central claim yet remains underspecified. Without ground-truth labels on intermediates, the method must rely on a learned critic, reward threshold, or self-consistency signal; the manuscript does not detail how false-positive erasures (removing correct steps) or false-negative retentions (leaving errors) are controlled, leaving open the possibility that observed gains arise from extra gradient updates or search budget rather than the erasable mechanism itself.
- [Experimental section] Experimental section / results tables: no ablation isolates the contribution of the erase-regenerate step from standard RL training or increased inference-time search. The headline +8.48% EM lift on the 3B model cannot be confidently attributed to ERL until such controls are shown; otherwise the result risks being an artifact of the training regime rather than the proposed correction process.
minor comments (2)
- [Abstract] Abstract: the phrase 'previous state-of-the-art(SOTA)' should be accompanied by explicit citations to the prior works being surpassed.
- [Introduction] Notation: introduce the distinction between ERL (the training framework) and ESearch (the resulting model) at first use to avoid reader confusion.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the presentation of the ERL framework and strengthen the experimental claims. We address each major point below and will incorporate revisions to improve the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (ERL framework description): the unsupervised detection of faulty reasoning steps is load-bearing for the central claim yet remains underspecified. Without ground-truth labels on intermediates, the method must rely on a learned critic, reward threshold, or self-consistency signal; the manuscript does not detail how false-positive erasures (removing correct steps) or false-negative retentions (leaving errors) are controlled, leaving open the possibility that observed gains arise from extra gradient updates or search budget rather than the erasable mechanism itself.
Authors: We agree that Section 3 provides a high-level description of faulty-step detection and would benefit from greater specificity. The ERL approach identifies faulty steps via a reward-based critic that flags low-reward intermediate outputs, augmented by a self-consistency check over multiple sampled continuations of the same prefix. Erasure is applied only when both signals agree and is limited to a single step per chain to reduce over-erasure risk. We will revise Section 3 to include the precise critic formulation, threshold selection procedure, and a short discussion of how these choices limit false positives and negatives. We will also add a brief analysis showing that the total number of gradient updates and search budget are matched to the baselines, so that gains cannot be attributed solely to extra compute. revision: yes
-
Referee: [Experimental section] Experimental section / results tables: no ablation isolates the contribution of the erase-regenerate step from standard RL training or increased inference-time search. The headline +8.48% EM lift on the 3B model cannot be confidently attributed to ERL until such controls are shown; otherwise the result risks being an artifact of the training regime rather than the proposed correction process.
Authors: This observation is correct and points to a genuine gap in the current experimental design. While the reported results compare ESearch against prior SOTA methods, they do not contain an explicit ablation that removes only the erase-and-regenerate loop while holding the underlying RL objective, training steps, and inference-time search budget fixed. We will add these controls in the revised version, including (i) a standard RL baseline without erasure, (ii) a variant that performs additional search steps without erasure, and (iii) a table reporting the isolated contribution of the erasure component. These additions will allow readers to attribute performance differences more directly to the proposed mechanism. revision: yes
Circularity Check
No circularity: empirical RL framework evaluated on external benchmarks
full rationale
The paper introduces Erasable Reinforcement Learning (ERL) as a training procedure that detects faulty intermediate reasoning steps, erases them, and regenerates replacements within a search-augmented LLM pipeline. Reported gains (+8.48% EM / +11.56% F1 for the 3B model on HotpotQA/MuSiQue/etc.) are presented as measured outcomes of this training on standard held-out QA datasets. No equations, uniqueness theorems, fitted-parameter predictions, or self-citation chains appear in the provided text that would reduce any claimed result to its own inputs by construction. The central contribution is therefore an empirical method whose validity rests on external benchmark performance rather than internal definitional closure.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Erasable Reinforcement Learning (ERL)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
OpenAI. Gpt-5 system card. Technical report, OpenAI, aug 2025. Accessed: 2025-09-11
work page 2025
-
[2]
The llama 4 herd: The beginning of a new era of natively multimodal intelli- gence
Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal intelli- gence. https://ai.meta.com/blog/lllama-4-multimodal-intelligence/ , apr 2025. Accessed: 2025-09-11
work page 2025
-
[3]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55, 2025
work page 2025
-
[5]
TrustLLM: Trustworthiness in Large Language Models
Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, et al. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024
work page internal anchor Pith review arXiv 2024
-
[6]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
A survey of llm-based deep search agents: Paradigm, optimization, evaluation, and challenges
Yunjia Xi, Jianghao Lin, Yongzhao Xiao, Zheli Zhou, Rong Shan, Te Gao, Jiachen Zhu, Weiwen Liu, Yong Yu, and Weinan Zhang. A survey of llm-based deep search agents: Paradigm, optimization, evaluation, and challenges. arXiv preprint arXiv:2508.05668, 2025
-
[10]
Retrieval-augmented generation for knowledge-intensive nlp tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020
work page 2020
-
[11]
Retrieval-Augmented Generation for Large Language Models: A Survey
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2(1), 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Introducing deep research, 2025
OpenAI. Introducing deep research, 2025. Accessed: 2025-09-11
work page 2025
-
[13]
Gemini deep research – your personal research assistant, 2025
Google DeepMind. Gemini deep research – your personal research assistant, 2025. Accessed: 2025-09-11
work page 2025
-
[14]
Introducing perplexity deep research, 2025
Perplexity AI. Introducing perplexity deep research, 2025. Accessed: 2025-09-11
work page 2025
-
[15]
Deep Reinforcement Learning: An Overview
Yuxi Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[16]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Shu Zhao, Tan Yu, Anbang Xu, Japinder Singh, Aaditya Shukla, and Rama Akkiraju. Par- allelsearch: Train your llms to decompose query and search sub-queries in parallel with reinforcement learning. arXiv preprint arXiv:2508.09303, 2025
-
[19]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhut- dinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
Musique: Multihop questions via single-hop question composition
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022
work page 2022
-
[21]
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[22]
Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025
Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, Jianye Hao, et al. Deep research agents: A systematic examina- tion and roadmap. arXiv preprint arXiv:2506.18096, 2025
-
[23]
Reinforcement learning foundations for deep research systems: A survey
Wenjun Li, Zhi Chen, Jingru Lin, Hannan Cao, Wei Han, Sheng Liang, Zhi Zhang, Kuicai Dong, Dexun Li, Chen Zhang, et al. Reinforcement learning foundations for deep research systems: A survey. arXiv preprint arXiv:2509.06733, 2025
-
[24]
Reinforcement learning: An introduction, volume 1
Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998
work page 1998
-
[25]
Reinforcement learning: A survey
Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996
work page 1996
-
[26]
Agent models: Inter- nalizing chain-of-action generation into reasoning models
Yuxiang Zhang, Yuqi Yang, Jiangming Shu, Xinyan Wen, and Jitao Sang. Agent models: Inter- nalizing chain-of-action generation into reasoning models. arXiv preprint arXiv:2503.06580, 2025
-
[27]
An empirical study on reinforcement learning for reasoning-search interleaved llm agents
Bowen Jin, Jinsung Yoon, Priyanka Kargupta, Sercan O Arik, and Jiawei Han. An empirical study on reinforcement learning for reasoning-search interleaved llm agents. arXiv preprint arXiv:2505.15117, 2025. 11
-
[28]
Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl
Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl. arXiv preprint arXiv:2508.07976, 2025
-
[29]
Measuring and Narrowing the Compositionality Gap in Language Models
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
All language models large and small
Zhixun Chen, Yali Du, and David Mguni. All language models large and small. arXiv preprint arXiv:2402.12061, 2024
-
[31]
Reinforcement learning as heuristic for action-rule preferences
Joost Broekens, Koen Hindriks, and Pascal Wiggers. Reinforcement learning as heuristic for action-rule preferences. In International Workshop on Programming Multi-Agent Systems, pages 25–40. Springer, 2010
work page 2010
-
[32]
Reinforcement learning framework for window hardware installation
Tzu-Hao Huang. Reinforcement learning framework for window hardware installation. 2022
work page 2022
-
[33]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[34]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[35]
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, Fan Yang, et al. Learning to reason with search for llms via reinforcement learning. arXiv preprint arXiv:2503.19470, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching. arXiv preprint arXiv:2505.04588, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
R-search: Em- powering llm reasoning with search via multi-reward reinforcement learning
Qingfei Zhao, Ruobing Wang, Dingling Xu, Daren Zha, and Limin Liu. R-search: Em- powering llm reasoning with search via multi-reward reinforcement learning. arXiv preprint arXiv:2506.04185, 2025
-
[38]
Ssrl: Self-search reinforcement learning
Yuchen Fan, Kaiyan Zhang, Heng Zhou, Yuxin Zuo, Yanxu Chen, Yu Fu, Xinwei Long, Xuekai Zhu, Che Jiang, Yuchen Zhang, et al. Ssrl: Self-search reinforcement learning. arXiv preprint arXiv:2508.10874, 2025
-
[39]
Stepsearch: Igniting llms search ability via step-wise proximal policy optimization
Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization. arXiv preprint arXiv:2505.15107, 2025
-
[40]
Qwen2.5 technical report, 2025
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
work page 2025
-
[41]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[42]
Dense passage retrieval for open-domain question answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP (1), pages 6769–6781, 2020
work page 2020
-
[43]
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. arXiv preprint arXiv:2504.03160, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
O2-searcher: A searching-based agent model for open-domain open-ended question answering
Jianbiao Mei, Tao Hu, Daocheng Fu, Licheng Wen, Xuemeng Yang, Rong Wu, Pinlong Cai, Xinyu Cai, Xing Gao, Yu Yang, et al. O2-searcher: A searching-based agent model for open-domain open-ended question answering. arXiv preprint arXiv:2505.16582, 2025
-
[45]
Masksearch: A universal pre-training framework to enhance agentic search capability
Weiqi Wu, Xin Guan, Shen Huang, Yong Jiang, Pengjun Xie, Fei Huang, Jiuxin Cao, Hai Zhao, and Jingren Zhou. Masksearch: A universal pre-training framework to enhance agentic search capability. arXiv preprint arXiv:2505.20285, 2025. 12
-
[46]
Evolvesearch: An iterative self-evolving search agent
Dingchu Zhang, Yida Zhao, Jialong Wu, Baixuan Li, Wenbiao Yin, Liwen Zhang, Yong Jiang, Yufeng Li, Kewei Tu, Pengjun Xie, et al. Evolvesearch: An iterative self-evolving search agent. arXiv preprint arXiv:2505.22501, 2025
-
[47]
Dynasearcher: Dynamic knowledge graph augmented search agent via multi-reward reinforcement learning
Chuzhan Hao, Wenfeng Feng, Yuewei Zhang, and Hao Wang. Dynasearcher: Dynamic knowledge graph augmented search agent via multi-reward reinforcement learning. arXiv preprint arXiv:2507.17365, 2025
-
[48]
Hybrid deep searcher: Integrating parallel and sequential search reasoning
Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee, Yongrae Jo, Gunhee Kim, Moontae Lee, and Kyungjae Lee. Hybrid deep searcher: Integrating parallel and sequential search reasoning. arXiv preprint arXiv:2508.19113, 2025
-
[49]
Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics
work page 2020
-
[50]
Musique: Multihop questions via single-hop question composition, 2022
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition, 2022
work page 2022
-
[51]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Vladimir Karpukhin, Barlas O ˘guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. Dense passage retrieval for open-domain question answering, 2020. 13 A Related work Recent research has increasingly explored reinforcement learning (RL) as a means to improve the retrieval and reasoning capabilities of large language models...
-
[53]
Identify the city where Chopin was born
-
[54]
Determine which organization has legislative power in that city. Let’s address these sub-questions one by one. <search> Chopin birthplace city </search> <information> Doc 1<## Title: "Frédéric Chopin" ##> Frédéric François Chopin was born on 1 March 1810 in ˙Zelazowa Wola, a village in the Duchy of Warsaw ... moved to Warsaw when he was seven months old ....
work page 2000
-
[55]
Identify the person who won the electoral vote in the 2000 U.S. presidential election
work page 2000
-
[56]
Then find out who that person’s spouse is. <search> Winner of electoral vote in the 2000 election </search> <information> No useful information related are retrieved! </information> Plan is incorrect! It will trigger an erasure and regeneration of plan! Erasure and Regeneration: To answer this question, we need to break it down into sub-questions:
work page 2000
-
[57]
Identify the person who won the popular vote in the 2000 election
work page 2000
-
[58]
2000 United States presidential election
Determine the spouse of that person. Let’s address these sub-questions one by one. <search> popular vote in the 2000 election (Prussia)? </search> <information> Doc 1<## Title: "2000 United States presidential election" ##>The 2000 United States presidential election was the 54th quadrennial presidential election... Doc 2<## Title: "2000 United States pre...
work page 2000
-
[59]
Identify where City A is, where KFAB broadcasts
-
[60]
Determine who wrote “The Road to City A” with City A being the location identified above. Let’s address these step by step: <search> KFAB broadcasts </search> <information> Doc 1<## Title: KFAB ##> KFAB (1110 AM) is a 50,000 watt clear channel news and talk radio station licensed to Omaha, Nebraska... </information> <observation> KFAB broadcasts in Omaha,...
work page 2004
-
[61]
Identify the city where Alchip is headquartered
-
[62]
Determine which performer released the Fantasy Land Tour 2004 album and performed it live in that city. Let’s address each sub-question: <search> Alchip headquarters city </search> <information> Doc 1<## Title: Alchip ##> Alchip is a fabless semiconductor company founded in 2003 and headquartered in Taipei, Taiwan... </information> <observation> Alchip is...
work page 2004
-
[63]
Identify the country Banir is located in
-
[64]
Determine the year of the country’s unification. Let’s address each sub-question: <search> Banir location </search> <information> Doc 1<## Title: Banior ##> Banior is a village and gram panchayat in Nalhati I Block in Rampurhat subdivision of Birbhum District in the Indian state of West Bengal... Doc 2<## Title: Banagher ##> Banagher is a town in Republic...
work page 1921
-
[65]
Identify the state where Wellesley College is located
-
[66]
Determine the capital of that state
-
[67]
Find out who established the first committee of correspondence in 1772 in that capital
-
[68]
Understand the reason for establishing the committee. Let’s address each sub-question: <search> Wellesley College location </search> <information> Doc 1<## Title: "Wellesley, Massachusetts" ##> Wellesley is a town in Norfolk County, Massachusetts... It is best known as the home of Wellesley College... Doc 2<## Title: "Wellesley College" ##> Wellesley Coll...
-
[69]
America-Lite: How Imperial Academia Dismantled Our Culture
Identify the author of the book "America-Lite: How Imperial Academia Dismantled Our Culture"
-
[70]
Determine the school attended by this author
-
[71]
America-Lite: How Imperial Academia Dismantled Our Culture
Find the weekly publication in Jeanne Flanagan’s birthplace that is issued by the school identified in step 2. Let’s address each sub-question: <search> author of the book "America-Lite: How Imperial Academia Dismantled Our Culture" </search> <information> Doc 1<## Title: America-Lite ##> America-Lite America-Lite: How Imperial Academia Dismantled Our Cul...
work page 2012
-
[72]
Identify the director of The Mystic Masseur
-
[73]
Determine the country where the director was born
-
[74]
Find out in which state Kattalai, a city, is located in that country. Let’s address each sub-question: <search> director of The Mystic Masseur </search> <information> Doc 1<## Title: "The Mystic Masseur" ##> The Mystic Masseur The Mystic Masseur is a 2001 Merchant Ivory film based on the novel of the same name by V . S. Naipaul. It is one of relatively fe...
work page 2001
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.