DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

Caijun Xu; Changyi Xiao; Yixin Cao; Zhongyuan Peng

arxiv: 2605.28421 · v1 · pith:MDSXF4AXnew · submitted 2026-05-27 · 💻 cs.AI

DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

Caijun Xu , Changyi Xiao , Zhongyuan Peng , Yixin Cao This is my paper

Pith reviewed 2026-06-29 11:48 UTC · model grok-4.3

classification 💻 cs.AI

keywords reinforcement learningreasoning modelsself-correctionnoisy prefixeslarge language modelsbootstrappingrecovery optimization

0 comments

The pith

DenoiseRL improves reasoning models by turning their own incorrect traces into recovery training signals without external teachers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DenoiseRL, a reinforcement learning method that converts failures from weak models into opportunities for learning better reasoning. It replaces dependence on stronger supervisors or hand-picked hard examples with direct optimization over noisy prefixes to create a richer training signal. This approach increases exploration efficiency from imperfect behavior and leads to stronger self-correction as task difficulty grows. A reader would care because it offers a pathway to scale reasoning improvements using only the model's internal errors rather than scarce external resources.

Core claim

DenoiseRL substitutes external supervision with recovery-oriented optimization over failures from weak models, learning directly from incorrect reasoning traces by converting them into opportunities for improvement and yielding a richer and more diverse learning signal that improves exploration efficiency from imperfect model behavior.

What carries the argument

Recovery-oriented optimization that converts incorrect reasoning traces from weak models into training signals for self-correction.

If this is right

DenoiseRL outperforms strong on-policy RL baselines across competitive mathematical and general reasoning benchmarks.
DenoiseRL promotes stronger self-corrective behavior as training difficulty increases.
DenoiseRL reduces the need for expensive data curation or stronger teacher models.
DenoiseRL improves reasoning performance and overall training efficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recovery mechanism might transfer to sequential tasks outside language reasoning such as code synthesis or multi-step planning.
Training with recovery signals could produce models that remain effective even when initial generations contain noise.
Hybrid systems that combine recovery optimization with other RL objectives might further reduce dependence on curated data.
Larger models might amplify the diversity of the learning signal obtained from their own failures.

Load-bearing premise

Converting failures from weak models into recovery-oriented optimization yields a richer and more diverse learning signal than existing methods that rely on external supervision.

What would settle it

An experiment on the same mathematical and general reasoning benchmarks in which DenoiseRL fails to outperform strong on-policy RL baselines would falsify the performance claim.

read the original abstract

Reinforcement learning has become a central paradigm for advancing reasoning in large language models, yet most existing methods still depend on stronger teacher models or heavily curated difficult datasets, limiting scalable capability improvement. In this paper, we introduce DenoiseRL, a reinforcement learning framework that substitutes external supervision with recovery-oriented optimization over failures from weak models. Instead of relying on stronger supervision or carefully engineered data, DenoiseRL learns directly from incorrect reasoning traces by converting them into opportunities for improvement, making training more scalable and less dependent on external resources. This yields a richer and more diverse learning signal, improving exploration efficiency from imperfect model behavior. As a result, DenoiseRL improves reasoning performance and overall training efficiency while reducing the need for expensive data curation or stronger teacher models. Empirically, DenoiseRL consistently outperforms strong on-policy RL baselines across competitive mathematical and general reasoning benchmarks and promotes stronger self-corrective behavior as training difficulty increases, highlighting an effective and scalable alternative pathway for improving reasoning in large language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DenoiseRL claims to turn weak-model failures into a scalable RL signal for reasoning, but the abstract supplies no methods or results to check whether that actually works.

read the letter

The main takeaway is that this paper wants to replace external teachers and curated hard data with recovery training on noisy prefixes from weaker models. It frames the task as learning to fix incorrect reasoning traces directly, which in principle could reduce dependence on stronger supervision and make scaling easier.

What looks new is the explicit focus on converting on-policy failures into a recovery objective rather than just maximizing reward on correct traces. The abstract says this produces a richer learning signal and better self-correction as difficulty rises, and it reports outperformance over strong on-policy baselines on math and general reasoning tasks. That direction addresses a real practical bottleneck in current RL-for-reasoning setups.

The problem is that none of the central claims can be checked from the given text. There are no reward equations, no description of how noisy prefixes are generated or filtered, no baseline implementations, and no numbers or ablation results. The soundness score is low for exactly this reason: the empirical assertions sit there unsupported. Without seeing the actual optimization or evaluation setup, it is impossible to tell whether the claimed gains come from the recovery framing or from some other difference in training.

This is the kind of paper that would interest people already running RL loops on reasoning models and looking for ways to bootstrap without bigger teachers. A reader in that group might pick up the high-level idea, but the work is too thin on details to be useful on its own. I would send it to peer review only if the full manuscript contains reproducible methods and properly controlled experiments; on the abstract alone it does not yet deserve referee time.

Referee Report

1 major / 0 minor

Summary. The paper introduces DenoiseRL, a reinforcement learning framework for advancing reasoning in large language models. It replaces external supervision or curated difficult datasets with recovery-oriented optimization over incorrect reasoning traces generated by weak models, converting failures into learning signals to improve exploration efficiency, self-correction, and overall performance. The central empirical claim is that DenoiseRL consistently outperforms strong on-policy RL baselines on mathematical and general reasoning benchmarks while promoting stronger self-corrective behavior as training difficulty increases.

Significance. If the method and results hold, the work would offer a scalable pathway for RL-based reasoning improvement that reduces dependence on stronger teacher models and expensive data curation. The approach of bootstrapping from noisy prefixes/failures could address a key bottleneck in current on-policy methods. However, the provided manuscript text consists only of the abstract with no methods, equations, experimental details, baselines, or results, so the significance cannot be evaluated.

major comments (1)

[Abstract] Abstract: the central claims of consistent outperformance over on-policy RL baselines and improved self-correction are asserted without any supporting methods, data, reward formulation, benchmark details, or quantitative results. This prevents any assessment of whether the recovery-oriented optimization actually yields a richer learning signal or avoids the evaluation artifacts common in RL-for-reasoning setups.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for reviewing our work and for highlighting the need for supporting details to evaluate the claims. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of consistent outperformance over on-policy RL baselines and improved self-correction are asserted without any supporting methods, data, reward formulation, benchmark details, or quantitative results. This prevents any assessment of whether the recovery-oriented optimization actually yields a richer learning signal or avoids the evaluation artifacts common in RL-for-reasoning setups.

Authors: We agree that an abstract alone cannot support evaluation of the empirical claims. The full manuscript provides the complete DenoiseRL method (including recovery-oriented optimization over noisy prefixes from weak models), the RL formulation and reward design, experimental protocols, baseline implementations, benchmark details (mathematical and general reasoning tasks), and quantitative results showing outperformance and improved self-correction. If the version provided to the referee contained only the abstract, we will ensure the complete manuscript is submitted for the next round. revision: yes

Circularity Check

0 steps flagged

No significant circularity; no derivation chain present

full rationale

The abstract and supplied text describe an empirical RL framework (DenoiseRL) that converts weak-model failures into recovery-oriented optimization signals, with claims of benchmark outperformance. No equations, parameters, fitted quantities, uniqueness theorems, or derivation steps appear in the provided content. The central claims are empirical and methodological rather than mathematical reductions, so no load-bearing step can be shown to equal its inputs by construction. Self-citations or ansatzes are not visible, and the method is presented as a scalable alternative without internal self-reference that would force the result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no technical details on free parameters, axioms, or invented entities are provided in the query.

pith-pipeline@v0.9.1-grok · 5707 in / 1016 out tokens · 25789 ms · 2026-06-29T11:48:45.180010+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 27 canonical work pages · 14 internal anchors

[1]

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. 2024. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. Preprint, arXiv:2402.14740

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Art of Problem Solving. 2025. Aime problems and solutions

2025
[3]

Art of Problem Solving. 2025. Amc problems and solutions

2025
[4]

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, and 1 others. 2023. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision.arXivpreprintarXiv:2312.09390

work page arXiv 2023
[5]

Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, and 1 others. 2025. Training-free group relative policy optimization.arXivpreprintarXiv:2510.08191

work page arXiv 2025
[6]

Kang Chen, Yaoning Wang, Kai Xiong, Zhuoka Feng, Wenhe Sun, Haotian Chen, and Yixin Cao. 2025. Do llms signal when they’re right? evidence from neuron agreement.Preprint, arXiv:2510.26277

work page arXiv 2025
[7]

Scott Geng, Dutch Hansen, and Jerry Li. 2026. Weak-to-strong generalization is nearly inevitable (in linear models). arXiv preprintarXiv:2605.05742

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprintarXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2025. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.Preprint, arXiv:2504.11456

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset.arXivpreprintarXiv:2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Jian Hu. 2025. Reinforce++: A simple and efficient approach for aligning large language models.arXiv e-prints, pages arXiv–2501

2025
[12]

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Yu, Xinying Song, and Denny Zhou. 2024. Large language models cannot self-correct reasoning yet. InInternational conferenceonlearning representations, volume 2024, pages 32808–32824

2024
[13]

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, and Lu Wang. 2025. Process reward models that think.arXivpreprintarXiv:2504.16828

work page arXiv 2025
[14]

Sanghwan Kim, Rui Xiao, Stephan Alaniz, Yongqin Xian, and Zeynep Akata. 2025. Training-free uncertainty guidance for complex visual tasks with mllms.arXivpreprint arXiv:2510.00705

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Hao Lang, Fei Huang, and Yongbin Li. 2025. Selective weak-to-strong generalization.Preprint, arXiv:2511.14166

work page arXiv 2025
[16]

Shiye Lei, Zhihao Cheng, and Dacheng Tao. 2026. A step back: Prefix importance ratio stabilizes policy optimization. Preprint, arXiv:2601.22718

work page arXiv 2026
[17]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. InProceedings of the 58th annual meeting of the association for computationallinguistics, pages 7871–7880

2020
[18]

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2024. Let’s verify step by step. InInternational ConferenceonLearning Representations, volume 2024, pages 39578–39601

2024
[19]

Huanyu Liu, Jia Li, Yihong Dong, Chang Yu, Taozhi Chen, Lecheng Wang, Yongding Tao, Bin Gu, and Ge Li. 2025. Evocot: Overcoming the exploration bottleneck in reinforcement learning.arXivpreprintarXiv:2508.07809. 15

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, and Junxian He. 2025. Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond.Preprint, arXiv:2505.19641

work page arXiv 2025
[21]

Yi Liu, Guoyin Wang, Shicheng Li, Feifan Song, and Xu Sun. 2025. ATLANTIS: Weak-to-strong learning via importancesampling. InProceedingsofthe63rdAnnualMeetingoftheAssociationforComputationalLinguistics (Volume1: LongPapers), pages 1042–1052, Vienna, Austria. Association for Computational Linguistics

2025
[22]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advancesinneuralinformationprocessingsystems, 35:27730–27744

2022
[23]

Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. 2026. Pope: Learning to reason on hard problems via privileged on-policy exploration.arXivpreprint arXiv:2601.18779

work page arXiv 2026
[24]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXivpreprint arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Amrith Setlur, Zĳian Wang, Andrew Cohen, Paria Rashidinejad, and Sang Michael Xie. 2026. Reuse your flops: Scaling rl on hard problems by conditioning on very off-policy prefixes.Preprint, arXiv:2601.18795

work page arXiv 2026
[27]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXivpreprint arXiv:2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Seamus Somerstep, Felipe Maia Polo, Moulinath Banerjee, Yaacov Ritov, Mikhail Yurochkin, and Yuekai Sun. 2024. A statistical framework for weak-to-strong generalization. InICML2024NextGenerationofAISafetyWorkshop

2024
[29]

Yiliu Sun, Zicheng Zhao, Yang Wei, Yanfang Zhang, and Chen Gong. 2026. Well begun, half done: Reinforcement learningwithprefixoptimizationforllmreasoning. In ProceedingsoftheAAAIConferenceonArtificialIntelligence, volume 40, pages 33144–33152

2026
[30]

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and 1 others. 2023. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindingsof the Association for ComputationalLinguistics: ACL2023, pages 13003–13051

2023
[31]

Qwen Team. 2025. Qwen3 technical report.Preprint, arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. InProceedings of the 25th international conference on Machine learning, pages 1096–1103

2008
[33]

Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. 2022. Generating sequences by learning to self-correct.arXiv preprintarXiv:2211.00053

work page arXiv 2022
[34]

Changyi Xiao, Mengdi Zhang, and Yixin Cao. 2025. Bnpo: Beta normalization policy optimization.Preprint, arXiv:2506.02864

work page arXiv 2025
[35]

Caĳun Xu, Changyi Xiao, Zhongyuan Peng, Xinrun Wang, and Yixin Cao. 2026. Scaler: Synthetic scalable adaptive learning environment for reasoning.arXivpreprintarXiv:2601.04809

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. 2026. Learning to reason under off-policy guidance.AdvancesinNeuralInformationProcessingSystems, 38:117157–117186

2026
[37]

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, and 40 others. 2024. Qwen2 technical report.arXiv preprintarXiv:2407.10671. 16

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Wei Yao, Wenkai Yang, Ziqiao Wang, Yankai Lin, and Yong Liu. 2025. Revisiting weak-to-strong generalization in theory and practice: Reverse kl vs. forward kl. InFindingsofthe AssociationforComputationalLinguistics: ACL 2025, pages 2860–2888

2025
[39]

Linhao Yu, Tianmeng Yang, Siyu Ding, Renren Jin, Naibin Gu, Xiangzhao Hao, Shuaiyi Nie, Deyi Xiong, Weichong Yin, Yu Sun, and Hua Wu. 2026. Knowrl: Boosting llm reasoning via reinforcement learning with minimal-sufficient knowledge guidance.Preprint, arXiv:2604.12627

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others. 2026. Dapo: An open-source llm reinforcement learning system at scale.Advancesin NeuralInformationProcessingSystems, 38:113222–113244

2026
[41]

Yige Yuan, Teng Xiao, Shuchang Tao, Xue Wang, Jinyang Gao, Bolin Ding, and Bingbing Xu. 2026. Incentivizing strong reasoning from weak supervision.Preprint, arXiv:2505.20072

work page arXiv 2026
[42]

Shaoxiong Zhan, Yanlin Lai, Ziyu Lu, Dahua Lin, Ziqing Yang, and Fei Tan. 2026. Mathsmith: Towards extremely hard mathematical reasoning by forging synthetic problems with a reinforced policy.Preprint, arXiv:2508.05592

work page arXiv 2026
[43]

Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, and Stefano Soatto. 2026. Reinforcement-aware knowledge distillation for llm reasoning.Preprint, arXiv:2602.22495. 17

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. 2024. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. Preprint, arXiv:2402.14740

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Art of Problem Solving. 2025. Aime problems and solutions

2025

[3] [3]

Art of Problem Solving. 2025. Amc problems and solutions

2025

[4] [4]

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, and 1 others. 2023. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision.arXivpreprintarXiv:2312.09390

work page arXiv 2023

[5] [5]

Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, and 1 others. 2025. Training-free group relative policy optimization.arXivpreprintarXiv:2510.08191

work page arXiv 2025

[6] [6]

Kang Chen, Yaoning Wang, Kai Xiong, Zhuoka Feng, Wenhe Sun, Haotian Chen, and Yixin Cao. 2025. Do llms signal when they’re right? evidence from neuron agreement.Preprint, arXiv:2510.26277

work page arXiv 2025

[7] [7]

Scott Geng, Dutch Hansen, and Jerry Li. 2026. Weak-to-strong generalization is nearly inevitable (in linear models). arXiv preprintarXiv:2605.05742

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprintarXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2025. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.Preprint, arXiv:2504.11456

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset.arXivpreprintarXiv:2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

Jian Hu. 2025. Reinforce++: A simple and efficient approach for aligning large language models.arXiv e-prints, pages arXiv–2501

2025

[12] [12]

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Yu, Xinying Song, and Denny Zhou. 2024. Large language models cannot self-correct reasoning yet. InInternational conferenceonlearning representations, volume 2024, pages 32808–32824

2024

[13] [13]

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, and Lu Wang. 2025. Process reward models that think.arXivpreprintarXiv:2504.16828

work page arXiv 2025

[14] [14]

Sanghwan Kim, Rui Xiao, Stephan Alaniz, Yongqin Xian, and Zeynep Akata. 2025. Training-free uncertainty guidance for complex visual tasks with mllms.arXivpreprint arXiv:2510.00705

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Hao Lang, Fei Huang, and Yongbin Li. 2025. Selective weak-to-strong generalization.Preprint, arXiv:2511.14166

work page arXiv 2025

[16] [16]

Shiye Lei, Zhihao Cheng, and Dacheng Tao. 2026. A step back: Prefix importance ratio stabilizes policy optimization. Preprint, arXiv:2601.22718

work page arXiv 2026

[17] [17]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. InProceedings of the 58th annual meeting of the association for computationallinguistics, pages 7871–7880

2020

[18] [18]

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2024. Let’s verify step by step. InInternational ConferenceonLearning Representations, volume 2024, pages 39578–39601

2024

[19] [19]

Huanyu Liu, Jia Li, Yihong Dong, Chang Yu, Taozhi Chen, Lecheng Wang, Yongding Tao, Bin Gu, and Ge Li. 2025. Evocot: Overcoming the exploration bottleneck in reinforcement learning.arXivpreprintarXiv:2508.07809. 15

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, and Junxian He. 2025. Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond.Preprint, arXiv:2505.19641

work page arXiv 2025

[21] [21]

Yi Liu, Guoyin Wang, Shicheng Li, Feifan Song, and Xu Sun. 2025. ATLANTIS: Weak-to-strong learning via importancesampling. InProceedingsofthe63rdAnnualMeetingoftheAssociationforComputationalLinguistics (Volume1: LongPapers), pages 1042–1052, Vienna, Austria. Association for Computational Linguistics

2025

[22] [22]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advancesinneuralinformationprocessingsystems, 35:27730–27744

2022

[23] [23]

Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. 2026. Pope: Learning to reason on hard problems via privileged on-policy exploration.arXivpreprint arXiv:2601.18779

work page arXiv 2026

[24] [24]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXivpreprint arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [26]

Amrith Setlur, Zĳian Wang, Andrew Cohen, Paria Rashidinejad, and Sang Michael Xie. 2026. Reuse your flops: Scaling rl on hard problems by conditioning on very off-policy prefixes.Preprint, arXiv:2601.18795

work page arXiv 2026

[26] [27]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXivpreprint arXiv:2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [28]

Seamus Somerstep, Felipe Maia Polo, Moulinath Banerjee, Yaacov Ritov, Mikhail Yurochkin, and Yuekai Sun. 2024. A statistical framework for weak-to-strong generalization. InICML2024NextGenerationofAISafetyWorkshop

2024

[28] [29]

Yiliu Sun, Zicheng Zhao, Yang Wei, Yanfang Zhang, and Chen Gong. 2026. Well begun, half done: Reinforcement learningwithprefixoptimizationforllmreasoning. In ProceedingsoftheAAAIConferenceonArtificialIntelligence, volume 40, pages 33144–33152

2026

[29] [30]

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and 1 others. 2023. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindingsof the Association for ComputationalLinguistics: ACL2023, pages 13003–13051

2023

[30] [31]

Qwen Team. 2025. Qwen3 technical report.Preprint, arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [32]

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. InProceedings of the 25th international conference on Machine learning, pages 1096–1103

2008

[32] [33]

Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. 2022. Generating sequences by learning to self-correct.arXiv preprintarXiv:2211.00053

work page arXiv 2022

[33] [34]

Changyi Xiao, Mengdi Zhang, and Yixin Cao. 2025. Bnpo: Beta normalization policy optimization.Preprint, arXiv:2506.02864

work page arXiv 2025

[34] [35]

Caĳun Xu, Changyi Xiao, Zhongyuan Peng, Xinrun Wang, and Yixin Cao. 2026. Scaler: Synthetic scalable adaptive learning environment for reasoning.arXivpreprintarXiv:2601.04809

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [36]

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. 2026. Learning to reason under off-policy guidance.AdvancesinNeuralInformationProcessingSystems, 38:117157–117186

2026

[36] [37]

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, and 40 others. 2024. Qwen2 technical report.arXiv preprintarXiv:2407.10671. 16

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [38]

Wei Yao, Wenkai Yang, Ziqiao Wang, Yankai Lin, and Yong Liu. 2025. Revisiting weak-to-strong generalization in theory and practice: Reverse kl vs. forward kl. InFindingsofthe AssociationforComputationalLinguistics: ACL 2025, pages 2860–2888

2025

[38] [39]

Linhao Yu, Tianmeng Yang, Siyu Ding, Renren Jin, Naibin Gu, Xiangzhao Hao, Shuaiyi Nie, Deyi Xiong, Weichong Yin, Yu Sun, and Hua Wu. 2026. Knowrl: Boosting llm reasoning via reinforcement learning with minimal-sufficient knowledge guidance.Preprint, arXiv:2604.12627

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [40]

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others. 2026. Dapo: An open-source llm reinforcement learning system at scale.Advancesin NeuralInformationProcessingSystems, 38:113222–113244

2026

[40] [41]

Yige Yuan, Teng Xiao, Shuchang Tao, Xue Wang, Jinyang Gao, Bolin Ding, and Bingbing Xu. 2026. Incentivizing strong reasoning from weak supervision.Preprint, arXiv:2505.20072

work page arXiv 2026

[41] [42]

Shaoxiong Zhan, Yanlin Lai, Ziyu Lu, Dahua Lin, Ziqing Yang, and Fei Tan. 2026. Mathsmith: Towards extremely hard mathematical reasoning by forging synthetic problems with a reinforced policy.Preprint, arXiv:2508.05592

work page arXiv 2026

[42] [43]

Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, and Stefano Soatto. 2026. Reinforcement-aware knowledge distillation for llm reasoning.Preprint, arXiv:2602.22495. 17

work page internal anchor Pith review Pith/arXiv arXiv 2026