pith. machine review for the scientific record. sign in

arxiv: 2605.08401 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

AIPO: : Learning to Reason from Active Interaction

Gholamreza Haffari, Junnan Liu, Linhao Luo, Thuy-Trang Vu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM reasoningReinforcement learningMulti-agent interactionRLVRCapability boundaryOff-policy learningActive exploration
0
0 comments X

The pith

AIPO enables LLMs to actively consult verify, knowledge, and reasoning agents during RL training to expand their reasoning capability boundary beyond standard RLVR limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that standard reinforcement learning with verifiable rewards leaves LLMs stuck inside their initial capability limits because exploration cannot go beyond what the policy already knows. AIPO changes this by letting the policy model pause at reasoning bottlenecks and request targeted help from three separate agents instead of receiving full expert trajectories. The agents supply fine-grained corrections on verification, knowledge, and reasoning steps, which the policy then absorbs through a modified importance-sampling update that includes clipping to control off-policy drift. After training ends the policy runs without any agents yet retains the expanded abilities, as shown by gains on AIME, MATH500, GPQA-Diamond, and LiveCodeBench across multiple base models and RL algorithms.

Core claim

By allowing the policy model to proactively query three functional collaborative agents when it encounters reasoning bottlenecks, AIPO supplies fine-grained, targeted guidance that actively expands the policy's capability boundary during training; a tailored importance-sampling coefficient together with a clipping strategy mitigates the resulting off-policy bias and gradient issues, so that after training the policy reasons independently and outperforms prior RLVR baselines on diverse benchmarks.

What carries the argument

The AIPO framework, in which the policy model initiates active queries to Verify Agent, Knowledge Agent, and Reasoning Agent at detected bottlenecks, combined with an importance-sampling coefficient and clipping rule that stabilizes learning from the resulting off-policy feedback.

If this is right

  • Reasoning accuracy rises on AIME, MATH500, GPQA-Diamond, and LiveCodeBench relative to plain RLVR.
  • The gains hold across different base policy models and different underlying RLVR algorithms.
  • After training the policy model solves new problems without calling any collaborative agents.
  • The method replaces trajectory-level expert demonstrations with shorter, on-demand agent exchanges.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same active-consultation pattern could be tested on non-reasoning tasks such as long-horizon planning or tool use.
  • If the three agents themselves are lightweight, the overall training cost may remain comparable to standard RLVR while still widening the reachable solution space.
  • The approach suggests that future RLVR work could replace static expert buffers with dynamic, queryable helper models.

Load-bearing premise

The guidance supplied by the three agents is sufficiently fine-grained and targeted that it genuinely expands the policy model's capability boundary rather than merely providing temporary scaffolding.

What would settle it

Train the same policy model with standard RLVR versus AIPO on the same data and compute budget, then measure whether AIPO produces higher accuracy on harder held-out problems that lie outside the original policy's initial success rate while the model still solves them without any agents present at test time.

Figures

Figures reproduced from arXiv: 2605.08401 by Gholamreza Haffari, Junnan Liu, Linhao Luo, Thuy-Trang Vu.

Figure 1
Figure 1. Figure 1: Comparison between existing methods and the proposed [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of AIPO. In the AIPO framework, during each rollout, the policy model engages in active interactions with collaborators. We then compute the reward and optimize the policy model using losses derived from both internal (on-policy) and external (off-policy) tokens. Additionally, we propose an amended importance sampling coefficient and clipping strategy to mitigate off-policy errors and the vani… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation Study of the collaborators in AIPO. Each bar indicates the average performance of all benchmarks in this domain. 0 20 40 60 80 100 Training Step 0.0 0.2 0.4 0.6 Pass@n Our GRPO LUFFY [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training Dynamics of AIPO and baselines on Qwen2.5-7B-Instruct with the same model as collaborators. initiated by the policy model per batch (Batch Interactions). Under AIPO, the interaction frequency initially rises, then declines, and eventually stabilizes. This pattern suggests that the policy model queries external collaborators frequently in the early stages of training because of its limited initial … view at source ↗
read the original abstract

Recent advances in large language models (LLMs) have demonstrated remarkable reasoning capabilities, largely stimulated by Reinforcement Learning with Verifiable Rewards (RLVR). However, existing RL algorithms face a fundamental limitation: their exploration remains largely constrained by the inherent capability boundary of the policy model. Although recent methods introduce external expert demonstrations to extend this boundary, they typically rely on complete trajectory-level guidance, which is sample-inefficient, information-sparse, and may confine exploration to a static guidance space. Inspired by the potential of multi-agent systems, we propose $\textbf{AIPO}$, an enhanced reinforcement learning framework that improves LLM reasoning through active multi-agent interaction during exploration. Specifically, AIPO enables the policy model to proactively consult three functional collaborative agents, $\textit{Verify Agent}$, $\textit{Knowledge Agent}$, and $\textit{Reasoning Agent}$, when encountering reasoning bottlenecks, thereby receiving fine-grained and targeted guidance to actively expand its capability boundary during training. We further introduce a tailored importance sampling coefficient together with a clipping strategy to mitigate the off-policy bias and gradient vanishing issues that arise when learning from agent-provided feedback. After training, the policy model performs reasoning independently without relying on collaborative agents. Extensive experiments on diverse reasoning benchmarks, including AIME, MATH500, GPQA-Diamond, and LiveCodeBench, show that AIPO consistently improves reasoning performance, generalizes robustly across different policy models and RLVR algorithms, and effectively expands the reasoning capability boundary of the policy model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes AIPO, an enhanced RLVR framework for LLM reasoning in which the policy model proactively consults three functional agents (Verify Agent, Knowledge Agent, Reasoning Agent) upon detecting reasoning bottlenecks during exploration. These agents supply fine-grained, targeted feedback that is intended to expand the policy's standalone capability boundary. A tailored importance sampling coefficient combined with a clipping strategy is introduced to correct for off-policy bias and gradient vanishing induced by the agent feedback. After training the policy reasons independently without agents. Experiments on AIME, MATH500, GPQA-Diamond and LiveCodeBench are claimed to show consistent gains that generalize across policy models and base RLVR algorithms.

Significance. If the empirical gains and the off-policy correction hold, the work would be a meaningful incremental advance over static expert-demonstration methods in RLVR by replacing trajectory-level guidance with dynamic, bottleneck-triggered multi-agent interaction. The post-training independence of the policy is a practical strength, and the approach could stimulate further research on active multi-agent exploration for reasoning.

major comments (2)
  1. [Method (description of importance sampling and clipping)] The explicit mathematical definition of the tailored importance sampling coefficient (and its interaction with the three-agent feedback structure) is not supplied in the method description. Without the formula it is impossible to verify whether the ratio reduces to standard PPO IS when no agents are consulted or whether it correctly accounts for state-dependent consultation triggers that may introduce unmodeled distribution shift.
  2. [Method and Experiments] The central claim that agent feedback expands the policy's capability boundary rests on the assumption that the guidance is sufficiently fine-grained and that the IS+clipping correction fully mitigates bias. No derivation or ablation is shown that isolates the contribution of the three-agent interaction versus simpler forms of external feedback.
minor comments (2)
  1. [Title] The title contains a typographical double colon (AIPO: : Learning...).
  2. [Throughout] Agent names are inconsistently formatted between the abstract and later text; uniform italicization or bolding would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the paper's significance. We will revise the manuscript to address the concerns raised in the major comments.

read point-by-point responses
  1. Referee: [Method (description of importance sampling and clipping)] The explicit mathematical definition of the tailored importance sampling coefficient (and its interaction with the three-agent feedback structure) is not supplied in the method description. Without the formula it is impossible to verify whether the ratio reduces to standard PPO IS when no agents are consulted or whether it correctly accounts for state-dependent consultation triggers that may introduce unmodeled distribution shift.

    Authors: We acknowledge this omission in the original manuscript. The revised version will include the explicit formula for the tailored importance sampling coefficient. It is defined to reduce to the standard PPO importance sampling ratio in cases where no agents are consulted. The coefficient incorporates an adjustment for the state-dependent consultation probability to account for potential distribution shifts. We will also add a short derivation in the method section to clarify its interaction with the three-agent feedback. revision: yes

  2. Referee: [Method and Experiments] The central claim that agent feedback expands the policy's capability boundary rests on the assumption that the guidance is sufficiently fine-grained and that the IS+clipping correction fully mitigates bias. No derivation or ablation is shown that isolates the contribution of the three-agent interaction versus simpler forms of external feedback.

    Authors: We agree that additional ablations would better isolate the effects. In the revision, we will include new experiments comparing the full three-agent AIPO to variants with simplified feedback mechanisms (e.g., single agent or non-interactive external signals). This will help validate the fine-grained nature of the guidance. The current results on multiple benchmarks and the independence at inference time provide supporting evidence for the capability boundary expansion, though we will expand the discussion to address potential limitations in the bias correction. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL extension with independent experimental validation

full rationale

The paper frames AIPO as an empirical enhancement to existing RLVR methods via proactive multi-agent consultation and a tailored importance sampling coefficient with clipping. No equations, derivations, or load-bearing self-citations are present that reduce the claimed capability expansion or performance gains to fitted inputs, self-definitions, or prior author results by construction. The central claims rest on benchmark experiments (AIME, MATH500, etc.) that are externally falsifiable and do not invoke uniqueness theorems or ansatzes from the authors' own prior work. This is the standard case of a self-contained empirical proposal.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

Based solely on the abstract, the central claim rests on the introduction of three new agent entities and one new coefficient; no independent evidence for these components is provided.

free parameters (1)
  • tailored importance sampling coefficient
    Introduced along with a clipping strategy to address off-policy bias and gradient vanishing when learning from agent-provided feedback.
invented entities (1)
  • Verify Agent, Knowledge Agent, and Reasoning Agent no independent evidence
    purpose: Provide fine-grained and targeted guidance to the policy model at reasoning bottlenecks during training.
    These three functional collaborative agents are introduced as core components of the AIPO framework.

pith-pipeline@v0.9.0 · 5570 in / 1282 out tokens · 85853 ms · 2026-05-12T01:12:52.321007+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 29 internal anchors

  1. [1]

    Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InACL (1), pp. 12248–12267. Association for Computational Linguistics, 2024. 2

  2. [2]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.CoRR, abs/2108.07732, 2021. 1, 4.1, B.6

  3. [3]

    Russak, Kiran Kamble, Dmytro Mozolevskyi, Muayad Ali, and Waseem Alshikh

    Shelly Bensal, Umar Jamil, Christopher Bryant, Melisa Russak, Kiran Kamble, Dmytro Mo- zolevskyi, Muayad Ali, and Waseem AlShikh. Reflect, retry, reward: Self-improving llms via reinforcement learning.CoRR, abs/2505.24726, 2025. D.2

  4. [4]

    Introduction to techniques used in seed1.6

    ByteDance Seed. Introduction to techniques used in seed1.6. https://seed.bytedance.com/ en/seed1_6, 2025. 5

  5. [5]

    Nudging the boundaries of LLM reasoning

    Justin Chih-Yao Chen, Becky Xiangyu Peng, Prafulla Kumar Choubey, Kung-Hsiang Huang, Jiaxin Zhang, Mohit Bansal, and Chien-Sheng Wu. Nudging the boundaries of LLM reasoning. CoRR, abs/2509.25666, 2025. 1

  6. [6]

    arXiv preprint arXiv:2509.06948 , year =

    Liang Chen, Xueting Han, Li Shen, Jing Bai, and Kam-Fai Wong. Beyond two-stage training: Cooperative SFT and RL for LLM reasoning.CoRR, abs/2509.06948, 2025. 1, 2, 5, D.1

  7. [7]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.CoRR, abs/2503.09567, 2025. 1

  8. [8]

    Multi-Agent Evolve:

    Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Pat- wary, and Jiaxuan You. Multi-agent evolve: LLM self-improve through co-evolution.CoRR, abs/2510.23595, 2025. 1

  9. [9]

    Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025

    Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective.CoRR, abs/2506.14758, 2025. A

  10. [10]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit S. Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan- Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav ...

  11. [11]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards.CoRR, abs/2502.01456, 2025. 4.1 10

  12. [12]

    Weight ensembling improves reasoning in language models.CoRR, abs/2504.10478, 2025

    Xingyu Dang, Christina Baek, Kaiyue Wen, Zico Kolter, and Aditi Raghunathan. Weight ensembling improves reasoning in language models.CoRR, abs/2504.10478, 2025. 1

  13. [13]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  14. [14]

    Supervised reinforcement learning: From expert trajectories to step-wise reasoning.CoRR, abs/2510.25992, 2025

    Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang, Tomas Pfister, and Chen-Yu Lee. Supervised reinforcement learning: From expert trajectories to step-wise reasoning.CoRR, abs/2510.25992, 2025. 1

  15. [15]

    Re-rest: Reflection-reinforced self-training for language agents

    Zi-Yi Dou, Cheng-Fu Yang, Xueqing Wu, Kai-Wei Chang, and Nanyun Peng. Re-rest: Reflection-reinforced self-training for language agents. InEMNLP, pp. 15394–15411. As- sociation for Computational Linguistics, 2024. D.1

  16. [16]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

  17. [17]

    Srft: A single-stage method with supervised and reinforcement fine-tuning for reasoning.arXiv preprint arXiv:2506.19767, 2025

    Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. SRFT: A single-stage method with super- vised and reinforcement fine-tuning for reasoning.CoRR, abs/2506.19767, 2025. 1, 1, 2, 5, D.1

  18. [18]

    arXiv preprint arXiv:2504.15257 , year=

    Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, and Tianyu Pang. Flowreasoner: Reinforcing query-level meta-agents.CoRR, abs/2504.15257,

  19. [19]

    Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Ji...

  20. [20]

    Rewarding the unlikely: Lifting GRPO beyond distribution sharpening

    Andre Wang He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting GRPO beyond distribution sharpening. InEMNLP, pp. 25548–25560. Association for Computational Linguistics, 2025. 1

  21. [21]

    arXiv preprint arXiv:2504.11456 , year=

    Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.CoRR, abs/2504.11456, 2025. 4.4

  22. [22]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Datasets and Benchmarks, 2021. 1, 4.1, B.6

  23. [23]

    Scaling Laws for Autoregressive Generative Modeling

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling.CoRR, abs/2010.14701,

  24. [24]

    Deep Learning Scaling is Predictable, Empirically

    Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory F. Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.CoRR, abs/1712.00409, 2017. E

  25. [25]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu. REINFORCE++: A simple and efficient approach for aligning large language models. CoRR, abs/2501.03262, 2025. 2

  26. [26]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.CoRR, abs/2503.24290, 2025. 2

  27. [27]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans. Inf. Syst., 43(2):42:1–42:55, 2025. D.2

  28. [28]

    Yichen Huang and Lin F. Yang. Gemini 2.5 pro capable of winning gold at IMO 2025.CoRR, abs/2507.15855, 2025. 1

  29. [29]

    Yoshitaka Inoue, Tianci Song, and Tianfan Fu

    Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu. O1 replication journey - part 2: Surpassing o1- preview through simple distillation, big progress or bitter lesson?CoRR, abs/2411.16489, 2024. 3.1

  30. [30]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InICLR. OpenReview.net, 2025. 1, 4.1, B.6

  31. [31]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.CoRR, abs/2503.09516, 2025. 3.2

  32. [32]

    Kimi-Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

  33. [33]

    Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M

    Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D. Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M. Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal M. P. Behbahani, and Aleksandra Faust. Training language models to self-correct via reinforcement learning. InICLR....

  34. [34]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InSOSP, pp. 611–626. ACM, 2023. 4.1, B.1

  35. [35]

    Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models

    Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models. InICML. OpenReview.net, 2024. 2

  36. [36]

    Marft: Multi-agent reinforcement fine-tuning.arXiv preprint arXiv:2504.16129, 2025

    Junwei Liao, Muning Wen, Jun Wang, and Weinan Zhang. MARFT: multi-agent reinforcement fine-tuning.CoRR, abs/2504.16129, 2025. 3.1

  37. [37]

    Enhancing efficiency and exploration in reinforcement learning for llms

    Mengqi Liao, Xiangyu Xi, Ruinian Chen, Jia Leng, Yangen Hu, Ke Zeng, Shuai Liu, and Huaiyu Wan. Enhancing efficiency and exploration in reinforcement learning for llms. In EMNLP, pp. 1451–1463. Association for Computational Linguistics, 2025. 1

  38. [38]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InICLR. OpenReview.net, 2024. B.6

  39. [39]

    Interactive Learning for LLM Reasoning

    Hehai Lin, Shilei Cao, Sudong Wang, Haotian Wu, Minzhi Li, Linyi Yang, Juepeng Zheng, and Chengwei Qin. Interactive learning for LLM reasoning.CoRR, abs/2509.26306, 2025. 1, 5, D.1

  40. [40]

    Are your llms capable of stable reasoning?CoRR, abs/2412.13147, 2024

    Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, and Kai Chen. Are your llms capable of stable reasoning?CoRR, abs/2412.13147, 2024. 1, 4.1, B.6

  41. [41]

    Situatedthinker: Grounding LLM reasoning with real-world through situated thinking.CoRR, abs/2505.19300, 2025

    Junnan Liu, Linhao Luo, Thuy-Trang Vu, and Gholamreza Haffari. Situatedthinker: Grounding LLM reasoning with real-world through situated thinking.CoRR, abs/2505.19300, 2025. 3.2

  42. [42]

    Learn to reason efficiently with adaptive length-based reward shaping

    Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, and Junxian He. Learn to reason efficiently with adaptive length-based reward shaping.CoRR, abs/2505.15612, 2025. 4.1

  43. [43]

    Exploratory memory-augmented llm agent via hybrid on-and off-policy optimization.arXiv preprint arXiv:2602.23008, 2026

    Zeyuan Liu, Jeonghye Kim, Xufang Luo, Dongsheng Li, and Yuqing Yang. Exploratory memory- augmented llm agent via hybrid on- and off-policy optimization.CoRR, abs/2602.23008, 2026. 1

  44. [44]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.CoRR, abs/2503.20783,

  45. [45]

    Towards a unified view of large language model post-training.CoRR, abs/2509.04419, 2025

    Xingtai Lv, Yuxin Zuo, Youbang Sun, Hongyi Liu, Yuntian Wei, Zhekai Chen, Lixuan He, Xuekai Zhu, Kaiyan Zhang, Bingning Wang, Ning Ding, and Bowen Zhou. Towards a unified view of large language model post-training.CoRR, abs/2509.04419, 2025. 1, 2, 5, D.1 13

  46. [46]

    Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions, 2026

    Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, and Wentao Zhang. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions.CoRR, abs/2506.07527,

  47. [47]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel J. Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.CoRR, abs/2501.19393, 2025. 3.1

  48. [48]

    Learning to reason with llms

    OpenAI. Learning to reason with llms. https://openai.com/index/learning-to- reason-with-llms/, 2024. Accessed: 2024-09. 1, 2, 5

  49. [49]

    Introducing openai o3 and o4-mini

    OpenAI. Introducing openai o3 and o4-mini. https://openai.com/index/introducing- o3-and-o4-mini/, 2024. Accessed: 2024-12. 2

  50. [50]

    Gpt-5 and the new era of work

    OpenAI. Gpt-5 and the new era of work. https://openai.com/index/gpt-5-new-era- of-work/, 2025. Accessed: 2025-08. 1, 5

  51. [51]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...

  52. [52]

    Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al

    Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, and Pengfei Liu. O1 replication journey: A strategic progress report - part 1.CoRR, abs/2410.18982, 2024. 3.1

  53. [53]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.CoRR, abs/2311.12022, 2023. 1, 4.1, B.6

  54. [54]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017. 1, 2, 3.2

  55. [55]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300, 2024. 1, 1, 2, 4.1

  56. [56]

    Hybridflow: A flexible and efficient RLHF framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. In EuroSys, pp. 1279–1297. ACM, 2025. 4.1, B.1

  57. [57]

    Reflexion: language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InNeurIPS, 2023. 5, D.1

  58. [58]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters.CoRR, abs/2408.03314, 2024. 1

  59. [59]

    R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

    Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.CoRR, abs/2503.05592, 2025. 3.2

  60. [60]

    Reasoninggym: Reasoningenvironmentsforreinforcementlearningwithverifiable rewards, 2025

    Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, and Andreas Köpf. REASONING GYM: reasoning environments for reinforcement learning with verifiable rewards.CoRR, abs/2505.24760, 2025. 1, 4.1, B.6

  61. [61]

    Zerosearch: Incentivize the search capability of llms without searching.arXiv preprint arXiv:2505.04588, 2025

    Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching.CoRR, abs/2505.04588, 2025. 3.1

  62. [62]

    Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay, 2026

    Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, and Huan Zhang. Improving data efficiency for LLM reinforcement fine-tuning through difficulty- targeted online data selection and rollout replay.CoRR, abs/2506.05316, 2025. 5, D.1 14

  63. [63]

    Qwq-32b: Embracing the power of reinforcement learning, March 2025

    Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm.github.io/blog/qwq-32b/. 5

  64. [64]

    Rema: Learning to meta-think for llms with multi-agent reinforcement learning.CoRR, abs/2503.09501, 2025

    Ziyu Wan, Yunxiang Li, Yan Song, Hanjing Wang, Linyi Yang, Mark Schmidt, Jun Wang, Weinan Zhang, Shuyue Hu, and Ying Wen. Rema: Learning to meta-think for llms with multi-agent reinforcement learning.CoRR, abs/2503.09501, 2025. 3.1

  65. [65]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for LLM reasoning.CoRR, abs/2506.01939, 2025. A

  66. [66]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022. 3.1

  67. [67]

    Truthrl: Incentivizing truthful llms via reinforcement learning

    Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Sean Chen, Mohammad Kachuee, Teja Gollapudi, Tony Liao, Nicolas Scheffer, Rakesh Wanga, Anuj Kumar, Yu Meng, Wen-tau Yih, and Xin Luna Dong. Truthrl: Incentivizing truthful llms via reinforcement learning.CoRR, abs/2509.25760, 2025. D.2

  68. [68]

    Grok 4.https://x.ai/news/grok-4/, 2025

    xAI. Grok 4.https://x.ai/news/grok-4/, 2025. Accessed: 2025-07. 1

  69. [69]

    arXiv preprint arXiv:2506.02208 , year =

    Hongling Xu, Qi Zhu, Heyuan Deng, Jinpeng Li, Lu Hou, Yasheng Wang, Lifeng Shang, Ruifeng Xu, and Fei Mi. KDRL: post-training reasoning llms via unified knowledge distillation and reinforcement learning.CoRR, abs/2506.02208, 2025. 5

  70. [70]

    Comas: Co-evolving multi-agent systems via interaction rewards.CoRR, abs/2510.08529, 2025

    Xiangyuan Xue, Yifan Zhou, Guibin Zhang, Zaibin Zhang, Yijiang Li, Chen Zhang, Zhenfei Yin, Philip Torr, Wanli Ouyang, and Lei Bai. Comas: Co-evolving multi-agent systems via interaction rewards.CoRR, abs/2510.08529, 2025. 1

  71. [71]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  72. [72]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  73. [73]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  74. [74]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?CoRR, abs/2504.13837, 2025. 1, 5 15

  75. [75]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Cheng- Xiang Wang, Tiantian Fan, Zhengyin Du, Xiangpeng Wei, Xiangyu Yu, Gaohong Liu, Juncai Liu, Lingjun Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Ru Zhang, Xin Liu, Mingxuan Wang, Yonghui Wu, and Lin Yan. V APO: efficient and reliab...

  76. [76]

    Incentivizing llms to self-verify their answers

    Fuxiang Zhang, Jiacheng Xu, Chaojie Wang, Ce Cui, Yang Liu, and Bo An. Incentivizing llms to self-verify their answers.CoRR, abs/2506.01369, 2025. D.2

  77. [77]

    On-policy rl meets off-policy experts: Harmonizing supervised fine- tuning and reinforcement learning via dynamic weighting, 2026

    Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou. On-policy RL meets off-policy experts: Harmonizing supervised fine-tuning and reinforcement learning via dynamic weighting.CoRR, abs/2508.11408, 2025. 1, 2, 2, 5, D.1

  78. [78]

    Learning to reason under off-policy guidance

    Yue Zhang, Yafu Li, Ganqu Cui, Yu Cheng, Zhi Wang, Xiaoye Qu, Jianhao Yan, and Zican Hu. Learning to reason under off-policy guidance. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 1, 2, 2, 3.2, 5, D.1

  79. [79]

    Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach

    Rosie Zhao, Alexandru Meterez, Sham M. Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach. Echo chamber: RL post-training amplifies behaviors learned in pretraining.CoRR, abs/2504.07912, 2025. 1, 5

  80. [80]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.CoRR, abs/2601.18734, 2026. 1, 4.1, 5, D.1

Showing first 80 references.