Learning from Language Feedback via Variational Policy Distillation
Pith reviewed 2026-05-20 20:30 UTC · model grok-4.3
The pith
Variational Policy Distillation co-evolves a teacher policy to extract better signals from language feedback as the student improves.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Variational Policy Distillation formalizes learning from language feedback as a Variational Expectation-Maximization problem. In the E-step the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher's ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation and outperforms both standard RLVR and existing self-distillation baselines on scientific reasoning and code generation.
What carries the argument
Variational Expectation-Maximization with adaptive trust-region update on the teacher, which refines target token distributions from textual feedback for the student to follow.
If this is right
- VPD outperforms standard RLVR and passive self-distillation baselines across diverse diagnostic feedback sources on scientific reasoning and code generation tasks.
- The method supports learning in cold-start regimes where initial policies have limited capabilities.
- Results on rigid mathematical reasoning tasks highlight the limits of feedback-driven self-distillation relative to pure environment-driven RL.
- Co-evolution prevents the teacher's assessment quality from plateauing as the student policy advances.
Where Pith is reading between the lines
- The variational framing may extend naturally to settings where feedback comes from multiple sources or human evaluators rather than fixed models.
- Similar co-evolution mechanisms could address exploration bottlenecks in other sparse-reward domains beyond reasoning and coding.
- The approach suggests that one-way distillation methods may underperform when both teacher and student can improve jointly over time.
Load-bearing premise
An adaptive trust-region update on the teacher will reliably turn textual feedback into a stable and useful target token distribution for the student without adding instability or bias.
What would settle it
An experiment showing that the teacher's feedback interpretation stops improving or that VPD performs no better than fixed-teacher baselines on the same reasoning and code tasks would falsify the central claim.
Figures
read the original abstract
Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing language feedback to generate dense, token-level supervision. However, these approaches rely on a fixed, passive teacher to interpret the feedback. As the student policy improves, the teacher's zero-shot assessment capabilities plateau, ultimately halting further learning. To overcome this, we propose Variational Policy Distillation (VPD), a framework that formalizes learning from language feedback as a Variational Expectation-Maximization (EM) problem. VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step, the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher's ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation. Evaluated across diverse sources of diagnostic feedback on scientific reasoning and code generation tasks, VPD consistently outperforms both standard RLVR and existing self-distillation baselines. Finally, by stress-testing our framework on rigid mathematical reasoning and cold-start regimes, we illuminate the fundamental bounds of feedback-driven self-distillation compared to pure environment-driven RL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Variational Policy Distillation (VPD), which casts learning from language feedback in RLVR as a variational EM procedure. The E-step refines the teacher via an adaptive trust-region update on trajectory outcomes to produce an improved target token distribution from textual critique; the M-step updates the student on its own on-policy rollouts to internalize that distribution. The method is evaluated on scientific reasoning and code generation tasks with diverse diagnostic feedback sources and is stress-tested on rigid mathematical reasoning and cold-start regimes, claiming consistent gains over standard RLVR and passive self-distillation baselines.
Significance. If the co-evolution mechanism is stable, VPD would offer a concrete way to overcome the plateau of fixed teachers in feedback-driven distillation, potentially improving sample efficiency on complex reasoning tasks where outcome signals are sparse. The explicit stress-testing on mathematical reasoning and cold-start regimes is a positive feature that helps delineate the practical limits of language-feedback approaches relative to pure environment-driven RL.
major comments (2)
- [§3.2] §3.2 (E-step and trust-region update): The central claim that the adaptive trust-region update reliably converts textual feedback into a dynamically improved, non-degenerate target token distribution rests on an unproven assumption of stability and lack of bias as the student policy shifts. No explicit KL bounds, contraction arguments, or ablation results are supplied showing that the refined teacher distribution remains useful and does not collapse or introduce systematic bias across iterations; this directly underpins the asserted advantage over passive distillation.
- [§5] §5 (experimental results): The reported outperformance is presented without per-task variance, statistical significance tests, or controls that isolate the contribution of the teacher update versus the student update. Without these, it is difficult to attribute gains specifically to the co-evolution mechanism rather than to increased compute or different hyper-parameters.
minor comments (2)
- [§3] Notation for the variational objective and the trust-region constraint should be introduced with a single consistent symbol table or appendix equation list to avoid repeated re-definition across sections.
- [Figure 2] Figure 2 (training curves) would benefit from shaded standard-error bands and explicit labeling of which curves correspond to the teacher versus student policy.
Simulated Author's Rebuttal
We are grateful to the referee for their detailed and insightful comments, which have helped us identify areas for improvement in the manuscript. We address each of the major comments below.
read point-by-point responses
-
Referee: [§3.2] §3.2 (E-step and trust-region update): The central claim that the adaptive trust-region update reliably converts textual feedback into a dynamically improved, non-degenerate target token distribution rests on an unproven assumption of stability and lack of bias as the student policy shifts. No explicit KL bounds, contraction arguments, or ablation results are supplied showing that the refined teacher distribution remains useful and does not collapse or introduce systematic bias across iterations; this directly underpins the asserted advantage over passive distillation.
Authors: We acknowledge that the manuscript would benefit from stronger empirical validation of the teacher update's stability. While we do not provide formal contraction arguments or KL bounds in the current version, the adaptive trust-region mechanism is intended to maintain stability by limiting updates based on verifiable outcomes. In the revised manuscript, we will add ablation experiments that measure the entropy and KL divergence of the teacher distribution over iterations to demonstrate that it does not collapse or become biased. These results will be included in an expanded §3.2 and the appendix. revision: yes
-
Referee: [§5] §5 (experimental results): The reported outperformance is presented without per-task variance, statistical significance tests, or controls that isolate the contribution of the teacher update versus the student update. Without these, it is difficult to attribute gains specifically to the co-evolution mechanism rather than to increased compute or different hyper-parameters.
Authors: We agree that reporting variance and statistical tests is important for rigorous evaluation. The current experiments were run with multiple seeds, but the variance was not reported in the main text. In the revision, we will include per-task means and standard deviations, along with p-values from statistical tests. Furthermore, we will add a control experiment where the teacher is held fixed (disabling the E-step) while matching the total compute, to isolate the effect of the co-evolution. This will be presented in §5 and the appendix. revision: yes
Circularity Check
No circularity: VPD derivation introduces independent co-evolution via standard variational EM without reducing claims to fitted parameters or self-referential inputs
full rationale
The paper formalizes learning from language feedback as a variational EM problem with an explicit E-step (adaptive trust-region refinement of the teacher on trajectory outcomes to produce an improved target distribution) and M-step (student internalization on on-policy rollouts). These steps are defined procedurally from the problem setup and do not reduce by construction to any fitted quantity, renamed prediction, or load-bearing self-citation. The abstract and framework description present the co-evolution as a novel mechanism to overcome fixed-teacher plateaus, with no equations or claims shown to be equivalent to their inputs via definition or prior author work. The derivation remains self-contained against external benchmarks such as standard RLVR and self-distillation baselines.
Axiom & Free-Parameter Ledger
free parameters (1)
- trust-region size
axioms (1)
- domain assumption On-policy rollouts yield unbiased samples for policy improvement
invented entities (1)
-
Dynamically improved target token distribution
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step, the student internalizes this dense distributional guidance on its own on-policy rollouts.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formalize on-policy learning from language feedback as a Variational EM procedure.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024
work page 2024
-
[2]
Wen-Tse Chen, Jiayu Chen, Fahim Tajwar, Hao Zhu, Xintong Duan, Ruslan Salakhutdinov, and Jeff Schneider. Retrospective in-context learning for temporal credit assignment with large language models.arXiv preprint arXiv:2602.17497, 2026
-
[3]
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Kehua Feng, Xinyi Shen, Weijie Wang, Xiang Zhuang, Yuqi Tang, Qiang Zhang, and Keyan Ding. Sciknoweval: Evaluating multi-level scientific knowledge of large language models. arXiv preprint arXiv:2406.09098, 2024
-
[6]
Natural language reinforcement learning.arXiv preprint arXiv:2411.14251, 2024
Xidong Feng, Bo Liu, Yan Song, Haotian Fu, Ziyu Wan, Girish A Koushik, Zhiyuan Hu, Mengyue Yang, Ying Wen, and Jun Wang. Natural language reinforcement learning.arXiv preprint arXiv:2411.14251, 2024
-
[7]
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023
work page 2023
-
[8]
Aligning language models with preferences through f-divergence minimization
Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, and Marc Dymetman. Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215, 2023
-
[9]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Reinforcement Learning via Self-Distillation
Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Binary classifier optimization for large language model alignment
Seungjae Jung, Gunsoo Han, Daniel Wontae Nam, and Kyoung-Woon On. Binary classifier optimization for large language model alignment. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1858–1872, 2025
work page 2025
-
[13]
VinePPO: Unlocking RL potential for LLM reasoning through refined credit assignment
Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms.arXiv preprint arXiv:2410.01679, 2024
-
[14]
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
arXiv preprint arXiv:2511.07919 , year=
Yoonho Lee, Joseph Boen, and Chelsea Finn. Feedback descent: Open-ended text optimization via pairwise comparison.arXiv preprint arXiv:2511.07919, 2025
-
[16]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Chain of hindsight aligns language models with feedback
Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. Chain of hindsight aligns language models with feedback.arXiv preprint arXiv:2302.02676, 2023
-
[18]
Inference-time scaling for generalist reward modeling,
Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling.arXiv preprint arXiv:2504.02495, 2025
-
[19]
Renjie Luo, Zichen Liu, Xiangyan Liu, Chao Du, Min Lin, Wenhu Chen, Wei Lu, and Tianyu Pang. Language models can learn from verbal feedback without scalar rewards.arXiv preprint arXiv:2509.22638, 2025
-
[20]
A view of the em algorithm that justifies incremental, sparse, and other variants
Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. InLearning in graphical models, pages 355–368. Springer, 1998. 11
work page 1998
-
[21]
Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[23]
Richard Y Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization.Advances in Neural Information Processing Systems, 37:116617–116637, 2024
work page 2024
-
[24]
Privileged Information Distillation for Language Models
Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026
work page internal anchor Pith review arXiv 2026
-
[25]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[26]
Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. Pope: Learning to reason on hard problems via privileged on-policy exploration.arXiv preprint arXiv:2601.18779, 2026
-
[27]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[28]
Direct nash optimization: Teaching language models to self-improve with general preferences,
Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, and Tengyang Xie. Direct nash optimization: Teaching language models to self-improve with general preferences.arXiv preprint arXiv:2404.03715, 2024
-
[29]
Training language models with language feedback at scale
Jérémy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. Training language models with language feedback at scale.arXiv preprint arXiv:2303.16755, 2023
-
[30]
Trust region policy optimization
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015
work page 2015
-
[31]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
Amrith Setlur, Zijian Wang, Andrew Cohen, Paria Rashidinejad, and Sang Michael Xie. Reuse your flops: Scaling rl on hard problems by conditioning on very off-policy prefixes.arXiv preprint arXiv:2601.18795, 2026
-
[33]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Self-Distillation Enables Continual Learning
Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
arXiv preprint arXiv:2602.02482 , year=
Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback.arXiv preprint arXiv:2602.02482, 2026
-
[36]
Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675, 2024
Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675, 2024
-
[37]
Provably learning from language feedback.arXiv preprint arXiv:2506.10341, 2025
Wanqiao Xu, Allen Nie, Ruijie Zheng, Aditya Modi, Adith Swaminathan, and Ching-An Cheng. Provably learning from language feedback.arXiv preprint arXiv:2506.10341, 2025. 12
-
[38]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[40]
Deepcritic: Deliberate critique with large language models.arXiv preprint arXiv:2505.00662, 2025
Wenkai Yang, Jingwen Chen, Yankai Lin, and Ji-Rong Wen. Deepcritic: Deliberate critique with large language models.arXiv preprint arXiv:2505.00662, 2025
-
[41]
Online experiential learning for language models.arXiv preprint arXiv:2603.16856, 2026
Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, and Furu Wei. Online experiential learning for language models.arXiv preprint arXiv:2603.16856, 2026
-
[42]
On-Policy Context Distillation for Language Models
Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[43]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Self-rewarding language models
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason E Weston. Self-rewarding language models. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[45]
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction.arXiv preprint arXiv:2408.15240, 2024
-
[46]
American invitational mathematics examination (aime) 2024, 2024
Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024
work page 2024
-
[47]
American invitational mathematics examination (aime) 2025, 2025
Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025
work page 2025
-
[48]
Improving sampling efficiency in rlvr through adaptive rollout and response reuse
Yuheng Zhang, Wenlin Yao, Changlong Yu, Yao Liu, Qingyu Yin, Bing Yin, Hyokun Yun, and Lihong Li. Improving sampling efficiency in rlvr through adaptive rollout and response reuse. arXiv preprint arXiv:2509.25808, 2025
-
[49]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[50]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen
Haizhong Zheng, Yang Zhou, Brian R Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177, 2025
-
[52]
Victor Zhong, Dipendra Misra, Xingdi Yuan, and Marc-Alexandre Côté. Policy improve- ment using language feedback models.Advances in Neural Information Processing Systems, 37:43730–43758, 2024
work page 2024
-
[53]
Variational reasoning for language models.arXiv preprint arXiv:2509.22637, 2025
Xiangxin Zhou, Zichen Liu, Haonan Wang, Chao Du, Min Lin, Chongxuan Li, Liang Wang, and Tianyu Pang. Variational reasoning for language models.arXiv preprint arXiv:2509.22637, 2025. 13 A Theoretical Derivations This appendix provides the formal derivations for the variational framework introduced in Section 3. We first derive the closed-form optimal polic...
-
[54]
This reveals that the term exp(−1− λ β ) acts as a normalization constant. We define the partition functionZ(x)as: Z(x) = X y πref(y|x) exp 1 β r(x, y) .(A.5) Thus, the optimal target distribution is the exponentially reward-tilted policy: π∗(y|x) = 1 Z(x) πref(y|x) exp 1 β r(x, y) .(A.6) A.2 Equivalence of Reverse KL and the RLVR Objective We now demonst...
-
[55]
Joint Loss Optimization.The most straightforward baseline computes the objective losses independently and optimizes their weighted sum. We calculate the standard GRPO surrogate loss LGRPO using the sequence-level advantages, and combine it with the SDPO KL distillation loss: LHybrid(θ) =ω opd · LSDPO(θ) +ω rl · LGRPO(θ),(B.23) where ωopd and ωrl are hyper...
-
[56]
Advantage Reshaping.Instead of summing the final losses, a second class of baselines fuses the signals at the advantage level. Following the methodology of Self-Distillation Policy Optimization (SDPO) [10], the teacher’s dense distillation signal can be translated into a per-token advantage, ASDPO t =sg(logq ϕ(yt |x,C, y <t)−logπ θ(yt |x, y <t)). This is ...
-
[57]
Distillation-Guided Advantage Reweighting.A fundamental limitation of the standard GRPO advantage AGRPO is its uniform application to all tokens in a sequence, failing to differentiate between critical reasoning steps and generic filler. To construct a baseline that addresses this without fully decoupling the steps, we can explicitly weight the sequence-l...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.