pith. sign in

arxiv: 2605.15113 · v2 · pith:GZ3TWWSPnew · submitted 2026-05-14 · 💻 cs.LG

Learning from Language Feedback via Variational Policy Distillation

Pith reviewed 2026-05-20 20:30 UTC · model grok-4.3

classification 💻 cs.LG
keywords variational policy distillationlanguage feedbackreinforcement learningself-distillationscientific reasoningcode generationexpectation maximization
0
0 comments X

The pith

Variational Policy Distillation co-evolves a teacher policy to extract better signals from language feedback as the student improves.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Variational Policy Distillation to address sparse rewards and exploration issues in reinforcement learning from verifiable rewards by using language feedback more dynamically. It frames the interaction as a variational expectation-maximization process in which the teacher policy is actively updated in the E-step with an adaptive trust-region method on trajectory outcomes to create improved target token distributions. The student then internalizes these distributions during its own on-policy rollouts in the M-step. A sympathetic reader would care because this co-evolution prevents the teacher from plateauing as the student advances, leading to consistent gains over standard RLVR and passive self-distillation on scientific reasoning and code generation tasks with various diagnostic feedback sources.

Core claim

Variational Policy Distillation formalizes learning from language feedback as a Variational Expectation-Maximization problem. In the E-step the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher's ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation and outperforms both standard RLVR and existing self-distillation baselines on scientific reasoning and code generation.

What carries the argument

Variational Expectation-Maximization with adaptive trust-region update on the teacher, which refines target token distributions from textual feedback for the student to follow.

If this is right

  • VPD outperforms standard RLVR and passive self-distillation baselines across diverse diagnostic feedback sources on scientific reasoning and code generation tasks.
  • The method supports learning in cold-start regimes where initial policies have limited capabilities.
  • Results on rigid mathematical reasoning tasks highlight the limits of feedback-driven self-distillation relative to pure environment-driven RL.
  • Co-evolution prevents the teacher's assessment quality from plateauing as the student policy advances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The variational framing may extend naturally to settings where feedback comes from multiple sources or human evaluators rather than fixed models.
  • Similar co-evolution mechanisms could address exploration bottlenecks in other sparse-reward domains beyond reasoning and coding.
  • The approach suggests that one-way distillation methods may underperform when both teacher and student can improve jointly over time.

Load-bearing premise

An adaptive trust-region update on the teacher will reliably turn textual feedback into a stable and useful target token distribution for the student without adding instability or bias.

What would settle it

An experiment showing that the teacher's feedback interpretation stops improving or that VPD performs no better than fixed-teacher baselines on the same reasoning and code tasks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.15113 by Erik Nijkamp, Semih Yavuz, Shafiq Joty, Yang Li.

Figure 1
Figure 1. Figure 1: Reward margin between correct and incorrect responses dur￾ing LCB training. 1. Environment Feedback (LiveCodeBench). For code gen￾eration tasks, the environment acts as a natural, determinis￾tic verifier, providing rich feedback such as runtime errors and failed unit test assertions. We evaluate Qwen3-8B (with reasoning/thinking mode disabled) on the LiveCodeBench (LCB) v6 subset, following the public and … view at source ↗
Figure 2
Figure 2. Figure 2: Training progression on SciKnowEval 2. Contrastive Sibling Rollouts (Sci￾KnowEval). For many scientific reason￾ing tasks, ground-truth textual feedback is unavailable; the environment only pro￾vides a sparse, binary correctness sig￾nal. In these scenarios, we can syn￾thesize the diagnostic feedback C us￾ing the model’s own generations. Fol￾lowing the methodology of SDPO [10], we provide the student with a … view at source ↗
Figure 3
Figure 3. Figure 3: Training progression on Qwen3-4B-Base. The "Cold Start" Problem on Base Models. Recent literature demonstrates that GRPO can elicit advanced reasoning capabilities from a base foundation model. However, when we apply SDPO to base models, performance rapidly collapses to near zero. We hypoth￾esize that self-distillation intrinsically requires the policy to possess a rudimentary level of instruction-followin… view at source ↗
Figure 4
Figure 4. Figure 4: Performance on the Math500 benchmark for mod￾els trained on DAPO-Math. Mathematical Reasoning. Similarly, on challenging mathematical benchmarks (e.g., training on DAPO-Math), SDPO suffers from severe training collapse. This vulnerability to mathematical reason￾ing domains has been observed in concurrent works [14]. While VPD again successfully delays this collapse, pure GRPO remains the dominant approach … view at source ↗
Figure 5
Figure 5. Figure 5: Training progression on Qwen3-1.7B with different reference model for E-Step. Ablation: Dynamic Reference Prior. As established in Eq. 7, VPD dynamically anchors the reference prior to the current student policy (πθ). This sliding trust region restricts the teacher’s target distribu￾tion, ensuring its guidance remains safely reachable for the student. To validate this design, we conduct an ablation study c… view at source ↗
read the original abstract

Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing language feedback to generate dense, token-level supervision. However, these approaches rely on a fixed, passive teacher to interpret the feedback. As the student policy improves, the teacher's zero-shot assessment capabilities plateau, ultimately halting further learning. To overcome this, we propose Variational Policy Distillation (VPD), a framework that formalizes learning from language feedback as a Variational Expectation-Maximization (EM) problem. VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step, the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher's ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation. Evaluated across diverse sources of diagnostic feedback on scientific reasoning and code generation tasks, VPD consistently outperforms both standard RLVR and existing self-distillation baselines. Finally, by stress-testing our framework on rigid mathematical reasoning and cold-start regimes, we illuminate the fundamental bounds of feedback-driven self-distillation compared to pure environment-driven RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Variational Policy Distillation (VPD), which casts learning from language feedback in RLVR as a variational EM procedure. The E-step refines the teacher via an adaptive trust-region update on trajectory outcomes to produce an improved target token distribution from textual critique; the M-step updates the student on its own on-policy rollouts to internalize that distribution. The method is evaluated on scientific reasoning and code generation tasks with diverse diagnostic feedback sources and is stress-tested on rigid mathematical reasoning and cold-start regimes, claiming consistent gains over standard RLVR and passive self-distillation baselines.

Significance. If the co-evolution mechanism is stable, VPD would offer a concrete way to overcome the plateau of fixed teachers in feedback-driven distillation, potentially improving sample efficiency on complex reasoning tasks where outcome signals are sparse. The explicit stress-testing on mathematical reasoning and cold-start regimes is a positive feature that helps delineate the practical limits of language-feedback approaches relative to pure environment-driven RL.

major comments (2)
  1. [§3.2] §3.2 (E-step and trust-region update): The central claim that the adaptive trust-region update reliably converts textual feedback into a dynamically improved, non-degenerate target token distribution rests on an unproven assumption of stability and lack of bias as the student policy shifts. No explicit KL bounds, contraction arguments, or ablation results are supplied showing that the refined teacher distribution remains useful and does not collapse or introduce systematic bias across iterations; this directly underpins the asserted advantage over passive distillation.
  2. [§5] §5 (experimental results): The reported outperformance is presented without per-task variance, statistical significance tests, or controls that isolate the contribution of the teacher update versus the student update. Without these, it is difficult to attribute gains specifically to the co-evolution mechanism rather than to increased compute or different hyper-parameters.
minor comments (2)
  1. [§3] Notation for the variational objective and the trust-region constraint should be introduced with a single consistent symbol table or appendix equation list to avoid repeated re-definition across sections.
  2. [Figure 2] Figure 2 (training curves) would benefit from shaded standard-error bands and explicit labeling of which curves correspond to the teacher versus student policy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their detailed and insightful comments, which have helped us identify areas for improvement in the manuscript. We address each of the major comments below.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (E-step and trust-region update): The central claim that the adaptive trust-region update reliably converts textual feedback into a dynamically improved, non-degenerate target token distribution rests on an unproven assumption of stability and lack of bias as the student policy shifts. No explicit KL bounds, contraction arguments, or ablation results are supplied showing that the refined teacher distribution remains useful and does not collapse or introduce systematic bias across iterations; this directly underpins the asserted advantage over passive distillation.

    Authors: We acknowledge that the manuscript would benefit from stronger empirical validation of the teacher update's stability. While we do not provide formal contraction arguments or KL bounds in the current version, the adaptive trust-region mechanism is intended to maintain stability by limiting updates based on verifiable outcomes. In the revised manuscript, we will add ablation experiments that measure the entropy and KL divergence of the teacher distribution over iterations to demonstrate that it does not collapse or become biased. These results will be included in an expanded §3.2 and the appendix. revision: yes

  2. Referee: [§5] §5 (experimental results): The reported outperformance is presented without per-task variance, statistical significance tests, or controls that isolate the contribution of the teacher update versus the student update. Without these, it is difficult to attribute gains specifically to the co-evolution mechanism rather than to increased compute or different hyper-parameters.

    Authors: We agree that reporting variance and statistical tests is important for rigorous evaluation. The current experiments were run with multiple seeds, but the variance was not reported in the main text. In the revision, we will include per-task means and standard deviations, along with p-values from statistical tests. Furthermore, we will add a control experiment where the teacher is held fixed (disabling the E-step) while matching the total compute, to isolate the effect of the co-evolution. This will be presented in §5 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: VPD derivation introduces independent co-evolution via standard variational EM without reducing claims to fitted parameters or self-referential inputs

full rationale

The paper formalizes learning from language feedback as a variational EM problem with an explicit E-step (adaptive trust-region refinement of the teacher on trajectory outcomes to produce an improved target distribution) and M-step (student internalization on on-policy rollouts). These steps are defined procedurally from the problem setup and do not reduce by construction to any fitted quantity, renamed prediction, or load-bearing self-citation. The abstract and framework description present the co-evolution as a novel mechanism to overcome fixed-teacher plateaus, with no equations or claims shown to be equivalent to their inputs via definition or prior author work. The derivation remains self-contained against external benchmarks such as standard RLVR and self-distillation baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard RL assumptions about on-policy sampling and trust-region stability plus the novel claim that language feedback can be turned into improved token distributions through teacher updates.

free parameters (1)
  • trust-region size
    Adaptive trust-region update is invoked in the E-step but no specific value or schedule is given in the abstract.
axioms (1)
  • domain assumption On-policy rollouts yield unbiased samples for policy improvement
    Invoked when the student internalizes guidance on its own rollouts in the M-step.
invented entities (1)
  • Dynamically improved target token distribution no independent evidence
    purpose: To convert textual feedback into dense supervision that evolves with the student
    Introduced as the output of the E-step refinement; no independent evidence outside the proposed method is provided.

pith-pipeline@v0.9.0 · 5778 in / 1382 out tokens · 52504 ms · 2026-05-20T20:30:03.301932+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 19 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

  2. [2]

    Retrospective in-context learning for temporal credit assignment with large language models.arXiv preprint arXiv:2602.17497, 2026

    Wen-Tse Chen, Jiayu Chen, Fahim Tajwar, Hao Zhu, Xintong Duan, Ruslan Salakhutdinov, and Jeff Schneider. Retrospective in-context learning for temporal credit assignment with large language models.arXiv preprint arXiv:2602.17497, 2026

  3. [3]

    Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024. 10

  4. [4]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

  5. [5]

    2024 , url =

    Kehua Feng, Xinyi Shen, Weijie Wang, Xiang Zhuang, Yuqi Tang, Qiang Zhang, and Keyan Ding. Sciknoweval: Evaluating multi-level scientific knowledge of large language models. arXiv preprint arXiv:2406.09098, 2024

  6. [6]

    Natural language reinforcement learning.arXiv preprint arXiv:2411.14251, 2024

    Xidong Feng, Bo Liu, Yan Song, Haotian Fu, Ziyu Wan, Girish A Koushik, Zhiyuan Hu, Mengyue Yang, Ying Wen, and Jun Wang. Natural language reinforcement learning.arXiv preprint arXiv:2411.14251, 2024

  7. [7]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023

  8. [8]

    Aligning language models with preferences through f-divergence minimization

    Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, and Marc Dymetman. Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215, 2023

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  10. [10]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  11. [11]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

  12. [12]

    Binary classifier optimization for large language model alignment

    Seungjae Jung, Gunsoo Han, Daniel Wontae Nam, and Kyoung-Woon On. Binary classifier optimization for large language model alignment. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1858–1872, 2025

  13. [13]

    VinePPO: Unlocking RL potential for LLM reasoning through refined credit assignment

    Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms.arXiv preprint arXiv:2410.01679, 2024

  14. [14]

    Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

    Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472, 2026

  15. [15]

    arXiv preprint arXiv:2511.07919 , year=

    Yoonho Lee, Joseph Boen, and Chelsea Finn. Feedback descent: Open-ended text optimization via pairwise comparison.arXiv preprint arXiv:2511.07919, 2025

  16. [16]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

  17. [17]

    Chain of hindsight aligns language models with feedback

    Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. Chain of hindsight aligns language models with feedback.arXiv preprint arXiv:2302.02676, 2023

  18. [18]

    Inference-time scaling for generalist reward modeling,

    Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling.arXiv preprint arXiv:2504.02495, 2025

  19. [19]

    Language models can learn from verbal feedback without scalar rewards.arXiv preprint arXiv:2509.22638, 2025

    Renjie Luo, Zichen Liu, Xiangyan Liu, Chao Du, Min Lin, Wenhu Chen, Wei Lu, and Tianyu Pang. Language models can learn from verbal feedback without scalar rewards.arXiv preprint arXiv:2509.22638, 2025

  20. [20]

    A view of the em algorithm that justifies incremental, sparse, and other variants

    Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. InLearning in graphical models, pages 355–368. Springer, 1998. 11

  21. [21]

    Olmo 3

    Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961, 2025

  22. [22]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  23. [23]

    Iterative reasoning preference optimization.Advances in Neural Information Processing Systems, 37:116617–116637, 2024

    Richard Y Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization.Advances in Neural Information Processing Systems, 37:116617–116637, 2024

  24. [24]

    Privileged Information Distillation for Language Models

    Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

  25. [25]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019

  26. [26]

    Pope: Learning to reason on hard problems via privileged on-policy exploration.arXiv preprint arXiv:2601.18779, 2026

    Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. Pope: Learning to reason on hard problems via privileged on-policy exploration.arXiv preprint arXiv:2601.18779, 2026

  27. [27]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  28. [28]

    Direct nash optimization: Teaching language models to self-improve with general preferences,

    Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, and Tengyang Xie. Direct nash optimization: Teaching language models to self-improve with general preferences.arXiv preprint arXiv:2404.03715, 2024

  29. [29]

    Training language models with language feedback at scale

    Jérémy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. Training language models with language feedback at scale.arXiv preprint arXiv:2303.16755, 2023

  30. [30]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

  31. [31]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  32. [32]

    Reuse your flops: Scaling rl on hard problems by conditioning on very off-policy prefixes.arXiv preprint arXiv:2601.18795, 2026

    Amrith Setlur, Zijian Wang, Andrew Cohen, Paria Rashidinejad, and Sang Michael Xie. Reuse your flops: Scaling rl on hard problems by conditioning on very off-policy prefixes.arXiv preprint arXiv:2601.18795, 2026

  33. [33]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  34. [34]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

  35. [35]

    arXiv preprint arXiv:2602.02482 , year=

    Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback.arXiv preprint arXiv:2602.02482, 2026

  36. [36]

    Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675, 2024

    Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675, 2024

  37. [37]

    Provably learning from language feedback.arXiv preprint arXiv:2506.10341, 2025

    Wanqiao Xu, Allen Nie, Ruijie Zheng, Aditya Modi, Adith Swaminathan, and Ching-An Cheng. Provably learning from language feedback.arXiv preprint arXiv:2506.10341, 2025. 12

  38. [38]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  39. [39]

    Self-Distilled RLVR

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

  40. [40]

    Deepcritic: Deliberate critique with large language models.arXiv preprint arXiv:2505.00662, 2025

    Wenkai Yang, Jingwen Chen, Yankai Lin, and Ji-Rong Wen. Deepcritic: Deliberate critique with large language models.arXiv preprint arXiv:2505.00662, 2025

  41. [41]

    Online experiential learning for language models.arXiv preprint arXiv:2603.16856, 2026

    Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, and Furu Wei. Online experiential learning for language models.arXiv preprint arXiv:2603.16856, 2026

  42. [42]

    On-Policy Context Distillation for Language Models

    Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

  43. [43]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  44. [44]

    Self-rewarding language models

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason E Weston. Self-rewarding language models. InForty-first International Conference on Machine Learning, 2024

  45. [45]

    arXiv:2408.15240

    Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction.arXiv preprint arXiv:2408.15240, 2024

  46. [46]

    American invitational mathematics examination (aime) 2024, 2024

    Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

  47. [47]

    American invitational mathematics examination (aime) 2025, 2025

    Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025

  48. [48]

    Improving sampling efficiency in rlvr through adaptive rollout and response reuse

    Yuheng Zhang, Wenlin Yao, Changlong Yu, Yao Liu, Qingyu Yin, Bing Yin, Hyokun Yun, and Lihong Li. Improving sampling efficiency in rlvr through adaptive rollout and response reuse. arXiv preprint arXiv:2509.25808, 2025

  49. [49]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

  50. [50]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

  51. [51]

    Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen

    Haizhong Zheng, Yang Zhou, Brian R Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177, 2025

  52. [52]

    Policy improve- ment using language feedback models.Advances in Neural Information Processing Systems, 37:43730–43758, 2024

    Victor Zhong, Dipendra Misra, Xingdi Yuan, and Marc-Alexandre Côté. Policy improve- ment using language feedback models.Advances in Neural Information Processing Systems, 37:43730–43758, 2024

  53. [53]

    Variational reasoning for language models.arXiv preprint arXiv:2509.22637, 2025

    Xiangxin Zhou, Zichen Liu, Haonan Wang, Chao Du, Min Lin, Chongxuan Li, Liang Wang, and Tianyu Pang. Variational reasoning for language models.arXiv preprint arXiv:2509.22637, 2025. 13 A Theoretical Derivations This appendix provides the formal derivations for the variational framework introduced in Section 3. We first derive the closed-form optimal polic...

  54. [54]

    next step,

    This reveals that the term exp(−1− λ β ) acts as a normalization constant. We define the partition functionZ(x)as: Z(x) = X y πref(y|x) exp 1 β r(x, y) .(A.5) Thus, the optimal target distribution is the exponentially reward-tilted policy: π∗(y|x) = 1 Z(x) πref(y|x) exp 1 β r(x, y) .(A.6) A.2 Equivalence of Reverse KL and the RLVR Objective We now demonst...

  55. [55]

    Joint Loss Optimization.The most straightforward baseline computes the objective losses independently and optimizes their weighted sum. We calculate the standard GRPO surrogate loss LGRPO using the sequence-level advantages, and combine it with the SDPO KL distillation loss: LHybrid(θ) =ω opd · LSDPO(θ) +ω rl · LGRPO(θ),(B.23) where ωopd and ωrl are hyper...

  56. [56]

    Advantage Reshaping.Instead of summing the final losses, a second class of baselines fuses the signals at the advantage level. Following the methodology of Self-Distillation Policy Optimization (SDPO) [10], the teacher’s dense distillation signal can be translated into a per-token advantage, ASDPO t =sg(logq ϕ(yt |x,C, y <t)−logπ θ(yt |x, y <t)). This is ...

  57. [57]

    thinking mode

    Distillation-Guided Advantage Reweighting.A fundamental limitation of the standard GRPO advantage AGRPO is its uniform application to all tokens in a sequence, failing to differentiate between critical reasoning steps and generic filler. To construct a baseline that addresses this without fully decoupling the steps, we can explicitly weight the sequence-l...