Learning from Language Feedback via Variational Policy Distillation

Erik Nijkamp; Semih Yavuz; Shafiq Joty; Yang Li

arxiv: 2605.15113 · v2 · pith:GZ3TWWSPnew · submitted 2026-05-14 · 💻 cs.LG

Learning from Language Feedback via Variational Policy Distillation

Yang Li , Erik Nijkamp , Semih Yavuz , Shafiq Joty This is my paper

Pith reviewed 2026-05-20 20:30 UTC · model grok-4.3

classification 💻 cs.LG

keywords variational policy distillationlanguage feedbackreinforcement learningself-distillationscientific reasoningcode generationexpectation maximization

0 comments

The pith

Variational Policy Distillation co-evolves a teacher policy to extract better signals from language feedback as the student improves.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Variational Policy Distillation to address sparse rewards and exploration issues in reinforcement learning from verifiable rewards by using language feedback more dynamically. It frames the interaction as a variational expectation-maximization process in which the teacher policy is actively updated in the E-step with an adaptive trust-region method on trajectory outcomes to create improved target token distributions. The student then internalizes these distributions during its own on-policy rollouts in the M-step. A sympathetic reader would care because this co-evolution prevents the teacher from plateauing as the student advances, leading to consistent gains over standard RLVR and passive self-distillation on scientific reasoning and code generation tasks with various diagnostic feedback sources.

Core claim

Variational Policy Distillation formalizes learning from language feedback as a Variational Expectation-Maximization problem. In the E-step the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher's ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation and outperforms both standard RLVR and existing self-distillation baselines on scientific reasoning and code generation.

What carries the argument

Variational Expectation-Maximization with adaptive trust-region update on the teacher, which refines target token distributions from textual feedback for the student to follow.

If this is right

VPD outperforms standard RLVR and passive self-distillation baselines across diverse diagnostic feedback sources on scientific reasoning and code generation tasks.
The method supports learning in cold-start regimes where initial policies have limited capabilities.
Results on rigid mathematical reasoning tasks highlight the limits of feedback-driven self-distillation relative to pure environment-driven RL.
Co-evolution prevents the teacher's assessment quality from plateauing as the student policy advances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The variational framing may extend naturally to settings where feedback comes from multiple sources or human evaluators rather than fixed models.
Similar co-evolution mechanisms could address exploration bottlenecks in other sparse-reward domains beyond reasoning and coding.
The approach suggests that one-way distillation methods may underperform when both teacher and student can improve jointly over time.

Load-bearing premise

An adaptive trust-region update on the teacher will reliably turn textual feedback into a stable and useful target token distribution for the student without adding instability or bias.

What would settle it

An experiment showing that the teacher's feedback interpretation stops improving or that VPD performs no better than fixed-teacher baselines on the same reasoning and code tasks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.15113 by Erik Nijkamp, Semih Yavuz, Shafiq Joty, Yang Li.

**Figure 1.** Figure 1: Reward margin between correct and incorrect responses during LCB training. 1. Environment Feedback (LiveCodeBench). For code generation tasks, the environment acts as a natural, deterministic verifier, providing rich feedback such as runtime errors and failed unit test assertions. We evaluate Qwen3-8B (with reasoning/thinking mode disabled) on the LiveCodeBench (LCB) v6 subset, following the public and … view at source ↗

**Figure 2.** Figure 2: Training progression on SciKnowEval 2. Contrastive Sibling Rollouts (SciKnowEval). For many scientific reasoning tasks, ground-truth textual feedback is unavailable; the environment only provides a sparse, binary correctness signal. In these scenarios, we can synthesize the diagnostic feedback C using the model’s own generations. Following the methodology of SDPO [10], we provide the student with a … view at source ↗

**Figure 3.** Figure 3: Training progression on Qwen3-4B-Base. The "Cold Start" Problem on Base Models. Recent literature demonstrates that GRPO can elicit advanced reasoning capabilities from a base foundation model. However, when we apply SDPO to base models, performance rapidly collapses to near zero. We hypothesize that self-distillation intrinsically requires the policy to possess a rudimentary level of instruction-followin… view at source ↗

**Figure 4.** Figure 4: Performance on the Math500 benchmark for models trained on DAPO-Math. Mathematical Reasoning. Similarly, on challenging mathematical benchmarks (e.g., training on DAPO-Math), SDPO suffers from severe training collapse. This vulnerability to mathematical reasoning domains has been observed in concurrent works [14]. While VPD again successfully delays this collapse, pure GRPO remains the dominant approach … view at source ↗

**Figure 5.** Figure 5: Training progression on Qwen3-1.7B with different reference model for E-Step. Ablation: Dynamic Reference Prior. As established in Eq. 7, VPD dynamically anchors the reference prior to the current student policy (πθ). This sliding trust region restricts the teacher’s target distribution, ensuring its guidance remains safely reachable for the student. To validate this design, we conduct an ablation study c… view at source ↗

read the original abstract

Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing language feedback to generate dense, token-level supervision. However, these approaches rely on a fixed, passive teacher to interpret the feedback. As the student policy improves, the teacher's zero-shot assessment capabilities plateau, ultimately halting further learning. To overcome this, we propose Variational Policy Distillation (VPD), a framework that formalizes learning from language feedback as a Variational Expectation-Maximization (EM) problem. VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step, the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher's ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation. Evaluated across diverse sources of diagnostic feedback on scientific reasoning and code generation tasks, VPD consistently outperforms both standard RLVR and existing self-distillation baselines. Finally, by stress-testing our framework on rigid mathematical reasoning and cold-start regimes, we illuminate the fundamental bounds of feedback-driven self-distillation compared to pure environment-driven RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VPD's active teacher refinement via adaptive trust-region in the E-step is the real addition over fixed-teacher self-distillation, but the paper still needs to show that step stays stable and unbiased.

read the letter

The main thing to know is that this work turns the teacher from a fixed interpreter into something that gets updated on the fly inside a variational EM loop. In the E-step the teacher policy is refined with an adaptive trust-region update on trajectory outcomes so that the language feedback produces a better target distribution for the student; the M-step then lets the student learn from that on its own rollouts. That co-evolution framing is distinct from the passive teacher setups in the self-distillation papers they cite, and the experiments on scientific reasoning and code generation with multiple feedback sources show consistent gains over both plain RLVR and the earlier baselines. The cold-start and rigid math stress tests are also useful for mapping where the method helps and where it does not.

Referee Report

2 major / 2 minor

Summary. The paper proposes Variational Policy Distillation (VPD), which casts learning from language feedback in RLVR as a variational EM procedure. The E-step refines the teacher via an adaptive trust-region update on trajectory outcomes to produce an improved target token distribution from textual critique; the M-step updates the student on its own on-policy rollouts to internalize that distribution. The method is evaluated on scientific reasoning and code generation tasks with diverse diagnostic feedback sources and is stress-tested on rigid mathematical reasoning and cold-start regimes, claiming consistent gains over standard RLVR and passive self-distillation baselines.

Significance. If the co-evolution mechanism is stable, VPD would offer a concrete way to overcome the plateau of fixed teachers in feedback-driven distillation, potentially improving sample efficiency on complex reasoning tasks where outcome signals are sparse. The explicit stress-testing on mathematical reasoning and cold-start regimes is a positive feature that helps delineate the practical limits of language-feedback approaches relative to pure environment-driven RL.

major comments (2)

[§3.2] §3.2 (E-step and trust-region update): The central claim that the adaptive trust-region update reliably converts textual feedback into a dynamically improved, non-degenerate target token distribution rests on an unproven assumption of stability and lack of bias as the student policy shifts. No explicit KL bounds, contraction arguments, or ablation results are supplied showing that the refined teacher distribution remains useful and does not collapse or introduce systematic bias across iterations; this directly underpins the asserted advantage over passive distillation.
[§5] §5 (experimental results): The reported outperformance is presented without per-task variance, statistical significance tests, or controls that isolate the contribution of the teacher update versus the student update. Without these, it is difficult to attribute gains specifically to the co-evolution mechanism rather than to increased compute or different hyper-parameters.

minor comments (2)

[§3] Notation for the variational objective and the trust-region constraint should be introduced with a single consistent symbol table or appendix equation list to avoid repeated re-definition across sections.
[Figure 2] Figure 2 (training curves) would benefit from shaded standard-error bands and explicit labeling of which curves correspond to the teacher versus student policy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their detailed and insightful comments, which have helped us identify areas for improvement in the manuscript. We address each of the major comments below.

read point-by-point responses

Referee: [§3.2] §3.2 (E-step and trust-region update): The central claim that the adaptive trust-region update reliably converts textual feedback into a dynamically improved, non-degenerate target token distribution rests on an unproven assumption of stability and lack of bias as the student policy shifts. No explicit KL bounds, contraction arguments, or ablation results are supplied showing that the refined teacher distribution remains useful and does not collapse or introduce systematic bias across iterations; this directly underpins the asserted advantage over passive distillation.

Authors: We acknowledge that the manuscript would benefit from stronger empirical validation of the teacher update's stability. While we do not provide formal contraction arguments or KL bounds in the current version, the adaptive trust-region mechanism is intended to maintain stability by limiting updates based on verifiable outcomes. In the revised manuscript, we will add ablation experiments that measure the entropy and KL divergence of the teacher distribution over iterations to demonstrate that it does not collapse or become biased. These results will be included in an expanded §3.2 and the appendix. revision: yes
Referee: [§5] §5 (experimental results): The reported outperformance is presented without per-task variance, statistical significance tests, or controls that isolate the contribution of the teacher update versus the student update. Without these, it is difficult to attribute gains specifically to the co-evolution mechanism rather than to increased compute or different hyper-parameters.

Authors: We agree that reporting variance and statistical tests is important for rigorous evaluation. The current experiments were run with multiple seeds, but the variance was not reported in the main text. In the revision, we will include per-task means and standard deviations, along with p-values from statistical tests. Furthermore, we will add a control experiment where the teacher is held fixed (disabling the E-step) while matching the total compute, to isolate the effect of the co-evolution. This will be presented in §5 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: VPD derivation introduces independent co-evolution via standard variational EM without reducing claims to fitted parameters or self-referential inputs

full rationale

The paper formalizes learning from language feedback as a variational EM problem with an explicit E-step (adaptive trust-region refinement of the teacher on trajectory outcomes to produce an improved target distribution) and M-step (student internalization on on-policy rollouts). These steps are defined procedurally from the problem setup and do not reduce by construction to any fitted quantity, renamed prediction, or load-bearing self-citation. The abstract and framework description present the co-evolution as a novel mechanism to overcome fixed-teacher plateaus, with no equations or claims shown to be equivalent to their inputs via definition or prior author work. The derivation remains self-contained against external benchmarks such as standard RLVR and self-distillation baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard RL assumptions about on-policy sampling and trust-region stability plus the novel claim that language feedback can be turned into improved token distributions through teacher updates.

free parameters (1)

trust-region size
Adaptive trust-region update is invoked in the E-step but no specific value or schedule is given in the abstract.

axioms (1)

domain assumption On-policy rollouts yield unbiased samples for policy improvement
Invoked when the student internalizes guidance on its own rollouts in the M-step.

invented entities (1)

Dynamically improved target token distribution no independent evidence
purpose: To convert textual feedback into dense supervision that evolves with the student
Introduced as the output of the E-step refinement; no independent evidence outside the proposed method is provided.

pith-pipeline@v0.9.0 · 5778 in / 1382 out tokens · 52504 ms · 2026-05-20T20:30:03.301932+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step, the student internalizes this dense distributional guidance on its own on-policy rollouts.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize on-policy learning from language feedback as a Variational EM procedure.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 19 internal anchors

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

work page 2024
[2]

Retrospective in-context learning for temporal credit assignment with large language models.arXiv preprint arXiv:2602.17497, 2026

Wen-Tse Chen, Jiayu Chen, Fahim Tajwar, Hao Zhu, Xintong Duan, Ruslan Salakhutdinov, and Jeff Schneider. Retrospective in-context learning for temporal credit assignment with large language models.arXiv preprint arXiv:2602.17497, 2026

work page arXiv 2026
[3]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

2024 , url =

Kehua Feng, Xinyi Shen, Weijie Wang, Xiang Zhuang, Yuqi Tang, Qiang Zhang, and Keyan Ding. Sciknoweval: Evaluating multi-level scientific knowledge of large language models. arXiv preprint arXiv:2406.09098, 2024

work page arXiv 2024
[6]

Natural language reinforcement learning.arXiv preprint arXiv:2411.14251, 2024

Xidong Feng, Bo Liu, Yan Song, Haotian Fu, Ziyu Wan, Girish A Koushik, Zhiyuan Hu, Mengyue Yang, Ying Wen, and Jun Wang. Natural language reinforcement learning.arXiv preprint arXiv:2411.14251, 2024

work page arXiv 2024
[7]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023

work page 2023
[8]

Aligning language models with preferences through f-divergence minimization

Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, and Marc Dymetman. Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215, 2023

work page arXiv 2023
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Binary classifier optimization for large language model alignment

Seungjae Jung, Gunsoo Han, Daniel Wontae Nam, and Kyoung-Woon On. Binary classifier optimization for large language model alignment. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1858–1872, 2025

work page 2025
[13]

VinePPO: Unlocking RL potential for LLM reasoning through refined credit assignment

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms.arXiv preprint arXiv:2410.01679, 2024

work page arXiv 2024
[14]

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

arXiv preprint arXiv:2511.07919 , year=

Yoonho Lee, Joseph Boen, and Chelsea Finn. Feedback descent: Open-ended text optimization via pairwise comparison.arXiv preprint arXiv:2511.07919, 2025

work page arXiv 2025
[16]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Chain of hindsight aligns language models with feedback

Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. Chain of hindsight aligns language models with feedback.arXiv preprint arXiv:2302.02676, 2023

work page arXiv 2023
[18]

Inference-time scaling for generalist reward modeling,

Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling.arXiv preprint arXiv:2504.02495, 2025

work page arXiv 2025
[19]

Language models can learn from verbal feedback without scalar rewards.arXiv preprint arXiv:2509.22638, 2025

Renjie Luo, Zichen Liu, Xiangyan Liu, Chao Du, Min Lin, Wenhu Chen, Wei Lu, and Tianyu Pang. Language models can learn from verbal feedback without scalar rewards.arXiv preprint arXiv:2509.22638, 2025

work page arXiv 2025
[20]

A view of the em algorithm that justifies incremental, sparse, and other variants

Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. InLearning in graphical models, pages 355–368. Springer, 1998. 11

work page 1998
[21]

Olmo 3

Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[23]

Iterative reasoning preference optimization.Advances in Neural Information Processing Systems, 37:116617–116637, 2024

Richard Y Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization.Advances in Neural Information Processing Systems, 37:116617–116637, 2024

work page 2024
[24]

Privileged Information Distillation for Language Models

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

work page internal anchor Pith review arXiv 2026
[25]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[26]

Pope: Learning to reason on hard problems via privileged on-policy exploration.arXiv preprint arXiv:2601.18779, 2026

Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. Pope: Learning to reason on hard problems via privileged on-policy exploration.arXiv preprint arXiv:2601.18779, 2026

work page arXiv 2026
[27]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[28]

Direct nash optimization: Teaching language models to self-improve with general preferences,

Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, and Tengyang Xie. Direct nash optimization: Teaching language models to self-improve with general preferences.arXiv preprint arXiv:2404.03715, 2024

work page arXiv 2024
[29]

Training language models with language feedback at scale

Jérémy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. Training language models with language feedback at scale.arXiv preprint arXiv:2303.16755, 2023

work page arXiv 2023
[30]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015
[31]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

Reuse your flops: Scaling rl on hard problems by conditioning on very off-policy prefixes.arXiv preprint arXiv:2601.18795, 2026

Amrith Setlur, Zijian Wang, Andrew Cohen, Paria Rashidinejad, and Sang Michael Xie. Reuse your flops: Scaling rl on hard problems by conditioning on very off-policy prefixes.arXiv preprint arXiv:2601.18795, 2026

work page arXiv 2026
[33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

arXiv preprint arXiv:2602.02482 , year=

Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback.arXiv preprint arXiv:2602.02482, 2026

work page arXiv 2026
[36]

Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675, 2024

Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675, 2024

work page arXiv 2024
[37]

Provably learning from language feedback.arXiv preprint arXiv:2506.10341, 2025

Wanqiao Xu, Allen Nie, Ruijie Zheng, Aditya Modi, Adith Swaminathan, and Ching-An Cheng. Provably learning from language feedback.arXiv preprint arXiv:2506.10341, 2025. 12

work page arXiv 2025
[38]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Deepcritic: Deliberate critique with large language models.arXiv preprint arXiv:2505.00662, 2025

Wenkai Yang, Jingwen Chen, Yankai Lin, and Ji-Rong Wen. Deepcritic: Deliberate critique with large language models.arXiv preprint arXiv:2505.00662, 2025

work page arXiv 2025
[41]

Online experiential learning for language models.arXiv preprint arXiv:2603.16856, 2026

Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, and Furu Wei. Online experiential learning for language models.arXiv preprint arXiv:2603.16856, 2026

work page arXiv 2026
[42]

On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Self-rewarding language models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason E Weston. Self-rewarding language models. InForty-first International Conference on Machine Learning, 2024

work page 2024
[45]

arXiv:2408.15240

Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction.arXiv preprint arXiv:2408.15240, 2024

work page arXiv 2024
[46]

American invitational mathematics examination (aime) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

work page 2024
[47]

American invitational mathematics examination (aime) 2025, 2025

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025

work page 2025
[48]

Improving sampling efficiency in rlvr through adaptive rollout and response reuse

Yuheng Zhang, Wenlin Yao, Changlong Yu, Yao Liu, Qingyu Yin, Bing Yin, Hyokun Yun, and Lihong Li. Improving sampling efficiency in rlvr through adaptive rollout and response reuse. arXiv preprint arXiv:2509.25808, 2025

work page arXiv 2025
[49]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen

Haizhong Zheng, Yang Zhou, Brian R Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177, 2025

work page arXiv 2025
[52]

Policy improve- ment using language feedback models.Advances in Neural Information Processing Systems, 37:43730–43758, 2024

Victor Zhong, Dipendra Misra, Xingdi Yuan, and Marc-Alexandre Côté. Policy improve- ment using language feedback models.Advances in Neural Information Processing Systems, 37:43730–43758, 2024

work page 2024
[53]

Variational reasoning for language models.arXiv preprint arXiv:2509.22637, 2025

Xiangxin Zhou, Zichen Liu, Haonan Wang, Chao Du, Min Lin, Chongxuan Li, Liang Wang, and Tianyu Pang. Variational reasoning for language models.arXiv preprint arXiv:2509.22637, 2025. 13 A Theoretical Derivations This appendix provides the formal derivations for the variational framework introduced in Section 3. We first derive the closed-form optimal polic...

work page arXiv 2025
[54]

next step,

This reveals that the term exp(−1− λ β ) acts as a normalization constant. We define the partition functionZ(x)as: Z(x) = X y πref(y|x) exp 1 β r(x, y) .(A.5) Thus, the optimal target distribution is the exponentially reward-tilted policy: π∗(y|x) = 1 Z(x) πref(y|x) exp 1 β r(x, y) .(A.6) A.2 Equivalence of Reverse KL and the RLVR Objective We now demonst...

work page
[55]

Joint Loss Optimization.The most straightforward baseline computes the objective losses independently and optimizes their weighted sum. We calculate the standard GRPO surrogate loss LGRPO using the sequence-level advantages, and combine it with the SDPO KL distillation loss: LHybrid(θ) =ω opd · LSDPO(θ) +ω rl · LGRPO(θ),(B.23) where ωopd and ωrl are hyper...

work page
[56]

Advantage Reshaping.Instead of summing the final losses, a second class of baselines fuses the signals at the advantage level. Following the methodology of Self-Distillation Policy Optimization (SDPO) [10], the teacher’s dense distillation signal can be translated into a per-token advantage, ASDPO t =sg(logq ϕ(yt |x,C, y <t)−logπ θ(yt |x, y <t)). This is ...

work page
[57]

thinking mode

Distillation-Guided Advantage Reweighting.A fundamental limitation of the standard GRPO advantage AGRPO is its uniform application to all tokens in a sequence, failing to differentiate between critical reasoning steps and generic filler. To construct a baseline that addresses this without fully decoupling the steps, we can explicitly weight the sequence-l...

work page 2048

[1] [1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

work page 2024

[2] [2]

Retrospective in-context learning for temporal credit assignment with large language models.arXiv preprint arXiv:2602.17497, 2026

Wen-Tse Chen, Jiayu Chen, Fahim Tajwar, Hao Zhu, Xintong Duan, Ruslan Salakhutdinov, and Jeff Schneider. Retrospective in-context learning for temporal credit assignment with large language models.arXiv preprint arXiv:2602.17497, 2026

work page arXiv 2026

[3] [3]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

2024 , url =

Kehua Feng, Xinyi Shen, Weijie Wang, Xiang Zhuang, Yuqi Tang, Qiang Zhang, and Keyan Ding. Sciknoweval: Evaluating multi-level scientific knowledge of large language models. arXiv preprint arXiv:2406.09098, 2024

work page arXiv 2024

[6] [6]

Natural language reinforcement learning.arXiv preprint arXiv:2411.14251, 2024

Xidong Feng, Bo Liu, Yan Song, Haotian Fu, Ziyu Wan, Girish A Koushik, Zhiyuan Hu, Mengyue Yang, Ying Wen, and Jun Wang. Natural language reinforcement learning.arXiv preprint arXiv:2411.14251, 2024

work page arXiv 2024

[7] [7]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023

work page 2023

[8] [8]

Aligning language models with preferences through f-divergence minimization

Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, and Marc Dymetman. Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215, 2023

work page arXiv 2023

[9] [9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Binary classifier optimization for large language model alignment

Seungjae Jung, Gunsoo Han, Daniel Wontae Nam, and Kyoung-Woon On. Binary classifier optimization for large language model alignment. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1858–1872, 2025

work page 2025

[13] [13]

VinePPO: Unlocking RL potential for LLM reasoning through refined credit assignment

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms.arXiv preprint arXiv:2410.01679, 2024

work page arXiv 2024

[14] [14]

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

arXiv preprint arXiv:2511.07919 , year=

Yoonho Lee, Joseph Boen, and Chelsea Finn. Feedback descent: Open-ended text optimization via pairwise comparison.arXiv preprint arXiv:2511.07919, 2025

work page arXiv 2025

[16] [16]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Chain of hindsight aligns language models with feedback

Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. Chain of hindsight aligns language models with feedback.arXiv preprint arXiv:2302.02676, 2023

work page arXiv 2023

[18] [18]

Inference-time scaling for generalist reward modeling,

Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling.arXiv preprint arXiv:2504.02495, 2025

work page arXiv 2025

[19] [19]

Language models can learn from verbal feedback without scalar rewards.arXiv preprint arXiv:2509.22638, 2025

Renjie Luo, Zichen Liu, Xiangyan Liu, Chao Du, Min Lin, Wenhu Chen, Wei Lu, and Tianyu Pang. Language models can learn from verbal feedback without scalar rewards.arXiv preprint arXiv:2509.22638, 2025

work page arXiv 2025

[20] [20]

A view of the em algorithm that justifies incremental, sparse, and other variants

Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. InLearning in graphical models, pages 355–368. Springer, 1998. 11

work page 1998

[21] [21]

Olmo 3

Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[23] [23]

Iterative reasoning preference optimization.Advances in Neural Information Processing Systems, 37:116617–116637, 2024

Richard Y Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization.Advances in Neural Information Processing Systems, 37:116617–116637, 2024

work page 2024

[24] [24]

Privileged Information Distillation for Language Models

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

work page internal anchor Pith review arXiv 2026

[25] [25]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[26] [26]

Pope: Learning to reason on hard problems via privileged on-policy exploration.arXiv preprint arXiv:2601.18779, 2026

Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. Pope: Learning to reason on hard problems via privileged on-policy exploration.arXiv preprint arXiv:2601.18779, 2026

work page arXiv 2026

[27] [27]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023

[28] [28]

Direct nash optimization: Teaching language models to self-improve with general preferences,

Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, and Tengyang Xie. Direct nash optimization: Teaching language models to self-improve with general preferences.arXiv preprint arXiv:2404.03715, 2024

work page arXiv 2024

[29] [29]

Training language models with language feedback at scale

Jérémy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. Training language models with language feedback at scale.arXiv preprint arXiv:2303.16755, 2023

work page arXiv 2023

[30] [30]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015

[31] [31]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[32] [32]

Reuse your flops: Scaling rl on hard problems by conditioning on very off-policy prefixes.arXiv preprint arXiv:2601.18795, 2026

Amrith Setlur, Zijian Wang, Andrew Cohen, Paria Rashidinejad, and Sang Michael Xie. Reuse your flops: Scaling rl on hard problems by conditioning on very off-policy prefixes.arXiv preprint arXiv:2601.18795, 2026

work page arXiv 2026

[33] [33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

arXiv preprint arXiv:2602.02482 , year=

Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback.arXiv preprint arXiv:2602.02482, 2026

work page arXiv 2026

[36] [36]

Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675, 2024

Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675, 2024

work page arXiv 2024

[37] [37]

Provably learning from language feedback.arXiv preprint arXiv:2506.10341, 2025

Wanqiao Xu, Allen Nie, Ruijie Zheng, Aditya Modi, Adith Swaminathan, and Ching-An Cheng. Provably learning from language feedback.arXiv preprint arXiv:2506.10341, 2025. 12

work page arXiv 2025

[38] [38]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

Deepcritic: Deliberate critique with large language models.arXiv preprint arXiv:2505.00662, 2025

Wenkai Yang, Jingwen Chen, Yankai Lin, and Ji-Rong Wen. Deepcritic: Deliberate critique with large language models.arXiv preprint arXiv:2505.00662, 2025

work page arXiv 2025

[41] [41]

Online experiential learning for language models.arXiv preprint arXiv:2603.16856, 2026

Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, and Furu Wei. Online experiential learning for language models.arXiv preprint arXiv:2603.16856, 2026

work page arXiv 2026

[42] [42]

On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[43] [43]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Self-rewarding language models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason E Weston. Self-rewarding language models. InForty-first International Conference on Machine Learning, 2024

work page 2024

[45] [45]

arXiv:2408.15240

Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction.arXiv preprint arXiv:2408.15240, 2024

work page arXiv 2024

[46] [46]

American invitational mathematics examination (aime) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

work page 2024

[47] [47]

American invitational mathematics examination (aime) 2025, 2025

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025

work page 2025

[48] [48]

Improving sampling efficiency in rlvr through adaptive rollout and response reuse

Yuheng Zhang, Wenlin Yao, Changlong Yu, Yao Liu, Qingyu Yin, Bing Yin, Hyokun Yun, and Lihong Li. Improving sampling efficiency in rlvr through adaptive rollout and response reuse. arXiv preprint arXiv:2509.25808, 2025

work page arXiv 2025

[49] [49]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[50] [50]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen

Haizhong Zheng, Yang Zhou, Brian R Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177, 2025

work page arXiv 2025

[52] [52]

Policy improve- ment using language feedback models.Advances in Neural Information Processing Systems, 37:43730–43758, 2024

Victor Zhong, Dipendra Misra, Xingdi Yuan, and Marc-Alexandre Côté. Policy improve- ment using language feedback models.Advances in Neural Information Processing Systems, 37:43730–43758, 2024

work page 2024

[53] [53]

Variational reasoning for language models.arXiv preprint arXiv:2509.22637, 2025

Xiangxin Zhou, Zichen Liu, Haonan Wang, Chao Du, Min Lin, Chongxuan Li, Liang Wang, and Tianyu Pang. Variational reasoning for language models.arXiv preprint arXiv:2509.22637, 2025. 13 A Theoretical Derivations This appendix provides the formal derivations for the variational framework introduced in Section 3. We first derive the closed-form optimal polic...

work page arXiv 2025

[54] [54]

next step,

This reveals that the term exp(−1− λ β ) acts as a normalization constant. We define the partition functionZ(x)as: Z(x) = X y πref(y|x) exp 1 β r(x, y) .(A.5) Thus, the optimal target distribution is the exponentially reward-tilted policy: π∗(y|x) = 1 Z(x) πref(y|x) exp 1 β r(x, y) .(A.6) A.2 Equivalence of Reverse KL and the RLVR Objective We now demonst...

work page

[55] [55]

Joint Loss Optimization.The most straightforward baseline computes the objective losses independently and optimizes their weighted sum. We calculate the standard GRPO surrogate loss LGRPO using the sequence-level advantages, and combine it with the SDPO KL distillation loss: LHybrid(θ) =ω opd · LSDPO(θ) +ω rl · LGRPO(θ),(B.23) where ωopd and ωrl are hyper...

work page

[56] [56]

Advantage Reshaping.Instead of summing the final losses, a second class of baselines fuses the signals at the advantage level. Following the methodology of Self-Distillation Policy Optimization (SDPO) [10], the teacher’s dense distillation signal can be translated into a per-token advantage, ASDPO t =sg(logq ϕ(yt |x,C, y <t)−logπ θ(yt |x, y <t)). This is ...

work page

[57] [57]

thinking mode

Distillation-Guided Advantage Reweighting.A fundamental limitation of the standard GRPO advantage AGRPO is its uniform application to all tokens in a sequence, failing to differentiate between critical reasoning steps and generic filler. To construct a baseline that addresses this without fully decoupling the steps, we can explicitly weight the sequence-l...

work page 2048