arxiv: 2601.20802 · v2 · submitted 2026-01-28 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Reinforcement Learning via Self-Distillation

Jonas H\"ubotter , Frederike L\"ubeck , Lejs Behric , Anton Baumann , Marco Bagatella , Daniel Marta , Ido Hakimi , Idan Shenfeld

show 3 more authors

Thomas Kleine Buening Carlos Guestrin Andreas Krause

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningself-distillationpolicy optimizationlarge language modelsverifiable rewardscredit assignmentcode generationscientific reasoning

0 comments

The pith

Self-Distillation Policy Optimization turns a model's own feedback on failed attempts into a dense training signal for reinforcement learning in code and math.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models trained via reinforcement learning on verifiable tasks receive only scalar success signals per attempt, which creates a severe credit-assignment problem when learning from errors. The paper introduces Self-Distillation Policy Optimization, which conditions the current model on rich textual feedback such as runtime errors and then distills its resulting next-token predictions back into the policy. This process treats the model as its own teacher, allowing it to retrospectively correct mistakes in context without any external reward model or human teacher. Experiments across scientific reasoning, tool use, and competitive programming benchmarks show gains in sample efficiency and final accuracy over standard RLVR baselines, with similar benefits even when only scalar rewards are available by repurposing successful rollouts as implicit feedback. The method can also be applied at test time to individual questions, reaching the same discovery rate as best-of-k sampling with roughly three times fewer attempts.

Core claim

SDPO formalizes reinforcement learning with rich feedback and shows that conditioning the current policy on tokenized feedback from the environment produces next-token predictions that can be distilled back into the policy as an effective dense learning signal, bypassing the scalar-reward bottleneck and improving both training efficiency and test-time discovery on verifiable tasks.

What carries the argument

Self-Distillation Policy Optimization (SDPO), which treats the model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions directly into the policy update.

If this is right

SDPO raises sample efficiency and final accuracy over strong RLVR baselines on scientific reasoning, tool use, and competitive programming tasks.
The same method improves performance in purely scalar-reward environments by treating successful rollouts as implicit feedback for failed attempts.
Test-time application of SDPO to single questions achieves equivalent discovery probability to best-of-k sampling or multi-turn dialogues while using three times fewer attempts.
The approach exploits the model's existing in-context ability to identify its own mistakes when given explanatory feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The self-teacher mechanism may extend naturally to any domain where models can generate explanatory text about their own outputs.
If the distilled signal remains stable across iterations, SDPO could reduce reliance on large external reward models during post-training.
Test-time self-distillation suggests a path toward lightweight online adaptation without retraining the full model.

Load-bearing premise

That the model's next-token predictions when conditioned on feedback form a valid, unbiased, and useful dense learning signal for policy improvement without external validation.

What would settle it

A side-by-side run on LiveCodeBench or a similar verifiable benchmark in which SDPO produces equal or lower final accuracy and sample efficiency than a standard scalar RLVR baseline trained on the identical data and compute budget.

read the original abstract

Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context. Across scientific reasoning, tool use, and competitive programming on LiveCodeBench v6, SDPO improves sample efficiency and final accuracy over strong RLVR baselines. Notably, SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts. Finally, applying SDPO to individual questions at test time accelerates discovery on difficult binary-reward tasks, achieving the same discovery probability as best-of-k sampling or multi-turn conversations with 3x fewer attempts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SDPO turns rich feedback into a dense signal via self-distillation of the model's own next-token predictions, with reported efficiency gains on coding and reasoning tasks, though the self-teacher setup leaves open the risk of reinforcing errors.

read the letter

The main point is that this paper introduces Self-Distillation Policy Optimization to address the sparse credit assignment problem in RLVR for LLMs. Instead of learning only from scalar success/failure, SDPO conditions the current model on tokenized feedback and distills its next-token predictions back into the policy. This uses the model's in-context ability to identify mistakes without needing an external teacher or reward model. They also show it can repurpose successful rollouts as implicit feedback in pure scalar settings and apply the same idea at test time to reach solutions faster than best-of-k sampling.

Referee Report

3 major / 3 minor

Summary. The paper introduces Self-Distillation Policy Optimization (SDPO) as a method for reinforcement learning with rich feedback (RLRF) in LLMs. SDPO treats the current policy conditioned on tokenized feedback (or successful rollouts as implicit feedback) as a self-teacher, distilling its next-token predictions into a dense learning signal for policy updates without external teachers or reward models. It reports improved sample efficiency and final accuracy over strong RLVR baselines across scientific reasoning, tool use, and competitive programming on LiveCodeBench v6; extensions to scalar-reward RLVR environments; and test-time acceleration on binary-reward tasks achieving equivalent discovery probability to best-of-k or multi-turn methods with 3x fewer attempts.

Significance. If the central claims hold under rigorous validation, SDPO would provide a practical way to convert rich textual feedback (e.g., runtime errors) into usable dense signals in verifiable domains, addressing the credit-assignment bottleneck of scalar RLVR. The repurposing of successful rollouts for failed attempts and the test-time application are particularly interesting extensions that could reduce reliance on external supervision.

major comments (3)

[Abstract] Abstract and experimental claims: the reported gains in sample efficiency and accuracy over RLVR baselines are presented without details on statistical significance tests, exact baseline implementations, hyperparameter sweeps, or controls for post-hoc selection, which is load-bearing for the central empirical claim that SDPO consistently outperforms strong baselines.
[Method] Method description (self-distillation step): the procedure conditions the current model on tokenized feedback and distills its next-token predictions back into the policy, but provides no external validation, oracle comparison, or bias-correction term to break potential reinforcement of systematic in-context reasoning failures; this assumption is load-bearing for both the rich-feedback gains and the scalar-feedback repurposing result.
[Experiments] Scalar-feedback extension: the claim that SDPO outperforms baselines in standard RLVR environments by using successful rollouts as implicit feedback for failed attempts lacks a concrete derivation or ablation showing that this does not simply reduce to standard RLVR with additional data filtering.

minor comments (3)

[Method] Notation for the self-teacher distribution and the distillation loss should be defined more explicitly with equations to allow reproduction.
[Experiments] LiveCodeBench v6 results would benefit from per-task breakdowns or error analysis to clarify where the sample-efficiency gains occur.
[Experiments] The test-time application section should clarify whether the 3x fewer attempts comparison controls for total compute or token budget.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which have helped us strengthen the manuscript. We have revised the paper to include additional experimental details, clarifications on the method, and new ablations for the scalar-feedback extension. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract and experimental claims: the reported gains in sample efficiency and accuracy over RLVR baselines are presented without details on statistical significance tests, exact baseline implementations, hyperparameter sweeps, or controls for post-hoc selection, which is load-bearing for the central empirical claim that SDPO consistently outperforms strong baselines.

Authors: We agree that these details are essential for validating the empirical claims. In the revised manuscript, we have added statistical significance testing (bootstrap confidence intervals and paired tests across seeds) for all reported improvements. We now provide exact baseline code references, full hyperparameter sweep ranges and selection criteria (chosen via validation performance prior to test evaluation), and explicit controls against post-hoc selection by reporting all runs from a fixed hyperparameter set. These additions directly address the load-bearing concerns. revision: yes
Referee: [Method] Method description (self-distillation step): the procedure conditions the current model on tokenized feedback and distills its next-token predictions back into the policy, but provides no external validation, oracle comparison, or bias-correction term to break potential reinforcement of systematic in-context reasoning failures; this assumption is load-bearing for both the rich-feedback gains and the scalar-feedback repurposing result.

Authors: The core assumption is that the model can use in-context feedback to identify its own mistakes, which we validate empirically through performance gains. We have expanded the method section with an oracle comparison (where available in synthetic settings) and an ablation measuring distillation quality against ground-truth next tokens. A general bias-correction term is not introduced because it would require strong assumptions on error distributions not present in the verifiable domains; we instead discuss this as a limitation and show via ablations that systematic failures are not reinforced in practice. The description is now more precise. revision: partial
Referee: [Experiments] Scalar-feedback extension: the claim that SDPO outperforms baselines in standard RLVR environments by using successful rollouts as implicit feedback for failed attempts lacks a concrete derivation or ablation showing that this does not simply reduce to standard RLVR with additional data filtering.

Authors: We have added a formal derivation in the revised paper showing that SDPO's distillation transfers next-token distributions from successful contexts to failed attempts, creating a dense alignment signal distinct from filtering (which only augments data volume without cross-context transfer). We include a new ablation comparing SDPO against RLVR trained on filtered successful rollouts plus original data; SDPO retains statistically significant gains, confirming the benefit is not reducible to filtering alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity: SDPO is an algorithmic definition, not a tautological derivation

full rationale

The paper defines SDPO explicitly as distilling the current model's own feedback-conditioned next-token predictions back into the policy. This is a design choice for the optimization procedure rather than a claimed first-principles derivation or prediction that reduces to its inputs by construction. No equations are presented whose outputs are forced to match inputs (e.g., no fitted parameter renamed as a prediction, no uniqueness theorem imported from self-citation, no ansatz smuggled via prior work). Empirical gains are asserted over RLVR baselines on LiveCodeBench and other tasks, but these rest on experimental comparison rather than algebraic identity. The self-teacher aspect is load-bearing for the method's motivation but does not create a self-definitional loop in any reported result or proof. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests primarily on the domain assumption that the model can usefully identify and correct its own mistakes when given feedback; no free parameters or new invented entities are introduced in the abstract description.

axioms (1)

domain assumption The current model, when conditioned on feedback, produces next-token predictions that form a useful dense learning signal for the policy.
This is the core premise enabling self-distillation without an external teacher.

pith-pipeline@v0.9.0 · 5571 in / 1378 out tokens · 53495 ms · 2026-05-12T04:24:14.321335+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one; nothing_cannot_exist echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy... leverages the model’s ability to retrospectively identify its own mistakes in-context.
IndisputableMonolith.Foundation.DiscretenessForcing continuous_no_isolated_zero_defect echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts.
IndisputableMonolith.Foundation.InevitabilityStructure inevitability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The SDPO gradient is a (negated) logit-level policy gradient where the advantages are estimated using the self-teacher.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 55 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
cs.LG 2026-05 unverdicted novelty 7.0

MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 7.0

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
cs.LG 2026-05 conditional novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
cs.LG 2026-05 unverdicted novelty 7.0

Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
cs.LG 2026-05 unverdicted novelty 7.0

RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
cs.AI 2026-05 unverdicted novelty 7.0

TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preservi...
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
cs.CL 2026-05 unverdicted novelty 7.0

TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
KL for a KL: On-Policy Distillation with Control Variate Baseline
cs.LG 2026-05 unverdicted novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 7.0

VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
cs.LG 2026-05 unverdicted novelty 7.0

PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
cs.CL 2026-05 unverdicted novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 accept novelty 7.0

GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 unverdicted novelty 7.0

GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.
Near-Future Policy Optimization
cs.LG 2026-04 unverdicted novelty 7.0

NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating co...
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
Retrieval Augmented Conversational Recommendation with Reinforcement Learning
cs.IR 2026-04 unverdicted novelty 7.0

RAR retrieves candidate items from a 300k-movie corpus then uses LLM generation with RL feedback to produce context-aware recommendations that outperform baselines on benchmarks.
Self-Distilled RLVR
cs.LG 2026-04 unverdicted novelty 7.0

RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
cs.LG 2026-01 unverdicted novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
cs.CL 2026-05 unverdicted novelty 6.0

CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.
Revisiting DAgger in the Era of LLM-Agents
cs.LG 2026-05 conditional novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer
cs.LG 2026-05 unverdicted novelty 6.0

Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
cs.LG 2026-05 unverdicted novelty 6.0

Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
Selective Off-Policy Reference Tuning with Plan Guidance
cs.AI 2026-05 unverdicted novelty 6.0

SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
SOD: Step-wise On-policy Distillation for Small Language Model Agents
cs.CL 2026-05 unverdicted novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
cs.LG 2026-05 unverdicted novelty 6.0

Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
Co-Evolving Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
Process Supervision of Confidence Margin for Calibrated LLM Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
OPSDL: On-Policy Self-Distillation for Long-Context Language Models
cs.CL 2026-04 unverdicted novelty 6.0

OPSDL improves long-context LLM performance by having the model self-distill from its short-context capability using point-wise reverse KL divergence on generated tokens, outperforming SFT and DPO on benchmarks withou...
AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency
cs.CL 2026-04 unverdicted novelty 6.0

AtManRL learns an additive attention mask on CoT traces to produce a saliency reward that, when combined with outcome rewards in GRPO, trains LLMs to generate reasoning that genuinely influences final predictions.
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
cs.LG 2026-04 unverdicted novelty 6.0

π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
cs.LG 2026-04 unverdicted novelty 6.0

On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents
cs.LG 2026-04 unverdicted novelty 6.0

Skill-SD turns an agent's completed trajectories into dynamic natural-language skills that condition only the teacher in self-distillation, yielding 14-42% gains over RL and OPSD baselines on multi-turn agent benchmarks.
Self-Distilled Reinforcement Learning for Co-Evolving Agentic Recommender Systems
cs.IR 2026-04 unverdicted novelty 6.0

CoARS enables co-evolving recommender and user agents by using interaction-derived rewards and self-distilled credit assignment to internalize multi-turn feedback into model parameters, outperforming prior agentic baselines.
PolicyLong: Towards On-Policy Context Extension
cs.LG 2026-04 unverdicted novelty 6.0

PolicyLong shifts long-context data synthesis to an on-policy loop that re-screens contexts using the evolving model's entropy landscape, producing a self-curriculum that outperforms static offline baselines with larg...
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
cs.LG 2026-04 unverdicted novelty 6.0

The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...
Selective Off-Policy Reference Tuning with Plan Guidance
cs.AI 2026-05 unverdicted novelty 5.0

SORT converts all-failed reasoning prompts into selective, structure-aware training signals by weighting tokens according to how much a reference-derived plan increases their probability.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
cs.CL 2026-05 unverdicted novelty 5.0

UniSD unifies complementary self-distillation mechanisms for autoregressive LLMs and achieves up to +5.4 point gains over base models and +2.8 over baselines across six benchmarks and six models.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 5.0

MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework
cs.IR 2026-03 unverdicted novelty 4.0

OneSearch-V2 improves generative retrieval via latent reasoning and self-distillation, achieving +3.98% item CTR, +2.07% buyer volume, and +2.11% order volume in online A/B tests.
A Survey of On-Policy Distillation for Large Language Models
cs.LG 2026-04 unverdicted novelty 2.0

On-policy distillation reframes LLM knowledge transfer as iterative correction on student trajectories rather than single-pass imitation, with the survey organizing the field along divergence design, feedback sources,...