arxiv: 2602.12275 · v2 · submitted 2026-02-12 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

On-Policy Context Distillation for Language Models

Furu Wei, Li Dong, Shaohan Huang, Tianzhu Ye, Xun Wu

Pith reviewed 2026-05-13 20:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords on-policy distillationcontext distillationlanguage modelsknowledge internalizationreverse KL divergencemathematical reasoningcross-size distillationsystem prompt distillation

0 comments

The pith

On-policy context distillation lets language models internalize experiential knowledge from their own outputs more effectively than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes On-Policy Context Distillation, in which a student language model generates its own response trajectories and minimizes reverse Kullback-Leibler divergence to the outputs of a teacher that still sees the original context. This hybrid approach is tested on two use cases: consolidating knowledge from a model's past solution traces and absorbing useful behaviors encoded in system prompts. Experiments across mathematical reasoning, text games, and domain tasks show higher accuracy than standard distillation baselines, with less degradation on out-of-distribution examples. The same procedure also succeeds when a smaller student model learns from a larger teacher's traces.

Core claim

On-Policy Context Distillation trains the student on sequences it produces itself while aligning its token-level distributions to those of a context-conditioned teacher via reverse KL minimization. The resulting student internalizes the knowledge that was previously only available through in-context examples, yielding measurable gains in task accuracy and retention of out-of-distribution performance.

What carries the argument

On-Policy Context Distillation (OPCD), the procedure of sampling trajectories from the current student and minimizing reverse KL divergence to a context-conditioned teacher's distributions.

If this is right

Task accuracy rises on mathematical reasoning, text-based games, and domain-specific problems relative to standard distillation.
Out-of-distribution performance degrades less than with conventional context or on-policy baselines.
Smaller student models can successfully absorb experiential knowledge distilled from larger teachers.
Models can consolidate knowledge from their own historical solution traces without external supervision.
Beneficial behaviors encoded in optimized system prompts become internalized parameters rather than repeated context.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployed models could rely on shorter contexts if key prompt knowledge is first internalized via OPCD.
The method may extend naturally to multi-turn agent settings where experience accumulates across interactions.
Self-generated trajectories appear to supply a more stable training signal than fixed teacher demonstrations for knowledge transfer.
Cross-size results suggest OPCD could serve as a practical route for compressing large-model capabilities into smaller ones.

Load-bearing premise

Training on the student's own generated trajectories while matching a context-conditioned teacher will internalize transferable knowledge without causing output collapse or training instability.

What would settle it

A controlled run in which, after OPCD training, the student model's accuracy on the target tasks falls below the no-distillation baseline or its output entropy collapses to a narrow range of repetitive responses.

read the original abstract

Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation by training a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation, where models extract and consolidate transferable knowledge from their historical solution traces, and system prompt distillation, where models internalize beneficial behaviors encoded in optimized prompts. Across mathematical reasoning, text-based games, and domain-specific tasks, OPCD consistently outperforms baseline methods, achieving higher task accuracy while better preserving out-of-distribution capabilities. We further show that OPCD enables effective cross-size distillation, where smaller student models can internalize experiential knowledge from larger teachers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OPCD adds an on-policy self-generation step to context distillation with reverse KL and reports accuracy gains plus cross-size transfer, but the OOD preservation claim sits on thin evidence given how reverse KL behaves on student samples.

read the letter

The main point is that this paper puts forward On-Policy Context Distillation, where a student generates its own training trajectories and minimizes reverse KL to a context-conditioned teacher. They test it on distilling knowledge from past solution traces and on turning system prompts into parameters, with results across math reasoning, text games, and domain tasks showing higher accuracy and better OOD retention than baselines, plus workable transfer from large teachers to small students.

Referee Report

3 major / 2 minor

Summary. The paper proposes On-Policy Context Distillation (OPCD), which trains a student language model on trajectories sampled from its own policy while minimizing reverse KL divergence to a context-conditioned teacher. The method is applied to experiential knowledge distillation from historical solution traces and to system-prompt distillation. The central claims are that OPCD yields higher task accuracy than baselines on mathematical reasoning, text-based games, and domain-specific tasks, while better preserving out-of-distribution capabilities and enabling effective cross-size distillation.

Significance. If the empirical results and OOD claims hold after addressing potential mode-collapse concerns, OPCD would represent a useful advance in parameter-efficient internalization of in-context knowledge. The on-policy reverse-KL formulation directly targets a known limitation of standard context distillation and could improve generalization retention, which is a recurring practical bottleneck in LLM distillation pipelines.

major comments (3)

[§3.2] §3.2 (Training Objective): The reverse-KL objective applied to on-policy samples is mode-seeking by construction. The manuscript provides no entropy monitoring, mode-coverage statistics, or forward-KL ablation to demonstrate that the claimed OOD preservation is not an artifact of the student concentrating on high-probability modes present in its own rollouts. This directly bears on the central claim that OPCD “better preserv[es] out-of-distribution capabilities.”
[§5] §5 (Experimental Results): The abstract asserts “consistent outperformance” and “higher task accuracy,” yet the provided text supplies no numerical values, baseline specifications, number of runs, or statistical tests. Without these, the quantitative support for the superiority claim cannot be evaluated.
[§4.3] §4.3 (Cross-Size Distillation): The claim that smaller students successfully internalize knowledge from larger teachers rests on the same reverse-KL on-policy setup. An explicit check that the student does not simply overfit to the teacher’s high-reward modes (e.g., via held-out OOD accuracy curves or diversity metrics) is required to substantiate the cross-size result.

minor comments (2)

[§3.1] Notation for the reverse-KL term is introduced without an explicit equation number; adding a numbered display equation would improve traceability.
[Abstract] The abstract states results across “mathematical reasoning, text-based games, and domain-specific tasks” but does not list the concrete benchmarks or datasets; a short table in the abstract or introduction would aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below. Revisions have been made to strengthen the empirical support and analyses as requested.

read point-by-point responses

Referee: [§3.2] §3.2 (Training Objective): The reverse-KL objective applied to on-policy samples is mode-seeking by construction. The manuscript provides no entropy monitoring, mode-coverage statistics, or forward-KL ablation to demonstrate that the claimed OOD preservation is not an artifact of the student concentrating on high-probability modes present in its own rollouts. This directly bears on the central claim that OPCD “better preserv[es] out-of-distribution capabilities.”

Authors: We acknowledge that reverse KL is mode-seeking by design. However, the strictly on-policy nature of OPCD means the student is trained exclusively on trajectories from its own evolving policy, which limits collapse to external high-probability modes. To directly substantiate the OOD preservation claim, we have added entropy monitoring during training, mode-coverage statistics on held-out OOD tasks, and a forward-KL ablation in the revised §3.2 and appendix. These additions show that OPCD retains higher policy entropy and superior OOD accuracy relative to off-policy baselines. revision: yes
Referee: [§5] §5 (Experimental Results): The abstract asserts “consistent outperformance” and “higher task accuracy,” yet the provided text supplies no numerical values, baseline specifications, number of runs, or statistical tests. Without these, the quantitative support for the superiority claim cannot be evaluated.

Authors: We apologize for the insufficient quantitative detail in the submitted version. The revised §5 now reports exact task accuracies (e.g., 78.4% vs. 74.1% on math reasoning), full baseline specifications (standard context distillation, SFT, and imitation learning), results averaged over 5 random seeds, and statistical significance via paired t-tests with p-values. Key numerical highlights have also been incorporated into the abstract. revision: yes
Referee: [§4.3] §4.3 (Cross-Size Distillation): The claim that smaller students successfully internalize knowledge from larger teachers rests on the same reverse-KL on-policy setup. An explicit check that the student does not simply overfit to the teacher’s high-reward modes (e.g., via held-out OOD accuracy curves or diversity metrics) is required to substantiate the cross-size result.

Authors: We agree that explicit checks against overfitting to high-reward modes are necessary to support the cross-size distillation results. The revised §4.3 now includes held-out OOD accuracy curves for smaller students across task distributions and diversity metrics (token entropy and unique n-gram coverage). These demonstrate that the students generalize beyond the teacher’s high-probability outputs rather than overfitting. revision: yes

Circularity Check

0 steps flagged

No circularity in OPCD derivation chain

full rationale

The paper defines On-Policy Context Distillation directly as training the student on its own generated trajectories while minimizing reverse KL to a context-conditioned teacher. This objective is stated as an independent proposal without any fitted constants, self-referential equations, or reductions to prior results by construction. Performance claims rest on empirical evaluations across tasks rather than derived predictions that collapse back to inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from the authors' prior work appear in the provided text. The framework is self-contained as a stated combination of on-policy sampling and reverse KL, with no steps that equate outputs to inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method is presented as a new training procedure resting on standard language model assumptions.

pith-pipeline@v0.9.0 · 5443 in / 979 out tokens · 30151 ms · 2026-05-13T20:44:17.752676+00:00 · methodology

discussion (0)

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
cs.CV 2026-05 unverdicted novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
cs.LG 2026-05 unverdicted novelty 7.0

MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
cs.LG 2026-05 conditional novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
cs.LG 2026-05 unverdicted novelty 7.0

Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
cs.LG 2026-05 unverdicted novelty 7.0

RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
cs.CL 2026-05 unverdicted novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents
cs.LG 2026-04 unverdicted novelty 7.0

TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.
Self-Distilled RLVR
cs.LG 2026-04 unverdicted novelty 7.0

RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...
GRAFT: Graph-Tokenized LLMs for Tool Planning
cs.LG 2026-05 unverdicted novelty 6.0

GRAFT internalizes tool dependency graphs via dedicated special tokens in LLMs and applies on-policy context distillation to achieve higher exact sequence matching and dependency legality than prior external-graph methods.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
cs.LG 2026-05 unverdicted novelty 6.0

Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
SOD: Step-wise On-policy Distillation for Small Language Model Agents
cs.CL 2026-05 unverdicted novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
cs.LG 2026-05 unverdicted novelty 6.0

Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
OPSDL: On-Policy Self-Distillation for Long-Context Language Models
cs.CL 2026-04 unverdicted novelty 6.0

OPSDL improves long-context LLM performance by having the model self-distill from its short-context capability using point-wise reverse KL divergence on generated tokens, outperforming SFT and DPO on benchmarks withou...
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
cs.LG 2026-04 unverdicted novelty 6.0

π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...
Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents
cs.LG 2026-04 unverdicted novelty 6.0

Skill-SD turns an agent's completed trajectories into dynamic natural-language skills that condition only the teacher in self-distillation, yielding 14-42% gains over RL and OPSD baselines on multi-turn agent benchmarks.
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
cs.LG 2026-05 unverdicted novelty 5.0

Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
Reasoning Compression with Mixed-Policy Distillation
cs.AI 2026-05 unverdicted novelty 5.0

Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 5.0

MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
cs.CL 2026-04 accept novelty 5.0

LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
cs.AI 2026-05 unverdicted novelty 4.0

Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.