Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643

Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, Furu Wei · 2025 · arXiv 2511.10643

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

representative citing papers

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

cs.CL · 2026-04-09 · unverdicted · novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.

SOD: Step-wise On-policy Distillation for Small Language Model Agents

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generation tasks.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

cs.LG · 2026-04-14 · unverdicted · novelty 6.0 · 2 refs

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

cs.LG · 2026-05-12 · unverdicted · novelty 5.0

Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.

Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

cs.CL · 2026-04-09 · accept · novelty 5.0

LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.

citing papers explorer

Showing 8 of 8 citing papers.

Rubric-based On-policy Distillation cs.LG · 2026-05-08 · unverdicted · none · ref 22
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models cs.CL · 2026-04-09 · unverdicted · none · ref 24
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation cs.CL · 2026-05-13 · unverdicted · none · ref 39
Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.
SOD: Step-wise On-policy Distillation for Small Language Model Agents cs.CL · 2026-05-08 · unverdicted · none · ref 56
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation cs.CL · 2026-05-08 · unverdicted · none · ref 17
SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generation tasks.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation cs.LG · 2026-04-14 · unverdicted · none · ref 45 · 2 links
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training cs.LG · 2026-05-12 · unverdicted · none · ref 23
Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning cs.CL · 2026-04-09 · accept · none · ref 63
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.

Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643

fields

years

verdicts

representative citing papers

citing papers explorer