arxiv: 2605.11458 · v1 · submitted 2026-05-12 · 💻 cs.AI · cs.CL· cs.LO

Recognition: no theorem link

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

Zihao Han , Tiangang Zhang , Huaibin Wang , Yilun Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:58 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LO

keywords self-distillationLLM reasoningteacher exposureadaptive policyon-policy distillationBeta distributionmath benchmarks

0 comments

The pith

Adaptive control of how much reference reasoning the teacher sees during self-distillation improves LLM performance on math tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard on-policy self-distillation for LLM reasoning always gives the teacher the complete reference solution, yet experiments show this fixed full exposure is not reliably optimal and increases mismatch as the teacher sees more privileged steps. The paper treats teacher exposure instead as a learnable variable. A lightweight Beta-policy controller, conditioned on compact training statistics, samples a reveal ratio that stays fixed for a short window of student updates. The controller is optimized with a discounted reward that credits each choice by its measured effect on the student's future progress rather than immediate loss. Across Qwen3 models from 1.7B to 8B parameters, this adaptive schedule outperforms fixed-exposure self-distillation and RL baselines on AIME 24, AIME 25, and HMMT 25.

Core claim

Treating teacher exposure as a learnable control variable via a Beta-policy controller conditioned on training-state statistics and optimized by a discounted learning-progress reward produces higher student reasoning accuracy than the conventional choice of always revealing the full reference.

What carries the argument

A Beta-policy controller that samples the fraction of reference reasoning to expose to the teacher for a fixed hold window of student updates and receives a reward based on the student's subsequent improvement.

If this is right

Full exposure of the reference reasoning is not reliably the best choice for student learning.
Mismatch between teacher targets and student competence grows monotonically with the amount of privileged reasoning shown.
Optimizing exposure with a future-progress reward addresses the delayed credit assignment problem in on-policy distillation.
The adaptive method delivers consistent gains over OPSD and other baselines on AIME 24, AIME 25, and HMMT 25 for models ranging from 1.7B to 8B parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same controller design could be applied to other training-time decisions such as rollout length or data filtering in reasoning pipelines.
If the compact training statistics omit key signals, the learned policy may fail to generalize beyond the training distribution of math problems.
The delayed-reward formulation might transfer to other credit-assignment settings in LLM post-training where immediate loss is a poor signal.

Load-bearing premise

A lightweight Beta-policy controller optimized via a discounted learning-progress reward on compact training-state statistics will reliably produce exposure decisions that improve long-term student performance without introducing training instability or benchmark-specific overfitting.

What would settle it

An ablation in which the adaptive controller is replaced by fixed full exposure or random sampling and the same models are retrained on the identical benchmarks would show equal or higher scores.

Figures

Figures reproduced from arXiv: 2605.11458 by Huaibin Wang, Tiangang Zhang, Yilun Sun, Zihao Han.

**Figure 1.** Figure 1: Overview of ATESD. (A) Teacher-side exposure mismatch: on an easy problem (e.g. 2+3) the teacher’s privileged CoT stays within the student’s capability and distillation succeeds; on a hard problem (e.g. a quadratic equation) the full CoT far exceeds the student’s level, producing targets the student cannot absorb. (B) ATESD limits the privileged CoT via a learned exposure α: a Beta-policy controller πϕ sel… view at source ↗

**Figure 2.** Figure 2: Empirical analysis of teacher exposure on AIME 2024 with Qwen3-1.7B (3 seeds, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of ATESD. The OPSD backbone samples student continuations from the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Mechanism ablations for exposure control. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student's current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control variable. We therefore propose Adaptive Teacher Exposure for Self-Distillation (ATESD). ATESD models the reveal ratio with a lightweight Beta-policy controller conditioned on compact training-state statistics, and uses one sampled exposure for a short hold window of student updates. To make this exposure controller learnable, we optimize it with a discounted learning-progress reward that scores each held decision by its effect on the student's future improvement rather than its immediate loss change, addressing the delayed credit assignment induced by on-policy distillation. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-{1.7B, 4B, 8B} show that ATESD consistently outperforms competitive self-distillation and RL baselines, improving over OPSD by +0.95, +2.05, and +2.33 Average@12 points respectively, and establishing adaptive teacher exposure as an effective new axis for reasoning self-distillation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ATESD adds a Beta-policy controller for adaptive teacher exposure in self-distillation and shows modest gains over OPSD, but the results lack variance stats and the overfitting risk on these benchmarks is real.

read the letter

The main thing to know is that the paper treats teacher exposure ratio as a trainable control variable instead of a fixed hyperparameter. They use a lightweight Beta-policy conditioned on training-state stats, sample one exposure per short hold window, and optimize it with a discounted reward based on the student's later improvement rather than immediate loss. That setup is new in this literature and directly addresses the mismatch they identify where full reference reasoning can be too strong for the current student. Their fixed-exposure sweep supports the premise that full exposure is not always best and that mismatch grows with more privileged context. The experiments on Qwen3 models across AIME 24/25 and HMMT 25 report consistent outperformance, with gains of roughly 1-2 Average@12 points over OPSD and other baselines. That is a concrete empirical result worth noting. The soft spots are the absence of run variance, statistical significance, or baseline implementation details, which makes it hard to judge how robust the deltas really are. The stress-test concern about the controller overfitting to the difficulty curves of these specific problems via the progress reward also lands, because the reward is computed on the same evaluation distribution and no transfer tests or reward ablations are described. The gains are real but incremental, so the work does not upend existing recipes. This is for people actively tuning on-policy self-distillation for reasoning models who want another knob to adjust during training. A reader in that subfield would get value from the method and the comparison even if they treat the numbers as preliminary. I would send it to peer review so referees can check the experimental controls and generality.

Referee Report

2 major / 2 minor

Summary. The paper argues that full teacher exposure to reference reasoning in on-policy self-distillation for LLMs creates an exposure mismatch that hinders student learning. It proposes ATESD, which replaces fixed exposure with a lightweight Beta-policy controller conditioned on compact training-state statistics; the controller is optimized via a discounted learning-progress reward that evaluates each exposure decision by its effect on future student improvement over a short hold window. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-1.7B/4B/8B models report consistent gains over OPSD and other self-distillation/RL baselines (+0.95 to +2.33 Average@12).

Significance. If the performance gains prove robust, the work identifies a previously unexamined axis—adaptive teacher exposure—in reasoning self-distillation and demonstrates that a simple learnable controller can outperform fixed-exposure and standard RL baselines. The delayed-credit formulation and use of compact state statistics are practical contributions that could generalize beyond the reported math benchmarks.

major comments (2)

[Experiments] Experimental section: the central performance claim (consistent outperformance of OPSD by +0.95/+2.05/+2.33 Average@12 on AIME 24/25 and HMMT 25) is presented as summarized deltas without reported standard deviations, number of independent runs, statistical significance tests, or exact baseline re-implementations and data splits. This leaves the robustness of the gains difficult to evaluate.
[Method] Method (controller optimization): the discounted learning-progress reward directly scores future improvement on the same AIME/HMMT evaluation distribution used for final reporting. No cross-benchmark transfer experiments, hold-out validation of the controller, or ablation isolating the reward from benchmark-specific difficulty curves are described, raising the possibility that reported gains reflect reduced mismatch on these particular rollouts rather than a general exposure principle.

minor comments (2)

[Abstract] The abstract and introduction refer to “competitive self-distillation and RL baselines” without naming the full set of comparators (e.g., specific RL variants or prior self-distillation methods); an explicit list would improve clarity.
[Method] Notation for the Beta-policy parameters and the exact form of the compact training-state statistics could be formalized in a single equation or table for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental reporting and the generality of the proposed controller. We address each major comment below and have updated the manuscript to incorporate additional details, statistical reporting, and new experiments where feasible.

read point-by-point responses

Referee: [Experiments] Experimental section: the central performance claim (consistent outperformance of OPSD by +0.95/+2.05/+2.33 Average@12 on AIME 24/25 and HMMT 25) is presented as summarized deltas without reported standard deviations, number of independent runs, statistical significance tests, or exact baseline re-implementations and data splits. This leaves the robustness of the gains difficult to evaluate.

Authors: We agree that these details are necessary to properly evaluate robustness. The original manuscript omitted them for space reasons, but the experiments were run with multiple seeds. In the revised version we now report means and standard deviations over three independent runs for all main results, include paired t-test p-values demonstrating statistical significance of the reported gains, and provide full details on baseline re-implementations together with exact data splits and random seeds in a new appendix section. revision: yes
Referee: [Method] Method (controller optimization): the discounted learning-progress reward directly scores future improvement on the same AIME/HMMT evaluation distribution used for final reporting. No cross-benchmark transfer experiments, hold-out validation of the controller, or ablation isolating the reward from benchmark-specific difficulty curves are described, raising the possibility that reported gains reflect reduced mismatch on these particular rollouts rather than a general exposure principle.

Authors: The concern is well-taken: the learning-progress reward is computed from student accuracy on the target benchmark distributions during the short hold window. While the controller itself receives only compact, task-agnostic state features (recent loss, gradient statistics, rollout entropy), this still leaves open the question of whether the gains are benchmark-specific. In the revision we have added (i) cross-benchmark transfer results in which a controller trained on AIME rollouts is deployed on HMMT and vice versa, (ii) an internal hold-out split of the benchmark problems used solely for reward computation during controller updates, and (iii) an ablation replacing the delayed learning-progress reward with an immediate-loss baseline. These new results are presented in an expanded experimental section and support that the benefit arises from adaptive exposure rather than overfitting to particular benchmark difficulty curves. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation begins with an empirical fixed-exposure sweep establishing that full teacher reference is suboptimal and mismatch increases with exposure; this is an independent observation, not a fitted input. It then defines a new lightweight Beta-policy controller and a discounted learning-progress reward whose target (future student improvement over hold windows) is specified externally to any model parameters or prior results. The reported gains over OPSD and RL baselines are measured on held-out evaluation rollouts rather than quantities forced by construction or by a self-citation chain. No equations, uniqueness theorems, or ansatzes are shown to reduce to self-referential definitions, and the method introduces an independent optimization axis whose validity is tested rather than presupposed.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim depends on the empirical effectiveness of a newly introduced controller whose parameters are fitted during training and on background assumptions about on-policy distillation dynamics.

free parameters (2)

Beta-policy parameters
Weights of the lightweight controller that outputs the reveal-ratio distribution; learned end-to-end.
Reward discount factor
Hyperparameter controlling how far into the future the learning-progress signal is discounted.

axioms (2)

domain assumption On-policy self-distillation with teacher conditioning on reference solutions is a viable base recipe for improving LLM reasoning.
The paper takes this established approach as given and modifies only the exposure variable.
ad hoc to paper Compact training-state statistics are sufficient to condition an effective exposure policy.
Introduced without further justification in the method description.

invented entities (1)

Beta-policy controller no independent evidence
purpose: Dynamically samples the fraction of reference reasoning revealed to the teacher.
New component proposed to replace the fixed full-exposure default.

pith-pipeline@v0.9.0 · 5630 in / 1518 out tokens · 51073 ms · 2026-05-13T01:58:33.413358+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 11 internal anchors

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InICLR, 2024

work page 2024
[2]

On-policy distillation of language models: Learning from self- generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes. 2024

work page 2024
[3]

Scheduled sampling for sequence prediction with recurrent neural networks

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems, pages 1171–1179, 2015

work page 2015
[4]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InICML, 2009

work page 2009
[5]

Soda: Semi on-policy black-box distillation for large language models.arXiv preprint, 2026

David Chen, Omar Khattab, and Matei Zaharia. Soda: Semi on-policy black-box distillation for large language models.arXiv preprint, 2026

work page 2026
[6]

arXiv preprint arXiv:2603.23871 , year =

Ken Ding. Hdpo: Hybrid distillation policy optimization via privileged self-distillation.arXiv preprint arXiv:2603.23871, 2026

work page arXiv 2026
[7]

MiniLLM: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InICLR, 2024

work page 2024
[8]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, DeJian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

Dist+: Knowledge distillation from a stronger adaptive teacher.arXiv preprint, 2025

Xiaoqi Huang, Jie Zhao, Bingchen Han, Jian Li, Xiao Wang, Aohan Zeng, Wendi Zhao, Yuxiao Dong, and Jie Tang. Dist+: Knowledge distillation from a stronger adaptive teacher.arXiv preprint, 2025

work page 2025
[12]

SDPO: Self-distillation with privileged observations.arXiv preprint, 2026

Jonas Hübotter et al. SDPO: Self-distillation with privileged observations.arXiv preprint, 2026

work page 2026
[13]

Dynamic temperature scheduler for knowledge distillation.arXiv preprint, 2025

Kazi Rakibul Islam, Md Sumon Islam, Syed Ahmed, and Mohammad Hasan. Dynamic temperature scheduler for knowledge distillation.arXiv preprint, 2025

work page 2025
[14]

Adversarially adaptive temperatures for decoupled knowledge distillation with application to classification and regression.arXiv preprint, 2025

Jian Jin, Liujun Chen, Ge Luo, Yitong Chen, Shuanglong Liang, and Linjun Qian. Adversarially adaptive temperatures for decoupled knowledge distillation with application to classification and regression.arXiv preprint, 2025

work page 2025
[15]

Distillm: Towards streamlined distillation for large language models

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. Distillm: Towards streamlined distillation for large language models. 2024

work page 2024
[16]

Rethinking on-policy distillation of large language models: Phenomenology, mechanisms, and optimal practices.arXiv preprint, 2026

Xiaotian Li, Zheng Wang, Man Luo, Shuzhan Chen, Jianxun Li, Kai Zhang, Yuxuan Dong, and Jie Liu. Rethinking on-policy distillation of large language models: Phenomenology, mechanisms, and optimal practices.arXiv preprint, 2026

work page 2026
[17]

Curriculum temperature for knowledge distillation.arXiv preprint, 2023

Yuxuan Li, Xu Shen, , et al. Curriculum temperature for knowledge distillation.arXiv preprint, 2023

work page 2023
[18]

Adaptive tempera- ture based on logits correlation in knowledge distillation.arXiv preprint, 2025

Takuya Matsuyama, Tomoki Shibata, Jumpei Tanaka, and Yoshiaki Uchida. Adaptive tempera- ture based on logits correlation in knowledge distillation.arXiv preprint, 2025

work page 2025
[19]

Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022. 10

work page 2022
[20]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

A reduction of imitation learning and structured prediction to no-regret online learning.AISTATS, 2011

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning.AISTATS, 2011

work page 2011
[23]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. On-policy self-distillation for reasoning compression.arXiv preprint arXiv:2603.05433, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold.arXiv preprint arXiv:2406.14532, 2024

Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Hany Awadalla, David Dohan, and Aviral Kumar. Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold.arXiv preprint arXiv:2406.14532, 2024

work page arXiv 2024
[26]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Self-Distillation Enables Continual Learning

Irina Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Gates: Self-distillation under privileged context with consensus gating.arXiv preprint arXiv:2602.20574, 2026

Alex Stein, Furong Huang, and Tom Goldstein. Gates: Self-distillation under privileged context with consensus gating.arXiv preprint arXiv:2602.20574, 2026

work page arXiv 2026
[29]

Hindsight credit assignment for long- horizon llm agents.arXiv preprint, 2026

Jiachen Tan, Zheng Wang, Yiran Chen, and Ziniu Liu. Hindsight credit assignment for long- horizon llm agents.arXiv preprint, 2026

work page 2026
[30]

Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine Learning, 8(3–4):229–256, 1992

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine Learning, 8(3–4):229–256, 1992

work page 1992
[31]

On-policy distillation of language models.arXiv preprint, 2024

Canwen Xu et al. On-policy distillation of language models.arXiv preprint, 2024

work page 2024
[32]

Direct reasoning optimization: Constrained rl with token-level dense reward and monotonic improvement for reasoning in llms.arXiv preprint, 2025

Yuzhe Xu, Yiran Chen, and Ziniu Liu. Direct reasoning optimization: Constrained rl with token-level dense reward and monotonic improvement for reasoning in llms.arXiv preprint, 2025

work page 2025
[33]

Distribution-aligned sequence distillation for superior long-cot reasoning.arXiv preprint, 2026

Yiran Yan, Yiran Chen, and Ziniu Liu. Distribution-aligned sequence distillation for superior long-cot reasoning.arXiv preprint, 2026

work page 2026
[34]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu et al. DAPO: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Self-distilled reasoner: On-policy self-distillation for large language models.Preprint,

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.Preprint,

work page
[36]

URLhttps://arxiv.org/abs/2601.18734. 11

work page internal anchor Pith review Pith/arXiv arXiv