pith. machine review for the scientific record. sign in

arxiv: 2605.11458 · v1 · submitted 2026-05-12 · 💻 cs.AI · cs.CL· cs.LO

Recognition: no theorem link

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:58 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LO
keywords self-distillationLLM reasoningteacher exposureadaptive policyon-policy distillationBeta distributionmath benchmarks
0
0 comments X

The pith

Adaptive control of how much reference reasoning the teacher sees during self-distillation improves LLM performance on math tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard on-policy self-distillation for LLM reasoning always gives the teacher the complete reference solution, yet experiments show this fixed full exposure is not reliably optimal and increases mismatch as the teacher sees more privileged steps. The paper treats teacher exposure instead as a learnable variable. A lightweight Beta-policy controller, conditioned on compact training statistics, samples a reveal ratio that stays fixed for a short window of student updates. The controller is optimized with a discounted reward that credits each choice by its measured effect on the student's future progress rather than immediate loss. Across Qwen3 models from 1.7B to 8B parameters, this adaptive schedule outperforms fixed-exposure self-distillation and RL baselines on AIME 24, AIME 25, and HMMT 25.

Core claim

Treating teacher exposure as a learnable control variable via a Beta-policy controller conditioned on training-state statistics and optimized by a discounted learning-progress reward produces higher student reasoning accuracy than the conventional choice of always revealing the full reference.

What carries the argument

A Beta-policy controller that samples the fraction of reference reasoning to expose to the teacher for a fixed hold window of student updates and receives a reward based on the student's subsequent improvement.

If this is right

  • Full exposure of the reference reasoning is not reliably the best choice for student learning.
  • Mismatch between teacher targets and student competence grows monotonically with the amount of privileged reasoning shown.
  • Optimizing exposure with a future-progress reward addresses the delayed credit assignment problem in on-policy distillation.
  • The adaptive method delivers consistent gains over OPSD and other baselines on AIME 24, AIME 25, and HMMT 25 for models ranging from 1.7B to 8B parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same controller design could be applied to other training-time decisions such as rollout length or data filtering in reasoning pipelines.
  • If the compact training statistics omit key signals, the learned policy may fail to generalize beyond the training distribution of math problems.
  • The delayed-reward formulation might transfer to other credit-assignment settings in LLM post-training where immediate loss is a poor signal.

Load-bearing premise

A lightweight Beta-policy controller optimized via a discounted learning-progress reward on compact training-state statistics will reliably produce exposure decisions that improve long-term student performance without introducing training instability or benchmark-specific overfitting.

What would settle it

An ablation in which the adaptive controller is replaced by fixed full exposure or random sampling and the same models are retrained on the identical benchmarks would show equal or higher scores.

Figures

Figures reproduced from arXiv: 2605.11458 by Huaibin Wang, Tiangang Zhang, Yilun Sun, Zihao Han.

Figure 1
Figure 1. Figure 1: Overview of ATESD. (A) Teacher-side exposure mismatch: on an easy problem (e.g. 2+3) the teacher’s privileged CoT stays within the student’s capability and distillation succeeds; on a hard problem (e.g. a quadratic equation) the full CoT far exceeds the student’s level, producing targets the student cannot absorb. (B) ATESD limits the privileged CoT via a learned exposure α: a Beta-policy controller πϕ sel… view at source ↗
Figure 2
Figure 2. Figure 2: Empirical analysis of teacher exposure on AIME 2024 with Qwen3-1.7B (3 seeds, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of ATESD. The OPSD backbone samples student continuations from the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mechanism ablations for exposure control. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student's current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control variable. We therefore propose Adaptive Teacher Exposure for Self-Distillation (ATESD). ATESD models the reveal ratio with a lightweight Beta-policy controller conditioned on compact training-state statistics, and uses one sampled exposure for a short hold window of student updates. To make this exposure controller learnable, we optimize it with a discounted learning-progress reward that scores each held decision by its effect on the student's future improvement rather than its immediate loss change, addressing the delayed credit assignment induced by on-policy distillation. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-{1.7B, 4B, 8B} show that ATESD consistently outperforms competitive self-distillation and RL baselines, improving over OPSD by +0.95, +2.05, and +2.33 Average@12 points respectively, and establishing adaptive teacher exposure as an effective new axis for reasoning self-distillation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper argues that full teacher exposure to reference reasoning in on-policy self-distillation for LLMs creates an exposure mismatch that hinders student learning. It proposes ATESD, which replaces fixed exposure with a lightweight Beta-policy controller conditioned on compact training-state statistics; the controller is optimized via a discounted learning-progress reward that evaluates each exposure decision by its effect on future student improvement over a short hold window. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-1.7B/4B/8B models report consistent gains over OPSD and other self-distillation/RL baselines (+0.95 to +2.33 Average@12).

Significance. If the performance gains prove robust, the work identifies a previously unexamined axis—adaptive teacher exposure—in reasoning self-distillation and demonstrates that a simple learnable controller can outperform fixed-exposure and standard RL baselines. The delayed-credit formulation and use of compact state statistics are practical contributions that could generalize beyond the reported math benchmarks.

major comments (2)
  1. [Experiments] Experimental section: the central performance claim (consistent outperformance of OPSD by +0.95/+2.05/+2.33 Average@12 on AIME 24/25 and HMMT 25) is presented as summarized deltas without reported standard deviations, number of independent runs, statistical significance tests, or exact baseline re-implementations and data splits. This leaves the robustness of the gains difficult to evaluate.
  2. [Method] Method (controller optimization): the discounted learning-progress reward directly scores future improvement on the same AIME/HMMT evaluation distribution used for final reporting. No cross-benchmark transfer experiments, hold-out validation of the controller, or ablation isolating the reward from benchmark-specific difficulty curves are described, raising the possibility that reported gains reflect reduced mismatch on these particular rollouts rather than a general exposure principle.
minor comments (2)
  1. [Abstract] The abstract and introduction refer to “competitive self-distillation and RL baselines” without naming the full set of comparators (e.g., specific RL variants or prior self-distillation methods); an explicit list would improve clarity.
  2. [Method] Notation for the Beta-policy parameters and the exact form of the compact training-state statistics could be formalized in a single equation or table for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental reporting and the generality of the proposed controller. We address each major comment below and have updated the manuscript to incorporate additional details, statistical reporting, and new experiments where feasible.

read point-by-point responses
  1. Referee: [Experiments] Experimental section: the central performance claim (consistent outperformance of OPSD by +0.95/+2.05/+2.33 Average@12 on AIME 24/25 and HMMT 25) is presented as summarized deltas without reported standard deviations, number of independent runs, statistical significance tests, or exact baseline re-implementations and data splits. This leaves the robustness of the gains difficult to evaluate.

    Authors: We agree that these details are necessary to properly evaluate robustness. The original manuscript omitted them for space reasons, but the experiments were run with multiple seeds. In the revised version we now report means and standard deviations over three independent runs for all main results, include paired t-test p-values demonstrating statistical significance of the reported gains, and provide full details on baseline re-implementations together with exact data splits and random seeds in a new appendix section. revision: yes

  2. Referee: [Method] Method (controller optimization): the discounted learning-progress reward directly scores future improvement on the same AIME/HMMT evaluation distribution used for final reporting. No cross-benchmark transfer experiments, hold-out validation of the controller, or ablation isolating the reward from benchmark-specific difficulty curves are described, raising the possibility that reported gains reflect reduced mismatch on these particular rollouts rather than a general exposure principle.

    Authors: The concern is well-taken: the learning-progress reward is computed from student accuracy on the target benchmark distributions during the short hold window. While the controller itself receives only compact, task-agnostic state features (recent loss, gradient statistics, rollout entropy), this still leaves open the question of whether the gains are benchmark-specific. In the revision we have added (i) cross-benchmark transfer results in which a controller trained on AIME rollouts is deployed on HMMT and vice versa, (ii) an internal hold-out split of the benchmark problems used solely for reward computation during controller updates, and (iii) an ablation replacing the delayed learning-progress reward with an immediate-loss baseline. These new results are presented in an expanded experimental section and support that the benefit arises from adaptive exposure rather than overfitting to particular benchmark difficulty curves. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation begins with an empirical fixed-exposure sweep establishing that full teacher reference is suboptimal and mismatch increases with exposure; this is an independent observation, not a fitted input. It then defines a new lightweight Beta-policy controller and a discounted learning-progress reward whose target (future student improvement over hold windows) is specified externally to any model parameters or prior results. The reported gains over OPSD and RL baselines are measured on held-out evaluation rollouts rather than quantities forced by construction or by a self-citation chain. No equations, uniqueness theorems, or ansatzes are shown to reduce to self-referential definitions, and the method introduces an independent optimization axis whose validity is tested rather than presupposed.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim depends on the empirical effectiveness of a newly introduced controller whose parameters are fitted during training and on background assumptions about on-policy distillation dynamics.

free parameters (2)
  • Beta-policy parameters
    Weights of the lightweight controller that outputs the reveal-ratio distribution; learned end-to-end.
  • Reward discount factor
    Hyperparameter controlling how far into the future the learning-progress signal is discounted.
axioms (2)
  • domain assumption On-policy self-distillation with teacher conditioning on reference solutions is a viable base recipe for improving LLM reasoning.
    The paper takes this established approach as given and modifies only the exposure variable.
  • ad hoc to paper Compact training-state statistics are sufficient to condition an effective exposure policy.
    Introduced without further justification in the method description.
invented entities (1)
  • Beta-policy controller no independent evidence
    purpose: Dynamically samples the fraction of reference reasoning revealed to the teacher.
    New component proposed to replace the fixed full-exposure default.

pith-pipeline@v0.9.0 · 5630 in / 1518 out tokens · 51073 ms · 2026-05-13T01:58:33.413358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 11 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InICLR, 2024

  2. [2]

    On-policy distillation of language models: Learning from self- generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes. 2024

  3. [3]

    Scheduled sampling for sequence prediction with recurrent neural networks

    Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems, pages 1171–1179, 2015

  4. [4]

    Curriculum learning

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InICML, 2009

  5. [5]

    Soda: Semi on-policy black-box distillation for large language models.arXiv preprint, 2026

    David Chen, Omar Khattab, and Matei Zaharia. Soda: Semi on-policy black-box distillation for large language models.arXiv preprint, 2026

  6. [6]

    arXiv preprint arXiv:2603.23871 , year =

    Ken Ding. Hdpo: Hybrid distillation policy optimization via privileged self-distillation.arXiv preprint arXiv:2603.23871, 2026

  7. [7]

    MiniLLM: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InICLR, 2024

  8. [8]

    OpenThoughts: Data Recipes for Reasoning Models

    Etash Guha et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, DeJian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  10. [10]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  11. [11]

    Dist+: Knowledge distillation from a stronger adaptive teacher.arXiv preprint, 2025

    Xiaoqi Huang, Jie Zhao, Bingchen Han, Jian Li, Xiao Wang, Aohan Zeng, Wendi Zhao, Yuxiao Dong, and Jie Tang. Dist+: Knowledge distillation from a stronger adaptive teacher.arXiv preprint, 2025

  12. [12]

    SDPO: Self-distillation with privileged observations.arXiv preprint, 2026

    Jonas Hübotter et al. SDPO: Self-distillation with privileged observations.arXiv preprint, 2026

  13. [13]

    Dynamic temperature scheduler for knowledge distillation.arXiv preprint, 2025

    Kazi Rakibul Islam, Md Sumon Islam, Syed Ahmed, and Mohammad Hasan. Dynamic temperature scheduler for knowledge distillation.arXiv preprint, 2025

  14. [14]

    Adversarially adaptive temperatures for decoupled knowledge distillation with application to classification and regression.arXiv preprint, 2025

    Jian Jin, Liujun Chen, Ge Luo, Yitong Chen, Shuanglong Liang, and Linjun Qian. Adversarially adaptive temperatures for decoupled knowledge distillation with application to classification and regression.arXiv preprint, 2025

  15. [15]

    Distillm: Towards streamlined distillation for large language models

    Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. Distillm: Towards streamlined distillation for large language models. 2024

  16. [16]

    Rethinking on-policy distillation of large language models: Phenomenology, mechanisms, and optimal practices.arXiv preprint, 2026

    Xiaotian Li, Zheng Wang, Man Luo, Shuzhan Chen, Jianxun Li, Kai Zhang, Yuxuan Dong, and Jie Liu. Rethinking on-policy distillation of large language models: Phenomenology, mechanisms, and optimal practices.arXiv preprint, 2026

  17. [17]

    Curriculum temperature for knowledge distillation.arXiv preprint, 2023

    Yuxuan Li, Xu Shen, , et al. Curriculum temperature for knowledge distillation.arXiv preprint, 2023

  18. [18]

    Adaptive tempera- ture based on logits correlation in knowledge distillation.arXiv preprint, 2025

    Takuya Matsuyama, Tomoki Shibata, Jumpei Tanaka, and Yoshiaki Uchida. Adaptive tempera- ture based on logits correlation in knowledge distillation.arXiv preprint, 2025

  19. [19]

    Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022. 10

  20. [20]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  21. [21]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023

  22. [22]

    A reduction of imitation learning and structured prediction to no-regret online learning.AISTATS, 2011

    Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning.AISTATS, 2011

  23. [23]

    CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

    Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. On-policy self-distillation for reasoning compression.arXiv preprint arXiv:2603.05433, 2026

  24. [24]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  25. [25]

    Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold.arXiv preprint arXiv:2406.14532, 2024

    Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Hany Awadalla, David Dohan, and Aviral Kumar. Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold.arXiv preprint arXiv:2406.14532, 2024

  26. [26]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  27. [27]

    Self-Distillation Enables Continual Learning

    Irina Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

  28. [28]

    Gates: Self-distillation under privileged context with consensus gating.arXiv preprint arXiv:2602.20574, 2026

    Alex Stein, Furong Huang, and Tom Goldstein. Gates: Self-distillation under privileged context with consensus gating.arXiv preprint arXiv:2602.20574, 2026

  29. [29]

    Hindsight credit assignment for long- horizon llm agents.arXiv preprint, 2026

    Jiachen Tan, Zheng Wang, Yiran Chen, and Ziniu Liu. Hindsight credit assignment for long- horizon llm agents.arXiv preprint, 2026

  30. [30]

    Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine Learning, 8(3–4):229–256, 1992

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine Learning, 8(3–4):229–256, 1992

  31. [31]

    On-policy distillation of language models.arXiv preprint, 2024

    Canwen Xu et al. On-policy distillation of language models.arXiv preprint, 2024

  32. [32]

    Direct reasoning optimization: Constrained rl with token-level dense reward and monotonic improvement for reasoning in llms.arXiv preprint, 2025

    Yuzhe Xu, Yiran Chen, and Ziniu Liu. Direct reasoning optimization: Constrained rl with token-level dense reward and monotonic improvement for reasoning in llms.arXiv preprint, 2025

  33. [33]

    Distribution-aligned sequence distillation for superior long-cot reasoning.arXiv preprint, 2026

    Yiran Yan, Yiran Chen, and Ziniu Liu. Distribution-aligned sequence distillation for superior long-cot reasoning.arXiv preprint, 2026

  34. [34]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu et al. DAPO: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  35. [35]

    Self-distilled reasoner: On-policy self-distillation for large language models.Preprint,

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.Preprint,

  36. [36]

    URLhttps://arxiv.org/abs/2601.18734. 11