pith. sign in

arxiv: 2606.00755 · v1 · pith:JP5WSH53new · submitted 2026-05-30 · 💻 cs.CL · cs.LG

Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning

Pith reviewed 2026-06-28 18:47 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords reinforcement learningself-distillationentropy collapselarge language modelspolicy reheatingon-policy distillationtemperature scaling
0
0 comments X

The pith

Temperature-scaled on-policy self-distillation restores entropy in collapsed RL policies for language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a lightweight self-distillation process can internalize the exploratory benefits of high sampling temperature directly into model parameters after entropy collapse occurs during reinforcement learning. This matters to a sympathetic reader because current fixes keep temperature adjustments outside the model, requiring ongoing external management, whereas this approach bakes the effect into the weights themselves. Experiments indicate that the resulting checkpoint serves as a stronger base for further RL training on reasoning tasks compared to simply continuing RL or reheating only at rollout time.

Core claim

Starting from an entropy-collapsed RL checkpoint, TS-OPSD constructs a self-teacher by applying high-temperature scaling to the model's own logits and distills the smoother distribution back into the student model. This internalizes the temperature effect, requiring no external teacher or additional inference cost during training.

What carries the argument

Temperature-Scaled On-Policy Self-Distillation (TS-OPSD), the process of using high-temperature scaled self-logits as a teacher signal for distillation to reheat the policy parameters.

If this is right

  • Policy reheating via TS-OPSD yields stronger initialization for continued RL than standard continued RL.
  • TS-OPSD reduces output sharpness while preserving intermediate representations, top candidate sets, and reasoning capability.
  • Entropy restoration serves as a simple post-collapse intervention for extending reasoning-oriented RL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be applied periodically during RL training to prevent collapse rather than only as a post-hoc fix.
  • Similar self-distillation might help in other domains where policy entropy needs restoration without changing the objective.
  • Since it preserves reasoning capability, it may allow scaling RL to longer horizons or more complex verifiable reward tasks.

Load-bearing premise

Distilling from high-temperature scaled versions of the model's own logits produces a useful teacher signal that restores entropy without degrading the underlying reasoning capability or intermediate representations.

What would settle it

A direct comparison where a model reheated with TS-OPSD shows no improvement or worse performance in subsequent RL training on the same verifiable reward tasks compared to a baseline without reheating.

Figures

Figures reproduced from arXiv: 2606.00755 by Jiachen Yu, Jie Wu, Junjie Wang, Shaoning Sun, Xuewei Yang, Yujiu Yang.

Figure 1
Figure 1. Figure 1: (a) Rollout reheating increases sampling tem￾perature without modifying the policy itself, whereas policy reheating (TS-OPSD) internalizes temperature directly into the model parameters. (b) After applying policy reheating to an entropy-collapsed checkpoint and resuming RL, DAPO w/ PR-FKL consistently outper￾forms both the DAPO baseline and rollout reheating (DAPO w/ RR) on Qwen3-4B-Base and Qwen3-8B￾Base … view at source ↗
Figure 2
Figure 2. Figure 2: Reward and mean token entropy during RL training on Qwen3-8B-Base. The dotted vertical line marks the point at which policy reheating (PR-FKL) is applied to the collapsed checkpoint. After reheating, DAPO w/ PR-FKL (red) resumes from a higher-entropy initialization and finally achieves a higher reward than standard DAPO (blue) in the continued training phase. collapse, serving as the primary baseline. DAPO… view at source ↗
Figure 4
Figure 4. Figure 4: Task accuracy and mean token entropy across [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Output distribution analysis before and after high-temperature OPSD on MATH-500 probe data. Left: [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Stochastic log-probability trace averaged [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training dynamics of FKL and RKL objec￾tives during high-temperature OPSD (T = 1.2, steps 1–15). (a) Mean student entropy: both objectives pro￾duce comparable entropy recovery, with a jump occur￾ring around step 9–10 under both settings. (b) Gradi￾ent norm: FKL consistently exhibits a higher gradient norm than RKL throughout training, consistent with the theoretical prediction that RKL reduces to a more co… view at source ↗
Figure 9
Figure 9. Figure 9: Student entropy and gradient norm during [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Student entropy during long-horizon FKL OPSD on an entropy-collapsed Qwen3-8B-Base check￾point. Left: full trajectory over 31 steps, showing a clear regime change around step 16 where entropy transitions from slow, controlled growth to rapid, accelerating in￾crease. Right: zoomed view of steps 1–16, showing the characteristic two-phase dynamic with a visible entropy jump, albeit with higher step-to-step v… view at source ↗
Figure 12
Figure 12. Figure 12: Layer-wise reversibility metrics compar [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 11
Figure 11. Figure 11: Student entropy during low-temperature (T = 0.8) RKL OPSD applied to the reheated Qwen3- 8B-Base checkpoint. Entropy decreases in a multi-phase pattern symmetric to the jump-and-stabilize dynamics observed during high-temperature OPSD, with a sharp discontinuous drop around step 10–11 after an initial slow decline. High-temperature OPSD increases policy en￾tropy by moving model parameters toward a smoothe… view at source ↗
Figure 13
Figure 13. Figure 13: Training dynamics on a data-scarce RL setting using a 2048-prompt subset of DAPO-Math-17K (Qwen3- [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
read the original abstract

Reinforcement learning from verifiable rewards improves the reasoning ability of large language models, but often suffers from entropy collapse, in which increasingly concentrated policies reduce rollout diversity and useful learning signals. Existing remedies either constrain the RL objective (e.g., entropy regularization) or adjust sampling temperature during rollout collection, but these interventions remain external to the model parameters. We propose Temperature-Scaled On-Policy Self-Distillation (TS-OPSD), a lightweight policy reheating method that internalizes the exploratory effect of temperature into model parameters. Starting from an entropy-collapsed RL checkpoint, TS-OPSD constructs a self-teacher by applying high-temperature scaling to the model's own logits, then distills the resulting smoother distribution back into the student. This policy reheating requires no external teacher, privileged data, or additional inference cost. Experiments on Qwen3-4B-Base and Qwen3-8B-Base show that policy reheating yields a stronger initialization for continued RL than both standard continued RL and rollout-level temperature reheating. Further analyses show that TS-OPSD mainly reduces output sharpness while preserving intermediate representations, top candidate sets, and reasoning capability. These results suggest that entropy restoration can serve as a simple post-collapse intervention for extending reasoning-oriented RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Temperature-Scaled On-Policy Self-Distillation (TS-OPSD) to address entropy collapse in RL for LLM reasoning. Starting from a collapsed checkpoint, it constructs a self-teacher by high-temperature scaling of the model's own logits and distills the smoother distribution back into the parameters. This internalizes the exploratory effect of temperature without external teachers or extra inference cost. Experiments on Qwen3-4B-Base and Qwen3-8B-Base claim that the resulting checkpoint is a stronger initialization for continued RL than either standard continued RL or rollout-level temperature reheating. Analyses indicate that TS-OPSD reduces output sharpness while preserving intermediate representations, top candidate sets, and reasoning capability.

Significance. If the central experimental claim holds, the method offers a lightweight, parameter-internal post-collapse intervention that could extend the usable horizon of reasoning-oriented RL without modifying the RL objective or incurring rollout overhead. The self-distillation approach is notable for requiring no privileged data or additional models.

major comments (2)
  1. [Analyses] The headline claim that TS-OPSD yields a stronger initialization rests on the assumption that high-T self-distillation from collapsed logits produces a net-positive teacher signal. The analyses section must provide quantitative evidence (e.g., token-distribution divergence metrics or entropy curves post-distillation) showing that the softened distribution restores exploitable diversity rather than merely rescaling the same narrow token set; without this, the superiority over rollout-temperature baselines cannot be secured.
  2. [Experiments] Experiments section: the reported superiority on Qwen3-4B-Base and Qwen3-8B-Base requires explicit controls for the RL training budget, reward model, and evaluation metrics used to declare a 'stronger initialization.' The current description lacks statistical significance tests or variance across seeds, which is load-bearing for the cross-method comparison.
minor comments (2)
  1. [Introduction] The abstract and introduction should clarify whether the high-temperature scaling is applied only at the final layer or throughout the network, as this affects the interpretation of 'internalizing' the temperature.
  2. [Method] Notation for the temperature-scaled logits and the distillation loss should be introduced with explicit equations to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each of the major comments below and indicate the revisions we plan to make to the manuscript.

read point-by-point responses
  1. Referee: [Analyses] The headline claim that TS-OPSD yields a stronger initialization rests on the assumption that high-T self-distillation from collapsed logits produces a net-positive teacher signal. The analyses section must provide quantitative evidence (e.g., token-distribution divergence metrics or entropy curves post-distillation) showing that the softened distribution restores exploitable diversity rather than merely rescaling the same narrow token set; without this, the superiority over rollout-temperature baselines cannot be secured.

    Authors: We agree that providing explicit quantitative evidence on the effect of the self-distillation would strengthen the paper. Our existing analyses already show that TS-OPSD reduces output sharpness while preserving intermediate representations, top candidate sets, and reasoning capability, which suggests the softened distribution is not merely rescaling but maintaining useful structure. However, to directly address the referee's concern, we will add token-distribution divergence metrics and entropy curves in the revised Analyses section to quantify the restoration of diversity. revision: yes

  2. Referee: [Experiments] Experiments section: the reported superiority on Qwen3-4B-Base and Qwen3-8B-Base requires explicit controls for the RL training budget, reward model, and evaluation metrics used to declare a 'stronger initialization.' The current description lacks statistical significance tests or variance across seeds, which is load-bearing for the cross-method comparison.

    Authors: We will revise the Experiments section to provide explicit details on the RL training budget, reward model, and evaluation metrics. We acknowledge that statistical significance tests and variance across seeds are not currently reported. In the revision, we will include multi-seed results and appropriate statistical tests to support the comparisons, thereby making the claims more robust. revision: yes

Circularity Check

0 steps flagged

No circularity; method is procedural with no derivations or self-referential reductions

full rationale

The paper describes TS-OPSD as a post-hoc empirical intervention (high-T scaling of own logits followed by distillation) with no equations, first-principles derivations, or predictions that reduce to fitted inputs. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the central claim. Experimental comparisons to baselines are presented directly without statistical forcing or renaming of known results. The derivation chain is empty of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5771 in / 994 out tokens · 17706 ms · 2026-06-28T18:47:14.808551+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 22 canonical work pages · 14 internal anchors

  1. [1]

    InInternational Conference on Learning Representations, volume 2024, pages 21246–21263

    On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, volume 2024, pages 21246–21263. Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Xin Zhao, Zhenliang Zhang, and Furu Wei

  2. [2]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    The entropy mechanism of rein- forcement learning for reasoning language models. Preprint, arXiv:2505.22617. Haoran Dang, Cuiling Lan, Hai Wan, Xibin Zhao, and Yan Lu

  3. [3]

    Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao

    A unified revisit of temperature in classification-based knowledge distil- lation.Preprint, arXiv:2603.02430. Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao

  4. [4]

    Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

    Revisiting on-policy distillation: Empirical failure modes and simple fixes.Preprint, arXiv:2603.25562. Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang

  5. [5]

    MiniLLM: On-Policy Distillation of Large Language Models

    Minillm: On-policy distillation of large language models.Preprint, arXiv:2306.08543. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhu- oshu Li, Ziyi Gao, Aixin Liu, and 175 others

  6. [6]

    Distilling the Knowledge in a Neural Network

    Distilling the knowledge in a neural network. Preprint, arXiv:1503.02531. Jonas Hübotter, Frederike Lübeck, Lejs Behric, An- ton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause

  7. [7]

    Reinforcement Learning via Self-Distillation

    Re- inforcement learning via self-distillation.Preprint, arXiv:2601.20802. Aref Jafari, Mehdi Rezagholizadeh, Pranav Sharma, and Ali Ghodsi

  8. [8]

    Entropy-aware on-policy distillation of language models.Preprint, arXiv:2603.07079. S. Kullback and R. A. Leibler

  9. [9]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Rethinking on-policy distillation of large lan- guage models: Phenomenology, mechanism, and recipe.Preprint, arXiv:2604.13016. Zheng Li, Xiang Li, Lingfeng Yang, Borui Zhao, Renjie Song, Lei Luo, Jun Li, and Jian Yang

  10. [10]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Di- rani, Julian Michael, and Samuel R

    Entropy-preserving rein- forcement learning.Preprint, arXiv:2603.11682. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Di- rani, Julian Michael, and Samuel R. Bowman

  11. [11]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    Gpqa: A graduate-level google-proof q&a bench- mark.Preprint, arXiv:2311.12022. Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun

  12. [12]

    CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

    Crisp: Com- pressed reasoning via iterative self-policy distillation. Preprint, arXiv:2603.05433. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo

  13. [13]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models.Preprint, arXiv:2402.03300. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu

  14. [14]

    Challeng- ing the boundaries of reasoning: An olympiad-level math benchmark for large language models.Preprint, arXiv:2503.21380. Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shix- uan Liu, Rui Lu, Kai Dang, Xiong-Hui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin

  15. [15]

    Dynamic Temperature Knowledge Distillation , url =

    Dynamic temperature knowledge distillation.Preprint, arXiv:2404.12711. Yuqiao Wen, Zichao Li, Wenyu Du, and Lili Mou

  16. [16]

    InProceedings of the 61st An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10817– 10834, Toronto, Canada

    f-divergence minimization for sequence-level knowl- edge distillation. InProceedings of the 61st An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10817– 10834, Toronto, Canada. Association for Computa- tional Linguistics. Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong. 2025a. Re...

  17. [17]

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan

    Bapo: Stabilizing off-policy reinforcement learning for llms via balanced policy optimization with adap- tive clipping.Preprint, arXiv:2510.18927. Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan

  18. [18]

    Self-Distilled RLVR

    Self-distilled rlvr. Preprint, arXiv:2604.03128. Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei

  19. [19]

    On-policy context distillation for language models.Preprint, arXiv:2602.12275. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, juncai liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, and 17 others

  20. [20]

    EDT: Improving large language models’ generation by entropy-based dynamic temperature sampling.arXiv preprint arXiv:2403.14541,

    Edt: Improving large language models’ generation by entropy-based dynamic temperature sampling. Preprint, arXiv:2403.14541. Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover

  21. [21]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Self-distilled reasoner: On-policy self-distillation for large language models.Preprint, arXiv:2601.18734. Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Hong Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen

  22. [22]

    Swift: a scalable lightweight infras- tructure for fine-tuning

    Swift:a scalable lightweight infrastructure for fine-tuning.Preprint, arXiv:2408.05517. Yixiao Zhou, Yang Li, Dongzhou Cheng, Hehe Fan, and Yu Cheng

  23. [23]

    10 A Hyperparameters A.1 RL Training (DAPO) We use DAPO (Yu et al.,

    Look inward to explore outward: Learning temperature policy from llm internal states via hierarchical rl.Preprint, arXiv:2602.13035. 10 A Hyperparameters A.1 RL Training (DAPO) We use DAPO (Yu et al.,

  24. [24]

    Table 3 lists the shared hyperparameters for Qwen3-4B-Base and Qwen3- 8B-Base; the two models use identical settings un- less otherwise noted

    as the base RL pro- cedure for all experiments, implemented on top of verl (Sheng et al., 2025). Table 3 lists the shared hyperparameters for Qwen3-4B-Base and Qwen3- 8B-Base; the two models use identical settings un- less otherwise noted. Hyperparameter Value Framework verl Training data DAPO-Math-17K Optimizer AdamW Learning rate 2e-6 LR warmup steps 10...

  25. [25]

    The global cosine similarity between ∆low and −∆high is −0.878, indicating strong directional anti-alignment that cannot arise by chance in an 8B-parameter space. The reverse projection co- efficient of 0.891 shows that the low-temperature update recovers approximately 89% of the high- temperature displacement in the anti-aligned direc- tion, and the norm...

  26. [26]

    to enable a fair per-sample comparison. ity and quantity of training prompts: data scarcity is a genuine bottleneck in RL-based post-training, par- ticularly for larger models that require more diverse and challenging prompts to sustain a meaningful gradient signal. To probe the robustness of policy reheating in this regime, we construct a data-scarce set...