arxiv: 2605.13255 · v1 · submitted 2026-05-13 · 💻 cs.AI

Recognition: no theorem link

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

Junlong Ke , Zichen Wen , Weijia Li , Conghui He , Linfeng Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords self-distillationon-policy trainingLLM reasoningentropy guidanceefficient reasoningtoken weighting

0 comments

The pith

An entropy confidence gate that down-weights uncertain tokens improves the accuracy-length trade-off in on-policy self-distillation for LLM reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes EGRSD for training reasoning models on their own rollouts, where a teacher supplies token-level supervision. It replaces uniform weighting with three combined signals: a reward-based direction, a likelihood-ratio magnitude, and a teacher-entropy confidence gate that reduces influence from high-entropy positions while keeping a nonzero minimum weight on every token. A causal-lookahead variant, CL-EGRSD, further separates sustained uncertain spans from transient ones by examining future context. Experiments on Qwen3-4B and Qwen3-8B models show the resulting training advances the frontier of reasoning accuracy versus output length relative to prior trainable baselines.

Core claim

On-policy self-distillation for reasoning improves when token-level supervision respects the teacher's predictive entropy: high-entropy positions receive lower weight through an entropy confidence gate, while reward direction and likelihood ratios still guide updates, and a lookahead mechanism distinguishes persistent uncertainty from temporary spikes.

What carries the argument

The teacher-entropy confidence gate, which modulates each token's distillation weight inversely with the teacher's entropy to focus learning on positions where the teacher is confident.

If this is right

Reasoning outputs become shorter for the same accuracy level after training with the modulated weights.
Supervision focuses on low-entropy positions where the teacher provides reliable guidance.
The causal-lookahead version handles sequences with extended uncertainty spans differently from brief spikes.
The combined reward, ratio, and entropy signals operate without external models beyond the privileged-context teacher.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could lower inference cost by encouraging models to avoid prolonged uncertain reasoning chains.
Similar entropy modulation might apply to other token-level training objectives where uncertainty varies across sequences.
Testing whether the gate still helps when the teacher and student differ more substantially in size would clarify its scope.

Load-bearing premise

Reducing weight on high-entropy tokens improves overall reasoning quality without discarding critical information that appears only in uncertain positions.

What would settle it

Training the same Qwen3 models with uniform token weights instead of the entropy gate produces equal or better accuracy-length results on the evaluation tasks.

Figures

Figures reproduced from arXiv: 2605.13255 by Conghui He, Junlong Ke, Linfeng Zhang, Weijia Li, Zichen Wen.

**Figure 1.** Figure 1: Accuracy–length tradeoff on Qwen3-8B: EGRSD and CLEGRSD (ours) extend the Pareto frontier. All trainable baselines are dominated. These are strategy-shift pivots, not sustained branching forks, and blindly suppressing all high-entropy tokens would destroy the transition signal that the pivots carry. This issue is especially important in self-distillation. Unlike offline distillation with a superior ex… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed method. The token-level update multiplies three signals: [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Per-token predictive entropy on a representative reasoning trace (Qwen3-4B). [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Mechanism diagnostics on 5.5M held-out tokens (1,688 completions from [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

On-policy self-distillation trains a reasoning model on its own rollouts while a teacher, often the same model conditioned on privileged context, provides dense token-level supervision. Existing objectives typically weight the teacher's token-level signal uniformly across a chain-of-thought sequence, despite substantial variation in the entropy of the teacher's predictive distribution. We propose EGRSD (Entropy-Guided Reinforced Self-Distillation), which unifies token-level updates through three signals: a reward-grounded direction, a teacher-student likelihood-ratio magnitude, and the proposed teacher-entropy confidence gate that down-weights high-entropy token positions while maintaining a nonzero lower bound on every token weight. We further introduce CL-EGRSD, a causal-lookahead variant that distinguishes sustained high-entropy spans from transient high-entropy positions whose following context rapidly becomes low entropy. Experiments with Qwen3-4B and Qwen3-8B in thinking mode show that EGRSD and CL-EGRSD advance the accuracy-length frontier among the compared trainable methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EGRSD adds an entropy gate to self-distillation for reasoning LLMs but the abstract gives no numbers to check if the gains are real.

read the letter

The main thing here is that EGRSD and its causal variant try to improve on-policy self-distillation by gating the teacher's signal based on entropy, down-weighting high-entropy tokens with a floor on the weight. This is paired with reward direction and likelihood ratio to shape the updates. What stands out as new is the explicit unification of those three signals and the lookahead to separate sustained uncertainty from quick-resolving spots. That addresses a plausible weakness in standard self-distillation where every token gets equal weight regardless of how confident the teacher is. The approach makes sense on paper because entropy in CoT often signals ambiguity, and reducing its pull could lead to more stable training and shorter outputs without losing accuracy. The nonzero lower bound is a good safeguard against completely ignoring tokens. The soft spot is the lack of any numbers in the abstract. It says the methods advance the frontier on Qwen3-4B and 8B, but without baselines, effect sizes, or error bars it's impossible to judge if the gains are real or if the entropy gate is accidentally dropping key reasoning steps. The stress-test point about high-entropy tokens often being where self-corrections happen is worth checking in the full experiments. If the paper includes solid ablations showing that the gate helps more than it hurts, and reproducible code, this would be worth attention for groups working on distilling reasoning into smaller models. It looks like honest work on a practical problem rather than a forced new method. I'd bring it to a reading group to discuss the entropy weighting idea. I wouldn't cite it until I see the results. It should go to peer review because the construction is clear and the motivation is sound, even if the empirical support needs scrutiny.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes EGRSD (Entropy-Guided Reinforced Self-Distillation) for on-policy self-distillation of LLM reasoning models. It combines a reward-grounded update direction, a teacher-student likelihood-ratio term, and a teacher-entropy confidence gate that down-weights high-entropy token positions while enforcing a nonzero lower bound on all weights. A causal-lookahead variant (CL-EGRSD) is introduced to differentiate sustained high-entropy spans from transient ones. Experiments on Qwen3-4B and Qwen3-8B in thinking mode are claimed to advance the accuracy-length frontier relative to other trainable baselines.

Significance. If the empirical results are substantiated, the work would offer a principled mechanism for handling predictive uncertainty during self-distillation, potentially improving the efficiency of reasoning chains without uniform token weighting. The explicit unification of reward, likelihood, and entropy signals plus the causal lookahead distinction constitute concrete technical advances over prior uniform-distillation objectives.

major comments (2)

Experiments section: the claim that EGRSD and CL-EGRSD advance the accuracy-length frontier is stated without any quantitative metrics, baseline tables, error bars, or statistical tests, leaving the central empirical assertion without verifiable support.
Section introducing the teacher-entropy confidence gate: the design rests on the assumption that selectively down-weighting high-entropy tokens improves net reasoning quality, yet no ablation, token-level analysis, or discussion addresses whether these positions frequently encode critical inference steps or self-corrections in on-policy CoT rollouts.

minor comments (1)

Abstract: the frontier-advancement claim would be more persuasive if at least one concrete accuracy or length number were reported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thoughtful comments and the recommendation for major revision. We believe the suggested additions will strengthen the manuscript and address the concerns raised. Below we provide point-by-point responses to the major comments.

read point-by-point responses

Referee: Experiments section: the claim that EGRSD and CL-EGRSD advance the accuracy-length frontier is stated without any quantitative metrics, baseline tables, error bars, or statistical tests, leaving the central empirical assertion without verifiable support.

Authors: We thank the referee for highlighting this. The experiments section presents results through accuracy-length frontier plots comparing EGRSD and CL-EGRSD against other trainable baselines on the Qwen3 models. To make the quantitative support more explicit and verifiable, we will include a dedicated table with numerical values for accuracy and average response length, along with standard deviations computed over multiple random seeds and appropriate statistical tests for significance. revision: yes
Referee: Section introducing the teacher-entropy confidence gate: the design rests on the assumption that selectively down-weighting high-entropy tokens improves net reasoning quality, yet no ablation, token-level analysis, or discussion addresses whether these positions frequently encode critical inference steps or self-corrections in on-policy CoT rollouts.

Authors: This comment correctly identifies a gap in the empirical validation of the entropy gate's design choice. While the method is motivated by the principle that high-entropy predictions are less trustworthy for distillation, we did not provide supporting analysis on the nature of high-entropy tokens in the rollouts. In the revised manuscript, we will add an ablation study comparing performance with and without the entropy gate, as well as a token-level examination of several examples to determine the prevalence of critical reasoning steps or self-corrections in high-entropy positions. revision: yes

Circularity Check

0 steps flagged

No circularity; objective defined from external signals and evaluated on held-out performance

full rationale

The paper defines EGRSD and CL-EGRSD from three external signals (reward-grounded direction, teacher-student likelihood ratio, and entropy-based gate) without any self-referential fitting or renaming that reduces the claimed advance to an input by construction. No equations or self-citations are shown that force the accuracy-length improvement; the method is presented as a weighting scheme evaluated on Qwen3 models against baselines on held-out metrics. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities. The method appears to rest on standard RL and distillation assumptions plus the new entropy gate whose scaling details are unspecified.

pith-pipeline@v0.9.0 · 5489 in / 1107 out tokens · 95461 ms · 2026-05-14T19:47:16.821590+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 10 internal anchors

[1]

AceReason-Nemotron: Advancing math and code reasoning through reinforcement learning.arXiv preprint arXiv:2505.16400,

Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. AceReason-Nemotron: Advancing math and code reasoning through reinforcement learning.arXiv preprint arXiv:2505.16400,

work page arXiv
[2]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, et al. OpenThoughts: Data recipes for reasoning models. arXiv preprint arXiv:2506.04178,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Entropy-aware on-policy distillation of language models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models.arXiv preprint arXiv:2603.07079,

work page arXiv
[5]

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Selec- tive reflection-tuning: Student-selected data recycling for LLM instruction-tuning

Ming Li, Lichang Chen, Jiuhai Chen, Shwai He, Jiuxiang Gu, and Tianyi Zhou. Selec- tive reflection-tuning: Student-selected data recycling for LLM instruction-tuning. In Findings of the Association for Computational Linguistics: ACL 2024, pages 16189–16211, Bangkok, Thailand,

work page 2024
[7]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Association for Computational Linguistics. doi: 10.18653/v1/ 2024.findings-acl.958. URLhttps://aclanthology.org/2024.findings-acl.958/. Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learni...

work page doi:10.18653/v1/ 2024
[8]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. CRISP: Compressed reasoning via iterative self-policy distillation.arXiv preprint arXiv:2603.05433,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Yu Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Gates: Self-distillation under privileged context with consensus gating.arXiv preprint arXiv:2602.20574, 2026

Alex Stein, Furong Huang, and Tom Goldstein. GATES: Self-distillation under privileged context with consensus gating.arXiv preprint arXiv:2602.20574,

work page arXiv
[13]

Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600,

11 Preprint. Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600,

work page arXiv
[14]

PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. PACED: Distilla- tion and on-policy self-distillation at the frontier of student competence.arXiv preprint arXiv:2603.11178,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled RLVR.arXiv preprint arXiv:2604.03128,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Embarrassingly simple self-distillation improves code generation.arXiv preprint arXiv:2604.01193, 2026a

Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, and Yizhe Zhang. Embarrassingly simple self-distillation improves code generation.arXiv preprint arXiv:2604.01193, 2026a. Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, and Stefano Soatto. Reinforcement-aware knowledge d...

work page arXiv
[17]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Appendix A Extended related work 13 B Background details 14 C Derivations for the main-text remarks 15 C.1 Geometric interpretation of the linear gate

12 Preprint. Appendix A Extended related work 13 B Background details 14 C Derivations for the main-text remarks 15 C.1 Geometric interpretation of the linear gate . . . . . . . . . . . . . . . . . . . . 15 C.2 Minimum as the extremal causal smoothing filter . . . . . . . . . . . . . . . 15 D Full experimental details 16 E Hyperparameters 17 F Training dy...

work page 2022
[19]

is the closest compression-oriented baseline to our setting and uses iterative self-policy distillation to encourage concise reasoning. Our method is complementary in goal but different in mechanism: rather than imposing a prompt-based conciseness instruction or treating all teacher positions uniformly, EGRSD changes the token-level confidence weighting o...

work page 2025
[20]

, hW ) and every c≥ 0:(a)per-argument monotonicity, with ϕ non-decreasing in each coordinate separately; (b)conservativity, with ϕ(h0,

C.2 Minimum as the extremal causal smoothing filter Definition 1(Causal smoothing filter family).For a window W≥ 1, let FW denote the class of functions ϕ:R W+1 ≥0 →R ≥0 satisfying, for every input (h0, . . ., hW ) and every c≥ 0:(a)per-argument monotonicity, with ϕ non-decreasing in each coordinate separately; (b)conservativity, with ϕ(h0, . . ., hW )≤h ...

work page 2024
[21]

We therefore use our direction-aware baseline (γ=0) as its reference point in the ablations

is conceptually highly relevant, it lacks public training code at the time of writing. We therefore use our direction-aware baseline (γ=0) as its reference point in the ablations. Training data.Our training data configuration matches that of OPSD (Zhao et al., 2026): the subset of OpenThoughts-114k (Guha et al.,

work page 2026
[22]

Each sample provides a problem x and a concise reference solution s⋆

reasoning traces and filtered to answer-verified samples (Zhao, 2026). Each sample provides a problem x and a concise reference solution s⋆. Our data collator uses only the problem and solution columns, so the teacher context is (x, s⋆, y<t) and the student context is (x, y<t). All compared methods share the same training data and preprocessing, so accura...

work page 2026
[23]

is a held- out 500-problem subset of the MATH competition dataset (Hendrycks et al., 2021).Minerva Math(Lewkowycz et al.,

work page 2021
[24]

MATH500 and GSM8K are excluded 18 Preprint. W=0 W=3 W=5 W=7 =0.3 =0.5 =1 76.67 75.00 74.72 75.28 75.00 73.61 76.11 76.11 76.39 74.17 77.22 74.44 AIME 2024 W=0 W=3 W=5 W=7 =0.3 =0.5 =1 67.50 66.67 69.17 70.83 67.50 65.83 65.00 68.33 66.67 65.83 70.00 70.83 AIME 2025 W=0 W=3 W=5 W=7 =0.3 =0.5 =1 41.67 50.00 45.00 45.83 43.33 42.50 47.50 40.00 47.50 37.50 52...

work page 2024
[25]

charged? Wait, in

Figure A4 visualizes pivot rescue directly, showing per-token current entropy, five-token lookahead entropy, and the corresponding EGRSD and CL-EGRSD weights on representative reasoning windows. Figure A5 further overlays the top-K local entropy peaks on two Minerva completions together with their 4-token left context, verifying that most annotated peaks ...

work page 2025
[26]

Absolute performance is much lower than on Qwen3 or on reasoning-tuned external bases, so this experiment is not intended as a headline accuracy comparison

67.78 49.17 35.83 86.7033.0991.94 60.75 7,908 CL-EGRSD(γ=0.5) 67.22 52.50 35.83 86.90 32.35 93.91 61.45 7,943 J Weak-base cross-architecture diagnostic on Olmo-3-7B Base We also run a weak-base cross-architecture diagnostic onOlmo-3-7B Base(Groeneveld et al., 2024), a non-reasoning-tuned external base model. Absolute performance is much lower than on Qwen...

work page arXiv 2024
[27]

13.0613.335.0053.9010.57 55.67 25.25 7,311 CRISP (Sang et al.,

work page arXiv