Recognition: unknown
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
Pith reviewed 2026-05-15 00:39 UTC · model grok-4.3
The pith
Self-distillation suppresses uncertainty expression in LLMs, degrading performance on out-of-domain reasoning tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Conditioning the teacher on rich information during self-distillation suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly.
What carries the argument
Suppression of epistemic verbalization, the model's expression of uncertainty while reasoning.
If this is right
- Self-distilled models optimize faster on problems matching the training distribution.
- Out-of-domain accuracy falls when uncertainty signals are removed from reasoning traces.
- Robust generalization requires uncertainty verbalization in addition to correct final answers.
- Post-training should optimize reasoning behavior rather than only reinforcing answer traces.
Where Pith is reading between the lines
- Other post-training methods that supply rich conditioning may produce similar suppression of uncertainty.
- Distillation variants that explicitly retain uncertainty phrases could be tested to protect out-of-domain performance.
- Reasoning benchmarks with higher out-of-distribution coverage would make such degradations easier to detect.
Load-bearing premise
The observed out-of-domain performance drop is caused primarily by reduced uncertainty expression rather than other unmeasured effects of self-distillation.
What would settle it
A controlled run of self-distillation that preserves the original frequency of uncertainty phrases and then measures whether out-of-domain accuracy remains unchanged.
read the original abstract
Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that self-distillation in LLMs for mathematical reasoning can degrade OOD performance by suppressing epistemic verbalization (the expression of uncertainty during reasoning). Through controlled experiments that vary teacher conditioning context richness and task coverage across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, it reports that rich conditioning enables rapid in-domain gains with shorter traces but causes performance drops of up to 40% on unseen problems, where uncertainty expression aids adjustment; the work concludes that optimizing reasoning behavior must preserve appropriate uncertainty rather than only reinforcing correct traces.
Significance. If the causal mechanism is confirmed, the result would be significant for post-training of LLMs, as it identifies a concrete drawback of standard self-distillation (suppression of epistemic signals) that can produce brittle models despite in-domain gains. The multi-model empirical scope and focus on OOD robustness provide a useful counterpoint to the common assumption that shorter, correct reasoning traces are always beneficial.
major comments (2)
- [Controlled experiments] Controlled experiments section: the design varies teacher conditioning richness and task coverage but lacks an ablation that holds trace length, correctness rate, and overall style fixed while independently manipulating verbalized uncertainty (e.g., via targeted insertion or suppression prompts). Without this isolation, the observed OOD drops cannot be attributed specifically to suppressed epistemic verbalization rather than generic compression or other unmeasured distillation effects.
- [Results] Results section: performance drops of up to 40% are reported across models, yet the manuscript provides no details on the exact evaluation metrics, statistical controls (e.g., variance across seeds or significance tests), or problem exclusion rules. This weakens the ability to verify that the OOD degradation is robust and directly tied to the proposed mechanism.
minor comments (2)
- [Abstract] Abstract: the phrase 'up to 40%' is used without specifying the baseline model, exact benchmark, or in-domain vs. OOD split; adding one sentence of clarification would improve precision.
- [Introduction] The operational definition of 'epistemic verbalization' is introduced but would benefit from explicit examples of counted uncertainty phrases and how they are measured in traces.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We agree that additional controls and details will strengthen the manuscript and address both major points with new experiments and expanded reporting in the revision.
read point-by-point responses
-
Referee: Controlled experiments section: the design varies teacher conditioning richness and task coverage but lacks an ablation that holds trace length, correctness rate, and overall style fixed while independently manipulating verbalized uncertainty (e.g., via targeted insertion or suppression prompts). Without this isolation, the observed OOD drops cannot be attributed specifically to suppressed epistemic verbalization rather than generic compression or other unmeasured distillation effects.
Authors: We acknowledge the value of a more isolated ablation. Our existing variations in conditioning richness do modulate uncertainty expression while producing measurable changes in trace length and OOD performance, but they do not fully decouple uncertainty from other factors. In the revision we will add a targeted ablation that holds trace length, correctness rate, and stylistic features fixed (via controlled prompting or post-processing) while independently inserting or suppressing epistemic verbalization. This will be reported in an expanded Controlled Experiments section. revision: yes
-
Referee: Results section: performance drops of up to 40% are reported across models, yet the manuscript provides no details on the exact evaluation metrics, statistical controls (e.g., variance across seeds or significance tests), or problem exclusion rules. This weakens the ability to verify that the OOD degradation is robust and directly tied to the proposed mechanism.
Authors: We agree that these details are necessary for verification. The revised manuscript will include a new subsection (with supporting appendix tables) specifying the exact metrics (accuracy via exact-match on MATH, GSM8K, and OOD splits), statistical controls (standard deviation across five random seeds, paired t-tests with p-values), and explicit problem exclusion/selection criteria for the OOD sets. This will make the reported drops fully reproducible and allow direct assessment of robustness. revision: yes
Circularity Check
No circularity: empirical observations rest on experimental comparisons, not self-referential definitions or fitted predictions
full rationale
The paper reports controlled experiments varying teacher conditioning richness and task coverage, then measures performance drops (up to 40%) and links them to reduced epistemic verbalization. No equations, ansatzes, or derivations are present that reduce to their own inputs by construction. No parameters are fitted to a subset and then relabeled as predictions. Self-citations, if any, are not load-bearing for the core empirical claim, which is falsifiable via the described ablations and cross-model observations. The analysis is self-contained against external benchmarks (multiple LLMs, in-domain vs OOD splits) and does not invoke uniqueness theorems or rename known results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Epistemic verbalization (expression of uncertainty) improves out-of-domain reasoning performance
invented entities (1)
-
epistemic verbalization
no independent evidence
Forward citations
Cited by 18 Pith papers
-
Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
-
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
-
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.
-
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preservi...
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
-
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
-
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
-
How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.
-
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...
-
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
-
Selective Off-Policy Reference Tuning with Plan Guidance
SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
-
Multilingual Safety Alignment via Self-Distillation
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
-
AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency
AtManRL learns an additive attention mask on CoT traces to produce a saliency reward that, when combined with outcome rewards in GRPO, trains LLMs to generate reasoning that genuinely influences final predictions.
-
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
-
Selective Off-Policy Reference Tuning with Plan Guidance
SORT converts all-failed reasoning prompts into selective, structure-aware training signals by weighting tokens according to how much a reference-derived plan increases their probability.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
-
Multilingual Safety Alignment via Self-Distillation
MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.