arxiv: 2603.24472 · v2 · submitted 2026-03-25 · 💻 cs.CL · cs.LG

Recognition: unknown

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Dohyung Kim, Dongsheng Li, Jeonghye Kim, Jiwon Jeon, Minbeom Kim, Sangmook Lee, Xufang Luo, Yuqing Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:39 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords self-distillationepistemic verbalizationLLM reasoningout-of-domain performanceuncertainty expressionmathematical reasoningpost-training methods

0 comments

The pith

Self-distillation suppresses uncertainty expression in LLMs, degrading performance on out-of-domain reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Self-distillation trains models on their own generated outputs and often shortens reasoning traces while boosting in-domain results. In mathematical reasoning, however, the same process reduces the model's expression of uncertainty during step-by-step thinking. When the teacher is given rich context, the student quickly fits the limited training distribution but loses the ability to flag doubt and revise on novel problems. Experiments across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct document out-of-domain drops reaching 40 percent. The central finding is that appropriate uncertainty verbalization is required for reasoning that generalizes beyond the training set.

Core claim

Conditioning the teacher on rich information during self-distillation suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly.

What carries the argument

Suppression of epistemic verbalization, the model's expression of uncertainty while reasoning.

If this is right

Self-distilled models optimize faster on problems matching the training distribution.
Out-of-domain accuracy falls when uncertainty signals are removed from reasoning traces.
Robust generalization requires uncertainty verbalization in addition to correct final answers.
Post-training should optimize reasoning behavior rather than only reinforcing answer traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other post-training methods that supply rich conditioning may produce similar suppression of uncertainty.
Distillation variants that explicitly retain uncertainty phrases could be tested to protect out-of-domain performance.
Reasoning benchmarks with higher out-of-distribution coverage would make such degradations easier to detect.

Load-bearing premise

The observed out-of-domain performance drop is caused primarily by reduced uncertainty expression rather than other unmeasured effects of self-distillation.

What would settle it

A controlled run of self-distillation that preserves the original frequency of uncertainty phrases and then measures whether out-of-domain accuracy remains unchanged.

read the original abstract

Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self-distillation can suppress uncertainty expression in reasoning traces and hurt OOD math performance, but the experiments do not fully isolate that mechanism from other changes like trace compression.

read the letter

The core finding is that self-distillation, when the teacher sees rich context, reduces how often the model verbalizes uncertainty during step-by-step math reasoning. This speeds up in-domain gains but leads to bigger drops on out-of-distribution problems, where expressing doubt helps the model backtrack or adjust. They show this pattern on Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, with drops reaching 40 percent in some cases, and they link it to varying teacher conditioning richness and task coverage. That is a practical observation worth noting for anyone tuning reasoning models after pretraining. The controlled comparisons across models and the focus on epistemic verbalization as a specific behavior change are the parts that stand out as useful. The paper does not just report shorter traces; it tries to connect the suppression of uncertainty talk to the robustness loss. The soft spot is that the causal claim still rests on the observed correlation rather than a tight ablation. Changing context richness affects multiple things at once—length, style, step distribution—so it is hard to say uncertainty expression is the primary driver without an experiment that holds those other factors fixed while restoring or removing the verbalized uncertainty. The abstract also leaves the exact measurement of epistemic verbalization and the statistical controls a bit thin, though the full text may fill that in. This paper is for people working on post-training pipelines for reasoning LLMs who want to understand why some distillation runs lose robustness. It is not a complete story on the mechanism, but the direction is clear enough that a serious editor should send it to referees rather than desk-reject it. The work shows honest engagement with a real training issue and deserves the chance for reviewers to press on the isolation of the effect.

Referee Report

2 major / 2 minor

Summary. The paper claims that self-distillation in LLMs for mathematical reasoning can degrade OOD performance by suppressing epistemic verbalization (the expression of uncertainty during reasoning). Through controlled experiments that vary teacher conditioning context richness and task coverage across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, it reports that rich conditioning enables rapid in-domain gains with shorter traces but causes performance drops of up to 40% on unseen problems, where uncertainty expression aids adjustment; the work concludes that optimizing reasoning behavior must preserve appropriate uncertainty rather than only reinforcing correct traces.

Significance. If the causal mechanism is confirmed, the result would be significant for post-training of LLMs, as it identifies a concrete drawback of standard self-distillation (suppression of epistemic signals) that can produce brittle models despite in-domain gains. The multi-model empirical scope and focus on OOD robustness provide a useful counterpoint to the common assumption that shorter, correct reasoning traces are always beneficial.

major comments (2)

[Controlled experiments] Controlled experiments section: the design varies teacher conditioning richness and task coverage but lacks an ablation that holds trace length, correctness rate, and overall style fixed while independently manipulating verbalized uncertainty (e.g., via targeted insertion or suppression prompts). Without this isolation, the observed OOD drops cannot be attributed specifically to suppressed epistemic verbalization rather than generic compression or other unmeasured distillation effects.
[Results] Results section: performance drops of up to 40% are reported across models, yet the manuscript provides no details on the exact evaluation metrics, statistical controls (e.g., variance across seeds or significance tests), or problem exclusion rules. This weakens the ability to verify that the OOD degradation is robust and directly tied to the proposed mechanism.

minor comments (2)

[Abstract] Abstract: the phrase 'up to 40%' is used without specifying the baseline model, exact benchmark, or in-domain vs. OOD split; adding one sentence of clarification would improve precision.
[Introduction] The operational definition of 'epistemic verbalization' is introduced but would benefit from explicit examples of counted uncertainty phrases and how they are measured in traces.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We agree that additional controls and details will strengthen the manuscript and address both major points with new experiments and expanded reporting in the revision.

read point-by-point responses

Referee: Controlled experiments section: the design varies teacher conditioning richness and task coverage but lacks an ablation that holds trace length, correctness rate, and overall style fixed while independently manipulating verbalized uncertainty (e.g., via targeted insertion or suppression prompts). Without this isolation, the observed OOD drops cannot be attributed specifically to suppressed epistemic verbalization rather than generic compression or other unmeasured distillation effects.

Authors: We acknowledge the value of a more isolated ablation. Our existing variations in conditioning richness do modulate uncertainty expression while producing measurable changes in trace length and OOD performance, but they do not fully decouple uncertainty from other factors. In the revision we will add a targeted ablation that holds trace length, correctness rate, and stylistic features fixed (via controlled prompting or post-processing) while independently inserting or suppressing epistemic verbalization. This will be reported in an expanded Controlled Experiments section. revision: yes
Referee: Results section: performance drops of up to 40% are reported across models, yet the manuscript provides no details on the exact evaluation metrics, statistical controls (e.g., variance across seeds or significance tests), or problem exclusion rules. This weakens the ability to verify that the OOD degradation is robust and directly tied to the proposed mechanism.

Authors: We agree that these details are necessary for verification. The revised manuscript will include a new subsection (with supporting appendix tables) specifying the exact metrics (accuracy via exact-match on MATH, GSM8K, and OOD splits), statistical controls (standard deviation across five random seeds, paired t-tests with p-values), and explicit problem exclusion/selection criteria for the OOD sets. This will make the reported drops fully reproducible and allow direct assessment of robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations rest on experimental comparisons, not self-referential definitions or fitted predictions

full rationale

The paper reports controlled experiments varying teacher conditioning richness and task coverage, then measures performance drops (up to 40%) and links them to reduced epistemic verbalization. No equations, ansatzes, or derivations are present that reduce to their own inputs by construction. No parameters are fitted to a subset and then relabeled as predictions. Self-citations, if any, are not load-bearing for the core empirical claim, which is falsifiable via the described ablations and cross-model observations. The analysis is self-contained against external benchmarks (multiple LLMs, in-domain vs OOD splits) and does not invoke uniqueness theorems or rename known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that epistemic verbalization supports OOD generalization and that distillation selectively suppresses it. No free parameters or new physical entities are introduced.

axioms (1)

domain assumption Epistemic verbalization (expression of uncertainty) improves out-of-domain reasoning performance
Invoked to explain why suppression harms OOD results while aiding in-domain speed.

invented entities (1)

epistemic verbalization no independent evidence
purpose: Term for the model's expression of uncertainty during reasoning steps
Introduced as the key suppressed behavior; no independent falsifiable evidence provided beyond the experiments described.

pith-pipeline@v0.9.0 · 5483 in / 1296 out tokens · 55786 ms · 2026-05-15T00:39:27.721013+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
cs.LG 2026-05 unverdicted novelty 7.0

MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
cs.LG 2026-05 unverdicted novelty 7.0

RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
cs.AI 2026-05 unverdicted novelty 7.0

TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preservi...
KL for a KL: On-Policy Distillation with Control Variate Baseline
cs.LG 2026-05 unverdicted novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
cs.LG 2026-05 unverdicted novelty 7.0

PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
cs.CL 2026-03 conditional novelty 7.0

TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
cs.LG 2026-05 unverdicted novelty 6.0

Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
Selective Off-Policy Reference Tuning with Plan Guidance
cs.AI 2026-05 unverdicted novelty 6.0

SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency
cs.CL 2026-04 unverdicted novelty 6.0

AtManRL learns an additive attention mask on CoT traces to produce a saliency reward that, when combined with outcome rewards in GRPO, trains LLMs to generate reasoning that genuinely influences final predictions.
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
cs.LG 2026-04 unverdicted novelty 6.0

On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
Selective Off-Policy Reference Tuning with Plan Guidance
cs.AI 2026-05 unverdicted novelty 5.0

SORT converts all-failed reasoning prompts into selective, structure-aware training signals by weighting tokens according to how much a reference-derived plan increases their probability.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 5.0

MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.