Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
hub
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation
11 Pith papers cite this work. Polarity classification is still indexing.
abstract
Reasoning models think out loud, but much of what they say is noise. We introduce CRISP (Compressed Reasoning via Iterative Self-Policy Distillation), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a ''be concise'' instruction to obtain teacher logits, and minimize per-token reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: CRISP automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57--59% token reduction on MATH-500 while improving accuracy by 9--16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. Ablations show that qualitative conciseness instructions outperform explicit token targets, and periodic teacher refreshes yield a broad stable regime. The method generalizes across model families -- DeepSeek-R1-Distill-Llama-8B improves accuracy by up to 5 points with 17--32% compression -- and transfers beyond math to multi-step agentic planning (DeepPlanning), reducing token usage by 42--51% while preserving planning quality. Code is available at https://github.com/HJSang/OPSD_Reasoning_Compression.
hub tools
years
2026 11representative citing papers
PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
ATESD makes teacher exposure to reference reasoning a learnable control variable via a Beta-policy optimized on future student improvement, yielding gains of up to +2.33 points over fixed-exposure self-distillation on AIME and HMMT math benchmarks.
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
In on-policy distillation, tokens with high student entropy or low entropy plus high teacher divergence provide dense corrective signal, allowing effective training on under 20% of tokens across math and planning tasks.
π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and improving efficiency 2-3× over standard self-play.
Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.
Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
citing papers explorer
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
-
Self-Distilled RLVR
RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
-
Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
ATESD makes teacher exposure to reference reasoning a learnable control variable via a Beta-policy optimized on future student improvement, yielding gains of up to +2.33 points over fixed-exposure self-distillation on AIME and HMMT math benchmarks.
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
-
Multilingual Safety Alignment via Self-Distillation
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
-
TIP: Token Importance in On-Policy Distillation
In on-policy distillation, tokens with high student entropy or low entropy plus high teacher divergence provide dense corrective signal, allowing effective training on under 20% of tokens across math and planning tasks.
-
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and improving efficiency 2-3× over standard self-play.
-
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.
-
Reasoning Compression with Mixed-Policy Distillation
Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.
-
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.