Recognition: unknown
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation
read the original abstract
Reasoning models think out loud, but much of what they say is noise. We introduce CRISP (Compressed Reasoning via Iterative Self-Policy Distillation), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a ''be concise'' instruction to obtain teacher logits, and minimize per-token reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: CRISP automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57--59% token reduction on MATH-500 while improving accuracy by 9--16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. Ablations show that qualitative conciseness instructions outperform explicit token targets, and periodic teacher refreshes yield a broad stable regime. The method generalizes across model families -- DeepSeek-R1-Distill-Llama-8B improves accuracy by up to 5 points with 17--32% compression -- and transfers beyond math to multi-step agentic planning (DeepPlanning), reducing token usage by 42--51% while preserving planning quality. Code is available at https://github.com/HJSang/OPSD_Reasoning_Compression.
This paper has not been read by Pith yet.
Forward citations
Cited by 16 Pith papers
-
Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
-
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
-
Self-Distilled RLVR
RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
-
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.
-
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Sparse RL on capable teachers followed by dense distillation to students beats direct GRPO on students for verifiable math reasoning.
-
Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
ATESD makes teacher exposure to reference reasoning a learnable control variable via a Beta-policy optimized on future student improvement, yielding gains of up to +2.33 points over fixed-exposure self-distillation on...
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
-
Multilingual Safety Alignment via Self-Distillation
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
-
TIP: Token Importance in On-Policy Distillation
In on-policy distillation, tokens with high student entropy or low entropy plus high teacher divergence provide dense corrective signal, allowing effective training on under 20% of tokens across math and planning tasks.
-
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...
-
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.
-
Reasoning Compression with Mixed-Policy Distillation
Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.
-
Multilingual Safety Alignment via Self-Distillation
MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
-
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.