pith. sign in

arxiv: 2607.02234 · v1 · pith:PCXXHQZVnew · submitted 2026-07-02 · 💻 cs.AI · cs.LG

Purified OPSD: On-Policy Self-Distillation Without Losing How to Think

Pith reviewed 2026-07-03 13:53 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords on-policy self-distillationlong chain-of-thought reasoningLLM distillationpointwise mutual informationreference-induced shortcutsepistemic behavior preservation
0
0 comments X

The pith

Purifying the self-distillation signal by subtracting reference-only outputs lets long-CoT models improve without losing reflective reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

On-policy self-distillation fails on long chain-of-thought models because the teacher's supervision is dominated by reference-induced shortcuts that promote rote memorization. The paper decomposes the signal to isolate the non-transferable reference component using a reference-only teacher and subtracts it to reveal the question-conditioned residual. This residual is then shaped into a distillation target via pointwise mutual information to filter out shortcuts. Experiments show this purified method delivers consistent gains over base models and standard OPSD on four long-CoT models and two datasets while maintaining natural epistemic behavior.

Core claim

The teacher's supervision in OPSD contains a dominant reference-induced component that drives memorization of shortcuts and a weaker question-conditioned component that carries transferable inference corrections; isolating the former via a reference-only teacher and converting the residual with pointwise mutual information produces a clean target that supports effective distillation without destabilizing reflection.

What carries the argument

Reference-only teacher subtraction to isolate the residual correction signal, followed by pointwise mutual information to form the PMI target distribution for distillation.

If this is right

  • Consistent performance improvements over base models and standard OPSD on long-CoT reasoning tasks.
  • Preservation of the models' natural epistemic behavior throughout training.
  • Effective filtering of reference-induced shortcuts that cause rote memorization.
  • Applicability across multiple long-CoT models and datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition might apply to other forms of privileged supervision in LLM training.
  • PMI could be compared to alternative measures for shaping the residual signal.
  • Scaling the method to larger models or more complex reasoning tasks remains to be tested.

Load-bearing premise

The residual after subtracting the reference-only teacher's output captures the question-conditioned, inference-transferable correction that can be turned into a usable distillation target by PMI.

What would settle it

If training with the purified targets produces no improvement over standard OPSD or causes the same destabilization of reflective reasoning as the unpurified version, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2607.02234 by Chen Shen, Gang Chen, Haobo Wang, Hao Chen, Jieping Ye, Jintao Tong, Junbo Zhao, Rui Miao, Shaotian Yan, Wentao Ye, Xiaomeng Hu, Zhanming Shen.

Figure 1
Figure 1. Figure 1: AIME 2025 accuracy across OPSD training checkpoints on Math-CoT-20K. OPSD provides [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Epistemic marker analysis during OPSD training on Math-CoT-20K. Qwen3-8B collapses [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Decomposition of the teacher’s update ∆total = log πT − log πθ into reference-induced (∆ref, red) and inference-transferable (∆it, green) components across OPSD checkpoints. Left columns: directional alignment (cosine similarity). Right columns: magnitude dominance (norm fraction). The reference component dominates in both direction and magnitude. ∥∆it∥/∥∆total∥, capturing the magnitude dominance. Both are… view at source ↗
Figure 4
Figure 4. Figure 4: Training dynamics on Math-CoT-20K (AIME 2025). OPSD-Standard (red) uses the raw [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 1
Figure 1. Figure 1: Two key observations emerge. First, OPSD-PMI improves and remains stable, whereas OPSD-Standard peaks briefly (if at all) before steadily declining. Second, OPSD-PMI is robust to checkpoint selection: the variance across checkpoints is small, meaning practitioners do not need careful early stopping to avoid catastrophic degradation. This is a practical advantage over OPSD-Standard, where later checkpoints … view at source ↗
Figure 5
Figure 5. Figure 5: Epistemic marker analysis on Math-CoT-20K. OPSD-Standard uses the raw teacher [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation on the soft clipping threshold c (β = 1 fixed). All settings consistently improve over the baseline and show similar trajectories, confirming robustness to this hyperparameter. 0 25 50 75 100 125 150 175 200 Training Step 75.5 76.0 76.5 77.0 77.5 78.0 78.5 79.0 AIME 2024 Accuracy (%) β = 0.5 β = 1 β = 2 (a) Qwen3-8B (AIME24) 0 25 50 75 100 125 150 175 200 Training Step 66 67 68 69 70 71 72 AIME 20… view at source ↗
Figure 7
Figure 7. Figure 7: Ablation on the correction strength β (c = 10 fixed). β = 0.5 produces volatile trajectories; β = 2 yields smoother curves with occasionally higher peaks; β = 1 balances stability and performance. Correction strength β [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
read the original abstract

On-policy self-distillation (OPSD) has emerged as a promising paradigm for improving LLM reasoning, where a privileged teacher with access to reference solutions provides token-level supervision on the student's own generated trajectories. However, we find that OPSD consistently fails on long chain-of-thought (long-CoT) reasoning models, yielding at best marginal gains while destabilizing the reflective reasoning capability these models depend on. Through a novel decomposition of the teacher's supervision signal, we identify the root cause: the teacher's supervision is dominated by a reference-induced component that drives rote memorization of reference-specific shortcuts, while the question-conditioned, inference-transferable component is ignored or actively opposed. Based on this diagnosis, we propose a two-step solution. First, we construct a reference-only teacher (the same model conditioned on the reference without the question) to isolate the non-transferable component of the supervision signal; the residual after subtracting this component captures the question-conditioned, inference-transferable correction. Second, we use pointwise mutual information (PMI) as the mechanism to transform this residual into a well-formed PMI target distribution that the student can directly distill from, filtering out the reference-induced shortcut. Experiments on four long-CoT models across two datasets demonstrate consistent improvements over both the base model and standard OPSD, while preserving the models' natural epistemic behavior throughout training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that standard on-policy self-distillation (OPSD) fails on long chain-of-thought reasoning models because the teacher's token-level supervision is dominated by reference-induced shortcuts that promote rote memorization, while the question-conditioned transferable component is ignored. It proposes Purified OPSD: (1) construct a reference-only teacher (same model conditioned only on the reference) to isolate the non-transferable component, (2) subtract to obtain the residual as the transferable correction, and (3) apply pointwise mutual information (PMI) to convert the residual into a valid distillation target. Experiments on four long-CoT models across two datasets are reported to yield consistent gains over both the base model and standard OPSD while preserving natural epistemic behavior.

Significance. If the decomposition via subtraction is valid and the PMI target reliably filters shortcuts without introducing new artifacts, the method could offer a practical improvement to self-distillation pipelines for reasoning models, addressing instability in reflective capabilities that current OPSD approaches exhibit.

major comments (2)
  1. [Abstract, two-step solution paragraph] Abstract, paragraph describing the two-step solution: the central construction assumes the teacher's supervision decomposes additively into a reference-induced component (isolated by the reference-only teacher) and a residual that is precisely the question-conditioned, inference-transferable correction. No derivation is given showing why subtraction (in logits, probabilities, or other space) isolates this component rather than mixing artifacts, nor why the residual is guaranteed to be non-negative or normalizable before PMI is applied. If this does not hold, the PMI target is not guaranteed to filter shortcuts while preserving reasoning.
  2. [Experiments section] Experiments (as summarized in abstract): the claim of 'consistent improvements' and 'preservation of the models' natural epistemic behavior throughout training' is presented without quantitative details on effect sizes, variance across runs, or ablation controls that isolate the contribution of the reference-only subtraction versus PMI. This makes it difficult to verify that the gains are robust and attributable to the proposed purification rather than other factors.
minor comments (1)
  1. [Methods] Notation for the residual and PMI target distribution should be defined explicitly with equations early in the methods, as the abstract description leaves the precise mathematical form of the subtraction and normalization ambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract, two-step solution paragraph] Abstract, paragraph describing the two-step solution: the central construction assumes the teacher's supervision decomposes additively into a reference-induced component (isolated by the reference-only teacher) and a residual that is precisely the question-conditioned, inference-transferable correction. No derivation is given showing why subtraction (in logits, probabilities, or other space) isolates this component rather than mixing artifacts, nor why the residual is guaranteed to be non-negative or normalizable before PMI is applied. If this does not hold, the PMI target is not guaranteed to filter shortcuts while preserving reasoning.

    Authors: We agree that a more explicit justification of the additive decomposition assumption would strengthen the paper. The subtraction is performed in logit space prior to the PMI transformation, motivated by the goal of isolating the question-conditioned residual; however, the manuscript presents this primarily through empirical diagnosis rather than a formal derivation. We will revise the method section to include a dedicated paragraph discussing the assumptions underlying the logit-space subtraction, potential mixing of artifacts, conditions for non-negativity after adjustment, and the role of PMI in producing a valid target distribution, supported by additional empirical checks. revision: yes

  2. Referee: [Experiments section] Experiments (as summarized in abstract): the claim of 'consistent improvements' and 'preservation of the models' natural epistemic behavior throughout training' is presented without quantitative details on effect sizes, variance across runs, or ablation controls that isolate the contribution of the reference-only subtraction versus PMI. This makes it difficult to verify that the gains are robust and attributable to the proposed purification rather than other factors.

    Authors: The experiments section reports results across four models and two datasets with tables showing performance deltas relative to the base model and standard OPSD. We acknowledge that the current presentation would be improved by explicit reporting of effect sizes, run-to-run variance, and ablations that separately disable the reference-only subtraction and the PMI step. We will expand the experiments section with these quantitative details and ablation results to better isolate the contribution of each component. revision: yes

Circularity Check

0 steps flagged

No significant circularity; explicit heuristic construction validated by experiments

full rationale

The paper's core contribution is an explicit two-step construction: (1) a reference-only teacher to isolate a non-transferable component, followed by subtraction to obtain a residual claimed to be the question-conditioned correction, and (2) PMI applied to that residual to produce the distillation target. This is presented as a diagnostic decomposition rather than a mathematical derivation from first principles. No equations or self-citations are shown that reduce the claimed performance gains or preservation of epistemic behavior to a tautology, fitted parameter, or self-referential definition. The improvements are asserted via experiments on four models and two datasets, making the argument self-contained against external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the subtracted residual isolates transferable reasoning corrections; no free parameters or invented entities are introduced in the abstract description.

axioms (1)
  • domain assumption The residual supervision signal after subtracting the reference-only teacher output represents the question-conditioned, inference-transferable component.
    This premise is required for the PMI step to produce a useful target; stated in the abstract's diagnosis paragraph.

pith-pipeline@v0.9.1-grok · 5807 in / 1179 out tokens · 36247 ms · 2026-07-03T13:53:53.814369+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 12 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes.arXiv preprint arXiv:2306.13649,

    Rishabh Agarwal, Nino Vieillard, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes.arXiv preprint arXiv:2306.13649,

  2. [2]

    Skip-thinking: Chunk-wise chain-of-thought distillation enable smaller language models to reason better and faster.arXiv preprint arXiv:2505.18642, 2025a

    Xiao Chen, Sihang Zhou, Ke Liang, Xiaoyu Sun, and Xinwang Liu. Skip-thinking: Chunk-wise chain-of-thought distillation enable smaller language models to reason better and faster.arXiv preprint arXiv:2505.18642, 2025a. Xinghao Chen, Zhijing Sun, Wenjin Guo, Miaoran Zhang, Yanjun Chen, Yirong Sun, Hui Su, Yijie Pan, Dietrich Klakow, Wenjie Li, et al. Unveil...

  3. [3]

    A Brief Overview: On-Policy Self-Distillation In Large Language Models

    Fangming Cui, Sunan Li, and Jiahong Li. A brief overview: On-policy self-distillation in large language models.arXiv preprint arXiv:2605.18141,

  4. [4]

    Goodman , year=

    Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D Goodman. Stream of search (sos): Learning to search in language.arXiv preprint arXiv:2404.03683,

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  6. [6]

    Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

    Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, and Sanjeev Arora. Self-distillation zero: Self-revision turns binary rewards into dense supervision.arXiv preprint arXiv:2604.12002,

  7. [7]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

  8. [8]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,

  9. [9]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  10. [10]

    Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

    Simran Kaur, Narutatsu Ri, Yinghui He, Liam H Fowl, and Sanjeev Arora. Rethinking on-policy self-distillation for thinking models. InICML 2026 Workshop on Foundations of Deep Generative Models: Understanding Memorization, Generalization, and Reasoning. Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Y...

  11. [11]

    Feature extrac- tion and steering for enhanced chain-of-thought reasoning in language models.arXiv preprint arXiv:2505.15634,

    Zihao Li, Xu Wang, Yuzhe Yang, Ziyu Yao, Haoyi Xiong, and Mengnan Du. Feature extrac- tion and steering for enhanced chain-of-thought reasoning in language models.arXiv preprint arXiv:2505.15634,

  12. [12]

    Through the valley: Path to effective long cot training for small language models.arXiv preprint arXiv:2506.07712, 2025a

    Renjie Luo, Jiaxi Li, Chen Huang, and Wei Lu. Through the valley: Path to effective long cot training for small language models.arXiv preprint arXiv:2506.07712, 2025a. Yijia Luo, Yulin Song, Xingyao Zhang, Jiaheng Liu, Weixun Wang, GengRu Chen, Wenbo Su, and Bo Zheng. Deconstructing long chain-of-thought: A structured reasoning optimization framework for ...

  13. [13]

    Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

    URL https://arxiv.org/abs/ 2604.06628. Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

  14. [14]

    2 OLMo 2 Furious

    OLMo Team, Dirk Groeneveld, Luca Soldaini, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, et al. Olmo 2: The best fully open language model to date.arXiv preprint arXiv:2501.00656,

  15. [15]

    R1-compress: Long chain-of-thought compression via chunk compression and search.arXiv preprint arXiv:2505.16838,

    Yibo Wang, Li Shen, Huanjin Yao, Tiansheng Huang, Rui Liu, Naiqiang Tan, Jiaxing Huang, Kai Zhang, and Dacheng Tao. R1-compress: Long chain-of-thought compression via chunk compression and search.arXiv preprint arXiv:2505.16838,

  16. [16]

    Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond.arXiv preprint arXiv:2503.10460,

    Liang Wen et al. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond.arXiv preprint arXiv:2503.10460,

  17. [17]

    Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning

    Xiaojun Wu, Xiaoguang Jiang, Huiyang Li, Jucai Zhai, Dengfeng Liu, Qiaobo Hao, Huang Liu, Zhiguo Yang, Ji Xie, Ninglun Gu, et al. Beyond scaling law: A data-efficient distillation framework for reasoning.arXiv preprint arXiv:2508.09883,

  18. [18]

    Redstar: Does scaling long-cot data unlock better slow-reasoning systems?arXiv preprint arXiv:2501.11284,

    Haotian Xu, Xing Yang, Yixiao Song, Hengyuan Wang, Yezeng Ren, Erlu Liu, Haoran Peng, et al. Redstar: Does scaling long-cot data unlock better slow-reasoning systems?arXiv preprint arXiv:2501.11284,

  19. [19]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al

    URLhttps://arxiv.org/abs/2601.09088. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  20. [20]

    LIMO: Less is More for Reasoning

    Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning.arXiv preprint arXiv:2502.03387,

  21. [21]

    Distilling system 2 into system 1.arXiv preprint arXiv:2407.06023,

    14 Ping Yu, Jing Xu, Jason Weston, and Ilia Kulikov. Distilling system 2 into system 1.arXiv preprint arXiv:2407.06023,

  22. [22]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,