pith. machine review for the scientific record. sign in

arxiv: 2605.11609 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords anti-self-distillationreasoning RLpointwise mutual informationmath reasoninglanguage modelson-policy distillationself-improvement
0
0 comments X

The pith

Reversing the self-distillation objective counters per-token biases from privileged context and accelerates math reasoning training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard on-policy self-distillation produces inconsistent gains in math reasoning because the privileged context inflates teacher confidence on solution-implied tokens and deflates it on deliberation tokens that drive search. The paper diagnoses this pattern with pointwise mutual information and introduces Anti-Self-Distillation, which ascends rather than descends the divergence from the teacher to flip the sign of the advantage on each token. An entropy-triggered gate then disables the term once the teacher becomes overconfident. This produces a naturally bounded training signal that serves as a drop-in replacement. A sympathetic reader would care because the change lets a model bootstrap its own reasoning improvements with far less compute and no stronger external teacher.

Core claim

On-policy self-distillation fails to deliver reliable gains in math reasoning because the privileged context inflates the teacher's confidence on tokens already implied by the solution and deflates it on deliberation tokens. Anti-Self-Distillation ascends the divergence between student and teacher instead of descending it, reversing the per-token sign and yielding a naturally bounded advantage in one step. An entropy-triggered gate disables the term once teacher entropy collapses. The resulting method reaches the GRPO baseline accuracy in 2 to 10 times fewer training steps and improves final accuracy by up to 11.5 points across models from 4B to 30B parameters.

What carries the argument

The Anti-Self-Distillation objective that ascends the divergence from the teacher policy, diagnosed by pointwise mutual information on per-token effects and controlled by an entropy-triggered gate.

If this is right

  • Models reach the GRPO baseline accuracy in 2 to 10 times fewer training steps.
  • Final accuracy improves by up to 11.5 points over the baseline.
  • The method works as a drop-in replacement inside existing reinforcement learning pipelines for reasoning.
  • The gains hold across model sizes from 4B to 30B parameters on math reasoning benchmarks.
  • Self-improvement becomes more reliable because the training signal no longer requires a stronger external teacher.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The PMI diagnosis of token-level bias could be applied to other distillation or knowledge-transfer settings beyond reasoning RL.
  • If the same confidence-inflation pattern appears on non-math tasks such as coding, the reversal approach may transfer directly.
  • Combining the bounded advantage with other RL estimators could further stabilize training on longer reasoning traces.
  • A direct test would measure whether AntiSD reduces the volume of verified solutions needed to reach a given accuracy level.

Load-bearing premise

The privileged context inflates the teacher's confidence on tokens already implied by the solution and deflates it on deliberation tokens that drive multi-step search.

What would settle it

An experiment that keeps the same models, benchmarks, and training setup but removes the sign reversal or the entropy gate and still observes the reported speed-up and accuracy gains would show the claimed mechanism is not responsible.

Figures

Figures reproduced from arXiv: 2605.11609 by Chenxiao Zhao, Dongcheng Zhao, Guobin Shen, Jindong Li, Lei Huang, Xiang Cheng, Xing Yu.

Figure 1
Figure 1. Figure 1: (a) An oracle-conditioned teacher biases the student toward a single root; reversing the signal preserves deliberation and recovers the rest. (b) Qwen3-8B on HMMT 2025: AntiSD reaches GRPO’s peak in ∼1/5 the steps and ends +15 pp higher; default self-distillation underperforms GRPO. only modest or inconsistent gains on more challenging mathematical problems [8]. We observe the same pattern across model fam… view at source ↗
Figure 2
Figure 2. Figure 2: Per-token signal ut = tt − st on Qwen3-4B-IT-2507 at AIME-25. (a) Single-rollout trace. (b) (πS, πT ) heatmap. Blue marks deliberation tokens (ut ≪ 0); red marks shortcut tokens (ut ≫ 0). plays two roles: the student πS(· | x, y<t) := πθ(· | x, y<t) generates the rollout, while the teacher πT (· | x, y<t) := πθ(· | x, c, y<t) scores it under richer conditioning (we suppress c from the teacher’s left-hand-s… view at source ↗
Figure 3
Figure 3. Figure 3: HMMT25 pass@k at each row’s peak-mean snapshot. AntiSD’s lead over GRPO is sustained across k – the gain reflects expanded coverage, not just variance reduction. A natural concern is whether AntiSD’s gain comes from better single-rollout accuracy or from concentrating probability mass on already-correct rollouts at the cost of generation diversity [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training dynamics on three models (rows) along six axes (columns). Faded traces are raw values; bold traces are 20-step rolling means. here. Its sharpest expression is the Qwen3-4B-IT-2507 collapse around step 80: train reward to zero, response length pinned at the 32K cap, and both entropies spiking, all within a single step window before the run terminates. 4.3 Ablations AntiSD adds three components on t… view at source ↗
Figure 5
Figure 5. Figure 5: Mechanism ablations: failure modes. Top: reward over non-truncated rollouts. Bottom: mean response length. Line truncation indicates run termination after collapse. No-teacher: self-reinforcement collapse. Removing the teacher entirely – so the per-token signal becomes a function of the student’s log-probability alone, with no teacher–student differential – collapses on all three models within ∼ 70 trainin… view at source ↗
read the original abstract

On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens ("Wait", "Let", "Maybe") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper diagnoses failures of on-policy self-distillation in math reasoning via pointwise mutual information (PMI) analysis, which shows that privileged context inflates teacher confidence on structural/implied tokens while deflating it on deliberation tokens. It proposes Anti-Self-Distillation (AntiSD) that ascends rather than descends the divergence, yielding a bounded advantage, combined with an entropy-triggered gate to disable the term upon teacher entropy collapse. The method is presented as a drop-in replacement for standard self-distillation, with empirical claims of reaching GRPO baseline accuracy in 2-10x fewer steps and final accuracy gains up to 11.5 points across 4B-30B models on math benchmarks.

Significance. If the empirical results hold and the mechanism is mechanistically validated, AntiSD offers a lightweight, self-contained way to improve reasoning RL without external teachers, potentially aiding scalable self-improvement. The PMI diagnostic provides a concrete, falsifiable lens on token-level confidence mismatches that could generalize beyond the reported setting. The one-step bounded advantage and entropy gate are practical strengths that reduce hyperparameter sensitivity.

major comments (3)
  1. [§3] §3 (PMI analysis): The diagnosis that privileged context deflates probability on deliberation tokens (e.g., 'Wait', 'Let', 'Maybe') is plausible, but the manuscript provides no per-token probability measurements or checkpoint comparisons showing that ascending the divergence under AntiSD specifically raises these probabilities relative to standard self-distillation while the baseline does the opposite. Without this link, the reported accuracy and speed gains could stem from the entropy gate or other implementation details rather than the diagnosed cause.
  2. [§5] §5 (experiments): The central claims of 2-10x fewer training steps and up to 11.5-point gains are reported only via aggregate accuracy and step counts across five models; no ablation isolating the divergence-ascent term, no per-token analysis, and no statistical tests or variance across runs are supplied to confirm robustness or attribute gains to the PMI-motivated mechanism.
  3. [§4.2] §4.2 (entropy gate): The precise condition for disabling the AntiSD term (e.g., entropy threshold, measurement over which tokens) is not formalized, making it difficult to reproduce the claimed speedups or determine whether the gate is load-bearing for the final accuracy improvements.
minor comments (2)
  1. [Abstract] The abstract states specific numerical gains but supplies no baseline details, number of runs, or benchmark names; these should be added for immediate clarity.
  2. [§4] Notation for the divergence ascent and the one-step bounded advantage should be explicitly contrasted with the standard KL or reverse-KL objectives in an equation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight opportunities to strengthen the mechanistic evidence, experimental rigor, and reproducibility of Anti-Self-Distillation. We address each point below and will incorporate the suggested additions and clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (PMI analysis): The diagnosis that privileged context deflates probability on deliberation tokens (e.g., 'Wait', 'Let', 'Maybe') is plausible, but the manuscript provides no per-token probability measurements or checkpoint comparisons showing that ascending the divergence under AntiSD specifically raises these probabilities relative to standard self-distillation while the baseline does the opposite. Without this link, the reported accuracy and speed gains could stem from the entropy gate or other implementation details rather than the diagnosed cause.

    Authors: We agree that an explicit link between the PMI diagnosis and the effect of divergence ascent would strengthen the mechanistic claim. In the revised version we will add per-token probability trajectories extracted from training checkpoints for representative deliberation tokens (e.g., 'Wait', 'Let', 'Maybe') under GRPO, standard self-distillation, and AntiSD. These plots will show that AntiSD increases probability mass on deliberation tokens while standard self-distillation decreases it, directly connecting the PMI analysis to the observed training dynamics and final gains. revision: yes

  2. Referee: [§5] §5 (experiments): The central claims of 2-10x fewer training steps and up to 11.5-point gains are reported only via aggregate accuracy and step counts across five models; no ablation isolating the divergence-ascent term, no per-token analysis, and no statistical tests or variance across runs are supplied to confirm robustness or attribute gains to the PMI-motivated mechanism.

    Authors: We acknowledge that the current experimental section relies on aggregate metrics. The revised manuscript will include: (i) an ablation that isolates the divergence-ascent term by comparing full AntiSD against a variant that retains only the entropy gate; (ii) the per-token probability analysis described above; and (iii) results from three independent runs per model size with mean, standard deviation, and paired statistical tests (e.g., Wilcoxon) to quantify robustness. These additions will allow readers to attribute performance differences specifically to the PMI-motivated component. revision: yes

  3. Referee: [§4.2] §4.2 (entropy gate): The precise condition for disabling the AntiSD term (e.g., entropy threshold, measurement over which tokens) is not formalized, making it difficult to reproduce the claimed speedups or determine whether the gate is load-bearing for the final accuracy improvements.

    Authors: We will formalize the entropy gate in Section 4.2 of the revision. The condition will be stated as: the AntiSD term is disabled for a given sequence when the average teacher entropy (computed over all tokens in the generated response) falls below a threshold τ = 0.8 nats; the threshold is applied uniformly across the sequence rather than per-token. We will also report the fraction of steps in which the gate activates across runs, clarifying its contribution to the observed speedups. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims.

full rationale

The paper motivates AntiSD via a PMI diagnosis of how privileged context affects per-token teacher probabilities, then defines the method as ascending (rather than descending) a divergence with an entropy gate. No equations, derivations, or self-referential steps appear in the abstract or described text that reduce the reported speedups or accuracy gains to a fitted quantity or input by construction. The empirical results on math benchmarks are independent measurements, not tautological outputs of the method definition itself. No load-bearing self-citations, uniqueness theorems, or renamed known results are invoked to force the central claims. This is a standard empirical proposal whose validity rests on external benchmark outcomes rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond the core assumption about privileged context effects.

axioms (1)
  • domain assumption Privileged context inflates teacher confidence on implied tokens and deflates it on deliberation tokens
    This is the load-bearing premise used to motivate the PMI analysis and the sign reversal.

pith-pipeline@v0.9.0 · 5550 in / 1187 out tokens · 47295 ms · 2026-05-13T01:25:42.965508+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 18 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

  2. [2]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025

  3. [3]

    Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

    Jasper Dekoninck, Nikola Jovanovi´c, Tim Gehrunger, Kári Rögnvalddson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond benchmarks: MathArena as an evaluation platform for mathematics with LLMs.arXiv preprint arXiv:2605.00674, 2026

  4. [4]

    Revisiting on-policy distillation: Empirical failure modes and simple fixes, 2026

    Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes, 2026

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  6. [6]

    arXiv preprint arXiv:2603.08660 , year=

    Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, et al. How far can unsupervised rlvr scale llm training? arXiv preprint arXiv:2603.08660, 2026

  7. [7]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  8. [8]

    Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

    Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472, 2026

  9. [9]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi Team, Angang Du, Bofei Gao, Bowen Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with LLMs.arXiv preprint arXiv:2501.12599, 2025

  10. [10]

    Efficient Process Reward Modeling via Contrastive Mutual Information

    Nakyung Lee, Sangwoo Hong, and Jungwoo Lee. Efficient process reward modeling via contrastive mutual information.arXiv preprint arXiv:2604.10660, 2026

  11. [11]

    Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

  12. [12]

    Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026

    Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026

  13. [13]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

  14. [14]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

  15. [15]

    Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in neural information processing systems, 36:21558–21572, 2023. 10

  16. [16]

    Unifying distillation and privileged information.arXiv preprint arXiv:1511.03643, 2015

    David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. Unifying distillation and privileged information.arXiv preprint arXiv:1511.03643, 2015

  17. [17]

    https://thinkingmachines.ai/blog/ on-policy-distillation/

    Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy- distillation

  18. [18]

    Improve Mathematical Reasoning in Language Models by Automated Process Supervision

    Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, et al. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592, 2024

  19. [19]

    Policy invariance under reward transforma- tions: Theory and application to reward shaping

    Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transforma- tions: Theory and application to reward shaping. InProceedings of the International Conference on Machine Learning (ICML), volume 99, pages 278–287. Citeseer, 1999

  20. [20]

    Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, Shan...

  21. [21]

    On-policy self-distillation for reasoning compression.arXiv e-prints, pages arXiv–2603, 2026

    Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. On-policy self-distillation for reasoning compression.arXiv e-prints, pages arXiv–2603, 2026

  22. [22]

    Rewarding progress: Scaling automated process verifiers for LLM reasoning

    Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for LLM reasoning. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=A6Y7AqlzLW

  23. [23]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  24. [24]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

  25. [25]

    A new learning paradigm: Learning using privileged information.Neural networks, 22(5-6):544–557, 2009

    Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged information.Neural networks, 22(5-6):544–557, 2009

  26. [26]

    Math-shepherd: Verify and reinforce llms step-by-step without human annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024

  27. [27]

    MiMo-V2-Flash Technical Report

    Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

  28. [28]

    PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence

    Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. Paced: Distillation and on- policy self-distillation at the frontier of student competence.arXiv preprint arXiv:2603.11178, 2026. 11

  29. [29]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  30. [30]

    Self-Distilled RLVR

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

  31. [31]

    On-Policy Context Distillation for Language Models

    Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

  32. [32]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  33. [33]

    American invitational mathematics examination (aime) 2024, 2024

    Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

  34. [34]

    American invitational mathematics examination (aime) 2025, 2025

    Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025

  35. [35]

    American invitational mathematics examination (aime) 2026, 2026

    Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2026, 2026

  36. [36]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

  37. [37]

    Learning to reason without external rewards.arXiv preprint arXiv:2505.19590, 2025

    Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards.arXiv preprint arXiv:2505.19590, 2025. 12 Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information Supplementary Material Table of Contents A Proofs and Deferred Statements . . . . . . . . . . . . . . . . . . . . . . . . . . ....