arxiv: 2605.10194 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

Jiaxuan Wang, Lan-Zhe Guo, Xin Li, Xuan Ouyang, Yulan Hu, Zheng Pan, Zhiyu Chen

Pith reviewed 2026-05-12 02:52 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords token-routed alignmentself on-policy distillationcritical reasoning spansRLVRmath reasoningout-of-distribution generalizationGRPOQwen3 models

0 comments

The pith

Routing self-distillation to annotator-marked critical spans improves math reasoning while preserving out-of-distribution accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that full-response on-policy self-distillation wastes gradients on redundant tokens and leaks privileged context, producing entropy growth, shorter reasoning chains, and out-of-distribution degradation during long-horizon math training. TRACE counters this by restricting distillation losses to annotator-marked critical spans, using forward KL on key spans of correct rollouts, optional reverse KL on error spans, and GRPO on remaining tokens, with the KL term annealed after a short warm-up. The routing supplies non-vanishing lift to under-allocated important tokens while keeping cumulative privileged-gradient exposure finite. A sympathetic reader would care because the method offers a concrete route to more stable self-improvement in reasoning models without the side effects that undermine uniform distillation.

Core claim

TRACE shows that token-routed self on-policy alignment limited to critical spans outperforms GRPO by an average of 2.76 percentage points on four held-out math benchmarks plus GPQA-Diamond. It achieves this while preserving the base Qwen3-8B out-of-distribution score on GPQA-Diamond, where both GRPO and all-token self-OPD degrade. The method applies forward KL to key correct spans, reverse KL to localized errors when beneficial, and GRPO elsewhere, with annealing to bound privileged effects. Gains remain when critical spans are obtained via online self-annotation rather than strong external APIs.

What carries the argument

Token-routed loss assignment that directs forward or reverse KL divergence only to annotator-marked critical spans while applying GRPO to all other tokens, combined with annealing of the distillation channel.

If this is right

Delivers a 2.76 percentage point average improvement over GRPO on held-out math benchmarks and GPQA-Diamond.
Maintains base-model out-of-distribution accuracy on GPQA-Diamond where full-response methods degrade.
Retains roughly 69 percent of the strong-API gain when critical spans are supplied by online self-annotation.
Optimal routing choice varies by scale, with forward KL on correct spans preferred for 8B models and reverse KL on error spans for 1.7B models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The finite-exposure principle may generalize to other privileged-information alignment settings where uniform distillation risks collapse.
Self-annotation viability suggests the approach could scale without constant external labeling in resource-limited environments.

Load-bearing premise

The reported gains arise from the token-routing and span-masking mechanism rather than from the quality or selection process of the annotator-marked critical spans.

What would settle it

An ablation that applies the KL distillation uniformly across all tokens while retaining the same annotation pipeline, which should recover the entropy rise and OOD degradation of the all-token baseline if routing and masking are the operative factors.

Figures

Figures reproduced from arXiv: 2605.10194 by Jiaxuan Wang, Lan-Zhe Guo, Xin Li, Xuan Ouyang, Yulan Hu, Zheng Pan, Zhiyu Chen.

**Figure 1.** Figure 1: Why TRACE. (a) Per-token actor entropy across 400 training steps (EMA α = 0.85). SDPO and SRPO exceed 4× the GRPO baseline before validation accuracy collapses (App. D.6); TRACE tracks GRPO inside the stable band. (b) TRACE routing: span mask gates KL with coverage cap α= 0.25; FKL on Ky (default), RKL on Ey (optional), GRPO on Ny. λk →0 by step 40; cumulative privileged-gradient exposure O(α P k λ 2 k ) (… view at source ↗

**Figure 2.** Figure 2: TRACE pipeline: (1) student samples rollout yˆ and verifier returns R; (2) annotator πA produces both a span mask and a coarse type label; (3) teacher receives the type label as a private diagnostic prefix and computes logits causally on yˆ<t; (4) routed KL action on each span class — the default is FKL on Ky with no-KL on Ey and Ny — combined with GRPO on Ny and the KL weight decaying to zero after warm-u… view at source ↗

**Figure 3.** Figure 3: Cross-scale training dynamics. Validation mean@16 (left) and training reward (right) on OpenThoughts math, for Qwen3-8B (top) and Qwen3-1.7B (bottom). EMA α=0.85. On the strong base, TRACE-FKL on Ky is the dominant corner and TRACE-RKL on Ey is second; on the weaker base, the dominant corner inverts to RKL-on-Ey, matching Cor. 2’s prediction when confident-but-wrong tokens become the dominant error mode. B… view at source ↗

**Figure 4.** Figure 4: Qwen3-1.7B validation mean@16 (EMA α= 0.55), with RKL on Ey dominant. All KL variants share TRACE’s decay schedule (Eq. 2); shading marks the KL-active window. FKL/RKL asymmetry and finite exposure under mask coverage plus decay. The resulting design principle is to choose the routed corner by the base model’s dominant residual error mode: FKL for under-allocated teacher-supported key tokens, RKL for local… view at source ↗

**Figure 5.** Figure 5: Idealized corner geometry for the extended choice space β ∈ [0, 1] ∪ {∅}. Under the conditional regime κN < κ < min{κE, κK}, endpoint utilities motivate FKL on key spans (a), optional RKL on localized error spans (b), and no-KL on non-spans (c); γK, γE are the alignment margins from Prop. 4. The figure illustrates utility geometry rather than a deployed-network optimality guarantee. RKL on Et is an optiona… view at source ↗

**Figure 6.** Figure 6: Token-level credit assignment on the Diophantine case study. Black boxes denote annotator spans [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Annotator ablation dynamics (cf. Tab. 3). Validation mean@16 on OpenThoughts heldout. Shaded band in (a) marks the KL-active window (step 10–40); (b) zooms in. Online-self lags strong-API only during this window and stabilizes ∼1.5 pp below strong; both outperform GRPO throughout. EMA α = 0.85. D.4 Evaluation Protocol Sanity Checks The Qwen3-8B base scores reported in the “base” row of Tab. 2 are independ… view at source ↗

**Figure 8.** Figure 8: Three-symptom failure pattern of all-token self-OPD baselines, and the SRPO inner [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

read the original abstract

On-policy self-distillation (self-OPD) densifies reinforcement learning with verifiable rewards (RLVR) by letting a policy teach itself under privileged context. We find that when this guidance spans the full response, all-token KL spends gradients on mostly redundant positions and amplifies privileged-information leakage, causing entropy rise, shortened reasoning, and out-of-distribution degradation in long-horizon math training. We propose Token-Routed Alignment for Critical rEasoning (TRACE), which distills only on annotator-marked critical spans: forward KL on key spans of correct rollouts, optional reverse KL on localized error spans, and GRPO on all remaining tokens, with the KL channel annealed away after a short warm-up. Our analysis explains TRACE through two effects: forward KL provides non-vanishing lift to teacher-supported tokens that the student under-allocates, while span masking and decay keep cumulative privileged-gradient exposure finite. On four held-out math benchmarks plus GPQA-Diamond, TRACE improves over GRPO by 2.76 percentage points on average and preserves the Qwen3-8B base OOD score on GPQA-Diamond, where GRPO and all-token self-OPD baselines degrade. Gains persist under online self-annotation (+1.90 percentage points, about 69% of the strong-API gain), reducing the concern that TRACE merely imports external annotator capability. Across scales, the best routed action is base-dependent: on Qwen3-8B it is forward KL on key spans, while on Qwen3-1.7B it shifts to reverse KL on error spans.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRACE routes KL distillation to annotator-marked spans to curb entropy rise and OOD drop in self-OPD for math RLVR, with gains that hold under self-annotation, but the setup leaves open whether the routing or the span quality drives the lift.

read the letter

TRACE's core move is to skip full-response KL and instead apply forward KL only on correct critical spans, optional reverse KL on error spans, GRPO on the rest, and anneal the KL term after warm-up. This targets the specific failure modes the abstract describes: gradients wasted on redundant tokens and leakage that shortens reasoning chains over long training. The two-effect framing—non-vanishing lift on under-allocated tokens plus bounded privileged exposure—gives a clean mechanistic story for why selective routing should outperform all-token self-OPD and plain GRPO. Reporting that the lift survives online self-annotation at 69% of the strong-annotator level is a useful check against the method simply importing external capability. Across the two Qwen3 sizes the preferred KL direction flips, which at least shows the routing is not one-size-fits-all. Those pieces are the actual new content worth paying attention to. The empirical side is thinner. The abstract states a 2.76 pp average gain and OOD preservation on GPQA-Diamond, yet supplies no per-run variance, no statistical tests, and no ablation that replaces the marked spans with random or length-matched tokens while holding the schedule fixed. Without that isolation, it remains possible the gains track the quality of the critical-span labels more than the token-routing logic itself. The stress-test concern lands here: the paper does not yet rule out that any selective masking plus annealing would produce similar stability if the masks were good. The work is aimed at groups already running RLVR on reasoning models and hitting entropy or distribution-shift walls. A reader who needs a practical lever to try on top of GRPO will get a concrete recipe and a plausible explanation to test. It is not yet a finished result for citation, but the problem it attacks is real and the proposed fix is distinct from prior all-token self-OPD. I would send it to peer review so the authors can add the missing controls and variance numbers; the idea is worth referee scrutiny even if the current evidence is preliminary.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes TRACE (Token-Routed Alignment for Critical rEasoning), a token-routed variant of on-policy self-distillation (self-OPD) for RLVR on math reasoning. It restricts forward KL to annotator-marked key spans of correct rollouts and optional reverse KL to localized error spans, applies GRPO to all other tokens, and anneals the KL channel after a short warm-up. The central claim is that this avoids redundant gradient allocation and privileged-information leakage that cause entropy rise, shortened reasoning, and OOD degradation in all-token self-OPD. Empirically, TRACE yields a 2.76 pp average gain over GRPO across four held-out math benchmarks plus GPQA-Diamond while preserving the Qwen3-8B base OOD score (where GRPO and all-token baselines degrade); gains persist at +1.90 pp under online self-annotation.

Significance. If the reported gains are attributable to the routing and span-masking mechanics rather than annotator span quality, the work supplies a concrete mechanism for limiting privileged-gradient exposure while retaining non-vanishing lift on under-allocated tokens. The two-effect analysis and the online self-annotation result would be useful for scaling self-distillation without external APIs. The base-dependent choice of forward vs. reverse KL also highlights that optimal routing is model-scale dependent.

major comments (3)

[Abstract and Experiments] Abstract and Experiments section: The headline 2.76 pp average improvement and OOD preservation on GPQA-Diamond are reported without variance, standard errors, statistical significance tests, or the exact span-marking protocol (criteria, inter-annotator agreement, or length statistics). This leaves the central empirical claim under-supported for verification.
[Experiments] Experiments section: No ablation replaces the annotator-marked critical spans with random or length-matched tokens while preserving the KL schedule, GRPO on remaining tokens, and annealing. Without this control, it is impossible to isolate whether the token-routed mechanism (rather than the quality or selection process of the marked spans) produces the claimed non-vanishing lift and finite privileged-gradient exposure.
[Analysis] Analysis section: The two-effect explanation (forward KL supplying non-vanishing lift to teacher-supported tokens; span masking plus decay keeping cumulative privileged exposure finite) is invoked to explain the results, yet the manuscript supplies no equations, quantitative derivation, or direct linkage showing these effects are load-bearing for the 2.76 pp gain versus the baselines.

minor comments (2)

[Abstract] The abstract lists KL annealing schedule and warm-up length as free parameters but provides neither their concrete values nor sensitivity results.
Hyperparameter details for routing thresholds, span selection, and the exact GRPO implementation on non-critical tokens are missing, hindering reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the empirical support and analysis without altering the core claims of the work.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The headline 2.76 pp average improvement and OOD preservation on GPQA-Diamond are reported without variance, standard errors, statistical significance tests, or the exact span-marking protocol (criteria, inter-annotator agreement, or length statistics). This leaves the central empirical claim under-supported for verification.

Authors: We agree that variance estimates, standard errors, and significance testing would improve verifiability of the reported gains. In the revised manuscript we will add standard deviations computed over multiple random seeds for all main results, include paired statistical significance tests against the GRPO and all-token baselines, and expand the Experiments section (plus appendix) with the precise span-marking criteria, inter-annotator agreement metrics, and length-distribution statistics for the marked spans. revision: yes
Referee: [Experiments] Experiments section: No ablation replaces the annotator-marked critical spans with random or length-matched tokens while preserving the KL schedule, GRPO on remaining tokens, and annealing. Without this control, it is impossible to isolate whether the token-routed mechanism (rather than the quality or selection process of the marked spans) produces the claimed non-vanishing lift and finite privileged-gradient exposure.

Authors: We concur that a random or length-matched span ablation is the cleanest way to isolate the routing mechanism from span quality. We will add this control experiment in the revised version, keeping the KL schedule, GRPO application on non-routed tokens, and annealing schedule identical, so that any performance difference can be attributed to the choice of routed positions rather than the annotators' selection process. revision: yes
Referee: [Analysis] Analysis section: The two-effect explanation (forward KL supplying non-vanishing lift to teacher-supported tokens; span masking plus decay keeping cumulative privileged exposure finite) is invoked to explain the results, yet the manuscript supplies no equations, quantitative derivation, or direct linkage showing these effects are load-bearing for the 2.76 pp gain versus the baselines.

Authors: We accept that the current analysis is largely qualitative. In the revision we will augment the Analysis section with explicit equations formalizing the non-vanishing lift under forward KL on under-allocated tokens and the finite cumulative privileged exposure under span masking plus annealing. Where possible we will provide a quantitative sketch linking these terms to the observed gap versus all-token self-OPD and GRPO. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains on held-out benchmarks with mechanistic description

full rationale

The paper reports experimental improvements (2.76 pp average lift over GRPO, OOD preservation on GPQA-Diamond) from implementing token-routed forward/reverse KL on annotator-marked spans plus GRPO elsewhere, with annealing. No equations, derivations, or self-citations are shown that reduce the claimed performance deltas to quantities defined by the method's own fitted parameters, span selections, or prior author results. The two-effect analysis is a post-hoc interpretation of observed training dynamics rather than a closed loop that forces the result by construction. Results are framed as direct comparisons on external benchmarks, satisfying the self-contained empirical standard.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that selective distillation on annotator-marked spans delivers non-vanishing lift to under-allocated tokens while keeping cumulative privileged-gradient exposure finite through masking and annealing; this depends on reliable span identification and the validity of the two-effect analysis.

free parameters (2)

KL annealing schedule and warm-up length
The duration and decay rate of the KL channel are design choices that control finite exposure and must be set to achieve the reported behavior.
Critical span marking criteria
Definition and application of 'annotator-marked critical spans' involve human judgment or guidelines whose exact thresholds or heuristics are not specified.

axioms (1)

domain assumption Annotators can reliably identify critical reasoning spans that correspond to positions where the student under-allocates probability mass
The routing mechanism and the non-vanishing lift effect both presuppose accurate identification of these spans.

pith-pipeline@v0.9.0 · 5613 in / 1573 out tokens · 75173 ms · 2026-05-12T02:52:04.698982+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
TRACE routes teacher signal only to annotator-marked critical spans... FKL on key spans of correct rollouts, optional reverse KL on localized error spans, GRPO on all remaining tokens, with the KL channel annealed away after a short warm-up.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
Our analysis explains TRACE through two effects: forward KL provides non-vanishing lift... while span masking and decay keep cumulative privileged-gradient exposure finite.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 12 internal anchors

[1]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161,

work page internal anchor Pith review arXiv
[2]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InInternational Conference on Learning Representations, volume 2024, pages 32694–32717,

work page 2024
[3]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178,

work page internal anchor Pith review arXiv
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation. arXiv preprint arXiv:2601.20802,

work page internal anchor Pith review arXiv
[6]

arXiv preprint arXiv:2510.13786 , year=

Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms.arXiv preprint arXiv:2510.13786,

work page arXiv
[7]

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327,

work page 2016
[9]

Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026

10 Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimization via sample routing. arXiv preprint arXiv:2604.02288, 2026a. Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wen...

work page arXiv
[10]

Changze Lv, Jie Zhou, Wentao Zhao, Jingwen Xu, Zisu Huang, Muzhao Tian, Shihan Dou, Tao Gui, Le Tian, Xiao Zhou, et al

URLhttps://thinkingmachines.ai/blog/on-policy-distillation. Changze Lv, Jie Zhou, Wentao Zhao, Jingwen Xu, Zisu Huang, Muzhao Tian, Shihan Dou, Tao Gui, Le Tian, Xiao Zhou, et al. Learning query-specific rubrics from human preferences for deepresearch report generation. arXiv preprint arXiv:2602.03619,

work page arXiv
[11]

AMC 12 2023 problems,

Mathematical Association of America. AMC 12 2023 problems,

work page 2023
[12]

com/wiki/index.php/AMC_12_Problems_and_Solutions

https://artofproblemsolving. com/wiki/index.php/AMC_12_Problems_and_Solutions. Mathematical Association of America. AIME 2024 problems,

work page 2024
[13]

Mathematical Association of America

https://artofproblemsolving.com/ wiki/index.php/AIME_Problems_and_Solutions. Mathematical Association of America. AIME 2025 problems,

work page 2025
[14]

Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942,

work page arXiv
[15]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897,

work page internal anchor Pith review arXiv
[18]

Ktae: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning, 2025

Wei Sun, Wen Yang, Pu Jian, Qianlong Du, Fuwei Cui, Shuo Ren, and Jiajun Zhang. Ktae: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning.arXiv preprint arXiv:2505.16826,

work page arXiv
[19]

TIP: Token Importance in On-Policy Distillation

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, and Alborz Geramifard. Tip: Token importance in on-policy distillation.arXiv preprint arXiv:2604.14084,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,

work page internal anchor Pith review arXiv
[23]

confident wrong

A Notation and Preliminaries Notation.Let πθ be a language model parameterized by θ. Given a problem x, y= (y 1, . . . , yT ) is generated autoregressively; R(x, y)∈ {0,1} is a binary verifier; c denotesprivileged information(a verified reasoning trace, environment feedback, or coarse type label). The student is πS(· |x) := πθ(· |x) and the privileged-con...

work page 2026
[24]

1 |yk| X t mk,t ∥δt∥2 # ≤ C 2 s |yk| X t mk,t Vk,t. Step 3: Minibatch expectation.Taking expectation over the rollouty k and the span maskm k, E

B.5 Distribution-space mass dynamics on a token set Lemma 10(Natural-gradient mass dynamics).Under simplex updates with the natural gradient (Fisher-information metric) of KLF (πT , πθ), the mass πθ(U) moves monotonically toward πT (U). In particular, ifπ θ(U)> π T (U)initially, the mass strictly decreases towardπ T (U). Proof. The natural gradient of a d...

work page 1998
[25]

Independent of Yang et al

The condition is local — one correct token with low teacher probability suffices — and is empirically common in math reasoning, where the privileged context is typically a single canonical solution and the student may sample valid alternatives. Independent of Yang et al. [2026]’s mutual-information ill-posedness; the two viewpoints are complementary. Proo...

work page 2026
[26]

Token-level advantageA(i) t =A (i) = 0 for every t, hence the GRPO loss −P t A(i) t logπ θ(y(i) t ) has zero gradient

Under the convention0/0 = 0 used in practical GRPO implementations, the per-rollout advantage A(i) = (R(i) −µ G)/σG = 0/0 = 0 for every i. Token-level advantageA(i) t =A (i) = 0 for every t, hence the GRPO loss −P t A(i) t logπ θ(y(i) t ) has zero gradient. Part 2.By Prop. 5, the TRACE gradient on y(i) decomposes additively into a span piece onSy(i) and a...

work page 2048
[27]

[2026a] propose SRPO with sample-level routing

formalize an information- asymmetric ill-posedness with an irreducible mutual-information gap and propose RLSD which uses the teacher’s log-probability ratio as a magnitude multiplier within GRPO; Li et al. [2026a] propose SRPO with sample-level routing. TRACE departs from the GKD line by treating divergence direction as a per-token-class action (§3), and...

work page 2025
[28]

The gap to the original SDPO/SRPO reports is plausibly due to the change in domain (math vs scientific / code), feedback channel, base model (Qwen3-8B), and long-horizon Think-mode generation; we do not isolate a single cause. 25 D.6 Three-Symptom Failure Pattern of All-Token Self-OPD Baselines 0 50 100 150 200 250 300 Training step 1000 2000 3000 4000 50...

work page 2000
[29]

SDPO collapses from 2027→1042 tokens (min

(c) SRPO mechanism: EMA teacher student feedback loop SRPO actor entropy SRPO EMA teacher entropy 0 50 100 150 200 250 300 350 400 Training step 0.000 0.005 0.010 0.015 0.020 0.025Distillation loss residual signal persists through teacher collapse (d) SRPO suppression engages but cannot eliminate residual signal hybrid_distill_loss entropy_weight (suppres...

work page 2027
[30]

This is the EMA feedback loop predicted by Kim et al. [2026]: student errors are absorbed into the next teacher snapshot, which then amplifies them.(d)SRPO’s entropy-aware suppression weight does decrease from 0.85→0.39 as teacher entropy rises, but never approaches zero; the residual hybrid_distill_loss continues to be applied throughout the collapse pha...

work page 2026
[31]

26 Table 9:Asymmetric thinking ablation.Training: teacher Think, student NoThink; eval: think- mode

recovers in-distribution math without losing the OOD gain. 26 Table 9:Asymmetric thinking ablation.Training: teacher Think, student NoThink; eval: think- mode. GRPO row matches Tab. 2 (the asymmetric/symmetric distinction does not apply to GRPO since it has no teacher). Method (asymmetric training) MATH-500 AIME 24 AIME 25 AMC 23 GPQA-D A VG Qwen3-8B base...

work page arXiv 2026