Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

Dawei Li; Runchao Li; Shubhashis Roy Dipta; Yuxuan Jiang; Zhao Yang

arxiv: 2605.09253 · v2 · pith:2QE6M6KHnew · submitted 2026-05-10 · 💻 cs.CL · cs.AI

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

Yuxuan Jiang , Runchao Li , Shubhashis Roy Dipta , Dawei Li , Zhao Yang This is my paper

Pith reviewed 2026-05-12 02:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords on-policy distillationrock tokenshigh-loss tokensmodel alignmenttoken weightingdistillation efficiencyreasoning performance

0 comments

The pith

High-loss Rock Tokens in on-policy distillation resist training yet add almost nothing to reasoning performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines on-policy distillation, the process where a student model learns to match a teacher's token-by-token probabilities while generating its own outputs. It discovers that a persistent group of high-loss tokens, labeled Rock Tokens, continue to show large mismatches long after training appears to stabilize, and these tokens consume a large share of the total training gradients. Causal tests that alter or remove these tokens produce almost no change in the student model's ability to reason or solve problems. If this holds, then standard distillation wastes substantial effort on tokens the student model neither can nor needs to copy, opening a route to simpler and faster alignment at scale by treating tokens unequally.

Core claim

Even after on-policy distillation reaches apparent saturation, a substantial subset of tokens continues to exhibit persistently high loss. These Rock Tokens can account for up to 18% of the tokens in generated outputs. They provide a disproportionately large share of total gradient norms yet remain stagnant throughout training and resist teacher-driven corrections. Through causal intervention, these tokens are shown to provide negligible functional contribution to the model's actual reasoning performance, indicating that optimization bandwidth is spent on structural and discourse residuals that the student cannot or need not internalize.

What carries the argument

Rock Tokens: the persistently high-loss tokens under the per-token KL objective that resist correction while showing negligible downstream effect on reasoning.

Load-bearing premise

The tests that change or remove these high-loss tokens accurately capture whether they affect the model's final reasoning outputs.

What would settle it

Performing the causal intervention on Rock Tokens and observing clear changes in the model's reasoning accuracy or outputs would show the contribution is not negligible.

Figures

Figures reproduced from arXiv: 2605.09253 by Dawei Li, Runchao Li, Shubhashis Roy Dipta, Yuxuan Jiang, Zhao Yang.

**Figure 1.** Figure 1: The lifecycle and functional impact of Rock Tokens in OPD. (a) Phenomenon: Identification of optimization-resistant tokens. (b) Mechanism: Causal evidence of structural redundancy via token knock-out. (c) Utility: Performance parity achieved through strategic gradient sparsification. is inferred from policy entropy or reward signals, OPD provides a direct measure of student-teacher mismatch through per-to… view at source ↗

**Figure 2.** Figure 2: Empirical identification and stability of Rock Tokens. (a) Per-token KL ℓbv vs. frequency on N=500 MATH-500 trajectories: rare tokens are noise-dominated, while the Rock Score R(v) isolates true Rock Tokens (red) at the upper edge of stable frequency bands. (b) Per-sequence Rock-Token density (median 18.5%). (c) Cumulative loss coverage (blue) and selection stability (red) vs. cutoff K; K=100 balances repr… view at source ↗

**Figure 3.** Figure 3: Per-token gradient geometry and persistence under training. (a) Per-token logitgradient magnitude ∥g¯t∥ by group: rocks are an order of magnitude smaller than rare high-KL tokens. (b) Cosine alignment with the frequency-balanced descent direction Gbalanced: rocks are positively aligned, with a tail reaching cos > 0.3. (c) Per-token mean KL paired across two training checkpoints (log-log). Points below the… view at source ↗

**Figure 4.** Figure 4: Knockout effect on the |R| e = 200 screened Rock-Token candidates, sorted by ∆. Each bar is a single candidate; height is the accuracy change when its logit is masked at decode time. The shaded grey band marks the categorization threshold |∆| < ε = 0.01. Bars outside the band that pass the paired-bootstrap test (α = 0.05, 10,000 resamples) are coloured by category (Strong Pillar in red; Strong Stumbling in… view at source ↗

**Figure 5.** Figure 5: Average accuracy across AIME24, AIME25, and HMMT25 during OPD training. Each 200 training steps correspond to 8,000 prompts, with 4 rollouts per prompt. 4.1 RQ3: What is the genuine functional contribution of Rock Tokens to model training? The functional redundancy of Rock Tokens at inference prompts a critical question: do their persistent high-loss signals provide essential constraints, or are they mer… view at source ↗

**Figure 6.** Figure 6: Pillarhood is not predicted by entropy, frequency, or loss. MATH-500 knockout ∆ for each of the |R| e = 200 screened candidates (Strong Pillars in red) plotted against six candidate predictors: post- and pre-OPD student entropy, teacher entropy, log-frequency, rock rate, and mean post-OPD KL. Annotated r, p are Pearson correlations over all 200 candidates. None reach |r| > 0.07. Multiple-testing considerat… view at source ↗

read the original abstract

While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Distillation (OPD) remains largely unexplored. In this work, we investigate high-loss tokens, a token type that--as the most direct signal of student-teacher mismatch under OPD's per-token KL objective--should progressively diminish as training converges according to existing studies; however, our empirical analysis shows otherwise. Even after OPD training reaches apparent saturation, a substantial subset of tokens continues to exhibit persistently high loss; these tokens, which we term Rock Tokens, can account for up to 18\% of the tokens in generated outputs. Our investigation reveals two startling paradoxes. First, despite their high occurrence frequency providing a disproportionately large share of total gradient norms, Rock Tokens themselves remain stagnant throughout training, resisting teacher-driven corrections. Second, through causal intervention, we find that these tokens provide negligible functional contribution to the model's actual reasoning performance. These findings suggest that a vast amount of optimization bandwidth is spent on structural and discourse residuals that the student model cannot or need not internalize. By deconstructing these dynamics, we demonstrate that strategically bypassing these ``stumbling blocks'' can significantly streamline the alignment process, challenging the necessity of uniform token weighting and offering a more efficient paradigm for large-scale model distillation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rock Tokens look like a real inefficiency in on-policy distillation, but the causal claim that they contribute nothing rests on interventions that may not isolate their effect cleanly.

read the letter

The paper's core observation is that in on-policy distillation, a subset of high-loss tokens keeps showing up even after training appears to saturate. These Rock Tokens can reach 18% of generated output, pull a large share of the gradient norm, yet stay resistant to the teacher's signal. The authors then run causal interventions and report that removing or altering them barely moves final reasoning performance. That combination of persistence, gradient dominance, and negligible downstream effect is the new piece; prior RLVR work flagged critical tokens but did not frame this specific mismatch inside the KL objective of distillation.

Referee Report

1 major / 2 minor

Summary. The paper investigates persistently high-loss tokens in on-policy distillation (OPD) for language models, termed 'Rock Tokens.' These tokens persist after apparent training saturation, comprise up to 18% of generated outputs, and account for a disproportionate share of gradient norms while resisting teacher corrections. Through causal interventions, the authors claim these tokens contribute negligibly to reasoning performance, suggesting that strategically bypassing them can streamline distillation by challenging uniform token weighting.

Significance. If the causal interventions are shown to isolate token contributions without confounding effects, the work would provide a valuable empirical lens on token-level dynamics in OPD, highlighting optimization inefficiencies and motivating targeted weighting schemes for more efficient large-scale distillation. The manuscript is credited for its observational analysis of training saturation and the application of causal interventions to probe functional contributions.

major comments (1)

Abstract: the central claim that causal interventions demonstrate negligible functional contribution of Rock Tokens to reasoning performance is load-bearing for the proposal to bypass them. However, in autoregressive generation, masking or altering specific tokens necessarily alters the conditioning context for all subsequent tokens. A null effect on final outputs could therefore reflect compensatory adjustments by later tokens rather than true lack of causal weight from the Rock Tokens. Without explicit controls for sequence position, length, or matched comparisons to non-Rock high-loss tokens, the intervention does not cleanly isolate the claimed negligible contribution.

minor comments (2)

Abstract and methods: details on dataset sizes, exact intervention methods (e.g., masking vs. replacement), statistical controls, and the operational definition of 'saturation' are missing, hindering verification of the empirical observations and gradient dominance claims.
The paper should include a dedicated section contrasting the OPD findings with prior token-level analyses in RLVR to clarify novelty.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies a substantive methodological consideration in our causal analysis. We address the concern directly below and describe the revisions we will make.

read point-by-point responses

Referee: Abstract: the central claim that causal interventions demonstrate negligible functional contribution of Rock Tokens to reasoning performance is load-bearing for the proposal to bypass them. However, in autoregressive generation, masking or altering specific tokens necessarily alters the conditioning context for all subsequent tokens. A null effect on final outputs could therefore reflect compensatory adjustments by later tokens rather than true lack of causal weight from the Rock Tokens. Without explicit controls for sequence position, length, or matched comparisons to non-Rock high-loss tokens, the intervention does not cleanly isolate the claimed negligible contribution.

Authors: We agree that autoregressive dependencies represent a potential confounder and that stronger isolation of token-level effects requires additional controls. Our original interventions replaced Rock Tokens with the teacher's token at the same position while continuing generation, yielding no measurable change in final reasoning accuracy. To address the referee's point, the revised manuscript will add: (i) position-stratified results (early/mid/late sequence interventions), (ii) length-matched sequence cohorts, and (iii) parallel interventions on non-Rock high-loss tokens at matched positions and loss magnitudes. These controls will be reported alongside the original findings to demonstrate that compensatory effects do not explain the null result for Rock Tokens specifically. revision: partial

Circularity Check

0 steps flagged

No significant circularity: purely observational and interventional analysis

full rationale

The paper reports empirical measurements of token losses during on-policy distillation, identifies persistently high-loss 'Rock Tokens' via direct observation, and assesses their functional contribution through causal interventions on generated sequences. No equations, closed-form derivations, or parameter-fitting steps are present that would reduce any reported quantity (e.g., gradient norms, loss values, or performance deltas) to a fitted input defined by the same data. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The central claims rest on external experimental benchmarks rather than self-referential reductions, satisfying the criteria for a self-contained observational study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that high per-token KL loss directly signals mismatch worth correcting and that causal interventions can isolate functional contribution; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption High-loss tokens are the most direct signal of student-teacher mismatch under the per-token KL objective
Stated in the abstract as the basis for expecting these tokens to diminish with training.

invented entities (1)

Rock Tokens no independent evidence
purpose: Label for the subset of persistently high-loss tokens that resist correction
New descriptive term coined from empirical observation; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5565 in / 1234 out tokens · 39359 ms · 2026-05-12T02:28:31.072089+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Even after OPD training reaches apparent saturation, a substantial subset of tokens continues to exhibit persistently high loss; these tokens, which we term Rock Tokens... through causal intervention, we find that these tokens provide negligible functional contribution to the model's actual reasoning performance.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define the initial Rock Score as R(v) = ℓ̄v · Freq(v)... context-consistent rock rate CCR(v) = |R(v)| / Freq(v).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems
cs.SE 2026-05 unverdicted novelty 6.0

About 18.2% of structurally flagged skill pairs represent genuine compositional safety risks in agent skill registries, with exploitation gated by host model behavior.
Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance
cs.CL 2026-05 unverdicted novelty 6.0

TOPD augments on-policy distillation by using near-future trajectory signals to suppress non-divergent high-loss tokens and distribute guidance, raising average accuracy from 47.8% to 52.2% on reasoning benchmarks.
A Formula-Driven Survey and Research Agenda for On-Policy Distillation
cs.AI 2026-06 unverdicted novelty 4.0

A survey creates a taxonomy for on-policy distillation in LLMs that separates temporal credit assignment from vocabulary-level probability routing.
RankVR: Low-Rank Structure Perception and Value Recalibration for Robust Composed Image Retrieval
cs.CV 2026-06 unverdicted novelty 4.0

RankVR introduces GSCP and ASVC modules to improve CIR robustness by decoupling clean samples via low-rank structure and dynamically scoring triplet value in noisy datasets.
IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval
cs.CV 2026-06 unverdicted novelty 4.0

IMAGINE uses adaptive schema-imagery via dynamic multimodal prototypes to incorporate implicit semantics into composed video retrieval, claiming SOTA results on CVR and CIR benchmarks.