pith. sign in

arxiv: 2606.22830 · v1 · pith:JHWUX7UInew · submitted 2026-06-22 · 💻 cs.AI

Finding the Evidence: Discovering Decision-Supporting Tokens for On-Policy Reasoning Distillation

Pith reviewed 2026-06-26 08:48 UTC · model grok-4.3

classification 💻 cs.AI
keywords on-policy distillationreasoning distillationdecision tokensevidence tokenstoken selectionstudent entropymath reasoningcode generation
0
0 comments X

The pith

On-policy reasoning distillation transfers more knowledge when evidence tokens that justify decisions are discovered via entropy and hidden-state similarity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that reasoning chains split into decision points, which appear where the student shows high uncertainty, and evidence steps, which hide where the student is confident but wrong. Standard on-policy distillation only supervises the decision tokens and leaves the justifying evidence untransferred. DEAR first flags decisions with student entropy, then finds their supporting evidence by measuring hidden-state cosine similarity to those anchors while boosting with teacher-student divergence. Experiments across three student-teacher pairs show consistent gains on math and code tasks. A reader would care because the approach targets the specific knowledge gap that current dense supervision misses.

Core claim

Reasoning chains contain two distinct kinds of knowledge: decisions, which surface through student entropy, and evidence, which appears at positions where the student is confident yet incorrect; DEAR locates the evidence tokens by computing hidden-state cosine similarity to the entropy-selected decision anchors and weighting them by teacher-student divergence to emphasize the largest knowledge gaps, allowing both types of tokens to receive targeted supervision during on-policy distillation.

What carries the argument

DEAR's evidence discovery mechanism, which ranks tokens by hidden-state cosine similarity to entropy-identified decision anchors boosted by teacher-student divergence.

If this is right

  • Student models reach higher accuracy on competition math problems after distillation.
  • Code generation performance increases on standard benchmarks.
  • The gains appear consistently across different student-teacher model size pairs.
  • Evidence tokens identified this way contain the intermediate steps missed by decision-only supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selection logic could be tested on non-math reasoning tasks such as logical deduction or scientific explanation.
  • Overconfident student errors may systematically mark the locations where human-like justification is needed.
  • Human inspection of the selected tokens could check whether they align with the actual logical steps used in correct solutions.
  • Replacing cosine similarity with other representation distances might change which evidence is chosen and affect final gains.

Load-bearing premise

The tokens selected by entropy followed by hidden-state cosine similarity to decision anchors are the actual evidence that justifies the decisions and carries the transferable knowledge.

What would settle it

An ablation that removes or randomizes the evidence-selection step and shows that the performance gains on math and code benchmarks disappear.

Figures

Figures reproduced from arXiv: 2606.22830 by Jinwei Xiao, Qi Gu, Wentao Chen, Xunliang Cai, Yueqing Sun, Yuxin Liu, Zhengxi Lu, Zhiyuan Yao, Zhuowen Han.

Figure 1
Figure 1. Figure 1: Overview of DEAR. ❶ The student generates a rollout; both student entropy HS t and teacher–student logprob divergence δt are computed per token. ❷ Decision Identification: the top-p% entropy tokens form the decision set D. ❸ Evidence Discovery: non-decision tokens are scored by cosine similarity to decisions, boosted by divergence; the top-q% form the evidence set E. ❹ The OPD loss is computed only on S = … view at source ↗
Figure 2
Figure 2. Figure 2: The distillation signal is extremely sparse. The top-20% of tokens carry ∼80% of total gradient mass (Gini = 0.776). not how many tokens to select, but what kind. Finding 2. Entropy selects the reasoning skele￾ton, not the reasoning knowledge. High-entropy positions are connectives and branching mark￾ers; substantive intermediate steps are uniformly low-entropy [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Entropy concentrates on the reasoning skeleton. Bright positions align with logical connec￾tives and branching markers; intermediate-step tokens are uniformly dark regardless of correctness. 0 1 2 3 4 5 6 7 Student Entropy 0 5 10 15 20 25 30 Logprob Divergence Evidential Salience Expansion Entropy Anchor [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The token landscape. “Entropy Anchor” tokens (blue) are captured by entropy selection. “Evi￾dential Salience Expansion” tokens (pink) concentrate in the low-entropy, high-divergence region. Finding 3. Missed reasoning knowledge con￾centrates in “evidence tokens” where the stu￾dent is confident yet wrong. These positions are structurally invisible to entropy selection. Where does the missed knowledge reside… view at source ↗
Figure 6
Figure 6. Figure 6: Evidence discovery targets high-value to￾kens. (a) Gradient mass captured by each method. (b) Recovery count stratified by semantic category. 4.4 Analysis: Closing the Loop We analyze DEAR to verify that the discovered tokens are indeed evidence tokens that support de￾cisions, and to understand why evidence-enriched training helps. Evidence tokens carry disproportionate learn￾ing signal. Figure 6a quantifi… view at source ↗
Figure 9
Figure 9. Figure 9: Training reward over steps. DEAR achieves higher reward throughout training, with the gap widen￾ing over time. consistently higher reward, with the gap widen￾ing as training progresses. As the student learns evidence early, its subsequent trajectories improve and expose further evidence tokens, creating a com￾pounding effect. Standard OPD exhibits elevated clip ratios (Appendix C), symptomatic of noisy up￾… view at source ↗
Figure 8
Figure 8. Figure 8: Evidence recovery targets the correct re￾gion. Each cell shows recall difference (DEAR − decision-only) on the entropy–divergence quantile grid. Gains concentrate in the low-entropy, high-divergence region. carry 0.94× individually, yet dominate total gra￾dient mass by volume. This confirms the comple￾mentary roles: entropy captures decisions, while DEAR recovers the numerous evidence positions where the k… view at source ↗
Figure 10
Figure 10. Figure 10: Sensitivity to selection ratios. Left: deci￾sion ratio p (Stage 1 only, no evidence). Right: evidence ratio q with decisions fixed at p=0.2. Both stages ben￾efit from sparsity; performance degrades gracefully at higher ratios. broad range; the widely-adopted p=0.2 (Wang et al., 2026) is near-optimal. For the evidence ra￾tio q (right, full DEAR with p=0.2), the optimum is at q=0.2 with gradual degradation … view at source ↗
Figure 11
Figure 11. Figure 11: PPO clip ratio over training. DEAR main￾tains a lower clip ratio than standard OPD, indicating more stable policy updates. D Knowledge-Gap Scoring Ablation The full scoring function (aˆ ×(1 + ˆδ)) outperforms both ablated variants. Relevance-only selection 11 [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Training curves on math (Settings A, B, C). Rows: actor entropy, reward (mean), score (mean). Columns: Setting C (Qwen3-1.7B ← Qwen3-4B), Setting A (Qwen2.5-1.5B ← Qwen2.5-14B), Setting B (Qwen2.5-1.5B ← Qwen3-4B). 14 [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
read the original abstract

On-policy distillation transfers reasoning ability through dense token-level supervision, yet the nature of the transferable signal remains unclear. We discover that reasoning chains contain two types of knowledge that require different discovery mechanisms: decisions (where to branch), which surface through student uncertainty, and evidence (intermediate steps that justify decisions), which hides in positions where the student is confident yet wrong. Current methods capture only decisions; the substantive knowledge in evidence tokens remains untransferred. We propose DEAR(Decision-Evidence Aware Reasoning Distillation), which first identifies decisions via student entropy, then discovers their supporting evidence through hidden-state cosine similarity to decision anchors, boosted by teacher-student divergence to prioritize the largest knowledge gaps. Across three student-teacher configurations on math and code benchmarks, DEAR consistently outperforms standard OPD, with up to +2.5pp on competition math and +5.7pp on code generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that on-policy reasoning distillation transfers two distinct types of knowledge—decisions (identified via student entropy) and evidence (intermediate justifying steps hidden in confident-but-wrong positions)—and proposes DEAR to discover the latter via hidden-state cosine similarity to decision anchors, modulated by teacher-student divergence. It reports that DEAR outperforms standard OPD across three student-teacher pairs on math and code benchmarks, with gains up to +2.5pp on competition math and +5.7pp on code generation.

Significance. If the cosine-similarity procedure is shown to isolate causally relevant evidence tokens rather than correlated high-uncertainty positions, the work would provide a concrete mechanism for more targeted knowledge transfer in reasoning distillation and a useful decomposition of what is learned during on-policy training. The empirical gains are modest but consistent; the primary value would lie in establishing the mechanistic interpretation rather than the raw numbers alone.

major comments (2)
  1. [Experiments] Experiments section: the attribution of the reported gains (+2.5pp math, +5.7pp code) to the evidence-discovery step is not supported by any ablation that removes or perturbs only the cosine-similarity component while retaining entropy-based decision selection and divergence boosting; without such controls it is impossible to rule out that the benefit arises from simply surfacing high-divergence tokens irrespective of the 'evidence' framing.
  2. [Method] Method section (DEAR description): the claim that hidden-state cosine similarity to decision anchors isolates 'substantive evidence that justifies decisions' is presented as the core innovation, yet no direct validation (counterfactual token masking, human relevance ratings, or comparison against random high-entropy tokens) is provided to establish that these tokens carry the transferable justifying knowledge rather than a non-causal correlate.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'three student-teacher configurations' is used without naming the models or datasets, which would aid immediate assessment of scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight the need for stronger controls to attribute gains specifically to the evidence-discovery mechanism. We address each point below and will incorporate revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the attribution of the reported gains (+2.5pp math, +5.7pp code) to the evidence-discovery step is not supported by any ablation that removes or perturbs only the cosine-similarity component while retaining entropy-based decision selection and divergence boosting; without such controls it is impossible to rule out that the benefit arises from simply surfacing high-divergence tokens irrespective of the 'evidence' framing.

    Authors: We agree that the current experiments compare full DEAR against standard on-policy distillation but lack a targeted ablation that disables only the cosine-similarity step while retaining entropy-based decision selection and divergence modulation. This limits causal attribution of the gains. We will add this ablation (and the corresponding results) in the revised manuscript. revision: yes

  2. Referee: [Method] Method section (DEAR description): the claim that hidden-state cosine similarity to decision anchors isolates 'substantive evidence that justifies decisions' is presented as the core innovation, yet no direct validation (counterfactual token masking, human relevance ratings, or comparison against random high-entropy tokens) is provided to establish that these tokens carry the transferable justifying knowledge rather than a non-causal correlate.

    Authors: The manuscript motivates the cosine-similarity procedure from the observed pattern that evidence tokens appear in confident-but-wrong positions and reports consistent empirical gains across three student-teacher pairs. However, we acknowledge that direct causal validations such as token-masking experiments, human ratings, or explicit comparison to random high-entropy tokens are absent. We will include additional analysis addressing this in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method evaluated on external benchmarks

full rationale

The paper defines DEAR procedurally (entropy for decisions, cosine similarity plus divergence for evidence) and reports benchmark gains on math/code tasks. No equations, fitted parameters, or self-citations are shown that reduce any claimed result to its inputs by construction. The derivation chain consists of an algorithmic procedure whose outputs are measured against independent test sets, making the reported improvements non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5708 in / 1008 out tokens · 18802 ms · 2026-06-26T08:48:08.382011+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 13 linked inside Pith

  1. [1]

    InInternational Conference on Learning Representations, volume 2024, pages 21246–21263

    On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, volume 2024, pages 21246–21263. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and 1 others

  2. [2]

    Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374. Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, and 4 others

  3. [3]

    CoRR, abs/2502.01456

    Process Reinforcement through Implicit Rewards. CoRR, abs/2502.01456. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, and 1 others

  4. [4]

    arXiv preprint arXiv:2405.16064

    Keypoint-based progressive chain-of-thought distillation for llms. arXiv preprint arXiv:2405.16064. Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang

  5. [5]

    InThe Thirteenth Inter- national Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

    Omni-MATH: A Uni- versal Olympiad Level Mathematic Benchmark for Large Language Models. InThe Thirteenth Inter- national Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

  6. [6]

    InInternational Conference on Learning Representations, volume 2024, pages 32694–32717

    Minillm: Knowledge distillation of large language models. InInternational Conference on Learning Representations, volume 2024, pages 32694–32717. Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yu- jie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun

  7. [7]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. InProceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thai- land, August 11-16, 2024, pages 3828–3850. Associ- ation for Computational Linguistics. Zhiwei H...

  8. [8]

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Man- tas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt

    DeepMath-103K: A Large-Scale, Challenging, De- contaminated, and Verifiable Mathematical Dataset for Advancing Reasoning.CoRR, abs/2504.11456. Dan Hendrycks, Steven Basart, Saurav Kadavath, Man- tas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt

  9. [9]

    InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Bench- marks 2021, December 2021, virtual

    Measuring Coding Challenge Com- petence With APPS. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Bench- marks 2021, December 2021, virtual. Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean

  10. [10]

    CoRR, abs/1503.02531

    Distilling the knowledge in a neural network. CoRR, abs/1503.02531. Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee

  11. [11]

    Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron

    Entropy-aware on-policy distillation of language models.arXiv preprint arXiv:2603.07079. Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron

  12. [12]

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, and 1 others

    Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, and 1 others

  13. [13]

    arXiv preprint arXiv:2604.13016

    Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe. arXiv preprint arXiv:2604.13016. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Ling- ming Zhang

  14. [14]

    InAdvances in Neural Information Processing Systems 36: An- nual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16,

    Is Your Code Generated by Chat- GPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InAdvances in Neural Information Processing Systems 36: An- nual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16,

  15. [15]

    Demystifying opd: Length in- flation and stabilization strategies for large language models.arXiv preprint arXiv:2604.08527. MAA

  16. [16]

    Changshuo Shen, Leheng Sheng, Yuxin Chen, An Zhang, and Xiang Wang

    The linear representation hypothesis and the ge- ometry of large language models.arXiv preprint arXiv:2311.03658. Changshuo Shen, Leheng Sheng, Yuxin Chen, An Zhang, and Xiang Wang

  17. [17]

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu

    Reasoning can be restored by correcting a few decision tokens.arXiv preprint arXiv:2605.16874. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu

  18. [18]

    InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pages 1279–1297

    HybridFlow: A Flexible and Efficient RLHF Framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pages 1279–1297. ACM. Qwen Team

  19. [19]

    Ian Tenney, Dipanjan Das, and Ellie Pavlick

    Qwen3 Technical Report.CoRR, abs/2505.09388. Ian Tenney, Dipanjan Das, and Ellie Pavlick

  20. [20]

    Tip: Token importance in on-policy distillation.arXiv preprint arXiv:2604.14084. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayi- heng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Ji- axi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 22 others

  21. [21]

    Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin

    Qwen2.5 Technical Report.CoRR, abs/2412.15115. Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin

  22. [22]

    Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D Lyng, Sanjit Singh Batra, and Robert E Tillman

    Learn- ing beyond teacher: Generalized on-policy distil- lation with reward extrapolation.arXiv preprint arXiv:2602.12125. Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D Lyng, Sanjit Singh Batra, and Robert E Tillman

  23. [23]

    predictive state

    Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260. 10 A Why Cosine Similarity Detects Evidence Membership In causal transformers, the hidden state at posi- tion t after L layers, hL t , is not merely a repre- sentation of token yt but the model’s compressed working memory encoding all causally accessible con...

  24. [24]

    Relevance

    confirm that deep- layer representations are dominated by semantic and functional information, with syntactic features attenuating. While these findings originate from encoder models, subsequent work on autoregres- sive LLMs has confirmed that deep layers similarly encode task-relevant semantics over surface form. This means cosine similarity at deep laye...