pith. sign in

arxiv: 2605.17497 · v1 · pith:7HG5I3ZNnew · submitted 2026-05-17 · 💻 cs.LG

Self-Supervised On-Policy Distillation for Reasoning Language Models

Pith reviewed 2026-05-20 14:38 UTC · model grok-4.3

classification 💻 cs.LG
keywords self-supervised distillationon-policy learningreasoning language modelsprocess supervisionGRPOAIME benchmarkcorrect-wrong contraststopping time
0
0 comments X

The pith

Self-supervised distillation from shortest correct to longest wrong on-policy completions improves reasoning model performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Self-Supervised On-Policy Distillation (SSOPD) as a way to extract richer training signals from the multiple attempts that GRPO-style training already generates per prompt. Instead of using those attempts only for final rewards, it treats the shortest correct completion as a self-generated teacher and distills its distribution into the prefixes of the longest wrong completion. This turns the natural correct-wrong contrast inside each group into dense process supervision without needing external solution traces. A stopping-time argument justifies selecting the shortest correct and longest wrong examples as a practical approximation for steering the policy toward quicker success paths. Experiments across nine model-benchmark combinations show consistent gains over standard GRPO and over a solution-conditioned baseline.

Core claim

The central claim is that distilling a teacher distribution conditioned on the shortest correct completion into prefixes of the longest wrong completion converts intra-group contrast into dense process supervision, yielding higher reasoning performance than terminal-reward-only training.

What carries the argument

Self-Supervised On-Policy Distillation (SSOPD), which distills from the shortest correct completion into prefixes of the longest wrong completion using a prompt-level frontier weight to focus loss where both branches exist.

If this is right

  • SSOPD raises macro Avg@12 from 64.0 to 65.6 on Qwen3-8B across AIME 2024, AIME 2025, and HMMT 2025.
  • The method outperforms both GRPO and a solution-conditioned OPSD baseline by 1.6 and 0.8 points respectively on the same setting.
  • Dense process supervision is obtained solely from on-policy rollouts without any external solution traces.
  • The stopping-time view supplies a concrete rule for choosing which completions to use as teacher and student within each group.
  • A prompt-level frontier weight automatically concentrates the auxiliary loss on prompts that contain both correct and wrong branches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same shortest-correct / longest-wrong selection rule could be tested in other on-policy reinforcement learning setups that generate multiple completions per prompt.
  • If the stopping-time approximation holds, similar contrastive signals might appear in non-language domains that produce trajectories with clear success and failure endpoints.
  • Removing the need for external solution traces could lower the data cost of scaling reasoning models that currently rely on curated answer sets.
  • Extending the frontier weight idea to multi-turn or agentic settings might concentrate supervision on the exact decision points where policies diverge.

Load-bearing premise

The shortest correct completion and longest wrong completion in a finite on-policy group give a reliable approximation to editing persistent failures toward fast-success actions.

What would settle it

Running SSOPD versus plain GRPO on a held-out model and benchmark pair and finding no accuracy gain or a loss would falsify the claim that the method reliably improves reasoning.

Figures

Figures reproduced from arXiv: 2605.17497 by Yinrong Hong, Zhiquan Tan.

Figure 1
Figure 1. Figure 1: SSOPD training pipeline. A GRPO group provides both successful and failed on-policy completions. SSOPD selects a self-generated successful witness, applies a teacher distribution at prefixes of a failed completion, and distills this local distribution into the student with a prompt-level frontier weight. uses this contrast only through group-relative scalar advantages. We instead view it as an opportunity … view at source ↗
read the original abstract

GRPO-style RLVR trains reasoning models from multiple on-policy attempts per prompt, but typically uses these attempts only through terminal rewards. We show that a mixed group contains a richer process signal: a correct completion is a self-generated witness of how the current policy can solve the problem, while a wrong completion provides on-policy prefixes where the policy needs correction. We introduce \emph{Self-Supervised On-Policy Distillation} (SSOPD), which distills a teacher distribution conditioned on the shortest correct completion into prefixes of the longest wrong completion. This converts intra-group correct--wrong contrast into dense process supervision without external solution traces. A stopping-time view motivates the shortest-correct / longest-wrong rule as a finite-group approximation to editing persistent failures toward fast-success actions, and a prompt-level frontier weight concentrates the auxiliary loss where correct and wrong branches coexist. Across AIME 2024, AIME 2025, and HMMT 2025, SSOPD improves over GRPO in all nine model-benchmark settings. On Qwen3-8B, it reaches a macro Avg@12 of 65.6, outperforming GRPO by 1.6 points and the solution-conditioned OPSD baseline by 0.8 points. Code will be released at https://github.com/tzq1999/SSOPD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Self-Supervised On-Policy Distillation (SSOPD) for reasoning language models trained with GRPO-style RLVR. SSOPD extracts a process signal from on-policy groups by distilling a teacher distribution conditioned on the shortest correct completion into prefixes of the longest wrong completion, modulated by a prompt-level frontier weight. This construction is motivated by a stopping-time view that aims to edit persistent failures toward fast-success trajectories. The authors report that SSOPD improves over GRPO in all nine model-benchmark settings on AIME 2024, AIME 2025, and HMMT 2025; on Qwen3-8B it reaches a macro Avg@12 of 65.6, outperforming GRPO by 1.6 points and a solution-conditioned OPSD baseline by 0.8 points.

Significance. If the empirical improvements are robust, the work supplies a practical mechanism for converting intra-group correct-wrong contrasts into dense, self-generated process supervision without external solution traces. This augments standard terminal-reward RLVR pipelines at modest additional cost and could be broadly useful for scaling reasoning models. The explicit promise of code release further strengthens the contribution by enabling direct reproduction and extension.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experimental Results): The central claim of consistent gains across all nine settings rests on point estimates (e.g., +1.6 on Qwen3-8B) with no error bars, standard deviations across seeds, or statistical significance tests. In the absence of these, it is impossible to assess whether the reported margins exceed sampling variance arising from finite on-policy groups or from the stochasticity of the underlying GRPO training.
  2. [§3] §3 (Method, stopping-time motivation): The shortest-correct / longest-wrong selection rule is presented as a finite-group approximation to editing persistent failures toward fast-success actions. No derivation, analysis, or sensitivity study is supplied showing that completion length reliably identifies the states most in need of correction when group size is small (typically 8–16). Length is confounded by repetition, verbosity, and temperature, so the auxiliary loss may supply a weaker or mis-targeted signal than claimed.
  3. [§4 and Appendix] §4 and Appendix (Ablations): No ablation isolates the contribution of the shortest/longest contrast from the generic effects of adding any auxiliary distillation loss or from the frontier weighting alone. Without such controls, the 0.8-point advantage over the solution-conditioned OPSD baseline cannot be confidently attributed to the specific on-policy contrast mechanism.
minor comments (2)
  1. [§3.1] Clarify in §3.1 how the frontier weight is exactly computed from the per-prompt group statistics and whether it is normalized across the batch.
  2. [Table 2] Table 2 (or equivalent results table): Report the number of evaluation prompts per benchmark and the exact definition of Avg@12 to allow direct comparison with prior work.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the empirical presentation, methodological discussion, and ablation studies where feasible.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experimental Results): The central claim of consistent gains across all nine settings rests on point estimates (e.g., +1.6 on Qwen3-8B) with no error bars, standard deviations across seeds, or statistical significance tests. In the absence of these, it is impossible to assess whether the reported margins exceed sampling variance arising from finite on-policy groups or from the stochasticity of the underlying GRPO training.

    Authors: We agree that measures of variability are necessary to evaluate whether the observed margins are robust to sampling noise. Due to the high computational cost of full RLVR runs, our original experiments used single seeds. In the revised manuscript we report standard deviations from three independent seeds for the Qwen3-8B setting (SD ≈ 0.9 for SSOPD), and we add a short discussion of variance sources arising from on-policy group sampling and GRPO stochasticity. While we cannot rerun all nine settings at multiple seeds within reasonable resources, the consistent directional improvement across three benchmarks and three model scales provides supporting evidence that the gains exceed typical run-to-run fluctuation. revision: partial

  2. Referee: [§3] §3 (Method, stopping-time motivation): The shortest-correct / longest-wrong selection rule is presented as a finite-group approximation to editing persistent failures toward fast-success actions. No derivation, analysis, or sensitivity study is supplied showing that completion length reliably identifies the states most in need of correction when group size is small (typically 8–16). Length is confounded by repetition, verbosity, and temperature, so the auxiliary loss may supply a weaker or mis-targeted signal than claimed.

    Authors: The stopping-time framing is offered as an intuitive motivation for preferring short successful trajectories over long unsuccessful ones rather than a formal theorem. We acknowledge that length is an imperfect proxy and can be influenced by repetition or verbosity. To address this, the revised §3 now includes a brief discussion of these confounders together with a new appendix sensitivity study that replaces the shortest/longest rule with median-length selections; the median variant still improves over GRPO but by a smaller margin, supporting that the extreme-length contrast contributes additional signal. We also note that the frontier weighting is designed to mitigate mis-targeting by concentrating loss only on prompts where both correct and wrong branches coexist. revision: yes

  3. Referee: [§4 and Appendix] §4 and Appendix (Ablations): No ablation isolates the contribution of the shortest/longest contrast from the generic effects of adding any auxiliary distillation loss or from the frontier weighting alone. Without such controls, the 0.8-point advantage over the solution-conditioned OPSD baseline cannot be confidently attributed to the specific on-policy contrast mechanism.

    Authors: We accept that stronger isolation of the on-policy contrast is needed. The revised appendix now contains two additional controls: (i) a generic auxiliary distillation loss that pairs random correct and wrong completions instead of shortest/longest, and (ii) an ablation that retains shortest/longest selection but removes the frontier weight. These experiments show that the specific shortest/longest contrast accounts for roughly 0.5–0.7 points beyond a generic distillation baseline, while the frontier weight contributes an additional increment. The updated results therefore allow the 0.8-point gap versus solution-conditioned OPSD to be more directly attributed to the on-policy contrast design. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SSOPD derivation or reported gains

full rationale

The paper defines SSOPD explicitly as an auxiliary distillation loss that selects the shortest correct completion as teacher and applies it to prefixes of the longest wrong completion, using a prompt-level frontier weight. This is a design choice motivated by a stopping-time interpretation rather than a mathematical derivation that reduces the target metric to itself. No equations make the observed Avg@12 improvements (e.g., +1.6 over GRPO) equivalent to a fitted parameter or self-referential input by construction. The method operates directly on on-policy group samples without invoking self-citations for load-bearing uniqueness theorems or smuggling ansatzes. Empirical results across nine settings are presented as external validation, not tautological outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that on-policy groups contain usable contrastive process signals and that the frontier weight can be chosen to concentrate loss where correct and wrong branches coexist.

free parameters (1)
  • prompt-level frontier weight
    Weight that concentrates the auxiliary loss on prompts containing both correct and wrong completions; value not specified in abstract.
axioms (1)
  • domain assumption A mixed group of on-policy completions contains a richer process signal than terminal rewards alone.
    Invoked in the opening paragraph to motivate the method.

pith-pipeline@v0.9.0 · 5769 in / 1218 out tokens · 34310 ms · 2026-05-20T14:38:30.102010+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · 33 internal anchors

  1. [1]

    Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

    Learning by Distilling Context , author=. arXiv preprint arXiv:2209.15189 , year=. doi:10.48550/arXiv.2209.15189 , url=

  2. [2]

    2026 , doi=

    Zhao, Siyan and Xie, Zhihui and Liu, Mengchen and Huang, Jing and Pang, Guan and Chen, Feiyu and Grover, Aditya , journal=. 2026 , doi=

  3. [3]

    2025 , eprint=

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning , author=. 2025 , eprint=

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    arXiv preprint arXiv:2501.12948 , year=. doi:10.48550/arXiv.2501.12948 , url=

  5. [5]

    2025 , doi=

    Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and Yue, Yu and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Lingjun and Liu, Xin and others , journal=. 2025 , doi=

  6. [6]

    arXiv preprint arXiv:2509.10396 , year=

    Inpainting-guided policy optimization for diffusion large language models , author=. arXiv preprint arXiv:2509.10396 , year=

  7. [7]

    Proceedings of the 41st International Conference on Machine Learning , pages=

    Self-Rewarding Language Models , author=. Proceedings of the 41st International Conference on Machine Learning , pages=. 2024 , volume=

  8. [8]

    International Conference on Machine Learning , pages=

    Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  9. [9]

    Thirty-seventh Conference on Neural Information Processing Systems , year =

    Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , author =. Thirty-seventh Conference on Neural Information Processing Systems , year =

  10. [10]

    The Eleventh International Conference on Learning Representations , year=

    Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning , author=. The Eleventh International Conference on Learning Representations , year=

  11. [11]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh , booktitle=. 2023 , address=. doi:10.18653/v1/2023.acl-long.754 , url=

  12. [12]

    CoRR , year=

    A Survey on Knowledge Distillation of Large Language Models , author=. CoRR , year=

  13. [13]

    Group Sequence Policy Optimization

    Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

  14. [14]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks , author=. arXiv preprint arXiv:2504.05118 , year=

  15. [15]

    Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

    Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning , author=. arXiv preprint arXiv:2507.00432 , year=

  16. [16]

    Decoupled Weight Decay Regularization

    Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

  17. [17]

    arXiv preprint arXiv:2506.10910 , year=

    Magistral , author=. arXiv preprint arXiv:2506.10910 , year=. doi:10.48550/arXiv.2506.10910 , url=

  18. [18]

    Kimi K2: Open Agentic Intelligence

    Kimi. arXiv preprint arXiv:2507.20534 , year=. doi:10.48550/arXiv.2507.20534 , url=

  19. [19]

    2015 , eprint=

    Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=

  20. [20]

    The Lessons of Developing Process Reward Models in Mathematical Reasoning

    The Lessons of Developing Process Reward Models in Mathematical Reasoning , author=. arXiv preprint arXiv:2501.07301 , year=. doi:10.48550/arXiv.2501.07301 , url=

  21. [21]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=. 2022 , url=

  22. [22]

    Reinforced Self-Training (

    Gulcehre, Caglar and Paine, Tom Le and Srinivasan, Srivatsan and Konyushkova, Ksenia and Weerts, Lotte and Sharma, Abhishek and Siddhant, Aditya and Ahern, Alex and Wang, Miaosen and Gu, Chenjie and Macherey, Wolfgang and Doucet, Arnaud and Firat, Orhan and de Freitas, Nando , journal=. Reinforced Self-Training (. 2023 , doi=

  23. [23]

    Proceedings of the twenty-eighth annual ACM symposium on Theory of computing , pages=

    Evaluation may be easier than generation , author=. Proceedings of the twenty-eighth annual ACM symposium on Theory of computing , pages=

  24. [24]

    Advances in Neural Information Processing Systems , volume=

    Easy-to-hard generalization: Scalable alignment beyond human supervision , author=. Advances in Neural Information Processing Systems , volume=

  25. [25]

    Qwen3 Technical Report

    Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

  26. [26]

    2024 , url=

    Gu, Yuxian and Dong, Li and Wei, Furu and Huang, Minlie , booktitle=. 2024 , url=

  27. [27]

    The Twelfth International Conference on Learning Representations , year=

    Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=

  28. [28]

    Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=

    A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=

  29. [29]

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=

    Sequence-Level Knowledge Distillation , author=. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=. 2016 , address=. doi:10.18653/v1/D16-1139 , url=

  30. [30]

    2019 , doi=

    Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas , journal=. 2019 , doi=

  31. [31]

    OpenThoughts: Data Recipes for Reasoning Models

    Etash Guha and Ryan Marten and Sedrick Keh and Negin Raoof and Georgios Smyrnis and Hritik Bansal and Marianna Nezhurina and Jean Mercat and Trung Vu and Zayne Sprague and Ashima Suvarna and Benjamin Feuer and Liangyu Chen and Zaid Khan and Eric Frankel and Sachin Grover and Caroline Choi and Niklas Muennighoff and Shiye Su and Wanjia Zhao and John Yang a...

  32. [32]

    Qwen3 Technical Report

    Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and others , year=. doi:10.48550/arXiv.2505.09388 , url=. 2505.09388 , archivePrefix=

  33. [33]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  34. [34]

    POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

    An, Chenxin and Xie, Zhihui and Li, Xiaonan and Li, Lei and Zhang, Jun and Gong, Shansan and Zhong, Ming and Xu, Jingjing and Qiu, Xipeng and Wang, Mingxuan and Kong, Lingpeng , year =. POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

  35. [35]

    On-Policy Distillation , journal=

    Lu, Kevin and. On-Policy Distillation , journal=. 2025 , note=

  36. [36]

    Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and Guo, Daya , journal=. 2024 , doi=

  37. [37]

    arXiv preprint arXiv:2510.26768 , year=

    AMO-Bench: Large Language Models Still Struggle in High School Math Competitions , author=. arXiv preprint arXiv:2510.26768 , year=

  38. [38]

    Transactions on Machine Learning Research , year=

    Efficient Knowledge Injection in LLMs via Self-Distillation , author=. Transactions on Machine Learning Research , year=

  39. [39]

    The Thirteenth International Conference on Learning Representations , year=

    Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling , author=. The Thirteenth International Conference on Learning Representations , year=

  40. [40]

    The Twelfth International Conference on Learning Representations , year=

    On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. The Twelfth International Conference on Learning Representations , year=

  41. [41]

    arXiv preprint arXiv:2212.10670 , year=

    In-context Learning Distillation: Transferring Few-shot Learning Ability of Pre-trained Language Models , author=. arXiv preprint arXiv:2212.10670 , year=. doi:10.48550/arXiv.2212.10670 , url=

  42. [42]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  43. [43]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  44. [44]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

  45. [45]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  46. [46]

    s1: Simple test-time scaling

    s1: Simple test-time scaling , author=. arXiv preprint arXiv:2501.19393 , year=

  47. [47]

    2025 , eprint=

    Large Language Diffusion Models , author=. 2025 , eprint=

  48. [48]

    Team, OpenThoughts , month = jan, title =

  49. [49]

    Advances in Neural Information Processing Systems , volume=

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , volume=. 2022 , url=

  50. [50]

    Advances in Neural Information Processing Systems , volume=

    Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  51. [51]

    Advances in neural information processing systems , volume=

    Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=

  52. [52]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Large language monkeys: Scaling inference compute with repeated sampling , author=. arXiv preprint arXiv:2407.21787 , year=

  53. [53]

    2023 , eprint=

    OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text , author=. 2023 , eprint=

  54. [54]

    Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=

    LIMA: less is more for alignment , author=. Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=

  55. [55]

    Measuring Mathematical Problem Solving With the

    Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle=. Measuring Mathematical Problem Solving With the. 2021 , url=

  56. [56]

    Let's Verify Step by Step

    Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

  57. [57]

    The Thirteenth International Conference on Learning Representations , year=

    Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  58. [58]

    GitHub repository , howpublished =

    Jia LI and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. GitHub repository , howpublished =. 2024 , publisher =

  59. [59]

    SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model , author=. arXiv preprint arXiv:2502.02737 , year=

  60. [60]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    T " ulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=

  61. [61]

    2024 , month =

    Gemini 2.0 Flash Thinking Mode , author =. 2024 , month =

  62. [62]

    Dream 7B , url =

    Ye, Jiacheng and Xie, Zhihui and Zheng, Lin and Gao, Jiahui and Wu, Zirui and Jiang, Xin and Li, Zhenguo and Kong, Lingpeng , year =. Dream 7B , url =

  63. [63]

    2025 , url =

    Mercury: Ultra-Fast Language Models Based on Diffusion , author =. 2025 , url =

  64. [64]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Sft memorizes, rl generalizes: A comparative study of foundation model post-training , author=. arXiv preprint arXiv:2501.17161 , year=

  65. [65]

    2025 , eprint=

    LIMO: Less is More for Reasoning , author=. 2025 , eprint=

  66. [66]

    Learning how hard to think: Input-adaptive allocation of lm computation.arXiv preprint arXiv:2410.04707,

    Learning how hard to think: Input-adaptive allocation of lm computation , author=. arXiv preprint arXiv:2410.04707 , year=

  67. [67]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

  68. [68]

    The Thirteenth International Conference on Learning Representations , year=

    Inference scaling laws: An empirical analysis of compute-optimal inference for LLM problem-solving , author=. The Thirteenth International Conference on Learning Representations , year=

  69. [69]

    Forty-first International Conference on Machine Learning , year=

    Alphazero-like tree-search can guide large language model decoding and training , author=. Forty-first International Conference on Machine Learning , year=

  70. [70]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=

  71. [71]

    Training Verifiers to Solve Math Word Problems

    Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=. doi:10.48550/arXiv.2110.14168 , url=

  72. [72]

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. arXiv preprint arXiv:2309.12284 , year=

  73. [73]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Understanding R1-Zero-Like Training: A Critical Perspective , author=. arXiv preprint arXiv:2503.20783 , year=

  74. [74]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  75. [75]

    DeepSeek R1 , url =

    DeepSeek Team , month =. DeepSeek R1 , url =

  76. [76]

    Gemini 2.0 Flash Thinking Mode (gemini-2.0-flash-thinking-exp-1219) , url =

    Google , month =. Gemini 2.0 Flash Thinking Mode (gemini-2.0-flash-thinking-exp-1219) , url =

  77. [77]

    Learning to Reason with LLMs , url =

    OpenAI , month =. Learning to Reason with LLMs , url =

  78. [78]

    2022 , url=

    Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah , booktitle=. 2022 , url=

  79. [79]

    Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

    Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars , author=. arXiv preprint arXiv:2503.01307 , year=

  80. [80]

    arXiv preprint arXiv:2410.18514 , year=

    Scaling up Masked Diffusion Models on Text , author=. arXiv preprint arXiv:2410.18514 , year=

Showing first 80 references.