pith. machine review for the scientific record. sign in

arxiv: 2605.08665 · v1 · submitted 2026-05-09 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Hint Tuning: Less Data Makes Better Reasoners

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:50 UTC · model grok-4.3

classification 💻 cs.CL
keywords hint tuningchain of thoughtreasoning modelstoken efficiencydifficulty calibrationself-annotationinstruct modelslarge language models
0
0 comments X

The pith

Hint Tuning lets reasoning models cut tokens by a third on average by learning difficulty from their instruct counterparts on 1K self-labeled examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models waste tokens by applying long chain-of-thought uniformly even to easy problems. The paper establishes that an instruct model can serve as a reliable difficulty probe by checking how much guidance it needs to solve each problem correctly. This check produces three categories of training data: problems the model can answer with no hint, with a sparse hint, or only with a full hint. Fine-tuning the reasoning model on these 1K self-annotated examples teaches it to match reasoning depth to actual difficulty. The outcome is 24 to 66 percent fewer tokens generated while accuracy stays competitive across five benchmarks and multiple model scales.

Core claim

By converting difficulty labeling into a consistency check between the instruct model and the reasoning model under varying levels of guidance, Hint Tuning automatically generates No-Hint, Sparse-Hint, and Full-Hint training states. Fine-tuning on only 1K such examples calibrates the reasoning model to produce shorter outputs on easier problems, yielding 24--66 percent token reduction (31.5 percent average) across 4B--32B models such as Qwen3-Thinking and DeepSeek-R1-Distill while preserving competitive accuracy on five benchmarks.

What carries the argument

Hint Tuning, which uses consistency between the instruct model and reasoning model under no, sparse, and full guidance to label problem difficulty and create three-state training data for calibrating reasoning depth.

If this is right

  • Token usage falls 24--66 percent (31.5 percent on average) across 4B to 32B parameter reasoning models.
  • Accuracy remains competitive with the original model on five standard benchmarks.
  • Only 1K self-annotated samples suffice, avoiding the need for large distillation datasets or reinforcement learning.
  • The method applies directly to existing models such as Qwen3-Thinking and DeepSeek-R1-Distill.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same consistency-probe approach could be adapted to other generation tasks where output length should vary with input complexity.
  • Instruct models appear to encode useful implicit signals about problem solvability that can be extracted without external labels.
  • Adding more than three hint levels might allow even finer control over reasoning granularity in future extensions.

Load-bearing premise

That consistency between the instruct model and the reasoning model under varying guidance levels accurately identifies problem difficulty and that fine-tuning on these labels will reliably calibrate reasoning depth without introducing new errors.

What would settle it

A held-out test set where the Hint-Tuned model produces noticeably more tokens than necessary on problems the instruct model solves with no hint, or drops accuracy below the untuned baseline on those same problems.

Figures

Figures reproduced from arXiv: 2605.08665 by Bowen Qin, Liujie Zhang, Minghao Li, Shuo Shang, Siqi Fan, Weihang Chen, Xiaoqian Ma, Xiusheng Huang, Zhuo Chen.

Figure 1
Figure 1. Figure 1: HINT TUNING Efficiency on DeepSeek￾R1-Distill-Qwen-7B. Chain-of-thought (CoT) prompting [1, 2] has become the dominant paradigm for eliciting rea￾soning in large language models. Recent work demonstrates that CoT follows test-time scaling laws [3, 4]: accuracy improves with longer rea￾soning traces, enabling breakthroughs on com￾plex mathematical tasks [5, 6]. However, this capability introduces a new inef… view at source ↗
Figure 2
Figure 2. Figure 2: Reasoning models waste tokens through over-elaboration. (a) Example: same answer, [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Efficient reasoning training paradigms. Chain-of-Thought and Test-Time Scaling. Chain-of-Thought (CoT) prompting [1, 2] elic￾its reasoning in large language models by en￾couraging step-by-step thinking. Recent work shows CoT follows a test-time scaling law [3, 4], where accuracy improves with increased infer￾ence compute via sequential scaling (longer reasoning traces per sample, as in O1 and DeepSeek-R1) … view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of K∗ reveals non-monotonic patterns. (a) Success rate fluctuates with K rather than increasing monotonically. (b) Most required hints concentrate in early episodes (within first 25). (c) Hints enable 23–33% of previously unsolvable problems to succeed with sparse reasoning, and 33–60% with full traces. (d) Only 25–32% show continuous success; most exhibit non-monotonic patterns. where e1∶k de… view at source ↗
Figure 5
Figure 5. Figure 5: Training Data Variance and Adaptive CoT Compression. (a) Comparison of token length [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Grading prompt for answer validation. Token Counting. Reasoning overhead is measured by counting tokens within the <think>...</think> tags. Tokens outside these tags (i.e., the problem statement and final answer) are excluded from compression metrics but included when computing total inference cost. 3https://huggingface.co/docs/lighteval 13 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template for Budget Forcing baseline. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Large reasoning models achieve high accuracy through extended chain-of-thought but generate 5--8 more tokens than necessary, applying verbose reasoning uniformly regardless of problem difficulty. We propose Hint Tuning, a data-efficient approach that teaches models to calibrate reasoning depth. Our key insight: the corresponding instruct model serves as an ideal difficulty probe. By testing what the instruct model can solve with varying guidance, we automatically construct training data across three states: No-Hint (direct answer), Sparse-Hint (minimal prefix), and Full-Hint (complete reasoning). This converts the abstract challenge of difficulty labeling into a measurable consistency check between the instruct and reasoning models. With only 1K self-annotated samples, Hint Tuning achieves 24--66% token reduction (31.5% average) across mainstream reasoning models (Qwen3-Thinking, DeepSeek-R1-Distill) at multiple scales (4B--32B) while maintaining competitive accuracy on five benchmarks. Unlike methods requiring massive distillation datasets or expensive RL, we achieve superior efficiency through simple alignment with the instruct model's capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes Hint Tuning, a data-efficient fine-tuning method for large reasoning models. It uses the corresponding instruct model as a difficulty probe to automatically label problems into three categories—No-Hint (direct answer), Sparse-Hint (minimal prefix), and Full-Hint (complete reasoning)—via consistency checks under varying guidance levels. With only 1K self-annotated samples, the method claims to achieve 24–66% token reduction (31.5% average) across models such as Qwen3-Thinking and DeepSeek-R1-Distill at 4B–32B scales, while maintaining competitive accuracy on five benchmarks, without requiring massive distillation datasets or RL.

Significance. If the central empirical claims hold, Hint Tuning offers a practical, low-resource alternative for calibrating reasoning depth in chain-of-thought models, potentially reducing inference costs substantially. The self-annotation strategy leveraging instruct-model consistency is a notable insight that avoids expensive external labeling, and the reported gains across multiple model families and scales indicate broad applicability. This could meaningfully advance efficient deployment of reasoning LLMs.

major comments (3)
  1. [§3] §3 (Method, Data Construction): The labeling procedure relies on consistency between the instruct model and reasoning model under varying guidance, but provides no quantitative validation (e.g., human difficulty ratings, agreement with ground-truth labels, or correlation with problem complexity metrics) that these labels faithfully indicate minimal required reasoning depth. This is load-bearing for the token-reduction claim, as mislabeling due to prompt sensitivity or format issues could lead to under- or over-reasoning.
  2. [§4] §4 (Experiments): The reported results (24–66% token reduction, competitive accuracy) lack error bars, multiple random seeds, or statistical significance tests, and provide limited ablation details on the 1K sample size, guidance phrasing, or alternative labeling strategies. This weakens confidence in the reliability and generality of the efficiency gains.
  3. [§4.3] §4.3 (Results tables): Without explicit comparison to baselines that use random or heuristic difficulty labeling, it is unclear whether the observed token savings stem specifically from the consistency-based calibration or from general fine-tuning effects.
minor comments (3)
  1. [Abstract, §1] The abstract and introduction could more clearly distinguish Hint Tuning from prior self-consistency and difficulty-aware prompting methods with explicit citations.
  2. [Figure 2] Figure 2 (token reduction visualization) would benefit from per-model breakdowns and confidence intervals for clarity.
  3. [§4.1] The experimental protocol section should specify the exact prompt templates used for the three guidance levels to enable reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments raise important points regarding validation of the labeling procedure and the robustness of the experimental results. We address each major comment below and will revise the manuscript to incorporate additional analyses and baselines where feasible.

read point-by-point responses
  1. Referee: [§3] §3 (Method, Data Construction): The labeling procedure relies on consistency between the instruct model and reasoning model under varying guidance, but provides no quantitative validation (e.g., human difficulty ratings, agreement with ground-truth labels, or correlation with problem complexity metrics) that these labels faithfully indicate minimal required reasoning depth. This is load-bearing for the token-reduction claim, as mislabeling due to prompt sensitivity or format issues could lead to under- or over-reasoning.

    Authors: We appreciate this observation. The core of our method is that the labels are produced via an internal consistency check between the instruct model (used as probe) and the reasoning model under graduated guidance levels; this directly operationalizes minimal required depth without external supervision. We agree, however, that further quantitative support would strengthen the claim. In the revision we will add (i) Pearson correlations between assigned hint categories and problem-complexity proxies (ground-truth solution length and lexical difficulty scores) and (ii) a sensitivity study across alternative guidance phrasings. These results will be reported in an expanded Section 3. revision: partial

  2. Referee: [§4] §4 (Experiments): The reported results (24–66% token reduction, competitive accuracy) lack error bars, multiple random seeds, or statistical significance tests, and provide limited ablation details on the 1K sample size, guidance phrasing, or alternative labeling strategies. This weakens confidence in the reliability and generality of the efficiency gains.

    Authors: We concur that statistical reporting and ablations are necessary for confidence. In the revised manuscript we will (a) rerun all main experiments with at least three random seeds and report means with standard deviations and error bars, (b) include paired t-tests or Wilcoxon tests for the token-reduction figures, and (c) expand the ablation section with curves for training-set sizes around 1 K and with two additional guidance-phrasing variants. These changes will appear in Section 4. revision: yes

  3. Referee: [§4.3] §4.3 (Results tables): Without explicit comparison to baselines that use random or heuristic difficulty labeling, it is unclear whether the observed token savings stem specifically from the consistency-based calibration or from general fine-tuning effects.

    Authors: This is a valid concern. To isolate the benefit of consistency-based labeling, the revision will include two new control baselines: (1) random assignment of the three hint categories to the same 1 K problems, and (2) heuristic labeling based on input-length quartiles. We will report token reduction and accuracy for these controls alongside our method, demonstrating that only the consistency-derived labels produce the observed efficiency gains without accuracy degradation. The new results will be added to Section 4.3 and the corresponding tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method evaluated on external benchmarks

full rationale

The paper constructs training labels by probing an instruct model with varying hint levels to categorize problems into No-Hint, Sparse-Hint, and Full-Hint states, then fine-tunes the reasoning model on 1K such samples. Token reduction and accuracy are measured outcomes on five independent external benchmarks rather than derived quantities. No equations, fitted parameters presented as predictions, self-referential definitions, or load-bearing self-citations appear in the derivation chain. The central claims rest on observed performance improvements, making the approach self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

No new mathematical constants or entities are introduced. The work rests on standard assumptions about fine-tuning and model behavior.

axioms (2)
  • domain assumption An instruct model can serve as a reliable proxy for estimating the reasoning depth required by a more capable reasoning model.
    This is the central premise used to label training examples.
  • domain assumption Fine-tuning on self-generated hint-level data will transfer to improved calibration of chain-of-thought length on unseen problems.
    Required for the claimed generalization from 1K samples.

pith-pipeline@v0.9.0 · 5508 in / 1353 out tokens · 25312 ms · 2026-05-12T00:50:04.052800+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 6 internal anchors

  1. [1]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 2...

  2. [2]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022

  3. [3]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

  4. [4]

    Learning to reason with language models, 2024

    OpenAI. Learning to reason with language models, 2024. Accessed: 2025-05-19

  5. [5]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Hel- yar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  6. [6]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

  7. [7]

    Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

    Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187, 2024

  8. [8]

    Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419, 2025

  9. [9]

    C3ot: Generating shorter chain-of-thought without compromising effectiveness

    Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou. C3ot: Generating shorter chain-of-thought without compromising effectiveness. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24312–24320, 2025

  10. [10]

    Tokenskip: Controllable chain-of-thought compression in llms

    Heming Xia, Yongqi Li, Chak Tou Leong, Wenjie Wang, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms.CoRR, abs/2502.12067, 2025

  11. [11]

    Verithinker: Learning to verify makes reasoning model efficient

    Zigeng Chen, Xinyin Ma, Gongfan Fang, Ruonan Yu, and Xinchao Wang. Verithinker: Learning to verify makes reasoning model efficient. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  12. [12]

    Token-budget- aware llm reasoning

    Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget- aware llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 24842–24855, 2025

  13. [13]

    Can language models learn to skip steps?Advances in Neural Information Processing Systems, 37:45359–45385, 2024

    Tengxiao Liu, Qipeng Guo, Xiangkun Hu, Cheng Jiayang, Yue Zhang, Xipeng Qiu, and Zheng Zhang. Can language models learn to skip steps?Advances in Neural Information Processing Systems, 37:45359–45385, 2024

  14. [14]

    Cot-valve: Length- compressible chain-of-thought tuning

    Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, and Xinchao Wang. Cot-valve: Length- compressible chain-of-thought tuning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 6025–6035

  15. [15]

    O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.ArXiv, abs/2501.12570, 2025

    Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.CoRR, abs/2501.12570, 2025

  16. [16]

    Training language models to reason efficiently

    Daman Arora and Andrea Zanette. Training language models to reason efficiently. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  17. [17]

    L1: Controlling how long a reasoning model thinks with reinforcement learning

    Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. InSecond Conference on Language Modeling, 2025

  18. [18]

    AdaptThink: Reasoning models can learn when to think

    Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li. AdaptThink: Reasoning models can learn when to think. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3716–3730, Suzhou, China, November 2025. Association for Computational Linguistics. 10

  19. [19]

    Proximal policy optimization algorithms, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

  20. [20]

    Optimizing test-time compute via meta reinforcement fine-tuning.arXiv preprint arXiv:2503.07572, 2025

    Yuxiao Qu, Matthew YR Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, and Aviral Kumar. Optimizing test-time compute via meta reinforcement fine-tuning.arXiv preprint arXiv:2503.07572, 2025

  21. [21]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023

  22. [22]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332, 2025

  23. [23]

    Self-training elicits concise reasoning in large language models

    Tergel Munkhbat, Namgyu Ho, Seo Hyun Kim, Yongjin Yang, Yujin Kim, and Se-Young Yun. Self-training elicits concise reasoning in large language models. InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, volume ACL 2025 ofFindings of ACL, pages 25127–25152

  24. [24]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022

  25. [25]

    Token-budget- aware LLM reasoning

    Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget- aware LLM reasoning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, volume ACL 2025 ofFindings of ACL, page...

  26. [26]

    Manning, Stefano Ermon, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...

  27. [27]

    Can language models learn to skip steps? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M

    Tengxiao Liu, Qipeng Guo, Xiangkun Hu, Cheng Jiayang, Yue Zhang, Xipeng Qiu, and Zheng Zhang. Can language models learn to skip steps? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Proc...

  28. [28]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  29. [29]

    Demystifying long chain-of-thought reasoning in llms, 2025

    Edward Y . Chang, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms.CoRR, abs/2502.03373, 2025

  30. [30]

    Pace: Prefix-protected and difficulty-aware compression for efficient reasoning.arXiv preprint arXiv:2602.11639, 2026

    Ruixiang Feng, Yuntao Wen, Silin Zhou, Ke Shi, Yifan Wang, Ran Le, Zhenwei An, Zongchao Chen, Chen Yang, Guangyue Peng, et al. Pace: Prefix-protected and difficulty-aware compression for efficient reasoning.arXiv preprint arXiv:2602.11639, 2026

  31. [31]

    DAST: Difficulty-adaptive slow-thinking for large reasoning models

    Yi Shen, Jian Zhang, Jieyun Huang, Shuming Shi, Wenjing Zhang, Jiangze Yan, Ning Wang, Kai Wang, Zhaoxiang Liu, and Shiguo Lian. DAST: Difficulty-adaptive slow-thinking for large reasoning models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2322–2331

  32. [32]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...

  33. [33]

    Secrets of RLHF in large language models part I: PPO.CoRR, abs/2307.04964, 2023

    Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Yuhao Zhou, Limao Xiong, Lu Chen, Zhiheng Xi, Nuo Xu, Wenbin Lai, Minghao Zhu, Cheng Chang, Zhangyue Yin, Rongxiang Weng, Wensen Cheng, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, and Xuanjing Huang. Secrets of RLHF in large langu...

  34. [34]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021

  35. [35]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

  36. [36]

    Implementation matters in deep RL: A case study on PPO and TRPO

    Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep RL: A case study on PPO and TRPO. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,

  37. [37]

    OpenReview.net, 2020

  38. [38]

    Andrychowicz, A

    Marcin Andrychowicz, Anton Raichuk, Piotr Stanczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, and Olivier Bachem. What matters in on-policy reinforcement learning? A large-scale empirical study.CoRR, abs/2006.05990, 2020

  39. [39]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025

  40. [40]

    Token-budget- aware LLM reasoning

    Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget- aware LLM reasoning. InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 24842–24855

  41. [41]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual

  42. [42]

    Toward generalizable evaluation in the llm era: A survey beyond benchmarks.arXiv preprint arXiv:2504.18838, 2025

    Yixin Cao, Shibo Hong, Xinze Li, Jiahao Ying, Yubo Ma, Haiyuan Liang, Yantao Liu, Zijun Yao, Xiaozhi Wang, Dan Huang, et al. Toward generalizable evaluation in the llm era: A survey beyond benchmarks. arXiv preprint arXiv:2504.18838, 2025

  43. [43]

    Amo-bench: Large language models still struggle in high school math competitions, 2025

    Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin, Junlin Liu, Xinxuan Lv, Dan Ma, Xuanlin Wang, Ziwen Wang, and Shuang Zhou. Amo-bench: Large language models still struggle in high school math competitions, 2025

  44. [44]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025

  45. [45]

    Please reason step by step, and put your final answer within \boxed{}

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. 12 A Implementation Details A.1 Training Configuration We perform full-parameter supervised fine-tuning on 8×NVIDIA H20-1...