arxiv: 2605.08665 · v1 · submitted 2026-05-09 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Hint Tuning: Less Data Makes Better Reasoners

Siqi Fan , Minghao Li , Xiaoqian Ma , Xiusheng Huang , Zhuo Chen , Bowen Qin , Liujie Zhang , Shuo Shang

show 1 more author

Weihang Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords hint tuningchain of thoughtreasoning modelstoken efficiencydifficulty calibrationself-annotationinstruct modelslarge language models

0 comments

The pith

Hint Tuning lets reasoning models cut tokens by a third on average by learning difficulty from their instruct counterparts on 1K self-labeled examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models waste tokens by applying long chain-of-thought uniformly even to easy problems. The paper establishes that an instruct model can serve as a reliable difficulty probe by checking how much guidance it needs to solve each problem correctly. This check produces three categories of training data: problems the model can answer with no hint, with a sparse hint, or only with a full hint. Fine-tuning the reasoning model on these 1K self-annotated examples teaches it to match reasoning depth to actual difficulty. The outcome is 24 to 66 percent fewer tokens generated while accuracy stays competitive across five benchmarks and multiple model scales.

Core claim

By converting difficulty labeling into a consistency check between the instruct model and the reasoning model under varying levels of guidance, Hint Tuning automatically generates No-Hint, Sparse-Hint, and Full-Hint training states. Fine-tuning on only 1K such examples calibrates the reasoning model to produce shorter outputs on easier problems, yielding 24--66 percent token reduction (31.5 percent average) across 4B--32B models such as Qwen3-Thinking and DeepSeek-R1-Distill while preserving competitive accuracy on five benchmarks.

What carries the argument

Hint Tuning, which uses consistency between the instruct model and reasoning model under no, sparse, and full guidance to label problem difficulty and create three-state training data for calibrating reasoning depth.

If this is right

Token usage falls 24--66 percent (31.5 percent on average) across 4B to 32B parameter reasoning models.
Accuracy remains competitive with the original model on five standard benchmarks.
Only 1K self-annotated samples suffice, avoiding the need for large distillation datasets or reinforcement learning.
The method applies directly to existing models such as Qwen3-Thinking and DeepSeek-R1-Distill.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same consistency-probe approach could be adapted to other generation tasks where output length should vary with input complexity.
Instruct models appear to encode useful implicit signals about problem solvability that can be extracted without external labels.
Adding more than three hint levels might allow even finer control over reasoning granularity in future extensions.

Load-bearing premise

That consistency between the instruct model and the reasoning model under varying guidance levels accurately identifies problem difficulty and that fine-tuning on these labels will reliably calibrate reasoning depth without introducing new errors.

What would settle it

A held-out test set where the Hint-Tuned model produces noticeably more tokens than necessary on problems the instruct model solves with no hint, or drops accuracy below the untuned baseline on those same problems.

Figures

Figures reproduced from arXiv: 2605.08665 by Bowen Qin, Liujie Zhang, Minghao Li, Shuo Shang, Siqi Fan, Weihang Chen, Xiaoqian Ma, Xiusheng Huang, Zhuo Chen.

**Figure 1.** Figure 1: HINT TUNING Efficiency on DeepSeekR1-Distill-Qwen-7B. Chain-of-thought (CoT) prompting [1, 2] has become the dominant paradigm for eliciting reasoning in large language models. Recent work demonstrates that CoT follows test-time scaling laws [3, 4]: accuracy improves with longer reasoning traces, enabling breakthroughs on complex mathematical tasks [5, 6]. However, this capability introduces a new inef… view at source ↗

**Figure 2.** Figure 2: Reasoning models waste tokens through over-elaboration. (a) Example: same answer, [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Efficient reasoning training paradigms. Chain-of-Thought and Test-Time Scaling. Chain-of-Thought (CoT) prompting [1, 2] elicits reasoning in large language models by encouraging step-by-step thinking. Recent work shows CoT follows a test-time scaling law [3, 4], where accuracy improves with increased inference compute via sequential scaling (longer reasoning traces per sample, as in O1 and DeepSeek-R1) … view at source ↗

**Figure 4.** Figure 4: Distribution of K∗ reveals non-monotonic patterns. (a) Success rate fluctuates with K rather than increasing monotonically. (b) Most required hints concentrate in early episodes (within first 25). (c) Hints enable 23–33% of previously unsolvable problems to succeed with sparse reasoning, and 33–60% with full traces. (d) Only 25–32% show continuous success; most exhibit non-monotonic patterns. where e1∶k de… view at source ↗

**Figure 5.** Figure 5: Training Data Variance and Adaptive CoT Compression. (a) Comparison of token length [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Grading prompt for answer validation. Token Counting. Reasoning overhead is measured by counting tokens within the <think>...</think> tags. Tokens outside these tags (i.e., the problem statement and final answer) are excluded from compression metrics but included when computing total inference cost. 3https://huggingface.co/docs/lighteval 13 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt template for Budget Forcing baseline. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

Large reasoning models achieve high accuracy through extended chain-of-thought but generate 5--8 more tokens than necessary, applying verbose reasoning uniformly regardless of problem difficulty. We propose Hint Tuning, a data-efficient approach that teaches models to calibrate reasoning depth. Our key insight: the corresponding instruct model serves as an ideal difficulty probe. By testing what the instruct model can solve with varying guidance, we automatically construct training data across three states: No-Hint (direct answer), Sparse-Hint (minimal prefix), and Full-Hint (complete reasoning). This converts the abstract challenge of difficulty labeling into a measurable consistency check between the instruct and reasoning models. With only 1K self-annotated samples, Hint Tuning achieves 24--66% token reduction (31.5% average) across mainstream reasoning models (Qwen3-Thinking, DeepSeek-R1-Distill) at multiple scales (4B--32B) while maintaining competitive accuracy on five benchmarks. Unlike methods requiring massive distillation datasets or expensive RL, we achieve superior efficiency through simple alignment with the instruct model's capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hint Tuning shows a practical route to cut tokens in reasoning models with 1K self-labeled examples, but the instruct-model probe for difficulty labels looks noisy and unvalidated.

read the letter

Your colleague should know that Hint Tuning is a lightweight way to make reasoning models more efficient by teaching them to adjust how much chain-of-thought they produce based on problem difficulty. The method uses just 1,000 self-generated examples and delivers noticeable token savings on several models without big accuracy drops. The new part is turning the instruct model into a difficulty probe. They run each problem under no hint, sparse hint, and full hint conditions, then label it accordingly and fine-tune the reasoning model to output the matching level of reasoning. This avoids needing huge external datasets or complex RL setups. It does well on the efficiency side. The reported results show 24-66% fewer tokens on average 31.5% across Qwen3 and DeepSeek variants at different sizes, with competitive performance on five benchmarks. That's practical for lowering costs. The main concern is whether the labels actually capture true difficulty. An instruct model might fail on an easy problem because of prompt phrasing or sampling variance, not because it needs more hints. Without extra validation like agreement with human judgments or fixed difficulty sets, the training data could have errors that lead to under- or over-reasoning after tuning. The abstract doesn't detail ablations or error bars, so the robustness isn't fully clear yet. This paper is for people building or deploying reasoning systems who care about inference speed and cost. A practitioner or researcher in efficient LLM inference would get something out of the concrete pipeline and numbers. It has enough substance to go to peer review rather than a desk reject, since the claims are measurable and the approach is replicable in principle. I'd recommend putting it through review. Referees will likely push on the label quality, but the core idea is worth checking.

Referee Report

3 major / 3 minor

Summary. The paper proposes Hint Tuning, a data-efficient fine-tuning method for large reasoning models. It uses the corresponding instruct model as a difficulty probe to automatically label problems into three categories—No-Hint (direct answer), Sparse-Hint (minimal prefix), and Full-Hint (complete reasoning)—via consistency checks under varying guidance levels. With only 1K self-annotated samples, the method claims to achieve 24–66% token reduction (31.5% average) across models such as Qwen3-Thinking and DeepSeek-R1-Distill at 4B–32B scales, while maintaining competitive accuracy on five benchmarks, without requiring massive distillation datasets or RL.

Significance. If the central empirical claims hold, Hint Tuning offers a practical, low-resource alternative for calibrating reasoning depth in chain-of-thought models, potentially reducing inference costs substantially. The self-annotation strategy leveraging instruct-model consistency is a notable insight that avoids expensive external labeling, and the reported gains across multiple model families and scales indicate broad applicability. This could meaningfully advance efficient deployment of reasoning LLMs.

major comments (3)

[§3] §3 (Method, Data Construction): The labeling procedure relies on consistency between the instruct model and reasoning model under varying guidance, but provides no quantitative validation (e.g., human difficulty ratings, agreement with ground-truth labels, or correlation with problem complexity metrics) that these labels faithfully indicate minimal required reasoning depth. This is load-bearing for the token-reduction claim, as mislabeling due to prompt sensitivity or format issues could lead to under- or over-reasoning.
[§4] §4 (Experiments): The reported results (24–66% token reduction, competitive accuracy) lack error bars, multiple random seeds, or statistical significance tests, and provide limited ablation details on the 1K sample size, guidance phrasing, or alternative labeling strategies. This weakens confidence in the reliability and generality of the efficiency gains.
[§4.3] §4.3 (Results tables): Without explicit comparison to baselines that use random or heuristic difficulty labeling, it is unclear whether the observed token savings stem specifically from the consistency-based calibration or from general fine-tuning effects.

minor comments (3)

[Abstract, §1] The abstract and introduction could more clearly distinguish Hint Tuning from prior self-consistency and difficulty-aware prompting methods with explicit citations.
[Figure 2] Figure 2 (token reduction visualization) would benefit from per-model breakdowns and confidence intervals for clarity.
[§4.1] The experimental protocol section should specify the exact prompt templates used for the three guidance levels to enable reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments raise important points regarding validation of the labeling procedure and the robustness of the experimental results. We address each major comment below and will revise the manuscript to incorporate additional analyses and baselines where feasible.

read point-by-point responses

Referee: [§3] §3 (Method, Data Construction): The labeling procedure relies on consistency between the instruct model and reasoning model under varying guidance, but provides no quantitative validation (e.g., human difficulty ratings, agreement with ground-truth labels, or correlation with problem complexity metrics) that these labels faithfully indicate minimal required reasoning depth. This is load-bearing for the token-reduction claim, as mislabeling due to prompt sensitivity or format issues could lead to under- or over-reasoning.

Authors: We appreciate this observation. The core of our method is that the labels are produced via an internal consistency check between the instruct model (used as probe) and the reasoning model under graduated guidance levels; this directly operationalizes minimal required depth without external supervision. We agree, however, that further quantitative support would strengthen the claim. In the revision we will add (i) Pearson correlations between assigned hint categories and problem-complexity proxies (ground-truth solution length and lexical difficulty scores) and (ii) a sensitivity study across alternative guidance phrasings. These results will be reported in an expanded Section 3. revision: partial
Referee: [§4] §4 (Experiments): The reported results (24–66% token reduction, competitive accuracy) lack error bars, multiple random seeds, or statistical significance tests, and provide limited ablation details on the 1K sample size, guidance phrasing, or alternative labeling strategies. This weakens confidence in the reliability and generality of the efficiency gains.

Authors: We concur that statistical reporting and ablations are necessary for confidence. In the revised manuscript we will (a) rerun all main experiments with at least three random seeds and report means with standard deviations and error bars, (b) include paired t-tests or Wilcoxon tests for the token-reduction figures, and (c) expand the ablation section with curves for training-set sizes around 1 K and with two additional guidance-phrasing variants. These changes will appear in Section 4. revision: yes
Referee: [§4.3] §4.3 (Results tables): Without explicit comparison to baselines that use random or heuristic difficulty labeling, it is unclear whether the observed token savings stem specifically from the consistency-based calibration or from general fine-tuning effects.

Authors: This is a valid concern. To isolate the benefit of consistency-based labeling, the revision will include two new control baselines: (1) random assignment of the three hint categories to the same 1 K problems, and (2) heuristic labeling based on input-length quartiles. We will report token reduction and accuracy for these controls alongside our method, demonstrating that only the consistency-derived labels produce the observed efficiency gains without accuracy degradation. The new results will be added to Section 4.3 and the corresponding tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method evaluated on external benchmarks

full rationale

The paper constructs training labels by probing an instruct model with varying hint levels to categorize problems into No-Hint, Sparse-Hint, and Full-Hint states, then fine-tunes the reasoning model on 1K such samples. Token reduction and accuracy are measured outcomes on five independent external benchmarks rather than derived quantities. No equations, fitted parameters presented as predictions, self-referential definitions, or load-bearing self-citations appear in the derivation chain. The central claims rest on observed performance improvements, making the approach self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

No new mathematical constants or entities are introduced. The work rests on standard assumptions about fine-tuning and model behavior.

axioms (2)

domain assumption An instruct model can serve as a reliable proxy for estimating the reasoning depth required by a more capable reasoning model.
This is the central premise used to label training examples.
domain assumption Fine-tuning on self-generated hint-level data will transfer to improved calibration of chain-of-thought length on unseen problems.
Required for the claimed generalization from 1K samples.

pith-pipeline@v0.9.0 · 5508 in / 1353 out tokens · 25312 ms · 2026-05-12T00:50:04.052800+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 6 internal anchors

[1]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 2...

work page 2022
[2]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022

work page 2022
[3]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Learning to reason with language models, 2024

OpenAI. Learning to reason with language models, 2024. Accessed: 2025-05-19

work page 2024
[5]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Hel- yar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

work page 2025
[7]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187, 2024

work page internal anchor Pith review arXiv 2024
[8]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419, 2025

work page internal anchor Pith review arXiv 2025
[9]

C3ot: Generating shorter chain-of-thought without compromising effectiveness

Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou. C3ot: Generating shorter chain-of-thought without compromising effectiveness. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24312–24320, 2025

work page 2025
[10]

Tokenskip: Controllable chain-of-thought compression in llms

Heming Xia, Yongqi Li, Chak Tou Leong, Wenjie Wang, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms.CoRR, abs/2502.12067, 2025

work page arXiv 2025
[11]

Verithinker: Learning to verify makes reasoning model efficient

Zigeng Chen, Xinyin Ma, Gongfan Fang, Ruonan Yu, and Xinchao Wang. Verithinker: Learning to verify makes reasoning model efficient. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[12]

Token-budget- aware llm reasoning

Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget- aware llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 24842–24855, 2025

work page 2025
[13]

Can language models learn to skip steps?Advances in Neural Information Processing Systems, 37:45359–45385, 2024

Tengxiao Liu, Qipeng Guo, Xiangkun Hu, Cheng Jiayang, Yue Zhang, Xipeng Qiu, and Zheng Zhang. Can language models learn to skip steps?Advances in Neural Information Processing Systems, 37:45359–45385, 2024

work page 2024
[14]

Cot-valve: Length- compressible chain-of-thought tuning

Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, and Xinchao Wang. Cot-valve: Length- compressible chain-of-thought tuning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 6025–6035

work page 2025
[15]

O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.ArXiv, abs/2501.12570, 2025

Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.CoRR, abs/2501.12570, 2025

work page arXiv 2025
[16]

Training language models to reason efficiently

Daman Arora and Andrea Zanette. Training language models to reason efficiently. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[17]

L1: Controlling how long a reasoning model thinks with reinforcement learning

Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. InSecond Conference on Language Modeling, 2025

work page 2025
[18]

AdaptThink: Reasoning models can learn when to think

Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li. AdaptThink: Reasoning models can learn when to think. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3716–3730, Suzhou, China, November 2025. Association for Computational Linguistics. 10

work page 2025
[19]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017
[20]

Optimizing test-time compute via meta reinforcement fine-tuning.arXiv preprint arXiv:2503.07572, 2025

Yuxiao Qu, Matthew YR Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, and Aviral Kumar. Optimizing test-time compute via meta reinforcement fine-tuning.arXiv preprint arXiv:2503.07572, 2025

work page arXiv 2025
[21]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023

work page 2023
[22]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332, 2025

work page 2025
[23]

Self-training elicits concise reasoning in large language models

Tergel Munkhbat, Namgyu Ho, Seo Hyun Kim, Yongjin Yang, Yujin Kim, and Se-Young Yun. Self-training elicits concise reasoning in large language models. InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, volume ACL 2025 ofFindings of ACL, pages 25127–25152

work page 2025
[24]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022

work page 2022
[25]

Token-budget- aware LLM reasoning

Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget- aware LLM reasoning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, volume ACL 2025 ofFindings of ACL, page...

work page 2025
[26]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...

work page 2023
[27]

Can language models learn to skip steps? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M

Tengxiao Liu, Qipeng Guo, Xiangkun Hu, Cheng Jiayang, Yue Zhang, Xipeng Qiu, and Zheng Zhang. Can language models learn to skip steps? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Proc...

work page 2024
[28]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Demystifying long chain-of-thought reasoning in llms, 2025

Edward Y . Chang, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms.CoRR, abs/2502.03373, 2025

work page arXiv 2025
[30]

Pace: Prefix-protected and difficulty-aware compression for efficient reasoning.arXiv preprint arXiv:2602.11639, 2026

Ruixiang Feng, Yuntao Wen, Silin Zhou, Ke Shi, Yifan Wang, Ran Le, Zhenwei An, Zongchao Chen, Chen Yang, Guangyue Peng, et al. Pace: Prefix-protected and difficulty-aware compression for efficient reasoning.arXiv preprint arXiv:2602.11639, 2026

work page arXiv 2026
[31]

DAST: Difficulty-adaptive slow-thinking for large reasoning models

Yi Shen, Jian Zhang, Jieyun Huang, Shuming Shi, Wenjing Zhang, Jiangze Yan, Ning Wang, Kai Wang, Zhaoxiang Liu, and Shiguo Lian. DAST: Difficulty-adaptive slow-thinking for large reasoning models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2322–2331

work page 2025
[32]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...

work page 2022
[33]

Secrets of RLHF in large language models part I: PPO.CoRR, abs/2307.04964, 2023

Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Yuhao Zhou, Limao Xiong, Lu Chen, Zhiheng Xi, Nuo Xu, Wenbin Lai, Minghao Zhu, Cheng Chang, Zhangyue Yin, Rongxiang Weng, Wensen Cheng, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, and Xuanjing Huang. Secrets of RLHF in large langu...

work page arXiv 2023
[34]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

work page 2024
[36]

Implementation matters in deep RL: A case study on PPO and TRPO

Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep RL: A case study on PPO and TRPO. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,

work page 2020
[37]

OpenReview.net, 2020

work page 2020
[38]

Andrychowicz, A

Marcin Andrychowicz, Anton Raichuk, Piotr Stanczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, and Olivier Bachem. What matters in on-policy reinforcement learning? A large-scale empirical study.CoRR, abs/2006.05990, 2020

work page arXiv 2006
[39]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025

work page 2025
[40]

Token-budget- aware LLM reasoning

Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget- aware LLM reasoning. InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 24842–24855

work page 2025
[41]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual

work page 2021
[42]

Toward generalizable evaluation in the llm era: A survey beyond benchmarks.arXiv preprint arXiv:2504.18838, 2025

Yixin Cao, Shibo Hong, Xinze Li, Jiahao Ying, Yubo Ma, Haiyuan Liang, Yantao Liu, Zijun Yao, Xiaozhi Wang, Dan Huang, et al. Toward generalizable evaluation in the llm era: A survey beyond benchmarks. arXiv preprint arXiv:2504.18838, 2025

work page arXiv 2025
[43]

Amo-bench: Large language models still struggle in high school math competitions, 2025

Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin, Junlin Liu, Xinxuan Lv, Dan Ma, Xuanlin Wang, Ziwen Wang, and Shuang Zhou. Amo-bench: Large language models still struggle in high school math competitions, 2025

work page 2025
[44]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025

work page 2025
[45]

Please reason step by step, and put your final answer within \boxed{}

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. 12 A Implementation Details A.1 Training Configuration We perform full-parameter supervised fine-tuning on 8×NVIDIA H20-1...

work page 2024