Recognition: 2 theorem links
· Lean TheoremHint Tuning: Less Data Makes Better Reasoners
Pith reviewed 2026-05-12 00:50 UTC · model grok-4.3
The pith
Hint Tuning lets reasoning models cut tokens by a third on average by learning difficulty from their instruct counterparts on 1K self-labeled examples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By converting difficulty labeling into a consistency check between the instruct model and the reasoning model under varying levels of guidance, Hint Tuning automatically generates No-Hint, Sparse-Hint, and Full-Hint training states. Fine-tuning on only 1K such examples calibrates the reasoning model to produce shorter outputs on easier problems, yielding 24--66 percent token reduction (31.5 percent average) across 4B--32B models such as Qwen3-Thinking and DeepSeek-R1-Distill while preserving competitive accuracy on five benchmarks.
What carries the argument
Hint Tuning, which uses consistency between the instruct model and reasoning model under no, sparse, and full guidance to label problem difficulty and create three-state training data for calibrating reasoning depth.
If this is right
- Token usage falls 24--66 percent (31.5 percent on average) across 4B to 32B parameter reasoning models.
- Accuracy remains competitive with the original model on five standard benchmarks.
- Only 1K self-annotated samples suffice, avoiding the need for large distillation datasets or reinforcement learning.
- The method applies directly to existing models such as Qwen3-Thinking and DeepSeek-R1-Distill.
Where Pith is reading between the lines
- The same consistency-probe approach could be adapted to other generation tasks where output length should vary with input complexity.
- Instruct models appear to encode useful implicit signals about problem solvability that can be extracted without external labels.
- Adding more than three hint levels might allow even finer control over reasoning granularity in future extensions.
Load-bearing premise
That consistency between the instruct model and the reasoning model under varying guidance levels accurately identifies problem difficulty and that fine-tuning on these labels will reliably calibrate reasoning depth without introducing new errors.
What would settle it
A held-out test set where the Hint-Tuned model produces noticeably more tokens than necessary on problems the instruct model solves with no hint, or drops accuracy below the untuned baseline on those same problems.
Figures
read the original abstract
Large reasoning models achieve high accuracy through extended chain-of-thought but generate 5--8 more tokens than necessary, applying verbose reasoning uniformly regardless of problem difficulty. We propose Hint Tuning, a data-efficient approach that teaches models to calibrate reasoning depth. Our key insight: the corresponding instruct model serves as an ideal difficulty probe. By testing what the instruct model can solve with varying guidance, we automatically construct training data across three states: No-Hint (direct answer), Sparse-Hint (minimal prefix), and Full-Hint (complete reasoning). This converts the abstract challenge of difficulty labeling into a measurable consistency check between the instruct and reasoning models. With only 1K self-annotated samples, Hint Tuning achieves 24--66% token reduction (31.5% average) across mainstream reasoning models (Qwen3-Thinking, DeepSeek-R1-Distill) at multiple scales (4B--32B) while maintaining competitive accuracy on five benchmarks. Unlike methods requiring massive distillation datasets or expensive RL, we achieve superior efficiency through simple alignment with the instruct model's capabilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Hint Tuning, a data-efficient fine-tuning method for large reasoning models. It uses the corresponding instruct model as a difficulty probe to automatically label problems into three categories—No-Hint (direct answer), Sparse-Hint (minimal prefix), and Full-Hint (complete reasoning)—via consistency checks under varying guidance levels. With only 1K self-annotated samples, the method claims to achieve 24–66% token reduction (31.5% average) across models such as Qwen3-Thinking and DeepSeek-R1-Distill at 4B–32B scales, while maintaining competitive accuracy on five benchmarks, without requiring massive distillation datasets or RL.
Significance. If the central empirical claims hold, Hint Tuning offers a practical, low-resource alternative for calibrating reasoning depth in chain-of-thought models, potentially reducing inference costs substantially. The self-annotation strategy leveraging instruct-model consistency is a notable insight that avoids expensive external labeling, and the reported gains across multiple model families and scales indicate broad applicability. This could meaningfully advance efficient deployment of reasoning LLMs.
major comments (3)
- [§3] §3 (Method, Data Construction): The labeling procedure relies on consistency between the instruct model and reasoning model under varying guidance, but provides no quantitative validation (e.g., human difficulty ratings, agreement with ground-truth labels, or correlation with problem complexity metrics) that these labels faithfully indicate minimal required reasoning depth. This is load-bearing for the token-reduction claim, as mislabeling due to prompt sensitivity or format issues could lead to under- or over-reasoning.
- [§4] §4 (Experiments): The reported results (24–66% token reduction, competitive accuracy) lack error bars, multiple random seeds, or statistical significance tests, and provide limited ablation details on the 1K sample size, guidance phrasing, or alternative labeling strategies. This weakens confidence in the reliability and generality of the efficiency gains.
- [§4.3] §4.3 (Results tables): Without explicit comparison to baselines that use random or heuristic difficulty labeling, it is unclear whether the observed token savings stem specifically from the consistency-based calibration or from general fine-tuning effects.
minor comments (3)
- [Abstract, §1] The abstract and introduction could more clearly distinguish Hint Tuning from prior self-consistency and difficulty-aware prompting methods with explicit citations.
- [Figure 2] Figure 2 (token reduction visualization) would benefit from per-model breakdowns and confidence intervals for clarity.
- [§4.1] The experimental protocol section should specify the exact prompt templates used for the three guidance levels to enable reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments raise important points regarding validation of the labeling procedure and the robustness of the experimental results. We address each major comment below and will revise the manuscript to incorporate additional analyses and baselines where feasible.
read point-by-point responses
-
Referee: [§3] §3 (Method, Data Construction): The labeling procedure relies on consistency between the instruct model and reasoning model under varying guidance, but provides no quantitative validation (e.g., human difficulty ratings, agreement with ground-truth labels, or correlation with problem complexity metrics) that these labels faithfully indicate minimal required reasoning depth. This is load-bearing for the token-reduction claim, as mislabeling due to prompt sensitivity or format issues could lead to under- or over-reasoning.
Authors: We appreciate this observation. The core of our method is that the labels are produced via an internal consistency check between the instruct model (used as probe) and the reasoning model under graduated guidance levels; this directly operationalizes minimal required depth without external supervision. We agree, however, that further quantitative support would strengthen the claim. In the revision we will add (i) Pearson correlations between assigned hint categories and problem-complexity proxies (ground-truth solution length and lexical difficulty scores) and (ii) a sensitivity study across alternative guidance phrasings. These results will be reported in an expanded Section 3. revision: partial
-
Referee: [§4] §4 (Experiments): The reported results (24–66% token reduction, competitive accuracy) lack error bars, multiple random seeds, or statistical significance tests, and provide limited ablation details on the 1K sample size, guidance phrasing, or alternative labeling strategies. This weakens confidence in the reliability and generality of the efficiency gains.
Authors: We concur that statistical reporting and ablations are necessary for confidence. In the revised manuscript we will (a) rerun all main experiments with at least three random seeds and report means with standard deviations and error bars, (b) include paired t-tests or Wilcoxon tests for the token-reduction figures, and (c) expand the ablation section with curves for training-set sizes around 1 K and with two additional guidance-phrasing variants. These changes will appear in Section 4. revision: yes
-
Referee: [§4.3] §4.3 (Results tables): Without explicit comparison to baselines that use random or heuristic difficulty labeling, it is unclear whether the observed token savings stem specifically from the consistency-based calibration or from general fine-tuning effects.
Authors: This is a valid concern. To isolate the benefit of consistency-based labeling, the revision will include two new control baselines: (1) random assignment of the three hint categories to the same 1 K problems, and (2) heuristic labeling based on input-length quartiles. We will report token reduction and accuracy for these controls alongside our method, demonstrating that only the consistency-derived labels produce the observed efficiency gains without accuracy degradation. The new results will be added to Section 4.3 and the corresponding tables. revision: yes
Circularity Check
No significant circularity; empirical method evaluated on external benchmarks
full rationale
The paper constructs training labels by probing an instruct model with varying hint levels to categorize problems into No-Hint, Sparse-Hint, and Full-Hint states, then fine-tunes the reasoning model on 1K such samples. Token reduction and accuracy are measured outcomes on five independent external benchmarks rather than derived quantities. No equations, fitted parameters presented as predictions, self-referential definitions, or load-bearing self-citations appear in the derivation chain. The central claims rest on observed performance improvements, making the approach self-contained against external evaluation.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption An instruct model can serve as a reliable proxy for estimating the reasoning depth required by a more capable reasoning model.
- domain assumption Fine-tuning on self-generated hint-level data will transfer to improved calibration of chain-of-thought length on unseen problems.
Reference graph
Works this paper leans on
-
[1]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 2...
work page 2022
-
[2]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022
work page 2022
-
[3]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Learning to reason with language models, 2024
OpenAI. Learning to reason with language models, 2024. Accessed: 2025-05-19
work page 2024
-
[5]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Hel- yar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025
work page 2025
-
[7]
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187, 2024
work page internal anchor Pith review arXiv 2024
-
[8]
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419, 2025
work page internal anchor Pith review arXiv 2025
-
[9]
C3ot: Generating shorter chain-of-thought without compromising effectiveness
Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou. C3ot: Generating shorter chain-of-thought without compromising effectiveness. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24312–24320, 2025
work page 2025
-
[10]
Tokenskip: Controllable chain-of-thought compression in llms
Heming Xia, Yongqi Li, Chak Tou Leong, Wenjie Wang, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms.CoRR, abs/2502.12067, 2025
-
[11]
Verithinker: Learning to verify makes reasoning model efficient
Zigeng Chen, Xinyin Ma, Gongfan Fang, Ruonan Yu, and Xinchao Wang. Verithinker: Learning to verify makes reasoning model efficient. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[12]
Token-budget- aware llm reasoning
Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget- aware llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 24842–24855, 2025
work page 2025
-
[13]
Tengxiao Liu, Qipeng Guo, Xiangkun Hu, Cheng Jiayang, Yue Zhang, Xipeng Qiu, and Zheng Zhang. Can language models learn to skip steps?Advances in Neural Information Processing Systems, 37:45359–45385, 2024
work page 2024
-
[14]
Cot-valve: Length- compressible chain-of-thought tuning
Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, and Xinchao Wang. Cot-valve: Length- compressible chain-of-thought tuning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 6025–6035
work page 2025
-
[15]
O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning
Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.CoRR, abs/2501.12570, 2025
-
[16]
Training language models to reason efficiently
Daman Arora and Andrea Zanette. Training language models to reason efficiently. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[17]
L1: Controlling how long a reasoning model thinks with reinforcement learning
Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. InSecond Conference on Language Modeling, 2025
work page 2025
-
[18]
AdaptThink: Reasoning models can learn when to think
Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li. AdaptThink: Reasoning models can learn when to think. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3716–3730, Suzhou, China, November 2025. Association for Computational Linguistics. 10
work page 2025
-
[19]
Proximal policy optimization algorithms, 2017
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017
work page 2017
-
[20]
Optimizing test-time compute via meta reinforcement fine-tuning
Yuxiao Qu, Matthew YR Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, and Aviral Kumar. Optimizing test-time compute via meta reinforcement fine-tuning.arXiv preprint arXiv:2503.07572, 2025
-
[21]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023
work page 2023
-
[22]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332, 2025
work page 2025
-
[23]
Self-training elicits concise reasoning in large language models
Tergel Munkhbat, Namgyu Ho, Seo Hyun Kim, Yongjin Yang, Yujin Kim, and Se-Young Yun. Self-training elicits concise reasoning in large language models. InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, volume ACL 2025 ofFindings of ACL, pages 25127–25152
work page 2025
-
[24]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022
work page 2022
-
[25]
Token-budget- aware LLM reasoning
Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget- aware LLM reasoning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, volume ACL 2025 ofFindings of ACL, page...
work page 2025
-
[26]
Manning, Stefano Ermon, and Chelsea Finn
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...
work page 2023
-
[27]
Tengxiao Liu, Qipeng Guo, Xiangkun Hu, Cheng Jiayang, Yue Zhang, Xipeng Qiu, and Zheng Zhang. Can language models learn to skip steps? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Proc...
work page 2024
-
[28]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Demystifying long chain-of- thought reasoning in llms.arXiv preprint arXiv:2502.03373, 2025
Edward Y . Chang, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms.CoRR, abs/2502.03373, 2025
-
[30]
Ruixiang Feng, Yuntao Wen, Silin Zhou, Ke Shi, Yifan Wang, Ran Le, Zhenwei An, Zongchao Chen, Chen Yang, Guangyue Peng, et al. Pace: Prefix-protected and difficulty-aware compression for efficient reasoning.arXiv preprint arXiv:2602.11639, 2026
-
[31]
DAST: Difficulty-adaptive slow-thinking for large reasoning models
Yi Shen, Jian Zhang, Jieyun Huang, Shuming Shi, Wenjing Zhang, Jiangze Yan, Ning Wang, Kai Wang, Zhaoxiang Liu, and Shiguo Lian. DAST: Difficulty-adaptive slow-thinking for large reasoning models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2322–2331
work page 2025
-
[32]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...
work page 2022
-
[33]
Secrets of rlhf in large language models part i: Ppo.arXiv preprint arXiv:2307.04964, 2023
Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Yuhao Zhou, Limao Xiong, Lu Chen, Zhiheng Xi, Nuo Xu, Wenbin Lai, Minghao Zhu, Cheng Chang, Zhangyue Yin, Rongxiang Weng, Wensen Cheng, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, and Xuanjing Huang. Secrets of RLHF in large langu...
-
[34]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[35]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024
work page 2024
-
[36]
Implementation matters in deep RL: A case study on PPO and TRPO
Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep RL: A case study on PPO and TRPO. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,
work page 2020
-
[37]
OpenReview.net, 2020
work page 2020
-
[38]
Marcin Andrychowicz, Anton Raichuk, Piotr Stanczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, and Olivier Bachem. What matters in on-policy reinforcement learning? A large-scale empirical study.CoRR, abs/2006.05990, 2020
- [39]
-
[40]
Token-budget- aware LLM reasoning
Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget- aware LLM reasoning. InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 24842–24855
work page 2025
-
[41]
Measuring mathematical problem solving with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual
work page 2021
-
[42]
Yixin Cao, Shibo Hong, Xinze Li, Jiahao Ying, Yubo Ma, Haiyuan Liang, Yantao Liu, Zijun Yao, Xiaozhi Wang, Dan Huang, et al. Toward generalizable evaluation in the llm era: A survey beyond benchmarks. arXiv preprint arXiv:2504.18838, 2025
-
[43]
Amo-bench: Large language models still struggle in high school math competitions, 2025
Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin, Junlin Liu, Xinxuan Lv, Dan Ma, Xuanlin Wang, Ziwen Wang, and Shuang Zhou. Amo-bench: Large language models still struggle in high school math competitions, 2025
work page 2025
-
[44]
Livecodebench: Holistic and contamination free evaluation of large language models for code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025
work page 2025
-
[45]
Please reason step by step, and put your final answer within \boxed{}
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. 12 A Implementation Details A.1 Training Configuration We perform full-parameter supervised fine-tuning on 8×NVIDIA H20-1...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.