pith. sign in

arxiv: 2605.21127 · v1 · pith:ZXPMSMOInew · submitted 2026-05-20 · 💻 cs.LG

Reasoning-Trace Collapse: Evaluating the Loss of Explicit Reasoning During Fine-Tuning

Pith reviewed 2026-05-21 06:01 UTC · model grok-4.3

classification 💻 cs.LG
keywords reasoning trace collapsesupervised fine-tuningexplicit reasoningstructural evaluationloss maskingreasoning modelsLLM adaptationtrace validity
0
0 comments X

The pith

Fine-tuning on answer-only data causes reasoning models to lose their explicit reasoning traces while final answers stay correct.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that reasoning models trained to output step-by-step traces before answers lose those traces when fine-tuned on ordinary instruction data that lacks any traces. This creates reasoning-trace collapse, where models still give plausible answers but stop producing structurally valid reasoning. The authors introduce a framework that measures trace validity separately from answer correctness, revealing that standard answer-only metrics hide the loss because performance stays high on the remaining valid traces. They also demonstrate that masking the loss on certain tokens during fine-tuning can reduce the collapse without needing extra reasoning data from a teacher model.

Core claim

Explicit reasoning models lose their intermediate reasoning traces during standard supervised fine-tuning on data that contains no such traces, resulting in reasoning-trace collapse. A structural evaluation framework tracks four categories of traces (valid, empty, missing, truncated) and computes task performance only on cases with valid traces. Across four open-weight models, the rate of valid reasoning drops sharply after fine-tuning while conditional performance on valid traces remains high, and simple loss-masking during training mitigates the effect without requiring teacher-generated traces.

What carries the argument

The structural evaluation framework that classifies each reasoning trace as valid, empty, missing, or truncated and reports both overall accuracy and accuracy conditioned on valid traces.

If this is right

  • Answer-only metrics can substantially overestimate the reasoning reliability of fine-tuned models.
  • The rate of valid reasoning traces can fall sharply even when performance conditional on those traces stays high.
  • Loss-masking during fine-tuning can preserve valid reasoning traces without the need for teacher-generated reasoning data.
  • Evaluations of adapted reasoning models should include structural reasoning reliability metrics alongside final-answer scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same collapse could appear in other structured generation tasks such as code or mathematical derivations when fine-tuning data lacks explicit steps.
  • Developers adapting reasoning models may need to include reasoning traces in all fine-tuning datasets to avoid silent degradation.
  • The divergence between trace validity and answer correctness could affect how gains from domain adaptation are interpreted in practice.

Load-bearing premise

The framework's rules for labeling traces as valid or invalid accurately reflect real loss of reasoning ability instead of depending on arbitrary classification choices.

What would settle it

Manually reviewing a random sample of model outputs before and after fine-tuning and finding that the fraction of outputs containing valid reasoning steps does not decrease.

Figures

Figures reproduced from arXiv: 2605.21127 by Helen Yannakoudakis, Jie M. Zhang, Lukas Twist.

Figure 1
Figure 1. Figure 1: Format-Induced Reasoning-Trace Collapse. We compare two ways of representing missing reasoning during supervised fine-tuning: including an empty <think> block, or omitting rea￾soning tags entirely. Metrics are measured every 100 training steps across three datasets: Chemistry, GSM8K (math), and EvalPlus (code). Solid lines show pass@1, dashed lines show valid reasoning rate (VR), and dotted lines show reas… view at source ↗
Figure 2
Figure 2. Figure 2: Mitigating Reasoning-Trace Collapse. We compare three ways to mitigate reasoning-trace collapse: masking the empty <think> block, updating weights using the response only, or using distillation from a teacher model. Metrics are measured every 100 training steps across three datasets: Chemistry, GSM8K (math), and EvalPlus (code). Solid lines show pass@1, dashed lines show valid reasoning rate (VR), and dott… view at source ↗
Figure 3
Figure 3. Figure 3: Learning-Rate Sensitivity. We compare three different initial learning rates for standard supervised fine-tuning at 5e−6, 1e−5, and 2e−5. Metrics are measured every 100 training steps across three datasets: Chemistry, GSM8K (math), and EvalPlus (code). Solid lines show pass@1, dashed lines show valid reasoning rate (VR), and dotted lines show reasoning-conditioned pass@1 (Rpass@1). C Learning-Rate Sensitiv… view at source ↗
read the original abstract

Explicit reasoning models are trained to produce intermediate reasoning traces before final answers, but downstream fine-tuning is often performed on ordinary instruction-response data that contains no such traces. We show that this mismatch can induce reasoning-trace collapse: a fine-tuned model continues to produce plausible final answers while losing the structurally valid explicit reasoning traces that made it a reasoning model in the first place. We introduce a structural evaluation framework that separates answer correctness from reasoning-trace validity, measuring valid, empty, missing, and truncated reasoning alongside reasoning-conditioned task performance. Using this framework, we study four open-weight reasoning models and find that standard supervised fine-tuning can rapidly suppress valid reasoning traces, and that answer-only metrics can substantially obscure this failure: in several settings, performance conditional on valid reasoning remains high while the rate of valid reasoning falls sharply. We further show that simple loss-masking strategies can substantially mitigate collapse without requiring teacher-generated reasoning traces. These results suggest that evaluations of fine-tuned reasoning models should report structural reasoning reliability metrics in addition to final-answer performance, especially when adaptation data does not contain explicit reasoning traces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that standard supervised fine-tuning on answer-only data induces 'reasoning-trace collapse' in explicit reasoning models: valid intermediate reasoning traces are rapidly suppressed while final-answer accuracy remains high, and that answer-only metrics obscure this. It introduces a structural evaluation framework classifying traces as valid/empty/missing/truncated and measuring reasoning-conditioned performance, demonstrates the phenomenon across four open-weight models, and shows mitigation via loss-masking without needing teacher traces. The work recommends reporting structural reliability metrics when adaptation data lacks explicit reasoning.

Significance. If the central empirical findings hold after addressing framework validation, the paper would be significant for highlighting a practical failure mode in fine-tuning reasoning models and for advocating structural metrics beyond final-answer accuracy. The loss-masking mitigation is a low-cost, practical contribution that does not require additional trace generation. The empirical focus on observable rates of valid traces provides falsifiable observations that could influence evaluation standards in LLM reasoning research.

major comments (2)
  1. [§3] §3 (Structural Evaluation Framework): The framework's classification into valid/empty/missing/truncated categories is load-bearing for the central claim, yet the manuscript provides no validation against human judgments of reasoning quality or controlled format-ablation experiments on the base models. Because adaptation data contains no traces, post-SFT models are incentivized to drop surface markers (e.g., <think> blocks or step delimiters); if the classifier keys primarily on these, the observed drop in valid rates may reflect output-format adaptation rather than genuine loss of reasoning capability. This distinction must be demonstrated for the claim that answer-only metrics 'substantially obscure' reasoning failure to be supported.
  2. [§4.2] §4.2 (Experimental Results): The reported sharp fall in valid reasoning rates alongside stable reasoning-conditioned accuracy is the key quantitative finding, but the manuscript does not report inter-annotator agreement or error analysis for the automatic classifier on a held-out sample of traces. Without this, it is unclear whether the categories reliably isolate reasoning loss or introduce systematic bias that could inflate the collapse effect across the four models studied.
minor comments (2)
  1. [Abstract / §1] The abstract and §1 should explicitly state the four models used (e.g., by name and size) rather than referring only to 'open-weight reasoning models' to improve reproducibility.
  2. [Figure 3] Figure 3 (loss-masking ablation): The y-axis scale and legend for reasoning-conditioned accuracy are difficult to read; enlarge labels and clarify whether error bars represent standard deviation across seeds or runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas where additional validation can strengthen the manuscript's claims about the structural evaluation framework and classifier reliability. We address each major comment below and will incorporate the suggested analyses in the revision.

read point-by-point responses
  1. Referee: [§3] §3 (Structural Evaluation Framework): The framework's classification into valid/empty/missing/truncated categories is load-bearing for the central claim, yet the manuscript provides no validation against human judgments of reasoning quality or controlled format-ablation experiments on the base models. Because adaptation data contains no traces, post-SFT models are incentivized to drop surface markers (e.g., <think> blocks or step delimiters); if the classifier keys primarily on these, the observed drop in valid rates may reflect output-format adaptation rather than genuine loss of reasoning capability. This distinction must be demonstrated for the claim that answer-only metrics 'substantially obscure' reasoning failure to be supported.

    Authors: We agree that explicitly demonstrating the distinction between format adaptation and genuine reasoning loss is essential to support the central claim. The classification rules in §3 require not only structural delimiters but also coherent, step-wise logical progression that connects to the final answer; empty or missing traces are flagged even when delimiters are present if content is absent or incoherent. Nevertheless, to directly address the concern, we will add to the revised manuscript a human validation study (agreement rates on 100 sampled traces from base and post-SFT models) and a format-ablation experiment on base-model outputs (stripping delimiters and re-classifying to show valid rates remain high). These additions will confirm that the observed collapse reflects loss of reasoning content rather than surface-format changes alone. revision: yes

  2. Referee: [§4.2] §4.2 (Experimental Results): The reported sharp fall in valid reasoning rates alongside stable reasoning-conditioned accuracy is the key quantitative finding, but the manuscript does not report inter-annotator agreement or error analysis for the automatic classifier on a held-out sample of traces. Without this, it is unclear whether the categories reliably isolate reasoning loss or introduce systematic bias that could inflate the collapse effect across the four models studied.

    Authors: We acknowledge that reporting inter-annotator agreement and error analysis would improve confidence in the classifier's reliability. The classifier is intentionally rule-based and deterministic to ensure reproducibility across the four models, but we will add a dedicated validation subsection in the revision. This will include manual review and error analysis on a held-out sample of 200 traces (50 per model), Cohen's kappa agreement between two independent human annotators and the automatic classifier, and a breakdown of error categories (e.g., over- or under-flagging of valid traces). These results will be reported to show that systematic bias does not inflate the collapse effect. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; empirical reporting only

full rationale

This is an empirical study that defines a structural classification scheme for reasoning traces (valid/empty/missing/truncated) and measures observed rates plus conditional accuracy after fine-tuning. No equations, parameter fits, or derivations are claimed or present in the provided text. The central results are direct experimental observations rather than reductions of outputs to inputs by construction. Self-citations, if any, are not load-bearing for the reported rates. The framework is introduced as a measurement tool, not derived from prior self-referential results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to the core assumption underlying the new measurement framework. No free parameters or invented entities are described.

axioms (1)
  • domain assumption The structural evaluation framework can reliably distinguish valid reasoning traces from empty, missing, or truncated ones.
    This assumption is required for the reported rates of reasoning collapse and the claim that answer-only metrics obscure the failure.

pith-pipeline@v0.9.0 · 5721 in / 1334 out tokens · 58722 ms · 2026-05-21T06:01:11.070775+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

  1. [1]

    Allenai/Olmo-3-7B-Think · Hugging Face

    allenai. Allenai/Olmo-3-7B-Think · Hugging Face. https://huggingface.co/allenai/Olmo-3-7B- Think, January 2026

  2. [2]

    Reasoning AI Models: An overview

    Amit Bahree. Reasoning AI Models: An overview. https://blog.desigeek.com/post/2025/09/reasoning-ai-models-a-deep-dive/, September 2025

  3. [3]

    LoRA learns less and forgets less.Transactions on Machine Learning Research, 2024

    Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John Patrick Cunningham. LoRA learns less and forgets less.Transactions on Machine Learning Research, 2024. ISSN 2835-8856

  4. [4]

    Training Verifiers to Solve Math Word Problems, November 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training Verifiers to Solve Math Word Problems, November 2021

  5. [5]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  6. [6]

    Transformers

    Hugging Face. Transformers. https://huggingface.co/docs/transformers/en/index, 2026

  7. [7]

    Keypoint- based progressive chain-of-thought distillation for LLMs

    Kaituo Feng, Changsheng Li, Xiaolu Zhang, Jun Zhou, Ye Yuan, and Guoren Wang. Keypoint- based progressive chain-of-thought distillation for LLMs. InProceedings of the 41st Interna- tional Conference on Machine Learning, volume 235 ofICML’24, pages 13241–13255, Vienna, Austria, July 2024. JMLR.org

  8. [8]

    SciKnowEval: A Comprehensive Dataset for Evaluating Scientific Knowledge of Large Language Models

    Kehua Feng, Xinyi Shen, Weijie Wang, Xiang Zhuang, Yuqi Tang, Qiang Zhang, and Keyan Ding. SciKnowEval: A Comprehensive Dataset for Evaluating Scientific Knowledge of Large Language Models. InNeurIPS 2025 AI for Science Workshop, October 2025

  9. [9]

    Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Rea Sprague, Ashima Suvarna, Benjamin Feuer, Leon Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie...

  10. [10]

    LLM reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models

    Shibo Hao, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Shao, Xinyuan Wang, Shuhua Xie, Haodi Ma, Adithya Samavedhi, Qiyue Gao, Zhen Wang, and Zhiting Hu. LLM reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models. InICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024

  11. [11]

    doi: 10.18653/v1/2023.findings-acl.507

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computa- ti...

  12. [12]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations, October 2021

  13. [13]

    Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines. InThe Twelfth International Conference on Learning Represent...

  14. [14]

    Evaluating Step-by-step Reasoning Traces: A Survey

    Jinu Lee and Julia Hockenmaier. Evaluating Step-by-step Reasoning Traces: A Survey. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1789–1814, Suzhou, China, November 2025. Association for Computational Linguistics. ISBN 979-8- 89176-33...

  15. [15]

    Let’s Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s Verify Step by Step. InThe Twelfth International Conference on Learning Representations, October 2023

  16. [16]

    Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, pages 21558–21572, Red Hook, NY , USA, December 2023. Curran Associates Inc

  17. [17]

    On the Impact of Fine-Tuning on Chain- of-Thought Reasoning

    Elita Lobo, Chirag Agarwal, and Himabindu Lakkaraju. On the Impact of Fine-Tuning on Chain- of-Thought Reasoning. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 11679–1...

  18. [18]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations, September 2018

  19. [19]

    Qwen3 Best Practices

    Alibaba ModelScope. Qwen3 Best Practices. https://swift.readthedocs.io/en/latest/BestPractices/Qwen3- Best-Practice.html, 2026

  20. [20]

    Nvidia/OpenReasoning-Nemotron-7B

    nvidia. Nvidia/OpenReasoning-Nemotron-7B. https://huggingface.co/nvidia/OpenReasoning- Nemotron-7B, May 2026

  21. [21]

    GPT-5 mini - API

    OpenAI. GPT-5 mini - API. https://platform.openai.com, 2025

  22. [22]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! InThe Twelfth International Conference on Learning Representations, October 2023

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! InThe Twelfth International Conference on Learning Representations, October 2023

  23. [23]

    Qwen/Qwen3-8B · Hugging Face

    Qwen. Qwen/Qwen3-8B · Hugging Face. https://huggingface.co/Qwen/Qwen3-8B, December 2025

  24. [24]

    Self-Distillation Enables Continual Learning, January 2026

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-Distillation Enables Continual Learning, January 2026

  25. [25]

    Yu, and Jianfeng Gao

    Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, and Jianfeng Gao. A Survey on Post-training of ...

  26. [26]

    A Short Survey on Small Reasoning Models: Training, Inference, Applications and Research Directions, April 2025

    Chengyu Wang, Taolin Zhang, Richang Hong, and Jun Huang. A Short Survey on Small Reasoning Models: Training, Inference, Applications and Research Directions, April 2025

  27. [27]

    A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness.ACM Trans

    Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, TzuHao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, and Suhang Wang. A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness.ACM Trans. Intell. Sys...

  28. [28]

    SCOTT: Self-Consistent Chain-of-Thought Distillation

    Peifeng Wang, Zhengyang Wang, Zheng Li, Yifan Gao, Bing Yin, and Xiang Ren. SCOTT: Self-Consistent Chain-of-Thought Distillation. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 5546–5558, Toronto, Canada, July 2023. Asso...

  29. [29]

    Emergent Abilities of Large Language Models.Transactions on Machine Learning Research, 2022

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent Abilities of Large Language Models.Transactions on Machine Learning Research, 2022. 12

  30. [30]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, pages 24824–24837, Red Hook, NY , USA, November 2022. Curran Ass...

  31. [31]

    Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models, January 2025

    Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, and Yong Li. Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models, January 2025

  32. [32]

    Zhang, Z

    Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. When Scaling Meets LLM Finetun- ing: The Effect of Data, Model and Finetuning Method. In12th International Conference on Learning Representations (ICLR24). arXiv, February 2024. doi: 10.48550/arXiv.2402.17193

  33. [33]

    Instruction Tuning for Large Language Models: A Survey.ACM Computing Surveys, page 3777411, November 2025

    Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Guoyin Wang, and Fei Wu. Instruction Tuning for Large Language Models: A Survey.ACM Comput. Surv., 58(7):169:1–169:36, January 2026. ISSN 0360-0300. doi: 10.1145/3777411

  34. [34]

    Hedderich

    Raoyuan Zhao, Yihong Liu, Hinrich Schuetze, and Michael A. Hedderich. A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors, Findings of the Association for Computational Linguistics: EACL 2026, pages 5223–5247, Rabat, Mo...

  35. [35]

    Let me break this down step by step

    Yujun Zhou, Jiayi Ye, Zipeng Ling, Yufei Han, Yue Huang, Haomin Zhuang, Zhenwen Liang, Kehan Guo, Taicheng Guo, Xiangqi Wang, and Xiangliang Zhang. Dissecting Logical Reason- ing in LLMs: A Fine-Grained Evaluation and Supervision Study. In Christos Christodoulopou- los, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Associ- ati...