Reasoning-Trace Collapse: Evaluating the Loss of Explicit Reasoning During Fine-Tuning
Pith reviewed 2026-05-21 06:01 UTC · model grok-4.3
The pith
Fine-tuning on answer-only data causes reasoning models to lose their explicit reasoning traces while final answers stay correct.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Explicit reasoning models lose their intermediate reasoning traces during standard supervised fine-tuning on data that contains no such traces, resulting in reasoning-trace collapse. A structural evaluation framework tracks four categories of traces (valid, empty, missing, truncated) and computes task performance only on cases with valid traces. Across four open-weight models, the rate of valid reasoning drops sharply after fine-tuning while conditional performance on valid traces remains high, and simple loss-masking during training mitigates the effect without requiring teacher-generated traces.
What carries the argument
The structural evaluation framework that classifies each reasoning trace as valid, empty, missing, or truncated and reports both overall accuracy and accuracy conditioned on valid traces.
If this is right
- Answer-only metrics can substantially overestimate the reasoning reliability of fine-tuned models.
- The rate of valid reasoning traces can fall sharply even when performance conditional on those traces stays high.
- Loss-masking during fine-tuning can preserve valid reasoning traces without the need for teacher-generated reasoning data.
- Evaluations of adapted reasoning models should include structural reasoning reliability metrics alongside final-answer scores.
Where Pith is reading between the lines
- The same collapse could appear in other structured generation tasks such as code or mathematical derivations when fine-tuning data lacks explicit steps.
- Developers adapting reasoning models may need to include reasoning traces in all fine-tuning datasets to avoid silent degradation.
- The divergence between trace validity and answer correctness could affect how gains from domain adaptation are interpreted in practice.
Load-bearing premise
The framework's rules for labeling traces as valid or invalid accurately reflect real loss of reasoning ability instead of depending on arbitrary classification choices.
What would settle it
Manually reviewing a random sample of model outputs before and after fine-tuning and finding that the fraction of outputs containing valid reasoning steps does not decrease.
Figures
read the original abstract
Explicit reasoning models are trained to produce intermediate reasoning traces before final answers, but downstream fine-tuning is often performed on ordinary instruction-response data that contains no such traces. We show that this mismatch can induce reasoning-trace collapse: a fine-tuned model continues to produce plausible final answers while losing the structurally valid explicit reasoning traces that made it a reasoning model in the first place. We introduce a structural evaluation framework that separates answer correctness from reasoning-trace validity, measuring valid, empty, missing, and truncated reasoning alongside reasoning-conditioned task performance. Using this framework, we study four open-weight reasoning models and find that standard supervised fine-tuning can rapidly suppress valid reasoning traces, and that answer-only metrics can substantially obscure this failure: in several settings, performance conditional on valid reasoning remains high while the rate of valid reasoning falls sharply. We further show that simple loss-masking strategies can substantially mitigate collapse without requiring teacher-generated reasoning traces. These results suggest that evaluations of fine-tuned reasoning models should report structural reasoning reliability metrics in addition to final-answer performance, especially when adaptation data does not contain explicit reasoning traces.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard supervised fine-tuning on answer-only data induces 'reasoning-trace collapse' in explicit reasoning models: valid intermediate reasoning traces are rapidly suppressed while final-answer accuracy remains high, and that answer-only metrics obscure this. It introduces a structural evaluation framework classifying traces as valid/empty/missing/truncated and measuring reasoning-conditioned performance, demonstrates the phenomenon across four open-weight models, and shows mitigation via loss-masking without needing teacher traces. The work recommends reporting structural reliability metrics when adaptation data lacks explicit reasoning.
Significance. If the central empirical findings hold after addressing framework validation, the paper would be significant for highlighting a practical failure mode in fine-tuning reasoning models and for advocating structural metrics beyond final-answer accuracy. The loss-masking mitigation is a low-cost, practical contribution that does not require additional trace generation. The empirical focus on observable rates of valid traces provides falsifiable observations that could influence evaluation standards in LLM reasoning research.
major comments (2)
- [§3] §3 (Structural Evaluation Framework): The framework's classification into valid/empty/missing/truncated categories is load-bearing for the central claim, yet the manuscript provides no validation against human judgments of reasoning quality or controlled format-ablation experiments on the base models. Because adaptation data contains no traces, post-SFT models are incentivized to drop surface markers (e.g., <think> blocks or step delimiters); if the classifier keys primarily on these, the observed drop in valid rates may reflect output-format adaptation rather than genuine loss of reasoning capability. This distinction must be demonstrated for the claim that answer-only metrics 'substantially obscure' reasoning failure to be supported.
- [§4.2] §4.2 (Experimental Results): The reported sharp fall in valid reasoning rates alongside stable reasoning-conditioned accuracy is the key quantitative finding, but the manuscript does not report inter-annotator agreement or error analysis for the automatic classifier on a held-out sample of traces. Without this, it is unclear whether the categories reliably isolate reasoning loss or introduce systematic bias that could inflate the collapse effect across the four models studied.
minor comments (2)
- [Abstract / §1] The abstract and §1 should explicitly state the four models used (e.g., by name and size) rather than referring only to 'open-weight reasoning models' to improve reproducibility.
- [Figure 3] Figure 3 (loss-masking ablation): The y-axis scale and legend for reasoning-conditioned accuracy are difficult to read; enlarge labels and clarify whether error bars represent standard deviation across seeds or runs.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which identifies key areas where additional validation can strengthen the manuscript's claims about the structural evaluation framework and classifier reliability. We address each major comment below and will incorporate the suggested analyses in the revision.
read point-by-point responses
-
Referee: [§3] §3 (Structural Evaluation Framework): The framework's classification into valid/empty/missing/truncated categories is load-bearing for the central claim, yet the manuscript provides no validation against human judgments of reasoning quality or controlled format-ablation experiments on the base models. Because adaptation data contains no traces, post-SFT models are incentivized to drop surface markers (e.g., <think> blocks or step delimiters); if the classifier keys primarily on these, the observed drop in valid rates may reflect output-format adaptation rather than genuine loss of reasoning capability. This distinction must be demonstrated for the claim that answer-only metrics 'substantially obscure' reasoning failure to be supported.
Authors: We agree that explicitly demonstrating the distinction between format adaptation and genuine reasoning loss is essential to support the central claim. The classification rules in §3 require not only structural delimiters but also coherent, step-wise logical progression that connects to the final answer; empty or missing traces are flagged even when delimiters are present if content is absent or incoherent. Nevertheless, to directly address the concern, we will add to the revised manuscript a human validation study (agreement rates on 100 sampled traces from base and post-SFT models) and a format-ablation experiment on base-model outputs (stripping delimiters and re-classifying to show valid rates remain high). These additions will confirm that the observed collapse reflects loss of reasoning content rather than surface-format changes alone. revision: yes
-
Referee: [§4.2] §4.2 (Experimental Results): The reported sharp fall in valid reasoning rates alongside stable reasoning-conditioned accuracy is the key quantitative finding, but the manuscript does not report inter-annotator agreement or error analysis for the automatic classifier on a held-out sample of traces. Without this, it is unclear whether the categories reliably isolate reasoning loss or introduce systematic bias that could inflate the collapse effect across the four models studied.
Authors: We acknowledge that reporting inter-annotator agreement and error analysis would improve confidence in the classifier's reliability. The classifier is intentionally rule-based and deterministic to ensure reproducibility across the four models, but we will add a dedicated validation subsection in the revision. This will include manual review and error analysis on a held-out sample of 200 traces (50 per model), Cohen's kappa agreement between two independent human annotators and the automatic classifier, and a breakdown of error categories (e.g., over- or under-flagging of valid traces). These results will be reported to show that systematic bias does not inflate the collapse effect. revision: yes
Circularity Check
No derivation chain present; empirical reporting only
full rationale
This is an empirical study that defines a structural classification scheme for reasoning traces (valid/empty/missing/truncated) and measures observed rates plus conditional accuracy after fine-tuning. No equations, parameter fits, or derivations are claimed or present in the provided text. The central results are direct experimental observations rather than reductions of outputs to inputs by construction. Self-citations, if any, are not load-bearing for the reported rates. The framework is introduced as a measurement tool, not derived from prior self-referential results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The structural evaluation framework can reliably distinguish valid reasoning traces from empty, missing, or truncated ones.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a structural evaluation framework that separates answer correctness from reasoning-trace validity, measuring valid, empty, missing, and truncated reasoning alongside reasoning-conditioned task performance.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
reasoning-trace collapse: a fine-tuned model continues to produce plausible final answers while losing the structurally valid explicit reasoning traces
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Allenai/Olmo-3-7B-Think · Hugging Face
allenai. Allenai/Olmo-3-7B-Think · Hugging Face. https://huggingface.co/allenai/Olmo-3-7B- Think, January 2026
work page 2026
-
[2]
Reasoning AI Models: An overview
Amit Bahree. Reasoning AI Models: An overview. https://blog.desigeek.com/post/2025/09/reasoning-ai-models-a-deep-dive/, September 2025
work page 2025
-
[3]
LoRA learns less and forgets less.Transactions on Machine Learning Research, 2024
Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John Patrick Cunningham. LoRA learns less and forgets less.Transactions on Machine Learning Research, 2024. ISSN 2835-8856
work page 2024
-
[4]
Training Verifiers to Solve Math Word Problems, November 2021
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training Verifiers to Solve Math Word Problems, November 2021
work page 2021
-
[5]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page 2025
-
[6]
Hugging Face. Transformers. https://huggingface.co/docs/transformers/en/index, 2026
work page 2026
-
[7]
Keypoint- based progressive chain-of-thought distillation for LLMs
Kaituo Feng, Changsheng Li, Xiaolu Zhang, Jun Zhou, Ye Yuan, and Guoren Wang. Keypoint- based progressive chain-of-thought distillation for LLMs. InProceedings of the 41st Interna- tional Conference on Machine Learning, volume 235 ofICML’24, pages 13241–13255, Vienna, Austria, July 2024. JMLR.org
work page 2024
-
[8]
SciKnowEval: A Comprehensive Dataset for Evaluating Scientific Knowledge of Large Language Models
Kehua Feng, Xinyi Shen, Weijie Wang, Xiang Zhuang, Yuqi Tang, Qiang Zhang, and Keyan Ding. SciKnowEval: A Comprehensive Dataset for Evaluating Scientific Knowledge of Large Language Models. InNeurIPS 2025 AI for Science Workshop, October 2025
work page 2025
-
[9]
Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Rea Sprague, Ashima Suvarna, Benjamin Feuer, Leon Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie...
work page 2025
-
[10]
Shibo Hao, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Shao, Xinyuan Wang, Shuhua Xie, Haodi Ma, Adithya Samavedhi, Qiyue Gao, Zhen Wang, and Zhiting Hu. LLM reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models. InICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024
work page 2024
-
[11]
doi: 10.18653/v1/2023.findings-acl.507
Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computa- ti...
-
[12]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations, October 2021
work page 2021
-
[13]
Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines. InThe Twelfth International Conference on Learning Represent...
work page 2023
-
[14]
Evaluating Step-by-step Reasoning Traces: A Survey
Jinu Lee and Julia Hockenmaier. Evaluating Step-by-step Reasoning Traces: A Survey. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1789–1814, Suzhou, China, November 2025. Association for Computational Linguistics. ISBN 979-8- 89176-33...
-
[15]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s Verify Step by Step. InThe Twelfth International Conference on Learning Representations, October 2023
work page 2023
-
[16]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, pages 21558–21572, Red Hook, NY , USA, December 2023. Curran Associates Inc
work page 2023
-
[17]
On the Impact of Fine-Tuning on Chain- of-Thought Reasoning
Elita Lobo, Chirag Agarwal, and Himabindu Lakkaraju. On the Impact of Fine-Tuning on Chain- of-Thought Reasoning. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 11679–1...
-
[18]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations, September 2018
work page 2018
-
[19]
Alibaba ModelScope. Qwen3 Best Practices. https://swift.readthedocs.io/en/latest/BestPractices/Qwen3- Best-Practice.html, 2026
work page 2026
-
[20]
Nvidia/OpenReasoning-Nemotron-7B
nvidia. Nvidia/OpenReasoning-Nemotron-7B. https://huggingface.co/nvidia/OpenReasoning- Nemotron-7B, May 2026
work page 2026
- [21]
-
[22]
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! InThe Twelfth International Conference on Learning Representations, October 2023
work page 2023
-
[23]
Qwen. Qwen/Qwen3-8B · Hugging Face. https://huggingface.co/Qwen/Qwen3-8B, December 2025
work page 2025
-
[24]
Self-Distillation Enables Continual Learning, January 2026
Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-Distillation Enables Continual Learning, January 2026
work page 2026
-
[25]
Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, and Jianfeng Gao. A Survey on Post-training of ...
work page 2025
-
[26]
Chengyu Wang, Taolin Zhang, Richang Hong, and Jun Huang. A Short Survey on Small Reasoning Models: Training, Inference, Applications and Research Directions, April 2025
work page 2025
-
[27]
Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, TzuHao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, and Suhang Wang. A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness.ACM Trans. Intell. Sys...
-
[28]
SCOTT: Self-Consistent Chain-of-Thought Distillation
Peifeng Wang, Zhengyang Wang, Zheng Li, Yifan Gao, Bing Yin, and Xiang Ren. SCOTT: Self-Consistent Chain-of-Thought Distillation. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 5546–5558, Toronto, Canada, July 2023. Asso...
-
[29]
Emergent Abilities of Large Language Models.Transactions on Machine Learning Research, 2022
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent Abilities of Large Language Models.Transactions on Machine Learning Research, 2022. 12
work page 2022
-
[30]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, pages 24824–24837, Red Hook, NY , USA, November 2022. Curran Ass...
work page 2022
-
[31]
Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, and Yong Li. Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models, January 2025
work page 2025
-
[32]
Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. When Scaling Meets LLM Finetun- ing: The Effect of Data, Model and Finetuning Method. In12th International Conference on Learning Representations (ICLR24). arXiv, February 2024. doi: 10.48550/arXiv.2402.17193
-
[33]
Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Guoyin Wang, and Fei Wu. Instruction Tuning for Large Language Models: A Survey.ACM Comput. Surv., 58(7):169:1–169:36, January 2026. ISSN 0360-0300. doi: 10.1145/3777411
-
[34]
Raoyuan Zhao, Yihong Liu, Hinrich Schuetze, and Michael A. Hedderich. A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors, Findings of the Association for Computational Linguistics: EACL 2026, pages 5223–5247, Rabat, Mo...
-
[35]
Let me break this down step by step
Yujun Zhou, Jiayi Ye, Zipeng Ling, Yufei Han, Yue Huang, Haomin Zhuang, Zhenwen Liang, Kehan Guo, Taicheng Guo, Xiangqi Wang, and Xiangliang Zhang. Dissecting Logical Reason- ing in LLMs: A Fine-Grained Evaluation and Supervision Study. In Christos Christodoulopou- los, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Associ- ati...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.