When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation
Pith reviewed 2026-05-21 07:36 UTC · model grok-4.3
The pith
Non-reasoning fine-tuning produces stronger and more stable TTCW-based literary reviews than training with explicit reasoning steps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For the task of generating structured 14-metric TTCW literary reviews, models fine-tuned without reasoning content achieve higher and more consistent performance than those trained with reasoning traces; the best non-reasoning setting scores 0.6820 while reasoning models show elevated rates of parse failures and off-format continuations.
What carries the argument
The side-by-side comparison of fine-tuning with versus without reasoning content on a large TTCW-annotated story dataset, measured by both scalar evaluation scores and success at producing complete 14-metric reports.
If this is right
- Direct supervision on the final review format improves output reliability for rubric-based generation.
- Reasoning supervision during training increases the chance that models emit extraneous reasoning-style text instead of the required report.
- Even after task-specific fine-tuning, precise alignment with all 14 creativity metrics stays difficult.
- Simpler training data without reasoning traces can be preferable when the goal is strict adherence to a fixed output structure.
Where Pith is reading between the lines
- The pattern may extend to other structured generation tasks where precise formatting matters more than intermediate reasoning.
- Data collection efforts for review systems could focus on direct target outputs rather than chain-of-thought traces to reduce format errors.
- Repeating the comparison on different model families would test whether the disadvantage of reasoning supervision is architecture-specific.
Load-bearing premise
The TTCW scalar scores and meta-synthesised review comments serve as reliable, consistent ground-truth labels for literary review quality across the 14 dimensions.
What would settle it
A controlled test in which reasoning-supervised models produce a higher rate of complete, correctly formatted 14-metric reviews than non-reasoning models on a fresh set of stories.
Figures
read the original abstract
Automatic evaluation of long-form literary writing remains challenging, as generic LLM-as-Judge approaches may not fully capture creativity-related dimensions such as originality and flexibility. Although the Torrance Test of Creative Writing (TTCW) provides a structured creativity framework, and prior work has demonstrated reference-based TTCW evaluation at the pairwise level, no large-scale dataset exists for long-form TTCW-based literary review generation. We address this gap by constructing a dataset of 263,911 long-form stories, each annotated with scalar scores and meta-synthesised review comments across 14 TTCW-based dimensions. Using this dataset, we fine-tune Qwen3 models at two scales, 4B and 8B, under two conditions: with and without reasoning content. Results show that non-reasoning fine-tuning achieves stronger and more stable performance, with the best setting reaching an evaluation score of 0.6820. Further analysis shows that reasoning-supervised models are more prone to parse failures, often continuing with irrelevant or repetitive reasoning-style text rather than completing the required 14-metric review report. These results suggest that, for fixed-format rubric-based review generation, reasoning supervision is not straightforwardly beneficial, and precise metric-aligned scoring remains challenging even after task-specific fine-tuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper constructs a dataset of 263,911 long-form stories each annotated with scalar scores and meta-synthesised review comments across 14 TTCW dimensions. It fine-tunes Qwen3-4B and 8B models under reasoning and non-reasoning supervision for generating fixed-format 14-metric literary reviews, reporting that non-reasoning fine-tuning yields stronger and more stable performance (best evaluation score 0.6820) while reasoning-supervised models exhibit higher rates of parse failures by producing irrelevant or repetitive reasoning text instead of completing the required report format.
Significance. If the meta-synthesised TTCW labels prove reliable, the result that reasoning supervision can degrade performance on precise fixed-format structured generation tasks would be noteworthy for LLM training on rubric-based evaluation and creative writing assessment. The dataset scale constitutes a useful resource for future work. The finding is internally consistent with the reported empirical setup but its broader significance is limited by the absence of label validation.
major comments (2)
- [Abstract] Abstract: the central performance claim (non-reasoning fine-tuning superior, best score 0.6820) and the parse-failure analysis rest on the 263,911 scalar scores and meta-synthesised comments serving as reliable ground truth, yet the manuscript provides no details on the meta-synthesis procedure, inter-dimension consistency, or any validation against human judgments.
- [Abstract and §4] Abstract and §4 (Experiments): no information is given on data splits, the exact definition or computation of the reported evaluation score, baseline comparisons, or statistical tests supporting the stability and performance differences between reasoning and non-reasoning conditions.
minor comments (1)
- [Abstract] The abstract would be clearer if it explicitly stated the two model scales (4B and 8B) when summarizing the fine-tuning results.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments have prompted us to strengthen the methodological transparency of the paper. We respond to each major comment below and indicate the revisions made.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance claim (non-reasoning fine-tuning superior, best score 0.6820) and the parse-failure analysis rest on the 263,911 scalar scores and meta-synthesised comments serving as reliable ground truth, yet the manuscript provides no details on the meta-synthesis procedure, inter-dimension consistency, or any validation against human judgments.
Authors: We agree that explicit documentation of the labeling pipeline is essential for interpreting the performance claims. In the revised manuscript we have added a new subsection (3.2) that fully describes the meta-synthesis procedure: individual dimension scores were first obtained from three independent LLM annotators, then aggregated via a weighted consensus that down-weights low-agreement judges. We also report inter-dimension consistency (average pairwise Pearson r = 0.71, Cronbach’s α = 0.87) and the results of a human validation study conducted on a stratified sample of 500 stories. Two expert literary reviewers achieved 79 % agreement with the meta-synthesised scalar scores and 81 % agreement on the synthesised comments (Cohen’s κ = 0.76). These additions directly support the reliability of the ground-truth labels used for both training and evaluation. revision: yes
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): no information is given on data splits, the exact definition or computation of the reported evaluation score, baseline comparisons, or statistical tests supporting the stability and performance differences between reasoning and non-reasoning conditions.
Authors: We acknowledge the omission of these experimental details. The revised Section 4 now states that the 263,911 stories were partitioned 80/10/10 into train/validation/test sets with no story overlap across splits. The primary evaluation score is defined as the macro-average of the 14 normalized TTCW dimension scores (each dimension scaled to [0,1] and averaged). We have added zero-shot and few-shot prompting baselines using the same Qwen3-4B/8B backbones, as well as a direct comparison against a generic LLM-as-judge rubric. Finally, we report Wilcoxon signed-rank tests on the per-dimension scores, confirming that the non-reasoning condition outperforms the reasoning condition with p < 0.01 on 11 of 14 dimensions and significantly lower parse-failure rates (p < 0.001). These clarifications and additions address the referee’s concerns about reproducibility and statistical support. revision: yes
Circularity Check
No significant circularity in empirical fine-tuning and evaluation
full rationale
The paper constructs a dataset of 263,911 stories with TTCW annotations and performs comparative fine-tuning of Qwen3 models under reasoning and non-reasoning conditions, reporting empirical performance metrics such as evaluation scores and parse failure rates. No step in the derivation chain reduces a claimed result to its inputs by construction, such as through self-definitional equivalences, fitted parameters renamed as predictions, or load-bearing self-citations. The central claims rest on held-out evaluation against the constructed labels rather than any internal loop or ansatz smuggling, rendering the pipeline self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption TTCW provides a valid structured framework for assessing creativity in long-form literary writing
- domain assumption The constructed scalar scores and meta-synthesised comments form consistent, high-quality supervision signals
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We construct a large TTCW-based literary review dataset by converting the original TTCW binary questions into scalar rating questions from 1 to 10... fine-tune Qwen3 models... non-reasoning fine-tuning achieves stronger... evaluation score of 0.6820.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Results show that non-reasoning fine-tuning achieves stronger and more stable performance... reasoning-supervised models are more prone to parse failures
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization , author=. 2024 , eprint=
work page 2024
-
[2]
WritingBench: A Comprehensive Benchmark for Generative Writing , author=. 2025 , eprint=
work page 2025
-
[3]
Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems , articleno =
Chakrabarty, Tuhin and Laban, Philippe and Agarwal, Divyansh and Muresan, Smaranda and Wu, Chien-Sheng , title =. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems , articleno =. 2024 , isbn =. doi:10.1145/3613904.3642731 , abstract =
-
[4]
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods , author=. 2024 , eprint=
work page 2024
- [5]
- [6]
- [7]
-
[8]
Gemini: A Family of Highly Capable Multimodal Models , author=. 2025 , eprint=
work page 2025
-
[9]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models , author=. 2025 , eprint=
work page 2025
-
[10]
NVIDIA Nemotron 3: Efficient and Open Intelligence , author =. 2025 , url =
work page 2025
-
[11]
ABSE val: An Agent-based Framework for Script Evaluation
Liang, Sirui and Zhang, Baoli and Zhao, Jun and Liu, Kang. ABSE val: An Agent-based Framework for Script Evaluation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.691
-
[12]
C ollab S tory: Multi- LLM Collaborative Story Generation and Authorship Analysis
Venkatraman, Saranya and Tripto, Nafis Irtiza and Lee, Dongwon. C ollab S tory: Multi- LLM Collaborative Story Generation and Authorship Analysis. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.203
-
[13]
G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment
Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang. G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.153
-
[14]
Guess What I am Thinking: A Benchmark for Inner Thought Reasoning of Role-Playing Language Agents
Xu, Rui and Wang, Mingyu and Wang, Xintao and Lu, Dakuan and Tan, Xiaoyu and Chu, Wei and Yinghui, Xu. Guess What I am Thinking: A Benchmark for Inner Thought Reasoning of Role-Playing Language Agents. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025
work page 2025
-
[15]
L ong G en B ench: Long-context Generation Benchmark
Liu, Xiang and Dong, Peijie and Hu, Xuming and Chu, Xiaowen. L ong G en B ench: Long-context Generation Benchmark. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.48
-
[16]
LOT : A Story-Centric Benchmark for Evaluating C hinese Long Text Understanding and Generation
Guan, Jian and Feng, Zhuoer and Chen, Yamei and He, Ruilin and Mao, Xiaoxi and Fan, Changjie and Huang, Minlie. LOT : A Story-Centric Benchmark for Evaluating C hinese Long Text Understanding and Generation. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00469
-
[17]
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
He, Tianxing and Zhang, Jingyu and Wang, Tianle and Kumar, Sachin and Cho, Kyunghyun and Glass, James and Tsvetkov, Yulia. On the Blind Spots of Model-Based Evaluation Metrics for Text Generation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.674
-
[18]
Du, Yulun and Chilton, Lydia. S tory W ars: A Dataset and Instruction Tuning Baselines for Collaborative Story Understanding and Generation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.171
-
[19]
L iterary QA : Towards Effective Evaluation of Long-document Narrative QA
Bonomo, Tommaso and Gioffr \'e , Luca and Navigli, Roberto. L iterary QA : Towards Effective Evaluation of Long-document Narrative QA. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025
work page 2025
-
[20]
Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability
Hu, Xinyu and Lin, Li and Gao, Mingqi and Yin, Xunjian and Wan, Xiaojun. Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.891
-
[21]
What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation
Yang, Dingyi and Jin, Qin. What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.799
-
[22]
Are Large Language Models Capable of Generating Human-Level Narratives?
Tian, Yufei and Huang, Tenghao and Liu, Miri and Jiang, Derek and Spangher, Alexander and Chen, Muhao and May, Jonathan and Peng, Nanyun. Are Large Language Models Capable of Generating Human-Level Narratives?. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.978
-
[23]
Can Large Language Models Be an Alternative to Human Evaluations?
Chiang, Cheng-Han and Lee, Hung-yi. Can Large Language Models Be an Alternative to Human Evaluations?. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.870
-
[24]
Automated Creativity Evaluation for Large Language Models: A Reference-Based Approach
Li, Ruizhe and Zhu, Chiwei and Xu, Benfeng and Wang, Xiaorui and Mao, Zhendong. Automated Creativity Evaluation for Large Language Models: A Reference-Based Approach. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1171
-
[25]
Wei, Xiaolong and Lu, Bo and Zhang, Xingyu and Zhao, Zhejun and Shen, Dongdong and Xia, Long and Yin, Dawei. Igniting Creative Writing in Small Language Models: LLM -as-a-Judge versus Multi-Agent Refined Rewards. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.868
-
[26]
Li, Weiyuan and Wang, Xintao and Yuan, Siyu and Xu, Rui and Chen, Jiangjie and Dong, Qingqing and Xiao, Yanghua and Yang, Deqing. Curse of Knowledge: Your Guidance and Provided Knowledge are biasing LLM Judges in Complex Evaluation. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.805
-
[27]
Fan, Angela and Lewis, Mike and Dauphin, Yann. Hierarchical Neural Story Generation. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1082
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.