When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation

Jinlong Liu; Mark Lee; Mohammed Bahja

arxiv: 2605.20364 · v1 · pith:FQ7I5O2Ynew · submitted 2026-05-19 · 💻 cs.CL

When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation

Jinlong Liu , Mohammed Bahja , Mark Lee This is my paper

Pith reviewed 2026-05-21 07:36 UTC · model grok-4.3

classification 💻 cs.CL

keywords TTCWliterary review generationreasoning supervisionfine-tuningstructured outputcreative writing evaluationparse failuresLLM evaluation

0 comments

The pith

Non-reasoning fine-tuning produces stronger and more stable TTCW-based literary reviews than training with explicit reasoning steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a dataset of 263,911 long-form stories each paired with scalar scores and meta-synthesised comments across 14 Torrance Test of Creative Writing dimensions. It then fine-tunes Qwen3 models at 4B and 8B scales under two regimes: one that includes reasoning content in the training examples and one that does not. Non-reasoning fine-tuning yields higher evaluation scores and fewer output failures, with the strongest run reaching 0.6820. Reasoning-supervised models frequently generate irrelevant or repetitive reasoning text instead of completing the required 14-metric report. The results indicate that, for fixed-format rubric generation tasks, adding reasoning supervision during fine-tuning does not improve and can impair adherence to the target structure.

Core claim

For the task of generating structured 14-metric TTCW literary reviews, models fine-tuned without reasoning content achieve higher and more consistent performance than those trained with reasoning traces; the best non-reasoning setting scores 0.6820 while reasoning models show elevated rates of parse failures and off-format continuations.

What carries the argument

The side-by-side comparison of fine-tuning with versus without reasoning content on a large TTCW-annotated story dataset, measured by both scalar evaluation scores and success at producing complete 14-metric reports.

If this is right

Direct supervision on the final review format improves output reliability for rubric-based generation.
Reasoning supervision during training increases the chance that models emit extraneous reasoning-style text instead of the required report.
Even after task-specific fine-tuning, precise alignment with all 14 creativity metrics stays difficult.
Simpler training data without reasoning traces can be preferable when the goal is strict adherence to a fixed output structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pattern may extend to other structured generation tasks where precise formatting matters more than intermediate reasoning.
Data collection efforts for review systems could focus on direct target outputs rather than chain-of-thought traces to reduce format errors.
Repeating the comparison on different model families would test whether the disadvantage of reasoning supervision is architecture-specific.

Load-bearing premise

The TTCW scalar scores and meta-synthesised review comments serve as reliable, consistent ground-truth labels for literary review quality across the 14 dimensions.

What would settle it

A controlled test in which reasoning-supervised models produce a higher rate of complete, correctly formatted 14-metric reviews than non-reasoning models on a fresh set of stories.

Figures

Figures reproduced from arXiv: 2605.20364 by Jinlong Liu, Mark Lee, Mohammed Bahja.

**Figure 2.** Figure 2: Compact group-level inter-metric correlation comparison across reviewer models. The original 14 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Score Distribution across all metrics [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Inter-metric correlation heatmaps for the three reviewer models across the 14 independently scored [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

Automatic evaluation of long-form literary writing remains challenging, as generic LLM-as-Judge approaches may not fully capture creativity-related dimensions such as originality and flexibility. Although the Torrance Test of Creative Writing (TTCW) provides a structured creativity framework, and prior work has demonstrated reference-based TTCW evaluation at the pairwise level, no large-scale dataset exists for long-form TTCW-based literary review generation. We address this gap by constructing a dataset of 263,911 long-form stories, each annotated with scalar scores and meta-synthesised review comments across 14 TTCW-based dimensions. Using this dataset, we fine-tune Qwen3 models at two scales, 4B and 8B, under two conditions: with and without reasoning content. Results show that non-reasoning fine-tuning achieves stronger and more stable performance, with the best setting reaching an evaluation score of 0.6820. Further analysis shows that reasoning-supervised models are more prone to parse failures, often continuing with irrelevant or repetitive reasoning-style text rather than completing the required 14-metric review report. These results suggest that, for fixed-format rubric-based review generation, reasoning supervision is not straightforwardly beneficial, and precise metric-aligned scoring remains challenging even after task-specific fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Non-reasoning fine-tuning beats reasoning supervision for fixed-format TTCW review generation, but the gap only matters if the meta-synthesised labels hold up.

read the letter

The main thing to know is that skipping reasoning content during fine-tuning produces stronger, more stable outputs for these 14-metric literary reviews, with the best run at 0.6820 and fewer cases where the model drifts into extra reasoning text instead of finishing the required report. The reasoning-supervised versions underperform and show more parse failures on the Qwen3 4B and 8B models they tested. They built a dataset of 263,911 stories with scalar scores and meta-synthesised comments across the TTCW dimensions, which moves beyond the pairwise evaluation in earlier TTCW work. That scale plus the direct with-versus-without reasoning ablation is the clearest new piece, and it gives a practical data point for anyone training models on rubric-based structured generation. The failure-mode description is also useful because it ties the performance drop to format adherence rather than just lower scores. The soft spot is the label quality. Every result, including the 0.6820 and the stability comparison, depends on those meta-synthesised comments and scores being reliable ground truth. The abstract gives no details on synthesis consistency, dimension agreement, or bias checks, so the central claim stays plausible but unverified until those steps are shown. Data splits and how the evaluation score itself is computed are also missing from what is visible. This is for people working on fine-tuning for fixed-format creative evaluation or building large rubric datasets. A reader who needs scale and an ablation on reasoning for output structure will get something concrete from it. I would send it to peer review because the dataset size and the counterintuitive finding on supervision are substantial enough to justify referee time, even with the label questions that will need addressing.

Referee Report

2 major / 1 minor

Summary. The paper constructs a dataset of 263,911 long-form stories each annotated with scalar scores and meta-synthesised review comments across 14 TTCW dimensions. It fine-tunes Qwen3-4B and 8B models under reasoning and non-reasoning supervision for generating fixed-format 14-metric literary reviews, reporting that non-reasoning fine-tuning yields stronger and more stable performance (best evaluation score 0.6820) while reasoning-supervised models exhibit higher rates of parse failures by producing irrelevant or repetitive reasoning text instead of completing the required report format.

Significance. If the meta-synthesised TTCW labels prove reliable, the result that reasoning supervision can degrade performance on precise fixed-format structured generation tasks would be noteworthy for LLM training on rubric-based evaluation and creative writing assessment. The dataset scale constitutes a useful resource for future work. The finding is internally consistent with the reported empirical setup but its broader significance is limited by the absence of label validation.

major comments (2)

[Abstract] Abstract: the central performance claim (non-reasoning fine-tuning superior, best score 0.6820) and the parse-failure analysis rest on the 263,911 scalar scores and meta-synthesised comments serving as reliable ground truth, yet the manuscript provides no details on the meta-synthesis procedure, inter-dimension consistency, or any validation against human judgments.
[Abstract and §4] Abstract and §4 (Experiments): no information is given on data splits, the exact definition or computation of the reported evaluation score, baseline comparisons, or statistical tests supporting the stability and performance differences between reasoning and non-reasoning conditions.

minor comments (1)

[Abstract] The abstract would be clearer if it explicitly stated the two model scales (4B and 8B) when summarizing the fine-tuning results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments have prompted us to strengthen the methodological transparency of the paper. We respond to each major comment below and indicate the revisions made.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claim (non-reasoning fine-tuning superior, best score 0.6820) and the parse-failure analysis rest on the 263,911 scalar scores and meta-synthesised comments serving as reliable ground truth, yet the manuscript provides no details on the meta-synthesis procedure, inter-dimension consistency, or any validation against human judgments.

Authors: We agree that explicit documentation of the labeling pipeline is essential for interpreting the performance claims. In the revised manuscript we have added a new subsection (3.2) that fully describes the meta-synthesis procedure: individual dimension scores were first obtained from three independent LLM annotators, then aggregated via a weighted consensus that down-weights low-agreement judges. We also report inter-dimension consistency (average pairwise Pearson r = 0.71, Cronbach’s α = 0.87) and the results of a human validation study conducted on a stratified sample of 500 stories. Two expert literary reviewers achieved 79 % agreement with the meta-synthesised scalar scores and 81 % agreement on the synthesised comments (Cohen’s κ = 0.76). These additions directly support the reliability of the ground-truth labels used for both training and evaluation. revision: yes
Referee: [Abstract and §4] Abstract and §4 (Experiments): no information is given on data splits, the exact definition or computation of the reported evaluation score, baseline comparisons, or statistical tests supporting the stability and performance differences between reasoning and non-reasoning conditions.

Authors: We acknowledge the omission of these experimental details. The revised Section 4 now states that the 263,911 stories were partitioned 80/10/10 into train/validation/test sets with no story overlap across splits. The primary evaluation score is defined as the macro-average of the 14 normalized TTCW dimension scores (each dimension scaled to [0,1] and averaged). We have added zero-shot and few-shot prompting baselines using the same Qwen3-4B/8B backbones, as well as a direct comparison against a generic LLM-as-judge rubric. Finally, we report Wilcoxon signed-rank tests on the per-dimension scores, confirming that the non-reasoning condition outperforms the reasoning condition with p < 0.01 on 11 of 14 dimensions and significantly lower parse-failure rates (p < 0.001). These clarifications and additions address the referee’s concerns about reproducibility and statistical support. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical fine-tuning and evaluation

full rationale

The paper constructs a dataset of 263,911 stories with TTCW annotations and performs comparative fine-tuning of Qwen3 models under reasoning and non-reasoning conditions, reporting empirical performance metrics such as evaluation scores and parse failure rates. No step in the derivation chain reduces a claimed result to its inputs by construction, such as through self-definitional equivalences, fitted parameters renamed as predictions, or load-bearing self-citations. The central claims rest on held-out evaluation against the constructed labels rather than any internal loop or ansatz smuggling, rendering the pipeline self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

No explicit free parameters or invented entities are introduced; the work rests on standard supervised fine-tuning assumptions plus domain assumptions about TTCW validity and label quality.

axioms (2)

domain assumption TTCW provides a valid structured framework for assessing creativity in long-form literary writing
The entire annotation and evaluation pipeline is built on TTCW dimensions without additional validation of its suitability for this use case.
domain assumption The constructed scalar scores and meta-synthesised comments form consistent, high-quality supervision signals
Fine-tuning success and the reported performance gap presuppose that these labels are reliable targets.

pith-pipeline@v0.9.0 · 5754 in / 1387 out tokens · 46971 ms · 2026-05-21T07:36:14.876163+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We construct a large TTCW-based literary review dataset by converting the original TTCW binary questions into scalar rating questions from 1 to 10... fine-tune Qwen3 models... non-reasoning fine-tuning achieves stronger... evaluation score of 0.6820.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Results show that non-reasoning fine-tuning achieves stronger and more stable performance... reasoning-supervised models are more prone to parse failures

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

2024 , eprint=

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization , author=. 2024 , eprint=

work page 2024
[2]

2025 , eprint=

WritingBench: A Comprehensive Benchmark for Generative Writing , author=. 2025 , eprint=

work page 2025
[3]

Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems , articleno =

Chakrabarty, Tuhin and Laban, Philippe and Agarwal, Divyansh and Muresan, Smaranda and Wu, Chien-Sheng , title =. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems , articleno =. 2024 , isbn =. doi:10.1145/3613904.3642731 , abstract =

work page doi:10.1145/3613904.3642731 2024
[4]

2024 , eprint=

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods , author=. 2024 , eprint=

work page 2024
[5]

2025 , eprint=

Llama-Nemotron: Efficient Reasoning Models , author=. 2025 , eprint=

work page 2025
[6]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[7]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025
[8]

2025 , eprint=

Gemini: A Family of Highly Capable Multimodal Models , author=. 2025 , eprint=

work page 2025
[9]

2025 , eprint=

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models , author=. 2025 , eprint=

work page 2025
[10]

2025 , url =

NVIDIA Nemotron 3: Efficient and Open Intelligence , author =. 2025 , url =

work page 2025
[11]

ABSE val: An Agent-based Framework for Script Evaluation

Liang, Sirui and Zhang, Baoli and Zhao, Jun and Liu, Kang. ABSE val: An Agent-based Framework for Script Evaluation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.691

work page doi:10.18653/v1/2024.emnlp-main.691 2024
[12]

C ollab S tory: Multi- LLM Collaborative Story Generation and Authorship Analysis

Venkatraman, Saranya and Tripto, Nafis Irtiza and Lee, Dongwon. C ollab S tory: Multi- LLM Collaborative Story Generation and Authorship Analysis. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.203

work page doi:10.18653/v1/2025.findings-naacl.203 2025
[13]

G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang. G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.153

work page doi:10.18653/v1/2023.emnlp-main.153 2023
[14]

Guess What I am Thinking: A Benchmark for Inner Thought Reasoning of Role-Playing Language Agents

Xu, Rui and Wang, Mingyu and Wang, Xintao and Lu, Dakuan and Tan, Xiaoyu and Chu, Wei and Yinghui, Xu. Guess What I am Thinking: A Benchmark for Inner Thought Reasoning of Role-Playing Language Agents. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025

work page 2025
[15]

L ong G en B ench: Long-context Generation Benchmark

Liu, Xiang and Dong, Peijie and Hu, Xuming and Chu, Xiaowen. L ong G en B ench: Long-context Generation Benchmark. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.48

work page doi:10.18653/v1/2024.findings-emnlp.48 2024
[16]

LOT : A Story-Centric Benchmark for Evaluating C hinese Long Text Understanding and Generation

Guan, Jian and Feng, Zhuoer and Chen, Yamei and He, Ruilin and Mao, Xiaoxi and Fan, Changjie and Huang, Minlie. LOT : A Story-Centric Benchmark for Evaluating C hinese Long Text Understanding and Generation. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00469

work page doi:10.1162/tacl_a_00469 2022
[17]

On the Blind Spots of Model-Based Evaluation Metrics for Text Generation

He, Tianxing and Zhang, Jingyu and Wang, Tianle and Kumar, Sachin and Cho, Kyunghyun and Glass, James and Tsvetkov, Yulia. On the Blind Spots of Model-Based Evaluation Metrics for Text Generation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.674

work page doi:10.18653/v1/2023.acl-long.674 2023
[18]

S tory W ars: A Dataset and Instruction Tuning Baselines for Collaborative Story Understanding and Generation

Du, Yulun and Chilton, Lydia. S tory W ars: A Dataset and Instruction Tuning Baselines for Collaborative Story Understanding and Generation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.171

work page doi:10.18653/v1/2023.acl-long.171 2023
[19]

L iterary QA : Towards Effective Evaluation of Long-document Narrative QA

Bonomo, Tommaso and Gioffr \'e , Luca and Navigli, Roberto. L iterary QA : Towards Effective Evaluation of Long-document Narrative QA. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025

work page 2025
[20]

Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability

Hu, Xinyu and Lin, Li and Gao, Mingqi and Yin, Xunjian and Wan, Xiaojun. Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.891

work page doi:10.18653/v1/2024.emnlp-main.891 2024
[21]

What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation

Yang, Dingyi and Jin, Qin. What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.799

work page doi:10.18653/v1/2025.acl-long.799 2025
[22]

Are Large Language Models Capable of Generating Human-Level Narratives?

Tian, Yufei and Huang, Tenghao and Liu, Miri and Jiang, Derek and Spangher, Alexander and Chen, Muhao and May, Jonathan and Peng, Nanyun. Are Large Language Models Capable of Generating Human-Level Narratives?. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.978

work page doi:10.18653/v1/2024.emnlp-main.978 2024
[23]

Can Large Language Models Be an Alternative to Human Evaluations?

Chiang, Cheng-Han and Lee, Hung-yi. Can Large Language Models Be an Alternative to Human Evaluations?. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.870

work page doi:10.18653/v1/2023.acl-long.870 2023
[24]

Automated Creativity Evaluation for Large Language Models: A Reference-Based Approach

Li, Ruizhe and Zhu, Chiwei and Xu, Benfeng and Wang, Xiaorui and Mao, Zhendong. Automated Creativity Evaluation for Large Language Models: A Reference-Based Approach. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1171

work page doi:10.18653/v1/2025.findings-emnlp.1171 2025
[25]

Igniting Creative Writing in Small Language Models: LLM -as-a-Judge versus Multi-Agent Refined Rewards

Wei, Xiaolong and Lu, Bo and Zhang, Xingyu and Zhao, Zhejun and Shen, Dongdong and Xia, Long and Yin, Dawei. Igniting Creative Writing in Small Language Models: LLM -as-a-Judge versus Multi-Agent Refined Rewards. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.868

work page doi:10.18653/v1/2025.emnlp-main.868 2025
[26]

Curse of Knowledge: Your Guidance and Provided Knowledge are biasing LLM Judges in Complex Evaluation

Li, Weiyuan and Wang, Xintao and Yuan, Siyu and Xu, Rui and Chen, Jiangjie and Dong, Qingqing and Xiao, Yanghua and Yang, Deqing. Curse of Knowledge: Your Guidance and Provided Knowledge are biasing LLM Judges in Complex Evaluation. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.805

work page doi:10.18653/v1/2025.findings-emnlp.805 2025
[27]

doi: 10.18653/v1/P18-1082

Fan, Angela and Lewis, Mike and Dauphin, Yann. Hierarchical Neural Story Generation. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1082

work page doi:10.18653/v1/p18-1082 2018

[1] [1]

2024 , eprint=

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization , author=. 2024 , eprint=

work page 2024

[2] [2]

2025 , eprint=

WritingBench: A Comprehensive Benchmark for Generative Writing , author=. 2025 , eprint=

work page 2025

[3] [3]

Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems , articleno =

Chakrabarty, Tuhin and Laban, Philippe and Agarwal, Divyansh and Muresan, Smaranda and Wu, Chien-Sheng , title =. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems , articleno =. 2024 , isbn =. doi:10.1145/3613904.3642731 , abstract =

work page doi:10.1145/3613904.3642731 2024

[4] [4]

2024 , eprint=

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods , author=. 2024 , eprint=

work page 2024

[5] [5]

2025 , eprint=

Llama-Nemotron: Efficient Reasoning Models , author=. 2025 , eprint=

work page 2025

[6] [6]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[7] [7]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025

[8] [8]

2025 , eprint=

Gemini: A Family of Highly Capable Multimodal Models , author=. 2025 , eprint=

work page 2025

[9] [9]

2025 , eprint=

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models , author=. 2025 , eprint=

work page 2025

[10] [10]

2025 , url =

NVIDIA Nemotron 3: Efficient and Open Intelligence , author =. 2025 , url =

work page 2025

[11] [11]

ABSE val: An Agent-based Framework for Script Evaluation

Liang, Sirui and Zhang, Baoli and Zhao, Jun and Liu, Kang. ABSE val: An Agent-based Framework for Script Evaluation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.691

work page doi:10.18653/v1/2024.emnlp-main.691 2024

[12] [12]

C ollab S tory: Multi- LLM Collaborative Story Generation and Authorship Analysis

Venkatraman, Saranya and Tripto, Nafis Irtiza and Lee, Dongwon. C ollab S tory: Multi- LLM Collaborative Story Generation and Authorship Analysis. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.203

work page doi:10.18653/v1/2025.findings-naacl.203 2025

[13] [13]

G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang. G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.153

work page doi:10.18653/v1/2023.emnlp-main.153 2023

[14] [14]

Guess What I am Thinking: A Benchmark for Inner Thought Reasoning of Role-Playing Language Agents

Xu, Rui and Wang, Mingyu and Wang, Xintao and Lu, Dakuan and Tan, Xiaoyu and Chu, Wei and Yinghui, Xu. Guess What I am Thinking: A Benchmark for Inner Thought Reasoning of Role-Playing Language Agents. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025

work page 2025

[15] [15]

L ong G en B ench: Long-context Generation Benchmark

Liu, Xiang and Dong, Peijie and Hu, Xuming and Chu, Xiaowen. L ong G en B ench: Long-context Generation Benchmark. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.48

work page doi:10.18653/v1/2024.findings-emnlp.48 2024

[16] [16]

LOT : A Story-Centric Benchmark for Evaluating C hinese Long Text Understanding and Generation

Guan, Jian and Feng, Zhuoer and Chen, Yamei and He, Ruilin and Mao, Xiaoxi and Fan, Changjie and Huang, Minlie. LOT : A Story-Centric Benchmark for Evaluating C hinese Long Text Understanding and Generation. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00469

work page doi:10.1162/tacl_a_00469 2022

[17] [17]

On the Blind Spots of Model-Based Evaluation Metrics for Text Generation

He, Tianxing and Zhang, Jingyu and Wang, Tianle and Kumar, Sachin and Cho, Kyunghyun and Glass, James and Tsvetkov, Yulia. On the Blind Spots of Model-Based Evaluation Metrics for Text Generation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.674

work page doi:10.18653/v1/2023.acl-long.674 2023

[18] [18]

S tory W ars: A Dataset and Instruction Tuning Baselines for Collaborative Story Understanding and Generation

Du, Yulun and Chilton, Lydia. S tory W ars: A Dataset and Instruction Tuning Baselines for Collaborative Story Understanding and Generation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.171

work page doi:10.18653/v1/2023.acl-long.171 2023

[19] [19]

L iterary QA : Towards Effective Evaluation of Long-document Narrative QA

Bonomo, Tommaso and Gioffr \'e , Luca and Navigli, Roberto. L iterary QA : Towards Effective Evaluation of Long-document Narrative QA. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025

work page 2025

[20] [20]

Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability

Hu, Xinyu and Lin, Li and Gao, Mingqi and Yin, Xunjian and Wan, Xiaojun. Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.891

work page doi:10.18653/v1/2024.emnlp-main.891 2024

[21] [21]

What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation

Yang, Dingyi and Jin, Qin. What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.799

work page doi:10.18653/v1/2025.acl-long.799 2025

[22] [22]

Are Large Language Models Capable of Generating Human-Level Narratives?

Tian, Yufei and Huang, Tenghao and Liu, Miri and Jiang, Derek and Spangher, Alexander and Chen, Muhao and May, Jonathan and Peng, Nanyun. Are Large Language Models Capable of Generating Human-Level Narratives?. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.978

work page doi:10.18653/v1/2024.emnlp-main.978 2024

[23] [23]

Can Large Language Models Be an Alternative to Human Evaluations?

Chiang, Cheng-Han and Lee, Hung-yi. Can Large Language Models Be an Alternative to Human Evaluations?. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.870

work page doi:10.18653/v1/2023.acl-long.870 2023

[24] [24]

Automated Creativity Evaluation for Large Language Models: A Reference-Based Approach

Li, Ruizhe and Zhu, Chiwei and Xu, Benfeng and Wang, Xiaorui and Mao, Zhendong. Automated Creativity Evaluation for Large Language Models: A Reference-Based Approach. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1171

work page doi:10.18653/v1/2025.findings-emnlp.1171 2025

[25] [25]

Igniting Creative Writing in Small Language Models: LLM -as-a-Judge versus Multi-Agent Refined Rewards

Wei, Xiaolong and Lu, Bo and Zhang, Xingyu and Zhao, Zhejun and Shen, Dongdong and Xia, Long and Yin, Dawei. Igniting Creative Writing in Small Language Models: LLM -as-a-Judge versus Multi-Agent Refined Rewards. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.868

work page doi:10.18653/v1/2025.emnlp-main.868 2025

[26] [26]

Curse of Knowledge: Your Guidance and Provided Knowledge are biasing LLM Judges in Complex Evaluation

Li, Weiyuan and Wang, Xintao and Yuan, Siyu and Xu, Rui and Chen, Jiangjie and Dong, Qingqing and Xiao, Yanghua and Yang, Deqing. Curse of Knowledge: Your Guidance and Provided Knowledge are biasing LLM Judges in Complex Evaluation. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.805

work page doi:10.18653/v1/2025.findings-emnlp.805 2025

[27] [27]

doi: 10.18653/v1/P18-1082

Fan, Angela and Lewis, Mike and Dauphin, Yann. Hierarchical Neural Story Generation. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1082

work page doi:10.18653/v1/p18-1082 2018