arxiv: 2605.08503 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.CY· cs.HC

Recognition: 2 theorem links

· Lean Theorem

NARRA-Gym for Evaluating Interactive Narrative Agents

Chaoran Chen, Jiayi Ye, Kehan Guo, Stefan Feuerriegel, Toby Jia-Jun Li, Wenjie Wang, Xiangliang Zhang, Xingjian Hu, Yuchen Ma, Yue Huang, Yuexing Hao, Yujun Zhou, Yunhong He, Zhangchen Xu, Zhengqing Yuan, Zichen Chen, Zipeng Ling

Pith reviewed 2026-05-12 02:14 UTC · model grok-4.3

classification 💻 cs.CL cs.CYcs.HC

keywords interactive narrativeLLM evaluationbenchmark environmentstory generationuser adaptationlong-horizon planningmodel-in-the-looppersonalization

0 comments

The pith

NARRA-Gym logs full model trajectories to test language models on interactive story adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces NARRA-Gym as an executable environment that starts from a minimal emotional seed and produces a complete multi-turn story while recording every step the model takes. This includes building the story, updating memory, making plans, adjusting pacing based on user input, and optionally creating artifacts that fit the narrative. Evaluations of nine leading models across different personas show clear differences: some handle fluent generation but fall short on keeping stories robust, delivering good user experiences, or adapting to resistance. The work addresses a gap where most tests examine only single story outputs instead of the sustained, user-driven process required for real interactive agents.

Core claim

NARRA-Gym is an executable evaluation environment that turns a sparse emotional seed into a complete interactive story episode and logs the full model-in-the-loop trajectory, including story construction, memory updates, planning, pacing interventions, and optional artifact synthesis. When used to assess nine frontier LLMs through controlled LLM-as-judge sweeps over eight benchmark personas plus human ratings of customized outputs, the results show substantial variation across models, personas, and dimensions such as robustness, user experience, and resistance-sensitive personalization.

What carries the argument

NARRA-Gym, the executable evaluation environment that converts a sparse emotional seed into a full interactive story episode while logging the complete model trajectory of story construction, memory updates, planning, pacing interventions, and artifact synthesis.

If this is right

Models that produce fluent single-turn stories can still fail when required to sustain coherence and adapt across multiple user turns.
Full trajectory logging makes it possible to isolate specific failures in memory handling, pacing control, or personalization.
Performance differences appear across personas, indicating that interactive narrative tests must account for varied user types.
Interactive narrative provides a stronger test of long-horizon adaptive behavior than isolated story generation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The logged trajectories could serve as diagnostic data to identify and fix model weaknesses in sustained interactions.
Similar executable environments could be built for other long-horizon tasks that require ongoing adaptation, such as collaborative problem solving.
The setup allows future tests to measure how well automated judges match live user outcomes in interactive settings.

Load-bearing premise

That LLM-as-judge ratings and human participant scores on the logged trajectories reliably capture robustness, user experience, and resistance-sensitive personalization without significant bias or misalignment with actual user perception.

What would settle it

A direct comparison study in which real users interact live with the same models in the benchmark scenarios and their satisfaction ratings plus perceived coherence are checked against the existing LLM-judge and participant scores for alignment.

read the original abstract

Interactive narrative tasks require LLMs to sustain a coherent, evolving story while adapting to a user over multiple turns. However, suitable benchmarks for this setting are limited: existing evaluations often focus on static prompts, isolated story generations, or post-hoc ratings, and therefore miss whether models can jointly manage story generation, long-context state and pacing, character simulation, empathic personalization, and story-grounded artifacts. We introduce NARRA-Gym, an executable evaluation environment that turns a sparse emotional seed into a complete interactive story episode and logs the full model-in-the-loop trajectory, including story construction, memory updates, planning, pacing interventions, and optional artifact synthesis. We evaluate nine frontier LLMs using a controlled LLM-as-judge sweep over eight benchmark personas and a human evaluation in which participants rate customized model outputs. Our results show substantial variation across models, personas, and evaluation dimensions: models that produce fluent stories can still fail on robustness, user experience, or resistance-sensitive personalization. These findings suggest that interactive narrative offers a useful benchmark for evaluating long-horizon, user-adaptive LLM behavior beyond isolated story quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NARRA-Gym is a practical executable benchmark for multi-turn narrative interaction that logs model states, though its evaluation details are thin.

read the letter

The main point is that NARRA-Gym converts a sparse emotional seed into a complete logged interactive story episode. It forces the model to manage story construction, memory updates, planning, pacing interventions, and optional artifacts while interacting over multiple turns. This is new because most prior narrative work used static prompts or rated finished outputs without the closed-loop state tracking. The paper tests nine frontier LLMs across eight personas and shows that models strong on fluent stories can still fail on robustness, user experience, or resistance-sensitive personalization. That separation is useful and the full-trajectory logging makes the failures inspectable. The environment itself is a concrete artifact that could let others run controlled comparisons on long-horizon adaptation. The soft spot is the evaluation. The results rest on an LLM-as-judge sweep plus human ratings, but the description gives no information on judge prompts, validation against humans, exact metrics, statistical tests, or inter-rater numbers. Without those, the reported variation is hard to trust as more than noise. Releasing the code, seeds, and prompts would fix most of this. This paper is for researchers building or benchmarking interactive LLM agents in creative or conversational settings. Readers who need a testbed for user-adaptive, multi-turn behavior will get value from the setup even if they treat the current numbers as preliminary. It deserves a serious referee because the core environment addresses a real gap and the experiments are at least directionally informative. The authors should be asked to add evaluation transparency and open-source the implementation before publication.

Referee Report

2 major / 2 minor

Summary. The paper introduces NARRA-Gym, an executable evaluation environment that converts sparse emotional seeds into complete interactive story episodes and logs full model-in-the-loop trajectories covering story construction, memory updates, planning, pacing interventions, and optional artifact synthesis. It evaluates nine frontier LLMs via a controlled LLM-as-judge sweep across eight benchmark personas plus a human evaluation in which participants rate customized outputs, reporting substantial variation across models, personas, and dimensions such as robustness, user experience, and resistance-sensitive personalization.

Significance. If the environment faithfully implements the claimed components and the evaluations prove reliable, NARRA-Gym would provide a useful benchmark for long-horizon, user-adaptive LLM behavior in interactive narrative settings. The explicit logging of trajectories and the contrast with static or post-hoc story evaluations are strengths that could help identify gaps in current models beyond isolated generation quality.

major comments (2)

[Evaluation section] Evaluation section: the manuscript reports 'substantial variation' across models and personas but supplies no details on the precise metrics used for each dimension, any statistical tests or significance thresholds, inter-rater agreement for the human ratings, or the exact prompting template and validation procedure applied to the LLM-as-judge. These omissions make it impossible to determine whether the observed differences are robust or could be artifacts of the evaluation design.
[Results] Results presentation: without reported quantitative values, confidence intervals, or ablation checks on the LLM judge, the central claim that models fluent in story generation still fail on robustness or personalization cannot be fully assessed for reliability.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a short statement of typical episode length or number of interaction turns to give readers a concrete sense of the horizon being evaluated.
[Figures] Figure captions describing the logged trajectory components could be expanded to clarify which elements are automatically recorded versus optionally synthesized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of NARRA-Gym's potential as a benchmark. We address each major comment below and have revised the manuscript to provide the requested details on evaluation methodology and results.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: the manuscript reports 'substantial variation' across models and personas but supplies no details on the precise metrics used for each dimension, any statistical tests or significance thresholds, inter-rater agreement for the human ratings, or the exact prompting template and validation procedure applied to the LLM-as-judge. These omissions make it impossible to determine whether the observed differences are robust or could be artifacts of the evaluation design.

Authors: We agree that additional methodological detail is required for reproducibility and to confirm robustness. In the revised manuscript we have expanded the Evaluation section with: explicit scoring rubrics and scales for each dimension (e.g., robustness defined via consistency under controlled user perturbations); the statistical tests performed (ANOVA with Tukey post-hoc tests and p < 0.05 threshold); inter-rater agreement for human ratings (Fleiss' kappa); and the complete LLM-as-judge prompt template plus its validation procedure against a human-annotated subset. These additions allow readers to assess whether the reported variations are reliable. revision: yes
Referee: [Results] Results presentation: without reported quantitative values, confidence intervals, or ablation checks on the LLM judge, the central claim that models fluent in story generation still fail on robustness or personalization cannot be fully assessed for reliability.

Authors: We accept that the original results section was insufficiently quantitative. The revised manuscript now includes tables with mean scores and 95% confidence intervals across models, personas, and dimensions. We have also added an ablation comparing the LLM judge to human ratings on a 100-episode subset, reporting correlation and error metrics that support the reliability of the judge and the central claim that fluency does not guarantee robustness or personalization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark introduction with no derivations

full rationale

The paper presents NARRA-Gym as a new executable evaluation environment for interactive narrative tasks, along with empirical results from LLM-as-judge sweeps and human evaluations across models and personas. No mathematical derivations, equations, fitted parameters, predictions, or first-principles results are claimed. The central contribution is the environment construction and logged trajectories themselves, which do not reduce to any self-referential inputs or prior self-citations in a load-bearing way. All evaluation dimensions are explicitly defined and measured independently.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

No mathematical derivations or physical models are involved; the paper rests on the domain assumption that interactive narrative quality can be decomposed into the listed dimensions and measured via LLM judges plus human ratings.

axioms (2)

domain assumption LLM-as-judge can produce reliable scores for narrative coherence, adaptation, and personalization when given the logged trajectory
Invoked when the authors use an LLM judge sweep to evaluate model outputs
domain assumption Human participants can meaningfully rate customized model outputs on the same dimensions
Invoked in the human evaluation component

pith-pipeline@v0.9.0 · 5550 in / 1411 out tokens · 42591 ms · 2026-05-12T02:14:41.417011+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
We introduce NARRA-Gym, an executable evaluation environment that turns a sparse emotional seed into a complete interactive story episode and logs the full model-in-the-loop trajectory, including story construction, memory updates, planning, pacing interventions, and optional artifact synthesis.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
models that produce fluent stories can still fail on robustness, user experience, or resistance-sensitive personalization

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 3 internal anchors

[1]

International Conference on Learning Representations , year =

Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations , year =

work page
[2]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models , author =. arXiv preprint arXiv:2206.04615 , year =. doi:10.48550/arXiv.2206.04615 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2206.04615
[3]

Generative Agents: Interactive Simulacra of Human Behavior

Generative Agents: Interactive Simulacra of Human Behavior , author =. arXiv preprint arXiv:2304.03442 , year =. doi:10.48550/arXiv.2304.03442 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.03442
[4]

Hierarchical neural story generation

Hierarchical Neural Story Generation , author =. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2018 , address =. doi:10.18653/v1/P18-1082 , url =

work page doi:10.18653/v1/p18-1082 2018
[5]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

Plan-and-Write: Towards Better Automatic Storytelling , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2019 , publisher =. doi:10.1609/aaai.v33i01.33017378 , url =

work page doi:10.1609/aaai.v33i01.33017378 2019
[6]

Transactions of the Association for Computational Linguistics , volume =

Lost in the Middle: How Language Models Use Long Contexts , author =. Transactions of the Association for Computational Linguistics , volume =. 2024 , doi =

work page 2024
[7]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Aug 2024).https://doi.org/10.18653/v1/2024.acl-long.172

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , address =. doi:10.18653/v1/2024.acl-long.172 , url =

work page doi:10.18653/v1/2024.acl-long.172 2024
[8]

Wang, Noah and Peng, Z.y. and Que, Haoran and Liu, Jiaheng and Zhou, Wangchunshu and Wu, Yuhan and Guo, Hongcheng and Gan, Ruitong and Ni, Zehao and Yang, Jian and Zhang, Man and Zhang, Zhaoxiang and Ouyang, Wanli and Xu, Ke and Huang, Wenhao and Fu, Jie and Peng, Junran , booktitle =. 2024 , address =. doi:10.18653/v1/2024.findings-acl.878 , url =

work page doi:10.18653/v1/2024.findings-acl.878 2024
[9]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages =

Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset , author =. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages =. 2019 , address =. doi:10.18653/v1/P19-1534 , url =

work page doi:10.18653/v1/p19-1534 2019
[10]

Learning to Speak and Act in a Fantasy Text Adventure Game , author =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages =. 2019 , address =. doi:10.18653/v1/D19-1062 , url =

work page doi:10.18653/v1/d19-1062 2019
[11]

G -eval: NLG evaluation using gpt-4 with better human alignment

Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , booktitle =. 2023 , address =. doi:10.18653/v1/2023.emnlp-main.153 , url =

work page doi:10.18653/v1/2023.emnlp-main.153 2023
[12]

arXiv preprint arXiv:2408.14622 , year =

What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation , author =. arXiv preprint arXiv:2408.14622 , year =. doi:10.48550/arXiv.2408.14622 , url =

work page doi:10.48550/arxiv.2408.14622
[13]

arXiv preprint arXiv:2212.04634 , year =

Open-world Story Generation with Structured Knowledge Enhancement: A Comprehensive Survey , author =. arXiv preprint arXiv:2212.04634 , year =. doi:10.48550/arXiv.2212.04634 , url =

work page doi:10.48550/arxiv.2212.04634
[14]

Weaver: Foundation models for creative writing

Weaver: Foundation Models for Creative Writing , author =. arXiv preprint arXiv:2401.17268 , year =. doi:10.48550/arXiv.2401.17268 , url =

work page doi:10.48550/arxiv.2401.17268
[15]

Zhu, Lixing and Zhao, Runcong and Gui, Lin and He, Yulan , journal =. Are. 2023 , doi =

work page 2023
[16]

2023 , doi =

Zhou, Wangchunshu and Jiang, Yuchen Eleanor and Cui, Peng and Wang, Tiannan and Xiao, Zhenxin and Hou, Yifan and Cotterell, Ryan and Sachan, Mrinmaya , journal =. 2023 , doi =

work page 2023
[17]

Findings of the Association for Computational Linguistics: EMNLP 2023 , year =

Improving Pacing in Long-Form Story Planning , author =. Findings of the Association for Computational Linguistics: EMNLP 2023 , year =. doi:10.48550/arXiv.2311.04459 , url =

work page doi:10.48550/arxiv.2311.04459 2023
[18]

arXiv preprint arXiv:2402.17119 , year =

Creating Suspenseful Stories: Iterative Planning with Large Language Models , author =. arXiv preprint arXiv:2402.17119 , year =. doi:10.48550/arXiv.2402.17119 , url =

work page doi:10.48550/arxiv.2402.17119
[19]

arXiv preprint arXiv:2404.13919 , year =

Navigating the Path of Writing: Outline-guided Text Generation with Large Language Models , author =. arXiv preprint arXiv:2404.13919 , year =. doi:10.48550/arXiv.2404.13919 , url =

work page doi:10.48550/arxiv.2404.13919
[20]

arXiv preprint arXiv:2412.13575 , year =

Generating Long-form Story Using Dynamic Hierarchical Outlining with Memory-Enhancement , author =. arXiv preprint arXiv:2412.13575 , year =. doi:10.48550/arXiv.2412.13575 , url =

work page doi:10.48550/arxiv.2412.13575
[21]

arXiv preprint arXiv:2410.02428 , year =

Collective Critics for Creative Story Generation , author =. arXiv preprint arXiv:2410.02428 , year =. doi:10.48550/arXiv.2410.02428 , url =

work page doi:10.48550/arxiv.2410.02428
[22]

arXiv preprint arXiv:2408.08506 , year =

Ex3: Automatic Novel Writing by Extracting, Excelsior and Expanding , author =. arXiv preprint arXiv:2408.08506 , year =. doi:10.48550/arXiv.2408.08506 , url =

work page doi:10.48550/arxiv.2408.08506
[23]

2024 , address =

Han, Senyu and Chen, Lu and Lin, Li-Min and Xu, Zhengshan and Yu, Kai , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.88 , url =

work page doi:10.18653/v1/2024.acl-long.88 2024
[24]

2024 , doi =

Chen, Jing and Zhu, Xinyu and Yang, Cheng and Shi, Chufan and Xi, Yadong and Zhang, Yuxiang and Wang, Junjie and Pu, Jiashu and Zhang, Rongsheng and Yang, Yujiu and Feng, Tian , journal =. 2024 , doi =

work page 2024
[25]

arXiv preprint arXiv:2405.13042 , year =

StoryVerse: Towards Co-authoring Dynamic Plot with LLM-based Character Simulation via Narrative Planning , author =. arXiv preprint arXiv:2405.13042 , year =. doi:10.48550/arXiv.2405.13042 , url =

work page doi:10.48550/arxiv.2405.13042
[26]

2024 , doi =

Yang, Shuai and Ge, Yuying and Li, Yang and Chen, Yukang and Ge, Yixiao and Shan, Ying and Chen, Yingcong , journal =. 2024 , doi =

work page 2024
[27]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

Measuring Psychological Depth in Language Models , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.953 , url =

work page doi:10.18653/v1/2024.emnlp-main.953 2024
[28]

Are Large Language Models Capable of Generating Human-Level Narratives?

Are Large Language Models Capable of Generating Human-Level Narratives? , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.978 , url =

work page doi:10.18653/v1/2024.emnlp-main.978 2024
[29]

2023 , doi =

Chang, Yapei and Lo, Kyle and Goyal, Tanya and Iyyer, Mohit , journal =. 2023 , doi =

work page 2023
[30]

arXiv preprint arXiv:2403.01061 , year =

Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers , author =. arXiv preprint arXiv:2403.01061 , year =. doi:10.48550/arXiv.2403.01061 , url =

work page doi:10.48550/arxiv.2403.01061
[31]

and McKeown, Kathleen , journal =

Subbiah, Melanie and Ladhak, Faisal and Mishra, Akankshya and Adams, Griffin and Chilton, Lydia B. and McKeown, Kathleen , journal =. 2024 , doi =

work page 2024
[32]

Proceedings of the 29th International Conference on Computational Linguistics , pages =

Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation , author =. Proceedings of the 29th International Conference on Computational Linguistics , pages =. 2022 , address =

work page 2022
[33]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages =

StoryER: Automatic Story Evaluation via Ranking, Rating and Reasoning , author =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages =. 2022 , address =. doi:10.18653/v1/2022.emnlp-main.114 , url =

work page doi:10.18653/v1/2022.emnlp-main.114 2022
[34]

2021 , address =

Guan, Jian and Zhang, Zhexin and Feng, Zhuoer and Liu, Zitao and Ding, Wenbiao and Mao, Xiaoxi and Fan, Changjie and Huang, Minlie , booktitle =. 2021 , address =

work page 2021
[35]

2020 , address =

Guan, Jian and Huang, Minlie , booktitle =. 2020 , address =. doi:10.18653/v1/2020.emnlp-main.736 , url =

work page doi:10.18653/v1/2020.emnlp-main.736 2020
[36]

2022 , address =

Guan, Jian and Feng, Zhuoer and Chen, Yamei and He, Ruilin and Mao, Xiaoxi and Fan, Changjie and Huang, Minlie , journal =. 2022 , address =. doi:10.1162/tacl_a_00469 , url =

work page doi:10.1162/tacl_a_00469 2022
[37]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages =

A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing , author =. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages =. 2023 , address =. doi:10.18653/v1/2023.findings-emnlp.966 , url =

work page doi:10.18653/v1/2023.findings-emnlp.966 2023
[38]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =

BookWorm: A Dataset for Character Description and Analysis , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =. 2024 , address =. doi:10.18653/v1/2024.findings-emnlp.258 , url =

work page doi:10.18653/v1/2024.findings-emnlp.258 2024
[39]

Findings of the Association for Computational Linguistics: ACL 2024 , pages =

Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives , author =. Findings of the Association for Computational Linguistics: ACL 2024 , pages =. 2024 , address =. doi:10.18653/v1/2024.findings-acl.454 , url =

work page doi:10.18653/v1/2024.findings-acl.454 2024
[40]

STORIUM : A D ataset and E valuation P latform for M achine-in-the- L oop S tory G eneration

STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =. 2020 , address =. doi:10.18653/v1/2020.emnlp-main.525 , url =

work page doi:10.18653/v1/2020.emnlp-main.525 2020
[41]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

StoryWars: A Dataset and Instruction Tuning Baselines for Collaborative Story Understanding and Generation , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2023 , address =. doi:10.18653/v1/2023.acl-long.171 , url =

work page doi:10.18653/v1/2023.acl-long.171 2023
[42]

Can Large Language Models Be an Alternative to Human Evaluations?

Can Large Language Models Be an Alternative to Human Evaluations? , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2023 , address =. doi:10.18653/v1/2023.acl-long.870 , url =

work page doi:10.18653/v1/2023.acl-long.870 2023
[43]

arXiv preprint arXiv:2310.03304 , year =

Learning Personalized Alignment for Evaluating Open-ended Text Generation , author =. arXiv preprint arXiv:2310.03304 , year =. doi:10.48550/arXiv.2310.03304 , url =

work page doi:10.48550/arxiv.2310.03304
[44]

arXiv preprint arXiv:2409.13935 , year =

MirrorStories: Reflecting Diversity through Personalized Narrative Generation with Large Language Models , author =. arXiv preprint arXiv:2409.13935 , year =. doi:10.48550/arXiv.2409.13935 , url =

work page doi:10.48550/arxiv.2409.13935
[45]

2024 , doi =

Lyu, Zhiheng and Yang, Kevin and Kong, Lingpeng and Klein, Daniel , journal =. 2024 , doi =

work page 2024
[46]

2025 , doi =

Wu, Siwei and Li, Yizhi and Qu, Xingwei and Ravikumar, Rishi and Li, Yucheng and Loakman, Tyler and Quan, Shanghaoran and Wei, Xiaoyong and Batista-Navarro, Riza and Lin, Chenghua , journal =. 2025 , doi =

work page 2025
[47]

arXiv preprint arXiv:2410.04197 , year =

CS4: Measuring the Creativity of Large Language Models Automatically by Controlling the Number of Story-Writing Constraints , author =. arXiv preprint arXiv:2410.04197 , year =. doi:10.48550/arXiv.2410.04197 , url =

work page doi:10.48550/arxiv.2410.04197
[48]

arXiv preprint arXiv:2411.02316 , year =

Evaluating Creative Short Story Generation in Humans and Large Language Models , author =. arXiv preprint arXiv:2411.02316 , year =. doi:10.48550/arXiv.2411.02316 , url =

work page doi:10.48550/arxiv.2411.02316
[49]

arXiv preprint arXiv:2501.00273 , year =

Echoes in AI: Quantifying Lack of Plot Diversity in LLM Outputs , author =. arXiv preprint arXiv:2501.00273 , year =. doi:10.48550/arXiv.2501.00273 , url =

work page doi:10.48550/arxiv.2501.00273
[50]

arXiv preprint arXiv:2503.22828 , year =

Learning to Reason for Long-Form Story Generation , author =. arXiv preprint arXiv:2503.22828 , year =. doi:10.48550/arXiv.2503.22828 , url =

work page doi:10.48550/arxiv.2503.22828
[51]

2016 , doi =

Brockman, Greg and Cheung, Vicki and Pettersson, Ludwig and Schneider, Jonas and Schulman, John and Tang, Jie and Zaremba, Wojciech , journal =. 2016 , doi =

work page 2016
[52]

Individual Choice Behavior: A Theoretical Analysis , author =

work page
[53]

Journal of the Royal Statistical Society

The Analysis of Permutations , author =. Journal of the Royal Statistical Society. Series C (Applied Statistics) , volume =. 1975 , doi =

work page 1975
[54]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author =. arXiv preprint arXiv:2107.03374 , year =. doi:10.48550/arXiv.2107.03374 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374
[55]

Advances in Neural Information Processing Systems , volume =

Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Advances in Neural Information Processing Systems , volume =. 2023 , url =

work page 2023
[56]

2024 , howpublished =

How Ubisoft's New Generative AI Prototype Changes the Narrative for NPCs , author =. 2024 , howpublished =

work page 2024
[57]

2023 , howpublished =

Xbox and Inworld AI Partner to Empower Game Creators with the Potential of Generative AI , author =. 2023 , howpublished =

work page 2023
[58]

2023 , howpublished =

Introducing NVIDIA ACE for Games: Spark Life Into Virtual Characters With Generative AI , author =. 2023 , howpublished =

work page 2023