Recognition: unknown
From World-Gen to Quest-Line: A Dependency-Driven Prompt Pipeline for Coherent RPG Generation
Pith reviewed 2026-05-07 16:16 UTC · model grok-4.3
The pith
A dependency-driven prompt pipeline with structured JSON intermediates enables LLMs to generate coherent, scalable RPG content without narrative drift.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a dependency-aware, multi-stage prompt pipeline—running from world building, non-player character creation, player character creation, campaign-level quest planning, to quest expansion—conditions each stage on structured JSON outputs from previous stages. By enforcing schemas and explicit data flow, the pipeline reduces narrative drift, limits hallucinations, and supports scalable creation of interconnected narrative elements. Qualitative human evaluation across multiple independent runs finds that outputs remain structurally complete, internally consistent, narratively coherent, diverse, and actionable, with no quality degradation as complexity increases. Separated
What carries the argument
The multi-stage prompt pipeline that models narrative dependencies through structured intermediate JSON representations between stages.
If this is right
- The pipeline generates logically sound and structurally valid RPG content consistently across independent runs.
- Output quality does not degrade as the number of interconnected quests and characters increases.
- Separating high-level campaign planning from detailed quest expansion improves both global structure and local storytelling quality.
- The same dependency pattern supports scalable creation of any set of interconnected narrative elements.
Where Pith is reading between the lines
- The same staged JSON handoff approach could be applied to long-form fiction or branching interactive stories where state must be preserved across many scenes.
- Smaller or less capable models might handle complex narrative tasks more reliably when given explicit structured state rather than relying on implicit memory.
- The design could extend to procedural generation in other sequential domains such as automated world simulation or multi-step planning systems.
Load-bearing premise
Explicit schema enforcement and sequential JSON data flow are sufficient to prevent narrative drift and hallucinations in current LLMs across varied prompts and model versions.
What would settle it
Running the pipeline on several high-complexity multi-quest campaigns and finding repeated contradictions between character motivations or world facts that violate the JSON dependencies supplied to later stages.
Figures
read the original abstract
Large Language Models (LLMs) have shown strong potential for narrative generation, but their use in complex, multi-layered role-playing game (RPG) worlds is still limited by issues of coherence, controllability, and structural consistency. This paper explores a dependency-aware, multi-stage prompt pipeline for procedural RPG content generation that models narrative dependencies through structured intermediate representations. The approach decomposes generation into sequential stages: world building, non-player character creation, player character creation, campaign-level quest planning, and quest expansion. Each stage conditions on structured JSON outputs from previous stages. By enforcing schemas and explicit data flow, the pipeline reduces narrative drift, limits hallucinations, and supports scalable creation of interconnected narrative elements. The system is evaluated qualitatively through human-centered analysis across multiple independent runs. Outputs are assessed using criteria such as structural completeness, internal consistency, narrative coherence, diversity, and actionability. Results show that the pipeline consistently generates logically sound and structurally valid RPG content, without quality degradation as complexity increases. Separating high-level campaign planning from detailed quest expansion improves both global structure and local storytelling. These findings suggest that dependency-aware prompt pipelines with structured intermediate representations are an effective design pattern for LLM-based procedural content generation. This approach may also generalize to other domains requiring sequential reasoning over evolving contextual states.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a multi-stage, dependency-aware prompt pipeline for LLM-based RPG content generation. It decomposes the task into sequential stages (world building, NPC/PC creation, campaign-level quest planning, and quest expansion) where each stage conditions on structured JSON outputs from prior stages to enforce narrative dependencies, reduce drift, and improve coherence. Qualitative human evaluation across runs is used to assess structural completeness, internal consistency, coherence, diversity, and actionability, with the central claim that the pipeline produces logically sound content without quality degradation as complexity increases and that separating high-level planning from detailed expansion improves global and local quality.
Significance. If the results hold under more rigorous testing, the work offers a practical, reusable design pattern for structured LLM prompting in procedural narrative generation. The explicit use of intermediate JSON representations and staged decomposition addresses a known weakness in direct prompting for long-horizon tasks; this could generalize beyond RPGs to other sequential reasoning domains. The absence of quantitative metrics, baselines, or scaling experiments currently limits the strength of the evidence for the 'no degradation' and 'consistently' claims.
major comments (2)
- [Evaluation] Evaluation section (and abstract): The central claim that the pipeline 'consistently generates logically sound and structurally valid RPG content, without quality degradation as complexity increases' rests entirely on qualitative human review, yet no details are provided on the number of runs, number of raters, inter-rater agreement, specific complexity dimensions varied (e.g., number of quests, world size), or any baseline comparison to flat prompting. This leaves the claims of consistency and improvement unsupported by verifiable evidence.
- [Pipeline Design] Pipeline Design and Results: The assertion that explicit schema enforcement and sequential JSON data flow prevent narrative drift and hallucinations is presented as a key advantage, but no control condition or ablation (e.g., multi-stage without schemas vs. with schemas) is reported to isolate this factor from the multi-stage decomposition itself or from current LLM capabilities.
minor comments (2)
- [Abstract] The abstract lists five evaluation criteria but the manuscript does not clarify how each was operationalized during human review or whether any quantitative proxies (e.g., count of consistency violations) were collected.
- [Pipeline Design] Notation for the JSON schemas and data-flow dependencies could be made more precise (e.g., by including an explicit dependency graph or table of fields passed between stages) to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments identify important opportunities to strengthen the presentation of our evaluation and the justification for specific design choices. We respond to each major comment below and describe the revisions we will undertake.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section (and abstract): The central claim that the pipeline 'consistently generates logically sound and structurally valid RPG content, without quality degradation as complexity increases' rests entirely on qualitative human review, yet no details are provided on the number of runs, number of raters, inter-rater agreement, specific complexity dimensions varied (e.g., number of quests, world size), or any baseline comparison to flat prompting. This leaves the claims of consistency and improvement unsupported by verifiable evidence.
Authors: We agree that the evaluation section would be strengthened by providing the specific details the referee requests. The current manuscript describes the process at a high level as 'qualitative human-centered analysis across multiple independent runs' without enumerating the exact counts, rater configuration, agreement statistics, or complexity parameters tested. In the revised version we will expand the Evaluation section to report the number of independent runs performed, the number of raters, inter-rater agreement, the concrete complexity dimensions varied (world size and quest count), and a direct qualitative comparison against a flat-prompting baseline. We will also revise the abstract and results language to more precisely characterize the scope of the evidence while preserving the observed patterns that motivated the pipeline design. revision: yes
-
Referee: [Pipeline Design] Pipeline Design and Results: The assertion that explicit schema enforcement and sequential JSON data flow prevent narrative drift and hallucinations is presented as a key advantage, but no control condition or ablation (e.g., multi-stage without schemas vs. with schemas) is reported to isolate this factor from the multi-stage decomposition itself or from current LLM capabilities.
Authors: We acknowledge that the manuscript does not contain an ablation isolating the contribution of schema enforcement from the multi-stage structure alone. The pipeline is presented as an integrated design pattern whose benefits are illustrated through the generated outputs. In the revised manuscript we will add an explicit Limitations subsection discussing this point and will include a small-scale ablation comparing the full schema-enforced pipeline against a multi-stage variant that omits schema constraints. This addition will help readers assess the specific role of the structured JSON intermediates. revision: yes
Circularity Check
No circularity: empirical design pattern evaluated externally
full rationale
The paper describes a multi-stage prompt pipeline for RPG generation using structured JSON intermediates and schema enforcement. Its central claims rest on qualitative human assessment of outputs for coherence and lack of degradation with complexity, not on any derivation, equations, fitted parameters, or predictions that reduce to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The approach is presented as an engineering pattern whose validity is judged against external criteria (human raters), making the evaluation independent of the pipeline's internal data flow. This matches the default case of a non-circular empirical contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can reliably follow and condition on structured JSON schemas provided in prompts
Reference graph
Works this paper leans on
-
[1]
Procedural generation of dungeons,
R. van der Linden, R. Lopes, and R. Bidarra, “Procedural generation of dungeons,”IEEE Transactions on Computational Intelligence and AI in Games, vol. 6, no. 1, pp. 78–89, 2014
2014
-
[2]
Procedural generation of branching quests for games,
E. S. de Lima, B. Feij ´o, and A. L. Furtado, “Procedural generation of branching quests for games,”Entertainment Computing, vol. 43, p. 100491, 2022
2022
-
[3]
Let conan tell you a story: Procedural quest generation,
V . Breault, S. Ouellet, and J. Davies, “Let conan tell you a story: Procedural quest generation,” 2018
2018
-
[4]
Pangea: Procedural artificial narrative using generative ai for turn- based, role-playing video games,
S. Buongiorno, L. Klinkert, Z. Zhuang, T. Chawla, and C. Clark, “Pangea: Procedural artificial narrative using generative ai for turn- based, role-playing video games,”Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, vol. 20, pp. 156–166, 11 2024
2024
-
[5]
Procedural content generation for games: A survey,
M. J. C. Hendrikx, S. A. Meijer, J. van der Velden, and A. Iosup, “Procedural content generation for games: A survey,”ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 9, no. 1, 2013
2013
-
[6]
Word2world: Generating stories and worlds through large language models,
M. U. Nasir, S. James, and J. Togelius, “Word2world: Generating stories and worlds through large language models,”arXiv preprint arXiv:2405.06686, 2024
-
[7]
Generative ai in game design: Enhancing creativity or constraining innovation?
S. A. Alharthi, “Generative ai in game design: Enhancing creativity or constraining innovation?”Journal of Intelligence, vol. 13, no. 6, p. 60, 2025
2025
-
[8]
Gpt for games: A scoping review (2020– 2023),
J. Bergdahl and S. Dahlskog, “Gpt for games: A scoping review (2020– 2023),”IEEE Transactions on Games, 2023
2020
-
[9]
Generating role-playing game quests with gpt language models,
X. Peng, J. Quaye, S. Rao, W. Xu, P. Botchway, and C. Brockett, “Generating role-playing game quests with gpt language models,” in Proceedings of the IEEE Conference on Games, 2023
2023
-
[10]
Procedural content generation in games: A survey with insights on emerging llm integration,
M. F. Maleki and R. Zhao, “Procedural content generation in games: A survey with insights on emerging llm integration,”Proceedings of the AAAI Conference on Artificial Intelligence, 2024, arXiv:2410.15644
-
[11]
Questville: Procedural quest generation using nlp models,
E. S. de Lima, M. M. E. Neggers, B. Feijo, M. A. Casanova, and A. L. Furtado, “Questville: Procedural quest generation using nlp models,” Entertainment Computing, vol. 47, 2024
2024
-
[12]
Practical pcg through large language models,
M. U. Nasir and J. Togelius, “Practical pcg through large language models,” 2023
2023
-
[13]
Adversarial reinforcement learning for procedural content generation,
L. Gissl ´en, A. Eakins, C. Gordillo, J. Bergdahl, and K. Tollmar, “Adversarial reinforcement learning for procedural content generation,” in2021 IEEE Conference on Games (CoG), 2021, pp. 1–8
2021
-
[14]
Search- based procedural content generation: A taxonomy and survey,
J. Togelius, G. N. Yannakakis, K. O. Stanley, and C. Browne, “Search- based procedural content generation: A taxonomy and survey,”IEEE Transactions on Computational Intelligence and AI in Games, vol. 3, no. 3, pp. 172–186, 2011
2011
-
[15]
Experience-driven procedural content generation,
G. Yannakakis and J. Togelius, “Experience-driven procedural content generation,”Affective Computing, IEEE Transactions on, vol. 2, pp. 147– 161, 07 2011
2011
-
[16]
Procedural content generation via machine learning (pcgml),
A. Summerville, S. Snodgrass, M. Guzdial, C. Holmg ˚ard, A. K. Hoover, A. Isaksen, A. Nealen, and J. Togelius, “Procedural content generation via machine learning (pcgml),”IEEE Transactions on Games, vol. 10, no. 3, pp. 257–270, 2018
2018
-
[17]
Procedural generation of quests for games using genetic algorithms and automated planning,
E. Soares de Lima, B. Feij ´o, and A. L. Furtado, “Procedural generation of quests for games using genetic algorithms and automated planning,” in2019 18th Brazilian Symposium on Computer Games and Digital Entertainment (SBGames), 2019, pp. 144–153
2019
-
[18]
Game generation via large language models,
C. Hu, Y . Zhao, and J. Liu, “Game generation via large language models,” in2024 IEEE Conference on Games (CoG), 2024, pp. 1–4
2024
-
[19]
The science of evaluating foundation models,
J. Yuan, J. Zhang, A. Wen, and X. Hu, “The science of evaluating foundation models,”arXiv preprint, 2025
2025
-
[20]
Bleu: a method for automatic evaluation of machine translation,
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318
2002
-
[21]
Rouge: A package for automatic evaluation of summaries,
C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” inText summarization branches out, 2004, pp. 74–81
2004
-
[22]
Why we need new evaluation metrics for nlg,
J. Novikova, O. Du ˇsek, A. C. Curry, and V . Rieser, “Why we need new evaluation metrics for nlg,” inProceedings of the 2017 conference on empirical methods in natural language processing, 2017, pp. 2241–2252
2017
-
[23]
Cheng-Han Chiang and Hung-yi Lee
A. Celikyilmaz, E. Clark, and J. Gao, “Evaluation of text generation: A survey,”arXiv preprint arXiv:2006.14799, 2020
-
[24]
Best practices for the human evaluation of automatically generated text,
C. Van Der Lee, A. Gatt, E. Van Miltenburg, S. Wubben, and E. Krah- mer, “Best practices for the human evaluation of automatically generated text,” inProceedings of the 12th international conference on natural language generation, 2019, pp. 355–368
2019
-
[25]
On the Opportunities and Risks of Foundation Models
R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskillet al., “On the opportunities and risks of foundation models,”arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review arXiv 2021
-
[26]
Holistic Evaluation of Language Models
P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumaret al., “Holistic evaluation of language models,”arXiv preprint arXiv:2211.09110, 2022
work page internal anchor Pith review arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.