S2ED: From Story to Executable Descriptions for Consistency-Aware Story Illustration

Jiamou Liu; Qian Liu; Sijing Yin; Xiao Tang; Yaser Shakib

arxiv: 2605.22448 · v1 · pith:USLKGGAYnew · submitted 2026-05-21 · 💻 cs.AI

S2ED: From Story to Executable Descriptions for Consistency-Aware Story Illustration

Sijing Yin , Jiamou Liu , Xiao Tang , Yaser Shakib , Qian Liu This is my paper

Pith reviewed 2026-05-22 04:52 UTC · model grok-4.3

classification 💻 cs.AI

keywords story illustrationmulti-frame consistencycharacter fidelityprompt engineeringmulti-agent frameworktext-to-image generationnarrative decompositiontraining-free method

0 comments

The pith

S2ED turns full stories into sequences of executable descriptions that carry character identity and state across illustrated frames without retraining the generator.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents S2ED as a training-free framework that decomposes a story into explicit prompts using three agents. One agent segments the narrative into frames, another fixes canonical character attributes, and the third adds spatial and affective details. These descriptions then propagate state through prompts while allowing local edits to fix any drift in identity, layout, or emotion. A sympathetic reader would care because standard text-to-image models often produce inconsistent characters and scenes when generating sequences for stories, limiting their use in coherent illustrated books or videos. If the approach works as described, it offers an interpretable way to achieve long-horizon consistency by editing prompts rather than retraining models.

Core claim

S2ED coordinates three agents to segment the narrative, ground canonical character attributes, and enrich spatial and affective cues, producing a sequence of explicit, editable executable descriptions that support prompt-carried state propagation and local drift repair for consistent multi-frame story illustration.

What carries the argument

Story-to-Executable Descriptions (S2ED), a prompt-layer framework with three agents that segment stories, ground character traits, and enrich cues to enable state propagation and local prompt edits across generated frames.

If this is right

Sequence-level consistency and character fidelity improve over strong prompting, large-model planning, and training-based methods on Flintstones and Shakoo Maku under automatic metrics and human judgments.
Prompt-carried state propagation combined with local edits allows repair of inconsistencies without retraining the underlying image generator.
The resulting executable descriptions support deployment in end-to-end story-to-storybook pipelines for children's illustrated stories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular agent approach could extend to video generation where temporal consistency across many frames is required.
Local editability might lower the cost of adapting story illustration systems to new styles or domains without full retraining.
Similar decomposition into executable state descriptions could apply to other multi-step creative tasks such as script-to-animation pipelines.

Load-bearing premise

The three agents accurately segment narratives, ground character attributes, and enrich cues so that prompt-carried state maintains persistent identity, layout, and affect without significant unrepairable drift.

What would settle it

Generate images from S2ED descriptions on a new story with complex interactions and measure whether character appearance or spatial layout drifts in ways that local prompt edits cannot restore to match the canonical attributes.

Figures

Figures reproduced from arXiv: 2605.22448 by Jiamou Liu, Qian Liu, Sijing Yin, Xiao Tang, Yaser Shakib.

**Figure 2.** Figure 2: S2ED workflow. Story sentences are segmented into captions, converted into structured states Zi , and combined recursively with prior descriptions to generate Executable Descriptions pi . tency over strong prompting and large-model baselines in both automatic metrics and a controlled human preference study. We further deploy S2ED in an end-to-end story-to-storybook system featuring fixed IP characters from… view at source ↗

**Figure 3.** Figure 3: Overview of S2ED. The Narrative Segmenter produces frame-level captions, the Consistency Grounder extracts character and appearance attributes from the caption and global knowledge bases, and the Visual Enricher integrates layout and affect cues to produce Executable Descriptions for T2I rendering. • Frame Accuracy (F-Acc). Event alignment following [15], computed by matching predicted frame labels to acti… view at source ↗

**Figure 4.** Figure 4: Qualitative overview of S2ED. Top: Results on the Flintstones dataset comparing S2ED with prompting baselines (Plain, TokenInject, Layout) and StoryDiffusion across multiple frames (1, 5, 10, 11). Bottom: Results on the Shakoo Maku dataset showing end-to-end story-to-image generation (Frames 1–4), comparing GPT-5, Gemini-2.5 Pro, and S2ED. generalization, even when the stories differ from those used to con… view at source ↗

**Figure 5.** Figure 5: Representative failure cases in S2ED. (a) Multi-entity [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Multi-frame story illustration requires long-horizon coherence beyond single-image text-to-image generation, including narrative decomposition and persistent character identity, layout, and affect across frames. We propose Story-to-Executable Descriptions (S2ED), a training-free, model-agnostic, prompt-layer framework that converts a full story into a sequence of explicit, editable executable descriptions for more consistent rendering. S2ED coordinates three agents to segment the narrative, ground canonical character attributes, and enrich spatial and affective cues, enabling interpretable prompt-carried state propagation and local edits to repair drift without retraining the generator. Experiments on Flintstones and Shakoo Maku show that S2ED improves sequence-level consistency and character fidelity over strong prompting, large-model planning, and a reference training-based method, under both automatic metrics and human judgments. We also deploy S2ED in an end-to-end story-to-storybook system for children's illustrated stories, with a supplementary video.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

S2ED gives a clean three-agent prompt pipeline for keeping story illustrations consistent across frames without training, but the experimental backing stays thin on agent reliability and ablations.

read the letter

S2ED turns a full story into a sequence of editable executable descriptions by running three agents in sequence: one segments the narrative, one grounds canonical character traits, and one adds spatial and affective details. The state then travels in the prompts so the image generator can hold identity, layout, and affect with only local repairs when drift appears. That setup is the concrete new piece relative to plain long-context prompting or single large-model planners. It is training-free and model-agnostic, which matters for people who want to plug it into existing generators. They also built and showed an end-to-end story-to-storybook system for children’s books, which demonstrates the method is meant to be used rather than just described.

Referee Report

3 major / 2 minor

Summary. The paper proposes Story-to-Executable Descriptions (S2ED), a training-free and model-agnostic prompt-layer framework that decomposes a story into a sequence of explicit executable descriptions via three coordinated LLM agents: one for narrative segmentation, one for grounding canonical character attributes, and one for enriching spatial and affective cues. These descriptions enable prompt-carried state propagation with local edits to maintain consistency in character identity, layout, and affect across multiple frames. Experiments on the Flintstones and Shakoo Maku datasets report improvements in sequence-level consistency and character fidelity over strong prompting baselines, large-model planning approaches, and a reference training-based method, as measured by automatic metrics and human judgments; the framework is also deployed in an end-to-end story-to-storybook system.

Significance. If the empirical claims hold under rigorous verification, S2ED would offer a practical, interpretable alternative to retraining-based methods for long-horizon story illustration, emphasizing editable state propagation rather than end-to-end fine-tuning. The training-free and model-agnostic design could broaden applicability to various generators, and the explicit agent decomposition provides a clear mechanism for drift repair. However, the absence of agent-level quantitative metrics and ablations in the current presentation limits the ability to assess whether gains derive specifically from the proposed decomposition or from more structured prompting in general.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: the headline claim of improvements 'under both automatic metrics and human judgments' is load-bearing for the central contribution, yet the manuscript supplies no definitions of the automatic metrics, no error bars or statistical significance tests, no exclusion criteria for test stories, and no details on how human judgments were collected or scored. This prevents verification that the reported gains are robust rather than artifacts of metric choice or evaluation protocol.
[Method / §3] Method description (three-agent pipeline): the framework's success is predicated on the accuracy of the narrative segmentation, canonical attribute grounding, and cue enrichment agents in producing drift-resistant state. No quantitative agent-level metrics (e.g., segmentation F1, attribute grounding error rate, inter-agent consistency) or ablations isolating each agent's contribution are reported. Without these, it remains possible that observed gains stem primarily from longer, more structured prompts rather than the specific executable-description mechanism.
[Experiments] §4 (or equivalent experimental results): the comparison to 'a reference training-based method' lacks details on the training data, model size, fine-tuning procedure, and whether the baseline was given equivalent access to the same story-level information. This makes it difficult to determine whether S2ED's advantages are due to its training-free nature or to differences in information access and prompt engineering.

minor comments (2)

[Abstract / Conclusion] The supplementary video is referenced but not described in the main text; a brief summary of its content and how it illustrates the local-edit repair process would improve accessibility.
[Method] Notation for the executable descriptions (e.g., how state is encoded and propagated between frames) should be formalized with a small example in the method section to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The points raised highlight important areas for improving the clarity and rigor of our experimental reporting. We address each major comment below and will incorporate the suggested changes in the revised manuscript.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the headline claim of improvements 'under both automatic metrics and human judgments' is load-bearing for the central contribution, yet the manuscript supplies no definitions of the automatic metrics, no error bars or statistical significance tests, no exclusion criteria for test stories, and no details on how human judgments were collected or scored. This prevents verification that the reported gains are robust rather than artifacts of metric choice or evaluation protocol.

Authors: We agree that the current presentation lacks sufficient detail on the evaluation protocol, which is necessary for independent verification. In the revised manuscript we will add explicit definitions and formulas for all automatic metrics, report standard deviations or error bars along with statistical significance tests (e.g., paired t-tests or Wilcoxon tests), state any exclusion criteria applied to the test stories, and provide a complete description of the human evaluation setup including participant count, scoring rubric, interface, and inter-rater agreement statistics. revision: yes
Referee: [Method / §3] Method description (three-agent pipeline): the framework's success is predicated on the accuracy of the narrative segmentation, canonical attribute grounding, and cue enrichment agents in producing drift-resistant state. No quantitative agent-level metrics (e.g., segmentation F1, attribute grounding error rate, inter-agent consistency) or ablations isolating each agent's contribution are reported. Without these, it remains possible that observed gains stem primarily from longer, more structured prompts rather than the specific executable-description mechanism.

Authors: The referee is correct that agent-level diagnostics would strengthen the causal link between the three-agent decomposition and the observed gains. We will add quantitative agent-level metrics obtained via manual annotation of a held-out subset (segmentation F1, attribute grounding precision/recall, and inter-agent consistency scores) together with ablation experiments that successively disable or simplify each agent while keeping prompt length comparable. These additions will appear in a new subsection of the experiments. revision: yes
Referee: [Experiments] §4 (or equivalent experimental results): the comparison to 'a reference training-based method' lacks details on the training data, model size, fine-tuning procedure, and whether the baseline was given equivalent access to the same story-level information. This makes it difficult to determine whether S2ED's advantages are due to its training-free nature or to differences in information access and prompt engineering.

Authors: We will expand the baseline description to include the exact training corpus, model architecture and parameter count, fine-tuning schedule and hyperparameters, and explicit confirmation that the training-based method received the full story text (identical to the input given to S2ED). This will allow readers to assess whether the performance difference is attributable to the training-free design or to unequal information access. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a training-free framework S2ED that uses three external LLM agents to convert stories into executable descriptions for consistent illustration. Claims of improved sequence-level consistency and character fidelity rest on experimental comparisons against prompting baselines, large-model planning, and a training-based reference method on the Flintstones and Shakoo Maku datasets, using both automatic metrics and human judgments. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the provided abstract or method outline. The derivation chain is self-contained because results are obtained via external evaluation rather than by construction from inputs defined inside the work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the method description does not introduce new physical or mathematical constructs.

pith-pipeline@v0.9.0 · 5702 in / 1132 out tokens · 40201 ms · 2026-05-22T04:52:31.237647+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

Storydiffusion: Consistent self-attention for long-range image and video generation,

Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou, “Storydiffusion: Consistent self-attention for long-range image and video generation,” inNeurIPS, 2024

work page 2024
[2]

Storydiffusion: How to support ux storyboarding with generative ai,

Zhaohui Liang, Xiaoyu Zhang, Kevin Ma, Zhao Liu, Xipei Ren, Kosa Goucher-Lambert, and Can Liu, “Storydiffusion: How to support ux storyboarding with generative ai,” inICMI, 2025

work page 2025
[3]

Dreambooth: Fine-tuning text-to-image diffusion models for subject-driven generation,

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman, “Dreambooth: Fine-tuning text-to-image diffusion models for subject-driven generation,” inCVPR, 2023

work page 2023
[4]

Lora: Low-rank adaptation of large language models,

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen, “Lora: Low-rank adaptation of large language models,” inICLR, 2022

work page 2022
[5]

An image is worth one word: Personalizing text-to-image generation using textual inversion,

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” inICLR, 2023

work page 2023
[6]

Instructpix2pix: Learning to follow image editing instructions,

Tim Brooks, Aleksander Holynski, and Alexei A Efros, “Instructpix2pix: Learning to follow image editing instructions,” inCVPR, 2023

work page 2023
[7]

Dreamstory: Open-domain story visualization by llm-guided multi- subject consistent diffusion,

Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin, “Dreamstory: Open-domain story visualization by llm-guided multi- subject consistent diffusion,”IEEE TPAMI, 2025

work page 2025
[8]

Flamingo: A visual language model for few-shot learning,

Jean-Baptiste Alayrac et al., “Flamingo: A visual language model for few-shot learning,” inNeurIPS, 2022

work page 2022
[9]

Characterfactory: Sampling consistent characters with gans for diffusion models,

Qinghe Wang, Baolu Li, Xiaomin Li, Bing Cao, Liqian Ma, Huchuan Lu, and Xu Jia, “Characterfactory: Sampling consistent characters with gans for diffusion models,”IEEE TIP, 2025

work page 2025
[10]

Infinite-story: A training-free consistent text-to-image generation,

Jihun Park, Kyoungmin Lee, Jongmin Gim, Hyeonseo Jo, Minseok Oh, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Minwoo Choi, and Sunghoon Im, “Infinite-story: A training-free consistent text-to-image generation,”arXiv, 2025

work page 2025
[11]

One-prompt-one-story: Free-lunch consistent text-to-image generation using a single prompt,

Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, and Ming-Ming Cheng, “One-prompt-one-story: Free-lunch consistent text-to-image generation using a single prompt,”arXiv, 2025

work page 2025
[12]

React: Synergizing reasoning and acting in language models,

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao, “React: Synergizing reasoning and acting in language models,” inICLR, 2023

work page 2023
[13]

Clipscore: A reference-free evaluation metric for image captioning,

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi, “Clipscore: A reference-free evaluation metric for image captioning,” inEMNLP, 2021

work page 2021
[14]

Make-a-story: Visual memory conditioned consistent story generation,

Tanzila Rahman, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Shweta Mahajan, and Leonid Sigal, “Make-a-story: Visual memory conditioned consistent story generation,” inCVPR, 2023

work page 2023
[15]

Storydall-e: Adapting pretrained text-to-image transformers for story continuation,

Adyasha Maharana, Darryl Hannan, and Mohit Bansal, “Storydall-e: Adapting pretrained text-to-image transformers for story continuation,” inECCV, 2022

work page 2022
[16]

Microsoft coco: Common objects in context,

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Per- ona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick, “Microsoft coco: Common objects in context,” inECCV, 2014

work page 2014
[17]

Grounded language-image pretraining,

et al. Li, Liunian Harold, “Grounded language-image pretraining,” in CVPR, 2022

work page 2022
[18]

Gligen: Open-set grounded text-to-image generation,

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee, “Gligen: Open-set grounded text-to-image generation,” inCVPR, 2023

work page 2023

[1] [1]

Storydiffusion: Consistent self-attention for long-range image and video generation,

Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou, “Storydiffusion: Consistent self-attention for long-range image and video generation,” inNeurIPS, 2024

work page 2024

[2] [2]

Storydiffusion: How to support ux storyboarding with generative ai,

Zhaohui Liang, Xiaoyu Zhang, Kevin Ma, Zhao Liu, Xipei Ren, Kosa Goucher-Lambert, and Can Liu, “Storydiffusion: How to support ux storyboarding with generative ai,” inICMI, 2025

work page 2025

[3] [3]

Dreambooth: Fine-tuning text-to-image diffusion models for subject-driven generation,

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman, “Dreambooth: Fine-tuning text-to-image diffusion models for subject-driven generation,” inCVPR, 2023

work page 2023

[4] [4]

Lora: Low-rank adaptation of large language models,

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen, “Lora: Low-rank adaptation of large language models,” inICLR, 2022

work page 2022

[5] [5]

An image is worth one word: Personalizing text-to-image generation using textual inversion,

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” inICLR, 2023

work page 2023

[6] [6]

Instructpix2pix: Learning to follow image editing instructions,

Tim Brooks, Aleksander Holynski, and Alexei A Efros, “Instructpix2pix: Learning to follow image editing instructions,” inCVPR, 2023

work page 2023

[7] [7]

Dreamstory: Open-domain story visualization by llm-guided multi- subject consistent diffusion,

Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin, “Dreamstory: Open-domain story visualization by llm-guided multi- subject consistent diffusion,”IEEE TPAMI, 2025

work page 2025

[8] [8]

Flamingo: A visual language model for few-shot learning,

Jean-Baptiste Alayrac et al., “Flamingo: A visual language model for few-shot learning,” inNeurIPS, 2022

work page 2022

[9] [9]

Characterfactory: Sampling consistent characters with gans for diffusion models,

Qinghe Wang, Baolu Li, Xiaomin Li, Bing Cao, Liqian Ma, Huchuan Lu, and Xu Jia, “Characterfactory: Sampling consistent characters with gans for diffusion models,”IEEE TIP, 2025

work page 2025

[10] [10]

Infinite-story: A training-free consistent text-to-image generation,

Jihun Park, Kyoungmin Lee, Jongmin Gim, Hyeonseo Jo, Minseok Oh, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Minwoo Choi, and Sunghoon Im, “Infinite-story: A training-free consistent text-to-image generation,”arXiv, 2025

work page 2025

[11] [11]

One-prompt-one-story: Free-lunch consistent text-to-image generation using a single prompt,

Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, and Ming-Ming Cheng, “One-prompt-one-story: Free-lunch consistent text-to-image generation using a single prompt,”arXiv, 2025

work page 2025

[12] [12]

React: Synergizing reasoning and acting in language models,

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao, “React: Synergizing reasoning and acting in language models,” inICLR, 2023

work page 2023

[13] [13]

Clipscore: A reference-free evaluation metric for image captioning,

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi, “Clipscore: A reference-free evaluation metric for image captioning,” inEMNLP, 2021

work page 2021

[14] [14]

Make-a-story: Visual memory conditioned consistent story generation,

Tanzila Rahman, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Shweta Mahajan, and Leonid Sigal, “Make-a-story: Visual memory conditioned consistent story generation,” inCVPR, 2023

work page 2023

[15] [15]

Storydall-e: Adapting pretrained text-to-image transformers for story continuation,

Adyasha Maharana, Darryl Hannan, and Mohit Bansal, “Storydall-e: Adapting pretrained text-to-image transformers for story continuation,” inECCV, 2022

work page 2022

[16] [16]

Microsoft coco: Common objects in context,

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Per- ona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick, “Microsoft coco: Common objects in context,” inECCV, 2014

work page 2014

[17] [17]

Grounded language-image pretraining,

et al. Li, Liunian Harold, “Grounded language-image pretraining,” in CVPR, 2022

work page 2022

[18] [18]

Gligen: Open-set grounded text-to-image generation,

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee, “Gligen: Open-set grounded text-to-image generation,” inCVPR, 2023

work page 2023