FreeStory: Training-Free Character Consistency for Free-Form Visual Storytelling

Ismail Shaheen; Sarah Adel Bargal; Sibo Dong

arxiv: 2606.25079 · v1 · pith:TLZ4SHNFnew · submitted 2026-06-23 · 💻 cs.CV

FreeStory: Training-Free Character Consistency for Free-Form Visual Storytelling

Sibo Dong , Ismail Shaheen , Sarah Adel Bargal This is my paper

Pith reviewed 2026-06-26 00:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual storytellingcharacter consistencytraining-free methodsdiffusion modelsfree-form promptsentity groundingattention feature reuse

0 comments

The pith

FreeStory maintains character consistency in visual storytelling under free-form prompts by entity-grounded feature reuse without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visual storytelling requires generating image sequences that follow a narrative while keeping the same characters looking identical across frames. Prior training-free methods achieve this only by repeating full character descriptions in every prompt, an unnatural constraint that does not match how stories are usually told. FreeStory instead treats later references such as pronouns or type names as entities that must be linked back to the original description. It does so through dynamic masks, correspondence-aware feature matching, key-value injection, and query blending inside the diffusion process. The result is tested on both existing structured benchmarks and a new FreeStoryBench dataset covering single- and multi-character free-form stories.

Core claim

Character consistency under free-form prompts can be achieved by reformulating the task as entity-grounded feature reuse: reference mentions are associated with their initial character descriptions, after which dynamic character masks, correspondence-aware feature matching, key-value injection, and query blending are combined to preserve identity while retaining generation diversity.

What carries the argument

Entity-grounded feature reuse, which links prompt references to character descriptions and selectively reuses attention features through masks, matching, injection, and blending.

If this is right

Character appearance remains consistent even when prompts introduce a character once and later refer to it indirectly.
Generation diversity is retained while identity preservation improves over prior training-free baselines.
A new benchmark enables direct measurement of consistency on both single- and multi-character free-form stories.
State-of-the-art consistency among training-free methods is reached on both structured and free-form prompt sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The association step could be extended to handle longer narratives with multiple ambiguous references.
Similar selective feature reuse might reduce repetition needs in other text-to-image or text-to-video pipelines.
If the linking step scales, prompting interfaces for story generation could shift away from exhaustive repeated descriptions.

Load-bearing premise

The method assumes that reference mentions in free-form prompts can be reliably associated with their corresponding character descriptions without training or external supervision.

What would settle it

Running the method on free-form prompts that use only pronouns or short references and observing visibly inconsistent character appearances across the generated image sequence would falsify the consistency claim.

Figures

Figures reproduced from arXiv: 2606.25079 by Ismail Shaheen, Sarah Adel Bargal, Sibo Dong.

**Figure 1.** Figure 1: Multi-character free-form story generated by FreeStory. Characters are introduced once with full descriptions and later referred to using shorter mentions (e.g., boy, golden retriever). the character description appearing at the beginning of every prompt. This strict format simplifies character grounding and allows feature reuse methods to directly utilize character description. However, this assumption d… view at source ↗

**Figure 2.** Figure 2: Overview of our proposed FreeStory framework. Given a character-defining prompt P1, the model generates the reference image I1 and uses entity grounding to associate the character description τ (1,j) with reference mentions τ (k,j) in referring prompt Pk for character c (j) . During generation of I1, we extract cross-attention weights to compute the dynamic mask M˜ (1,j) t and store the corresponding key, … view at source ↗

**Figure 3.** Figure 3: Qualitative results on ConsiStory+ dataset. Independent generation using SDXL and FLUX.1 fails to preserve character identity. Prior storytelling methods partially improve consistency but still suffer from identity drift. FLUX.1-Kontext achieves strong appearance similarity but exhibits copy-paste artifacts, with limited pose and background diversity. Our method preserves character identity while maintain… view at source ↗

**Figure 4.** Figure 4: Mean IoU between attentionderived masks and Grounded SAM segmentation across diffusion timesteps. To better understand mask quality, we analyze the localization accuracy of attention-derived masks across diffusion timesteps. Specifically, we generate images and obtain ground-truth character masks using Grounded SAM [21]. We then compute the Intersection-overUnion (IoU) between the attention-derived ma… view at source ↗

**Figure 5.** Figure 5: Comparison of background removal using CarveKit and GroundedSAM. The examples illustrate three typical failure modes of CarveKit. Previous works commonly adopt CarveKit for background removal during evaluation. However, we find that it is not sufficiently robust for our setting, particularly in images containing multiple characters or object interactions. Therefore, we instead employ Grounded-SAM to ob… view at source ↗

**Figure 6.** Figure 6: Qualitative ablation results. Removing query blending reduces character consistency, [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results on the FreeStoryBench dataset under the Type setting. Blue text [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

**Figure 8.** Figure 8: Grounding and complex-interaction examples. (a) Entity-grounding failure: Stanza incorrectly identifies her wings as the second character rather than the forest sprite; consequently, only the fox remains consistent. (b) A successful case involving overlapping or occluded character interactions. (c) A failure case in which inaccurate attention masks lead to localization or consistency errors. We further sho… view at source ↗

**Figure 9.** Figure 9: Qualitative comparison results on ConsiStory+ dataset. [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison results on ConsiStory+ dataset. [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparison results on ConsiStory+ dataset. [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative comparison results on ConsiStory+ dataset. [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗

read the original abstract

Visual storytelling aims to generate image sequences that are both aligned with narrative prompts and consistent in character appearance across images. Recent training-free methods improve character consistency by reusing attention features, but rely on structured prompts where full character descriptions are repeated in every prompt. This assumption simplifies the task but deviates from natural storytelling, where characters are typically introduced once and later referred to using pronouns or type-based expressions. We propose \textbf{FreeStory}, a training-free framework that reformulates character consistency under free-form prompts as entity-grounded feature reuse. Our method associates reference mentions with their corresponding character descriptions and combines dynamic character masks, correspondence-aware feature matching, key-value injection, and query blending to preserve identity while retaining generation diversity. We also introduce \textbf{FreeStoryBench}, a benchmark for this setting that includes both single- and multi-character stories. Experiments show that FreeStory achieves state-of-the-art performance among training-free methods on structured benchmarks and stronger overall consistency over baselines under free-form prompts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FreeStory extends training-free consistency methods to free-form prompts with pronouns and type references, but the unmeasured association step between mentions and characters is a central untested piece.

read the letter

The paper's actual advance is dropping the repeated full-description requirement that earlier attention-reuse methods needed. It introduces entity-grounded feature reuse plus dynamic masks, correspondence-aware matching, KV injection, and query blending to handle natural storytelling language, and it ships FreeStoryBench covering single- and multi-character free-form cases.

That matches a real usage gap. Most prior training-free work stayed inside structured prompts, so the benchmark and the reformulation are the parts that feel new and directly useful.

The soft spot is the association step itself. Everything downstream depends on correctly linking later mentions back to the initial character descriptions without any training or external help. The abstract and claims give no accuracy figures, no error analysis on ambiguous pronouns or multi-character scenes, and no ablation showing what happens when the matching is wrong. If that module is noisy, the reported consistency gains on the new benchmark become hard to interpret.

The math and method description look internally consistent on paper, with no obvious circularity or invented quantities. The experiments are presented as showing gains over baselines, but without the association numbers the central claim rests on an assumption that is easy to violate in practice.

This is for groups already working on training-free diffusion consistency or visual storytelling pipelines. A reader who needs to move beyond repetitive prompts will find the benchmark and the high-level approach worth looking at. It is coherent enough on its own terms to deserve a serious referee who can check the missing measurements and see whether the association holds up under realistic prompt variation.

Referee Report

2 major / 2 minor

Summary. The paper proposes FreeStory, a training-free framework for maintaining character consistency in visual storytelling under free-form prompts (where characters are introduced once and later referenced via pronouns or type expressions). It reformulates consistency as entity-grounded feature reuse via mention-to-description association, dynamic character masks, correspondence-aware feature matching, key-value injection, and query blending. The work also introduces FreeStoryBench (covering single- and multi-character stories) and claims state-of-the-art performance among training-free methods on structured benchmarks plus stronger overall consistency under free-form prompts.

Significance. If the core association step proves reliable, the approach would meaningfully extend training-free consistency techniques beyond the restrictive structured-prompt regime used by prior work, supporting more natural narrative generation. The new benchmark is a constructive addition for evaluating free-form settings.

major comments (2)

[Method (association module)] The reference-to-character association step (described in the method) is load-bearing for all downstream components (dynamic masks, correspondence-aware matching, KV injection, query blending) yet no accuracy, precision/recall, or error analysis is reported for it. Association errors on multi-character or pronoun-heavy stories would directly falsify the consistency gains claimed on FreeStoryBench.
[Experiments] Experiments section reports SOTA claims and stronger consistency under free-form prompts but supplies no error bars, statistical significance tests, or ablations isolating the association module versus the other proposed components, leaving the support for the central claim unassessable from the provided details.

minor comments (2)

[Method] Notation for the association and matching steps could be clarified with a short pseudocode or diagram to make the entity-grounded reuse pipeline easier to follow.
[Abstract] The abstract would benefit from one sentence summarizing how the unsupervised association is implemented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important aspects of the association module and experimental reporting that we will address in revision.

read point-by-point responses

Referee: [Method (association module)] The reference-to-character association step (described in the method) is load-bearing for all downstream components (dynamic masks, correspondence-aware matching, KV injection, query blending) yet no accuracy, precision/recall, or error analysis is reported for it. Association errors on multi-character or pronoun-heavy stories would directly falsify the consistency gains claimed on FreeStoryBench.

Authors: We agree that a direct quantitative evaluation of the association step would strengthen the claims. The module relies on mention-to-description matching within the free-form prompt setting, and while end-to-end consistency results on FreeStoryBench (including multi-character cases) provide supporting evidence, we did not report isolated precision/recall or error rates. In the revision we will add a dedicated subsection with accuracy metrics computed on a set of annotated free-form prompts, plus qualitative error cases. revision: yes
Referee: [Experiments] Experiments section reports SOTA claims and stronger consistency under free-form prompts but supplies no error bars, statistical significance tests, or ablations isolating the association module versus the other proposed components, leaving the support for the central claim unassessable from the provided details.

Authors: The original experiments focused on comparative consistency scores across methods and benchmarks. We acknowledge the value of statistical reporting and component ablations. In the revised manuscript we will report standard deviations over multiple random seeds, include paired significance tests where appropriate, and add an ablation study that measures the incremental contribution of the association module when combined with the remaining components. revision: yes

Circularity Check

0 steps flagged

No circularity: method is procedural description without equations or self-referential derivations

full rationale

The paper describes a training-free framework (FreeStory) that associates mentions with character descriptions and applies dynamic masks, feature matching, KV injection, and query blending. No equations, fitted parameters, or first-principles derivations are present in the abstract or described claims. Performance is evaluated empirically on benchmarks rather than derived from inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core steps. The association step is a design choice whose accuracy is not quantified here, but that is an empirical limitation, not a circular reduction of any claimed result to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, parameters, or background assumptions; ledger left empty.

pith-pipeline@v0.9.1-grok · 5700 in / 1014 out tokens · 19554 ms · 2026-06-26T00:08:37.501585+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 8 canonical work pages

[1]

Blue noise for diffusion models

Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. The chosen one: Consistent characters in text-to-image diffusion models. InACM SIGGRAPH 2024 Conference Papers, SIGGRAPH ’24, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400705250. URL https://doi.org/10.1145/364...

work page doi:10.1145/3641519.3657430 2024
[2]

Vista: Vi- sual storytelling using multi-modal adapters for text-to-image diffusion models

Sibo Dong, Ismail Shaheen, Maggie Shen, Rupayan Mallick, and Sarah Adel Bargal. Vista: Vi- sual storytelling using multi-modal adapters for text-to-image diffusion models. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), March 2026. 4

2026
[3]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning, ICML...

2024
[4]

Improved visual story generation with adaptive context modeling

Zhangyin Feng, Yuchen Ren, Xinmiao Yu, Xiaocheng Feng, Duyu Tang, Shuming Shi, and Bing Qin. Improved visual story generation with adaptive context modeling. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computa- tional Linguistics: ACL 2023, pages 4939–4955, Toronto, Canada, July 2023. Association for Com...

2023
[5]

Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola

Stephanie Fu, Netanel Y. Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: learning new dimensions of human visual similarity using syn- thetic data. InProceedings of the 37th International Conference on Neural Information Pro- cessing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc. 10

2023
[6]

Unleashing diffusion transformers for visual correspondence by modulating massive acti- vations

Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, and Weiyao Lin. Unleashing diffusion transformers for visual correspondence by modulating massive acti- vations. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
[7]

URLhttps://openreview.net/forum?id=s3MwCBuqav. 7
[8]

Interactive story visualization with multiple characters

Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Yingqing He, Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, and Yujiu Yang. Interactive story visualization with multiple characters. InSIGGRAPH Asia 2023 Conference Papers, SA ’23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400703157. doi: 10.1145/ 3610548.3618184. 4

arXiv 2023
[9]

Dreamstory: Open-domain story visualization by llm- guided multi-subject consistent diffusion.arXiv preprint arXiv:2407.12899, 2024

Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. Dreamstory: Open-domain story visualization by llm- guided multi-subject consistent diffusion.arXiv preprint arXiv:2407.12899, 2024. 4

arXiv 2024
[10]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, D...

2021
[11]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024. 1, 3, 4, 10

2024
[12]

Flux.1 kontext: Flow matching for in-context image generation and editing in latent space, 2025

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

Pith/arXiv arXiv 2025
[13]

In- telligent grimm - open-ended visual storytelling via latent diffusion models

Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, Yanfeng Wang, and Weidi Xie. In- telligent grimm - open-ended visual storytelling via latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6190–6200, June 2024. 4

2024
[14]

One-prompt-one-story: Free-lunch consistent text- to-image generation using a single prompt

Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, and Ming-Ming Cheng. One-prompt-one-story: Free-lunch consistent text- to-image generation using a single prompt. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=cD1kl2QKv1. 1, 3, 4, 9, 10, 12

2025
[15]

Storynizor: Consistent story generation via inter-frame synchronized and shuffled id injection.arXiv preprint arXiv:2409.19624, 2024

Yuhang Ma, Wenting Xu, Chaoyi Zhao, Keqiang Sun, Qinfeng Jin, Zeng Zhao, Changjie Fan, and Zhipeng Hu. Storynizor: Consistent story generation via inter-frame synchronized and shuffled id injection.arXiv preprint arXiv:2409.19624, 2024. 4

arXiv 2024
[16]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khali- dov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

2024
[17]

IEEE Transactions on Systems, Man, and Cybernetics9(1), 62–66 (1979), https: //doi.org/10.1109/TSMC.1979.4310076

Nobuyuki Otsu. A threshold selection method from gray-level histograms.IEEE Transactions on Systems, Man, and Cybernetics, 9(1):62–66, 1979. doi: 10.1109/TSMC.1979.4310076. 6

work page doi:10.1109/tsmc.1979.4310076 1979
[18]

In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp

Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, and Wenhu Chen. Synthesizing coherent story with auto-regressive latent diffusion models. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2908–2918, 2024. doi: 10.1109/WACV57701. 2024.00290. 4

work page doi:10.1109/wacv57701 2024
[19]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=di52zR8xgf. 1, 3, 10

2024
[20]

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. Stanza: A Python natural language processing toolkit for many human languages. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations,
[21]

9, 27 15

URLhttps://nlp.stanford.edu/pubs/qi2020stanza.pdf. 9, 27 15
[22]

Reproducible scaling laws for contrastive language-image learning

Tanzila Rahman, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Shweta Mahajan, and Leonid Si- gal. Make-a-story: Visual memory conditioned consistent story generation. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2493–2502, 2023. doi: 10.1109/CVPR52729.2023.00246. 4

work page doi:10.1109/cvpr52729.2023.00246 2023
[23]

Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024. 10, 18

Pith/arXiv arXiv 2024
[24]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2022. doi: 10.1109/ CVPR52688.2022.01042. 1, 3

arXiv 2022
[25]

In: CVPR

Xiaoqian Shen and Mohamed Elhoseiny. Storygpt-v: Large language models as consistent story visualizers. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13273–13283, 2025. doi: 10.1109/CVPR52734.2025.01239. 4

work page doi:10.1109/cvpr52734.2025.01239 2025
[26]

Singfake: Singing voice deepfake detection,

Tianyi Song, Jiuxin Cao, Kun Wang, Bo Liu, and Xiaofeng Zhang. Causal-story: Local causal attention utilizing parameter-efficient tuning for visual story synthesis. InICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3350–3354, 2024. doi: 10.1109/ICASSP48485.2024.10446420. 4

work page doi:10.1109/icassp48485.2024.10446420 2024
[27]

Emer- gent correspondence from image diffusion

LumingTang, MenglinJia, QianqianWang, ChengPerngPhoo, andBharathHariharan. Emer- gent correspondence from image diffusion. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=ypOiXjdfnU. 7

2023
[28]

Storyimager: A unified and efficient framework for coherent story visualization and completion

Ming Tao, Bing-Kun Bao, Hao Tang, Yaowei Wang, and Changsheng Xu. Storyimager: A unified and efficient framework for coherent story visualization and completion. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LVI, page 479–495, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978-3- 03...

work page doi:10.1007/978-3-031-72992-8_27 2024
[29]

Training-free consistent text-to-image generation.ACM Trans

YoadTewel, OmriKaduri, RinonGal, YoniKasten, LiorWolf, GalChechik, andYuvalAtzmon. Training-free consistent text-to-image generation.ACM Trans. Graph., 43(4), July 2024. ISSN 0730-0301. doi: 10.1145/3658157. 1, 4, 6, 7, 10, 12

work page doi:10.1145/3658157 2024
[30]

Oneactor: Consistent character generation via cluster-conditioned guidance.arXiv preprint arXiv:2404.10267, 2024

Jiahao Wang, Caixia Yan, Haonan Lin, and Weizhan Zhang. Oneactor: Consistent character generation via cluster-conditioned guidance.arXiv preprint arXiv:2404.10267, 2024. 4

arXiv 2024
[31]

Characonsist: Fine-grained consistent character generation

Mengyu Wang, Henghui Ding, Jianing Peng, Yao Zhao, Yunpeng Chen, and Yunchao Wei. Characonsist: Fine-grained consistent character generation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16058–16067, October 2025. 4, 6, 7

2025
[32]

Seed-story: Multimodal long story generation with large language model.arXiv preprint arXiv:2407.08683, 2024

Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, and Yingcong Chen. Seed-story: Multimodal long story generation with large language model.arXiv preprint arXiv:2407.08683, 2024. URLhttps://arxiv.org/abs/2407.08683. 4

arXiv 2024
[33]

Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023. 1 16

Pith/arXiv arXiv 2023
[34]

Temporalstory: Enhancing consistency in story visualization using spatial-temporal attention.arXiv preprint arXiv:2407.09774, 2024

Sixiao Zheng and Yanwei Fu. Temporalstory: Enhancing consistency in story visualization using spatial-temporal attention.arXiv preprint arXiv:2407.09774, 2024. 4

arXiv 2024
[35]

Storydiffu- sion: Consistent self-attention for long-range image and video generation

Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffu- sion: Consistent self-attention for long-range image and video generation. InThe Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps: //openreview.net/forum?id=VFqzxhINFU. 1, 4, 10, 12

2024
[36]

Storymaker: Towards holistic consistent characters in text-to-image generation.arXiv preprint arXiv:2409.12576, 2024

Zhengguang Zhou, Jing Li, Huaxia Li, Nemo Chen, and Xu Tang. Storymaker: Towards holistic consistent characters in text-to-image generation.arXiv preprint arXiv:2409.12576, 2024. 4

arXiv 2024
[37]

Dual-Generation

Zhongyang Zhu and Jie Tang. Cogcartoon: Towards practical story visualization.Int. J. Comput. Vision, 133(4):1808–1833, October 2024. ISSN 0920-5691. doi: 10.1007/ s11263-024-02267-5. 4 17 A Method A.1 Dynamic Mask Extraction Figure 4: Mean IoU between attention- derived masks and Grounded SAM seg- mentation across diffusion timesteps. To better understan...

2024
[38]

story_id

JSON Schema Your output must validate against this schema: ‘‘‘json [ { "story_id": 1, // Integer: Unique ID for the story. You will be given this in the task section at the end. 19 "metadata": { "title": "Story Title", // String: A human-readable title. "genre": ["Genre"], // List[String]: \eg., "Sci-Fi", "Fantasy", "Realism". "num_scenes": 3, // Integer:...
[39]

story_id

High-Quality Example Here is a perfect example of a single story entry. ‘‘‘json [ { "story_id": 1, "metadata": { "title": "The Curious Robot", "genre": ["Sci-Fi"], "num_scenes": 3, "num_characters": 1, "keywords": ["robot", "park", "butterfly", "curiosity"] }, "characters": [ 20 { "char_id": "c1", "category": "object", "name": "Beeper", "type": "Robot", "...
[40]

they" or

Core Generation Rules For story_action: - story_action should be a linguisitically sound and coherent description of the story scene, while semantic_action should be a templated version of the story_action that replaces character mentions with character placeholders. - The first story_action in the first scene should introduce the character(s) using the c...
[41]

story_id

TASK Generate a new, unique story entry as a single JSON object. Constraints for this story that you should follow are: - story\_id: <story_id> - num\_characters: <character_count> - Each scene must include <character_presence> from the ‘characters‘ list. - category: <character_category> - num\_scenes: 6 Respond ONLY with the single JSON object. A.2.3 Sto...
[43]

The nimble female archer wearing a green tunic and leather bracers begins to walk down the path, and The massive grey wolf with a distinctive white patch on his chest follows closely behind her
[44]

The nimble female archer wearing a green tunic and leather bracers leaps across the wet stones as The massive grey wolf with a distinctive white patch on his chest splashes through the water to stay by her side
[45]

The nimble female archer wearing a green tunic and leather bracers and The massive grey wolf with a distinctive white patch on his chest reach the ruins and scan the area for any signs of movement
[46]

The nimble female archer wearing a green tunic and leather bracers notches an arrow, while The massive grey wolf with a distinctive white patch on his chest lets out a low growl to protect her
[47]

V2: Single→Mix({‘mode’: ‘single’, ‘fallback’: ‘mix’})

Finally, The nimble female archer wearing a green tunic and leather bracers sits on the cliff’s edge with The massive grey wolf with a distinctive white patch on his chest as they watch the sun disappear. V2: Single→Mix({‘mode’: ‘single’, ‘fallback’: ‘mix’})
[49]

She begins to walk down the path, and the wolf follows closely behind her
[50]

The archer leaps across the wet stones as the wolf splashes through the water to stay by her side
[51]

They both reach the ruins and scan the area for any signs of movement
[52]

The archer notches an arrow, while the wolf lets out a low growl to protect her
[53]

V3: Single→Type({‘mode’: ‘single’, ‘fallback’: ‘type’})

Finally, she sits on the cliff’s edge with the wolf as they watch the sun disappear. V3: Single→Type({‘mode’: ‘single’, ‘fallback’: ‘type’})
[54]

A nimble female archer wearing a green tunic and leather bracers stands at the forest edge while a massive grey wolf with a distinctive white patch on his chest sniffs the ground nearby
[59]

V4: Single→Name({‘mode’: ‘single’, ‘fallback’: ‘name’})

Finally, The Archer sits on the cliff’s edge with The Wolf as they watch the sun disappear. V4: Single→Name({‘mode’: ‘single’, ‘fallback’: ‘name’})
[60]

Eara, The nimble female archer wearing a green tunic and leather bracers, stands at the forest edge while Silver, The massive grey wolf with a distinctive white patch on his chest, sniffs the ground nearby
[65]

25 V5: No Description→Type({‘mode’: ‘no_desc’, ‘fallback’: ‘type’})

Finally, Eara sits on the cliff’s edge with Silver as they watch the sun disappear. 25 V5: No Description→Type({‘mode’: ‘no_desc’, ‘fallback’: ‘type’})
[66]

The Archer stands at the forest edge while The Wolf sniffs the ground nearby
[67]

The Archer begins to walk down the path, and The Wolf follows closely behind her
[68]

The Archer leaps across the wet stones as The Wolf splashes through the water to stay by her side
[69]

The Archer and The Wolf reach the ruins and scan the area for any signs of movement
[70]

The Archer notches an arrow, while The Wolf lets out a low growl to protect her
[71]

V6: No Description→Name({‘mode’: ‘no_desc’, ‘fallback’: ‘name’})

Finally, The Archer sits on the cliff’s edge with The Wolf as they watch the sun disappear. V6: No Description→Name({‘mode’: ‘no_desc’, ‘fallback’: ‘name’})
[72]

Eara stands at the forest edge while Silver sniffs the ground nearby
[73]

Eara begins to walk down the path, and Silver follows closely behind her
[74]

Eara leaps across the wet stones as Silver splashes through the water to stay by her side
[75]

Eara and Silver reach the ruins and scan the area for any signs of movement
[76]

Eara notches an arrow, while Silver lets out a low growl to protect her
[77]

Finally, Eara sits on the cliff’s edge with Silver as they watch the sun disappear. B Implementation Details and Additional Results B.1 Background Removal for Evaluation Figure 5: Comparison of background re- moval using CarveKit and Grounded- SAM.Theexamplesillustratethreetyp- ical failure modes of CarveKit. Previous works commonly adopt CarveKit for bac...

[1] [1]

Blue noise for diffusion models

Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. The chosen one: Consistent characters in text-to-image diffusion models. InACM SIGGRAPH 2024 Conference Papers, SIGGRAPH ’24, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400705250. URL https://doi.org/10.1145/364...

work page doi:10.1145/3641519.3657430 2024

[2] [2]

Vista: Vi- sual storytelling using multi-modal adapters for text-to-image diffusion models

Sibo Dong, Ismail Shaheen, Maggie Shen, Rupayan Mallick, and Sarah Adel Bargal. Vista: Vi- sual storytelling using multi-modal adapters for text-to-image diffusion models. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), March 2026. 4

2026

[3] [3]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning, ICML...

2024

[4] [4]

Improved visual story generation with adaptive context modeling

Zhangyin Feng, Yuchen Ren, Xinmiao Yu, Xiaocheng Feng, Duyu Tang, Shuming Shi, and Bing Qin. Improved visual story generation with adaptive context modeling. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computa- tional Linguistics: ACL 2023, pages 4939–4955, Toronto, Canada, July 2023. Association for Com...

2023

[5] [5]

Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola

Stephanie Fu, Netanel Y. Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: learning new dimensions of human visual similarity using syn- thetic data. InProceedings of the 37th International Conference on Neural Information Pro- cessing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc. 10

2023

[6] [6]

Unleashing diffusion transformers for visual correspondence by modulating massive acti- vations

Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, and Weiyao Lin. Unleashing diffusion transformers for visual correspondence by modulating massive acti- vations. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

[7] [7]

URLhttps://openreview.net/forum?id=s3MwCBuqav. 7

[8] [8]

Interactive story visualization with multiple characters

Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Yingqing He, Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, and Yujiu Yang. Interactive story visualization with multiple characters. InSIGGRAPH Asia 2023 Conference Papers, SA ’23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400703157. doi: 10.1145/ 3610548.3618184. 4

arXiv 2023

[9] [9]

Dreamstory: Open-domain story visualization by llm- guided multi-subject consistent diffusion.arXiv preprint arXiv:2407.12899, 2024

Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. Dreamstory: Open-domain story visualization by llm- guided multi-subject consistent diffusion.arXiv preprint arXiv:2407.12899, 2024. 4

arXiv 2024

[10] [10]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, D...

2021

[11] [11]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024. 1, 3, 4, 10

2024

[12] [12]

Flux.1 kontext: Flow matching for in-context image generation and editing in latent space, 2025

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

Pith/arXiv arXiv 2025

[13] [13]

In- telligent grimm - open-ended visual storytelling via latent diffusion models

Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, Yanfeng Wang, and Weidi Xie. In- telligent grimm - open-ended visual storytelling via latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6190–6200, June 2024. 4

2024

[14] [14]

One-prompt-one-story: Free-lunch consistent text- to-image generation using a single prompt

Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, and Ming-Ming Cheng. One-prompt-one-story: Free-lunch consistent text- to-image generation using a single prompt. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=cD1kl2QKv1. 1, 3, 4, 9, 10, 12

2025

[15] [15]

Storynizor: Consistent story generation via inter-frame synchronized and shuffled id injection.arXiv preprint arXiv:2409.19624, 2024

Yuhang Ma, Wenting Xu, Chaoyi Zhao, Keqiang Sun, Qinfeng Jin, Zeng Zhao, Changjie Fan, and Zhipeng Hu. Storynizor: Consistent story generation via inter-frame synchronized and shuffled id injection.arXiv preprint arXiv:2409.19624, 2024. 4

arXiv 2024

[16] [16]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khali- dov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

2024

[17] [17]

IEEE Transactions on Systems, Man, and Cybernetics9(1), 62–66 (1979), https: //doi.org/10.1109/TSMC.1979.4310076

Nobuyuki Otsu. A threshold selection method from gray-level histograms.IEEE Transactions on Systems, Man, and Cybernetics, 9(1):62–66, 1979. doi: 10.1109/TSMC.1979.4310076. 6

work page doi:10.1109/tsmc.1979.4310076 1979

[18] [18]

In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp

Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, and Wenhu Chen. Synthesizing coherent story with auto-regressive latent diffusion models. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2908–2918, 2024. doi: 10.1109/WACV57701. 2024.00290. 4

work page doi:10.1109/wacv57701 2024

[19] [19]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=di52zR8xgf. 1, 3, 10

2024

[20] [20]

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. Stanza: A Python natural language processing toolkit for many human languages. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations,

[21] [21]

9, 27 15

URLhttps://nlp.stanford.edu/pubs/qi2020stanza.pdf. 9, 27 15

[22] [22]

Reproducible scaling laws for contrastive language-image learning

Tanzila Rahman, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Shweta Mahajan, and Leonid Si- gal. Make-a-story: Visual memory conditioned consistent story generation. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2493–2502, 2023. doi: 10.1109/CVPR52729.2023.00246. 4

work page doi:10.1109/cvpr52729.2023.00246 2023

[23] [23]

Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024. 10, 18

Pith/arXiv arXiv 2024

[24] [24]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2022. doi: 10.1109/ CVPR52688.2022.01042. 1, 3

arXiv 2022

[25] [25]

In: CVPR

Xiaoqian Shen and Mohamed Elhoseiny. Storygpt-v: Large language models as consistent story visualizers. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13273–13283, 2025. doi: 10.1109/CVPR52734.2025.01239. 4

work page doi:10.1109/cvpr52734.2025.01239 2025

[26] [26]

Singfake: Singing voice deepfake detection,

Tianyi Song, Jiuxin Cao, Kun Wang, Bo Liu, and Xiaofeng Zhang. Causal-story: Local causal attention utilizing parameter-efficient tuning for visual story synthesis. InICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3350–3354, 2024. doi: 10.1109/ICASSP48485.2024.10446420. 4

work page doi:10.1109/icassp48485.2024.10446420 2024

[27] [27]

Emer- gent correspondence from image diffusion

LumingTang, MenglinJia, QianqianWang, ChengPerngPhoo, andBharathHariharan. Emer- gent correspondence from image diffusion. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=ypOiXjdfnU. 7

2023

[28] [28]

Storyimager: A unified and efficient framework for coherent story visualization and completion

Ming Tao, Bing-Kun Bao, Hao Tang, Yaowei Wang, and Changsheng Xu. Storyimager: A unified and efficient framework for coherent story visualization and completion. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LVI, page 479–495, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978-3- 03...

work page doi:10.1007/978-3-031-72992-8_27 2024

[29] [29]

Training-free consistent text-to-image generation.ACM Trans

YoadTewel, OmriKaduri, RinonGal, YoniKasten, LiorWolf, GalChechik, andYuvalAtzmon. Training-free consistent text-to-image generation.ACM Trans. Graph., 43(4), July 2024. ISSN 0730-0301. doi: 10.1145/3658157. 1, 4, 6, 7, 10, 12

work page doi:10.1145/3658157 2024

[30] [30]

Oneactor: Consistent character generation via cluster-conditioned guidance.arXiv preprint arXiv:2404.10267, 2024

Jiahao Wang, Caixia Yan, Haonan Lin, and Weizhan Zhang. Oneactor: Consistent character generation via cluster-conditioned guidance.arXiv preprint arXiv:2404.10267, 2024. 4

arXiv 2024

[31] [31]

Characonsist: Fine-grained consistent character generation

Mengyu Wang, Henghui Ding, Jianing Peng, Yao Zhao, Yunpeng Chen, and Yunchao Wei. Characonsist: Fine-grained consistent character generation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16058–16067, October 2025. 4, 6, 7

2025

[32] [32]

Seed-story: Multimodal long story generation with large language model.arXiv preprint arXiv:2407.08683, 2024

Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, and Yingcong Chen. Seed-story: Multimodal long story generation with large language model.arXiv preprint arXiv:2407.08683, 2024. URLhttps://arxiv.org/abs/2407.08683. 4

arXiv 2024

[33] [33]

Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023. 1 16

Pith/arXiv arXiv 2023

[34] [34]

Temporalstory: Enhancing consistency in story visualization using spatial-temporal attention.arXiv preprint arXiv:2407.09774, 2024

Sixiao Zheng and Yanwei Fu. Temporalstory: Enhancing consistency in story visualization using spatial-temporal attention.arXiv preprint arXiv:2407.09774, 2024. 4

arXiv 2024

[35] [35]

Storydiffu- sion: Consistent self-attention for long-range image and video generation

Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffu- sion: Consistent self-attention for long-range image and video generation. InThe Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps: //openreview.net/forum?id=VFqzxhINFU. 1, 4, 10, 12

2024

[36] [36]

Storymaker: Towards holistic consistent characters in text-to-image generation.arXiv preprint arXiv:2409.12576, 2024

Zhengguang Zhou, Jing Li, Huaxia Li, Nemo Chen, and Xu Tang. Storymaker: Towards holistic consistent characters in text-to-image generation.arXiv preprint arXiv:2409.12576, 2024. 4

arXiv 2024

[37] [37]

Dual-Generation

Zhongyang Zhu and Jie Tang. Cogcartoon: Towards practical story visualization.Int. J. Comput. Vision, 133(4):1808–1833, October 2024. ISSN 0920-5691. doi: 10.1007/ s11263-024-02267-5. 4 17 A Method A.1 Dynamic Mask Extraction Figure 4: Mean IoU between attention- derived masks and Grounded SAM seg- mentation across diffusion timesteps. To better understan...

2024

[38] [38]

story_id

JSON Schema Your output must validate against this schema: ‘‘‘json [ { "story_id": 1, // Integer: Unique ID for the story. You will be given this in the task section at the end. 19 "metadata": { "title": "Story Title", // String: A human-readable title. "genre": ["Genre"], // List[String]: \eg., "Sci-Fi", "Fantasy", "Realism". "num_scenes": 3, // Integer:...

[39] [39]

story_id

High-Quality Example Here is a perfect example of a single story entry. ‘‘‘json [ { "story_id": 1, "metadata": { "title": "The Curious Robot", "genre": ["Sci-Fi"], "num_scenes": 3, "num_characters": 1, "keywords": ["robot", "park", "butterfly", "curiosity"] }, "characters": [ 20 { "char_id": "c1", "category": "object", "name": "Beeper", "type": "Robot", "...

[40] [40]

they" or

Core Generation Rules For story_action: - story_action should be a linguisitically sound and coherent description of the story scene, while semantic_action should be a templated version of the story_action that replaces character mentions with character placeholders. - The first story_action in the first scene should introduce the character(s) using the c...

[41] [41]

story_id

TASK Generate a new, unique story entry as a single JSON object. Constraints for this story that you should follow are: - story\_id: <story_id> - num\_characters: <character_count> - Each scene must include <character_presence> from the ‘characters‘ list. - category: <character_category> - num\_scenes: 6 Respond ONLY with the single JSON object. A.2.3 Sto...

[42] [43]

The nimble female archer wearing a green tunic and leather bracers begins to walk down the path, and The massive grey wolf with a distinctive white patch on his chest follows closely behind her

[43] [44]

The nimble female archer wearing a green tunic and leather bracers leaps across the wet stones as The massive grey wolf with a distinctive white patch on his chest splashes through the water to stay by her side

[44] [45]

The nimble female archer wearing a green tunic and leather bracers and The massive grey wolf with a distinctive white patch on his chest reach the ruins and scan the area for any signs of movement

[45] [46]

The nimble female archer wearing a green tunic and leather bracers notches an arrow, while The massive grey wolf with a distinctive white patch on his chest lets out a low growl to protect her

[46] [47]

V2: Single→Mix({‘mode’: ‘single’, ‘fallback’: ‘mix’})

Finally, The nimble female archer wearing a green tunic and leather bracers sits on the cliff’s edge with The massive grey wolf with a distinctive white patch on his chest as they watch the sun disappear. V2: Single→Mix({‘mode’: ‘single’, ‘fallback’: ‘mix’})

[47] [49]

She begins to walk down the path, and the wolf follows closely behind her

[48] [50]

The archer leaps across the wet stones as the wolf splashes through the water to stay by her side

[49] [51]

They both reach the ruins and scan the area for any signs of movement

[50] [52]

The archer notches an arrow, while the wolf lets out a low growl to protect her

[51] [53]

V3: Single→Type({‘mode’: ‘single’, ‘fallback’: ‘type’})

Finally, she sits on the cliff’s edge with the wolf as they watch the sun disappear. V3: Single→Type({‘mode’: ‘single’, ‘fallback’: ‘type’})

[52] [54]

A nimble female archer wearing a green tunic and leather bracers stands at the forest edge while a massive grey wolf with a distinctive white patch on his chest sniffs the ground nearby

[53] [59]

V4: Single→Name({‘mode’: ‘single’, ‘fallback’: ‘name’})

Finally, The Archer sits on the cliff’s edge with The Wolf as they watch the sun disappear. V4: Single→Name({‘mode’: ‘single’, ‘fallback’: ‘name’})

[54] [60]

Eara, The nimble female archer wearing a green tunic and leather bracers, stands at the forest edge while Silver, The massive grey wolf with a distinctive white patch on his chest, sniffs the ground nearby

[55] [65]

25 V5: No Description→Type({‘mode’: ‘no_desc’, ‘fallback’: ‘type’})

Finally, Eara sits on the cliff’s edge with Silver as they watch the sun disappear. 25 V5: No Description→Type({‘mode’: ‘no_desc’, ‘fallback’: ‘type’})

[56] [66]

The Archer stands at the forest edge while The Wolf sniffs the ground nearby

[57] [67]

The Archer begins to walk down the path, and The Wolf follows closely behind her

[58] [68]

The Archer leaps across the wet stones as The Wolf splashes through the water to stay by her side

[59] [69]

The Archer and The Wolf reach the ruins and scan the area for any signs of movement

[60] [70]

The Archer notches an arrow, while The Wolf lets out a low growl to protect her

[61] [71]

V6: No Description→Name({‘mode’: ‘no_desc’, ‘fallback’: ‘name’})

Finally, The Archer sits on the cliff’s edge with The Wolf as they watch the sun disappear. V6: No Description→Name({‘mode’: ‘no_desc’, ‘fallback’: ‘name’})

[62] [72]

Eara stands at the forest edge while Silver sniffs the ground nearby

[63] [73]

Eara begins to walk down the path, and Silver follows closely behind her

[64] [74]

Eara leaps across the wet stones as Silver splashes through the water to stay by her side

[65] [75]

Eara and Silver reach the ruins and scan the area for any signs of movement

[66] [76]

Eara notches an arrow, while Silver lets out a low growl to protect her

[67] [77]

Finally, Eara sits on the cliff’s edge with Silver as they watch the sun disappear. B Implementation Details and Additional Results B.1 Background Removal for Evaluation Figure 5: Comparison of background re- moval using CarveKit and Grounded- SAM.Theexamplesillustratethreetyp- ical failure modes of CarveKit. Previous works commonly adopt CarveKit for bac...