pith. sign in

arxiv: 2509.04123 · v2 · submitted 2025-09-04 · 💻 cs.CV

TaleDiffusion: Multi-Character Story Generation with Dialogue Rendering

Pith reviewed 2026-05-18 18:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-character story generationtext-to-image diffusioncharacter consistencydialogue renderingLLM-guided image generationattention mechanismsstory visualization
0
0 comments X

The pith

TaleDiffusion generates consistent multi-character stories by planning frames with an LLM and controlling diffusion attention to keep identities stable while rendering assigned dialogues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TaleDiffusion to address the problem of generating visual stories with multiple characters that must interact across frames without changing appearance or having dialogues assigned incorrectly. It begins with a pre-trained language model that breaks a given story into per-frame scene descriptions, detailed character appearances, and specific dialogue lines assigned to each speaker. The framework then applies bounded attention masks around character boxes to limit unwanted interactions, identity-consistent self-attention layers to preserve appearance across frames, and region-aware cross-attention to place objects accurately. Dialogues are turned into speech bubbles and matched to the right characters through a post-processing segmentation step. Experiments indicate that these steps together reduce visual noise and improve both character consistency and dialogue accuracy over prior text-to-image story methods.

Core claim

TaleDiffusion introduces an iterative framework that maintains character consistency and accurate dialogue assignment in multi-character story generation. It leverages a pre-trained LLM via in-context learning to generate per-frame descriptions, character details, and dialogues. A bounded attention-based per-box mask technique controls character interactions, while identity-consistent self-attention ensures consistency across frames and region-aware cross-attention handles object placement. Dialogues are rendered as bubbles and assigned using CLIPSeg, leading to better performance in consistency, noise reduction, and dialogue rendering.

What carries the argument

The combination of bounded attention-based per-box masks, identity-consistent self-attention, and region-aware cross-attention inside the diffusion process, which together enforce stable character appearances and precise dialogue placement across story frames.

If this is right

  • Multi-character stories can be produced with characters that retain the same visual identity from one frame to the next without manual correction.
  • Dialogue bubbles are placed and attributed automatically to the intended speaker rather than appearing randomly or mislabeled.
  • Generated images exhibit fewer artifacts around character boundaries and interaction zones compared with standard diffusion story generators.
  • The same pipeline can be applied to new stories simply by supplying a fresh text prompt to the language model stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the per-frame planning step to include temporal ordering constraints could support longer coherent sequences beyond short stories.
  • The attention control techniques might transfer to video diffusion models to reduce identity drift across many frames in animated sequences.
  • Replacing the fixed CLIPSeg assignment step with a learned module trained on dialogue-to-character pairs could further reduce assignment errors.

Load-bearing premise

The framework assumes that a pre-trained LLM can reliably produce accurate per-frame descriptions, character details, and dialogue assignments via in-context learning, and that post-processing with CLIPSeg will correctly assign rendered dialogues without introducing new errors.

What would settle it

A direct visual comparison of story sequences in which the same character changes facial features or clothing between frames, or in which speech bubbles are attached to the wrong speaker, would demonstrate that the consistency and assignment mechanisms have failed.

Figures

Figures reproduced from arXiv: 2509.04123 by Anjan Dutta, Ayan Banerjee, Josep Llados, Umapada Pal.

Figure 1
Figure 1. Figure 1: TaleDiffusion enhances interactivity through dynamic character and environment handling using Identity Consistent Self [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: TaleDiffusion framework: Given a story S and character descriptions C, it uses a pretrained LLM to generate frame descriptions B, dialogues D, and layouts Λi with mask guidance. It builds an image character database from C via low-rank adaptation during latent guidance, which is processed by an image adapter and diffusion U-Net to denoise and add coherent backgrounds. Finally, it renders dialogue RT , assi… view at source ↗
Figure 3
Figure 3. Figure 3: CoT (left) vs ICL (right): In CoT, the whole task is divided into multiple subtasks where each subtask is dependent on the previous one. So it can never have the full context. In contrast, ICL provides the whole context to the LLM in a single step and helps to provide more creativity and completeness. A girl, playing A brown cat, playing A black cat, playing cross attention cross attention cross attention … view at source ↗
Figure 4
Figure 4. Figure 4: Bounded attention based per-box mask (left) and ICSA (right): Former takes the bounding box of a single character at a time and create binarize mask of the cross attention of CLIPtext where the later extends the self-attention by storing the character features in a long vector and update the key and values during frame generation instead of using the same queries like traditional self-attention [82]. (see … view at source ↗
Figure 5
Figure 5. Figure 5: SDSA [67] vs ICSA-RACA: Former mostly focused on eyes, which leads to artifact generation, and sometimes main characters are missing. In contrast, later puts their primary focus is on the eyes, but also helps to maintain the pose and other attributes. θ(M˜Ri , Cdb, t) = X L l=1 M fi∈Cdb (M˜Ri ⊙ Zot ⊙ ICSAl fi ) (8) Here, L denotes concatenation and L is the no. of self￾attention layers, and Zot is the Gaus… view at source ↗
Figure 6
Figure 6. Figure 6: Gradient Fusion (GF): The character’s attributes can be inconsistent (white hair in the second frame) while GF resolves those inconsistencies in the final frame generation. Background Latent Denoising: After generating the fore￾ground latent Zfg, we further encode it via an image adapter, and further denoise it with The background latent Zbgt , pre￾serving the character attributes, to generate a coherent b… view at source ↗
Figure 7
Figure 7. Figure 7: TaleDiffusion improves interactivity by dynamically ad [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: TaleDiffusion outperforms the prior works by maintaining the consistency and spatial relationship between multiple objects. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: CLIPSeg in dialogue rendering: FLUX.1-schnell and RetroComicFlux fail to place text in bubbles and assign multiple bubbles to one character. In contrast, our method precisely places text and uses CLIPSeg to correctly assign bubbles to characters. background denoising ( [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Ablation Study: While bounded attention mitigates the artifacts generation, ICSA and RACA improve character con￾sistency by maintaining their spatial relationship. 5. Conclusion We proposed TaleDiffusion, a framework for generat￾ing realistic stories with multiple characters, addressing key 8 [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Video2Comics [77] vs TaleDiffusion: The former struggles with character consistency and dialogue accuracy, while the latter ensures both, maintaining story coherence [500% zoom]. 6. Implementation Details This section offers in-depth insights into the implemen￾tation of TaleDiffusion as a training-free framework that is compatible with most existing LLM architectures and dif￾fusion models. We implemented … view at source ↗
Figure 12
Figure 12. Figure 12: Cross attention computation during mask generation [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Self Attention maps from U-Net’s middle and first up block across all the timesteps from t [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Attention map to mask generation it tried to put more focus on human faces than non-human ones, as it is difficult to maintain human facial characteris￾tics. Not only that, we extract the mean ICSA, which helps to maintain consistency during latent denoising of the back￾ground generation. We have also tried to generate a story without ICSA in [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Effectiveness of ICSA: At the initial layers, ICSA focused on individual characters that maintain character consistency. At a later stage, it attends all the objects together and helps in multi-character customization. cross attention + sigmoid + normalization Layer1 Layer2 Layer3 Layer4 Layer5 [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: RACA: cross-attention, aided by sigmoid normaliza￾tion, captures object shape and structure for precise positioning. erated by an image synthesis technique. To measure the artifacts score, we first identify the types of artifacts gen￾erated by diffusion models and rank them based on their frequency of occurrence. We design a comprehensive arti￾fact taxonomy for synthetic images including 26 kinds of artif… view at source ↗
Figure 17
Figure 17. Figure 17: Types of artifacts: Out of 26 artifacts, these 8 are the most frequent that occur during comic generation. Similarly, in Figs. 19 and 20 Textual Inversion [17] nei￾ther follows the text prompt nor the object count. Elite [74] started generating images without any characters. Al￾though BLIP-Diffusion [36] follows the text prompt unable to maintain consistency. Also, IP-Adapter [78] and DB￾LoRA [59] mostly … view at source ↗
Figure 18
Figure 18. Figure 18: Consistent image generation with baseline models for [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Consistent image generation with baseline models for [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗
Figure 21
Figure 21. Figure 21: Robustness of dialogue rendering against different styles: It has been observed that the panels generated by TaleDiffusion are not only consistent across different styles also the bubbles are correctly assigned to the characters against the noise caused by styles. Bob is getting out of the car in front of a research institute. Bob is meeting a group of scientists. Bob is shaking hands with one scientist. … view at source ↗
Figure 22
Figure 22. Figure 22: Stylization: We can generate the same story with different styles, maintaining character consistency and frame coherency. 7 [PITH_FULL_IMAGE:figures/full_fig_p019_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Effectiveness of Dialogue Rendering: It has been ob￾served that dialogues improved the story generation over the im￾ages in all perspectives. 50.2% 21.4% 19.7% 1.1% 4.6% 4% Which methods are following the text prompt? 40.7% 24.6% 21.3% 3.3% 5.0% 5.1% Which methods contains most of the artifacts? 38.8% 31.2% 15.7% 6.2% 3.3% 4.8% Which methods are character consistent? 37.8% 28.6% 1.4% 18.2% 9.5% 5.5% Which… view at source ↗
Figure 24
Figure 24. Figure 24: Story Generation: TaleDiffusion gets the maximum vote from the user in all the categories, whereas StoryGen [44] gets the least. on the other hand, StoryDiffusion [82] give some competition to TaleDiffusion in character consistency. generate multiple characters and the inference time of a sin￾gle panel with TaleDiffusione depends on how many char￾acters we are trying to generate in a single panel [PITH_F… view at source ↗
Figure 26
Figure 26. Figure 26: Qualitative Evaluation: StoryGen neither has consistency nor follows the text prompt. Although StoryDiffusion follows the text prompt, it fails to maintain character as well as background consistency and generates artifacts. In contrast, TaleDiffusion follows the text prompt and maintains character as well as background consistency. A girl is playing with a brown and black cat. The cats are met with a blu… view at source ↗
Figure 27
Figure 27. Figure 27: Experiments on Multilingualism: We can generate the same story in different languages, maintaining character consistency and frame coherence. 17. Ablation of CLIPSeg CLIPSeg, as proposed in [47], is a powerful segmen￾tation model that accepts a text prompt and an image as inputs, enabling it to identify and segment the image re￾gion corresponding to the given textual description. This capability has been … view at source ↗
Figure 28
Figure 28. Figure 28: A journey from hatred to friendship: Three scientists hate each other, met at a conference. They compete against each other for new inventions. Ultimately they become friends and start enjoying life. Please zoom in for better visualization. • wrap(T, Wmax) wraps text T within a width Wmax. • fsize(T, font) returns the size (wt, ht) of T. Out of these cases, the bubble assignment to the ”hair” was observed… view at source ↗
Figure 29
Figure 29. Figure 29: A daily life of two friends: Paul and Victor live together. Every day they wake up, do their work, enjoy life, and go to sleep again. Please zoom in for better visualization. structure laid out by the text prompts. Its superior handling of dialogue rendering further elevates the readability and overall quality of the generated comic story. 11 [PITH_FULL_IMAGE:figures/full_fig_p023_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: The last few days of a girl A girl is about to go to space. Before that, she is enjoying her life at its best. Please zoom in for better visualization. A cat, a dog, and a rabbit jumping in a park A cat, a dog, and a rabbit walking in the street Panel 1 - 5 A cat, a dog, and a rabbit sitting in the living room The dog and the rabbit are reading newspaper A cat, a dog, and a rabbit enjoying in the beach A … view at source ↗
Figure 31
Figure 31. Figure 31: Sharing is Caring: A cat, a dog, and a rabbit live together, play together, eat, travel, and are always happy together. Please zoom in for better visualization. 12 [PITH_FULL_IMAGE:figures/full_fig_p024_31.png] view at source ↗
read the original abstract

Text-to-story visualization is challenging due to the need for consistent interaction among multiple characters across frames. Existing methods struggle with character consistency, leading to artifact generation and inaccurate dialogue rendering, which results in disjointed storytelling. In response, we introduce TaleDiffusion, a novel framework for generating multi-character stories with an iterative process, maintaining character consistency, and accurate dialogue assignment via postprocessing. Given a story, we use a pre-trained LLM to generate per-frame descriptions, character details, and dialogues via in-context learning, followed by a bounded attention-based per-box mask technique to control character interactions and minimize artifacts. We then apply an identity-consistent self-attention mechanism to ensure character consistency across frames and region-aware cross-attention for precise object placement. Dialogues are also rendered as bubbles and assigned to characters via CLIPSeg. Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces TaleDiffusion, a framework for multi-character story visualization from text. It uses a pre-trained LLM to generate per-frame descriptions, character details, and dialogue assignments via in-context learning, followed by a diffusion model with bounded attention-based per-box masks to control interactions, identity-consistent self-attention for cross-frame consistency, region-aware cross-attention for object placement, and CLIPSeg post-processing to assign rendered dialogue bubbles. The central claim is that this pipeline outperforms prior methods in character consistency, noise reduction, and accurate dialogue rendering.

Significance. If the experimental claims hold with rigorous validation, the work could advance controllable story generation by addressing multi-character consistency and dialogue placement, which are persistent challenges in text-to-image/video pipelines. The modular use of pre-trained components (LLM + diffusion + segmentation) is a practical strength, but the absence of quantitative metrics, baselines, or ablations in the abstract limits assessment of whether the proposed attention mechanisms deliver the claimed gains beyond the LLM stage.

major comments (2)
  1. [Abstract] Abstract: The claim that 'Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering' is unsupported by any reported metrics, baseline comparisons, dataset details, or statistical tests. This is load-bearing for the central claim of superiority and must be addressed with quantitative evidence before the contribution can be evaluated.
  2. [Abstract / Method overview] The pipeline's performance in consistency and dialogue rendering is predicated on the reliability of the LLM's in-context learning outputs for per-frame descriptions, character details, and dialogue assignments (as described in the abstract). No error rates, human validation, or ablation studies on this stage are mentioned; if LLM errors are common in complex multi-character scenes, they would propagate and undermine attribution of gains to the bounded attention masks, identity-consistent self-attention, or region-aware cross-attention.
minor comments (1)
  1. [Abstract] The abstract refers to 'postprocessing' for dialogue assignment without specifying the exact CLIPSeg integration or failure modes (e.g., misassignment under occlusion).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for highlighting areas where the empirical validation of TaleDiffusion can be strengthened. We address the concerns regarding the abstract claims and the LLM component below, and have made revisions to incorporate additional quantitative evidence and analysis.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering' is unsupported by any reported metrics, baseline comparisons, dataset details, or statistical tests. This is load-bearing for the central claim of superiority and must be addressed with quantitative evidence before the contribution can be evaluated.

    Authors: We agree that the abstract's claim requires better support within the abstract itself. We will revise the abstract to summarize the key quantitative results from Section 4, including baseline comparisons and the specific metrics used for consistency, noise reduction, and dialogue rendering. Dataset details are already provided in the experiments section, and we will ensure statistical tests are highlighted. revision: yes

  2. Referee: [Abstract / Method overview] The pipeline's performance in consistency and dialogue rendering is predicated on the reliability of the LLM's in-context learning outputs for per-frame descriptions, character details, and dialogue assignments (as described in the abstract). No error rates, human validation, or ablation studies on this stage are mentioned; if LLM errors are common in complex multi-character scenes, they would propagate and undermine attribution of gains to the bounded attention masks, identity-consistent self-attention, or region-aware cross-attention.

    Authors: We acknowledge the need to validate the LLM stage to properly attribute contributions. In the revised manuscript, we will include an analysis of the LLM outputs, such as error rates from a human study on a sample of generated descriptions and dialogues. We will also present an ablation study that isolates the impact of the bounded attention, identity-consistent self-attention, and region-aware cross-attention mechanisms to show their added value beyond the LLM-generated inputs. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline relies on external pre-trained models and independent mechanisms

full rationale

The paper presents a multi-stage pipeline that invokes a pre-trained LLM for per-frame descriptions and dialogue assignment via in-context learning, applies custom bounded attention masks plus identity-consistent self-attention and region-aware cross-attention inside the diffusion process, and uses CLIPSeg for bubble assignment. No equations, fitted parameters, or self-citations are shown that would make any claimed output (consistency, noise reduction, dialogue accuracy) equivalent to the inputs by construction. The experimental outperformance claims rest on comparisons against external baselines rather than tautological re-derivations of the method's own definitions or prior self-work. This is the normal case of a self-contained engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the effectiveness of standard pre-trained components and the authors' proposed attention modifications; no free parameters or new invented entities are introduced in the abstract description.

axioms (1)
  • domain assumption A pre-trained LLM can generate accurate per-frame descriptions, character details, and dialogues via in-context learning.
    This premise is invoked as the first step of the framework in the abstract.

pith-pipeline@v0.9.0 · 5689 in / 1236 out tokens · 39020 ms · 2026-05-18T18:57:48.741341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DocRevive: A Unified Pipeline for Document Text Restoration

    cs.CV 2026-04 unverdicted novelty 5.0

    DocRevive builds a unified pipeline using OCR, image analysis, language models, and diffusion to reconstruct degraded document text, backed by a 30k-image synthetic dataset and the UCSM metric.

  2. DocRevive: A Unified Pipeline for Document Text Restoration

    cs.CV 2026-04 unverdicted novelty 5.0

    A unified pipeline using OCR, inpainting, and diffusion models restores text in degraded documents on a new synthetic benchmark dataset, evaluated with the proposed UCSM metric.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    https://www

    Claude 3.5 Sonnet — anthropic.com. https://www. anthropic.com/claude/sonnet. [Accessed 12-11- 2024]. 8

  2. [2]

    https : / / huggingface

    renderartist/retrocomicflux · Hugging Face — hug- gingface.co. https : / / huggingface . co / renderartist/retrocomicflux, 2024. [Accessed 12-11-2024]. 8

  3. [3]

    https://huggingface.co/Xenova/gpt- 3.5- turbo, 2024

    Xenova/gpt-3.5-turbo · Hugging Face — huggingface.co. https://huggingface.co/Xenova/gpt- 3.5- turbo, 2024. [Accessed 12-11-2024]. 8

  4. [4]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

  5. [5]

    Spice: Semantic propositional image cap- tion evaluation

    Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image cap- tion evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Octo- ber 11-14, 2016, Proceedings, Part V 14 , pages 382–398. Springer, 2016. 2, 3

  6. [6]

    The chosen one: Consistent characters in text- to-image diffusion models

    Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. The chosen one: Consistent characters in text- to-image diffusion models. In ACM SIGGRAPH 2024 Con- ference Papers, pages 1–12, 2024. 6, 7, 8

  7. [7]

    Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments

    Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. In Proceedings of the acl workshop on in- trinsic and extrinsic evaluation measures for machine trans- lation and/or summarization, pages 65–72, 2005. 2, 3

  8. [8]

    Synartifact: Classifying and alleviat- ing artifacts in synthetic images via vision-language model

    Bin Cao, Jianhao Yuan, Yexin Liu, Jian Li, Shuyang Sun, Jing Liu, and Bo Zhao. Synartifact: Classifying and alleviat- ing artifacts in synthetic images via vision-language model. arXiv preprint arXiv:2402.18068, 2024. 4

  9. [9]

    Auto- matic stylistic manga layout.ACM Transactions on Graphics (TOG), 31(6):1–10, 2012

    Ying Cao, Antoni B Chan, and Rynson WH Lau. Auto- matic stylistic manga layout.ACM Transactions on Graphics (TOG), 31(6):1–10, 2012. 2

  10. [10]

    Claude2-alpaca: Instruction tuning datasets distilled from claude

    Lichang Chen, Khalid Saifullah, Ming Li, Tianyi Zhou, and Heng Huang. Claude2-alpaca: Instruction tuning datasets distilled from claude. https://github.com/ Lichang-Chen/claude2-alpaca, 2023. 8

  11. [11]

    Manga genera- tion via layout-controllable diffusion

    Siyu Chen, Dengjie Li, Zenghao Bao, Yao Zhou, Lingfeng Tan, Yujie Zhong, and Zheng Zhao. Manga genera- tion via layout-controllable diffusion. In arXiv preprint arxiv:2412.19303, 2024. 2, 3

  12. [12]

    arXiv preprint arXiv:2406.01388 , year=

    Junhao Cheng, Xi Lu, Hanhui Li, Khun Loun Zai, Baiqiao Yin, Yuhao Cheng, Yiqiang Yan, and Xiaodan Liang. Au- tostudio: Crafting consistent subjects in multi-turn interac- tive image generation. arXiv preprint arXiv:2406.01388 ,

  13. [13]

    Theatergen: Character management with llm for consistent multi-turn image generation

    Junhao Cheng, Baiqiao Yin, Kaixin Cai, Minbin Huang, Hanhui Li, Yuxin He, Xi Lu, Yue Li, Yifei Li, Yuhao Cheng, et al. Theatergen: Character management with llm for consistent multi-turn image generation. arXiv preprint arXiv:2404.18919, 2024. 2

  14. [14]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2(3):6,

  15. [15]

    Be yourself: Bounded attention for multi-subject text-to-image generation

    Omer Dahary, Or Patashnik, Kfir Aberman, and Daniel Cohen-Or. Be yourself: Bounded attention for multi-subject text-to-image generation. arXiv preprint arXiv:2403.16990, 2(5), 2024. 2

  16. [16]

    A survey on in-context learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. A survey on in-context learning. arXiv, 2024. 3

  17. [17]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022. 4, 5, 6, 8

  18. [18]

    Controlling perceptual fac- tors in neural style transfer

    Leon A Gatys, Alexander S Ecker, Matthias Bethge, Aaron Hertzmann, and Eli Shechtman. Controlling perceptual fac- tors in neural style transfer. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 3985–3993, 2017. 5 9

  19. [19]

    TokenFlow: Consistent Diffusion Features for Consistent Video Editing

    Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023. 3

  20. [20]

    Interactive story visualization with multiple characters

    Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Yingqing He, Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, and Yujiu Yang. Interactive story visualization with multiple characters. In SIGGRAPH Asia 2023 Conference Papers , SA ’23, New York, NY , USA,

  21. [21]

    Association for Computing Machinery. 6, 7

  22. [22]

    Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. In NeurIPS, 2014. 2

  23. [23]

    Mix-of-show: Decentralized low- rank adaptation for multi-concept customization of diffusion models

    Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yun- peng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, et al. Mix-of-show: Decentralized low- rank adaptation for multi-concept customization of diffusion models. Advances in Neural Information Processing Sys- tems, 36, 2024. 5

  24. [24]

    ebdtheque: a representative database of comics

    Cl ´ement Gu ´erin, Christophe Rigaud, Antoine Mercier, Farid Ammar-Boudjelal, Karell Bertet, Alain Bouju, Jean- Christophe Burie, Georges Louis, Jean-Marc Ogier, and Ar- naud Revel. ebdtheque: a representative database of comics. In 2013 12th International Conference on Document Analy- sis and Recognition, pages 1145–1149. IEEE, 2013. 8

  25. [25]

    Imagine this! scripts to composi- tions to videos

    Tanmay Gupta, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. Imagine this! scripts to composi- tions to videos. In Proceedings of the European conference on computer vision (ECCV) , pages 598–613, 2018. 2

  26. [26]

    Textdescriptives: A python package for calculating a large variety of metrics from text

    Lasse Hansen, Ludvig Renbo Olsen, and Kenneth Enevold- sen. Textdescriptives: A python package for calculating a large variety of metrics from text. Journal of Open Source Software, 8(84):5153, Apr. 2023. 8

  27. [27]

    Improving multi-subject consistency in open-domain image genera- tion with isolation and reposition attention

    Huiguo He, Qiuyue Wang, Yuan Zhou, Yuxuan Cai, Hongyang Chao, Jian Yin, and Huan Yang. Improving multi-subject consistency in open-domain image genera- tion with isolation and reposition attention. arXiv preprint arXiv:2411.19261, 2024. 2

  28. [28]

    Dreamstory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion.arXiv preprint arXiv:2407.12899, 2024

    Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. Dreamstory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion. arXiv preprint arXiv:2407.12899, 2024. 2, 3, 6, 1

  29. [29]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 6

  30. [30]

    Inferring semantic layout for hierarchical text- to-image synthesis

    Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. Inferring semantic layout for hierarchical text- to-image synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 7986– 7994, 2018. 2

  31. [31]

    Image quality metrics: Psnr vs

    Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010. 6

  32. [32]

    Jay Hosler and K. B. Boomer. Are comic books an effec- tive way to engage nonmajors in learning and appreciating science? CBE—Life Sciences Education , 2011. 1

  33. [33]

    Animate anyone: Consistent and controllable image- to-video synthesis for character animation

    Li Hu. Animate anyone: Consistent and controllable image- to-video synthesis for character animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024. 3

  34. [34]

    Identity decoupling for multi-subject per- sonalization of text-to-image models

    Sangwon Jang, Jaehyeong Jo, Kimin Lee, and Sung Ju Hwang. Identity decoupling for multi-subject per- sonalization of text-to-image models. arXiv preprint arXiv:2404.04243, 2024. 3

  35. [35]

    Content-aware video2comics with manga-style layout

    Guangmei Jing, Yongtao Hu, Yanwen Guo, Yizhou Yu, and Wenping Wang. Content-aware video2comics with manga-style layout. IEEE Transactions on Multimedia , 17(12):2122–2133, 2015. 2

  36. [36]

    Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux, 2024. 8

  37. [37]

    Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing

    Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing. Advances in Neural Information Pro- cessing Systems, 36, 2024. 4, 5, 6, 8

  38. [38]

    Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In In- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. 6

  39. [39]

    Unbounded: A generative infinite game of character life simulation

    Jialu Li, Yuanzhen Li, Neal Wadhwa, Yael Pritch, David E Jacobs, Michael Rubinstein, Mohit Bansal, and Nataniel Ruiz. Unbounded: A generative infinite game of character life simulation. arXiv preprint arXiv:2410.18975, 2024. 2

  40. [40]

    Storygan: A sequential conditional gan for story vi- sualization

    Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. Storygan: A sequential conditional gan for story vi- sualization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 6329–6338,

  41. [41]

    Gligen: Open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023. 4, 1

  42. [42]

    Photomaker: Customizing re- alistic human photos via stacked id embedding

    Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming- Ming Cheng, and Ying Shan. Photomaker: Customizing re- alistic human photos via stacked id embedding. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8640–8650, 2024. 4, 6

  43. [43]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out , pages 74–81, 2004. 2, 3

  44. [44]

    Evaluating text-to-visual generation with image-to-text gen- eration

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text gen- eration. In European Conference on Computer Vision, pages 366–384. Springer, 2025. 7, 8, 6

  45. [45]

    Intelligent grimm-open-ended visual storytelling via latent diffusion models

    Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, Yan- feng Wang, and Weidi Xie. Intelligent grimm-open-ended visual storytelling via latent diffusion models. In Proceed- 10 ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6190–6200, 2024. 2, 6, 7, 8, 10

  46. [46]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 4

  47. [47]

    One-prompt-one-story: Free-lunch consistent text-to-image generation using a single prompt

    Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fa- had Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, and Ming-Ming Cheng. One-prompt-one-story: Free-lunch consistent text-to-image generation using a single prompt. In The Thirteenth International Conference on Learning Repre- sentations, 2025. 6, 7

  48. [48]

    Image segmenta- tion using text and image prompts

    Timo L ¨uddecke and Alexander Ecker. Image segmenta- tion using text and image prompts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7086–7096, 2022. 2, 3, 6, 5, 9

  49. [49]

    Integrating visuospa- tial, linguistic and commonsense structure into story visual- ization

    Adyasha Maharana and Mohit Bansal. Integrating visuospa- tial, linguistic and commonsense structure into story visual- ization. arXiv preprint arXiv:2110.10834, 2021. 2

  50. [50]

    Storydall-e: Adapting pretrained text-to-image transformers for story continuation

    Adyasha Maharana, Darryl Hannan, and Mohit Bansal. Storydall-e: Adapting pretrained text-to-image transformers for story continuation. InEuropean Conference on Computer Vision, pages 70–87. Springer, 2022. 2

  51. [51]

    Story-adapter: A training-free iterative framework for long story visualization

    Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, and Yuyin Zhou. Story-adapter: A training-free iterative framework for long story visualization. In arXiv, 2024. 7

  52. [52]

    Digital comics image indexing based on deep learn- ing

    Nhu-Van Nguyen, Christophe Rigaud, and Jean-Christophe Burie. Digital comics image indexing based on deep learn- ing. Journal of Imaging, 4(7):89, 2018. 3

  53. [53]

    Beyond semantic distance: Automated scoring of divergent thinking greatly improves with large language models

    Peter Organisciak, Selcuk Acar, Denis Dumas, and Kelly Berthiaume. Beyond semantic distance: Automated scoring of divergent thinking greatly improves with large language models. Thinking Skills and Creativity, 49:101356, 2023. 8

  54. [54]

    Synthesizing coherent story with auto-regressive la- tent diffusion models

    Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, and Wenhu Chen. Synthesizing coherent story with auto-regressive la- tent diffusion models. In Proceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision , pages 2920–2930, 2024. 2

  55. [55]

    Pytorch: An im- perative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library. Ad- vances in neural information processing systems , 32, 2019. 1

  56. [56]

    Comic- gan: Text-to-comic generative adversarial network

    Ben Proven-Bessel, Zilong Zhao, and Lydia Chen. Comic- gan: Text-to-comic generative adversarial network. arXiv preprint arXiv:2109.09120, 2021. 2

  57. [57]

    Make-a-story: Visual memory conditioned consistent story generation

    Tanzila Rahman, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Shweta Mahajan, and Leonid Sigal. Make-a-story: Visual memory conditioned consistent story generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2493–2502, 2023. 2

  58. [58]

    Grounded sam: Assembling open-world models for diverse visual tasks

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv, 2024. 4

  59. [59]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In CVPR, 2022. 2

  60. [60]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 22500– 22510, 2023. 2, 4, 5, 6, 8

  61. [61]

    Improved aesthetic predictor, 2022

    Christoph Schuhmann. Improved aesthetic predictor, 2022. 7, 8, 6

  62. [62]

    Storygpt-v: Large language models as consistent story visualizers.arXiv, 2023

    Xiaoqian Shen and Mohamed Elhoseiny. Storygpt-v: Large language models as consistent story visualizers.arXiv, 2023. 2

  63. [63]

    Storybooth: Training-free multi-subject consistency for improved visual storytelling

    Jaskirat Singh, Junshen Kevin Chen, Jonas Kohler, and Michael Cohen. Storybooth: Training-free multi-subject consistency for improved visual storytelling. arXiv preprint arXiv:2504.05800, 2025. 2, 3

  64. [64]

    Text2scene: Generating compositional scenes from textual descriptions

    Fuwen Tan, Song Feng, and Vicente Ordonez. Text2scene: Generating compositional scenes from textual descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6710–6719, 2019. 2, 3

  65. [65]

    arXiv preprint arXiv:2210.04885 , year=

    Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, and Ferhan Ture. What the daam: Interpreting stable diffu- sion using cross attention. arXiv preprint arXiv:2210.04885,

  66. [66]

    Storyimager: A unified and efficient frame- work for coherent story visualization and completion

    Ming Tao, Bing-Kun Bao, Hao Tang, Yaowei Wang, and Changsheng Xu. Storyimager: A unified and efficient frame- work for coherent story visualization and completion. arXiv preprint arXiv:2404.05979, 2024. 2, 6

  67. [67]

    Science comics as tools for science educa- tion and communication: a brief, exploratory study

    Mi ´co Tatalovi´c. Science comics as tools for science educa- tion and communication: a brief, exploratory study. JCOM, 8(4), 2009. 1

  68. [68]

    Training-free consis- tent text-to-image generation

    Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free consis- tent text-to-image generation. ACM Transactions on Graph- ics (TOG), 43(4):1–18, 2024. 2, 5, 6, 7, 3, 8

  69. [69]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 8, 9, 2, 3

  70. [70]

    Cider: Consensus-based image description evalua- tion

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 2, 3

  71. [71]

    Comix: A comprehensive benchmark for multi-task comic understanding

    Emanuele Vivoli, Marco Bertini, and Dimosthenis Karatzas. Comix: A comprehensive benchmark for multi-task comic understanding. arXiv preprint arXiv:2407.03550, 2024. 8

  72. [72]

    Cdac: Cross-domain attention consistency in trans- former for domain adaptive semantic segmentation

    Kaihong Wang, Donghyun Kim, Rogerio Feris, and Margrit Betke. Cdac: Cross-domain attention consistency in trans- former for domain adaptive semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11519–11529, 2023. 8 11

  73. [73]

    Autostory: Generating diverse storytelling images with minimal human efforts

    Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, and Chunhua Shen. Autostory: Generating diverse storytelling images with minimal human efforts. Interna- tional Journal of Computer Vision , pages 1–22, 2024. 2, 3, 4

  74. [74]

    Large language models are latent variable models: Explaining and finding good demonstra- tions for in-context learning

    Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark Steyvers, and William Yang Wang. Large language models are latent variable models: Explaining and finding good demonstra- tions for in-context learning. Advances in Neural Informa- tion Processing Systems, 36, 2024. 3, 9

  75. [75]

    Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation

    Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15943–15953, 2023. 2, 4, 5, 6

  76. [76]

    Diffsensei: Bridging multi- modal llms and diffusion models for customized manga gen- eration

    Jianzong Wu, Chao Tang, Jingbo Wang, Yanhong Zeng, Xi- angtai Li, and Yunhai Tong. Diffsensei: Bridging multi- modal llms and diffusion models for customized manga gen- eration. arXiv preprint arXiv:2412.07589 , 2024. 2, 3, 6, 7, 8

  77. [77]

    Human preference score: Better aligning text- to-image models with human preference

    Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hong- sheng Li. Human preference score: Better aligning text- to-image models with human preference. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 2096–2105, 2023. 7, 8, 6

  78. [78]

    Auto- matic comic generation with stylistic multi-page layouts and emotion-driven text balloon generation

    Xin Yang, Zongliang Ma, Letian Yu, Ying Cao, Baocai Yin, Xiaopeng Wei, Qiang Zhang, and Rynson WH Lau. Auto- matic comic generation with stylistic multi-page layouts and emotion-driven text balloon generation. ACM Transactions on Multimedia Computing, Communications, and Applica- tions (TOMM), 17(2):1–19, 2021. 2, 1

  79. [79]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models. arXiv preprint arXiv:2308.06721 ,

  80. [80]

    Obj2text: Generating visually descriptive language from object layouts

    Xuwang Yin and Vicente Ordonez. Obj2text: Generating visually descriptive language from object layouts. In Pro- ceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 177–187, 2017. 3, 2

Showing first 80 references.