TaleDiffusion: Multi-Character Story Generation with Dialogue Rendering

Anjan Dutta; Ayan Banerjee; Josep Llados; Umapada Pal

arxiv: 2509.04123 · v2 · submitted 2025-09-04 · 💻 cs.CV

TaleDiffusion: Multi-Character Story Generation with Dialogue Rendering

Ayan Banerjee , Josep Llados , Umapada Pal , Anjan Dutta This is my paper

Pith reviewed 2026-05-18 18:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-character story generationtext-to-image diffusioncharacter consistencydialogue renderingLLM-guided image generationattention mechanismsstory visualization

0 comments

The pith

TaleDiffusion generates consistent multi-character stories by planning frames with an LLM and controlling diffusion attention to keep identities stable while rendering assigned dialogues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TaleDiffusion to address the problem of generating visual stories with multiple characters that must interact across frames without changing appearance or having dialogues assigned incorrectly. It begins with a pre-trained language model that breaks a given story into per-frame scene descriptions, detailed character appearances, and specific dialogue lines assigned to each speaker. The framework then applies bounded attention masks around character boxes to limit unwanted interactions, identity-consistent self-attention layers to preserve appearance across frames, and region-aware cross-attention to place objects accurately. Dialogues are turned into speech bubbles and matched to the right characters through a post-processing segmentation step. Experiments indicate that these steps together reduce visual noise and improve both character consistency and dialogue accuracy over prior text-to-image story methods.

Core claim

TaleDiffusion introduces an iterative framework that maintains character consistency and accurate dialogue assignment in multi-character story generation. It leverages a pre-trained LLM via in-context learning to generate per-frame descriptions, character details, and dialogues. A bounded attention-based per-box mask technique controls character interactions, while identity-consistent self-attention ensures consistency across frames and region-aware cross-attention handles object placement. Dialogues are rendered as bubbles and assigned using CLIPSeg, leading to better performance in consistency, noise reduction, and dialogue rendering.

What carries the argument

The combination of bounded attention-based per-box masks, identity-consistent self-attention, and region-aware cross-attention inside the diffusion process, which together enforce stable character appearances and precise dialogue placement across story frames.

If this is right

Multi-character stories can be produced with characters that retain the same visual identity from one frame to the next without manual correction.
Dialogue bubbles are placed and attributed automatically to the intended speaker rather than appearing randomly or mislabeled.
Generated images exhibit fewer artifacts around character boundaries and interaction zones compared with standard diffusion story generators.
The same pipeline can be applied to new stories simply by supplying a fresh text prompt to the language model stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the per-frame planning step to include temporal ordering constraints could support longer coherent sequences beyond short stories.
The attention control techniques might transfer to video diffusion models to reduce identity drift across many frames in animated sequences.
Replacing the fixed CLIPSeg assignment step with a learned module trained on dialogue-to-character pairs could further reduce assignment errors.

Load-bearing premise

The framework assumes that a pre-trained LLM can reliably produce accurate per-frame descriptions, character details, and dialogue assignments via in-context learning, and that post-processing with CLIPSeg will correctly assign rendered dialogues without introducing new errors.

What would settle it

A direct visual comparison of story sequences in which the same character changes facial features or clothing between frames, or in which speech bubbles are attached to the wrong speaker, would demonstrate that the consistency and assignment mechanisms have failed.

Figures

Figures reproduced from arXiv: 2509.04123 by Anjan Dutta, Ayan Banerjee, Josep Llados, Umapada Pal.

**Figure 1.** Figure 1: TaleDiffusion enhances interactivity through dynamic character and environment handling using Identity Consistent Self [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: TaleDiffusion framework: Given a story S and character descriptions C, it uses a pretrained LLM to generate frame descriptions B, dialogues D, and layouts Λi with mask guidance. It builds an image character database from C via low-rank adaptation during latent guidance, which is processed by an image adapter and diffusion U-Net to denoise and add coherent backgrounds. Finally, it renders dialogue RT , assi… view at source ↗

**Figure 3.** Figure 3: CoT (left) vs ICL (right): In CoT, the whole task is divided into multiple subtasks where each subtask is dependent on the previous one. So it can never have the full context. In contrast, ICL provides the whole context to the LLM in a single step and helps to provide more creativity and completeness. A girl, playing A brown cat, playing A black cat, playing cross attention cross attention cross attention … view at source ↗

**Figure 4.** Figure 4: Bounded attention based per-box mask (left) and ICSA (right): Former takes the bounding box of a single character at a time and create binarize mask of the cross attention of CLIPtext where the later extends the self-attention by storing the character features in a long vector and update the key and values during frame generation instead of using the same queries like traditional self-attention [82]. (see … view at source ↗

**Figure 5.** Figure 5: SDSA [67] vs ICSA-RACA: Former mostly focused on eyes, which leads to artifact generation, and sometimes main characters are missing. In contrast, later puts their primary focus is on the eyes, but also helps to maintain the pose and other attributes. θ(M˜Ri , Cdb, t) = X L l=1 M fi∈Cdb (M˜Ri ⊙ Zot ⊙ ICSAl fi ) (8) Here, L denotes concatenation and L is the no. of selfattention layers, and Zot is the Gaus… view at source ↗

**Figure 6.** Figure 6: Gradient Fusion (GF): The character’s attributes can be inconsistent (white hair in the second frame) while GF resolves those inconsistencies in the final frame generation. Background Latent Denoising: After generating the foreground latent Zfg, we further encode it via an image adapter, and further denoise it with The background latent Zbgt , preserving the character attributes, to generate a coherent b… view at source ↗

**Figure 7.** Figure 7: TaleDiffusion improves interactivity by dynamically ad [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: TaleDiffusion outperforms the prior works by maintaining the consistency and spatial relationship between multiple objects. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: CLIPSeg in dialogue rendering: FLUX.1-schnell and RetroComicFlux fail to place text in bubbles and assign multiple bubbles to one character. In contrast, our method precisely places text and uses CLIPSeg to correctly assign bubbles to characters. background denoising ( [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Ablation Study: While bounded attention mitigates the artifacts generation, ICSA and RACA improve character consistency by maintaining their spatial relationship. 5. Conclusion We proposed TaleDiffusion, a framework for generating realistic stories with multiple characters, addressing key 8 [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Video2Comics [77] vs TaleDiffusion: The former struggles with character consistency and dialogue accuracy, while the latter ensures both, maintaining story coherence [500% zoom]. 6. Implementation Details This section offers in-depth insights into the implementation of TaleDiffusion as a training-free framework that is compatible with most existing LLM architectures and diffusion models. We implemented … view at source ↗

**Figure 12.** Figure 12: Cross attention computation during mask generation [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Self Attention maps from U-Net’s middle and first up block across all the timesteps from t [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Attention map to mask generation it tried to put more focus on human faces than non-human ones, as it is difficult to maintain human facial characteristics. Not only that, we extract the mean ICSA, which helps to maintain consistency during latent denoising of the background generation. We have also tried to generate a story without ICSA in [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: Effectiveness of ICSA: At the initial layers, ICSA focused on individual characters that maintain character consistency. At a later stage, it attends all the objects together and helps in multi-character customization. cross attention + sigmoid + normalization Layer1 Layer2 Layer3 Layer4 Layer5 [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: RACA: cross-attention, aided by sigmoid normalization, captures object shape and structure for precise positioning. erated by an image synthesis technique. To measure the artifacts score, we first identify the types of artifacts generated by diffusion models and rank them based on their frequency of occurrence. We design a comprehensive artifact taxonomy for synthetic images including 26 kinds of artif… view at source ↗

**Figure 17.** Figure 17: Types of artifacts: Out of 26 artifacts, these 8 are the most frequent that occur during comic generation. Similarly, in Figs. 19 and 20 Textual Inversion [17] neither follows the text prompt nor the object count. Elite [74] started generating images without any characters. Although BLIP-Diffusion [36] follows the text prompt unable to maintain consistency. Also, IP-Adapter [78] and DBLoRA [59] mostly … view at source ↗

**Figure 18.** Figure 18: Consistent image generation with baseline models for [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗

**Figure 19.** Figure 19: Consistent image generation with baseline models for [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗

**Figure 21.** Figure 21: Robustness of dialogue rendering against different styles: It has been observed that the panels generated by TaleDiffusion are not only consistent across different styles also the bubbles are correctly assigned to the characters against the noise caused by styles. Bob is getting out of the car in front of a research institute. Bob is meeting a group of scientists. Bob is shaking hands with one scientist. … view at source ↗

**Figure 22.** Figure 22: Stylization: We can generate the same story with different styles, maintaining character consistency and frame coherency. 7 [PITH_FULL_IMAGE:figures/full_fig_p019_22.png] view at source ↗

**Figure 23.** Figure 23: Effectiveness of Dialogue Rendering: It has been observed that dialogues improved the story generation over the images in all perspectives. 50.2% 21.4% 19.7% 1.1% 4.6% 4% Which methods are following the text prompt? 40.7% 24.6% 21.3% 3.3% 5.0% 5.1% Which methods contains most of the artifacts? 38.8% 31.2% 15.7% 6.2% 3.3% 4.8% Which methods are character consistent? 37.8% 28.6% 1.4% 18.2% 9.5% 5.5% Which… view at source ↗

**Figure 24.** Figure 24: Story Generation: TaleDiffusion gets the maximum vote from the user in all the categories, whereas StoryGen [44] gets the least. on the other hand, StoryDiffusion [82] give some competition to TaleDiffusion in character consistency. generate multiple characters and the inference time of a single panel with TaleDiffusione depends on how many characters we are trying to generate in a single panel [PITH_F… view at source ↗

**Figure 26.** Figure 26: Qualitative Evaluation: StoryGen neither has consistency nor follows the text prompt. Although StoryDiffusion follows the text prompt, it fails to maintain character as well as background consistency and generates artifacts. In contrast, TaleDiffusion follows the text prompt and maintains character as well as background consistency. A girl is playing with a brown and black cat. The cats are met with a blu… view at source ↗

**Figure 27.** Figure 27: Experiments on Multilingualism: We can generate the same story in different languages, maintaining character consistency and frame coherence. 17. Ablation of CLIPSeg CLIPSeg, as proposed in [47], is a powerful segmentation model that accepts a text prompt and an image as inputs, enabling it to identify and segment the image region corresponding to the given textual description. This capability has been … view at source ↗

**Figure 28.** Figure 28: A journey from hatred to friendship: Three scientists hate each other, met at a conference. They compete against each other for new inventions. Ultimately they become friends and start enjoying life. Please zoom in for better visualization. • wrap(T, Wmax) wraps text T within a width Wmax. • fsize(T, font) returns the size (wt, ht) of T. Out of these cases, the bubble assignment to the ”hair” was observed… view at source ↗

**Figure 29.** Figure 29: A daily life of two friends: Paul and Victor live together. Every day they wake up, do their work, enjoy life, and go to sleep again. Please zoom in for better visualization. structure laid out by the text prompts. Its superior handling of dialogue rendering further elevates the readability and overall quality of the generated comic story. 11 [PITH_FULL_IMAGE:figures/full_fig_p023_29.png] view at source ↗

**Figure 30.** Figure 30: The last few days of a girl A girl is about to go to space. Before that, she is enjoying her life at its best. Please zoom in for better visualization. A cat, a dog, and a rabbit jumping in a park A cat, a dog, and a rabbit walking in the street Panel 1 - 5 A cat, a dog, and a rabbit sitting in the living room The dog and the rabbit are reading newspaper A cat, a dog, and a rabbit enjoying in the beach A … view at source ↗

**Figure 31.** Figure 31: Sharing is Caring: A cat, a dog, and a rabbit live together, play together, eat, travel, and are always happy together. Please zoom in for better visualization. 12 [PITH_FULL_IMAGE:figures/full_fig_p024_31.png] view at source ↗

read the original abstract

Text-to-story visualization is challenging due to the need for consistent interaction among multiple characters across frames. Existing methods struggle with character consistency, leading to artifact generation and inaccurate dialogue rendering, which results in disjointed storytelling. In response, we introduce TaleDiffusion, a novel framework for generating multi-character stories with an iterative process, maintaining character consistency, and accurate dialogue assignment via postprocessing. Given a story, we use a pre-trained LLM to generate per-frame descriptions, character details, and dialogues via in-context learning, followed by a bounded attention-based per-box mask technique to control character interactions and minimize artifacts. We then apply an identity-consistent self-attention mechanism to ensure character consistency across frames and region-aware cross-attention for precise object placement. Dialogues are also rendered as bubbles and assigned to characters via CLIPSeg. Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TaleDiffusion adds targeted attention controls to diffusion models for multi-character story consistency and dialogue bubbles, but the claims rest on unverified LLM outputs and lack any metrics or baselines.

read the letter

The main thing to know is that TaleDiffusion lays out an iterative pipeline that starts with a pre-trained LLM breaking a story into per-frame descriptions, character details, and dialogues via in-context learning, then uses bounded attention per-box masks, identity-consistent self-attention, region-aware cross-attention, and CLIPSeg to assign speech bubbles while trying to keep characters consistent and cut down on artifacts. The specific mix of those attention mechanisms is the concrete new piece that extends earlier diffusion work for this task without obvious circularity in the setup. It does a clean job naming the practical problems like disjointed storytelling from inconsistent characters and inaccurate dialogue placement, and the post-processing step with CLIPSeg is a straightforward addition that could help with bubble assignment. The description of how the pieces fit together is direct and builds on standard pre-trained models in a way that feels usable for story visualization. The soft spots are clear and worth noting in proportion. The abstract claims better consistency, noise reduction, and dialogue rendering than existing methods, yet it gives no quantitative metrics, baseline comparisons, or experimental details at all, so there is no way to judge whether the attention changes actually drive the gains or if other factors are at play. The pipeline also assumes the LLM stage will reliably produce accurate descriptions and dialogue assignments for complex multi-character interactions; if that part slips, errors will reach the diffusion stage and the later controls cannot fully compensate. That assumption is the least secure part, and the absence of any error rates, ablations, or validation for the LLM outputs makes the attribution to the new mechanisms shaky. This is aimed at researchers working on generative vision for narratives, such as tools for media production or education. A reader who wants concrete ideas for controlling attention in diffusion models to handle multiple characters might find the technical choices worth looking at. It deserves peer review so the full experiments, numbers, and any checks on the LLM stage can be examined properly.

Referee Report

2 major / 1 minor

Summary. The paper introduces TaleDiffusion, a framework for multi-character story visualization from text. It uses a pre-trained LLM to generate per-frame descriptions, character details, and dialogue assignments via in-context learning, followed by a diffusion model with bounded attention-based per-box masks to control interactions, identity-consistent self-attention for cross-frame consistency, region-aware cross-attention for object placement, and CLIPSeg post-processing to assign rendered dialogue bubbles. The central claim is that this pipeline outperforms prior methods in character consistency, noise reduction, and accurate dialogue rendering.

Significance. If the experimental claims hold with rigorous validation, the work could advance controllable story generation by addressing multi-character consistency and dialogue placement, which are persistent challenges in text-to-image/video pipelines. The modular use of pre-trained components (LLM + diffusion + segmentation) is a practical strength, but the absence of quantitative metrics, baselines, or ablations in the abstract limits assessment of whether the proposed attention mechanisms deliver the claimed gains beyond the LLM stage.

major comments (2)

[Abstract] Abstract: The claim that 'Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering' is unsupported by any reported metrics, baseline comparisons, dataset details, or statistical tests. This is load-bearing for the central claim of superiority and must be addressed with quantitative evidence before the contribution can be evaluated.
[Abstract / Method overview] The pipeline's performance in consistency and dialogue rendering is predicated on the reliability of the LLM's in-context learning outputs for per-frame descriptions, character details, and dialogue assignments (as described in the abstract). No error rates, human validation, or ablation studies on this stage are mentioned; if LLM errors are common in complex multi-character scenes, they would propagate and undermine attribution of gains to the bounded attention masks, identity-consistent self-attention, or region-aware cross-attention.

minor comments (1)

[Abstract] The abstract refers to 'postprocessing' for dialogue assignment without specifying the exact CLIPSeg integration or failure modes (e.g., misassignment under occlusion).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for highlighting areas where the empirical validation of TaleDiffusion can be strengthened. We address the concerns regarding the abstract claims and the LLM component below, and have made revisions to incorporate additional quantitative evidence and analysis.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering' is unsupported by any reported metrics, baseline comparisons, dataset details, or statistical tests. This is load-bearing for the central claim of superiority and must be addressed with quantitative evidence before the contribution can be evaluated.

Authors: We agree that the abstract's claim requires better support within the abstract itself. We will revise the abstract to summarize the key quantitative results from Section 4, including baseline comparisons and the specific metrics used for consistency, noise reduction, and dialogue rendering. Dataset details are already provided in the experiments section, and we will ensure statistical tests are highlighted. revision: yes
Referee: [Abstract / Method overview] The pipeline's performance in consistency and dialogue rendering is predicated on the reliability of the LLM's in-context learning outputs for per-frame descriptions, character details, and dialogue assignments (as described in the abstract). No error rates, human validation, or ablation studies on this stage are mentioned; if LLM errors are common in complex multi-character scenes, they would propagate and undermine attribution of gains to the bounded attention masks, identity-consistent self-attention, or region-aware cross-attention.

Authors: We acknowledge the need to validate the LLM stage to properly attribute contributions. In the revised manuscript, we will include an analysis of the LLM outputs, such as error rates from a human study on a sample of generated descriptions and dialogues. We will also present an ablation study that isolates the impact of the bounded attention, identity-consistent self-attention, and region-aware cross-attention mechanisms to show their added value beyond the LLM-generated inputs. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline relies on external pre-trained models and independent mechanisms

full rationale

The paper presents a multi-stage pipeline that invokes a pre-trained LLM for per-frame descriptions and dialogue assignment via in-context learning, applies custom bounded attention masks plus identity-consistent self-attention and region-aware cross-attention inside the diffusion process, and uses CLIPSeg for bubble assignment. No equations, fitted parameters, or self-citations are shown that would make any claimed output (consistency, noise reduction, dialogue accuracy) equivalent to the inputs by construction. The experimental outperformance claims rest on comparisons against external baselines rather than tautological re-derivations of the method's own definitions or prior self-work. This is the normal case of a self-contained engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the effectiveness of standard pre-trained components and the authors' proposed attention modifications; no free parameters or new invented entities are introduced in the abstract description.

axioms (1)

domain assumption A pre-trained LLM can generate accurate per-frame descriptions, character details, and dialogues via in-context learning.
This premise is invoked as the first step of the framework in the abstract.

pith-pipeline@v0.9.0 · 5689 in / 1236 out tokens · 39020 ms · 2026-05-18T18:57:48.741341+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a per-box bounded attention-based mask latent generation technique... identity-consistent self-attention (ICSA) mechanism... region-aware cross-attention (RACA)... CLIPSeg
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DocRevive: A Unified Pipeline for Document Text Restoration
cs.CV 2026-04 unverdicted novelty 5.0

DocRevive builds a unified pipeline using OCR, image analysis, language models, and diffusion to reconstruct degraded document text, backed by a 30k-image synthetic dataset and the UCSM metric.
DocRevive: A Unified Pipeline for Document Text Restoration
cs.CV 2026-04 unverdicted novelty 5.0

A unified pipeline using OCR, inpainting, and diffusion models restores text in degraded documents on a new synthetic benchmark dataset, evaluated with the proposed UCSM metric.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

https://www

Claude 3.5 Sonnet — anthropic.com. https://www. anthropic.com/claude/sonnet. [Accessed 12-11- 2024]. 8

work page 2024
[2]

https : / / huggingface

renderartist/retrocomicflux · Hugging Face — hug- gingface.co. https : / / huggingface . co / renderartist/retrocomicflux, 2024. [Accessed 12-11-2024]. 8

work page 2024
[3]

https://huggingface.co/Xenova/gpt- 3.5- turbo, 2024

Xenova/gpt-3.5-turbo · Hugging Face — huggingface.co. https://huggingface.co/Xenova/gpt- 3.5- turbo, 2024. [Accessed 12-11-2024]. 8

work page 2024
[4]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Spice: Semantic propositional image cap- tion evaluation

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image cap- tion evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Octo- ber 11-14, 2016, Proceedings, Part V 14 , pages 382–398. Springer, 2016. 2, 3

work page 2016
[6]

The chosen one: Consistent characters in text- to-image diffusion models

Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. The chosen one: Consistent characters in text- to-image diffusion models. In ACM SIGGRAPH 2024 Con- ference Papers, pages 1–12, 2024. 6, 7, 8

work page 2024
[7]

Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. In Proceedings of the acl workshop on in- trinsic and extrinsic evaluation measures for machine trans- lation and/or summarization, pages 65–72, 2005. 2, 3

work page 2005
[8]

Synartifact: Classifying and alleviat- ing artifacts in synthetic images via vision-language model

Bin Cao, Jianhao Yuan, Yexin Liu, Jian Li, Shuyang Sun, Jing Liu, and Bo Zhao. Synartifact: Classifying and alleviat- ing artifacts in synthetic images via vision-language model. arXiv preprint arXiv:2402.18068, 2024. 4

work page arXiv 2024
[9]

Auto- matic stylistic manga layout.ACM Transactions on Graphics (TOG), 31(6):1–10, 2012

Ying Cao, Antoni B Chan, and Rynson WH Lau. Auto- matic stylistic manga layout.ACM Transactions on Graphics (TOG), 31(6):1–10, 2012. 2

work page 2012
[10]

Claude2-alpaca: Instruction tuning datasets distilled from claude

Lichang Chen, Khalid Saifullah, Ming Li, Tianyi Zhou, and Heng Huang. Claude2-alpaca: Instruction tuning datasets distilled from claude. https://github.com/ Lichang-Chen/claude2-alpaca, 2023. 8

work page 2023
[11]

Manga genera- tion via layout-controllable diffusion

Siyu Chen, Dengjie Li, Zenghao Bao, Yao Zhou, Lingfeng Tan, Yujie Zhong, and Zheng Zhao. Manga genera- tion via layout-controllable diffusion. In arXiv preprint arxiv:2412.19303, 2024. 2, 3

work page arXiv 2024
[12]

arXiv preprint arXiv:2406.01388 , year=

Junhao Cheng, Xi Lu, Hanhui Li, Khun Loun Zai, Baiqiao Yin, Yuhao Cheng, Yiqiang Yan, and Xiaodan Liang. Au- tostudio: Crafting consistent subjects in multi-turn interac- tive image generation. arXiv preprint arXiv:2406.01388 ,

work page arXiv
[13]

Theatergen: Character management with llm for consistent multi-turn image generation

Junhao Cheng, Baiqiao Yin, Kaixin Cai, Minbin Huang, Hanhui Li, Yuxin He, Xi Lu, Yue Li, Yifei Li, Yuhao Cheng, et al. Theatergen: Character management with llm for consistent multi-turn image generation. arXiv preprint arXiv:2404.18919, 2024. 2

work page arXiv 2024
[14]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2(3):6,

work page 2023
[15]

Be yourself: Bounded attention for multi-subject text-to-image generation

Omer Dahary, Or Patashnik, Kfir Aberman, and Daniel Cohen-Or. Be yourself: Bounded attention for multi-subject text-to-image generation. arXiv preprint arXiv:2403.16990, 2(5), 2024. 2

work page arXiv 2024
[16]

A survey on in-context learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. A survey on in-context learning. arXiv, 2024. 3

work page 2024
[17]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022. 4, 5, 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Controlling perceptual fac- tors in neural style transfer

Leon A Gatys, Alexander S Ecker, Matthias Bethge, Aaron Hertzmann, and Eli Shechtman. Controlling perceptual fac- tors in neural style transfer. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 3985–3993, 2017. 5 9

work page 2017
[19]

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Interactive story visualization with multiple characters

Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Yingqing He, Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, and Yujiu Yang. Interactive story visualization with multiple characters. In SIGGRAPH Asia 2023 Conference Papers , SA ’23, New York, NY , USA,

work page 2023
[21]

Association for Computing Machinery. 6, 7

work page
[22]

Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. In NeurIPS, 2014. 2

work page 2014
[23]

Mix-of-show: Decentralized low- rank adaptation for multi-concept customization of diffusion models

Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yun- peng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, et al. Mix-of-show: Decentralized low- rank adaptation for multi-concept customization of diffusion models. Advances in Neural Information Processing Sys- tems, 36, 2024. 5

work page 2024
[24]

ebdtheque: a representative database of comics

Cl ´ement Gu ´erin, Christophe Rigaud, Antoine Mercier, Farid Ammar-Boudjelal, Karell Bertet, Alain Bouju, Jean- Christophe Burie, Georges Louis, Jean-Marc Ogier, and Ar- naud Revel. ebdtheque: a representative database of comics. In 2013 12th International Conference on Document Analy- sis and Recognition, pages 1145–1149. IEEE, 2013. 8

work page 2013
[25]

Imagine this! scripts to composi- tions to videos

Tanmay Gupta, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. Imagine this! scripts to composi- tions to videos. In Proceedings of the European conference on computer vision (ECCV) , pages 598–613, 2018. 2

work page 2018
[26]

Textdescriptives: A python package for calculating a large variety of metrics from text

Lasse Hansen, Ludvig Renbo Olsen, and Kenneth Enevold- sen. Textdescriptives: A python package for calculating a large variety of metrics from text. Journal of Open Source Software, 8(84):5153, Apr. 2023. 8

work page 2023
[27]

Improving multi-subject consistency in open-domain image genera- tion with isolation and reposition attention

Huiguo He, Qiuyue Wang, Yuan Zhou, Yuxuan Cai, Hongyang Chao, Jian Yin, and Huan Yang. Improving multi-subject consistency in open-domain image genera- tion with isolation and reposition attention. arXiv preprint arXiv:2411.19261, 2024. 2

work page arXiv 2024
[28]

Dreamstory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion.arXiv preprint arXiv:2407.12899, 2024

Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. Dreamstory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion. arXiv preprint arXiv:2407.12899, 2024. 2, 3, 6, 1

work page arXiv 2024
[29]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 6

work page 2017
[30]

Inferring semantic layout for hierarchical text- to-image synthesis

Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. Inferring semantic layout for hierarchical text- to-image synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 7986– 7994, 2018. 2

work page 2018
[31]

Image quality metrics: Psnr vs

Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010. 6

work page 2010
[32]

Jay Hosler and K. B. Boomer. Are comic books an effec- tive way to engage nonmajors in learning and appreciating science? CBE—Life Sciences Education , 2011. 1

work page 2011
[33]

Animate anyone: Consistent and controllable image- to-video synthesis for character animation

Li Hu. Animate anyone: Consistent and controllable image- to-video synthesis for character animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024. 3

work page 2024
[34]

Identity decoupling for multi-subject per- sonalization of text-to-image models

Sangwon Jang, Jaehyeong Jo, Kimin Lee, and Sung Ju Hwang. Identity decoupling for multi-subject per- sonalization of text-to-image models. arXiv preprint arXiv:2404.04243, 2024. 3

work page arXiv 2024
[35]

Content-aware video2comics with manga-style layout

Guangmei Jing, Yongtao Hu, Yanwen Guo, Yizhou Yu, and Wenping Wang. Content-aware video2comics with manga-style layout. IEEE Transactions on Multimedia , 17(12):2122–2133, 2015. 2

work page 2015
[36]

Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux, 2024. 8

work page 2024
[37]

Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing

Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing. Advances in Neural Information Pro- cessing Systems, 36, 2024. 4, 5, 6, 8

work page 2024
[38]

Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In In- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. 6

work page 2022
[39]

Unbounded: A generative infinite game of character life simulation

Jialu Li, Yuanzhen Li, Neal Wadhwa, Yael Pritch, David E Jacobs, Michael Rubinstein, Mohit Bansal, and Nataniel Ruiz. Unbounded: A generative infinite game of character life simulation. arXiv preprint arXiv:2410.18975, 2024. 2

work page arXiv 2024
[40]

Storygan: A sequential conditional gan for story vi- sualization

Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. Storygan: A sequential conditional gan for story vi- sualization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 6329–6338,

work page
[41]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023. 4, 1

work page 2023
[42]

Photomaker: Customizing re- alistic human photos via stacked id embedding

Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming- Ming Cheng, and Ying Shan. Photomaker: Customizing re- alistic human photos via stacked id embedding. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8640–8650, 2024. 4, 6

work page 2024
[43]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out , pages 74–81, 2004. 2, 3

work page 2004
[44]

Evaluating text-to-visual generation with image-to-text gen- eration

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text gen- eration. In European Conference on Computer Vision, pages 366–384. Springer, 2025. 7, 8, 6

work page 2025
[45]

Intelligent grimm-open-ended visual storytelling via latent diffusion models

Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, Yan- feng Wang, and Weidi Xie. Intelligent grimm-open-ended visual storytelling via latent diffusion models. In Proceed- 10 ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6190–6200, 2024. 2, 6, 7, 8, 10

work page 2024
[46]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 4

work page 2024
[47]

One-prompt-one-story: Free-lunch consistent text-to-image generation using a single prompt

Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fa- had Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, and Ming-Ming Cheng. One-prompt-one-story: Free-lunch consistent text-to-image generation using a single prompt. In The Thirteenth International Conference on Learning Repre- sentations, 2025. 6, 7

work page 2025
[48]

Image segmenta- tion using text and image prompts

Timo L ¨uddecke and Alexander Ecker. Image segmenta- tion using text and image prompts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7086–7096, 2022. 2, 3, 6, 5, 9

work page 2022
[49]

Integrating visuospa- tial, linguistic and commonsense structure into story visual- ization

Adyasha Maharana and Mohit Bansal. Integrating visuospa- tial, linguistic and commonsense structure into story visual- ization. arXiv preprint arXiv:2110.10834, 2021. 2

work page arXiv 2021
[50]

Storydall-e: Adapting pretrained text-to-image transformers for story continuation

Adyasha Maharana, Darryl Hannan, and Mohit Bansal. Storydall-e: Adapting pretrained text-to-image transformers for story continuation. InEuropean Conference on Computer Vision, pages 70–87. Springer, 2022. 2

work page 2022
[51]

Story-adapter: A training-free iterative framework for long story visualization

Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, and Yuyin Zhou. Story-adapter: A training-free iterative framework for long story visualization. In arXiv, 2024. 7

work page 2024
[52]

Digital comics image indexing based on deep learn- ing

Nhu-Van Nguyen, Christophe Rigaud, and Jean-Christophe Burie. Digital comics image indexing based on deep learn- ing. Journal of Imaging, 4(7):89, 2018. 3

work page 2018
[53]

Beyond semantic distance: Automated scoring of divergent thinking greatly improves with large language models

Peter Organisciak, Selcuk Acar, Denis Dumas, and Kelly Berthiaume. Beyond semantic distance: Automated scoring of divergent thinking greatly improves with large language models. Thinking Skills and Creativity, 49:101356, 2023. 8

work page 2023
[54]

Synthesizing coherent story with auto-regressive la- tent diffusion models

Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, and Wenhu Chen. Synthesizing coherent story with auto-regressive la- tent diffusion models. In Proceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision , pages 2920–2930, 2024. 2

work page 2024
[55]

Pytorch: An im- perative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library. Ad- vances in neural information processing systems , 32, 2019. 1

work page 2019
[56]

Comic- gan: Text-to-comic generative adversarial network

Ben Proven-Bessel, Zilong Zhao, and Lydia Chen. Comic- gan: Text-to-comic generative adversarial network. arXiv preprint arXiv:2109.09120, 2021. 2

work page arXiv 2021
[57]

Make-a-story: Visual memory conditioned consistent story generation

Tanzila Rahman, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Shweta Mahajan, and Leonid Sigal. Make-a-story: Visual memory conditioned consistent story generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2493–2502, 2023. 2

work page 2023
[58]

Grounded sam: Assembling open-world models for diverse visual tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv, 2024. 4

work page 2024
[59]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In CVPR, 2022. 2

work page 2022
[60]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 22500– 22510, 2023. 2, 4, 5, 6, 8

work page 2023
[61]

Improved aesthetic predictor, 2022

Christoph Schuhmann. Improved aesthetic predictor, 2022. 7, 8, 6

work page 2022
[62]

Storygpt-v: Large language models as consistent story visualizers.arXiv, 2023

Xiaoqian Shen and Mohamed Elhoseiny. Storygpt-v: Large language models as consistent story visualizers.arXiv, 2023. 2

work page 2023
[63]

Storybooth: Training-free multi-subject consistency for improved visual storytelling

Jaskirat Singh, Junshen Kevin Chen, Jonas Kohler, and Michael Cohen. Storybooth: Training-free multi-subject consistency for improved visual storytelling. arXiv preprint arXiv:2504.05800, 2025. 2, 3

work page arXiv 2025
[64]

Text2scene: Generating compositional scenes from textual descriptions

Fuwen Tan, Song Feng, and Vicente Ordonez. Text2scene: Generating compositional scenes from textual descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6710–6719, 2019. 2, 3

work page 2019
[65]

arXiv preprint arXiv:2210.04885 , year=

Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, and Ferhan Ture. What the daam: Interpreting stable diffu- sion using cross attention. arXiv preprint arXiv:2210.04885,

work page arXiv
[66]

Storyimager: A unified and efficient frame- work for coherent story visualization and completion

Ming Tao, Bing-Kun Bao, Hao Tang, Yaowei Wang, and Changsheng Xu. Storyimager: A unified and efficient frame- work for coherent story visualization and completion. arXiv preprint arXiv:2404.05979, 2024. 2, 6

work page arXiv 2024
[67]

Science comics as tools for science educa- tion and communication: a brief, exploratory study

Mi ´co Tatalovi´c. Science comics as tools for science educa- tion and communication: a brief, exploratory study. JCOM, 8(4), 2009. 1

work page 2009
[68]

Training-free consis- tent text-to-image generation

Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free consis- tent text-to-image generation. ACM Transactions on Graph- ics (TOG), 43(4):1–18, 2024. 2, 5, 6, 7, 3, 8

work page 2024
[69]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 8, 9, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Cider: Consensus-based image description evalua- tion

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 2, 3

work page 2015
[71]

Comix: A comprehensive benchmark for multi-task comic understanding

Emanuele Vivoli, Marco Bertini, and Dimosthenis Karatzas. Comix: A comprehensive benchmark for multi-task comic understanding. arXiv preprint arXiv:2407.03550, 2024. 8

work page arXiv 2024
[72]

Cdac: Cross-domain attention consistency in trans- former for domain adaptive semantic segmentation

Kaihong Wang, Donghyun Kim, Rogerio Feris, and Margrit Betke. Cdac: Cross-domain attention consistency in trans- former for domain adaptive semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11519–11529, 2023. 8 11

work page 2023
[73]

Autostory: Generating diverse storytelling images with minimal human efforts

Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, and Chunhua Shen. Autostory: Generating diverse storytelling images with minimal human efforts. Interna- tional Journal of Computer Vision , pages 1–22, 2024. 2, 3, 4

work page 2024
[74]

Large language models are latent variable models: Explaining and finding good demonstra- tions for in-context learning

Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark Steyvers, and William Yang Wang. Large language models are latent variable models: Explaining and finding good demonstra- tions for in-context learning. Advances in Neural Informa- tion Processing Systems, 36, 2024. 3, 9

work page 2024
[75]

Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation

Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15943–15953, 2023. 2, 4, 5, 6

work page 2023
[76]

Diffsensei: Bridging multi- modal llms and diffusion models for customized manga gen- eration

Jianzong Wu, Chao Tang, Jingbo Wang, Yanhong Zeng, Xi- angtai Li, and Yunhai Tong. Diffsensei: Bridging multi- modal llms and diffusion models for customized manga gen- eration. arXiv preprint arXiv:2412.07589 , 2024. 2, 3, 6, 7, 8

work page arXiv 2024
[77]

Human preference score: Better aligning text- to-image models with human preference

Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hong- sheng Li. Human preference score: Better aligning text- to-image models with human preference. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 2096–2105, 2023. 7, 8, 6

work page 2096
[78]

Auto- matic comic generation with stylistic multi-page layouts and emotion-driven text balloon generation

Xin Yang, Zongliang Ma, Letian Yu, Ying Cao, Baocai Yin, Xiaopeng Wei, Qiang Zhang, and Rynson WH Lau. Auto- matic comic generation with stylistic multi-page layouts and emotion-driven text balloon generation. ACM Transactions on Multimedia Computing, Communications, and Applica- tions (TOMM), 17(2):1–19, 2021. 2, 1

work page 2021
[79]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models. arXiv preprint arXiv:2308.06721 ,

work page internal anchor Pith review Pith/arXiv arXiv
[80]

Obj2text: Generating visually descriptive language from object layouts

Xuwang Yin and Vicente Ordonez. Obj2text: Generating visually descriptive language from object layouts. In Pro- ceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 177–187, 2017. 3, 2

work page 2017

Showing first 80 references.

[1] [1]

https://www

Claude 3.5 Sonnet — anthropic.com. https://www. anthropic.com/claude/sonnet. [Accessed 12-11- 2024]. 8

work page 2024

[2] [2]

https : / / huggingface

renderartist/retrocomicflux · Hugging Face — hug- gingface.co. https : / / huggingface . co / renderartist/retrocomicflux, 2024. [Accessed 12-11-2024]. 8

work page 2024

[3] [3]

https://huggingface.co/Xenova/gpt- 3.5- turbo, 2024

Xenova/gpt-3.5-turbo · Hugging Face — huggingface.co. https://huggingface.co/Xenova/gpt- 3.5- turbo, 2024. [Accessed 12-11-2024]. 8

work page 2024

[4] [4]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Spice: Semantic propositional image cap- tion evaluation

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image cap- tion evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Octo- ber 11-14, 2016, Proceedings, Part V 14 , pages 382–398. Springer, 2016. 2, 3

work page 2016

[6] [6]

The chosen one: Consistent characters in text- to-image diffusion models

Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. The chosen one: Consistent characters in text- to-image diffusion models. In ACM SIGGRAPH 2024 Con- ference Papers, pages 1–12, 2024. 6, 7, 8

work page 2024

[7] [7]

Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. In Proceedings of the acl workshop on in- trinsic and extrinsic evaluation measures for machine trans- lation and/or summarization, pages 65–72, 2005. 2, 3

work page 2005

[8] [8]

Synartifact: Classifying and alleviat- ing artifacts in synthetic images via vision-language model

Bin Cao, Jianhao Yuan, Yexin Liu, Jian Li, Shuyang Sun, Jing Liu, and Bo Zhao. Synartifact: Classifying and alleviat- ing artifacts in synthetic images via vision-language model. arXiv preprint arXiv:2402.18068, 2024. 4

work page arXiv 2024

[9] [9]

Auto- matic stylistic manga layout.ACM Transactions on Graphics (TOG), 31(6):1–10, 2012

Ying Cao, Antoni B Chan, and Rynson WH Lau. Auto- matic stylistic manga layout.ACM Transactions on Graphics (TOG), 31(6):1–10, 2012. 2

work page 2012

[10] [10]

Claude2-alpaca: Instruction tuning datasets distilled from claude

Lichang Chen, Khalid Saifullah, Ming Li, Tianyi Zhou, and Heng Huang. Claude2-alpaca: Instruction tuning datasets distilled from claude. https://github.com/ Lichang-Chen/claude2-alpaca, 2023. 8

work page 2023

[11] [11]

Manga genera- tion via layout-controllable diffusion

Siyu Chen, Dengjie Li, Zenghao Bao, Yao Zhou, Lingfeng Tan, Yujie Zhong, and Zheng Zhao. Manga genera- tion via layout-controllable diffusion. In arXiv preprint arxiv:2412.19303, 2024. 2, 3

work page arXiv 2024

[12] [12]

arXiv preprint arXiv:2406.01388 , year=

Junhao Cheng, Xi Lu, Hanhui Li, Khun Loun Zai, Baiqiao Yin, Yuhao Cheng, Yiqiang Yan, and Xiaodan Liang. Au- tostudio: Crafting consistent subjects in multi-turn interac- tive image generation. arXiv preprint arXiv:2406.01388 ,

work page arXiv

[13] [13]

Theatergen: Character management with llm for consistent multi-turn image generation

Junhao Cheng, Baiqiao Yin, Kaixin Cai, Minbin Huang, Hanhui Li, Yuxin He, Xi Lu, Yue Li, Yifei Li, Yuhao Cheng, et al. Theatergen: Character management with llm for consistent multi-turn image generation. arXiv preprint arXiv:2404.18919, 2024. 2

work page arXiv 2024

[14] [14]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2(3):6,

work page 2023

[15] [15]

Be yourself: Bounded attention for multi-subject text-to-image generation

Omer Dahary, Or Patashnik, Kfir Aberman, and Daniel Cohen-Or. Be yourself: Bounded attention for multi-subject text-to-image generation. arXiv preprint arXiv:2403.16990, 2(5), 2024. 2

work page arXiv 2024

[16] [16]

A survey on in-context learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. A survey on in-context learning. arXiv, 2024. 3

work page 2024

[17] [17]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022. 4, 5, 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Controlling perceptual fac- tors in neural style transfer

Leon A Gatys, Alexander S Ecker, Matthias Bethge, Aaron Hertzmann, and Eli Shechtman. Controlling perceptual fac- tors in neural style transfer. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 3985–3993, 2017. 5 9

work page 2017

[19] [19]

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Interactive story visualization with multiple characters

Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Yingqing He, Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, and Yujiu Yang. Interactive story visualization with multiple characters. In SIGGRAPH Asia 2023 Conference Papers , SA ’23, New York, NY , USA,

work page 2023

[21] [21]

Association for Computing Machinery. 6, 7

work page

[22] [22]

Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. In NeurIPS, 2014. 2

work page 2014

[23] [23]

Mix-of-show: Decentralized low- rank adaptation for multi-concept customization of diffusion models

Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yun- peng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, et al. Mix-of-show: Decentralized low- rank adaptation for multi-concept customization of diffusion models. Advances in Neural Information Processing Sys- tems, 36, 2024. 5

work page 2024

[24] [24]

ebdtheque: a representative database of comics

Cl ´ement Gu ´erin, Christophe Rigaud, Antoine Mercier, Farid Ammar-Boudjelal, Karell Bertet, Alain Bouju, Jean- Christophe Burie, Georges Louis, Jean-Marc Ogier, and Ar- naud Revel. ebdtheque: a representative database of comics. In 2013 12th International Conference on Document Analy- sis and Recognition, pages 1145–1149. IEEE, 2013. 8

work page 2013

[25] [25]

Imagine this! scripts to composi- tions to videos

Tanmay Gupta, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. Imagine this! scripts to composi- tions to videos. In Proceedings of the European conference on computer vision (ECCV) , pages 598–613, 2018. 2

work page 2018

[26] [26]

Textdescriptives: A python package for calculating a large variety of metrics from text

Lasse Hansen, Ludvig Renbo Olsen, and Kenneth Enevold- sen. Textdescriptives: A python package for calculating a large variety of metrics from text. Journal of Open Source Software, 8(84):5153, Apr. 2023. 8

work page 2023

[27] [27]

Improving multi-subject consistency in open-domain image genera- tion with isolation and reposition attention

Huiguo He, Qiuyue Wang, Yuan Zhou, Yuxuan Cai, Hongyang Chao, Jian Yin, and Huan Yang. Improving multi-subject consistency in open-domain image genera- tion with isolation and reposition attention. arXiv preprint arXiv:2411.19261, 2024. 2

work page arXiv 2024

[28] [28]

Dreamstory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion.arXiv preprint arXiv:2407.12899, 2024

Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. Dreamstory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion. arXiv preprint arXiv:2407.12899, 2024. 2, 3, 6, 1

work page arXiv 2024

[29] [29]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 6

work page 2017

[30] [30]

Inferring semantic layout for hierarchical text- to-image synthesis

Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. Inferring semantic layout for hierarchical text- to-image synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 7986– 7994, 2018. 2

work page 2018

[31] [31]

Image quality metrics: Psnr vs

Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010. 6

work page 2010

[32] [32]

Jay Hosler and K. B. Boomer. Are comic books an effec- tive way to engage nonmajors in learning and appreciating science? CBE—Life Sciences Education , 2011. 1

work page 2011

[33] [33]

Animate anyone: Consistent and controllable image- to-video synthesis for character animation

Li Hu. Animate anyone: Consistent and controllable image- to-video synthesis for character animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024. 3

work page 2024

[34] [34]

Identity decoupling for multi-subject per- sonalization of text-to-image models

Sangwon Jang, Jaehyeong Jo, Kimin Lee, and Sung Ju Hwang. Identity decoupling for multi-subject per- sonalization of text-to-image models. arXiv preprint arXiv:2404.04243, 2024. 3

work page arXiv 2024

[35] [35]

Content-aware video2comics with manga-style layout

Guangmei Jing, Yongtao Hu, Yanwen Guo, Yizhou Yu, and Wenping Wang. Content-aware video2comics with manga-style layout. IEEE Transactions on Multimedia , 17(12):2122–2133, 2015. 2

work page 2015

[36] [36]

Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux, 2024. 8

work page 2024

[37] [37]

Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing

Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing. Advances in Neural Information Pro- cessing Systems, 36, 2024. 4, 5, 6, 8

work page 2024

[38] [38]

Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In In- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. 6

work page 2022

[39] [39]

Unbounded: A generative infinite game of character life simulation

Jialu Li, Yuanzhen Li, Neal Wadhwa, Yael Pritch, David E Jacobs, Michael Rubinstein, Mohit Bansal, and Nataniel Ruiz. Unbounded: A generative infinite game of character life simulation. arXiv preprint arXiv:2410.18975, 2024. 2

work page arXiv 2024

[40] [40]

Storygan: A sequential conditional gan for story vi- sualization

Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. Storygan: A sequential conditional gan for story vi- sualization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 6329–6338,

work page

[41] [41]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023. 4, 1

work page 2023

[42] [42]

Photomaker: Customizing re- alistic human photos via stacked id embedding

Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming- Ming Cheng, and Ying Shan. Photomaker: Customizing re- alistic human photos via stacked id embedding. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8640–8650, 2024. 4, 6

work page 2024

[43] [43]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out , pages 74–81, 2004. 2, 3

work page 2004

[44] [44]

Evaluating text-to-visual generation with image-to-text gen- eration

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text gen- eration. In European Conference on Computer Vision, pages 366–384. Springer, 2025. 7, 8, 6

work page 2025

[45] [45]

Intelligent grimm-open-ended visual storytelling via latent diffusion models

Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, Yan- feng Wang, and Weidi Xie. Intelligent grimm-open-ended visual storytelling via latent diffusion models. In Proceed- 10 ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6190–6200, 2024. 2, 6, 7, 8, 10

work page 2024

[46] [46]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 4

work page 2024

[47] [47]

One-prompt-one-story: Free-lunch consistent text-to-image generation using a single prompt

Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fa- had Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, and Ming-Ming Cheng. One-prompt-one-story: Free-lunch consistent text-to-image generation using a single prompt. In The Thirteenth International Conference on Learning Repre- sentations, 2025. 6, 7

work page 2025

[48] [48]

Image segmenta- tion using text and image prompts

Timo L ¨uddecke and Alexander Ecker. Image segmenta- tion using text and image prompts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7086–7096, 2022. 2, 3, 6, 5, 9

work page 2022

[49] [49]

Integrating visuospa- tial, linguistic and commonsense structure into story visual- ization

Adyasha Maharana and Mohit Bansal. Integrating visuospa- tial, linguistic and commonsense structure into story visual- ization. arXiv preprint arXiv:2110.10834, 2021. 2

work page arXiv 2021

[50] [50]

Storydall-e: Adapting pretrained text-to-image transformers for story continuation

Adyasha Maharana, Darryl Hannan, and Mohit Bansal. Storydall-e: Adapting pretrained text-to-image transformers for story continuation. InEuropean Conference on Computer Vision, pages 70–87. Springer, 2022. 2

work page 2022

[51] [51]

Story-adapter: A training-free iterative framework for long story visualization

Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, and Yuyin Zhou. Story-adapter: A training-free iterative framework for long story visualization. In arXiv, 2024. 7

work page 2024

[52] [52]

Digital comics image indexing based on deep learn- ing

Nhu-Van Nguyen, Christophe Rigaud, and Jean-Christophe Burie. Digital comics image indexing based on deep learn- ing. Journal of Imaging, 4(7):89, 2018. 3

work page 2018

[53] [53]

Beyond semantic distance: Automated scoring of divergent thinking greatly improves with large language models

Peter Organisciak, Selcuk Acar, Denis Dumas, and Kelly Berthiaume. Beyond semantic distance: Automated scoring of divergent thinking greatly improves with large language models. Thinking Skills and Creativity, 49:101356, 2023. 8

work page 2023

[54] [54]

Synthesizing coherent story with auto-regressive la- tent diffusion models

Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, and Wenhu Chen. Synthesizing coherent story with auto-regressive la- tent diffusion models. In Proceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision , pages 2920–2930, 2024. 2

work page 2024

[55] [55]

Pytorch: An im- perative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library. Ad- vances in neural information processing systems , 32, 2019. 1

work page 2019

[56] [56]

Comic- gan: Text-to-comic generative adversarial network

Ben Proven-Bessel, Zilong Zhao, and Lydia Chen. Comic- gan: Text-to-comic generative adversarial network. arXiv preprint arXiv:2109.09120, 2021. 2

work page arXiv 2021

[57] [57]

Make-a-story: Visual memory conditioned consistent story generation

Tanzila Rahman, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Shweta Mahajan, and Leonid Sigal. Make-a-story: Visual memory conditioned consistent story generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2493–2502, 2023. 2

work page 2023

[58] [58]

Grounded sam: Assembling open-world models for diverse visual tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv, 2024. 4

work page 2024

[59] [59]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In CVPR, 2022. 2

work page 2022

[60] [60]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 22500– 22510, 2023. 2, 4, 5, 6, 8

work page 2023

[61] [61]

Improved aesthetic predictor, 2022

Christoph Schuhmann. Improved aesthetic predictor, 2022. 7, 8, 6

work page 2022

[62] [62]

Storygpt-v: Large language models as consistent story visualizers.arXiv, 2023

Xiaoqian Shen and Mohamed Elhoseiny. Storygpt-v: Large language models as consistent story visualizers.arXiv, 2023. 2

work page 2023

[63] [63]

Storybooth: Training-free multi-subject consistency for improved visual storytelling

Jaskirat Singh, Junshen Kevin Chen, Jonas Kohler, and Michael Cohen. Storybooth: Training-free multi-subject consistency for improved visual storytelling. arXiv preprint arXiv:2504.05800, 2025. 2, 3

work page arXiv 2025

[64] [64]

Text2scene: Generating compositional scenes from textual descriptions

Fuwen Tan, Song Feng, and Vicente Ordonez. Text2scene: Generating compositional scenes from textual descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6710–6719, 2019. 2, 3

work page 2019

[65] [65]

arXiv preprint arXiv:2210.04885 , year=

Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, and Ferhan Ture. What the daam: Interpreting stable diffu- sion using cross attention. arXiv preprint arXiv:2210.04885,

work page arXiv

[66] [66]

Storyimager: A unified and efficient frame- work for coherent story visualization and completion

Ming Tao, Bing-Kun Bao, Hao Tang, Yaowei Wang, and Changsheng Xu. Storyimager: A unified and efficient frame- work for coherent story visualization and completion. arXiv preprint arXiv:2404.05979, 2024. 2, 6

work page arXiv 2024

[67] [67]

Science comics as tools for science educa- tion and communication: a brief, exploratory study

Mi ´co Tatalovi´c. Science comics as tools for science educa- tion and communication: a brief, exploratory study. JCOM, 8(4), 2009. 1

work page 2009

[68] [68]

Training-free consis- tent text-to-image generation

Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free consis- tent text-to-image generation. ACM Transactions on Graph- ics (TOG), 43(4):1–18, 2024. 2, 5, 6, 7, 3, 8

work page 2024

[69] [69]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 8, 9, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[70] [70]

Cider: Consensus-based image description evalua- tion

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 2, 3

work page 2015

[71] [71]

Comix: A comprehensive benchmark for multi-task comic understanding

Emanuele Vivoli, Marco Bertini, and Dimosthenis Karatzas. Comix: A comprehensive benchmark for multi-task comic understanding. arXiv preprint arXiv:2407.03550, 2024. 8

work page arXiv 2024

[72] [72]

Cdac: Cross-domain attention consistency in trans- former for domain adaptive semantic segmentation

Kaihong Wang, Donghyun Kim, Rogerio Feris, and Margrit Betke. Cdac: Cross-domain attention consistency in trans- former for domain adaptive semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11519–11529, 2023. 8 11

work page 2023

[73] [73]

Autostory: Generating diverse storytelling images with minimal human efforts

Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, and Chunhua Shen. Autostory: Generating diverse storytelling images with minimal human efforts. Interna- tional Journal of Computer Vision , pages 1–22, 2024. 2, 3, 4

work page 2024

[74] [74]

Large language models are latent variable models: Explaining and finding good demonstra- tions for in-context learning

Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark Steyvers, and William Yang Wang. Large language models are latent variable models: Explaining and finding good demonstra- tions for in-context learning. Advances in Neural Informa- tion Processing Systems, 36, 2024. 3, 9

work page 2024

[75] [75]

Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation

Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15943–15953, 2023. 2, 4, 5, 6

work page 2023

[76] [76]

Diffsensei: Bridging multi- modal llms and diffusion models for customized manga gen- eration

Jianzong Wu, Chao Tang, Jingbo Wang, Yanhong Zeng, Xi- angtai Li, and Yunhai Tong. Diffsensei: Bridging multi- modal llms and diffusion models for customized manga gen- eration. arXiv preprint arXiv:2412.07589 , 2024. 2, 3, 6, 7, 8

work page arXiv 2024

[77] [77]

Human preference score: Better aligning text- to-image models with human preference

Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hong- sheng Li. Human preference score: Better aligning text- to-image models with human preference. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 2096–2105, 2023. 7, 8, 6

work page 2096

[78] [78]

Auto- matic comic generation with stylistic multi-page layouts and emotion-driven text balloon generation

Xin Yang, Zongliang Ma, Letian Yu, Ying Cao, Baocai Yin, Xiaopeng Wei, Qiang Zhang, and Rynson WH Lau. Auto- matic comic generation with stylistic multi-page layouts and emotion-driven text balloon generation. ACM Transactions on Multimedia Computing, Communications, and Applica- tions (TOMM), 17(2):1–19, 2021. 2, 1

work page 2021

[79] [79]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models. arXiv preprint arXiv:2308.06721 ,

work page internal anchor Pith review Pith/arXiv arXiv

[80] [80]

Obj2text: Generating visually descriptive language from object layouts

Xuwang Yin and Vicente Ordonez. Obj2text: Generating visually descriptive language from object layouts. In Pro- ceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 177–187, 2017. 3, 2

work page 2017