TaleDiffusion: Multi-Character Story Generation with Dialogue Rendering
Pith reviewed 2026-05-18 18:57 UTC · model grok-4.3
The pith
TaleDiffusion generates consistent multi-character stories by planning frames with an LLM and controlling diffusion attention to keep identities stable while rendering assigned dialogues.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TaleDiffusion introduces an iterative framework that maintains character consistency and accurate dialogue assignment in multi-character story generation. It leverages a pre-trained LLM via in-context learning to generate per-frame descriptions, character details, and dialogues. A bounded attention-based per-box mask technique controls character interactions, while identity-consistent self-attention ensures consistency across frames and region-aware cross-attention handles object placement. Dialogues are rendered as bubbles and assigned using CLIPSeg, leading to better performance in consistency, noise reduction, and dialogue rendering.
What carries the argument
The combination of bounded attention-based per-box masks, identity-consistent self-attention, and region-aware cross-attention inside the diffusion process, which together enforce stable character appearances and precise dialogue placement across story frames.
If this is right
- Multi-character stories can be produced with characters that retain the same visual identity from one frame to the next without manual correction.
- Dialogue bubbles are placed and attributed automatically to the intended speaker rather than appearing randomly or mislabeled.
- Generated images exhibit fewer artifacts around character boundaries and interaction zones compared with standard diffusion story generators.
- The same pipeline can be applied to new stories simply by supplying a fresh text prompt to the language model stage.
Where Pith is reading between the lines
- Extending the per-frame planning step to include temporal ordering constraints could support longer coherent sequences beyond short stories.
- The attention control techniques might transfer to video diffusion models to reduce identity drift across many frames in animated sequences.
- Replacing the fixed CLIPSeg assignment step with a learned module trained on dialogue-to-character pairs could further reduce assignment errors.
Load-bearing premise
The framework assumes that a pre-trained LLM can reliably produce accurate per-frame descriptions, character details, and dialogue assignments via in-context learning, and that post-processing with CLIPSeg will correctly assign rendered dialogues without introducing new errors.
What would settle it
A direct visual comparison of story sequences in which the same character changes facial features or clothing between frames, or in which speech bubbles are attached to the wrong speaker, would demonstrate that the consistency and assignment mechanisms have failed.
Figures
read the original abstract
Text-to-story visualization is challenging due to the need for consistent interaction among multiple characters across frames. Existing methods struggle with character consistency, leading to artifact generation and inaccurate dialogue rendering, which results in disjointed storytelling. In response, we introduce TaleDiffusion, a novel framework for generating multi-character stories with an iterative process, maintaining character consistency, and accurate dialogue assignment via postprocessing. Given a story, we use a pre-trained LLM to generate per-frame descriptions, character details, and dialogues via in-context learning, followed by a bounded attention-based per-box mask technique to control character interactions and minimize artifacts. We then apply an identity-consistent self-attention mechanism to ensure character consistency across frames and region-aware cross-attention for precise object placement. Dialogues are also rendered as bubbles and assigned to characters via CLIPSeg. Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TaleDiffusion, a framework for multi-character story visualization from text. It uses a pre-trained LLM to generate per-frame descriptions, character details, and dialogue assignments via in-context learning, followed by a diffusion model with bounded attention-based per-box masks to control interactions, identity-consistent self-attention for cross-frame consistency, region-aware cross-attention for object placement, and CLIPSeg post-processing to assign rendered dialogue bubbles. The central claim is that this pipeline outperforms prior methods in character consistency, noise reduction, and accurate dialogue rendering.
Significance. If the experimental claims hold with rigorous validation, the work could advance controllable story generation by addressing multi-character consistency and dialogue placement, which are persistent challenges in text-to-image/video pipelines. The modular use of pre-trained components (LLM + diffusion + segmentation) is a practical strength, but the absence of quantitative metrics, baselines, or ablations in the abstract limits assessment of whether the proposed attention mechanisms deliver the claimed gains beyond the LLM stage.
major comments (2)
- [Abstract] Abstract: The claim that 'Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering' is unsupported by any reported metrics, baseline comparisons, dataset details, or statistical tests. This is load-bearing for the central claim of superiority and must be addressed with quantitative evidence before the contribution can be evaluated.
- [Abstract / Method overview] The pipeline's performance in consistency and dialogue rendering is predicated on the reliability of the LLM's in-context learning outputs for per-frame descriptions, character details, and dialogue assignments (as described in the abstract). No error rates, human validation, or ablation studies on this stage are mentioned; if LLM errors are common in complex multi-character scenes, they would propagate and undermine attribution of gains to the bounded attention masks, identity-consistent self-attention, or region-aware cross-attention.
minor comments (1)
- [Abstract] The abstract refers to 'postprocessing' for dialogue assignment without specifying the exact CLIPSeg integration or failure modes (e.g., misassignment under occlusion).
Simulated Author's Rebuttal
We are grateful to the referee for highlighting areas where the empirical validation of TaleDiffusion can be strengthened. We address the concerns regarding the abstract claims and the LLM component below, and have made revisions to incorporate additional quantitative evidence and analysis.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that 'Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering' is unsupported by any reported metrics, baseline comparisons, dataset details, or statistical tests. This is load-bearing for the central claim of superiority and must be addressed with quantitative evidence before the contribution can be evaluated.
Authors: We agree that the abstract's claim requires better support within the abstract itself. We will revise the abstract to summarize the key quantitative results from Section 4, including baseline comparisons and the specific metrics used for consistency, noise reduction, and dialogue rendering. Dataset details are already provided in the experiments section, and we will ensure statistical tests are highlighted. revision: yes
-
Referee: [Abstract / Method overview] The pipeline's performance in consistency and dialogue rendering is predicated on the reliability of the LLM's in-context learning outputs for per-frame descriptions, character details, and dialogue assignments (as described in the abstract). No error rates, human validation, or ablation studies on this stage are mentioned; if LLM errors are common in complex multi-character scenes, they would propagate and undermine attribution of gains to the bounded attention masks, identity-consistent self-attention, or region-aware cross-attention.
Authors: We acknowledge the need to validate the LLM stage to properly attribute contributions. In the revised manuscript, we will include an analysis of the LLM outputs, such as error rates from a human study on a sample of generated descriptions and dialogues. We will also present an ablation study that isolates the impact of the bounded attention, identity-consistent self-attention, and region-aware cross-attention mechanisms to show their added value beyond the LLM-generated inputs. revision: yes
Circularity Check
No circularity: pipeline relies on external pre-trained models and independent mechanisms
full rationale
The paper presents a multi-stage pipeline that invokes a pre-trained LLM for per-frame descriptions and dialogue assignment via in-context learning, applies custom bounded attention masks plus identity-consistent self-attention and region-aware cross-attention inside the diffusion process, and uses CLIPSeg for bubble assignment. No equations, fitted parameters, or self-citations are shown that would make any claimed output (consistency, noise reduction, dialogue accuracy) equivalent to the inputs by construction. The experimental outperformance claims rest on comparisons against external baselines rather than tautological re-derivations of the method's own definitions or prior self-work. This is the normal case of a self-contained engineering contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A pre-trained LLM can generate accurate per-frame descriptions, character details, and dialogues via in-context learning.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a per-box bounded attention-based mask latent generation technique... identity-consistent self-attention (ICSA) mechanism... region-aware cross-attention (RACA)... CLIPSeg
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
DocRevive: A Unified Pipeline for Document Text Restoration
DocRevive builds a unified pipeline using OCR, image analysis, language models, and diffusion to reconstruct degraded document text, backed by a 30k-image synthetic dataset and the UCSM metric.
-
DocRevive: A Unified Pipeline for Document Text Restoration
A unified pipeline using OCR, inpainting, and diffusion models restores text in degraded documents on a new synthetic benchmark dataset, evaluated with the proposed UCSM metric.
Reference graph
Works this paper leans on
-
[1]
Claude 3.5 Sonnet — anthropic.com. https://www. anthropic.com/claude/sonnet. [Accessed 12-11- 2024]. 8
work page 2024
-
[2]
renderartist/retrocomicflux · Hugging Face — hug- gingface.co. https : / / huggingface . co / renderartist/retrocomicflux, 2024. [Accessed 12-11-2024]. 8
work page 2024
-
[3]
https://huggingface.co/Xenova/gpt- 3.5- turbo, 2024
Xenova/gpt-3.5-turbo · Hugging Face — huggingface.co. https://huggingface.co/Xenova/gpt- 3.5- turbo, 2024. [Accessed 12-11-2024]. 8
work page 2024
-
[4]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Spice: Semantic propositional image cap- tion evaluation
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image cap- tion evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Octo- ber 11-14, 2016, Proceedings, Part V 14 , pages 382–398. Springer, 2016. 2, 3
work page 2016
-
[6]
The chosen one: Consistent characters in text- to-image diffusion models
Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. The chosen one: Consistent characters in text- to-image diffusion models. In ACM SIGGRAPH 2024 Con- ference Papers, pages 1–12, 2024. 6, 7, 8
work page 2024
-
[7]
Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments
Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. In Proceedings of the acl workshop on in- trinsic and extrinsic evaluation measures for machine trans- lation and/or summarization, pages 65–72, 2005. 2, 3
work page 2005
-
[8]
Synartifact: Classifying and alleviat- ing artifacts in synthetic images via vision-language model
Bin Cao, Jianhao Yuan, Yexin Liu, Jian Li, Shuyang Sun, Jing Liu, and Bo Zhao. Synartifact: Classifying and alleviat- ing artifacts in synthetic images via vision-language model. arXiv preprint arXiv:2402.18068, 2024. 4
-
[9]
Auto- matic stylistic manga layout.ACM Transactions on Graphics (TOG), 31(6):1–10, 2012
Ying Cao, Antoni B Chan, and Rynson WH Lau. Auto- matic stylistic manga layout.ACM Transactions on Graphics (TOG), 31(6):1–10, 2012. 2
work page 2012
-
[10]
Claude2-alpaca: Instruction tuning datasets distilled from claude
Lichang Chen, Khalid Saifullah, Ming Li, Tianyi Zhou, and Heng Huang. Claude2-alpaca: Instruction tuning datasets distilled from claude. https://github.com/ Lichang-Chen/claude2-alpaca, 2023. 8
work page 2023
-
[11]
Manga genera- tion via layout-controllable diffusion
Siyu Chen, Dengjie Li, Zenghao Bao, Yao Zhou, Lingfeng Tan, Yujie Zhong, and Zheng Zhao. Manga genera- tion via layout-controllable diffusion. In arXiv preprint arxiv:2412.19303, 2024. 2, 3
-
[12]
arXiv preprint arXiv:2406.01388 , year=
Junhao Cheng, Xi Lu, Hanhui Li, Khun Loun Zai, Baiqiao Yin, Yuhao Cheng, Yiqiang Yan, and Xiaodan Liang. Au- tostudio: Crafting consistent subjects in multi-turn interac- tive image generation. arXiv preprint arXiv:2406.01388 ,
-
[13]
Theatergen: Character management with llm for consistent multi-turn image generation
Junhao Cheng, Baiqiao Yin, Kaixin Cai, Minbin Huang, Hanhui Li, Yuxin He, Xi Lu, Yue Li, Yifei Li, Yuhao Cheng, et al. Theatergen: Character management with llm for consistent multi-turn image generation. arXiv preprint arXiv:2404.18919, 2024. 2
-
[14]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2(3):6,
work page 2023
-
[15]
Be yourself: Bounded attention for multi-subject text-to-image generation
Omer Dahary, Or Patashnik, Kfir Aberman, and Daniel Cohen-Or. Be yourself: Bounded attention for multi-subject text-to-image generation. arXiv preprint arXiv:2403.16990, 2(5), 2024. 2
-
[16]
A survey on in-context learning
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. A survey on in-context learning. arXiv, 2024. 3
work page 2024
-
[17]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022. 4, 5, 6, 8
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Controlling perceptual fac- tors in neural style transfer
Leon A Gatys, Alexander S Ecker, Matthias Bethge, Aaron Hertzmann, and Eli Shechtman. Controlling perceptual fac- tors in neural style transfer. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 3985–3993, 2017. 5 9
work page 2017
-
[19]
TokenFlow: Consistent Diffusion Features for Consistent Video Editing
Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Interactive story visualization with multiple characters
Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Yingqing He, Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, and Yujiu Yang. Interactive story visualization with multiple characters. In SIGGRAPH Asia 2023 Conference Papers , SA ’23, New York, NY , USA,
work page 2023
-
[21]
Association for Computing Machinery. 6, 7
-
[22]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. In NeurIPS, 2014. 2
work page 2014
-
[23]
Mix-of-show: Decentralized low- rank adaptation for multi-concept customization of diffusion models
Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yun- peng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, et al. Mix-of-show: Decentralized low- rank adaptation for multi-concept customization of diffusion models. Advances in Neural Information Processing Sys- tems, 36, 2024. 5
work page 2024
-
[24]
ebdtheque: a representative database of comics
Cl ´ement Gu ´erin, Christophe Rigaud, Antoine Mercier, Farid Ammar-Boudjelal, Karell Bertet, Alain Bouju, Jean- Christophe Burie, Georges Louis, Jean-Marc Ogier, and Ar- naud Revel. ebdtheque: a representative database of comics. In 2013 12th International Conference on Document Analy- sis and Recognition, pages 1145–1149. IEEE, 2013. 8
work page 2013
-
[25]
Imagine this! scripts to composi- tions to videos
Tanmay Gupta, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. Imagine this! scripts to composi- tions to videos. In Proceedings of the European conference on computer vision (ECCV) , pages 598–613, 2018. 2
work page 2018
-
[26]
Textdescriptives: A python package for calculating a large variety of metrics from text
Lasse Hansen, Ludvig Renbo Olsen, and Kenneth Enevold- sen. Textdescriptives: A python package for calculating a large variety of metrics from text. Journal of Open Source Software, 8(84):5153, Apr. 2023. 8
work page 2023
-
[27]
Huiguo He, Qiuyue Wang, Yuan Zhou, Yuxuan Cai, Hongyang Chao, Jian Yin, and Huan Yang. Improving multi-subject consistency in open-domain image genera- tion with isolation and reposition attention. arXiv preprint arXiv:2411.19261, 2024. 2
-
[28]
Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. Dreamstory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion. arXiv preprint arXiv:2407.12899, 2024. 2, 3, 6, 1
-
[29]
Gans trained by a two time-scale update rule converge to a local nash equilib- rium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 6
work page 2017
-
[30]
Inferring semantic layout for hierarchical text- to-image synthesis
Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. Inferring semantic layout for hierarchical text- to-image synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 7986– 7994, 2018. 2
work page 2018
-
[31]
Image quality metrics: Psnr vs
Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010. 6
work page 2010
-
[32]
Jay Hosler and K. B. Boomer. Are comic books an effec- tive way to engage nonmajors in learning and appreciating science? CBE—Life Sciences Education , 2011. 1
work page 2011
-
[33]
Animate anyone: Consistent and controllable image- to-video synthesis for character animation
Li Hu. Animate anyone: Consistent and controllable image- to-video synthesis for character animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024. 3
work page 2024
-
[34]
Identity decoupling for multi-subject per- sonalization of text-to-image models
Sangwon Jang, Jaehyeong Jo, Kimin Lee, and Sung Ju Hwang. Identity decoupling for multi-subject per- sonalization of text-to-image models. arXiv preprint arXiv:2404.04243, 2024. 3
-
[35]
Content-aware video2comics with manga-style layout
Guangmei Jing, Yongtao Hu, Yanwen Guo, Yizhou Yu, and Wenping Wang. Content-aware video2comics with manga-style layout. IEEE Transactions on Multimedia , 17(12):2122–2133, 2015. 2
work page 2015
-
[36]
Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux, 2024. 8
work page 2024
-
[37]
Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing. Advances in Neural Information Pro- cessing Systems, 36, 2024. 4, 5, 6, 8
work page 2024
-
[38]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In In- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. 6
work page 2022
-
[39]
Unbounded: A generative infinite game of character life simulation
Jialu Li, Yuanzhen Li, Neal Wadhwa, Yael Pritch, David E Jacobs, Michael Rubinstein, Mohit Bansal, and Nataniel Ruiz. Unbounded: A generative infinite game of character life simulation. arXiv preprint arXiv:2410.18975, 2024. 2
-
[40]
Storygan: A sequential conditional gan for story vi- sualization
Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. Storygan: A sequential conditional gan for story vi- sualization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 6329–6338,
-
[41]
Gligen: Open-set grounded text-to-image generation
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023. 4, 1
work page 2023
-
[42]
Photomaker: Customizing re- alistic human photos via stacked id embedding
Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming- Ming Cheng, and Ying Shan. Photomaker: Customizing re- alistic human photos via stacked id embedding. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8640–8650, 2024. 4, 6
work page 2024
-
[43]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out , pages 74–81, 2004. 2, 3
work page 2004
-
[44]
Evaluating text-to-visual generation with image-to-text gen- eration
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text gen- eration. In European Conference on Computer Vision, pages 366–384. Springer, 2025. 7, 8, 6
work page 2025
-
[45]
Intelligent grimm-open-ended visual storytelling via latent diffusion models
Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, Yan- feng Wang, and Weidi Xie. Intelligent grimm-open-ended visual storytelling via latent diffusion models. In Proceed- 10 ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6190–6200, 2024. 2, 6, 7, 8, 10
work page 2024
-
[46]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 4
work page 2024
-
[47]
One-prompt-one-story: Free-lunch consistent text-to-image generation using a single prompt
Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fa- had Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, and Ming-Ming Cheng. One-prompt-one-story: Free-lunch consistent text-to-image generation using a single prompt. In The Thirteenth International Conference on Learning Repre- sentations, 2025. 6, 7
work page 2025
-
[48]
Image segmenta- tion using text and image prompts
Timo L ¨uddecke and Alexander Ecker. Image segmenta- tion using text and image prompts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7086–7096, 2022. 2, 3, 6, 5, 9
work page 2022
-
[49]
Integrating visuospa- tial, linguistic and commonsense structure into story visual- ization
Adyasha Maharana and Mohit Bansal. Integrating visuospa- tial, linguistic and commonsense structure into story visual- ization. arXiv preprint arXiv:2110.10834, 2021. 2
-
[50]
Storydall-e: Adapting pretrained text-to-image transformers for story continuation
Adyasha Maharana, Darryl Hannan, and Mohit Bansal. Storydall-e: Adapting pretrained text-to-image transformers for story continuation. InEuropean Conference on Computer Vision, pages 70–87. Springer, 2022. 2
work page 2022
-
[51]
Story-adapter: A training-free iterative framework for long story visualization
Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, and Yuyin Zhou. Story-adapter: A training-free iterative framework for long story visualization. In arXiv, 2024. 7
work page 2024
-
[52]
Digital comics image indexing based on deep learn- ing
Nhu-Van Nguyen, Christophe Rigaud, and Jean-Christophe Burie. Digital comics image indexing based on deep learn- ing. Journal of Imaging, 4(7):89, 2018. 3
work page 2018
-
[53]
Peter Organisciak, Selcuk Acar, Denis Dumas, and Kelly Berthiaume. Beyond semantic distance: Automated scoring of divergent thinking greatly improves with large language models. Thinking Skills and Creativity, 49:101356, 2023. 8
work page 2023
-
[54]
Synthesizing coherent story with auto-regressive la- tent diffusion models
Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, and Wenhu Chen. Synthesizing coherent story with auto-regressive la- tent diffusion models. In Proceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision , pages 2920–2930, 2024. 2
work page 2024
-
[55]
Pytorch: An im- perative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library. Ad- vances in neural information processing systems , 32, 2019. 1
work page 2019
-
[56]
Comic- gan: Text-to-comic generative adversarial network
Ben Proven-Bessel, Zilong Zhao, and Lydia Chen. Comic- gan: Text-to-comic generative adversarial network. arXiv preprint arXiv:2109.09120, 2021. 2
-
[57]
Make-a-story: Visual memory conditioned consistent story generation
Tanzila Rahman, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Shweta Mahajan, and Leonid Sigal. Make-a-story: Visual memory conditioned consistent story generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2493–2502, 2023. 2
work page 2023
-
[58]
Grounded sam: Assembling open-world models for diverse visual tasks
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv, 2024. 4
work page 2024
-
[59]
High-resolution image syn- thesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In CVPR, 2022. 2
work page 2022
-
[60]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 22500– 22510, 2023. 2, 4, 5, 6, 8
work page 2023
-
[61]
Improved aesthetic predictor, 2022
Christoph Schuhmann. Improved aesthetic predictor, 2022. 7, 8, 6
work page 2022
-
[62]
Storygpt-v: Large language models as consistent story visualizers.arXiv, 2023
Xiaoqian Shen and Mohamed Elhoseiny. Storygpt-v: Large language models as consistent story visualizers.arXiv, 2023. 2
work page 2023
-
[63]
Storybooth: Training-free multi-subject consistency for improved visual storytelling
Jaskirat Singh, Junshen Kevin Chen, Jonas Kohler, and Michael Cohen. Storybooth: Training-free multi-subject consistency for improved visual storytelling. arXiv preprint arXiv:2504.05800, 2025. 2, 3
-
[64]
Text2scene: Generating compositional scenes from textual descriptions
Fuwen Tan, Song Feng, and Vicente Ordonez. Text2scene: Generating compositional scenes from textual descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6710–6719, 2019. 2, 3
work page 2019
-
[65]
arXiv preprint arXiv:2210.04885 , year=
Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, and Ferhan Ture. What the daam: Interpreting stable diffu- sion using cross attention. arXiv preprint arXiv:2210.04885,
-
[66]
Storyimager: A unified and efficient frame- work for coherent story visualization and completion
Ming Tao, Bing-Kun Bao, Hao Tang, Yaowei Wang, and Changsheng Xu. Storyimager: A unified and efficient frame- work for coherent story visualization and completion. arXiv preprint arXiv:2404.05979, 2024. 2, 6
-
[67]
Science comics as tools for science educa- tion and communication: a brief, exploratory study
Mi ´co Tatalovi´c. Science comics as tools for science educa- tion and communication: a brief, exploratory study. JCOM, 8(4), 2009. 1
work page 2009
-
[68]
Training-free consis- tent text-to-image generation
Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free consis- tent text-to-image generation. ACM Transactions on Graph- ics (TOG), 43(4):1–18, 2024. 2, 5, 6, 7, 3, 8
work page 2024
-
[69]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 8, 9, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[70]
Cider: Consensus-based image description evalua- tion
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 2, 3
work page 2015
-
[71]
Comix: A comprehensive benchmark for multi-task comic understanding
Emanuele Vivoli, Marco Bertini, and Dimosthenis Karatzas. Comix: A comprehensive benchmark for multi-task comic understanding. arXiv preprint arXiv:2407.03550, 2024. 8
-
[72]
Cdac: Cross-domain attention consistency in trans- former for domain adaptive semantic segmentation
Kaihong Wang, Donghyun Kim, Rogerio Feris, and Margrit Betke. Cdac: Cross-domain attention consistency in trans- former for domain adaptive semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11519–11529, 2023. 8 11
work page 2023
-
[73]
Autostory: Generating diverse storytelling images with minimal human efforts
Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, and Chunhua Shen. Autostory: Generating diverse storytelling images with minimal human efforts. Interna- tional Journal of Computer Vision , pages 1–22, 2024. 2, 3, 4
work page 2024
-
[74]
Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark Steyvers, and William Yang Wang. Large language models are latent variable models: Explaining and finding good demonstra- tions for in-context learning. Advances in Neural Informa- tion Processing Systems, 36, 2024. 3, 9
work page 2024
-
[75]
Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation
Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15943–15953, 2023. 2, 4, 5, 6
work page 2023
-
[76]
Diffsensei: Bridging multi- modal llms and diffusion models for customized manga gen- eration
Jianzong Wu, Chao Tang, Jingbo Wang, Yanhong Zeng, Xi- angtai Li, and Yunhai Tong. Diffsensei: Bridging multi- modal llms and diffusion models for customized manga gen- eration. arXiv preprint arXiv:2412.07589 , 2024. 2, 3, 6, 7, 8
-
[77]
Human preference score: Better aligning text- to-image models with human preference
Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hong- sheng Li. Human preference score: Better aligning text- to-image models with human preference. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 2096–2105, 2023. 7, 8, 6
work page 2096
-
[78]
Xin Yang, Zongliang Ma, Letian Yu, Ying Cao, Baocai Yin, Xiaopeng Wei, Qiang Zhang, and Rynson WH Lau. Auto- matic comic generation with stylistic multi-page layouts and emotion-driven text balloon generation. ACM Transactions on Multimedia Computing, Communications, and Applica- tions (TOMM), 17(2):1–19, 2021. 2, 1
work page 2021
-
[79]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models. arXiv preprint arXiv:2308.06721 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
Obj2text: Generating visually descriptive language from object layouts
Xuwang Yin and Vicente Ordonez. Obj2text: Generating visually descriptive language from object layouts. In Pro- ceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 177–187, 2017. 3, 2
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.