pith. sign in

arxiv: 2605.20777 · v1 · pith:2NGEI3XFnew · submitted 2026-05-20 · 💻 cs.CV

AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models

Pith reviewed 2026-05-21 05:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual storytellingdiffusion modelsattribute realizationcross-attention mapsfine-grained attributeslatent optimizationstory generation
0
0 comments X

The pith

Optimizing cross-attention maps during early denoising enables faithful rendering of fine-grained attributes in visual storytelling with diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AttriStory, a benchmark of 200 multi-scene stories across 10 artistic styles, each with explicit attribute specifications for characters and objects. It proposes a plug-and-play latent optimization module that applies AttriLoss during early denoising steps to align cross-attention maps with desired attribute-object pairs and reduce incorrect associations. This addition integrates with existing consistency methods and yields consistent gains in attribute accuracy across baselines. A sympathetic reader would care because it addresses the missing step between keeping characters consistent and making specific details like clothing color or texture match the story text.

Core claim

AttriStory provides a benchmark enabling attribute realization in visual storytelling. The AttriLoss objective maximizes alignment between the cross-attention maps for desired attribute-object pairs while suppressing spurious associations, guiding models to localize attributes correctly when applied during early denoising steps. The approach operates orthogonally to existing consistency mechanisms and integrates seamlessly with current story generation pipelines without architectural modifications.

What carries the argument

AttriLoss objective that maximizes alignment between cross-attention maps for desired attribute-object pairs while suppressing spurious associations to localize attributes correctly.

If this is right

  • Consistent improvements appear when incorporating AttriLoss across all tested baselines.
  • Attribute realization emerges as a distinct and complementary dimension of visual storytelling alongside character consistency.
  • The method advances the field toward fine-grained attribute-controlled story generation.
  • No architectural modifications are required to integrate with existing pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The attention-alignment idea could extend to controlling object relationships or background elements in generated scenes.
  • Applying the same optimization at later denoising stages might further refine details without harming early structure.
  • Interactive tools could let users adjust specific attributes mid-generation by modifying the loss targets.

Load-bearing premise

Optimizing cross-attention maps only during early denoising steps produces faithful attribute rendering in the final image without introducing new artifacts or degrading overall story coherence.

What would settle it

Side-by-side evaluation on the AttriStory benchmark showing no measurable increase in correct attribute depiction when using AttriLoss versus standard generation, as judged by attribute-specific accuracy metrics or human raters.

Figures

Figures reproduced from arXiv: 2605.20777 by Manogna Sreenivas, Rohit Kumar, Soma Biswas.

Figure 1
Figure 1. Figure 1: Visualization of a story generated from the AttriStory benchmark. This story of Ben, illustrates the dual challenge in visual storytelling: maintaining character consistency across scenes, while realizing fine-grained attributes such as clothing and accessories. Abstract Visual storytelling with diffusion models has made impres￾sive strides in maintaining character consistency across narrative scenes. Howe… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of story narratives proposed in prior benchmarks vs. ours. Existing approaches like ConsiStory (top) provide minimal visual specifications, capturing only basic character identity and actions. AttriStory (bottom) enriches narratives with explicit positive and negative attribute-object pairs (P + and P −) for each scene, enabling systematic evaluation of fine-grained attribute realization. Oliver… view at source ↗
Figure 3
Figure 3. Figure 3: LLM-driven benchmark generation. The pipeline inputs artistic styles and structured instructions that emphasize explicit, fine￾grained attribute specifications. For each story, the LLM chooses an artistic style and generates character descriptions, scene narratives, and positive (P +) and negative (P −) attribute-object pairs, producing structured stories enabling attribute realization. To address this gap… view at source ↗
Figure 4
Figure 4. Figure 4: AttriLoss: Targeted IoU loss on cross-attention maps. Our method optimizes spatial overlap between attention maps of attribute-object token pairs during early denoising steps. By maximizing IoU for positive pairs (e.g., pink and dress should co-occur) and minimizing IoU for negative pairs (e.g., pink and lilies should not overlap), we guide the model to correctly localize fine-grained attributes. 4. AttriL… view at source ↗
Figure 5
Figure 5. Figure 5: Attention maps of ConsiStory and with AttriLoss. The attention maps of baseline method ConsiStory show ambigu￾ous spatial overlaps where attribute tokens pink and lilies attend to the same regions resulting in the image with pink roses as well. Using AttriLoss objective with ConsiStory, the attention maps for attribute-object pairs sharpen into distinct regions (pink and lilies don’t overlap), achieving co… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results of ConsiStory baseline with and without AttriLoss. Using ConsiStory, the character consistency is maintained but it fails to correctly bind fine-grained attributes (e.g., pink roses are rendered with white lilies (1), umbrella is partially colored as blue instead of red (2)). With AttriLoss, attribute specifications are faithfully realized while preserving character consistency. The Att… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of StoryDiffusion baseline with and without AttriLoss. Using Consistory (top), the character consistency is maintained but fails to correctly bind fine-grained attributes (e.g., grey coat (2), yellow coat(3) and beige jacket(3) are not rendered using StoryDiffusion). With AttriLoss (bottom), attribute specifications are faithfully realized while character consistency is preserved. Image… view at source ↗
Figure 8
Figure 8. Figure 8: Attribute realization across diverse stories using baseline as ConsiStory (top) and with AttriLoss (bottom). Each column shows a scene in varied artistic styles (Pixar, cartoon, oil painting, photo, watercolor). AttriLoss corrects attribute-object binding failures: peacock’s red velvet capelet (1), Dr. Barkley’s glasses (2), yellow flag on the raft (3), Luke’s green hoodie (4), Oliver’s green bike (5) mech… view at source ↗
read the original abstract

Visual storytelling with diffusion models has made impressive strides in maintaining character consistency across narrative scenes. However, a critical gap remains: while these methods ensure a character remains consistent across scenes, they provide no systematic method to ensure if fine-grained attributes such as color and textures of clothing, accessories are faithfully rendered in the generated images. Towards this goal, we introduce AttriStory, a benchmark enabling attribute realization in visual storytelling. We curate 200 multi-scene stories across 10 distinct artistic styles using Large Language Model. Each scene is constructed with detailed attribute specifications to enable rich visual narratives. Further, to address attribute realization, we propose a plug-and-play latent optimization module that operates during early denoising steps, when the model establishes structural and semantic content. We achieve this through AttriLoss objective designed to maximize alignment between the cross-attention maps for desired attribute-object pairs while suppressing spurious associations, guiding models to localize attributes correctly. This approach operates orthogonally to existing consistency mechanisms, integrating seamlessly with current story generation pipelines without requiring architectural modifications. Our experiments demonstrate consistent improvements on incorporating AttriLoss across all baselines. This work positions attribute realization as a distinct, complementary dimension of visual storytelling, alongside character consistency, advancing the field toward fine-grained attribute-controlled story generation. Project-page:https://manogna-s.github.io/attristory/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AttriStory, a benchmark of 200 multi-scene stories across 10 artistic styles curated via LLMs, each equipped with detailed attribute specifications for clothing, accessories, colors, and textures. It proposes a plug-and-play latent optimization module that applies an AttriLoss objective during early denoising steps to maximize cross-attention alignment for desired attribute-object pairs while suppressing spurious associations, thereby guiding correct attribute localization in diffusion-based visual storytelling. The method is presented as orthogonal to existing consistency mechanisms and is reported to yield consistent improvements across baselines.

Significance. If substantiated, the work usefully separates attribute realization from character consistency as a distinct control axis in story generation. The benchmark could support future controlled experiments, and the plug-and-play formulation avoids architectural changes. However, the absence of quantitative results, measurement protocols, or ablations in the abstract limits evaluation of practical impact and leaves the core proxy assumption (early attention alignment implies final pixel-level fidelity) untested.

major comments (2)
  1. [Abstract] Abstract: the claim of 'consistent improvements on incorporating AttriLoss across all baselines' is unsupported by any quantitative results, error bars, or description of how attribute success (e.g., color/texture fidelity) was measured. This information is load-bearing for the central empirical claim.
  2. [Abstract] Abstract (AttriLoss description): the method optimizes cross-attention maps only in early denoising steps under the assumption that this suffices for faithful final-image attribute rendering. For fine-grained attributes such as clothing textures or accessory colors, attention maps can be diffuse and later denoising steps can still alter appearance; no ablation on loss timing or direct evidence linking early alignment to final pixel output is provided.
minor comments (2)
  1. The manuscript would benefit from explicit statements on reproducibility (hyperparameters of the latent optimization, exact weighting of AttriLoss, and whether code or prompts will be released).
  2. Notation for cross-attention maps and the precise formulation of the suppression term in AttriLoss should be clarified with an equation reference to avoid ambiguity in implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments that help clarify the presentation of our empirical claims and methodological choices. We address each major comment below and describe the corresponding revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'consistent improvements on incorporating AttriLoss across all baselines' is unsupported by any quantitative results, error bars, or description of how attribute success (e.g., color/texture fidelity) was measured. This information is load-bearing for the central empirical claim.

    Authors: We agree that the abstract lacks sufficient detail on the evaluation protocol and results. The full manuscript reports quantitative metrics for attribute fidelity (color and texture accuracy via automated matching and human evaluation) with standard deviations across runs, showing consistent gains over baselines. We will revise the abstract to briefly describe the measurement approach and key quantitative outcomes while referencing the experiments section. revision: yes

  2. Referee: [Abstract] Abstract (AttriLoss description): the method optimizes cross-attention maps only in early denoising steps under the assumption that this suffices for faithful final-image attribute rendering. For fine-grained attributes such as clothing textures or accessory colors, attention maps can be diffuse and later denoising steps can still alter appearance; no ablation on loss timing or direct evidence linking early alignment to final pixel output is provided.

    Authors: The focus on early steps follows from the established role of initial denoising in determining semantic structure and layout. The manuscript includes attention-map visualizations that link improved early alignment to correct final attributes. We acknowledge the benefit of explicit timing ablations and will add experiments comparing AttriLoss application across early, middle, and late stages, plus quantitative analysis correlating attention alignment with pixel-level attribute accuracy. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces a new benchmark (AttriStory with 200 stories) and a plug-and-play latent optimization module using the AttriLoss on cross-attention maps during early denoising steps. The central claim—that this loss maximizes alignment for attribute-object pairs and thereby improves fine-grained attribute realization—is presented as an empirical outcome from integrating the module with baselines, without any equations or steps that reduce the reported improvement to a fitted parameter, self-defined metric, or self-citation chain. The text explicitly positions the method as orthogonal to consistency mechanisms and requiring no architectural changes, indicating the derivation chain adds independent content rather than renaming or reconstructing its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on standard diffusion model assumptions plus the new AttriLoss formulation and the curated benchmark; no explicit free parameters or invented physical entities are stated in the abstract.

axioms (1)
  • domain assumption Cross-attention maps in diffusion models can be directly optimized to control attribute localization without side effects on image quality.
    Invoked when describing the AttriLoss operating during early denoising steps.
invented entities (2)
  • AttriLoss no independent evidence
    purpose: Objective to maximize alignment between cross-attention maps for desired attribute-object pairs.
    New loss term introduced in the paper.
  • AttriStory benchmark no independent evidence
    purpose: Dataset of 200 multi-scene stories with attribute specifications across 10 styles.
    New curated dataset for evaluating attribute realization.

pith-pipeline@v0.9.0 · 5770 in / 1361 out tokens · 24381 ms · 2026-05-21T05:43:50.710691+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 8 internal anchors

  1. [1]

    Oracle: Leveraging mutual information for consistent character generation with loras in diffusion models.arXiv preprint arXiv:2406.02820,

    Kiymet Akdemir and Pinar Yanardag. Oracle: Leveraging mutual information for consistent character generation with loras in diffusion models.arXiv preprint arXiv:2406.02820,

  2. [2]

    Break-a-scene: Extracting multi- ple concepts from a single image

    Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen- Or, and Dani Lischinski. Break-a-scene: Extracting multi- ple concepts from a single image. InSIGGRAPH Asia 2023 Conference Papers, pages 1–12, 2023. 2

  3. [3]

    The chosen one: Consistent characters in text- to-image diffusion models

    Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. The chosen one: Consistent characters in text- to-image diffusion models. InACM SIGGRAPH 2024 con- ference papers, pages 1–12, 2024. 3

  4. [4]

    Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 22560–22570, 2023. 2

  5. [5]

    Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

    Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dream- sim: Learning new dimensions of human visual similar- ity using synthetic data.arXiv preprint arXiv:2306.09344,

  6. [6]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022. 2

  7. [7]

    Interactive story visualiza- tion with multiple characters

    Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Yingqing He, Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, et al. Interactive story visualiza- tion with multiple characters. InSIGGRAPH Asia 2023 Con- ference Papers, pages 1–10, 2023. 1

  8. [8]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022. 2, 6

  9. [9]

    Clipscore: A reference-free evaluation met- ric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProceedings of the 2021 confer- ence on empirical methods in natural language processing, pages 7514–7528, 2021. 7

  10. [10]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2

  11. [11]

    Animate anyone: Consistent and controllable image- to-video synthesis for character animation

    Li Hu. Animate anyone: Consistent and controllable image- to-video synthesis for character animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024. 2

  12. [12]

    Multi-concept customization of text-to-image diffusion

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023. 2

  13. [13]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 2

  14. [14]

    Photomaker: Customizing re- alistic human photos via stacked id embedding

    Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming- Ming Cheng, and Ying Shan. Photomaker: Customizing re- alistic human photos via stacked id embedding. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8640–8650, 2024. 3

  15. [15]

    Evaluating text-to-visual generation with image-to-text gen- eration

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text gen- eration. InEuropean Conference on Computer Vision, pages 366–384. Springer, 2024. 7

  16. [16]

    Towards understanding cross and self-attention in stable diffusion for text-guided image editing

    Bingyan Liu, Chengyu Wang, Tingfeng Cao, Kui Jia, and Jun Huang. Towards understanding cross and self-attention in stable diffusion for text-guided image editing. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7817–7826, 2024. 2

  17. [17]

    Intelligent grimm - open-ended visual storytelling via latent diffusion models

    Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, Yan- feng Wang, and Weidi Xie. Intelligent grimm - open-ended visual storytelling via latent diffusion models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6190–6200, 2024. 1

  18. [18]

    One-prompt-one-story: Free-lunch consistent text-to-image generation using a single prompt

    Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fa- had Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, and Ming-Ming Cheng. One-prompt-one-story: Free-lunch consistent text-to-image generation using a single prompt. arXiv preprint arXiv:2501.13554, 2025. 1, 2, 3, 6, 7

  19. [19]

    Storydall-e: Adapting pretrained text-to-image transformers for story continuation

    Adyasha Maharana, Darryl Hannan, and Mohit Bansal. Storydall-e: Adapting pretrained text-to-image transformers for story continuation. InEuropean conference on computer vision, pages 70–87. Springer, 2022. 1

  20. [20]

    Dragondiffusion: Enabling drag-style manipula- tion on diffusion models

    Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and Jian Zhang. Dragondiffusion: Enabling drag-style manipula- tion on diffusion models.arXiv preprint arXiv:2307.02421,

  21. [21]

    Chatgpt.https://chatgpt.com/, 2025

    OpenAI. Chatgpt.https://chatgpt.com/, 2025. Large language model. 4, 6

  22. [22]

    Synthesizing coherent story with auto-regressive la- tent diffusion models

    Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, and Wenhu Chen. Synthesizing coherent story with auto-regressive la- tent diffusion models. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 2920–2930, 2024. 1

  23. [23]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  24. [24]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 2, 6, 7

  25. [25]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

  26. [26]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 2

  27. [27]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2

  28. [28]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 2

  29. [29]

    Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, and Kfir Aberman. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6527–6536, 2024. 2

  30. [30]

    Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 2

  31. [31]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 2

  32. [32]

    Training-free consis- tent text-to-image generation.ACM Transactions on Graph- ics (TOG), 43(4):1–18, 2024

    Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free consis- tent text-to-image generation.ACM Transactions on Graph- ics (TOG), 43(4):1–18, 2024. 1, 2, 3, 6, 7

  33. [33]

    Characonsist: Fine- grained consistent character generation

    Mengyu Wang, Henghui Ding, Jianing Peng, Yao Zhao, Yunpeng Chen, and Yunchao Wei. Characonsist: Fine- grained consistent character generation. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 16058–16067, 2025. 1

  34. [34]

    InstantID: Zero-shot Identity-Preserving Generation in Seconds

    Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024. 2

  35. [35]

    Characterfactory: Sampling consis- tent characters with gans for diffusion models.IEEE Trans- actions on Image Processing, 2025

    Qinghe Wang, Baolu Li, Xiaomin Li, Bing Cao, Liqian Ma, Huchuan Lu, and Xu Jia. Characterfactory: Sampling consis- tent characters with gans for diffusion models.IEEE Trans- actions on Image Processing, 2025. 2

  36. [36]

    Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,

    Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation.arXiv preprint arXiv:2302.13848, 2023. 2

  37. [37]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 2

  38. [38]

    Seed-story: Multi- modal long story generation with large language model

    Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, and Ying-Cong Chen. Seed-story: Multi- modal long story generation with large language model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1850–1860, 2025. 1

  39. [39]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

  40. [40]

    Storydiffusion: Consistent self- attention for long-range image and video generation.Ad- vances in Neural Information Processing Systems, 37: 110315–110340, 2024

    Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self- attention for long-range image and video generation.Ad- vances in Neural Information Processing Systems, 37: 110315–110340, 2024. 1, 2, 3, 6, 7

  41. [41]

    Storymaker: Towards holistic consistent characters in text-to-image generation.arXiv preprint arXiv:2409.12576,

    Zhengguang Zhou, Jing Li, Huaxia Li, Nemo Chen, and Xu Tang. Storymaker: Towards holistic consistent characters in text-to-image generation.arXiv preprint arXiv:2409.12576,