pith. sign in

arxiv: 2412.12242 · v2 · submitted 2024-12-16 · 💻 cs.CV · cs.AI· cs.LG

OmniPrism: Learning Disentangled Visual Concept for Image Generation

Pith reviewed 2026-05-23 06:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords visual concept disentanglementimage generationdiffusion modelsPCD-200K datasetCOD training pipelineconcept injectioncreative image generationmultimodal concept extraction
0
0 comments X

The pith

OmniPrism separates multiple visual concepts from one reference image so diffusion models can apply chosen ones without mixing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to fix the problem that current image generators either handle only one aspect from a reference or mix in unwanted elements when several aspects are present. It does this by creating a large paired dataset where each pair shares exactly one concept such as style or composition, then training representations to isolate those concepts through contrastive orthogonal learning guided by language. The isolated concepts are fed into extra layers of a diffusion model along with block embeddings that adapt to each concept type. A sympathetic reader would care because this would let users draw specific elements from example images while still following a text prompt, producing creative outputs that stay faithful to both without confusion.

Core claim

OmniPrism learns disentangled concept representations from reference images by leveraging a multimodal extractor and natural language guidance. It builds the PCD-200K dataset consisting of image pairs that share the same single concept in areas like content, style, or composition. Through the contrastive orthogonal disentangled training pipeline these representations are isolated and then injected into additional cross-attention layers of a diffusion model, with block embeddings adapting each layer to the appropriate concept domain, resulting in generated images that maintain high fidelity to the text prompt and the selected concepts.

What carries the argument

The contrastive orthogonal disentangled (COD) training pipeline that operates on the PCD-200K paired dataset to produce isolatable concept representations for injection into diffusion cross-attention layers.

If this is right

  • Diffusion models gain the ability to incorporate only the desired concept from a reference while ignoring others.
  • Generated images show improved fidelity to both the input text prompt and the explicitly chosen visual concepts.
  • Multi-aspect creative generation becomes feasible without the concept confusion seen in prior single-aspect or entangled approaches.
  • Block embeddings allow each diffusion layer to specialize in a particular concept domain during generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar paired-data construction and contrastive isolation could be tested on other generative architectures beyond diffusion.
  • The method implies that disentanglement might reduce the need for heavy prompt engineering when transferring styles or contents.
  • Extending the pairing idea to video frames or 3D assets could address temporal or geometric concept leakage in those domains.

Load-bearing premise

The PCD-200K paired dataset and multimodal extractor together supply clean unbiased signals that let the COD pipeline isolate concepts without residual entanglement or dataset artifacts.

What would settle it

Run the model on reference images containing overlapping or ambiguous concepts and check whether generated outputs still exhibit unintended mixing of non-selected concepts from the reference.

Figures

Figures reproduced from arXiv: 2412.12242 by Allen He, Daqing Liu, Guoqing Jin, Wu Liu, Xinchen Liu, Yangyang Li, Yongdong Zhang.

Figure 1
Figure 1. Figure 1: We propose OmniPrism, which arbitrarily disentangles and combines visual concepts. (a) Disentangled visual concept gener￾ation. Given a reference image with multiple concepts, our method can disentangle the desired concept guided by natural language such as content names (red color words in prompts), “style” or “composition” (e.g., relation or structural features like pose) while remaining faithful to prom… view at source ↗
Figure 2
Figure 2. Figure 2: (a)). Other works [2, 3] use subject masks to generate a single subject concept from images with multiple subjects, achieving relatively diverse subject disentanglement. How￾ever, they do not address abstract concepts that cannot be selected with a mask, such as style or relationships. Addi￾tionally, these methods often require fine-tuning during in￾ference or complex additional conditions for each sample,… view at source ↗
Figure 3
Figure 3. Figure 3: Framework of OmniPrism. (a) Given the reference image Iref , target prompt Ttar and concept guidance Tcg, the concept extractor disentangles concept representations fcpt by concatenating CLIP features fcg of Tcg with a learnable query q, and feeds fcpt into additional cross-attention layers in U-Net to generate target image Itar. A learnable block embedding ei is added to q to align the concept domain of i… view at source ↗
Figure 4
Figure 4. Figure 4: Diverse capabilities of our method. Our method supports the single concept disentangled generation from a same reference image, including different content, style, and composition. In addition, we can combine these disentangled concepts to generate results that incorporate multiple desired concepts. 4.2. Main Results We demonstrate the capabilities of our method from mul￾tiple aspects, as shown in [PITH_F… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison with the state-of-the-art works. Our method achieves superior disentangled generation performance. It not only avoids introducing irrelevant concepts but also ensures the highest concept and prompt fidelity and image quality. Method Mask CLIP-I ↑ CLIP-T ↑ Style Similarity ↑ Aesthetic Score ↑ IP-Adapter [46] 0.7839 0.2430 0.8042 6.1854 BLIP-Diffusion [19] 0.7551 0.2489 0.5117 6.1742 DEADiff [26] … view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of attention map. The results illustrate how concept guidance interacts with image representations in con￾cept extractor. Our method achieves the highest Mask CLIP-I and CLIP￾T scores, which indicates our superior concept fidelity and prompt fidelity. IP-Adapter achieves the highest style simi￾larity, but their method relies heavily on the reference image and neglects the text prompt, which e… view at source ↗
Figure 7
Figure 7. Figure 7: The t-SNE projection visualization of concept repre￾sentations with other methods. Our method effectively separates different types of concepts and obtains a disentangled visual con￾cept representation space. tractor, as shown in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Construction Pipeline of our PCD-200K. We design three data construction pipelines for the three concepts of “content”, “style”, and “composition”, each pipeline uses GPT-4o to obtain reference prompts Tref , target prompts Ttar, and concept guidance Tcg, and use different models to generate corresponding reference images Iref and target images Itar. 3 [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 13
Figure 13. Figure 13: Additional Controls with ControlNet. potentials on more creative applications in this section. I.1. Multi-Content Combinations In our paper, we demonstrate the creative generation results achieved by combining various concepts, such as content and style, to achieve subject stylization. The same concept, such as multiple style or composition concepts, is difficult to combine due to they may conflict with e… view at source ↗
Figure 11
Figure 11. Figure 11: Ablations of Concept Scale µ Reference ControlNet-Canny ControlNet-OpenPose ControlNet-Depth Ours A man and a woman A man cleaning the room [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Discussion with ControlNet. ControlNet with all con￾ditions is prone to conflicts between prompts and structural fea￾tures, while our method extracts abstract “composition” concepts (e.g. relationships, poses) and generates creative results. Control Condition A woman in forest Concept: woman A woman in autumn A woman in room A woman in forest Concept: woman A woman in autumn A woman in room A tiger in wil… view at source ↗
Figure 16
Figure 16. Figure 16: Limitations of our OmniPrism. Our method may fail when the concept name is unknown. L. Limitations Our OmniPrism can disentangle and generate various con￾cepts in an image and allowing for any combination in a sin￾gle result. However, when the concepts in the reference im￾age are difficult to describe in natural language, such as un￾known categories (Unknown Concept Name), our method struggles to generate… view at source ↗
Figure 14
Figure 14. Figure 14: Combination of multiple content concepts. We use latent masks to assign layouts to different concepts to prevent them from conflicting. Concept: cat A girl in the school A boy in the snow A lion in the forest A tiger in the wild Concept: dog Concept: man Concept: girl A girl in the wild A man in the hospital A owl in the sky A girl in the room A woman in the snow A eagle in the sky A tiger in the rain A d… view at source ↗
Figure 15
Figure 15. Figure 15: Concept Blending. We modify the concept in prompts to some other subjects to generate creative results. risks such as the creation of realistic but false content that can spread misinformation and deepfakes, potentially un￾dermining public trust and political discourse. The unau￾thorized use of copyrighted material raises legal and ethical concerns, while biases in training datasets can perpetuate harmful… view at source ↗
Figure 17
Figure 17. Figure 17: Disentangled Generation of Content. 6 [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Disentangled Generation of Style. 7 [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Disentangled Generation of Composition. 8 [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗
read the original abstract

Creative visual concept generation often draws inspiration from specific concepts in a reference image to produce relevant outcomes. However, existing methods are typically constrained to single-aspect concept generation or are easily disrupted by irrelevant concepts in multi-aspect concept scenarios, leading to concept confusion and hindering creative generation. To address this, we propose OmniPrism, a visual concept disentangling approach for creative image generation. Our method learns disentangled concept representations guided by natural language and trains a diffusion model to incorporate these concepts. We utilize the rich semantic space of a multimodal extractor to achieve concept disentanglement from given images and concept guidance. To disentangle concepts with different semantics, we construct a paired concept disentangled dataset (PCD-200K), where each pair shares the same concept such as content, style, and composition. We learn disentangled concept representations through our contrastive orthogonal disentangled (COD) training pipeline, which are then injected into additional diffusion cross-attention layers for generation. A set of block embeddings is designed to adapt each block's concept domain in the diffusion models. Extensive experiments demonstrate that our method can generate high-quality, concept-disentangled results with high fidelity to text prompts and desired concepts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes OmniPrism for disentangling visual concepts (content, style, composition) from reference images to enable controlled creative generation in diffusion models. It constructs a paired dataset PCD-200K where image pairs share exactly one semantic concept, employs a multimodal extractor with a contrastive orthogonal disentangled (COD) training pipeline to learn representations, and injects these via additional cross-attention layers and block embeddings into the diffusion model.

Significance. If the disentanglement holds without residual correlations, the method could advance multi-aspect controllable generation by reducing concept confusion, offering a structured alternative to single-aspect or entangled approaches through large-scale paired data and orthogonal losses.

major comments (3)
  1. [PCD-200K construction] PCD-200K construction (described in the method section): the claim that each pair shares the same concept such as content, style, and composition and differs in exactly one axis lacks any quantitative validation (cross-axis correlation statistics, human verification rates, or extractor ablation). This assumption is load-bearing for the COD pipeline, as residual correlations would be encoded as 'disentangled' factors.
  2. [Experiments] Experiments and results: the abstract states that 'extensive experiments demonstrate' high-quality, concept-disentangled results with high fidelity, yet no quantitative metrics (FID, CLIP-based fidelity, disentanglement scores), baseline comparisons, or ablation details on COD components are referenced, leaving the central performance claim unsupported.
  3. [COD training pipeline] COD training pipeline: the contrastive orthogonal loss is presented as isolating concepts via the multimodal extractor, but without reported evidence that the extractor yields unbiased vectors or that the orthogonal term removes entanglement beyond what contrastive alone achieves, the isolation claim cannot be verified.
minor comments (1)
  1. [Abstract] The role of 'block embeddings' in adapting each block's concept domain is mentioned in the abstract but would benefit from an earlier definition or diagram reference for readers unfamiliar with the diffusion architecture modifications.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the PCD-200K dataset, experimental reporting, and COD pipeline. The comments identify areas where additional evidence would strengthen the manuscript, and we address each point below with plans for revision.

read point-by-point responses
  1. Referee: [PCD-200K construction] PCD-200K construction (described in the method section): the claim that each pair shares the same concept such as content, style, and composition and differs in exactly one axis lacks any quantitative validation (cross-axis correlation statistics, human verification rates, or extractor ablation). This assumption is load-bearing for the COD pipeline, as residual correlations would be encoded as 'disentangled' factors.

    Authors: We agree that the manuscript does not currently include quantitative validation (e.g., cross-axis correlations or human verification rates) for the PCD-200K pairs. This is a substantive concern given the central role of the dataset. In the revised manuscript we will add a validation subsection reporting correlation statistics across concept axes and human evaluation results on pair quality. revision: yes

  2. Referee: [Experiments] Experiments and results: the abstract states that 'extensive experiments demonstrate' high-quality, concept-disentangled results with high fidelity, yet no quantitative metrics (FID, CLIP-based fidelity, disentanglement scores), baseline comparisons, or ablation details on COD components are referenced, leaving the central performance claim unsupported.

    Authors: The referee is correct that the current manuscript references extensive experiments in the abstract but does not report quantitative metrics, baseline comparisons, or COD ablations. We will expand the experiments section in the revision to include FID, CLIP-based fidelity, disentanglement scores, baseline comparisons, and component ablations. revision: yes

  3. Referee: [COD training pipeline] COD training pipeline: the contrastive orthogonal loss is presented as isolating concepts via the multimodal extractor, but without reported evidence that the extractor yields unbiased vectors or that the orthogonal term removes entanglement beyond what contrastive alone achieves, the isolation claim cannot be verified.

    Authors: We acknowledge that the manuscript describes the COD loss but does not provide ablations or analyses demonstrating the orthogonal term's incremental benefit or the unbiased character of the extracted vectors. The revised version will include targeted ablations (contrastive-only vs. full COD) and supporting metrics or visualizations. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external dataset construction and new loss, not self-referential fitting

full rationale

The paper constructs PCD-200K externally and defines a COD pipeline with contrastive-orthogonal loss to produce concept vectors that are then injected into diffusion cross-attention. No equation or claim equates a reported performance metric to a quantity defined by the model's own fitted parameters. The central claims rest on the dataset pairs and multimodal extractor as independent inputs rather than on any self-definition, fitted-input-as-prediction, or self-citation chain. This matches the default expectation of a non-circular empirical method.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that a multimodal extractor yields separable semantics and that paired data can be constructed without introducing systematic biases; block embeddings are learned parameters whose count and initialization are not specified in the abstract.

free parameters (1)
  • block embeddings
    Learned per-block adaptation vectors whose dimensionality and training schedule are unspecified in the abstract.
axioms (1)
  • domain assumption Multimodal extractor supplies rich semantic space sufficient for concept disentanglement
    Invoked when the method states it uses the extractor to achieve disentanglement from given images.

pith-pipeline@v0.9.0 · 5758 in / 1255 out tokens · 36401 ms · 2026-05-23T06:40:47.145209+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

    cs.CV 2026-04 unverdicted novelty 7.0

    CLAY reframes pretrained VLM embedding spaces as text-conditional similarity spaces for adaptive, multi-conditioned image retrieval without additional training.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    An image is worth multiple words: Multi-attribute inversion for constrained text-to-image syn- thesis

    Aishwarya Agarwal, Srikrishna Karanam, Tripti Shukla, and Balaji Vasan Srinivasan. An image is worth multiple words: Multi-attribute inversion for constrained text-to-image syn- thesis. arXiv:2311.11919, 2023. 3, 5

  2. [2]

    Break-a-scene: Extracting multiple concepts from a single image

    Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen- Or, and Dani Lischinski. Break-a-scene: Extracting multiple concepts from a single image. In SIGGRAPH Asia, pages 1–12, 2023. 2, 3

  3. [3]

    Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation,

    Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, and Wenwu Zhu. Disenbooth: Identity- preserving disentangled tuning for subject-driven text-to- image generation. arXiv:2305.03374, 2023. 2

  4. [4]

    Anydoor: Zero-shot object-level im- age customization

    Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level im- age customization. arXiv:2307.09481, 2023. 3, 5

  5. [5]

    Anydoor: Zero-shot object-level im- age customization

    Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level im- age customization. In CVPR, pages 6593–6602, 2024. 2

  6. [6]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv:2010.11929,

  7. [7]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. 3

  8. [8]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image gen- eration using textual inversion. arXiv:2208.01618, 2022. 3, 5

  9. [9]

    An image is worth one word: Personalizing text-to-image generation using textual inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023. 2

  10. [10]

    Style aligned image generation via shared atten- tion

    Amir Hertz, Andrey V oynov, Shlomi Fruchter, and Daniel Cohen-Or. Style aligned image generation via shared atten- tion. In CVPR, pages 4775–4785, 2024. 2

  11. [11]

    Clipscore: A reference-free evaluation met- ric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. In EMNLP, 2021. 5

  12. [12]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv:2207.12598, 2022. 4

  13. [13]

    Learning disentangled iden- tifiers for action-customized text-to-image generation

    Siteng Huang, Biao Gong, Yutong Feng, Xi Chen, Yuqian Fu, Yu Liu, and Donglin Wang. Learning disentangled iden- tifiers for action-customized text-to-image generation. In CVPR, pages 7797–7806, 2024. 2

  14. [14]

    Reversion: Diffusion-based relation inversion from images

    Ziqi Huang, Tianxing Wu, Yuming Jiang, Kelvin CK Chan, and Ziwei Liu. Reversion: Diffusion-based relation inversion from images. arXiv:2303.13495, 2023. 2

  15. [15]

    Visual style prompting with swapping self- attention

    Jaeseok Jeong, Junho Kim, Yunjey Choi, Gayoung Lee, and Youngjung Uh. Visual style prompting with swapping self- attention. arXiv:2402.12974, 2024. 2

  16. [16]

    An image is worth mul- tiple words: Discovering object level concepts using multi- concept prompt learning

    Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Di- ethe, and Philip Alexander Teare. An image is worth mul- tiple words: Discovering object level concepts using multi- concept prompt learning. In ICML, 2024. 3

  17. [17]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023. 2

  18. [18]

    Multi-concept customiza- tion of text-to-image diffusion

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customiza- tion of text-to-image diffusion. In CVPR, pages 1931–1941,

  19. [19]

    Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing

    Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing. NeurIPS, 2024. 2, 3, 4, 5, 6, 7, 1

  20. [20]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742. PMLR, 2023. 2, 3, 4, 5, 1

  21. [21]

    Pseudo numerical methods for diffusion models on manifolds

    Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. arXiv:2202.09778, 2022. 4

  22. [22]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019. 5

  23. [23]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. NeurIPS, pages 5775–5787, 2022. 4

  24. [24]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv:2112.10741, 2021. 2

  25. [25]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis. arXiv:2307.01952,

  26. [26]

    Dead- iff: An efficient stylization diffusion model with disentan- gled representations

    Tianhao Qi, Shancheng Fang, Yanze Wu, Hongtao Xie, Ji- awei Liu, Lang Chen, Qian He, and Yongdong Zhang. Dead- iff: An efficient stylization diffusion model with disentan- gled representations. arXiv:2403.06951, 2024. 2, 3, 4, 5, 6, 7, 1

  27. [27]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, 2021. 2, 3, 4, 5, 1

  28. [28]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents. arXiv:2204.06125, 2022. 2 9

  29. [29]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In CVPR, pages 10684– 10695, 2022. 2, 3, 6

  30. [30]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023. 2, 3, 5

  31. [31]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022. 2

  32. [32]

    LAION-5B: an open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: an open large-scale dataset for training next generation image-text model...

  33. [33]

    Instant- booth: Personalized text-to-image generation without test- time finetuning

    Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instant- booth: Personalized text-to-image generation without test- time finetuning. In CVPR, pages 8543–8552, 2024. 2

  34. [34]

    Styledrop: Text-to-image synthesis of any style

    Kihyuk Sohn, Lu Jiang, Jarred Barber, Kimin Lee, Nataniel Ruiz, Dilip Krishnan, Huiwen Chang, Yuanzhen Li, Irfan Essa, Michael Rubinstein, et al. Styledrop: Text-to-image synthesis of any style. NeurIPS, 2024. 2

  35. [35]

    Measuring style similarity in diffusion models.arXiv preprint arXiv:2404.01292, 2024

    Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shra- may Palta, Micah Goldblum, Jonas Geiping, Abhinav Shri- vastava, and Tom Goldstein. Measuring style similarity in diffusion models. arXiv:2404.01292, 2024. 5

  36. [36]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. arXiv:2010.02502, 2020. 4, 5

  37. [37]

    Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis

    Kolors Team. Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis. arXiv preprint,

  38. [38]

    Visualizing data using t-sne

    Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 9(11), 2008. 8

  39. [39]

    Concept decomposition for visual exploration and inspiration

    Yael Vinker, Andrey V oynov, Daniel Cohen-Or, and Ariel Shamir. Concept decomposition for visual exploration and inspiration. TOG, 2023. 2, 3

  40. [40]

    p+: Ex- tended textual conditioning in text-to-image generation,

    Andrey V oynov, Qinghao Chu, Daniel Cohen-Or, and Kfir Aberman. P+: Extended textual conditioning in text-to- image generation. arXiv:2303.09522, 2023. 3, 4

  41. [41]

    Instantstyle: Free lunch towards style- preserving in text-to-image generation

    Haofan Wang, Qixun Wang, Xu Bai, Zekui Qin, and Anthony Chen. Instantstyle: Free lunch towards style- preserving in text-to-image generation. arXiv:2404.02733,

  42. [42]

    Styleadapter: A single-pass lora-free model for stylized image generation

    Zhouxia Wang, Xintao Wang, Liangbin Xie, Zhongang Qi, Ying Shan, Wenping Wang, and Ping Luo. Styleadapter: A single-pass lora-free model for stylized image generation. arXiv:2309.01770, 2023. 2

  43. [43]

    Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation

    Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation. In ICCV, pages 15943–15953, 2023. 2, 3, 5

  44. [44]

    Cusconcept: Cus- tomized visual concept decomposition with diffusion mod- els

    Zhi Xu, Shaozhe Hao, and Kai Han. Cusconcept: Cus- tomized visual concept decomposition with diffusion mod- els. arXiv:2410.00398, 2024. 2, 3

  45. [45]

    Paint by example: Exemplar-based image editing with diffusion mod- els

    Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion mod- els. In CVPR, pages 18381–18391, 2023. 3, 5

  46. [46]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models. arXiv:2308.06721, 2023. 3, 5, 6, 7, 1

  47. [47]

    DINO: DETR with improved denoising anchor boxes for end-to-end object de- tection

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel Ni, and Heung-Yeung Shum. DINO: DETR with improved denoising anchor boxes for end-to-end object de- tection. In ICLR, 2023. 3

  48. [48]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023. 2, 3, 5

  49. [49]

    Prospect: Prompt spec- trum for attribute-aware personalization of diffusion models

    Yuxin Zhang, Weiming Dong, Fan Tang, Nisha Huang, Haibin Huang, Chongyang Ma, Tong-Yee Lee, Oliver Deussen, and Changsheng Xu. Prospect: Prompt spec- trum for attribute-aware personalization of diffusion models. TOG, 42(6):1–14, 2023. 3

  50. [50]

    Ssr-encoder: Encoding selective subject representation for subject-driven generation

    Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al. Ssr-encoder: Encoding selective subject representation for subject-driven generation. In CVPR, 2024. 2, 3, 4, 5, 6, 7, 1

  51. [51]

    composition

    Yanbing Zhang, Mengping Yang, Qin Zhou, and Zhe Wang. Attention calibration for disentangled text-to-image person- alization. In CVPR, pages 4764–4774, 2024. 3 10 OmniPrism: Learning Disentangled Visual Concept for Image Generation Supplementary Material In the supplementary materials, we introduce more detailed analysis and additional results: • Sec. F p...