pith. machine review for the scientific record. sign in

arxiv: 2302.08453 · v2 · submitted 2023-02-16 · 💻 cs.CV · cs.AI· cs.LG· cs.MM

Recognition: unknown

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.MM
keywords text-to-imagediffusion modelsadapterscontrollable generationimage synthesisstructure controlgenerative AI
0
0 comments X

The pith

Lightweight adapters align external signals with the internal knowledge of frozen text-to-image diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large text-to-image diffusion models already encode complex structures and semantics, yet text prompts alone limit precise control over color or layout. The paper proposes training small T2I-Adapters that map external conditions such as edges or depth maps into the model's existing representations. Only the adapters are trained while the base model stays frozen, preserving its generative quality. This yields composable controls that support structure-aware editing and generalization across inputs. The result is granular manipulation without the cost of retraining billions of parameters.

Core claim

By learning simple and lightweight T2I-Adapters, internal knowledge implicitly learned by large T2I models can be aligned with external control signals while the original large T2I models remain frozen. Different adapters can then be trained for separate conditions to produce rich control and editing effects on color and structure, with the adapters showing composability and generalization ability.

What carries the argument

T2I-Adapter: a small trainable network that receives an external condition signal and injects aligned features into the frozen diffusion model's intermediate layers.

If this is right

  • Separate adapters can be trained for distinct controls such as color palettes or edge structures and applied independently.
  • Multiple adapters can be combined at inference time to enforce several conditions simultaneously.
  • The frozen base model retains its original sample quality and diversity while the adapters add targeted guidance.
  • New adapters can be trained for additional conditions without touching the underlying diffusion weights.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach suggests that future models could ship with a library of plug-in adapters for common creative tasks.
  • Composability may allow users to build custom editing pipelines by stacking adapters trained on different signals.
  • Because only small modules are updated, the method could support on-device fine-tuning for domain-specific control.
  • The same alignment idea might extend to other generative modalities such as video or 3D synthesis.

Load-bearing premise

The knowledge already captured inside a pre-trained text-to-image model contains enough structure that a small adapter can redirect it toward new control signals without breaking coherence.

What would settle it

Generate images with the adapter using a clear control signal such as a depth map, then measure whether the output depth deviates substantially from the input map or whether FID scores rise sharply compared with the unadapted model.

read the original abstract

The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes T2I-Adapter, lightweight modules inserted into frozen text-to-image diffusion models (e.g., Stable Diffusion) to align external control signals such as sketches, depth maps, and color palettes with the model's internal representations. Adapters are trained via standard conditional diffusion loss on paired data while the base UNet remains frozen; the paper reports qualitative and quantitative results on controllability, adapter composability at inference time, and generalization to new conditions or editing tasks.

Significance. If the empirical results hold under rigorous evaluation, the contribution is significant for enabling parameter-efficient, modular control of large-scale T2I models without full retraining. The emphasis on composability and low training cost addresses practical needs in deployment and editing workflows, and the approach could generalize as a template for adapter-based conditioning in other generative architectures.

major comments (3)
  1. [§3.2] §3.2, adapter insertion points: the multi-scale feature alignment is presented as leveraging pre-existing internal knowledge, yet the training objective is the standard diffusion loss with no auxiliary term to encourage reuse of frozen UNet features versus learning a new mapping; an ablation measuring feature similarity (e.g., cosine distance between pre- and post-adapter activations) is needed to support the central 'dig out' claim.
  2. [§4.3] §4.3, composability experiments: independently trained adapters are summed at inference, but no quantitative metrics (FID, control accuracy, or artifact rate) are reported for combined use versus single-adapter baselines; this leaves the practical composability claim without load-bearing evidence.
  3. [Table 2] Table 2, quantitative results: reported FID and user-study scores show competitive performance, but the table lacks error bars, number of runs, or statistical tests; marginal gains over baselines cannot be confidently attributed to the adapter design without these.
minor comments (2)
  1. [Figure 3] Figure 3 captions are terse; they should explicitly state the control signal type and strength for each row to aid reproducibility.
  2. [§2] The related-work section omits discussion of concurrent adapter methods in diffusion models; a brief comparison paragraph would clarify novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed review and the recommendation for minor revision. We address the major comments point by point below, and have incorporated revisions to strengthen the manuscript where appropriate.

read point-by-point responses
  1. Referee: [§3.2] §3.2, adapter insertion points: the multi-scale feature alignment is presented as leveraging pre-existing internal knowledge, yet the training objective is the standard diffusion loss with no auxiliary term to encourage reuse of frozen UNet features versus learning a new mapping; an ablation measuring feature similarity (e.g., cosine distance between pre- and post-adapter activations) is needed to support the central 'dig out' claim.

    Authors: We appreciate this comment, which highlights an important aspect of our design. The use of a frozen UNet with standard conditional diffusion loss is intentional, as it forces the lightweight adapter to align external conditions with the pre-trained features rather than learning a new mapping from scratch. To directly address the request for supporting evidence, we have added an ablation in the revised manuscript that computes cosine similarities between activations in the frozen UNet with and without the adapter. The results show high similarity scores, indicating that the adapter primarily modulates rather than overwrites internal representations, thereby supporting the 'dig out' claim. revision: yes

  2. Referee: [§4.3] §4.3, composability experiments: independently trained adapters are summed at inference, but no quantitative metrics (FID, control accuracy, or artifact rate) are reported for combined use versus single-adapter baselines; this leaves the practical composability claim without load-bearing evidence.

    Authors: We agree that quantitative support for composability would enhance the claims. In the original manuscript, we focused on qualitative demonstrations due to the challenges in defining precise metrics for multi-condition control. However, following this suggestion, we have included additional quantitative results in the revision, reporting FID scores and control accuracy metrics for compositions of adapters (e.g., sketch + depth). These show that composable use achieves performance close to individual adapters without significant degradation, providing the requested load-bearing evidence. revision: yes

  3. Referee: [Table 2] Table 2, quantitative results: reported FID and user-study scores show competitive performance, but the table lacks error bars, number of runs, or statistical tests; marginal gains over baselines cannot be confidently attributed to the adapter design without these.

    Authors: We acknowledge the importance of statistical rigor in quantitative evaluations. The results in Table 2 are based on single runs following common practice in the field for large-scale generative models due to computational constraints. In the revised version, we have added a note clarifying the number of runs (one) and included error bars where feasible from multiple seeds on smaller subsets. While full statistical tests across all baselines would require substantial additional compute, we believe the consistent trends across metrics support the conclusions. revision: partial

Circularity Check

0 steps flagged

Empirical adapter training exhibits no circularity

full rationale

The paper describes a standard empirical procedure: lightweight adapters are trained from scratch on external paired (image, condition) datasets while the base T2I diffusion model remains frozen. No mathematical derivation chain exists; there are no equations that reduce a claimed prediction to a fitted parameter by construction, no self-definitional loops, and no load-bearing self-citations that import uniqueness theorems. All performance claims rest on experimental results rather than tautological reuse of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The claim rests on the domain assumption that T2I models have already learned rich implicit knowledge about structure and semantics that can be aligned via simple adapters; the adapter weights themselves are learned parameters.

free parameters (1)
  • T2I-Adapter weights
    Learned during training to align external signals with the frozen model.
axioms (1)
  • domain assumption Large T2I diffusion models have implicitly learned complex structures and meaningful semantics from training data
    Invoked to justify that external controls can be aligned without retraining the base model.
invented entities (1)
  • T2I-Adapter no independent evidence
    purpose: Lightweight module to provide external control to frozen T2I model
    New module introduced by the paper; no independent evidence outside the work itself.

pith-pipeline@v0.9.0 · 5522 in / 1312 out tokens · 40876 ms · 2026-05-16T22:44:07.595923+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision

    cs.CV 2026-05 unverdicted novelty 7.0

    Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.

  2. LooseRoPE: Content-aware Attention Manipulation for Semantic Harmonization

    cs.GR 2026-01 unverdicted novelty 7.0

    LooseRoPE modulates RoPE in diffusion attention maps to continuously trade off between preserving a pasted object's identity and harmonizing it with its new surroundings.

  3. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    cs.CV 2023-07 unverdicted novelty 7.0

    A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.

  4. Adding Conditional Control to Text-to-Image Diffusion Models

    cs.CV 2023-02 conditional novelty 7.0

    ControlNet adds spatial conditioning controls to pretrained text-to-image diffusion models via zero convolutions for stable fine-tuning on small or large datasets.

  5. The two clocks and the innovation window: When and how generative models learn rules

    cs.LG 2026-05 unverdicted novelty 6.0

    Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.

  6. Stylistic Attribute Control in Latent Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.

  7. Map2World: Segment Map Conditioned Text to 3D World Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Map2World produces scale-consistent 3D worlds from text and arbitrary segment maps via a detail enhancer that incorporates global structure information.

  8. PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.

  9. UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

    cs.CV 2026-05 unverdicted novelty 6.0

    UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.

  10. MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution

    cs.CV 2026-04 unverdicted novelty 6.0

    MetaSR adaptively orchestrates metadata in a DiT-based generative SR model to deliver up to 1 dB PSNR gains and 50% bitrate savings across diverse content and degradations.

  11. Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens

    cs.CV 2026-04 unverdicted novelty 6.0

    Viewpoint tokens learned on a mixed 3D-rendered and photorealistic dataset enable precise camera control in text-to-image generation while factorizing geometry from appearance and transferring to unseen object categories.

  12. VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    cs.CV 2025-03 accept novelty 6.0

    VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...

  13. CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.

  14. VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    cs.CV 2023-10 unverdicted novelty 6.0

    Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.

  15. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    cs.CV 2023-08 unverdicted novelty 6.0

    IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.

  16. Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study

    cs.CV 2026-05 unverdicted novelty 5.0

    DiffKT3D transfers priors from video diffusion models to 3D radiotherapy dose prediction via modality-specific embeddings and clinically guided RL, reducing voxel MAE from 2.07 to 1.93 and claiming SOTA over the GDP-H...

  17. Step1X-Edit: A Practical Framework for General Image Editing

    cs.CV 2025-04 unverdicted novelty 4.0

    Step1X-Edit integrates a multimodal LLM with a diffusion decoder, trained on a custom high-quality dataset, to deliver image editing performance that surpasses open-source baselines and approaches proprietary models o...

  18. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    cs.LG 2024-03 accept novelty 4.0

    A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 18 Pith papers · 7 internal anchors

  1. [1]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv preprint arXiv:2211.01324, 2022. 3

  2. [2]

    Coco- stuff: Thing and stuff classes in context

    Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1209–1218, 2018. 6

  3. [3]

    Vision transformer adapter for dense predictions

    Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022. 3

  4. [4]

    Openmmlab pose estimation toolbox and benchmark

    MMPose Contributors. Openmmlab pose estimation toolbox and benchmark. https://github.com/ open-mmlab/mmpose, 2020. 6

  5. [5]

    Gen- erative adversarial networks: An overview

    Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. Gen- erative adversarial networks: An overview. IEEE signal pro- cessing magazine, 35(1):53–65, 2018. 2

  6. [6]

    Cogview: Mastering text-to-image gen- eration via transformers

    Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image gen- eration via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021. 2, 3

  7. [7]

    NICE: Non-linear Independent Components Estimation

    Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014. 2

  8. [8]

    An image is worth 16x16 words: Trans- formers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InInternational Con- ference on Learning Representations. 3

  9. [9]

    Training-free structured diffusion guidance for compositional text-to-image synthesis

    Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032, 2022. 3

  10. [10]

    Make-a-scene: Scene- based text-to-image generation with human priors

    Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene- based text-to-image generation with human priors. In Com- puter Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XV , pages 89–106. Springer, 2022. 3

  11. [11]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022. 3

  12. [12]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020. 3

  13. [13]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019. 3

  14. [14]

    Composer: Creative and controllable im- age synthesis with composable conditions

    Lianghua Huang, Di Chen, Yu Liu, Shen Yujun, Deli Zhao, and Zhou Jingren. Composer: Creative and controllable im- age synthesis with composable conditions. 2023. 3

  15. [15]

    Multimodal conditional image synthesis with product- of-experts gans

    Xun Huang, Arun Mallya, Ting-Chun Wang, and Ming-Yu Liu. Multimodal conditional image synthesis with product- of-experts gans. In Computer Vision–ECCV 2022: 17th Eu- ropean Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVI, pages 91–109. Springer, 2022. 3

  16. [16]

    Image-to-image translation with conditional adver- sarial networks

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134,

  17. [17]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019. 3

  18. [18]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,

  19. [19]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes. arXiv preprint arXiv:1312.6114, 2013. 2

  20. [20]

    Exploring plain vision transformer backbones for object de- tection

    Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object de- tection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceed- ings, Part IX, pages 280–296. Springer, 2022. 3

  21. [21]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 6

  22. [22]

    Design guidelines for prompt engineering text-to-image generative models

    Vivian Liu and Lydia B Chilton. Design guidelines for prompt engineering text-to-image generative models. InPro- ceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–23, 2022. 2

  23. [23]

    Glide: Towards photorealis- tic image generation and editing with text-guided diffusion models

    Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealis- tic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pages 16784–16804. PMLR, 2022. 2, 3

  24. [24]

    Semantic image synthesis with spatially-adaptive nor- malization

    Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive nor- malization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2337–2346,

  25. [25]

    Semantic image synthesis with spatially-adaptive nor- malization

    Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive nor- malization. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2019. 6

  26. [26]

    Best prompts for text-to-image models and how to find them

    Nikita Pavlichenko and Dmitry Ustalov. Best prompts for text-to-image models and how to find them. arXiv preprint arXiv:2209.11711, 2022. 2

  27. [27]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763, 2021. 3

  28. [28]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 6

  29. [29]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents. arXiv preprint arXiv:2204.06125,

  30. [30]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Confer- ence on Machine Learning, pages 8821–8831. PMLR, 2021. 2, 3

  31. [31]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

    Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(3), 2022. 6

  32. [32]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2, 3, 5, 6, 7

  33. [33]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015. 3

  34. [34]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022. 2, 3

  35. [35]

    You only need adversarial supervision for semantic image synthesis

    Edgar Sch ¨onfeld, Vadim Sushko, Dan Zhang, Juergen Gall, Bernt Schiele, and Anna Khoreva. You only need adversarial supervision for semantic image synthesis. In International Conference on Learning Representations, 2021. 6

  36. [36]

    pytorch-fid: FID Score for PyTorch

    Maximilian Seitzer. pytorch-fid: FID Score for PyTorch. https://github.com/mseitzer/pytorch-fid, August 2020. Version 0.3.0. 6

  37. [37]

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

    Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016. 4

  38. [38]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Confer- ence on Machine Learning, pages 2256–2265. PMLR, 2015. 3

  39. [39]

    Generative modeling by esti- mating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by esti- mating gradients of the data distribution. Advances in neural information processing systems, 32, 2019. 3

  40. [40]

    Bert and pals: Pro- jected attention layers for efficient adaptation in multi-task learning

    Asa Cooper Stickland and Iain Murray. Bert and pals: Pro- jected attention layers for efficient adaptation in multi-task learning. In International Conference on Machine Learning, pages 5986–5995. PMLR, 2019. 3

  41. [41]

    Pixel difference networks for efficient edge detection

    Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao, Qi Tian, Matti Pietik ¨ainen, and Li Liu. Pixel difference networks for efficient edge detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5117–5127, 2021. 6

  42. [42]

    Sketch-guided text-to-image diffusion models

    Andrey V oynov, Kfir Aberman, and Daniel Cohen-Or. Sketch-guided text-to-image diffusion models. arXiv preprint arXiv:2211.13752, 2022. 3

  43. [43]

    Pretraining is all you need for image-to-image translation

    Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong Chen, Qifeng Chen, and Fang Wen. Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022. 3, 6

  44. [44]

    High-resolution image syn- thesis and semantic manipulation with conditional gans

    Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image syn- thesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018. 3

  45. [45]

    Adding conditional control to text-to-image diffusion models, 2023

    Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 3

  46. [46]

    Lafite: Towards language-free training for text-to- image generation

    Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. Lafite: Towards language-free training for text-to- image generation. arXiv preprint arXiv:2111.13792, 2021. 2, 3

  47. [47]

    Unpaired image-to-image translation using cycle- consistent adversarial networks

    Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle- consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision , pages 2223– 2232, 2017. 3