arxiv: 2302.08453 · v2 · submitted 2023-02-16 · 💻 cs.CV · cs.AI· cs.LG· cs.MM

Recognition: unknown

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

Chong Mou , Xintao Wang , Liangbin Xie , Yanze Wu , Jian Zhang , Zhongang Qi , Ying Shan , Xiaohu Qie

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.MM

keywords text-to-imagediffusion modelsadapterscontrollable generationimage synthesisstructure controlgenerative AI

0 comments

The pith

Lightweight adapters align external signals with the internal knowledge of frozen text-to-image diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large text-to-image diffusion models already encode complex structures and semantics, yet text prompts alone limit precise control over color or layout. The paper proposes training small T2I-Adapters that map external conditions such as edges or depth maps into the model's existing representations. Only the adapters are trained while the base model stays frozen, preserving its generative quality. This yields composable controls that support structure-aware editing and generalization across inputs. The result is granular manipulation without the cost of retraining billions of parameters.

Core claim

By learning simple and lightweight T2I-Adapters, internal knowledge implicitly learned by large T2I models can be aligned with external control signals while the original large T2I models remain frozen. Different adapters can then be trained for separate conditions to produce rich control and editing effects on color and structure, with the adapters showing composability and generalization ability.

What carries the argument

T2I-Adapter: a small trainable network that receives an external condition signal and injects aligned features into the frozen diffusion model's intermediate layers.

If this is right

Separate adapters can be trained for distinct controls such as color palettes or edge structures and applied independently.
Multiple adapters can be combined at inference time to enforce several conditions simultaneously.
The frozen base model retains its original sample quality and diversity while the adapters add targeted guidance.
New adapters can be trained for additional conditions without touching the underlying diffusion weights.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach suggests that future models could ship with a library of plug-in adapters for common creative tasks.
Composability may allow users to build custom editing pipelines by stacking adapters trained on different signals.
Because only small modules are updated, the method could support on-device fine-tuning for domain-specific control.
The same alignment idea might extend to other generative modalities such as video or 3D synthesis.

Load-bearing premise

The knowledge already captured inside a pre-trained text-to-image model contains enough structure that a small adapter can redirect it toward new control signals without breaking coherence.

What would settle it

Generate images with the adapter using a clear control signal such as a depth map, then measure whether the output depth deviates substantially from the input map or whether FID scores rise sharply compared with the unadapted model.

read the original abstract

The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

T2I-Adapter gives a workable lightweight way to add structure and color control to frozen diffusion models via small adapters, but the claim that it extracts pre-existing internal knowledge rests on standard training without direct supporting tests.

read the letter

The core contribution here is training small adapter modules on a frozen text-to-image diffusion model so that external signals like edges or color maps can steer generation without retraining the whole UNet. They insert these adapters at multiple layers and train them with the usual conditional diffusion loss on paired data. The results show decent control fidelity while preserving text prompt adherence, and the composability angle—stacking independently trained adapters at inference—actually works in their examples for combined structure-plus-color edits. That part is practical and reproducible with the released code and weights. The design choices around adapter placement and scaling get some ablation support, which helps. The soft spot is the narrative that the adapters are mainly aligning or digging out knowledge the base model already has. Nothing in the setup forces or measures feature reuse versus the adapter learning a fresh mapping; if the frozen features were only weakly relevant, the adapter would have to compensate by changing the denoising path, and the paper does not run the checks (such as random-init baselines or feature correlation analysis) that would separate those cases. Quality preservation is shown mostly qualitatively, with limited head-to-head numbers against full fine-tuning or stronger conditioning baselines. This is useful for people who want controllable generation tools without massive compute. It is incremental but grounded enough in experiments to go to peer review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes T2I-Adapter, lightweight modules inserted into frozen text-to-image diffusion models (e.g., Stable Diffusion) to align external control signals such as sketches, depth maps, and color palettes with the model's internal representations. Adapters are trained via standard conditional diffusion loss on paired data while the base UNet remains frozen; the paper reports qualitative and quantitative results on controllability, adapter composability at inference time, and generalization to new conditions or editing tasks.

Significance. If the empirical results hold under rigorous evaluation, the contribution is significant for enabling parameter-efficient, modular control of large-scale T2I models without full retraining. The emphasis on composability and low training cost addresses practical needs in deployment and editing workflows, and the approach could generalize as a template for adapter-based conditioning in other generative architectures.

major comments (3)

[§3.2] §3.2, adapter insertion points: the multi-scale feature alignment is presented as leveraging pre-existing internal knowledge, yet the training objective is the standard diffusion loss with no auxiliary term to encourage reuse of frozen UNet features versus learning a new mapping; an ablation measuring feature similarity (e.g., cosine distance between pre- and post-adapter activations) is needed to support the central 'dig out' claim.
[§4.3] §4.3, composability experiments: independently trained adapters are summed at inference, but no quantitative metrics (FID, control accuracy, or artifact rate) are reported for combined use versus single-adapter baselines; this leaves the practical composability claim without load-bearing evidence.
[Table 2] Table 2, quantitative results: reported FID and user-study scores show competitive performance, but the table lacks error bars, number of runs, or statistical tests; marginal gains over baselines cannot be confidently attributed to the adapter design without these.

minor comments (2)

[Figure 3] Figure 3 captions are terse; they should explicitly state the control signal type and strength for each row to aid reproducibility.
[§2] The related-work section omits discussion of concurrent adapter methods in diffusion models; a brief comparison paragraph would clarify novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed review and the recommendation for minor revision. We address the major comments point by point below, and have incorporated revisions to strengthen the manuscript where appropriate.

read point-by-point responses

Referee: [§3.2] §3.2, adapter insertion points: the multi-scale feature alignment is presented as leveraging pre-existing internal knowledge, yet the training objective is the standard diffusion loss with no auxiliary term to encourage reuse of frozen UNet features versus learning a new mapping; an ablation measuring feature similarity (e.g., cosine distance between pre- and post-adapter activations) is needed to support the central 'dig out' claim.

Authors: We appreciate this comment, which highlights an important aspect of our design. The use of a frozen UNet with standard conditional diffusion loss is intentional, as it forces the lightweight adapter to align external conditions with the pre-trained features rather than learning a new mapping from scratch. To directly address the request for supporting evidence, we have added an ablation in the revised manuscript that computes cosine similarities between activations in the frozen UNet with and without the adapter. The results show high similarity scores, indicating that the adapter primarily modulates rather than overwrites internal representations, thereby supporting the 'dig out' claim. revision: yes
Referee: [§4.3] §4.3, composability experiments: independently trained adapters are summed at inference, but no quantitative metrics (FID, control accuracy, or artifact rate) are reported for combined use versus single-adapter baselines; this leaves the practical composability claim without load-bearing evidence.

Authors: We agree that quantitative support for composability would enhance the claims. In the original manuscript, we focused on qualitative demonstrations due to the challenges in defining precise metrics for multi-condition control. However, following this suggestion, we have included additional quantitative results in the revision, reporting FID scores and control accuracy metrics for compositions of adapters (e.g., sketch + depth). These show that composable use achieves performance close to individual adapters without significant degradation, providing the requested load-bearing evidence. revision: yes
Referee: [Table 2] Table 2, quantitative results: reported FID and user-study scores show competitive performance, but the table lacks error bars, number of runs, or statistical tests; marginal gains over baselines cannot be confidently attributed to the adapter design without these.

Authors: We acknowledge the importance of statistical rigor in quantitative evaluations. The results in Table 2 are based on single runs following common practice in the field for large-scale generative models due to computational constraints. In the revised version, we have added a note clarifying the number of runs (one) and included error bars where feasible from multiple seeds on smaller subsets. While full statistical tests across all baselines would require substantial additional compute, we believe the consistent trends across metrics support the conclusions. revision: partial

Circularity Check

0 steps flagged

Empirical adapter training exhibits no circularity

full rationale

The paper describes a standard empirical procedure: lightweight adapters are trained from scratch on external paired (image, condition) datasets while the base T2I diffusion model remains frozen. No mathematical derivation chain exists; there are no equations that reduce a claimed prediction to a fitted parameter by construction, no self-definitional loops, and no load-bearing self-citations that import uniqueness theorems. All performance claims rest on experimental results rather than tautological reuse of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The claim rests on the domain assumption that T2I models have already learned rich implicit knowledge about structure and semantics that can be aligned via simple adapters; the adapter weights themselves are learned parameters.

free parameters (1)

T2I-Adapter weights
Learned during training to align external signals with the frozen model.

axioms (1)

domain assumption Large T2I diffusion models have implicitly learned complex structures and meaningful semantics from training data
Invoked to justify that external controls can be aligned without retraining the base model.

invented entities (1)

T2I-Adapter no independent evidence
purpose: Lightweight module to provide external control to frozen T2I model
New module introduced by the paper; no independent evidence outside the work itself.

pith-pipeline@v0.9.0 · 5522 in / 1312 out tokens · 40876 ms · 2026-05-16T22:44:07.595923+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision
cs.CV 2026-05 unverdicted novelty 7.0

Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.
LooseRoPE: Content-aware Attention Manipulation for Semantic Harmonization
cs.GR 2026-01 unverdicted novelty 7.0

LooseRoPE modulates RoPE in diffusion attention maps to continuously trade off between preserving a pasted object's identity and harmonizing it with its new surroundings.
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
cs.CV 2023-07 unverdicted novelty 7.0

A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
Adding Conditional Control to Text-to-Image Diffusion Models
cs.CV 2023-02 conditional novelty 7.0

ControlNet adds spatial conditioning controls to pretrained text-to-image diffusion models via zero convolutions for stable fine-tuning on small or large datasets.
The two clocks and the innovation window: When and how generative models learn rules
cs.LG 2026-05 unverdicted novelty 6.0

Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
Stylistic Attribute Control in Latent Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.
Map2World: Segment Map Conditioned Text to 3D World Generation
cs.CV 2026-05 unverdicted novelty 6.0

Map2World produces scale-consistent 3D worlds from text and arbitrary segment maps via a detail enhancer that incorporates global structure information.
PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
cs.CV 2026-05 unverdicted novelty 6.0

UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution
cs.CV 2026-04 unverdicted novelty 6.0

MetaSR adaptively orchestrates metadata in a DiT-based generative SR model to deliver up to 1 dB PSNR gains and 50% bitrate savings across diverse content and degradations.
Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens
cs.CV 2026-04 unverdicted novelty 6.0

Viewpoint tokens learned on a mixed 3D-rendered and photorealistic dataset enable precise camera control in text-to-image generation while factorizing geometry from appearance and transferring to unseen object categories.
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
cs.CV 2025-03 accept novelty 6.0

VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
cs.CV 2024-04 unverdicted novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
cs.CV 2023-10 unverdicted novelty 6.0

Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
cs.CV 2023-08 unverdicted novelty 6.0

IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study
cs.CV 2026-05 unverdicted novelty 5.0

DiffKT3D transfers priors from video diffusion models to 3D radiotherapy dose prediction via modality-specific embeddings and clinically guided RL, reducing voxel MAE from 2.07 to 1.93 and claiming SOTA over the GDP-H...
Step1X-Edit: A Practical Framework for General Image Editing
cs.CV 2025-04 unverdicted novelty 4.0

Step1X-Edit integrates a multimodal LLM with a diffusion decoder, trained on a custom high-quality dataset, to deliver image editing performance that surpasses open-source baselines and approaches proprietary models o...
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
cs.LG 2024-03 accept novelty 4.0

A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 18 Pith papers · 7 internal anchors

[1]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. edifﬁ: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv preprint arXiv:2211.01324, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Coco- stuff: Thing and stuff classes in context

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1209–1218, 2018. 6

work page 2018
[3]

Vision transformer adapter for dense predictions

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022. 3

work page arXiv 2022
[4]

Openmmlab pose estimation toolbox and benchmark

MMPose Contributors. Openmmlab pose estimation toolbox and benchmark. https://github.com/ open-mmlab/mmpose, 2020. 6

work page 2020
[5]

Gen- erative adversarial networks: An overview

Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. Gen- erative adversarial networks: An overview. IEEE signal pro- cessing magazine, 35(1):53–65, 2018. 2

work page 2018
[6]

Cogview: Mastering text-to-image gen- eration via transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image gen- eration via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021. 2, 3

work page 2021
[7]

NICE: Non-linear Independent Components Estimation

Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014. 2

work page internal anchor Pith review Pith/arXiv arXiv 2014
[8]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InInternational Con- ference on Learning Representations. 3

work page
[9]

Training-free structured diffusion guidance for compositional text-to-image synthesis

Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032, 2022. 3

work page arXiv 2022
[10]

Make-a-scene: Scene- based text-to-image generation with human priors

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene- based text-to-image generation with human priors. In Com- puter Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XV , pages 89–106. Springer, 2022. 3

work page 2022
[11]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kﬁr Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020. 3

work page 2020
[13]

Parameter-efﬁcient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efﬁcient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019. 3

work page 2019
[14]

Composer: Creative and controllable im- age synthesis with composable conditions

Lianghua Huang, Di Chen, Yu Liu, Shen Yujun, Deli Zhao, and Zhou Jingren. Composer: Creative and controllable im- age synthesis with composable conditions. 2023. 3

work page 2023
[15]

Multimodal conditional image synthesis with product- of-experts gans

Xun Huang, Arun Mallya, Ting-Chun Wang, and Ming-Yu Liu. Multimodal conditional image synthesis with product- of-experts gans. In Computer Vision–ECCV 2022: 17th Eu- ropean Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVI, pages 91–109. Springer, 2022. 3

work page 2022
[16]

Image-to-image translation with conditional adver- sarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134,

work page
[17]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019. 3

work page 2019
[18]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes. arXiv preprint arXiv:1312.6114, 2013. 2

work page internal anchor Pith review Pith/arXiv arXiv 2013
[20]

Exploring plain vision transformer backbones for object de- tection

Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object de- tection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceed- ings, Part IX, pages 280–296. Springer, 2022. 3

work page 2022
[21]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 6

work page 2014
[22]

Design guidelines for prompt engineering text-to-image generative models

Vivian Liu and Lydia B Chilton. Design guidelines for prompt engineering text-to-image generative models. InPro- ceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–23, 2022. 2

work page 2022
[23]

Glide: Towards photorealis- tic image generation and editing with text-guided diffusion models

Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealis- tic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pages 16784–16804. PMLR, 2022. 2, 3

work page 2022
[24]

Semantic image synthesis with spatially-adaptive nor- malization

Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive nor- malization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2337–2346,

work page
[25]

Semantic image synthesis with spatially-adaptive nor- malization

Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive nor- malization. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2019. 6

work page 2019
[26]

Best prompts for text-to-image models and how to ﬁnd them

Nikita Pavlichenko and Dmitry Ustalov. Best prompts for text-to-image models and how to ﬁnd them. arXiv preprint arXiv:2209.11711, 2022. 2

work page arXiv 2022
[27]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763, 2021. 3

work page 2021
[28]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 6

work page 2021
[29]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents. arXiv preprint arXiv:2204.06125,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Confer- ence on Machine Learning, pages 8821–8831. PMLR, 2021. 2, 3

work page 2021
[31]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(3), 2022. 6

work page 2022
[32]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2, 3, 5, 6, 7

work page 2022
[33]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015. 3

work page 2015
[34]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

You only need adversarial supervision for semantic image synthesis

Edgar Sch ¨onfeld, Vadim Sushko, Dan Zhang, Juergen Gall, Bernt Schiele, and Anna Khoreva. You only need adversarial supervision for semantic image synthesis. In International Conference on Learning Representations, 2021. 6

work page 2021
[36]

pytorch-ﬁd: FID Score for PyTorch

Maximilian Seitzer. pytorch-ﬁd: FID Score for PyTorch. https://github.com/mseitzer/pytorch-fid, August 2020. Version 0.3.0. 6

work page 2020
[37]

Real-time single image and video super-resolution using an efﬁcient sub-pixel convolutional neural network

Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efﬁcient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016. 4

work page 2016
[38]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Confer- ence on Machine Learning, pages 2256–2265. PMLR, 2015. 3

work page 2015
[39]

Generative modeling by esti- mating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by esti- mating gradients of the data distribution. Advances in neural information processing systems, 32, 2019. 3

work page 2019
[40]

Bert and pals: Pro- jected attention layers for efﬁcient adaptation in multi-task learning

Asa Cooper Stickland and Iain Murray. Bert and pals: Pro- jected attention layers for efﬁcient adaptation in multi-task learning. In International Conference on Machine Learning, pages 5986–5995. PMLR, 2019. 3

work page 2019
[41]

Pixel difference networks for efﬁcient edge detection

Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao, Qi Tian, Matti Pietik ¨ainen, and Li Liu. Pixel difference networks for efﬁcient edge detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5117–5127, 2021. 6

work page 2021
[42]

Sketch-guided text-to-image diffusion models

Andrey V oynov, Kﬁr Aberman, and Daniel Cohen-Or. Sketch-guided text-to-image diffusion models. arXiv preprint arXiv:2211.13752, 2022. 3

work page arXiv 2022
[43]

Pretraining is all you need for image-to-image translation

Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong Chen, Qifeng Chen, and Fang Wen. Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022. 3, 6

work page arXiv 2022
[44]

High-resolution image syn- thesis and semantic manipulation with conditional gans

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image syn- thesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018. 3

work page 2018
[45]

Adding conditional control to text-to-image diffusion models, 2023

Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 3

work page 2023
[46]

Laﬁte: Towards language-free training for text-to- image generation

Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. Laﬁte: Towards language-free training for text-to- image generation. arXiv preprint arXiv:2111.13792, 2021. 2, 3

work page arXiv 2021
[47]

Unpaired image-to-image translation using cycle- consistent adversarial networks

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle- consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision , pages 2223– 2232, 2017. 3

work page 2017