Recognition: unknown
T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models
Pith reviewed 2026-05-16 22:44 UTC · model grok-4.3
The pith
Lightweight adapters align external signals with the internal knowledge of frozen text-to-image diffusion models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By learning simple and lightweight T2I-Adapters, internal knowledge implicitly learned by large T2I models can be aligned with external control signals while the original large T2I models remain frozen. Different adapters can then be trained for separate conditions to produce rich control and editing effects on color and structure, with the adapters showing composability and generalization ability.
What carries the argument
T2I-Adapter: a small trainable network that receives an external condition signal and injects aligned features into the frozen diffusion model's intermediate layers.
If this is right
- Separate adapters can be trained for distinct controls such as color palettes or edge structures and applied independently.
- Multiple adapters can be combined at inference time to enforce several conditions simultaneously.
- The frozen base model retains its original sample quality and diversity while the adapters add targeted guidance.
- New adapters can be trained for additional conditions without touching the underlying diffusion weights.
Where Pith is reading between the lines
- The approach suggests that future models could ship with a library of plug-in adapters for common creative tasks.
- Composability may allow users to build custom editing pipelines by stacking adapters trained on different signals.
- Because only small modules are updated, the method could support on-device fine-tuning for domain-specific control.
- The same alignment idea might extend to other generative modalities such as video or 3D synthesis.
Load-bearing premise
The knowledge already captured inside a pre-trained text-to-image model contains enough structure that a small adapter can redirect it toward new control signals without breaking coherence.
What would settle it
Generate images with the adapter using a clear control signal such as a depth map, then measure whether the output depth deviates substantially from the input map or whether FID scores rise sharply compared with the unadapted model.
read the original abstract
The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes T2I-Adapter, lightweight modules inserted into frozen text-to-image diffusion models (e.g., Stable Diffusion) to align external control signals such as sketches, depth maps, and color palettes with the model's internal representations. Adapters are trained via standard conditional diffusion loss on paired data while the base UNet remains frozen; the paper reports qualitative and quantitative results on controllability, adapter composability at inference time, and generalization to new conditions or editing tasks.
Significance. If the empirical results hold under rigorous evaluation, the contribution is significant for enabling parameter-efficient, modular control of large-scale T2I models without full retraining. The emphasis on composability and low training cost addresses practical needs in deployment and editing workflows, and the approach could generalize as a template for adapter-based conditioning in other generative architectures.
major comments (3)
- [§3.2] §3.2, adapter insertion points: the multi-scale feature alignment is presented as leveraging pre-existing internal knowledge, yet the training objective is the standard diffusion loss with no auxiliary term to encourage reuse of frozen UNet features versus learning a new mapping; an ablation measuring feature similarity (e.g., cosine distance between pre- and post-adapter activations) is needed to support the central 'dig out' claim.
- [§4.3] §4.3, composability experiments: independently trained adapters are summed at inference, but no quantitative metrics (FID, control accuracy, or artifact rate) are reported for combined use versus single-adapter baselines; this leaves the practical composability claim without load-bearing evidence.
- [Table 2] Table 2, quantitative results: reported FID and user-study scores show competitive performance, but the table lacks error bars, number of runs, or statistical tests; marginal gains over baselines cannot be confidently attributed to the adapter design without these.
minor comments (2)
- [Figure 3] Figure 3 captions are terse; they should explicitly state the control signal type and strength for each row to aid reproducibility.
- [§2] The related-work section omits discussion of concurrent adapter methods in diffusion models; a brief comparison paragraph would clarify novelty.
Simulated Author's Rebuttal
We thank the referee for the detailed review and the recommendation for minor revision. We address the major comments point by point below, and have incorporated revisions to strengthen the manuscript where appropriate.
read point-by-point responses
-
Referee: [§3.2] §3.2, adapter insertion points: the multi-scale feature alignment is presented as leveraging pre-existing internal knowledge, yet the training objective is the standard diffusion loss with no auxiliary term to encourage reuse of frozen UNet features versus learning a new mapping; an ablation measuring feature similarity (e.g., cosine distance between pre- and post-adapter activations) is needed to support the central 'dig out' claim.
Authors: We appreciate this comment, which highlights an important aspect of our design. The use of a frozen UNet with standard conditional diffusion loss is intentional, as it forces the lightweight adapter to align external conditions with the pre-trained features rather than learning a new mapping from scratch. To directly address the request for supporting evidence, we have added an ablation in the revised manuscript that computes cosine similarities between activations in the frozen UNet with and without the adapter. The results show high similarity scores, indicating that the adapter primarily modulates rather than overwrites internal representations, thereby supporting the 'dig out' claim. revision: yes
-
Referee: [§4.3] §4.3, composability experiments: independently trained adapters are summed at inference, but no quantitative metrics (FID, control accuracy, or artifact rate) are reported for combined use versus single-adapter baselines; this leaves the practical composability claim without load-bearing evidence.
Authors: We agree that quantitative support for composability would enhance the claims. In the original manuscript, we focused on qualitative demonstrations due to the challenges in defining precise metrics for multi-condition control. However, following this suggestion, we have included additional quantitative results in the revision, reporting FID scores and control accuracy metrics for compositions of adapters (e.g., sketch + depth). These show that composable use achieves performance close to individual adapters without significant degradation, providing the requested load-bearing evidence. revision: yes
-
Referee: [Table 2] Table 2, quantitative results: reported FID and user-study scores show competitive performance, but the table lacks error bars, number of runs, or statistical tests; marginal gains over baselines cannot be confidently attributed to the adapter design without these.
Authors: We acknowledge the importance of statistical rigor in quantitative evaluations. The results in Table 2 are based on single runs following common practice in the field for large-scale generative models due to computational constraints. In the revised version, we have added a note clarifying the number of runs (one) and included error bars where feasible from multiple seeds on smaller subsets. While full statistical tests across all baselines would require substantial additional compute, we believe the consistent trends across metrics support the conclusions. revision: partial
Circularity Check
Empirical adapter training exhibits no circularity
full rationale
The paper describes a standard empirical procedure: lightweight adapters are trained from scratch on external paired (image, condition) datasets while the base T2I diffusion model remains frozen. No mathematical derivation chain exists; there are no equations that reduce a claimed prediction to a fitted parameter by construction, no self-definitional loops, and no load-bearing self-citations that import uniqueness theorems. All performance claims rest on experimental results rather than tautological reuse of inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- T2I-Adapter weights
axioms (1)
- domain assumption Large T2I diffusion models have implicitly learned complex structures and meaningful semantics from training data
invented entities (1)
-
T2I-Adapter
no independent evidence
Forward citations
Cited by 18 Pith papers
-
Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision
Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.
-
LooseRoPE: Content-aware Attention Manipulation for Semantic Harmonization
LooseRoPE modulates RoPE in diffusion attention maps to continuously trade off between preserving a pasted object's identity and harmonizing it with its new surroundings.
-
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
-
Adding Conditional Control to Text-to-Image Diffusion Models
ControlNet adds spatial conditioning controls to pretrained text-to-image diffusion models via zero convolutions for stable fine-tuning on small or large datasets.
-
The two clocks and the innovation window: When and how generative models learn rules
Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
-
Stylistic Attribute Control in Latent Diffusion Models
A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.
-
Map2World: Segment Map Conditioned Text to 3D World Generation
Map2World produces scale-consistent 3D worlds from text and arbitrary segment maps via a detail enhancer that incorporates global structure information.
-
PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning
PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.
-
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
-
MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution
MetaSR adaptively orchestrates metadata in a DiT-based generative SR model to deliver up to 1 dB PSNR gains and 50% bitrate savings across diverse content and degradations.
-
Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens
Viewpoint tokens learned on a mixed 3D-rendered and photorealistic dataset enable precise camera control in text-to-image generation while factorizing geometry from appearance and transferring to unseen object categories.
-
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
-
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
-
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
-
Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study
DiffKT3D transfers priors from video diffusion models to 3D radiotherapy dose prediction via modality-specific embeddings and clinically guided RL, reducing voxel MAE from 2.07 to 1.93 and claiming SOTA over the GDP-H...
-
Step1X-Edit: A Practical Framework for General Image Editing
Step1X-Edit integrates a multimodal LLM with a diffusion decoder, trained on a custom high-quality dataset, to deliver image editing performance that surpasses open-source baselines and approaches proprietary models o...
-
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
Reference graph
Works this paper leans on
-
[1]
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv preprint arXiv:2211.01324, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Coco- stuff: Thing and stuff classes in context
Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1209–1218, 2018. 6
work page 2018
-
[3]
Vision transformer adapter for dense predictions
Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022. 3
-
[4]
Openmmlab pose estimation toolbox and benchmark
MMPose Contributors. Openmmlab pose estimation toolbox and benchmark. https://github.com/ open-mmlab/mmpose, 2020. 6
work page 2020
-
[5]
Gen- erative adversarial networks: An overview
Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. Gen- erative adversarial networks: An overview. IEEE signal pro- cessing magazine, 35(1):53–65, 2018. 2
work page 2018
-
[6]
Cogview: Mastering text-to-image gen- eration via transformers
Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image gen- eration via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021. 2, 3
work page 2021
-
[7]
NICE: Non-linear Independent Components Estimation
Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014. 2
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[8]
An image is worth 16x16 words: Trans- formers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InInternational Con- ference on Learning Representations. 3
-
[9]
Training-free structured diffusion guidance for compositional text-to-image synthesis
Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032, 2022. 3
-
[10]
Make-a-scene: Scene- based text-to-image generation with human priors
Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene- based text-to-image generation with human priors. In Com- puter Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XV , pages 89–106. Springer, 2022. 3
work page 2022
-
[11]
Prompt-to-Prompt Image Editing with Cross Attention Control
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
Denoising diffu- sion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020. 3
work page 2020
-
[13]
Parameter-efficient transfer learning for nlp
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019. 3
work page 2019
-
[14]
Composer: Creative and controllable im- age synthesis with composable conditions
Lianghua Huang, Di Chen, Yu Liu, Shen Yujun, Deli Zhao, and Zhou Jingren. Composer: Creative and controllable im- age synthesis with composable conditions. 2023. 3
work page 2023
-
[15]
Multimodal conditional image synthesis with product- of-experts gans
Xun Huang, Arun Mallya, Ting-Chun Wang, and Ming-Yu Liu. Multimodal conditional image synthesis with product- of-experts gans. In Computer Vision–ECCV 2022: 17th Eu- ropean Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVI, pages 91–109. Springer, 2022. 3
work page 2022
-
[16]
Image-to-image translation with conditional adver- sarial networks
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134,
-
[17]
Bert: Pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019. 3
work page 2019
-
[18]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes. arXiv preprint arXiv:1312.6114, 2013. 2
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[20]
Exploring plain vision transformer backbones for object de- tection
Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object de- tection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceed- ings, Part IX, pages 280–296. Springer, 2022. 3
work page 2022
-
[21]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 6
work page 2014
-
[22]
Design guidelines for prompt engineering text-to-image generative models
Vivian Liu and Lydia B Chilton. Design guidelines for prompt engineering text-to-image generative models. InPro- ceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–23, 2022. 2
work page 2022
-
[23]
Glide: Towards photorealis- tic image generation and editing with text-guided diffusion models
Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealis- tic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pages 16784–16804. PMLR, 2022. 2, 3
work page 2022
-
[24]
Semantic image synthesis with spatially-adaptive nor- malization
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive nor- malization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2337–2346,
-
[25]
Semantic image synthesis with spatially-adaptive nor- malization
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive nor- malization. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2019. 6
work page 2019
-
[26]
Best prompts for text-to-image models and how to find them
Nikita Pavlichenko and Dmitry Ustalov. Best prompts for text-to-image models and how to find them. arXiv preprint arXiv:2209.11711, 2022. 2
-
[27]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763, 2021. 3
work page 2021
-
[28]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 6
work page 2021
-
[29]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents. arXiv preprint arXiv:2204.06125,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Confer- ence on Machine Learning, pages 8821–8831. PMLR, 2021. 2, 3
work page 2021
-
[31]
Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer
Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(3), 2022. 6
work page 2022
-
[32]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2, 3, 5, 6, 7
work page 2022
-
[33]
U- net: Convolutional networks for biomedical image segmen- tation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015. 3
work page 2015
-
[34]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
You only need adversarial supervision for semantic image synthesis
Edgar Sch ¨onfeld, Vadim Sushko, Dan Zhang, Juergen Gall, Bernt Schiele, and Anna Khoreva. You only need adversarial supervision for semantic image synthesis. In International Conference on Learning Representations, 2021. 6
work page 2021
-
[36]
pytorch-fid: FID Score for PyTorch
Maximilian Seitzer. pytorch-fid: FID Score for PyTorch. https://github.com/mseitzer/pytorch-fid, August 2020. Version 0.3.0. 6
work page 2020
-
[37]
Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016. 4
work page 2016
-
[38]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Confer- ence on Machine Learning, pages 2256–2265. PMLR, 2015. 3
work page 2015
-
[39]
Generative modeling by esti- mating gradients of the data distribution
Yang Song and Stefano Ermon. Generative modeling by esti- mating gradients of the data distribution. Advances in neural information processing systems, 32, 2019. 3
work page 2019
-
[40]
Bert and pals: Pro- jected attention layers for efficient adaptation in multi-task learning
Asa Cooper Stickland and Iain Murray. Bert and pals: Pro- jected attention layers for efficient adaptation in multi-task learning. In International Conference on Machine Learning, pages 5986–5995. PMLR, 2019. 3
work page 2019
-
[41]
Pixel difference networks for efficient edge detection
Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao, Qi Tian, Matti Pietik ¨ainen, and Li Liu. Pixel difference networks for efficient edge detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5117–5127, 2021. 6
work page 2021
-
[42]
Sketch-guided text-to-image diffusion models
Andrey V oynov, Kfir Aberman, and Daniel Cohen-Or. Sketch-guided text-to-image diffusion models. arXiv preprint arXiv:2211.13752, 2022. 3
-
[43]
Pretraining is all you need for image-to-image translation
Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong Chen, Qifeng Chen, and Fang Wen. Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022. 3, 6
-
[44]
High-resolution image syn- thesis and semantic manipulation with conditional gans
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image syn- thesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018. 3
work page 2018
-
[45]
Adding conditional control to text-to-image diffusion models, 2023
Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 3
work page 2023
-
[46]
Lafite: Towards language-free training for text-to- image generation
Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. Lafite: Towards language-free training for text-to- image generation. arXiv preprint arXiv:2111.13792, 2021. 2, 3
-
[47]
Unpaired image-to-image translation using cycle- consistent adversarial networks
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle- consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision , pages 2223– 2232, 2017. 3
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.