Recognition: 2 theorem links
· Lean TheoremAdding Conditional Control to Text-to-Image Diffusion Models
Pith reviewed 2026-05-16 22:38 UTC · model grok-4.3
The pith
ControlNet adds spatial controls like edges, depth, and human poses to pretrained text-to-image diffusion models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ControlNet locks the production-ready large diffusion models and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls, connected with zero convolutions that progressively grow parameters from zero to ensure no harmful noise affects the finetuning.
What carries the argument
Zero convolutions (zero-initialized convolution layers) that connect the new control network to the locked pretrained backbone and grow parameters gradually to preserve original model behavior.
Load-bearing premise
Zero convolutions progressively grow parameters from zero and ensure that no harmful noise could affect the finetuning, allowing the pretrained backbone to remain intact while learning new controls.
What would settle it
Generate images from the original diffusion model and from the same model plus a trained ControlNet with all control inputs set to zero or absent; the outputs should match in quality and distribution if the backbone stayed intact.
read the original abstract
We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ControlNet, a neural architecture that adds spatial conditioning controls (edges, depth, segmentation, human pose, etc.) to large pretrained text-to-image diffusion models such as Stable Diffusion. The original UNet is locked and control signals are injected exclusively through zero-initialized convolution layers that grow from zero during training, with the goal of preserving the pretrained backbone while enabling robust learning from small or large datasets under single or multiple conditions, with or without text prompts.
Significance. If the central claims are substantiated, the work would be significant for enabling modular, controllable extensions to production-grade diffusion models without retraining or degrading the base model, thereby supporting a range of downstream applications in image editing, design, and synthesis.
major comments (3)
- [Abstract] Abstract: the claim that 'the training of ControlNets is robust with small (<50k) and large (>1m) datasets' is presented without any quantitative metrics, error bars, ablation tables, or statistical comparisons; only qualitative results are referenced.
- [Method] Method (zero-convolution description): the assertion that zero convolutions 'ensure that no harmful noise could affect the finetuning' and that the pretrained backbone 'remains intact' is not accompanied by before/after measurements of generation quality on fixed text-only prompts; forward-pass summation of control features into locked layers could still alter activations once weights become nonzero.
- [Experiments] Experiments: no quantitative evaluation (FID, CLIP score, user studies, or ablation on control strength) is reported for any of the tested conditions, making it impossible to assess the strength of the robustness or multi-condition claims.
minor comments (2)
- [Method] The handling of multiple simultaneous conditions is mentioned but lacks a clear diagram or equation showing how feature maps from separate ControlNets are combined inside the locked UNet.
- [Figures] Figure captions should explicitly state the conditioning input type and whether text prompts were used for each example.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to incorporate quantitative evaluations and additional measurements where the original submission was lacking.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'the training of ControlNets is robust with small (<50k) and large (>1m) datasets' is presented without any quantitative metrics, error bars, ablation tables, or statistical comparisons; only qualitative results are referenced.
Authors: We agree that the abstract claim would be stronger with supporting quantitative evidence. In the revised manuscript we have added FID scores, training loss curves, and statistical comparisons between small (<50k) and large (>1m) datasets, together with new ablation tables in the Experiments section and supplementary material. revision: yes
-
Referee: [Method] Method (zero-convolution description): the assertion that zero convolutions 'ensure that no harmful noise could affect the finetuning' and that the pretrained backbone 'remains intact' is not accompanied by before/after measurements of generation quality on fixed text-only prompts; forward-pass summation of control features into locked layers could still alter activations once weights become nonzero.
Authors: We accept that before/after measurements are needed to substantiate the claim. We have added new experiments evaluating generation quality on fixed text-only prompts before and after ControlNet training; the results show negligible degradation. While feature summation could in principle affect activations, the zero-initialization combined with locked weights keeps the backbone effectively unchanged, as confirmed by the added measurements. revision: yes
-
Referee: [Experiments] Experiments: no quantitative evaluation (FID, CLIP score, user studies, or ablation on control strength) is reported for any of the tested conditions, making it impossible to assess the strength of the robustness or multi-condition claims.
Authors: We acknowledge that the original manuscript relied primarily on qualitative demonstrations. The revised version now includes FID and CLIP scores for each conditioning type, results from a 100-participant user study on control accuracy and image quality, and ablations varying control strength. These additions appear in the updated Experiments section and supplementary material. revision: yes
Circularity Check
No significant circularity; ControlNet is an independent architectural addition
full rationale
The paper introduces ControlNet as a new trainable module attached to a frozen pretrained UNet via zero-initialized convolutions. The zero-convolution design is a direct architectural choice that starts with no effect and learns additive control signals; no equation or claim reduces the final output distribution to a redefinition of the training inputs, a fitted parameter, or a self-citation chain. The claim that the backbone remains intact follows from the explicit freezing plus zero-init initialization rather than from any tautological re-use of the target result. No load-bearing step matches the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained diffusion model layers encode robust features from billions of images that can be reused as a backbone without degradation.
Forward citations
Cited by 17 Pith papers
-
WildDet3D: Scaling Promptable 3D Detection in the Wild
WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
-
MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation
MoZoo generates high-fidelity animal videos with fur and muscle dynamics from coarse meshes by extending video diffusion with role-aware RoPE and asymmetric decoupled attention, trained on a new synthetic-to-real dataset.
-
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.
-
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.
-
Stylistic Attribute Control in Latent Diffusion Models
A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.
-
Compared to What? Baselines and Metrics for Counterfactual Prompting
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
-
PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning
PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.
-
GOLD-BEV: GrOund and aeriaL Data for Dense Semantic BEV Mapping of Dynamic Scenes
GOLD-BEV learns dense BEV semantic maps including dynamic agents from ego-centric sensors by using synchronized aerial imagery for training supervision and pseudo-label generation.
-
PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios
PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional ...
-
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
-
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation
CamCo equips image-to-video generators with Plücker-coordinate camera inputs and epipolar attention to improve 3D consistency and camera controllability.
-
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
-
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
-
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
-
Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study
DiffKT3D transfers priors from video diffusion models to 3D radiotherapy dose prediction via modality-specific embeddings and clinically guided RL, reducing voxel MAE from 2.07 to 1.93 and claiming SOTA over the GDP-H...
-
A Real-Calibrated Synthetic-First Data Engine
A data curation pipeline using diffusion-generated synthetic images improves pose estimation when added to real data but underperforms when used without real anchors.
-
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Seedream 4.0 unifies text-to-image synthesis, image editing, and multi-image composition in an efficient diffusion transformer pretrained on billions of pairs and accelerated to 1.8 seconds for 2K output.
Reference graph
Works this paper leans on
-
[1]
Sadia Afrin. Weight initialization in neural network, inspired by andrew ng, https://medium.com/@safrin1128/weight- initialization-in-neural-network-inspired-by-andrew-ng- e0066dc4a566, 2020. 3
work page 2020
-
[2]
In- trinsic dimensionality explains the effectiveness of language model fine-tuning
Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. In- trinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Process- ing, pages 7319–7328, Online, Aug. 2021. Association fo...
work page 2021
-
[3]
Only a matter of style: Age transformation using a style-based regression model
Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Only a matter of style: Age transformation using a style-based regression model. ACM Transactions on Graphics (TOG), 40(4), 2021. 3
work page 2021
-
[4]
Hyperstyle: Stylegan inversion with hypernetworks for real image editing
Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and Amit Bermano. Hyperstyle: Stylegan inversion with hypernetworks for real image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18511–18521, 2022. 2
work page 2022
-
[5]
Disco diffusion, https://github.com/alembics/disco- diffusion, 2022
Alembics. Disco diffusion, https://github.com/alembics/disco- diffusion, 2022. 3
work page 2022
-
[6]
Spatext: Spatio-textual representation for con- trollable image generation
Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, and Xi Yin. Spatext: Spatio-textual representation for con- trollable image generation. arXiv preprint arXiv:2211.14305,
-
[7]
Blended diffusion for text-driven editing of natural images
Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022. 3
work page 2022
-
[8]
Multidiffusion: Fusing diffusion paths for controlled image generation
Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113, 2023. 3
-
[9]
Masksketch: Unpaired structure-guided masked image generation
Dina Bashkirova, Jose Lezama, Kihyuk Sohn, Kate Saenko, and Irfan Essa. Masksketch: Unpaired structure-guided masked image generation. arXiv preprint arXiv:2302.05496,
-
[10]
In- structpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022. 2, 3
-
[11]
A computational approach to edge detection
John Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, (6):679–698, 1986. 6
work page 1986
-
[12]
Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y . A. Sheikh. Openpose: Realtime multi-person 2d pose estima- tion using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. 6
work page 2019
-
[13]
Pre-trained image processing transformer
Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12299–12310, 2021. 3
work page 2021
-
[14]
Vision transformer adapter for dense predictions
Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. International Conference on Learning Representations, 2023. 2
work page 2023
-
[15]
Stargan: Unified genera- tive adversarial networks for multi-domain image-to-image translation
Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified genera- tive adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8789–8797,
-
[16]
darkstorm2150. Protogen x3.4 (photorealism) offi- cial release, https://civitai.com/models/3666/protogen-x34- photorealism-official-release, 2022. 8
work page 2022
-
[17]
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021. 3
work page 2021
-
[18]
Dinh, Anh Tuan Tran, Rang Nguyen, and Binh-Son Hua
Tan M. Dinh, Anh Tuan Tran, Rang Nguyen, and Binh-Son Hua. Hyperinverter: Improving stylegan inversion via hy- pernetwork. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11389– 11398, 2022. 2
work page 2022
-
[19]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021. 3, 5, 7, 8
work page 2021
-
[20]
Make-a-scene: Scene- based text-to-image generation with human priors
Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene- based text-to-image generation with human priors. In Euro- pean Conference on Computer Vision (ECCV), pages 89–106. Springer, 2022. 2, 3
work page 2022
-
[21]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image genera- tion using textual inversion.arXiv preprint arXiv:2208.01618,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Stylegan-nada: Clip- guided domain adaptation of image generators
Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip- guided domain adaptation of image generators. ACM Trans- actions on Graphics (TOG), 41(4):1–13, 2022. 3
work page 2022
-
[23]
Clip- adapter: Better vision-language models with feature adapters
Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip- adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021. 2
-
[24]
Towards light-weight and real-time line segment detection
Geonmo Gu, Byungsoo Ko, SeoungHyun Go, Sung-Hyun Lee, Jingeun Lee, and Minchul Shin. Towards light-weight and real-time line segment detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022. 6
work page 2022
-
[25]
David Ha, Andrew M. Dai, and Quoc V . Le. Hypernetworks. In International Conference on Learning Representations ,
-
[26]
Heathen. Hypernetwork style training, a tiny guide, stable- diffusion-webui, https://github.com/automatic1111/stable- diffusion-webui/discussions/2670, 2022. 2
work page 2022
-
[27]
Prompt-to-Prompt Image Editing with Cross Attention Control
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fer- gus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Asso...
work page 2017
-
[29]
Classifier-free diffusion guidance, 2022
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. 5
work page 2022
-
[30]
Parameter-efficient transfer learning for nlp
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799, 2019. 2
work page 2019
-
[31]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021. 2
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[32]
Composer: Creative and controllable image synthesis with composable conditions
Lianghua Huang, Di Chen, Yu Liu, Shen Yujun, Deli Zhao, and Zhou Jingren. Composer: Creative and controllable image synthesis with composable conditions. 2023. 3
work page 2023
-
[33]
Region-aware diffusion for zero-shot text- driven image editing
Nisha Huang, Fan Tang, Weiming Dong, Tong-Yee Lee, and Changsheng Xu. Region-aware diffusion for zero-shot text- driven image editing. arXiv preprint arXiv:2302.11797, 2023. 3
-
[34]
Image-to-image translation with conditional adversarial net- works
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial net- works. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1125–1134, 2017. 1, 3
work page 2017
-
[35]
OneFormer: One Transformer to Rule Universal Image Segmentation
Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. OneFormer: One Transformer to Rule Universal Image Segmentation. 2023. 7
work page 2023
-
[36]
Progressive growing of gans for improved quality, stability, and variation
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. International Conference on Learning Repre- sentations, 2018. 3
work page 2018
-
[37]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019. 3
work page 2019
-
[38]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. IEEE Transactions on Pattern Analysis, 2021. 3
work page 2021
-
[39]
Multi-level latent space structuring for generative control
Oren Katzir, Vicky Perepelook, Dani Lischinski, and Daniel Cohen-Or. Multi-level latent space structuring for generative control. arXiv preprint arXiv:2202.05910, 2022. 3
-
[40]
Imagic: Text-based real image editing with diffusion models
Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022. 3
-
[41]
Dif- fusionclip: Text-guided diffusion models for robust image manipulation
Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Dif- fusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426– 2435, 2022. 3
work page 2022
-
[42]
Variational diffusion models.Advances in Neural Information Processing Systems, 34:21696–21707, 2021
Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.Advances in Neural Information Processing Systems, 34:21696–21707, 2021. 3
work page 2021
-
[43]
Kurumuz. Novelai improvements on stable diffusion, https://blog.novelai.net/novelai-improvements-on-stable- diffusion-e10d38db82ac, 2022. 2
work page 2022
-
[44]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015. 3
work page 2015
- [45]
-
[46]
Noise2noise: Learning image restoration without clean data
Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila. Noise2noise: Learning image restoration without clean data. Proceedings of the 35th International Conference on Machine Learning, 2018. 3
work page 2018
-
[47]
Measuring the intrinsic dimension of objective landscapes
Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. International Conference on Learning Represen- tations, 2018. 3
work page 2018
-
[48]
Gligen: Open-set grounded text-to-image generation
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. 2023. 3
work page 2023
-
[49]
Exploring plain vision transformer backbones for object de- tection
Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object de- tection. arXiv preprint arXiv:2203.16527, 2022. 2
-
[50]
Benchmarking detection transfer learning with vision transformers
Yanghao Li, Saining Xie, Xinlei Chen, Piotr Dollar, Kaim- ing He, and Ross Girshick. Benchmarking detection transfer learning with vision transformers. arXiv preprint arXiv:2111.11429, 2021. 2
-
[51]
Piggy- back: Adapting a single network to multiple tasks by learning to mask weights
Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggy- back: Adapting a single network to multiple tasks by learning to mask weights. In European Conference on Computer Vi- sion (ECCV), pages 67–82, 2018. 2
work page 2018
-
[52]
Packnet: Adding multi- ple tasks to a single network by iterative pruning
Arun Mallya and Svetlana Lazebnik. Packnet: Adding multi- ple tasks to a single network by iterative pruning. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018. 2
work page 2018
-
[53]
Sdedit: Guided image synthesis and editing with stochastic differential equations
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021. 3
work page 2021
-
[54]
https://www.midjourney.com/, 2023
Midjourney. https://www.midjourney.com/, 2023. 1, 3
work page 2023
-
[55]
Self- distilled stylegan: Towards generation from internet photos
Ron Mokady, Omer Tov, Michal Yarom, Oran Lang, Inbar Mosseri, Tali Dekel, Daniel Cohen-Or, and Michal Irani. Self- distilled stylegan: Towards generation from internet photos. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–9, 2022. 3
work page 2022
-
[56]
Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon- gang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023. 2, 3
work page internal anchor Pith review arXiv 2023
-
[57]
GLIDE: towards photorealistic image generation and editing with text-guided diffusion models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. CoRR, 2021. 8
work page 2021
-
[58]
Glide: Towards photorealistic image generation and editing with text-guided diffusion models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. 2022. 3
work page 2022
-
[59]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR,
-
[60]
Mystyle: A personalized generative prior
Yotam Nitzan, Kfir Aberman, Qiurui He, Orly Liba, Michal Yarom, Yossi Gandelsman, Inbar Mosseri, Yael Pritch, and Daniel Cohen-Or. Mystyle: A personalized generative prior. arXiv preprint arXiv:2203.17272, 2022. 3
-
[61]
Comic-diffusion v2, trained on 6 styles at once, https://huggingface.co/ogkalu/comic-diffusion, 2022
ogkalu. Comic-diffusion v2, trained on 6 styles at once, https://huggingface.co/ogkalu/comic-diffusion, 2022. 8
work page 2022
-
[62]
Dall-e-2, https://openai.com/product/dall-e-2, 2023
OpenAI. Dall-e-2, https://openai.com/product/dall-e-2, 2023. 1, 3
work page 2023
-
[63]
Semantic image synthesis with spatially-adaptive nor- malization
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive nor- malization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2337–2346,
-
[64]
Zero-shot image-to-image translation
Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation. arXiv preprint arXiv:2302.03027, 2023. 3
-
[65]
Styleclip: Text-driven manipulation of stylegan imagery
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 2085–2094, October 2021. 3
work page 2085
-
[66]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning , pages 8748–8763. PMLR, 2021. 2, 3, 4, 8
work page 2021
-
[67]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image genera- tion with clip latents. arXiv preprint arXiv:2204.06125, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[68]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Confer- ence on Machine Learning, pages 8821–8831. PMLR, 2021. 3
work page 2021
-
[69]
Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer
Ren´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1623–1637, 2020. 6
work page 2020
-
[70]
Efficient parametrization of multi-domain deep neural net- works
Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Efficient parametrization of multi-domain deep neural net- works. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 8119–8127, 2018. 2
work page 2018
-
[71]
Encoding in style: a stylegan encoder for image-to-image translation
Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 3
work page 2021
-
[72]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 2, 3, 4, 5, 7
work page 2022
-
[73]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Inter- vention MICCAI International Conference, pages 234–241,
-
[74]
Incremental learning through deep adaptation
Amir Rosenfeld and John K Tsotsos. Incremental learning through deep adaptation. IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 42(3):651–663, 2018. 2
work page 2018
-
[75]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration. arXiv preprint arXiv:2208.12242, 2022. 2, 3
-
[76]
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating er- rors. Nature, 323(6088):533–536, Oct. 1986. 3
work page 1986
-
[77]
Palette: Image-to-image diffusion models
Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, SIGGRAPH ’22, New York, NY , USA, 2022. Association for Computing Ma- chinery. 3
work page 2022
-
[78]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487,
work page internal anchor Pith review Pith/arXiv arXiv
-
[79]
LAION-5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text m...
work page 2022
-
[80]
Overcoming catastrophic forgetting with hard atten- tion to the task
Joan Serra, Didac Suris, Marius Miron, and Alexandros Karat- zoglou. Overcoming catastrophic forgetting with hard atten- tion to the task. In International Conference on Machine Learning, pages 4548–4557. PMLR, 2018. 2
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.