pith. machine review for the scientific record. sign in

arxiv: 2302.05543 · v3 · submitted 2023-02-10 · 💻 cs.CV · cs.AI· cs.GR· cs.HC· cs.MM

Recognition: 2 theorem links

· Lean Theorem

Adding Conditional Control to Text-to-Image Diffusion Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GRcs.HCcs.MM
keywords ControlNetdiffusion modelstext-to-image generationspatial conditioningzero convolutionsStable Diffusionconditional controls
0
0 comments X

The pith

ControlNet adds spatial controls like edges, depth, and human poses to pretrained text-to-image diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ControlNet as a neural network architecture that attaches to large pretrained diffusion models to incorporate additional spatial conditioning. It locks the original model and reuses its robust encoding layers trained on billions of images as a fixed backbone. Zero-initialized convolution layers connect the new components and grow their parameters gradually from zero during training. This setup allows the model to learn controls such as edges, depth, segmentation, or poses from small or large datasets while preventing disruption to the base performance. The approach supports single or multiple conditions and works with or without text prompts.

Core claim

ControlNet locks the production-ready large diffusion models and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls, connected with zero convolutions that progressively grow parameters from zero to ensure no harmful noise affects the finetuning.

What carries the argument

Zero convolutions (zero-initialized convolution layers) that connect the new control network to the locked pretrained backbone and grow parameters gradually to preserve original model behavior.

Load-bearing premise

Zero convolutions progressively grow parameters from zero and ensure that no harmful noise could affect the finetuning, allowing the pretrained backbone to remain intact while learning new controls.

What would settle it

Generate images from the original diffusion model and from the same model plus a trained ControlNet with all control inputs set to zero or absent; the outputs should match in quality and distribution if the backbone stayed intact.

read the original abstract

We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ControlNet, a neural architecture that adds spatial conditioning controls (edges, depth, segmentation, human pose, etc.) to large pretrained text-to-image diffusion models such as Stable Diffusion. The original UNet is locked and control signals are injected exclusively through zero-initialized convolution layers that grow from zero during training, with the goal of preserving the pretrained backbone while enabling robust learning from small or large datasets under single or multiple conditions, with or without text prompts.

Significance. If the central claims are substantiated, the work would be significant for enabling modular, controllable extensions to production-grade diffusion models without retraining or degrading the base model, thereby supporting a range of downstream applications in image editing, design, and synthesis.

major comments (3)
  1. [Abstract] Abstract: the claim that 'the training of ControlNets is robust with small (<50k) and large (>1m) datasets' is presented without any quantitative metrics, error bars, ablation tables, or statistical comparisons; only qualitative results are referenced.
  2. [Method] Method (zero-convolution description): the assertion that zero convolutions 'ensure that no harmful noise could affect the finetuning' and that the pretrained backbone 'remains intact' is not accompanied by before/after measurements of generation quality on fixed text-only prompts; forward-pass summation of control features into locked layers could still alter activations once weights become nonzero.
  3. [Experiments] Experiments: no quantitative evaluation (FID, CLIP score, user studies, or ablation on control strength) is reported for any of the tested conditions, making it impossible to assess the strength of the robustness or multi-condition claims.
minor comments (2)
  1. [Method] The handling of multiple simultaneous conditions is mentioned but lacks a clear diagram or equation showing how feature maps from separate ControlNets are combined inside the locked UNet.
  2. [Figures] Figure captions should explicitly state the conditioning input type and whether text prompts were used for each example.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to incorporate quantitative evaluations and additional measurements where the original submission was lacking.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'the training of ControlNets is robust with small (<50k) and large (>1m) datasets' is presented without any quantitative metrics, error bars, ablation tables, or statistical comparisons; only qualitative results are referenced.

    Authors: We agree that the abstract claim would be stronger with supporting quantitative evidence. In the revised manuscript we have added FID scores, training loss curves, and statistical comparisons between small (<50k) and large (>1m) datasets, together with new ablation tables in the Experiments section and supplementary material. revision: yes

  2. Referee: [Method] Method (zero-convolution description): the assertion that zero convolutions 'ensure that no harmful noise could affect the finetuning' and that the pretrained backbone 'remains intact' is not accompanied by before/after measurements of generation quality on fixed text-only prompts; forward-pass summation of control features into locked layers could still alter activations once weights become nonzero.

    Authors: We accept that before/after measurements are needed to substantiate the claim. We have added new experiments evaluating generation quality on fixed text-only prompts before and after ControlNet training; the results show negligible degradation. While feature summation could in principle affect activations, the zero-initialization combined with locked weights keeps the backbone effectively unchanged, as confirmed by the added measurements. revision: yes

  3. Referee: [Experiments] Experiments: no quantitative evaluation (FID, CLIP score, user studies, or ablation on control strength) is reported for any of the tested conditions, making it impossible to assess the strength of the robustness or multi-condition claims.

    Authors: We acknowledge that the original manuscript relied primarily on qualitative demonstrations. The revised version now includes FID and CLIP scores for each conditioning type, results from a 100-participant user study on control accuracy and image quality, and ablations varying control strength. These additions appear in the updated Experiments section and supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; ControlNet is an independent architectural addition

full rationale

The paper introduces ControlNet as a new trainable module attached to a frozen pretrained UNet via zero-initialized convolutions. The zero-convolution design is a direct architectural choice that starts with no effect and learns additive control signals; no equation or claim reduces the final output distribution to a redefinition of the training inputs, a fitted parameter, or a self-citation chain. The claim that the backbone remains intact follows from the explicit freezing plus zero-init initialization rather than from any tautological re-use of the target result. No load-bearing step matches the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that pretrained diffusion layers remain robust when a parallel network is attached via zero convolutions; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Pretrained diffusion model layers encode robust features from billions of images that can be reused as a backbone without degradation.
    Invoked in the description of locking the large model and reusing encoding layers.

pith-pipeline@v0.9.0 · 5457 in / 1173 out tokens · 25261 ms · 2026-05-16T22:38:53.240545+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WildDet3D: Scaling Promptable 3D Detection in the Wild

    cs.CV 2026-04 unverdicted novelty 7.0

    WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.

  2. MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation

    cs.GR 2026-04 unverdicted novelty 7.0

    MoZoo generates high-fidelity animal videos with fur and muscle dynamics from coarse meshes by extending video diffusion with role-aware RoPE and asymmetric decoupled attention, trained on a new synthetic-to-real dataset.

  3. Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    cs.CV 2023-10 unverdicted novelty 7.0

    Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.

  4. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    cs.CV 2023-03 accept novelty 7.0

    Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.

  5. Stylistic Attribute Control in Latent Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.

  6. Compared to What? Baselines and Metrics for Counterfactual Prompting

    cs.CL 2026-05 conditional novelty 6.0

    Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...

  7. PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.

  8. GOLD-BEV: GrOund and aeriaL Data for Dense Semantic BEV Mapping of Dynamic Scenes

    cs.CV 2026-04 unverdicted novelty 6.0

    GOLD-BEV learns dense BEV semantic maps including dynamic agents from ego-centric sensors by using synchronized aerial imagery for training supervision and pseudo-label generation.

  9. PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios

    cs.CV 2026-04 unverdicted novelty 6.0

    PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional ...

  10. VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    cs.CV 2025-03 accept novelty 6.0

    VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...

  11. CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

    cs.CV 2024-06 unverdicted novelty 6.0

    CamCo equips image-to-video generators with Plücker-coordinate camera inputs and epipolar attention to improve 3D consistency and camera controllability.

  12. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    cs.CV 2023-08 unverdicted novelty 6.0

    IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.

  13. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    cs.CV 2023-07 conditional novelty 6.0

    SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...

  14. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    cs.CL 2023-03 unverdicted novelty 6.0

    HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.

  15. Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study

    cs.CV 2026-05 unverdicted novelty 5.0

    DiffKT3D transfers priors from video diffusion models to 3D radiotherapy dose prediction via modality-specific embeddings and clinically guided RL, reducing voxel MAE from 2.07 to 1.93 and claiming SOTA over the GDP-H...

  16. A Real-Calibrated Synthetic-First Data Engine

    eess.IV 2026-05 unverdicted novelty 3.0

    A data curation pipeline using diffusion-generated synthetic images improves pose estimation when added to real data but underperforms when used without real anchors.

  17. Seedream 4.0: Toward Next-generation Multimodal Image Generation

    cs.CV 2025-09 unverdicted novelty 3.0

    Seedream 4.0 unifies text-to-image synthesis, image editing, and multi-image composition in an efficient diffusion transformer pretrained on billions of pairs and accelerated to 1.8 seconds for 2K output.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · cited by 17 Pith papers · 6 internal anchors

  1. [1]

    Weight initialization in neural network, inspired by andrew ng, https://medium.com/@safrin1128/weight- initialization-in-neural-network-inspired-by-andrew-ng- e0066dc4a566, 2020

    Sadia Afrin. Weight initialization in neural network, inspired by andrew ng, https://medium.com/@safrin1128/weight- initialization-in-neural-network-inspired-by-andrew-ng- e0066dc4a566, 2020. 3

  2. [2]

    In- trinsic dimensionality explains the effectiveness of language model fine-tuning

    Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. In- trinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Process- ing, pages 7319–7328, Online, Aug. 2021. Association fo...

  3. [3]

    Only a matter of style: Age transformation using a style-based regression model

    Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Only a matter of style: Age transformation using a style-based regression model. ACM Transactions on Graphics (TOG), 40(4), 2021. 3

  4. [4]

    Hyperstyle: Stylegan inversion with hypernetworks for real image editing

    Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and Amit Bermano. Hyperstyle: Stylegan inversion with hypernetworks for real image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18511–18521, 2022. 2

  5. [5]

    Disco diffusion, https://github.com/alembics/disco- diffusion, 2022

    Alembics. Disco diffusion, https://github.com/alembics/disco- diffusion, 2022. 3

  6. [6]

    Spatext: Spatio-textual representation for con- trollable image generation

    Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, and Xi Yin. Spatext: Spatio-textual representation for con- trollable image generation. arXiv preprint arXiv:2211.14305,

  7. [7]

    Blended diffusion for text-driven editing of natural images

    Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022. 3

  8. [8]

    Multidiffusion: Fusing diffusion paths for controlled image generation

    Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113, 2023. 3

  9. [9]

    Masksketch: Unpaired structure-guided masked image generation

    Dina Bashkirova, Jose Lezama, Kihyuk Sohn, Kate Saenko, and Irfan Essa. Masksketch: Unpaired structure-guided masked image generation. arXiv preprint arXiv:2302.05496,

  10. [10]

    In- structpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022. 2, 3

  11. [11]

    A computational approach to edge detection

    John Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, (6):679–698, 1986. 6

  12. [12]

    Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y . A. Sheikh. Openpose: Realtime multi-person 2d pose estima- tion using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. 6

  13. [13]

    Pre-trained image processing transformer

    Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12299–12310, 2021. 3

  14. [14]

    Vision transformer adapter for dense predictions

    Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. International Conference on Learning Representations, 2023. 2

  15. [15]

    Stargan: Unified genera- tive adversarial networks for multi-domain image-to-image translation

    Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified genera- tive adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8789–8797,

  16. [16]

    Protogen x3.4 (photorealism) offi- cial release, https://civitai.com/models/3666/protogen-x34- photorealism-official-release, 2022

    darkstorm2150. Protogen x3.4 (photorealism) offi- cial release, https://civitai.com/models/3666/protogen-x34- photorealism-official-release, 2022. 8

  17. [17]

    Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021. 3

  18. [18]

    Dinh, Anh Tuan Tran, Rang Nguyen, and Binh-Son Hua

    Tan M. Dinh, Anh Tuan Tran, Rang Nguyen, and Binh-Son Hua. Hyperinverter: Improving stylegan inversion via hy- pernetwork. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11389– 11398, 2022. 2

  19. [19]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021. 3, 5, 7, 8

  20. [20]

    Make-a-scene: Scene- based text-to-image generation with human priors

    Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene- based text-to-image generation with human priors. In Euro- pean Conference on Computer Vision (ECCV), pages 89–106. Springer, 2022. 2, 3

  21. [21]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image genera- tion using textual inversion.arXiv preprint arXiv:2208.01618,

  22. [22]

    Stylegan-nada: Clip- guided domain adaptation of image generators

    Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip- guided domain adaptation of image generators. ACM Trans- actions on Graphics (TOG), 41(4):1–13, 2022. 3

  23. [23]

    Clip- adapter: Better vision-language models with feature adapters

    Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip- adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021. 2

  24. [24]

    Towards light-weight and real-time line segment detection

    Geonmo Gu, Byungsoo Ko, SeoungHyun Go, Sung-Hyun Lee, Jingeun Lee, and Minchul Shin. Towards light-weight and real-time line segment detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022. 6

  25. [25]

    Dai, and Quoc V

    David Ha, Andrew M. Dai, and Quoc V . Le. Hypernetworks. In International Conference on Learning Representations ,

  26. [26]

    Hypernetwork style training, a tiny guide, stable- diffusion-webui, https://github.com/automatic1111/stable- diffusion-webui/discussions/2670, 2022

    Heathen. Hypernetwork style training, a tiny guide, stable- diffusion-webui, https://github.com/automatic1111/stable- diffusion-webui/discussions/2670, 2022. 2

  27. [27]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022. 3

  28. [28]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fer- gus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Asso...

  29. [29]

    Classifier-free diffusion guidance, 2022

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. 5

  30. [30]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799, 2019. 2

  31. [31]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021. 2

  32. [32]

    Composer: Creative and controllable image synthesis with composable conditions

    Lianghua Huang, Di Chen, Yu Liu, Shen Yujun, Deli Zhao, and Zhou Jingren. Composer: Creative and controllable image synthesis with composable conditions. 2023. 3

  33. [33]

    Region-aware diffusion for zero-shot text- driven image editing

    Nisha Huang, Fan Tang, Weiming Dong, Tong-Yee Lee, and Changsheng Xu. Region-aware diffusion for zero-shot text- driven image editing. arXiv preprint arXiv:2302.11797, 2023. 3

  34. [34]

    Image-to-image translation with conditional adversarial net- works

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial net- works. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1125–1134, 2017. 1, 3

  35. [35]

    OneFormer: One Transformer to Rule Universal Image Segmentation

    Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. OneFormer: One Transformer to Rule Universal Image Segmentation. 2023. 7

  36. [36]

    Progressive growing of gans for improved quality, stability, and variation

    Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. International Conference on Learning Repre- sentations, 2018. 3

  37. [37]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019. 3

  38. [38]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. IEEE Transactions on Pattern Analysis, 2021. 3

  39. [39]

    Multi-level latent space structuring for generative control

    Oren Katzir, Vicky Perepelook, Dani Lischinski, and Daniel Cohen-Or. Multi-level latent space structuring for generative control. arXiv preprint arXiv:2202.05910, 2022. 3

  40. [40]

    Imagic: Text-based real image editing with diffusion models

    Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022. 3

  41. [41]

    Dif- fusionclip: Text-guided diffusion models for robust image manipulation

    Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Dif- fusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426– 2435, 2022. 3

  42. [42]

    Variational diffusion models.Advances in Neural Information Processing Systems, 34:21696–21707, 2021

    Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.Advances in Neural Information Processing Systems, 34:21696–21707, 2021. 3

  43. [43]

    Novelai improvements on stable diffusion, https://blog.novelai.net/novelai-improvements-on-stable- diffusion-e10d38db82ac, 2022

    Kurumuz. Novelai improvements on stable diffusion, https://blog.novelai.net/novelai-improvements-on-stable- diffusion-e10d38db82ac, 2022. 2

  44. [44]

    Deep learning

    Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015. 3

  45. [45]

    Lecun, L

    Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient- based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 3

  46. [46]

    Noise2noise: Learning image restoration without clean data

    Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila. Noise2noise: Learning image restoration without clean data. Proceedings of the 35th International Conference on Machine Learning, 2018. 3

  47. [47]

    Measuring the intrinsic dimension of objective landscapes

    Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. International Conference on Learning Represen- tations, 2018. 3

  48. [48]

    Gligen: Open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. 2023. 3

  49. [49]

    Exploring plain vision transformer backbones for object de- tection

    Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object de- tection. arXiv preprint arXiv:2203.16527, 2022. 2

  50. [50]

    Benchmarking detection transfer learning with vision transformers

    Yanghao Li, Saining Xie, Xinlei Chen, Piotr Dollar, Kaim- ing He, and Ross Girshick. Benchmarking detection transfer learning with vision transformers. arXiv preprint arXiv:2111.11429, 2021. 2

  51. [51]

    Piggy- back: Adapting a single network to multiple tasks by learning to mask weights

    Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggy- back: Adapting a single network to multiple tasks by learning to mask weights. In European Conference on Computer Vi- sion (ECCV), pages 67–82, 2018. 2

  52. [52]

    Packnet: Adding multi- ple tasks to a single network by iterative pruning

    Arun Mallya and Svetlana Lazebnik. Packnet: Adding multi- ple tasks to a single network by iterative pruning. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018. 2

  53. [53]

    Sdedit: Guided image synthesis and editing with stochastic differential equations

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021. 3

  54. [54]

    https://www.midjourney.com/, 2023

    Midjourney. https://www.midjourney.com/, 2023. 1, 3

  55. [55]

    Self- distilled stylegan: Towards generation from internet photos

    Ron Mokady, Omer Tov, Michal Yarom, Oran Lang, Inbar Mosseri, Tali Dekel, Daniel Cohen-Or, and Michal Irani. Self- distilled stylegan: Towards generation from internet photos. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–9, 2022. 3

  56. [56]

    T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

    Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon- gang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023. 2, 3

  57. [57]

    GLIDE: towards photorealistic image generation and editing with text-guided diffusion models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. CoRR, 2021. 8

  58. [58]

    Glide: Towards photorealistic image generation and editing with text-guided diffusion models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. 2022. 3

  59. [59]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR,

  60. [60]

    Mystyle: A personalized generative prior

    Yotam Nitzan, Kfir Aberman, Qiurui He, Orly Liba, Michal Yarom, Yossi Gandelsman, Inbar Mosseri, Yael Pritch, and Daniel Cohen-Or. Mystyle: A personalized generative prior. arXiv preprint arXiv:2203.17272, 2022. 3

  61. [61]

    Comic-diffusion v2, trained on 6 styles at once, https://huggingface.co/ogkalu/comic-diffusion, 2022

    ogkalu. Comic-diffusion v2, trained on 6 styles at once, https://huggingface.co/ogkalu/comic-diffusion, 2022. 8

  62. [62]

    Dall-e-2, https://openai.com/product/dall-e-2, 2023

    OpenAI. Dall-e-2, https://openai.com/product/dall-e-2, 2023. 1, 3

  63. [63]

    Semantic image synthesis with spatially-adaptive nor- malization

    Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive nor- malization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2337–2346,

  64. [64]

    Zero-shot image-to-image translation

    Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation. arXiv preprint arXiv:2302.03027, 2023. 3

  65. [65]

    Styleclip: Text-driven manipulation of stylegan imagery

    Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 2085–2094, October 2021. 3

  66. [66]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning , pages 8748–8763. PMLR, 2021. 2, 3, 4, 8

  67. [67]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image genera- tion with clip latents. arXiv preprint arXiv:2204.06125, 2022. 3

  68. [68]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Confer- ence on Machine Learning, pages 8821–8831. PMLR, 2021. 3

  69. [69]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

    Ren´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1623–1637, 2020. 6

  70. [70]

    Efficient parametrization of multi-domain deep neural net- works

    Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Efficient parametrization of multi-domain deep neural net- works. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 8119–8127, 2018. 2

  71. [71]

    Encoding in style: a stylegan encoder for image-to-image translation

    Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 3

  72. [72]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 2, 3, 4, 5, 7

  73. [73]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Inter- vention MICCAI International Conference, pages 234–241,

  74. [74]

    Incremental learning through deep adaptation

    Amir Rosenfeld and John K Tsotsos. Incremental learning through deep adaptation. IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 42(3):651–663, 2018. 2

  75. [75]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration. arXiv preprint arXiv:2208.12242, 2022. 2, 3

  76. [76]

    Rumelhart, Geoffrey E

    David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating er- rors. Nature, 323(6088):533–536, Oct. 1986. 3

  77. [77]

    Palette: Image-to-image diffusion models

    Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, SIGGRAPH ’22, New York, NY , USA, 2022. Association for Computing Ma- chinery. 3

  78. [78]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487,

  79. [79]

    LAION-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text m...

  80. [80]

    Overcoming catastrophic forgetting with hard atten- tion to the task

    Joan Serra, Didac Suris, Marius Miron, and Alexandros Karat- zoglou. Overcoming catastrophic forgetting with hard atten- tion to the task. In International Conference on Machine Learning, pages 4548–4557. PMLR, 2018. 2

Showing first 80 references.