arxiv: 2302.05543 · v3 · submitted 2023-02-10 · 💻 cs.CV · cs.AI· cs.GR· cs.HC· cs.MM

Recognition: 2 theorem links

· Lean Theorem

Adding Conditional Control to Text-to-Image Diffusion Models

Lvmin Zhang , Anyi Rao , Maneesh Agrawala

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GRcs.HCcs.MM

keywords ControlNetdiffusion modelstext-to-image generationspatial conditioningzero convolutionsStable Diffusionconditional controls

0 comments

The pith

ControlNet adds spatial controls like edges, depth, and human poses to pretrained text-to-image diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ControlNet as a neural network architecture that attaches to large pretrained diffusion models to incorporate additional spatial conditioning. It locks the original model and reuses its robust encoding layers trained on billions of images as a fixed backbone. Zero-initialized convolution layers connect the new components and grow their parameters gradually from zero during training. This setup allows the model to learn controls such as edges, depth, segmentation, or poses from small or large datasets while preventing disruption to the base performance. The approach supports single or multiple conditions and works with or without text prompts.

Core claim

ControlNet locks the production-ready large diffusion models and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls, connected with zero convolutions that progressively grow parameters from zero to ensure no harmful noise affects the finetuning.

What carries the argument

Zero convolutions (zero-initialized convolution layers) that connect the new control network to the locked pretrained backbone and grow parameters gradually to preserve original model behavior.

Load-bearing premise

Zero convolutions progressively grow parameters from zero and ensure that no harmful noise could affect the finetuning, allowing the pretrained backbone to remain intact while learning new controls.

What would settle it

Generate images from the original diffusion model and from the same model plus a trained ControlNet with all control inputs set to zero or absent; the outputs should match in quality and distribution if the backbone stayed intact.

read the original abstract

We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ControlNet adds spatial controls to frozen diffusion models via zero convolutions, but the paper shows this only through pictures with no numbers on quality or invariance.

read the letter

ControlNet is a simple attachment that lets you feed extra maps like edges, depth, or poses into a locked Stable Diffusion model. You copy the encoder, train the copy, and wire it in with convolutions that begin at zero weights so the original network stays untouched during early training steps. The paper tests this on single and combined conditions, with and without text prompts, and on both small and large datasets. That setup is the main new piece: it reuses the billion-image backbone without full retraining or prompt engineering hacks. The zero-conv trick is a clean way to ramp up the new parameters gradually, and the results look usable for design and animation work. The approach is straightforward enough that people have already copied it for other backbones. The main weakness is that every claim rests on selected images. No FID scores, no control accuracy metrics, no error bars, and no before-and-after test on plain text prompts to check whether generation quality actually stays the same once the ControlNet is trained. The stress-test worry about behavioral invariance under null controls is not answered in the text, so the “remains intact” statement is an assumption rather than a measured result. Readers who need controllable generation for tools or pipelines will still want to read the architecture details and try the code. The idea is practical and the implementation choices are clear, so it is worth sending to referees even though the evaluation section needs numbers and ablations to stand up.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ControlNet, a neural architecture that adds spatial conditioning controls (edges, depth, segmentation, human pose, etc.) to large pretrained text-to-image diffusion models such as Stable Diffusion. The original UNet is locked and control signals are injected exclusively through zero-initialized convolution layers that grow from zero during training, with the goal of preserving the pretrained backbone while enabling robust learning from small or large datasets under single or multiple conditions, with or without text prompts.

Significance. If the central claims are substantiated, the work would be significant for enabling modular, controllable extensions to production-grade diffusion models without retraining or degrading the base model, thereby supporting a range of downstream applications in image editing, design, and synthesis.

major comments (3)

[Abstract] Abstract: the claim that 'the training of ControlNets is robust with small (<50k) and large (>1m) datasets' is presented without any quantitative metrics, error bars, ablation tables, or statistical comparisons; only qualitative results are referenced.
[Method] Method (zero-convolution description): the assertion that zero convolutions 'ensure that no harmful noise could affect the finetuning' and that the pretrained backbone 'remains intact' is not accompanied by before/after measurements of generation quality on fixed text-only prompts; forward-pass summation of control features into locked layers could still alter activations once weights become nonzero.
[Experiments] Experiments: no quantitative evaluation (FID, CLIP score, user studies, or ablation on control strength) is reported for any of the tested conditions, making it impossible to assess the strength of the robustness or multi-condition claims.

minor comments (2)

[Method] The handling of multiple simultaneous conditions is mentioned but lacks a clear diagram or equation showing how feature maps from separate ControlNets are combined inside the locked UNet.
[Figures] Figure captions should explicitly state the conditioning input type and whether text prompts were used for each example.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to incorporate quantitative evaluations and additional measurements where the original submission was lacking.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'the training of ControlNets is robust with small (<50k) and large (>1m) datasets' is presented without any quantitative metrics, error bars, ablation tables, or statistical comparisons; only qualitative results are referenced.

Authors: We agree that the abstract claim would be stronger with supporting quantitative evidence. In the revised manuscript we have added FID scores, training loss curves, and statistical comparisons between small (<50k) and large (>1m) datasets, together with new ablation tables in the Experiments section and supplementary material. revision: yes
Referee: [Method] Method (zero-convolution description): the assertion that zero convolutions 'ensure that no harmful noise could affect the finetuning' and that the pretrained backbone 'remains intact' is not accompanied by before/after measurements of generation quality on fixed text-only prompts; forward-pass summation of control features into locked layers could still alter activations once weights become nonzero.

Authors: We accept that before/after measurements are needed to substantiate the claim. We have added new experiments evaluating generation quality on fixed text-only prompts before and after ControlNet training; the results show negligible degradation. While feature summation could in principle affect activations, the zero-initialization combined with locked weights keeps the backbone effectively unchanged, as confirmed by the added measurements. revision: yes
Referee: [Experiments] Experiments: no quantitative evaluation (FID, CLIP score, user studies, or ablation on control strength) is reported for any of the tested conditions, making it impossible to assess the strength of the robustness or multi-condition claims.

Authors: We acknowledge that the original manuscript relied primarily on qualitative demonstrations. The revised version now includes FID and CLIP scores for each conditioning type, results from a 100-participant user study on control accuracy and image quality, and ablations varying control strength. These additions appear in the updated Experiments section and supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; ControlNet is an independent architectural addition

full rationale

The paper introduces ControlNet as a new trainable module attached to a frozen pretrained UNet via zero-initialized convolutions. The zero-convolution design is a direct architectural choice that starts with no effect and learns additive control signals; no equation or claim reduces the final output distribution to a redefinition of the training inputs, a fitted parameter, or a self-citation chain. The claim that the backbone remains intact follows from the explicit freezing plus zero-init initialization rather than from any tautological re-use of the target result. No load-bearing step matches the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that pretrained diffusion layers remain robust when a parallel network is attached via zero convolutions; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Pretrained diffusion model layers encode robust features from billions of images that can be reused as a backbone without degradation.
Invoked in the description of locking the large model and reusing encoding layers.

pith-pipeline@v0.9.0 · 5457 in / 1173 out tokens · 25261 ms · 2026-05-16T22:38:53.240545+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WildDet3D: Scaling Promptable 3D Detection in the Wild
cs.CV 2026-04 unverdicted novelty 7.0

WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation
cs.GR 2026-04 unverdicted novelty 7.0

MoZoo generates high-fidelity animal videos with fur and muscle dynamics from coarse meshes by extending video diffusion with role-aware RoPE and asymmetric decoupled attention, trained on a new synthetic-to-real dataset.
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
cs.CV 2023-10 unverdicted novelty 7.0

Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
cs.CV 2023-03 accept novelty 7.0

Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.
Stylistic Attribute Control in Latent Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.
GOLD-BEV: GrOund and aeriaL Data for Dense Semantic BEV Mapping of Dynamic Scenes
cs.CV 2026-04 unverdicted novelty 6.0

GOLD-BEV learns dense BEV semantic maps including dynamic agents from ego-centric sensors by using synchronized aerial imagery for training supervision and pseudo-label generation.
PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios
cs.CV 2026-04 unverdicted novelty 6.0

PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional ...
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
cs.CV 2025-03 accept novelty 6.0

VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation
cs.CV 2024-06 unverdicted novelty 6.0

CamCo equips image-to-video generators with Plücker-coordinate camera inputs and epipolar attention to improve 3D consistency and camera controllability.
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
cs.CV 2023-08 unverdicted novelty 6.0

IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
cs.CV 2023-07 conditional novelty 6.0

SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
cs.CL 2023-03 unverdicted novelty 6.0

HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study
cs.CV 2026-05 unverdicted novelty 5.0

DiffKT3D transfers priors from video diffusion models to 3D radiotherapy dose prediction via modality-specific embeddings and clinically guided RL, reducing voxel MAE from 2.07 to 1.93 and claiming SOTA over the GDP-H...
A Real-Calibrated Synthetic-First Data Engine
eess.IV 2026-05 unverdicted novelty 3.0

A data curation pipeline using diffusion-generated synthetic images improves pose estimation when added to real data but underperforms when used without real anchors.
Seedream 4.0: Toward Next-generation Multimodal Image Generation
cs.CV 2025-09 unverdicted novelty 3.0

Seedream 4.0 unifies text-to-image synthesis, image editing, and multi-image composition in an efficient diffusion transformer pretrained on billions of pairs and accelerated to 1.8 seconds for 2K output.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · cited by 17 Pith papers · 6 internal anchors

[1]

Weight initialization in neural network, inspired by andrew ng, https://medium.com/@safrin1128/weight- initialization-in-neural-network-inspired-by-andrew-ng- e0066dc4a566, 2020

Sadia Afrin. Weight initialization in neural network, inspired by andrew ng, https://medium.com/@safrin1128/weight- initialization-in-neural-network-inspired-by-andrew-ng- e0066dc4a566, 2020. 3

work page 2020
[2]

In- trinsic dimensionality explains the effectiveness of language model fine-tuning

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. In- trinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Process- ing, pages 7319–7328, Online, Aug. 2021. Association fo...

work page 2021
[3]

Only a matter of style: Age transformation using a style-based regression model

Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Only a matter of style: Age transformation using a style-based regression model. ACM Transactions on Graphics (TOG), 40(4), 2021. 3

work page 2021
[4]

Hyperstyle: Stylegan inversion with hypernetworks for real image editing

Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and Amit Bermano. Hyperstyle: Stylegan inversion with hypernetworks for real image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18511–18521, 2022. 2

work page 2022
[5]

Disco diffusion, https://github.com/alembics/disco- diffusion, 2022

Alembics. Disco diffusion, https://github.com/alembics/disco- diffusion, 2022. 3

work page 2022
[6]

Spatext: Spatio-textual representation for con- trollable image generation

Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, and Xi Yin. Spatext: Spatio-textual representation for con- trollable image generation. arXiv preprint arXiv:2211.14305,

work page arXiv
[7]

Blended diffusion for text-driven editing of natural images

Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022. 3

work page 2022
[8]

Multidiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113, 2023. 3

work page arXiv 2023
[9]

Masksketch: Unpaired structure-guided masked image generation

Dina Bashkirova, Jose Lezama, Kihyuk Sohn, Kate Saenko, and Irfan Essa. Masksketch: Unpaired structure-guided masked image generation. arXiv preprint arXiv:2302.05496,

work page arXiv
[10]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022. 2, 3

work page arXiv 2022
[11]

A computational approach to edge detection

John Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, (6):679–698, 1986. 6

work page 1986
[12]

Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y . A. Sheikh. Openpose: Realtime multi-person 2d pose estima- tion using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. 6

work page 2019
[13]

Pre-trained image processing transformer

Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12299–12310, 2021. 3

work page 2021
[14]

Vision transformer adapter for dense predictions

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. International Conference on Learning Representations, 2023. 2

work page 2023
[15]

Stargan: Unified genera- tive adversarial networks for multi-domain image-to-image translation

Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified genera- tive adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8789–8797,

work page
[16]

Protogen x3.4 (photorealism) offi- cial release, https://civitai.com/models/3666/protogen-x34- photorealism-official-release, 2022

darkstorm2150. Protogen x3.4 (photorealism) offi- cial release, https://civitai.com/models/3666/protogen-x34- photorealism-official-release, 2022. 8

work page 2022
[17]

Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021. 3

work page 2021
[18]

Dinh, Anh Tuan Tran, Rang Nguyen, and Binh-Son Hua

Tan M. Dinh, Anh Tuan Tran, Rang Nguyen, and Binh-Son Hua. Hyperinverter: Improving stylegan inversion via hy- pernetwork. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11389– 11398, 2022. 2

work page 2022
[19]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021. 3, 5, 7, 8

work page 2021
[20]

Make-a-scene: Scene- based text-to-image generation with human priors

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene- based text-to-image generation with human priors. In Euro- pean Conference on Computer Vision (ECCV), pages 89–106. Springer, 2022. 2, 3

work page 2022
[21]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image genera- tion using textual inversion.arXiv preprint arXiv:2208.01618,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Stylegan-nada: Clip- guided domain adaptation of image generators

Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip- guided domain adaptation of image generators. ACM Trans- actions on Graphics (TOG), 41(4):1–13, 2022. 3

work page 2022
[23]

Clip- adapter: Better vision-language models with feature adapters

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip- adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021. 2

work page arXiv 2021
[24]

Towards light-weight and real-time line segment detection

Geonmo Gu, Byungsoo Ko, SeoungHyun Go, Sung-Hyun Lee, Jingeun Lee, and Minchul Shin. Towards light-weight and real-time line segment detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022. 6

work page 2022
[25]

Dai, and Quoc V

David Ha, Andrew M. Dai, and Quoc V . Le. Hypernetworks. In International Conference on Learning Representations ,

work page
[26]

Hypernetwork style training, a tiny guide, stable- diffusion-webui, https://github.com/automatic1111/stable- diffusion-webui/discussions/2670, 2022

Heathen. Hypernetwork style training, a tiny guide, stable- diffusion-webui, https://github.com/automatic1111/stable- diffusion-webui/discussions/2670, 2022. 2

work page 2022
[27]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fer- gus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Asso...

work page 2017
[29]

Classifier-free diffusion guidance, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. 5

work page 2022
[30]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799, 2019. 2

work page 2019
[31]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021
[32]

Composer: Creative and controllable image synthesis with composable conditions

Lianghua Huang, Di Chen, Yu Liu, Shen Yujun, Deli Zhao, and Zhou Jingren. Composer: Creative and controllable image synthesis with composable conditions. 2023. 3

work page 2023
[33]

Region-aware diffusion for zero-shot text- driven image editing

Nisha Huang, Fan Tang, Weiming Dong, Tong-Yee Lee, and Changsheng Xu. Region-aware diffusion for zero-shot text- driven image editing. arXiv preprint arXiv:2302.11797, 2023. 3

work page arXiv 2023
[34]

Image-to-image translation with conditional adversarial net- works

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial net- works. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1125–1134, 2017. 1, 3

work page 2017
[35]

OneFormer: One Transformer to Rule Universal Image Segmentation

Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. OneFormer: One Transformer to Rule Universal Image Segmentation. 2023. 7

work page 2023
[36]

Progressive growing of gans for improved quality, stability, and variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. International Conference on Learning Repre- sentations, 2018. 3

work page 2018
[37]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019. 3

work page 2019
[38]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. IEEE Transactions on Pattern Analysis, 2021. 3

work page 2021
[39]

Multi-level latent space structuring for generative control

Oren Katzir, Vicky Perepelook, Dani Lischinski, and Daniel Cohen-Or. Multi-level latent space structuring for generative control. arXiv preprint arXiv:2202.05910, 2022. 3

work page arXiv 2022
[40]

Imagic: Text-based real image editing with diffusion models

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022. 3

work page arXiv 2022
[41]

Dif- fusionclip: Text-guided diffusion models for robust image manipulation

Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Dif- fusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426– 2435, 2022. 3

work page 2022
[42]

Variational diffusion models.Advances in Neural Information Processing Systems, 34:21696–21707, 2021

Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.Advances in Neural Information Processing Systems, 34:21696–21707, 2021. 3

work page 2021
[43]

Novelai improvements on stable diffusion, https://blog.novelai.net/novelai-improvements-on-stable- diffusion-e10d38db82ac, 2022

Kurumuz. Novelai improvements on stable diffusion, https://blog.novelai.net/novelai-improvements-on-stable- diffusion-e10d38db82ac, 2022. 2

work page 2022
[44]

Deep learning

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015. 3

work page 2015
[45]

Lecun, L

Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient- based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 3

work page 1998
[46]

Noise2noise: Learning image restoration without clean data

Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila. Noise2noise: Learning image restoration without clean data. Proceedings of the 35th International Conference on Machine Learning, 2018. 3

work page 2018
[47]

Measuring the intrinsic dimension of objective landscapes

Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. International Conference on Learning Represen- tations, 2018. 3

work page 2018
[48]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. 2023. 3

work page 2023
[49]

Exploring plain vision transformer backbones for object de- tection

Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object de- tection. arXiv preprint arXiv:2203.16527, 2022. 2

work page arXiv 2022
[50]

Benchmarking detection transfer learning with vision transformers

Yanghao Li, Saining Xie, Xinlei Chen, Piotr Dollar, Kaim- ing He, and Ross Girshick. Benchmarking detection transfer learning with vision transformers. arXiv preprint arXiv:2111.11429, 2021. 2

work page arXiv 2021
[51]

Piggy- back: Adapting a single network to multiple tasks by learning to mask weights

Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggy- back: Adapting a single network to multiple tasks by learning to mask weights. In European Conference on Computer Vi- sion (ECCV), pages 67–82, 2018. 2

work page 2018
[52]

Packnet: Adding multi- ple tasks to a single network by iterative pruning

Arun Mallya and Svetlana Lazebnik. Packnet: Adding multi- ple tasks to a single network by iterative pruning. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018. 2

work page 2018
[53]

Sdedit: Guided image synthesis and editing with stochastic differential equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021. 3

work page 2021
[54]

https://www.midjourney.com/, 2023

Midjourney. https://www.midjourney.com/, 2023. 1, 3

work page 2023
[55]

Self- distilled stylegan: Towards generation from internet photos

Ron Mokady, Omer Tov, Michal Yarom, Oran Lang, Inbar Mosseri, Tali Dekel, Daniel Cohen-Or, and Michal Irani. Self- distilled stylegan: Towards generation from internet photos. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–9, 2022. 3

work page 2022
[56]

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon- gang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023. 2, 3

work page internal anchor Pith review arXiv 2023
[57]

GLIDE: towards photorealistic image generation and editing with text-guided diffusion models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. CoRR, 2021. 8

work page 2021
[58]

Glide: Towards photorealistic image generation and editing with text-guided diffusion models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. 2022. 3

work page 2022
[59]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR,

work page
[60]

Mystyle: A personalized generative prior

Yotam Nitzan, Kfir Aberman, Qiurui He, Orly Liba, Michal Yarom, Yossi Gandelsman, Inbar Mosseri, Yael Pritch, and Daniel Cohen-Or. Mystyle: A personalized generative prior. arXiv preprint arXiv:2203.17272, 2022. 3

work page arXiv 2022
[61]

Comic-diffusion v2, trained on 6 styles at once, https://huggingface.co/ogkalu/comic-diffusion, 2022

ogkalu. Comic-diffusion v2, trained on 6 styles at once, https://huggingface.co/ogkalu/comic-diffusion, 2022. 8

work page 2022
[62]

Dall-e-2, https://openai.com/product/dall-e-2, 2023

OpenAI. Dall-e-2, https://openai.com/product/dall-e-2, 2023. 1, 3

work page 2023
[63]

Semantic image synthesis with spatially-adaptive nor- malization

Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive nor- malization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2337–2346,

work page
[64]

Zero-shot image-to-image translation

Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation. arXiv preprint arXiv:2302.03027, 2023. 3

work page arXiv 2023
[65]

Styleclip: Text-driven manipulation of stylegan imagery

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 2085–2094, October 2021. 3

work page 2085
[66]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning , pages 8748–8763. PMLR, 2021. 2, 3, 4, 8

work page 2021
[67]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image genera- tion with clip latents. arXiv preprint arXiv:2204.06125, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[68]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Confer- ence on Machine Learning, pages 8821–8831. PMLR, 2021. 3

work page 2021
[69]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

Ren´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1623–1637, 2020. 6

work page 2020
[70]

Efficient parametrization of multi-domain deep neural net- works

Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Efficient parametrization of multi-domain deep neural net- works. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 8119–8127, 2018. 2

work page 2018
[71]

Encoding in style: a stylegan encoder for image-to-image translation

Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 3

work page 2021
[72]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 2, 3, 4, 5, 7

work page 2022
[73]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Inter- vention MICCAI International Conference, pages 234–241,

work page
[74]

Incremental learning through deep adaptation

Amir Rosenfeld and John K Tsotsos. Incremental learning through deep adaptation. IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 42(3):651–663, 2018. 2

work page 2018
[75]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration. arXiv preprint arXiv:2208.12242, 2022. 2, 3

work page arXiv 2022
[76]

Rumelhart, Geoffrey E

David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating er- rors. Nature, 323(6088):533–536, Oct. 1986. 3

work page 1986
[77]

Palette: Image-to-image diffusion models

Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, SIGGRAPH ’22, New York, NY , USA, 2022. Association for Computing Ma- chinery. 3

work page 2022
[78]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487,

work page internal anchor Pith review Pith/arXiv arXiv
[79]

LAION-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text m...

work page 2022
[80]

Overcoming catastrophic forgetting with hard atten- tion to the task

Joan Serra, Didac Suris, Marius Miron, and Alexandros Karat- zoglou. Overcoming catastrophic forgetting with hard atten- tion to the task. In International Conference on Machine Learning, pages 4548–4557. PMLR, 2018. 2

work page 2018

Showing first 80 references.