pith. machine review for the scientific record. sign in

arxiv: 2211.01324 · v5 · submitted 2022-11-02 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:40 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords text-to-image generationdiffusion modelsensemble of expertsdenoisersprompt alignmentimage synthesisconditional generation
0
0 comments X

The pith

An ensemble of stage-specialized diffusion models improves text alignment in image synthesis at the same inference cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that text-to-image diffusion models change their reliance on conditioning during sampling: early iterations depend strongly on the text prompt to build aligned content, while later iterations largely disregard it. A single model with shared parameters across all steps is therefore suboptimal. The authors address this by first training one base model and then splitting it into an ensemble of expert denoisers, each fine-tuned for a narrow range of sampling stages. The resulting system, eDiff-I, delivers stronger prompt adherence, keeps visual quality high, and runs at the original computational budget. Additional conditioning options, including CLIP image embeddings for style transfer and a paint-with-words interface, are shown to work naturally within the same framework.

Core claim

We propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark.

What carries the argument

An ensemble of expert denoisers, each fine-tuned on a narrow window of the iterative sampling trajectory after an initial shared pre-training stage.

If this is right

  • Better text-to-image alignment on standard benchmarks than prior large-scale diffusion models.
  • No increase in inference compute or sampling steps relative to a single model.
  • Retention of high visual fidelity while adding controllable behaviors via multiple conditioning embeddings.
  • Support for intuitive style transfer from reference images using CLIP image embeddings.
  • User-level control via a paint-with-words mechanism that lets selected prompt words directly influence output regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged-specialization pattern could be tested on other iterative generative tasks such as video or 3D synthesis where conditioning importance may also vary across steps.
  • Focusing capacity on the phases where conditioning actually matters might reduce the parameter count needed for high performance compared with scaling a monolithic model.
  • The paint-with-words interface suggests a path toward more interactive, region-specific editing tools that operate inside the diffusion loop rather than post hoc.
  • Because the ensemble is created by splitting a shared base, the method may offer a practical route for adapting large pre-trained diffusion models to new domains without retraining from scratch.

Load-bearing premise

The synthesis process changes qualitatively so that text conditioning drives early steps but is largely ignored later, rendering a single shared-parameter model suboptimal.

What would settle it

If a single diffusion model trained with the same total compute budget produces equal or higher text-alignment scores on the standard benchmark, the premise that stage specialization is required would be refuted.

read the original abstract

Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiff-I/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes eDiff-I, an ensemble of expert denoisers for text-to-image diffusion models. A single base model is first trained across all timesteps and then split into stage-specific specialists that receive continued training on restricted timestep intervals. The central claim is that this yields improved text alignment at unchanged inference cost and visual quality, outperforming prior large-scale models on a standard benchmark; additional results cover conditioning with T5/CLIP embeddings and a paint-with-words interface.

Significance. If the performance gains are shown to arise from the ensemble structure rather than extra optimization steps, the work would demonstrate that parameter sharing across the full diffusion trajectory is suboptimal and that stage-specialized experts can improve conditioning adherence without raising inference cost. This would be a useful empirical finding for diffusion-based generative modeling.

major comments (1)
  1. [Training procedure] Training procedure (described in abstract and §3): each specialist receives additional gradient steps on its assigned timestep interval after the base model is split. The manuscript compares against single-model baselines that appear to have received fewer total optimization steps. This leaves open the possibility that measured gains in text alignment are driven by extra training compute rather than removal of parameter sharing, directly undermining the claim that a shared-parameter model is suboptimal.
minor comments (1)
  1. [Abstract] Abstract: the claim of outperformance on 'the standard benchmark' supplies no quantitative metrics, error bars, ablation tables, or benchmark name, preventing verification of the result.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for identifying the key question of whether gains arise from specialization or extra optimization steps. We address this concern directly below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Training procedure (described in abstract and §3): each specialist receives additional gradient steps on its assigned timestep interval after the base model is split. The manuscript compares against single-model baselines that appear to have received fewer total optimization steps. This leaves open the possibility that measured gains in text alignment are driven by extra training compute rather than removal of parameter sharing, directly undermining the claim that a shared-parameter model is suboptimal.

    Authors: We agree that the current experimental setup does not fully isolate the effect of parameter sharing from total training compute. After the initial base model is trained, each specialist receives continued gradient steps on its restricted timestep interval, resulting in higher aggregate optimization steps for the ensemble than for the reported single-model baselines. To address this, we will add a controlled ablation in the revised manuscript: a single shared-parameter model trained for a total number of gradient steps matching the sum of steps used across all specialists. We will report text-alignment metrics (e.g., CLIP score) and visual quality for this equal-compute baseline alongside eDiff-I. If the ensemble still outperforms, this will strengthen the claim that stage-specific specialization is beneficial beyond extra training. We will also explicitly document the step counts for the base model and each specialist in §3 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical training procedure with no derivations

full rationale

The paper describes an empirical procedure: train a base diffusion model, split parameters into stage-specific experts, and continue training each on its timestep interval. No equations, predictions, or first-principles derivations are presented that could reduce to fitted inputs by construction. The central claim (ensemble improves text alignment) rests on benchmark comparisons rather than any self-definitional or self-citation load-bearing step. External benchmarks and qualitative observations are independent of the training split itself, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the empirical observation of stage-dependent text reliance.

pith-pipeline@v0.9.0 · 5666 in / 1019 out tokens · 44630 ms · 2026-05-15T01:40:27.202143+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.EightTick eight_tick_period echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Consistency Models

    cs.LG 2023-03 conditional novelty 8.0

    Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

  2. A Flow Matching Algorithm for Many-Shot Adaptation to Unseen Distributions

    cs.LG 2026-05 unverdicted novelty 7.0

    FP-FM adapts flow matching models to unseen distributions via least-squares projection onto basis functions spanning training velocity fields, yielding improved precision and recall without inference-time training.

  3. Benchmarking Layout-Guided Diffusion Models through Unified Semantic-Spatial Evaluation in Closed and Open Settings

    cs.CV 2026-04 conditional novelty 7.0

    Introduces closed-set C-Bench and open-set O-Bench for layout-guided diffusion models, a unified semantic-spatial scoring protocol, and ranks six models after generating and evaluating 319,086 images.

  4. ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    cs.CV 2024-03 unverdicted novelty 7.0

    ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.

  5. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    cs.CV 2023-07 unverdicted novelty 7.0

    A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.

  6. Leveraging Verifier-Based Reinforcement Learning in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.

  7. Temporally Extended Mixture-of-Experts Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.

  8. OFA-Diffusion Compression: Compressing Diffusion Model in One-Shot Manner

    cs.CV 2026-04 conditional novelty 6.0

    OFA-Diffusion Compression trains diffusion models once to yield multiple size-specific compressed subnetworks via restricted candidate spaces, importance-based channel allocation, and reweighting.

  9. PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

    cs.CV 2026-04 unverdicted novelty 6.0

    PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.

  10. SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

    cs.CV 2024-10 unverdicted novelty 6.0

    Sana-0.6B produces high-resolution images with strong text alignment at 20x smaller size and 100x higher throughput than Flux-12B by combining 32x image compression, linear DiT blocks, and a decoder-only LLM text encoder.

  11. CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.

  12. VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    cs.CV 2023-10 unverdicted novelty 6.0

    Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.

  13. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    cs.CV 2023-08 unverdicted novelty 6.0

    IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.

  14. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    cs.CV 2023-07 conditional novelty 6.0

    SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...

  15. Embody4D: A Generalist 4D World Model for Embodied AI

    cs.CV 2026-05 unverdicted novelty 5.0

    Embody4D generates high-fidelity, view-consistent novel views from monocular videos for embodied scenarios via 3D-aware data synthesis, adaptive noise injection, and interaction-aware attention.

  16. DiffMagicFace: Identity Consistent Facial Editing of Real Videos

    cs.CV 2026-04 unverdicted novelty 5.0

    DiffMagicFace uses concurrent fine-tuned text and image diffusion models plus a rendered multi-view dataset to achieve identity-consistent text-conditioned editing of real facial videos.

  17. ADP-DiT: Text-Guided Diffusion Transformer for Brain Image Generation in Alzheimer's Disease Progression

    cs.CV 2026-04 unverdicted novelty 5.0

    ADP-DiT is a text-conditioned diffusion transformer for synthesizing longitudinal Alzheimer's MRI scans, reporting SSIM 0.8739 and PSNR 29.32 dB with improvements over a DiT baseline.

  18. 3D Smoke Scene Reconstruction Guided by Vision Priors from Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 5.0

    A framework that combines MLLM-based image enhancement with a medium-aware 3D Gaussian Splatting model to reconstruct and render smoke scenes.

  19. Movie Gen: A Cast of Media Foundation Models

    cs.CV 2024-10 unverdicted novelty 5.0

    A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

  20. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · cited by 20 Pith papers · 11 internal anchors

  1. [1]

    Efficient large scale language modeling with mixtures of experts

    Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mi- haylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, et al. Efficient large scale language modeling with mixtures of experts. arXiv preprint arXiv:2112.10684, 2021. 5

  2. [2]

    Blended latent diffusion

    Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion. arXiv preprint arXiv:2206.02779, 2022. 4

  3. [3]

    Blended diffusion for text-driven editing of natural images

    Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proc. CVPR, 2022. 4, 14

  4. [4]

    Estimating the optimal covariance with imperfect mean in diffusion probabilistic models

    Fan Bao, Chongxuan Li, Jiacheng Sun, Jun Zhu, and Bo Zhang. Estimating the optimal covariance with imperfect mean in diffusion probabilistic models. In Proc. ICML, 2022. 4

  5. [5]

    Analytic- DPM: An analytic estimate of the optimal reverse variance in diffusion probabilistic models

    Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic- DPM: An analytic estimate of the optimal reverse variance in diffusion probabilistic models. In Proc. ICLR, 2022. 4

  6. [6]

    Paint by word

    David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, and Antonio Torralba. Paint by word. arXiv preprint arXiv:2103.10951, 2021. 14

  7. [7]

    Semi-Parametric Neural Image Synthesis

    Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas M¨uller, and Bj¨orn Ommer. Semi-Parametric Neural Image Synthesis. In Proc. NeurIPS, 2022. 4

  8. [8]

    Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Proc. NeurIPS, 2020. 5

  9. [9]

    Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W. Cohen. Re-Imagen: Retrieval-Augmented Text-to-Image Generator. arXiv preprint arXiv:2209.14491, 2022. 4

  10. [10]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311, 2022. 5

  11. [11]

    Improving diffusion models for inverse problems using manifold constraints

    Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for inverse problems using manifold constraints. In Proc. NeurIPS, 2022. 4

  12. [12]

    DiffEdit: Diffusion-based seman- tic image editing with mask guidance

    Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. DiffEdit: Diffusion-based seman- tic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022. 4

  13. [13]

    Diffusion models beat GANs on image synthesis

    Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. In Proc. NeurIPS,

  14. [14]

    Differentially private diffusion models

    Tim Dockhorn, Tianshi Cao, Arash Vahdat, and Karsten Kreis. Differentially private diffusion models. arXiv:2210.09929,

  15. [15]

    GENIE: Higher-order denoising diffusion solvers

    Tim Dockhorn, Arash Vahdat, and Karsten Kreis. GENIE: Higher-order denoising diffusion solvers. In Proc. NeurIPS,

  16. [16]

    Score- based generative modeling with critically-damped Langevin diffusion

    Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Score- based generative modeling with critically-damped Langevin diffusion. In Proc. ICLR, 2022. 4

  17. [17]

    Make-a-scene: Scene-based text-to-image generation with human priors

    Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131, 2022. 9, 10

  18. [18]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image genera- tion using textual inversion.arXiv preprint arXiv:2208.01618,

  19. [19]

    Vector quan- tized diffusion model for text-to-image synthesis

    Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quan- tized diffusion model for text-to-image synthesis. In Proc. CVPR, 2022. 4

  20. [20]

    Flexible diffusion modeling of long videos

    William Harvey, Saeid Naderiparizi, Vaden Masrani, Chris- tian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495, 2022. 4

  21. [21]

    Scaling Laws for Autoregressive Generative Modeling

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christo- pher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Pra- fulla Dhariwal, Scott Gray, et al. Scaling laws for autoregres- sive generative modeling. arXiv preprint arXiv:2010.14701,

  22. [22]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022. 4

  23. [23]

    Deep Learning Scaling is Predictable, Empirically

    Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409,

  24. [24]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. 9

  25. [25]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Ima- gen Video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022. 4

  26. [26]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. In Proc. NeurIPS, 2020. 2, 4

  27. [27]

    Fleet, Mohammad Norouzi, and Tim Salimans

    Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. JMLR, 23(47):1– 33, 2022. 4, 7

  28. [28]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 4, 7

  29. [29]

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022. 4

  30. [30]

    Multimodal conditional image synthesis with product- of-experts GANs

    Xun Huang, Arun Mallya, Ting-Chun Wang, and Ming-Yu Liu. Multimodal conditional image synthesis with product- of-experts GANs. In Proc. ECCV, 2022. 14

  31. [31]

    Estimation of non-normalized statistical models by score matching

    Aapo Hyv¨arinen. Estimation of non-normalized statistical models by score matching. JMLR, 6(24):695–709, 2005. 4, 5 20

  32. [32]

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial net- works. In Proc. CVPR, 2017. 14

  33. [33]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

  34. [34]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Proc. NeurIPS, 2022. 4, 5

  35. [35]

    Denoising diffusion restoration models

    Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. In Proc. NeurIPS, 2022. 4

  36. [36]

    JPEG artifact correction using denoising diffusion restoration models

    Bahjat Kawar, Jiaming Song, Stefano Ermon, and Michael Elad. JPEG artifact correction using denoising diffusion restoration models. In NeurIPS 2022 Workshop on Score- Based Methods, 2022. 4

  37. [37]

    Imagic: Text-based real image editing with diffusion models

    Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022. 4

  38. [38]

    Scaling laws for deep learning based image reconstruction

    Tobit Klug and Reinhard Heckel. Scaling laws for deep learning based image reconstruction. arXiv preprint arXiv:2209.13435, 2022. 2

  39. [39]

    DiffWave: A versatile diffusion model for audio synthesis

    Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. DiffWave: A versatile diffusion model for audio synthesis. In Proc. ICLR, 2021. 4

  40. [40]

    Shamma, Michael Bernstein, and Li Fei-Fei

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- ditis, Li-Jia Li, David A. Shamma, Michael Bernstein, and Li Fei-Fei. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123:32–73, 2017. 9

  41. [41]

    Hashimoto

    Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B. Hashimoto. Diffusion-LM improves control- lable text generation. arXiv preprint arXiv:2205.14217, 2022. 4

  42. [42]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In Proc. ECCV, 2014. 9

  43. [43]

    Pseudo numerical methods for diffusion models on manifolds

    Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. In Proc. ICLR, 2022. 4

  44. [44]

    Imaginaire

    Ming-Yu Liu, Ting-Chun Wang, Xun Huang, and Arun Mallya. Imaginaire. https://github.com/NVlabs/ imaginaire, 2020. 8

  45. [45]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In Proc. ICLR, 2019. 8

  46. [46]

    DPM-Solver: A fast ODE solver for diffu- sion probabilistic model sampling in around 10 steps

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ODE solver for diffu- sion probabilistic model sampling in around 10 steps. In Proc. NeurIPS, 2022. 4

  47. [47]

    RePaint: Inpainting us- ing denoising diffusion probabilistic models

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting us- ing denoising diffusion probabilistic models. In Proc. CVPR,

  48. [48]

    Improving diffusion model efficiency through patching.arXiv preprint arXiv:2207.04316,

    Troy Luhman and Eric Luhman. Improving diffusion model efficiency through patching.arXiv preprint arXiv:2207.04316,

  49. [49]

    Diffusion probabilistic models for 3D point cloud generation

    Shitong Luo and Wei Hu. Diffusion probabilistic models for 3D point cloud generation. In Proc. CVPR, 2021. 4

  50. [50]

    SDEdit: Guided image synthesis and editing with stochastic differential equations

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In Proc. ICLR, 2022. 4

  51. [51]

    GLIDE: Towards photorealistic image genera- tion and editing with text-guided diffusion models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image genera- tion and editing with text-guided diffusion models. In Proc. ICML, 2022. 3, 4, 9, 10, 14

  52. [52]

    Diffusion models for adver- sarial purification

    Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Anima Anandkumar. Diffusion models for adver- sarial purification. In Proc. ICML, 2022. 4

  53. [53]

    Semantic image synthesis with spatially-adaptive nor- malization

    Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive nor- malization. In Proc. CVPR, 2019. 14

  54. [54]

    PyTorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. PyTorch: An imperative style, high-performance deep learning library. In Proc. NeurIPS, 2019. 8

  55. [55]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proc. ICML,

  56. [56]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR, 21(140):1–67, 2020. 3, 5, 7

  57. [57]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with CLIP latents. arXiv preprint arXiv:2204.06125,

  58. [58]

    2, 3, 4, 5, 9, 10, 23

  59. [59]

    Scaling vision with sparse mixture of experts

    Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr´e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. In Proc. NeurIPS, 2021. 5

  60. [60]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, 2022. 2, 3, 4, 7, 9, 10

  61. [61]

    Stable diffusion v1-4

    Robin Rombach and Patrick Esser. Stable diffusion v1-4. https : / / huggingface . co / CompVis / stable - diffusion-v1-4, July 2022. 3, 4

  62. [62]

    DreamBooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration. arXiv preprint arXiv:2208.12242, 2022. 4

  63. [63]

    Lee, Jonathan Ho, Tim Salimans, David J

    Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee, Jonathan Ho, Tim Salimans, David J. Fleet, and Mohammad 21 Norouzi. Palette: Image-to-image diffusion models. In Proc. SIGGRAPH, 2022. 4

  64. [64]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022. 2,...

  65. [65]

    Fleet, and Mohammad Norouzi

    Chitwan Saharia, Jonathan Ho, William Chan, Tim Sali- mans, David J. Fleet, and Mohammad Norouzi. Image super- resolution via iterative refinement. IEEE Trans. Pattern Anal- ysis and Machine Intelligence, 2022. 4

  66. [66]

    Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer. In Proc. ICLR, 2017. 5

  67. [67]

    KNN- Diffusion: Image Generation via Large-Scale Retrieval

    Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, and Yaniv Taigman. KNN- Diffusion: Image Generation via Large-Scale Retrieval. arXiv preprint arXiv:2204.02849, 2022. 4

  68. [68]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-A-Video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022. 4

  69. [69]

    D2C: Diffusion-decoding models for few-shot condi- tional generation

    Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon. D2C: Diffusion-decoding models for few-shot condi- tional generation. In Proc. NeurIPS, 2021. 4

  70. [70]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. ICML, 2015. 2, 4

  71. [71]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In Proc. ICLR, 2021. 4, 5

  72. [72]

    Generative modeling by estimating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. InProc. NeurIPS,

  73. [73]

    Improved techniques for training score-based generative models

    Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. In Proc. NeurIPS,

  74. [74]

    Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In Proc. ICLR, 2021. 2, 4, 5

  75. [75]

    Dropout: a simple way to prevent neural networks from overfitting

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014. 7

  76. [76]

    The bitter lesson

    Rich Sutton. The bitter lesson. http://www.incompleteideas.net/IncIdeas/ BitterLesson.html, March 2019. 2

  77. [77]

    CSDI: Conditional score-based diffusion models for probabilistic time series imputation

    Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Er- mon. CSDI: Conditional score-based diffusion models for probabilistic time series imputation. In Proc. NeurIPS, 2021. 4

  78. [78]

    Score-based generative modeling in latent space

    Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. In Proc. NeurIPS, 2021. 4

  79. [79]

    UniTune: Text-driven image editing by fine tuning an image generation model on a single image

    Dani Valevski, Matan Kalman, Yossi Matias, and Yaniv Leviathan. UniTune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv preprint arXiv:2210.09477, 2022. 4

  80. [80]

    A connection between score matching and denoising autoencoders

    Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661– 1674, 2011. 4, 5

Showing first 80 references.