pith. machine review for the scientific record. sign in

arxiv: 2604.05761 · v1 · submitted 2026-04-07 · 💻 cs.CV

Recognition: no theorem link

Improving Controllable Generation: Faster Training and Better Performance via x₀-Supervision

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords controllable generationdiffusion modelsx0-supervisiontraining objectivetext-to-image generationconvergence accelerationimage controldenoising dynamics
0
0 comments X

The pith

Direct supervision on clean target images accelerates training of controllable diffusion models by up to 2x while improving quality and control accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Controllable text-to-image diffusion models add conditions like layouts to standard models but often train slowly with the usual noise-prediction loss. The paper shows through analysis of denoising dynamics that supervising directly on the clean image x0, or reweighting the loss equivalently, makes the model converge faster. This change leads to up to twice the speed by a new convergence metric and produces images that better match both the text and the control signals. Readers care because it reduces the time and compute needed to train precise image generators without changing the model architecture.

Core claim

Analysis of the denoising process in models with extra control conditions reveals that the standard training objective creates a mismatch in how the model learns to predict under controls. Switching to direct x0-supervision on the clean target image aligns the training signal better with the control-augmented inputs, resulting in faster convergence up to 2 times faster according to the mean Area Under the Convergence Curve metric and higher visual quality plus conditioning accuracy across multiple control settings.

What carries the argument

x0-supervision, a direct loss on the predicted clean image rather than noise, which reweights the diffusion training objective to better suit additional control signals.

If this is right

  • Convergence measured by mAUCC improves by up to 2x in controllable settings.
  • Generated images show better visual quality under the same control inputs.
  • Accuracy in satisfying additional conditions such as layouts increases.
  • The approach works across various control types without needing model-specific changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same supervision shift might apply to other generative frameworks like flow models mentioned in the abstract.
  • Lower training times could enable more rapid iteration when developing new control mechanisms for image synthesis.
  • Reweighting the loss equivalently suggests the benefit comes from emphasizing the clean image prediction in the objective.

Load-bearing premise

The dynamics of denoising with added control conditions are similar enough across signals and architectures that a single change in supervision works universally.

What would settle it

An experiment comparing standard loss training to x0-supervision on identical controllable model setups, tracking mAUCC scores and qualitative control adherence until convergence, to check for the reported speedup and quality gains.

Figures

Figures reproduced from arXiv: 2604.05761 by Adrien Maglo, Amadou S. Sangare, Bertrand Luvison, Mohamed Chaouch.

Figure 1
Figure 1. Figure 1: ControlNet converges faster with the clean image x0 as the supervision signal compared to the baseline ϵ. The red, orange, and green squares respectively indicate that the generated sample does not follow, partially follows, and correctly follows the input control. Abstract Text-to-Image (T2I) diffusion/flow models have recently achieved remarkable progress in visual fidelity and text alignment. However, t… view at source ↗
Figure 2
Figure 2. Figure 2: Evolution of the final image during a DDIM [32] sampling with 200 steps. The top row is the result for Stable Diffusion 1.5 and the bottom row is the result for a segmentation ControlNet [38] based on Stable Diffusion 1.5. As we can see, images are generated in a coarse-to-fine fashion. The early steps determine the overall layout of the scene (red arrow), while the next steps add fine-grained details (blu… view at source ↗
Figure 3
Figure 3. Figure 3: Our approach. For a more efficient controllable gen￾eration training, we propose to convert any predictor to an x0- predictor, and supervise with the clean image. This simple trick significantly improves the convergence speed and the final perfor￾mance. γt ≡ 0 corresponds to DDIM and the case where γt = σt−1 σt r 1 − α2 t α2 t−1 corresponds to DDPM. We can notice that sampling works by predicting x0 based … view at source ↗
Figure 4
Figure 4. Figure 4: Convergence curves for ControlNet and OminiControl on different tasks. We use an EMA weight of 0.9 to smooth the curves. We can notice that the convergence is faster with x0-supervision. 0k 50k 100k 150k 200k Training steps 0 5 10 15 20 25 30 mAP -prediction x0-prediction v-prediction (a) GLIGEN Box+Text. 0k 50k 100k 150k 200k Training steps 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 mAP -supervision x0-supervisi… view at source ↗
Figure 5
Figure 5. Figure 5: Convergence curves for GLIGEN on different tasks. We use an EMA weight of 0.9 to smooth the curves. We can notice that the convergence is faster with x0-supervision. and train the model with: L ϵ→x0 θ = Et,ϵ,x0∼p(x0) [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The noise schedule and the evolution of the signal-to￾noise ratio in Stable Diffusion. The SNR decrease very quickly, using it as the loss weighting function significantly down-weights the learning signal for low SNRs. 3.4. Mean Area Under the Convergence Curve Taking inspiration from the active learning literature [6, 36], we provide a new metric, called mAUCC to measure con￾vergence speed. Given the conv… view at source ↗
Figure 7
Figure 7. Figure 7: Convergence comparison between x0-supervision and σ 2 t α2 t ϵ-supervision. We can see that they have the same con￾vergence speed, hence validating our insights in Sec. 3.3. Non-spatially-aligned control signals [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 1
Figure 1. Figure 1: OT noise schedule used in flow matching and the cor￾responding weighting incurred by x0-supervision in log scale. In flow matching, the goal is to predict the velocity field u governing the probability flow ODE [15]: dXt dt = ut(Xt) (19) Equation (1) describes the conditional flow ψt(x0, ϵ) : t 7→ αtx0 + σtϵ, and one has: ut(x) = Ex0,ϵ [ut(Xt|x0, ϵ)|Xt = x] (20) = Ex0,ϵ h ψ˙ t(x0, ϵ) [PITH_FULL_IMAGE:figu… view at source ↗
Figure 2
Figure 2. Figure 2: Convergence curves for T2I-Adapter on different tasks. We use an EMA weight of 0.9 to smooth the curves. We can notice that the convergence is faster with x0-supervision. and for the case of u-to-ϵ we get: L u→ϵ θ = Et,ϵ,x0  ∥ϵ − αt αtσ˙ t − α˙ tσt (uθ(xt, t) − α˙ t αt xt)∥ 2 2  (37) = Et,ϵ,x0 " αt αtσ˙ t − α˙ tσt 2 ∥ut (xt|x0, ϵ) − uθ(xt, t)∥ 2 2 # (38) = Et,ϵ,x0 " αt αtσ˙ t − α˙ tσt 2 w u t ∥x0 − x… view at source ↗
Figure 3
Figure 3. Figure 3: Log-weighting functions. In Fig. 3a, we plot the log￾weightings incurred by the x0-supervision loss when using ϵ, v, and x0 as supervision signals for SD-based control methods. In Fig. 3b, we plot the corresponding weightings when using u, ϵ, and x0 as supervision signals for OminiControl. In both cases, we observe that lower convergence speed and performance are related to suboptimal weighting of the init… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on depth and segmentation ControlNet with the three supervision signals after [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results on depth and segmentation ControlNet with the three supervision signals after [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results on depth and segmentation ControlNet with the three supervision signals after [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results on Canny and pose ControlNet with the three supervision signals after [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results on Canny and pose ControlNet with the three supervision signals after [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative results on Canny edge and pose ControlNet with the three supervision signals after [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative results on depth and segmentation T2I-Adapter with the three supervision signals after [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative results on depth and segmentation T2I-Adapter with the three supervision signals after [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative results on Canny edge and pose T2I-Adapter with the three supervision signals after [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative results on Canny edge and pose T2I-Adapter with the three supervision signals after [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative results on depth and segmentation OminiControl with the three supervision signals after [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative results on depth and segmentation OminiControl with the three supervision signals after [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative results on Canny edge and pose OminiControl with the three supervision signals after [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative results on Canny edge and pose OminiControl with the three supervision signals after [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
read the original abstract

Text-to-Image (T2I) diffusion/flow models have recently achieved remarkable progress in visual fidelity and text alignment. However, they remain limited when users need to precisely control image layouts, something that natural language alone cannot reliably express. Controllable generation methods augment the initial T2I model with additional conditions that more easily describe the scene. Prior works straightforwardly train the augmented network with the same loss as the initial network. Although natural at first glance, this can lead to very long training times in some cases before convergence. In this work, we revisit the training objective of controllable diffusion models through a detailed analysis of their denoising dynamics. We show that direct supervision on the clean target image, dubbed $x_0$-supervision, or an equivalent re-weighting of the diffusion loss, yields faster convergence. Experiments on multiple control settings demonstrate that our formulation accelerates convergence by up to 2$\times$ according to our novel metric (mean Area Under the Convergence Curve - mAUCC), while also improving both visual quality and conditioning accuracy. Our code is available at https://github.com/CEA-LIST/x0-supervision

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that standard training of controllable text-to-image diffusion models is suboptimal due to the denoising dynamics under additional control signals. It proposes x0-supervision (direct supervision on the clean target image x0) or an equivalent re-weighting of the diffusion loss, which accelerates convergence by up to 2x (measured by the new mAUCC metric), while also improving visual quality and conditioning accuracy. This is supported by analysis of the reverse process and experiments across multiple control settings, with code released.

Significance. If the central result holds, the work offers a simple, architecture-agnostic change to the training objective that addresses a practical bottleneck in controllable generation. The denoising-dynamics analysis provides explanatory insight, the mAUCC metric is a useful addition for convergence evaluation, and the empirical gains (faster training plus better metrics) are directly actionable. Releasing code supports reproducibility and adoption.

major comments (1)
  1. [§3] §3 (Denoising Dynamics Analysis): The claimed equivalence between x0-supervision and the re-weighted loss is derived under the assumption that the control signal modulates the clean-image prediction with timestep-independent scaling. This assumption does not obviously extend to cross-attention injection or cases where control features are themselves diffused; the paper's experiments test only a narrow subset of injection styles, so the generality of the 2× mAUCC gain remains unproven and load-bearing for the central claim.
minor comments (2)
  1. The abstract and introduction should explicitly list the control-injection mechanisms used in the experiments (e.g., concatenation, cross-attention, etc.) so readers can immediately assess the scope.
  2. [Experiments] Table or figure captions for the convergence curves should include the exact definition or pseudocode for mAUCC to make the metric self-contained.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive overall assessment of our work and for the constructive comment regarding the scope of our analysis. We address the concern point by point below.

read point-by-point responses
  1. Referee: [§3] §3 (Denoising Dynamics Analysis): The claimed equivalence between x0-supervision and the re-weighted loss is derived under the assumption that the control signal modulates the clean-image prediction with timestep-independent scaling. This assumption does not obviously extend to cross-attention injection or cases where control features are themselves diffused; the paper's experiments test only a narrow subset of injection styles, so the generality of the 2× mAUCC gain remains unproven and load-bearing for the central claim.

    Authors: We appreciate the referee's identification of the central assumption underlying the equivalence in §3. The derivation does rely on the control signal providing a timestep-independent modulation to the clean-image prediction, which enables showing that x0-supervision is equivalent to a re-weighted diffusion loss. We agree that this assumption may not hold exactly for all possible injection mechanisms, such as certain forms of cross-attention or when control features are themselves noised. Our experiments do cover multiple control settings that include cross-attention-based conditioning (in addition to other injection styles), and we observe consistent gains in convergence speed according to mAUCC as well as improved quality and accuracy. While the strict equivalence may be architecture-dependent, the practical advantage of direct x0-supervision appears to hold more broadly in the tested cases. To address the concern, we will revise §3 to explicitly state the assumptions and discuss their applicability to different injection styles, thereby clarifying the scope of the claimed gains. revision: partial

Circularity Check

0 steps flagged

No circularity: x0-supervision derived from independent denoising-dynamics analysis and validated empirically

full rationale

The paper's central derivation analyzes the denoising dynamics of controllable diffusion models to motivate direct x0-supervision or an equivalent loss re-weighting. This analysis is presented as first-principles reasoning on how control signals affect clean-image versus noise prediction, followed by empirical demonstration of faster convergence (via mAUCC) and improved quality across multiple control settings. No step reduces by construction to a fitted parameter, self-citation chain, or tautological renaming; the claimed acceleration is not forced by the objective definition itself but shown through experiments. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard diffusion model assumptions and the equivalence of x0-supervision to a re-weighted loss; no new free parameters, axioms beyond domain standards, or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The denoising dynamics of diffusion models under additional control conditions can be analyzed to derive an improved training objective.
    Invoked in the abstract when stating that analysis of denoising dynamics leads to the x0-supervision formulation.

pith-pipeline@v0.9.0 · 5513 in / 1244 out tokens · 21031 ms · 2026-05-10T18:33:53.378785+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

    Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden- Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.CoRR, abs/2303.08797, 2023. 2

  2. [2]

    Flux: Official inference repository for flux.1 models, 2024

    Black Forest Labs. Flux: Official inference repository for flux.1 models, 2024. Accessed: 2024-11-12. 2

  3. [3]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18- 24, 2022, pages 10955–10965. IEEE, ...

  4. [4]

    Controlnet++: Improving conditional controls with efficient consistency feedback

    Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaon- ing Wang, Xuefeng Xiao, and Chen Chen. Controlnet++: Improving conditional controls with efficient consistency feedback. InComputer Vision - ECCV 2024 - 18th Eu- ropean Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part VII, pages 129–147. Springer, 2024. 2

  5. [5]

    GLIGEN: open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. GLIGEN: open-set grounded text-to-image generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17- 24, 2023, pages 22511–22521. IEEE, 2023. 1, 2

  6. [6]

    Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C

    Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. InComputer Vision - ECCV 2014 - 13th Eu- ropean Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pages 740–755. Springer, 2014. 1, 2

  7. [7]

    Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky T. Q. Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code, 2024. 2

  8. [8]

    Pseudo numerical methods for diffusion models on manifolds

    Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. In The Tenth International Conference on Learning Represen- tations, ICLR 2022, Virtual Event, April 25-29, 2022. Open- Review.net, 2022. 1

  9. [9]

    T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InThirty-Eighth AAAI Conference on Ar- tificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fo...

  10. [10]

    Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing images using 1 million captioned pho- tographs. InAdvances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Informa- tion Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain, pages 1143–1151,

  11. [11]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674– 10685. IEEE, 2022. 2

  12. [12]

    Progressive distillation for fast sampling of diffusion models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Vir- tual Event, April 25-29, 2022. OpenReview.net, 2022. 2, 3, 4

  13. [13]

    Objects365: A large-scale, high-quality dataset for object detection

    Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 8429–8438. IEEE, 2019. 1, 2

  14. [14]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Associa- tion for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 2556–2565. Association ...

  15. [15]

    Score-based generative modeling through stochastic differential equa- tions

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InInternational Conference on Learning Represen- tations, 2021. 4

  16. [16]

    Ominicontrol: Minimal and univer- sal control for diffusion transformer

    Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and univer- sal control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14940–14950, 2025. 2

  17. [17]

    Instancediffusion: Instance- level control for image generation

    Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Ro- hit Girdhar, and Ishan Misra. Instancediffusion: Instance- level control for image generation. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 6232–

  18. [18]

    Florence-2: Advancing a unified representation for a vari- ety of vision tasks

    Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a vari- ety of vision tasks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 4818–4829. IEEE, 2024. 1

  19. [19]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 3813–

  20. [20]

    Scene parsing through ade20k dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1, 2 Figure 4. Qualitative results on depth and segmentation ControlNet with the three supervision signals after10k training steps. Figure 5. Qu...