pith. sign in

arxiv: 2606.31683 · v1 · pith:2SAHJUVDnew · submitted 2026-06-30 · 💻 cs.CV · cs.AI· cs.LG

Histogram-constrained Image Generation

Pith reviewed 2026-07-01 05:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords histogram constraintsdiffusion modelsoptimal transportcontrollable generationimage synthesisdistributional controlcolor histograms
0
0 comments X

The pith

Histogram-constrained Image Generation enforces exact user-specified distributional constraints on diffusion models by applying optimal transport guidance at each sampling step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Histogram-constrained Image Generation to let diffusion models follow user-specified histograms, such as color distributions or latent token counts, with exact precision during sampling. This control sits between high-level text prompts and dense local conditions like ControlNet. The approach treats the desired constraint as an optimal transport problem and inserts explicit guidance transformations into the diffusion trajectory. It supports tasks including constrained color output and high-capacity information embedding via histogram encoding. A reader would care because the method adds a middle-granularity, interpretable control that works alongside existing mechanisms.

Core claim

By modeling control as an optimal transport problem, the framework applies explicit guidance transformations during the diffusion sampling process to drive trajectories toward user-specified histograms, achieving exact precision in distributional constraints while maintaining sample coherence.

What carries the argument

Optimal transport guidance transformations applied at each diffusion step to enforce exact histogram matching.

Load-bearing premise

Explicit optimal-transport guidance transformations can be applied at each diffusion step to achieve exact histogram matching while preserving image coherence and sample quality.

What would settle it

Running the guided sampler on a target histogram and verifying that the final image histogram deviates from the target by more than numerical tolerance, or that perceptual quality metrics fall below the unconstrained baseline.

Figures

Figures reproduced from arXiv: 2606.31683 by Haoming Liu, Hongyi Wen, Shenji Wan, Yijia Cao, Yuanhe Guo.

Figure 1
Figure 1. Figure 1: Overview for HIG. We intervene in the diffusion process with explicit OT￾based guidance. HIG enables diverse applications, including constrained generation with arbitrary histogram constraints and high-capacity information embedding. encode abstract concepts and grant the diffusion process considerable flexibility to improvise during generation. As a result, they influence the output at a global scale, suc… view at source ↗
Figure 2
Figure 2. Figure 2: Exemplar OT plans with single-option (d = 6) and multi-option binning (k = 2, d = 3). In some cases, strict single￾option binning may lead to excessive content distortion during OT-based histogram matching. To mitigate this, we introduce a multi-option binning scheme for OT, where each bin contains multiple candidate values. In this set￾ting, the transport plan only enforces the aggregated mass per bin to … view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of our information embedding workflow. We first elaborate on how a sequence of text tokens can be transformed into a compact soft-prompt embedding via prompt tuning [31]. Prompt tuning is a parameter-efficient fine-tuning (PEFT) technique that learns a set of continu￾ous embeddings (soft prompts) that are prepended to the input text to guide the language model’s behavior, which can be viewed a… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on color-constrained generation. “LoRA+CN+IP” refers to the stacked control from LoRA [51], ControlNet [32], and IP-Adapter [69]. HistKL quantifies the KL divergence to the target color distribution (the lower the better). are quantized to match the histogram dimension (e.g., 163 for RGB binning, 642 for RG binning, etc.). For information embedding, we employ Llama-3.1- 8B [11] for soft… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results for color-constrained image gener￾ation. OT-based guidance helps alleviate visual artifacts. Method Base Model Latency (s) ↓ Overhead (s) ↓ Unconstrained SDXL 10.67 – HIG (w/o post-hoc OT) SDXL 12.87 2.20 HIG (w/ post-hoc OT) SDXL 15.06 4.39 DreamBooth LoRA∗ SDXL 13.01 2.34 ControlNet++ (Depth) SDXL 25.47 14.80 ControlNet++ (Softedge) SDXL 15.51 4.84 ControlNet++ (OpenPose) SDXL 17.19 6… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results of information embedding. Each image embeds 512 text tokens that can be faithfully decoded. Under single-option binning, OT-based guid￾ance (col 3&5) drastically reduces visual artifacts compared to direct OT variants (col 2&4); under multi-option binning, the embedded images remain visually similar to unconstrained generations (col 6). Better view with colors. and reliable control over… view at source ↗
Figure 7
Figure 7. Figure 7: Robustness evaluation of our information embedding technique. Our evaluation spans ran￾dom scaling, JPEG compression, soft-prompt per￾turbation, and histogram corruption [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative results for combining HIG’s distributional color control with Dream￾Booth LoRA [51] and ControlNet++ [32]. Better view with color [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualizations of OT-based color histogram matching during the denoising process (T = {40, 30, 20, 10}). For each example, we sample a random color histogram as h tgt. Row 1&2 use single-option binning on RGB channels; Row 3&4 use single￾option binning on RG channels. D Content Stability over Decoding-Encoding Cycles [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Content stability of SDXL VAE [46] after multiple decoding-encoding cycles. Overall, the reconstructed images remain visually identical across cycles, demonstrating the feasibility of our decode-transform-encode diffusion guidance scheme [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Robustness under highly complex embedded text and image content [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: A post-hoc OT step can enforce exact compliance with h tgt, but may introduce visual artifacts from rigid color reassignment. While such strict control is essential for tasks like information embedding, it can be safely omitted in more flexible settings such as color scheme matching. M Extended Usage: Lighting Control [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Color histogram matching enables lighting control on photo-realistic images [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: More qualitative results for color-constrained generation (with post-hoc OT) [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: More qualitative results for information embedding via color histograms [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗
read the original abstract

Diffusion models have emerged as a dominant paradigm in generative modeling, enabling high-fidelity sampling from complex data distributions. Despite impressive capabilities, controlling diffusion models to produce outputs aligned with user intent remains an open challenge, especially when balancing global coherence with local precision. Existing control mechanisms vary in the granularity of their conditioning signals. For example, textual prompts guide generation globally through high-level semantics, while ControlNet-like approaches secure precise local structure via dense conditions. In this work, we introduce Histogram-constrained Image Generation (HIG), a novel control mechanism that falls into the middle ground of control granularity. Our framework enforces user-specified distributional constraints (e.g., color histograms or latent token distributions) during the generation process with exact precision. We model such control as an optimal transport (OT) problem and apply explicit guidance transformations during sampling, thereby driving the diffusion trajectory to align with the desired histogram. We demonstrate the versatility of HIG across diverse applications, including constrained generation via color/latent histograms and high-capacity information embedding through histogram-level encoding. Our findings underscore the promise of distributional control, a flexible and interpretable control scheme that is fully compatible with existing control mechanisms, diversifying the hybrid strategies for controllable image generation. Our project page is available at: https://maps-research.github.io/hig/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Histogram-constrained Image Generation (HIG), a control mechanism for diffusion models that enforces user-specified distributional constraints (e.g., color histograms or latent token distributions) with exact precision. Control is modeled as an optimal transport (OT) problem, with explicit guidance transformations applied during sampling to align the diffusion trajectory to the target histogram. The approach is positioned as a middle-granularity control compatible with existing methods and is demonstrated on constrained generation and high-capacity information embedding tasks.

Significance. If the exact histogram matching is achieved without degrading sample quality or coherence, HIG would provide a flexible, interpretable distributional control primitive that complements global (text) and local (dense) conditioning, enabling new hybrid strategies. The OT framing and claimed exactness are the core novelties.

major comments (1)
  1. [Abstract] Abstract: the central claim of 'exact precision' in histogram alignment is presented as following directly from the OT modeling and guidance transformations, yet no derivation, algorithm, or proof sketch is supplied to show how the per-step transformations preserve the diffusion marginals or avoid introducing artifacts; this is load-bearing for the 'exact' qualifier.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'middle ground of control granularity' is used without a quantitative comparison (e.g., bits of control or spatial scale) to textual prompts or ControlNet-style methods.
  2. [Abstract] Abstract: the project page URL is given but no quantitative results, ablation tables, or failure cases are referenced in the text itself.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for identifying a point that bears on the central claim of exactness. We respond to the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'exact precision' in histogram alignment is presented as following directly from the OT modeling and guidance transformations, yet no derivation, algorithm, or proof sketch is supplied to show how the per-step transformations preserve the diffusion marginals or avoid introducing artifacts; this is load-bearing for the 'exact' qualifier.

    Authors: The abstract is a high-level summary; the derivation that the per-step OT guidance map is the closed-form solution to the Wasserstein problem between the current empirical distribution and the target histogram, and that this map can be applied without changing the diffusion marginals outside the controlled dimensions, appears in Section 3.2 together with the explicit algorithm. We nevertheless agree that the abstract would be strengthened by a short clause indicating that the guidance is constructed to preserve the diffusion process marginals. We will revise the abstract accordingly in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents HIG as a modeling choice that frames distributional control as an OT problem and applies explicit guidance transformations at sampling time. No equations, fitted parameters, or self-citations are shown that would reduce the claimed exact histogram alignment to a self-referential definition or input-by-construction. The abstract and description treat the OT formulation as an independent modeling decision whose validity rests on external validation rather than internal reduction. This is the common case of a self-contained proposal without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard diffusion sampling and the mathematical properties of optimal transport; no free parameters, invented entities, or ad-hoc axioms are visible in the abstract.

axioms (1)
  • domain assumption Optimal transport provides a well-defined way to transform one distribution into another that can be applied step-wise during diffusion sampling.
    The paper states it models control as an OT problem and applies explicit guidance transformations.

pith-pipeline@v0.9.1-grok · 5763 in / 1209 out tokens · 21735 ms · 2026-07-01T05:49:47.449120+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 23 canonical work pages · 9 internal anchors

  1. [1]

    Naval Research Logistics Quarterly8(1), 41–54 (1961) 6

    Balinski, M.L.: Fixed-cost transportation problems. Naval Research Logistics Quarterly8(1), 41–54 (1961) 6

  2. [2]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Bansal, A., Chu, H.M., Schwarzschild, A., Sengupta, S., Goldblum, M., Geiping, J., Goldstein, T.: Universal guidance for diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 843–852 (2023) 14

  3. [3]

    Chen, J., Ge, C., Xie, E., Wu, Y., Yao, L., Ren, X., Wang, Z., Luo, P., Lu, H., Li, Z.: Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation (2024) 13

  4. [4]

    In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=FsdB3I9Y2414

    Christopher, J.K., Baek, S., Fioretto, F.: Constrained synthesis with projected diffusion models. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=FsdB3I9Y2414

  5. [5]

    In: The Eleventh International Con- ference on Learning Representations (2023),https://openreview.net/forum?id= OnD9zGAGT0k14

    Chung, H., Kim, J., Mccann, M.T., Klasky, M.L., Ye, J.C.: Diffusion posterior sampling for general noisy inverse problems. In: The Eleventh International Con- ference on Learning Representations (2023),https://openreview.net/forum?id= OnD9zGAGT0k14

  6. [6]

    In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K

    Chung, H., Sim, B., Ryu, D., Ye, J.C.: Improving diffusion models for inverse problems using manifold constraints. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022),https: //openreview.net/forum?id=nJJjv0JDJju14

  7. [7]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Chung, J., Hyun, S., Heo, J.P.: Style injection in diffusion: A training-free approach for adapting large-scale diffusion models for style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8795– 8805 (2024) 8, 13

  8. [8]

    Ad- vances in neural information processing systems26(2013) 6

    Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transport. Ad- vances in neural information processing systems26(2013) 6

  9. [9]

    Advances in neural information processing systems34, 8780–8794 (2021) 1

    Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021) 1

  10. [10]

    In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=tplXNcHZs114

    Dou, Z., Song, Y.: Diffusion posterior sampling for linear inverse problem solv- ing: A filtering perspective. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=tplXNcHZs114

  11. [11]

    The Llama 3 Herd of Models

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 7, 8

  12. [12]

    In: Forty-first International Conference on Machine Learning (2024) 1, 13

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high- resolution image synthesis. In: Forty-first International Conference on Machine Learning (2024) 1, 13

  13. [13]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12873–12883 (2021) 3, 12

  14. [14]

    Transactions on Machine Learning Research (2023),https://openreview.net/forum?id=xuWTFQ4VGO, expert Certification 14

    Fishman, N., Klarner, L., Bortoli, V.D., Mathieu, E., Hutchinson, M.J.: Diffu- sion models for constrained domains. Transactions on Machine Learning Research (2023),https://openreview.net/forum?id=xuWTFQ4VGO, expert Certification 14

  15. [15]

    arXiv preprint arXiv:2407.01414 (2024) 8, 13

    Gao, J., Liu, Y., Sun, Y., Tang, Y., Zeng, Y., Chen, K., Zhao, C.: Styleshot: A snapshot on any style. arXiv preprint arXiv:2407.01414 (2024) 8, 13

  16. [16]

    Seedream 3.0 Technical Report

    Gao, Y., Gong, L., Guo, Q., Hou, X., Lai, Z., Li, F., Li, L., Lian, X., Liao, C., Liu, L., et al.: Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346 (2025) 13 Histogram-constrained Image Generation 17

  17. [17]

    Advances in Neural Information Processing Systems38, 73343–73384 (2026) 14

    Guo, Y., Yang, Y., Yuan, H., Wang, M.: Training-free guidance beyond differ- entiability: Scalable path steering with tree search in diffusion and flow models. Advances in Neural Information Processing Systems38, 73343–73384 (2026) 14

  18. [18]

    In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=o3BxOLoxm18, 14

    He, Y., Murata, N., Lai, C.H., Takida, Y., Uesaka, T., Kim, D., Liao, W.H., Mit- sufuji, Y., Kolter, J.Z., Salakhutdinov, R., Ermon, S.: Manifold preserving guided diffusion. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=o3BxOLoxm18, 14

  19. [19]

    Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: A reference- freeevaluationmetricforimagecaptioning.arXivpreprintarXiv:2104.08718(2021) 9

  20. [20]

    Advances in neural information processing systems33, 6840–6851 (2020) 1, 3

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 1, 3

  21. [21]

    In: International Conference on Learning Representations (2022) 1, 2, 13

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022) 1, 2, 13

  22. [22]

    In: Proceedings of the 41st International Conference on Machine Learning

    Huang,Y.,Ghatare,A.,Liu,Y.,Hu,Z.,Zhang,Q.,Sastry,C.S.,Gururani,S.,Oore, S.,Yue,Y.:Symbolicmusicgenerationwithnon-differentiableruleguideddiffusion. In: Proceedings of the 41st International Conference on Machine Learning. pp. 19772–19797 (2024) 14

  23. [23]

    Advances in neural information processing systems35, 26565–26577 (2022) 13

    Karras,T.,Aittala,M.,Aila,T.,Laine,S.:Elucidatingthedesignspaceofdiffusion- based generative models. Advances in neural information processing systems35, 26565–26577 (2022) 13

  24. [24]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., Laine, S.: Analyzing and improving the training dynamics of diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24174– 24184 (2024) 13

  25. [25]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ke, Z., Liu, Y., Zhu, L., Zhao, N., Lau, R.W.: Neural preset for color style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14173–14182 (2023) 13

  26. [26]

    Auto-Encoding Variational Bayes

    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 3

  27. [27]

    Labs, B.F.: Flux (2023),https://github.com/black-forest-labs/flux1, 8, 13, 24

  28. [28]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow match- ing for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025) 13

  29. [29]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Larchenko, M., Lobashev, A., Guskov, D., Palyulin, V.V.: Color transfer with mod- ulated flows. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 4464–4472 (2025) 13

  30. [30]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Laria, H., Gomez-Villa, A., Qin, J., Butt, M.A., Raducanu, B., Vazquez-Corral, J., van de Weijer, J., Wang, K.: Leveraging semantic attribute binding for free- lunch color control in diffusion models. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 7689–7698 (2026) 13

  31. [31]

    In: Moens, M.F., Huang, X., Specia, L., Yih, S.W.t

    Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. In: Moens, M.F., Huang, X., Specia, L., Yih, S.W.t. (eds.) Proceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp.3045–3059.AssociationforComputationalLinguistics,OnlineandPuntaCana, Dominican Republic (Nov 2021) 3, 7, 10 ...

  32. [32]

    In: European Conference on Computer Vision

    Li, M., Yang, T., Kuang, H., Wu, J., Wang, Z., Xiao, X., Chen, C.: Controlnet++: Improving conditional controls with efficient consistency feedback. In: European Conference on Computer Vision. pp. 129–147. Springer (2025) 1, 2, 8, 13, 22

  33. [33]

    arXiv preprint arXiv:2408.08252 (2024) 14

    Li, X., Zhao, Y., Wang, C., Scalia, G., Eraslan, G., Nair, S., Biancalani, T., Ji, S., Regev, A., Levine, S., et al.: Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding. arXiv preprint arXiv:2408.08252 (2024) 14

  34. [34]

    arXiv preprint arXiv:2402.10855 (2024) 13

    Liang, Z., Li, Z., Zhou, S., Li, C., Loy, C.C.: Control color: Multimodal diffusion- based interactive image colorization. arXiv preprint arXiv:2402.10855 (2024) 13

  35. [35]

    In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t3, 8, 13

    Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t3, 8, 13

  36. [36]

    arXiv preprint arXiv:2412.04465 (2024) 13

    Liu, C., Shah, V., Cui, A., Lazebnik, S.: Unziplora: Separating content and style from a single image. arXiv preprint arXiv:2412.04465 (2024) 13

  37. [37]

    In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z8, 13

    Liu, X., Gong, C., qiang liu: Flow straight and fast: Learning to generate and trans- fer data with rectified flow. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z8, 13

  38. [38]

    Advances in Neural Information Processing Systems38, 164572–164601 (2026) 13

    Lobashev, A., Larchenko, M., Guskov, D.: Color conditional generation with sliced wasserstein guidance. Advances in Neural Information Processing Systems38, 164572–164601 (2026) 13

  39. [39]

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2019) 8

  40. [40]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4296–4304 (2024) 13

  41. [41]

    Naderiparizi, S., Liang, X., Zwartsenberg, B., Wood, F.: Don’t be so negative! score-based generative modeling with oracle-assisted guidance (2024),https:// openreview.net/forum?id=gJ7cHBHfBk14

  42. [42]

    OpenAI: Introducing 4o image generation (2025),https://openai.com/index/ introducing-4o-image-generation/, accessed: 2025-05-15 8, 13

  43. [43]

    Scalable Diffusion Models with Transformers

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748 (2022) 13

  44. [44]

    In: Proceedings of the 31st ACM International Conference on Multimedia

    Peng, Y., Hu, D., Wang, Y., Chen, K., Pei, G., Zhang, W.: Stegaddpm: Gen- erative image steganography based on denoising diffusion probabilistic model. In: Proceedings of the 31st ACM International Conference on Multimedia. p. 7143–7151. MM ’23, Association for Computing Machinery, New York, NY, USA (2023).https://doi.org/10.1145/3581783.3612514,https://d...

  45. [45]

    In: ACM Multimedia 2024 (2024),https://openreview.net/forum?id=kEqGgMgIlu 14

    Peng, Y., Wang, Y., Hu, D., Chen, K., Rong, X., Zhang, W.: LDStega: Practical and robust generative image steganography based on latent diffusion models. In: ACM Multimedia 2024 (2024),https://openreview.net/forum?id=kEqGgMgIlu 14

  46. [46]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 1, 3, 8, 13, 23

  47. [47]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Qiu, Q., Mao, J., Wang, X.: Exploring palette based color guidance in diffusion models. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 10287–10295 (2025) 13 Histogram-constrained Image Generation 19

  48. [48]

    arXiv preprint arXiv:2412.03069 (2024) 3, 12

    Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. arXiv preprint arXiv:2412.03069 (2024) 3, 12

  49. [49]

    Advances in neural information processing systems32(2019) 3

    Razavi, A., Van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems32(2019) 3

  50. [50]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 1, 3, 13, 14

  51. [51]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22500–22510 (June 2023) 1, 2, 8, 13, 22

  52. [52]

    Advances in Neural Information Processing Systems35, 25278–25294 (2022) 9

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large- scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems35, 25278–25294 (2022) 9

  53. [53]

    In: European Conference on Computer Vision

    Shah, V., Ruiz, N., Cole, F., Lu, E., Lazebnik, S., Li, Y., Jampani, V.: Ziplora: Any subject in any style by effectively merging loras. In: European Conference on Computer Vision. pp. 422–438. Springer (2025) 13

  54. [54]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Shum, K.C., Hua, B.S., Nguyen, D.T., Yeung, S.K.: Color alignment in diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 28446–28455 (2025) 13

  55. [55]

    In: International conference on machine learning

    Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsuper- vised learning using nonequilibrium thermodynamics. In: International conference on machine learning. pp. 2256–2265. PMLR (2015) 1

  56. [56]

    Song,J.,Meng,C.,Ermon,S.:Denoisingdiffusionimplicitmodels.In:International Conferenceon LearningRepresentations(2021),https://openreview.net/forum? id=St1giarCHLP1, 3, 8

  57. [57]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020) 3

  58. [58]

    Proceedings of the AAAI Conference on Artificial Intelli- gence38(1), 240–248 (Mar 2024).https://doi.org/10.1609/aaai.v38i1.27776 14

    Su, W., Ni, J., Sun, Y.: Stegastylegan: Towards generic and practical generative image steganography. Proceedings of the AAAI Conference on Artificial Intelli- gence38(1), 240–248 (Mar 2024).https://doi.org/10.1609/aaai.v38i1.27776 14

  59. [59]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Tan, Z., Liu, S., Yang, X., Xue, Q., Wang, X.: Ominicontrol: Minimal and universal control for diffusion transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14940–14950 (2025) 13

  60. [60]

    Advances in neural information processing systems30(2017) 3, 4

    Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Advances in neural information processing systems30(2017) 3, 4

  61. [61]

    Springer (2009) 2, 4

    Villani, C.: Optimal Transport: Old and New. Springer (2009) 2, 4

  62. [62]

    arXiv preprint arXiv:2407.00788 (2024) 8, 13

    Wang, H., Xing, P., Huang, R., Ai, H., Wang, Q., Bai, X.: Instantstyle-plus: Style transfer with content-preserving in text-to-image generation. arXiv preprint arXiv:2407.00788 (2024) 8, 13

  63. [63]

    arXiv preprint arXiv:2506.05083 (2025) 13

    Wang, P., Shi, Y., Lian, X., Zhai, Z., Xia, X., Xiao, X., Huang, W., Yang, J.: Seededit 3.0: Fast and high-quality generative image editing. arXiv preprint arXiv:2506.05083 (2025) 13

  64. [64]

    Liu et al

    Xie, E., Chen, J., Chen, J., Cai, H., Tang, H., Lin, Y., Zhang, Z., Li, M., Zhu, L., Lu, Y., Han, S.: Sana: Efficient high-resolution image synthesis with linear diffusion transformer (2024) 13 20 H. Liu et al

  65. [65]

    In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/forum?id= NniXePXVXw14

    Xu, Z., xu, D., Li, Z., Zhang, C.: MDDM: Practical message-driven generative image steganography based on diffusion models. In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/forum?id= NniXePXVXw14

  66. [66]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Yan,L.,Li,X.,Zhang,J.,Guan,F.,Peng,K.,Li,P.:F-ddim:Afeaturizeddenoising diffusion implicit model for facial image steganography. In: Proceedings of the 33rd ACM International Conference on Multimedia. p. 8488–8496. MM ’25, Association for Computing Machinery, New York, NY, USA (2025).https://doi.org/10. 1145/3746027.3755517,https://doi.org/10.1145/3746027...

  67. [67]

    In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence

    Yang, Y., Liu, Z., Jia, J., Gao, Z., Li, Y., Sun, W., Liu, X., Zhai, G.: Diffstega: to- wards universal training-free coverless image steganography with diffusion models. In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence. pp. 1579–1587 (2024) 14

  68. [68]

    Advances in Neural Information Processing Systems37, 22370–22417 (2024) 14

    Ye, H., Lin, H., Han, J., Xu, M., Liu, S., Liang, Y., Ma, J., Zou, J.Y., Ermon, S.: Tfg: Unified training-free guidance for diffusion models. Advances in Neural Information Processing Systems37, 22370–22417 (2024) 14

  69. [69]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023) 8

  70. [70]

    Advances in Neural Information Processing Systems36, 80730–80743 (2023) 14

    Yu, J., Zhang, X., Xu, Y., Zhang, J.: Cross: Diffusion model makes controllable, ro- bust and secure image steganography. Advances in Neural Information Processing Systems36, 80730–80743 (2023) 14

  71. [71]

    Advances in Neural Information Processing Systems37, 128940–128966 (2024) 3, 12

    Yu, Q., Weber, M., Deng, X., Shen, X., Cremers, D., Chen, L.C.: An image is worth 32 tokens for reconstruction and generation. Advances in Neural Information Processing Systems37, 128940–128966 (2024) 3, 12

  72. [72]

    arXiv preprint arXiv:2410.03021 (2024) 8, 13

    Zamzam, O.: Pixelshuffler: A simple image translation through pixel rearrange- ment. arXiv preprint arXiv:2410.03021 (2024) 8, 13

  73. [73]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847 (2023) 1, 2, 13

  74. [74]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)

    Zhang, Y., Huang, N., Tang, F., Huang, H., Ma, C., Dong, W., Xu, C.: Inversion- based style transfer with diffusion models. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 10146–10156 (June 2023) 8, 13

  75. [75]

    Zhao, S., Chen, D., Chen, Y.C., Bao, J., Hao, S., Yuan, L., Wong, K.Y.K.: Uni- controlnet:All-in-onecontroltotext-to-imagediffusionmodels.AdvancesinNeural Information Processing Systems36(2024) 13

  76. [76]

    IEEE Transactions on Circuits and Systems for Video Technology35(7), 6494–6507 (2025).https://doi.org/10.1109/TCSVT.2025

    Zhou, Q., Wei, P., Qian, Z., Zhang, X., Li, S.: Improved generative steganography based on diffusion model. IEEE Transactions on Circuits and Systems for Video Technology35(7), 6494–6507 (2025).https://doi.org/10.1109/TCSVT.2025. 353983214

  77. [77]

    IEEE Trans- actions on Information Forensics and Security18, 2751–2765 (2023).https: //doi.org/10.1109/TIFS.2023.326884314

    Zhou, Z., Dong, X., Meng, R., Wang, M., Yan, H., Yu, K., Choo, K.K.R.: Genera- tive steganography via auto-generation of semantic object contours. IEEE Trans- actions on Information Forensics and Security18, 2751–2765 (2023).https: //doi.org/10.1109/TIFS.2023.326884314

  78. [78]

    an artwork with intricate details, vibrant colors, high resolution, 8k

    Zhou, Z., Su, Y., Li, J., Yu, K., Wu, Q.M.J., Fu, Z., Shi, Y.: Secret-to-Image Reversible Transformation for Generative Steganography . IEEE Transactions on Dependable and Secure Computing20(05), 4118–4134 (Sep 2023).https://doi. org/10.1109/TDSC.2022.321766114 Histogram-constrained Image Generation 21 A Pseudocode for HIG Algorithm1Text-to-ImageGeneratio...