Histogram-constrained Image Generation

Haoming Liu; Hongyi Wen; Shenji Wan; Yijia Cao; Yuanhe Guo

arxiv: 2606.31683 · v1 · pith:2SAHJUVDnew · submitted 2026-06-30 · 💻 cs.CV · cs.AI· cs.LG

Histogram-constrained Image Generation

Haoming Liu , Yuanhe Guo , Yijia Cao , Shenji Wan , Hongyi Wen This is my paper

Pith reviewed 2026-07-01 05:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords histogram constraintsdiffusion modelsoptimal transportcontrollable generationimage synthesisdistributional controlcolor histograms

0 comments

The pith

Histogram-constrained Image Generation enforces exact user-specified distributional constraints on diffusion models by applying optimal transport guidance at each sampling step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Histogram-constrained Image Generation to let diffusion models follow user-specified histograms, such as color distributions or latent token counts, with exact precision during sampling. This control sits between high-level text prompts and dense local conditions like ControlNet. The approach treats the desired constraint as an optimal transport problem and inserts explicit guidance transformations into the diffusion trajectory. It supports tasks including constrained color output and high-capacity information embedding via histogram encoding. A reader would care because the method adds a middle-granularity, interpretable control that works alongside existing mechanisms.

Core claim

By modeling control as an optimal transport problem, the framework applies explicit guidance transformations during the diffusion sampling process to drive trajectories toward user-specified histograms, achieving exact precision in distributional constraints while maintaining sample coherence.

What carries the argument

Optimal transport guidance transformations applied at each diffusion step to enforce exact histogram matching.

Load-bearing premise

Explicit optimal-transport guidance transformations can be applied at each diffusion step to achieve exact histogram matching while preserving image coherence and sample quality.

What would settle it

Running the guided sampler on a target histogram and verifying that the final image histogram deviates from the target by more than numerical tolerance, or that perceptual quality metrics fall below the unconstrained baseline.

Figures

Figures reproduced from arXiv: 2606.31683 by Haoming Liu, Hongyi Wen, Shenji Wan, Yijia Cao, Yuanhe Guo.

**Figure 1.** Figure 1: Overview for HIG. We intervene in the diffusion process with explicit OTbased guidance. HIG enables diverse applications, including constrained generation with arbitrary histogram constraints and high-capacity information embedding. encode abstract concepts and grant the diffusion process considerable flexibility to improvise during generation. As a result, they influence the output at a global scale, suc… view at source ↗

**Figure 2.** Figure 2: Exemplar OT plans with single-option (d = 6) and multi-option binning (k = 2, d = 3). In some cases, strict singleoption binning may lead to excessive content distortion during OT-based histogram matching. To mitigate this, we introduce a multi-option binning scheme for OT, where each bin contains multiple candidate values. In this setting, the transport plan only enforces the aggregated mass per bin to … view at source ↗

**Figure 3.** Figure 3: Illustration of our information embedding workflow. We first elaborate on how a sequence of text tokens can be transformed into a compact soft-prompt embedding via prompt tuning [31]. Prompt tuning is a parameter-efficient fine-tuning (PEFT) technique that learns a set of continuous embeddings (soft prompts) that are prepended to the input text to guide the language model’s behavior, which can be viewed a… view at source ↗

**Figure 4.** Figure 4: Qualitative results on color-constrained generation. “LoRA+CN+IP” refers to the stacked control from LoRA [51], ControlNet [32], and IP-Adapter [69]. HistKL quantifies the KL divergence to the target color distribution (the lower the better). are quantized to match the histogram dimension (e.g., 163 for RGB binning, 642 for RG binning, etc.). For information embedding, we employ Llama-3.1- 8B [11] for soft… view at source ↗

**Figure 5.** Figure 5: Qualitative results for color-constrained image generation. OT-based guidance helps alleviate visual artifacts. Method Base Model Latency (s) ↓ Overhead (s) ↓ Unconstrained SDXL 10.67 – HIG (w/o post-hoc OT) SDXL 12.87 2.20 HIG (w/ post-hoc OT) SDXL 15.06 4.39 DreamBooth LoRA∗ SDXL 13.01 2.34 ControlNet++ (Depth) SDXL 25.47 14.80 ControlNet++ (Softedge) SDXL 15.51 4.84 ControlNet++ (OpenPose) SDXL 17.19 6… view at source ↗

**Figure 6.** Figure 6: Qualitative results of information embedding. Each image embeds 512 text tokens that can be faithfully decoded. Under single-option binning, OT-based guidance (col 3&5) drastically reduces visual artifacts compared to direct OT variants (col 2&4); under multi-option binning, the embedded images remain visually similar to unconstrained generations (col 6). Better view with colors. and reliable control over… view at source ↗

**Figure 7.** Figure 7: Robustness evaluation of our information embedding technique. Our evaluation spans random scaling, JPEG compression, soft-prompt perturbation, and histogram corruption [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 9.** Figure 9: Qualitative results for combining HIG’s distributional color control with DreamBooth LoRA [51] and ControlNet++ [32]. Better view with color [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Visualizations of OT-based color histogram matching during the denoising process (T = {40, 30, 20, 10}). For each example, we sample a random color histogram as h tgt. Row 1&2 use single-option binning on RGB channels; Row 3&4 use singleoption binning on RG channels. D Content Stability over Decoding-Encoding Cycles [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Content stability of SDXL VAE [46] after multiple decoding-encoding cycles. Overall, the reconstructed images remain visually identical across cycles, demonstrating the feasibility of our decode-transform-encode diffusion guidance scheme [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Robustness under highly complex embedded text and image content [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: A post-hoc OT step can enforce exact compliance with h tgt, but may introduce visual artifacts from rigid color reassignment. While such strict control is essential for tasks like information embedding, it can be safely omitted in more flexible settings such as color scheme matching. M Extended Usage: Lighting Control [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

**Figure 14.** Figure 14: Color histogram matching enables lighting control on photo-realistic images [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: More qualitative results for color-constrained generation (with post-hoc OT) [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗

**Figure 16.** Figure 16: More qualitative results for information embedding via color histograms [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗

read the original abstract

Diffusion models have emerged as a dominant paradigm in generative modeling, enabling high-fidelity sampling from complex data distributions. Despite impressive capabilities, controlling diffusion models to produce outputs aligned with user intent remains an open challenge, especially when balancing global coherence with local precision. Existing control mechanisms vary in the granularity of their conditioning signals. For example, textual prompts guide generation globally through high-level semantics, while ControlNet-like approaches secure precise local structure via dense conditions. In this work, we introduce Histogram-constrained Image Generation (HIG), a novel control mechanism that falls into the middle ground of control granularity. Our framework enforces user-specified distributional constraints (e.g., color histograms or latent token distributions) during the generation process with exact precision. We model such control as an optimal transport (OT) problem and apply explicit guidance transformations during sampling, thereby driving the diffusion trajectory to align with the desired histogram. We demonstrate the versatility of HIG across diverse applications, including constrained generation via color/latent histograms and high-capacity information embedding through histogram-level encoding. Our findings underscore the promise of distributional control, a flexible and interpretable control scheme that is fully compatible with existing control mechanisms, diversifying the hybrid strategies for controllable image generation. Our project page is available at: https://maps-research.github.io/hig/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HIG frames histogram constraints as OT guidance for middle-granularity diffusion control, but the exact-precision claim rests on unseen experiments.

read the letter

The core idea is a control layer that sits between text prompts and dense maps like ControlNet: users supply a target histogram (color or latent tokens) and the sampler is steered to match it exactly via optimal transport transformations at each diffusion step. That framing is new enough to stand out from the baselines mentioned.

What works is the positioning. Distributional constraints are a natural middle ground—more specific than language, less rigid than pixel-level masks—and the claim that the method stays compatible with existing controls is plausible on paper. Treating the problem as an OT matching task gives a clean mathematical handle for the guidance step.

The soft spot is the strength of the central promise. The abstract says the guidance achieves “exact precision” while preserving coherence and quality, yet no derivation, update rule, or ablation is visible here. Without the full equations or the quantitative results on how much the OT step perturbs the diffusion trajectory, it is impossible to judge whether the exact match comes at an acceptable cost in FID or perceptual quality. That is the load-bearing part of the contribution.

This is aimed at people already working on controllable diffusion pipelines who want another knob between global and local conditioning. A reader who cares about new primitives for editing or data embedding could get something out of the experiments once they are shown.

I would send it to review. The idea is distinct and the compatibility angle is useful; the referee can check whether the OT guidance actually delivers on the exactness claim without hidden side effects.

Referee Report

1 major / 2 minor

Summary. The paper introduces Histogram-constrained Image Generation (HIG), a control mechanism for diffusion models that enforces user-specified distributional constraints (e.g., color histograms or latent token distributions) with exact precision. Control is modeled as an optimal transport (OT) problem, with explicit guidance transformations applied during sampling to align the diffusion trajectory to the target histogram. The approach is positioned as a middle-granularity control compatible with existing methods and is demonstrated on constrained generation and high-capacity information embedding tasks.

Significance. If the exact histogram matching is achieved without degrading sample quality or coherence, HIG would provide a flexible, interpretable distributional control primitive that complements global (text) and local (dense) conditioning, enabling new hybrid strategies. The OT framing and claimed exactness are the core novelties.

major comments (1)

[Abstract] Abstract: the central claim of 'exact precision' in histogram alignment is presented as following directly from the OT modeling and guidance transformations, yet no derivation, algorithm, or proof sketch is supplied to show how the per-step transformations preserve the diffusion marginals or avoid introducing artifacts; this is load-bearing for the 'exact' qualifier.

minor comments (2)

[Abstract] Abstract: the phrase 'middle ground of control granularity' is used without a quantitative comparison (e.g., bits of control or spatial scale) to textual prompts or ControlNet-style methods.
[Abstract] Abstract: the project page URL is given but no quantitative results, ablation tables, or failure cases are referenced in the text itself.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for identifying a point that bears on the central claim of exactness. We respond to the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'exact precision' in histogram alignment is presented as following directly from the OT modeling and guidance transformations, yet no derivation, algorithm, or proof sketch is supplied to show how the per-step transformations preserve the diffusion marginals or avoid introducing artifacts; this is load-bearing for the 'exact' qualifier.

Authors: The abstract is a high-level summary; the derivation that the per-step OT guidance map is the closed-form solution to the Wasserstein problem between the current empirical distribution and the target histogram, and that this map can be applied without changing the diffusion marginals outside the controlled dimensions, appears in Section 3.2 together with the explicit algorithm. We nevertheless agree that the abstract would be strengthened by a short clause indicating that the guidance is constructed to preserve the diffusion process marginals. We will revise the abstract accordingly in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents HIG as a modeling choice that frames distributional control as an OT problem and applies explicit guidance transformations at sampling time. No equations, fitted parameters, or self-citations are shown that would reduce the claimed exact histogram alignment to a self-referential definition or input-by-construction. The abstract and description treat the OT formulation as an independent modeling decision whose validity rests on external validation rather than internal reduction. This is the common case of a self-contained proposal without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard diffusion sampling and the mathematical properties of optimal transport; no free parameters, invented entities, or ad-hoc axioms are visible in the abstract.

axioms (1)

domain assumption Optimal transport provides a well-defined way to transform one distribution into another that can be applied step-wise during diffusion sampling.
The paper states it models control as an OT problem and applies explicit guidance transformations.

pith-pipeline@v0.9.1-grok · 5763 in / 1209 out tokens · 21735 ms · 2026-07-01T05:49:47.449120+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 23 canonical work pages · 9 internal anchors

[1]

Naval Research Logistics Quarterly8(1), 41–54 (1961) 6

Balinski, M.L.: Fixed-cost transportation problems. Naval Research Logistics Quarterly8(1), 41–54 (1961) 6

1961
[2]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Bansal, A., Chu, H.M., Schwarzschild, A., Sengupta, S., Goldblum, M., Geiping, J., Goldstein, T.: Universal guidance for diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 843–852 (2023) 14

2023
[3]

Chen, J., Ge, C., Xie, E., Wu, Y., Yao, L., Ren, X., Wang, Z., Luo, P., Lu, H., Li, Z.: Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation (2024) 13

2024
[4]

In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=FsdB3I9Y2414

Christopher, J.K., Baek, S., Fioretto, F.: Constrained synthesis with projected diffusion models. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=FsdB3I9Y2414

2024
[5]

In: The Eleventh International Con- ference on Learning Representations (2023),https://openreview.net/forum?id= OnD9zGAGT0k14

Chung, H., Kim, J., Mccann, M.T., Klasky, M.L., Ye, J.C.: Diffusion posterior sampling for general noisy inverse problems. In: The Eleventh International Con- ference on Learning Representations (2023),https://openreview.net/forum?id= OnD9zGAGT0k14

2023
[6]

In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K

Chung, H., Sim, B., Ryu, D., Ye, J.C.: Improving diffusion models for inverse problems using manifold constraints. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022),https: //openreview.net/forum?id=nJJjv0JDJju14

2022
[7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Chung, J., Hyun, S., Heo, J.P.: Style injection in diffusion: A training-free approach for adapting large-scale diffusion models for style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8795– 8805 (2024) 8, 13

2024
[8]

Ad- vances in neural information processing systems26(2013) 6

Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transport. Ad- vances in neural information processing systems26(2013) 6

2013
[9]

Advances in neural information processing systems34, 8780–8794 (2021) 1

Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021) 1

2021
[10]

In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=tplXNcHZs114

Dou, Z., Song, Y.: Diffusion posterior sampling for linear inverse problem solv- ing: A filtering perspective. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=tplXNcHZs114

2024
[11]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

In: Forty-first International Conference on Machine Learning (2024) 1, 13

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high- resolution image synthesis. In: Forty-first International Conference on Machine Learning (2024) 1, 13

2024
[13]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12873–12883 (2021) 3, 12

2021
[14]

Transactions on Machine Learning Research (2023),https://openreview.net/forum?id=xuWTFQ4VGO, expert Certification 14

Fishman, N., Klarner, L., Bortoli, V.D., Mathieu, E., Hutchinson, M.J.: Diffu- sion models for constrained domains. Transactions on Machine Learning Research (2023),https://openreview.net/forum?id=xuWTFQ4VGO, expert Certification 14

2023
[15]

arXiv preprint arXiv:2407.01414 (2024) 8, 13

Gao, J., Liu, Y., Sun, Y., Tang, Y., Zeng, Y., Chen, K., Zhao, C.: Styleshot: A snapshot on any style. arXiv preprint arXiv:2407.01414 (2024) 8, 13

work page arXiv 2024
[16]

Seedream 3.0 Technical Report

Gao, Y., Gong, L., Guo, Q., Hou, X., Lai, Z., Li, F., Li, L., Lian, X., Liao, C., Liu, L., et al.: Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346 (2025) 13 Histogram-constrained Image Generation 17

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Advances in Neural Information Processing Systems38, 73343–73384 (2026) 14

Guo, Y., Yang, Y., Yuan, H., Wang, M.: Training-free guidance beyond differ- entiability: Scalable path steering with tree search in diffusion and flow models. Advances in Neural Information Processing Systems38, 73343–73384 (2026) 14

2026
[18]

In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=o3BxOLoxm18, 14

He, Y., Murata, N., Lai, C.H., Takida, Y., Uesaka, T., Kim, D., Liao, W.H., Mit- sufuji, Y., Kolter, J.Z., Salakhutdinov, R., Ermon, S.: Manifold preserving guided diffusion. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=o3BxOLoxm18, 14

2024
[19]

Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: A reference- freeevaluationmetricforimagecaptioning.arXivpreprintarXiv:2104.08718(2021) 9

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Advances in neural information processing systems33, 6840–6851 (2020) 1, 3

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 1, 3

2020
[21]

In: International Conference on Learning Representations (2022) 1, 2, 13

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022) 1, 2, 13

2022
[22]

In: Proceedings of the 41st International Conference on Machine Learning

Huang,Y.,Ghatare,A.,Liu,Y.,Hu,Z.,Zhang,Q.,Sastry,C.S.,Gururani,S.,Oore, S.,Yue,Y.:Symbolicmusicgenerationwithnon-differentiableruleguideddiffusion. In: Proceedings of the 41st International Conference on Machine Learning. pp. 19772–19797 (2024) 14

2024
[23]

Advances in neural information processing systems35, 26565–26577 (2022) 13

Karras,T.,Aittala,M.,Aila,T.,Laine,S.:Elucidatingthedesignspaceofdiffusion- based generative models. Advances in neural information processing systems35, 26565–26577 (2022) 13

2022
[24]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., Laine, S.: Analyzing and improving the training dynamics of diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24174– 24184 (2024) 13

2024
[25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ke, Z., Liu, Y., Zhu, L., Zhao, N., Lau, R.W.: Neural preset for color style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14173–14182 (2023) 13

2023
[26]

Auto-Encoding Variational Bayes

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 3

work page internal anchor Pith review Pith/arXiv arXiv 2013
[27]

Labs, B.F.: Flux (2023),https://github.com/black-forest-labs/flux1, 8, 13, 24

2023
[28]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow match- ing for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025) 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Larchenko, M., Lobashev, A., Guskov, D., Palyulin, V.V.: Color transfer with mod- ulated flows. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 4464–4472 (2025) 13

2025
[30]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Laria, H., Gomez-Villa, A., Qin, J., Butt, M.A., Raducanu, B., Vazquez-Corral, J., van de Weijer, J., Wang, K.: Leveraging semantic attribute binding for free- lunch color control in diffusion models. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 7689–7698 (2026) 13

2026
[31]

In: Moens, M.F., Huang, X., Specia, L., Yih, S.W.t

Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. In: Moens, M.F., Huang, X., Specia, L., Yih, S.W.t. (eds.) Proceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp.3045–3059.AssociationforComputationalLinguistics,OnlineandPuntaCana, Dominican Republic (Nov 2021) 3, 7, 10 ...

2021
[32]

In: European Conference on Computer Vision

Li, M., Yang, T., Kuang, H., Wu, J., Wang, Z., Xiao, X., Chen, C.: Controlnet++: Improving conditional controls with efficient consistency feedback. In: European Conference on Computer Vision. pp. 129–147. Springer (2025) 1, 2, 8, 13, 22

2025
[33]

arXiv preprint arXiv:2408.08252 (2024) 14

Li, X., Zhao, Y., Wang, C., Scalia, G., Eraslan, G., Nair, S., Biancalani, T., Ji, S., Regev, A., Levine, S., et al.: Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding. arXiv preprint arXiv:2408.08252 (2024) 14

work page arXiv 2024
[34]

arXiv preprint arXiv:2402.10855 (2024) 13

Liang, Z., Li, Z., Zhou, S., Li, C., Loy, C.C.: Control color: Multimodal diffusion- based interactive image colorization. arXiv preprint arXiv:2402.10855 (2024) 13

work page arXiv 2024
[35]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t3, 8, 13

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t3, 8, 13

2023
[36]

arXiv preprint arXiv:2412.04465 (2024) 13

Liu, C., Shah, V., Cui, A., Lazebnik, S.: Unziplora: Separating content and style from a single image. arXiv preprint arXiv:2412.04465 (2024) 13

work page arXiv 2024
[37]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z8, 13

Liu, X., Gong, C., qiang liu: Flow straight and fast: Learning to generate and trans- fer data with rectified flow. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z8, 13

2023
[38]

Advances in Neural Information Processing Systems38, 164572–164601 (2026) 13

Lobashev, A., Larchenko, M., Guskov, D.: Color conditional generation with sliced wasserstein guidance. Advances in Neural Information Processing Systems38, 164572–164601 (2026) 13

2026
[39]

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2019) 8

2019
[40]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4296–4304 (2024) 13

2024
[41]

Naderiparizi, S., Liang, X., Zwartsenberg, B., Wood, F.: Don’t be so negative! score-based generative modeling with oracle-assisted guidance (2024),https:// openreview.net/forum?id=gJ7cHBHfBk14

2024
[42]

OpenAI: Introducing 4o image generation (2025),https://openai.com/index/ introducing-4o-image-generation/, accessed: 2025-05-15 8, 13

2025
[43]

Scalable Diffusion Models with Transformers

Peebles, W., Xie, S.: Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748 (2022) 13

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

In: Proceedings of the 31st ACM International Conference on Multimedia

Peng, Y., Hu, D., Wang, Y., Chen, K., Pei, G., Zhang, W.: Stegaddpm: Gen- erative image steganography based on denoising diffusion probabilistic model. In: Proceedings of the 31st ACM International Conference on Multimedia. p. 7143–7151. MM ’23, Association for Computing Machinery, New York, NY, USA (2023).https://doi.org/10.1145/3581783.3612514,https://d...

work page doi:10.1145/3581783.3612514 2023
[45]

In: ACM Multimedia 2024 (2024),https://openreview.net/forum?id=kEqGgMgIlu 14

Peng, Y., Wang, Y., Hu, D., Chen, K., Rong, X., Zhang, W.: LDStega: Practical and robust generative image steganography based on latent diffusion models. In: ACM Multimedia 2024 (2024),https://openreview.net/forum?id=kEqGgMgIlu 14

2024
[46]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 1, 3, 8, 13, 23

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Qiu, Q., Mao, J., Wang, X.: Exploring palette based color guidance in diffusion models. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 10287–10295 (2025) 13 Histogram-constrained Image Generation 19

2025
[48]

arXiv preprint arXiv:2412.03069 (2024) 3, 12

Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. arXiv preprint arXiv:2412.03069 (2024) 3, 12

work page arXiv 2024
[49]

Advances in neural information processing systems32(2019) 3

Razavi, A., Van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems32(2019) 3

2019
[50]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 1, 3, 13, 14

2022
[51]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22500–22510 (June 2023) 1, 2, 8, 13, 22

2023
[52]

Advances in Neural Information Processing Systems35, 25278–25294 (2022) 9

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large- scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems35, 25278–25294 (2022) 9

2022
[53]

In: European Conference on Computer Vision

Shah, V., Ruiz, N., Cole, F., Lu, E., Lazebnik, S., Li, Y., Jampani, V.: Ziplora: Any subject in any style by effectively merging loras. In: European Conference on Computer Vision. pp. 422–438. Springer (2025) 13

2025
[54]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Shum, K.C., Hua, B.S., Nguyen, D.T., Yeung, S.K.: Color alignment in diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 28446–28455 (2025) 13

2025
[55]

In: International conference on machine learning

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsuper- vised learning using nonequilibrium thermodynamics. In: International conference on machine learning. pp. 2256–2265. PMLR (2015) 1

2015
[56]

Song,J.,Meng,C.,Ermon,S.:Denoisingdiffusionimplicitmodels.In:International Conferenceon LearningRepresentations(2021),https://openreview.net/forum? id=St1giarCHLP1, 3, 8

2021
[57]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020) 3

work page internal anchor Pith review Pith/arXiv arXiv 2011
[58]

Proceedings of the AAAI Conference on Artificial Intelli- gence38(1), 240–248 (Mar 2024).https://doi.org/10.1609/aaai.v38i1.27776 14

Su, W., Ni, J., Sun, Y.: Stegastylegan: Towards generic and practical generative image steganography. Proceedings of the AAAI Conference on Artificial Intelli- gence38(1), 240–248 (Mar 2024).https://doi.org/10.1609/aaai.v38i1.27776 14

work page doi:10.1609/aaai.v38i1.27776 2024
[59]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Tan, Z., Liu, S., Yang, X., Xue, Q., Wang, X.: Ominicontrol: Minimal and universal control for diffusion transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14940–14950 (2025) 13

2025
[60]

Advances in neural information processing systems30(2017) 3, 4

Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Advances in neural information processing systems30(2017) 3, 4

2017
[61]

Springer (2009) 2, 4

Villani, C.: Optimal Transport: Old and New. Springer (2009) 2, 4

2009
[62]

arXiv preprint arXiv:2407.00788 (2024) 8, 13

Wang, H., Xing, P., Huang, R., Ai, H., Wang, Q., Bai, X.: Instantstyle-plus: Style transfer with content-preserving in text-to-image generation. arXiv preprint arXiv:2407.00788 (2024) 8, 13

work page arXiv 2024
[63]

arXiv preprint arXiv:2506.05083 (2025) 13

Wang, P., Shi, Y., Lian, X., Zhai, Z., Xia, X., Xiao, X., Huang, W., Yang, J.: Seededit 3.0: Fast and high-quality generative image editing. arXiv preprint arXiv:2506.05083 (2025) 13

work page arXiv 2025
[64]

Liu et al

Xie, E., Chen, J., Chen, J., Cai, H., Tang, H., Lin, Y., Zhang, Z., Li, M., Zhu, L., Lu, Y., Han, S.: Sana: Efficient high-resolution image synthesis with linear diffusion transformer (2024) 13 20 H. Liu et al

2024
[65]

In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/forum?id= NniXePXVXw14

Xu, Z., xu, D., Li, Z., Zhang, C.: MDDM: Practical message-driven generative image steganography based on diffusion models. In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/forum?id= NniXePXVXw14

2025
[66]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Yan,L.,Li,X.,Zhang,J.,Guan,F.,Peng,K.,Li,P.:F-ddim:Afeaturizeddenoising diffusion implicit model for facial image steganography. In: Proceedings of the 33rd ACM International Conference on Multimedia. p. 8488–8496. MM ’25, Association for Computing Machinery, New York, NY, USA (2025).https://doi.org/10. 1145/3746027.3755517,https://doi.org/10.1145/3746027...

work page doi:10.1145/3746027.375551714 2025
[67]

In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence

Yang, Y., Liu, Z., Jia, J., Gao, Z., Li, Y., Sun, W., Liu, X., Zhai, G.: Diffstega: to- wards universal training-free coverless image steganography with diffusion models. In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence. pp. 1579–1587 (2024) 14

2024
[68]

Advances in Neural Information Processing Systems37, 22370–22417 (2024) 14

Ye, H., Lin, H., Han, J., Xu, M., Liu, S., Liang, Y., Ma, J., Zou, J.Y., Ermon, S.: Tfg: Unified training-free guidance for diffusion models. Advances in Neural Information Processing Systems37, 22370–22417 (2024) 14

2024
[69]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023) 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Advances in Neural Information Processing Systems36, 80730–80743 (2023) 14

Yu, J., Zhang, X., Xu, Y., Zhang, J.: Cross: Diffusion model makes controllable, ro- bust and secure image steganography. Advances in Neural Information Processing Systems36, 80730–80743 (2023) 14

2023
[71]

Advances in Neural Information Processing Systems37, 128940–128966 (2024) 3, 12

Yu, Q., Weber, M., Deng, X., Shen, X., Cremers, D., Chen, L.C.: An image is worth 32 tokens for reconstruction and generation. Advances in Neural Information Processing Systems37, 128940–128966 (2024) 3, 12

2024
[72]

arXiv preprint arXiv:2410.03021 (2024) 8, 13

Zamzam, O.: Pixelshuffler: A simple image translation through pixel rearrange- ment. arXiv preprint arXiv:2410.03021 (2024) 8, 13

work page arXiv 2024
[73]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847 (2023) 1, 2, 13

2023
[74]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)

Zhang, Y., Huang, N., Tang, F., Huang, H., Ma, C., Dong, W., Xu, C.: Inversion- based style transfer with diffusion models. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 10146–10156 (June 2023) 8, 13

2023
[75]

Zhao, S., Chen, D., Chen, Y.C., Bao, J., Hao, S., Yuan, L., Wong, K.Y.K.: Uni- controlnet:All-in-onecontroltotext-to-imagediffusionmodels.AdvancesinNeural Information Processing Systems36(2024) 13

2024
[76]

IEEE Transactions on Circuits and Systems for Video Technology35(7), 6494–6507 (2025).https://doi.org/10.1109/TCSVT.2025

Zhou, Q., Wei, P., Qian, Z., Zhang, X., Li, S.: Improved generative steganography based on diffusion model. IEEE Transactions on Circuits and Systems for Video Technology35(7), 6494–6507 (2025).https://doi.org/10.1109/TCSVT.2025. 353983214

work page doi:10.1109/tcsvt.2025 2025
[77]

IEEE Trans- actions on Information Forensics and Security18, 2751–2765 (2023).https: //doi.org/10.1109/TIFS.2023.326884314

Zhou, Z., Dong, X., Meng, R., Wang, M., Yan, H., Yu, K., Choo, K.K.R.: Genera- tive steganography via auto-generation of semantic object contours. IEEE Trans- actions on Information Forensics and Security18, 2751–2765 (2023).https: //doi.org/10.1109/TIFS.2023.326884314

work page doi:10.1109/tifs.2023.326884314 2023
[78]

an artwork with intricate details, vibrant colors, high resolution, 8k

Zhou, Z., Su, Y., Li, J., Yu, K., Wu, Q.M.J., Fu, Z., Shi, Y.: Secret-to-Image Reversible Transformation for Generative Steganography . IEEE Transactions on Dependable and Secure Computing20(05), 4118–4134 (Sep 2023).https://doi. org/10.1109/TDSC.2022.321766114 Histogram-constrained Image Generation 21 A Pseudocode for HIG Algorithm1Text-to-ImageGeneratio...

work page doi:10.1109/tdsc.2022.321766114 2023

[1] [1]

Naval Research Logistics Quarterly8(1), 41–54 (1961) 6

Balinski, M.L.: Fixed-cost transportation problems. Naval Research Logistics Quarterly8(1), 41–54 (1961) 6

1961

[2] [2]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Bansal, A., Chu, H.M., Schwarzschild, A., Sengupta, S., Goldblum, M., Geiping, J., Goldstein, T.: Universal guidance for diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 843–852 (2023) 14

2023

[3] [3]

Chen, J., Ge, C., Xie, E., Wu, Y., Yao, L., Ren, X., Wang, Z., Luo, P., Lu, H., Li, Z.: Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation (2024) 13

2024

[4] [4]

In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=FsdB3I9Y2414

Christopher, J.K., Baek, S., Fioretto, F.: Constrained synthesis with projected diffusion models. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=FsdB3I9Y2414

2024

[5] [5]

In: The Eleventh International Con- ference on Learning Representations (2023),https://openreview.net/forum?id= OnD9zGAGT0k14

Chung, H., Kim, J., Mccann, M.T., Klasky, M.L., Ye, J.C.: Diffusion posterior sampling for general noisy inverse problems. In: The Eleventh International Con- ference on Learning Representations (2023),https://openreview.net/forum?id= OnD9zGAGT0k14

2023

[6] [6]

In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K

Chung, H., Sim, B., Ryu, D., Ye, J.C.: Improving diffusion models for inverse problems using manifold constraints. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022),https: //openreview.net/forum?id=nJJjv0JDJju14

2022

[7] [7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Chung, J., Hyun, S., Heo, J.P.: Style injection in diffusion: A training-free approach for adapting large-scale diffusion models for style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8795– 8805 (2024) 8, 13

2024

[8] [8]

Ad- vances in neural information processing systems26(2013) 6

Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transport. Ad- vances in neural information processing systems26(2013) 6

2013

[9] [9]

Advances in neural information processing systems34, 8780–8794 (2021) 1

Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021) 1

2021

[10] [10]

In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=tplXNcHZs114

Dou, Z., Song, Y.: Diffusion posterior sampling for linear inverse problem solv- ing: A filtering perspective. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=tplXNcHZs114

2024

[11] [11]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

In: Forty-first International Conference on Machine Learning (2024) 1, 13

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high- resolution image synthesis. In: Forty-first International Conference on Machine Learning (2024) 1, 13

2024

[13] [13]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12873–12883 (2021) 3, 12

2021

[14] [14]

Transactions on Machine Learning Research (2023),https://openreview.net/forum?id=xuWTFQ4VGO, expert Certification 14

Fishman, N., Klarner, L., Bortoli, V.D., Mathieu, E., Hutchinson, M.J.: Diffu- sion models for constrained domains. Transactions on Machine Learning Research (2023),https://openreview.net/forum?id=xuWTFQ4VGO, expert Certification 14

2023

[15] [15]

arXiv preprint arXiv:2407.01414 (2024) 8, 13

Gao, J., Liu, Y., Sun, Y., Tang, Y., Zeng, Y., Chen, K., Zhao, C.: Styleshot: A snapshot on any style. arXiv preprint arXiv:2407.01414 (2024) 8, 13

work page arXiv 2024

[16] [16]

Seedream 3.0 Technical Report

Gao, Y., Gong, L., Guo, Q., Hou, X., Lai, Z., Li, F., Li, L., Lian, X., Liao, C., Liu, L., et al.: Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346 (2025) 13 Histogram-constrained Image Generation 17

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Advances in Neural Information Processing Systems38, 73343–73384 (2026) 14

Guo, Y., Yang, Y., Yuan, H., Wang, M.: Training-free guidance beyond differ- entiability: Scalable path steering with tree search in diffusion and flow models. Advances in Neural Information Processing Systems38, 73343–73384 (2026) 14

2026

[18] [18]

In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=o3BxOLoxm18, 14

He, Y., Murata, N., Lai, C.H., Takida, Y., Uesaka, T., Kim, D., Liao, W.H., Mit- sufuji, Y., Kolter, J.Z., Salakhutdinov, R., Ermon, S.: Manifold preserving guided diffusion. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=o3BxOLoxm18, 14

2024

[19] [19]

Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: A reference- freeevaluationmetricforimagecaptioning.arXivpreprintarXiv:2104.08718(2021) 9

work page internal anchor Pith review Pith/arXiv arXiv 2021

[20] [20]

Advances in neural information processing systems33, 6840–6851 (2020) 1, 3

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 1, 3

2020

[21] [21]

In: International Conference on Learning Representations (2022) 1, 2, 13

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022) 1, 2, 13

2022

[22] [22]

In: Proceedings of the 41st International Conference on Machine Learning

Huang,Y.,Ghatare,A.,Liu,Y.,Hu,Z.,Zhang,Q.,Sastry,C.S.,Gururani,S.,Oore, S.,Yue,Y.:Symbolicmusicgenerationwithnon-differentiableruleguideddiffusion. In: Proceedings of the 41st International Conference on Machine Learning. pp. 19772–19797 (2024) 14

2024

[23] [23]

Advances in neural information processing systems35, 26565–26577 (2022) 13

Karras,T.,Aittala,M.,Aila,T.,Laine,S.:Elucidatingthedesignspaceofdiffusion- based generative models. Advances in neural information processing systems35, 26565–26577 (2022) 13

2022

[24] [24]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., Laine, S.: Analyzing and improving the training dynamics of diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24174– 24184 (2024) 13

2024

[25] [25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ke, Z., Liu, Y., Zhu, L., Zhao, N., Lau, R.W.: Neural preset for color style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14173–14182 (2023) 13

2023

[26] [26]

Auto-Encoding Variational Bayes

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 3

work page internal anchor Pith review Pith/arXiv arXiv 2013

[27] [27]

Labs, B.F.: Flux (2023),https://github.com/black-forest-labs/flux1, 8, 13, 24

2023

[28] [28]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow match- ing for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025) 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Larchenko, M., Lobashev, A., Guskov, D., Palyulin, V.V.: Color transfer with mod- ulated flows. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 4464–4472 (2025) 13

2025

[30] [30]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Laria, H., Gomez-Villa, A., Qin, J., Butt, M.A., Raducanu, B., Vazquez-Corral, J., van de Weijer, J., Wang, K.: Leveraging semantic attribute binding for free- lunch color control in diffusion models. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 7689–7698 (2026) 13

2026

[31] [31]

In: Moens, M.F., Huang, X., Specia, L., Yih, S.W.t

Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. In: Moens, M.F., Huang, X., Specia, L., Yih, S.W.t. (eds.) Proceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp.3045–3059.AssociationforComputationalLinguistics,OnlineandPuntaCana, Dominican Republic (Nov 2021) 3, 7, 10 ...

2021

[32] [32]

In: European Conference on Computer Vision

Li, M., Yang, T., Kuang, H., Wu, J., Wang, Z., Xiao, X., Chen, C.: Controlnet++: Improving conditional controls with efficient consistency feedback. In: European Conference on Computer Vision. pp. 129–147. Springer (2025) 1, 2, 8, 13, 22

2025

[33] [33]

arXiv preprint arXiv:2408.08252 (2024) 14

Li, X., Zhao, Y., Wang, C., Scalia, G., Eraslan, G., Nair, S., Biancalani, T., Ji, S., Regev, A., Levine, S., et al.: Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding. arXiv preprint arXiv:2408.08252 (2024) 14

work page arXiv 2024

[34] [34]

arXiv preprint arXiv:2402.10855 (2024) 13

Liang, Z., Li, Z., Zhou, S., Li, C., Loy, C.C.: Control color: Multimodal diffusion- based interactive image colorization. arXiv preprint arXiv:2402.10855 (2024) 13

work page arXiv 2024

[35] [35]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t3, 8, 13

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t3, 8, 13

2023

[36] [36]

arXiv preprint arXiv:2412.04465 (2024) 13

Liu, C., Shah, V., Cui, A., Lazebnik, S.: Unziplora: Separating content and style from a single image. arXiv preprint arXiv:2412.04465 (2024) 13

work page arXiv 2024

[37] [37]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z8, 13

Liu, X., Gong, C., qiang liu: Flow straight and fast: Learning to generate and trans- fer data with rectified flow. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z8, 13

2023

[38] [38]

Advances in Neural Information Processing Systems38, 164572–164601 (2026) 13

Lobashev, A., Larchenko, M., Guskov, D.: Color conditional generation with sliced wasserstein guidance. Advances in Neural Information Processing Systems38, 164572–164601 (2026) 13

2026

[39] [39]

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2019) 8

2019

[40] [40]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4296–4304 (2024) 13

2024

[41] [41]

Naderiparizi, S., Liang, X., Zwartsenberg, B., Wood, F.: Don’t be so negative! score-based generative modeling with oracle-assisted guidance (2024),https:// openreview.net/forum?id=gJ7cHBHfBk14

2024

[42] [42]

OpenAI: Introducing 4o image generation (2025),https://openai.com/index/ introducing-4o-image-generation/, accessed: 2025-05-15 8, 13

2025

[43] [43]

Scalable Diffusion Models with Transformers

Peebles, W., Xie, S.: Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748 (2022) 13

work page internal anchor Pith review Pith/arXiv arXiv 2022

[44] [44]

In: Proceedings of the 31st ACM International Conference on Multimedia

Peng, Y., Hu, D., Wang, Y., Chen, K., Pei, G., Zhang, W.: Stegaddpm: Gen- erative image steganography based on denoising diffusion probabilistic model. In: Proceedings of the 31st ACM International Conference on Multimedia. p. 7143–7151. MM ’23, Association for Computing Machinery, New York, NY, USA (2023).https://doi.org/10.1145/3581783.3612514,https://d...

work page doi:10.1145/3581783.3612514 2023

[45] [45]

In: ACM Multimedia 2024 (2024),https://openreview.net/forum?id=kEqGgMgIlu 14

Peng, Y., Wang, Y., Hu, D., Chen, K., Rong, X., Zhang, W.: LDStega: Practical and robust generative image steganography based on latent diffusion models. In: ACM Multimedia 2024 (2024),https://openreview.net/forum?id=kEqGgMgIlu 14

2024

[46] [46]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 1, 3, 8, 13, 23

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Qiu, Q., Mao, J., Wang, X.: Exploring palette based color guidance in diffusion models. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 10287–10295 (2025) 13 Histogram-constrained Image Generation 19

2025

[48] [48]

arXiv preprint arXiv:2412.03069 (2024) 3, 12

Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. arXiv preprint arXiv:2412.03069 (2024) 3, 12

work page arXiv 2024

[49] [49]

Advances in neural information processing systems32(2019) 3

Razavi, A., Van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems32(2019) 3

2019

[50] [50]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 1, 3, 13, 14

2022

[51] [51]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22500–22510 (June 2023) 1, 2, 8, 13, 22

2023

[52] [52]

Advances in Neural Information Processing Systems35, 25278–25294 (2022) 9

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large- scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems35, 25278–25294 (2022) 9

2022

[53] [53]

In: European Conference on Computer Vision

Shah, V., Ruiz, N., Cole, F., Lu, E., Lazebnik, S., Li, Y., Jampani, V.: Ziplora: Any subject in any style by effectively merging loras. In: European Conference on Computer Vision. pp. 422–438. Springer (2025) 13

2025

[54] [54]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Shum, K.C., Hua, B.S., Nguyen, D.T., Yeung, S.K.: Color alignment in diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 28446–28455 (2025) 13

2025

[55] [55]

In: International conference on machine learning

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsuper- vised learning using nonequilibrium thermodynamics. In: International conference on machine learning. pp. 2256–2265. PMLR (2015) 1

2015

[56] [56]

Song,J.,Meng,C.,Ermon,S.:Denoisingdiffusionimplicitmodels.In:International Conferenceon LearningRepresentations(2021),https://openreview.net/forum? id=St1giarCHLP1, 3, 8

2021

[57] [57]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020) 3

work page internal anchor Pith review Pith/arXiv arXiv 2011

[58] [58]

Proceedings of the AAAI Conference on Artificial Intelli- gence38(1), 240–248 (Mar 2024).https://doi.org/10.1609/aaai.v38i1.27776 14

Su, W., Ni, J., Sun, Y.: Stegastylegan: Towards generic and practical generative image steganography. Proceedings of the AAAI Conference on Artificial Intelli- gence38(1), 240–248 (Mar 2024).https://doi.org/10.1609/aaai.v38i1.27776 14

work page doi:10.1609/aaai.v38i1.27776 2024

[59] [59]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Tan, Z., Liu, S., Yang, X., Xue, Q., Wang, X.: Ominicontrol: Minimal and universal control for diffusion transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14940–14950 (2025) 13

2025

[60] [60]

Advances in neural information processing systems30(2017) 3, 4

Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Advances in neural information processing systems30(2017) 3, 4

2017

[61] [61]

Springer (2009) 2, 4

Villani, C.: Optimal Transport: Old and New. Springer (2009) 2, 4

2009

[62] [62]

arXiv preprint arXiv:2407.00788 (2024) 8, 13

Wang, H., Xing, P., Huang, R., Ai, H., Wang, Q., Bai, X.: Instantstyle-plus: Style transfer with content-preserving in text-to-image generation. arXiv preprint arXiv:2407.00788 (2024) 8, 13

work page arXiv 2024

[63] [63]

arXiv preprint arXiv:2506.05083 (2025) 13

Wang, P., Shi, Y., Lian, X., Zhai, Z., Xia, X., Xiao, X., Huang, W., Yang, J.: Seededit 3.0: Fast and high-quality generative image editing. arXiv preprint arXiv:2506.05083 (2025) 13

work page arXiv 2025

[64] [64]

Liu et al

Xie, E., Chen, J., Chen, J., Cai, H., Tang, H., Lin, Y., Zhang, Z., Li, M., Zhu, L., Lu, Y., Han, S.: Sana: Efficient high-resolution image synthesis with linear diffusion transformer (2024) 13 20 H. Liu et al

2024

[65] [65]

In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/forum?id= NniXePXVXw14

Xu, Z., xu, D., Li, Z., Zhang, C.: MDDM: Practical message-driven generative image steganography based on diffusion models. In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/forum?id= NniXePXVXw14

2025

[66] [66]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Yan,L.,Li,X.,Zhang,J.,Guan,F.,Peng,K.,Li,P.:F-ddim:Afeaturizeddenoising diffusion implicit model for facial image steganography. In: Proceedings of the 33rd ACM International Conference on Multimedia. p. 8488–8496. MM ’25, Association for Computing Machinery, New York, NY, USA (2025).https://doi.org/10. 1145/3746027.3755517,https://doi.org/10.1145/3746027...

work page doi:10.1145/3746027.375551714 2025

[67] [67]

In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence

Yang, Y., Liu, Z., Jia, J., Gao, Z., Li, Y., Sun, W., Liu, X., Zhai, G.: Diffstega: to- wards universal training-free coverless image steganography with diffusion models. In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence. pp. 1579–1587 (2024) 14

2024

[68] [68]

Advances in Neural Information Processing Systems37, 22370–22417 (2024) 14

Ye, H., Lin, H., Han, J., Xu, M., Liu, S., Liang, Y., Ma, J., Zou, J.Y., Ermon, S.: Tfg: Unified training-free guidance for diffusion models. Advances in Neural Information Processing Systems37, 22370–22417 (2024) 14

2024

[69] [69]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023) 8

work page internal anchor Pith review Pith/arXiv arXiv 2023

[70] [70]

Advances in Neural Information Processing Systems36, 80730–80743 (2023) 14

Yu, J., Zhang, X., Xu, Y., Zhang, J.: Cross: Diffusion model makes controllable, ro- bust and secure image steganography. Advances in Neural Information Processing Systems36, 80730–80743 (2023) 14

2023

[71] [71]

Advances in Neural Information Processing Systems37, 128940–128966 (2024) 3, 12

Yu, Q., Weber, M., Deng, X., Shen, X., Cremers, D., Chen, L.C.: An image is worth 32 tokens for reconstruction and generation. Advances in Neural Information Processing Systems37, 128940–128966 (2024) 3, 12

2024

[72] [72]

arXiv preprint arXiv:2410.03021 (2024) 8, 13

Zamzam, O.: Pixelshuffler: A simple image translation through pixel rearrange- ment. arXiv preprint arXiv:2410.03021 (2024) 8, 13

work page arXiv 2024

[73] [73]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847 (2023) 1, 2, 13

2023

[74] [74]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)

Zhang, Y., Huang, N., Tang, F., Huang, H., Ma, C., Dong, W., Xu, C.: Inversion- based style transfer with diffusion models. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 10146–10156 (June 2023) 8, 13

2023

[75] [75]

Zhao, S., Chen, D., Chen, Y.C., Bao, J., Hao, S., Yuan, L., Wong, K.Y.K.: Uni- controlnet:All-in-onecontroltotext-to-imagediffusionmodels.AdvancesinNeural Information Processing Systems36(2024) 13

2024

[76] [76]

IEEE Transactions on Circuits and Systems for Video Technology35(7), 6494–6507 (2025).https://doi.org/10.1109/TCSVT.2025

Zhou, Q., Wei, P., Qian, Z., Zhang, X., Li, S.: Improved generative steganography based on diffusion model. IEEE Transactions on Circuits and Systems for Video Technology35(7), 6494–6507 (2025).https://doi.org/10.1109/TCSVT.2025. 353983214

work page doi:10.1109/tcsvt.2025 2025

[77] [77]

IEEE Trans- actions on Information Forensics and Security18, 2751–2765 (2023).https: //doi.org/10.1109/TIFS.2023.326884314

Zhou, Z., Dong, X., Meng, R., Wang, M., Yan, H., Yu, K., Choo, K.K.R.: Genera- tive steganography via auto-generation of semantic object contours. IEEE Trans- actions on Information Forensics and Security18, 2751–2765 (2023).https: //doi.org/10.1109/TIFS.2023.326884314

work page doi:10.1109/tifs.2023.326884314 2023

[78] [78]

an artwork with intricate details, vibrant colors, high resolution, 8k

Zhou, Z., Su, Y., Li, J., Yu, K., Wu, Q.M.J., Fu, Z., Shi, Y.: Secret-to-Image Reversible Transformation for Generative Steganography . IEEE Transactions on Dependable and Secure Computing20(05), 4118–4134 (Sep 2023).https://doi. org/10.1109/TDSC.2022.321766114 Histogram-constrained Image Generation 21 A Pseudocode for HIG Algorithm1Text-to-ImageGeneratio...

work page doi:10.1109/tdsc.2022.321766114 2023