pith. machine review for the scientific record. sign in

arxiv: 2604.05730 · v1 · submitted 2026-04-07 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Controllable Image Generation with Composed Parallel Token Prediction

Chris G. Willcocks, Hubert P. H. Shum, Jamie Stirling, Noura Al-Moubayed

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:28 UTC · model grok-4.3

classification 💻 cs.LG
keywords controllable image generationdiscrete generative modelstoken predictionVQ-VAEVQ-GANconcept compositionmasked generationtext-to-image control
0
0 comments X

The pith

A composition rule for discrete generative processes allows precise control over novel, unseen combinations of image conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Conditional discrete generative models struggle to combine multiple input conditions without errors or inconsistencies. The paper derives a theoretically grounded formulation for composing these probabilistic processes, of which masked generation is one special case. This formulation supports exact specification of new combinations and counts of conditions that never appeared together in training, plus weighting to strengthen or cancel individual conditions. When paired with the compositional vocabularies learned by VQ-VAE and VQ-GAN, the method produces substantially more accurate and faster controllable image generation.

Core claim

We derive a theoretically-grounded formulation for composing discrete probabilistic generative processes, with masked generation (absorbing diffusion) as a special case. Our formulation enables precise specification of novel combinations and numbers of input conditions that lie outside the training data, with concept weighting enabling emphasis or negation of individual conditions.

What carries the argument

The composed parallel token prediction formulation that merges multiple discrete probabilistic generative processes into a single coherent prediction step.

If this is right

  • Precise specification of novel combinations and numbers of input conditions outside the training data becomes possible without retraining.
  • Concept weighting allows emphasis or negation of individual conditions during generation.
  • A 63.4 percent relative reduction in error rate is achieved across positional CLEVR, relational CLEVR, and FFHQ when used with VQ-VAE and VQ-GAN.
  • An average absolute FID improvement of 9.58 is obtained alongside the error reduction.
  • Generation runs 2.3 to 12 times faster than comparable methods while remaining applicable to open pre-trained discrete text-to-image models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same composition rule could be tested on discrete generative tasks outside images, such as audio or video, where exhaustive training on every possible condition mix is impractical.
  • Negation of conditions via weighting offers a route to targeted editing of outputs after initial generation without model retraining.
  • Because the method separates the composition logic from the underlying token vocabulary, it may reduce data-collection costs when new control signals are added to existing models.

Load-bearing premise

Discrete probabilistic generative processes can be composed in a theoretically grounded manner that supports extrapolation to unseen combinations and counts of conditions without introducing inconsistencies or requiring additional training data.

What would settle it

Apply the method to a combination of conditions whose count or joint appearance never occurred in training, such as three visual attributes never seen together, and check whether the generated images faithfully reflect all conditions without added artifacts or contradictions; consistent failure on such tests would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.05730 by Chris G. Willcocks, Hubert P. H. Shum, Jamie Stirling, Noura Al-Moubayed.

Figure 1
Figure 1. Figure 1: Scatter plots of compositional generation error vs FID on 3 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Compositional text-to-image results with captions (zooming [PITH_FULL_IMAGE:figures/full_fig_p001_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Concept negation with text-to-image (left baseline, right ours): Our method allows more precise control over the outputs of an existing pre-trained [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Conceptual product space: Example of composing two concept spaces using our framework: {"a cat","a dog","an apple","a cherry"} [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of our approach. Left: We start with a generic “empty” state s0 or the preceding state st and a set of input concepts c1, c2, .... These are used to compute the unconditional distribution over st+1 and the conditional distributions given each of c1, c2, .... These are composed, optionally incorporating concept weights, to produce an accurate estimate of the distribution conditioned on all inputs. … view at source ↗
Figure 6
Figure 6. Figure 6: Effect of varying the wsmile concept weight from −3.0 to 3.0 while keeping wmale = wno_glasses = 3.0. (0.25, 0.6), (0.5, 0.6), (0.75, 0.6), (0.25, 0.4), (0.5, 0.4), (0.75, 0.4) (0.25, 0.6), (0.5, 0.6), (0.75, 0.6), (0.25, 0.4), (0.5, 0.4), (0.75, 0.4) (0.2, 0.6), (0.4, 0.6), (0.6, 0.6), (0.8, 0.6), (0.25, 0.4), (0.5, 0.4), (0.75,0.4) (0.2, 0.6), (0.4, 0.6), (0.6, 0.6), (0.8, 0.6), (0.25, 0.4), (0.5, 0.4), … view at source ↗
Figure 7
Figure 7. Figure 7: Compositional out-of-distribution generation: Positional CLEVR training images contain no more than 5 objects per image, but our compositional [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
read the original abstract

Conditional discrete generative models struggle to faithfully compose multiple input conditions. To address this, we derive a theoretically-grounded formulation for composing discrete probabilistic generative processes, with masked generation (absorbing diffusion) as a special case. Our formulation enables precise specification of novel combinations and numbers of input conditions that lie outside the training data, with concept weighting enabling emphasis or negation of individual conditions. In synergy with the richly compositional learned vocabulary of VQ-VAE and VQ-GAN, our method attains a $63.4\%$ relative reduction in error rate compared to the previous state-of-the-art, averaged across 3 datasets (positional CLEVR, relational CLEVR and FFHQ), simultaneously obtaining an average absolute FID improvement of $-9.58$. Meanwhile, our method offers a $2.3\times$ to $12\times$ real-time speed-up over comparable methods, and is readily applied to an open pre-trained discrete text-to-image model for fine-grained control of text-to-image generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to derive a theoretically-grounded formulation for composing discrete probabilistic generative processes (with masked generation/absorbing diffusion as a special case), enabling precise specification of novel combinations and counts of input conditions outside the training data, along with concept weighting for emphasis or negation. In combination with VQ-VAE and VQ-GAN, it reports a 63.4% relative error-rate reduction and average absolute FID improvement of -9.58 across positional CLEVR, relational CLEVR, and FFHQ, plus 2.3x-12x speed-ups and applicability to open pre-trained text-to-image models.

Significance. If the composition operator is probabilistically valid for unseen condition counts and the empirical gains hold under controlled comparisons, the work would offer a principled route to flexible controllability in discrete latent generative models without retraining, which could be impactful for conditional image synthesis and related tasks.

major comments (2)
  1. [§3] §3 (Composition Derivation): the central claim that the derived operator supports extrapolation to arbitrary unseen numbers and combinations of conditions requires an explicit proof that the resulting distribution remains normalized (integrates to 1) and assigns non-negative probability mass. The formulation via products or conditionals must be shown to avoid inconsistencies or the need for ad-hoc re-normalization that could break the probabilistic interpretation for k different from training counts.
  2. [Experimental Results] Experimental Results section: the 63.4% relative error-rate reduction and FID gains are load-bearing for the practical contribution, yet the precise baselines (including whether they share identical VQ-VAE/GAN backbones and training regimes), the definition of 'error rate' on CLEVR tasks, and controls for confounding factors are not sufficiently detailed to attribute improvements unambiguously to the proposed composition method.
minor comments (1)
  1. [Abstract] Abstract: the term 'error rate' in the 63.4% reduction claim is not defined; add a brief parenthetical or forward reference to the metric used in the CLEVR evaluations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment below with clarifications and proposed revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Composition Derivation): the central claim that the derived operator supports extrapolation to arbitrary unseen numbers and combinations of conditions requires an explicit proof that the resulting distribution remains normalized (integrates to 1) and assigns non-negative probability mass. The formulation via products or conditionals must be shown to avoid inconsistencies or the need for ad-hoc re-normalization that could break the probabilistic interpretation for k different from training counts.

    Authors: We agree that an explicit proof of normalization and non-negativity is valuable for rigorously establishing the probabilistic validity of the composition operator across arbitrary k. The derivation in §3 constructs the operator via products of conditional distributions that are designed to preserve normalization by construction, without requiring ad-hoc renormalization. To directly address this point, we will add a new subsection in the revised §3 that provides a formal proof demonstrating that the composed distribution integrates to 1 and remains non-negative for any number of conditions, including unseen counts and combinations. This will also explicitly rule out inconsistencies in the product/conditional formulation. revision: yes

  2. Referee: [Experimental Results] Experimental Results section: the 63.4% relative error-rate reduction and FID gains are load-bearing for the practical contribution, yet the precise baselines (including whether they share identical VQ-VAE/GAN backbones and training regimes), the definition of 'error rate' on CLEVR tasks, and controls for confounding factors are not sufficiently detailed to attribute improvements unambiguously to the proposed composition method.

    Authors: We acknowledge that greater experimental detail is needed to allow unambiguous attribution of the gains. In the revised manuscript, we will expand the Experimental Results section (and the associated appendix) to: (i) confirm that all baselines employ identical VQ-VAE/VQ-GAN backbones, tokenizers, and training regimes as our method; (ii) provide a precise definition of the error-rate metric on CLEVR (percentage of images failing to satisfy the full set of specified positional/relational conditions, evaluated via an independent classifier); and (iii) include additional controlled ablations that isolate the composition operator while holding all other factors fixed. These additions will strengthen the link between the reported 63.4% error-rate reduction, FID improvements, and the proposed method. revision: yes

Circularity Check

0 steps flagged

Derivation of composition operator presented as independent theoretical construction with no reduction to fitted inputs or self-citations.

full rationale

The paper claims a derivation of a composition operator for discrete probabilistic generative processes that supports extrapolation to unseen condition counts and combinations. No equations or sections in the provided abstract or reader summary reduce this operator by construction to a fitted parameter, a self-citation chain, or a renaming of an existing result. The formulation is described as theoretically grounded and then evaluated empirically on FID and error rates separately from the derivation itself. The central claim therefore remains self-contained against external benchmarks rather than circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on an unshown derivation whose assumptions are not enumerated here.

pith-pipeline@v0.9.0 · 5479 in / 1146 out tokens · 49504 ms · 2026-05-10T19:28:49.393359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

46 extracted references · 13 canonical work pages · 7 internal anchors

  1. [1]

    T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation

    Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xi- hui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Pro- cessing Systems, volume 36, pages 78723–78747. Curran Associates, Inc., 2023. 1

  2. [2]

    A survey on compositional generalization in applications.arXiv preprint arXiv:2302.01067, 2023

    Baihan Lin, Djallel Bouneffouf, and Irina Rish. A survey on compositional generalization in applications.arXiv preprint arXiv:2302.01067, 2023. 1

  3. [3]

    amused: An open muse reproduction.arXiv preprint arXiv:2401.01808, 2024

    Suraj Patil, William Berman, Robin Rombach, and Patrick von Platen. amused: An open muse reproduction.arXiv preprint arXiv:2401.01808, 2024. 2, 3, 7

  4. [4]

    Control- lable and compositional generation with latent-space energy- based models.Advances in Neural Information Processing Systems, 34:13497–13510, 2021

    Weili Nie, Arash Vahdat, and Anima Anandkumar. Control- lable and compositional generation with latent-space energy- based models.Advances in Neural Information Processing Systems, 34:13497–13510, 2021. 1, 2, 7, 8

  5. [5]

    Compositional visual generation with energy based models.Advances in Neural Information Processing Systems, 33:6637–6647, 2020

    Yilun Du, Shuang Li, and Igor Mordatch. Compositional visual generation with energy based models.Advances in Neural Information Processing Systems, 33:6637–6647, 2020. 2, 7, 8

  6. [6]

    Compositional visual generation with composable diffusion models

    Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. InComputer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23– 27, 2022, Proceedings, Part XVII, pages 423–439. Springer,

  7. [7]

    Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 2, 3, 4, 5

  8. [8]

    Taming transformers for high-resolution image synthesis, 2020

    Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis, 2020. 2, 3, 4, 5

  9. [9]

    Breckon, and Chris G

    Sam Bond-Taylor, Peter Hessey, Hiroshi Sasaki, Toby P. Breckon, and Chris G. Willcocks. Unleashing transformers: Parallel token prediction with discrete absorbing diffusion for fast high-resolution image generation from vector-quantized codes. InEuropean Conference on Computer Vision (ECCV),

  10. [10]

    Muse: Text-to-image generation via masked generative transform- ers

    Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to- image generation via masked generative transformers.arXiv preprint arXiv:2301.00704, 2023. 2, 3, 7

  11. [11]

    Maskgit: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022. 2, 4

  12. [12]

    Implicit generation and model- ing with energy based models.Advances in Neural Informa- tion Processing Systems, 32, 2019

    Yilun Du and Igor Mordatch. Implicit generation and model- ing with energy based models.Advances in Neural Informa- tion Processing Systems, 32, 2019. 2, 6

  13. [13]

    Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  14. [14]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022. 2

  15. [15]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910,

  16. [16]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019. 2, 5

  17. [17]

    G.E. Hinton. Products of experts.9th International Confer- ence on Artificial Neural Networks: ICANN ’99, 1999:1–6,

  18. [18]

    doi: 10.1049/CP:1999107510.1049/CP:19991075. 2, 3

  19. [19]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 2

  20. [20]

    Compositional generative inverse design.arXiv preprint arXiv:2401.13171,

    Tailin Wu, Takashi Maruyama, Long Wei, Tao Zhang, Yilun Du, Gianluca Iaccarino, and Jure Leskovec. Compositional generative inverse design.arXiv preprint arXiv:2401.13171,

  21. [21]

    Energymogen: Com- positional human motion generation with energy-based diffu- sion model in latent space

    Jianrong Zhang, Hehe Fan, and Yi Yang. Energymogen: Com- positional human motion generation with energy-based diffu- sion model in latent space. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17592– 17602, 2025. 2

  22. [22]

    Mcp: Learning composable hierarchical control with multiplicative compositional policies

    Xue Bin Peng, Michael Chang, Grace Zhang, Pieter Abbeel, and Sergey Levine. Mcp: Learning composable hierarchical control with multiplicative compositional policies. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Asso- ciates, Inc., 2019. UR...

  23. [23]

    Mixture of experts: a literature survey.Artificial Intelligence Review, 42:275–293,

    Saeed Masoudnia and Reza Ebrahimpour. Mixture of experts: a literature survey.Artificial Intelligence Review, 42:275–293,

  24. [24]

    Generat- ing diverse high-fidelity images with vq-vae-2.Advances in neural information processing systems, 32, 2019

    Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generat- ing diverse high-fidelity images with vq-vae-2.Advances in neural information processing systems, 32, 2019. 3

  25. [25]

    Vector quantization.IEEE Assp Magazine, 1 (2):4–29, 1984

    Robert Gray. Vector quantization.IEEE Assp Magazine, 1 (2):4–29, 1984. 3

  26. [26]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018. 3

  27. [27]

    Unaligned 2d to 3d transla- tion with conditional vector-quantized code diffusion using transformers

    Abril Corona-Figueroa, Sam Bond-Taylor, Neelanjan Bhowmik, Yona Falinie A Gaus, Toby P Breckon, Hubert PH Shum, and Chris G Willcocks. Unaligned 2d to 3d transla- tion with conditional vector-quantized code diffusion using transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14585–14594, 2023. 3

  28. [28]

    Discrete 9 flow matching.Advances in Neural Information Processing Systems, 37:133345–133385, 2024

    Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete 9 flow matching.Advances in Neural Information Processing Systems, 37:133345–133385, 2024. 3

  29. [29]

    Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

    Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025. 3

  30. [30]

    Zhenlin Xu, Marc Niethammer, and Colin A Raffel. Compo- sitional generalization in unsupervised compositional repre- sentation learning: A study on disentanglement and emergent language.Advances in Neural Information Processing Sys- tems, 35:25074–25087, 2022. 4

  31. [31]

    Bayes’ theorem.The Stanford Encyclopedia of Philosophy, 2003

    James Joyce. Bayes’ theorem.The Stanford Encyclopedia of Philosophy, 2003. 4

  32. [32]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz L...

  33. [33]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkor- eit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention Is All You Need.Advances in Neural Information Processing Systems, 2017-December:5999–6009, 6 2017. ISSN 10495258. URL https://arxiv.org/ abs/1706.03762v7. 4

  34. [34]

    Jacob Devlin, Ming Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Trans- formers for Language Understanding.NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies - Proceedings of the Conference, 1:4171– 4186, 10 2018. URL https...

  35. [35]

    Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021. 5

  36. [36]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Es- timating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432,

  37. [37]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 6

  38. [38]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. 6

  39. [39]

    Re- thinking fid: Towards a better evaluation metric for image generation

    Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Re- thinking fid: Towards a better evaluation metric for image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9307–9315,

  40. [40]

    The gan is dead; long live the gan! a modern gan baseline.Advances in Neural Information Processing Systems, 37:44177–44215, 2024

    Nick Huang, Aaron Gokaslan, V olodymyr Kuleshov, and James Tompkin. The gan is dead; long live the gan! a modern gan baseline.Advances in Neural Information Processing Systems, 37:44177–44215, 2024. 6

  41. [41]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- tezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024. 6

  42. [42]

    Training generative ad- versarial networks with limited data.Advances in neural information processing systems, 33:12104–12114, 2020

    Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative ad- versarial networks with limited data.Advances in neural information processing systems, 33:12104–12114, 2020. 7, 8

  43. [43]

    Analyzing and improv- ing the image quality of stylegan

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improv- ing the image quality of stylegan. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 8110–8119, 2020. 7, 8

  44. [44]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 7, 8

  45. [45]

    Laion-5b: An open large-scale dataset for training next gen- eration image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next gen- eration image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022. 7

  46. [46]

    Survey of bias in text-to- image generation: Definition, evaluation, and mitigation,

    Yixin Wan, Arjun Subramonian, Anaelia Ovalle, Zongyu Lin, Ashima Suvarna, Christina Chance, Hritik Bansal, Rebecca Pattichis, and Kai-Wei Chang. Survey of bias in text-to-image generation: Definition, evaluation, and mitigation.arXiv preprint arXiv:2404.01030, 2024. 8 10