Don't Settle at the Mode! Mitigating Diversity Collapse in Pretrained Flow Models via Feature Self-Guidance

Abhijnya Bhat; Pradhaan S Bhat; Rishubh Parihar; R. Venkatesh Babu

arxiv: 2606.27371 · v1 · pith:VNVEBCQ7new · submitted 2026-06-25 · 💻 cs.CV

Don't Settle at the Mode! Mitigating Diversity Collapse in Pretrained Flow Models via Feature Self-Guidance

Pradhaan S Bhat , Rishubh Parihar , Abhijnya Bhat , R. Venkatesh Babu This is my paper

Pith reviewed 2026-06-26 05:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords diversity collapseflow modelsself-guidancemanifold regularizationconditional image generationtraining-free methodfeature dispersionplug-and-play generation

0 comments

The pith

Flow models recover sample diversity by dispersing internal features during batch generation and projecting them back onto the data manifold.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pretrained flow models for conditional image generation often produce nearly identical outputs when sampling multiple times from the same prompt, limiting their usefulness. The paper establishes that this diversity collapse can be addressed without retraining or external reward models by applying a self-guidance step that spreads out the model's internal feature representations across a batch. A subsequent manifold regularization projects the dispersed features back onto the learned data manifold to maintain realism and prompt alignment. The method adds only a small inference cost and works as a plug-in for existing pretrained models. If correct, it means flow-based generators can produce varied yet faithful samples from identical conditions at negligible extra expense.

Core claim

The paper claims that dispersing the internal features of a flow model during batch generation via feature self-guidance, followed by a manifold regularization step that projects the dispersed features back onto the data manifold, mitigates diversity collapse. This produces more varied outputs while preserving fidelity to the input conditions and without requiring additional reward models or training. The approach integrates as a plug-and-play module into pretrained flow models with only marginal inference overhead and shows improvements across multi-step and few-step models for text-to-image, depth-to-image, and reference image tasks.

What carries the argument

Feature self-guidance mechanism that disperses internal features in a batch, paired with manifold regularization that projects the result back onto the learned data manifold.

If this is right

Diversity metrics improve across several conditional flow models without loss of fidelity.
The method requires no extra reward models or retraining and adds only marginal inference cost.
It works for both multi-step and few-step generation pipelines.
It applies to text-to-image, depth-to-image, and reference-image conditional tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dispersion-plus-projection idea might extend to other generative architectures that exhibit mode collapse.
It could reduce dependence on post-hoc sample selection techniques that rely on external scorers.
Varying the strength of feature dispersion per task or model layer might yield further gains in specific domains.

Load-bearing premise

Dispersing internal features and then projecting them back onto the manifold will increase output diversity while keeping fidelity and condition alignment intact.

What would settle it

Apply the method to a standard pretrained flow model and compare diversity, fidelity, and alignment metrics against ordinary sampling; if diversity metrics show no gain or fidelity drops substantially, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2606.27371 by Abhijnya Bhat, Pradhaan S Bhat, Rishubh Parihar, R. Venkatesh Babu.

**Figure 1.** Figure 1: We propose an efficient inference-time approach to enhance diversity in conditional flow models. Our plug-and-play module integrates seamlessly into various conditional flow models and consistently enhances diversity for text-to-image, depth-toimage and image-to-image models. Compared to recent Group Inference [44], which relies on reward-based sample rejection, our method achieves competitive diversity… view at source ↗

**Figure 2.** Figure 2: Feature Distance vs. Output Diversity. We examine the impact of perturbing FLUX model’s internal DiT features ht with Gaussian noise. As shown in the histograms, increasing the pairwise distance between internal features increases sample diversity. In contrast, perturbing latents xt with noise causes image corruption. This identifies feature dispersion as a mechanism for recovering flow model diversity… view at source ↗

**Figure 3.** Figure 3: Overview of Feature Self-Guidance. Our method integrates a "Disperseand-Refine" module into the flow model backbone. (a) Dispersion Stage: Internal features are pushed apart using iterative self-guidance to expand sample variety. (b) Refinement Stage: These features are re-processed through the same MMDiT block to project them back onto the valid image manifold. Finally, we linearly interpolate the dispe… view at source ↗

**Figure 5.** Figure 5: Ablation over DiT Block for dispersion & mixing parameter β choices. the output feature space, giving projected features hˆ t. In our context, we pass the dispersed features again through the same block to make corrections to ht with ∆h˜ t. This allows the block, utilizing its pretrained internal priors, to Block i Block i disperse h t 𝚫h t h t + + ~ h t ^ 𝚫h t ~ Correcting dispersed features Write Read Fi… view at source ↗

**Figure 4.** Figure 4: Read-write mechanism of transformers: Our regularization method leverages the read-write mechanisms of transformer blocks to add corrections to dispersed features to project them to the natural image manifold. project the perturbed features back onto the natural image manifold while preserving diversity. Next, we use the projected features and the dispersed features to obtain a final update: h'_t := \beta … view at source ↗

**Figure 6.** Figure 6: Qualitative results for text-to-image generation [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative Comparison for text-to-image generation [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 6.** Figure 6: Fig.6. The samples generated by our method are significantly more diverse in [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 8.** Figure 8: Additional Generation Tasks: Our method demonstrates the ability to generate diverse samples while maintaining adherence to the input conditioning, such as depth maps or reference images. For depth-conditioned generation, our method generates diverse backgrounds and textures. For reference conditioned generation, our method preserves the house/dog identity and generates diverse styles, whereas baseline met… view at source ↗

**Figure 9.** Figure 9: Performance on scaling Batch Size Scaling Batch Size: To evaluate the robustness of our approach for different batch sizes, we ablate the generation with batch size upto 128 samples. As illustrated in Fig.9, our method maintains high diversity and prompt alignment even at elevated scales, demonstrating superior scalability. In contrast, Group Inference [44] exhibits diminishing diversity as the batch … view at source ↗

**Figure 10.** Figure 10: a) Faithfulness to the Prompt and Diversity Pareto front for our [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: User Study: User preferences show that our method substantially outperforms baselines in sample diversity while delivering competitive image quality and prompt following scores. We conduct a user study to subjectively evaluate the samples generated from our method against three baselines a) Interval Guidance, b) CFG and c) Group Inference across three dimensions: i) Image Quality, ii) Prompt Alignment… view at source ↗

read the original abstract

State-of-the-art flow models generate stunning images from text or image prompts. However, they suffer from diversity collapse when generating multiple samples under the same conditioning. Existing methods address this issue via either latent guidance, which has limited effectiveness, or sample selection, which relies on external reward models that incur significant inference-time overhead. In this work, we introduce an efficient, training-free self-guidance mechanism to mitigate diversity collapse without requiring additional reward models. Specifically, we disperse the internal features of the flow model during batch generation with feature self-guidance. Further, to keep the features close to the manifold, we introduce a manifold regularization step that projects these dispersed features back onto the data manifold, ensuring diverse generation without sacrificing alignment with the input conditions. Our method integrates seamlessly as a plug-and-play module into pretrained flow models, adding only a marginal inference cost. Experiments demonstrate significant improvements in diversity while preserving fidelity across several conditional flow models, including multi-step and few-step text-to-image, depth-to-image, and reference image generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a training-free plug-in for flow models that disperses internal features then projects them back to the manifold to fight diversity collapse, but the abstract supplies zero metrics or checks on whether the projection preserves alignment.

read the letter

The one thing to know is that this work claims an inference-time fix for diversity collapse in conditional flow models by spreading out internal features during batch generation and then applying a manifold regularization step to snap them back. It positions itself as simpler than latent guidance or reward-model selection because it needs no extra training or external models and adds only marginal cost.

What is actually new is the specific pairing of feature dispersion with that projection step, framed as a self-guidance mechanism that has not been reported before for flow models. The paper does a reasonable job laying out the practical problem and showing how the method could slot into existing pipelines for text-to-image, depth-to-image, and reference-image tasks.

The soft spots are clear from the abstract. It asserts significant improvements in diversity while keeping fidelity and condition alignment, yet provides no numbers, baselines, ablations, or protocol. The manifold regularization is described at a high level with no explicit form for the projection operator and no analysis or measurements showing that dispersed features stay on the learned manifold or retain condition match after projection. The stress-test concern lands: without those checks the central guarantee is unverified. If the full paper contains solid quantitative results with conditional metrics before and after the step, that would change the picture; on the supplied text it does not.

This is aimed at people already working with flow-based generative models in computer vision who need a lightweight diversity boost at sampling time. A reader in that niche could extract the method description and try it, but only if the experiments hold up.

I would send it to peer review so referees can examine the actual results and the projection details.

Referee Report

1 major / 1 minor

Summary. The paper claims that pretrained flow models suffer from diversity collapse under repeated conditioning, and introduces a training-free 'feature self-guidance' mechanism that disperses internal features during batch generation together with a manifold-regularization projection step that returns the dispersed features to the learned data manifold. This is asserted to increase sample diversity while preserving both fidelity and condition alignment, at only marginal extra inference cost, and to integrate as a plug-and-play module into existing multi-step and few-step conditional flow models (text-to-image, depth-to-image, reference-image). Experiments are said to demonstrate significant improvements across these settings without external reward models.

Significance. If the method and its empirical support hold, the contribution would be practically useful: an efficient, training-free, reward-model-free way to mitigate a known failure mode of flow-based generators. The plug-and-play character and low overhead distinguish it from latent-guidance or post-hoc selection baselines.

major comments (1)

[Abstract and Method] Abstract / Method description: the central claim that 'the manifold regularization step ... projects these dispersed features back onto the data manifold, ensuring diverse generation without sacrificing alignment with the input conditions' is load-bearing yet unsupported. No explicit form of the projection operator is supplied, no analysis shows that the projected features remain on the learned manifold or preserve conditional alignment, and no quantitative conditional metrics (e.g., CLIP-score, FID conditioned on prompt, or before/after ablation) are referenced to verify the 'without sacrificing alignment' guarantee.

minor comments (1)

[Abstract] The abstract states 'significant improvements in diversity while preserving fidelity' but supplies neither the concrete metrics, baselines, nor experimental protocol; these details should appear in the abstract or a summary table for immediate verifiability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below.

read point-by-point responses

Referee: [Abstract and Method] Abstract / Method description: the central claim that 'the manifold regularization step ... projects these dispersed features back onto the data manifold, ensuring diverse generation without sacrificing alignment with the input conditions' is load-bearing yet unsupported. No explicit form of the projection operator is supplied, no analysis shows that the projected features remain on the learned manifold or preserve conditional alignment, and no quantitative conditional metrics (e.g., CLIP-score, FID conditioned on prompt, or before/after ablation) are referenced to verify the 'without sacrificing alignment' guarantee.

Authors: We agree that the claim regarding the manifold regularization step is central to the contribution and that the current manuscript provides insufficient support for it. The abstract and method description state the intended effect but do not supply an explicit mathematical form for the projection operator, nor do they include a dedicated analysis or before/after quantitative metrics (such as CLIP-score or conditioned FID) to verify preservation of conditional alignment. We will revise the method section to include the explicit definition of the operator, add a short analysis of its effect on the learned manifold, and incorporate the requested quantitative conditional metrics and ablations in the experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: method is introduced as new plug-and-play components without reduction to fitted inputs or self-citations.

full rationale

The paper presents feature self-guidance and manifold regularization as newly introduced mechanisms for dispersing features and projecting back to the manifold. The abstract and description contain no equations, no fitted parameters renamed as predictions, and no load-bearing self-citations or uniqueness theorems that reduce the central claim to prior inputs by construction. The approach is described as training-free and plug-and-play, with claimed benefits supported by experiments rather than any self-referential derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5730 in / 1037 out tokens · 32919 ms · 2026-06-26T05:05:47.547354+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 16 linked inside Pith

[1]

Alaluf, Y., Richardson, E., Metzer, G., Cohen-Or, D.: A neural space-time repre- sentationfortext-to-imagepersonalization.ACMTransactionsonGraphics(TOG) 42(6), 1–10 (2023)

2023
[2]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Avrahami, O., Patashnik, O., Fried, O., Nemchinov, E., Aberman, K., Lischin- ski, D., Cohen-Or, D.: Stable flow: Vital layers for training-free image editing. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 7877–7888 (2025)

2025
[3]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Bansal, A., Chu, H.M., Schwarzschild, A., Sengupta, S., Goldblum, M., Geiping, J., Goldstein, T.: Universal guidance for diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 843–852 (2023)

2023
[4]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https: //openreview.net/forum?id=RjN1LYymST

Berrada, T., Romero-Soriano, A., Drozdzal, M., Verbeek, J., Alahari, K.: En- tropy rectifying guidance for diffusion and flow models. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https: //openreview.net/forum?id=RjN1LYymST

2025
[5]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

2021
[6]

arXiv preprint arXiv:2310.13102 (2023)

Corso, G., Xu, Y., De Bortoli, V., Barzilay, R., Jaakkola, T.: Particle guidance: non-iid diverse sampling with diffusion models. arXiv preprint arXiv:2310.13102 (2023)

arXiv 2023
[7]

Advances in neural information processing systems34, 8780–8794 (2021)

Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021)

2021
[8]

Transformer Circuits Thread (2021), https://transformer-circuits.pub/2021/framework/index.html

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield- Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., Olah, C.: A mathemat- ical framework for transformer circ...

2021
[9]

Advances in Neural Information Processing Systems 36, 16222–16239 (2023)

Epstein, D., Jabri, A., Poole, B., Efros, A., Holynski, A.: Diffusion self-guidance for controllable image generation. Advances in Neural Information Processing Systems 36, 16222–16239 (2023)

2023
[10]

In: Forty-first international conference on machine learning (2024) Don’t Settle at the Mode! 17

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024) Don’t Settle at the Mode! 17

2024
[11]

Eyring, L., Karthik, S., Dosovitskiy, A., Ruiz, N., Akata, Z.: Noise hypernetworks: Amortizingtest-timecomputeindiffusionmodels.arXivpreprintarXiv:2508.09968 (2025)

arXiv 2025
[12]

Advances in Neural Information Processing Systems37, 125487–125519 (2024)

Eyring, L., Karthik, S., Roth, K., Dosovitskiy, A., Akata, Z.: Reno: Enhancing one-step text-to-image models through reward-based noise optimization. Advances in Neural Information Processing Systems37, 125487–125519 (2024)

2024
[13]

arXiv preprint arXiv:2210.02410 (2022)

Friedman, D., Dieng, A.B.: The vendi score: A diversity evaluation metric for machine learning. arXiv preprint arXiv:2210.02410 (2022)

arXiv 2022
[14]

arXiv preprint arXiv:2306.09344 (2023)

Fu,S.,Tamir,N.,Sundaram,S.,Chai,L.,Zhang,R.,Dekel,T.,Isola,P.:Dreamsim: Learning new dimensions of human visual similarity using synthetic data. arXiv preprint arXiv:2306.09344 (2023)

Pith/arXiv arXiv 2023
[15]

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image genera- tionusingtextualinversion.In:TheEleventhInternationalConferenceonLearning Representations (2023),https://openreview.net/forum?id=NAQvF08TcyG

2023
[16]

ACM Transactions on Graphics (TOG)42(4), 1–13 (2023)

Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D.: Encoder-based domain tuning for fast personalization of text-to-image models. ACM Transactions on Graphics (TOG)42(4), 1–13 (2023)

2023
[17]

Advances in Neural Information Processing Systems36, 52132–52152 (2023)

Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023)

2023
[18]

arXiv preprint arXiv:2601.00090 (2025)

Harrington, A., Koepke, A., Karthik, S., Darrell, T., Efros, A.A.: It’s never too late: Noise optimization for collapse recovery in trained diffusion models. arXiv preprint arXiv:2601.00090 (2025)

Pith/arXiv arXiv 2025
[19]

In: Proceedings of the 2021 conference on empirical methods in natural language processing

Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: Clipscore: A reference- free evaluation metric for image captioning. In: Proceedings of the 2021 conference on empirical methods in natural language processing. pp. 7514–7528 (2021)

2021
[20]

Advances in neural information processing systems30(2017)

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

2017
[21]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

2020
[22]

arXiv preprint arXiv:2207.12598 (2022)

Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

Pith/arXiv arXiv 2022
[23]

arXiv preprint arXiv:2403.05135 (2024)

Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P., Yu, G.: Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135 (2024)

Pith/arXiv arXiv 2024
[24]

arXiv preprint arXiv:2506.10173 (2025), https://arxiv.org/abs/2506.10173

Jalali, M., Lei, H., Gohari, A., Farnia, F.: Sparke: Scalable prompt-aware diversity guidance in diffusion models via rke score. arXiv preprint arXiv:2506.10173 (2025), https://arxiv.org/abs/2506.10173

arXiv 2025
[25]

arXiv preprint arXiv:1710.10196 (2017)

Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for im- proved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)

Pith/arXiv arXiv 2017
[26]

arXiv preprint arXiv:2511.20647 (2025)

Kazimi, T., Dunlop, C., Yanardag, P.: Diverse video generation with determinantal point process-guided policy optimization. arXiv preprint arXiv:2511.20647 (2025)

arXiv 2025
[27]

In: Proceedings of the IEEE/CVF International ConferenceonComputerVision(ICCV)Workshops.pp.6506–6516(October2025)

Khan, M.A.H., Jain, Y., Bhattacharyya, S., Vineet, V.: Test-time prompt refine- ment for text-to-image models. In: Proceedings of the IEEE/CVF International ConferenceonComputerVision(ICCV)Workshops.pp.6506–6516(October2025)
[28]

arXiv preprint arXiv:2510.03813 (2025)

Kim, B., Um, S., Ye, J.C.: Diverse text-to-image generation via contrastive noise optimization. arXiv preprint arXiv:2510.03813 (2025)

arXiv 2025
[29]

arXiv preprint arXiv:2503.19385 (2025) 18 P.Bhat et al

Kim, J., Yoon, T., Hwang, J., Sung, M.: Inference-time scaling for flow models via stochastic generation and rollover budget forcing. arXiv preprint arXiv:2503.19385 (2025) 18 P.Bhat et al

arXiv 2025
[30]

arXiv preprint arXiv:2410.06025 (2024)

Kirchhof, M., Thornton, J., Béthune, L., Ablin, P., Ndiaye, E., Cuturi, M.: Shielded diffusion: Generating novel and diverse images using sparse repellency. arXiv preprint arXiv:2410.06025 (2024)

arXiv 2024
[31]

Advances in neural information processing systems36, 36652–36663 (2023)

Kirstain,Y.,Polyak,A.,Singer,U.,Matiana,S.,Penna,J.,Levy,O.:Pick-a-pic:An open dataset of user preferences for text-to-image generation. Advances in neural information processing systems36, 36652–36663 (2023)

2023
[32]

arXiv preprint arXiv:2210.10960 (2022)

Kwon, M., Jeong, J., Uh, Y.: Diffusion models already have a semantic latent space. arXiv preprint arXiv:2210.10960 (2022)

arXiv 2022
[33]

Advances in Neural Information Processing Systems37, 122458– 122483 (2024)

Kynkäänniemi, T., Aittala, M., Karras, T., Laine, S., Aila, T., Lehtinen, J.: Ap- plying guidance in a limited interval improves sample and distribution quality in diffusion models. Advances in Neural Information Processing Systems37, 122458– 122483 (2024)

2024
[34]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024)

2024
[35]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025)

2025
[36]

In: European conference on computer vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

2014
[37]

arXiv preprint arXiv:2210.02747 (2022)

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

Pith/arXiv arXiv 2022
[38]

arXiv preprint arXiv:2209.03003 (2022)

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

Pith/arXiv arXiv 2022
[39]

In: Proceedings of International Conference on Computer Vision (ICCV) (December 2015)

Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision (ICCV) (December 2015)

2015
[40]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Luo, G., Granskog, J., Holynski, A., Darrell, T.: Dual-process image generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17972–17983 (2025)

2025
[41]

arXiv preprint arXiv:2501.09732 (2025)

Ma, N., Tong, S., Jia, H., Hu, H., Su, Y.C., Zhang, M., Yang, X., Li, Y., Jaakkola, T., Jia, X., et al.: Inference-time scaling for diffusion models beyond scaling de- noising steps. arXiv preprint arXiv:2501.09732 (2025)

Pith/arXiv arXiv 2025
[42]

arXiv preprint arXiv:2304.07193 (2023)

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

Pith/arXiv arXiv 2023
[43]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Parihar, R., Bhat, A., Basu, A., Mallick, S., Kundu, J.N., Babu, R.V.: Balanc- ing act: Distribution-guided debiasing in diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6668–6678 (2024)

2024
[44]

arXiv preprint arXiv:2508.15773 (2025)

Parmar, G., Patashnik, O., Ostashev, D., Wang, K.C., Aberman, K., Narasimhan, S., Zhu, J.Y.: Scaling group inference for diverse and high-quality generation. arXiv preprint arXiv:2508.15773 (2025)

arXiv 2025
[45]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Parmar, G., Zhang, R., Zhu, J.Y.: On aliased resizing and surprising subtleties in gan evaluation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11410–11420 (2022)

2022
[46]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pizzi, E., Roy, S.D., Ravindra, S.N., Goyal, P., Douze, M.: A self-supervised de- scriptor for image copy detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14532–14542 (2022)

2022
[47]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) Don’t Settle at the Mode! 19

2021
[48]

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth:Finetuningtext-to-imagediffusionmodelsforsubject-drivengeneration.In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 22500–22510 (2023)

2023
[49]

arXiv preprint arXiv:2310.17347 (2023)

Sadat, S., Buhmann, J., Bradley, D., Hilliges, O., Weber, R.M.: Cads: Unleash- ing the diversity of diffusion models through condition-annealed sampling. arXiv preprint arXiv:2310.17347 (2023)

arXiv 2023
[50]

Singh, J., Leng, X., Wu, Z., Zheng, L., Zhang, R., Shechtman, E., Xie, S.: What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794 (2025)

arXiv 2025
[51]

arXiv preprint arXiv:2408.03314 (2024)

Snell, C., Lee, J., Xu, K., Kumar, A.: Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314 (2024)

Pith/arXiv arXiv 2024
[52]

arXiv preprint arXiv:2010.02502 (2020)

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

Pith/arXiv arXiv 2010
[53]

arXiv preprint arXiv:2011.13456 (2020)

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)

Pith/arXiv arXiv 2011
[54]

arXiv preprint arXiv:2405.18881 (2024)

Tang, Z., Peng, J., Tang, J., Hong, M., Wang, F., Chang, T.H.: Inference- time alignment of diffusion models with direct noise optimization. arXiv preprint arXiv:2405.18881 (2024)

arXiv 2024
[55]

arXiv preprint arXiv:2601.16208 (2026)

Tong, S., Zheng, B., Wang, Z., Tang, B., Ma, N., Brown, E., Yang, J., Fergus, R., LeCun,Y.,Xie,S.:Scalingtext-to-imagediffusiontransformerswithrepresentation autoencoders. arXiv preprint arXiv:2601.16208 (2026)

arXiv 2026
[56]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Um, S., Ye, J.C.: Minority-focused text-to-image generation via prompt optimiza- tion. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 20926–20936 (2025)

2025
[57]

In: ACM SIGGRAPH 2023 conference proceedings

Voynov, A., Aberman, K., Cohen-Or, D.: Sketch-guided text-to-image diffusion models. In: ACM SIGGRAPH 2023 conference proceedings. pp. 1–11 (2023)

2023
[58]

arXiv preprint arXiv:2506.09027 (2025)

Wang, R., He, K.: Diffuse and disperse: Image generation with representation reg- ularization. arXiv preprint arXiv:2506.09027 (2025)

arXiv 2025
[59]

Advances in neural information processing systems35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

2022
[60]

arXiv preprint arXiv:2508.02324 (2025)

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

Pith/arXiv arXiv 2025
[61]

arXiv preprint arXiv:2306.09341 (2023)

Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., Li, H.: Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023)

Pith/arXiv arXiv 2023
[62]

arXiv preprint arXiv:2501.18427 (2025)

Xie,E.,Chen,J.,Zhao,Y.,Yu,J.,Zhu,L.,Wu,C.,Lin,Y.,Zhang,Z.,Li,M.,Chen, J., et al.: Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer. arXiv preprint arXiv:2501.18427 (2025)

arXiv 2025
[63]

Advances in Neural Information Processing Systems37, 21875–21911 (2024)

Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. Advances in Neural Information Processing Systems37, 21875–21911 (2024)

2024
[64]

arXiv preprint arXiv:2410.06940 (2024)

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940 (2024)

Pith/arXiv arXiv 2024
[65]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yun, T., Zhang, D., Park, J., Pan, L.: Learning to sample effective and diverse prompts for text-to-image generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 23625–23635 (2025) 20 P.Bhat et al

2025
[66]

In: The Thirteenth International Conference on Learning Representations (2025)

Zhao, B.N., Xiao, Y., Xu, J., Jiang, X., Yang, Y., Li, D., Itti, L., Vineet, V., Ge, Y.: Dreamdistribution: Learning prompt distribution for diverse in-distribution gen- eration. In: The Thirteenth International Conference on Learning Representations (2025)

2025
[67]

arXiv preprint arXiv:2510.11690 (2025)

Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690 (2025)

Pith/arXiv arXiv 2025
[68]

arXiv preprint arXiv:2512.24176 (2025)

Zhou, X., Li, Q., Hu, X., Chen, H., Gu, S.: Guiding a diffusion transformer with the internal dynamics of itself. arXiv preprint arXiv:2512.24176 (2025)

arXiv 2025
[69]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Zhuo, L., Zhao, L., Paul, S., Liao, Y., Zhang, R., Xin, Y., Gao, P., Elhoseiny, M., Li, H.: From reflection to perfection: Scaling inference-time optimization for text- to-image diffusion models via reflection tuning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 15329–15339 (October 2025) Don’t Settle at the Mode...

2025

[1] [1]

Alaluf, Y., Richardson, E., Metzer, G., Cohen-Or, D.: A neural space-time repre- sentationfortext-to-imagepersonalization.ACMTransactionsonGraphics(TOG) 42(6), 1–10 (2023)

2023

[2] [2]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Avrahami, O., Patashnik, O., Fried, O., Nemchinov, E., Aberman, K., Lischin- ski, D., Cohen-Or, D.: Stable flow: Vital layers for training-free image editing. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 7877–7888 (2025)

2025

[3] [3]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Bansal, A., Chu, H.M., Schwarzschild, A., Sengupta, S., Goldblum, M., Geiping, J., Goldstein, T.: Universal guidance for diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 843–852 (2023)

2023

[4] [4]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https: //openreview.net/forum?id=RjN1LYymST

Berrada, T., Romero-Soriano, A., Drozdzal, M., Verbeek, J., Alahari, K.: En- tropy rectifying guidance for diffusion and flow models. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https: //openreview.net/forum?id=RjN1LYymST

2025

[5] [5]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

2021

[6] [6]

arXiv preprint arXiv:2310.13102 (2023)

Corso, G., Xu, Y., De Bortoli, V., Barzilay, R., Jaakkola, T.: Particle guidance: non-iid diverse sampling with diffusion models. arXiv preprint arXiv:2310.13102 (2023)

arXiv 2023

[7] [7]

Advances in neural information processing systems34, 8780–8794 (2021)

Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021)

2021

[8] [8]

Transformer Circuits Thread (2021), https://transformer-circuits.pub/2021/framework/index.html

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield- Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., Olah, C.: A mathemat- ical framework for transformer circ...

2021

[9] [9]

Advances in Neural Information Processing Systems 36, 16222–16239 (2023)

Epstein, D., Jabri, A., Poole, B., Efros, A., Holynski, A.: Diffusion self-guidance for controllable image generation. Advances in Neural Information Processing Systems 36, 16222–16239 (2023)

2023

[10] [10]

In: Forty-first international conference on machine learning (2024) Don’t Settle at the Mode! 17

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024) Don’t Settle at the Mode! 17

2024

[11] [11]

Eyring, L., Karthik, S., Dosovitskiy, A., Ruiz, N., Akata, Z.: Noise hypernetworks: Amortizingtest-timecomputeindiffusionmodels.arXivpreprintarXiv:2508.09968 (2025)

arXiv 2025

[12] [12]

Advances in Neural Information Processing Systems37, 125487–125519 (2024)

Eyring, L., Karthik, S., Roth, K., Dosovitskiy, A., Akata, Z.: Reno: Enhancing one-step text-to-image models through reward-based noise optimization. Advances in Neural Information Processing Systems37, 125487–125519 (2024)

2024

[13] [13]

arXiv preprint arXiv:2210.02410 (2022)

Friedman, D., Dieng, A.B.: The vendi score: A diversity evaluation metric for machine learning. arXiv preprint arXiv:2210.02410 (2022)

arXiv 2022

[14] [14]

arXiv preprint arXiv:2306.09344 (2023)

Fu,S.,Tamir,N.,Sundaram,S.,Chai,L.,Zhang,R.,Dekel,T.,Isola,P.:Dreamsim: Learning new dimensions of human visual similarity using synthetic data. arXiv preprint arXiv:2306.09344 (2023)

Pith/arXiv arXiv 2023

[15] [15]

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image genera- tionusingtextualinversion.In:TheEleventhInternationalConferenceonLearning Representations (2023),https://openreview.net/forum?id=NAQvF08TcyG

2023

[16] [16]

ACM Transactions on Graphics (TOG)42(4), 1–13 (2023)

Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D.: Encoder-based domain tuning for fast personalization of text-to-image models. ACM Transactions on Graphics (TOG)42(4), 1–13 (2023)

2023

[17] [17]

Advances in Neural Information Processing Systems36, 52132–52152 (2023)

Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023)

2023

[18] [18]

arXiv preprint arXiv:2601.00090 (2025)

Harrington, A., Koepke, A., Karthik, S., Darrell, T., Efros, A.A.: It’s never too late: Noise optimization for collapse recovery in trained diffusion models. arXiv preprint arXiv:2601.00090 (2025)

Pith/arXiv arXiv 2025

[19] [19]

In: Proceedings of the 2021 conference on empirical methods in natural language processing

Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: Clipscore: A reference- free evaluation metric for image captioning. In: Proceedings of the 2021 conference on empirical methods in natural language processing. pp. 7514–7528 (2021)

2021

[20] [20]

Advances in neural information processing systems30(2017)

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

2017

[21] [21]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

2020

[22] [22]

arXiv preprint arXiv:2207.12598 (2022)

Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

Pith/arXiv arXiv 2022

[23] [23]

arXiv preprint arXiv:2403.05135 (2024)

Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P., Yu, G.: Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135 (2024)

Pith/arXiv arXiv 2024

[24] [24]

arXiv preprint arXiv:2506.10173 (2025), https://arxiv.org/abs/2506.10173

Jalali, M., Lei, H., Gohari, A., Farnia, F.: Sparke: Scalable prompt-aware diversity guidance in diffusion models via rke score. arXiv preprint arXiv:2506.10173 (2025), https://arxiv.org/abs/2506.10173

arXiv 2025

[25] [25]

arXiv preprint arXiv:1710.10196 (2017)

Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for im- proved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)

Pith/arXiv arXiv 2017

[26] [26]

arXiv preprint arXiv:2511.20647 (2025)

Kazimi, T., Dunlop, C., Yanardag, P.: Diverse video generation with determinantal point process-guided policy optimization. arXiv preprint arXiv:2511.20647 (2025)

arXiv 2025

[27] [27]

In: Proceedings of the IEEE/CVF International ConferenceonComputerVision(ICCV)Workshops.pp.6506–6516(October2025)

Khan, M.A.H., Jain, Y., Bhattacharyya, S., Vineet, V.: Test-time prompt refine- ment for text-to-image models. In: Proceedings of the IEEE/CVF International ConferenceonComputerVision(ICCV)Workshops.pp.6506–6516(October2025)

[28] [28]

arXiv preprint arXiv:2510.03813 (2025)

Kim, B., Um, S., Ye, J.C.: Diverse text-to-image generation via contrastive noise optimization. arXiv preprint arXiv:2510.03813 (2025)

arXiv 2025

[29] [29]

arXiv preprint arXiv:2503.19385 (2025) 18 P.Bhat et al

Kim, J., Yoon, T., Hwang, J., Sung, M.: Inference-time scaling for flow models via stochastic generation and rollover budget forcing. arXiv preprint arXiv:2503.19385 (2025) 18 P.Bhat et al

arXiv 2025

[30] [30]

arXiv preprint arXiv:2410.06025 (2024)

Kirchhof, M., Thornton, J., Béthune, L., Ablin, P., Ndiaye, E., Cuturi, M.: Shielded diffusion: Generating novel and diverse images using sparse repellency. arXiv preprint arXiv:2410.06025 (2024)

arXiv 2024

[31] [31]

Advances in neural information processing systems36, 36652–36663 (2023)

Kirstain,Y.,Polyak,A.,Singer,U.,Matiana,S.,Penna,J.,Levy,O.:Pick-a-pic:An open dataset of user preferences for text-to-image generation. Advances in neural information processing systems36, 36652–36663 (2023)

2023

[32] [32]

arXiv preprint arXiv:2210.10960 (2022)

Kwon, M., Jeong, J., Uh, Y.: Diffusion models already have a semantic latent space. arXiv preprint arXiv:2210.10960 (2022)

arXiv 2022

[33] [33]

Advances in Neural Information Processing Systems37, 122458– 122483 (2024)

Kynkäänniemi, T., Aittala, M., Karras, T., Laine, S., Aila, T., Lehtinen, J.: Ap- plying guidance in a limited interval improves sample and distribution quality in diffusion models. Advances in Neural Information Processing Systems37, 122458– 122483 (2024)

2024

[34] [34]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024)

2024

[35] [35]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025)

2025

[36] [36]

In: European conference on computer vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

2014

[37] [37]

arXiv preprint arXiv:2210.02747 (2022)

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

Pith/arXiv arXiv 2022

[38] [38]

arXiv preprint arXiv:2209.03003 (2022)

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

Pith/arXiv arXiv 2022

[39] [39]

In: Proceedings of International Conference on Computer Vision (ICCV) (December 2015)

Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision (ICCV) (December 2015)

2015

[40] [40]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Luo, G., Granskog, J., Holynski, A., Darrell, T.: Dual-process image generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17972–17983 (2025)

2025

[41] [41]

arXiv preprint arXiv:2501.09732 (2025)

Ma, N., Tong, S., Jia, H., Hu, H., Su, Y.C., Zhang, M., Yang, X., Li, Y., Jaakkola, T., Jia, X., et al.: Inference-time scaling for diffusion models beyond scaling de- noising steps. arXiv preprint arXiv:2501.09732 (2025)

Pith/arXiv arXiv 2025

[42] [42]

arXiv preprint arXiv:2304.07193 (2023)

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

Pith/arXiv arXiv 2023

[43] [43]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Parihar, R., Bhat, A., Basu, A., Mallick, S., Kundu, J.N., Babu, R.V.: Balanc- ing act: Distribution-guided debiasing in diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6668–6678 (2024)

2024

[44] [44]

arXiv preprint arXiv:2508.15773 (2025)

Parmar, G., Patashnik, O., Ostashev, D., Wang, K.C., Aberman, K., Narasimhan, S., Zhu, J.Y.: Scaling group inference for diverse and high-quality generation. arXiv preprint arXiv:2508.15773 (2025)

arXiv 2025

[45] [45]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Parmar, G., Zhang, R., Zhu, J.Y.: On aliased resizing and surprising subtleties in gan evaluation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11410–11420 (2022)

2022

[46] [46]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pizzi, E., Roy, S.D., Ravindra, S.N., Goyal, P., Douze, M.: A self-supervised de- scriptor for image copy detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14532–14542 (2022)

2022

[47] [47]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) Don’t Settle at the Mode! 19

2021

[48] [48]

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth:Finetuningtext-to-imagediffusionmodelsforsubject-drivengeneration.In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 22500–22510 (2023)

2023

[49] [49]

arXiv preprint arXiv:2310.17347 (2023)

Sadat, S., Buhmann, J., Bradley, D., Hilliges, O., Weber, R.M.: Cads: Unleash- ing the diversity of diffusion models through condition-annealed sampling. arXiv preprint arXiv:2310.17347 (2023)

arXiv 2023

[50] [50]

Singh, J., Leng, X., Wu, Z., Zheng, L., Zhang, R., Shechtman, E., Xie, S.: What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794 (2025)

arXiv 2025

[51] [51]

arXiv preprint arXiv:2408.03314 (2024)

Snell, C., Lee, J., Xu, K., Kumar, A.: Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314 (2024)

Pith/arXiv arXiv 2024

[52] [52]

arXiv preprint arXiv:2010.02502 (2020)

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

Pith/arXiv arXiv 2010

[53] [53]

arXiv preprint arXiv:2011.13456 (2020)

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)

Pith/arXiv arXiv 2011

[54] [54]

arXiv preprint arXiv:2405.18881 (2024)

Tang, Z., Peng, J., Tang, J., Hong, M., Wang, F., Chang, T.H.: Inference- time alignment of diffusion models with direct noise optimization. arXiv preprint arXiv:2405.18881 (2024)

arXiv 2024

[55] [55]

arXiv preprint arXiv:2601.16208 (2026)

Tong, S., Zheng, B., Wang, Z., Tang, B., Ma, N., Brown, E., Yang, J., Fergus, R., LeCun,Y.,Xie,S.:Scalingtext-to-imagediffusiontransformerswithrepresentation autoencoders. arXiv preprint arXiv:2601.16208 (2026)

arXiv 2026

[56] [56]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Um, S., Ye, J.C.: Minority-focused text-to-image generation via prompt optimiza- tion. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 20926–20936 (2025)

2025

[57] [57]

In: ACM SIGGRAPH 2023 conference proceedings

Voynov, A., Aberman, K., Cohen-Or, D.: Sketch-guided text-to-image diffusion models. In: ACM SIGGRAPH 2023 conference proceedings. pp. 1–11 (2023)

2023

[58] [58]

arXiv preprint arXiv:2506.09027 (2025)

Wang, R., He, K.: Diffuse and disperse: Image generation with representation reg- ularization. arXiv preprint arXiv:2506.09027 (2025)

arXiv 2025

[59] [59]

Advances in neural information processing systems35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

2022

[60] [60]

arXiv preprint arXiv:2508.02324 (2025)

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

Pith/arXiv arXiv 2025

[61] [61]

arXiv preprint arXiv:2306.09341 (2023)

Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., Li, H.: Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023)

Pith/arXiv arXiv 2023

[62] [62]

arXiv preprint arXiv:2501.18427 (2025)

Xie,E.,Chen,J.,Zhao,Y.,Yu,J.,Zhu,L.,Wu,C.,Lin,Y.,Zhang,Z.,Li,M.,Chen, J., et al.: Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer. arXiv preprint arXiv:2501.18427 (2025)

arXiv 2025

[63] [63]

Advances in Neural Information Processing Systems37, 21875–21911 (2024)

Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. Advances in Neural Information Processing Systems37, 21875–21911 (2024)

2024

[64] [64]

arXiv preprint arXiv:2410.06940 (2024)

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940 (2024)

Pith/arXiv arXiv 2024

[65] [65]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yun, T., Zhang, D., Park, J., Pan, L.: Learning to sample effective and diverse prompts for text-to-image generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 23625–23635 (2025) 20 P.Bhat et al

2025

[66] [66]

In: The Thirteenth International Conference on Learning Representations (2025)

Zhao, B.N., Xiao, Y., Xu, J., Jiang, X., Yang, Y., Li, D., Itti, L., Vineet, V., Ge, Y.: Dreamdistribution: Learning prompt distribution for diverse in-distribution gen- eration. In: The Thirteenth International Conference on Learning Representations (2025)

2025

[67] [67]

arXiv preprint arXiv:2510.11690 (2025)

Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690 (2025)

Pith/arXiv arXiv 2025

[68] [68]

arXiv preprint arXiv:2512.24176 (2025)

Zhou, X., Li, Q., Hu, X., Chen, H., Gu, S.: Guiding a diffusion transformer with the internal dynamics of itself. arXiv preprint arXiv:2512.24176 (2025)

arXiv 2025

[69] [69]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Zhuo, L., Zhao, L., Paul, S., Liao, Y., Zhang, R., Xin, Y., Gao, P., Elhoseiny, M., Li, H.: From reflection to perfection: Scaling inference-time optimization for text- to-image diffusion models via reflection tuning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 15329–15339 (October 2025) Don’t Settle at the Mode...

2025