Recognition: unknown
Evolutionary Token-Level Prompt Optimization for Diffusion Models
Pith reviewed 2026-05-10 16:58 UTC · model grok-4.3
The pith
A genetic algorithm optimizes prompts for diffusion models by evolving token vectors to improve aesthetic quality and text alignment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a genetic algorithm can be used for prompt optimization in text-to-image diffusion models by directly evolving the token vectors. This is done by optimizing a fitness function that combines measures of aesthetic quality with prompt-image alignment. On 36 prompts from a standard dataset, this approach outperforms baseline methods such as text rewriting and random search, with fitness improvements reaching up to 23.93%. The method is adaptable to image generation models with tokenized text encoders and provides a modular framework.
What carries the argument
A genetic algorithm that directly evolves the token vectors used by the text encoder in diffusion models, guided by a fitness function that combines aesthetic quality and prompt-image alignment measures.
Load-bearing premise
That an automated combination of aesthetic quality and prompt alignment scores reliably indicates desirable image results, and that directly changing the token vectors provides a meaningful way to explore the input space.
What would settle it
A side-by-side comparison by human judges of images from the evolved prompts versus those from the baseline methods, where the absence of consistent preference for the evolved versions would challenge the claim.
Figures
read the original abstract
Text-to-image diffusion models exhibit strong generative performance but remain highly sensitive to prompt formulation, often requiring extensive manual trial and error to obtain satisfactory results. This motivates the development of automated, model-agnostic prompt optimization methods that can systematically explore the conditioning space beyond conventional text rewriting. This work investigates the use of a Genetic Algorithm (GA) for prompt optimization by directly evolving the token vectors employed by CLIP-based diffusion models. The GA optimizes a fitness function that combines aesthetic quality, measured by the LAION Aesthetic Predictor V2, with prompt-image alignment, assessed via CLIPScore. Experiments on 36 prompts from the Parti Prompts (P2) dataset show that the proposed approach outperforms the baseline methods, including Promptist and random search, achieving up to a 23.93% improvement in fitness. Overall, the method is adaptable to image generation models with tokenized text encoders and provides a modular framework for future extensions, the limitations and prospects of which are discussed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a genetic algorithm (GA) to optimize text prompts for CLIP-based diffusion models by directly evolving token vectors in embedding space rather than rewriting text. A composite fitness function is defined using the LAION Aesthetic Predictor V2 and CLIPScore; the GA is evaluated on 36 prompts from the Parti Prompts (P2) dataset and reported to outperform Promptist and random search by up to 23.93% in fitness.
Significance. If the empirical superiority holds under proper validation, the work supplies a model-agnostic, evolutionary framework for automated prompt optimization that could reduce manual trial-and-error for any tokenized text encoder. The modular design and discussion of limitations are positive; however, the absence of human-preference correlation or embedding-validity checks limits the practical significance of the reported fitness gains.
major comments (3)
- [Experiments] Experiments section: the 23.93% fitness improvement is presented without reported variance across runs, number of independent trials, or statistical significance tests; this makes it impossible to assess whether the gains over Promptist and random search are reliable or could be explained by stochasticity in the GA or the diffusion sampler.
- [Methods] Methods, fitness definition: the linear combination of LAION Aesthetic V2 and CLIPScore is introduced without ablation on the weighting coefficients or any correlation analysis against human ratings; consequently the claim that higher fitness corresponds to subjectively better images remains unanchored.
- [Method] Token-vector evolution: no mechanism or post-hoc check is described to ensure that mutated or crossed-over vectors remain inside the support of the CLIP text encoder’s training distribution; out-of-distribution embeddings could silently degrade diffusion conditioning even while proxy scores rise.
minor comments (2)
- [Introduction] The abstract and introduction cite only a handful of prompt-optimization baselines; a more complete comparison table including recent token-level or gradient-based methods would strengthen the positioning.
- [Results] Figure captions and axis labels in the results plots should explicitly state the number of GA generations, population size, and mutation rate used for each curve.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript accordingly to improve its rigor and clarity.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the 23.93% fitness improvement is presented without reported variance across runs, number of independent trials, or statistical significance tests; this makes it impossible to assess whether the gains over Promptist and random search are reliable or could be explained by stochasticity in the GA or the diffusion sampler.
Authors: We agree that reporting variability and statistical significance is necessary to substantiate the claims. In the revised manuscript, we will rerun all experiments over at least five independent trials with different random seeds, report mean fitness scores with standard deviations, and include paired t-tests (or Wilcoxon tests) against Promptist and random search to establish statistical significance of the observed improvements. revision: yes
-
Referee: [Methods] Methods, fitness definition: the linear combination of LAION Aesthetic V2 and CLIPScore is introduced without ablation on the weighting coefficients or any correlation analysis against human ratings; consequently the claim that higher fitness corresponds to subjectively better images remains unanchored.
Authors: We accept that the weighting coefficients require justification. We will add an ablation study in the revised paper that varies the relative weights (e.g., 0.3/0.7, 0.5/0.5, 0.7/0.3) and reports the resulting fitness and qualitative image quality. While a comprehensive human preference study lies outside the scope of this work, we will cite prior validation of both predictors against human judgments and include a small-scale qualitative review of selected outputs by the authors in the appendix. We will also revise the text to present the fitness function explicitly as a proxy metric rather than claiming direct subjective superiority. revision: partial
-
Referee: [Method] Token-vector evolution: no mechanism or post-hoc check is described to ensure that mutated or crossed-over vectors remain inside the support of the CLIP text encoder’s training distribution; out-of-distribution embeddings could silently degrade diffusion conditioning even while proxy scores rise.
Authors: This is a legitimate concern. Although the fitness function is evaluated directly through the diffusion model, we will add a post-hoc analysis in the revised version that computes the average Euclidean distance of evolved token vectors to the nearest original CLIP vocabulary embeddings. We will also report any observed degradation in image coherence and discuss the risk of out-of-distribution embeddings in the limitations section, along with a simple projection heuristic that could be applied in future extensions. revision: yes
Circularity Check
No significant circularity detected.
full rationale
The paper applies a standard genetic algorithm to evolve CLIP token vectors, with fitness explicitly defined as a linear combination of two external, pre-trained predictors (LAION Aesthetic V2 and CLIPScore). Experimental claims consist of comparative fitness scores on a fixed 36-prompt subset of Parti Prompts against Promptist and random search; these are direct measurements of the stated objective rather than self-referential derivations. No equations reduce to their own inputs by construction, no load-bearing self-citations justify core premises, and no ansatz or uniqueness result is imported from prior author work. The method is therefore self-contained as an empirical optimization procedure whose outputs are evaluated against independently trained proxies.
Axiom & Free-Parameter Ledger
free parameters (2)
- Fitness combination weights
- Genetic algorithm hyperparameters
axioms (2)
- domain assumption Directly evolving token vectors produces valid conditioning signals for the diffusion model
- domain assumption LAION Aesthetic Predictor V2 plus CLIPScore is a sufficient proxy for prompt quality
Reference graph
Works this paper leans on
-
[1]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models.Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022-June:10674–10685, 2022. ISSN 10636919. doi: 10.1109/CVPR52688.2022. 01042
-
[2]
Optimizing prompts for text-to-image generation
Yaru Hao, Zewen Chi, Li Dong, and Furu Wei. Optimizing prompts for text-to-image generation. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 66923–66939. Curran Associates, Inc., 2023. URLhttps://proceedings.neurips. cc/paper_files/paper/2023/file/d346d91...
2023
-
[3]
Evolving prompts for synthetic image generation with genetic algorithm
Khoi Dinh Tran, Dat Viet Bui, and Ngoc Hoang Luong. Evolving prompts for synthetic image generation with genetic algorithm. In2023 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pages 1–6, 2023. doi: 10.1109/MAPR59823.2023.10288925
-
[4]
Generating adversarial examples through latent space exploration of generative adversarial networks
Luana Clare and João Correia. Generating adversarial examples through latent space exploration of generative adversarial networks. InProceedings of the Companion Conference on Genetic and Evolutionary Computation, GECCO ’23 Companion, page 1760–1767, New York, NY , USA, 2023. Association for Computing Machin- ery. ISBN 9798400701207. doi: 10.1145/3583133....
-
[5]
Promptcharm: Text-to-image generation through multi-modal prompting and refinement.Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 2024
Zhijie Wang, Yuheng Huang, Da Song, Lei Ma, and Tianyi Zhang. Promptcharm: Text-to-image generation through multi-modal prompting and refinement.Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 2024
2024
-
[6]
Evolving the embedding space of diffusion models in the field of visual arts
Marcel Salvenmoser and Michael Affenzeller. Evolving the embedding space of diffusion models in the field of visual arts. InArtificial Intelligence in Music, Sound, Art and Design: 14th International Conference, EvoMUSART 2025, Held as Part of EvoStar 2025, Trieste, Italy, April 23–25, 2025, Proceedings, page 402–416, Berlin, Heidelberg, 2025. Springer-Ve...
-
[7]
A sample implementation for parallelizing divide-and-conquer algorithms on the GPU,
Nassim Dehouche and Kullathida Dehouche. What’s in a text-to-image prompt? the potential of stable diffusion in visual arts education.Heliyon, 9(6):e16757, 2023. ISSN 2405-8440. doi: https://doi.org/10.1016/j.heliyon. 2023.e16757
-
[8]
Test-time prompt refinement for text-to-image models.ArXiv, abs/2507.22076, 2025
Mohammad Abdul Hafeez Khan, Yash Jain, Siddhartha Bhattacharyya, and Vibhav Vineet. Test-time prompt refinement for text-to-image models.ArXiv, abs/2507.22076, 2025
-
[9]
Prompt stealing attacks against text-to-image generation models.arXiv, 2024
Xinyue Shen, Yiting Qu, Michael Backes, and Yang Zhang. Prompt stealing attacks against text-to-image generation models.arXiv, 2024. URLhttps://arxiv.org/abs/2302.09923
-
[10]
InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23)
Stephen Brade, Bryan Wang, Mauricio Sousa, Sageev Oore, and Tovi Grossman. Promptify: Text-to-image generation through interactive prompt exploration with large language models. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, New York, NY , USA, 2023. Association for Computing Machinery. ISBN 97984007013...
-
[11]
Evolutionary algorithms.Wiley Int
Thomas Bartz-Beielstein, Jürgen Branke, Jörn Mehnen, and Olaf Mersmann. Evolutionary algorithms.Wiley Int. Rev. Data Min. and Knowl. Disc., 4(3):178–195, May 2014. doi: 10.1002/widm.1124. URL https: //doi.org/10.1002/widm.1124
-
[12]
Evogen-prompt-evolution
Magnus Petersen. Evogen-prompt-evolution. https://github.com/MagnusPetersen/ EvoGen-Prompt-Evolution, 2022. Accessed: 2023-07-16
2022
-
[13]
If by deepfloyd.https://github.com/deep-floyd/IF, 2023
DeepFloyd Team. If by deepfloyd.https://github.com/deep-floyd/IF, 2023. Accessed: 2025-10-08
2023
-
[14]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.ArXiv, abs/2310.00426, 2023. URLhttps://arxiv.org/abs/2310.00426
work page internal anchor Pith review arXiv 2023
-
[16]
URLhttps://arxiv.org/abs/2307.01952
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Adversarial diffusion distillation
Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 87–103, Cham, 2025. Springer Nature Switzerland. ISBN 978-3-031-73016-0
2024
-
[18]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine ...
2021
-
[19]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21(1), January 2020. ISSN 1532-4435
2020
-
[20]
Laion-aesthetics, 8 2022
Christoph Schuhmann. Laion-aesthetics, 8 2022. URLhttps://laion.ai/blog/laion-aesthetics/. 10 Evolutionary Token-Level Prompt Optimization for Diffusion Models
2022
-
[21]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning.ArXiv, abs/2104.08718, 2021. URL https://api.semanticscholar. org/CorpusID:233296711. 11 Evolutionary Token-Level Prompt Optimization for Diffusion Models Appendix A Full List of Generated Images Figure A1: Final out...
work page internal anchor Pith review arXiv 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.