pith. machine review for the scientific record. sign in

arxiv: 2604.18168 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

Aiming Hao, Chenxi Zhao, Chen Zhu, Jiachen Lei, Jiahong Wu, Jiashu Zhu, Jufeng Yang, Xiangxiang Chu, Xiaokun Feng

Pith reviewed 2026-05-10 05:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords one-step image generationMeanFlowtext-to-image synthesisdiscriminative text featuresLLM text encoderdiffusion modelsfew-step generation
0
0 comments X

The pith

MeanFlow enables one-step image generation from text by requiring and using highly discriminative LLM text encoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that MeanFlow's one-step generation, which works well for class labels, can be extended to text conditions. The central requirement is that text features must exhibit high discriminability because the process allows only a single refinement step, unlike the discrete separation of class labels. By selecting an LLM-based encoder with the needed semantic properties and adapting the MeanFlow process, the authors achieve the first efficient text-to-image synthesis under this constraint. This extension matters because it moves one-step generation from fixed categories toward open-ended descriptive inputs. The same insight also improves performance when applied to standard diffusion models.

Core claim

Due to the extremely limited refinement steps in MeanFlow, such as only one step, text feature representations must possess sufficiently high discriminability; by leveraging a powerful LLM-based text encoder validated to have the required semantic properties and adapting the MeanFlow generation process, efficient text-conditioned one-step image synthesis is achieved for the first time.

What carries the argument

Adaptation of the MeanFlow single-refinement process to text features that meet a high-discriminability requirement, supplied by a validated LLM encoder.

If this is right

  • Text inputs can replace fixed class labels to enable richer, more flexible image content in one-step generation.
  • The same discriminability requirement yields measurable generation improvements on standard diffusion models.
  • The approach supplies a practical template for extending other one-step or few-step generators to text conditions.
  • Future work on text-conditioned MeanFlow can focus on encoder selection rather than multi-step architectural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The discriminability constraint likely applies to other low-step generative regimes, suggesting encoder design should prioritize separation metrics over general language modeling.
  • Real-time applications could adopt this for on-device text-to-image synthesis where multi-step sampling is too slow.
  • Optimizing encoders specifically for generation discriminability, rather than downstream NLP tasks, may become a distinct research direction.
  • The insight could transfer to text-conditioned generation of video or 3D content under tight step budgets.

Load-bearing premise

Text feature representations must possess sufficiently high discriminability to function with the extremely limited number of refinement steps in MeanFlow.

What would settle it

A controlled test showing that a text encoder lacking high discriminability still produces high-quality one-step text-to-image results in the adapted MeanFlow framework.

Figures

Figures reproduced from arXiv: 2604.18168 by Aiming Hao, Chenxi Zhao, Chen Zhu, Jiachen Lei, Jiahong Wu, Jiashu Zhu, Jufeng Yang, Xiangxiang Chu, Xiaokun Feng.

Figure 1
Figure 1. Figure 1: Visual comparison of Our Model and SANA-Sprint in 4-step inference on challenging text. Our model achieves superior image quality and instruction following. The blue text denotes examples where SANA-Sprint fails. Abstract Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by Mean￾Flow achieving remarkable results. Existing research on MeanFlow primarily f… view at source ↗
Figure 2
Figure 2. Figure 2: Left: Performance gap in few-step generation — For "Ducks float leisurely on vibrant, clear blue water.", SANA-1.5 misses the subject "ducks" in early steps, while BLIP3o-NEXT preserves it, yielding greater robustness and a more accurate velocity-field direction. Right: Under few-step sampling, BLIP3o-NEXT consistently outperforms SANA-1.5 on GenEval. BLIP3o-NEXT shows stronger subject preservation and dow… view at source ↗
Figure 3
Figure 3. Figure 3: On the COCO2017 train set [55], we encode query prompts using different text encoders, retrieve the Top-2 similar texts, and visualize their corresponding images. Among them, the images retrieved using the BLIP3o-NEXT text encoder are the most similar. This indicates that the distributions of its text and image representations are closely aligned, exhibiting strong discriminability. Disentanglement. Anothe… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study of sampling steps in T2I MeanFlow. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: 4-step sampling comparison of our method with existing distilled models. Our method achieves superior semantic fidelity and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: 4-step GenEval performance of different text encoders [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Denoising Trajectory Comparison. Simple class-label [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Representative visual results on DPG-Bench. Compared to the blurred outputs of few-step Flow Matching (FM) inference, our [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional comparisons under 4-step sampling between our method and existing distilled models. Our approach achieves higher [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation. Compared to the limited class labels, text conditions pose greater challenges to the model's understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance. To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number of refinement steps in the MeanFlow generation, such as only one step, the text feature representations are required to possess sufficiently high discriminability. This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Guided by these insights, we leverage a powerful LLM-based text encoder validated to possess the required semantic properties and adapt the MeanFlow generation process to this framework, resulting in efficient text-conditioned synthesis for the first time. Furthermore, we validate our approach on the widely used diffusion model, demonstrating significant generation performance improvements. We hope this work provides a general and practical reference for future research on text-conditioned MeanFlow generation. The code is available at https://github.com/AMAP-ML/EMF.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper extends the MeanFlow one-step image generation framework from class-conditional to text-conditional synthesis. It identifies that standard integration of LLM-based text encoders yields poor results because the one-step (or few-step) refinement process requires text features with high discriminability, in contrast to discrete class labels. Through targeted analyses, the authors select and validate an LLM encoder possessing the necessary semantic properties, adapt the MeanFlow generation process, and report successful text-conditioned one-step synthesis. The same adaptation is shown to improve a standard diffusion model, with code released for reproducibility.

Significance. If the empirical findings hold, the work is significant for enabling flexible, text-driven one-step image generation, a practical advance over fixed class-label conditioning in few-step methods. The identification of the discriminability requirement, the adaptation strategy, and the cross-model validation (MeanFlow and diffusion) provide a concrete reference for future research. The public code release is a notable strength, directly supporting verification of the encoder selection, process adaptation, and reported performance gains.

minor comments (3)
  1. [Abstract] Abstract: the statement that the approach results in 'efficient text-conditioned synthesis for the first time' would be strengthened by a brief qualification or citation to any closely related prior attempts at text-conditioned few-step generation.
  2. The description of the adaptation to the MeanFlow generation process would benefit from an explicit equation or pseudocode block showing the modified conditioning and refinement steps, to aid direct implementation.
  3. Tables reporting quantitative results (e.g., FID or CLIP scores) should include standard deviations across multiple runs or seeds to allow assessment of statistical significance of the claimed improvements.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work, the recognition of its significance for enabling flexible text-driven one-step image generation, and the recommendation for minor revision. We appreciate the acknowledgment of our analyses on text feature discriminability, the adaptation of MeanFlow, the cross-validation on diffusion models, and the code release.

Circularity Check

0 steps flagged

No significant circularity; empirical adaptation with independent validation

full rationale

The paper's chain consists of empirical analysis showing that standard LLM text encoders lack sufficient discriminability for one-step MeanFlow (due to limited refinement steps), followed by selection of a validated encoder with the required properties, adaptation of the generation process, and reporting of improved synthesis results (also shown to benefit standard diffusion models). No equations or derivations are presented that reduce by construction to fitted parameters, self-definitions, or prior self-citations as load-bearing premises. The central claim is an empirical demonstration supported by analyses and external benchmarks, with code released for verification. This is self-contained against external checks and receives score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5589 in / 883 out tokens · 27930 ms · 2026-05-10T05:11:42.616903+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 52 canonical work pages · 19 internal anchors

  1. [1]

    Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 3

  2. [2]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,

  3. [3]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 1, 3

  4. [4]

    Scaling rectified flowtransformersforhigh-resolutionimagesynthesis.InForty- first international conference on machine learning, 2024

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flowtransformersforhigh-resolutionimagesynthesis.InForty- first international conference on machine learning, 2024. 1, 3, 7, 2

  5. [5]

    Consistency models

    YangSong,PrafullaDhariwal,MarkChen,andIlyaSutskever. Consistency models. 2023. 1, 3

  6. [6]

    Multistep Consistency Models, November 2024

    Jonathan Heek, Emiel Hoogeboom, and Tim Salimans. Mul- tistep consistency models.arXiv preprint arXiv:2403.06807,

  7. [7]

    Align your flow: Scaling continuous-time flow map distillation.arXiv preprint arXiv:2506.14603, 2025

    Amirmojtaba Sabour, Sanja Fidler, and Karsten Kreis. Align your flow: Scaling continuous-time flow map distillation. arXiv preprint arXiv:2506.14603, 2025. 2, 3, 6

  8. [8]

    Flow map matching with stochastic interpolants: A mathematical framework for consistency models.arXiv preprint arXiv:2406.07507, 2024

    Nicholas M Boffi, Michael S Albergo, and Eric Vanden-Eijnden. Flow map matching.arXiv preprint arXiv:2406.07507, 2(3):9, 2024. 1, 2, 3

  9. [9]

    Mean Flows for One-step Generative Modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025. 2, 3

  10. [10]

    Alphaflow: Understanding and improvi ng meanflow models

    Huijie Zhang, Aliaksandr Siarohin, Willi Menapace, Michael Vasilkovsky,SergeyTulyakov,QingQu,andIvanSkorokhodov. Alphaflow: Understanding and improving meanflow models. arXiv preprint arXiv:2510.20771, 2025. 2, 3

  11. [11]

    Splitmeanflow: Interval splitting consistency in few-step generative modeling.arXiv preprint arXiv:2507.16884, 2025

    Yi Guo, Wei Wang, Zhihang Yuan, Rong Cao, Kuan Chen, ZhengyangChen,YuanyuanHuo,YangZhang,YupingWang, Shouda Liu, et al. Splitmeanflow: Interval splitting con- sistency in few-step generative modeling.arXiv preprint arXiv:2507.16884, 2025. 3

  12. [12]

    Yaron Lipman, Ricky T

    Kyungmin Lee, Sihyun Yu, and Jinwoo Shin. Decoupled meanflow: Turningflowmodelsintoflowmapsforaccelerated sampling.arXiv preprint arXiv:2510.24474, 2025. 2, 3, 6

  13. [13]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 2

  14. [14]

    Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024

    EnzeXie,JunsongChen,JunyuChen,HanCai,HaotianTang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with lin- ear diffusion transformers.arXiv preprint arXiv:2410.10629,

  15. [15]

    Learning trans- ferable visual models from natural language supervision

    AlecRadford,JongWookKim,ChrisHallacy,AdityaRamesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning trans- ferable visual models from natural language supervision. In International conference on machine learning, pages 8748–

  16. [16]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 2, 3, 4

  17. [17]

    Qwen2 Technical Report

    Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2(3), 2024. 2, 3

  18. [18]

    Gemma 2: Improving Open Language Models at a Practical Size

    GemmaTeam,MorganeRiviere,ShreyaPathak,PierGiuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024. 2, 3

  19. [19]

    Sana-sprint: One-step diffusion with continuous-time consistency distillation

    Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, and Enze Xie. Sana-sprint: One-step diffusion with continuous-time consistency distillation.arXiv preprint arXiv:2503.09641,

  20. [20]

    Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024. 2, 3

  21. [21]

    Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

    Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularizedcontinuous-timeconsistency.arXivpreprint arXiv:2510.08431, 2025. 2, 7

  22. [22]

    Consistency models made easy

    Zhengyang Geng, Ashwini Pokle, William Luo, Justin Lin, and J Zico Kolter. Consistency models made easy.arXiv preprint arXiv:2406.14548, 2024. 2, 3

  23. [23]

    Improved Tech- niques for Training Consistency Models

    Yang Song and Prafulla Dhariwal. Improved techniques for trainingconsistencymodels.arXivpreprintarXiv:2310.14189,

  24. [24]

    Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv:2510.12586, 2025

    Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, and Xiangxiang Chu. Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv preprint arXiv:2510.12586, 2025. 2

  25. [25]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025

  26. [26]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representa- tionalignmentforgeneration: Trainingdiffusiontransformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

  27. [27]

    arXiv preprint arXiv:2510.15301 (2025)

    Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder. arXiv preprint arXiv:2510.15301, 2025

  28. [28]

    Representation entanglement for generation: Training diffusion transformers is much easier than you think

    Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transformers is much easier than you think. InNeurIPS, 2025

  29. [29]

    InICCV, 2025

    XiangxiangChu,RendaLi,andYongWang.Usp: Unifiedself- supervisedpretrainingforimagegenerationandunderstanding. InICCV, 2025

  30. [30]

    Reconstruction alignment improves unified multimodal models.arXiv preprint arXiv:2509.07295, 2025a

    Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. Reconstruction alignment improves unified multimodal mod- els.arXiv preprint arXiv:2509.07295, 2025. 2

  31. [31]

    Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer, 2025

    Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer, 2025. 2, 3, 4, 7

  32. [32]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer,

  33. [33]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  34. [34]

    S2-guidance: Stochastic self guidance for training-free enhancement of diffusion models.arXiv preprint arXiv:2508.12880, 2025

    Chubin Chen, Jiashu Zhu, Xiaokun Feng, et al. S2-guidance: Stochastic self guidance for training-free enhancement of diffusion models.arXiv preprint arXiv:2508.12880, 2025. 3

  35. [35]

    Taming preference mode collapse via directional decoupling alignment in diffusion reinforcement learning.arXiv preprint arXiv:2512.24146, 2025

    Chubin Chen, Sujie Hu, Jiashu Zhu, Meiqi Wu, Jintao Chen, Yanxun Li, Nisha Huang, Chengyu Fang, Jiahong Wu, Xi- angxiang Chu, et al. Taming preference mode collapse via directional decoupling alignment in diffusion reinforcement learning.arXiv preprint arXiv:2512.24146, 2025. 3

  36. [36]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, etal. Qwentechnicalreport.arXivpreprintarXiv:2309.16609,

  37. [37]

    Qwen3 Technical Report

    AnYang,AnfengLi,BaosongYang,BeichenZhang,Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  38. [38]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technol- ogy.arXiv preprint arXiv:2403.08295, 2024. 5

  39. [39]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

  40. [40]

    High-resolution image synthesis withlatentdiffusionmodels

    RobinRombach,AndreasBlattmann,DominikLorenz,Patrick Esser, and Björn Ommer. High-resolution image synthesis withlatentdiffusionmodels. InProceedingsoftheIEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

  41. [41]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, TimDockhorn, JonasMüller, JoePenna, andRobinRombach. Sdxl: Improving latent diffusion models for high-resolution imagesynthesis.arXivpreprintarXiv:2307.01952, 2023. 3, 7

  42. [42]

    Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James T Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In ICLR, 2024. 3, 7, 2

  43. [43]

    Pixart-{\delta}: Fast and controllable image generation with latent consistency models.arXiv preprint arXiv:2401.05252, 2024

    Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, and Zhenguo Li. Pixart-{\delta}: Fast and controllable image generation with latent consistency models.arXiv preprint arXiv:2401.05252, 2024

  44. [44]

    Pixart-𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean ConferenceonComputerVision,pages74–91.Springer,2024. 3, 2

  45. [45]

    Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux, 2024. 3, 7, 2

  46. [46]

    Flux-text: A simple and advanced diffusion transformer baseline for scene text editing.arXiv preprint arXiv:2505.03329, 2025

    Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Dongyang Jin, Ryan Xu, Dong Nie, Lei Sun, and Xiangxiang Chu. Flux- text: A simple and advanced diffusion transformer baseline forscenetextediting.arXivpreprintarXiv:2505.03329, 2025. 3

  47. [47]

    Nano banana

    Google. Nano banana. https : / / developers . googleblog.com/en/introducing- gemini- 2- 5-flash-image, 2025. 3

  48. [48]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 3, 7, 2

  49. [49]

    Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

    Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, XinchiDeng,YingDong,KipperGong,TianpengGu,Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025. 3

  50. [50]

    Playground v3: Im- proving text-to-image alignment with deep-fusion large language models.arXiv preprint arXiv:2409.10695, 2024

    Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza,SuhailDoshi,andDaiqingLi. Playgroundv3: Improv- ing text-to-image alignment with deep-fusion large language models.arXiv preprint arXiv:2409.10695, 2024. 3

  51. [51]

    Blip3o-next: Next frontier of native image generation.arXiv preprint arXiv:2510.15857, 2025

    Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, et al. Blip3o-next: Next frontier of native image generation.arXiv preprint arXiv:2510.15857, 2025. 3, 4, 7

  52. [52]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023. 3, 7

  53. [53]

    arXiv preprint arXiv:2407.02398 , year=

    Ling Yang, Zixiang Zhang, Zhilong Zhang, Xingchao Liu, Minkai Xu, Wentao Zhang, Chenlin Meng, Stefano Er- mon, and Bin Cui. Consistency flow matching: Defining straight flows with velocity consistency.arXiv preprint arXiv:2407.02398, 2024. 3

  54. [54]

    Transition models: Rethinking the generative learning objective.arXiv preprint arXiv:2509.04394, 2025

    Zidong Wang, Yiyuan Zhang, Xiaoyu Yue, Xiangyu Yue, YangguangLi,WanliOuyang,andLeiBai. Transitionmodels: Rethinking the generative learning objective.arXiv preprint arXiv:2509.04394, 2025. 3, 6

  55. [55]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 4, 5

  56. [56]

    Saco loss: Sample-wise affinity consistency for vision-language pre-training

    Sitong Wu, Haoru Tan, Zhuotao Tian, Yukang Chen, Xi- aojuan Qi, and Jiaya Jia. Saco loss: Sample-wise affinity consistency for vision-language pre-training. InProceedings oftheIEEE/CVFConferenceonComputerVisionandPattern Recognition, pages 27358–27369, 2024. 4

  57. [57]

    DINOv3

    Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Di- nov3.arXiv preprint arXiv:2508.10104, 2025. 4

  58. [58]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and GangYu. Ella: Equipdiffusionmodelswithllmforenhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 5, 6

  59. [59]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 6, 7, 2

  60. [60]

    ShareGPT-4o-Image: Aligning multimodal models with GPT-4o-level image generation

    Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt- 4o-level image generation.arXiv preprint arXiv:2506.18095,

  61. [61]

    arXiv preprint arXiv:2508.09987 (2025)

    Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zheng- hao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt-4osyntheticimagesforimprovedimagegeneration.arXiv preprint arXiv:2508.09987, 2025. 6

  62. [62]

    Geneval: An object-focused framework for evaluating text-to- imagealignment.AdvancesinNeuralInformationProcessing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to- imagealignment.AdvancesinNeuralInformationProcessing Systems, 36:52132–52152, 2023. 6

  63. [63]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

  64. [64]

    Cosmos world foundation model platform for physical ai, 2025

    NVIDIA. Cosmos world foundation model platform for physical ai, 2025. 7

  65. [65]

    Lumina- image 2.0: A unified and efficient image generative frame- work.arXiv preprint arXiv:2503.21758, 2025

    Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina- image2.0: Aunifiedandefficientimagegenerativeframework. arXiv preprint arXiv:2503.21758, 2025. 7, 2

  66. [66]

    Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

    Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Pei- han Xu, et al. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025. 7, 2

  67. [67]

    Seedream 3.0 Technical Report

    Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025. 7, 2

  68. [68]

    Gpt-image-1, 2025

    OpenAI. Gpt-image-1, 2025. 7, 2

  69. [69]

    Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

    Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Trans- fer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025. 7, 2

  70. [70]

    Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025c

    Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, QinyueLi,WeiLi,andChenChangeLoy. Openuni: Asimple baseline for unified multimodal understanding and generation. arXiv preprint arXiv:2505.23661, 2025. 7, 2

  71. [71]

    Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXiv preprint arXiv:2506.18898, 2025.https://arxiv.org/abs/2506.18898

    Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, and Lu Jiang. Vision as a dialect: Unifyingvisualunderstandingandgenerationviatext- aligned representations.arXiv preprint arXiv:2506.18898,

  72. [72]

    arXiv preprint arXiv:2508.08098, 2025

    JunzheXu,YuyangYin,andXiChen.Tbac-uniimage: Unified understanding and generation by ladder-side diffusion tuning. arXiv preprint arXiv:2508.08098, 2025. 7, 2

  73. [73]

    Sdxl-lightning: Progressive adversarial diffusion distillation.arXiv preprint arXiv:2402.13929, 2024

    Shanchuan Lin, Anran Wang, and Xiao Yang. Sdxl-lightning: Progressive adversarial diffusion distillation.arXiv preprint arXiv:2402.13929, 2024. 7

  74. [74]

    Hyper-sd: Trajectory segmented consistency model for efficient image synthesis

    Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, and Xuefeng Xiao. Hyper-sd: Trajectory segmented consistency model for efficient image synthesis. InNeurIPS, 2024. 7

  75. [75]

    Improved distribution matching distillation for fast image synthesis

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis. Advancesinneuralinformationprocessingsystems,37:47455– 47487, 2024. 7, 2

  76. [76]

    Lumina-next: Making lumina-t2x stronger and faster with next-dit.Advances in Neural Information Processing Systems, 37:131278–131315, 2024

    Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Xiangyang Zhu, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit.Advances in Neural Information Processing Systems, 37:131278–131315, 2024. 2

  77. [77]

    Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation.ArXiv, abs/2402.17245, 2024

    DaiqingLi,AleksKamko,EhsanAkhgari,AliSabet,Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights to- wards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024. 2

  78. [78]

    Hunyuan- dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, WenyueLi,ChenZhang,RongweiQuan,JianxiangLu,Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang,ChaoZhang,MengChen,JieLiu,ZhengFang,Weiyan Wang,JinbaoXue,YangyuTao,JianchenZ...

  79. [79]

    DALL·E 3

    OpenAI. DALL·E 3. https://openai.com/research/dall-e-3, September 2023. 2

  80. [80]

    Fast high- resolution image synthesis with latent adversarial diffusion distillation

    Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high- resolution image synthesis with latent adversarial diffusion distillation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 2 Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation Supplementary Material

Showing first 80 references.