arxiv: 2604.18168 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

Aiming Hao, Chenxi Zhao, Chen Zhu, Jiachen Lei, Jiahong Wu, Jiashu Zhu, Jufeng Yang, Xiangxiang Chu, Xiaokun Feng

Pith reviewed 2026-05-10 05:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords one-step image generationMeanFlowtext-to-image synthesisdiscriminative text featuresLLM text encoderdiffusion modelsfew-step generation

0 comments

The pith

MeanFlow enables one-step image generation from text by requiring and using highly discriminative LLM text encoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that MeanFlow's one-step generation, which works well for class labels, can be extended to text conditions. The central requirement is that text features must exhibit high discriminability because the process allows only a single refinement step, unlike the discrete separation of class labels. By selecting an LLM-based encoder with the needed semantic properties and adapting the MeanFlow process, the authors achieve the first efficient text-to-image synthesis under this constraint. This extension matters because it moves one-step generation from fixed categories toward open-ended descriptive inputs. The same insight also improves performance when applied to standard diffusion models.

Core claim

Due to the extremely limited refinement steps in MeanFlow, such as only one step, text feature representations must possess sufficiently high discriminability; by leveraging a powerful LLM-based text encoder validated to have the required semantic properties and adapting the MeanFlow generation process, efficient text-conditioned one-step image synthesis is achieved for the first time.

What carries the argument

Adaptation of the MeanFlow single-refinement process to text features that meet a high-discriminability requirement, supplied by a validated LLM encoder.

If this is right

Text inputs can replace fixed class labels to enable richer, more flexible image content in one-step generation.
The same discriminability requirement yields measurable generation improvements on standard diffusion models.
The approach supplies a practical template for extending other one-step or few-step generators to text conditions.
Future work on text-conditioned MeanFlow can focus on encoder selection rather than multi-step architectural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The discriminability constraint likely applies to other low-step generative regimes, suggesting encoder design should prioritize separation metrics over general language modeling.
Real-time applications could adopt this for on-device text-to-image synthesis where multi-step sampling is too slow.
Optimizing encoders specifically for generation discriminability, rather than downstream NLP tasks, may become a distinct research direction.
The insight could transfer to text-conditioned generation of video or 3D content under tight step budgets.

Load-bearing premise

Text feature representations must possess sufficiently high discriminability to function with the extremely limited number of refinement steps in MeanFlow.

What would settle it

A controlled test showing that a text encoder lacking high discriminability still produces high-quality one-step text-to-image results in the adapted MeanFlow framework.

Figures

Figures reproduced from arXiv: 2604.18168 by Aiming Hao, Chenxi Zhao, Chen Zhu, Jiachen Lei, Jiahong Wu, Jiashu Zhu, Jufeng Yang, Xiangxiang Chu, Xiaokun Feng.

**Figure 1.** Figure 1: Visual comparison of Our Model and SANA-Sprint in 4-step inference on challenging text. Our model achieves superior image quality and instruction following. The blue text denotes examples where SANA-Sprint fails. Abstract Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily f… view at source ↗

**Figure 2.** Figure 2: Left: Performance gap in few-step generation — For "Ducks float leisurely on vibrant, clear blue water.", SANA-1.5 misses the subject "ducks" in early steps, while BLIP3o-NEXT preserves it, yielding greater robustness and a more accurate velocity-field direction. Right: Under few-step sampling, BLIP3o-NEXT consistently outperforms SANA-1.5 on GenEval. BLIP3o-NEXT shows stronger subject preservation and dow… view at source ↗

**Figure 3.** Figure 3: On the COCO2017 train set [55], we encode query prompts using different text encoders, retrieve the Top-2 similar texts, and visualize their corresponding images. Among them, the images retrieved using the BLIP3o-NEXT text encoder are the most similar. This indicates that the distributions of its text and image representations are closely aligned, exhibiting strong discriminability. Disentanglement. Anothe… view at source ↗

**Figure 4.** Figure 4: Ablation study of sampling steps in T2I MeanFlow. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: 4-step sampling comparison of our method with existing distilled models. Our method achieves superior semantic fidelity and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: 4-step GenEval performance of different text encoders [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Denoising Trajectory Comparison. Simple class-label [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Representative visual results on DPG-Bench. Compared to the blurred outputs of few-step Flow Matching (FM) inference, our [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Additional comparisons under 4-step sampling between our method and existing distilled models. Our approach achieves higher [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation. Compared to the limited class labels, text conditions pose greater challenges to the model's understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance. To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number of refinement steps in the MeanFlow generation, such as only one step, the text feature representations are required to possess sufficiently high discriminability. This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Guided by these insights, we leverage a powerful LLM-based text encoder validated to possess the required semantic properties and adapt the MeanFlow generation process to this framework, resulting in efficient text-conditioned synthesis for the first time. Furthermore, we validate our approach on the widely used diffusion model, demonstrating significant generation performance improvements. We hope this work provides a general and practical reference for future research on text-conditioned MeanFlow generation. The code is available at https://github.com/AMAP-ML/EMF.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MeanFlow can do one-step text-to-image if you swap in a text encoder with high enough discriminability and adjust the pipeline to match.

read the letter

The main point is that this paper gets MeanFlow working with flexible text prompts instead of just class labels. They trace the usual failure of LLM encoders to the one-step limit: with almost no refinement steps left, the conditioning features have to be sharply separable on their own, which class labels already are but raw text embeddings often are not. They test encoders, pick one that passes the discriminability bar, adapt the MeanFlow training and sampling, and show it produces usable text-conditioned images in one step. The same encoder swap also lifts performance on a standard diffusion baseline, which is a useful side check.

Referee Report

0 major / 3 minor

Summary. The paper extends the MeanFlow one-step image generation framework from class-conditional to text-conditional synthesis. It identifies that standard integration of LLM-based text encoders yields poor results because the one-step (or few-step) refinement process requires text features with high discriminability, in contrast to discrete class labels. Through targeted analyses, the authors select and validate an LLM encoder possessing the necessary semantic properties, adapt the MeanFlow generation process, and report successful text-conditioned one-step synthesis. The same adaptation is shown to improve a standard diffusion model, with code released for reproducibility.

Significance. If the empirical findings hold, the work is significant for enabling flexible, text-driven one-step image generation, a practical advance over fixed class-label conditioning in few-step methods. The identification of the discriminability requirement, the adaptation strategy, and the cross-model validation (MeanFlow and diffusion) provide a concrete reference for future research. The public code release is a notable strength, directly supporting verification of the encoder selection, process adaptation, and reported performance gains.

minor comments (3)

[Abstract] Abstract: the statement that the approach results in 'efficient text-conditioned synthesis for the first time' would be strengthened by a brief qualification or citation to any closely related prior attempts at text-conditioned few-step generation.
The description of the adaptation to the MeanFlow generation process would benefit from an explicit equation or pseudocode block showing the modified conditioning and refinement steps, to aid direct implementation.
Tables reporting quantitative results (e.g., FID or CLIP scores) should include standard deviations across multiple runs or seeds to allow assessment of statistical significance of the claimed improvements.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work, the recognition of its significance for enabling flexible text-driven one-step image generation, and the recommendation for minor revision. We appreciate the acknowledgment of our analyses on text feature discriminability, the adaptation of MeanFlow, the cross-validation on diffusion models, and the code release.

Circularity Check

0 steps flagged

No significant circularity; empirical adaptation with independent validation

full rationale

The paper's chain consists of empirical analysis showing that standard LLM text encoders lack sufficient discriminability for one-step MeanFlow (due to limited refinement steps), followed by selection of a validated encoder with the required properties, adaptation of the generation process, and reporting of improved synthesis results (also shown to benefit standard diffusion models). No equations or derivations are presented that reduce by construction to fitted parameters, self-definitions, or prior self-citations as load-bearing premises. The central claim is an empirical demonstration supported by analyses and external benchmarks, with code released for verification. This is self-contained against external checks and receives score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5589 in / 883 out tokens · 27930 ms · 2026-05-10T05:11:42.616903+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 52 canonical work pages · 19 internal anchors

[1]

Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 3

2020
[2]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[3]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 1, 3

2023
[4]

Scaling rectified flowtransformersforhigh-resolutionimagesynthesis.InForty- first international conference on machine learning, 2024

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flowtransformersforhigh-resolutionimagesynthesis.InForty- first international conference on machine learning, 2024. 1, 3, 7, 2

2024
[5]

Consistency models

YangSong,PrafullaDhariwal,MarkChen,andIlyaSutskever. Consistency models. 2023. 1, 3

2023
[6]

Multistep Consistency Models, November 2024

Jonathan Heek, Emiel Hoogeboom, and Tim Salimans. Mul- tistep consistency models.arXiv preprint arXiv:2403.06807,

work page arXiv
[7]

Align your flow: Scaling continuous-time flow map distillation.arXiv preprint arXiv:2506.14603, 2025

Amirmojtaba Sabour, Sanja Fidler, and Karsten Kreis. Align your flow: Scaling continuous-time flow map distillation. arXiv preprint arXiv:2506.14603, 2025. 2, 3, 6

work page arXiv 2025
[8]

Flow map matching with stochastic interpolants: A mathematical framework for consistency models.arXiv preprint arXiv:2406.07507, 2024

Nicholas M Boffi, Michael S Albergo, and Eric Vanden-Eijnden. Flow map matching.arXiv preprint arXiv:2406.07507, 2(3):9, 2024. 1, 2, 3

work page arXiv 2024
[9]

Mean Flows for One-step Generative Modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025. 2, 3

work page internal anchor Pith review arXiv 2025
[10]

Alphaﬂow: Understanding and improvi ng meanﬂow models

Huijie Zhang, Aliaksandr Siarohin, Willi Menapace, Michael Vasilkovsky,SergeyTulyakov,QingQu,andIvanSkorokhodov. Alphaflow: Understanding and improving meanflow models. arXiv preprint arXiv:2510.20771, 2025. 2, 3

work page arXiv 2025
[11]

Splitmeanflow: Interval splitting consistency in few-step generative modeling.arXiv preprint arXiv:2507.16884, 2025

Yi Guo, Wei Wang, Zhihang Yuan, Rong Cao, Kuan Chen, ZhengyangChen,YuanyuanHuo,YangZhang,YupingWang, Shouda Liu, et al. Splitmeanflow: Interval splitting con- sistency in few-step generative modeling.arXiv preprint arXiv:2507.16884, 2025. 3

work page arXiv 2025
[12]

Yaron Lipman, Ricky T

Kyungmin Lee, Sihyun Yu, and Jinwoo Shin. Decoupled meanflow: Turningflowmodelsintoflowmapsforaccelerated sampling.arXiv preprint arXiv:2510.24474, 2025. 2, 3, 6

work page arXiv 2025
[13]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 2

2009
[14]

Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024

EnzeXie,JunsongChen,JunyuChen,HanCai,HaotianTang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with lin- ear diffusion transformers.arXiv preprint arXiv:2410.10629,

work page arXiv
[15]

Learning trans- ferable visual models from natural language supervision

AlecRadford,JongWookKim,ChrisHallacy,AdityaRamesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning trans- ferable visual models from natural language supervision. In International conference on machine learning, pages 8748–
[16]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 2, 3, 4

2020
[17]

Qwen2 Technical Report

Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2(3), 2024. 2, 3

work page internal anchor Pith review arXiv 2024
[18]

Gemma 2: Improving Open Language Models at a Practical Size

GemmaTeam,MorganeRiviere,ShreyaPathak,PierGiuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024. 2, 3

work page internal anchor Pith review arXiv 2024
[19]

Sana-sprint: One-step diffusion with continuous-time consistency distillation

Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, and Enze Xie. Sana-sprint: One-step diffusion with continuous-time consistency distillation.arXiv preprint arXiv:2503.09641,

work page arXiv
[20]

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024. 2, 3

work page internal anchor Pith review arXiv 2024
[21]

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularizedcontinuous-timeconsistency.arXivpreprint arXiv:2510.08431, 2025. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Consistency models made easy

Zhengyang Geng, Ashwini Pokle, William Luo, Justin Lin, and J Zico Kolter. Consistency models made easy.arXiv preprint arXiv:2406.14548, 2024. 2, 3

work page arXiv 2024
[23]

Improved Tech- niques for Training Consistency Models

Yang Song and Prafulla Dhariwal. Improved techniques for trainingconsistencymodels.arXivpreprintarXiv:2310.14189,

work page arXiv
[24]

Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv:2510.12586, 2025

Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, and Xiangxiang Chu. Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv preprint arXiv:2510.12586, 2025. 2

work page arXiv 2025
[25]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review arXiv 2025
[26]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representa- tionalignmentforgeneration: Trainingdiffusiontransformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

work page internal anchor Pith review arXiv 2024
[27]

arXiv preprint arXiv:2510.15301 (2025)

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder. arXiv preprint arXiv:2510.15301, 2025

work page arXiv 2025
[28]

Representation entanglement for generation: Training diffusion transformers is much easier than you think

Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transformers is much easier than you think. InNeurIPS, 2025

2025
[29]

InICCV, 2025

XiangxiangChu,RendaLi,andYongWang.Usp: Unifiedself- supervisedpretrainingforimagegenerationandunderstanding. InICCV, 2025

2025
[30]

Reconstruction alignment improves unified multimodal models.arXiv preprint arXiv:2509.07295, 2025a

Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. Reconstruction alignment improves unified multimodal mod- els.arXiv preprint arXiv:2509.07295, 2025. 2

work page arXiv 2025
[31]

Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer, 2025

Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer, 2025. 2, 3, 4, 7

2025
[32]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer,
[33]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
[34]

S2-guidance: Stochastic self guidance for training-free enhancement of diffusion models.arXiv preprint arXiv:2508.12880, 2025

Chubin Chen, Jiashu Zhu, Xiaokun Feng, et al. S2-guidance: Stochastic self guidance for training-free enhancement of diffusion models.arXiv preprint arXiv:2508.12880, 2025. 3

work page arXiv 2025
[35]

Taming preference mode collapse via directional decoupling alignment in diffusion reinforcement learning.arXiv preprint arXiv:2512.24146, 2025

Chubin Chen, Sujie Hu, Jiashu Zhu, Meiqi Wu, Jintao Chen, Yanxun Li, Nisha Huang, Chengyu Fang, Jiahong Wu, Xi- angxiang Chu, et al. Taming preference mode collapse via directional decoupling alignment in diffusion reinforcement learning.arXiv preprint arXiv:2512.24146, 2025. 3

work page arXiv 2025
[36]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, etal. Qwentechnicalreport.arXivpreprintarXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Qwen3 Technical Report

AnYang,AnfengLi,BaosongYang,BeichenZhang,Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technol- ogy.arXiv preprint arXiv:2403.08295, 2024. 5

work page internal anchor Pith review arXiv 2024
[39]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

High-resolution image synthesis withlatentdiffusionmodels

RobinRombach,AndreasBlattmann,DominikLorenz,Patrick Esser, and Björn Ommer. High-resolution image synthesis withlatentdiffusionmodels. InProceedingsoftheIEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

2022
[41]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, TimDockhorn, JonasMüller, JoePenna, andRobinRombach. Sdxl: Improving latent diffusion models for high-resolution imagesynthesis.arXivpreprintarXiv:2307.01952, 2023. 3, 7

work page internal anchor Pith review arXiv 2023
[42]

Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James T Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In ICLR, 2024. 3, 7, 2

2024
[43]

Pixart-{\delta}: Fast and controllable image generation with latent consistency models.arXiv preprint arXiv:2401.05252, 2024

Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, and Zhenguo Li. Pixart-{\delta}: Fast and controllable image generation with latent consistency models.arXiv preprint arXiv:2401.05252, 2024

work page arXiv 2024
[44]

Pixart-𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean ConferenceonComputerVision,pages74–91.Springer,2024. 3, 2

2024
[45]

Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux, 2024. 3, 7, 2

2024
[46]

Flux-text: A simple and advanced diffusion transformer baseline for scene text editing.arXiv preprint arXiv:2505.03329, 2025

Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Dongyang Jin, Ryan Xu, Dong Nie, Lei Sun, and Xiangxiang Chu. Flux- text: A simple and advanced diffusion transformer baseline forscenetextediting.arXivpreprintarXiv:2505.03329, 2025. 3

work page arXiv 2025
[47]

Nano banana

Google. Nano banana. https : / / developers . googleblog.com/en/introducing- gemini- 2- 5-flash-image, 2025. 3

2025
[48]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 3, 7, 2

work page internal anchor Pith review arXiv 2025
[49]

Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, XinchiDeng,YingDong,KipperGong,TianpengGu,Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025. 3

work page arXiv 2025
[50]

Playground v3: Im- proving text-to-image alignment with deep-fusion large language models.arXiv preprint arXiv:2409.10695, 2024

Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza,SuhailDoshi,andDaiqingLi. Playgroundv3: Improv- ing text-to-image alignment with deep-fusion large language models.arXiv preprint arXiv:2409.10695, 2024. 3

work page arXiv 2024
[51]

Blip3o-next: Next frontier of native image generation.arXiv preprint arXiv:2510.15857, 2025

Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, et al. Blip3o-next: Next frontier of native image generation.arXiv preprint arXiv:2510.15857, 2025. 3, 4, 7

work page arXiv 2025
[52]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023. 3, 7

work page internal anchor Pith review arXiv 2023
[53]

arXiv preprint arXiv:2407.02398 , year=

Ling Yang, Zixiang Zhang, Zhilong Zhang, Xingchao Liu, Minkai Xu, Wentao Zhang, Chenlin Meng, Stefano Er- mon, and Bin Cui. Consistency flow matching: Defining straight flows with velocity consistency.arXiv preprint arXiv:2407.02398, 2024. 3

work page arXiv 2024
[54]

Transition models: Rethinking the generative learning objective.arXiv preprint arXiv:2509.04394, 2025

Zidong Wang, Yiyuan Zhang, Xiaoyu Yue, Xiangyu Yue, YangguangLi,WanliOuyang,andLeiBai. Transitionmodels: Rethinking the generative learning objective.arXiv preprint arXiv:2509.04394, 2025. 3, 6

work page arXiv 2025
[55]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 4, 5

2014
[56]

Saco loss: Sample-wise affinity consistency for vision-language pre-training

Sitong Wu, Haoru Tan, Zhuotao Tian, Yukang Chen, Xi- aojuan Qi, and Jiaya Jia. Saco loss: Sample-wise affinity consistency for vision-language pre-training. InProceedings oftheIEEE/CVFConferenceonComputerVisionandPattern Recognition, pages 27358–27369, 2024. 4

2024
[57]

DINOv3

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Di- nov3.arXiv preprint arXiv:2508.10104, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and GangYu. Ella: Equipdiffusionmodelswithllmforenhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 5, 6

work page internal anchor Pith review arXiv 2024
[59]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 6, 7, 2

work page Pith review arXiv 2025
[60]

ShareGPT-4o-Image: Aligning multimodal models with GPT-4o-level image generation

Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt- 4o-level image generation.arXiv preprint arXiv:2506.18095,

work page arXiv
[61]

arXiv preprint arXiv:2508.09987 (2025)

Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zheng- hao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt-4osyntheticimagesforimprovedimagegeneration.arXiv preprint arXiv:2508.09987, 2025. 6

work page arXiv 2025
[62]

Geneval: An object-focused framework for evaluating text-to- imagealignment.AdvancesinNeuralInformationProcessing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to- imagealignment.AdvancesinNeuralInformationProcessing Systems, 36:52132–52152, 2023. 6

2023
[63]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

work page internal anchor Pith review arXiv
[64]

Cosmos world foundation model platform for physical ai, 2025

NVIDIA. Cosmos world foundation model platform for physical ai, 2025. 7

2025
[65]

Lumina- image 2.0: A unified and efficient image generative frame- work.arXiv preprint arXiv:2503.21758, 2025

Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina- image2.0: Aunifiedandefficientimagegenerativeframework. arXiv preprint arXiv:2503.21758, 2025. 7, 2

work page arXiv 2025
[66]

Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Pei- han Xu, et al. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025. 7, 2

work page arXiv 2025
[67]

Seedream 3.0 Technical Report

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025. 7, 2

work page internal anchor Pith review arXiv 2025
[68]

Gpt-image-1, 2025

OpenAI. Gpt-image-1, 2025. 7, 2

2025
[69]

Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Trans- fer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025. 7, 2

work page arXiv 2025
[70]

Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025c

Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, QinyueLi,WeiLi,andChenChangeLoy. Openuni: Asimple baseline for unified multimodal understanding and generation. arXiv preprint arXiv:2505.23661, 2025. 7, 2

work page arXiv 2025
[71]

Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXiv preprint arXiv:2506.18898, 2025.https://arxiv.org/abs/2506.18898

Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, and Lu Jiang. Vision as a dialect: Unifyingvisualunderstandingandgenerationviatext- aligned representations.arXiv preprint arXiv:2506.18898,

work page arXiv
[72]

arXiv preprint arXiv:2508.08098, 2025

JunzheXu,YuyangYin,andXiChen.Tbac-uniimage: Unified understanding and generation by ladder-side diffusion tuning. arXiv preprint arXiv:2508.08098, 2025. 7, 2

work page arXiv 2025
[73]

Sdxl-lightning: Progressive adversarial diffusion distillation.arXiv preprint arXiv:2402.13929, 2024

Shanchuan Lin, Anran Wang, and Xiao Yang. Sdxl-lightning: Progressive adversarial diffusion distillation.arXiv preprint arXiv:2402.13929, 2024. 7

work page arXiv 2024
[74]

Hyper-sd: Trajectory segmented consistency model for efficient image synthesis

Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, and Xuefeng Xiao. Hyper-sd: Trajectory segmented consistency model for efficient image synthesis. InNeurIPS, 2024. 7

2024
[75]

Improved distribution matching distillation for fast image synthesis

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis. Advancesinneuralinformationprocessingsystems,37:47455– 47487, 2024. 7, 2

2024
[76]

Lumina-next: Making lumina-t2x stronger and faster with next-dit.Advances in Neural Information Processing Systems, 37:131278–131315, 2024

Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Xiangyang Zhu, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit.Advances in Neural Information Processing Systems, 37:131278–131315, 2024. 2

2024
[77]

Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation.ArXiv, abs/2402.17245, 2024

DaiqingLi,AleksKamko,EhsanAkhgari,AliSabet,Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights to- wards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024. 2

work page arXiv 2024
[78]

Hunyuan- dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, WenyueLi,ChenZhang,RongweiQuan,JianxiangLu,Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang,ChaoZhang,MengChen,JieLiu,ZhengFang,Weiyan Wang,JinbaoXue,YangyuTao,JianchenZ...

2024
[79]

DALL·E 3

OpenAI. DALL·E 3. https://openai.com/research/dall-e-3, September 2023. 2

2023
[80]

Fast high- resolution image synthesis with latent adversarial diffusion distillation

Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high- resolution image synthesis with latent adversarial diffusion distillation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 2 Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation Supplementary Material

2024

Showing first 80 references.