Recognition: unknown
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
Pith reviewed 2026-05-10 05:11 UTC · model grok-4.3
The pith
MeanFlow enables one-step image generation from text by requiring and using highly discriminative LLM text encoders.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Due to the extremely limited refinement steps in MeanFlow, such as only one step, text feature representations must possess sufficiently high discriminability; by leveraging a powerful LLM-based text encoder validated to have the required semantic properties and adapting the MeanFlow generation process, efficient text-conditioned one-step image synthesis is achieved for the first time.
What carries the argument
Adaptation of the MeanFlow single-refinement process to text features that meet a high-discriminability requirement, supplied by a validated LLM encoder.
If this is right
- Text inputs can replace fixed class labels to enable richer, more flexible image content in one-step generation.
- The same discriminability requirement yields measurable generation improvements on standard diffusion models.
- The approach supplies a practical template for extending other one-step or few-step generators to text conditions.
- Future work on text-conditioned MeanFlow can focus on encoder selection rather than multi-step architectural changes.
Where Pith is reading between the lines
- The discriminability constraint likely applies to other low-step generative regimes, suggesting encoder design should prioritize separation metrics over general language modeling.
- Real-time applications could adopt this for on-device text-to-image synthesis where multi-step sampling is too slow.
- Optimizing encoders specifically for generation discriminability, rather than downstream NLP tasks, may become a distinct research direction.
- The insight could transfer to text-conditioned generation of video or 3D content under tight step budgets.
Load-bearing premise
Text feature representations must possess sufficiently high discriminability to function with the extremely limited number of refinement steps in MeanFlow.
What would settle it
A controlled test showing that a text encoder lacking high discriminability still produces high-quality one-step text-to-image results in the adapted MeanFlow framework.
Figures
read the original abstract
Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation. Compared to the limited class labels, text conditions pose greater challenges to the model's understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance. To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number of refinement steps in the MeanFlow generation, such as only one step, the text feature representations are required to possess sufficiently high discriminability. This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Guided by these insights, we leverage a powerful LLM-based text encoder validated to possess the required semantic properties and adapt the MeanFlow generation process to this framework, resulting in efficient text-conditioned synthesis for the first time. Furthermore, we validate our approach on the widely used diffusion model, demonstrating significant generation performance improvements. We hope this work provides a general and practical reference for future research on text-conditioned MeanFlow generation. The code is available at https://github.com/AMAP-ML/EMF.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends the MeanFlow one-step image generation framework from class-conditional to text-conditional synthesis. It identifies that standard integration of LLM-based text encoders yields poor results because the one-step (or few-step) refinement process requires text features with high discriminability, in contrast to discrete class labels. Through targeted analyses, the authors select and validate an LLM encoder possessing the necessary semantic properties, adapt the MeanFlow generation process, and report successful text-conditioned one-step synthesis. The same adaptation is shown to improve a standard diffusion model, with code released for reproducibility.
Significance. If the empirical findings hold, the work is significant for enabling flexible, text-driven one-step image generation, a practical advance over fixed class-label conditioning in few-step methods. The identification of the discriminability requirement, the adaptation strategy, and the cross-model validation (MeanFlow and diffusion) provide a concrete reference for future research. The public code release is a notable strength, directly supporting verification of the encoder selection, process adaptation, and reported performance gains.
minor comments (3)
- [Abstract] Abstract: the statement that the approach results in 'efficient text-conditioned synthesis for the first time' would be strengthened by a brief qualification or citation to any closely related prior attempts at text-conditioned few-step generation.
- The description of the adaptation to the MeanFlow generation process would benefit from an explicit equation or pseudocode block showing the modified conditioning and refinement steps, to aid direct implementation.
- Tables reporting quantitative results (e.g., FID or CLIP scores) should include standard deviations across multiple runs or seeds to allow assessment of statistical significance of the claimed improvements.
Simulated Author's Rebuttal
We thank the referee for the positive summary of our work, the recognition of its significance for enabling flexible text-driven one-step image generation, and the recommendation for minor revision. We appreciate the acknowledgment of our analyses on text feature discriminability, the adaptation of MeanFlow, the cross-validation on diffusion models, and the code release.
Circularity Check
No significant circularity; empirical adaptation with independent validation
full rationale
The paper's chain consists of empirical analysis showing that standard LLM text encoders lack sufficient discriminability for one-step MeanFlow (due to limited refinement steps), followed by selection of a validated encoder with the required properties, adaptation of the generation process, and reporting of improved synthesis results (also shown to benefit standard diffusion models). No equations or derivations are presented that reduce by construction to fitted parameters, self-definitions, or prior self-citations as load-bearing premises. The central claim is an empirical demonstration supported by analyses and external benchmarks, with code released for verification. This is self-contained against external checks and receives score 0.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 3
2020
-
[2]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[3]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 1, 3
2023
-
[4]
Scaling rectified flowtransformersforhigh-resolutionimagesynthesis.InForty- first international conference on machine learning, 2024
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flowtransformersforhigh-resolutionimagesynthesis.InForty- first international conference on machine learning, 2024. 1, 3, 7, 2
2024
-
[5]
Consistency models
YangSong,PrafullaDhariwal,MarkChen,andIlyaSutskever. Consistency models. 2023. 1, 3
2023
-
[6]
Multistep Consistency Models, November 2024
Jonathan Heek, Emiel Hoogeboom, and Tim Salimans. Mul- tistep consistency models.arXiv preprint arXiv:2403.06807,
-
[7]
Align your flow: Scaling continuous-time flow map distillation.arXiv preprint arXiv:2506.14603, 2025
Amirmojtaba Sabour, Sanja Fidler, and Karsten Kreis. Align your flow: Scaling continuous-time flow map distillation. arXiv preprint arXiv:2506.14603, 2025. 2, 3, 6
-
[8]
Nicholas M Boffi, Michael S Albergo, and Eric Vanden-Eijnden. Flow map matching.arXiv preprint arXiv:2406.07507, 2(3):9, 2024. 1, 2, 3
-
[9]
Mean Flows for One-step Generative Modeling
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025. 2, 3
work page internal anchor Pith review arXiv 2025
-
[10]
Alphaflow: Understanding and improvi ng meanflow models
Huijie Zhang, Aliaksandr Siarohin, Willi Menapace, Michael Vasilkovsky,SergeyTulyakov,QingQu,andIvanSkorokhodov. Alphaflow: Understanding and improving meanflow models. arXiv preprint arXiv:2510.20771, 2025. 2, 3
-
[11]
Yi Guo, Wei Wang, Zhihang Yuan, Rong Cao, Kuan Chen, ZhengyangChen,YuanyuanHuo,YangZhang,YupingWang, Shouda Liu, et al. Splitmeanflow: Interval splitting con- sistency in few-step generative modeling.arXiv preprint arXiv:2507.16884, 2025. 3
-
[12]
Kyungmin Lee, Sihyun Yu, and Jinwoo Shin. Decoupled meanflow: Turningflowmodelsintoflowmapsforaccelerated sampling.arXiv preprint arXiv:2510.24474, 2025. 2, 3, 6
-
[13]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 2
2009
-
[14]
EnzeXie,JunsongChen,JunyuChen,HanCai,HaotianTang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with lin- ear diffusion transformers.arXiv preprint arXiv:2410.10629,
-
[15]
Learning trans- ferable visual models from natural language supervision
AlecRadford,JongWookKim,ChrisHallacy,AdityaRamesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning trans- ferable visual models from natural language supervision. In International conference on machine learning, pages 8748–
-
[16]
Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 2, 3, 4
2020
-
[17]
Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2(3), 2024. 2, 3
work page internal anchor Pith review arXiv 2024
-
[18]
Gemma 2: Improving Open Language Models at a Practical Size
GemmaTeam,MorganeRiviere,ShreyaPathak,PierGiuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024. 2, 3
work page internal anchor Pith review arXiv 2024
-
[19]
Sana-sprint: One-step diffusion with continuous-time consistency distillation
Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, and Enze Xie. Sana-sprint: One-step diffusion with continuous-time consistency distillation.arXiv preprint arXiv:2503.09641,
-
[20]
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024. 2, 3
work page internal anchor Pith review arXiv 2024
-
[21]
Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularizedcontinuous-timeconsistency.arXivpreprint arXiv:2510.08431, 2025. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Zhengyang Geng, Ashwini Pokle, William Luo, Justin Lin, and J Zico Kolter. Consistency models made easy.arXiv preprint arXiv:2406.14548, 2024. 2, 3
-
[23]
Improved Tech- niques for Training Consistency Models
Yang Song and Prafulla Dhariwal. Improved techniques for trainingconsistencymodels.arXivpreprintarXiv:2310.14189,
-
[24]
Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, and Xiangxiang Chu. Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv preprint arXiv:2510.12586, 2025. 2
-
[25]
Diffusion Transformers with Representation Autoencoders
Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025
work page internal anchor Pith review arXiv 2025
-
[26]
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representa- tionalignmentforgeneration: Trainingdiffusiontransformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024
work page internal anchor Pith review arXiv 2024
-
[27]
arXiv preprint arXiv:2510.15301 (2025)
Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder. arXiv preprint arXiv:2510.15301, 2025
-
[28]
Representation entanglement for generation: Training diffusion transformers is much easier than you think
Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transformers is much easier than you think. InNeurIPS, 2025
2025
-
[29]
InICCV, 2025
XiangxiangChu,RendaLi,andYongWang.Usp: Unifiedself- supervisedpretrainingforimagegenerationandunderstanding. InICCV, 2025
2025
-
[30]
Reconstruction alignment improves unified multimodal models.arXiv preprint arXiv:2509.07295, 2025a
Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. Reconstruction alignment improves unified multimodal mod- els.arXiv preprint arXiv:2509.07295, 2025. 2
-
[31]
Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer, 2025
Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer, 2025. 2, 3, 4, 7
2025
-
[32]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer,
-
[33]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[34]
Chubin Chen, Jiashu Zhu, Xiaokun Feng, et al. S2-guidance: Stochastic self guidance for training-free enhancement of diffusion models.arXiv preprint arXiv:2508.12880, 2025. 3
-
[35]
Chubin Chen, Sujie Hu, Jiashu Zhu, Meiqi Wu, Jintao Chen, Yanxun Li, Nisha Huang, Chengyu Fang, Jiahong Wu, Xi- angxiang Chu, et al. Taming preference mode collapse via directional decoupling alignment in diffusion reinforcement learning.arXiv preprint arXiv:2512.24146, 2025. 3
-
[36]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, etal. Qwentechnicalreport.arXivpreprintarXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
AnYang,AnfengLi,BaosongYang,BeichenZhang,Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technol- ogy.arXiv preprint arXiv:2403.08295, 2024. 5
work page internal anchor Pith review arXiv 2024
-
[39]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
High-resolution image synthesis withlatentdiffusionmodels
RobinRombach,AndreasBlattmann,DominikLorenz,Patrick Esser, and Björn Ommer. High-resolution image synthesis withlatentdiffusionmodels. InProceedingsoftheIEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3
2022
-
[41]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, TimDockhorn, JonasMüller, JoePenna, andRobinRombach. Sdxl: Improving latent diffusion models for high-resolution imagesynthesis.arXivpreprintarXiv:2307.01952, 2023. 3, 7
work page internal anchor Pith review arXiv 2023
-
[42]
Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James T Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In ICLR, 2024. 3, 7, 2
2024
-
[43]
Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, and Zhenguo Li. Pixart-{\delta}: Fast and controllable image generation with latent consistency models.arXiv preprint arXiv:2401.05252, 2024
-
[44]
Pixart-𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean ConferenceonComputerVision,pages74–91.Springer,2024. 3, 2
2024
-
[45]
Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux, 2024. 3, 7, 2
2024
-
[46]
Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Dongyang Jin, Ryan Xu, Dong Nie, Lei Sun, and Xiangxiang Chu. Flux- text: A simple and advanced diffusion transformer baseline forscenetextediting.arXivpreprintarXiv:2505.03329, 2025. 3
-
[47]
Nano banana
Google. Nano banana. https : / / developers . googleblog.com/en/introducing- gemini- 2- 5-flash-image, 2025. 3
2025
-
[48]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 3, 7, 2
work page internal anchor Pith review arXiv 2025
-
[49]
Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025
Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, XinchiDeng,YingDong,KipperGong,TianpengGu,Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025. 3
-
[50]
Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza,SuhailDoshi,andDaiqingLi. Playgroundv3: Improv- ing text-to-image alignment with deep-fusion large language models.arXiv preprint arXiv:2409.10695, 2024. 3
-
[51]
Blip3o-next: Next frontier of native image generation.arXiv preprint arXiv:2510.15857, 2025
Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, et al. Blip3o-next: Next frontier of native image generation.arXiv preprint arXiv:2510.15857, 2025. 3, 4, 7
-
[52]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023. 3, 7
work page internal anchor Pith review arXiv 2023
-
[53]
arXiv preprint arXiv:2407.02398 , year=
Ling Yang, Zixiang Zhang, Zhilong Zhang, Xingchao Liu, Minkai Xu, Wentao Zhang, Chenlin Meng, Stefano Er- mon, and Bin Cui. Consistency flow matching: Defining straight flows with velocity consistency.arXiv preprint arXiv:2407.02398, 2024. 3
-
[54]
Zidong Wang, Yiyuan Zhang, Xiaoyu Yue, Xiangyu Yue, YangguangLi,WanliOuyang,andLeiBai. Transitionmodels: Rethinking the generative learning objective.arXiv preprint arXiv:2509.04394, 2025. 3, 6
-
[55]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 4, 5
2014
-
[56]
Saco loss: Sample-wise affinity consistency for vision-language pre-training
Sitong Wu, Haoru Tan, Zhuotao Tian, Yukang Chen, Xi- aojuan Qi, and Jiaya Jia. Saco loss: Sample-wise affinity consistency for vision-language pre-training. InProceedings oftheIEEE/CVFConferenceonComputerVisionandPattern Recognition, pages 27358–27369, 2024. 4
2024
-
[57]
Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Di- nov3.arXiv preprint arXiv:2508.10104, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and GangYu. Ella: Equipdiffusionmodelswithllmforenhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 5, 6
work page internal anchor Pith review arXiv 2024
-
[59]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 6, 7, 2
work page Pith review arXiv 2025
-
[60]
ShareGPT-4o-Image: Aligning multimodal models with GPT-4o-level image generation
Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt- 4o-level image generation.arXiv preprint arXiv:2506.18095,
-
[61]
arXiv preprint arXiv:2508.09987 (2025)
Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zheng- hao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt-4osyntheticimagesforimprovedimagegeneration.arXiv preprint arXiv:2508.09987, 2025. 6
-
[62]
Geneval: An object-focused framework for evaluating text-to- imagealignment.AdvancesinNeuralInformationProcessing Systems, 36:52132–52152, 2023
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to- imagealignment.AdvancesinNeuralInformationProcessing Systems, 36:52132–52152, 2023. 6
2023
-
[63]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,
work page internal anchor Pith review arXiv
-
[64]
Cosmos world foundation model platform for physical ai, 2025
NVIDIA. Cosmos world foundation model platform for physical ai, 2025. 7
2025
-
[65]
Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina- image2.0: Aunifiedandefficientimagegenerativeframework. arXiv preprint arXiv:2503.21758, 2025. 7, 2
-
[66]
Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Pei- han Xu, et al. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025. 7, 2
-
[67]
Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025. 7, 2
work page internal anchor Pith review arXiv 2025
-
[68]
Gpt-image-1, 2025
OpenAI. Gpt-image-1, 2025. 7, 2
2025
-
[69]
Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025
Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Trans- fer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025. 7, 2
-
[70]
Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, QinyueLi,WeiLi,andChenChangeLoy. Openuni: Asimple baseline for unified multimodal understanding and generation. arXiv preprint arXiv:2505.23661, 2025. 7, 2
-
[71]
Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, and Lu Jiang. Vision as a dialect: Unifyingvisualunderstandingandgenerationviatext- aligned representations.arXiv preprint arXiv:2506.18898,
-
[72]
arXiv preprint arXiv:2508.08098, 2025
JunzheXu,YuyangYin,andXiChen.Tbac-uniimage: Unified understanding and generation by ladder-side diffusion tuning. arXiv preprint arXiv:2508.08098, 2025. 7, 2
-
[73]
Sdxl-lightning: Progressive adversarial diffusion distillation.arXiv preprint arXiv:2402.13929, 2024
Shanchuan Lin, Anran Wang, and Xiao Yang. Sdxl-lightning: Progressive adversarial diffusion distillation.arXiv preprint arXiv:2402.13929, 2024. 7
-
[74]
Hyper-sd: Trajectory segmented consistency model for efficient image synthesis
Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, and Xuefeng Xiao. Hyper-sd: Trajectory segmented consistency model for efficient image synthesis. InNeurIPS, 2024. 7
2024
-
[75]
Improved distribution matching distillation for fast image synthesis
Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis. Advancesinneuralinformationprocessingsystems,37:47455– 47487, 2024. 7, 2
2024
-
[76]
Lumina-next: Making lumina-t2x stronger and faster with next-dit.Advances in Neural Information Processing Systems, 37:131278–131315, 2024
Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Xiangyang Zhu, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit.Advances in Neural Information Processing Systems, 37:131278–131315, 2024. 2
2024
-
[77]
DaiqingLi,AleksKamko,EhsanAkhgari,AliSabet,Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights to- wards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024. 2
-
[78]
Hunyuan- dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, WenyueLi,ChenZhang,RongweiQuan,JianxiangLu,Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang,ChaoZhang,MengChen,JieLiu,ZhengFang,Weiyan Wang,JinbaoXue,YangyuTao,JianchenZ...
2024
-
[79]
DALL·E 3
OpenAI. DALL·E 3. https://openai.com/research/dall-e-3, September 2023. 2
2023
-
[80]
Fast high- resolution image synthesis with latent adversarial diffusion distillation
Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high- resolution image synthesis with latent adversarial diffusion distillation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 2 Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation Supplementary Material
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.