Recognition: no theorem link
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Pith reviewed 2026-05-11 14:01 UTC · model grok-4.3
The pith
A 6-billion-parameter image model reaches commercial-level photorealism and text rendering with far less training compute than larger rivals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Z-Image, built on the S3-DiT architecture, achieves performance comparable to or surpassing leading competitors across various dimensions through systematic optimization of the model lifecycle from curated data to training curriculum and few-step distillation with reward post-training, delivering exceptional photorealistic image generation and bilingual text rendering that rivals top-tier commercial models while using significantly reduced computational overhead of 314K H800 GPU hours.
What carries the argument
The Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture, which processes the diffusion process in a single efficient stream rather than multiple separate paths.
If this is right
- Z-Image-Turbo enables sub-second inference on enterprise GPUs while remaining compatible with consumer hardware under 16GB VRAM.
- Z-Image-Edit provides strong instruction-following for image editing through the same omni-pre-training approach.
- Full training completes in 314K H800 GPU hours at roughly $630K cost, lowering the barrier for high-performance model development.
- Open release of code, weights, and demo supports community extension of efficient generative models beyond current proprietary systems.
Where Pith is reading between the lines
- Efficiency-focused designs like this could allow more frequent retraining or domain-specific fine-tuning without massive infrastructure.
- The bilingual text strength suggests similar single-stream methods may improve multilingual handling in other vision-language tasks.
- Reduced overall compute opens possibilities for on-device or edge deployment of high-quality image generation.
- The pattern of combining architecture changes with data and curriculum optimizations may apply to related generative domains such as video.
Load-bearing premise
The claimed performance levels arise directly from the described data curation, training curriculum, S3-DiT design, and distillation methods rather than from undisclosed larger data scales or selective evaluation practices.
What would settle it
Independent side-by-side evaluation of the released Z-Image weights against the cited competitors on fixed public benchmarks for photorealism and text accuracy, using identical prompts and metrics with fully disclosed training data volume.
read the original abstract
The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Z-Image, a 6B-parameter image generation foundation model using a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture. It optimizes the full lifecycle via curated data infrastructure, streamlined training curriculum, few-step distillation with reward post-training (yielding Z-Image-Turbo for sub-second inference), and omni-pre-training for an editing variant (Z-Image-Edit). The central claim is that qualitative and quantitative experiments show performance comparable to or surpassing larger models (FLUX.2, Qwen-Image, commercial systems) in photorealism and bilingual text rendering, achieved at low cost (314K H800 GPU hours, ~$630K).
Significance. If the performance claims hold with rigorous evidence, the work would be significant for demonstrating that state-of-the-art image generation results are attainable with substantially smaller models and lower training compute than the current 20B–80B scale paradigm, potentially improving accessibility for fine-tuning and inference on consumer hardware.
major comments (2)
- [Abstract] Abstract: The assertion that 'qualitative and quantitative experiments demonstrate' comparable or superior performance is unsupported by any reported metrics (FID, CLIP score, OCR accuracy for bilingual text, human preference rates), baselines, ablation studies, or error analysis, leaving the central empirical claim without visible evidence.
- [Results and Experiments] Experimental claims: No dataset cardinality, image-text pair counts, filtering criteria, or evaluation protocol details (e.g., inference steps matched to baselines, statistical tests, blinded raters) are supplied, so it is impossible to determine whether the S3-DiT architecture, curriculum, or distillation—not undisclosed data advantages—produce the headline results rivaling 20B–80B models.
minor comments (2)
- The manuscript would benefit from a dedicated table or figure summarizing quantitative comparisons against named baselines with exact scores and standard deviations.
- Clarify the precise definition and implementation details of the 'Scalable Single-Stream Diffusion Transformer (S3-DiT)' early in the architecture section to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address the major comments point by point below and will revise the manuscript to strengthen the empirical evidence and transparency of our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that 'qualitative and quantitative experiments demonstrate' comparable or superior performance is unsupported by any reported metrics (FID, CLIP score, OCR accuracy for bilingual text, human preference rates), baselines, ablation studies, or error analysis, leaving the central empirical claim without visible evidence.
Authors: We agree that the abstract makes a broad claim without embedding specific quantitative metrics, which leaves the central assertion insufficiently supported at first reading. The full manuscript does contain quantitative results and comparisons in the Experiments section, but these are not summarized in the abstract. In the revised version we will update the abstract to explicitly report key metrics (FID, CLIP score, OCR accuracy for bilingual text, and human preference rates) together with the main baselines. We will also expand the Experiments section with additional ablations, error analysis, and clearer tabular presentation of all quantitative results so that the performance claims are directly evidenced. revision: yes
-
Referee: [Results and Experiments] Experimental claims: No dataset cardinality, image-text pair counts, filtering criteria, or evaluation protocol details (e.g., inference steps matched to baselines, statistical tests, blinded raters) are supplied, so it is impossible to determine whether the S3-DiT architecture, curriculum, or distillation—not undisclosed data advantages—produce the headline results rivaling 20B–80B models.
Authors: We acknowledge that the current manuscript provides only high-level descriptions of the data pipeline and evaluation setup, making it difficult to isolate the contributions of the S3-DiT architecture and training curriculum from potential data advantages. In the revision we will add a dedicated subsection detailing the total number of image-text pairs, filtering criteria, deduplication steps, and the full evaluation protocol (including inference steps used for all baselines, statistical tests, and human-study design with blinded raters). These additions will allow readers to assess the source of the reported performance gains. revision: yes
Circularity Check
No circularity: empirical performance claims rest on external benchmarks and experiments.
full rationale
The paper describes an empirical architecture (S3-DiT) and training pipeline whose headline results are asserted via qualitative/quantitative experiments on standard image-generation tasks. No equations, first-principles derivations, or fitted parameters are presented that reduce by construction to the inputs; the central claim is a comparative performance statement against external models (FLUX.2, Qwen-Image, etc.) rather than a self-referential prediction. Self-citations are absent from the provided text, and the training-cost figure (314K H800 hours) is an input cost, not a derived output. This is the normal non-circular case for a systems paper.
Axiom & Free-Parameter Ledger
invented entities (1)
-
S3-DiT
no independent evidence
Forward citations
Cited by 44 Pith papers
-
MiVE: Multiscale Vision-language features for reference-guided video Editing
MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer, achieving top human preference in reference-guided video editing.
-
ImageAttributionBench: How Far Are We from Generalizable Attribution?
ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
-
Asymmetric Flow Models
Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...
-
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.
-
RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition
RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.
-
What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers
A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
-
DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing
DirectEdit achieves step-level accurate inversion for flow-based image editing by directly aligning forward paths, using attention feature injection and mask-guided noise blending to balance fidelity and editability w...
-
Evaluating Remote Sensing Image Captions Beyond Metric Biases
Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA pe...
-
Generative Texture Filtering
A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.
-
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
-
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation
CT-1 transfers spatial reasoning from vision-language models to estimate camera trajectories, which are then used in a video diffusion model with wavelet regularization to produce controllable videos, claiming 25.7% b...
-
Large-Scale Universal Defect Generation: Foundation Models and Datasets
A 300K quadruplet dataset and UniDG foundation model enable reference- or text-driven defect generation across categories, outperforming few-shot baselines on anomaly detection tasks.
-
FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding
FlowGuard detects unsafe content during diffusion image generation via linear latent decoding and curriculum learning, outperforming prior methods by over 30% F1 while reducing GPU memory by 97% and projection time to...
-
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation
SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
-
Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse
Chorus accelerates video DiT serving up to 45% via inter-request caching reuse in a three-stage denoising strategy with token-guided attention amplification.
-
Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation
Skill-aligned annotation improves inter-annotator agreement and evaluation stability in text-to-image generation compared to uniform annotation baselines.
-
Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation
Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.
-
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
-
L2P: Unlocking Latent Potential for Pixel Generation
L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
-
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
-
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
-
DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models
DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
-
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
-
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
-
Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image
Any3DAvatar reconstructs full-head 3D Gaussian avatars from one image via one-step denoising on a Plücker-aware scaffold plus auxiliary view supervision, beating prior single-image methods on fidelity while running su...
-
Generative Refinement Networks for Visual Synthesis
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
-
Continuous Adversarial Flow Models
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...
-
Gen-Searcher: Reinforcing Agentic Search for Image Generation
Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
IdentiFace: Multi-Modal Iterative Diffusion Framework for Identifiable Suspect Face Generation in Crime Investigations
IdentiFace is a multi-modal iterative diffusion framework that generates identifiable suspect faces with improved identity retrieval for law enforcement applications.
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
-
Exploring Time Conditioning in Diffusion Generative Models from Disjoint Noisy Data Manifolds
Aligning the DDIM forward diffusion process with flow-matching manifold evolution enables high-quality generation without time conditioning, and class-conditional synthesis is possible with an unconditional denoiser b...
-
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
-
Context Unrolling in Omni Models
Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
-
CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers
CreatiParser decomposes raster graphic designs into editable text, background, and sticker layers via a hybrid VLM-diffusion model with ParserReward and GRPO optimization, reporting 23.7% average metric gains on Parse...
-
On Semiotic-Grounded Interpretive Evaluation of Generative Art
SemJudge uses a Hierarchical Semiosis Graph based on Peircean theory to evaluate deeper artistic meaning in generative art and aligns better with human judgments than prior metrics.
-
Qwen-Image-2.0 Technical Report
Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
-
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
-
Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model's Robustness to Natural Semantic Variation Across Diverse Tasks
Robust CLIP models amplify vulnerabilities to natural adversarial scenarios while standard CLIP shows large performance drops on natural language-induced adversarial examples in zero-shot classification, segmentation,...
-
Can Nano Banana 2 Replace Traditional Image Restoration Models? An Evaluation of Its Performance on Image Restoration Tasks
Nano Banana 2 delivers competitive perceptual quality on image restoration but produces over-enhanced results that diverge from input fidelity in ways standard metrics miss.
-
The Fourth Challenge on Image Super-Resolution ($\times$4) at NTIRE 2026: Benchmark Results and Method Overview
The NTIRE 2026 ×4 super-resolution challenge benchmarks 31 teams on bicubic-downsampled images using PSNR for the restoration track and perceptual scores for the realism track.
-
The Second Challenge on Real-World Face Restoration at NTIRE 2026: Methods and Results
The NTIRE 2026 real-world face restoration challenge report details outcomes from 9 valid team submissions advancing perceptual quality and identity consistency in degraded face images.
Reference graph
Works this paper leans on
-
[1]
Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Langu...
work page 2024
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Imagen 3.arXiv preprint arXiv:2408.07009, 2024
Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castre- jon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, et al. Imagen 3.arXiv preprint arXiv:2408.07009, 2024
-
[4]
Improving image generation with better captions.Computer Science
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023
work page 2023
-
[5]
Instructpix2pix: Learning to follow im- age editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow im- age editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023
work page 2023
-
[6]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[7]
Hidream-i1: An open-source high-efficient image generative foundation model
Qi Cai, Yehao Li, Yingwei Pan, Ting Yao, and Tao Mei. Hidream-i1: An open-source high-efficient image generative foundation model. InProceedings of the 33rd ACM International Conference on Multimedia, pages 13636–13639, 2025
work page 2025
-
[8]
Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025
Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025
-
[9]
Oneig-bench: Omni-dimensional nuanced evaluation for image generation
Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation. arXiv preprint arXiv:2506.07977, 2025
-
[10]
Textdiffuser-2: Unleashing the power of language models for text rendering
Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pages 386–402. Springer, 2024
work page 2024
-
[11]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025
work page Pith review arXiv 2025
-
[12]
Pixart- 𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- 𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024
work page 2024
-
[13]
Pixart- 𝛼: Fast training of diffusion transformer for photo- realistic text-to-image synthesis
Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- 𝛼: Fast training of diffusion transformer for photo- realistic text-to-image synthesis. InThe Twelfth International Conference on Learning Representations
-
[14]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining, 2025. URL https://arxiv. org/abs/2505.14683. 50
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Cogview: Mastering text-to-image generation via transformers
Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in neural information processing systems, 34:19822–19835, 2021
work page 2021
-
[17]
arXiv preprint arXiv:2503.23461 (2025)
Nikai Du, Zhennan Chen, Zhizhou Chen, Shan Gao, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461, 2025
-
[18]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the International Conference on Machine Learning (ICML), 2024
work page 2024
-
[19]
Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, and Hongsheng Li. Flux-reason-6m and prism-bench: A million-scale text-to- image reasoning dataset and comprehensive benchmark.arXiv preprint arXiv:2509.09680, 2025
-
[20]
Peng Gao, Le Zhuo, Chris Liu, , Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, et al. Lumina- t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers.arXiv preprint arXiv:2405.05945, 2024
-
[21]
Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025
work page internal anchor Pith review arXiv 2025
-
[22]
arXiv preprint arXiv:2507.22058 (2025)
Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025
-
[23]
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132– 52152, 2023
work page 2023
-
[24]
Dynamic few-shot visual learning without forgetting
Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4367–4375, 2018
work page 2018
-
[25]
Gemini 2.5 flash & 2.5 flash image model card
Google. Gemini 2.5 flash & 2.5 flash image model card. https://storage.googleapis.com/d eepmind-media/Model-Cards/Gemini-2-5-Flash-Model-Card.pdf, 2025
work page 2025
-
[26]
Google. Imagen 4 model card. https://storage.googleapis.com/deepmind-media/Mod el-Cards/Imagen-4-Model-Card.pdf, 2025
work page 2025
-
[27]
Google. Nano banana pro. https://storage.googleapis.com/deepmind-media/Model-C ards/Gemini-3-Pro-Image-Model-Card.pdf, 2025
work page 2025
-
[28]
Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis
Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025
work page 2025
-
[29]
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.Advances in Neural Information Processing Systems Workshops (NeurIPS Workshops), 2021
work page 2021
-
[30]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024
work page internal anchor Pith review arXiv 2024
-
[31]
Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Xin Jin, David Liu, Zhen Li, Mengmeng Wang, Peng Gao, and Harry Yang. Distribution matching distillation meets reinforcement learning. arXiv preprint arXiv:2511.13649, 2025
-
[32]
Analyzing and improving the training dynamics of diffusion models
Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24174–24184, 2024. 51
work page 2024
-
[33]
Kolors 2.0.https://app.klingai.com/cn/, 2025
Kuaishou Kolors Team. Kolors 2.0.https://app.klingai.com/cn/, 2025
work page 2025
-
[34]
Flux.https://github.com/black-forest-labs/flux, 2023
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2023
work page 2023
-
[35]
FLUX.2: State-of-the-Art Visual Intelligence
Black Forest Labs. FLUX.2: State-of-the-Art Visual Intelligence. https://bfl.ai/blog/flux-2, 2025
work page 2025
-
[36]
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
LeaderGPU. Gpu server rental pricing. https://www.leadergpu.com/ , 2025. Accessed: November 2025
work page 2025
-
[38]
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024
-
[39]
Ragdiffusion: Faithful cloth generation via external knowledge assimilation
Yuhan Li, Xianfeng Tan, Wenxiang Shang, Yubo Wu, Jian Wang, Xuanhong Chen, Yi Zhang, Hangcheng Zhu, and Bingbing Ni. Ragdiffusion: Faithful cloth generation via external knowledge assimilation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17485–17495, 2025
work page 2025
-
[40]
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024
-
[41]
Visualcloze: A universal image generation framework via visual in-context learning
Zhong-Yu Li, Ruoyi Du, Juncheng Yan, Le Zhuo, Zhen Li, Peng Gao, Zhanyu Ma, and Ming-Ming Cheng. Visualcloze: A universal image generation framework via visual in-context learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2025
work page 2025
-
[42]
Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, and Li Yuan. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888, 2025
-
[43]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025
work page internal anchor Pith review arXiv 2025
-
[44]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[45]
Dongyang Liu, David Liu, Peng Gao, Ruoyi Du, Zhen Li, Qilong Wu, Xin Jin, Sihan Cao, Shifeng Zhang, Hongsheng Li, and Steven Hoi. Decoupled dmd: Cfg augmentation as the spear, distribu- tion matching as the shield.arXiv preprint, 2025
work page 2025
-
[46]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025
work page internal anchor Pith review arXiv 2025
-
[47]
Step1X-Edit: A Practical Framework for General Image Editing
Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025
work page internal anchor Pith review arXiv 2025
-
[48]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[49]
Omnicaptioner: One captioner to rule them all.arXiv preprint arXiv:2504.07089, 2025
Yiting Lu, Jiakang Yuan, Zhen Li, Shitian Zhao, Qi Qin, Xinyue Li, Le Zhuo, Licheng Wen, Dongyang Liu, Yuewen Cao, et al. Omnicaptioner: One captioner to rule them all.arXiv preprint arXiv:2504.07089, 2025. 52
-
[50]
Cosine normalization: Using cosine similarity instead of dot product in neural networks
Chunjie Luo, Jianfeng Zhan, Xiaohe Xue, Lei Wang, Rui Ren, and Qiang Yang. Cosine normalization: Using cosine similarity instead of dot product in neural networks. InInternational conference on artificial neural networks, pages 382–391. Springer, 2018
work page 2018
-
[51]
Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Liang Zhao, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024
-
[52]
Midjourney v7.https://www.midjourney.com/home, 2025
Midjourney. Midjourney v7.https://www.midjourney.com/home, 2025
work page 2025
-
[53]
Enhancing few-shot image classification with cosine transformer.IEEE Access, 11:79659–79672, 2023
Quang-Huy Nguyen, Cuong Q Nguyen, Dung D Le, and Hieu H Pham. Enhancing few-shot image classification with cosine transformer.IEEE Access, 11:79659–79672, 2023
work page 2023
-
[54]
Cagra: Highly parallel graph construction and approximate nearest neighbor search for gpus
Hiroyuki Ootomo, Akira Naruse, Corey Nolet, Ray Wang, Tamas Feher, and Yong Wang. Cagra: Highly parallel graph construction and approximate nearest neighbor search for gpus. In2024 IEEE 40th International Conference on Data Engineering (ICDE), pages 4236–4247. IEEE, 2024
work page 2024
-
[55]
OpenAI. Gpt-image-1. https://openai.com/zh-Hans-CN/index/introducing-4o-image -generation/, 2025
work page 2025
-
[56]
The pagerank citation ranking: Bringing order to the web
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford infolab, 1999
work page 1999
-
[57]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina-image 2.0: A unified and efficient image generative framework.arXiv preprint arXiv:2503.21758, 2025
-
[59]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[60]
cuGraph - RAPIDS Graph Analytics Library
rapidsai. cuGraph - RAPIDS Graph Analytics Library. https://github.com/rapidsai/cugr aph, 2018. Accessed: 2025-11-12
work page 2018
-
[61]
Recraft v3.https://www.recraft.ai/docs/recraft-models/recraft-V3, 2024
Recraft. Recraft v3.https://www.recraft.ai/docs/recraft-models/recraft-V3, 2024
work page 2024
-
[62]
Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond.Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009
work page 2009
-
[63]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[64]
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xi- aoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025
work page internal anchor Pith review arXiv 2025
-
[65]
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024
work page 2024
-
[66]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models, 2024
work page 2024
-
[67]
Flux.1 krea [dev].https://github.com/krea-ai/flux-krea, 2025
FLUX-Krea Team. Flux.1 krea [dev].https://github.com/krea-ai/flux-krea, 2025
work page 2025
-
[68]
From louvain to leiden: guaranteeing well-connected communities.Scientific reports, 9(1):1–12, 2019
Vincent A Traag, Ludo Waltman, and Nees Jan Van Eck. From louvain to leiden: guaranteeing well-connected communities.Scientific reports, 9(1):1–12, 2019. 53
work page 2019
-
[69]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdul- mohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[70]
Anytext: Multilingual visual text generation and editing
Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing. 2023
work page 2023
-
[71]
arXiv preprint arXiv:2405.15613 (2024)
Huy V Vo, Vasil Khalidov, Timothée Darcet, Théo Moutakanni, Nikita Smetanin, Marc Szafraniec, Hugo Touvron, Camille Couprie, Maxime Oquab, Armand Joulin, et al. Automatic data curation for self-supervised learning: A clustering-based approach.arXiv preprint arXiv:2405.15613, 2024
-
[72]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[73]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[74]
Tiif-bench: How does your t2i model follow your instructions? arXiv preprint arXiv:2506.02161, 2025
Xinyu Wei, Jinrui Zhang, Zeqing Wang, Hongyang Wei, Zhen Guo, and Lei Zhang. Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025
-
[75]
Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational conference on machine learning, pages 23965–23998. P...
work page 2022
-
[76]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[77]
Janus: Decoupling visual encoding for unified multimodal understanding and generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025
work page 2025
-
[78]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[79]
Xianfeng Wu, Yajing Bai, Haoze Zheng, Harold Haodong Chen, Yexin Liu, Zihao Wang, Xuran Ma, Wen-Jie Shu, Xianzu Wu, Harry Yang, et al. Lightgen: Efficient image generation through knowledge distillation and direct preference optimization.arXiv preprint arXiv:2503.08619, 2025
-
[80]
Omnigen: Unified image generation
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.