hub

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi · 2025 · cs.CV · arXiv 2511.22699

38 Pith papers cite this work. Polarity classification is still indexing.

38 Pith papers citing it

open full Pith review browse 38 citing papers arXiv PDF

abstract

The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.

hub tools

JSON dossier citing papers JSON arXiv source

claims ledger

abstract The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm

co-cited works

representative citing papers

Asymmetric Flow Models

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finetuning from latent models.

RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.

What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.

Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.

DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

DirectEdit achieves step-level accurate inversion for flow-based image editing by directly aligning forward paths, using attention feature injection and mask-guided noise blending to balance fidelity and editability without additional NFEs.

Evaluating Remote Sensing Image Captions Beyond Metric Biases

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.

Generative Texture Filtering

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.

Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

CT-1 transfers spatial reasoning from vision-language models to estimate camera trajectories, which are then used in a video diffusion model with wavelet regularization to produce controllable videos, claiming 25.7% better accuracy than prior methods.

Large-Scale Universal Defect Generation: Foundation Models and Datasets

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

A 300K quadruplet dataset and UniDG foundation model enable reference- or text-driven defect generation across categories, outperforming few-shot baselines on anomaly detection tasks.

FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

FlowGuard detects unsafe content during diffusion image generation via linear latent decoding and curriculum learning, outperforming prior methods by over 30% F1 while reducing GPU memory by 97% and projection time to 0.2 seconds.

SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation

cs.CV · 2026-04-08 · conditional · novelty 7.0

SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.

Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

Chorus accelerates video DiT serving up to 45% via inter-request caching reuse in a three-stage denoising strategy with token-guided attention amplification.

L2P: Unlocking Latent Potential for Pixel Generation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B to over 200B parameters.

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.

DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.

Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

Any3DAvatar reconstructs full-head 3D Gaussian avatars from one image via one-step denoising on a Plücker-aware scaffold plus auxiliary view supervision, beating prior single-image methods on fidelity while running substantially faster.

Generative Refinement Networks for Visual Synthesis

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

Continuous Adversarial Flow Models

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

citing papers explorer

Showing 38 of 38 citing papers.

Asymmetric Flow Models cs.CV · 2026-05-13 · unverdicted · none · ref 75 · internal anchor
Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finetuning from latent models.
RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition cs.CV · 2026-05-12 · unverdicted · none · ref 79 · internal anchor
RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.
What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers cs.CV · 2026-05-11 · unverdicted · none · ref 3 · internal anchor
A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents cs.AI · 2026-05-07 · unverdicted · none · ref 32 · internal anchor
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing cs.CV · 2026-05-04 · unverdicted · none · ref 22 · internal anchor
DirectEdit achieves step-level accurate inversion for flow-based image editing by directly aligning forward paths, using attention feature injection and mask-guided noise blending to balance fidelity and editability without additional NFEs.
Evaluating Remote Sensing Image Captions Beyond Metric Biases cs.CV · 2026-04-22 · unverdicted · none · ref 49 · internal anchor
Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.
Generative Texture Filtering cs.CV · 2026-04-21 · unverdicted · none · ref 24 · internal anchor
A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning cs.LG · 2026-04-21 · unverdicted · none · ref 2 · internal anchor
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation cs.CV · 2026-04-10 · unverdicted · none · ref 4 · internal anchor
CT-1 transfers spatial reasoning from vision-language models to estimate camera trajectories, which are then used in a video diffusion model with wavelet regularization to produce controllable videos, claiming 25.7% better accuracy than prior methods.
Large-Scale Universal Defect Generation: Foundation Models and Datasets cs.CV · 2026-04-10 · unverdicted · none · ref 2 · internal anchor
A 300K quadruplet dataset and UniDG foundation model enable reference- or text-driven defect generation across categories, outperforming few-shot baselines on anomaly detection tasks.
FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding cs.CV · 2026-04-09 · unverdicted · none · ref 35 · internal anchor
FlowGuard detects unsafe content during diffusion image generation via linear latent decoding and curriculum learning, outperforming prior methods by over 30% F1 while reducing GPU memory by 97% and projection time to 0.2 seconds.
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation cs.CV · 2026-04-08 · conditional · none · ref 44 · internal anchor
SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse cs.CV · 2026-04-06 · unverdicted · none · ref 8 · internal anchor
Chorus accelerates video DiT serving up to 45% via inter-request caching reuse in a three-stage denoising strategy with token-guided attention amplification.
L2P: Unlocking Latent Potential for Pixel Generation cs.CV · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer cs.CV · 2026-05-11 · unverdicted · none · ref 48 · internal anchor
A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B to over 200B parameters.
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation cs.CV · 2026-05-08 · unverdicted · none · ref 26 · internal anchor
SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models cs.CV · 2026-05-07 · unverdicted · none · ref 35 · internal anchor
DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models cs.CV · 2026-05-06 · unverdicted · none · ref 93 · internal anchor
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model cs.CV · 2026-04-22 · unverdicted · none · ref 4 · internal anchor
LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation cs.CV · 2026-04-21 · unverdicted · none · ref 2 · internal anchor
CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image cs.CV · 2026-04-15 · unverdicted · none · ref 5 · internal anchor
Any3DAvatar reconstructs full-head 3D Gaussian avatars from one image via one-step denoising on a Plücker-aware scaffold plus auxiliary view supervision, beating prior single-image methods on fidelity while running substantially faster.
Generative Refinement Networks for Visual Synthesis cs.CV · 2026-04-14 · unverdicted · none · ref 5 · internal anchor
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
Continuous Adversarial Flow Models cs.LG · 2026-04-13 · unverdicted · none · ref 6 · internal anchor
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture cs.CV · 2026-05-12 · unverdicted · none · ref 7 · internal anchor
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
IdentiFace: Multi-Modal Iterative Diffusion Framework for Identifiable Suspect Face Generation in Crime Investigations cs.CV · 2026-05-01 · unverdicted · none · ref 5 · internal anchor
IdentiFace is a multi-modal iterative diffusion framework that generates identifiable suspect faces with improved identity retrieval for law enforcement applications.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling cs.CV · 2026-04-30 · unverdicted · none · ref 7 · internal anchor
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.
Exploring Time Conditioning in Diffusion Generative Models from Disjoint Noisy Data Manifolds cs.LG · 2026-04-28 · unverdicted · none · ref 45 · internal anchor
Aligning the DDIM forward diffusion process with flow-matching manifold evolution enables high-quality generation without time conditioning, and class-conditional synthesis is possible with an unconditional denoiser by using separate time spaces per class.
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation cs.CV · 2026-04-27 · unverdicted · none · ref 7 · internal anchor
Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
Context Unrolling in Omni Models cs.CV · 2026-04-23 · unverdicted · none · ref 4 · internal anchor
Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers cs.CV · 2026-04-21 · unverdicted · none · ref 2 · internal anchor
CreatiParser decomposes raster graphic designs into editable text, background, and sticker layers via a hybrid VLM-diffusion model with ParserReward and GRPO optimization, reporting 23.7% average metric gains on Parser-40K and Crello datasets.
On Semiotic-Grounded Interpretive Evaluation of Generative Art cs.CV · 2026-04-09 · unverdicted · none · ref 9 · internal anchor
SemJudge uses a Hierarchical Semiosis Graph based on Peircean theory to evaluate deeper artistic meaning in generative art and aligns better with human judgments than prior metrics.
Qwen-Image-2.0 Technical Report cs.CV · 2026-05-11 · unverdicted · none · ref 3 · internal anchor
Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation cs.GR · 2026-05-05 · unverdicted · none · ref 76 · internal anchor
JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model's Robustness to Natural Semantic Variation Across Diverse Tasks cs.CV · 2026-04-06 · unverdicted · none · ref 28 · internal anchor
Robust CLIP models amplify vulnerabilities to natural adversarial scenarios while standard CLIP shows large performance drops on natural language-induced adversarial examples in zero-shot classification, segmentation, and VQA.
Can Nano Banana 2 Replace Traditional Image Restoration Models? An Evaluation of Its Performance on Image Restoration Tasks cs.CV · 2026-04-03 · unverdicted · none · ref 25 · internal anchor
Nano Banana 2 delivers competitive perceptual quality on image restoration but produces over-enhanced results that diverge from input fidelity in ways standard metrics miss.
The Fourth Challenge on Image Super-Resolution ($\times$4) at NTIRE 2026: Benchmark Results and Method Overview cs.CV · 2026-04-16 · unverdicted · none · ref 4 · internal anchor
The NTIRE 2026 ×4 super-resolution challenge benchmarks 31 teams on bicubic-downsampled images using PSNR for the restoration track and perceptual scores for the realism track.
The Second Challenge on Real-World Face Restoration at NTIRE 2026: Methods and Results cs.CV · 2026-04-12 · unverdicted · none · ref 3 · internal anchor
The NTIRE 2026 real-world face restoration challenge report details outcomes from 9 valid team submissions advancing perceptual quality and identity consistency in degraded face images.
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation cs.CV · 2026-05-12 · unreviewed · ref 2 · internal anchor

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

hub tools

claims ledger

co-cited works

fields

years

verdicts

representative citing papers

citing papers explorer