pith. machine review for the scientific record. sign in

arxiv: 2511.22699 · v3 · submitted 2025-11-27 · 💻 cs.CV

Recognition: no theorem link

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Authors on Pith no claims yet

Pith reviewed 2026-05-11 14:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords image generationdiffusion transformerefficient foundation modelphotorealistic generationbilingual text renderingfew-step distillationimage editing
0
0 comments X

The pith

A 6-billion-parameter image model reaches commercial-level photorealism and text rendering with far less training compute than larger rivals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Z-Image as a 6B-parameter foundation model for image generation that matches or exceeds leading competitors in photorealistic quality and bilingual text rendering. It achieves this through a Scalable Single-Stream Diffusion Transformer architecture combined with curated data infrastructure and a streamlined training process that finishes in 314K H800 GPU hours. A distilled few-step version called Z-Image-Turbo delivers fast inference while running on consumer hardware under 16GB VRAM, and an editing variant supports instruction following. If the approach holds, it shows that extreme parameter counts and compute budgets are not required for top-tier generative performance. The public release of code and weights aims to make such models more accessible for further work.

Core claim

Z-Image, built on the S3-DiT architecture, achieves performance comparable to or surpassing leading competitors across various dimensions through systematic optimization of the model lifecycle from curated data to training curriculum and few-step distillation with reward post-training, delivering exceptional photorealistic image generation and bilingual text rendering that rivals top-tier commercial models while using significantly reduced computational overhead of 314K H800 GPU hours.

What carries the argument

The Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture, which processes the diffusion process in a single efficient stream rather than multiple separate paths.

If this is right

  • Z-Image-Turbo enables sub-second inference on enterprise GPUs while remaining compatible with consumer hardware under 16GB VRAM.
  • Z-Image-Edit provides strong instruction-following for image editing through the same omni-pre-training approach.
  • Full training completes in 314K H800 GPU hours at roughly $630K cost, lowering the barrier for high-performance model development.
  • Open release of code, weights, and demo supports community extension of efficient generative models beyond current proprietary systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Efficiency-focused designs like this could allow more frequent retraining or domain-specific fine-tuning without massive infrastructure.
  • The bilingual text strength suggests similar single-stream methods may improve multilingual handling in other vision-language tasks.
  • Reduced overall compute opens possibilities for on-device or edge deployment of high-quality image generation.
  • The pattern of combining architecture changes with data and curriculum optimizations may apply to related generative domains such as video.

Load-bearing premise

The claimed performance levels arise directly from the described data curation, training curriculum, S3-DiT design, and distillation methods rather than from undisclosed larger data scales or selective evaluation practices.

What would settle it

Independent side-by-side evaluation of the released Z-Image weights against the cited competitors on fixed public benchmarks for photorealism and text accuracy, using identical prompts and metrics with fully disclosed training data volume.

read the original abstract

The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Z-Image, a 6B-parameter image generation foundation model using a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture. It optimizes the full lifecycle via curated data infrastructure, streamlined training curriculum, few-step distillation with reward post-training (yielding Z-Image-Turbo for sub-second inference), and omni-pre-training for an editing variant (Z-Image-Edit). The central claim is that qualitative and quantitative experiments show performance comparable to or surpassing larger models (FLUX.2, Qwen-Image, commercial systems) in photorealism and bilingual text rendering, achieved at low cost (314K H800 GPU hours, ~$630K).

Significance. If the performance claims hold with rigorous evidence, the work would be significant for demonstrating that state-of-the-art image generation results are attainable with substantially smaller models and lower training compute than the current 20B–80B scale paradigm, potentially improving accessibility for fine-tuning and inference on consumer hardware.

major comments (2)
  1. [Abstract] Abstract: The assertion that 'qualitative and quantitative experiments demonstrate' comparable or superior performance is unsupported by any reported metrics (FID, CLIP score, OCR accuracy for bilingual text, human preference rates), baselines, ablation studies, or error analysis, leaving the central empirical claim without visible evidence.
  2. [Results and Experiments] Experimental claims: No dataset cardinality, image-text pair counts, filtering criteria, or evaluation protocol details (e.g., inference steps matched to baselines, statistical tests, blinded raters) are supplied, so it is impossible to determine whether the S3-DiT architecture, curriculum, or distillation—not undisclosed data advantages—produce the headline results rivaling 20B–80B models.
minor comments (2)
  1. The manuscript would benefit from a dedicated table or figure summarizing quantitative comparisons against named baselines with exact scores and standard deviations.
  2. Clarify the precise definition and implementation details of the 'Scalable Single-Stream Diffusion Transformer (S3-DiT)' early in the architecture section to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comments point by point below and will revise the manuscript to strengthen the empirical evidence and transparency of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that 'qualitative and quantitative experiments demonstrate' comparable or superior performance is unsupported by any reported metrics (FID, CLIP score, OCR accuracy for bilingual text, human preference rates), baselines, ablation studies, or error analysis, leaving the central empirical claim without visible evidence.

    Authors: We agree that the abstract makes a broad claim without embedding specific quantitative metrics, which leaves the central assertion insufficiently supported at first reading. The full manuscript does contain quantitative results and comparisons in the Experiments section, but these are not summarized in the abstract. In the revised version we will update the abstract to explicitly report key metrics (FID, CLIP score, OCR accuracy for bilingual text, and human preference rates) together with the main baselines. We will also expand the Experiments section with additional ablations, error analysis, and clearer tabular presentation of all quantitative results so that the performance claims are directly evidenced. revision: yes

  2. Referee: [Results and Experiments] Experimental claims: No dataset cardinality, image-text pair counts, filtering criteria, or evaluation protocol details (e.g., inference steps matched to baselines, statistical tests, blinded raters) are supplied, so it is impossible to determine whether the S3-DiT architecture, curriculum, or distillation—not undisclosed data advantages—produce the headline results rivaling 20B–80B models.

    Authors: We acknowledge that the current manuscript provides only high-level descriptions of the data pipeline and evaluation setup, making it difficult to isolate the contributions of the S3-DiT architecture and training curriculum from potential data advantages. In the revision we will add a dedicated subsection detailing the total number of image-text pairs, filtering criteria, deduplication steps, and the full evaluation protocol (including inference steps used for all baselines, statistical tests, and human-study design with blinded raters). These additions will allow readers to assess the source of the reported performance gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on external benchmarks and experiments.

full rationale

The paper describes an empirical architecture (S3-DiT) and training pipeline whose headline results are asserted via qualitative/quantitative experiments on standard image-generation tasks. No equations, first-principles derivations, or fitted parameters are presented that reduce by construction to the inputs; the central claim is a comparative performance statement against external models (FLUX.2, Qwen-Image, etc.) rather than a self-referential prediction. Self-citations are absent from the provided text, and the training-cost figure (314K H800 hours) is an input cost, not a derived output. This is the normal non-circular case for a systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central performance claims rest on the unverified effectiveness of the newly introduced S3-DiT architecture and the training optimizations; no independent evidence for these components is supplied in the abstract.

invented entities (1)
  • S3-DiT no independent evidence
    purpose: Scalable single-stream diffusion transformer architecture intended to enable efficient high-quality image generation
    Core new component introduced to challenge scale-at-all-costs paradigm

pith-pipeline@v0.9.0 · 5688 in / 1200 out tokens · 89198 ms · 2026-05-11T14:01:29.639989+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 44 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MiVE: Multiscale Vision-language features for reference-guided video Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer, achieving top human preference in reference-guided video editing.

  2. ImageAttributionBench: How Far Are We from Generalizable Attribution?

    cs.CV 2026-05 unverdicted novelty 7.0

    ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.

  3. Asymmetric Flow Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...

  4. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.

  5. RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition

    cs.CV 2026-05 unverdicted novelty 7.0

    RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.

  6. What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 7.0

    A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.

  7. Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...

  8. DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    DirectEdit achieves step-level accurate inversion for flow-based image editing by directly aligning forward paths, using attention feature injection and mask-guided noise blending to balance fidelity and editability w...

  9. Evaluating Remote Sensing Image Captions Beyond Metric Biases

    cs.CV 2026-04 unverdicted novelty 7.0

    Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA pe...

  10. Generative Texture Filtering

    cs.CV 2026-04 unverdicted novelty 7.0

    A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.

  11. Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.

  12. CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    CT-1 transfers spatial reasoning from vision-language models to estimate camera trajectories, which are then used in a video diffusion model with wavelet regularization to produce controllable videos, claiming 25.7% b...

  13. Large-Scale Universal Defect Generation: Foundation Models and Datasets

    cs.CV 2026-04 unverdicted novelty 7.0

    A 300K quadruplet dataset and UniDG foundation model enable reference- or text-driven defect generation across categories, outperforming few-shot baselines on anomaly detection tasks.

  14. FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding

    cs.CV 2026-04 unverdicted novelty 7.0

    FlowGuard detects unsafe content during diffusion image generation via linear latent decoding and curriculum learning, outperforming prior methods by over 30% F1 while reducing GPU memory by 97% and projection time to...

  15. SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation

    cs.CV 2026-04 conditional novelty 7.0

    SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.

  16. Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse

    cs.CV 2026-04 unverdicted novelty 7.0

    Chorus accelerates video DiT serving up to 45% via inter-request caching reuse in a three-stage denoising strategy with token-guided attention amplification.

  17. Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Skill-aligned annotation improves inter-annotator agreement and evaluation stability in text-to-image generation compared to uniform annotation baselines.

  18. Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation

    cs.CV 2026-05 unverdicted novelty 6.0

    Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.

  19. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.

  20. L2P: Unlocking Latent Potential for Pixel Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

  21. HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

    cs.CV 2026-05 unverdicted novelty 6.0

    A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...

  22. SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.

  23. DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models

    cs.CV 2026-05 unverdicted novelty 6.0

    DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.

  24. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

  25. LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

    cs.CV 2026-04 unverdicted novelty 6.0

    LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.

  26. CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.

  27. Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image

    cs.CV 2026-04 unverdicted novelty 6.0

    Any3DAvatar reconstructs full-head 3D Gaussian avatars from one image via one-step denoising on a Plücker-aware scaffold plus auxiliary view supervision, beating prior single-image methods on fidelity while running su...

  28. Generative Refinement Networks for Visual Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

  29. Continuous Adversarial Flow Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...

  30. Gen-Searcher: Reinforcing Agentic Search for Image Generation

    cs.CV 2026-03 unverdicted novelty 6.0

    Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.

  31. SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    cs.CV 2026-05 unverdicted novelty 5.0

    SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

  32. IdentiFace: Multi-Modal Iterative Diffusion Framework for Identifiable Suspect Face Generation in Crime Investigations

    cs.CV 2026-05 unverdicted novelty 5.0

    IdentiFace is a multi-modal iterative diffusion framework that generates identifiable suspect faces with improved identity retrieval for law enforcement applications.

  33. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  34. Exploring Time Conditioning in Diffusion Generative Models from Disjoint Noisy Data Manifolds

    cs.LG 2026-04 unverdicted novelty 5.0

    Aligning the DDIM forward diffusion process with flow-matching manifold evolution enables high-quality generation without time conditioning, and class-conditional synthesis is possible with an unconditional denoiser b...

  35. Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.

  36. Context Unrolling in Omni Models

    cs.CV 2026-04 unverdicted novelty 5.0

    Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.

  37. CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers

    cs.CV 2026-04 unverdicted novelty 5.0

    CreatiParser decomposes raster graphic designs into editable text, background, and sticker layers via a hybrid VLM-diffusion model with ParserReward and GRPO optimization, reporting 23.7% average metric gains on Parse...

  38. On Semiotic-Grounded Interpretive Evaluation of Generative Art

    cs.CV 2026-04 unverdicted novelty 5.0

    SemJudge uses a Hierarchical Semiosis Graph based on Peircean theory to evaluate deeper artistic meaning in generative art and aligns better with human judgments than prior metrics.

  39. Qwen-Image-2.0 Technical Report

    cs.CV 2026-05 unverdicted novelty 4.0

    Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.

  40. Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    cs.GR 2026-05 unverdicted novelty 4.0

    JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.

  41. Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model's Robustness to Natural Semantic Variation Across Diverse Tasks

    cs.CV 2026-04 unverdicted novelty 4.0

    Robust CLIP models amplify vulnerabilities to natural adversarial scenarios while standard CLIP shows large performance drops on natural language-induced adversarial examples in zero-shot classification, segmentation,...

  42. Can Nano Banana 2 Replace Traditional Image Restoration Models? An Evaluation of Its Performance on Image Restoration Tasks

    cs.CV 2026-04 unverdicted novelty 4.0

    Nano Banana 2 delivers competitive perceptual quality on image restoration but produces over-enhanced results that diverge from input fidelity in ways standard metrics miss.

  43. The Fourth Challenge on Image Super-Resolution ($\times$4) at NTIRE 2026: Benchmark Results and Method Overview

    cs.CV 2026-04 unverdicted novelty 2.0

    The NTIRE 2026 ×4 super-resolution challenge benchmarks 31 teams on bicubic-downsampled images using PSNR for the restoration track and perceptual scores for the realism track.

  44. The Second Challenge on Real-World Face Restoration at NTIRE 2026: Methods and Results

    cs.CV 2026-04 unverdicted novelty 2.0

    The NTIRE 2026 real-world face restoration challenge report details outcomes from 9 valid team submissions advancing perceptual quality and identity consistency in degraded face images.

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages · cited by 43 Pith papers · 21 internal anchors

  1. [1]

    Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation

    Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Langu...

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  3. [3]

    Imagen 3.arXiv preprint arXiv:2408.07009, 2024

    Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castre- jon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, et al. Imagen 3.arXiv preprint arXiv:2408.07009, 2024

  4. [4]

    Improving image generation with better captions.Computer Science

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

  5. [5]

    Instructpix2pix: Learning to follow im- age editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow im- age editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

  6. [6]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  7. [7]

    Hidream-i1: An open-source high-efficient image generative foundation model

    Qi Cai, Yehao Li, Yingwei Pan, Ting Yao, and Tao Mei. Hidream-i1: An open-source high-efficient image generative foundation model. InProceedings of the 33rd ACM International Conference on Multimedia, pages 13636–13639, 2025

  8. [8]

    Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

    Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

  9. [9]

    Oneig-bench: Omni-dimensional nuanced evaluation for image generation

    Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation. arXiv preprint arXiv:2506.07977, 2025

  10. [10]

    Textdiffuser-2: Unleashing the power of language models for text rendering

    Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pages 386–402. Springer, 2024

  11. [11]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

  12. [12]

    Pixart- 𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- 𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024

  13. [13]

    Pixart- 𝛼: Fast training of diffusion transformer for photo- realistic text-to-image synthesis

    Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- 𝛼: Fast training of diffusion transformer for photo- realistic text-to-image synthesis. InThe Twelfth International Conference on Learning Representations

  14. [14]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  15. [15]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining, 2025. URL https://arxiv. org/abs/2505.14683. 50

  16. [16]

    Cogview: Mastering text-to-image generation via transformers

    Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in neural information processing systems, 34:19822–19835, 2021

  17. [17]

    arXiv preprint arXiv:2503.23461 (2025)

    Nikai Du, Zhennan Chen, Zhizhou Chen, Shan Gao, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461, 2025

  18. [18]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the International Conference on Machine Learning (ICML), 2024

  19. [19]

    Flux-reason-6m & prism-bench: A million-scale text-to-image reasoning dataset and comprehensive benchmark.arXiv preprint arXiv:2509.09680, 2025

    Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, and Hongsheng Li. Flux-reason-6m and prism-bench: A million-scale text-to- image reasoning dataset and comprehensive benchmark.arXiv preprint arXiv:2509.09680, 2025

  20. [20]

    Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers, 2024

    Peng Gao, Le Zhuo, Chris Liu, , Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, et al. Lumina- t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers.arXiv preprint arXiv:2405.05945, 2024

  21. [21]

    Seedream 3.0 Technical Report

    Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025

  22. [22]

    arXiv preprint arXiv:2507.22058 (2025)

    Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

  23. [23]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132– 52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132– 52152, 2023

  24. [24]

    Dynamic few-shot visual learning without forgetting

    Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4367–4375, 2018

  25. [25]

    Gemini 2.5 flash & 2.5 flash image model card

    Google. Gemini 2.5 flash & 2.5 flash image model card. https://storage.googleapis.com/d eepmind-media/Model-Cards/Gemini-2-5-Flash-Model-Card.pdf, 2025

  26. [26]

    Imagen 4 model card

    Google. Imagen 4 model card. https://storage.googleapis.com/deepmind-media/Mod el-Cards/Imagen-4-Model-Card.pdf, 2025

  27. [27]

    Nano banana pro

    Google. Nano banana pro. https://storage.googleapis.com/deepmind-media/Model-C ards/Gemini-3-Pro-Image-Model-Card.pdf, 2025

  28. [28]

    Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

    Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025

  29. [29]

    Classifier-free diffusion guidance.Advances in Neural Information Processing Systems Workshops (NeurIPS Workshops), 2021

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.Advances in Neural Information Processing Systems Workshops (NeurIPS Workshops), 2021

  30. [30]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

  31. [31]

    Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025

    Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Xin Jin, David Liu, Zhen Li, Mengmeng Wang, Peng Gao, and Harry Yang. Distribution matching distillation meets reinforcement learning. arXiv preprint arXiv:2511.13649, 2025

  32. [32]

    Analyzing and improving the training dynamics of diffusion models

    Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24174–24184, 2024. 51

  33. [33]

    Kolors 2.0.https://app.klingai.com/cn/, 2025

    Kuaishou Kolors Team. Kolors 2.0.https://app.klingai.com/cn/, 2025

  34. [34]

    Flux.https://github.com/black-forest-labs/flux, 2023

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2023

  35. [35]

    FLUX.2: State-of-the-Art Visual Intelligence

    Black Forest Labs. FLUX.2: State-of-the-Art Visual Intelligence. https://bfl.ai/blog/flux-2, 2025

  36. [36]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

  37. [37]

    Gpu server rental pricing

    LeaderGPU. Gpu server rental pricing. https://www.leadergpu.com/ , 2025. Accessed: November 2025

  38. [38]

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024

  39. [39]

    Ragdiffusion: Faithful cloth generation via external knowledge assimilation

    Yuhan Li, Xianfeng Tan, Wenxiang Shang, Yubo Wu, Jian Wang, Xuanhong Chen, Yi Zhang, Hangcheng Zhu, and Bingbing Ni. Ragdiffusion: Faithful cloth generation via external knowledge assimilation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17485–17495, 2025

  40. [40]

    net/forum?id=POWv6hDd9XH

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024

  41. [41]

    Visualcloze: A universal image generation framework via visual in-context learning

    Zhong-Yu Li, Ruoyi Du, Juncheng Yan, Le Zhuo, Zhen Li, Peng Gao, Zhanyu Ma, and Ming-Ming Cheng. Visualcloze: A universal image generation framework via visual in-context learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2025

  42. [42]

    Uniworld-V2: Reinforce im- age editing with diffusion negative-aware finetuning and MLLM implicit feedback.arXiv preprint arXiv:2510.16888, 2025

    Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, and Li Yuan. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888, 2025

  43. [43]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025

  44. [44]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  45. [45]

    Decoupled dmd: Cfg augmentation as the spear, distribu- tion matching as the shield.arXiv preprint, 2025

    Dongyang Liu, David Liu, Peng Gao, Ruoyi Du, Zhen Li, Qilong Wu, Xin Jin, Sihan Cao, Shifeng Zhang, Hongsheng Li, and Steven Hoi. Decoupled dmd: Cfg augmentation as the spear, distribu- tion matching as the shield.arXiv preprint, 2025

  46. [46]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

  47. [47]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025

  48. [48]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  49. [49]

    Omnicaptioner: One captioner to rule them all.arXiv preprint arXiv:2504.07089, 2025

    Yiting Lu, Jiakang Yuan, Zhen Li, Shitian Zhao, Qi Qin, Xinyue Li, Le Zhuo, Licheng Wen, Dongyang Liu, Yuewen Cao, et al. Omnicaptioner: One captioner to rule them all.arXiv preprint arXiv:2504.07089, 2025. 52

  50. [50]

    Cosine normalization: Using cosine similarity instead of dot product in neural networks

    Chunjie Luo, Jianfeng Zhan, Xiaohe Xue, Lei Wang, Rui Ren, and Qiang Yang. Cosine normalization: Using cosine similarity instead of dot product in neural networks. InInternational conference on artificial neural networks, pages 382–391. Springer, 2018

  51. [51]

    Janusflow: Harmonizing autoregres- sion and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Liang Zhao, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

  52. [52]

    Midjourney v7.https://www.midjourney.com/home, 2025

    Midjourney. Midjourney v7.https://www.midjourney.com/home, 2025

  53. [53]

    Enhancing few-shot image classification with cosine transformer.IEEE Access, 11:79659–79672, 2023

    Quang-Huy Nguyen, Cuong Q Nguyen, Dung D Le, and Hieu H Pham. Enhancing few-shot image classification with cosine transformer.IEEE Access, 11:79659–79672, 2023

  54. [54]

    Cagra: Highly parallel graph construction and approximate nearest neighbor search for gpus

    Hiroyuki Ootomo, Akira Naruse, Corey Nolet, Ray Wang, Tamas Feher, and Yong Wang. Cagra: Highly parallel graph construction and approximate nearest neighbor search for gpus. In2024 IEEE 40th International Conference on Data Engineering (ICDE), pages 4236–4247. IEEE, 2024

  55. [55]

    Gpt-image-1

    OpenAI. Gpt-image-1. https://openai.com/zh-Hans-CN/index/introducing-4o-image -generation/, 2025

  56. [56]

    The pagerank citation ranking: Bringing order to the web

    Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford infolab, 1999

  57. [57]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  58. [58]

    Lumina- image 2.0: A unified and efficient image generative frame- work.arXiv preprint arXiv:2503.21758, 2025

    Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina-image 2.0: A unified and efficient image generative framework.arXiv preprint arXiv:2503.21758, 2025

  59. [59]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  60. [60]

    cuGraph - RAPIDS Graph Analytics Library

    rapidsai. cuGraph - RAPIDS Graph Analytics Library. https://github.com/rapidsai/cugr aph, 2018. Accessed: 2025-11-12

  61. [61]

    Recraft v3.https://www.recraft.ai/docs/recraft-models/recraft-V3, 2024

    Recraft. Recraft v3.https://www.recraft.ai/docs/recraft-models/recraft-V3, 2024

  62. [62]

    The probabilistic relevance framework: Bm25 and beyond.Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009

    Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond.Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009

  63. [63]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  64. [64]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xi- aoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

  65. [65]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

  66. [66]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models, 2024

  67. [67]

    Flux.1 krea [dev].https://github.com/krea-ai/flux-krea, 2025

    FLUX-Krea Team. Flux.1 krea [dev].https://github.com/krea-ai/flux-krea, 2025

  68. [68]

    From louvain to leiden: guaranteeing well-connected communities.Scientific reports, 9(1):1–12, 2019

    Vincent A Traag, Ludo Waltman, and Nees Jan Van Eck. From louvain to leiden: guaranteeing well-connected communities.Scientific reports, 9(1):1–12, 2019. 53

  69. [69]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdul- mohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

  70. [70]

    Anytext: Multilingual visual text generation and editing

    Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing. 2023

  71. [71]

    arXiv preprint arXiv:2405.15613 (2024)

    Huy V Vo, Vasil Khalidov, Timothée Darcet, Théo Moutakanni, Nikita Smetanin, Marc Szafraniec, Hugo Touvron, Camille Couprie, Maxime Oquab, Armand Joulin, et al. Automatic data curation for self-supervised learning: A clustering-based approach.arXiv preprint arXiv:2405.15613, 2024

  72. [72]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

  73. [73]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  74. [74]

    Tiif-bench: How does your t2i model follow your instructions? arXiv preprint arXiv:2506.02161, 2025

    Xinyu Wei, Jinrui Zhang, Zeqing Wang, Hongyang Wei, Zhen Guo, and Lei Zhang. Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025

  75. [75]

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

    Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational conference on machine learning, pages 23965–23998. P...

  76. [76]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

  77. [77]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

  78. [78]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871, 2025

  79. [79]

    Lightgen: Efficient image generation through knowledge distillation and direct preference optimization.arXiv preprint arXiv:2503.08619, 2025

    Xianfeng Wu, Yajing Bai, Haoze Zheng, Harold Haodong Chen, Yexin Liu, Zihao Wang, Xuran Ma, Wen-Jie Shu, Xianzu Wu, Harry Yang, et al. Lightgen: Efficient image generation through knowledge distillation and direct preference optimization.arXiv preprint arXiv:2503.08619, 2025

  80. [80]

    Omnigen: Unified image generation

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025

Showing first 80 references.