arxiv: 2511.22699 · v3 · submitted 2025-11-27 · 💻 cs.CV

Recognition: no theorem link

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image Team , Huanqia Cai , Sihan Cao , Ruoyi Du , Peng Gao , Steven Hoi , Zhaohui Hou , Shijie Huang

show 13 more authors

Dengyang Jiang Xin Jin Liangchen Li Zhen Li Zhong-Yu Li David Liu Dongyang Liu Junhan Shi Qilong Wu Feng Yu Chi Zhang Shifeng Zhang Shilin Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-11 14:01 UTC · model grok-4.3

classification 💻 cs.CV

keywords image generationdiffusion transformerefficient foundation modelphotorealistic generationbilingual text renderingfew-step distillationimage editing

0 comments

The pith

A 6-billion-parameter image model reaches commercial-level photorealism and text rendering with far less training compute than larger rivals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Z-Image as a 6B-parameter foundation model for image generation that matches or exceeds leading competitors in photorealistic quality and bilingual text rendering. It achieves this through a Scalable Single-Stream Diffusion Transformer architecture combined with curated data infrastructure and a streamlined training process that finishes in 314K H800 GPU hours. A distilled few-step version called Z-Image-Turbo delivers fast inference while running on consumer hardware under 16GB VRAM, and an editing variant supports instruction following. If the approach holds, it shows that extreme parameter counts and compute budgets are not required for top-tier generative performance. The public release of code and weights aims to make such models more accessible for further work.

Core claim

Z-Image, built on the S3-DiT architecture, achieves performance comparable to or surpassing leading competitors across various dimensions through systematic optimization of the model lifecycle from curated data to training curriculum and few-step distillation with reward post-training, delivering exceptional photorealistic image generation and bilingual text rendering that rivals top-tier commercial models while using significantly reduced computational overhead of 314K H800 GPU hours.

What carries the argument

The Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture, which processes the diffusion process in a single efficient stream rather than multiple separate paths.

If this is right

Z-Image-Turbo enables sub-second inference on enterprise GPUs while remaining compatible with consumer hardware under 16GB VRAM.
Z-Image-Edit provides strong instruction-following for image editing through the same omni-pre-training approach.
Full training completes in 314K H800 GPU hours at roughly $630K cost, lowering the barrier for high-performance model development.
Open release of code, weights, and demo supports community extension of efficient generative models beyond current proprietary systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Efficiency-focused designs like this could allow more frequent retraining or domain-specific fine-tuning without massive infrastructure.
The bilingual text strength suggests similar single-stream methods may improve multilingual handling in other vision-language tasks.
Reduced overall compute opens possibilities for on-device or edge deployment of high-quality image generation.
The pattern of combining architecture changes with data and curriculum optimizations may apply to related generative domains such as video.

Load-bearing premise

The claimed performance levels arise directly from the described data curation, training curriculum, S3-DiT design, and distillation methods rather than from undisclosed larger data scales or selective evaluation practices.

What would settle it

Independent side-by-side evaluation of the released Z-Image weights against the cited competitors on fixed public benchmarks for photorealism and text accuracy, using identical prompts and metrics with fully disclosed training data volume.

read the original abstract

The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Z-Image, a 6B-parameter image generation foundation model using a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture. It optimizes the full lifecycle via curated data infrastructure, streamlined training curriculum, few-step distillation with reward post-training (yielding Z-Image-Turbo for sub-second inference), and omni-pre-training for an editing variant (Z-Image-Edit). The central claim is that qualitative and quantitative experiments show performance comparable to or surpassing larger models (FLUX.2, Qwen-Image, commercial systems) in photorealism and bilingual text rendering, achieved at low cost (314K H800 GPU hours, ~$630K).

Significance. If the performance claims hold with rigorous evidence, the work would be significant for demonstrating that state-of-the-art image generation results are attainable with substantially smaller models and lower training compute than the current 20B–80B scale paradigm, potentially improving accessibility for fine-tuning and inference on consumer hardware.

major comments (2)

[Abstract] Abstract: The assertion that 'qualitative and quantitative experiments demonstrate' comparable or superior performance is unsupported by any reported metrics (FID, CLIP score, OCR accuracy for bilingual text, human preference rates), baselines, ablation studies, or error analysis, leaving the central empirical claim without visible evidence.
[Results and Experiments] Experimental claims: No dataset cardinality, image-text pair counts, filtering criteria, or evaluation protocol details (e.g., inference steps matched to baselines, statistical tests, blinded raters) are supplied, so it is impossible to determine whether the S3-DiT architecture, curriculum, or distillation—not undisclosed data advantages—produce the headline results rivaling 20B–80B models.

minor comments (2)

The manuscript would benefit from a dedicated table or figure summarizing quantitative comparisons against named baselines with exact scores and standard deviations.
Clarify the precise definition and implementation details of the 'Scalable Single-Stream Diffusion Transformer (S3-DiT)' early in the architecture section to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comments point by point below and will revise the manuscript to strengthen the empirical evidence and transparency of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'qualitative and quantitative experiments demonstrate' comparable or superior performance is unsupported by any reported metrics (FID, CLIP score, OCR accuracy for bilingual text, human preference rates), baselines, ablation studies, or error analysis, leaving the central empirical claim without visible evidence.

Authors: We agree that the abstract makes a broad claim without embedding specific quantitative metrics, which leaves the central assertion insufficiently supported at first reading. The full manuscript does contain quantitative results and comparisons in the Experiments section, but these are not summarized in the abstract. In the revised version we will update the abstract to explicitly report key metrics (FID, CLIP score, OCR accuracy for bilingual text, and human preference rates) together with the main baselines. We will also expand the Experiments section with additional ablations, error analysis, and clearer tabular presentation of all quantitative results so that the performance claims are directly evidenced. revision: yes
Referee: [Results and Experiments] Experimental claims: No dataset cardinality, image-text pair counts, filtering criteria, or evaluation protocol details (e.g., inference steps matched to baselines, statistical tests, blinded raters) are supplied, so it is impossible to determine whether the S3-DiT architecture, curriculum, or distillation—not undisclosed data advantages—produce the headline results rivaling 20B–80B models.

Authors: We acknowledge that the current manuscript provides only high-level descriptions of the data pipeline and evaluation setup, making it difficult to isolate the contributions of the S3-DiT architecture and training curriculum from potential data advantages. In the revision we will add a dedicated subsection detailing the total number of image-text pairs, filtering criteria, deduplication steps, and the full evaluation protocol (including inference steps used for all baselines, statistical tests, and human-study design with blinded raters). These additions will allow readers to assess the source of the reported performance gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on external benchmarks and experiments.

full rationale

The paper describes an empirical architecture (S3-DiT) and training pipeline whose headline results are asserted via qualitative/quantitative experiments on standard image-generation tasks. No equations, first-principles derivations, or fitted parameters are presented that reduce by construction to the inputs; the central claim is a comparative performance statement against external models (FLUX.2, Qwen-Image, etc.) rather than a self-referential prediction. Self-citations are absent from the provided text, and the training-cost figure (314K H800 hours) is an input cost, not a derived output. This is the normal non-circular case for a systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central performance claims rest on the unverified effectiveness of the newly introduced S3-DiT architecture and the training optimizations; no independent evidence for these components is supplied in the abstract.

invented entities (1)

S3-DiT no independent evidence
purpose: Scalable single-stream diffusion transformer architecture intended to enable efficient high-quality image generation
Core new component introduced to challenge scale-at-all-costs paradigm

pith-pipeline@v0.9.0 · 5688 in / 1200 out tokens · 89198 ms · 2026-05-11T14:01:29.639989+00:00 · methodology

discussion (0)

Forward citations

Cited by 44 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MiVE: Multiscale Vision-language features for reference-guided video Editing
cs.CV 2026-05 unverdicted novelty 7.0

MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer, achieving top human preference in reference-guided video editing.
ImageAttributionBench: How Far Are We from Generalizable Attribution?
cs.CV 2026-05 unverdicted novelty 7.0

ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
Asymmetric Flow Models
cs.CV 2026-05 unverdicted novelty 7.0

Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 7.0

UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.
RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition
cs.CV 2026-05 unverdicted novelty 7.0

RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.
What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 7.0

A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
cs.AI 2026-05 unverdicted novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

DirectEdit achieves step-level accurate inversion for flow-based image editing by directly aligning forward paths, using attention feature injection and mask-guided noise blending to balance fidelity and editability w...
Evaluating Remote Sensing Image Captions Beyond Metric Biases
cs.CV 2026-04 unverdicted novelty 7.0

Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA pe...
Generative Texture Filtering
cs.CV 2026-04 unverdicted novelty 7.0

A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

CT-1 transfers spatial reasoning from vision-language models to estimate camera trajectories, which are then used in a video diffusion model with wavelet regularization to produce controllable videos, claiming 25.7% b...
Large-Scale Universal Defect Generation: Foundation Models and Datasets
cs.CV 2026-04 unverdicted novelty 7.0

A 300K quadruplet dataset and UniDG foundation model enable reference- or text-driven defect generation across categories, outperforming few-shot baselines on anomaly detection tasks.
FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding
cs.CV 2026-04 unverdicted novelty 7.0

FlowGuard detects unsafe content during diffusion image generation via linear latent decoding and curriculum learning, outperforming prior methods by over 30% F1 while reducing GPU memory by 97% and projection time to...
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation
cs.CV 2026-04 conditional novelty 7.0

SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse
cs.CV 2026-04 unverdicted novelty 7.0

Chorus accelerates video DiT serving up to 45% via inter-request caching reuse in a three-stage denoising strategy with token-guided attention amplification.
Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

Skill-aligned annotation improves inter-annotator agreement and evaluation stability in text-to-image generation compared to uniform annotation baselines.
Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation
cs.CV 2026-05 unverdicted novelty 6.0

Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
L2P: Unlocking Latent Potential for Pixel Generation
cs.CV 2026-05 unverdicted novelty 6.0

L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
cs.CV 2026-05 unverdicted novelty 6.0

A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models
cs.CV 2026-05 unverdicted novelty 6.0

DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
cs.CV 2026-04 unverdicted novelty 6.0

LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
cs.CV 2026-04 unverdicted novelty 6.0

CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image
cs.CV 2026-04 unverdicted novelty 6.0

Any3DAvatar reconstructs full-head 3D Gaussian avatars from one image via one-step denoising on a Plücker-aware scaffold plus auxiliary view supervision, beating prior single-image methods on fidelity while running su...
Generative Refinement Networks for Visual Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
Continuous Adversarial Flow Models
cs.LG 2026-04 unverdicted novelty 6.0

Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...
Gen-Searcher: Reinforcing Agentic Search for Image Generation
cs.CV 2026-03 unverdicted novelty 6.0

Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
IdentiFace: Multi-Modal Iterative Diffusion Framework for Identifiable Suspect Face Generation in Crime Investigations
cs.CV 2026-05 unverdicted novelty 5.0

IdentiFace is a multi-modal iterative diffusion framework that generates identifiable suspect faces with improved identity retrieval for law enforcement applications.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
Exploring Time Conditioning in Diffusion Generative Models from Disjoint Noisy Data Manifolds
cs.LG 2026-04 unverdicted novelty 5.0

Aligning the DDIM forward diffusion process with flow-matching manifold evolution enables high-quality generation without time conditioning, and class-conditional synthesis is possible with an unconditional denoiser b...
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
Context Unrolling in Omni Models
cs.CV 2026-04 unverdicted novelty 5.0

Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers
cs.CV 2026-04 unverdicted novelty 5.0

CreatiParser decomposes raster graphic designs into editable text, background, and sticker layers via a hybrid VLM-diffusion model with ParserReward and GRPO optimization, reporting 23.7% average metric gains on Parse...
On Semiotic-Grounded Interpretive Evaluation of Generative Art
cs.CV 2026-04 unverdicted novelty 5.0

SemJudge uses a Hierarchical Semiosis Graph based on Peircean theory to evaluate deeper artistic meaning in generative art and aligns better with human judgments than prior metrics.
Qwen-Image-2.0 Technical Report
cs.CV 2026-05 unverdicted novelty 4.0

Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
cs.GR 2026-05 unverdicted novelty 4.0

JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model's Robustness to Natural Semantic Variation Across Diverse Tasks
cs.CV 2026-04 unverdicted novelty 4.0

Robust CLIP models amplify vulnerabilities to natural adversarial scenarios while standard CLIP shows large performance drops on natural language-induced adversarial examples in zero-shot classification, segmentation,...
Can Nano Banana 2 Replace Traditional Image Restoration Models? An Evaluation of Its Performance on Image Restoration Tasks
cs.CV 2026-04 unverdicted novelty 4.0

Nano Banana 2 delivers competitive perceptual quality on image restoration but produces over-enhanced results that diverge from input fidelity in ways standard metrics miss.
The Fourth Challenge on Image Super-Resolution ($\times$4) at NTIRE 2026: Benchmark Results and Method Overview
cs.CV 2026-04 unverdicted novelty 2.0

The NTIRE 2026 ×4 super-resolution challenge benchmarks 31 teams on bicubic-downsampled images using PSNR for the restoration track and perceptual scores for the realism track.
The Second Challenge on Real-World Face Restoration at NTIRE 2026: Methods and Results
cs.CV 2026-04 unverdicted novelty 2.0

The NTIRE 2026 real-world face restoration challenge report details outcomes from 9 valid team submissions advancing perceptual quality and identity consistency in degraded face images.

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages · cited by 43 Pith papers · 21 internal anchors

[1]

Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Langu...

work page 2024
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Imagen 3.arXiv preprint arXiv:2408.07009, 2024

Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castre- jon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, et al. Imagen 3.arXiv preprint arXiv:2408.07009, 2024

work page arXiv 2024
[4]

Improving image generation with better captions.Computer Science

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

work page 2023
[5]

Instructpix2pix: Learning to follow im- age editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow im- age editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

work page 2023
[6]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[7]

Hidream-i1: An open-source high-efficient image generative foundation model

Qi Cai, Yehao Li, Yingwei Pan, Ting Yao, and Tao Mei. Hidream-i1: An open-source high-efficient image generative foundation model. InProceedings of the 33rd ACM International Conference on Multimedia, pages 13636–13639, 2025

work page 2025
[8]

Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

work page arXiv 2025
[9]

Oneig-bench: Omni-dimensional nuanced evaluation for image generation

Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation. arXiv preprint arXiv:2506.07977, 2025

work page arXiv 2025
[10]

Textdiffuser-2: Unleashing the power of language models for text rendering

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pages 386–402. Springer, 2024

work page 2024
[11]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

work page Pith review arXiv 2025
[12]

Pixart- 𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- 𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024

work page 2024
[13]

Pixart- 𝛼: Fast training of diffusion transformer for photo- realistic text-to-image synthesis

Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- 𝛼: Fast training of diffusion transformer for photo- realistic text-to-image synthesis. InThe Twelfth International Conference on Learning Representations

work page
[14]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining, 2025. URL https://arxiv. org/abs/2505.14683. 50

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Cogview: Mastering text-to-image generation via transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in neural information processing systems, 34:19822–19835, 2021

work page 2021
[17]

arXiv preprint arXiv:2503.23461 (2025)

Nikai Du, Zhennan Chen, Zhizhou Chen, Shan Gao, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461, 2025

work page arXiv 2025
[18]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the International Conference on Machine Learning (ICML), 2024

work page 2024
[19]

Flux-reason-6m & prism-bench: A million-scale text-to-image reasoning dataset and comprehensive benchmark.arXiv preprint arXiv:2509.09680, 2025

Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, and Hongsheng Li. Flux-reason-6m and prism-bench: A million-scale text-to- image reasoning dataset and comprehensive benchmark.arXiv preprint arXiv:2509.09680, 2025

work page arXiv 2025
[20]

Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers, 2024

Peng Gao, Le Zhuo, Chris Liu, , Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, et al. Lumina- t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers.arXiv preprint arXiv:2405.05945, 2024

work page arXiv 2024
[21]

Seedream 3.0 Technical Report

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025

work page internal anchor Pith review arXiv 2025
[22]

arXiv preprint arXiv:2507.22058 (2025)

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

work page arXiv 2025
[23]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132– 52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132– 52152, 2023

work page 2023
[24]

Dynamic few-shot visual learning without forgetting

Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4367–4375, 2018

work page 2018
[25]

Gemini 2.5 flash & 2.5 flash image model card

Google. Gemini 2.5 flash & 2.5 flash image model card. https://storage.googleapis.com/d eepmind-media/Model-Cards/Gemini-2-5-Flash-Model-Card.pdf, 2025

work page 2025
[26]

Imagen 4 model card

Google. Imagen 4 model card. https://storage.googleapis.com/deepmind-media/Mod el-Cards/Imagen-4-Model-Card.pdf, 2025

work page 2025
[27]

Nano banana pro

Google. Nano banana pro. https://storage.googleapis.com/deepmind-media/Model-C ards/Gemini-3-Pro-Image-Model-Card.pdf, 2025

work page 2025
[28]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025

work page 2025
[29]

Classifier-free diffusion guidance.Advances in Neural Information Processing Systems Workshops (NeurIPS Workshops), 2021

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.Advances in Neural Information Processing Systems Workshops (NeurIPS Workshops), 2021

work page 2021
[30]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review arXiv 2024
[31]

Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025

Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Xin Jin, David Liu, Zhen Li, Mengmeng Wang, Peng Gao, and Harry Yang. Distribution matching distillation meets reinforcement learning. arXiv preprint arXiv:2511.13649, 2025

work page arXiv 2025
[32]

Analyzing and improving the training dynamics of diffusion models

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24174–24184, 2024. 51

work page 2024
[33]

Kolors 2.0.https://app.klingai.com/cn/, 2025

Kuaishou Kolors Team. Kolors 2.0.https://app.klingai.com/cn/, 2025

work page 2025
[34]

Flux.https://github.com/black-forest-labs/flux, 2023

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2023

work page 2023
[35]

FLUX.2: State-of-the-Art Visual Intelligence

Black Forest Labs. FLUX.2: State-of-the-Art Visual Intelligence. https://bfl.ai/blog/flux-2, 2025

work page 2025
[36]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Gpu server rental pricing

LeaderGPU. Gpu server rental pricing. https://www.leadergpu.com/ , 2025. Accessed: November 2025

work page 2025
[38]

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024

work page arXiv 2024
[39]

Ragdiffusion: Faithful cloth generation via external knowledge assimilation

Yuhan Li, Xianfeng Tan, Wenxiang Shang, Yubo Wu, Jian Wang, Xuanhong Chen, Yi Zhang, Hangcheng Zhu, and Bingbing Ni. Ragdiffusion: Faithful cloth generation via external knowledge assimilation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17485–17495, 2025

work page 2025
[40]

net/forum?id=POWv6hDd9XH

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024

work page arXiv 2024
[41]

Visualcloze: A universal image generation framework via visual in-context learning

Zhong-Yu Li, Ruoyi Du, Juncheng Yan, Le Zhuo, Zhen Li, Peng Gao, Zhanyu Ma, and Ming-Ming Cheng. Visualcloze: A universal image generation framework via visual in-context learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2025

work page 2025
[42]

Uniworld-V2: Reinforce im- age editing with diffusion negative-aware finetuning and MLLM implicit feedback.arXiv preprint arXiv:2510.16888, 2025

Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, and Li Yuan. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888, 2025

work page arXiv 2025
[43]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025

work page internal anchor Pith review arXiv 2025
[44]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Decoupled dmd: Cfg augmentation as the spear, distribu- tion matching as the shield.arXiv preprint, 2025

Dongyang Liu, David Liu, Peng Gao, Ruoyi Du, Zhen Li, Qilong Wu, Xin Jin, Sihan Cao, Shifeng Zhang, Hongsheng Li, and Steven Hoi. Decoupled dmd: Cfg augmentation as the spear, distribu- tion matching as the shield.arXiv preprint, 2025

work page 2025
[46]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review arXiv 2025
[47]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025

work page internal anchor Pith review arXiv 2025
[48]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

Omnicaptioner: One captioner to rule them all.arXiv preprint arXiv:2504.07089, 2025

Yiting Lu, Jiakang Yuan, Zhen Li, Shitian Zhao, Qi Qin, Xinyue Li, Le Zhuo, Licheng Wen, Dongyang Liu, Yuewen Cao, et al. Omnicaptioner: One captioner to rule them all.arXiv preprint arXiv:2504.07089, 2025. 52

work page arXiv 2025
[50]

Cosine normalization: Using cosine similarity instead of dot product in neural networks

Chunjie Luo, Jianfeng Zhan, Xiaohe Xue, Lei Wang, Rui Ren, and Qiang Yang. Cosine normalization: Using cosine similarity instead of dot product in neural networks. InInternational conference on artificial neural networks, pages 382–391. Springer, 2018

work page 2018
[51]

Janusflow: Harmonizing autoregres- sion and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Liang Zhao, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

work page arXiv 2024
[52]

Midjourney v7.https://www.midjourney.com/home, 2025

Midjourney. Midjourney v7.https://www.midjourney.com/home, 2025

work page 2025
[53]

Enhancing few-shot image classification with cosine transformer.IEEE Access, 11:79659–79672, 2023

Quang-Huy Nguyen, Cuong Q Nguyen, Dung D Le, and Hieu H Pham. Enhancing few-shot image classification with cosine transformer.IEEE Access, 11:79659–79672, 2023

work page 2023
[54]

Cagra: Highly parallel graph construction and approximate nearest neighbor search for gpus

Hiroyuki Ootomo, Akira Naruse, Corey Nolet, Ray Wang, Tamas Feher, and Yong Wang. Cagra: Highly parallel graph construction and approximate nearest neighbor search for gpus. In2024 IEEE 40th International Conference on Data Engineering (ICDE), pages 4236–4247. IEEE, 2024

work page 2024
[55]

Gpt-image-1

OpenAI. Gpt-image-1. https://openai.com/zh-Hans-CN/index/introducing-4o-image -generation/, 2025

work page 2025
[56]

The pagerank citation ranking: Bringing order to the web

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford infolab, 1999

work page 1999
[57]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Lumina- image 2.0: A unified and efficient image generative frame- work.arXiv preprint arXiv:2503.21758, 2025

Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina-image 2.0: A unified and efficient image generative framework.arXiv preprint arXiv:2503.21758, 2025

work page arXiv 2025
[59]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[60]

cuGraph - RAPIDS Graph Analytics Library

rapidsai. cuGraph - RAPIDS Graph Analytics Library. https://github.com/rapidsai/cugr aph, 2018. Accessed: 2025-11-12

work page 2018
[61]

Recraft v3.https://www.recraft.ai/docs/recraft-models/recraft-V3, 2024

Recraft. Recraft v3.https://www.recraft.ai/docs/recraft-models/recraft-V3, 2024

work page 2024
[62]

The probabilistic relevance framework: Bm25 and beyond.Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009

Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond.Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009

work page 2009
[63]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[64]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xi- aoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review arXiv 2025
[65]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

work page 2024
[66]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models, 2024

work page 2024
[67]

Flux.1 krea [dev].https://github.com/krea-ai/flux-krea, 2025

FLUX-Krea Team. Flux.1 krea [dev].https://github.com/krea-ai/flux-krea, 2025

work page 2025
[68]

From louvain to leiden: guaranteeing well-connected communities.Scientific reports, 9(1):1–12, 2019

Vincent A Traag, Ludo Waltman, and Nees Jan Van Eck. From louvain to leiden: guaranteeing well-connected communities.Scientific reports, 9(1):1–12, 2019. 53

work page 2019
[69]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdul- mohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Anytext: Multilingual visual text generation and editing

Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing. 2023

work page 2023
[71]

arXiv preprint arXiv:2405.15613 (2024)

Huy V Vo, Vasil Khalidov, Timothée Darcet, Théo Moutakanni, Nikita Smetanin, Marc Szafraniec, Hugo Touvron, Camille Couprie, Maxime Oquab, Armand Joulin, et al. Automatic data curation for self-supervised learning: A clustering-based approach.arXiv preprint arXiv:2405.15613, 2024

work page arXiv 2024
[72]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[74]

Tiif-bench: How does your t2i model follow your instructions? arXiv preprint arXiv:2506.02161, 2025

Xinyu Wei, Jinrui Zhang, Zeqing Wang, Hongyang Wei, Zhen Guo, and Lei Zhang. Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025

work page arXiv 2025
[75]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational conference on machine learning, pages 23965–23998. P...

work page 2022
[76]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

work page 2025
[78]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[79]

Lightgen: Efficient image generation through knowledge distillation and direct preference optimization.arXiv preprint arXiv:2503.08619, 2025

Xianfeng Wu, Yajing Bai, Haoze Zheng, Harold Haodong Chen, Yexin Liu, Zihao Wang, Xuran Ma, Wen-Jie Shu, Xianzu Wu, Harry Yang, et al. Lightgen: Efficient image generation through knowledge distillation and direct preference optimization.arXiv preprint arXiv:2503.08619, 2025

work page arXiv 2025
[80]

Omnigen: Unified image generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025

work page 2025

Showing first 80 references.