pith. machine review for the scientific record. sign in

super hub

Qwen-Image Technical Report

128 Pith papers cite this work. Polarity classification is still indexing.

128 Pith papers citing it
abstract

We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.

hub tools

citation-role summary

baseline 2 background 1

citation-polarity summary

claims ledger

  • abstract We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially

authors

co-cited works

representative citing papers

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

LiWi: Layering in the Wild

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

LiWi uses an agent-driven data synthesis pipeline to build the LiWi-100k dataset and a model with shadow-guided and degradation-restoration objectives that achieves SoTA performance on RGB L1 and Alpha IoU for natural image layering decomposition.

Asymmetric Flow Models

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finetuning from latent models.

Inline Critic Steers Image Editing

cs.CV · 2026-05-12 · conditional · novelty 7.0

Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.

DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

DirectEdit achieves step-level accurate inversion for flow-based image editing by directly aligning forward paths, using attention feature injection and mask-guided noise blending to balance fidelity and editability without additional NFEs.

Long-Text-to-Image Generation via Compositional Prompt Decomposition

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models while generalizing better to prompts over 500 tokens.

citing papers explorer

Showing 50 of 128 citing papers.