super hub Mixed citations

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan · 2025 · cs.CV · arXiv 2508.02324

Mixed citation behavior. Most common role is background (46%).

311 Pith papers citing it

Background 46% of classified citations

open full Pith review browse 311 citing papers more from Chenfei Wu arXiv PDF

abstract

We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 34 baseline 23 method 10 dataset 2

citation-polarity summary

background 32 baseline 23 use method 10 unclear 2 use dataset 2

claims ledger

abstract We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially

authors

Chenfei Wu Jiahao Li Jingren Zhou Junyang Lin Kaiyuan Gao Kun Yan

co-cited works

representative citing papers

Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation

cs.CV · 2026-06-03 · unverdicted · novelty 8.0

A safety direction estimated in a source LLM is transported to a target generator through lightweight alignment on benign data alone, matching native safety performance without any target-side unsafe data.

Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

cs.CV · 2026-04-13 · unverdicted · novelty 8.0

The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving

cs.LG · 2025-12-16 · conditional · novelty 8.0

Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than existing systems or expert plans.

OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

OrbitQuant is a data-agnostic PTQ technique for DiTs that uses RPBH rotation in a normalized basis to enable a single codebook across all inputs, achieving SOTA low-bit performance on FLUX.1, CogVideoX and similar models.

Show Me Examples: Inferring Visual Concepts from Image Sets

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

Introduces VICIS task and training framework for inferring visual concepts from image sets, with experiments showing better accuracy, diversity, and generalization than standard VLMs on synthetic and ImageNet data.

CORGI: Consistency-Aware 3D Dog Reconstruction from a Single Image in the Wild

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

A new pipeline using canonical LoRAs for view synthesis, deformable 3D Gaussian splatting anchored on D-SMAL, and generative repair to produce animatable 3D dogs from single wild images without 3D supervision.

No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.

PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.

InterleaveThinker: Reinforcing Agentic Interleaved Generation

cs.CV · 2026-06-11 · unverdicted · novelty 7.0

InterleaveThinker is the first multi-agent pipeline enabling interleaved generation in any image generator through planner-critic agents, SFT on custom datasets, and GRPO RL with accuracy and step-wise rewards.

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

cs.CV · 2026-06-11 · unverdicted · novelty 7.0

CineOrchestra unifies control of subjects, events, cameras, and shot transitions in cinematic video generation through entity-centric conditioning primitives and parameter-free coordinated rotary embeddings.

Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks

cs.CV · 2026-06-09 · unverdicted · novelty 7.0

The paper constructs the VIBE benchmark and evaluates six visual in-context learning models on 14 datasets, 12 tasks, and 106 combinations under a unified one-shot protocol, revealing limitations and failure modes.

Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency

cs.CV · 2026-06-09 · unverdicted · novelty 7.0

ImageTime is a benchmark that probes image generation models' visual world modeling by requiring coherent four-state sequences in single images, scored via VLM judge.

SSR-Merge: Subspace Signal Routing for Training-Free LoRA Merging in Diffusion Models

cs.CV · 2026-06-09 · unverdicted · novelty 7.0

SSR-Merge merges LoRAs via subspace construction, inverse correlation decorrelation, and directional steering, shown to match the OLS solution with a streaming implementation that outperforms prior merging methods.

Customization under Fire: Plugin Poisoning in Text-to-Image Ecosystem

cs.CR · 2026-06-08 · unverdicted · novelty 7.0

PoisonLoRA demonstrates ~100% attack success rates for stealthy LoRA poisoning via concept hijacking and task injection on real platforms, with robustness to base model transfer and multiple remixes.

ZIPP:Zero-shot Image Personalization from Personas

cs.AI · 2026-06-07 · unverdicted · novelty 7.0

ZIPP conditions diffusion models on LLM-rewritten prompts derived from graph-mined natural-language personas to achieve zero-shot personalization, reporting 13-20% gains and 79% human preference win rate over generic outputs.

Complexity-Balanced Diffusion Splitting

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

CBS partitions the diffusion timeline into segments of equal approximation burden via Dirichlet energy and trajectory acceleration monitors estimated by an auxiliary model, yielding higher synthesis quality at fixed per-step cost across SiT, JiT and UNet backbones.

Beyond Absolute Scores: Relative Edit-induced Difference for Generalizable Image Aesthetic Assessment

cs.CV · 2026-06-04 · unverdicted · novelty 7.0 · 2 refs

RED-Aes learns aesthetic changes from edit-induced image pairs and a new RED-20k dataset via three-stage relative ranking training, claiming SOTA generalization over absolute MOS regression.

GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

GeM-NR performs multi-view consistent nonrigid editing by aligning depth-derived point clouds between edited and unedited scenes then refining projections conditioned on the original query view.

UniCAD: A Unified Benchmark and Universal Model for Multi-Modal Multi-Task CAD

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

UniCAD supplies a unified multi-modal benchmark and an end-to-end MLLM that performs reconstruction, generation, and QA on CAD data, reporting SOTA results on UniCAD and Fusion360.

MetaPoint: Unlocking Precise Spatial Control in Agentic Visual Generation

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

MetaPoint represents 2D coordinates as special tokens in visual generative models to enable precise spatial control using existing positional encodings without architectural modifications.

Impostor: An Agent-Curated Benchmark for Realistic AIGC Manipulation Localization

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

Introduces the Impostor benchmark dataset for localizing AIGC image manipulations via agent curation and the PANet model that uses phase and semantic consistency for better detection.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

Towards Characterizing Scientific Image Utility and Upgradability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

The SIU²A framework evaluates scientific images for error detection, repair feasibility, and correction quality, showing current multimodal systems have major limitations in preserving scientific validity.

citing papers explorer

Showing 1 of 1 citing paper after filters.

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving cs.DC · 2026-06-11 · unreviewed · ref 42 · internal anchor

Qwen-Image Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer