A safety direction estimated in a source LLM is transported to a target generator through lightweight alignment on benign data alone, matching native safety performance without any target-side unsafe data.
super hub Mixed citations
Qwen-Image Technical Report
Mixed citation behavior. Most common role is background (46%).
abstract
We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially
authors
co-cited works
representative citing papers
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than existing systems or expert plans.
OrbitQuant is a data-agnostic PTQ technique for DiTs that uses RPBH rotation in a normalized basis to enable a single codebook across all inputs, achieving SOTA low-bit performance on FLUX.1, CogVideoX and similar models.
Introduces VICIS task and training framework for inferring visual concepts from image sets, with experiments showing better accuracy, diversity, and generalization than standard VLMs on synthetic and ImageNet data.
A new pipeline using canonical LoRAs for view synthesis, deformable 3D Gaussian splatting anchored on D-SMAL, and generative repair to produce animatable 3D dogs from single wild images without 3D supervision.
Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.
PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.
InterleaveThinker is the first multi-agent pipeline enabling interleaved generation in any image generator through planner-critic agents, SFT on custom datasets, and GRPO RL with accuracy and step-wise rewards.
CineOrchestra unifies control of subjects, events, cameras, and shot transitions in cinematic video generation through entity-centric conditioning primitives and parameter-free coordinated rotary embeddings.
The paper constructs the VIBE benchmark and evaluates six visual in-context learning models on 14 datasets, 12 tasks, and 106 combinations under a unified one-shot protocol, revealing limitations and failure modes.
ImageTime is a benchmark that probes image generation models' visual world modeling by requiring coherent four-state sequences in single images, scored via VLM judge.
SSR-Merge merges LoRAs via subspace construction, inverse correlation decorrelation, and directional steering, shown to match the OLS solution with a streaming implementation that outperforms prior merging methods.
PoisonLoRA demonstrates ~100% attack success rates for stealthy LoRA poisoning via concept hijacking and task injection on real platforms, with robustness to base model transfer and multiple remixes.
ZIPP conditions diffusion models on LLM-rewritten prompts derived from graph-mined natural-language personas to achieve zero-shot personalization, reporting 13-20% gains and 79% human preference win rate over generic outputs.
CBS partitions the diffusion timeline into segments of equal approximation burden via Dirichlet energy and trajectory acceleration monitors estimated by an auxiliary model, yielding higher synthesis quality at fixed per-step cost across SiT, JiT and UNet backbones.
RED-Aes learns aesthetic changes from edit-induced image pairs and a new RED-20k dataset via three-stage relative ranking training, claiming SOTA generalization over absolute MOS regression.
GeM-NR performs multi-view consistent nonrigid editing by aligning depth-derived point clouds between edited and unedited scenes then refining projections conditioned on the original query view.
UniCAD supplies a unified multi-modal benchmark and an end-to-end MLLM that performs reconstruction, generation, and QA on CAD data, reporting SOTA results on UniCAD and Fusion360.
MetaPoint represents 2D coordinates as special tokens in visual generative models to enable precise spatial control using existing positional encodings without architectural modifications.
Introduces the Impostor benchmark dataset for localizing AIGC image manipulations via agent curation and the PANet model that uses phase and semantic consistency for better detection.
A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.
The SIU²A framework evaluates scientific images for error detection, repair feasibility, and correction quality, showing current multimodal systems have major limitations in preserving scientific validity.
citing papers explorer
-
Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation
A safety direction estimated in a source LLM is transported to a target generator through lightweight alignment on benign data alone, matching native safety performance without any target-side unsafe data.
-
Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
-
OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
-
OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers
OrbitQuant is a data-agnostic PTQ technique for DiTs that uses RPBH rotation in a normalized basis to enable a single codebook across all inputs, achieving SOTA low-bit performance on FLUX.1, CogVideoX and similar models.
-
Show Me Examples: Inferring Visual Concepts from Image Sets
Introduces VICIS task and training framework for inferring visual concepts from image sets, with experiments showing better accuracy, diversity, and generalization than standard VLMs on synthetic and ImageNet data.
-
CORGI: Consistency-Aware 3D Dog Reconstruction from a Single Image in the Wild
A new pipeline using canonical LoRAs for view synthesis, deformable 3D Gaussian splatting anchored on D-SMAL, and generative repair to produce animatable 3D dogs from single wild images without 3D supervision.
-
No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs
Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.
-
PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing
PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.
-
InterleaveThinker: Reinforcing Agentic Interleaved Generation
InterleaveThinker is the first multi-agent pipeline enabling interleaved generation in any image generator through planner-critic agents, SFT on custom datasets, and GRPO RL with accuracy and step-wise rewards.
-
CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation
CineOrchestra unifies control of subjects, events, cameras, and shot transitions in cinematic video generation through entity-centric conditioning primitives and parameter-free coordinated rotary embeddings.
-
Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks
The paper constructs the VIBE benchmark and evaluates six visual in-context learning models on 14 datasets, 12 tasks, and 106 combinations under a unified one-shot protocol, revealing limitations and failure modes.
-
Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency
ImageTime is a benchmark that probes image generation models' visual world modeling by requiring coherent four-state sequences in single images, scored via VLM judge.
-
SSR-Merge: Subspace Signal Routing for Training-Free LoRA Merging in Diffusion Models
SSR-Merge merges LoRAs via subspace construction, inverse correlation decorrelation, and directional steering, shown to match the OLS solution with a streaming implementation that outperforms prior merging methods.
-
Complexity-Balanced Diffusion Splitting
CBS partitions the diffusion timeline into segments of equal approximation burden via Dirichlet energy and trajectory acceleration monitors estimated by an auxiliary model, yielding higher synthesis quality at fixed per-step cost across SiT, JiT and UNet backbones.
-
Beyond Absolute Scores: Relative Edit-induced Difference for Generalizable Image Aesthetic Assessment
RED-Aes learns aesthetic changes from edit-induced image pairs and a new RED-20k dataset via three-stage relative ranking training, claiming SOTA generalization over absolute MOS regression.
-
GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes
GeM-NR performs multi-view consistent nonrigid editing by aligning depth-derived point clouds between edited and unedited scenes then refining projections conditioned on the original query view.
-
UniCAD: A Unified Benchmark and Universal Model for Multi-Modal Multi-Task CAD
UniCAD supplies a unified multi-modal benchmark and an end-to-end MLLM that performs reconstruction, generation, and QA on CAD data, reporting SOTA results on UniCAD and Fusion360.
-
MetaPoint: Unlocking Precise Spatial Control in Agentic Visual Generation
MetaPoint represents 2D coordinates as special tokens in visual generative models to enable precise spatial control using existing positional encodings without architectural modifications.
-
Impostor: An Agent-Curated Benchmark for Realistic AIGC Manipulation Localization
Introduces the Impostor benchmark dataset for localizing AIGC image manipulations via agent curation and the PANet model that uses phase and semantic consistency for better detection.
-
Diffusing in the Right Space: A Systematic Study of Latent Diffusability
A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.
-
Towards Characterizing Scientific Image Utility and Upgradability
The SIU²A framework evaluates scientific images for error detection, repair feasibility, and correction quality, showing current multimodal systems have major limitations in preserving scientific validity.
-
MemoGen: Can Past Experience Improve Future Text-to-Image Generation?
MemoGen is a training-free agentic framework that stores task understanding, references, visual feedback, and lessons from past generations as reusable memory to improve text-to-image output over evolution rounds.
-
From Zero to Hero: Training-Free Custom Concept Spawning in World Models
SPAWN enables training-free insertion of custom visual concepts into autoregressive world models by swapping the pinned context-memory anchor over a short injection window.
-
LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models
LL-Bench supplies a human-annotated dataset exposing generative model weaknesses in low-level restoration and introduces LL-Score as an MLLM evaluator that outperforms existing quality metrics and can serve as a training reward.
-
Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing
Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.
-
GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration
GGT-100K is a 103k-pair LQ-HQ dataset generated via MFMs to enhance real-world generalization of image restoration models.
-
CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations
CardioLens is a leakage-resistant CMR testbed of 473k slices and 13k QA pairs showing current MLLMs exhibit a large clinical reality gap with category-collapse failures on real workflows.
-
Orthogonal Negative Guidance in Attention Feature Space for Text-to-Image Generation
Orthogonal Negative Guidance subtracts only the orthogonal component of negative-prompt attention features from positive ones in FLUX models to suppress concepts while preserving semantics and quality.
-
Towards Anatomically Plausible Human Image Generation via Synthetic Localized Preferences
ASAP generates over 10K synthetic anatomical preference pairs via targeted degradation of high-fidelity images and applies a localized margin-bounded DPO to reduce anatomical errors in text-to-image human generation, supported by the new HAP dataset and HAF-Bench.
-
Vision-Language Binding in In-Context Image Generation
Text tokens in FLUX.2 absorb reference image properties like color and style to influence outputs while pixel-exact details bypass them, localized to padding tokens via causal interventions.
-
ETCHR: Editing To Clarify and Harness Reasoning
A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.
-
VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset
VINS-120K supplies the first large-scale set of instruction-image-edited-image triplets at ultra-high resolution together with an adaptation strategy that improves detail synthesis.
-
VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation
VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.
-
MotiMotion: Motion-Controlled Video Generation with Visual Reasoning
MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiBench benchmark.
-
VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis
VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.
-
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
-
Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models
Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.
-
ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation
ReAlign distills LLM-generated reasoning texts into a lightweight AIGI forgery detector via contrastive image-text alignment to improve generalization on complex forgeries.
-
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
-
Editor's Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis
Presents Entity-Rubrics and AbstractEdit benchmark to measure image editing models on abstract intent, finding standard models struggle to balance edit intent with image preservation.
-
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
-
Asymmetric Flow Models
AsymFlow uses rank-asymmetric velocity prediction to reach 1.57 FID on ImageNet 256x256 and enables finetuning of latent flow models into superior pixel-space text-to-image generators.
-
Inline Critic Steers Image Editing
Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.
-
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
-
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including video extension.
-
What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers
A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.
-
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
-
Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance
Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.
-
MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing
MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.
-
SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking
SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.