A safety direction estimated in a source LLM is transported to a target generator through lightweight alignment on benign data alone, matching native safety performance without any target-side unsafe data.
super hub Mixed citations
Qwen-Image Technical Report
Mixed citation behavior. Most common role is background (46%).
abstract
We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially
authors
co-cited works
representative citing papers
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than existing systems or expert plans.
OrbitQuant is a data-agnostic PTQ technique for DiTs that uses RPBH rotation in a normalized basis to enable a single codebook across all inputs, achieving SOTA low-bit performance on FLUX.1, CogVideoX and similar models.
Introduces VICIS task and training framework for inferring visual concepts from image sets, with experiments showing better accuracy, diversity, and generalization than standard VLMs on synthetic and ImageNet data.
A new pipeline using canonical LoRAs for view synthesis, deformable 3D Gaussian splatting anchored on D-SMAL, and generative repair to produce animatable 3D dogs from single wild images without 3D supervision.
Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.
SharpMoE is a plug-and-play post-training method that uses clean latent features and a trajectory routing loss to enable accurate saliency-based routing in diffusion MoE models for improved visual generation.
New 3DLP benchmark with real-world 1K HDR pairs shows state-of-the-art image editing models vary in physical lighting consistency, with best models close to reality but error-prone in low-light regions.
PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.
TryOnCrafter is the first DiT-based framework for camera-controllable video virtual try-on via a renderable 4D try-on proxy distilled from 2D priors into 3DGS avatar animated with SMPL-X.
Introduces 2M synthetic WATER-S dataset and WATERec model achieving 90.40% accuracy on WordArt-Bench, outperforming prior STR methods and VLMs.
Forensic Knowledge Graphs integrate forensic traces, causal dependencies, and scene links via a new authentication network and Iterative Context Refinement to outperform standard detectors and VLMs on detection, localization, and justification.
REALM is the first unified red-teaming benchmark for physical-world VLMs that aligns diverse attack methods via an agentic target-generation pipeline and evaluates them on shared datasets showing text/typographic attacks as most effective.
Sparse Context achieves 2-4x faster inference in reference-conditioned diffusion models by fine-tuning with random token dropping and applying task-aware selection at inference time, without loss of visual quality.
A technique for controllable diversity in text-to-image generation by inducing structured semantic variations at the prompt level via VLM and agentic workflow.
DiT-Reward converts pretrained DiT models into reward predictors that outperform HPSv3 on four benchmarks while providing 1.65x inference speedup.
RS-Gen proposes a plug-and-play agentic framework with a closed-loop reasoning mechanism that augments base image models to achieve SOTA results on WISE Verified and RISEBench.
InterleaveThinker is the first multi-agent pipeline enabling interleaved generation in any image generator through planner-critic agents, SFT on custom datasets, and GRPO RL with accuracy and step-wise rewards.
CineOrchestra unifies control of subjects, events, cameras, and shot transitions in cinematic video generation through entity-centric conditioning primitives and parameter-free coordinated rotary embeddings.
The paper constructs the VIBE benchmark and evaluates six visual in-context learning models on 14 datasets, 12 tasks, and 106 combinations under a unified one-shot protocol, revealing limitations and failure modes.
ImageTime is a benchmark that probes image generation models' visual world modeling by requiring coherent four-state sequences in single images, scored via VLM judge.
SSR-Merge merges LoRAs via subspace construction, inverse correlation decorrelation, and directional steering, shown to match the OLS solution with a streaming implementation that outperforms prior merging methods.
citing papers explorer
-
Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving
Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than existing systems or expert plans.
-
DiT-Reward: Generative Representations for Text-to-Image Reward Modeling
DiT-Reward converts pretrained DiT models into reward predictors that outperform HPSv3 on four benchmarks while providing 1.65x inference speedup.
-
Explicit Critic Guidance for Aligning Diffusion Models
Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.
-
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
-
Spectral-Progressive Thought Flow for Lightweight Multimodal Reasoning
SpecFlow represents intermediate visual thoughts in fixed-size DCT space and uses classifier-free guidance to steer updates from textual thoughts, achieving up to 2.1x lower computation and KV cache costs.
-
RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models
RT-Lynx shifts DiT sparsity from weights to activations, reports up to 1.55x linear-layer speedup while preserving generation quality across multiple diffusion models.
-
BigMac: Breaking the Pareto Frontier of Compute and Memory in Multimodal LLM Training
BigMac uses a dependency-safe nested pipeline to achieve O(1) activation memory for encoders and generators in MLLM training while matching unlimited-memory compute efficiency and delivering 1.08-1.9x speedup.
-
FeatMap: Understanding image manipulation in the feature space and its implications for feature space geometry
Linear mappings in feature space can reconstruct a wide range of image manipulations including semantic edits, suggesting that feature representations are approximately linearly organized.
-
Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models
Cornserve introduces a task abstraction and record-and-replay runtime for Any-to-Any multimodal models, achieving up to 3.81x higher throughput and 5.79x lower tail latency through component disaggregation and direct tensor forwarding.
-
OTCache: Optimal Transport for Geometry-Aware Caching in Diffusion Models
OTCache uses optimal transport to interpolate caching schedules between a graph-based reference and an Optuna-optimized anchor, delivering 3.66x-4.7x speedups on FLUX.1, Qwen-Image and HunyuanVideo with improved fidelity.
-
FlowAWR: Online Adaptive Flow Reinforcement via Advantage-Weighted Rectification
FlowAWR derives an advantage-weighted rectification for optimal velocity fields in flow models, claiming 2-5x faster convergence than DiffusionNFT on SD3.5-Medium.
-
Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion
Diffusion Templates is a unified plugin framework that allows injecting various controllable capabilities into diffusion models through a standardized interface.