super hub

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan · 2025 · cs.CV · arXiv 2508.02324

128 Pith papers cite this work. Polarity classification is still indexing.

128 Pith papers citing it

open full Pith review browse 128 citing papers more from Chenfei Wu arXiv PDF

abstract

We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 2 background 1

citation-polarity summary

baseline 2 background 1

claims ledger

abstract We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially

authors

Chenfei Wu Jiahao Li Jingren Zhou Junyang Lin Kaiyuan Gao Kun Yan

co-cited works

representative citing papers

Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

cs.CV · 2026-04-13 · unverdicted · novelty 8.0

The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

LiWi: Layering in the Wild

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

LiWi uses an agent-driven data synthesis pipeline to build the LiWi-100k dataset and a model with shadow-guided and degradation-restoration objectives that achieves SoTA performance on RGB L1 and Alpha IoU for natural image layering decomposition.

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

Asymmetric Flow Models

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finetuning from latent models.

Inline Critic Steers Image Editing

cs.CV · 2026-05-12 · conditional · novelty 7.0

Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing

cs.CR · 2026-05-11 · unverdicted · novelty 7.0

Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.

What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.

EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.

Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.

Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.

Secure Seed-Based Multi-bit Watermarking for Diffusion Models from First Principles

cs.CR · 2026-05-07 · unverdicted · novelty 7.0

A theoretical framework decouples diffusion model generation from watermark decisions, enabling SSB to reach any security-robustness-fidelity regime without model-specific empirical tests.

CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs

cs.AI · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

CrossCult-KIBench is a new benchmark for evaluating cross-cultural knowledge insertion in MLLMs, paired with the MCKI baseline method, showing current approaches fail to balance adaptation and preservation.

MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.

DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

DirectEdit achieves step-level accurate inversion for flow-based image editing by directly aligning forward paths, using attention feature injection and mask-guided noise blending to balance fidelity and editability without additional NFEs.

SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.

StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition

cs.GR · 2026-04-23 · unverdicted · novelty 7.0

StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.

Exploring Spatial Intelligence from a Generative Perspective

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

ReImagine decouples human appearance from temporal consistency via pretrained image backbones, SMPL-X motion guidance, and training-free video diffusion refinement to generate high-quality controllable videos.

HP-Edit: A Human-Preference Post-Training Framework for Image Editing

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.

Long-Text-to-Image Generation via Compositional Prompt Decomposition

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models while generalizing better to prompts over 500 tokens.

View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.

UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

cs.CV · 2026-04-19 · unverdicted · novelty 7.0 · 2 refs

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

citing papers explorer

Showing 50 of 128 citing papers.

Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation cs.CV · 2026-04-13 · unverdicted · none · ref 29 · internal anchor
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models cs.CV · 2026-04-05 · unverdicted · none · ref 41 · internal anchor
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
LiWi: Layering in the Wild cs.CV · 2026-05-14 · unverdicted · none · ref 16 · internal anchor
LiWi uses an agent-driven data synthesis pipeline to build the LiWi-100k dataset and a model with shadow-guided and degradation-restoration objectives that achieves SoTA performance on RGB L1 and Alpha IoU for natural image layering decomposition.
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling cs.CV · 2026-05-13 · unverdicted · none · ref 49 · internal anchor
Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
Asymmetric Flow Models cs.CV · 2026-05-13 · unverdicted · none · ref 69 · internal anchor
Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finetuning from latent models.
Inline Critic Steers Image Editing cs.CV · 2026-05-12 · conditional · none · ref 49 · internal anchor
Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation cs.CV · 2026-05-12 · unverdicted · none · ref 40 · internal anchor
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing cs.CR · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.
What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers cs.CV · 2026-05-11 · unverdicted · none · ref 39 · internal anchor
A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement cs.CV · 2026-05-08 · unverdicted · none · ref 52 · internal anchor
EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models cs.LG · 2026-05-07 · unverdicted · none · ref 59 · internal anchor
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance cs.CV · 2026-05-07 · unverdicted · none · ref 23 · internal anchor
Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.
Secure Seed-Based Multi-bit Watermarking for Diffusion Models from First Principles cs.CR · 2026-05-07 · unverdicted · none · ref 35 · internal anchor
A theoretical framework decouples diffusion model generation from watermark decisions, enabling SSB to reach any security-robustness-fidelity regime without model-specific empirical tests.
CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs cs.AI · 2026-05-07 · unverdicted · none · ref 26 · 2 links · internal anchor
CrossCult-KIBench is a new benchmark for evaluating cross-cultural knowledge insertion in MLLMs, paired with the MCKI baseline method, showing current approaches fail to balance adaptation and preservation.
MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing cs.CV · 2026-05-04 · unverdicted · none · ref 39 · internal anchor
MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.
DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing cs.CV · 2026-05-04 · unverdicted · none · ref 13 · internal anchor
DirectEdit achieves step-level accurate inversion for flow-based image editing by directly aligning forward paths, using attention feature injection and mask-guided noise blending to balance fidelity and editability without additional NFEs.
SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking cs.CV · 2026-05-04 · unverdicted · none · ref 1 · internal anchor
SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.
StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition cs.GR · 2026-04-23 · unverdicted · none · ref 22 · internal anchor
StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.
Exploring Spatial Intelligence from a Generative Perspective cs.CV · 2026-04-22 · unverdicted · none · ref 32 · internal anchor
Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis cs.CV · 2026-04-21 · unverdicted · none · ref 48 · internal anchor
ReImagine decouples human appearance from temporal consistency via pretrained image backbones, SMPL-X motion guidance, and training-free video diffusion refinement to generate high-quality controllable videos.
HP-Edit: A Human-Preference Post-Training Framework for Image Editing cs.CV · 2026-04-21 · unverdicted · none · ref 49 · internal anchor
HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
Long-Text-to-Image Generation via Compositional Prompt Decomposition cs.CV · 2026-04-20 · unverdicted · none · ref 59 · internal anchor
PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models while generalizing better to prompts over 500 tokens.
View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity cs.CV · 2026-04-20 · unverdicted · none · ref 41 · internal anchor
A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models cs.CV · 2026-04-19 · unverdicted · none · ref 73 · 2 links · internal anchor
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs cs.CV · 2026-04-17 · unverdicted · none · ref 48 · internal anchor
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories cs.CV · 2026-04-16 · unverdicted · none · ref 50 · internal anchor
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID cs.CV · 2026-04-16 · unverdicted · none · ref 41 · internal anchor
STFER uses LVLM-generated identity-consistent semantic text to drive visual token filtering and expert routing for improved any-time person re-identification under clothing changes and modality shifts.
LottieGPT: Tokenizing Vector Animation for Autoregressive Generation cs.CV · 2026-04-13 · unverdicted · none · ref 82 · internal anchor
LottieGPT tokenizes Lottie animations into compact sequences and fine-tunes Qwen-VL to autoregressively generate coherent vector animations from natural language or visual prompts, outperforming prior SVG models.
HiddenObjects: Scalable Diffusion-Distilled Spatial Priors for Object Placement cs.CV · 2026-04-12 · unverdicted · none · ref 54 · internal anchor
A diffusion-based pipeline creates a 27M-annotation dataset of object placements that outperforms human annotations and baselines on image editing tasks, then distills it into a fast model.
RewardFlow: Generate Images by Optimizing What You Reward cs.CV · 2026-04-09 · unverdicted · none · ref 42 · internal anchor
RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.
FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding cs.CV · 2026-04-09 · unverdicted · none · ref 37 · internal anchor
FlowGuard detects unsafe content during diffusion image generation via linear latent decoding and curriculum learning, outperforming prior methods by over 30% F1 while reducing GPU memory by 97% and projection time to 0.2 seconds.
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation cs.CV · 2026-04-08 · conditional · none · ref 47 · internal anchor
SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details cs.CV · 2026-04-08 · unverdicted · none · ref 43 · internal anchor
RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.
1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation cs.CV · 2026-04-05 · conditional · none · ref 46 · internal anchor
1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.
Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro cs.CV · 2026-04-03 · unverdicted · none · ref 59 · internal anchor
Banana100 dataset shows that none of 21 popular NR-IQA metrics consistently rate images degraded by 100 iterative edits lower than clean originals.
Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off cs.CV · 2026-03-23 · unverdicted · none · ref 56 · internal anchor
Dress-ED is the first large-scale benchmark unifying virtual try-on, try-off, and text-guided garment editing with 146k verified samples plus a multimodal diffusion baseline.
Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation cs.CV · 2026-05-13 · unverdicted · none · ref 40 · internal anchor
Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm cs.CV · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute binding and structural control.
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation cs.CV · 2026-05-12 · unverdicted · none · ref 35 · 2 links · internal anchor
A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
L2P: Unlocking Latent Potential for Pixel Generation cs.CV · 2026-05-12 · unverdicted · none · ref 22 · internal anchor
L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling cs.SD · 2026-05-12 · unverdicted · none · ref 12 · internal anchor
AuDirector is a self-reflective closed-loop multi-agent framework that generates immersive audio narratives with improved structural coherence, emotional expressiveness, and acoustic fidelity via identity-aware voice adaptation and iterative correction.
GeoR-Bench: Evaluating Geoscience Visual Reasoning cs.CV · 2026-05-12 · unverdicted · none · ref 9 · internal anchor
GeoR-Bench shows top multimodal models reach only 42.7% strict accuracy on geoscience visual reasoning tasks while open-source models reach 10.3%, with outputs often visually plausible yet scientifically inaccurate.
FeatMap: Understanding image manipulation in the feature space and its implications for feature space geometry cs.LG · 2026-05-11 · unverdicted · none · ref 24 · internal anchor
Linear mappings in feature space can reconstruct a wide range of image manipulations including semantic edits, suggesting that feature representations are approximately linearly organized.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping cs.CV · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer cs.CV · 2026-05-11 · unverdicted · none · ref 53 · internal anchor
A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B to over 200B parameters.
Pixal3D: Pixel-Aligned 3D Generation from Images cs.CV · 2026-05-11 · unverdicted · none · ref 68 · internal anchor
Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.
How Mobile World Model Guides GUI Agents? cs.AI · 2026-05-11 · unverdicted · none · ref 19 · internal anchor
Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out-of-distribution use.
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition cs.CV · 2026-05-11 · unverdicted · none · ref 48 · 2 links · internal anchor
Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria cs.AI · 2026-05-08 · unverdicted · none · ref 41 · internal anchor
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation cs.CV · 2026-05-08 · unverdicted · none · ref 25 · internal anchor
SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.

Qwen-Image Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer