hub Mixed citations

OmniGen2: Towards Instruction-Aligned Multimodal Generation

· 2025 · cs.CV · arXiv 2506.18871

Mixed citation behavior. Most common role is background (43%).

76 Pith papers citing it

Background 43% of classified citations

open full Pith review browse 76 citing papers arXiv PDF

abstract

In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 17 baseline 15 other 2 dataset 1

citation-polarity summary

background 15 baseline 15 support 2 unclear 2 use dataset 1

claims ledger

abstract In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of Omn

co-cited works

representative citing papers

Masked Generative Transformer Is What You Need for Image Editing

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EditMGT applies masked generative transformers with attention consolidation and region-hold sampling to deliver state-of-the-art localized image editing at 6x the speed of diffusion methods.

Orthogonal Negative Guidance in Attention Feature Space for Text-to-Image Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Orthogonal Negative Guidance subtracts only the orthogonal component of negative-prompt attention features from positive ones in FLUX models to suppress concepts while preserving semantics and quality.

VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

VINS-120K supplies the first large-scale set of instruction-image-edited-image triplets at ultra-high resolution together with an adaptation strategy that improves detail synthesis.

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiBench benchmark.

MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling

cs.CV · 2026-05-19 · conditional · novelty 7.0

MetaEarth-MM unifies multi-modal remote sensing image generation and any-to-any translation across five modalities via scene-centered joint modeling on the new EarthMM dataset.

Aurora: Unified Video Editing with a Tool-Using Agent

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

Aurora introduces a VLM-based agent that converts raw user video edit requests into structured conditioning inputs for a unified diffusion transformer, improving performance on underspecified tasks via a new benchmark.

From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

cs.MM · 2026-05-12 · unverdicted · novelty 7.0

UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.

EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.

MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.

Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

A co-trained adapter framework enables mask-free local editing in DiTs by factorizing edit semantics from spatial location and jointly learning a mask predictor.

Exploring Spatial Intelligence from a Generative Perspective

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.

HP-Edit: A Human-Preference Post-Training Framework for Image Editing

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.

UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

cs.AI · 2026-04-16 · unverdicted · novelty 7.0

ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.

OneHOI: Unifying Human-Object Interaction Generation and Editing

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

OneHOI unifies HOI generation and editing in one conditional diffusion transformer using role-aware tokens, structured attention, and joint training on mixed datasets to reach SOTA on both tasks.

ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

ASTRA disentangles subject identity from pose structure in diffusion transformers via retrieval-augmented pose guidance, asymmetric EURoPE embeddings, and a DSM adapter to improve multi-subject generation.

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.

AIM-Bench: Benchmarking and Improving Affective Image Manipulation via Fine-Grained Hierarchical Control

cs.CV · 2026-04-12 · unverdicted · novelty 7.0

AIM-Bench is the first dedicated benchmark for editing images to evoke specific emotions with fine-grained control, paired with AIM-40k dataset that delivers a 9.15% performance gain by correcting training data imbalances.

RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.

citing papers explorer

Showing 26 of 76 citing papers.

HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes cs.CV · 2026-04-06 · unverdicted · none · ref 61 · internal anchor
HorizonWeaver enables photorealistic, instruction-driven multi-level editing of complex driving scenes with improved generalization via a new paired dataset, language-guided masks, and joint training losses.
FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts cs.SD · 2026-03-20 · unverdicted · none · ref 41 · internal anchor
FoleyDirector introduces structured temporal scripts and a fusion module to enable precise timing control in DiT-based video-to-audio generation while preserving audio fidelity.
EmoCtrl: Controllable Emotional Image Content Generation cs.CV · 2025-12-27 · unverdicted · none · ref 37 · internal anchor
EmoCtrl generates images faithful to content prompts while expressing target emotions via textual/visual enhancement modules and emotion-driven preference optimization.
Benchmarking Attribute Discrimination in Infant-Scale Vision-Language Models cs.LG · 2025-12-22 · unverdicted · none · ref 24 · internal anchor
Infant-scale VLMs discriminate size and texture visually but perform poorly on color and struggle to ground attributes in text, while web-scale models excel at color grounding.
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation cs.CV · 2025-11-24 · conditional · none · ref 65 · internal anchor
DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while closing the gap to latent diffusion methods.
Emu3.5: Native Multimodal Models are World Learners cs.CV · 2025-10-30 · unverdicted · none · ref 107 · internal anchor
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.
Adversarial Concept Distillation for One-Step Diffusion Personalization cs.CV · 2025-10-23 · unverdicted · none · ref 97 · internal anchor
OPAD enables reliable high-quality personalization of one-step diffusion models via multi-step teacher distillation combined with adversarial alignment losses.
GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation cs.CV · 2026-05-20 · unverdicted · none · ref 48 · 2 links · internal anchor
GenEvolve introduces a self-evolving agent framework for image generation using tool-orchestrated trajectories and Visual Experience Distillation to achieve claimed SOTA results on benchmarks.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture cs.CV · 2026-05-12 · unverdicted · none · ref 141 · internal anchor
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
InsHuman: Towards Natural and Identity-Preserving Human Insertion cs.CV · 2026-05-08 · unverdicted · none · ref 8 · internal anchor
InsHuman proposes Human-Background Adaptive Fusion, Face-to-Face ID-Preserving, and Bidirectional Data Pairing to enable natural human insertion in images without altering identity.
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision cs.CV · 2026-05-07 · unverdicted · none · ref 57 · internal anchor
Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing cs.CV · 2026-04-21 · unverdicted · none · ref 56 · internal anchor
SmartPhotoCrafter performs automatic photographic image editing by coupling an Image Critic module that identifies deficiencies with a Photographic Artist module that generates edits, trained via multi-stage pretraining, reasoning supervision, and reinforcement learning.
UniCSG: Unified High-Fidelity Content-Constrained Style-Driven Generation via Staged Semantic and Frequency Disentanglement cs.CV · 2026-04-20 · unverdicted · none · ref 38 · internal anchor
UniCSG adds staged semantic disentanglement and frequency-aware reconstruction to DiT diffusion models to improve content preservation and style fidelity in both text- and reference-guided generation.
FineEdit: Fine-Grained Image Edit with Bounding Box Guidance cs.CV · 2026-04-13 · unverdicted · none · ref 57 · internal anchor
FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models on new and existing benchmarks.
LongCat-Image Technical Report cs.CV · 2025-12-08 · unverdicted · none · ref 26 · internal anchor
LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.
PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards cs.CV · 2025-12-01 · conditional · none · ref 39 · internal anchor
A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.
Qwen-Image Technical Report cs.CV · 2025-08-04 · unverdicted · none · ref 31 · internal anchor
Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive encoding for editing consistency.
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation cs.GR · 2026-05-05 · unverdicted · none · ref 86 · 2 links · internal anchor
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings cs.CV · 2026-04-21 · unverdicted · none · ref 44 · internal anchor
MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization cs.CV · 2026-02-05 · unverdicted · none · ref 80 · internal anchor
OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.
Show-o2: Improved Native Unified Multimodal Models cs.CV · 2025-06-18 · unverdicted · none · ref 117 · internal anchor
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
Semantic Generative Tuning for Unified Multimodal Models cs.CV · 2026-05-18 · unreviewed · ref 70 · internal anchor
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm cs.CV · 2026-05-12 · unreviewed · ref 33 · internal anchor
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models cs.CV · 2026-04-19 · unreviewed · ref 74 · 2 links · internal anchor
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling cs.CV · 2025-12-14 · unreviewed · ref 37 · internal anchor
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer cs.CV · 2025-11-27 · unreviewed · ref 78 · internal anchor

OmniGen2: Towards Instruction-Aligned Multimodal Generation

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer